Re: [Ocfs2-devel] [PATCH V2] ocfs2: Take inode cluster lock before moving reflinked inode from orphan dir
On 05/01/2018 05:08 PM, Andrew Morton wrote: > On Wed, 11 Apr 2018 12:31:47 -0700 Ashish Samant > wrote: > >> While reflinking an inode, we create a new inode in orphan directory, then >> take EX lock on it, reflink the original inode to orphan inode and release >> EX lock. Once the lock is released another node could request it in EX mode >> from ocfs2_recover_orphans() which causes downconvert of the lock, on this >> node, to NL mode. >> >> Later we attempt to initialize security acl for the orphan inode and move >> it to the reflink destination. However, while doing this we dont take EX >> lock on the inode. This could potentially cause problems because we could >> be starting transaction, accessing journal and modifying metadata of the >> inode while holding NL lock and with another node holding EX lock on the >> inode. > Someone (eg, me) needs to decide which kernel version(s) need the > patch. And all I have is "this could potentially cause problems". > > Can you please be more specific about the end-user impact? *Does* it > cause problems? Is the problem sufficiently urgent to warrant a merge > into 4.17? Into -stable kernels? Hi Andrew, What the code is doing currently is definitely wrong and should be fixed. This could theoretically cause a dlm race, leaading to kernel panic, although the possibility of that happening in the real world is very less. Hence the "potentially" in the commit message. So, I'd say it should go into 4.17 but backporting to stable is probably not required. Thanks, Ashish > > Please always include all the end-user-impact info when fixing bugs. > > Thanks. > >> Fix this by taking orphan inode cluster lock in EX mode before >> initializing security and moving orphan inode to reflink destination. >> Use the __tracker variant while taking inode lock to avoid recursive >> locking in the ocfs2_init_security_and_acl() call chain. ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH V2] ocfs2: Take inode cluster lock before moving reflinked inode from orphan dir
While reflinking an inode, we create a new inode in orphan directory, then take EX lock on it, reflink the original inode to orphan inode and release EX lock. Once the lock is released another node could request it in EX mode from ocfs2_recover_orphans() which causes downconvert of the lock, on this node, to NL mode. Later we attempt to initialize security acl for the orphan inode and move it to the reflink destination. However, while doing this we dont take EX lock on the inode. This could potentially cause problems because we could be starting transaction, accessing journal and modifying metadata of the inode while holding NL lock and with another node holding EX lock on the inode. Fix this by taking orphan inode cluster lock in EX mode before initializing security and moving orphan inode to reflink destination. Use the __tracker variant while taking inode lock to avoid recursive locking in the ocfs2_init_security_and_acl() call chain. Signed-off-by: Ashish Samant V1->V2: Modify commit message to better reflect the problem in upstream kernel. --- fs/ocfs2/refcounttree.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c index ab156e3..1b1283f 100644 --- a/fs/ocfs2/refcounttree.c +++ b/fs/ocfs2/refcounttree.c @@ -4250,10 +4250,11 @@ static int __ocfs2_reflink(struct dentry *old_dentry, static int ocfs2_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry, bool preserve) { - int error; + int error, had_lock; struct inode *inode = d_inode(old_dentry); struct buffer_head *old_bh = NULL; struct inode *new_orphan_inode = NULL; + struct ocfs2_lock_holder oh; if (!ocfs2_refcount_tree(OCFS2_SB(inode->i_sb))) return -EOPNOTSUPP; @@ -4295,6 +4296,14 @@ static int ocfs2_reflink(struct dentry *old_dentry, struct inode *dir, goto out; } + had_lock = ocfs2_inode_lock_tracker(new_orphan_inode, NULL, 1, + &oh); + if (had_lock < 0) { + error = had_lock; + mlog_errno(error); + goto out; + } + /* If the security isn't preserved, we need to re-initialize them. */ if (!preserve) { error = ocfs2_init_security_and_acl(dir, new_orphan_inode, @@ -4302,14 +4311,15 @@ static int ocfs2_reflink(struct dentry *old_dentry, struct inode *dir, if (error) mlog_errno(error); } -out: if (!error) { error = ocfs2_mv_orphaned_inode_to_new(dir, new_orphan_inode, new_dentry); if (error) mlog_errno(error); } + ocfs2_inode_unlock_tracker(new_orphan_inode, 1, &oh, had_lock); +out: if (new_orphan_inode) { /* * We need to open_unlock the inode no matter whether we -- 1.9.1 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: Take inode cluster lock before moving reflinked inode from orphan dir
On 04/02/2018 02:17 AM, Joseph Qi wrote: > > On 18/3/31 01:42, Ashish Samant wrote: >> While reflinking an inode, we create a new inode in orphan directory, then >> take EX lock on it, reflink the original inode to orphan inode and release >> EX lock. Once the lock is released another node might request it in PR mode >> which causes downconvert of the lock to PR mode. >> >> Later we attempt to initialize security acl for the orphan inode and move >> it to the reflink destination. However, while doing this we dont take EX >> lock on the inode. So effectively, we are doing this and accessing the >> journal for this inode while holding PR lock. While accessing the journal, >> we make >> >> ci->ci_last_trans = journal->j_trans_id >> >> At this point, if there is another downconvert request on this inode from >> another node (PR->NL), we will trip on the following condition in >> ocfs2_ci_checkpointed() >> >> BUG_ON(lockres->l_level != DLM_LOCK_EX && !checkpointed); >> >> because we hold the lock in PR mode and journal->j_trans_id is not greater >> than ci_last_trans for the inode. >> >> Fix this by taking orphan inode cluster lock in EX mode before >> initializing security and moving orphan inode to reflink destination. >> Use the __tracker variant while taking inode lock to avoid recursive >> locking in the ocfs2_init_security_and_acl() call chain. >> >> Signed-off-by: Ashish Samant > Looks good. > Reviewed-by: Joseph Qi Hi Joseph, Looks like mainline kernel has removed the inode lock call in PR mode for orphan inode in ocfs2_recover_orphans() and it directly takes the lock in EX mode. So, the patch commit log does not reflect the exact problem for mainline kernel. However, it is still possible that we might end up accessing the journal for the inode while holding inode lock in NL mode (instead of PR). It might not cause the same problem as described above, but it seems wrong to modify inode metadata with NL lock held and could potentially cause similar problems later. So, i'll submit v2 with modified commit message. Thanks, Ashish ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH] ocfs2: Take inode cluster lock before moving reflinked inode from orphan dir
While reflinking an inode, we create a new inode in orphan directory, then take EX lock on it, reflink the original inode to orphan inode and release EX lock. Once the lock is released another node might request it in PR mode which causes downconvert of the lock to PR mode. Later we attempt to initialize security acl for the orphan inode and move it to the reflink destination. However, while doing this we dont take EX lock on the inode. So effectively, we are doing this and accessing the journal for this inode while holding PR lock. While accessing the journal, we make ci->ci_last_trans = journal->j_trans_id At this point, if there is another downconvert request on this inode from another node (PR->NL), we will trip on the following condition in ocfs2_ci_checkpointed() BUG_ON(lockres->l_level != DLM_LOCK_EX && !checkpointed); because we hold the lock in PR mode and journal->j_trans_id is not greater than ci_last_trans for the inode. Fix this by taking orphan inode cluster lock in EX mode before initializing security and moving orphan inode to reflink destination. Use the __tracker variant while taking inode lock to avoid recursive locking in the ocfs2_init_security_and_acl() call chain. Signed-off-by: Ashish Samant --- fs/ocfs2/refcounttree.c | 14 -- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c index ab156e3..1b1283f 100644 --- a/fs/ocfs2/refcounttree.c +++ b/fs/ocfs2/refcounttree.c @@ -4250,10 +4250,11 @@ static int __ocfs2_reflink(struct dentry *old_dentry, static int ocfs2_reflink(struct dentry *old_dentry, struct inode *dir, struct dentry *new_dentry, bool preserve) { - int error; + int error, had_lock; struct inode *inode = d_inode(old_dentry); struct buffer_head *old_bh = NULL; struct inode *new_orphan_inode = NULL; + struct ocfs2_lock_holder oh; if (!ocfs2_refcount_tree(OCFS2_SB(inode->i_sb))) return -EOPNOTSUPP; @@ -4295,6 +4296,14 @@ static int ocfs2_reflink(struct dentry *old_dentry, struct inode *dir, goto out; } + had_lock = ocfs2_inode_lock_tracker(new_orphan_inode, NULL, 1, + &oh); + if (had_lock < 0) { + error = had_lock; + mlog_errno(error); + goto out; + } + /* If the security isn't preserved, we need to re-initialize them. */ if (!preserve) { error = ocfs2_init_security_and_acl(dir, new_orphan_inode, @@ -4302,14 +4311,15 @@ static int ocfs2_reflink(struct dentry *old_dentry, struct inode *dir, if (error) mlog_errno(error); } -out: if (!error) { error = ocfs2_mv_orphaned_inode_to_new(dir, new_orphan_inode, new_dentry); if (error) mlog_errno(error); } + ocfs2_inode_unlock_tracker(new_orphan_inode, 1, &oh, had_lock); +out: if (new_orphan_inode) { /* * We need to open_unlock the inode no matter whether we -- 1.9.1 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] fstrim corrupts ocfs2 filesystems(become ready-only) on SSD device which is managed by multipath
On 10/30/2017 07:07 PM, Gang He wrote: > Hello Ashish, > > Just give my feedback according to the testing script, > cat /trim_loop.sh > > LOG=./trim_loop.log > DEV=/dev/dm-0 > MOUNTDIR=/mnt/shared > BLOCKLIST="512 1K 2K 4K" > CLUSTERLIST="4K 8K 16K 32K 64K 128K 256K 512K 1M" > BLOCKSZ=1K > CLUSTERSZ=1M > set -x > >> ${LOG} > for CLUSTERSZ in ${CLUSTERLIST} ; > do > for BLOCKSZ in ${BLOCKLIST} ; > do > echo y | mkfs.ocfs2 -b ${BLOCKSZ} -C ${CLUSTERSZ} -N 4 ${DEV} > mount ${DEV} ${MOUNTDIR} > sleep 1 > fstrim -av || echo "`date` fstrim -av failed in -b ${BLOCKSZ} -C > ${CLUSTERSZ}" >> ${LOG} > sleep 1 > umount ${MOUNTDIR} > done > done > > > I can reproduce this bug in some block/cluster size combinations. > Mon Oct 30 10:49:05 CST 2017 fstrim -av failed in -b 4K -C 32K > Mon Oct 30 10:49:11 CST 2017 fstrim -av failed in -b 512 -C 64K > Mon Oct 30 10:49:21 CST 2017 fstrim -av failed in -b 1K -C 64K > Mon Oct 30 10:49:37 CST 2017 fstrim -av failed in -b 2K -C 64K > Mon Oct 30 10:50:03 CST 2017 fstrim -av failed in -b 4K -C 64K > Mon Oct 30 10:50:10 CST 2017 fstrim -av failed in -b 512 -C 128K > Mon Oct 30 10:50:19 CST 2017 fstrim -av failed in -b 1K -C 128K > Mon Oct 30 10:50:36 CST 2017 fstrim -av failed in -b 2K -C 128K > Mon Oct 30 10:51:02 CST 2017 fstrim -av failed in -b 4K -C 128K > Mon Oct 30 10:51:08 CST 2017 fstrim -av failed in -b 512 -C 256K > Mon Oct 30 10:51:18 CST 2017 fstrim -av failed in -b 1K -C 256K > Mon Oct 30 10:51:34 CST 2017 fstrim -av failed in -b 2K -C 256K > Mon Oct 30 10:52:00 CST 2017 fstrim -av failed in -b 4K -C 256K > Mon Oct 30 10:52:07 CST 2017 fstrim -av failed in -b 512 -C 512K > Mon Oct 30 10:52:16 CST 2017 fstrim -av failed in -b 1K -C 512K > Mon Oct 30 10:52:33 CST 2017 fstrim -av failed in -b 2K -C 512K > Mon Oct 30 10:52:59 CST 2017 fstrim -av failed in -b 4K -C 512K > Mon Oct 30 10:53:06 CST 2017 fstrim -av failed in -b 512 -C 1M > Mon Oct 30 10:53:15 CST 2017 fstrim -av failed in -b 1K -C 1M > Mon Oct 30 10:53:32 CST 2017 fstrim -av failed in -b 2K -C 1M > Mon Oct 30 10:53:58 CST 2017 fstrim -av failed in -b 4K -C 1M > > The patch can fix this bug, the test shell script can pass in all the cases. Thanks for testing this Gang. -Ashish > > Thanks > Gang > > >> On 10/28/2017 12:44 AM, Gang He wrote: >>> Hello Ashish, >>> Thank for your reply. >>> From the patch, it looks very related to this bug. >>> But one thing, I feel a little confused. >>> Why was I not able to reproduce this bug in local with a SSD disk? >> Hmm, thats interesting. It could be that the driver for your disk is not >> zeroing those blocks for some reason ... >> You could try to simulate this by creating ocfs2 on a loop device and >> running fstrim on it. >> loop converts fstrim to fallocate and puches a hole in the range, so it >> should zero out the range and >> cause corruption by zeroing the group descriptor. >> >> >>> There are any specific steps to reproduce this issue? >> I was able to reproduce this with block size 4k and cluster size 1M. No >> other special options. >> >> Thanks, >> Ashish >>> e.g. mount option for ocfs2? need to set SSD disk? >>> According to the patch, the bug is not related to multipath configuration. >>> >>> >>> Thanks >>> Gang >>> >>> >>> >>>>>> Ashish Samant 10/28/17 2:06 AM >>> >>> Hi Gang, >>> >>> The following patch sent to the list should fix the issue. >>> >>> https://patchwork.kernel.org/patch/10002583/ >>> >>> Thanks, >>> Ashish >>> >>> >>> On 10/27/2017 02:47 AM, Gang He wrote: >>>> Hello Guys, >>>> >>>> I got a bug from the customer, he said, fstrim command corrupted ocfs2 file >> system on their SSD SAN, the file system became read-only and SSD LUN was >> configured by multipath. >>>> After umount the file system, the customer ran fsck.ocfs2 on this file >> system, then the file system can be mounted until the next fstrim happens. >>>> The error messages were likes, >>>> 2017-10-02T00:00:00.334141+02:00 rz-xen10 systemd[1]: Starting Discard >>>> unused >> blocks... >>>> 2017-10-02T00:00:00.383805+02:00 rz-xen10 fstrim[36615]: fstrim: /xensan1: >> FITRIM ioctl fehlgeschlagen: Das Dateisystem ist nur lesbar >>>> 2017-10-02T00:00:00.385233+02:00 rz-xen10 kernel: [1092967.091821] OCFS2: >>>> ERROR >> (device dm-5): ocfs2
Re: [Ocfs2-devel] fstrim corrupts ocfs2 filesystems(become ready-only) on SSD device which is managed by multipath
On 10/28/2017 12:44 AM, Gang He wrote: > Hello Ashish, > Thank for your reply. > From the patch, it looks very related to this bug. > But one thing, I feel a little confused. > Why was I not able to reproduce this bug in local with a SSD disk? Hmm, thats interesting. It could be that the driver for your disk is not zeroing those blocks for some reason ... You could try to simulate this by creating ocfs2 on a loop device and running fstrim on it. loop converts fstrim to fallocate and puches a hole in the range, so it should zero out the range and cause corruption by zeroing the group descriptor. > There are any specific steps to reproduce this issue? I was able to reproduce this with block size 4k and cluster size 1M. No other special options. Thanks, Ashish > e.g. mount option for ocfs2? need to set SSD disk? > According to the patch, the bug is not related to multipath configuration. > > > Thanks > Gang > > > >>>> Ashish Samant 10/28/17 2:06 AM >>> > Hi Gang, > > The following patch sent to the list should fix the issue. > > https://patchwork.kernel.org/patch/10002583/ > > Thanks, > Ashish > > > On 10/27/2017 02:47 AM, Gang He wrote: >> Hello Guys, >> >> I got a bug from the customer, he said, fstrim command corrupted ocfs2 file >> system on their SSD SAN, the file system became read-only and SSD LUN was >> configured by multipath. >> After umount the file system, the customer ran fsck.ocfs2 on this file >> system, then the file system can be mounted until the next fstrim happens. >> The error messages were likes, >> 2017-10-02T00:00:00.334141+02:00 rz-xen10 systemd[1]: Starting Discard >> unused blocks... >> 2017-10-02T00:00:00.383805+02:00 rz-xen10 fstrim[36615]: fstrim: /xensan1: >> FITRIM ioctl fehlgeschlagen: Das Dateisystem ist nur lesbar >> 2017-10-02T00:00:00.385233+02:00 rz-xen10 kernel: [1092967.091821] OCFS2: >> ERROR (device dm-5): ocfs2_validate_gd_self: Group descriptor #8257536 has >> bad signature <<== here >> 2017-10-02T00:00:00.385251+02:00 rz-xen10 kernel: [1092967.091831] On-disk >> corruption discovered. Please run fsck.ocfs2 once the filesystem is >> unmounted. >> 2017-10-02T00:00:00.385254+02:00 rz-xen10 kernel: [1092967.091836] >> (fstrim,36615,5):ocfs2_trim_fs:7422 ERROR: status = -30 >> 2017-10-02T00:00:00.385854+02:00 rz-xen10 systemd[1]: fstrim.service: Main >> process exited, code=exited, status=32/n/a >> 2017-10-02T00:00:00.386756+02:00 rz-xen10 systemd[1]: Failed to start >> Discard unused blocks. >> 2017-10-02T00:00:00.387236+02:00 rz-xen10 systemd[1]: fstrim.service: Unit >> entered failed state. >> 2017-10-02T00:00:00.387601+02:00 rz-xen10 systemd[1]: fstrim.service: Failed >> with result 'exit-code'. >> >> The similar bug looks like >> https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_ubuntu_-2Bsource_util-2Dlinux_-2Bbug_1681410&d=DwIFAg&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=f4ohdmGrYxZejY77yzx3eNgTHb1ZAfZytktjHqNVzc8&m=Jdo98IlzJDxBqiDEhsKfqxvEt4B6WpIbZ_woY7zmLFw&s=xp0bUwpDVIHZP9g4EboYYG_1gkenzWEt_O_5KZXyFg8&e= >> . >> Then, I tried to reproduce this bug in local. >> Since I have not a SSD SAN, I found a PC server which has a SSD disk. >> I setup a two nodes ocfs2 cluster in VM on this PC server, attach this SSD >> disk to each VM instance twice, then I can configure this SSD disk with >> multipath tool, >> the configuration on each node likes, >> sle12sp3-nd1:/ # multipath -l >> INTEL_SSDSA2M040G2GC_CVGB0490002C040NGN dm-0 ATA,INTEL SSDSA2M040 >> size=37G features='1 retain_attached_hw_handler' hwhandler='0' wp=rw >> |-+- policy='service-time 0' prio=0 status=active >> | `- 0:0:0:0 sda 8:0 active undef unknown >> `-+- policy='service-time 0' prio=0 status=enabled >> `- 0:0:0:1 sdb 8:16 active undef unknown >> >> Next, I do some fstrim command from each node simultaneously, >> I also do dd command to write data to the shared SSD disk during fstrim >> commands. >> But, I can not reproduce this issue, all the things go well. >> >> Then, I'd like to ping the list, did who ever encounter this bug? If yes, >> please help to provide some information. >> I think there are three factors which are related to this bug, SSD device >> type, multipath configuration and simultaneously fstrim. >> >> Thanks a lot. >> Gang >> >> >> >> >> >> ___ >> Ocfs2-devel mailing list >> Ocfs2-devel@oss.oracle.com >> https://oss.oracle.com/mailman/listinfo/ocfs2-devel >> > > > ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] fstrim corrupts ocfs2 filesystems(become ready-only) on SSD device which is managed by multipath
Hi Gang, The following patch sent to the list should fix the issue. https://patchwork.kernel.org/patch/10002583/ Thanks, Ashish On 10/27/2017 02:47 AM, Gang He wrote: > Hello Guys, > > I got a bug from the customer, he said, fstrim command corrupted ocfs2 file > system on their SSD SAN, the file system became read-only and SSD LUN was > configured by multipath. > After umount the file system, the customer ran fsck.ocfs2 on this file > system, then the file system can be mounted until the next fstrim happens. > The error messages were likes, > 2017-10-02T00:00:00.334141+02:00 rz-xen10 systemd[1]: Starting Discard unused > blocks... > 2017-10-02T00:00:00.383805+02:00 rz-xen10 fstrim[36615]: fstrim: /xensan1: > FITRIM ioctl fehlgeschlagen: Das Dateisystem ist nur lesbar > 2017-10-02T00:00:00.385233+02:00 rz-xen10 kernel: [1092967.091821] OCFS2: > ERROR (device dm-5): ocfs2_validate_gd_self: Group descriptor #8257536 has > bad signature <<== here > 2017-10-02T00:00:00.385251+02:00 rz-xen10 kernel: [1092967.091831] On-disk > corruption discovered. Please run fsck.ocfs2 once the filesystem is unmounted. > 2017-10-02T00:00:00.385254+02:00 rz-xen10 kernel: [1092967.091836] > (fstrim,36615,5):ocfs2_trim_fs:7422 ERROR: status = -30 > 2017-10-02T00:00:00.385854+02:00 rz-xen10 systemd[1]: fstrim.service: Main > process exited, code=exited, status=32/n/a > 2017-10-02T00:00:00.386756+02:00 rz-xen10 systemd[1]: Failed to start Discard > unused blocks. > 2017-10-02T00:00:00.387236+02:00 rz-xen10 systemd[1]: fstrim.service: Unit > entered failed state. > 2017-10-02T00:00:00.387601+02:00 rz-xen10 systemd[1]: fstrim.service: Failed > with result 'exit-code'. > > The similar bug looks like > https://urldefense.proofpoint.com/v2/url?u=https-3A__bugs.launchpad.net_ubuntu_-2Bsource_util-2Dlinux_-2Bbug_1681410&d=DwIFAg&c=RoP1YumCXCgaWHvlZYR8PQcxBKCX5YTpkKY057SbK10&r=f4ohdmGrYxZejY77yzx3eNgTHb1ZAfZytktjHqNVzc8&m=Jdo98IlzJDxBqiDEhsKfqxvEt4B6WpIbZ_woY7zmLFw&s=xp0bUwpDVIHZP9g4EboYYG_1gkenzWEt_O_5KZXyFg8&e= > . > Then, I tried to reproduce this bug in local. > Since I have not a SSD SAN, I found a PC server which has a SSD disk. > I setup a two nodes ocfs2 cluster in VM on this PC server, attach this SSD > disk to each VM instance twice, then I can configure this SSD disk with > multipath tool, > the configuration on each node likes, > sle12sp3-nd1:/ # multipath -l > INTEL_SSDSA2M040G2GC_CVGB0490002C040NGN dm-0 ATA,INTEL SSDSA2M040 > size=37G features='1 retain_attached_hw_handler' hwhandler='0' wp=rw > |-+- policy='service-time 0' prio=0 status=active > | `- 0:0:0:0 sda 8:0 active undef unknown > `-+- policy='service-time 0' prio=0 status=enabled >`- 0:0:0:1 sdb 8:16 active undef unknown > > Next, I do some fstrim command from each node simultaneously, > I also do dd command to write data to the shared SSD disk during fstrim > commands. > But, I can not reproduce this issue, all the things go well. > > Then, I'd like to ping the list, did who ever encounter this bug? If yes, > please help to provide some information. > I think there are three factors which are related to this bug, SSD device > type, multipath configuration and simultaneously fstrim. > > Thanks a lot. > Gang > > > > > > ___ > Ocfs2-devel mailing list > Ocfs2-devel@oss.oracle.com > https://oss.oracle.com/mailman/listinfo/ocfs2-devel > ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH] ocfs2: fstrim: Fix start offset of first cluster group during fstrim
The first cluster group descriptor is not stored at the start of the group but at an offset from the start. We need to take this into account while doing fstrim on the first cluster group. Otherwise we will wrongly start fstrim a few blocks after the desired start block and the range can cross over into the next cluster group and zero out the group descriptor there. This can cause filesytem corruption that cannot be fixed by fsck. Signed-off-by: Ashish Samant Cc: sta...@vger.kernel.org --- fs/ocfs2/alloc.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/fs/ocfs2/alloc.c b/fs/ocfs2/alloc.c index a177eae..addd7c5 100644 --- a/fs/ocfs2/alloc.c +++ b/fs/ocfs2/alloc.c @@ -7304,13 +7304,24 @@ int ocfs2_truncate_inline(struct inode *inode, struct buffer_head *di_bh, static int ocfs2_trim_extent(struct super_block *sb, struct ocfs2_group_desc *gd, -u32 start, u32 count) +u64 group, u32 start, u32 count) { u64 discard, bcount; + struct ocfs2_super *osb = OCFS2_SB(sb); bcount = ocfs2_clusters_to_blocks(sb, count); - discard = le64_to_cpu(gd->bg_blkno) + - ocfs2_clusters_to_blocks(sb, start); + discard = ocfs2_clusters_to_blocks(sb, start); + + /* +* For the first cluster group, the gd->bg_blkno is not at the start +* of the group, but at an offset from the start. If we add it while +* calculating discard for first group, we will wrongly start fstrim a +* few blocks after the desried start block and the range can cross +* over into the next cluster group. So, add it only if this is not +* the first cluster group. +*/ + if (group != osb->first_cluster_group_blkno) + discard += le64_to_cpu(gd->bg_blkno); trace_ocfs2_trim_extent(sb, (unsigned long long)discard, bcount); @@ -7318,7 +7329,7 @@ static int ocfs2_trim_extent(struct super_block *sb, } static int ocfs2_trim_group(struct super_block *sb, - struct ocfs2_group_desc *gd, + struct ocfs2_group_desc *gd, u64 group, u32 start, u32 max, u32 minbits) { int ret = 0, count = 0, next; @@ -7337,7 +7348,7 @@ static int ocfs2_trim_group(struct super_block *sb, next = ocfs2_find_next_bit(bitmap, max, start); if ((next - start) >= minbits) { - ret = ocfs2_trim_extent(sb, gd, + ret = ocfs2_trim_extent(sb, gd, group, start, next - start); if (ret < 0) { mlog_errno(ret); @@ -7435,7 +7446,8 @@ int ocfs2_trim_fs(struct super_block *sb, struct fstrim_range *range) } gd = (struct ocfs2_group_desc *)gd_bh->b_data; - cnt = ocfs2_trim_group(sb, gd, first_bit, last_bit, minlen); + cnt = ocfs2_trim_group(sb, gd, group, + first_bit, last_bit, minlen); brelse(gd_bh); gd_bh = NULL; if (cnt < 0) { -- 1.9.1 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH] ocfs2: Fix double put of recount tree in ocfs2_lock_refcount_tree()
In ocfs2_lock_refcount_tree, if ocfs2_read_refcount_block() returns error, we do ocfs2_refcount_tree_put twice (once in ocfs2_unlock_refcount_tree and once outside it), thereby reducing the refcount of the refcount tree twice, but we dont delete the tree in this case. This will make refcnt of the tree = 0 and the ocfs2_refcount_tree_put will eventually call ocfs2_mark_lockres_freeing, setting OCFS2_LOCK_FREEING for the refcount_tree->rf_lockres. The error returned by ocfs2_read_refcount_block is propagated all the way back and for next iteration of write, ocfs2_lock_refcount_tree gets the same tree back from ocfs2_get_refcount_tree because we havent deleted the tree. Now we have the same tree, but OCFS2_LOCK_FREEING is set for rf_lockres and eventually, when _ocfs2_lock_refcount_tree is called in this iteration, BUG_ON( __ocfs2_cluster_lock:1395 ERROR: Cluster lock called on freeing lockres T0386019775b08d! flags 0x81) is triggerred. Call stack: (loop16,11155,0):ocfs2_lock_refcount_tree:482 ERROR: status = -5 (loop16,11155,0):ocfs2_refcount_cow_hunk:3497 ERROR: status = -5 (loop16,11155,0):ocfs2_refcount_cow:3560 ERROR: status = -5 (loop16,11155,0):ocfs2_prepare_inode_for_refcount:2111 ERROR: status = -5 (loop16,11155,0):ocfs2_prepare_inode_for_write:2190 ERROR: status = -5 (loop16,11155,0):ocfs2_file_write_iter:2331 ERROR: status = -5 (loop16,11155,0):__ocfs2_cluster_lock:1395 ERROR: bug expression: lockres->l_flags & OCFS2_LOCK_FREEING (loop16,11155,0):__ocfs2_cluster_lock:1395 ERROR: Cluster lock called on freeing lockres T0386019775b08d! flags 0x81 [ cut here ] kernel BUG at fs/ocfs2/dlmglue.c:1395! invalid opcode: [#1] SMP CPU 0 Modules linked in: tun ocfs2 jbd2 xen_blkback xen_netback xen_gntdev .. sd_mod crc_t10dif ext3 jbd mbcache RIP: e030:[] [] __ocfs2_cluster_lock+0x31c/0x740 [ocfs2] RSP: e02b:88017c0138a0 EFLAGS: 00010086 Process loop16 (pid: 11155, threadinfo 88017c01, task 8801b5374300) Stack: 88017bd25880 0081 00017c013920 88017c013960 001d 0001 88017bd258b4 880172006000 a07fa410 88017bd202b4 Call Trace: [] ocfs2_refcount_lock+0xae/0x130 [ocfs2] [] ? __ocfs2_lock_refcount_tree+0x29/0xe0 [ocfs2] [] ? _raw_spin_lock+0xe/0x20 [] __ocfs2_lock_refcount_tree+0x29/0xe0 [ocfs2] [] ocfs2_lock_refcount_tree+0xdd/0x320 [ocfs2] [] ocfs2_refcount_cow_hunk+0x1cb/0x440 [ocfs2] [] ocfs2_refcount_cow+0xa9/0x1d0 [ocfs2] [] ? ocfs2_prepare_inode_for_refcount+0x67/0x200 [ocfs2] [] ocfs2_prepare_inode_for_refcount+0x115/0x200 [ocfs2] [] ? ocfs2_inode_unlock+0xd4/0x140 [ocfs2] [] ocfs2_prepare_inode_for_write+0x33b/0x470 [ocfs2] [] ? ocfs2_rw_lock+0x80/0x190 [ocfs2] [] ocfs2_file_write_iter+0x220/0x8c0 [ocfs2] [] ? mempool_free_slab+0x17/0x20 [] ? bio_free+0x61/0x70 [] ? aio_kernel_free+0xe/0x10 [] aio_write_iter+0x2e/0x30 Fix this by avoiding the second call to ocfs2_refcount_tree_put() Signed-off-by: Ashish Samant --- fs/ocfs2/refcounttree.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/ocfs2/refcounttree.c b/fs/ocfs2/refcounttree.c index 92bbe93..a9d4102 100644 --- a/fs/ocfs2/refcounttree.c +++ b/fs/ocfs2/refcounttree.c @@ -478,7 +478,6 @@ again: if (ret) { mlog_errno(ret); ocfs2_unlock_refcount_tree(osb, tree, rw); - ocfs2_refcount_tree_put(tree); goto out; } -- 1.9.1 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH] ocfs2: Fix start offset to ocfs2_zero_range_for_truncate()
On 09/14/2016 03:43 PM, Andrew Morton wrote: > On Thu, 11 Aug 2016 16:12:27 -0700 Ashish Samant > wrote: > >> If we do fallocate with punch hole option on a reflink, with start offset >> on a cluster boundary and end offset somewhere in another cluster, we >> dont COW the first cluster starting at the start offset. But in this >> case, we were wrongly passing this cluster to >> ocfs2_zero_range_for_truncate() to zero out. >> >> Fix this by skipping this cluster in such a scenario. > How serious is this bug? It sounds like a data-corrupting error? As > such, this is a high priority fix and it should be backported into the > -stable kernels? > > > Please always include such info when fixing bugs. Yes, it is quite serious, I should have cc'ed stable. Will do it going forward. Thanks, Ashish ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
Re: [Ocfs2-devel] [PATCH v2] ocfs2: Fix start offset to ocfs2_zero_range_for_truncate()
Hi Eric, I am able to reproduce this on 4.8.0-rc3 as well. Can you try again and issue a sync between fallocate and dd? On 08/30/2016 12:38 AM, Eric Ren wrote: > Hi, > > I'm on 4.8.0-rc3 kernel. Hope someone else can double-confirm this;-) > > On 08/30/2016 12:11 PM, Ashish Samant wrote: >> Hmm, thats weird. I see this on 4.7 kernel without the patch: >> >> # xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile >> wrote 10485760/10485760 bytes at offset 0 >> 10 MiB, 2560 ops; 0. sec (683.995 MiB/sec and 175102.5992 ops/sec) >> # reflink -f 10MBfile reflnktest >> # fallocate -p -o 0 -l 1048615 reflnktest >> # dd if=10MBfile iflag=direct bs=1048576 count=1 | hexdump -C >> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >> || >> * >> 1+0 records in >> 1+0 records out >> 1048576 bytes (1.0 MB) copied, 0.0321517 s, 32.6 MB/s >> 0010 >> >> and with patch >> >> # dd if=10MBfile iflag=direct bs=1M count=1 | hexdump -C >> cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd >> || > > I'm not familiar with this code. So why is the output "cd ..."? > because we didn't write anything > into "10MBfile". Is it a magic number when reading from a hole? No, "cd" is what xfs_io wrote into the file. Those are the original contents of the file which are overwritten by 0 in the first cluster because of this bug. Thanks, Ashish > > Eric > >> * >> 1+0 records in >> 1+0 records out >> 0010 > > > >> >> Thanks, >> Ashish >> >> >> On 08/29/2016 08:33 PM, Eric Ren wrote: >>> Hello, >>> >>> On 08/30/2016 03:23 AM, Ashish Samant wrote: >>>> Hi Eric, >>>> >>>> The easiest way to reproduce this is : >>>> >>>> 1. Create a random file of say 10 MB >>>> xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile >>>> 2. Reflink it >>>> reflink -f 10MBfile reflnktest >>>> 3. Punch a hole at starting at cluster boundary with range greater >>>> that 1MB. You can also use a range that will put the end offset in >>>> another extent. >>>> fallocate -p -o 0 -l 1048615 reflnktest >>>> 4. sync >>>> 5. Check the first cluster in the source file. (It will be zeroed >>>> out). >>>>dd if=10MBfile iflag=direct bs= count=1 | hexdump -C >>> >>> Thanks! I have a try myself, but I'm not sure what is our expected >>> output and if the test result meet >>> it: >>> >>> 1. After applying this patch: >>> ocfs2dev1:/mnt/ocfs2 # rm 10MBfile reflnktest >>> ocfs2dev1:/mnt/ocfs2 # xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile >>> wrote 10485760/10485760 bytes at offset 0 >>> 10 MiB, 2560 ops; 0. sec (1.089 GiB/sec and 285427.5839 ops/sec) >>> ocfs2dev1:/mnt/ocfs2 # reflink -f 10MBfile reflnktest >>> ocfs2dev1:/mnt/ocfs2 # fallocate -p -o 0 -l 1048615 reflnktest >>> ocfs2dev1:/mnt/ocfs2 # dd if=10MBfile iflag=direct bs=1048576 >>> count=1 | hexdump -C >>> cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd >>> || >>> * >>> 1+0 records in >>> 1+0 records out >>> 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0952464 s, 11.0 MB/s >>> 0010 >>> >>> 2. Before this patch: >>> >>> ocfs2dev1:/mnt/ocfs2 # dd if=10MBfile iflag=direct bs=1048576 >>> count=1 | hexdump -C >>> cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd >>> || >>> * >>> 1+0 records in >>> 1+0 records out >>> 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0954648 s, 11.0 MB/s >>> 0010 >>> >>> 3. debugfs.ocfs2 -R stats /dev/sdb >>> ... >>> Block Size Bits: 12 Cluster Size Bits: 20 >>> ... >>> >>> Eric >>>> >>>> Thanks, >>>> Ashish >>>> >>>> On 08/28/2016 10:39 PM, Eric Ren wrote: >>>>> Hi, >>>>> >>>>> Thanks for this fix. I'd like to reproduce this issue locally and >>>>> test this patch, >>>>> could you elaborate the detailed steps of reproduction? >>>>> >>>>> Thanks, >>>>> Eric >>>>> >>>>> On 08/27/2016 07:04 AM, Ashish Samant wrote: >>>>>> If we punch a hole on a reflink su
Re: [Ocfs2-devel] [PATCH v2] ocfs2: Fix start offset to ocfs2_zero_range_for_truncate()
Hmm, thats weird. I see this on 4.7 kernel without the patch: # xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile wrote 10485760/10485760 bytes at offset 0 10 MiB, 2560 ops; 0. sec (683.995 MiB/sec and 175102.5992 ops/sec) # reflink -f 10MBfile reflnktest # fallocate -p -o 0 -l 1048615 reflnktest # dd if=10MBfile iflag=direct bs=1048576 count=1 | hexdump -C 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 || * 1+0 records in 1+0 records out 1048576 bytes (1.0 MB) copied, 0.0321517 s, 32.6 MB/s 0010 and with patch # dd if=10MBfile iflag=direct bs=1M count=1 | hexdump -C cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd || * 1+0 records in 1+0 records out 0010 Thanks, Ashish On 08/29/2016 08:33 PM, Eric Ren wrote: > Hello, > > On 08/30/2016 03:23 AM, Ashish Samant wrote: >> Hi Eric, >> >> The easiest way to reproduce this is : >> >> 1. Create a random file of say 10 MB >> xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile >> 2. Reflink it >> reflink -f 10MBfile reflnktest >> 3. Punch a hole at starting at cluster boundary with range greater >> that 1MB. You can also use a range that will put the end offset in >> another extent. >> fallocate -p -o 0 -l 1048615 reflnktest >> 4. sync >> 5. Check the first cluster in the source file. (It will be zeroed out). >>dd if=10MBfile iflag=direct bs= count=1 | hexdump -C > > Thanks! I have a try myself, but I'm not sure what is our expected > output and if the test result meet > it: > > 1. After applying this patch: > ocfs2dev1:/mnt/ocfs2 # rm 10MBfile reflnktest > ocfs2dev1:/mnt/ocfs2 # xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile > wrote 10485760/10485760 bytes at offset 0 > 10 MiB, 2560 ops; 0. sec (1.089 GiB/sec and 285427.5839 ops/sec) > ocfs2dev1:/mnt/ocfs2 # reflink -f 10MBfile reflnktest > ocfs2dev1:/mnt/ocfs2 # fallocate -p -o 0 -l 1048615 reflnktest > ocfs2dev1:/mnt/ocfs2 # dd if=10MBfile iflag=direct bs=1048576 count=1 > | hexdump -C > cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd > || > * > 1+0 records in > 1+0 records out > 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0952464 s, 11.0 MB/s > 0010 > > 2. Before this patch: > > ocfs2dev1:/mnt/ocfs2 # dd if=10MBfile iflag=direct bs=1048576 count=1 > | hexdump -C > cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd > || > * > 1+0 records in > 1+0 records out > 1048576 bytes (1.0 MB, 1.0 MiB) copied, 0.0954648 s, 11.0 MB/s > 0010 > > 3. debugfs.ocfs2 -R stats /dev/sdb > ... > Block Size Bits: 12 Cluster Size Bits: 20 > ... > > Eric >> >> Thanks, >> Ashish >> >> On 08/28/2016 10:39 PM, Eric Ren wrote: >>> Hi, >>> >>> Thanks for this fix. I'd like to reproduce this issue locally and >>> test this patch, >>> could you elaborate the detailed steps of reproduction? >>> >>> Thanks, >>> Eric >>> >>> On 08/27/2016 07:04 AM, Ashish Samant wrote: >>>> If we punch a hole on a reflink such that following conditions are >>>> met: >>>> >>>> 1. start offset is on a cluster boundary >>>> 2. end offset is not on a cluster boundary >>>> 3. (end offset is somewhere in another extent) or >>>> (hole range > MAX_CONTIG_BYTES(1MB)), >>>> >>>> we dont COW the first cluster starting at the start offset. But in >>>> this >>>> case, we were wrongly passing this cluster to >>>> ocfs2_zero_range_for_truncate() to zero out. This will modify the >>>> cluster >>>> in place and zero it in the source too. >>>> >>>> Fix this by skipping this cluster in such a scenario. >>>> >>>> Reported-by: Saar Maoz >>>> Signed-off-by: Ashish Samant >>>> Reviewed-by: Srinivas Eeda >>>> --- >>>> v1->v2: >>>> -Changed the commit msg to include a better and generic description of >>>> the problem, for all cluster sizes. >>>> -Added Reported-by and Reviewed-by tags. >>>> fs/ocfs2/file.c | 34 -- >>>> 1 file changed, 24 insertions(+), 10 deletions(-) >>>> >>>> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c >>>> index 4e7b0dc..0b055bf 100644 >>>> --- a/fs/ocfs2/file.c >>>> +++ b/fs/ocfs2/file.c >>>> @@ -1506,7 +1506,8 @@ static int ocfs2_zero_partial_clusters(struct >>>> inode *inode, >
Re: [Ocfs2-devel] [PATCH v2] ocfs2: Fix start offset to ocfs2_zero_range_for_truncate()
Hi Eric, The easiest way to reproduce this is : 1. Create a random file of say 10 MB xfs_io -c 'pwrite -b 4k 0 10M' -f 10MBfile 2. Reflink it reflink -f 10MBfile reflnktest 3. Punch a hole at starting at cluster boundary with range greater that 1MB. You can also use a range that will put the end offset in another extent. fallocate -p -o 0 -l 1048615 reflnktest 4. sync 5. Check the first cluster in the source file. (It will be zeroed out). dd if=10MBfile iflag=direct bs= count=1 | hexdump -C Thanks, Ashish On 08/28/2016 10:39 PM, Eric Ren wrote: > Hi, > > Thanks for this fix. I'd like to reproduce this issue locally and test > this patch, > could you elaborate the detailed steps of reproduction? > > Thanks, > Eric > > On 08/27/2016 07:04 AM, Ashish Samant wrote: >> If we punch a hole on a reflink such that following conditions are met: >> >> 1. start offset is on a cluster boundary >> 2. end offset is not on a cluster boundary >> 3. (end offset is somewhere in another extent) or >> (hole range > MAX_CONTIG_BYTES(1MB)), >> >> we dont COW the first cluster starting at the start offset. But in this >> case, we were wrongly passing this cluster to >> ocfs2_zero_range_for_truncate() to zero out. This will modify the >> cluster >> in place and zero it in the source too. >> >> Fix this by skipping this cluster in such a scenario. >> >> Reported-by: Saar Maoz >> Signed-off-by: Ashish Samant >> Reviewed-by: Srinivas Eeda >> --- >> v1->v2: >> -Changed the commit msg to include a better and generic description of >> the problem, for all cluster sizes. >> -Added Reported-by and Reviewed-by tags. >> fs/ocfs2/file.c | 34 -- >> 1 file changed, 24 insertions(+), 10 deletions(-) >> >> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c >> index 4e7b0dc..0b055bf 100644 >> --- a/fs/ocfs2/file.c >> +++ b/fs/ocfs2/file.c >> @@ -1506,7 +1506,8 @@ static int ocfs2_zero_partial_clusters(struct >> inode *inode, >> u64 start, u64 len) >> { >> int ret = 0; >> -u64 tmpend, end = start + len; >> +u64 tmpend = 0; >> +u64 end = start + len; >> struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); >> unsigned int csize = osb->s_clustersize; >> handle_t *handle; >> @@ -1538,18 +1539,31 @@ static int ocfs2_zero_partial_clusters(struct >> inode *inode, >> } >> /* >> - * We want to get the byte offset of the end of the 1st cluster. >> + * If start is on a cluster boundary and end is somewhere in >> another >> + * cluster, we have not COWed the cluster starting at start, unless >> + * end is also within the same cluster. So, in this case, we >> skip this >> + * first call to ocfs2_zero_range_for_truncate() truncate and >> move on >> + * to the next one. >>*/ >> -tmpend = (u64)osb->s_clustersize + (start & ~(osb->s_clustersize >> - 1)); >> -if (tmpend > end) >> -tmpend = end; >> +if ((start & (csize - 1)) != 0) { >> +/* >> + * We want to get the byte offset of the end of the 1st >> + * cluster. >> + */ >> +tmpend = (u64)osb->s_clustersize + >> +(start & ~(osb->s_clustersize - 1)); >> +if (tmpend > end) >> +tmpend = end; >> -trace_ocfs2_zero_partial_clusters_range1((unsigned long >> long)start, >> - (unsigned long long)tmpend); >> +trace_ocfs2_zero_partial_clusters_range1( >> +(unsigned long long)start, >> +(unsigned long long)tmpend); >> -ret = ocfs2_zero_range_for_truncate(inode, handle, start, >> tmpend); >> -if (ret) >> -mlog_errno(ret); >> +ret = ocfs2_zero_range_for_truncate(inode, handle, start, >> +tmpend); >> +if (ret) >> +mlog_errno(ret); >> +} >> if (tmpend < end) { >> /* > > ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH v2] ocfs2: Fix start offset to ocfs2_zero_range_for_truncate()
If we punch a hole on a reflink such that following conditions are met: 1. start offset is on a cluster boundary 2. end offset is not on a cluster boundary 3. (end offset is somewhere in another extent) or (hole range > MAX_CONTIG_BYTES(1MB)), we dont COW the first cluster starting at the start offset. But in this case, we were wrongly passing this cluster to ocfs2_zero_range_for_truncate() to zero out. This will modify the cluster in place and zero it in the source too. Fix this by skipping this cluster in such a scenario. Reported-by: Saar Maoz Signed-off-by: Ashish Samant Reviewed-by: Srinivas Eeda --- v1->v2: -Changed the commit msg to include a better and generic description of the problem, for all cluster sizes. -Added Reported-by and Reviewed-by tags. fs/ocfs2/file.c | 34 -- 1 file changed, 24 insertions(+), 10 deletions(-) diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index 4e7b0dc..0b055bf 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -1506,7 +1506,8 @@ static int ocfs2_zero_partial_clusters(struct inode *inode, u64 start, u64 len) { int ret = 0; - u64 tmpend, end = start + len; + u64 tmpend = 0; + u64 end = start + len; struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); unsigned int csize = osb->s_clustersize; handle_t *handle; @@ -1538,18 +1539,31 @@ static int ocfs2_zero_partial_clusters(struct inode *inode, } /* -* We want to get the byte offset of the end of the 1st cluster. +* If start is on a cluster boundary and end is somewhere in another +* cluster, we have not COWed the cluster starting at start, unless +* end is also within the same cluster. So, in this case, we skip this +* first call to ocfs2_zero_range_for_truncate() truncate and move on +* to the next one. */ - tmpend = (u64)osb->s_clustersize + (start & ~(osb->s_clustersize - 1)); - if (tmpend > end) - tmpend = end; + if ((start & (csize - 1)) != 0) { + /* +* We want to get the byte offset of the end of the 1st +* cluster. +*/ + tmpend = (u64)osb->s_clustersize + + (start & ~(osb->s_clustersize - 1)); + if (tmpend > end) + tmpend = end; - trace_ocfs2_zero_partial_clusters_range1((unsigned long long)start, -(unsigned long long)tmpend); + trace_ocfs2_zero_partial_clusters_range1( + (unsigned long long)start, + (unsigned long long)tmpend); - ret = ocfs2_zero_range_for_truncate(inode, handle, start, tmpend); - if (ret) - mlog_errno(ret); + ret = ocfs2_zero_range_for_truncate(inode, handle, start, + tmpend); + if (ret) + mlog_errno(ret); + } if (tmpend < end) { /* -- 1.9.1 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel
[Ocfs2-devel] [PATCH] ocfs2: Fix start offset to ocfs2_zero_range_for_truncate()
If we do fallocate with punch hole option on a reflink, with start offset on a cluster boundary and end offset somewhere in another cluster, we dont COW the first cluster starting at the start offset. But in this case, we were wrongly passing this cluster to ocfs2_zero_range_for_truncate() to zero out. Fix this by skipping this cluster in such a scenario. Signed-off-by: Ashish Samant --- fs/ocfs2/file.c | 34 -- 1 file changed, 24 insertions(+), 10 deletions(-) diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c index 4a6e130..ab305aa 100644 --- a/fs/ocfs2/file.c +++ b/fs/ocfs2/file.c @@ -1522,7 +1522,8 @@ static int ocfs2_zero_partial_clusters(struct inode *inode, u64 start, u64 len) { int ret = 0; - u64 tmpend, end = start + len; + u64 tmpend = 0; + u64 end = start + len; struct ocfs2_super *osb = OCFS2_SB(inode->i_sb); unsigned int csize = osb->s_clustersize; handle_t *handle; @@ -1554,18 +1555,31 @@ static int ocfs2_zero_partial_clusters(struct inode *inode, } /* -* We want to get the byte offset of the end of the 1st cluster. +* If start is on a cluster boundary and end is somewhere in another +* cluster, we have not COWed the cluster starting at start, unless +* end is also within the same cluster. So, in this case, we skip this +* first call to ocfs2_zero_range_for_truncate() truncate and move on +* to the next one. */ - tmpend = (u64)osb->s_clustersize + (start & ~(osb->s_clustersize - 1)); - if (tmpend > end) - tmpend = end; + if ((start & (csize - 1)) != 0) { + /* +* We want to get the byte offset of the end of the 1st +* cluster. +*/ + tmpend = (u64)osb->s_clustersize + + (start & ~(osb->s_clustersize - 1)); + if (tmpend > end) + tmpend = end; - trace_ocfs2_zero_partial_clusters_range1((unsigned long long)start, -(unsigned long long)tmpend); + trace_ocfs2_zero_partial_clusters_range1( + (unsigned long long)start, + (unsigned long long)tmpend); - ret = ocfs2_zero_range_for_truncate(inode, handle, start, tmpend); - if (ret) - mlog_errno(ret); + ret = ocfs2_zero_range_for_truncate(inode, handle, start, + tmpend); + if (ret) + mlog_errno(ret); + } if (tmpend < end) { /* -- 1.9.1 ___ Ocfs2-devel mailing list Ocfs2-devel@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-devel