Re: [PATCH] Btrfs: clear received_uuid field for new writable snapshots
On Wed, 22 May 2013 13:37:15 +0300, Alex Lyakas wrote: Hi Stephan, I fully understand the first part of your fix, and I believe it's quite critical. Indeed, a writable snapshot should have no evidence that it has an ancestor that was once received. Can you pls let me know that I understand the second part of your fix. In btrfs-progs the following code in tree_search() would have prevented us from mistakenly selecting such snapshot as a parent for receive: if (type == subvol_search_by_received_uuid) { entry = rb_entry(n, struct subvol_info, rb_received_node); comp = memcmp(entry-received_uuid, uuid, BTRFS_UUID_SIZE); if (!comp) { if (entry-stransid stransid) comp = -1; else if (entry-stransid stransid) comp = 1; else comp = 0; } The code checks both received_uuid (which would have been wrongly equal to what we need), but also the stransid (which was the ctransid on the send side), which would have been zero, so it wouldn't match. Now after your fix, the stransid field becomes not needed, correct? Because if we have a valid received_uuid, this means that either we are the received snapshot, or our whole chain of ancestors are read-only, and eventually there was an ancestor that was received. So we have valid data and can be used as a parent. Is it still needed after your fix to check the stransid field ? (it doesn't hurt to check it) Hi Alex, Yes, the code in tree_search() that evaluates the stransid field can be removed if compatibility of a new btrfs-progs release to an old kernel is not a concern. And in the improved send/receive code (that makes use of the UUID tree and the root tree to retrieve all the information it needs [PATCH v3 0/4] Btrfs-progs: speedup btrfs send/receive), this code is removed and stransid is not evaluated anymore. The evaluation was only useful to fix the bug that the received_uuid field was not cleared for writable snapshots. Clearring/Not clearing the rtransid - does it bring any value? rtransid is the local transid of when we had completed the receive process for this snap. Is there any interesting usage of this value? There's no code that makes use of the rtransid field. But since a read-only snapshot is identical to the parent, there is no need to clear the field while creating a read-only snapshot. And since I changed this for the stransid field (which is evaluated in the current btrfs-progs code), I changed all related fields at the same time, even those that are not evaluated anywhere. On Wed, Apr 17, 2013 at 12:11 PM, Stefan Behrens sbehr...@giantdisaster.de wrote: For created snapshots, the full root_item is copied from the source root and afterwards selectively modified. The current code forgets to clear the field received_uuid. The only problem is that it is confusing when you look at it with 'btrfs subv list', since for writable snapshots, the contents of the snapshot can be completely unrelated to the previously received snapshot. The receiver ignores such snapshots anyway because he also checks the field stransid in the root_item and that value used to be reset to zero for all created snapshots. This commit changes two things: - clear the received_uuid field for new writable snapshots. - don't clear the send/receive related information like the stransid for read-only snapshots (which makes them useable as a parent for the automatic selection of parents in the receive code). Signed-off-by: Stefan Behrens sbehr...@giantdisaster.de --- fs/btrfs/transaction.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index ffac232..94cbd10 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -1170,13 +1170,17 @@ static noinline int create_pending_snapshot(struct btrfs_trans_handle *trans, memcpy(new_root_item-uuid, new_uuid.b, BTRFS_UUID_SIZE); memcpy(new_root_item-parent_uuid, root-root_item.uuid, BTRFS_UUID_SIZE); + if (!(root_flags BTRFS_ROOT_SUBVOL_RDONLY)) { + memset(new_root_item-received_uuid, 0, + sizeof(new_root_item-received_uuid)); + memset(new_root_item-stime, 0, sizeof(new_root_item-stime)); + memset(new_root_item-rtime, 0, sizeof(new_root_item-rtime)); + btrfs_set_root_stransid(new_root_item, 0); + btrfs_set_root_rtransid(new_root_item, 0); + } new_root_item-otime.sec = cpu_to_le64(cur_time.tv_sec); new_root_item-otime.nsec = cpu_to_le32(cur_time.tv_nsec); btrfs_set_root_otransid(new_root_item, trans-transid); - memset(new_root_item-stime, 0, sizeof(new_root_item-stime)); -
Problem with btrfs send/receive
Hi everyone, I was trying the new send/receive feature today but can't make it work. These are the commands I was using: btrfs subvol snap -r /mnt/data1/@downloads/ /mnt/data1/snapshots/testsnap btrfs send /mnt/data1/snapshots/testsnap | btrfs receive /mnt/data1/snapshots/testreceive/ This command never finishes. A 'ls /mnt/data1/snapshots/testreceive/ ' never finishes too. After killing the send/receive process one can see that the target subvolume was created but is empty. Sending the snapshot to a file and use this for receiving does work. What am I missing? Kind regards, Felix -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with btrfs send/receive
On Thu, 23 May 2013 13:51:59 +0200, Felix Blanke wrote: Hi everyone, I was trying the new send/receive feature today but can't make it work. These are the commands I was using: btrfs subvol snap -r /mnt/data1/@downloads/ /mnt/data1/snapshots/testsnap btrfs send /mnt/data1/snapshots/testsnap | btrfs receive /mnt/data1/snapshots/testreceive/ This command never finishes. A 'ls /mnt/data1/snapshots/testreceive/ ' never finishes too. After killing the send/receive process one can see that the target subvolume was created but is empty. Sending the snapshot to a file and use this for receiving does work. What am I missing? That's a known design flaw in the btrfs send code. 'btrfs send /subvol | sleep 666' is an easier way to block the system. But it's interruptable if you terminate the sleep task in this case. btrfs send blocks on the pipe while being in the kernel, the call chain is like this: btrfs_ioctl_send - send_subvol - full_send_tree - changed_cb - send_cmd - write_buf - vfs_write - pipe_write - pipe_wait And full_send_tree() has called btrfs_join_transaction() before, thus the whole file system and afterwards the system is blocked. You can avoid it if you receive to a different filesystem or if you redirect the output of btrfs send into a file. In general, the output of btrfs send must never be blocked or you are lost. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with btrfs send/receive
Hi, thanks for pointing it out. You are right: Sending the snapshot to a different btrfs fixes the problem. Interesting :) It's not what you would except, but for my purposes it's ok, because in the end I'll send it to a different btrfs. Felix On Thu, May 23, 2013 at 2:22 PM, Stefan Behrens sbehr...@giantdisaster.de wrote: On Thu, 23 May 2013 13:51:59 +0200, Felix Blanke wrote: Hi everyone, I was trying the new send/receive feature today but can't make it work. These are the commands I was using: btrfs subvol snap -r /mnt/data1/@downloads/ /mnt/data1/snapshots/testsnap btrfs send /mnt/data1/snapshots/testsnap | btrfs receive /mnt/data1/snapshots/testreceive/ This command never finishes. A 'ls /mnt/data1/snapshots/testreceive/ ' never finishes too. After killing the send/receive process one can see that the target subvolume was created but is empty. Sending the snapshot to a file and use this for receiving does work. What am I missing? That's a known design flaw in the btrfs send code. 'btrfs send /subvol | sleep 666' is an easier way to block the system. But it's interruptable if you terminate the sleep task in this case. btrfs send blocks on the pipe while being in the kernel, the call chain is like this: btrfs_ioctl_send - send_subvol - full_send_tree - changed_cb - send_cmd - write_buf - vfs_write - pipe_write - pipe_wait And full_send_tree() has called btrfs_join_transaction() before, thus the whole file system and afterwards the system is blocked. You can avoid it if you receive to a different filesystem or if you redirect the output of btrfs send into a file. In general, the output of btrfs send must never be blocked or you are lost. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
raid6: rmw writes all the time?
Hi all, we got a new test system here and I just also tested btrfs raid6 on that. Write performance is slightly lower than hw-raid (LSI megasas) and md-raid6, but it probably would be much better than any of these two, if it wouldn't read all the during the writes. Is this a known issue? This is with linux-3.9.2. Thanks, Bernd -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6: rmw writes all the time?
Quoting Bernd Schubert (2013-05-23 08:55:47) Hi all, we got a new test system here and I just also tested btrfs raid6 on that. Write performance is slightly lower than hw-raid (LSI megasas) and md-raid6, but it probably would be much better than any of these two, if it wouldn't read all the during the writes. Is this a known issue? This is with linux-3.9.2. Hi Bernd, Any time you do a write smaller than a full stripe, we'll have to do a read/modify/write cycle to satisfy it. This is true of md raid6 and the hw-raid as well, but their reads don't show up in vmstat (try iostat instead). So the bigger question is where are your small writes coming from. If they are metadata, you can use raid1 for the metadata. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6: rmw writes all the time?
On 05/23/2013 03:11 PM, Chris Mason wrote: Quoting Bernd Schubert (2013-05-23 08:55:47) Hi all, we got a new test system here and I just also tested btrfs raid6 on that. Write performance is slightly lower than hw-raid (LSI megasas) and md-raid6, but it probably would be much better than any of these two, if it wouldn't read all the during the writes. Is this a known issue? This is with linux-3.9.2. Hi Bernd, Any time you do a write smaller than a full stripe, we'll have to do a read/modify/write cycle to satisfy it. This is true of md raid6 and the hw-raid as well, but their reads don't show up in vmstat (try iostat instead). Yeah, I know and I'm using iostat already. md raid6 does not do rmw, but does not fill the device queue, afaik it flushes the underlying devices quickly as it does not have barrier support - that is another topic, but was the reason why I started to test btrfs. So the bigger question is where are your small writes coming from. If they are metadata, you can use raid1 for the metadata. I used this command /tmp/mkfs.btrfs -L test2 -f -d raid6 -m raid10 /dev/sd[m-x] so meta-data should be raid10. And I'm using this iozone command: iozone -e -i0 -i1 -r1m -l 5 -u 5 -s20g -+n \ -F /data/fhgfs/storage/md126/testfile1 /data/fhgfs/storage/md126/testfile2 /data/fhgfs/storage/md126/testfile3 \ /data/fhgfs/storage/md127/testfile1 /data/fhgfs/storage/md127/testfile2 /data/fhgfs/storage/md127/testfile3 Higher IO sizes (e.g. -r16m) don't make a difference, it goes through the page cache anyway. I'm not familiar with btrfs code at all, but maybe writepages() submits too small IOs? Hrmm, just wanted to try direct IO, but then just noticed it went into RO mode before already: May 23 14:59:33 c8220a kernel: WARNING: at fs/btrfs/super.c:255 __btrfs_abort_transaction+0xdf/0x100 [btrfs]() ay 23 14:59:33 c8220a kernel: [8105db76] warn_slowpath_fmt+0x46/0x50 May 23 14:59:33 c8220a kernel: [a0b5428a] ? btrfs_free_path+0x2a/0x40 [btrfs] May 23 14:59:33 c8220a kernel: [a0b4e18f] __btrfs_abort_transaction+0xdf/0x100 [btrfs] May 23 14:59:33 c8220a kernel: [a0b70b2f] btrfs_save_ino_cache+0x22f/0x310 [btrfs] May 23 14:59:33 c8220a kernel: [a0b793e2] commit_fs_roots+0xd2/0x1c0 [btrfs] May 23 14:59:33 c8220a kernel: [815eb3fe] ? mutex_lock+0x1e/0x50 May 23 14:59:33 c8220a kernel: [a0b7a555] btrfs_commit_transaction+0x495/0xa40 [btrfs] May 23 14:59:33 c8220a kernel: [a0b7af7b] ? start_transaction+0xab/0x4d0 [btrfs] May 23 14:59:33 c8220a kernel: [81082f30] ? wake_up_bit+0x40/0x40 May 23 14:59:33 c8220a kernel: [a0b72b96] transaction_kthread+0x1a6/0x220 [btrfs] May 23 14:59:33 c8220a kernel: ---[ end trace 3d91874abeab5984 ]--- May 23 14:59:33 c8220a kernel: BTRFS error (device sdx) in btrfs_save_ino_cache:471: error 28 May 23 14:59:33 c8220a kernel: btrfs is forced readonly May 23 14:59:33 c8220a kernel: BTRFS warning (device sdx): Skipping commit of aborted transaction. May 23 14:59:33 c8220a kernel: BTRFS error (device sdx) in cleanup_transaction:1455: error 28 errno 28 - out of disk space? Going to recreate it and will play with it later on again. Thanks, Bernd -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6: rmw writes all the time?
Quoting Bernd Schubert (2013-05-23 09:22:41) On 05/23/2013 03:11 PM, Chris Mason wrote: Quoting Bernd Schubert (2013-05-23 08:55:47) Hi all, we got a new test system here and I just also tested btrfs raid6 on that. Write performance is slightly lower than hw-raid (LSI megasas) and md-raid6, but it probably would be much better than any of these two, if it wouldn't read all the during the writes. Is this a known issue? This is with linux-3.9.2. Hi Bernd, Any time you do a write smaller than a full stripe, we'll have to do a read/modify/write cycle to satisfy it. This is true of md raid6 and the hw-raid as well, but their reads don't show up in vmstat (try iostat instead). Yeah, I know and I'm using iostat already. md raid6 does not do rmw, but does not fill the device queue, afaik it flushes the underlying devices quickly as it does not have barrier support - that is another topic, but was the reason why I started to test btrfs. md should support barriers with recent kernels. You might want to verify with blktrace that md raid6 isn't doing r/m/w. So the bigger question is where are your small writes coming from. If they are metadata, you can use raid1 for the metadata. I used this command /tmp/mkfs.btrfs -L test2 -f -d raid6 -m raid10 /dev/sd[m-x] Ok, the stripe size is 64KB, so you want to do IO in multiples of 64KB times the number of devices on the FS. If you have 13 devices, that's 832K. Using buffered writes makes it much more likely the VM will break up the IOs as they go down. The btrfs writepages code does try to do full stripe IO, and it also caches stripes as the IO goes down. But for buffered IO it is surprisingly hard to get a 100% hit rate on full stripe IO at larger stripe sizes. so meta-data should be raid10. And I'm using this iozone command: iozone -e -i0 -i1 -r1m -l 5 -u 5 -s20g -+n \ -F /data/fhgfs/storage/md126/testfile1 /data/fhgfs/storage/md126/testfile2 /data/fhgfs/storage/md126/testfile3 \ /data/fhgfs/storage/md127/testfile1 /data/fhgfs/storage/md127/testfile2 /data/fhgfs/storage/md127/testfile3 Higher IO sizes (e.g. -r16m) don't make a difference, it goes through the page cache anyway. I'm not familiar with btrfs code at all, but maybe writepages() submits too small IOs? Hrmm, just wanted to try direct IO, but then just noticed it went into RO mode before already: Direct IO will make it easier to get full stripe writes. I thought I had fixed this abort, but it is just running out of space to write the inode cache. For now, please just don't mount with the inode cache enabled, I'll send in a fix for the next rc. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6: rmw writes all the time?
On 23/05/2013 15:22, Bernd Schubert wrote: Yeah, I know and I'm using iostat already. md raid6 does not do rmw, but does not fill the device queue, afaik it flushes the underlying devices quickly as it does not have barrier support - that is another topic, but was the reason why I started to test btrfs. MD raid6 DOES have barrier support! -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Virtual Device Support (N-way mirror code)
Am Dienstag, 21. Mai 2013, 13:19:31 schrieb Martin: Yep, ReiserFS has stood the test of time very well and I'm still using and abusing it still on various servers all the way from something like a decade ago! Very interesting. I only used it for a short time and it worked. But co-workers lost several ReiserFS filesystems completely. Well, if you search for the terms corrupt and your favorite filesystem, you will always find hits. Anyway, I won´t use ReiserFS 3 today for several reasons: 1) It is not yet actively developed anymore, but more in a maintenance. I know for some that might be a reason to use it, but I think this basically increases the risk of breakages instead of reducing it. That said, I didn´t hear of any, and also JFS is in maintenance, but appears to work as well. 2) As to my knowledge a fsck.reiserfs cannot tell the filesystem I check and possible ReiserFS3 filesystems in virtual machine image files on it appart, happily mixing them together in a huge big mess. 3) As to my knowledge mount times of large partitions can be quite long with ReiserFS 3. That said, I am using BTRFS on my main laptop even for /home now after having used it on several other machines for more than a year. Despite from that wierd scrub issue that I fixed by redoing the filesystem, rsync backup appeared t be okay, I am ready to trust my data to BTRFS. Also my backup harddisks are BTRFS. I like BTRFS for some reasons, two that immediately come to my mind: 1) It can prove to me that the data is intact. I find this rather valuable. 2) Due to snapshots I know have well snapshots for my backup. And even on SSD for my /home. I am not yet creating those in an automated way, but well I do use them. Ciao, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6: rmw writes all the time?
On 05/23/2013 03:41 PM, Bob Marley wrote: On 23/05/2013 15:22, Bernd Schubert wrote: Yeah, I know and I'm using iostat already. md raid6 does not do rmw, but does not fill the device queue, afaik it flushes the underlying devices quickly as it does not have barrier support - that is another topic, but was the reason why I started to test btrfs. MD raid6 DOES have barrier support! For underlying devices yes, but it does not further use it for additional buffering. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] xfstests: btrfs/308: simple sparse copy testcase for btrfs
From: Koen De Wit koen.de@oracle.com # Tests file clone functionality of btrfs (reflinks): # - Reflink a file # - Reflink the reflinked file # - Modify the original file # - Modify the reflinked file [sandeen: add helpers, make several mostly-cosmetic changes to the original testcase] Signed-off-by: Koen De Wit koen.de@oracle.com Signed-off-by: Eric Sandeen sand...@redhat.com --- Originally submitted as test 297 diff --git a/common/rc b/common/rc index fe6bbfc..4560715 100644 --- a/common/rc +++ b/common/rc @@ -2098,6 +2098,27 @@ _require_dumpe2fs() fi } +_require_cp_reflink() +{ + cp --help | grep -q reflink || \ + _notrun This test requires a cp with --reflink support. +} + +# Given 2 files, verify that they have the same mapping but different +# inodes - i.e. an undisturbed reflink +# Silent if so, make noise if not +_verify_reflink() +{ + # not a hard link or symlink? + cmp -s (stat -c '%i' $1) (stat -c '%i' $2) \ +echo $1 and $2 are not reflinks: same inode number + + # same mapping? + diff -u ($XFS_IO_PROG -F -c fiemap $1 | grep -v $1) \ + ($XFS_IO_PROG -F -c fiemap $2 | grep -v $2) \ + || echo $1 and $2 are not reflinks: different extents +} + _create_loop_device() { file=$1 diff --git a/tests/btrfs/308 b/tests/btrfs/308 new file mode 100755 index 000..1bb8f02 --- /dev/null +++ b/tests/btrfs/308 @@ -0,0 +1,87 @@ +#! /bin/bash +# FS QA Test No. btrfs/308 +# +# Tests file clone functionality of btrfs (reflinks): +# - Reflink a file +# - Reflink the reflinked file +# - Modify the original file +# - Modify the reflinked file +# +#--- +# Copyright (c) 2013, Oracle and/or its affiliates. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1# failure is the default! +trap _cleanup; exit \$status 0 1 2 3 15 + +_cleanup() +{ +cd / +rm -f $tmp.* +} + +# get standard environment, filters and checks +. common/rc +. common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux + +_require_xfs_io_fiemap +_require_cp_reflink + +TESTDIR1=$TEST_DIR/test-$seq +rm -rf $TESTDIR1 +mkdir $TESTDIR1 + +_checksum_files() { +for F in original copy1 copy2 +do +md5sum $TESTDIR1/$F | _filter_test_dir +done +} + +rm -f $seqres.full + +echo Create the original file and reflink to copy1, copy2 +$XFS_IO_PROG -F -f -c 'pwrite -S 0x61 0 9000' $TESTDIR1/original $seqres.full 21 +cp --reflink $TESTDIR1/original $TESTDIR1/copy1 +cp --reflink $TESTDIR1/copy1 $TESTDIR1/copy2 +_verify_reflink $TESTDIR1/original $TESTDIR1/copy1 +_verify_reflink $TESTDIR1/original $TESTDIR1/copy2 +echo Original md5sums: +_checksum_files + +echo Overwrite original file with new data +$XFS_IO_PROG -c 'pwrite -S 0x62 0 9000' $TESTDIR1/original $seqres.full 21 +echo md5sums after overwriting original: +_checksum_files + +echo Overwrite copy1 with different new data +$XFS_IO_PROG -c 'pwrite -S 0x63 0 9000' $TESTDIR1/copy1 $seqres.full 21 +echo md5sums after overwriting copy1: +_checksum_files + +# success, all done +status=0 +exit diff --git a/tests/btrfs/308.out b/tests/btrfs/308.out new file mode 100644 index 000..7bccf08 --- /dev/null +++ b/tests/btrfs/308.out @@ -0,0 +1,16 @@ +QA output created by 308 +Create the original file and reflink to copy1, copy2 +Original md5sums: +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-308/original +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-308/copy1 +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-308/copy2 +Overwrite original file with new data +md5sums after overwriting original: +4a847a25439532bf48b68c9e9536ed5b TEST_DIR/test-308/original +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-308/copy1 +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-308/copy2 +Overwrite copy1 with different new data +md5sums after overwriting copy1: +4a847a25439532bf48b68c9e9536ed5b TEST_DIR/test-308/original +e271cd47d9f62ebc96cb4e67ae4d16db TEST_DIR/test-308/copy1 +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-308/copy2 diff --git a/tests/btrfs/group b/tests/btrfs/group index bc6c256..f24628c
[PATCH] xfstests: btrfs/309: sparse copy of a directory tree on btrfs
# Tests file clone functionality of btrfs (reflinks) on directory trees. # - Create directory and subdirectory, each having one file # - Create 2 recursive reflinked copies of the tree # - Modify the original files # - Modify one of the copies [sandeen: mostly cosmetic changes] Signed-off-by: Koen De Wit koen.de@oracle.com Signed-off-by: Eric Sandeen sand...@redhat.com --- Originally submitted as xfstests: 298: sparse copy of a directory tree on btrfs diff --git a/tests/btrfs/309 b/tests/btrfs/309 new file mode 100755 index 000..b3927ba --- /dev/null +++ b/tests/btrfs/309 @@ -0,0 +1,98 @@ +#! /bin/bash +# FS QA Test No. btrfs/309 +# +# Tests file clone functionality of btrfs (reflinks) on directory trees. +# - Create directory and subdirectory, each having one file +# - Create 2 recursive reflinked copies of the tree +# - Modify the original files +# - Modify one of the copies +# +#--- +# Copyright (c) 2013, Oracle and/or its affiliates. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- + +seq=`basename $0` +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1# failure is the default! +trap _cleanup; exit \$status 0 1 2 3 15 + +_cleanup() +{ +cd / +rm -f $tmp.* +} + +# get standard environment, filters and checks +. common/rc +. common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux + +_require_xfs_io_fiemap +_require_cp_reflink + +TESTDIR1=$TEST_DIR/test-$seq +rm -rf $TESTDIR1 +mkdir $TESTDIR1 + +_checksum_files() { +for F in original/file1 original/subdir/file2 \ + copy1/file1 copy1/subdir/file2 \ + copy2/file1 copy2/subdir/file2 +do +md5sum $TESTDIR1/$F | _filter_test_dir +done +} + +rm -f $seqres.full + +mkdir $TESTDIR1/original +mkdir $TESTDIR1/original/subdir + +echo Create the original files and reflink dirs +$XFS_IO_PROG -F -f -c 'pwrite -S 0x61 0 9000' $TESTDIR1/original/file1 $seqres.full 21 +$XFS_IO_PROG -F -f -c 'pwrite -S 0x62 0 11000' $TESTDIR1/original/subdir/file2 $seqres.full 21 +cp --recursive --reflink $TESTDIR1/original $TESTDIR1/copy1 +cp --recursive --reflink $TESTDIR1/copy1 $TESTDIR1/copy2 + +_verify_reflink $TESTDIR1/original/file1 $TESTDIR1/copy1/file1 +_verify_reflink $TESTDIR1/original/subdir/file2 $TESTDIR1/copy1/subdir/file2 +_verify_reflink $TESTDIR1/original/file1 $TESTDIR1/copy2/file1 +_verify_reflink $TESTDIR1/original/subdir/file2 $TESTDIR1/copy2/subdir/file2 + +echo Original md5sums: +_checksum_files + +echo Overwrite original/file1 and original/subdir/file2 with new data +$XFS_IO_PROG -c 'pwrite -S 0x63 0 13000' $TESTDIR1/original/file1 $seqres.full 21 +$XFS_IO_PROG -c 'pwrite -S 0x64 5000 1000' $TESTDIR1/original/subdir/file2 $seqres.full 21 +echo md5sums now: +_checksum_files + +echo Overwrite copy1/file1 and copy1/subdir/file2 with new data +$XFS_IO_PROG -c 'pwrite -S 0x65 0 9000' $TESTDIR1/copy1/file1 $seqres.full 21 +$XFS_IO_PROG -c 'pwrite -S 0x66 5000 25000' $TESTDIR1/copy1/subdir/file2 $seqres.full 21 +echo md5sums now: +_checksum_files + +# success, all done +status=0 +exit diff --git a/tests/btrfs/309.out b/tests/btrfs/309.out new file mode 100644 index 000..93197d8 --- /dev/null +++ b/tests/btrfs/309.out @@ -0,0 +1,25 @@ +QA output created by 309 +Create the original files and reflink dirs +Original md5sums: +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-309/original/file1 +ca390545f0aedb54b808d6128c56a7dd TEST_DIR/test-309/original/subdir/file2 +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-309/copy1/file1 +ca390545f0aedb54b808d6128c56a7dd TEST_DIR/test-309/copy1/subdir/file2 +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-309/copy2/file1 +ca390545f0aedb54b808d6128c56a7dd TEST_DIR/test-309/copy2/subdir/file2 +Overwrite original/file1 and original/subdir/file2 with new data +md5sums now: +260f6785c0537fd12582dcae28a3c1a9 TEST_DIR/test-309/original/file1 +b8d91fb600f6f2191f2ba5374860 TEST_DIR/test-309/original/subdir/file2 +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-309/copy1/file1 +ca390545f0aedb54b808d6128c56a7dd TEST_DIR/test-309/copy1/subdir/file2 +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-309/copy2/file1
[PATCH] xfstests: btrfs/310: moving and deleting sparse copies on btrfs
# Moving and deleting cloned (reflinked) files on btrfs: # - Create a file and a reflink # - Move both to a directory # - Delete the original (moved) file, check that the copy still exists. [sandeen: mostly cosmetic changes] Signed-off-by: Koen De Wit koen.de@oracle.com Signed-off-by: Eric Sandeen sand...@redhat.com --- Originally submitted as xfstests: 299: moving and deleting sparse copies on btrfs diff --git a/tests/btrfs/310 b/tests/btrfs/310 new file mode 100755 index 000..f87e782 --- /dev/null +++ b/tests/btrfs/310 @@ -0,0 +1,79 @@ +#! /bin/bash +# FS QA Test No. btrfs/310 +# +# Moving and deleting cloned (reflinked) files on btrfs: +# - Create a file and a reflink +# - Move both to a directory +# - Delete the original (moved) file, check that the copy still exists. +# +#--- +# Copyright (c) 2013, Oracle and/or its affiliates. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- + +seq=`basename $0` +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1# failure is the default! +trap _cleanup; exit \$status 0 1 2 3 15 + +_cleanup() +{ +cd / +rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux + +_require_xfs_io_fiemap +_require_cp_reflink + +rm -f $seqres.full + +TESTDIR1=$TEST_DIR/test-$seq +rm -rf $TESTDIR1 +mkdir $TESTDIR1 + +echo Create the original files and reflink dirs +$XFS_IO_PROG -f -c 'pwrite -S 0x61 0 9000' $TESTDIR1/original $seqres.full +cp --reflink $TESTDIR1/original $TESTDIR1/copy + +_verify_reflink $TESTDIR1/original $TESTDIR1/copy + +echo Move orig reflink copy to subdir and md5sum: +mkdir $TESTDIR1/subdir +mv $TESTDIR1/original $TESTDIR1/subdir/original_moved +mv $TESTDIR1/copy $TESTDIR1/subdir/copy_moved +_verify_reflink $TESTDIR1/subdir/original_moved $TESTDIR1/subdir/copy_moved + +md5sum $TESTDIR1/subdir/original_moved | _filter_test_dir +md5sum $TESTDIR1/subdir/copy_moved | _filter_test_dir + +echo remove orig from subdir and md5sum reflink copy: +rm $TESTDIR1/subdir/original_moved +md5sum $TESTDIR1/subdir/copy_moved | _filter_test_dir +rm -rf $TESTDIR1/subdir + +# success, all done +status=0 +exit diff --git a/tests/btrfs/310.out b/tests/btrfs/310.out new file mode 100644 index 000..dae889d --- /dev/null +++ b/tests/btrfs/310.out @@ -0,0 +1,7 @@ +QA output created by 310 +Create the original files and reflink dirs +Move orig reflink copy to subdir and md5sum: +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-310/subdir/original_moved +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-310/subdir/copy_moved +remove orig from subdir and md5sum reflink copy: +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-310/subdir/copy_moved diff --git a/tests/btrfs/group b/tests/btrfs/group index a5bd6aa..bd624c4 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -11,3 +11,4 @@ 307 auto quick 308 auto quick reflink 309 auto quick reflink +310 auto quick reflink -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] xfstests: btrfs/311: sparse copy between different filesystems/mountpoints on btrfs
From: Koen De Wit koen.de@oracle.com # Check if creating a sparse copy (reflink) of a file on btrfs # expectedly fails when it's done between different filesystems or # different mount points of the same filesystem. # # For both situations, these actions are executed: #- Copy a file with the reflink=auto option. # A normal copy should be created. #- Copy a file with the reflink=always option. Should result in error, # no file should be created. [sandeen: mostly cosmetic changes] Signed-off-by: Koen De Wit koen.de@oracle.com Signed-off-by: Eric Sandeen sand...@redhat.com --- Originally submitted as: xfstests: 301: sparse copy between different filesystems/mountpoints on btrfs diff --git a/tests/btrfs/311 b/tests/btrfs/311 new file mode 100755 index 000..9256e7a --- /dev/null +++ b/tests/btrfs/311 @@ -0,0 +1,106 @@ +#! /bin/bash +# FS QA Test No. 311 +# +# Check if creating a sparse copy (reflink) of a file on btrfs +# expectedly fails when it's done between different filesystems or +# different mount points of the same filesystem. +# +# For both situations, these actions are executed: +#- Copy a file with the reflink=auto option. +# A normal copy should be created. +#- Copy a file with the reflink=always option. Should result in error, +# no file should be created. +# +#--- +# Copyright (c) 2013, Oracle and/or its affiliates. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- + +seq=`basename $0` +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1# failure is the default! +trap _cleanup; exit \$status 0 1 2 3 15 + +_cleanup() +{ +umount $SCRATCH_MNT /dev/null +cd / +rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux + +_require_scratch +_require_cp_reflink + +SOURCE_DIR=$TEST_DIR/test-$seq +CROSS_DEV_DIR=$SCRATCH_MNT/test-$seq +# mount point target for twice-mounted device +TEST_DIR2=$TEST_DIR/mount2 +DUAL_MOUNT_DIR=$SCRATCH_MNT/test-bis-$seq + +rm -rf $SOURCE_DIR +mkdir $SOURCE_DIR + +rm -f $seqres.full + +_scratch_mkfs +$XFS_IO_PROG -f -c 'pwrite -S 0x61 0 9000' $SOURCE_DIR/original $seqres.fll + +_filter_testdirs() +{ + _filter_test_dir | _filter_scratch +} + +_create_reflinks_to() +{ +# auto reflink, should fall back to non-reflink +rm -rf $1; mkdir $1 +echo reflink=auto: +cp --reflink=auto $SOURCE_DIR/original $1/copy +md5sum $SOURCE_DIR/original | _filter_testdirs +md5sum $1/copy | _filter_testdirs + +# always reflink, should fail outright +rm -rf $1; mkdir $1 +echo reflink=always: +cp --reflink=always $SOURCE_DIR/original $1/copyfail 21 | _filter_testdirs + +# The failed target actually gets created by cp: +ls $1/copyfail | _filter_testdirs +} + +echo test reflinks across different devices +_scratch_mount +_create_reflinks_to $CROSS_DEV_DIR +_scratch_unmount + +echo test reflinks across different mountpoints of same device +mount $TEST_DEV $SCRATCH_MNT || _fail Couldn't double-mount $TEST_DEV +_create_reflinks_to $DUAL_MOUNT_DIR +umount $SCRATCH_MNT + +# success, all done +status=0 +exit diff --git a/tests/btrfs/311.out b/tests/btrfs/311.out new file mode 100644 index 000..210727b --- /dev/null +++ b/tests/btrfs/311.out @@ -0,0 +1,15 @@ +QA output created by 311 +test reflinks across different devices +reflink=auto: +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-311/original +42d69d1a6d333a7ebdf64792a555e392 SCRATCH_MNT/test-311/copy +reflink=always: +cp: failed to clone `SCRATCH_MNT/test-311/copyfail': Invalid cross-device link +SCRATCH_MNT/test-311/copyfail +test reflinks across different mountpoints of same device +reflink=auto: +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-311/original +42d69d1a6d333a7ebdf64792a555e392 SCRATCH_MNT/test-bis-311/copy +reflink=always: +cp: failed to clone `SCRATCH_MNT/test-bis-311/copyfail': Invalid cross-device link +SCRATCH_MNT/test-bis-311/copyfail diff --git a/tests/btrfs/group b/tests/btrfs/group index bd624c4..c897118 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -12,3 +12,4 @@ 308 auto quick reflink 309 auto quick reflink 310 auto
Re: [PATCH] xfstests: btrfs/309: sparse copy of a directory tree on btrfs
This and the other reflink tests should all be: From: Koen De Wit koen.de@oracle.com to maintain original authorship, sorry I forgot to add that on the patches in the middle. Rich, if you can fix that up on commit it'd be great, unless I need to submit V2s then I can do it. Koen, I haven't gone over all of them, will try to get the rest tidied up and resubmitted unless you want to - which would be just fine! -Eric On 5/23/13 11:43 AM, Eric Sandeen wrote: # Tests file clone functionality of btrfs (reflinks) on directory trees. # - Create directory and subdirectory, each having one file # - Create 2 recursive reflinked copies of the tree # - Modify the original files # - Modify one of the copies [sandeen: mostly cosmetic changes] Signed-off-by: Koen De Wit koen.de@oracle.com Signed-off-by: Eric Sandeen sand...@redhat.com --- Originally submitted as xfstests: 298: sparse copy of a directory tree on btrfs diff --git a/tests/btrfs/309 b/tests/btrfs/309 new file mode 100755 index 000..b3927ba --- /dev/null +++ b/tests/btrfs/309 @@ -0,0 +1,98 @@ +#! /bin/bash +# FS QA Test No. btrfs/309 +# +# Tests file clone functionality of btrfs (reflinks) on directory trees. +# - Create directory and subdirectory, each having one file +# - Create 2 recursive reflinked copies of the tree +# - Modify the original files +# - Modify one of the copies +# +#--- +# Copyright (c) 2013, Oracle and/or its affiliates. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- + +seq=`basename $0` +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1# failure is the default! +trap _cleanup; exit \$status 0 1 2 3 15 + +_cleanup() +{ +cd / +rm -f $tmp.* +} + +# get standard environment, filters and checks +. common/rc +. common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux + +_require_xfs_io_fiemap +_require_cp_reflink + +TESTDIR1=$TEST_DIR/test-$seq +rm -rf $TESTDIR1 +mkdir $TESTDIR1 + +_checksum_files() { +for F in original/file1 original/subdir/file2 \ + copy1/file1 copy1/subdir/file2 \ + copy2/file1 copy2/subdir/file2 +do +md5sum $TESTDIR1/$F | _filter_test_dir +done +} + +rm -f $seqres.full + +mkdir $TESTDIR1/original +mkdir $TESTDIR1/original/subdir + +echo Create the original files and reflink dirs +$XFS_IO_PROG -F -f -c 'pwrite -S 0x61 0 9000' $TESTDIR1/original/file1 $seqres.full 21 +$XFS_IO_PROG -F -f -c 'pwrite -S 0x62 0 11000' $TESTDIR1/original/subdir/file2 $seqres.full 21 +cp --recursive --reflink $TESTDIR1/original $TESTDIR1/copy1 +cp --recursive --reflink $TESTDIR1/copy1 $TESTDIR1/copy2 + +_verify_reflink $TESTDIR1/original/file1 $TESTDIR1/copy1/file1 +_verify_reflink $TESTDIR1/original/subdir/file2 $TESTDIR1/copy1/subdir/file2 +_verify_reflink $TESTDIR1/original/file1 $TESTDIR1/copy2/file1 +_verify_reflink $TESTDIR1/original/subdir/file2 $TESTDIR1/copy2/subdir/file2 + +echo Original md5sums: +_checksum_files + +echo Overwrite original/file1 and original/subdir/file2 with new data +$XFS_IO_PROG -c 'pwrite -S 0x63 0 13000' $TESTDIR1/original/file1 $seqres.full 21 +$XFS_IO_PROG -c 'pwrite -S 0x64 5000 1000' $TESTDIR1/original/subdir/file2 $seqres.full 21 +echo md5sums now: +_checksum_files + +echo Overwrite copy1/file1 and copy1/subdir/file2 with new data +$XFS_IO_PROG -c 'pwrite -S 0x65 0 9000' $TESTDIR1/copy1/file1 $seqres.full 21 +$XFS_IO_PROG -c 'pwrite -S 0x66 5000 25000' $TESTDIR1/copy1/subdir/file2 $seqres.full 21 +echo md5sums now: +_checksum_files + +# success, all done +status=0 +exit diff --git a/tests/btrfs/309.out b/tests/btrfs/309.out new file mode 100644 index 000..93197d8 --- /dev/null +++ b/tests/btrfs/309.out @@ -0,0 +1,25 @@ +QA output created by 309 +Create the original files and reflink dirs +Original md5sums: +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-309/original/file1 +ca390545f0aedb54b808d6128c56a7dd TEST_DIR/test-309/original/subdir/file2 +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-309/copy1/file1 +ca390545f0aedb54b808d6128c56a7dd
Re: raid6: rmw writes all the time?
On 05/23/2013 03:34 PM, Chris Mason wrote: Quoting Bernd Schubert (2013-05-23 09:22:41) On 05/23/2013 03:11 PM, Chris Mason wrote: Quoting Bernd Schubert (2013-05-23 08:55:47) Hi all, we got a new test system here and I just also tested btrfs raid6 on that. Write performance is slightly lower than hw-raid (LSI megasas) and md-raid6, but it probably would be much better than any of these two, if it wouldn't read all the during the writes. Is this a known issue? This is with linux-3.9.2. Hi Bernd, Any time you do a write smaller than a full stripe, we'll have to do a read/modify/write cycle to satisfy it. This is true of md raid6 and the hw-raid as well, but their reads don't show up in vmstat (try iostat instead). Yeah, I know and I'm using iostat already. md raid6 does not do rmw, but does not fill the device queue, afaik it flushes the underlying devices quickly as it does not have barrier support - that is another topic, but was the reason why I started to test btrfs. md should support barriers with recent kernels. You might want to verify with blktrace that md raid6 isn't doing r/m/w. So the bigger question is where are your small writes coming from. If they are metadata, you can use raid1 for the metadata. I used this command /tmp/mkfs.btrfs -L test2 -f -d raid6 -m raid10 /dev/sd[m-x] Ok, the stripe size is 64KB, so you want to do IO in multiples of 64KB times the number of devices on the FS. If you have 13 devices, that's 832K. Actually I have 12 devices, but we have to subtract 2 parity disks. In the mean time I also patched btrfsprogs to use a chunksize of 256K. So that should be 2560kiB now if I found the right places. Btw, any chance to generally use chunksize/chunklen instead of stripe, such as the md layer does it? IMHO it is less confusing to use n-datadisks * chunksize = stripesize. Using buffered writes makes it much more likely the VM will break up the IOs as they go down. The btrfs writepages code does try to do full stripe IO, and it also caches stripes as the IO goes down. But for buffered IO it is surprisingly hard to get a 100% hit rate on full stripe IO at larger stripe sizes. I have not found that part yet, somehow it looks like as if writepages would submit single pages to another layer. I'm going to look into it again during the weekend. I can reserve the hardware that long, but I think we first need to fix striped writes in general. so meta-data should be raid10. And I'm using this iozone command: iozone -e -i0 -i1 -r1m -l 5 -u 5 -s20g -+n \ -F /data/fhgfs/storage/md126/testfile1 /data/fhgfs/storage/md126/testfile2 /data/fhgfs/storage/md126/testfile3 \ /data/fhgfs/storage/md127/testfile1 /data/fhgfs/storage/md127/testfile2 /data/fhgfs/storage/md127/testfile3 Higher IO sizes (e.g. -r16m) don't make a difference, it goes through the page cache anyway. I'm not familiar with btrfs code at all, but maybe writepages() submits too small IOs? Hrmm, just wanted to try direct IO, but then just noticed it went into RO mode before already: Direct IO will make it easier to get full stripe writes. I thought I had fixed this abort, but it is just running out of space to write the inode cache. For now, please just don't mount with the inode cache enabled, I'll send in a fix for the next rc. Thanks, I already noticed and disabled the inode cache. Direct-io works as expected and without any RMW cycles. And that provides more than 40% better performance than the Megasas controller or buffered MD writes (I didn't compare with direct-io MD, as that is very slow). Cheers, Bernd -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6: rmw writes all the time?
Quoting Bernd Schubert (2013-05-23 15:33:24) On 05/23/2013 03:34 PM, Chris Mason wrote: Quoting Bernd Schubert (2013-05-23 09:22:41) On 05/23/2013 03:11 PM, Chris Mason wrote: Quoting Bernd Schubert (2013-05-23 08:55:47) Hi all, we got a new test system here and I just also tested btrfs raid6 on that. Write performance is slightly lower than hw-raid (LSI megasas) and md-raid6, but it probably would be much better than any of these two, if it wouldn't read all the during the writes. Is this a known issue? This is with linux-3.9.2. Hi Bernd, Any time you do a write smaller than a full stripe, we'll have to do a read/modify/write cycle to satisfy it. This is true of md raid6 and the hw-raid as well, but their reads don't show up in vmstat (try iostat instead). Yeah, I know and I'm using iostat already. md raid6 does not do rmw, but does not fill the device queue, afaik it flushes the underlying devices quickly as it does not have barrier support - that is another topic, but was the reason why I started to test btrfs. md should support barriers with recent kernels. You might want to verify with blktrace that md raid6 isn't doing r/m/w. So the bigger question is where are your small writes coming from. If they are metadata, you can use raid1 for the metadata. I used this command /tmp/mkfs.btrfs -L test2 -f -d raid6 -m raid10 /dev/sd[m-x] Ok, the stripe size is 64KB, so you want to do IO in multiples of 64KB times the number of devices on the FS. If you have 13 devices, that's 832K. Actually I have 12 devices, but we have to subtract 2 parity disks. In the mean time I also patched btrfsprogs to use a chunksize of 256K. So that should be 2560kiB now if I found the right places. Sorry, thanks for filling in for my pre-coffee email. Btw, any chance to generally use chunksize/chunklen instead of stripe, such as the md layer does it? IMHO it is less confusing to use n-datadisks * chunksize = stripesize. Definitely, it will become much more configurable. Using buffered writes makes it much more likely the VM will break up the IOs as they go down. The btrfs writepages code does try to do full stripe IO, and it also caches stripes as the IO goes down. But for buffered IO it is surprisingly hard to get a 100% hit rate on full stripe IO at larger stripe sizes. I have not found that part yet, somehow it looks like as if writepages would submit single pages to another layer. I'm going to look into it again during the weekend. I can reserve the hardware that long, but I think we first need to fix striped writes in general. The VM calls writepages and btrfs tries to suck down all the pages that belong to the same extent. And we try to allocate the extents on boundaries. There is definitely some bleeding into rmw when I do it here, but overall it does well. But I was using 8 drives. I'll try with 12. so meta-data should be raid10. And I'm using this iozone command: iozone -e -i0 -i1 -r1m -l 5 -u 5 -s20g -+n \ -F /data/fhgfs/storage/md126/testfile1 /data/fhgfs/storage/md126/testfile2 /data/fhgfs/storage/md126/testfile3 \ /data/fhgfs/storage/md127/testfile1 /data/fhgfs/storage/md127/testfile2 /data/fhgfs/storage/md127/testfile3 Higher IO sizes (e.g. -r16m) don't make a difference, it goes through the page cache anyway. I'm not familiar with btrfs code at all, but maybe writepages() submits too small IOs? Hrmm, just wanted to try direct IO, but then just noticed it went into RO mode before already: Direct IO will make it easier to get full stripe writes. I thought I had fixed this abort, but it is just running out of space to write the inode cache. For now, please just don't mount with the inode cache enabled, I'll send in a fix for the next rc. Thanks, I already noticed and disabled the inode cache. Direct-io works as expected and without any RMW cycles. And that provides more than 40% better performance than the Megasas controller or buffered MD writes (I didn't compare with direct-io MD, as that is very slow). You can improve MD performance quite a lot by increasing the size of the stripe cache. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6: rmw writes all the time?
On 05/23/2013 09:37 PM, Chris Mason wrote: Quoting Bernd Schubert (2013-05-23 15:33:24) Btw, any chance to generally use chunksize/chunklen instead of stripe, such as the md layer does it? IMHO it is less confusing to use n-datadisks * chunksize = stripesize. Definitely, it will become much more configurable. Actually I meant in the code. I'm going to write a patch during the weekend. Using buffered writes makes it much more likely the VM will break up the IOs as they go down. The btrfs writepages code does try to do full stripe IO, and it also caches stripes as the IO goes down. But for buffered IO it is surprisingly hard to get a 100% hit rate on full stripe IO at larger stripe sizes. I have not found that part yet, somehow it looks like as if writepages would submit single pages to another layer. I'm going to look into it again during the weekend. I can reserve the hardware that long, but I think we first need to fix striped writes in general. The VM calls writepages and btrfs tries to suck down all the pages that belong to the same extent. And we try to allocate the extents on boundaries. There is definitely some bleeding into rmw when I do it here, but overall it does well. But I was using 8 drives. I'll try with 12. Hmm, I already tried with 10 drives (8+2), doesn't make a difference for RMW. Direct-io works as expected and without any RMW cycles. And that provides more than 40% better performance than the Megasas controller or buffered MD writes (I didn't compare with direct-io MD, as that is very slow). You can improve MD performance quite a lot by increasing the size of the stripe cache. I'm already doing that, without a higher stripe cache the performance is much lower. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid6: rmw writes all the time?
Quoting Bernd Schubert (2013-05-23 15:45:36) On 05/23/2013 09:37 PM, Chris Mason wrote: Quoting Bernd Schubert (2013-05-23 15:33:24) Btw, any chance to generally use chunksize/chunklen instead of stripe, such as the md layer does it? IMHO it is less confusing to use n-datadisks * chunksize = stripesize. Definitely, it will become much more configurable. Actually I meant in the code. I'm going to write a patch during the weekend. The btrfs raid code refers to stripes because a chunk is a very large (~1GB) slice of a set of drives that we allocate into raid levels. We have full stripes and device stripes, I'm afraid there are so many different terms in other projects that it is hard to pick something clear. Using buffered writes makes it much more likely the VM will break up the IOs as they go down. The btrfs writepages code does try to do full stripe IO, and it also caches stripes as the IO goes down. But for buffered IO it is surprisingly hard to get a 100% hit rate on full stripe IO at larger stripe sizes. I have not found that part yet, somehow it looks like as if writepages would submit single pages to another layer. I'm going to look into it again during the weekend. I can reserve the hardware that long, but I think we first need to fix striped writes in general. The VM calls writepages and btrfs tries to suck down all the pages that belong to the same extent. And we try to allocate the extents on boundaries. There is definitely some bleeding into rmw when I do it here, but overall it does well. But I was using 8 drives. I'll try with 12. Hmm, I already tried with 10 drives (8+2), doesn't make a difference for RMW. My benchmarks were on flash, so the rmw I was seeing may not have had as big an impact. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kernel BUG on balance, 3.10-rc2 skinny
Filesystem was created fairly recently under kernel 3.9 iirc. Data, RAID0: total=2.51TB, used=2.49TB System, RAID1: total=32.00MB, used=188.00KB System: total=4.00MB, used=0.00 Metadata, RAID1: total=5.00GB, used=3.72GB Total devices 2 FS bytes used 2.49TB devid1 size 2.73TB used 1.26TB path /dev/sdd devid2 size 2.73TB used 1.26TB path /dev/sdc mount: /dev/sdd on /home/fredrik type btrfs (rw,noatime,compress=lzo,space_cache) log attached -- Fredrik Rinnestam [16608.738909] btrfs: relocating block group 7988232323072 flags 9 [16615.765058] btrfs: found 3 extents [16620.296989] [ cut here ] [16620.297009] kernel BUG at fs/btrfs/relocation.c:3296! [16620.297025] invalid opcode: [#1] PREEMPT SMP [16620.297043] Modules linked in: auth_rpcgss oid_registry nfsv4 nfsv3 nfs lockd sunrpc loop fuse w83627ehf hwmon_vid snd_hda_codec_realtek btusb ath3k bluetooth kvm_intel kvm crc32c_intel aesni_intel evdev aes_x86_64 glue_helper lrw gf128mul ablk_helper snd_hda_intel r8169 snd_hda_codec mii snd_hwdep sr_mod firewire_ohci snd_pcm cdrom snd_page_alloc firewire_core snd_timer crc_itu_t snd xhci_hcd soundcore thermal fan 8250 serial_core rtc_cmos button [16620.297256] CPU: 0 PID: 30752 Comm: btrfs Not tainted 3.10.0-rc2 #2 [16620.297276] Hardware name: System manufacturer System Product Name/P8P67 DELUXE, BIOS 3602 10/31/2012 [16620.297308] task: 880272015280 ti: 880101b4c000 task.ti: 880101b4c000 [16620.297332] RIP: 0010:[8118af95] [8118af95] __add_tree_block+0xe8/0xfb [16620.297364] RSP: 0018:880101b4da88 EFLAGS: 00010287 [16620.297380] RAX: 0731283b7000 RBX: 88040c41ae10 RCX: [16620.297403] RDX: 0011 RSI: 880007cd1684 RDI: 880101b4da78 [16620.297426] RBP: 0001 R08: 880101b4da78 R09: 8803c2d46988 [16620.297449] R10: 1000 R11: 1600 R12: 0731283f4000 [16620.297472] R13: 880407879000 R14: 880101b4db60 R15: 1000 [16620.297495] FS: 7f6018cd5740() GS:88041ec0() knlGS: [16620.297522] CS: 0010 DS: ES: CR0: 80050033 [16620.297540] CR2: 7f62954a4000 CR3: 000104447000 CR4: 000407f0 [16620.297562] DR0: DR1: DR2: [16620.297585] DR3: DR6: 0ff0 DR7: 0400 [16620.297608] Stack: [16620.297611] 880101b4db77 00ff81164783 a90731283b70 [16620.297640] 1000 88040c4907e0 880407879000 0eba [16620.297669] 880101b4db60 880003a71400 880101b4db77 8118b049 [16620.297697] Call Trace: [16620.297704] [8118b049] ? add_data_references+0xa1/0x1af [16620.297726] [81164783] ? btrfs_get_token_64+0x76/0xc6 [16620.297746] [8118d318] ? relocate_block_group+0x1f8/0x4db [16620.297767] [8118d73c] ? btrfs_relocate_block_group+0x141/0x268 [16620.297790] [8116f13d] ? btrfs_relocate_chunk.isra.65+0x4b/0x398 [16620.297814] [813fd010] ? _raw_spin_unlock+0x1c/0x28 [16620.297832] [81166fe8] ? release_extent_buffer+0x90/0x97 [16620.297853] [81171db2] ? btrfs_balance+0x947/0xb11 [16620.297872] [81177a19] ? btrfs_ioctl_balance+0x228/0x2a4 [16620.297892] [8117a4cb] ? btrfs_ioctl+0xf32/0x18bf [16620.297911] [8102014d] ? __do_page_fault+0x284/0x322 [16620.297931] [810986b2] ? vma_link+0x6e/0x8c [16620.297948] [810bb586] ? vfs_ioctl+0x1e/0x31 [16620.297964] [810bbd63] ? do_vfs_ioctl+0x3b4/0x3f6 [16620.297983] [810bbde1] ? SyS_ioctl+0x3c/0x67 [16620.298000] [813fdc52] ? system_call_fastpath+0x16/0x1b [16620.298019] Code: 4c 89 f7 31 ed e8 19 d1 ff ff 48 85 c0 75 1e e9 6e ff ff ff 4c 89 f1 48 89 da 4c 89 ef 48 8d 74 24 0f e8 80 e0 ff ff 89 c5 eb bd 0f 0b 48 83 c4 28 89 e8 5b 5d 41 5c 41 5d 41 5e 41 5f c3 41 57 [16620.298162] RIP [8118af95] __add_tree_block+0xe8/0xfb [16620.298182] RSP 880101b4da88 [16620.306795] ---[ end trace bdf9370f4dbb18f4 ]---
Re: [PATCH] xfstests: btrfs/308: simple sparse copy testcase for btrfs
On Thu, May 23, 2013 at 11:36:32AM -0500, Eric Sandeen wrote: From: Koen De Wit koen.de@oracle.com # Tests file clone functionality of btrfs (reflinks): # - Reflink a file # - Reflink the reflinked file # - Modify the original file # - Modify the reflinked file [sandeen: add helpers, make several mostly-cosmetic changes to the original testcase] Signed-off-by: Koen De Wit koen.de@oracle.com Signed-off-by: Eric Sandeen sand...@redhat.com --- Originally submitted as test 297 FWIW, this will conflict with Josef's new btrfs/308 patch. You don't have to use monatomically increasing numbers for the tests anymore in the fileystem specifc test directories - you could make these tests btrfs/00[1-4] :) Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Virtual Device Support (N-way mirror code)
On 05/23/2013 09:08 AM, Martin Steigerwald wrote: 3) As to my knowledge mount times of large partitions can be quite long with ReiserFS 3. That may well be, but I certainly wouldn't consider btrfs mount times fast in such cases. [root@localhost ghmitch]# time mount LABEL=BACKUP /backup real0m18.133s user0m0.000s sys 0m0.190s [root@localhost ghmitch]# -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] xfstests: btrfs/308: simple sparse copy testcase for btrfs
On Thu, May 23, 2013 at 11:36:32AM -0500, Eric Sandeen wrote: From: Koen De Wit koen.de@oracle.com # Tests file clone functionality of btrfs (reflinks): # - Reflink a file # - Reflink the reflinked file # - Modify the original file # - Modify the reflinked file [sandeen: add helpers, make several mostly-cosmetic changes to the original testcase] Signed-off-by: Koen De Wit koen.de@oracle.com Signed-off-by: Eric Sandeen sand...@redhat.com --- Originally submitted as test 297 diff --git a/common/rc b/common/rc index fe6bbfc..4560715 100644 --- a/common/rc +++ b/common/rc @@ -2098,6 +2098,27 @@ _require_dumpe2fs() fi } +_require_cp_reflink() +{ + cp --help | grep -q reflink || \ + _notrun This test requires a cp with --reflink support. +} + +# Given 2 files, verify that they have the same mapping but different +# inodes - i.e. an undisturbed reflink +# Silent if so, make noise if not +_verify_reflink() +{ + # not a hard link or symlink? + cmp -s (stat -c '%i' $1) (stat -c '%i' $2) \ + echo $1 and $2 are not reflinks: same inode number + + # same mapping? + diff -u ($XFS_IO_PROG -F -c fiemap $1 | grep -v $1) \ + ($XFS_IO_PROG -F -c fiemap $2 | grep -v $2) \ + || echo $1 and $2 are not reflinks: different extents I'm not sure if -F is still needed after commit 96fce07 xfstests: automatically add -F to xfs_io on non-xfs +} + _create_loop_device() { file=$1 diff --git a/tests/btrfs/308 b/tests/btrfs/308 new file mode 100755 index 000..1bb8f02 --- /dev/null +++ b/tests/btrfs/308 @@ -0,0 +1,87 @@ +#! /bin/bash +# FS QA Test No. btrfs/308 +# +# Tests file clone functionality of btrfs (reflinks): +# - Reflink a file +# - Reflink the reflinked file +# - Modify the original file +# - Modify the reflinked file +# +#--- +# Copyright (c) 2013, Oracle and/or its affiliates. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1# failure is the default! +trap _cleanup; exit \$status 0 1 2 3 15 + +_cleanup() +{ +cd / +rm -f $tmp.* +} + +# get standard environment, filters and checks +. common/rc +. common/filter + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux + +_require_xfs_io_fiemap +_require_cp_reflink + +TESTDIR1=$TEST_DIR/test-$seq +rm -rf $TESTDIR1 +mkdir $TESTDIR1 + +_checksum_files() { +for F in original copy1 copy2 +do +md5sum $TESTDIR1/$F | _filter_test_dir +done +} + +rm -f $seqres.full + +echo Create the original file and reflink to copy1, copy2 +$XFS_IO_PROG -F -f -c 'pwrite -S 0x61 0 9000' $TESTDIR1/original $seqres.full 21 Here too, maybe we can drop -F option? Also 309 has this issue too. 310 and 311 are fine. Thanks, Eryu Guan +cp --reflink $TESTDIR1/original $TESTDIR1/copy1 +cp --reflink $TESTDIR1/copy1 $TESTDIR1/copy2 +_verify_reflink $TESTDIR1/original $TESTDIR1/copy1 +_verify_reflink $TESTDIR1/original $TESTDIR1/copy2 +echo Original md5sums: +_checksum_files + +echo Overwrite original file with new data +$XFS_IO_PROG -c 'pwrite -S 0x62 0 9000' $TESTDIR1/original $seqres.full 21 +echo md5sums after overwriting original: +_checksum_files + +echo Overwrite copy1 with different new data +$XFS_IO_PROG -c 'pwrite -S 0x63 0 9000' $TESTDIR1/copy1 $seqres.full 21 +echo md5sums after overwriting copy1: +_checksum_files + +# success, all done +status=0 +exit diff --git a/tests/btrfs/308.out b/tests/btrfs/308.out new file mode 100644 index 000..7bccf08 --- /dev/null +++ b/tests/btrfs/308.out @@ -0,0 +1,16 @@ +QA output created by 308 +Create the original file and reflink to copy1, copy2 +Original md5sums: +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-308/original +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-308/copy1 +42d69d1a6d333a7ebdf64792a555e392 TEST_DIR/test-308/copy2 +Overwrite original file with new data +md5sums after overwriting original: +4a847a25439532bf48b68c9e9536ed5b TEST_DIR/test-308/original
Re: [PATCH v2 00/12] VFS hot tracking
HI, Al Viro. I have incorporated all comments from all reviewers and waited for so long time. If you have no comments, can you merge the patchset? thanks. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] xfstests: btrfs/308: simple sparse copy testcase for btrfs
On 5/23/13 10:09 PM, Eryu Guan wrote: On Thu, May 23, 2013 at 11:36:32AM -0500, Eric Sandeen wrote: From: Koen De Wit koen.de@oracle.com # Tests file clone functionality of btrfs (reflinks): # - Reflink a file # - Reflink the reflinked file # - Modify the original file # - Modify the reflinked file [sandeen: add helpers, make several mostly-cosmetic changes to the original testcase] Signed-off-by: Koen De Wit koen.de@oracle.com Signed-off-by: Eric Sandeen sand...@redhat.com --- Originally submitted as test 297 diff --git a/common/rc b/common/rc index fe6bbfc..4560715 100644 --- a/common/rc +++ b/common/rc @@ -2098,6 +2098,27 @@ _require_dumpe2fs() fi } +_require_cp_reflink() +{ +cp --help | grep -q reflink || \ +_notrun This test requires a cp with --reflink support. +} + +# Given 2 files, verify that they have the same mapping but different +# inodes - i.e. an undisturbed reflink +# Silent if so, make noise if not +_verify_reflink() +{ +# not a hard link or symlink? +cmp -s (stat -c '%i' $1) (stat -c '%i' $2) \ + echo $1 and $2 are not reflinks: same inode number + +# same mapping? +diff -u ($XFS_IO_PROG -F -c fiemap $1 | grep -v $1) \ +($XFS_IO_PROG -F -c fiemap $2 | grep -v $2) \ +|| echo $1 and $2 are not reflinks: different extents I'm not sure if -F is still needed after commit 96fce07 xfstests: automatically add -F to xfs_io on non-xfs Right, it's not, oops. :( Old habits (and old patch, TBH) I can fix resend all of them I guess. Thanks, -Eric -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html