Re: Strange prformance degradation when COW writes happen at fixed offsets
2012/2/24 Nik Markovic nmarkovi.nav...@gmail.com: To add... I also tried nodatasum (only) and nodatacow otions. I found somewhere that nodatacow doesn't really mean tthat COW is disabled. Test data is still the same - CPU spikes and times are the same. On Fri, Feb 24, 2012 at 2:38 PM, Nik Markovic nmarkovi.nav...@gmail.com wrote: On Fri, Feb 24, 2012 at 12:38 AM, Duncan 1i5t5.dun...@cox.net wrote: Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted: I noticed a few errors in the script that I used. I corrected it and it seems that degradation is occurring even at fully random writes: I don't have an ssd, but is it possible that you're simply seeing erase- block related degradation due to multi-write-block sized erase-blocks? It seems to me that when originally written to the btrfs-on-ssd, the file will likely be written block-sequentially enough that the file as a whole takes up relatively few erase-blocks. As you COW-write individual blocks, they'll be written elsewhere, perhaps all the changed blocks to a new erase-block, perhaps each to a different erase block. This is a very interesting insight. I wasn't even aware of the erase-block issue, so I did some reading up on it... As you increase the successive COW generation count, the file's file- system/write blocks will be spread thru more and more erase-blocks, basically fragmentation but of the SSD-critical type, into more and more erase blocks, thus affecting modification and removal time but not read time. OK, so time to write would increase due to fragmentation and writing, it now makes sense (though I don't see why small writes would affect this, but my concerns are not writes anyway), but why would cp --reflink time increase so much. Yes, new extents would be created, but btrfs doesn't write into data blocks, does it? I figured its metadata would be kept in one place. I figure the only thing BTRFS would do on cp --reflink=always: 1. Take a collection of extents owned by source. 2. Make the new copy use the same collection of extents. 3. Write the collection of extents to the directory. Now this process seems to be CPU intensive. When I remove or make a reflink copy, one core pikes up to 100%, which tells me that there's a performance issue there, not an ssd issue. Also, only one CPU thread is being used for this. I figured that I can improve this by some setting. Maybe thread_pool mount option? Are there any updates in later kernels that I should possibly pick up? [...] Unless I am wrong, this would disable COW completely and reflink copy. Reflinks are a crucial component and the sole reason I picked BTRFS for the system that I am writing for my company. The autodefrag option addresses multiple writes. Writing is not the problem, but cp --reflink should be near-instant. That was the reason we chose BTRFS over ZFS, which seemed to be the only feasible alternative. ZFS snapshot complicate the design and deduplication copy time is the same as (or not much better than) raw copy. [...] As I mentioned above, the COW is the crucial component of our system, XFS won't do. Our system does not do random writes. In fact it is mainly heavy on read operation. The system does occasional rotation of rust on large files in a way that version control system would (large files are modified and then used as a new baseline) The symptoms you are reporting are quite similar to what I'm seeing in our Ceph cluster: http://comments.gmane.org/gmane.comp.file-systems.btrfs/15413 AFAIK, Chris and Josef are working on it, but you'll have to wait for kernel 3.4, until this will be available in mainline. If you are feeling adventurous, you could try the patches in Josef's git tree, but I think it's still experimental. Regards, Christian -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: LABEL only 1 device
On Mon, Feb 27, 2012 at 07:44:00AM +0100, Helmut Hullen wrote: Hallo, Hugo, Du meintest am 26.02.12: mkfs.btrfs creates a new filesystem. The -L option sets the label for the newly-created FS. It *cannot* be used to change the label of an existing FS. The safest way may be deleting this option ... it seems to work as expected only when I create a new FS on 1 disk/partition. I've said this several times: Your expectations are wrong. You don't label partitions. You label filesystems. You are using the wrong tool (filesystems labels) for the job (uniquely identifying partitions). If you want to do that, use btrfs filesystem label. And that seems to work as I expected - fine. Adding a device works, deleting a device works. Fine! Now I'll try the job with my Terabyte disks. (Yes - I have backups ...) Viele Gruesse! Helmut -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Be pure. Be vigilant. Behave. --- signature.asc Description: Digital signature
Re: btrfs-convert options
On Mon, Feb 27, 2012 at 08:52:00AM +0100, Helmut Hullen wrote: Hallo, linux-btrfs, I want to change some TByte disks (at least one) from ext4 to btrfs. And I want -d raid0 -m raid1. Is it possible to tell btrfs-convert especially these options for data and metadata? Or have I to use mkfs.btrfs (and then copy the backup) when I want these options? The other alternative is to get hold of 3.3-rcX and the latest userspace tools (currently in the dangerdonteveruse branch, but should be in master very soon), and use the restriper after you've converted. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Be pure. Be vigilant. Behave. --- signature.asc Description: Digital signature
Re: btrfs-convert options
Hallo, Hugo, Du meintest am 27.02.12: I want to change some TByte disks (at least one) from ext4 to btrfs. And I want -d raid0 -m raid1. Is it possible to tell btrfs-convert especially these options for data and metadata? Or have I to use mkfs.btrfs (and then copy the backup) when I want these options? The other alternative is to get hold of 3.3-rcX and the latest userspace tools (currently in the dangerdonteveruse branch, but should be in master very soon), and use the restriper after you've converted. H - I don't yet dare these experiments with really big disks ... Viele Gruesse! Helmut -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: LABEL only 1 device
Hallo, Hugo, Du meintest am 27.02.12: mkfs.btrfs creates a new filesystem. The -L option sets the label for the newly-created FS. It *cannot* be used to change the label of an existing FS. The safest way may be deleting this option ... it seems to work as expected only when I create a new FS on 1 disk/partition. I've said this several times: Your expectations are wrong. You don't label partitions. Yes - now I know. But I'm afraid other people also expect wrong - when I use mkfs.ext[234] then this option works (in another way than with mkfs.btrfs). Viele Gruesse! Helmut -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: LABEL only 1 device
On Sun, Feb 26, 2012 at 04:44:00PM +, Hugo Mills wrote: OK, the real problem you're seeing is that when btrfs removes a device from the filesystem, that device is not modified in any way. This means that the old superblock is left behind on it, containing the FS label information. What you need to do is, immediately after removing a device from the FS, zero the first part of the partition with dd and /dev/zero. A correction here: if the device being removed is writable, the superblock is cleared so it's not recognized as a part of any other fs: int btrfs_rm_device(struct btrfs_root *root, char *device_path) ... /* * at this point, the device is zero sized. We want to * remove it from the devices list and zero out the old super */ if (clear_super) { /* make sure this device isn't detected as part of * the FS anymore */ memset(disk_super-magic, 0, sizeof(disk_super-magic)); set_buffer_dirty(bh); sync_dirty_buffer(bh); } Doing this manually means zeroing 4k block at all offsets up to partition size: Superblock 0 offset 65536 Superblock 1 offset 67108864 Superblock 2 offset 274877906944 Superblock 3 offset 1125899906842624 Superblock 4 offset 4611686018427387904 david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: LABEL only 1 device
Hallo, David, Du meintest am 27.02.12: [deleting btrfs partition] OK, the real problem you're seeing is that when btrfs removes a device from the filesystem, that device is not modified in any way. This means that the old superblock is left behind on it, containing the FS label information. What you need to do is, immediately after removing a device from the FS, zero the first part of the partition with dd and /dev/zero. A correction here: if the device being removed is writable, the superblock is cleared so it's not recognized as a part of any other fs: [...] Doing this manually means zeroing 4k block at all offsets up to partition size: Superblock 0 offset 65536 Superblock 1 offset 67108864 Superblock 2 offset 274877906944 Superblock 3 offset 1125899906842624 Superblock 4 offset 4611686018427387904 My actual experiments: dd if=/dev/zero of=/dev/sdxn bs=16M count=1 seems to be enough. Perhaps deleting the first 16 MByte is too much. Viele Gruesse! Helmut -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: hold enough space for global_rsv
Am Tue, 17 Jan 2012 17:51:59 +0800 schrieb Liu Bo liubo2...@cn.fujitsu.com: I've kept hitting enospc warnings of global_rsv while running defragment on files: btrfs: block rsv returned -28 WARNING: at fs/btrfs/extent-tree.c:5984 btrfs_alloc_free_block+0x333/0x340 [btrfs]() ... I used a fio jobs to create a file with lots of fragments: $ filefrag /mnt/btrfs/foobar /mnt/btrfs/foobar: 66964 extents found and then btrfs fi defrag /mnt/btrfs/foobar sync would pop the warnings. I found that the global_rsv size is just not enough for defragment, and didn't find any space leak in using global_rsv, so double it and go ahead. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/extent-tree.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 8603ee4..77ea23c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3979,7 +3979,7 @@ static u64 calc_global_metadata_size(struct btrfs_fs_info *fs_info) num_bytes += div64_u64(data_used + meta_used, 50); if (num_bytes * 3 meta_used) - num_bytes = div64_u64(meta_used, 3); + num_bytes = div64_u64(meta_used, 3) * 2; return ALIGN(num_bytes, fs_info-extent_root-leafsize 10); } This patch breakes my system. With this applied all services fail on boot with no space left messages. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Set nodatacow per file?
Hello, Since several people asked to post the results, here they are. I tried raw virtio disk with and without -z -C set and also qcow2 virtio disk without -z -C set and did not notice any difference in performance at all - Redhat 6.2 Minimal installs in 10 minutes in each case. The abysmal performance as it was some several months ago (like 10 minutes just for virtual disk formatting) under the same conditions is no more at least on 3.3.0-rc5. best ~dima On 02/24/2012 02:22 PM, dima wrote: On 02/13/2012 04:17 PM, Ralf-Peter Rohbeck wrote: Hello, is it possible to set nodatacow on a per-file basis? I couldn't find anything. If not, wouldn't that be a great feature to get around the performance issues with VM and database storage? Of course cloning should still cause COW. Hello, Going back to the original question from Ralf I wanted to share my experience. Yesterday I set up KVM+qemu and set -z -C with David's 'fileflags' utility for the VM image file. I was very pleased with results - Redhat 6 Minimal installation was installed in 10 minutes whereas it was taking 'forever' the last time I tried it some 4 months ago. Writes during installation were very moderate. Performance of VM is excellent. Installing some big packages with yum inside VM goes very quickly with the speed indistinguishable from that of bare metal installs. I am not quite sure should this improvement be attributed to the nocow and nocompress flags or to the overall improvement of btrfs (I am on 3.3-rc4 kernel) but KVM is definitely more than usable on btrfs now. I am yet to test the install speed and performance without those flags set. best ~dima -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [BUG] Kernel Bug at fs/btrfs/volumes.c:3638
On Sun, Feb 26, 2012 at 11:42:58PM -0500, Jérôme Carretero wrote: At some point, I would appreciate some kind of thorough evaluation using a fuzzer on small disk images. The btrfs developers could for instance: - provide a script to create a filesystem image with a known layout (known corpus) - provide .config and reference to kernel sources to build the kernel - provide a minimal root filesystem to be run under qemu, it would run a procedure on the other disk image at boot crashes wouldn't affect the host, which is good. - provide a way to retrieve the test parameters and results for every test case in case of bug, the test can be reproduced by the developers since the configuration is known - expect volunteers to run the scenarios (I know I would) The tricky part is of course the potentially super-costly procedure... Simplest case: flipping every bit / writing blocks with pseudo-random data, even on meta-data only, as the outcome on data is supposed to be known. Smarter: flipping bits on every btrfs meta-data structure type at every possible logical location. There is a dangerdonteveruse(tm) utility btrfs-corrupt-block able to target at specific metadata structure and corrupt it, with the fsck counterpart for the rescue. I believe we'll see more updates in that area. The block checksums are supposed to catch bitflips after they were written down to device (provided the data were correct up to the checksum point). If you're talking about random bitflips in metadata structures during processing, that's very likely to crash in many ways of course. I think some logic needs to be added to those corruptions and accompanied by the fsck part. The kind of stuff that would help all this could be something like Python bindings for a *btrfs library*. Helpful even for prototyping fsck stuff, making illustrations, etc. We could see btrfsprogs turn into a library + tool, someday. Added to project page. As of today, how are btrfs developers testing the filesystem implementation (except with xfstests) ? If there is a patch fixing particular bug, I try to set up environment stressing exactly that bug (and sometimes finding another one ...). The xfstests suite is a must before any testing. There are common loads raising the chances to hit a bug like repeated snapshots (and deletions), exhausting data/metadata space, 'fi defrag', 'fi sync', 'fi balance'. Sometimes it's enough to run a specific xfstest in a loop. I have set of hackish scripts doing just these tasks or wrappers around xfstests to create filesystem with desired raid levels (where applicable) and let the suite run on top of it. Another dimension of testing are mount options, there are some combinations likely to execrise specific parts of code, or create files in a way that may confuse different mount options (like nodatasum). We've seen btrfs-specific tests added to xfstests, so it's mostly changing the outer environment for the testsuite. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Errors in dmesg, no crashes though.
On Sat, Feb 25, 2012 at 09:14:01AM +1030, Jordan Windsor wrote: I'm running Ubuntu under KVM, with btrfs on the host where the Qemu/KVM image is stored, the VM was also running at the time. I was going to check something unrelated in the dmesg output, as I did that I noticed some errors in it about btrfs here they are: [ 4294.431807] btrfs: block rsv returned -28 [ 4294.431811] [ cut here ] [ 4294.431831] WARNING: at fs/btrfs/extent-tree.c:5985 This is a warning and it shows up from time to time, accross recent releases. This didn't cause any crashes or related issues, also unrelated (I think?) is that I get very low performance (about 5MBps according to iostat) in a guest under Qemu/KVM. 3.2.7-1-ARCH x86_64 Arch Linux Yes, the performance goes down as some pathological should-not-happen code path is taken. I myself haven't seen this error recently during testing, but at the time I did, it slowed down the machine for a while, ie. it's not a inifinite loop. Seems there's a dark corner of space reservations left for Josef. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: LABEL only 1 device
Helmut Hullen posted on Mon, 27 Feb 2012 11:27:00 +0100 as excerpted: Du meintest am 27.02.12: mkfs.btrfs creates a new filesystem. The -L option sets the label for the newly-created FS. The safest way may be deleting this option ... it seems to work as expected only when I create a new FS on 1 disk/partition. I've said this several times: Your expectations are wrong. You don't label partitions. Yes - now I know. But I'm afraid other people also expect wrong - when I use mkfs.ext[234] then this option works (in another way than with mkfs.btrfs). AFAIK, it works in the same way... that is, it labels the, in that case, ext2/3/4 filesystem, in this case (mkfs.btrfs), btrfs filesystem. From the manpages: mkfs.btrfs (aka mkbtrfs): -L, --label name Specify a label for the filesystem. mkfs.ext2/3/4 (aka mke2fs): -L new-volume-label Set the volume label for the filesystem to new-volume-label. The maximum length of the volume label is 16 bytes. e2label: e2label will display or change the filesystem label on the ext2, ext3, or ext4 filesystem located on device. mkreiserfs: -l | --label LABEL Sets the volume label of the filesystem. LABEL can at most be 16 characters long; if it is longer than 16 characters, mkreiserfs will truncate it. reiserfstune: -l | --label LABEL Set the volume label of the filesystem. LABEL can be at most 16 characters long; if it is longer than 16 characters, reiserfstune will truncate it. The mkswap manpage does make things a bit more confusing, until you realize that the device they're referencing is a swap device, which can be a file, not just a block device. mkswap sets up a Linux swap area on a device or in a file. [...] -L, --label label Specify a label for the device, to allow swapon by label. fstab indicates the filesystem label: The first field (fs_spec). This field describes the block special device or remote filesystem to be mounted. For ordinary mounts it will hold (a link to) a block special device node (as created by mknod(8)) for the device to be mounted, like `/dev/cdrom' or `/dev/sdb7'. [...] Instead of giving the device explicitly, one may indicate the (ext2 or xfs) filesystem that is to be mounted by its UUID or volume label (cf. e2label(8) or xfs_admin(8)), writing LABEL=label or UUID=uuid, e.g., `LABEL=Boot'[.] This will make the system more robust: adding or removing a SCSI disk changes the disk device name but not the filesystem volume label. mount seems to be confused, using label in both the filesystem and device context (it also discusses selinux labels, etc, which are of course different). I'm not going to quote it here as the bits discussing label are dispersed and getting context clear on all of them would take a lot of space. Searching the manpage for label (case insensitive search) works, tho, again noting that it uses label in selinux and other contexts as well. In another post I mentioned that gpt partitions do have names, which /could/ function similarly to labels, tho Linux including the mount command generally ignores them at present. From the gdisk (part of gptfdisk) manpage (the cgdisk and sgdisk manpages, same package, are similarly worded, including the note on the distinction between gpt partition name and filesystem label): c Change the GPT name of a partition. This name is encoded as a UTF-16 string, but proper entry and display of anything beyond basic ASCII values requires suitable locale and font support. For the most part, Linux ignores the partition name, but it may be important in some OSes. GPT fdisk sets a default name based on the partition type code. Note that the GPT partition name is different from the filesystem name, which is encoded in the filesystem's data structures. Note especially that last sentence, above. So a filesystem label is just that, a /filesystem/ label. That there's normally a 1:1 correspondence between filesystem and the block device(s) it's on is simply an accident. But it's NOT an accident when a btrfs filesystem label applies to ALL the devices that compose the filesystem, since it's a FILESYSTEM label, NOT a PARTITION label. As the gptfdisk manpages make clear, partition names/labels, where they exist as in gpt based partitioning, are quite distinct from the filesystem names/labels. However, the above manpage research does point out that while usage is generally quite
Re: Errors in dmesg, no crashes though.
I've just seen this too on Fedora 16 while I was investigating an NFS issue. I was trying to copy a file from an NFS mount to a btrfs partition. The NFS transfers for large files were occurring in bursts for some reason and I was aborting the copy at times. This NFS problem was not related to btrfs (cat NFS file /dev/null was also bursting and slow). Originally I ran 3.2.1-3.fc16, but just upgraded to 3.2.7-1.fc16. The file system was formatted with 3.2.1 originally. I can't say for sure what caused this - whether it was the NFS being slow, the copy being interrupted, or btrfs itself. Regards, Nik On Mon, Feb 27, 2012 at 9:03 AM, David Sterba d...@jikos.cz wrote: On Sat, Feb 25, 2012 at 09:14:01AM +1030, Jordan Windsor wrote: I'm running Ubuntu under KVM, with btrfs on the host where the Qemu/KVM image is stored, the VM was also running at the time. I was going to check something unrelated in the dmesg output, as I did that I noticed some errors in it about btrfs here they are: [ 4294.431807] btrfs: block rsv returned -28 [ 4294.431811] [ cut here ] [ 4294.431831] WARNING: at fs/btrfs/extent-tree.c:5985 This is a warning and it shows up from time to time, accross recent releases. This didn't cause any crashes or related issues, also unrelated (I think?) is that I get very low performance (about 5MBps according to iostat) in a guest under Qemu/KVM. 3.2.7-1-ARCH x86_64 Arch Linux Yes, the performance goes down as some pathological should-not-happen code path is taken. I myself haven't seen this error recently during testing, but at the time I did, it slowed down the machine for a while, ie. it's not a inifinite loop. Seems there's a dark corner of space reservations left for Josef. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: LABEL only 1 device
Hallo, Duncan, Du meintest am 27.02.12: I've said this several times: Your expectations are wrong. You don't label partitions. Yes - now I know. But I'm afraid other people also expect wrong - when I use mkfs.ext[234] then this option works (in another way than with mkfs.btrfs). AFAIK, it works in the same way... that is, it labels the, in that case, ext2/3/4 filesystem, in this case (mkfs.btrfs), btrfs filesystem. From the manpages: mkfs.btrfs (aka mkbtrfs): -L, --label name Specify a label for the filesystem. mkfs.ext2/3/4 (aka mke2fs): -L new-volume-label Set the volume label for the filesystem to new-volume-label. The maximum length of the volume label is 16 bytes. But there's a small difference: mke2fs -L MyLabel /dev/sdn4 only sets/changes the label (ok - it tests the type of the partition and refuses labeling if the type doesn't fit). mkfs.btrfs -L MyLabel /dev/sdn4 not only sets/changes the label but also (re-)creates a btrfs filesystem, using the default parameters. I had to learn this difference ... Viele Gruesse! Helmut -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 21/22] btrfs: add support for read_iter, write_iter, and direct_IO_bvec
Some helpers were broken out of btrfs_direct_IO() in order to avoid code duplication in new bio_vec-based function. Signed-off-by: Dave Kleikamp dave.kleik...@oracle.com Cc: Zach Brown z...@zabbo.net Cc: Chris Mason chris.ma...@oracle.com Cc: linux-btrfs@vger.kernel.org --- fs/btrfs/file.c |2 + fs/btrfs/inode.c | 116 +- 2 files changed, 82 insertions(+), 36 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 859ba2d..7a2fbc0 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -1880,6 +1880,8 @@ const struct file_operations btrfs_file_operations = { .aio_read = generic_file_aio_read, .splice_read= generic_file_splice_read, .aio_write = btrfs_file_aio_write, + .read_iter = generic_file_read_iter, + .write_iter = generic_file_write_iter, .mmap = btrfs_file_mmap, .open = generic_file_open, .release= btrfs_release_file, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 32214fe..52199e7 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -6151,24 +6151,14 @@ static ssize_t check_direct_IO(struct btrfs_root *root, int rw, struct kiocb *io out: return retval; } -static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb, - const struct iovec *iov, loff_t offset, - unsigned long nr_segs) + +static ssize_t btrfs_pre_direct_IO(int writing, loff_t offset, size_t count, + struct inode *inode, int *write_bits) { - struct file *file = iocb-ki_filp; - struct inode *inode = file-f_mapping-host; struct btrfs_ordered_extent *ordered; struct extent_state *cached_state = NULL; u64 lockstart, lockend; ssize_t ret; - int writing = rw WRITE; - int write_bits = 0; - size_t count = iov_length(iov, nr_segs); - - if (check_direct_IO(BTRFS_I(inode)-root, rw, iocb, iov, - offset, nr_segs)) { - return 0; - } lockstart = offset; lockend = offset + count - 1; @@ -6176,7 +6166,7 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb, if (writing) { ret = btrfs_delalloc_reserve_space(inode, count); if (ret) - goto out; + return ret; } while (1) { @@ -6191,8 +6181,8 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb, lockend - lockstart + 1); if (!ordered) break; - unlock_extent_cached(BTRFS_I(inode)-io_tree, lockstart, lockend, -cached_state, GFP_NOFS); + unlock_extent_cached(BTRFS_I(inode)-io_tree, lockstart, +lockend, cached_state, GFP_NOFS); btrfs_start_ordered_extent(inode, ordered, 1); btrfs_put_ordered_extent(ordered); cond_resched(); @@ -6203,46 +6193,99 @@ static ssize_t btrfs_direct_IO(int rw, struct kiocb *iocb, * the dirty or uptodate bits */ if (writing) { - write_bits = EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING; - ret = set_extent_bit(BTRFS_I(inode)-io_tree, lockstart, lockend, -EXTENT_DELALLOC, 0, NULL, cached_state, -GFP_NOFS); - if (ret) { + *write_bits = EXTENT_DELALLOC | EXTENT_DO_ACCOUNTING; + ret = set_extent_bit(BTRFS_I(inode)-io_tree, lockstart, +lockend, EXTENT_DELALLOC, 0, NULL, +cached_state, GFP_NOFS); + if (ret) clear_extent_bit(BTRFS_I(inode)-io_tree, lockstart, -lockend, EXTENT_LOCKED | write_bits, +lockend, EXTENT_LOCKED | *write_bits, 1, 0, cached_state, GFP_NOFS); - goto out; - } } - free_extent_state(cached_state); - cached_state = NULL; - ret = __blockdev_direct_IO(rw, iocb, inode, - BTRFS_I(inode)-root-fs_info-fs_devices-latest_bdev, - iov, offset, nr_segs, btrfs_get_blocks_direct, NULL, - btrfs_submit_direct, 0); + return ret; +} + +static ssize_t btrfs_post_direct_IO(ssize_t ret, loff_t offset, size_t count, + struct inode *inode, int *write_bits) +{ + struct extent_state *cached_state = NULL; if (ret 0 ret != -EIOCBQUEUED) { clear_extent_bit(BTRFS_I(inode)-io_tree, offset, - offset + iov_length(iov, nr_segs) - 1, -
Re: LABEL only 1 device
On Mon, Feb 27, 2012 at 10:15:00PM +0100, Helmut Hullen wrote: Du meintest am 27.02.12: I've said this several times: Your expectations are wrong. You don't label partitions. Yes - now I know. But I'm afraid other people also expect wrong - when I use mkfs.ext[234] then this option works (in another way than with mkfs.btrfs). AFAIK, it works in the same way... that is, it labels the, in that case, ext2/3/4 filesystem, in this case (mkfs.btrfs), btrfs filesystem. From the manpages: mkfs.btrfs (aka mkbtrfs): -L, --label name Specify a label for the filesystem. mkfs.ext2/3/4 (aka mke2fs): -L new-volume-label Set the volume label for the filesystem to new-volume-label. The maximum length of the volume label is 16 bytes. But there's a small difference: mke2fs -L MyLabel /dev/sdn4 only sets/changes the label (ok - it tests the type of the partition and refuses labeling if the type doesn't fit). That feels really weird. It wouldn't ever occur to me to look at a mkfs tool to relabel a filesystem without destroying the data on it. I view this behaviour as a bug. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Great oxymorons of the world, no. 6: Mature Student --- signature.asc Description: Digital signature
Re: LABEL only 1 device
Hi Helmut, are you sure that 'mkfs.ext2/3/4 -L label /dev/xxx' doesn't create a new fs? Afaik to change a label of a given (ext2/3/4) filesystem you should use tune2fs. I don't have a linux system available right now but this is what I would expect and what would make a lot more sense then changing a label via mkfs.ext2/3/4. If you are correct with that labeling thing then the btrfs way makes like 1000x more sense then the way ext2/3/4 does it. mkfs should only be used for creating filesystems. For changing existing fs tools like tune2fs, btrfs etc. should be used. Regards, Felix On 2/27/12 10:15 PM, Helmut Hullen wrote: Hallo, Duncan, Du meintest am 27.02.12: I've said this several times: Your expectations are wrong. You don't label partitions. Yes - now I know. But I'm afraid other people also expect wrong - when I use mkfs.ext[234] then this option works (in another way than with mkfs.btrfs). AFAIK, it works in the same way... that is, it labels the, in that case, ext2/3/4 filesystem, in this case (mkfs.btrfs), btrfs filesystem. From the manpages: mkfs.btrfs (aka mkbtrfs): -L, --label name Specify a label for the filesystem. mkfs.ext2/3/4 (aka mke2fs): -L new-volume-label Set the volume label for the filesystem to new-volume-label. The maximum length of the volume label is 16 bytes. But there's a small difference: mke2fs -L MyLabel /dev/sdn4 only sets/changes the label (ok - it tests the type of the partition and refuses labeling if the type doesn't fit). mkfs.btrfs -L MyLabel /dev/sdn4 not only sets/changes the label but also (re-)creates a btrfs filesystem, using the default parameters. I had to learn this difference ... Viele Gruesse! Helmut -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: LABEL only 1 device
On Mon, Feb 27, 2012 at 10:15:00PM +0100, Helmut Hullen wrote: Du meintest am 27.02.12: I've said this several times: Your expectations are wrong. You don't label partitions. Yes - now I know. But I'm afraid other people also expect wrong - when I use mkfs.ext[234] then this option works (in another way than with mkfs.btrfs). AFAIK, it works in the same way... that is, it labels the, in that case, ext2/3/4 filesystem, in this case (mkfs.btrfs), btrfs filesystem. From the manpages: mkfs.btrfs (aka mkbtrfs): -L, --label name Specify a label for the filesystem. mkfs.ext2/3/4 (aka mke2fs): -L new-volume-label Set the volume label for the filesystem to new-volume-label. The maximum length of the volume label is 16 bytes. But there's a small difference: mke2fs -L MyLabel /dev/sdn4 only sets/changes the label (ok - it tests the type of the partition and refuses labeling if the type doesn't fit). OK, I have just tried this out. It does set the filesystem label. It also wipes the filesystem, as I expected it to. You clearly aren't doing this on existing filesystems with data in them. hrm@ruth:~ $ sudo mke2fs /dev/loop0 mke2fs 1.42 (29-Nov-2011) Filesystem label= OS type: Linux Block size=1024 (log=0) Fragment size=1024 (log=0) Stride=0 blocks, Stripe width=0 blocks 25688 inodes, 102400 blocks 5120 blocks (5.00%) reserved for the super user First data block=1 Maximum filesystem blocks=67371008 13 block groups 8192 blocks per group, 8192 fragments per group 1976 inodes per group Superblock backups stored on blocks: 8193, 24577, 40961, 57345, 73729 Allocating group tables: done Writing inode tables: done Writing superblocks and filesystem accounting information: done hrm@ruth:~ $ sudo mount /dev/loop0 /mnt hrm@ruth:~ $ sudo dd if=/dev/urandom of=/mnt/foo bs=1M count=5 5+0 records in 5+0 records out 5242880 bytes (5.2 MB) copied, 1.74158 s, 3.0 MB/s hrm@ruth:~ $ ls /mnt foo lost+found hrm@ruth:~ $ sudo umount /mnt hrm@ruth:~ $ sudo mke2fs -L newlabel /dev/loop0 mke2fs 1.42 (29-Nov-2011) Filesystem label=newlabel OS type: Linux Block size=1024 (log=0) Fragment size=1024 (log=0) Stride=0 blocks, Stripe width=0 blocks 25688 inodes, 102400 blocks 5120 blocks (5.00%) reserved for the super user First data block=1 Maximum filesystem blocks=67371008 13 block groups 8192 blocks per group, 8192 fragments per group 1976 inodes per group Superblock backups stored on blocks: 8193, 24577, 40961, 57345, 73729 Allocating group tables: done Writing inode tables: done Writing superblocks and filesystem accounting information: done hrm@ruth:~ $ sudo mount /dev/loop0 /mnt hrm@ruth:~ $ ls /mnt lost+found hrm@ruth:~ $ Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Great oxymorons of the world, no. 6: Mature Student --- signature.asc Description: Digital signature
Re: LABEL only 1 device
Hallo, Hugo, Du meintest am 27.02.12: But there's a small difference: mke2fs -L MyLabel /dev/sdn4 only sets/changes the label (ok - it tests the type of the partition and refuses labeling if the type doesn't fit). OK, I have just tried this out. It does set the filesystem label. It also wipes the filesystem, as I expected it to. You clearly aren't doing this on existing filesystems with data in them. I should have tested it ... sorry. I've always labeled my ext[234] partitions with e2label. And because I hadn't found such a simple command for btrfs I took mkfs.btrfs -L instead of btrfs fi label mountpoint. Viele Gruesse! Helmut -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Set nodatacow per file?
On Mon, Feb 27, 2012 at 7:54 AM, dima dole...@parallels.com wrote: Hello, Since several people asked to post the results, here they are. I tried raw virtio disk with and without -z -C set and also qcow2 virtio disk without -z -C set and did not notice any difference in performance at all - Redhat 6.2 Minimal installs in 10 minutes in each case. The abysmal performance as it was some several months ago (like 10 minutes just for virtual disk formatting) under the same conditions is no more at least on 3.3.0-rc5. Just to make sure, this is a _new_ virtual disk right? I can barely contain my excitement right now. This is amazing progress. best ~dima On 02/24/2012 02:22 PM, dima wrote: On 02/13/2012 04:17 PM, Ralf-Peter Rohbeck wrote: Hello, is it possible to set nodatacow on a per-file basis? I couldn't find anything. If not, wouldn't that be a great feature to get around the performance issues with VM and database storage? Of course cloning should still cause COW. Hello, Going back to the original question from Ralf I wanted to share my experience. Yesterday I set up KVM+qemu and set -z -C with David's 'fileflags' utility for the VM image file. I was very pleased with results - Redhat 6 Minimal installation was installed in 10 minutes whereas it was taking 'forever' the last time I tried it some 4 months ago. Writes during installation were very moderate. Performance of VM is excellent. Installing some big packages with yum inside VM goes very quickly with the speed indistinguishable from that of bare metal installs. I am not quite sure should this improvement be attributed to the nocow and nocompress flags or to the overall improvement of btrfs (I am on 3.3-rc4 kernel) but KVM is definitely more than usable on btrfs now. I am yet to test the install speed and performance without those flags set. best ~dima -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SELinux inode size gotcha in btrfs.
On Mon, Feb 27, 2012 at 10:18:55PM +, Alex wrote: I've come across the 'gotcha' in XFS where the inode size defaults to 256 [1] whereas for SELinux the attributes play better when you initialise it at creation to 512. A btrfs inode structure is 136 bytes in size. xattrs and any inline file data are separate from the inode structure, stored with additional keys in the FS tree (which means that they're quite likely to appear in the same page, as the inode data, but not guaranteed). From my reading of the btrfs specs [2] it doesn't look like you'll get caught with that as the inodes will not contain embedded file data or extended attribute data. These things are stored in other item types. Have I read that right? I've seen xattr bugs patches etc but nothing that would hit the SE Linux domain. It's not clear from looking at the gentoo doc what the problem actually is with different inode sizes... Without some kind of indication what the issue really is, it's kind of hard to say how this might affect btrfs. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 515C238D from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- The enemy have elected for Death by Powerpoint. That's what --- they shall get. -- gdb signature.asc Description: Digital signature
Re: SELinux inode size gotcha in btrfs.
On Mon, Feb 27, 2012 at 10:18:55PM +, Alex wrote: From my reading of the btrfs specs [2] it doesn't look like you'll get caught with that as the inodes will not contain embedded file data or extended attribute data. These things are stored in other item types. Have I read that right? I've seen xattr bugs patches etc but nothing that would hit the SE Linux domain. That's right. Inode represented as btrfs_inode_item does not contain any xattr fields, they're stored independently as a btrfs_dir_item of type BTRFS_FT_XATTR . Due to the way the b-tree keys are built, the xattr item key should be stored near the inode item key, that's for the tree search side. The xattr data are always stored inline in the b-tree leaf. david -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Set nodatacow per file?
On 02/28/2012 07:10 AM, Chester wrote: On Mon, Feb 27, 2012 at 7:54 AM, dimadole...@parallels.com wrote: Hello, Since several people asked to post the results, here they are. I tried raw virtio disk with and without -z -C set and also qcow2 virtio disk without -z -C set and did not notice any difference in performance at all - Redhat 6.2 Minimal installs in 10 minutes in each case. The abysmal performance as it was some several months ago (like 10 minutes just for virtual disk formatting) under the same conditions is no more at least on 3.3.0-rc5. Just to make sure, this is a _new_ virtual disk right? I can barely contain my excitement right now. This is amazing progress. Yes, it is a newly created virtual disk. By the way, one thing that slipped out from my message - in case of raw I did pre-allocation of the entire image, but in case of qcow2 I unchecked this box in virt-manager and the disk was growing as the system was installing. Nevertheless I did not notice performance degradation during the install. best ~dima -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: hold enough space for global_rsv
On 02/27/2012 09:29 PM, Johannes Hirte wrote: Am Tue, 17 Jan 2012 17:51:59 +0800 schrieb Liu Bo liubo2...@cn.fujitsu.com: I've kept hitting enospc warnings of global_rsv while running defragment on files: btrfs: block rsv returned -28 WARNING: at fs/btrfs/extent-tree.c:5984 btrfs_alloc_free_block+0x333/0x340 [btrfs]() ... I used a fio jobs to create a file with lots of fragments: $ filefrag /mnt/btrfs/foobar /mnt/btrfs/foobar: 66964 extents found and then btrfs fi defrag /mnt/btrfs/foobar sync would pop the warnings. I found that the global_rsv size is just not enough for defragment, and didn't find any space leak in using global_rsv, so double it and go ahead. Signed-off-by: Liu Bo liubo2...@cn.fujitsu.com --- fs/btrfs/extent-tree.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 8603ee4..77ea23c 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -3979,7 +3979,7 @@ static u64 calc_global_metadata_size(struct btrfs_fs_info *fs_info) num_bytes += div64_u64(data_used + meta_used, 50); if (num_bytes * 3 meta_used) -num_bytes = div64_u64(meta_used, 3); +num_bytes = div64_u64(meta_used, 3) * 2; return ALIGN(num_bytes, fs_info-extent_root-leafsize 10); } This patch breakes my system. With this applied all services fail on boot with no space left messages. It's weird since this patch is just aiming to enlarge our metadata reservation count. so you've tried a revert or a bisect, right? Can you show me the environment or any log messages? thanks, liubo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html