Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
On 09/15/2017 12:18 AM, Hugo Mills wrote: >As far as I know, both of these are basically known issues, with no > good solution, other than not using O_DIRECT. Certainly the first > issue is one I recognise. The second isn't one I recognise directly, > but is unsurprising to me. > >There have been discussions -- including developers -- on this list > as recent as a month or so ago. The general outcome seems to be that > any problems with O_DIRECT are not going to be fixed. I missed this thread; could you point it to me ? If csum and O_DIRECT are not reliable, why not disallow one of them: i.e allow O_DIRECT only on nodatasum files... ZFS (on linux) do not support O_DIRECT at all... In fact most of the applications which benefit from O_DIRECT (it comes to me VM e DB), are the ones which need also nodatasum to have good performance. One of the strongest point of BTRFS was the checksums; but these are not effective when the file is opened with O_DIRECT; worse there are cases where the file is corrupted and the application got -EIO; not mentioning that the dmesg is filled by "csum failed " > >Hugo. > > On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote: >> Hi all, >> >> I discovered two bugs when O_DIRECT is used... >> >> 1) a corrupted file doesn't return -EIO when O_DIRECT is used >> >> Normally BTRFS prevents to access the contents of a corrupted file; however >> I was able read the content of a corrupted file simply using O_DIRECT >> >> # in a new btrfs filesystem, create a file >> $ sudo mkfs.btrfs -f /dev/sdd5 >> $ mount /dev/sdd5 t >> $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd >> bs=$((16*1024)) iflag=fullblock count=1024 >> >> # corrupt the file >> $ sudo filefrag -v t/abcd >> Filesystem type is: 9123683e >> File size of t/abcd is 16777216 (4096 blocks of 4096 bytes) >> ext: logical_offset:physical_offset: length: expected: flags: >>0:0..3475: 70656.. 74131: 3476: >>1: 3476..4095: 74212.. 74831:620: 74132: >> last,eof >> t/abcd: 2 extents found >> $ sudo umount t >> $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 >> /dev/sdd5 >> mirror 1 logical 289406976 physical 289406976 device /dev/sdd5 >> corrupting 289406976 copy 1 >> >> # try to access the file; expected result: -EIO >> $ sudo mount /dev/sdd5 t >> $ dd if=t/abcd | hexdump -c | head >> dd: error reading 't/abcd': Input/output error >> 0+0 records in >> 0+0 records out >> 0 bytes copied, 0.000477413 s, 0.0 kB/s >> >> >> # try to access the file using O_DIRECT; expected result: -EIO, instead the >> file is accessible >> $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head >> 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 >> * >> 0001000 f g a b c e f g a b c e f g a b >> 0001010 c e f g a b c e f g a b c e f g >> 0001020 a b c e f g a b c e f g a b c e >> 0001030 f g a b c e f g a b c e f g a b >> 0001040 c e f g a b c e f g a b c e f g >> 0001050 a b c e f g a b c e f g a b c e >> 0001060 f g a b c e f g a b c e f g a b >> 0001070 c e f g a b c e f g a b c e f g >> >> (dmesg report the checksum mismatch) >> [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 0 >> csum 0x98f94189 expected csum 0x0ab6be80 mirror 1 >> >> Note the first 4k filled by 0x01 ! >> >> Conclusion: even if the file is corrupted and normally BTRFS prevent to >> access it, using O_DIRECT >> a) no error is returned to the caller >> b) instead of the page stored on the disk, it is returned a page filled with >> 0x01 (according also with the function __readpage_endio_check()) >> >> >> 2) The second bug, is a more severe bug. If during a writing of a buffer >> with O_DIRECT, the buffer is updated at the same time by a second process, >> the checksum may be incorrect. >> >> At the end of the email there is the code which shows the problem: two >> process share the same memory: the first write it to the disk, the second >> update the buffer continuously. A third process try to read the file, but it >> got time to time -EIO >> >> If you ran my code in a btrfs filesystem you got a lot of >> >> ERROR: read thread; r = 8192, expected = 16384 >> ERROR: read thread; r = 8192, expected = 16384 >> ERROR: read thread; e = 5 - Input/output error >> ERROR: read thread; e = 5 - Input/output error >> >> The firsts lines are related to a shorter read (which may happens). The >> lasts lines are related to a checksum mismatch. The dmesg is filled by lines >> like >> [...] >> [14873.573547] BTRFS warning (device sdd5): csum failed root 5 ino 259 off >> 4096 csum 0x0683c6df expected csum 0x55eb85e6 mirror
Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
On 09/15/2017 05:55 AM, Andrei Borzenkov wrote: > 15.09.2017 01:00, Goffredo Baroncelli пишет: >> >> 2) The second bug, is a more severe bug. If during a writing of a buffer >> with O_DIRECT, the buffer is updated at the same time by a second process, >> the checksum may be incorrect. >> > > Is it btrfs specific ? If buffer is updated before it was actually > consumed by kernel, this likely means data corruption on any filesystem. I don't see any corruption in other FS. The fact that application push to filesystem garbage, doesn't allow the filesystem to be corrupted. In this case the filesystem became corrupted, because another application which try to read the data (without O_DIRECT) may got -EIO. I repeat, the problem is a data race when the data is in the FS camp, and the kernel does wrong checksum. IMHO, BTRFS should disallow O_DIRECT (which is the same thing that does ZFS on linux); I think that it could be allowed only for nodatasum files. > I.e. there should be clear indication from kernel that buffer can be > reused by application, in your example - when pwrite returns. So when > data corruption happens - during pwrite or after? > If data is corrupted > during pwrite, it is arguably application fault - it should disallow > concurrent access. > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
15.09.2017 01:00, Goffredo Baroncelli пишет: > > 2) The second bug, is a more severe bug. If during a writing of a buffer with > O_DIRECT, the buffer is updated at the same time by a second process, the > checksum may be incorrect. > Is it btrfs specific? If buffer is updated before it was actually consumed by kernel, this likely means data corruption on any filesystem. I.e. there should be clear indication from kernel that buffer can be reused by application, in your example - when pwrite returns. So when data corruption happens - during pwrite or after? If data is corrupted during pwrite, it is arguably application fault - it should disallow concurrent access. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snapshots of encrypted directories?
14.09.2017 18:32, Hugo Mills пишет: > On Thu, Sep 14, 2017 at 04:57:39PM +0200, Ulli Horlacher wrote: >> I use encfs on top of btrfs. >> I can create btrfs snapshots, but I have no suggestive access to the files >> in these snaspshots, because they look like: >> >> drwx-- framstag users- 2017-09-08 11:47:18 >> uHjprldmxo3-nSfLmcH54HMW >> drwxr-xr-x framstag users- 2017-09-08 11:47:18 >> wNEWaDCgyXTj0d-Myk8wXZfh >> -rw-r--r-- framstag users 377 2015-06-12 14:02:53 >> -zDmc7xfobKDkbl8z7oKOHxv >> -rw-r--r-- framstag users2,367 2012-07-10 14:32:30 >> 7pfKs27K9k5zANE4WOQEuFa2 >> -rw--- framstag users 692 2009-10-20 13:45:41 >> 8SQElYCph85kDdcFasUHybVr >> -rw--- framstag users2,872 2017-08-31 16:21:52 >> bm,yNi1e4fsAClDv7lNxxSfJ >> lrwxrwxrwx framstag users- 2017-06-01 15:53:00 >> GZxNYI0Gy96R18fz40f7k5rl -> >> wvuQKHYzdFbar18fW6jjOerXk2IsS4OAA2fnHalBZjMQ,7Kw0j-zE3IJqxhmmGBN8G9 >> -rw-r--r-- framstag users 182 2016-12-01 13:34:31 >> rqtNBbiYDym0hPMbBL-VLJZcFZu6nkNxlsjTX-sU88I4I1 >> >> I have to mount the snapshot with encfs, to have access to the (decrypted) >> files. >> >> Any better ideas? > >I'd say it's doing exactly what it should be doing. You're making a > copy of an encrypted data store, With all respect - snapshot is not a copy. > and the result is encrypted. In order > to read it, it needs to have the decrpytion layer applied to it with > the correct key (which is the need to mount the snapshot with encfs). > But snapshot *is* mounted implicitly as it is part of mounted btrfs filesystem. So I can see that this behavior could be rather unexpected. >Would you _really_ want a system where the encrypted contents of a > subvolume can be decrypted by simply snapshotting it? The actual question is - do you need to mount each individual btrfs subvolume when using encfs? If yes, this behavior is at least consistent. If not - how are snapshots different? signature.asc Description: OpenPGP digital signature
Re: [PATCH 1/2] btrfs-progs: build: generate all dependency files
On 2017年09月14日 21:41, David Sterba wrote: > On Thu, Sep 14, 2017 at 07:10:46PM +0900, Naohiro Aota wrote: >> We're missing several dependency files like: >> >> $ diff -u <(find -name '*.o'|cut -d. -f2|sort) <(find -name '*.o.d'|cut -d. >> -f2|sort) >>--- /proc/self/fd/112017-09-14 18:17:44.460564620 +0900 >>+++ /proc/self/fd/122017-09-14 18:17:44.460564620 +0900 > > Please note that an actual diff in the changelog is understood as start > of the patch by git-am, indenting the --- or +++ lines makes it work > again. Oops, I forgot about that limitation. Thank you for the fix. > >> @@ -3,7 +3,6 @@ >> /btrfs-corrupt-block >> /btrfs-debug-tree >> /btrfs-find-root >> -/btrfs-list >> /btrfs-map-logical >> /btrfs-select-super >> /btrfstune >> @@ -29,11 +28,6 @@ >> /cmds-scrub >> /cmds-send >> /cmds-subvolume >> -/convert/common >> -/convert/main >> -/convert/source-ext2 >> -/convert/source-fs >> -/convert/source-reiserfs >> /ctree >> /dir-item >> /disk-io >> >> >> This is due to moving things out of objects and cmds_objects variables. Such >> missing dependency files cause mis-building of some source files (try touch >> utils.h; make mkfs/main.o). >> >> This patch introduce a new variable "all_objects" to keep all the objects and >> use the variable to generate proper dependency file building rules. >> >> Signed-off-by: Naohiro Aota > > Applied, thanks. > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
On 14 September 2017 at 19:53, Austin S. Hemmelgarn wrote: [..] > While it's not for BTRFS< a tool called e4rat might be of interest to you > regarding this. It reorganizes files on an ext4 filesystem so that stuff > used by the boot loader is right at the beginning of the device, and I've > know people to get insane performance improvements (on the order of 20x in > some pathologicallyb ad cases) in the time taken from the BIOS handing > things off to GRUB to GRUB handing execution off to the kernel. Do you know that what you've just wrote has nothing to do with fragmentation? Intentionally or not you just trying to change the subject. [..] > This shouldn't need examples. It's trivial math combined with basic > knowledge of hardware behavior. Every request to a device has a minimum > amount of overhead. On traditional hard drives, this is usually dominated > by seek latency, but on SSD's, the request setup, dispatch, and completion > are the dominant factor. Assumign you have a 2 micro-second overhead > per-request (not an exact number, just chosen for demonstration purposes > because it makes the math easy), and a 1GB file, the time difference between > reading ten 100MB extents and reading ten thousand 100kB extents is just > short of 0.02 seconds, or a factor of about one thousand (which, no surprise > here, is the factor of difference between the number of extents). So to produce few seconds delay during boot you need to make few hundreds thousands if not millions more IOs and on reading everything using ideal long sequential reads. Almost every package upgrade on rewrite some files in 100% will produce by using COW fully continuous areas per file. You know .. there is no so many files in typical distribution installation to produce such measurable impact. On my current laptop I have a lot of devel and debug stuff installed and still I have only $ rpm -qal | wc -l 276489 files (from which only small fractions are ELF DSOs or executables) installed by: $ rpm -qa | wc -l 2314 packages. I can bet that during even very complicated boot process it will be touched (by read IOs) only few hundreds files. None of those files will be read sequentially because this is not how executable content is usually loaded into the buffer cache. Simple change block device read ahead may improve boot time enough without putting all blocks in perfect order. All what you need is start enough early "blockdev --setra N" where N is greater than default 256 blocks. All this can be done without thinking about fragmentation. Seems you don't know that Linux by default is reading data from block dev using at least 256 blocks (1KB each one) chunks because such IO size is part of default RA settings, You can change those settings just for boot time and you will have way lower number of IOs and sill no significant improvement like few times shorter time. Fragmentation will be in such case secondary factor. All this could be done without bothering about fragmentation. In other words still you are talking about some institutionally possible results which will be falsified if you will try at least one time do some real tests and measurements. Last time when I've been doing some boot time measurements it was about using sequential start of all services vs. maximum palatalization. And yes by this it was possible to improve boot time by few times. All without bothering about fragmentation. Current fedora systemd base services definition can be improved in many places by add more dependencies and execute many small services in parallel. All those corrections can be done without even thinking about fragmentation. Because these base sett of systemd services comes with systemd source code those improvements can be done for almost all Linux systemd based distros. kloczek -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: cleanup 'start' subtraction from try uncompressed inline extent
Was added in: c8b978188c9a0fd3d535c13debd19d522b726f1f "Btrfs: Add zlib compression support" Survive to near time (from 08.10.2008). Because 'start' checked for zero before branch, so it's safe to remove that subtraction. Signed-off-by: Timofey Titovets --- fs/btrfs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 02ef32149c15..81123408e82e 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -570,7 +570,7 @@ static noinline void compress_file_range(struct inode *inode, cont: if (start == 0) { /* lets try to make an inline extent */ - if (ret || total_in < (actual_end - start)) { + if (ret || total_in < actual_end) { /* we didn't compress the entire range, try * to make an uncompressed inline extent. */ -- 2.14.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
As far as I know, both of these are basically known issues, with no good solution, other than not using O_DIRECT. Certainly the first issue is one I recognise. The second isn't one I recognise directly, but is unsurprising to me. There have been discussions -- including developers -- on this list as recent as a month or so ago. The general outcome seems to be that any problems with O_DIRECT are not going to be fixed. Hugo. On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote: > Hi all, > > I discovered two bugs when O_DIRECT is used... > > 1) a corrupted file doesn't return -EIO when O_DIRECT is used > > Normally BTRFS prevents to access the contents of a corrupted file; however I > was able read the content of a corrupted file simply using O_DIRECT > > # in a new btrfs filesystem, create a file > $ sudo mkfs.btrfs -f /dev/sdd5 > $ mount /dev/sdd5 t > $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd > bs=$((16*1024)) iflag=fullblock count=1024 > > # corrupt the file > $ sudo filefrag -v t/abcd > Filesystem type is: 9123683e > File size of t/abcd is 16777216 (4096 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0..3475: 70656.. 74131: 3476: >1: 3476..4095: 74212.. 74831:620: 74132: last,eof > t/abcd: 2 extents found > $ sudo umount t > $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 > /dev/sdd5 > mirror 1 logical 289406976 physical 289406976 device /dev/sdd5 > corrupting 289406976 copy 1 > > # try to access the file; expected result: -EIO > $ sudo mount /dev/sdd5 t > $ dd if=t/abcd | hexdump -c | head > dd: error reading 't/abcd': Input/output error > 0+0 records in > 0+0 records out > 0 bytes copied, 0.000477413 s, 0.0 kB/s > > > # try to access the file using O_DIRECT; expected result: -EIO, instead the > file is accessible > $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head > 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 > * > 0001000 f g a b c e f g a b c e f g a b > 0001010 c e f g a b c e f g a b c e f g > 0001020 a b c e f g a b c e f g a b c e > 0001030 f g a b c e f g a b c e f g a b > 0001040 c e f g a b c e f g a b c e f g > 0001050 a b c e f g a b c e f g a b c e > 0001060 f g a b c e f g a b c e f g a b > 0001070 c e f g a b c e f g a b c e f g > > (dmesg report the checksum mismatch) > [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 0 > csum 0x98f94189 expected csum 0x0ab6be80 mirror 1 > > Note the first 4k filled by 0x01 ! > > Conclusion: even if the file is corrupted and normally BTRFS prevent to > access it, using O_DIRECT > a) no error is returned to the caller > b) instead of the page stored on the disk, it is returned a page filled with > 0x01 (according also with the function __readpage_endio_check()) > > > 2) The second bug, is a more severe bug. If during a writing of a buffer with > O_DIRECT, the buffer is updated at the same time by a second process, the > checksum may be incorrect. > > At the end of the email there is the code which shows the problem: two > process share the same memory: the first write it to the disk, the second > update the buffer continuously. A third process try to read the file, but it > got time to time -EIO > > If you ran my code in a btrfs filesystem you got a lot of > > ERROR: read thread; r = 8192, expected = 16384 > ERROR: read thread; r = 8192, expected = 16384 > ERROR: read thread; e = 5 - Input/output error > ERROR: read thread; e = 5 - Input/output error > > The firsts lines are related to a shorter read (which may happens). The lasts > lines are related to a checksum mismatch. The dmesg is filled by lines like > [...] > [14873.573547] BTRFS warning (device sdd5): csum failed root 5 ino 259 off > 4096 csum 0x0683c6df expected csum 0x55eb85e6 mirror 1 > [...] > > This is definitely a bug. > > I think that using O_DIRECT and updating a page at the same time could happen > in a VM. In BTRFS this could lead to a wrong checksum. The problem is that > if BTRFS detects a checksum error during a reading: > a) if O_DIRECT is not used in the read > * -EIO is returned > Definitely BAD > > b) if O_DIRECT is used in the read > * it doesn't return the error to the caller > * it returns a page filled by 0x01 instead of the data from the disk > Even worse than a) > > Note1: even using O_DIRECT with O_SYNC, the problem still persist. > Note2: the man page of open(2) is filled by a lot of notes about O_DIRECT, > but also it stated that using O_DIRECT+fork()+mmap(... MAP_SHARED) is legally. > Note3: even "Z
BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
Hi all, I discovered two bugs when O_DIRECT is used... 1) a corrupted file doesn't return -EIO when O_DIRECT is used Normally BTRFS prevents to access the contents of a corrupted file; however I was able read the content of a corrupted file simply using O_DIRECT # in a new btrfs filesystem, create a file $ sudo mkfs.btrfs -f /dev/sdd5 $ mount /dev/sdd5 t $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd bs=$((16*1024)) iflag=fullblock count=1024 # corrupt the file $ sudo filefrag -v t/abcd Filesystem type is: 9123683e File size of t/abcd is 16777216 (4096 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0..3475: 70656.. 74131: 3476: 1: 3476..4095: 74212.. 74831:620: 74132: last,eof t/abcd: 2 extents found $ sudo umount t $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 /dev/sdd5 mirror 1 logical 289406976 physical 289406976 device /dev/sdd5 corrupting 289406976 copy 1 # try to access the file; expected result: -EIO $ sudo mount /dev/sdd5 t $ dd if=t/abcd | hexdump -c | head dd: error reading 't/abcd': Input/output error 0+0 records in 0+0 records out 0 bytes copied, 0.000477413 s, 0.0 kB/s # try to access the file using O_DIRECT; expected result: -EIO, instead the file is accessible $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 * 0001000 f g a b c e f g a b c e f g a b 0001010 c e f g a b c e f g a b c e f g 0001020 a b c e f g a b c e f g a b c e 0001030 f g a b c e f g a b c e f g a b 0001040 c e f g a b c e f g a b c e f g 0001050 a b c e f g a b c e f g a b c e 0001060 f g a b c e f g a b c e f g a b 0001070 c e f g a b c e f g a b c e f g (dmesg report the checksum mismatch) [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x0ab6be80 mirror 1 Note the first 4k filled by 0x01 ! Conclusion: even if the file is corrupted and normally BTRFS prevent to access it, using O_DIRECT a) no error is returned to the caller b) instead of the page stored on the disk, it is returned a page filled with 0x01 (according also with the function __readpage_endio_check()) 2) The second bug, is a more severe bug. If during a writing of a buffer with O_DIRECT, the buffer is updated at the same time by a second process, the checksum may be incorrect. At the end of the email there is the code which shows the problem: two process share the same memory: the first write it to the disk, the second update the buffer continuously. A third process try to read the file, but it got time to time -EIO If you ran my code in a btrfs filesystem you got a lot of ERROR: read thread; r = 8192, expected = 16384 ERROR: read thread; r = 8192, expected = 16384 ERROR: read thread; e = 5 - Input/output error ERROR: read thread; e = 5 - Input/output error The firsts lines are related to a shorter read (which may happens). The lasts lines are related to a checksum mismatch. The dmesg is filled by lines like [...] [14873.573547] BTRFS warning (device sdd5): csum failed root 5 ino 259 off 4096 csum 0x0683c6df expected csum 0x55eb85e6 mirror 1 [...] This is definitely a bug. I think that using O_DIRECT and updating a page at the same time could happen in a VM. In BTRFS this could lead to a wrong checksum. The problem is that if BTRFS detects a checksum error during a reading: a) if O_DIRECT is not used in the read * -EIO is returned Definitely BAD b) if O_DIRECT is used in the read * it doesn't return the error to the caller * it returns a page filled by 0x01 instead of the data from the disk Even worse than a) Note1: even using O_DIRECT with O_SYNC, the problem still persist. Note2: the man page of open(2) is filled by a lot of notes about O_DIRECT, but also it stated that using O_DIRECT+fork()+mmap(... MAP_SHARED) is legally. Note3: even "ZFS on linux" has its trouble with O_DIRECT: if fact ZFS doesn't support it; see https://github.com/zfsonlinux/zfs/issues/224 BR G.Baroncelli - cut --- cut --- cut #define _GNU_SOURCE #include #include #include #include #include #include #include #include #include #include #define FILESIZE(4096*4) int fd; char *buffer = NULL; void read_thread(const char *nf) { void *data = mmap(NULL, FILESIZE, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); assert(data); fprintf(stderr, "read_thread: data = %p\n", data); int rfd; rfd = open(nf, O_RDONLY);
Re: defragmenting best practice?
Am Thu, 14 Sep 2017 18:48:54 +0100 schrieb Tomasz Kłoczko : > On 14 September 2017 at 16:24, Kai Krakow > wrote: [..] > > Getting e.g. boot files into read order or at least nearby improves > > boot time a lot. Similar for loading applications. > > By how much it is possible to improve boot time? > Just please some example which I can try to replay which ill be > showing that we have similar results. > I still have one one of my laptops with spindle on btrfs root fs ( and > no other FSess in use) so I could be able to confirm that my numbers > are enough close to your numbers. I need to create a test setup because this system uses bcache. The difference (according to systemd-analyze) between warm bcache and no bcache at all ranges from 16-30s boot time vs. 3+ minutes boot time. I could turn off bcache, do a boot trace, try to rearrange boot files, boot again. However, that is not very reproducible as the current file layout is not defined. It'd be better to setup a separate machine where I could start over from a "well defined" state before applying optimization steps to see the differences between different strategies. At least readahead is not very helpful, I tested that in the past. It reduces boot time just by a few seconds, maybe 20-30, thus going from 3+ minutes to 2+ minutes. I still have an old laptop lying around: Single spindle, should make a good test scenario. I'll have to see if I can get it back into shape. It will take me some time. > > Shake tries to > > improve this by rewriting the files - and this works because file > > systems (given enough free space) already do a very good job at > > doing this. But constant system updates degrade this order over > > time. > > OK. Please prepare some database, import some data which size will be > few times of not used RAM (best if this multiplication factor will be > at least 10). Then do some batch of selects measuring distribution > latencies of those queries. Well, this is pretty easy. Systemd-journald is a real beast when it comes to cow fragmentation. Results can be easily generated and reproduced. There are long traces of discussions in the systemd mailing list and I simply decided to make the files nocow right from the start and that fixed it for me. I can simply revert it and create benchmarks. > This will give you some data about. not fragmented data. Well, I would probably do it the other way around: Generate a fragmented journal file (as that is how journald creates the file over time), then rewrite it by some manner to reduce extents, then run journal operations again on this file. Does it bother you to turn this around? > Then on next stage try to apply some number of update queries and > after reboot the system or drop all caches. and repeat the same set of > selects. > After this all what you need to do is compare distribution of the > latencies. Which tool to use to measure which latencies? Speaking of latencies: What's of interest here is perceived performance resulting mostly from seek overhead (except probably in the journal file case which just overwhelmes by the pure amount of extents). I'm not sure if measuring VFS latencies would provide any useful insights here. VFS probably works fast enough still in this case. > > It really doesn't matter if some big file is laid out in 1 > > allocation of 1 GB or in 250 allocations of 4MB: It really doesn't > > make a big difference. > > > > Recombining extents into bigger once, tho, can make a big > > difference in an aging btrfs, even on SSDs. > > That it may be an issue with using extents. I can't follow why you argue that a file with thousands of extents vs a file of same size but only a few extents would makes no difference to operate on. And of course this has to do with extents. But btrfs uses extents. Do you suggest to use ZFS instead? Due to how cow works, the effect would probably be less or barely noticable for writes, but read scanning through the file becomes slow with clearly more "noise" from the moving heads. > Again: please show some results of some test unit which anyone will be > able to reply and confirm or not that this effect really exist. > > If problem really exist and is related ot extents you should have real > scenario explanation why ZFS is not using extents. That was never the discussion. You brought in the ZFS point. I read about the design reasoning behind ZFS when it appeared and started gain public interest years back. > btrfs is not to far from classic approach do FS because it srill uses > allocation structures. > This is not the case in context of ZFS because this technology has no > information about what is already allocates. What about btrfs free space tree? Isn't that more or less the same? But I don't believe that makes a significant difference for desktop-sized storages. I think introduction of free space tree was due to performance of many-TB file systems up to petabyte storage (and beyond of course). > ZFS uses
Re: defragmenting best practice?
On 2017-09-14 13:48, Tomasz Kłoczko wrote: On 14 September 2017 at 16:24, Kai Krakow wrote: [..] Getting e.g. boot files into read order or at least nearby improves boot time a lot. Similar for loading applications. By how much it is possible to improve boot time? Just please some example which I can try to replay which ill be showing that we have similar results. I still have one one of my laptops with spindle on btrfs root fs ( and no other FSess in use) so I could be able to confirm that my numbers are enough close to your numbers. While it's not for BTRFS< a tool called e4rat might be of interest to you regarding this. It reorganizes files on an ext4 filesystem so that stuff used by the boot loader is right at the beginning of the device, and I've know people to get insane performance improvements (on the order of 20x in some pathologicallyb ad cases) in the time taken from the BIOS handing things off to GRUB to GRUB handing execution off to the kernel. Shake tries to improve this by rewriting the files - and this works because file systems (given enough free space) already do a very good job at doing this. But constant system updates degrade this order over time. OK. Please prepare some database, import some data which size will be few times of not used RAM (best if this multiplication factor will be at least 10). Then do some batch of selects measuring distribution latencies of those queries. This will give you some data about. not fragmented data. Then on next stage try to apply some number of update queries and after reboot the system or drop all caches. and repeat the same set of selects. After this all what you need to do is compare distribution of the latencies. It really doesn't matter if some big file is laid out in 1 allocation of 1 GB or in 250 allocations of 4MB: It really doesn't make a big difference. Recombining extents into bigger once, tho, can make a big difference in an aging btrfs, even on SSDs. That it may be an issue with using extents. Again: please show some results of some test unit which anyone will be able to reply and confirm or not that this effect really exist. This shouldn't need examples. It's trivial math combined with basic knowledge of hardware behavior. Every request to a device has a minimum amount of overhead. On traditional hard drives, this is usually dominated by seek latency, but on SSD's, the request setup, dispatch, and completion are the dominant factor. Assumign you have a 2 micro-second overhead per-request (not an exact number, just chosen for demonstration purposes because it makes the math easy), and a 1GB file, the time difference between reading ten 100MB extents and reading ten thousand 100kB extents is just short of 0.02 seconds, or a factor of about one thousand (which, no surprise here, is the factor of difference between the number of extents). If problem really exist and is related ot extents you should have real scenario explanation why ZFS is not using extents. Extents have nothing to do with it. What matters is how much of the file data is contiguous (and therefore can be read as a single request) and how smart the FS is about figuring that out. Extents help figure that out, but the primary reason to use them is to save space encoding block allocations within a file (go take a look at how ext2 handles allocations, and then compare that to ext4, the difference is insane in terms of space savings). btrfs is not to far from classic approach do FS because it srill uses allocation structures. This is not the case in context of ZFS because this technology has no information about what is already allocates. ZFS uses free lists so by negation whatever is not on free list is already allocated. I'm not trying to point that ZFS is better but only point that by changing allocation strategy you may not be blasted by something like some extents bottleneck (which sill needs to be proven) There are at least few very good reason why it is even necessary to change sometimes strategy from allocations structures to free lists. First: ZFS free list management is very similar to known from Linux memory SLAB allocator. Did you heard that someone needs to do system memory defragnentation because fragmented memory adds some additional latency to memory access? Other consequence is that with growing size of the files and number of files or directories FS metadata are growing exponentially with size and numbers of such objects. In case of free lists there is no such growth and all structures are growing with linear correlation. Caching in memory free list data takes much less than caching b-trees. Last thing is effort on deallocating something in FS with allocation structure and with free lists. In classic approach number of such operations is growing with depth of b-trees. In case free list all hat you need to do is compare ctime of the allocated block with volume or snapshot ctime to make decision about return or not b
Re: [PATCH 02/15] btrfs: Use pagevec_lookup_range_tag()
On Thu, Sep 14, 2017 at 03:18:06PM +0200, Jan Kara wrote: > We want only pages from given range in btree_write_cache_pages() and > extent_write_cache_pages(). Use pagevec_lookup_range_tag() instead of > pagevec_lookup_tag() and remove unnecessary code. > > CC: linux-btrfs@vger.kernel.org > CC: David Sterba > Signed-off-by: Jan Kara Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
On 14 September 2017 at 16:24, Kai Krakow wrote: [..] > Getting e.g. boot files into read order or at least nearby improves > boot time a lot. Similar for loading applications. By how much it is possible to improve boot time? Just please some example which I can try to replay which ill be showing that we have similar results. I still have one one of my laptops with spindle on btrfs root fs ( and no other FSess in use) so I could be able to confirm that my numbers are enough close to your numbers. > Shake tries to > improve this by rewriting the files - and this works because file > systems (given enough free space) already do a very good job at doing > this. But constant system updates degrade this order over time. OK. Please prepare some database, import some data which size will be few times of not used RAM (best if this multiplication factor will be at least 10). Then do some batch of selects measuring distribution latencies of those queries. This will give you some data about. not fragmented data. Then on next stage try to apply some number of update queries and after reboot the system or drop all caches. and repeat the same set of selects. After this all what you need to do is compare distribution of the latencies. > It really doesn't matter if some big file is laid out in 1 allocation > of 1 GB or in 250 allocations of 4MB: It really doesn't make a big > difference. > > Recombining extents into bigger once, tho, can make a big difference in > an aging btrfs, even on SSDs. That it may be an issue with using extents. Again: please show some results of some test unit which anyone will be able to reply and confirm or not that this effect really exist. If problem really exist and is related ot extents you should have real scenario explanation why ZFS is not using extents. btrfs is not to far from classic approach do FS because it srill uses allocation structures. This is not the case in context of ZFS because this technology has no information about what is already allocates. ZFS uses free lists so by negation whatever is not on free list is already allocated. I'm not trying to point that ZFS is better but only point that by changing allocation strategy you may not be blasted by something like some extents bottleneck (which sill needs to be proven) There are at least few very good reason why it is even necessary to change sometimes strategy from allocations structures to free lists. First: ZFS free list management is very similar to known from Linux memory SLAB allocator. Did you heard that someone needs to do system memory defragnentation because fragmented memory adds some additional latency to memory access? Other consequence is that with growing size of the files and number of files or directories FS metadata are growing exponentially with size and numbers of such objects. In case of free lists there is no such growth and all structures are growing with linear correlation. Caching in memory free list data takes much less than caching b-trees. Last thing is effort on deallocating something in FS with allocation structure and with free lists. In classic approach number of such operations is growing with depth of b-trees. In case free list all hat you need to do is compare ctime of the allocated block with volume or snapshot ctime to make decision about return or not block to free list. No matter how many snapshots, volumes, files or directories allays it will be *just one compare* of the block or vol/snapshot ctime. With necessity to do just only one compare comes way better predictable behavior of whole FS and simplicity of the code making such decisions. In other words ZFS internally uses well know SLAB allocator with caching some data about best possible location to allocate some different sizes allocation unit size multiplied by n^2 like you can see on Linux in /proc/slabinfo in case of *kmalloc* SLABs. This is why in case of ZFS number of volumes, snapshots has zero impact on avg speed of interactions over VFS layer. If you will be able present real impact of the fragmentation (again *if*) this may trigger other actions. So AFAIK no one been able to deliver real numbers or scenarios about such impact. And *if* such impact really exist one of the solutions may be just mimic what ZFS is doing (maybe there are other solutions). So please show us test unit exposing problem with measurement methodology presenting pathology related to fragmentation. > Bees is, btw, not about defragmentation: I have some OS containers > running and I want to deduplicate data after updates. Deduplication done in userspace has natural consequences in form of security issues. executable doing such things will need full access to everything and needs to have exposed some API/ABI allowing fiddle with content of the btrfs. Which adds second batch of security related risks. Try to have look how deduplication is working in case of ZFS without offline deduplication. >> In other words if someone is thinking that such
Re: [PATCH] Btrfs: fix confusing worker helper info
On Wed, Sep 13, 2017 at 12:09:28PM -0600, Liu Bo wrote: > We've seen the following backtrace stack in ftrace or dmesg log, > > kworker/u16:10-4244 [000] 241942.480955: function: > btrfs_put_ordered_extent > kworker/u16:10-4244 [000] 241942.480956: kernel_stack: trace> > => finish_ordered_fn (a0384475) > => btrfs_scrubparity_helper (a03ca577)<-"incorrect" > => btrfs_freespace_write_helper (a03ca98e)<-"correct" > => process_one_work (81117b2f) > => worker_thread (81118c2a) > => kthread (81121de0) > => ret_from_fork (81d7087a) > > btrfs_freespace_write_helper is actually calling normal_worker_helper > instead of btrfs_scrubparity_helper, so somehow kernel has parsed the > incorrect function address while unwinding the stack, > btrfs_scrubparity_helper really shouldn't be shown up. > > It's caused by compiler doing inline for our helper function, adding a > noinline tag can fix that. > > Signed-off-by: Liu Bo > cc: David Sterba Ok, understood now, thanks. I suggest to use noinline_for_stack, that is made exactly for this situation (I'll change it so you don't need to resend). Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2 v2] Btrfs: remove bio_flags which indicates a meta block of log-tree
On Wed, Sep 13, 2017 at 12:18:22PM -0600, Liu Bo wrote: > Since both committing transaction and writing log-tree are doing > plugging on metadata IO, we can unify to use %sync_writers to benefit > both cases, instead of checking bio_flags while writing meta blocks of > log-tree. > > We can remove this bio_flags because in order to write dirty blocks, > log tree also uses btrfs_write_marked_extents(), inside which we > has enabled %sync_writers, therefore, every write goes in a > synchronous way, so does checksuming. > > Please also note that, bio_flags is applied per-context while > %sync_writers is applied per-inode, so this might incur some overhead, ie. > > 1) while log tree is flushing its dirty blocks via >btrfs_write_marked_extents(), in which %sync_writers is increased >by one. > > 2) in the meantime, some writeback operations may happen upon btrfs's >metadata inode, so these writes go synchronously, too. > > However, AFAICS, the overhead is not a big one while the win is that > we unify the two places that needs synchronous way and remove a > special hack/flag. > > This removes the bio_flags related stuff for writing log-tree. > > Signed-off-by: Liu Bo Much better, thanks. Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
Am Thu, 14 Sep 2017 17:24:34 +0200 schrieb Kai Krakow : Errors corrected, see below... > Am Thu, 14 Sep 2017 14:31:48 +0100 > schrieb Tomasz Kłoczko : > > > On 14 September 2017 at 12:38, Kai Krakow > > wrote: [..] > > > > > > I suggest you only ever defragment parts of your main subvolume or > > > rely on autodefrag, and let bees do optimizing the snapshots. > > Please read that again including the parts you omitted. > > > > > Also, I experimented with adding btrfs support to shake, still > > > working on better integration but currently lacking time... :-( > > > > > > Shake is an adaptive defragger which rewrites files. With my > > > current patches it clones each file, and then rewrites it to its > > > original location. This approach is currently not optimal as it > > > simply bails out if some other process is accessing the file and > > > leaves you with an (intact) temporary copy you need to move back > > > in place manually. > > > > If you really want to have real and *ideal* distribution of the data > > across physical disk first you need to build time travel device. > > This device will allow you to put all blocks which needs to be read > > in perfect order (to read all data only sequentially without seek). > > However it will be working only in case of spindles because in case > > of SSDs there is no seek time. > > Please let us know when you will write drivers/timetravel/ Linux > > kernel driver. When such driver will be available I promise I'll > > write all necessary btrfs code by myself in matter of few days (it > > will be piece of cake compare to build such device). > > > > But seriously .. > > Seriously: Defragmentation on spindles is IMHO not about getting the > perfect continuous allocation but providing better spatial layout of > the files you work with. > > Getting e.g. boot files into read order or at least nearby improves > boot time a lot. Similar for loading applications. Shake tries to > improve this by rewriting the files - and this works because file > systems (given enough free space) already do a very good job at doing > this. But constant system updates degrade this order over time. > > It really doesn't matter if some big file is laid out in 1 allocation > of 1 GB or in 250 allocations of 4MB: It really doesn't make a big > difference. > > Recombining extents into bigger once, tho, can make a big difference > in an aging btrfs, even on SSDs. > > Bees is, btw, not about defragmentation: I have some OS containers > running and I want to deduplicate data after updates. It seems to do a > good job here, better than other deduplicators I found. And if some > defrag tools destroyed your snapshot reflinks, bees can also help > here. On its way it may recombine extents so it may improve > fragmentation. But usually it probably defragments because it needs ^^^ It fragments! > to split extents that a defragger combined. > > But well, I think getting 100% continuous allocation is really not the > achievement you want to get, especially when reflinks are a primary > concern. > > > > Only context/scenario when you may want to lower defragmentation is > > when you are something needs to allocate continuous area lower than > > free space and larger than largest free chunk. Something like this > > happens only when volume is working on almost 100% allocated space. > > In such scenario even you bees cannot do to much as it may be not > > enough free space to move some other data in larger chunks to > > defragment FS physical space. > > Bees does not do that. > > > > If your workload will be still writing > > new data to FS such defragmentation may give you (maybe) few more > > seconds and just after this FS will be 100% full, > > > > In other words if someone is thinking that such defragmentation > > daemon is solving any problems he/she may be 100% right .. such > > person is only *thinking* that this is truth. > > Bees is not about that. > > > > kloczek > > PS. Do you know first McGyver rule? -> "If it ain't broke, don't fix > > it". > > Do you know the saying "think first, then act"? > > > > So first show that fragmentation is hurting latency of the > > access to btrfs data and it will be possible to measurable such > > impact. Before you will start measuring this you need to learn how o > > sample for example VFS layer latency. Do you know how to do this to > > deliver such proof? > > You didn't get the point. You only read "defragmentation" and your > alarm lights lid up. You even think bees would be a defragmenter. It > probably is more the opposite because it introduces more fragments in > exchange for more reflinks. > > > > PS2. The same "discussions" about fragmentation where in the past > > about +10 years ago after ZFS has been introduced. Just to let you > > know that after initial ZFS introduction up to now was not written > > even single line of ZFS code to handle active fragmenta
Re: snapshots of encrypted directories?
On Thu, Sep 14, 2017 at 04:57:39PM +0200, Ulli Horlacher wrote: > I use encfs on top of btrfs. > I can create btrfs snapshots, but I have no suggestive access to the files > in these snaspshots, because they look like: > > drwx-- framstag users- 2017-09-08 11:47:18 > uHjprldmxo3-nSfLmcH54HMW > drwxr-xr-x framstag users- 2017-09-08 11:47:18 > wNEWaDCgyXTj0d-Myk8wXZfh > -rw-r--r-- framstag users 377 2015-06-12 14:02:53 > -zDmc7xfobKDkbl8z7oKOHxv > -rw-r--r-- framstag users2,367 2012-07-10 14:32:30 > 7pfKs27K9k5zANE4WOQEuFa2 > -rw--- framstag users 692 2009-10-20 13:45:41 > 8SQElYCph85kDdcFasUHybVr > -rw--- framstag users2,872 2017-08-31 16:21:52 > bm,yNi1e4fsAClDv7lNxxSfJ > lrwxrwxrwx framstag users- 2017-06-01 15:53:00 > GZxNYI0Gy96R18fz40f7k5rl -> > wvuQKHYzdFbar18fW6jjOerXk2IsS4OAA2fnHalBZjMQ,7Kw0j-zE3IJqxhmmGBN8G9 > -rw-r--r-- framstag users 182 2016-12-01 13:34:31 > rqtNBbiYDym0hPMbBL-VLJZcFZu6nkNxlsjTX-sU88I4I1 > > I have to mount the snapshot with encfs, to have access to the (decrypted) > files. > > Any better ideas? I'd say it's doing exactly what it should be doing. You're making a copy of an encrypted data store, and the result is encrypted. In order to read it, it needs to have the decrpytion layer applied to it with the correct key (which is the need to mount the snapshot with encfs). Would you _really_ want a system where the encrypted contents of a subvolume can be decrypted by simply snapshotting it? Hugo. -- Hugo Mills | Great films about cricket: Umpire of the Rising Sun hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: defragmenting best practice?
Am Thu, 14 Sep 2017 14:31:48 +0100 schrieb Tomasz Kłoczko : > On 14 September 2017 at 12:38, Kai Krakow > wrote: [..] > > > > I suggest you only ever defragment parts of your main subvolume or > > rely on autodefrag, and let bees do optimizing the snapshots. Please read that again including the parts you omitted. > > Also, I experimented with adding btrfs support to shake, still > > working on better integration but currently lacking time... :-( > > > > Shake is an adaptive defragger which rewrites files. With my current > > patches it clones each file, and then rewrites it to its original > > location. This approach is currently not optimal as it simply bails > > out if some other process is accessing the file and leaves you with > > an (intact) temporary copy you need to move back in place > > manually. > > If you really want to have real and *ideal* distribution of the data > across physical disk first you need to build time travel device. This > device will allow you to put all blocks which needs to be read in > perfect order (to read all data only sequentially without seek). > However it will be working only in case of spindles because in case of > SSDs there is no seek time. > Please let us know when you will write drivers/timetravel/ Linux > kernel driver. When such driver will be available I promise I'll > write all necessary btrfs code by myself in matter of few days (it > will be piece of cake compare to build such device). > > But seriously .. Seriously: Defragmentation on spindles is IMHO not about getting the perfect continuous allocation but providing better spatial layout of the files you work with. Getting e.g. boot files into read order or at least nearby improves boot time a lot. Similar for loading applications. Shake tries to improve this by rewriting the files - and this works because file systems (given enough free space) already do a very good job at doing this. But constant system updates degrade this order over time. It really doesn't matter if some big file is laid out in 1 allocation of 1 GB or in 250 allocations of 4MB: It really doesn't make a big difference. Recombining extents into bigger once, tho, can make a big difference in an aging btrfs, even on SSDs. Bees is, btw, not about defragmentation: I have some OS containers running and I want to deduplicate data after updates. It seems to do a good job here, better than other deduplicators I found. And if some defrag tools destroyed your snapshot reflinks, bees can also help here. On its way it may recombine extents so it may improve fragmentation. But usually it probably defragments because it needs to split extents that a defragger combined. But well, I think getting 100% continuous allocation is really not the achievement you want to get, especially when reflinks are a primary concern. > Only context/scenario when you may want to lower defragmentation is > when you are something needs to allocate continuous area lower than > free space and larger than largest free chunk. Something like this > happens only when volume is working on almost 100% allocated space. > In such scenario even you bees cannot do to much as it may be not > enough free space to move some other data in larger chunks to > defragment FS physical space. Bees does not do that. > If your workload will be still writing > new data to FS such defragmentation may give you (maybe) few more > seconds and just after this FS will be 100% full, > > In other words if someone is thinking that such defragmentation daemon > is solving any problems he/she may be 100% right .. such person is > only *thinking* that this is truth. Bees is not about that. > kloczek > PS. Do you know first McGyver rule? -> "If it ain't broke, don't fix > it". Do you know the saying "think first, then act"? > So first show that fragmentation is hurting latency of the > access to btrfs data and it will be possible to measurable such > impact. Before you will start measuring this you need to learn how o > sample for example VFS layer latency. Do you know how to do this to > deliver such proof? You didn't get the point. You only read "defragmentation" and your alarm lights lid up. You even think bees would be a defragmenter. It probably is more the opposite because it introduces more fragments in exchange for more reflinks. > PS2. The same "discussions" about fragmentation where in the past > about +10 years ago after ZFS has been introduced. Just to let you > know that after initial ZFS introduction up to now was not written > even single line of ZFS code to handle active fragmentation and no one > been able to prove that something about active defragmentation needs > to be done in case of ZFS. Btrfs has autodefrag to reduce the number of fragments by rewriting small portions of the file being written to. This is needed, otherwise the feature won't be there. Why? Have you tried working with 1GB files broken into 10+ of fragments just because of how CoW works? Try
snapshots of encrypted directories?
I use encfs on top of btrfs. I can create btrfs snapshots, but I have no suggestive access to the files in these snaspshots, because they look like: drwx-- framstag users- 2017-09-08 11:47:18 uHjprldmxo3-nSfLmcH54HMW drwxr-xr-x framstag users- 2017-09-08 11:47:18 wNEWaDCgyXTj0d-Myk8wXZfh -rw-r--r-- framstag users 377 2015-06-12 14:02:53 -zDmc7xfobKDkbl8z7oKOHxv -rw-r--r-- framstag users2,367 2012-07-10 14:32:30 7pfKs27K9k5zANE4WOQEuFa2 -rw--- framstag users 692 2009-10-20 13:45:41 8SQElYCph85kDdcFasUHybVr -rw--- framstag users2,872 2017-08-31 16:21:52 bm,yNi1e4fsAClDv7lNxxSfJ lrwxrwxrwx framstag users- 2017-06-01 15:53:00 GZxNYI0Gy96R18fz40f7k5rl -> wvuQKHYzdFbar18fW6jjOerXk2IsS4OAA2fnHalBZjMQ,7Kw0j-zE3IJqxhmmGBN8G9 -rw-r--r-- framstag users 182 2016-12-01 13:34:31 rqtNBbiYDym0hPMbBL-VLJZcFZu6nkNxlsjTX-sU88I4I1 I have to mount the snapshot with encfs, to have access to the (decrypted) files. Any better ideas? -- Ullrich Horlacher Server und Virtualisierung Rechenzentrum TIK Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de Allmandring 30aTel:++49-711-68565868 70569 Stuttgart (Germany) WWW:http://www.tik.uni-stuttgart.de/ REF:<20170914145739.ga32...@rus.uni-stuttgart.de> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
On 14 September 2017 at 12:38, Kai Krakow wrote: [..] > > I suggest you only ever defragment parts of your main subvolume or rely > on autodefrag, and let bees do optimizing the snapshots. > > Also, I experimented with adding btrfs support to shake, still working > on better integration but currently lacking time... :-( > > Shake is an adaptive defragger which rewrites files. With my current > patches it clones each file, and then rewrites it to its original > location. This approach is currently not optimal as it simply bails out > if some other process is accessing the file and leaves you with an > (intact) temporary copy you need to move back in place manually. If you really want to have real and *ideal* distribution of the data across physical disk first you need to build time travel device. This device will allow you to put all blocks which needs to be read in perfect order (to read all data only sequentially without seek). However it will be working only in case of spindles because in case of SSDs there is no seek time. Please let us know when you will write drivers/timetravel/ Linux kernel driver. When such driver will be available I promise I'll write all necessary btrfs code by myself in matter of few days (it will be piece of cake compare to build such device). But seriously .. Only context/scenario when you may want to lower defragmentation is when you are something needs to allocate continuous area lower than free space and larger than largest free chunk. Something like this happens only when volume is working on almost 100% allocated space. In such scenario even you bees cannot do to much as it may be not enough free space to move some other data in larger chunks to defragment FS physical space. If your workload will be still writing new data to FS such defragmentation may give you (maybe) few more seconds and just after this FS will be 100% full, In other words if someone is thinking that such defragmentation daemon is solving any problems he/she may be 100% right .. such person is only *thinking* that this is truth. kloczek PS. Do you know first McGyver rule? -> "If it ain't broke, don't fix it". So first show that fragmentation is hurting latency of the access to btrfs data and it will be possible to measurable such impact. Before you will start measuring this you need to learn how o sample for example VFS layer latency. Do you know how to do this to deliver such proof? PS2. The same "discussions" about fragmentation where in the past about +10 years ago after ZFS has been introduced. Just to let you know that after initial ZFS introduction up to now was not written even single line of ZFS code to handle active fragmentation and no one been able to prove that something about active defragmentation needs to be done in case of ZFS. Why? Because all stands on the shoulders of enough cleaver *allocation algorithm*. Only this and nothing more. PS3. Please can we stop this/EOT? -- Tomasz Kłoczko | LinkedIn: http://lnkd.in/FXPWxH -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/15] btrfs: Use pagevec_lookup_range_tag()
We want only pages from given range in btree_write_cache_pages() and extent_write_cache_pages(). Use pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove unnecessary code. CC: linux-btrfs@vger.kernel.org CC: David Sterba Signed-off-by: Jan Kara --- fs/btrfs/extent_io.c | 19 --- 1 file changed, 4 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 0f077c5db58e..9b7936ea3a88 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3819,8 +3819,8 @@ int btree_write_cache_pages(struct address_space *mapping, if (wbc->sync_mode == WB_SYNC_ALL) tag_pages_for_writeback(mapping, index, end); while (!done && !nr_to_write_done && (index <= end) && - (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag, - min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { + (nr_pages = pagevec_lookup_range_tag(&pvec, mapping, &index, end, + tag, PAGEVEC_SIZE))) { unsigned i; scanned = 1; @@ -3830,11 +3830,6 @@ int btree_write_cache_pages(struct address_space *mapping, if (!PagePrivate(page)) continue; - if (!wbc->range_cyclic && page->index > end) { - done = 1; - break; - } - spin_lock(&mapping->private_lock); if (!PagePrivate(page)) { spin_unlock(&mapping->private_lock); @@ -3966,8 +3961,8 @@ static int extent_write_cache_pages(struct address_space *mapping, tag_pages_for_writeback(mapping, index, end); done_index = index; while (!done && !nr_to_write_done && (index <= end) && - (nr_pages = pagevec_lookup_tag(&pvec, mapping, &index, tag, - min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { + (nr_pages = pagevec_lookup_range_tag(&pvec, mapping, &index, end, + tag, PAGEVEC_SIZE))) { unsigned i; scanned = 1; @@ -3992,12 +3987,6 @@ static int extent_write_cache_pages(struct address_space *mapping, continue; } - if (!wbc->range_cyclic && page->index > end) { - done = 1; - unlock_page(page); - continue; - } - if (wbc->sync_mode != WB_SYNC_NONE) { if (PageWriteback(page)) flush_fn(data); -- 2.12.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: do not backup tree roots when fsync
On Thu, Sep 14, 2017 at 09:55:48AM +0800, Qu Wenruo wrote: > > > On 2017年09月14日 02:25, Liu Bo wrote: > > It doens't make sense to backup tree roots when doing fsync, since > > during fsync those tree roots have not been consistent on disk. > > > > Signed-off-by: Liu Bo > > Reviewed-by: Qu Wenruo > > With a pit can be improved. > > --- > > fs/btrfs/disk-io.c | 9 - > > 1 file changed, 8 insertions(+), 1 deletion(-) > > > > diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c > > index 79ac228..a145a88 100644 > > --- a/fs/btrfs/disk-io.c > > +++ b/fs/btrfs/disk-io.c > > @@ -3668,7 +3668,14 @@ int write_all_supers(struct btrfs_fs_info *fs_info, > > int max_mirrors) > > u64 flags; > > > > do_barriers = !btrfs_test_opt(fs_info, NOBARRIER); > > - backup_super_roots(fs_info); > > + > > + /* > > +* max_mirrors == 0 indicates we're from commit_transaction, > > +* not from fsync where the tree roots in fs_info have not > > +* been consistent on disk. > > +*/ > > + if (max_mirrors == 0) > > + backup_super_roots(fs_info); > > BTW, the @max_mirrors naming here is really confusing. > Normally I would expect max_mirrors == 0 means we don't need to backup > super roots... Agreed it's confusing, could be something like "bool write_backups" (in a separate patch). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] btrfs: build: omit unnecessary -MD flag
On Thu, Sep 14, 2017 at 07:10:56PM +0900, Naohiro Aota wrote: > According to gcc(1), "-MD is equivalent to -M -MF file, except that -E is not > implied." Since the rule in the Makefile is just generating dependency file > and not building object file, it is no use to have "-MD" here. Also, it's > overridden and conflicting with the following "-MM" flag. I guess we can drop > it. > > Signed-off-by: Naohiro Aota Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] btrfs-progs: build: generate all dependency files
On Thu, Sep 14, 2017 at 07:10:46PM +0900, Naohiro Aota wrote: > We're missing several dependency files like: > > $ diff -u <(find -name '*.o'|cut -d. -f2|sort) <(find -name '*.o.d'|cut -d. > -f2|sort) >--- /proc/self/fd/112017-09-14 18:17:44.460564620 +0900 >+++ /proc/self/fd/122017-09-14 18:17:44.460564620 +0900 Please note that an actual diff in the changelog is understood as start of the patch by git-am, indenting the --- or +++ lines makes it work again. > @@ -3,7 +3,6 @@ > /btrfs-corrupt-block > /btrfs-debug-tree > /btrfs-find-root > -/btrfs-list > /btrfs-map-logical > /btrfs-select-super > /btrfstune > @@ -29,11 +28,6 @@ > /cmds-scrub > /cmds-send > /cmds-subvolume > -/convert/common > -/convert/main > -/convert/source-ext2 > -/convert/source-fs > -/convert/source-reiserfs > /ctree > /dir-item > /disk-io > > > This is due to moving things out of objects and cmds_objects variables. Such > missing dependency files cause mis-building of some source files (try touch > utils.h; make mkfs/main.o). > > This patch introduce a new variable "all_objects" to keep all the objects and > use the variable to generate proper dependency file building rules. > > Signed-off-by: Naohiro Aota Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
On 2017-09-14 03:54, Duncan wrote: Austin S. Hemmelgarn posted on Tue, 12 Sep 2017 13:27:00 -0400 as excerpted: The tricky part though is that differing workloads are impacted differently by fragmentation. Using just four generic examples: * Mostly sequential write focused workloads (like security recording systems) tend to be impacted by free space fragmentation more than data fragmentation. Balancing filesystems used for such workloads is likely to give a noticeable improvement, but defragmenting probably won't give much. * Mostly sequential read focused workloads (like a streaming media server) tend to be the most impacted by data fragmentation, but aren't generally impacted by free space fragmentation. As a result, defrag will help here a lot, but balance won't as much. * Mostly random write focused workloads (like most database systems or virtual machines) are often impacted by both free space and data fragmentation, and are a pathological case for CoW filesystems. Balance and defrag will help here, but they won't help for long. * Mostly random read focused workloads (like most non-multimedia desktop usage) are not impacted much by either aspect, but if you're on a traditional hard drive they can be impacted significantly by how the data is spread across the disk. Balance can help here, but only because it improves data locality, not because it compacts free space. This is a very useful analysis, particularly given the examples. Maybe put it on the wiki under the defrag discussion? (Assuming something like it isn't already there. I've not looked in awhile.) I've actually been meaning to write up something more thoroughly about this online (probably as a Gist). When finally get around to that (probably in the next few weeks), I'll try to make sure a link ends up on the defrag page on the wiki. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
Am Tue, 12 Sep 2017 18:28:43 +0200 schrieb Ulli Horlacher : > On Thu 2017-08-31 (09:05), Ulli Horlacher wrote: > > When I do a > > btrfs filesystem defragment -r /directory > > does it defragment really all files in this directory tree, even if > > it contains subvolumes? > > The man page does not mention subvolumes on this topic. > > No answer so far :-( > > But I found another problem in the man-page: > > Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as > well as with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or > >= 3.13.4 will break up the ref-links of COW data (for example files > >copied with > cp --reflink, snapshots or de-duplicated data). This may cause > considerable increase of space usage depending on the broken up > ref-links. > > I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several > snapshots. > Therefore, I better should avoid calling "btrfs filesystem defragment > -r"? > > What is the defragmenting best practice? > Avoid it completly? You may want to try https://github.com/Zygo/bees. It is a daemon watching the file system generation changes, scanning the blocks and then recombines them. Of course, this process somewhat defeats the purpose of defragging in the first place as it will undo some of the defragmenting. I suggest you only ever defragment parts of your main subvolume or rely on autodefrag, and let bees do optimizing the snapshots. Also, I experimented with adding btrfs support to shake, still working on better integration but currently lacking time... :-( Shake is an adaptive defragger which rewrites files. With my current patches it clones each file, and then rewrites it to its original location. This approach is currently not optimal as it simply bails out if some other process is accessing the file and leaves you with an (intact) temporary copy you need to move back in place manually. Shake works very well with the idea of detecting how defragmented, how old, and how far away from an "ideal" position a file is and exploits standard Linux file systems behavior to optimally placing files by rewriting them. It then records its status per file in extended attributes. It also works with non-btrfs file systems. My patches try to avoid defragging files with shared extents, so this may help your situation. However, it will still shuffle files around if they are too far from their ideal position, thus destroying shared extents. A future patch could use extent recombining and skip shared extents in that process. But first I'd like to clean out some of the rough edges together with the original author of shake. Look here: https://github.com/unbrice/shake and also check out the pull requests and comments there. You shouldn't currently run shake unattended and only on specific parts of your FS you feel need defragmenting. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
On 07/09/2017 16:43, Peter Becker wrote: 2017-09-07 16:37 GMT+02:00 Marco Lorenzo Crociani : [...] I got: 00-49: 1 50-79: 0 80-89: 0 90-99: 1 100:25540 this means that fs has only one block group used under 50% and 1 between 90 and 99% while the rest are all full? yes .. imo, balance wouldn't help -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Hi, after btrfs balance start -musage=50 /data/R6HW/ and btrfs balance start -musage=99 /data/R6HW/ I wasn't able to reproduce those messages. Regards, -- Marco Crociani -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] btrfs-progs: build: generate all dependency files
We're missing several dependency files like: $ diff -u <(find -name '*.o'|cut -d. -f2|sort) <(find -name '*.o.d'|cut -d. -f2|sort) --- /proc/self/fd/112017-09-14 18:17:44.460564620 +0900 +++ /proc/self/fd/122017-09-14 18:17:44.460564620 +0900 @@ -3,7 +3,6 @@ /btrfs-corrupt-block /btrfs-debug-tree /btrfs-find-root -/btrfs-list /btrfs-map-logical /btrfs-select-super /btrfstune @@ -29,11 +28,6 @@ /cmds-scrub /cmds-send /cmds-subvolume -/convert/common -/convert/main -/convert/source-ext2 -/convert/source-fs -/convert/source-reiserfs /ctree /dir-item /disk-io This is due to moving things out of objects and cmds_objects variables. Such missing dependency files cause mis-building of some source files (try touch utils.h; make mkfs/main.o). This patch introduce a new variable "all_objects" to keep all the objects and use the variable to generate proper dependency file building rules. Signed-off-by: Naohiro Aota --- Makefile |5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/Makefile b/Makefile index a114eca..c00dff6 100644 --- a/Makefile +++ b/Makefile @@ -121,6 +121,9 @@ libbtrfs_headers = send-stream.h send-utils.h send.h kernel-lib/rbtree.h btrfs-l convert_objects = convert/main.o convert/common.o convert/source-fs.o \ convert/source-ext2.o convert/source-reiserfs.o mkfs_objects = mkfs/main.o mkfs/common.o +image_objects = image/main.o +all_objects = $(objects) $(cmds_objects) $(libbtrfs_objects) $(convert_objects) \ + $(mkfs_objects) $(image_objects) TESTS = fsck-tests.sh convert-tests.sh @@ -591,5 +594,5 @@ uninstall: cd $(DESTDIR)$(bindir); $(RM) -f -- btrfsck fsck.btrfs $(progs_install) ifneq ($(MAKECMDGOALS),clean) --include $(objects:.o=.o.d) $(cmds_objects:.o=.o.d) $(subst .btrfs,, $(filter-out btrfsck.o.d, $(progs:=.o.d))) +-include $(all_objects:.o=.o.d) $(subst .btrfs,, $(filter-out btrfsck.o.d, $(progs:=.o.d))) endif -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs: build: omit unnecessary -MD flag
According to gcc(1), "-MD is equivalent to -M -MF file, except that -E is not implied." Since the rule in the Makefile is just generating dependency file and not building object file, it is no use to have "-MD" here. Also, it's overridden and conflicting with the following "-MM" flag. I guess we can drop it. Signed-off-by: Naohiro Aota --- Makefile |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Makefile b/Makefile index c00dff6..60c802a 100644 --- a/Makefile +++ b/Makefile @@ -264,7 +264,7 @@ else endif %.o.d: %.c - $(Q)$(CC) -MD -MM -MG -MF $@ -MT $(@:.o.d=.o) -MT $(@:.o.d=.static.o) -MT $@ $(CFLAGS) $< + $(Q)$(CC) -MM -MG -MF $@ -MT $(@:.o.d=.o) -MT $(@:.o.d=.static.o) -MT $@ $(CFLAGS) $< # # Pick from per-file variables, btrfs_*_cflags -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
Austin S. Hemmelgarn posted on Tue, 12 Sep 2017 13:27:00 -0400 as excerpted: > The tricky part though is that differing workloads are impacted > differently by fragmentation. Using just four generic examples: > > * Mostly sequential write focused workloads (like security recording > systems) tend to be impacted by free space fragmentation more than data > fragmentation. Balancing filesystems used for such workloads is likely > to give a noticeable improvement, but defragmenting probably won't give > much. > * Mostly sequential read focused workloads (like a streaming media > server) > tend to be the most impacted by data fragmentation, but aren't generally > impacted by free space fragmentation. As a result, defrag will help > here a lot, but balance won't as much. > * Mostly random write focused workloads (like most database systems or > virtual machines) are often impacted by both free space and data > fragmentation, and are a pathological case for CoW filesystems. Balance > and defrag will help here, but they won't help for long. > * Mostly random read focused workloads (like most non-multimedia desktop > usage) are not impacted much by either aspect, but if you're on a > traditional hard drive they can be impacted significantly by how the > data is spread across the disk. Balance can help here, but only because > it improves data locality, not because it compacts free space. This is a very useful analysis, particularly given the examples. Maybe put it on the wiki under the defrag discussion? (Assuming something like it isn't already there. I've not looked in awhile.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html