Re: qemu-kvm VM died during partial raid1 problems of btrfs
On Wed, Sep 13, 2017 at 08:21:01AM -0400, Austin S. Hemmelgarn wrote: > On 2017-09-12 17:13, Adam Borowski wrote: > > On Tue, Sep 12, 2017 at 04:12:32PM -0400, Austin S. Hemmelgarn wrote: > > > On 2017-09-12 16:00, Adam Borowski wrote: > > > > Noted. Both Marat's and my use cases, though, involve VMs that are off > > > > most > > > > of the time, and at least for me, turned on only to test something. > > > > Touching mtime makes rsync run again, and it's freaking _slow_: worse > > > > than > > > > 40 minutes for a 40GB VM (source:SSD target:deduped HDD). > > > 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if > > > you're going direct to a hard drive. I get better performance than that > > > on > > > my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s > > > there, > > > but it's for archival storage so I don't really care). I'm actually > > > curious > > > what the exact rsync command you are using is (you can obviously redact > > > paths as you see fit), as the only way I can think of that it should be > > > that > > > slow is if you're using both --checksum (but if you're using this, you can > > > tell rsync to skip the mtime check, and that issue goes away) and > > > --inplace, > > > _and_ your HDD is slow to begin with. > > > > rsync -axX --delete --inplace --numeric-ids /mnt/btr1/qemu/ > > mordor:$BASE/qemu > > The target is single, compress=zlib SAMSUNG HD204UI, 34976 hours old but > > with nothing notable on SMART, in a Qnap 253a, kernel 4.9. > compress=zlib is probably your biggest culprit. As odd as this sounds, I'd > suggest switching that to lzo (seriously, the performance difference is > ludicrous), and then setting up a cron job (or systemd timer) to run defrag > over things to switch to zlib. As a general point of comparison, we do > archival backups to a file server running BTRFS where I work, and the > archiving process runs about four to ten times faster if we take this > approach (LZO for initial compression, then recompress using defrag once the > initial transfer is done) than just using zlib directly. Turns out that lzo is actually the slowest, but only by a bit. I tried a different disk, in the same Qnap; also an old disk but 7200 rpm rather than 5400. Mostly empty, only a handful subvolumes, not much reflinking. I made three separate copies, fallocated -d, upgraded Windows inside the VM, then: [/mnt/btr1/qemu]$ for x in none lzo zlib;do time rsync -axX --delete --inplace --numeric-ids win10.img mordor:/SOME/DIR/$x/win10.img;done real31m37.459s user27m21.587s sys 2m16.210s real33m28.258s user27m19.745s sys 2m17.642s real32m57.058s user27m24.297s sys 2m17.640s Note the "user" values. So rsync does something bad on the source side. Despite fragmentation, reads on the source are not a problem: [/mnt/btr1/qemu]$ time cat /dev/null real1m28.815s user0m0.061s sys 0m48.094s [/mnt/btr1/qemu]$ /usr/sbin/filefrag win10.img win10.img: 63682 extents found [/mnt/btr1/qemu]$ btrfs fi def win10.img [/mnt/btr1/qemu]$ /usr/sbin/filefrag win10.img win10.img: 18015 extents found [/mnt/btr1/qemu]$ time cat /dev/null real1m17.879s user0m0.076s sys 0m37.757s > `--inplace` is probably not helping (especially if most of the file changed, > on BTRFS, it actually is marginally more efficient to just write out a whole > new file and then replace the old one with a rename if you're rewriting most > of the file), but is probably not as much of an issue as compress=zlib. Yeah, scp + dedupe would run faster. For deduplication, instead of duperemove it'd be better to call file_extent_same on the first 128K, then the second, ... -- without even hashing the blocks beforehand. Not that this particular VM takes enough backup space to make spending too much time worthwhile, but it's a good test case for performance issues like this. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts ⣾⠁⢰⠒⠀⣿⡁ productivity. You can read it, too, you just need the ⢿⡄⠘⠷⠚⠋⠀ right music while doing so. I recommend Skepticism ⠈⠳⣄ (funeral doom metal). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On 09/13/2017 04:15 PM, Marat Khalili wrote: > On 13/09/17 16:23, Chris Murphy wrote: >> Right, known problem. To use o_direct implies also using nodatacow (or >> at least nodatasum), e.g. xattr +C is set, done by qemu-img -o >> nocow=on >> https://www.spinics.net/lists/linux-btrfs/msg68244.html > Can you please elaborate? I don't have exactly the same problem as described > by the link, but I'm still worried that particularly qemu can be less > resilient to partial raid1 failures even on newer kernels, due to missing > checksums for instance. (BTW I didn't find any xattrs on my VM images, nor do > I plan to set any.) >From what Josef Bacik wrote, I understood that it is not only related to >RAID1. I tried to ask further clarifications without success :( It seems that simply using O_DIRECT could allow checksums mismatch. My understand is that to avoid to copy the data between buffer, the checksum computation is subject to data race: i.e. it is possible that the kernel computes the checksum *and* the user space program change the data. This lead to an io error during a subsequent read. To avoid that BTRFS should copy in a temporary buffer the data, and then compute the checksum. But this is what the common sense suggest that O_DIRECT should avoid. If I understood correctly (which is a BIG if), i think that O_DIRECT should be unsupported (i.e. return -EINVAL) if the file is "not marked" as "nodatacsum" I looked to what ZFSOL does: it seems that it doesn't support O_DIRECT [1] for the same reason (see the comments ' ryao commented on Jul 23, 2015' for further details). Anyway I suggest to read what the open(2) man page says about O_DIRECT: it seems that O_DIRECT has to be used carefully when doing fork; the man page concludes: [...] In summary, O_DIRECT is a potentially powerful tool that should be used with caution. It is recommended that applications treat use of O_DIRECT as a performance option which is disabled by default. [...] [1] https://github.com/zfsonlinux/zfs/issues/224 > > -- > > With Best Regards, > Marat Khalili > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On 2017-09-13 10:47, Martin Raiber wrote: Hi, On 12.09.2017 23:13 Adam Borowski wrote: On Tue, Sep 12, 2017 at 04:12:32PM -0400, Austin S. Hemmelgarn wrote: On 2017-09-12 16:00, Adam Borowski wrote: Noted. Both Marat's and my use cases, though, involve VMs that are off most of the time, and at least for me, turned on only to test something. Touching mtime makes rsync run again, and it's freaking _slow_: worse than 40 minutes for a 40GB VM (source:SSD target:deduped HDD). 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if you're going direct to a hard drive. I get better performance than that on my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s there, but it's for archival storage so I don't really care). I'm actually curious what the exact rsync command you are using is (you can obviously redact paths as you see fit), as the only way I can think of that it should be that slow is if you're using both --checksum (but if you're using this, you can tell rsync to skip the mtime check, and that issue goes away) and --inplace, _and_ your HDD is slow to begin with. rsync -axX --delete --inplace --numeric-ids /mnt/btr1/qemu/ mordor:$BASE/qemu The target is single, compress=zlib SAMSUNG HD204UI, 34976 hours old but with nothing notable on SMART, in a Qnap 253a, kernel 4.9. Both source and target are btrfs, but here switching to send|receive wouldn't give much as this particular guest is Win10 Insider Edition -- a thingy that shows what the folks from Redmond have cooked up, with roughly weekly updates to the tune of ~10GB writes 10GB deletions (if they do incremental transfers, installation still rewrites everything system). Lemme look a bit more, rsync performance is indeed really abysmal compared to what it should be. self promo, but consider using UrBackup (OSS software, too) instead? For Windows VMs I would install the client in the VM. It excludes unnessary stuff like e.g. page files or the shadow storage area from the image backups, as well and has a mode to store image backups as raw btrfs files. Linux VMs I'd backup as files either from the hypervisor or from in VM. If you want to backup big btrfs image files it can do that too, and faster than rsync plus it can do incremental backups with sparse files. Even without UrBackup (I'll need to look into that actually, we're looking for new backup software where I work since MS has been debating removing File History, and the custom scripts my predecessor wrote are showing their 20+ year age at this point), it's usually better to just run the backup from inside the VM if at all possible. You end up saving space, and don't waste time backing up stuff you don't need. In this particular use case, it would also save other system resources, since you only need to back up the VM if something has changed, and by definition nothing could have changed in the VM (at least, nothing could have legitimately changed) if it's not running. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
Hi, On 12.09.2017 23:13 Adam Borowski wrote: > On Tue, Sep 12, 2017 at 04:12:32PM -0400, Austin S. Hemmelgarn wrote: >> On 2017-09-12 16:00, Adam Borowski wrote: >>> Noted. Both Marat's and my use cases, though, involve VMs that are off most >>> of the time, and at least for me, turned on only to test something. >>> Touching mtime makes rsync run again, and it's freaking _slow_: worse than >>> 40 minutes for a 40GB VM (source:SSD target:deduped HDD). >> 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if >> you're going direct to a hard drive. I get better performance than that on >> my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s there, >> but it's for archival storage so I don't really care). I'm actually curious >> what the exact rsync command you are using is (you can obviously redact >> paths as you see fit), as the only way I can think of that it should be that >> slow is if you're using both --checksum (but if you're using this, you can >> tell rsync to skip the mtime check, and that issue goes away) and --inplace, >> _and_ your HDD is slow to begin with. > rsync -axX --delete --inplace --numeric-ids /mnt/btr1/qemu/ mordor:$BASE/qemu > The target is single, compress=zlib SAMSUNG HD204UI, 34976 hours old but > with nothing notable on SMART, in a Qnap 253a, kernel 4.9. > > Both source and target are btrfs, but here switching to send|receive > wouldn't give much as this particular guest is Win10 Insider Edition -- > a thingy that shows what the folks from Redmond have cooked up, with roughly > weekly updates to the tune of ~10GB writes 10GB deletions (if they do > incremental transfers, installation still rewrites everything system). > > Lemme look a bit more, rsync performance is indeed really abysmal compared > to what it should be. self promo, but consider using UrBackup (OSS software, too) instead? For Windows VMs I would install the client in the VM. It excludes unnessary stuff like e.g. page files or the shadow storage area from the image backups, as well and has a mode to store image backups as raw btrfs files. Linux VMs I'd backup as files either from the hypervisor or from in VM. If you want to backup big btrfs image files it can do that too, and faster than rsync plus it can do incremental backups with sparse files. Regards, Martin Raiber -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On 13/09/17 16:23, Chris Murphy wrote: Right, known problem. To use o_direct implies also using nodatacow (or at least nodatasum), e.g. xattr +C is set, done by qemu-img -o nocow=on https://www.spinics.net/lists/linux-btrfs/msg68244.html Can you please elaborate? I don't have exactly the same problem as described by the link, but I'm still worried that particularly qemu can be less resilient to partial raid1 failures even on newer kernels, due to missing checksums for instance. (BTW I didn't find any xattrs on my VM images, nor do I plan to set any.) -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On Tue, Sep 12, 2017 at 10:02 AM, Marat Khaliliwrote: > (3) it is possible that it uses O_DIRECT or something, and btrfs raid1 does > not fully protect this kind of access. Right, known problem. To use o_direct implies also using nodatacow (or at least nodatasum), e.g. xattr +C is set, done by qemu-img -o nocow=on https://www.spinics.net/lists/linux-btrfs/msg68244.html -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On 2017-09-12 20:52, Timofey Titovets wrote: No, no, no, no... No new ioctl, no change in fallocate. Fisrt: VM can do punch hole, if you use qemu -> qemu know how to do it. Windows Guest also know how to do it. Different Hypervisor? -> google -> Make issue to support, all Linux/Windows/Mac OS support holes in files. Not everybody who uses sparse files is using virtual machines. No new code, no new strange stuff to fix not broken things. Um, the fallocate PUNCH_HOLE mode _is_ broken. There's a race condition that can trivially cause data loss. You want replace zeroes? EXTENT_SAME can do that. But only on a small number of filesystems, and it requires extra work that shouldn't be necessary. truncate -s 4M test_hole dd if=/dev/zero of=./test_zero bs=4M duperemove -vhrd ./test_hole ./test_zero And performance for this approach is absolute shit compared to fallocate -d. Actual numbers, using a 4G test file (which is still small for what you're talking about) and a 4M hole file: fallocate -d: 0.19 user, 0.85 system, 1.26 real duperemove -vhrd: 0.75 user, 137.70 system, 144.80 real So, for a 4G file, it took duperemove (and the EXTENT_SAME ioctl) 114.92 times as long to achieve the same net effect. From a practical perspective, this isn't viable for regular usage just because of how long it takes. Most of that overhead is that the EXTENT_SAME ioctl does a byte-by-byte comparison of the ranges to make sure they match, but that isn't strictly necessary to avoid this race condition. All that's actually needed is determining if there is outstanding I/O on that region, and if so, some special handling prior to freezing the region is needed. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On 2017-09-12 17:13, Adam Borowski wrote: On Tue, Sep 12, 2017 at 04:12:32PM -0400, Austin S. Hemmelgarn wrote: On 2017-09-12 16:00, Adam Borowski wrote: Noted. Both Marat's and my use cases, though, involve VMs that are off most of the time, and at least for me, turned on only to test something. Touching mtime makes rsync run again, and it's freaking _slow_: worse than 40 minutes for a 40GB VM (source:SSD target:deduped HDD). 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if you're going direct to a hard drive. I get better performance than that on my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s there, but it's for archival storage so I don't really care). I'm actually curious what the exact rsync command you are using is (you can obviously redact paths as you see fit), as the only way I can think of that it should be that slow is if you're using both --checksum (but if you're using this, you can tell rsync to skip the mtime check, and that issue goes away) and --inplace, _and_ your HDD is slow to begin with. rsync -axX --delete --inplace --numeric-ids /mnt/btr1/qemu/ mordor:$BASE/qemu The target is single, compress=zlib SAMSUNG HD204UI, 34976 hours old but with nothing notable on SMART, in a Qnap 253a, kernel 4.9. compress=zlib is probably your biggest culprit. As odd as this sounds, I'd suggest switching that to lzo (seriously, the performance difference is ludicrous), and then setting up a cron job (or systemd timer) to run defrag over things to switch to zlib. As a general point of comparison, we do archival backups to a file server running BTRFS where I work, and the archiving process runs about four to ten times faster if we take this approach (LZO for initial compression, then recompress using defrag once the initial transfer is done) than just using zlib directly. `--inplace` is probably not helping (especially if most of the file changed, on BTRFS, it actually is marginally more efficient to just write out a whole new file and then replace the old one with a rename if you're rewriting most of the file), but is probably not as much of an issue as compress=zlib. Both source and target are btrfs, but here switching to send|receive wouldn't give much as this particular guest is Win10 Insider Edition -- a thingy that shows what the folks from Redmond have cooked up, with roughly weekly updates to the tune of ~10GB writes 10GB deletions (if they do incremental transfers, installation still rewrites everything system). > Lemme look a bit more, rsync performance is indeed really abysmal compared to what it should be. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
2017-09-13 0:13 GMT+03:00 Adam Borowski: > On Tue, Sep 12, 2017 at 04:12:32PM -0400, Austin S. Hemmelgarn wrote: >> On 2017-09-12 16:00, Adam Borowski wrote: >> > Noted. Both Marat's and my use cases, though, involve VMs that are off >> > most >> > of the time, and at least for me, turned on only to test something. >> > Touching mtime makes rsync run again, and it's freaking _slow_: worse than >> > 40 minutes for a 40GB VM (source:SSD target:deduped HDD). >> 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if >> you're going direct to a hard drive. I get better performance than that on >> my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s there, >> but it's for archival storage so I don't really care). I'm actually curious >> what the exact rsync command you are using is (you can obviously redact >> paths as you see fit), as the only way I can think of that it should be that >> slow is if you're using both --checksum (but if you're using this, you can >> tell rsync to skip the mtime check, and that issue goes away) and --inplace, >> _and_ your HDD is slow to begin with. > > rsync -axX --delete --inplace --numeric-ids /mnt/btr1/qemu/ mordor:$BASE/qemu > The target is single, compress=zlib SAMSUNG HD204UI, 34976 hours old but > with nothing notable on SMART, in a Qnap 253a, kernel 4.9. > > Both source and target are btrfs, but here switching to send|receive > wouldn't give much as this particular guest is Win10 Insider Edition -- > a thingy that shows what the folks from Redmond have cooked up, with roughly > weekly updates to the tune of ~10GB writes 10GB deletions (if they do > incremental transfers, installation still rewrites everything system). > > Lemme look a bit more, rsync performance is indeed really abysmal compared > to what it should be. > > > Meow! > -- > ⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts > ⣾⠁⢰⠒⠀⣿⡁ productivity. You can read it, too, you just need the > ⢿⡄⠘⠷⠚⠋⠀ right music while doing so. I recommend Skepticism > ⠈⠳⣄ (funeral doom metal). > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html No, no, no, no... No new ioctl, no change in fallocate. Fisrt: VM can do punch hole, if you use qemu -> qemu know how to do it. Windows Guest also know how to do it. Different Hypervisor? -> google -> Make issue to support, all Linux/Windows/Mac OS support holes in files. No new code, no new strange stuff to fix not broken things. You want replace zeroes? EXTENT_SAME can do that. truncate -s 4M test_hole dd if=/dev/zero of=./test_zero bs=4M duperemove -vhrd ./test_hole ./test_zero ~ du -hs test_* 0 test_hole 0 test_zero -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On Tue, Sep 12, 2017 at 04:12:32PM -0400, Austin S. Hemmelgarn wrote: > On 2017-09-12 16:00, Adam Borowski wrote: > > Noted. Both Marat's and my use cases, though, involve VMs that are off most > > of the time, and at least for me, turned on only to test something. > > Touching mtime makes rsync run again, and it's freaking _slow_: worse than > > 40 minutes for a 40GB VM (source:SSD target:deduped HDD). > 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if > you're going direct to a hard drive. I get better performance than that on > my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s there, > but it's for archival storage so I don't really care). I'm actually curious > what the exact rsync command you are using is (you can obviously redact > paths as you see fit), as the only way I can think of that it should be that > slow is if you're using both --checksum (but if you're using this, you can > tell rsync to skip the mtime check, and that issue goes away) and --inplace, > _and_ your HDD is slow to begin with. rsync -axX --delete --inplace --numeric-ids /mnt/btr1/qemu/ mordor:$BASE/qemu The target is single, compress=zlib SAMSUNG HD204UI, 34976 hours old but with nothing notable on SMART, in a Qnap 253a, kernel 4.9. Both source and target are btrfs, but here switching to send|receive wouldn't give much as this particular guest is Win10 Insider Edition -- a thingy that shows what the folks from Redmond have cooked up, with roughly weekly updates to the tune of ~10GB writes 10GB deletions (if they do incremental transfers, installation still rewrites everything system). Lemme look a bit more, rsync performance is indeed really abysmal compared to what it should be. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts ⣾⠁⢰⠒⠀⣿⡁ productivity. You can read it, too, you just need the ⢿⡄⠘⠷⠚⠋⠀ right music while doing so. I recommend Skepticism ⠈⠳⣄ (funeral doom metal). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On 2017-09-12 16:00, Adam Borowski wrote: On Tue, Sep 12, 2017 at 03:11:52PM -0400, Austin S. Hemmelgarn wrote: On 2017-09-12 14:43, Adam Borowski wrote: On Tue, Sep 12, 2017 at 01:36:48PM -0400, Austin S. Hemmelgarn wrote: On 2017-09-12 13:21, Adam Borowski wrote: There's fallocate -d, but that for some reason touches mtime which makes rsync go again. This can be handled manually but is still not nice. Yeah, the underlying ioctl does modify the file, it's merely fallocate -d calling it on regions that are already zero. The ioctl doesn't know that, so fallocate would have to restore the mtime by itself. There's also another problem: such a check + ioctl are racey. Unlike defrag or FILE_EXTENT_SAME, you can't thus use it on a file that's in use (or could suddenly become in use). Fixing this would need kernel support, either as FILE_EXTENT_SAME with /dev/zero or as a new mode of fallocate. A new fallocate mode would be more likely. Adding special code to the EXTENT_SAME ioctl and then requiring implementation on filesystems that don't otherwise support it is not likely to get anywhere. A new fallocate mode though would be easy, especially considering that a naive implementation is easy Sounds like a good idea. If we go this way, there's a question about interface: there's choice between: A) check if the whole range is zero, if even a single bit is one, abort B) dig many holes, with a given granulation (perhaps left to the filesystem's choice) or even both. The former is more consistent with FILE_EXTENT_SAME, the latter can be smarter (like, digging a 4k hole is bad for fragmentation but replacing a whole extent, no matter how small, is always a win). The first. It's more flexible, and the logic required for the second option is policy, which should not be in the kernel. Matching the EXTENT_SAME semantics would probably also make the implementation significantly easier, and with some minor work might give a trivial implementation for any FS that already supports that ioctl. That said, I'm not 100% certain if it's necessary. Intentionally calling fallocate on a file in use is not something most people are going to do normally anyway, since there is already a TOCTOU race in the fallocate -d implementation as things are right now. _Current_ fallocate -d suffers from races, the whole gain from doing this kernel-side would be eliminating those races. Use cases about the same as FILE_EXTENT_SAME: you don't need to stop the world. Heck, as I mentioned before, it conceptually _is_ FILE_EXTENT_SAME with /dev/null, other than your (good) point about non-btrfs non-xfs. I meant we shouldn't worry about a race involving the mtime check given that there's an existing race inherent in the ioctl already. For now, though, I wonder -- should we send fine folks at util-linux a patch to make fallocate -d restore mtime, either always or on an option? It would need to be an option, because it also suffers from a TOCTOU race (other things might have changed the mtime while you were punching holes), and it breaks from existing behavior. I think such an option would be useful, but not universally (for example, I don't care if the mtime on my VM images changes, as it typically matches the current date and time since the VM's are running constantly other than when doing maintenance like punching holes in the images). Noted. Both Marat's and my use cases, though, involve VMs that are off most of the time, and at least for me, turned on only to test something. Touching mtime makes rsync run again, and it's freaking _slow_: worse than 40 minutes for a 40GB VM (source:SSD target:deduped HDD). 40 minutes for 40GB is insanely slow (that's just short of 18 MB/s) if you're going direct to a hard drive. I get better performance than that on my somewhat pathetic NUC based storage cluster (I get roughly 20 MB/s there, but it's for archival storage so I don't really care). I'm actually curious what the exact rsync command you are using is (you can obviously redact paths as you see fit), as the only way I can think of that it should be that slow is if you're using both --checksum (but if you're using this, you can tell rsync to skip the mtime check, and that issue goes away) and --inplace, _and_ your HDD is slow to begin with. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On Tue, Sep 12, 2017 at 03:11:52PM -0400, Austin S. Hemmelgarn wrote: > On 2017-09-12 14:43, Adam Borowski wrote: > > On Tue, Sep 12, 2017 at 01:36:48PM -0400, Austin S. Hemmelgarn wrote: > > > On 2017-09-12 13:21, Adam Borowski wrote: > > > > There's fallocate -d, but that for some reason touches mtime which makes > > > > rsync go again. This can be handled manually but is still not nice. > > > > Yeah, the underlying ioctl does modify the file, it's merely fallocate -d > > calling it on regions that are already zero. The ioctl doesn't know that, > > so fallocate would have to restore the mtime by itself. > > > > There's also another problem: such a check + ioctl are racey. Unlike defrag > > or FILE_EXTENT_SAME, you can't thus use it on a file that's in use (or could > > suddenly become in use). Fixing this would need kernel support, either as > > FILE_EXTENT_SAME with /dev/zero or as a new mode of fallocate. > A new fallocate mode would be more likely. Adding special code to the > EXTENT_SAME ioctl and then requiring implementation on filesystems that > don't otherwise support it is not likely to get anywhere. A new fallocate > mode though would be easy, especially considering that a naive > implementation is easy Sounds like a good idea. If we go this way, there's a question about interface: there's choice between: A) check if the whole range is zero, if even a single bit is one, abort B) dig many holes, with a given granulation (perhaps left to the filesystem's choice) or even both. The former is more consistent with FILE_EXTENT_SAME, the latter can be smarter (like, digging a 4k hole is bad for fragmentation but replacing a whole extent, no matter how small, is always a win). > That said, I'm not 100% certain if it's necessary. Intentionally calling > fallocate on a file in use is not something most people are going to do > normally anyway, since there is already a TOCTOU race in the fallocate -d > implementation as things are right now. _Current_ fallocate -d suffers from races, the whole gain from doing this kernel-side would be eliminating those races. Use cases about the same as FILE_EXTENT_SAME: you don't need to stop the world. Heck, as I mentioned before, it conceptually _is_ FILE_EXTENT_SAME with /dev/null, other than your (good) point about non-btrfs non-xfs. > > For now, though, I wonder -- should we send fine folks at util-linux a patch > > to make fallocate -d restore mtime, either always or on an option? > It would need to be an option, because it also suffers from a TOCTOU race > (other things might have changed the mtime while you were punching holes), > and it breaks from existing behavior. I think such an option would be > useful, but not universally (for example, I don't care if the mtime on my VM > images changes, as it typically matches the current date and time since the > VM's are running constantly other than when doing maintenance like punching > holes in the images). Noted. Both Marat's and my use cases, though, involve VMs that are off most of the time, and at least for me, turned on only to test something. Touching mtime makes rsync run again, and it's freaking _slow_: worse than 40 minutes for a 40GB VM (source:SSD target:deduped HDD). Meow! -- ⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts ⣾⠁⢰⠒⠀⣿⡁ productivity. You can read it, too, you just need the ⢿⡄⠘⠷⠚⠋⠀ right music while doing so. I recommend Skepticism ⠈⠳⣄ (funeral doom metal). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On 2017-09-12 14:47, Christoph Hellwig wrote: On Tue, Sep 12, 2017 at 08:43:59PM +0200, Adam Borowski wrote: For now, though, I wonder -- should we send fine folks at util-linux a patch to make fallocate -d restore mtime, either always or on an option? Don't do that. Please just add a new ioctl or fallocate command that punches a hole if the range is zeroed, similar to what dedup does. It can probably even reuse a few helpers. Agreed, that would be far preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On 2017-09-12 14:43, Adam Borowski wrote: On Tue, Sep 12, 2017 at 01:36:48PM -0400, Austin S. Hemmelgarn wrote: On 2017-09-12 13:21, Adam Borowski wrote: There's fallocate -d, but that for some reason touches mtime which makes rsync go again. This can be handled manually but is still not nice. It touches mtime because it updates the block allocations, which in turn touch ctime, which on most (possibly all, not sure though) POSIX systems implies an mtime update. It's essentially the same as truncate updating the mtime when you extend the file, the only difference is that the FALLOCATE_PUNCH_HOLES ioctl doesn't change the file size. Yeah, the underlying ioctl does modify the file, it's merely fallocate -d calling it on regions that are already zero. The ioctl doesn't know that, so fallocate would have to restore the mtime by itself. There's also another problem: such a check + ioctl are racey. Unlike defrag or FILE_EXTENT_SAME, you can't thus use it on a file that's in use (or could suddenly become in use). Fixing this would need kernel support, either as FILE_EXTENT_SAME with /dev/zero or as a new mode of fallocate. A new fallocate mode would be more likely. Adding special code to the EXTENT_SAME ioctl and then requiring implementation on filesystems that don't otherwise support it is not likely to get anywhere. A new fallocate mode though would be easy, especially considering that a naive implementation is easy (block further requests to that range, complete all outstanding ones, check the range, punch the hole if possible, and then reopen requests for the range). That said, I'm not 100% certain if it's necessary. Intentionally calling fallocate on a file in use is not something most people are going to do normally anyway, since there is already a TOCTOU race in the fallocate -d implementation as things are right now. For now, though, I wonder -- should we send fine folks at util-linux a patch to make fallocate -d restore mtime, either always or on an option? It would need to be an option, because it also suffers from a TOCTOU race (other things might have changed the mtime while you were punching holes), and it breaks from existing behavior. I think such an option would be useful, but not universally (for example, I don't care if the mtime on my VM images changes, as it typically matches the current date and time since the VM's are running constantly other than when doing maintenance like punching holes in the images). You're the one with particular interest though, so I guess it's ultimately up to you how you choose to implement things in the patch ;) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On Tue, Sep 12, 2017 at 08:43:59PM +0200, Adam Borowski wrote: > For now, though, I wonder -- should we send fine folks at util-linux a patch > to make fallocate -d restore mtime, either always or on an option? Don't do that. Please just add a new ioctl or fallocate command that punches a hole if the range is zeroed, similar to what dedup does. It can probably even reuse a few helpers. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On Tue, Sep 12, 2017 at 01:36:48PM -0400, Austin S. Hemmelgarn wrote: > On 2017-09-12 13:21, Adam Borowski wrote: > > There's fallocate -d, but that for some reason touches mtime which makes > > rsync go again. This can be handled manually but is still not nice. > It touches mtime because it updates the block allocations, which in turn > touch ctime, which on most (possibly all, not sure though) POSIX systems > implies an mtime update. It's essentially the same as truncate updating the > mtime when you extend the file, the only difference is that the > FALLOCATE_PUNCH_HOLES ioctl doesn't change the file size. Yeah, the underlying ioctl does modify the file, it's merely fallocate -d calling it on regions that are already zero. The ioctl doesn't know that, so fallocate would have to restore the mtime by itself. There's also another problem: such a check + ioctl are racey. Unlike defrag or FILE_EXTENT_SAME, you can't thus use it on a file that's in use (or could suddenly become in use). Fixing this would need kernel support, either as FILE_EXTENT_SAME with /dev/zero or as a new mode of fallocate. For now, though, I wonder -- should we send fine folks at util-linux a patch to make fallocate -d restore mtime, either always or on an option? Meow! -- ⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts ⣾⠁⢰⠒⠀⣿⡁ productivity. You can read it, too, you just need the ⢿⡄⠘⠷⠚⠋⠀ right music while doing so. I recommend Skepticism ⠈⠳⣄ (funeral doom metal). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On 2017-09-12 13:21, Adam Borowski wrote: On Tue, Sep 12, 2017 at 02:26:39PM +0300, Marat Khalili wrote: On 12/09/17 14:12, Adam Borowski wrote: Why would you need support in the hypervisor if cp --reflink=always is enough? +1 :) But I've already found one problem: I use rsync snapshots for backups, and although rsync does have --sparse argument, apparently it conflicts with --inplace. You cannot have all nice things :( (Replying here to the above, as I can't seem to find the original in my e-mail client to reply to) --inplace and --sparse are inherently at odds with each other. The only way that they could work together is if rsync was taught about the FALLOCATE_PUNCH_HOLES ioctl, and that isn't likely to ever happen because it's Linux specific (at least, it's functionally Linux specific). Without that ioctl, the only way to create a sparse file is to seek over areas that are supposed to be empty when writing the file out initially, but you can't do that with an existing file because you then have old data where you're supposed to have zeroes. There's fallocate -d, but that for some reason touches mtime which makes rsync go again. This can be handled manually but is still not nice.It touches mtime because it updates the block allocations, which in turn touch ctime, which on most (possibly all, not sure though) POSIX systems implies an mtime update. It's essentially the same as truncate updating the mtime when you extend the file, the only difference is that the FALLOCATE_PUNCH_HOLES ioctl doesn't change the file size. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On Tue, Sep 12, 2017 at 02:26:39PM +0300, Marat Khalili wrote: > On 12/09/17 14:12, Adam Borowski wrote: > > Why would you need support in the hypervisor if cp --reflink=always is > > enough? > +1 :) > > But I've already found one problem: I use rsync snapshots for backups, and > although rsync does have --sparse argument, apparently it conflicts with > --inplace. You cannot have all nice things :( There's fallocate -d, but that for some reason touches mtime which makes rsync go again. This can be handled manually but is still not nice. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts ⣾⠁⢰⠒⠀⣿⡁ productivity. You can read it, too, you just need the ⢿⡄⠘⠷⠚⠋⠀ right music while doing so. I recommend Skepticism ⠈⠳⣄ (funeral doom metal). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On 12/09/17 14:12, Adam Borowski wrote: Why would you need support in the hypervisor if cp --reflink=always is enough? +1 :) But I've already found one problem: I use rsync snapshots for backups, and although rsync does have --sparse argument, apparently it conflicts with --inplace. You cannot have all nice things :( I think I'll simply try to minimize size of VM root partitions and won't think too much about gig or two extra zeroes in backup, at least until some autopunchholes mount option arrives. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
2017-09-12 14:12 GMT+03:00 Adam Borowski: > On Tue, Sep 12, 2017 at 02:01:53PM +0300, Timofey Titovets wrote: >> > On 12/09/17 13:32, Adam Borowski wrote: >> >> Just use raw -- btrfs already has every feature that qcow2 has, and >> >> does it better. This doesn't mean btrfs is the best choice for hosting >> >> VM files, just that raw-over-btrfs is strictly better than >> >> qcow2-over-btrfs. >> > >> > Thanks for advice, I wasn't sure I won't lose features, and was too lazy to >> > investigate/ask. Now it looks simple. >> >> The main problem with Raw over Btrfs is that (IIRC) no one support >> btrfs features. >> >> - Patches for libvirt not merged and obsolete >> - Patches for Proxmox also not merged >> - Other VM hypervisor like Virtualbox, VMware just ignore btrfs features. >> >> So with raw you will have a problems like: no snapshot support > > Why would you need support in the hypervisor if cp --reflink=always is > enough? Likewise, I wouldn't expect hypervisors to implement support for > every dedup tool -- it'd be a layering violation[1]. It's not emacs or > systemd, you really can use an external tool instead of adding a lawnmower > to the kitchen sink. > > > Meow! > > [1] Yeah, talking about layering violations in btrfs context is a bit weird, > but it's better to at least try. In that case why Hypervisors add support for LVM snapshots, ZFS, RBD Snapshot & etc? User can do that by hand, so it's useless, nope? (rhetorical question) This is not about layering violation with about teaming and integration between tools. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On Tue, Sep 12, 2017 at 02:01:53PM +0300, Timofey Titovets wrote: > > On 12/09/17 13:32, Adam Borowski wrote: > >> Just use raw -- btrfs already has every feature that qcow2 has, and > >> does it better. This doesn't mean btrfs is the best choice for hosting > >> VM files, just that raw-over-btrfs is strictly better than > >> qcow2-over-btrfs. > > > > Thanks for advice, I wasn't sure I won't lose features, and was too lazy to > > investigate/ask. Now it looks simple. > > The main problem with Raw over Btrfs is that (IIRC) no one support > btrfs features. > > - Patches for libvirt not merged and obsolete > - Patches for Proxmox also not merged > - Other VM hypervisor like Virtualbox, VMware just ignore btrfs features. > > So with raw you will have a problems like: no snapshot support Why would you need support in the hypervisor if cp --reflink=always is enough? Likewise, I wouldn't expect hypervisors to implement support for every dedup tool -- it'd be a layering violation[1]. It's not emacs or systemd, you really can use an external tool instead of adding a lawnmower to the kitchen sink. Meow! [1] Yeah, talking about layering violations in btrfs context is a bit weird, but it's better to at least try. -- ⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts ⣾⠁⢰⠒⠀⣿⡁ productivity. You can read it, too, you just need the ⢿⡄⠘⠷⠚⠋⠀ right music while doing so. I recommend Skepticism ⠈⠳⣄ (funeral doom metal). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On Tue, 12 Sep 2017 12:32:14 +0200 Adam Borowskiwrote: > discard in the guest (not supported over ide and virtio, supported over scsi > and virtio-scsi) IDE does support discard in QEMU, I use that all the time. It got broken briefly in QEMU 2.1 [1], but then fixed again. [1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=757927 -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
2017-09-12 13:39 GMT+03:00 Marat Khalili: > On 12/09/17 13:01, Duncan wrote: >> >> AFAIK that's wrong -- the only time the app should see the error on btrfs >> raid1 is if the second copy is also bad > > So thought I, but... > >> IIRC from what I've read on-list, qcow2 isn't the best alternative for >> hosting VMs on >> top of btrfs. > > Yeah, I've seen discussions about it here too, but in my case VMs write very > little (mostly logs and distro updates), so I decided it can live as it is > for a while. But I'm looking for better solutions as long as they are not > too complicated. > > > On 12/09/17 13:32, Adam Borowski wrote: >> >> Just use raw -- btrfs already has every feature that qcow2 has, and does >> it >> better. This doesn't mean btrfs is the best choice for hosting VM files, >> just that raw-over-btrfs is strictly better than qcow2-over-btrfs. > > Thanks for advice, I wasn't sure I won't lose features, and was too lazy to > investigate/ask. Now it looks simple. > > -- > > With Best Regards, > Marat Khalili > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html The main problem with Raw over Btrfs is that (IIRC) no one support btrfs features. - Patches for libvirt not merged and obsolete - Patches for Proxmox also not merged - Other VM hypervisor like Virtualbox, VMware just ignore btrfs features. So with raw you will have a problems like: no snapshot support But yes, raw over btrfs the best performance wise solution. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On 12/09/17 13:01, Duncan wrote: AFAIK that's wrong -- the only time the app should see the error on btrfs raid1 is if the second copy is also bad So thought I, but... IIRC from what I've read on-list, qcow2 isn't the best alternative for hosting VMs on top of btrfs. Yeah, I've seen discussions about it here too, but in my case VMs write very little (mostly logs and distro updates), so I decided it can live as it is for a while. But I'm looking for better solutions as long as they are not too complicated. On 12/09/17 13:32, Adam Borowski wrote: Just use raw -- btrfs already has every feature that qcow2 has, and does it better. This doesn't mean btrfs is the best choice for hosting VM files, just that raw-over-btrfs is strictly better than qcow2-over-btrfs. Thanks for advice, I wasn't sure I won't lose features, and was too lazy to investigate/ask. Now it looks simple. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On Tue, Sep 12, 2017 at 10:01:07AM +, Duncan wrote: > BTW, I am most definitely /not/ a VM expert, and won't pretend to > understand the details or be able to explain further, but IIRC from what > I've read on-list, qcow2 isn't the best alternative for hosting VMs on > top of btrfs. Something about it being cow-based as well, which means cow > (qcow2)-on-cow(btrfs), which tends to lead to /extreme/ fragmentation, > leading to low performance. > > I don't know enough about it to know what the alternatives to qcow2 are, > but something that not itself cow when it's on cow-based btrfs, would > presumably be a better alternative. Just use raw -- btrfs already has every feature that qcow2 has, and does it better. This doesn't mean btrfs is the best choice for hosting VM files, just that raw-over-btrfs is strictly better than qcow2-over-btrfs. And like qcow2, with raw over btrfs you have the choice between a fully pre-written nocow file and a sparse file. For the latter, you want discard in the guest (not supported over ide and virtio, supported over scsi and virtio-scsi), and you get the full list of btrfs goodies like snapshots or dedup. Meow! -- ⢀⣴⠾⠻⢶⣦⠀ I've read an article about how lively happy music boosts ⣾⠁⢰⠒⠀⣿⡁ productivity. You can read it, too, you just need the ⢿⡄⠘⠷⠚⠋⠀ right music while doing so. I recommend Skepticism ⠈⠳⣄ (funeral doom metal). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
Marat Khalili posted on Tue, 12 Sep 2017 11:42:52 +0300 as excerpted: > On 12/09/17 11:25, Timofey Titovets wrote: >> AFAIK, if while read BTRFS get Read Error in RAID1, application will >> also see that error and if application can't handle it -> you got a >> problems >> >> So Btrfs RAID1 ONLY protect data, not application (qemu in your case). > That's news to me! Why doesn't it try another copy and when does it > correct the error then? AFAIK that's wrong -- the only time the app should see the error on btrfs raid1 is if the second copy is also bad (and if it's good, the bad copy is automatically rewritten... elsewhere of course, due to cow)... or if the problem with btrfs is bad enough it sends the entire filesystem read- only, which I don't believe happened in your case (it was the ext4 on the VM that went ro). So you should be able to rest easy on that, at least. =:^) > Any idea on how to work it around at least for > qemu? (Assemble the array from within the VM?) BTW, I am most definitely /not/ a VM expert, and won't pretend to understand the details or be able to explain further, but IIRC from what I've read on-list, qcow2 isn't the best alternative for hosting VMs on top of btrfs. Something about it being cow-based as well, which means cow (qcow2)-on-cow(btrfs), which tends to lead to /extreme/ fragmentation, leading to low performance. I'd guess that due to the additional stress, it may also trigger race conditions and/or deadlocks that wouldn't ordinarily trigger. I don't know enough about it to know what the alternatives to qcow2 are, but something that not itself cow when it's on cow-based btrfs, would presumably be a better alternative. Sorry I can't do better on that, but this should at least give you enough information to look for more, if no one reposts the details here. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
2017-09-12 12:29 GMT+03:00 Marat Khalili: > On 12/09/17 12:21, Timofey Titovets wrote: >> >> Can't reproduce that on latest kernel: 4.13.1 > > Great! Thank you very much for the test. Do you know if it's fixed in 4.10? > (or what particular version does?) > -- > > With Best Regards, > Marat Khalili > Nope, i reading all list message for at least 3 years and i can't remember merget patches that can fix that, may be this can be related to latest BIO API rework and changes, but i'm unsure. Sry. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On 12/09/17 12:21, Timofey Titovets wrote: Can't reproduce that on latest kernel: 4.13.1 Great! Thank you very much for the test. Do you know if it's fixed in 4.10? (or what particular version does?) -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
2017-09-12 11:42 GMT+03:00 Marat Khalili: > On 12/09/17 11:25, Timofey Titovets wrote: >> >> AFAIK, if while read BTRFS get Read Error in RAID1, application will >> also see that error and if application can't handle it -> you got a >> problems >> >> So Btrfs RAID1 ONLY protect data, not application (qemu in your case). > > That's news to me! Why doesn't it try another copy and when does it correct > the error then? Any idea on how to work it around at least for qemu? > (Assemble the array from within the VM?) > > > -- > > With Best Regards, > Marat Khalili Can't reproduce that on latest kernel: 4.13.1 For reproduce i use 2 usb flash disk in btrfs raid1 + fio test for generate load While test i pull down one flash drive, some time ago (year?) this produce fio error, at now test continues without problems. -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On 12/09/17 11:25, Timofey Titovets wrote: AFAIK, if while read BTRFS get Read Error in RAID1, application will also see that error and if application can't handle it -> you got a problems So Btrfs RAID1 ONLY protect data, not application (qemu in your case). That's news to me! Why doesn't it try another copy and when does it correct the error then? Any idea on how to work it around at least for qemu? (Assemble the array from within the VM?) -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
2017-09-12 11:02 GMT+03:00 Marat Khalili: > Thanks to the help from the list I've successfully replaced part of btrfs > raid1 filesystem. However, while I waited for best opinions on the course of > actions, the root filesystem of one the qemu-kvm VMs went read-only, and > this root was of course based in a qcow2 file on the problematic btrfs (the > root filesystem of the VM itself is ext4, not btrfs). It is very well > possible that it is a coincidence or something inducted by heavier than > usual IO load, but it is hard for me to ignore the possibility that somehow > the hardware error was propagated to VM. Is it possible? > > No other processes on the machine developed any problems, but: > (1) it is very well possible that problematic sector belonged to this qcow2 > file; > (2) it is a Kernel VM after all, and it might bypass normal IO paths of > userspace processes; > (3) it is possible that it uses O_DIRECT or something, and btrfs raid1 does > not fully protect this kind of access. > Does this make any sense? > > I could not login to the VM normally to see logs, and made big mistake of > rebooting it. Now all I see in its logs is big hole, since, well, it went > read-only :( I'll try to find out if (1) above is true after I finish > migrating data from HDD and remove the it. I wonder where else can I look? > > -- > > With Best Regards, > Marat Khalili AFAIK, if while read BTRFS get Read Error in RAID1, application will also see that error and if application can't handle it -> you got a problems So Btrfs RAID1 ONLY protect data, not application (qemu in your case). -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
qemu-kvm VM died during partial raid1 problems of btrfs
Thanks to the help from the list I've successfully replaced part of btrfs raid1 filesystem. However, while I waited for best opinions on the course of actions, the root filesystem of one the qemu-kvm VMs went read-only, and this root was of course based in a qcow2 file on the problematic btrfs (the root filesystem of the VM itself is ext4, not btrfs). It is very well possible that it is a coincidence or something inducted by heavier than usual IO load, but it is hard for me to ignore the possibility that somehow the hardware error was propagated to VM. Is it possible? No other processes on the machine developed any problems, but: (1) it is very well possible that problematic sector belonged to this qcow2 file; (2) it is a Kernel VM after all, and it might bypass normal IO paths of userspace processes; (3) it is possible that it uses O_DIRECT or something, and btrfs raid1 does not fully protect this kind of access. Does this make any sense? I could not login to the VM normally to see logs, and made big mistake of rebooting it. Now all I see in its logs is big hole, since, well, it went read-only :( I'll try to find out if (1) above is true after I finish migrating data from HDD and remove the it. I wonder where else can I look? -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html