Re: recommendations and contraindications of using btrfs for Oracle Database Server
I'd advice to migrate from this configuration ASAP. Each of the following is enough reason to do so: 1. Running BTRFS on top of kernel 3.10 (there can be vendor backports, but most likely they only include absolutely critical fixes for the BTRFS part, not the whole stuff). And RedHat (that's your vendor, right?) recently abandoned BTRFS. 2. Running a database on top of BTRFS with snapshots, unless write load is very light and snapshots are very rare and few. Besides, taking storage snapshots is hardly a good way to backup databases. Oracle certainly provides better methods like online replication. On 11/01/18 13:51, Ext-Strii-Houttemane Philippe wrote: ORA-63999: data file suffered media failure ORA-01114: IO error writing block to file 99 (block # 4) ORA-01110: data file 99: '/oradata/PS92PRD/data/pcapp.dbf' ORA-27072: File I/O error Linux-x86_64 Error: 5: Input/output error Additional information: 4 Additional information: 4 There might be messages in syslog/dmesg about this. It nevers append with over filesystem types, all the hardware has been checked. I suspect a Btrfs activated feature via our mount options instead of a bug: Oracle see a ghost or a duplicated bloc even if copy on cow feature is disabled. Never saw this even while using BTRFS on kernel 3.11. But note that snapshots kind of disable nodatacow, so you still effectively have cow among them (that's the price of convenience). Mount options: defaults,nofail,nodatacow,nobarrier,noatime "nobarrier" looks a bit scary, unless you're using some fancy battery-backed controller. btrfs fi show: Label: 'oradataBtrfs' Total devices 1 FS bytes used 1.98TiB devid1 size 3.18TiB used 2.02TiB path /dev/sdb1 At least you are not using BTRFS RAID in this configuration, good. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Recommendations for balancing as part of regular maintenance?
On 08/01/18 19:34, Austin S. Hemmelgarn wrote: A: While not strictly necessary, running regular filtered balances (for example `btrfs balance start -dusage=50 -dlimit=2 -musage=50 -mlimit=4`, see `man btrfs-balance` for more info on what the options mean) can help keep a volume healthy by mitigating the things that typically cause ENOSPC errors. The choice of words is not very fortunate IMO. In my view volume stopping being "healthy" during normal operation presumes some bugs (at least shortcomings) in the filesystem code. In this case I'd prefer to have detailed understanding of the situation before copy-pasting commands from wiki pages. Remember, most users don't run cutting-edge kernels and tools, preferring LTS distribution releases instead, so one size might not fit all. On 08/01/18 23:29, Martin Raiber wrote: There have been reports of (rare) corruption caused by balance (won't be detected by a scrub) here on the mailing list. So I would stay a away from btrfs balance unless it is absolutely needed (ENOSPC), and while it is run I would try not to do anything else wrt. to writes simultaneously. This is my opinion too as a normal user, based upon reading this list and own attempts to recover from ENOSPC. I'd rather re-create filesystem from scratch, or at least make full verified backup before attempting to fix problems with balance. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: A Big Thank You, and some Notes on Current Recovery Tools.
> I think the 1-3TB Seagate drives are garbage. There are known problems with ST3000DM001, but first of all you should not put PC-oriented disks in RAID, they are not designed for it on multiple levels (vibration tolerance, error reporting...) There are similar horror stories about people filling whole cases with WD Greens and observing their (non-BTRFS) RAID 6 fail. (Sorry for OT.) -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Improve subvolume usability for a normal user
On 07/12/17 08:27, Qu Wenruo wrote: When doing snapshot, btrfs only needs to increase reference of 2nd highest level tree blocks of original snapshot, other than "walking the tree". (If tree root level is 2, then level 2 node is copied, while all reference of level 1 tree blocks get increased) Out of curiosity, how does it interacts with nocow files? Does every write to these files involves backref walk? -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: command to get quota limit
AFAIK you don't need subvolume id, `btrfs qgroup show -ref /path/to/subvolume` shows necessary qgroup for me. Separating value you need is more involved: btrfs qgroup show -ref --raw /path/to/subvolume | tail -n +3 | tr -s ' ' | cut -d ' ' -f 4 Not sure how robust is this though. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: notification about corrupt files from "btrfs scrub" in cron
On 23/11/17 15:59, Mike Fleetwood wrote: Cron starts configured jobs at the scheduled time asynchronously. I.e. It doesn't block waiting for each command to finish. Cron notices when the job finishes and any output produced, written to stdout and/or stderr, by the job is emailed to the user. So no, a 2 hour job is not a problem for cron. Minor additional advice -- prepend you command with: flock --nonblock /var/run/scrub.lock to avoid running several scrubs simultaneously in case one takes more than 24 hours to finish. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Tiered storage?
On 15/11/17 10:11, waxhead wrote: hint: you need more than two for raid1 if you want to stay safe Huh? Two is not enough? Having three or more makes a difference? (Or, you mean hot spare?) -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
On 14/11/17 06:39, Dave wrote: My rsync command currently looks like this: rsync -axAHv --inplace --delete-delay --exclude-from="/some/file" "$source_snapshop/" "$backup_location" As I learned from Kai Krakow in this maillist, you should also add --no-whole-file if both sides are local. Otherwise target space usage can be much worse (but fragmentation much better). I wonder what is your justification for --delete-delay, I just use --delete. Here's what I use: --verbose --archive --hard-links --acls --xattrs --numeric-ids --inplace --delete --delete-excluded --stats. Since in my case source is always remote, there's no --no-whole-file, but there's --numeric-ids. In particular, I want to know if I should or should not be using these options: -H, --hard-linkspreserve hard links -A, --acls preserve ACLs (implies -p) -X, --xattrspreserve extended attributes -x, --one-file-system don't cross filesystem boundaries I don't know any semantic use of hard links in modern systems. There're ACLs on some files in /var/log/journal on systems with systemd. Synology actively uses ACL, but it's implementation is sadly incompatible with rsync. There can always be some ACLs or xattrs set by sysadmin manually. End result, I always specify first three options where possible just in case (even though man page says that --hard-links may affect performance). I had to use the "x" option to prevent rsync from deleting files in snapshots in the backup location (as the source location does not retain any snapshots). Is there a better way? Don't keep snapshots under rsync target, place them under ../snapshots (if snapper supports this): # find . -maxdepth 2 . ./snapshots ./snapshots/2017-11-08T13:18:20+00:00 ./snapshots/2017-11-08T15:10:03+00:00 ./snapshots/2017-11-08T23:28:44+00:00 ./snapshots/2017-11-09T23:41:30+00:00 ./snapshots/2017-11-10T22:44:36+00:00 ./snapshots/2017-11-11T21:48:19+00:00 ./snapshots/2017-11-12T21:27:41+00:00 ./snapshots/2017-11-13T23:29:49+00:00 ./rsync Or, specify them in --exclude and avoid using --delete-excluded. Or keep using -x if it works, why not? -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Should cp default to reflink?
Obviously (for me) yes, but who will decide? There should be --no-reflink for people trying to defragment something. >Seems to me any request to duplicate should be optimized by default >with an auto reflink when possible, and require an explicit option to >inhibit. Key word is "any". I much more often use rsync than cp within the same volume. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with file system
>How is this an issue? Discard is issued only once we're positive >there's no >reference to the freed blocks anywhere. At that point, they're also >open >for reuse, thus they can be arbitrarily scribbled upon. Point was, how about keeping this reference for some time period? >Unless your hardware is seriously broken (such as lying about barriers, >which is nearly-guaranteed data loss on btrfs anyway), there's no way >the >filesystem will ever reference such blocks. Buggy hardware happen. So do buggy filesystems ;) Besides, most filesystems let user recover most data after losing just one sector, would be pity if BTRFS with all its COW coolness didn't. >Why would you special-case metadata? Metadata that points to >overwritten or >discarded blocks is of no use either. It takes significant time to overwrite noticeable portion of data on disk, but loss of metadata makes it gone in a moment. Moreover, user is usually prepared to lose some recently changed data in crash, but not the one that it didn't even touch. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
On 02/11/17 04:39, Dave wrote: I'm going to make this change now. What would be a good way to implement this so that the change applies to the $HOME/.cache of each user? I'd make each user's .cache a symlink (should work but if it won't then bind mount) to a per-user directory in some separately mounted volume with necessary options. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
On 01/11/17 09:51, Dave wrote: As already said by Romain Mamedov, rsync is viable alternative to send-receive with much less hassle. According to some reports it can even be faster. Thanks for confirming. I must have missed those reports. I had never considered this idea until now -- but I like it. Are there any blogs or wikis where people have done something similar to what we are discussing here? I don't know any. Probably someone needs to write it. We will delete most snapshots on the live volume, but retain many (or all) snapshots on the backup block device. Is that a good strategy, given my goals? Depending on the way you use it, retaining even a dozen snapshots on a live volume might hurt performance (for high-performance databases) or be completely transparent (for user folders). You may want to experiment with this number. We do experience severe performance problems now, especially with Firefox. Part of my experiment is to reduce the number of snapshots on the live volumes, hence this question. Just for statistics, how many snapshots do you have and how often do you take them? It's on SSD, right? Thanks. I hope you do find time to publish it. (And what do you mean by portable?) For now, Snapper has a cleanup algorithm that we can use. At least one of the tools listed here has a thinout algorithm too: https://btrfs.wiki.kernel.org/index.php/Incremental_Backup It is currently a small part of yet another home-grown backup tool which is itself fairly big and tuned to particular environment. I thought many times that it would be very nice to have thinning tool separately and with no unnecessary dependencies, but... BTW beware of deleting too many snapshots at once with any tool. Delete few and let filesystem stabilize before proceeding. Should I consider a dedup tool like one of these? Certainly NOT for snapshot-based backups: it is already deduplicated almost as much as possible, dedup tools can only make it *less* deduplicated. The question is whether to use a dedup tool on the live volume which has a few snapshots. Even with the new strategy (based on rsync), the live volume may sometimes have two snapshots (pre- and post- pacman upgrades). For deduplication tool to be useful you ought to have some duplicate data on your live volume. Do you have any (e.g. many LXC containers with the same distribution)? Also still wondering about these options: no-holes, skinny metadata, or extended inode refs? I don't know anything about any of these, sorry. P.S. I still think you need some off-system backup solution too, either rsync+snapshot-based over ssh or e.g. Burp (shameless advertising: http://burp.grke.org/ ). -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
I'm active user of backup using btrfs snapshots. Generally it works with some caveats. You seem to have two tasks: (1) same-volume snapshots (I would not call them backups) and (2) updating some backup volume (preferably on a different box). By solving them separately you can avoid some complexity like accidental remove of snapshot that's still needed for updating backup volume. To reconcile those conflicting goals, the only idea I have come up with so far is to use btrfs send-receive to perform incremental backups as described here: https://btrfs.wiki.kernel.org/index.php/Incremental_Backup . As already said by Romain Mamedov, rsync is viable alternative to send-receive with much less hassle. According to some reports it can even be faster. Given the hourly snapshots, incremental backups are the only practical option. They take mere moments. Full backups could take an hour or more, which won't work with hourly backups. I don't see much sense in re-doing full backups to the same physical device. If you care about backup integrity, it is probably more important to invest in backups verification. (OTOH, while you didn't reveal data size, if full backup takes just an hour on your system then why not?) We will delete most snapshots on the live volume, but retain many (or all) snapshots on the backup block device. Is that a good strategy, given my goals? Depending on the way you use it, retaining even a dozen snapshots on a live volume might hurt performance (for high-performance databases) or be completely transparent (for user folders). You may want to experiment with this number. In any case I'd not recommend retaining ALL snapshots on backup device, even if you have infinite space. Such filesystem would be as dangerous as the demon core, only good for adding more snapshots (not even deleting them), and any little mistake will blow everything up. Keep a few dozen, hundred at most. Unlike other backup systems, you can fairly easily remove snapshots in the middle of sequence, use this opportunity. My thinout rule is: remove snapshot if resulting gap will be less than some fraction (e.g. 1/4) of its age. One day I'll publish portable solution on github. Given this minimal retention of snapshots on the live volume, should I defrag it (assuming there is at least 50% free space available on the device)? (BTW, is defrag OK on an NVMe drive? or an SSD?) In the above procedure, would I perform that defrag before or after taking the snapshot? Or should I use autodefrag? I ended up using autodefrag, didn't try manual defragmentation. I don't use SSDs as backup volumes. Should I consider a dedup tool like one of these? Certainly NOT for snapshot-based backups: it is already deduplicated almost as much as possible, dedup tools can only make it *less* deduplicated. * Footnote: On the backup device, maybe we will never delete snapshots. In any event, that's not a concern now. We'll retain many, many snapshots on the backup device. Again, DO NOT do this, btrfs in its current state does not support it. Good rule of thumb for time of some operations is data size multiplied by number of snapshots (raised to some power >= 1) and divided by IO/CPU speed. By creating snapshots it is very easy to create petabytes of data for kernel to process, which it won't be able to do in many years. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with file system
On 31/10/17 00:37, Chris Murphy wrote: But off hand it sounds like hardware was sabotaging the expected write ordering. How to test a given hardware setup for that, I think, is really overdue. It affects literally every file system, and Linux storage technology. It kinda sounds like to me something other than supers is being overwritten too soon, and that's why it's possible for none of the backup roots to find a valid root tree, because all four possible root trees either haven't actually been written yet (still) or they've been overwritten, even though the super is updated. But again, it's speculation, we don't actually know why your system was no longer mountable. Just a detached view: I know hardware should respect ordering/barriers and such, but how hard is it really to avoid overwriting at least one complete metadata tree for half an hour (even better, yet another one for a day)? Just metadata, not data extents. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-subv-backup v0.1b
Hello Austin, Looks very useful. Two questions: 1. Can you release it under some standard license recognized by github, in case someone wants to include it in other projects? AGPL-3.0 would be nice. 2. I don't understand mentioned restore performance issues. It shouldn't apply if data is restored _after_ subvolume structure is re-created, but even if (1) data is already there, and (2) copyless move doesn't work between subvolumes (really a limitation of some older systems, not Python), there's a known workaround of creating a reflink and then removing the original. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: subvolume: outputs message only when operation succeeds
On 25/09/17 17:33, Qu Wenruo wrote: (Any in this case, anyone in the maillist can help review messages) If this is a question, I can help with assigning levels to messages. Although I think many levels are only required for complex daemons or network tools, while btrfs utils mostly perform atomic operations which either succeed or fail. But it's of course hard to be sure without seeing all actual messages, probably there's some use for 4 levels. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: subvolume: outputs message only when operation succeeds
On 25/09/17 11:08, Qu Wenruo wrote: What about redirecting stdout to /dev/null and redirecting stderr to mail if return value is not 0? s/if return value is not 0/if return value is 0/. The main point is, if btrfs returns 0, then nothing to worry about. (Unless there is a bug, even in that case keep an eye on stderr should be enough to catch that) Redirection to /dev/null will work. However, 1) It will always looks suspicious. grep -v with expected message is at least clear about its intent and consequences. 2) Although shorter than grep -v, it will still take space in shell scripts and force one to remember btrfs commands one has to add it after. This is already inconvenient enough to want a fix. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: subvolume: outputs message only when operation succeeds
On 25/09/17 10:52, Hugo Mills wrote: Isn't the correct way to catch the return value instead of grepping the output? It is, but if, for example, you're using the command in a cron script which is expected to work, you don't want it producing output because then you get a mail every time the script runs. So you have to grep -v on the "success" output to make the successful script silent. If it's some command not returning value properly, would you please report it as a bug so we can fix it. It's not the return value that's problematic (although those used to be a real mess). It's the fact that a successful run of the command produces noise on stdout, which most commands don't. Yes, exactly: cron, mail -E and just long scripts where btrfs operations are small steps here and there. (On top of this, actually catching the return value from the right command before `| grep -v` with errexit and pipefail on is so difficult that I usually end up rewriting whole mess in Python. Which would be nice result in itself if it didn't take a whole day in place of one minute for bash line.) -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: subvolume: outputs message only when operation succeeds
On 25/09/17 10:30, Nikolay Borisov wrote: On 19.09.2017 10:41, Misono, Tomohiro wrote: "btrfs subvolume create/delete" outputs the message of "Create/Delete subvolume ..." even when an operation fails. Since it is confusing, let's outputs the message only when an operation succeeds. Please change the verb to past tense, more strongly signaling success - i.e. "Created subvolume" What about recalling some UNIX standards and returning to NOT outputting any message when operation succeeds? My scripts are full of grep -v calls after each btrfs command, and this sucks (and I don't think I'm alone in this situation). If you change the message a lot of scripts will have to be changed, at least make it worth it. -- With Best Regards, Marat Khaliili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Does btrfs use crc32 for error correction?
Would be cool, but probably not wise IMHO, since on modern hardware you almost never get one-bit errors (usually it's a whole sector of garbage), and therefore you'd more often see an incorrect recovery than actually fixed bit. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A user cannot remove his readonly snapshots?!
On 16/09/17 13:19, Ulli Horlacher wrote: How do I know the btrfs filesystem for a given subvolume? Do I really have to manually test the diretory path upwards? It was discussed recently: the answer is, unfortunately, yes, until someone patches df to do it for us. You can do it more or less efficiently by analyzing /proc/mounts . -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
May I state my user's point of view: I know one applications that uses O_DIRECT, and it is subtly broken on BTRFS. I know no applications that use O_DIRECT and are not broken. (Really more statistics would help here, probably some exist that provably work.) According to developers making O_DIRECT work on BTRFS is difficult if not impossible. Isn't it time to disable O_DIRECT like ZFS does AFAIU? Data safety is certainly more important than performance gain it may or may not give some applications. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On 13/09/17 16:23, Chris Murphy wrote: Right, known problem. To use o_direct implies also using nodatacow (or at least nodatasum), e.g. xattr +C is set, done by qemu-img -o nocow=on https://www.spinics.net/lists/linux-btrfs/msg68244.html Can you please elaborate? I don't have exactly the same problem as described by the link, but I'm still worried that particularly qemu can be less resilient to partial raid1 failures even on newer kernels, due to missing checksums for instance. (BTW I didn't find any xattrs on my VM images, nor do I plan to set any.) -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On 12/09/17 14:12, Adam Borowski wrote: Why would you need support in the hypervisor if cp --reflink=always is enough? +1 :) But I've already found one problem: I use rsync snapshots for backups, and although rsync does have --sparse argument, apparently it conflicts with --inplace. You cannot have all nice things :( I think I'll simply try to minimize size of VM root partitions and won't think too much about gig or two extra zeroes in backup, at least until some autopunchholes mount option arrives. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On 12/09/17 13:01, Duncan wrote: AFAIK that's wrong -- the only time the app should see the error on btrfs raid1 is if the second copy is also bad So thought I, but... IIRC from what I've read on-list, qcow2 isn't the best alternative for hosting VMs on top of btrfs. Yeah, I've seen discussions about it here too, but in my case VMs write very little (mostly logs and distro updates), so I decided it can live as it is for a while. But I'm looking for better solutions as long as they are not too complicated. On 12/09/17 13:32, Adam Borowski wrote: Just use raw -- btrfs already has every feature that qcow2 has, and does it better. This doesn't mean btrfs is the best choice for hosting VM files, just that raw-over-btrfs is strictly better than qcow2-over-btrfs. Thanks for advice, I wasn't sure I won't lose features, and was too lazy to investigate/ask. Now it looks simple. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On 12/09/17 12:21, Timofey Titovets wrote: Can't reproduce that on latest kernel: 4.13.1 Great! Thank you very much for the test. Do you know if it's fixed in 4.10? (or what particular version does?) -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm VM died during partial raid1 problems of btrfs
On 12/09/17 11:25, Timofey Titovets wrote: AFAIK, if while read BTRFS get Read Error in RAID1, application will also see that error and if application can't handle it -> you got a problems So Btrfs RAID1 ONLY protect data, not application (qemu in your case). That's news to me! Why doesn't it try another copy and when does it correct the error then? Any idea on how to work it around at least for qemu? (Assemble the array from within the VM?) -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
qemu-kvm VM died during partial raid1 problems of btrfs
Thanks to the help from the list I've successfully replaced part of btrfs raid1 filesystem. However, while I waited for best opinions on the course of actions, the root filesystem of one the qemu-kvm VMs went read-only, and this root was of course based in a qcow2 file on the problematic btrfs (the root filesystem of the VM itself is ext4, not btrfs). It is very well possible that it is a coincidence or something inducted by heavier than usual IO load, but it is hard for me to ignore the possibility that somehow the hardware error was propagated to VM. Is it possible? No other processes on the machine developed any problems, but: (1) it is very well possible that problematic sector belonged to this qcow2 file; (2) it is a Kernel VM after all, and it might bypass normal IO paths of userspace processes; (3) it is possible that it uses O_DIRECT or something, and btrfs raid1 does not fully protect this kind of access. Does this make any sense? I could not login to the VM normally to see logs, and made big mistake of rebooting it. Now all I see in its logs is big hole, since, well, it went read-only :( I'll try to find out if (1) above is true after I finish migrating data from HDD and remove the it. I wonder where else can I look? -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please help with exact actions for raid1 hot-swap
Patrik, Duncan, thank you for the help. The `btrfs replace start /dev/sdb7 /dev/sdd7 /mnt/data` worked without a hitch (though I didn't try to reboot yet, still have grub/efi/several mdadm partitions to copy). It also worked much faster than mdadm would take, apparently only moving 126GB used, not 2.71TB total. Interestingly, according to HDD lights it mostly read from the remaining /dev/sda, not from replaced /dev/sdb (which must be completely readable now according to smartctl -- problematic sector got finally remapped after ~1day). It now looks like follows: $ sudo blkid /dev/sda7 /dev/sdb7 /dev/sdd7 /dev/sda7: LABEL="data" UUID="37d3313a-e2ad-4b7f-98fc-a01d815952e0" UUID_SUB="db644855-2334-4d61-a27b-9a591255aa39" TYPE="btrfs" PARTUUID="c5ceab7e-e5f8-47c8-b922-c5fa0678831f" /dev/sdb7: PARTUUID="493923cd-9ecb-4ee8-988b-5d0bfa8991b3" /dev/sdd7: LABEL="data" UUID="37d3313a-e2ad-4b7f-98fc-a01d815952e0" UUID_SUB="9c2f05e9-5996-479f-89ad-f94f7ce130e6" TYPE="btrfs" PARTUUID="178cd274-7251-4d25-9116-ce0732d2410b" $ sudo btrfs fi show /dev/sdb7 ERROR: no btrfs on /dev/sdb7 $ sudo btrfs fi show /dev/sdd7 Label: 'data' uuid: 37d3313a-e2ad-4b7f-98fc-a01d815952e0 Total devices 2 FS bytes used 108.05GiB devid1 size 2.71TiB used 131.03GiB path /dev/sda7 devid2 size 2.71TiB used 131.03GiB path /dev/sdd7 Does this mean: * I should not be afraid to reboot and find /dev/sdb7 mounted again? * I will not be able to easily mount /dev/sdb7 on a different computer to do some tests? Also, although /dev/sdd7 is much larger than /dev/sdb7 was, `btrfs fi show` still displays it as 2.71TiB, why? -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please help with exact actions for raid1 hot-swap
It doesn't need replaced disk to be readable, right? Then what prevents same procedure to work without a spare bay? -- With Best Regards, Marat Khalili On September 9, 2017 1:29:08 PM GMT+03:00, Patrik Lundquist <patrik.lundqu...@gmail.com> wrote: >On 9 September 2017 at 12:05, Marat Khalili <m...@rqc.ru> wrote: >> Forgot to add, I've got a spare empty bay if it can be useful here. > >That makes it much easier since you don't have to mount it degraded, >with the risks involved. > >Add and partition the disk. > ># btrfs replace start /dev/sdb7 /dev/sdc(?)7 /mnt/data > >Remove the old disk when it is done. > >> -- >> >> With Best Regards, >> Marat Khalili >> >> On September 9, 2017 10:46:10 AM GMT+03:00, Marat Khalili ><m...@rqc.ru> wrote: >>>Dear list, >>> >>>I'm going to replace one hard drive (partition actually) of a btrfs >>>raid1. Can you please spell exactly what I need to do in order to get >>>my >>>filesystem working as RAID1 again after replacement, exactly as it >was >>>before? I saw some bad examples of drive replacement in this list so >I >>>afraid to just follow random instructions on wiki, and putting this >>>system out of action even temporarily would be very inconvenient. >>> >>>For this filesystem: >>> >>>> $ sudo btrfs fi show /dev/sdb7 >>>> Label: 'data' uuid: 37d3313a-e2ad-4b7f-98fc-a01d815952e0 >>>> Total devices 2 FS bytes used 106.23GiB >>>> devid1 size 2.71TiB used 126.01GiB path /dev/sda7 >>>> devid2 size 2.71TiB used 126.01GiB path /dev/sdb7 >>>> $ grep /mnt/data /proc/mounts >>>> /dev/sda7 /mnt/data btrfs >>>> rw,noatime,space_cache,autodefrag,subvolid=5,subvol=/ 0 0 >>>> $ sudo btrfs fi df /mnt/data >>>> Data, RAID1: total=123.00GiB, used=104.57GiB >>>> System, RAID1: total=8.00MiB, used=48.00KiB >>>> Metadata, RAID1: total=3.00GiB, used=1.67GiB >>>> GlobalReserve, single: total=512.00MiB, used=0.00B >>>> $ uname -a >>>> Linux host 4.4.0-93-generic #116-Ubuntu SMP Fri Aug 11 21:17:51 UTC >>>> 2017 x86_64 x86_64 x86_64 GNU/Linux >>> >>>I've got this in dmesg: >>> >>>> [Sep 8 20:31] ata6.00: exception Emask 0x0 SAct 0x7ecaa5ef SErr 0x0 >>>> action 0x0 >>>> [ +0.51] ata6.00: irq_stat 0x4008 >>>> [ +0.29] ata6.00: failed command: READ FPDMA QUEUED >>>> [ +0.38] ata6.00: cmd 60/70:18:50:6c:f3/00:00:79:00:00/40 tag >3 >>>> ncq 57344 in >>>>res 41/40:00:68:6c:f3/00:00:79:00:00/40 >Emask >>>> 0x409 (media error) >>>> [ +0.94] ata6.00: status: { DRDY ERR } >>>> [ +0.26] ata6.00: error: { UNC } >>>> [ +0.001195] ata6.00: configured for UDMA/133 >>>> [ +0.30] sd 6:0:0:0: [sdb] tag#3 FAILED Result: >hostbyte=DID_OK >>>> driverbyte=DRIVER_SENSE >>>> [ +0.05] sd 6:0:0:0: [sdb] tag#3 Sense Key : Medium Error >>>> [current] [descriptor] >>>> [ +0.04] sd 6:0:0:0: [sdb] tag#3 Add. Sense: Unrecovered read >>>> error - auto reallocate failed >>>> [ +0.05] sd 6:0:0:0: [sdb] tag#3 CDB: Read(16) 88 00 00 00 00 >00 >>> >>>> 79 f3 6c 50 00 00 00 70 00 00 >>>> [ +0.03] blk_update_request: I/O error, dev sdb, sector >>>2045996136 >>>> [ +0.47] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0, >>>rd >>>> 1, flush 0, corrupt 0, gen 0 >>>> [ +0.62] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0, >>>rd >>>> 2, flush 0, corrupt 0, gen 0 >>>> [ +0.77] ata6: EH complete >>> >>>There's still 1 in Current_Pending_Sector line of smartctl output as >of >>> >>>now, so it probably won't heal by itself. >>> >>>-- >>> >>>With Best Regards, >>>Marat Khalili >>>-- >>>To unsubscribe from this list: send the line "unsubscribe >linux-btrfs" >>>in >>>the body of a message to majord...@vger.kernel.org >>>More majordomo info at http://vger.kernel.org/majordomo-info.html >> -- >> To unsubscribe from this list: send the line "unsubscribe >linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >-- >To unsubscribe from this list: send the line "unsubscribe linux-btrfs" >in >the body of a message to majord...@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please help with exact actions for raid1 hot-swap
Forgot to add, I've got a spare empty bay if it can be useful here. -- With Best Regards, Marat Khalili On September 9, 2017 10:46:10 AM GMT+03:00, Marat Khalili <m...@rqc.ru> wrote: >Dear list, > >I'm going to replace one hard drive (partition actually) of a btrfs >raid1. Can you please spell exactly what I need to do in order to get >my >filesystem working as RAID1 again after replacement, exactly as it was >before? I saw some bad examples of drive replacement in this list so I >afraid to just follow random instructions on wiki, and putting this >system out of action even temporarily would be very inconvenient. > >For this filesystem: > >> $ sudo btrfs fi show /dev/sdb7 >> Label: 'data' uuid: 37d3313a-e2ad-4b7f-98fc-a01d815952e0 >> Total devices 2 FS bytes used 106.23GiB >> devid1 size 2.71TiB used 126.01GiB path /dev/sda7 >> devid2 size 2.71TiB used 126.01GiB path /dev/sdb7 >> $ grep /mnt/data /proc/mounts >> /dev/sda7 /mnt/data btrfs >> rw,noatime,space_cache,autodefrag,subvolid=5,subvol=/ 0 0 >> $ sudo btrfs fi df /mnt/data >> Data, RAID1: total=123.00GiB, used=104.57GiB >> System, RAID1: total=8.00MiB, used=48.00KiB >> Metadata, RAID1: total=3.00GiB, used=1.67GiB >> GlobalReserve, single: total=512.00MiB, used=0.00B >> $ uname -a >> Linux host 4.4.0-93-generic #116-Ubuntu SMP Fri Aug 11 21:17:51 UTC >> 2017 x86_64 x86_64 x86_64 GNU/Linux > >I've got this in dmesg: > >> [Sep 8 20:31] ata6.00: exception Emask 0x0 SAct 0x7ecaa5ef SErr 0x0 >> action 0x0 >> [ +0.51] ata6.00: irq_stat 0x4008 >> [ +0.29] ata6.00: failed command: READ FPDMA QUEUED >> [ +0.38] ata6.00: cmd 60/70:18:50:6c:f3/00:00:79:00:00/40 tag 3 >> ncq 57344 in >>res 41/40:00:68:6c:f3/00:00:79:00:00/40 Emask >> 0x409 (media error) >> [ +0.94] ata6.00: status: { DRDY ERR } >> [ +0.26] ata6.00: error: { UNC } >> [ +0.001195] ata6.00: configured for UDMA/133 >> [ +0.30] sd 6:0:0:0: [sdb] tag#3 FAILED Result: hostbyte=DID_OK >> driverbyte=DRIVER_SENSE >> [ +0.05] sd 6:0:0:0: [sdb] tag#3 Sense Key : Medium Error >> [current] [descriptor] >> [ +0.04] sd 6:0:0:0: [sdb] tag#3 Add. Sense: Unrecovered read >> error - auto reallocate failed >> [ +0.05] sd 6:0:0:0: [sdb] tag#3 CDB: Read(16) 88 00 00 00 00 00 > >> 79 f3 6c 50 00 00 00 70 00 00 >> [ +0.03] blk_update_request: I/O error, dev sdb, sector >2045996136 >> [ +0.47] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0, >rd >> 1, flush 0, corrupt 0, gen 0 >> [ +0.000062] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0, >rd >> 2, flush 0, corrupt 0, gen 0 >> [ +0.77] ata6: EH complete > >There's still 1 in Current_Pending_Sector line of smartctl output as of > >now, so it probably won't heal by itself. > >-- > >With Best Regards, >Marat Khalili >-- >To unsubscribe from this list: send the line "unsubscribe linux-btrfs" >in >the body of a message to majord...@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Please help with exact actions for raid1 hot-swap
Dear list, I'm going to replace one hard drive (partition actually) of a btrfs raid1. Can you please spell exactly what I need to do in order to get my filesystem working as RAID1 again after replacement, exactly as it was before? I saw some bad examples of drive replacement in this list so I afraid to just follow random instructions on wiki, and putting this system out of action even temporarily would be very inconvenient. For this filesystem: $ sudo btrfs fi show /dev/sdb7 Label: 'data' uuid: 37d3313a-e2ad-4b7f-98fc-a01d815952e0 Total devices 2 FS bytes used 106.23GiB devid1 size 2.71TiB used 126.01GiB path /dev/sda7 devid2 size 2.71TiB used 126.01GiB path /dev/sdb7 $ grep /mnt/data /proc/mounts /dev/sda7 /mnt/data btrfs rw,noatime,space_cache,autodefrag,subvolid=5,subvol=/ 0 0 $ sudo btrfs fi df /mnt/data Data, RAID1: total=123.00GiB, used=104.57GiB System, RAID1: total=8.00MiB, used=48.00KiB Metadata, RAID1: total=3.00GiB, used=1.67GiB GlobalReserve, single: total=512.00MiB, used=0.00B $ uname -a Linux host 4.4.0-93-generic #116-Ubuntu SMP Fri Aug 11 21:17:51 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux I've got this in dmesg: [Sep 8 20:31] ata6.00: exception Emask 0x0 SAct 0x7ecaa5ef SErr 0x0 action 0x0 [ +0.51] ata6.00: irq_stat 0x4008 [ +0.29] ata6.00: failed command: READ FPDMA QUEUED [ +0.38] ata6.00: cmd 60/70:18:50:6c:f3/00:00:79:00:00/40 tag 3 ncq 57344 in res 41/40:00:68:6c:f3/00:00:79:00:00/40 Emask 0x409 (media error) [ +0.94] ata6.00: status: { DRDY ERR } [ +0.26] ata6.00: error: { UNC } [ +0.001195] ata6.00: configured for UDMA/133 [ +0.30] sd 6:0:0:0: [sdb] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ +0.05] sd 6:0:0:0: [sdb] tag#3 Sense Key : Medium Error [current] [descriptor] [ +0.04] sd 6:0:0:0: [sdb] tag#3 Add. Sense: Unrecovered read error - auto reallocate failed [ +0.05] sd 6:0:0:0: [sdb] tag#3 CDB: Read(16) 88 00 00 00 00 00 79 f3 6c 50 00 00 00 70 00 00 [ +0.03] blk_update_request: I/O error, dev sdb, sector 2045996136 [ +0.47] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0, rd 1, flush 0, corrupt 0, gen 0 [ +0.62] BTRFS error (device sda7): bdev /dev/sdb7 errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 [ +0.77] ata6: EH complete There's still 1 in Current_Pending_Sector line of smartctl output as of now, so it probably won't heal by itself. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is autodefrag recommended? -- re-duplication???
Dear experts, At first reaction to just switching autodefrag on was positive, but mentions of re-duplication are very scary. Main use of BTRFS here is backup snapshots, so re-duplication would be disastrous. In order to stick to concrete example, let there be two files, 4KB and 4GB in size, referenced in read-only snapshots 100 times each, and some 4KB of both files are rewritten each night and then another snapshot is created (let's ignore snapshots deletion here). AFAIU 8KB of additional space (+metadata) will be allocated each night without autodefrag. With autodefrag will it be perhaps 4KB+128KB or something much worse? -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Is autodefrag recommended?
Hello list, good time of the day, More than once I see mentioned in this list that autodefrag option solves problems with no apparent drawbacks, but it's not the default. Can you recommend to just switch it on indiscriminately on all installations? I'm currently on kernel 4.4, can switch to 4.10 if necessary (it's Ubuntu that gives us this strange choice, no idea why it's not 4.9). Only spinning rust here, no SSDs. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: number of subvolumes
> We find that typically apt is very slow on a machine with 50 or so snapshots > and raid10. Slow as in probably 10x slower as doing the same update on a > machine with 'single' and no snapshots. > > Other operations seem to be the same speed, especially disk benchmarks do not > seem to indicate any performance degradation. For meaningful discussion it is important to take into account the fact that dpkg infamously calls fsync after changing every bit of information, so basically you're measuring fsync speed. Which is slow on btrfs (compared to simpler filesystems), but unrelated to normal work. I've got two near-identical servers here with several containers each different only on in filesystem: btrfs-raid1 on one (for historical reasons) and ext4/mdadm-raid1 on another, no snapshots, no reflinks. Each time containers on ext4 update several times faster, but in everyday operation there's no significant difference. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: finding root filesystem of a subvolume?
Hmm, now I'm really confused, I just checked on the Ubuntu 17.04 and 16.04.3 VM's I have (I only run current and the most recent LTS version), and neither of them behave like this. Was also shocked, but: $ lsb_release -a No LSB modules are available. Distributor ID:Ubuntu Description:Ubuntu 16.04.3 LTS Release:16.04 Codename:xenial $ df -T | grep /mnt/data/lxc $ df -T /mnt/data/lxc Filesystem Type 1K-blocks Used Available Use% Mounted on - -2907008836 90829848 2815107576 4% /mnt/data/lxc -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: finding root filesystem of a subvolume?
I have no subvol=/ option at all: Probably depends on kernel, but I presume missing subvol means the same as subvol=/ . I am only interested in mounted volumes. If your initial path (/local/.backup/home) is a subvolume but it's not itself present in /proc/mounts then it's probably mounted as a part some higher-level subvolume, but this higher-level subvolume does not have to be root. Do you need volume root or just some higher-level subvolume that's mounted? -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: finding root filesystem of a subvolume?
On 22/08/17 15:50, Ulli Horlacher wrote: It seems, I have to scan the subvolume path upwards until I found a real mount point, I think searching /proc/mounts for the same device and subvol=/ in options is most straightforward. But what makes you think it's mounted at all? -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: slow btrfs with a single kworker process using 100% CPU
I'm a btrfs user, not a developer; developers can probably provide more detailed explanation by looking at stack traces in dmesg etc., but it's possible that there's just no quick fix (yet). I presume these are 1413 _full-volume_ snapshots. Then some operations have to process 43.65TiB*1413=62PiB of data -- well, metadata for that data, but it's still a lot as you may guess, especially if it's all heavily fragmented. You can either gradually reduce number of snapshots and wait (it may drag for weeks and months), or copy everything to a different device and reformat this one, then don't create that many snapshots again. As for "blocked for more than 120 seconds" messages in dmesg, I see them every night after I delete about a dozen of snapshots ~10TiB in _total_ volume, albeit with qgroups. These messages usually subside after about couple of hours. They only tell you what you already know: some btrfs operations are painfully slow. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: slow btrfs with a single kworker process using 100% CPU
I've one system where a single kworker process is using 100% CPU sometimes a second process comes up with 100% CPU [btrfs-transacti]. Is there anything i can do to get the old speed again or find the culprit? 1. Do you use quotas (qgroups)? 2. Do you have a lot of snapshots? Have you deleted some recently? More info about your system would help too. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive loss of disk space
On August 3, 2017 7:01:06 PM GMT+03:00, Goffredo Baroncelli >The file is physically extended > >ghigo@venice:/tmp$ fallocate -l 1000 foo.txt For clarity let's replace the fallocate above with: $ head -c 1000 foo.txt >ghigo@venice:/tmp$ ls -l foo.txt >-rw-r--r-- 1 ghigo ghigo 1000 Aug 3 18:00 foo.txt >ghigo@venice:/tmp$ fallocate -o 500 -l 1000 foo.txt >ghigo@venice:/tmp$ ls -l foo.txt >-rw-r--r-- 1 ghigo ghigo 1500 Aug 3 18:00 foo.txt >ghigo@venice:/tmp$ According to explanation by Austin the foo.txt at this point somehow occupies 2000 bytes of space because I can reflink it and then write another 1000 bytes of data into it without losing 1000 bytes I already have or getting out of drive space. (Or is it only true while there are open file handles?) -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive loss of disk space
On 02/08/17 20:52, Goffredo Baroncelli wrote: consider the following scenario: a) create a 2GB file b) fallocate -o 1GB -l 2GB c) write from 1GB to 3GB after b), the expectation is that c) always succeed [1]: i.e. there is enough space on the filesystem. Due to the COW nature of BTRFS, you cannot rely on the already allocated space because there could be a small time window where both the old and the new data exists on the disk. Just curious. With current implementation, in the following case: a) create a 2GB file1 && create a 2GB file2 b) fallocate -o 1GB -l 2GB file1 && fallocate -o 1GB -l 2GB file2 c) write from 1GB to 3GB file1 && write from 1GB to 3GB file2 will (c) always succeed? I.e. does fallocate really allocate 2GB per file, or does it only allocate additional 1GB and check free space for another 1GB? If it's only the latter, it is useless. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Best Practice: Add new device to RAID1 pool
>> This may be a stupid question , but are your pool of butter (or BTRFS pool) >> by any chance hooked up via USB? If this is USB2.0 at 480mitb/s then it is >> about 57MB/s / 4 drives = roughly 14.25 or about 11MB/s if you shave off >> some overhead. > >Nope, USB 3. Typically on scrubs I get 110MB/s that winds down to >60MB/s as it progresses to the slow parts of the disk. It could have degraded to USB2 due to bad connection/loose electrical contacts. You know USB3 needs extra wires, and if it lost some it'd connect (or reconnect) in USB2 mode. I'd check historical kernel messages just in case, and/or unmount and reconnect to be sure. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel btrfs file system wedged -- is it toast?
> The btrfs developers should have known this, and announced this, a long time ago, in various prominent ways that it would be difficult for potential new users to miss. I'm also a user like you, and I felt like this too when I came here (BTW there are several traps in BTRFS, and other are causing partial or whole filesystem loss, so you're lucky). There's truth in your words that some warning is needed, but in this open-source business it is not clear who should give it to whom. Developers in the list are actually spending their time on adding such warnings to kernel and command-line tools, but e.g. people using GUI and not reading dmesg over breakfast won't see them anyways. All situation is unfortunate because hardware and OS vendors keep hyping BTRFS and making it default in their products when it is clearly not ready, but you're now talking to and blaming the wrong people. Personally for me coming to this list was the most helpful thing in understanding BTRFS current state and limitations. I'm still using it, although in a very careful and controlled manner. But browsing the list every day sadly takes time. If you can't afford it or are running something absolutely critical, better look to other, more mature filesystems. After all, as adage says: "legacy is what we run in production". -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
On 21/06/17 06:48, Chris Murphy wrote: Another possibility is to ensure a new write is written to a new*not* full stripe, i.e. dynamic stripe size. So if the modification is a 50K file on a 4 disk raid5; instead of writing 3 64K data strips + 1 64K parity strip (a full stripe write); write out 1 64K data strip + 1 64K parity strip. In effect, a 4 disk raid5 would quickly get not just 3 data + 1 parity strip Btrfs block groups; but 1 data + 1 parity, and 2 data + 1 parity chunks, and direct those write to the proper chunk based on size. Anyway that's beyond my ability to assess how much allocator work that is. Balance I'd expect to rewrite everything to max data strips possible; the optimization would only apply to normal operation COW. This will make some filesystems mostly RAID1, negating all space savings of RAID5, won't it? Isn't it easier to recalculate parity block based using previous state of two rewritten strips, parity and data? I don't understand all performance implications, but it might scale better with number of devices. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: understanding differences in recoverability of raid1 vs raid10 and performance implications of unusual numbers of devices
>raid 1 write data on all disks synchronously all time, no tricks. >btrfs raid1 read data by PID%2 >0 - first copy >1 - second copy Meaning, a single-process database will only ever read one copy? At least, meaning of first/second relative to physical devices depends on extent, right, right? -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs mounts RO, kernel oops on RW
I'm not asking for a specific endorsement, but should I be considering something like the seagate ironwolf or WD red drives? You need two qualitative features from HDD for RAID usage: 1) being failfast (TLER in WD talk), and 2) be designed to tolerate vibrations from other disks in a box. Therefore you need _at least_ WD Red or alternative from Seagate. Paying more can only bring you quantitative benefits AFAIK. Just don't put desktop drives in RAID. (Sorry for being off-topic, but after some long recent discussions I don't feel as guilty. :) ) -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-tools/linux 4.11: btrfs-cleaner misbehaving
If you do have scripted snapshots being taken, be sure you have a script thinning down your snapshot history as well. I know Ivan P never mentioned qgroups, but just a warning for future readers: *with qgroups don't let this script delete more than couple dozen snapshots at once*, then wait for btrfs kernel activity to subside before trying again. Especially be very careful when running this script in production the very first time, it will most likely find too many snapshots to delete. (Temporary removing all affected qgroups beforehand may also work.) -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: snapshot destruction making IO extremely slow
Hello, It occurs when enabling quotas on a volume. When there are a lot of snapshots that are deleted, the system becomes extremely unresponsive (IO often waiting for 30s on a SSD). When I don't have quotas, removing snapshots is fast. Same problem here. It is now common knowledge in the list that qgroups cause performance problems. I try to avoid deleting many snapshots at once because of this. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QGroups Semantics
Hit "Send" a little too early: More complete workaround would be delayed cleanup. What about (re-)mount time? (Should also handle qgroups remaining ... after subvolumes deleted on previous kernels.) -- With Best Regards, Marat Khalili On 23/05/17 08:38, Marat Khalili wrote: Just some user's point of view: I propose the following changes: 1) We always cleanup level-0 qgroups by default, with no opt-out. I see absolutely no reason to keep these around. It WILL break scripts that try to do this cleanup themselves. OTOH it will simplify writing new ones. Since qgroups are assigned sequential numbers, it must be possible to partially work it around by not returning error on repeated delete. But you cannot completely emulate qgroup presence without actually keeping it, so some scripts will still break. More complete workaround would be delayed cleanup. What about (re-)mount time? (Should also handle qgroups remaining ) We do not allow the creation of level-0 qgroups for (sub)volumes that do not exist. Probably I'm mistaken, but I see no reasons for doing it even now, since I don't think it's possible to reliably assign existing 0-level qgroup to a new subvolume. So this change should break nothing. Why do we allow deleting a level 0 qgroup for a currently existing subvolume? 4) Add a flag to the qgroup_delete_v2 ioctl, NO_SUBVOL_CHECK. If the flag is present, it will allow you to delete qgroups which reference active subvolumes. Some people doing cleanup in the reverse order? Other than this, I don't understand why this feature is needed, so IMO it's unlikely to be needed in a new API. Of course, this is all just one datapoint for you. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: QGroups Semantics
Just some user's point of view: I propose the following changes: 1) We always cleanup level-0 qgroups by default, with no opt-out. I see absolutely no reason to keep these around. It WILL break scripts that try to do this cleanup themselves. OTOH it will simplify writing new ones. Since qgroups are assigned sequential numbers, it must be possible to partially work it around by not returning error on repeated delete. But you cannot completely emulate qgroup presence without actually keeping it, so some scripts will still break. More complete workaround would be delayed cleanup. What about (re-)mount time? (Should also handle qgroups remaining ) We do not allow the creation of level-0 qgroups for (sub)volumes that do not exist. Probably I'm mistaken, but I see no reasons for doing it even now, since I don't think it's possible to reliably assign existing 0-level qgroup to a new subvolume. So this change should break nothing. Why do we allow deleting a level 0 qgroup for a currently existing subvolume? 4) Add a flag to the qgroup_delete_v2 ioctl, NO_SUBVOL_CHECK. If the flag is present, it will allow you to delete qgroups which reference active subvolumes. Some people doing cleanup in the reverse order? Other than this, I don't understand why this feature is needed, so IMO it's unlikely to be needed in a new API. Of course, this is all just one datapoint for you. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Backing up BTRFS metadata
Indeed. This has been tried before, and I don't think it came to anything. What can/did go wrong? I suspect it's still only capturing metadata, rather than data. Yes. But data should still be there, on disk, right? -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Backing up BTRFS metadata
On 11/05/17 18:19, Chris Murphy wrote: btrfs-image Looks great, thank you! -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Backing up BTRFS metadata
Sorry if question sounds unorthodox, Is there some simple way to read (and backup) all BTRFS metadata from volume? Motivation of course is possibility to quickly recover from catastrophic filesystem failures on a logical level. Some small amount of actual data that this metadata references may be overwritten between backup and restore moments, but due to checksumming it can easily be caught (and either individually restored from backup or discarded). -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: runtime btrfsck
Hello, (Warning: I'm a user, not a developer here.) In my experience (on kernel 4.4) it processed larger and slower devices within a day, BUT according to some recent topics the runaway fragmentation (meaning in this case large number of small extents regardless of their relative location) can significantly slow down BTRFS operations to the point of making them infeasible. Possible reasons for fragmentations are snapshotting volumes too often and/or running VM images from BTRFS without taking some precautions. On top of this, mount device name makes one suspicious there's another layer between BTRFS and hardware. Are you sure it's not the bottleneck in this case? -- With Best Regards, Marat Khalili On 10/05/17 10:02, Stefan Priebe - Profihost AG wrote: I'm now trying btrfs progs 4.10.2. Is anybody out there who can tell me something about the expected runtime or how to fix bad key ordering? Greets, Stefan Am 06.05.2017 um 07:56 schrieb Stefan Priebe - Profihost AG: It's still running. Is this the normal behaviour? Is there any other way to fix the bad key ordering? Greets, Stefan Am 02.05.2017 um 08:29 schrieb Stefan Priebe - Profihost AG: Hello list, i wanted to check an fs cause it has bad key ordering. But btrfscheck is now running since 7 days. Current output: # btrfsck -p --repair /dev/mapper/crypt_md0 enabling repair mode Checking filesystem on /dev/mapper/crypt_md0 UUID: 37b15dd8-b2e1-4585-98d0-cc0fa2a5a7c9 bad key ordering 39 40 checking extents [O] FS is a 12TB BTRFS Raid 0 over 3 mdadm Raid 5 devices. How long should btrfsck run and is there any way to speed it up? btrfs tools is 4.8.5 Thanks! Greets, Stefan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
BTRFS warning (device sda7): block group 181491728384 has wrong amount of free space
Dear all, I cannot understand two messages in syslog, could someone please shed some light? Here they are: Apr 29 08:54:03 container-name kernel: [792742.662375] BTRFS warning (device sda7): block group 181491728384 has wrong amount of free space Apr 29 08:54:03 container-name kernel: [792742.662381] BTRFS warning (device sda7): failed to load free space cache for block group 181491728384, rebuilding it now Especially strange is the fact that messages appear in LXC container's syslog, but not in syslog of a host system. I only saw network and apparmor-related messages in container syslogs before. I didn't run any usermode btrfs tools at the time (especially in container, since they are not even installed there), but there's a quota set for this subvolume, and it was coming close to exhausting by large mysql database. There're no snapshots this time. smartmon finds no problems. marat@host:~$ uname -a Linux host 4.4.0-72-generic #93-Ubuntu SMP Fri Mar 31 14:07:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux marat@host:~$ lsb_release -a No LSB modules are available. Distributor ID:Ubuntu Description:Ubuntu 16.04.2 LTS Release:16.04 Codename:xenial marat@host:~$ btrfs --version btrfs-progs v4.4 marat@host:~$ sudo btrfs qgroup show -F -pcre /mnt/lxc/container-name/rootfs qgroupid rfer excl max_rfer max_excl parent child -- - 0/80263.93GiB 63.93GiB 64.00GiB none --- --- marat@host:~$ sudo btrfs filesystem show /dev/sda7 # run after freeing space by clearing database Label: 'data' uuid: 37d3313a-e2ad-4b7f-98fc-a01d815952e0 Total devices 2 FS bytes used 47.73GiB devid1 size 2.71TiB used 114.01GiB path /dev/sda7 devid2 size 2.71TiB used 114.01GiB path /dev/sdb7 marat@host:~$ sudo btrfs filesystem df /mnt/lxc/container-name/rootfs# run after freeing space by clearing database Data, RAID1: total=111.00GiB, used=46.83GiB System, RAID1: total=8.00MiB, used=32.00KiB Metadata, RAID1: total=3.00GiB, used=983.11MiB GlobalReserve, single: total=336.00MiB, used=0.00B marat@host:~$ sudo lxc-attach -n container-name cat /proc/mounts | grep sda7 /dev/sda7 / btrfs rw,relatime,space_cache,subvolid=802,subvol=/lxc/container-name/rootfs 0 0 -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with file system
On 25/04/17 03:26, Qu Wenruo wrote: IIRC qgroup for subvolume deletion will cause full subtree rescan which can cause tons of memory. Could it be this bad, 24GB of RAM for a 5.6TB volume? What does it even use this absurd amount of memory for? Is it swappable? Haven't read about RAM limitations for running qgroups before, only about CPU load (which importantly only requires patience, does not crash servers). -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Prevent escaping btrfs quota
Just some food for thought: there's already a tag that correctly assigns filesystem objects to users. It is called owner(ship). Instead of making qgroups repeat ownership logic, why not base qgroup assignments on ownership itself? (At least on per-subvolume basis.) -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Encountered kernel bug#72811. Advice on recovery?
Even making such a warning conditional on kernel version is problematic, because many distros backport major blocks of code, including perhaps btrfs fixes, and the nominally 3.14 or whatever kernel may actually be running btrfs and other fixes from 4.14 or later, by the time they actually drop support for whatever LTS distro version and quit backporting fixes. This information could be stored in kernel and made available for usermode tools via some proc file. This would be very useful _especially_ considering backporting. Raid56 could be fixed already (or not) by the time it is implemented, but no doubt there will still be other highly experimental capabilities judging by how things go. And this feature itself could easily be backported. Some machine-readable readiness level (ok/warning/override flag needed/known but disabled in kernel) plus one-line text message displayed to users in cases 2-4 is all we need. If proc file is missing or doesn't contain information about specific capability, tools could default to current behavior (AFAIR there're already warnings in some cases). Message should tersely cover any known issues, including stability, performance, compatibility and general readiness, and may contain links (to btrfs wiki?) for more information. I expect whole file to easily fit in 512 bytes. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Deduplication tools
After reading this maillist for a while I became a bit more cautious about using various BTRFS features, so decided to ask just in case: is it safe to use out-of-band deduplication tools <https://btrfs.wiki.kernel.org/index.php/Deduplication>, and which of them are considered more stable/mainstream? Also, won't running these tools exacerbate often mentioned stability/performance problems with too-many-snapshots? Any first-hand experience is very welcome. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Do different btrfs volumes compete for CPU?
On 04/04/17 20:36, Peter Grandi wrote: SATA works for external use, eSATA works well, but what really matters is the chipset of the adapter card. eSATA might be sound electrically, but mechanically it is awful. Try to run it for months in a crowded server room, and inevitably you'll get disconnections and data corruption. Tried different cables, brackets -- same result. If you ever used eSATA connector, you'd feel it. In my experience JMicron is not so good, Marvell a bit better, best is to use a recent motherboard chipset with a SATA-eSATA internal cable and bracket. That's exactly what I used to use: internal controller of Z77 chipset + bracket(s). But that does not change the fact that it is a library and work is initiated by user requests which are not per-subvolume, but in effect per-volume. That's the answer I was looking for. It is a way to do so and not a very good way. There is no obviously good way to define "real usage" in the presence of hard-links and reflinking, and qgroups use just one way to define it. A similar problem happens with processes in the presence of shared pages, multiple mapped shared libraries etc. No need to over-generalize. There's an obvious good way to define "real usage" of a subvolume and its snapshots as long as it don't share any data with other subvolumes, as is often the case. If it does share, two figures -- exclusive and referenced, like in qgroups -- are sufficient for most tasks. The problem is that both hard-links and ref-linking create really significant ambiguities as to used space. Plus the same problem would happen with directories instead of subvolumes and hard-links instead of reflinked snapshots. You're right, although with hard-links there's at least remote chance to estimate storage use with usermode scripts. ASMedia USB3 chipsets are fairly reliable at the least the card ones on the system side. The ones on the disk side I don't know much about. This is getting increasingly off-topic, but our mainstay are CFI 5-disk DAS boxes (8253JDGG to be exact) filled with WD Red-s in RAID5 configuration. They are no longer produced and getting harder and harder to source, but showed themselves as very reliable. According to lsusb they contain JMicron JMS567 SATA 6Gb/s bridge. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mix ssd and hdd in single volume
On 02/04/17 03:13, Duncan wrote: Meanwhile, since you appear to be designing a mass-market product, it's worth mentioning that btrfs is considered, certainly by its devs and users on this list, to be "still stabilizing, not fully stable and mature." [...] That doesn't sound like a particularly good choice for a mass-market NAS product to me. Of course there's rockstor and others out there already shipping such products, but they're risking their reputation and the safety of their customer's data in the process, altho there's certainly a few customers out there with the time, desire and technical know-how to ensure the recommended backups and following current kernel and list, and that's exactly the sort of people you'll find already here. But that's not sufficiently mass-market to appeal to most vendors. You may want to look here: https://www.synology.com/en-global/dsm/Btrfs . Somebody forgot to tell Synology, which already supports btrfs in all hardware-capable devices. I think Rubicon has been crossed in 'mass-market NAS[es]', for good or not. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Do different btrfs volumes compete for CPU?
On 01/04/17 13:17, Peter Grandi wrote: That "USB-connected is a rather bad idea. On the IRC channel #Btrfs whenever someone reports odd things happening I ask "is that USB?" and usually it is and then we say "good luck!" :-). You're right, but USB/eSATA arrays are dirt cheap in comparison with similar-performance SAN/NAS etc. things, that we unfortunately cannot really afford here. Just a bit of a back-story: I tried to use eSATA and ext4 first, but observed silent data corruption and irrecoverable kernel hangs -- apparently, SATA is not really designed for external use. That's when I switched to both USB and, coincidently, btrfs, and stability became orders of magnitude better even on re-purposed consumer-grade PC (Z77 chipset, 3rd gen. i5) with horribly outdated kernel. Now I'm rebuilding same configuration on server-grade hardware (C610 chipset, 40 io-channel Xeon) and modern kernel, and thus would be very surprised to find problems in USB throughput. As written that question is meaningless: despite the current mania for "threads"/"threadlets" a filesystem driver is a library, not a set of processes (all those '[btrfs-*]' threadlets are somewhat misguided ways to do background stuff). But these threadlets, misguided as the are, do exist, don't they? * Qgroups are famously system CPU intensive, even if less so than in earlier releases, especially with subvolumes, so the 16 hours CPU is both absurd and expected. I think that qgroups are still effectively unusable. I understand that qgroups is very much work in progress, but (correct me if I'm wrong) right now it's the only way to estimate real usage of subvolume and its snapshots. For instance, if I have dozen 1TB subvolumes each having ~50 snapshots and suddenly run out of space on a 24TB volume, how do I find the culprit without qgroups? Keeping eye on storage use is essential for any real life use of snapshots, and they are too convenient as backup de-duplication tool to give up. Just a stray thought: btrfs seem to lack object type in between of volume and subvolume, that would keep track of storage use by several subvolumes+their snapshots, allow snapshotting/transferring multiple subvolumes at once etc. Some kind of super-subvolume (supervolume?) that is hierarchical. With increasing use of subvolumes/snapshots within a single system installation, and multiple system installations (belonging to different users) in one volume due to liberal use of LXC and similar technologies this will become more and more of a pressing problem. * The scheduler gives excessive priority to kernel threads, so they can crowd out user processes. When for whatever reason the system CPU percentage rises everything else usually suffers. I thought it was clear, but probably needs spelling out: while 1 core was completely occupied with [btrfs-transacti] thread, 5 more were mostly idle serving occasional network requests without any problems. And only a process that used storage intensively died. Fortunately or not, it's the only data point so far -- smaller snapshot cullings do not cause problems. Only Intel/AMD USB chipsets and a few others are fairly reliable, and for mass storage only with USB3 with UASPI, which is basically SATA-over-USB (more precisely SCSI-command-set over USB). Your system-side card seems to be recent enough to do UASPI, but probably the peripheral-side chipset isn't. Things are so bad with third-party chipsets that even several types of add-on SATA and SAS cards are too buggy. Thank you very much for this hint. The card is indeed unknown factor here and I'll keep a close eye on it. The chip is ASM1142, not Intel/AMD sadly but quite popular nevertheless. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Do different btrfs volumes compete for CPU?
Thank you very much for reply and suggestions, more comments below. Still, is there a definite answer on root question: are different btrfs volumes independent in terms of CPU, or are there some shared workers that can be point of contention? What would have been interesting would have been if you had any reports from for instance htop during that time, showing wait percentage on the various cores and status (probably D, disk-wait) of the innocent process. iotop output would of course have been even better, but also rather more special-case so less commonly installed. Curiously, I have had iotop but not htop running. [btrfs-transacti] had some low-level activity in iotop (I still assume it was CPU-limited), the innocent process did not have any activity anywhere. Next time I'll also take notice of process state in ps (sadly, my omission). I believe you will find that the problem isn't btrfs, but rather, I/O contention This possibility did not come to my mind. Can USB drivers be still that bad in 4.4? Is there any way to discriminate these two situations (btrfs vs usb load)? BTW, USB adapter used is this one (though storage array only supports USB 3.0): https://www.asus.com/Motherboard-Accessory/USB_31_TYPEA_CARD/ and that if you try the same thing with one of the filesystems being for instance ext4, you'll see the same problem there as well Not sure if it's possible to reproduce the problem with ext4, since it's not possible to perform such extensive metadata operations there, and simply moving large amount of data never created any problems for me regardless of filesystem. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Do different btrfs volumes compete for CPU?
Approximately 16 hours ago I've run a script that deleted >~100 snapshots and started quota rescan on a large USB-connected btrfs volume (5.4 of 22 TB occupied now). Quota rescan only completed just now, with 100% load from [btrfs-transacti] throughout this period, which is probably ~ok depending on your view on things. What worries me is innocent process using _another_, SATA-connected btrfs volume that hung right after I started my script and took >30 minutes to be sigkilled. There's nothing interesting in the kernel log, and attempts to attach strace to the process output nothing, but I of course suspect that it freezed on disk operation. I wonder: 1) Can there be a contention for CPU or some mutexes between kernel btrfs threads belonging to different volumes? 2) If yes, can anything be done about it other than mounting volumes from (different) VMs? $ uname -a; btrfs --version Linux host 4.4.0-66-generic #87-Ubuntu SMP Fri Mar 3 15:29:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux btrfs-progs v4.4 -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Qgroups are not applied when snapshotting a subvol?
If we were going to reserve something, it should be a high number, not a low one. Having 0 reserved makes some sense, but reserving other low numbers seems kind of odd when they aren't already reserved. I did some experiments. Currently assigning higher-level qgroup to lower-level qgroup is not possible. Consequently, assigning anything to 0-level qgroup is not possible. On the other hand, assigning qgroups while skipping levels (e.g. qgroup 2/P to 10/Q) is possible. So setting default snapshot level high is technically possible, but you'll not be able to assign these high-level qgroups anywhere low later. although I hadn't realized that the snapshot command _does not_ have this argument, when it absolutely should. It does here in 4.4, it's just not documented :) I too found it by accident. Perhaps have an option Options always suit everyone except developers who need to implement and support them :) Here I'd like to wrap up since I seriously doubt any real btrfs developers are still reading our discussion :) -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Qgroups are not applied when snapshotting a subvol?
There are a couple of reasons I'm advocating the specific behavior I outlined: Some of your points are valid, but some break current behaviour and expectations or create technical difficulties. 1. It doesn't require any specific qgroup setup. By definition, you can be 100% certain that the destination qgroup exists, and that you won't need to create new qgroups behind the user's back (given your suggestion, what happens when qgroup 1/N doesn't exist?). This is a general problem with current qgroups: you have to reference them by some random numbers, not by user-assigned names like files. It would need to be fixed sooner or later with syntax like L: in place of L/N, or even some special syntax made specifically for path snapshots. BTW, what about reserving level 1 for qgroups describing subvolumes and all their snapshots and forbidding manual management of qgroups at this level? 2. Just because it's the default, doesn't mean that the subvolume can't be reassigned to a different qgroup. This also would not remove the ability to assign a specific qgroup through the snapshot creation command. This is arguably a general point in favor of having any default of course, but it's still worth pointing out. Currently 0/N qgroups are special in that they are created automatically and belong to the bottom of the hierarchy. It would be very nice to keep it this way. Changing qgroup assignments after snapshot creation is very inconvenient because it requires quota rescan and thus blocks all other quota operations. 3. Because BTRFS has COW semantics, the new snapshot should take up near zero space in the qgroup of it's parent. Indeed it works this way in my experiments if you assign snapshot to 1/N qgroup at creation where 0/N also belongs. Moreover, it does not require quota rescan, which is very nice. 4. This correlates with the behavior most people expect based on ZFS and LVM, which is that snapshots are tied to their parent. I'm not against tying it to the parent. I'm against removing snapshot's own qgroup. At a minimum, it should belong to _some_ qgroup. This could also be covered by having a designated 'default' qgroup that all new subvolumes created without a specified qgroup get put in, but I feel that that is somewhat orthogonal to the issue of how snapshots are handled. It belongs to its own 0/N' qgroup, but this is not probably what you mean. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Qgroups are not applied when snapshotting a subvol?
The default should be to inherit the qgroup of the parent subvolume. This behaviour is only good for this particular use-case. In general case, qgroups of subvolume and snapshots should exist separately, and both can be included in some higher level qgroup (after all, that's what qgroup hierarchy is for). In my system I found it convenient to include subvolume and its snapshots in qgroup 1/N, where 0/N is qgroup of bare subvolume. I think adopting this behaviour as default would be more sensible. -- With Best Regards, Marat Khalili On 28/03/17 14:24, Austin S. Hemmelgarn wrote: On 2017-03-27 15:32, Chris Murphy wrote: How about if qgroups are enabled, then non-root user is prevented from creating new subvolumes? Or is there a way for a new nested subvolume to be included in its parent's quota, rather than the new subvolume having a whole new quota limit? Tricky problem. The default should be to inherit the qgroup of the parent subvolume. The organization of subvolumes is hierarchical, and sane people expect things to behave as they look. Taking another angle, on ZFS, 'nested' (nested in quotes because ZFS' definition of 'nested' zvols is weird) inherit their parent's quota and reservations (essentially reverse quota), and they're not even inherently nested in the filesystem like subvolumes are, so we're differing from the only other widely used system that implements things in a similar manner. As far as the subvolume thing, there should be an option to disable user creation of subvolumes, and ideally it should be on by default because: 1. Users can't delete subvolumes by default. This means they can create but not destroy a resource by default, which means that a user can pretty easily accidentally cause issues for the system as a whole. 2. Correlating with 1, users being able to delete subvolumes by default is not safe on multiple levels (easy accidental data loss, numerous other issues), and thus user subvolume removal being off by default is significantly safer. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: backing up a file server with many subvolumes
Just some consideration, since I've faced similar but no exactly same problem: use rsync, but create snapshots on target machine. Blind rsync will destroy deduplication of your snapshots and take huge amount of storage, so it's not a solution. But you can rsync --inline your snapshots in chronological order to some folder and re-take snapshots of that folder, thus recreating your snapshots structure on target. Obviously, it can/should be automated. -- With Best Regards, Marat Khalili On 26/03/17 06:00, J. Hart wrote: I have a Btrfs filesystem on a backup server. This filesystem has a directory to hold backups for filesystems from remote machines. In this directory is a subdirectory for each machine. Under each machine subdirectory is one directory for each filesystem (ex /boot, /home, etc) on that machine. In each filesystem subdirectory are incremental snapshot subvolumes for that filesystem. The scheme is something like this: /backup/// I'd like to try to back up (duplicate) the file server filesystem containing these snapshot subvolumes for each remote machine. The problem is that I don't think I can use send/receive to do this. "Btrfs send" requires "read-only" snapshots, and snapshots are not recursive as yet. I think there are too many subvolumes which change too often to make doing this without recursion practical. Any thoughts would be most appreciated. J. Hart -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: partial quota rescan
Suddenly discovered undocumented (in man or anywhere) -i parameter of 'btrfs subvolume snapshot' that adds snapshot to qgroup without invalidating statistics. Amazing! -- With Best Regards, Marat Khalili On 08/02/17 18:46, Marat Khalili wrote: I'm using trying to use qgroups to keep track of storage occupied by snapshots. I noticed that: a) no two rescans can run in parallel, and there's no way to schedule another rescan while one is running; b) seems like it's a whole-disk operation regardless of path specified in CLI. I only just started to fill my new 24Tb btrfs volume using qgroups, but rescans already take a long time, and due to (a) above I each time have to wait for previous rescan to finish in my scripts. Can anything be done about it, like trashing and recomputing only statistics for specific qgroup? Linux host 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux btrfs-progs v4.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
partial quota rescan
I'm using trying to use qgroups to keep track of storage occupied by snapshots. I noticed that: a) no two rescans can run in parallel, and there's no way to schedule another rescan while one is running; b) seems like it's a whole-disk operation regardless of path specified in CLI. I only just started to fill my new 24Tb btrfs volume using qgroups, but rescans already take a long time, and due to (a) above I each time have to wait for previous rescan to finish in my scripts. Can anything be done about it, like trashing and recomputing only statistics for specific qgroup? Linux host 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux btrfs-progs v4.4 -- -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html