Re: btrfs-cleaner / snapshot performance analysis
> I am trying to better understand how the cleaner kthread > (btrfs-cleaner) impacts foreground performance, specifically > during snapshot deletion. My experience so far has been that > it can be dramatically disruptive to foreground I/O. That's such a warmly innocent and optimistic question! This post gives the answer, and to an even more general question: http://www.sabi.co.uk/blog/17-one.html?170610#170610 > the response tends to be "use less snapshots," or "disable > quotas," both of which strike me as intellectually > unsatisfying answers, especially the former in a filesystem > where snapshots are supposed to be "first-class citizens." They are "first class" but not "cost-free". In particular every extent is linked in a forward map and a reverse map, and deleting a snapshot involves materializing and updating a join of the two, which seems to be done with a classic nested-loop join strategy resulting in N^2 running time. I suspect that quotas have a similar optimization. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs reserve metadata problem
> When testing Btrfs with fio 4k random write, That's an exceptionally narrowly defined workload. Also it is narrower than that, because it must be without 'fsync' after each write, or else there would be no accumulation of dirty blocks in memory at all. > I found that volume with smaller free space available has > lower performance. That's an inappropriate use of "performance"... The speed may be lower, the performance is another matter. > It seems that the smaller the free space of volume is, the > smaller amount of dirty page filesystem could have. Is this a problem? Consider: all filesystems do less well when there is less free space (smaller chance of finding spatially compact allocations), it is usually good to minimize the the amont of dirty pages anyhow (even if there are reasons to keep delay writing them out). > [ ... ] btrfs will reserve metadata for every write. The > amount to reserve is calculated as follows: nodesize * > BTRFS_MAX_LEVEL(8) * 2, i.e., it reserves 256KB of metadata. > The maximum amount of metadata reservation depends on size of > metadata currently in used and free space within volume(free > chunk size /16) When metadata reaches the limit, btrfs will > need to flush the data to release the reservation. I don't understand here: under POSIX semantics filesystems are not really allowed to avoid flushing *metadata* to disk for most operations, that is metadata operations have an implied 'fsync'. Your case of the "4k random write" with "cow disabled" the only metadata that should get updated is the last-modified timestamp, unless the user/application has been so amazingly stupid to not preallocate the file, and then they deserve whatever they get. > 1. Is there any logic behind the value (free chunk size /16) > /* >* If we have dup, raid1 or raid10 then only half of the free >* space is actually useable. For raid56, the space info used >* doesn't include the parity drive, so we don't have to >* change the math >*/ > if (profile & (BTRFS_BLOCK_GROUP_DUP | > BTRFS_BLOCK_GROUP_RAID1 | > BTRFS_BLOCK_GROUP_RAID10)) >avail >>= 1; As written there is a plausible logic, but it is quite crude. > /* >* If we aren't flushing all things, let us overcommit up to >* 1/2th of the space. If we can flush, don't let us overcommit >* too much, let it overcommit up to 1/8 of the space. >*/ > if (flush == BTRFS_RESERVE_FLUSH_ALL) >avail >>= 3; > else >avail >>= 1; Presumably overcommitting beings some benefits on other workloads. In particular other parts of Btrfs don't behave awesomely well when free space runs out. > 2. Is there any way to improve this problem? Again, is it a problem? More interestingly, if it is a problem is a solution available that does not impact other workloads? It is simply impossible to optimize a filesystem perfectly for every workload. I'll try to summarize your report as I understand it: * If: - The workload is "4k random write" (without 'fsync'). - On a "cow disabled" file. - The file is not preallocated. - There is not much free space available. * Then allocation overcommitting results in a higher frequency of unrequested metadata flushes, and those metadata flushes slow down a specific benchmark. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected raid1 behaviour
[ ... ] > The advantage of writing single chunks when degraded, is in > the case where a missing device returns (is readded, > intact). Catching up that device with the first drive, is a > manual but simple invocation of 'btrfs balance start > -dconvert=raid1,soft -mconvert=raid1,soft' The alternative is > a full balance or full scrub. It's pretty tedious for big > arrays. That is merely an after-the-fact rationalization for a design that is at the same time entirely logical and quite broken: that the intended replication factor is the same as the current number of members of the volume, so if a volume has (currently) only one member, than only "single" chunks gets created. A design that would work better for operations would be to have "profiles" to be a concept entirely independent of number of members, or perhaps more precisely to have the "desired" profile of a chunk be distinct from the "actual" profile (dependent on the actual number of members of a volume) of that chunk, so that if a volume has only one member chunks could be created that have "desired" profile 'raid1' but "actual" profile 'single', or perhaps more sensibly 'raid1-with-missing-mirror', with checks that "actual" profile be usable else the volume is not mountable. Note: ideally every chunk would have both a static desired profile and a desired stripe width, and a computed actual profile and a actual stripe width. Or perhaps the desired profile and width would be properties of the volume (for each of the three types of data). For example in MD RAID it is perfectly legitimate to create a RAID6 set with "desired" width of 6 and "actual" width of 4 (in which case it can be activated as degraded) or a RAID5 set with "desired" width of 5 and actual width of 3 (in which case it cannot be activated at all until at least another member is added). The difference with MD RAID is that in MD RAID there is (except in one case , during conversion) an exact match between "desired" profile stripe width and number of members, while at least in principle a Btrfs volume can have any number of chunks of any profile of any desired stripe size (except that current implementation is not so flexible in most profiles). That would require scanning all chunks to determine whether a volume is mountable at all or mountable only as degraded, while MD RAID can just count the members. Apparently recent versions of the Btrfs 'raid1' profile do just that. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected raid1 behaviour
>> The fact is, the only cases where this is really an issue is >> if you've either got intermittently bad hardware, or are >> dealing with external > Well, the RAID1+ is all about the failing hardware. >> storage devices. For the majority of people who are using >> multi-device setups, the common case is internally connected >> fixed storage devices with properly working hardware, and for >> that use case, it works perfectly fine. > If you're talking about "RAID"-0 or storage pools (volume > management) that is true. But if you imply, that RAID1+ "works > perfectly fine as long as hardware works fine" this is > fundamentally wrong. I really agree with this, the argument about "properly working hardware" is utterly ridiculous. I'll to this: apparently I am not the first one to discover the "anomalies" in the "RAID" profiles, but I may have been the first to document some of them, e.g. the famous issues with the 'raid1' profile. How did I discover them? Well, I had used Btrfs in single device mode for a bit, and wanted to try multi-device, and the docs seemed "strange", so I did tests before trying it out. The tests were simply on a spare PC with a bunch of old disks to create two block devices (partitions), put them in 'raid1' first natively, then by adding a new member to an existing partition, and then 'remove' one, or simply unplug it (actually 'echo 1 > /sys/block/.../device/delete') initially. I wanted to check exactly what happened, resync times, speed, behaviour and speed when degraded, just ordinary operational tasks. Well I found significant problems after less than one hour. I can't imagine anyone with some experience of hw or sw RAID (especially hw RAID, as hw RAID firmware is often fantastically buggy especially as to RAID operations) that wouldn't have done the same tests before operational use, and would not have found the same issues too straight away. The only guess I could draw is that whover designed the "RAID" profile had zero operational system administration experience. > If the hardware needs to work properly for the RAID to work > properly, noone would need this RAID in the first place. It is not just that, but some maintenance operations are needed even if the hardware works properly: for example preventive maintenance, replacing drives that are becoming too old, expanding capacity, testing periodically hardware bits. Systems engineers don't just say "it works, let's assume it continues to work properly, why worry". My impression is that multi-device and "chunks" were designed in one way by someone, and someone else did not understand the intent, and confused them with "RAID", and based the 'raid' profiles on that confusion. For example the 'raid10' profile seems the least confused to me, and that's I think because the "RAID" aspect is kept more distinct from the "multi-device" aspect. But perhaps I am an optimist... To simplify a longer discussion to have "RAID" one needs an explicit design concept of "stripe", which in Btrfs needs to be quite different from that of "set of member devices" and "chunks", so that for example adding/removing to a "stripe" is not quite the same thing as adding/removing members to a volume, plus to make a distinction between online and offline members, not just added and removed ones, and well-defined state machine transitions (e.g. in response to hardware problems) among all those, like in MD RAID. But the importance of such distinctions may not be apparent to everybody. But I may have read comments in which "block device" (a data container on some medium), "block device inode" (a descriptor for that) and "block device name" (a path to a "block device inode") were hopelessly confused, so I don't hold a lot of hope. :-( -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected raid1 behaviour
>> I haven't seen that, but I doubt that it is the radical >> redesign of the multi-device layer of Btrfs that is needed to >> give it operational semantics similar to those of MD RAID, >> and that I have vaguely described previously. > I agree that btrfs volume manager is incomplete in view of > data center RAS requisites, there are couple of critical > bugs and inconsistent design between raid profiles, but I > doubt if it needs a radical redesign. Well it needs a radical redesign because the original design was based on an entirely consistent and logical concept that was quite different from that required for sensible operations, and then special-case case was added (and keeps being added) to fix the consequences. But I suspect that it does not need a radical *recoding*, because most if not all of the needed code is already there. All tha needs changing most likely is the member state-machine, that's the bit that need a radical redesign, and it is a relatively small part of the whole. The closer the member state-machine design is to the MD RAID one the better as it is a very workable, proven model. Sometimes I suspect that the design needs to be changed to also add a formal notion of "stripe" to the Btrfs internals, where a "stripe" is a collection of chunks that are "related" (and something like that is already part of the 'raid10' profile), but I think that needs not be user-visible. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Unexpected raid1 behaviour
"Duncan"'s reply is slightly optimistic in parts, so some further information... [ ... ] > Basically, at this point btrfs doesn't have "dynamic" device > handling. That is, if a device disappears, it doesn't know > it. That's just the consequence of what is a completely broken conceptual model: the current way most multi-device profiles are designed is that block-devices and only be "added" or "removed", and cannot be "broken"/"missing". Therefore if IO fails, that is just one IO failing, not the entire block-device going away. The time when a block-device is noticed as sort-of missing is when it is not available for "add"-ing at start. Put another way, the multi-device design is/was based on the demented idea that block-devices that are missing are/should be "remove"d, so that a 2-device volume with a 'raid1' profile becomes a 1-device volume with a 'single'/'dup' profile, and not a 2-device volume with a missing block-device and an incomplete 'raid1' profile, even if things have been awkwardly moving in that direction in recent years. Note the above is not totally accurate today because various hacks have been introduced to work around the various issues. > Thus, if a device disappears, to get it back you really have > to reboot, or at least unload/reload the btrfs kernel module, > in ordered to clear the stale device state and have btrfs > rescan and reassociate devices with the matching filesystems. IIRC that is not quite accurate: a "missing" device can be nowadays "replace"d (by "devid") or "remove"d, the latter possibly implying profile changes: https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Using_add_and_delete Terrible tricks like this also work: https://www.spinics.net/lists/linux-btrfs/msg48394.html > Meanwhile, as mentioned above, there's active work on proper > dynamic btrfs device tracking and management. It may or may > not be ready for 4.16, but once it goes in, btrfs should > properly detect a device going away and react accordingly, I haven't seen that, but I doubt that it is the radical redesign of the multi-device layer of Btrfs that is needed to give it operational semantics similar to those of MD RAID, and that I have vaguely described previously. > and it should detect a device coming back as a different > device too. That is disagreeable because of poor terminology: I guess that what was intended that it should be able to detect a previous member block-device becoming available again as a different device inode, which currently is very dangerous in some vital situations. > Longer term, there's further patches that will provide a > hot-spare functionality, automatically bringing in a device > pre-configured as a hot- spare if a device disappears, but > that of course requires that btrfs properly recognize devices > disappearing and coming back first, so one thing at a time. That would be trivial if the complete redesign of block-device states of the Btrfs multi-device layer happened, adding an "active" flag to an "accessible" flag to describe new member states, for example. My guess that while logically consistent, the current multi-device logic is fundamentally broken from an operational point of view, and needs a complete replacement instead of fixes. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] retry write on error
> [ ... ] btrfs incorporates disk management which is actually a > version of md layer, [ ... ] As far as I know Btrfs has no disk management, and was wisely designed without any, just like MD: Btrfs volumes and MD sets can be composed from "block devices", not disks, and block devices are quite high level abstractions, as they closely mimic the semantics of a UNIX file, not a physical device. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/7] retry write on error
>>> If the underlying protocal doesn't support retry and there >>> are some transient errors happening somewhere in our IO >>> stack, we'd like to give an extra chance for IO. >> A limited number of retries may make sense, though I saw some >> long stalls after retries on bad disks. Indeed! One of the major issues in actual storage administration is to find ways to reliably disable most retries, or to shorten them, both at the block device level and the device level, because in almost all cases where storage reliability matters what is important is simply swapping out the failing device immediately and then examining and possible refreshing it offline. To the point that many device manufacturers deliberately cripple in cheaper products retry shortening or disabling options to force long stalls, so that people who care about reliability more than price will buy the more expensive version that can disable or shorten retries. > Seems preferable to avoid issuing retries when the underlying > transport layer(s) has already done so, but I am not sure > there is a way to know that at the fs level. Inded, and to use an euphemism, a third layer of retries at the filesystem level are currently a thoroughly imbecilic idea :-), as whether retries are worth doing is not a filesystem dependent issue (but then plugging is done at the block io level when it is entirely device dependent whether it is worth doing, so there is famous precedent). There are excellent reasons why error recovery is in general not done at the filesystem level since around 20 years ago, which do not need repeating every time. However one of them is that where it makes sense device firmware does retries, and the block device layer does retries too, which is often a bad idea, and where it is not, the block io level should be do that, not the filesystem. A large part of the above discussion would not be needed if Linux kernel "developers" exposed a clear notion of hardware device and block device state machine and related semantics, or even knew that it were desirable, but that's an idea that is only 50 years old, so may not have yet reached popularity :-). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fixed subject: updatedb does not index separately mounted btrfs subvolumes
>> The issue is that updatedb by default will not index bind >> mounts, but by default on Fedora and probably other distros, >> put /home on a subvolume and then mount that subvolume which >> is in effect a bind mount. > > So the issue isn't /home being btrfs (as you said in the > subject), but rather, it's /home being an explicitly mounted > subvolume, since btrfs uses bind-mounts internally for > subvolume mounts. That to me seems like rather improper terminology and notes, and I would consider these to be more appropriate: * There are entities known as "root directories", and their main property is that all inodes reachable from one in the same filesystem have the same "device id". * Each "filesystem" has at least one, and a Btrfs "volume" has one for every "subvolume", including the "top subvolume". * A "root directory" can be "mounted" on a "mount point" directory of another "filesystem", which allows navigating from one filesystem to another. * A "mounted" root directory can be identified by the device id of '.' being different from that of '..'. * In Linux a "root directory" can be "mounted" onto several "mount point" directories at the same time. * In Linux a "bind" operation is not a "mount" operation, it is in effect a kind of temporary "hard link", one that makes a directory aliased to a "bind point" directory. Looking at this: tree# tail -3 /proc/mounts /dev/mapper/sda7 /fs/sda7 btrfs rw,nodiratime,relatime,nossd,nospace_cache,user_subvol_rm_allowed,subvolid=5,subvol=/ 0 0 /dev/mapper/sda7 /fs/sda7/bind btrfs rw,nodiratime,relatime,nossd,nospace_cache,user_subvol_rm_allowed,subvolid=431,subvol=/= 0 0 /dev/mapper/sda7 /fs/sda7/bind-tmp btrfs rw,nodiratime,relatime,nossd,nospace_cache,user_subvol_rm_allowed,subvolid=431,subvol=/=/tmp 0 0 tree# stat --format "%3D %6i %N" {,/fs,/fs/sda7}/{.,..} /fs/sda7/{=,=/subvol,=/subvol/dir,=/tmp,bind,bind-tmp}/{.,..} 806 2 ‘/.’ 806 2 ‘/..’ 23 36176 ‘/fs/.’ 806 2 ‘/fs/..’ 26256 ‘/fs/sda7/.’ 23 36176 ‘/fs/sda7/..’ 27256 ‘/fs/sda7/=/.’ 26256 ‘/fs/sda7/=/..’ 2b256 ‘/fs/sda7/=/subvol/.’ 27256 ‘/fs/sda7/=/subvol/..’ 2b258 ‘/fs/sda7/=/subvol/dir/.’ 2b256 ‘/fs/sda7/=/subvol/dir/..’ 27 344618 ‘/fs/sda7/=/tmp/.’ 27256 ‘/fs/sda7/=/tmp/..’ 27256 ‘/fs/sda7/bind/.’ 26256 ‘/fs/sda7/bind/..’ 27 344618 ‘/fs/sda7/bind-tmp/.’ 26256 ‘/fs/sda7/bind-tmp/..’ It shows that subvolume root directories are "mount points" and not "bind points" (note that ‘/fs/sda7/=/subvol’ is not explicitly mounted, yet its '.' and '..' have different device ids), and that "bind points" appear as if they were ordinary directories (an unwise decision I suspect). Many tools for UNIX-like systems don't cross "mount point" directories (or follow symbolic links), by default or with an option, but will cross "bind point" directories as they look like ordinary directories. For 'mlocate' the "bind point" directories are a special case, handled by looking up every directory examined in the list of "bind point" directories, as per line 381 here: https://pagure.io/mlocate/blob/master/f/src/bind-mount.c#_381 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
> Another one is to find the most fragmented files first or all > files of at least 1M with with at least say 100 fragments as in: > find "$HOME" -xdev -type f -size +1M -print0 | xargs -0 filefrag \ > | perl -n -e 'print "$1\0" if (m/(.*): ([0-9]+) extents/ && $1 > 100)' \ > | xargs -0 btrfs fi defrag That should have "&& $2 > 100". -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)
[ ... ] > The poor performance has existed from the beginning of using > BTRFS + KDE + Firefox (almost 2 years ago), at a point when > very few snapshots had yet been created. A comparison system > running similar hardware as well as KDE + Firefox (and LVM + > EXT4) did not have the performance problems. The difference > has been consistent and significant. That seems rather unlikely to depend on Btrfs, as I use Firefox 56 + KDE4 + Btrfs without issue, on somewhat old/small desktop and laptop, and is implausible on general grounds. You haven't provided so far any indication or quantification of your "speed" problem (which may or not be a "performance" issue". The things to look at usually at disk IO latency and rates, and system CPU time while the bad speed is observable (user CPU time is usually stuck at 100% on any JS based site as written earlier). To look at IO latency and rates the #1 choice is always: 'iostat -dk -zyx 1' and to look as system CPU (and user CPU) and other interesting details I suggest using 'htop' with the attached configuration file to write to "$HOME/.config/htop/htoprc". > Sometimes I have used Snapper settings like this: > TIMELINE_MIN_AGE="1800" > TIMELINE_LIMIT_HOURLY="36" > TIMELINE_LIMIT_DAILY="30" > TIMELINE_LIMIT_MONTHLY="12" > TIMELINE_LIMIT_YEARLY="10" > However, I also have some computers set like this: > TIMELINE_MIN_AGE="1800" > TIMELINE_LIMIT_HOURLY="10" > TIMELINE_LIMIT_DAILY="10" > TIMELINE_LIMIT_WEEKLY="0" > TIMELINE_LIMIT_MONTHLY="0" > TIMELINE_LIMIT_YEARLY="0" The first seems a bit "aspirational". IIRC "someone" confessed that the SUSE default of 'TIMELINE_LIMIT_YEARLY="10"' was imposed by external forces in the SUSE default configuration: https://github.com/openSUSE/snapper/blob/master/data/default-config https://wiki.archlinux.org/index.php/Snapper#Set_snapshot_limits https://lists.opensuse.org/yast-devel/2014-05/msg00036.html # Beware! This file is rewritten by htop when settings are changed in the interface. # The parser is also very primitive, and not human-friendly. fields=0 48 38 39 40 44 62 63 2 46 13 14 1 sort_key=47 sort_direction=1 hide_threads=1 hide_kernel_threads=1 hide_userland_threads=1 shadow_other_users=0 show_thread_names=1 highlight_base_name=1 highlight_megabytes=1 highlight_threads=1 tree_view=0 header_margin=0 detailed_cpu_time=1 cpu_count_from_zero=1 update_process_names=0 color_scheme=0 delay=15 left_meters=AllCPUs Memory Swap left_meter_modes=1 1 1 right_meters=Tasks LoadAverage Uptime right_meter_modes=2 2 2
Re: defragmenting best practice?
> When defragmenting individual files on a BTRFS filesystem with > COW, I assume reflinks between that file and all snapshots are > broken. So if there are 30 snapshots on that volume, that one > file will suddenly take up 30 times more space... [ ... ] Defragmentation works by effectively making a copy of the file contents (simplistic view), so the end result is one copy with 29 reflinked contents, and one copy with defragmented contents. > Can you also give an example of using find, as you suggested > above? [ ... ] Well, one way is to use 'find' as a filtering replacement for 'defrag' option '-r', as in for example: find "$HOME" -xdev '(' -name '*.sqlite' -o -name '*.mk4' ')' \ -type f -print0 | xargs -0 btrfs fi defrag Another one is to find the most fragmented files first or all files of at least 1M with with at least say 100 fragments as in: find "$HOME" -xdev -type f -size +1M -print0 | xargs -0 filefrag \ | perl -n -e 'print "$1\0" if (m/(.*): ([0-9]+) extents/ && $1 > 100)' \ | xargs -0 btrfs fi defrag But there are many 'find' web pages and that is not quite a Btrfs related topic. > [ ... ] The easiest way I know to exclude cache from > BTRFS snapshots is to put it on a separate subvolume. I assumed this > would make several things related to snapshots more efficient too. Only slightly. > Background: I'm not sure why our Firefox performance is so terrible As I always say, "performance" is not the same as "speed", and probably your Firefox "performance" is sort of OKish even if the "speed" is terrile, and neither is likely related to the profile or the cache being on Btrfs: most JavaScript based sites are awfully horrible regardless of browser: http://www.sabi.co.uk/blog/13-two.html?130817#130817 and if Firefox makes a special contribution it tends to leak memory on several odd but common cases: https://utcc.utoronto.ca/~cks/space/blog/web/FirefoxResignedToLeaks?showcomments Plus it tends to cache too much, e.g. recently close tabs. But Firefox is not special because most web browsers are not designed to run for a long time without a restart, and Chromium/Chrome simply have a different set of problem sites. Maybe the new "Quantum" Firefox 57 will improve matters because it has a far more restrictive plugin API. The overall problem is insoluble, hipster UX designers will be the second the the wall when the revolution comes :-). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
> I'm following up on all the suggestions regarding Firefox performance > on BTRFS. [ ... ] I haven't read that yet, so maybe I am missing something, but I use Firefox with Btrfs all the time and I haven't got issues. [ ... ] > 1. BTRFS snapshots have proven to be too useful (and too important to >our overall IT approach) to forego. [ ... ] > 3. We have large amounts of storage space (and can add more), but not >enough to break all reflinks on all snapshots. Firefox profiles get fragmented only in the databases containes in them, and they are tiny, as in dozens of MB. That's usually irrelevant. Also nothing forces you to defragment a whole filesystem, you can just defragment individual files or directories by using 'find' with it. My top "$HOME" fragmented files are the aKregator RSS feed databases, usually a few hundred fragments each, and the '.sqlite' files for Firefox. Occasionally like just now I do this: tree$ sudo filefrag .firefox/default/*.sqlite | sort -t: -k 2n | tail -4 .firefox/default/cleanup.sqlite: 43 extents found .firefox/default/content-prefs.sqlite: 67 extents found .firefox/default/formhistory.sqlite: 87 extents found .firefox/default/places.sqlite: 3879 extents found tree$ sudo btrfs fi defrag .firefox/default/*.sqlite tree$ sudo filefrag .firefox/default/*.sqlite | sort -t: -k 2n | tail -4 .firefox/default/webappsstore.sqlite: 1 extent found .firefox/default/favicons.sqlite: 2 extents found .firefox/default/kinto.sqlite: 2 extents found .firefox/default/places.sqlite: 44 extents found > 2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted > nocow -- it will NOT be snapshotted The cache can be simply deleted, and usually files in it are not updated in place, so don't get fragmented, so no worry. Also, you can declare the '.firefox/default/' directory to be NOCOW, and that "just works". I haven't even bothered with that. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: SLES 11 SP4: can't mount btrfs
>> But it could simply be that you have forgotten to refresh the >> 'initramfs' with 'mkinitrd' after modifying the '/etc/fstab'. > I finally managed it. I'm pretty sure having changed > /boot/grub/menu.lst, but somehow changes got lost/weren't > saved ? So the next thing to check would indeed have been that the GRUB2 script had been updated, which you can do with 'grub2-mkconfig'. Also double check that in '/etc/sysconfig/bootloader' there is a line 'LOADER_TYPE="grub"' instead of "none". The system config tools will update the 'initramfs' and the 'menu.lst' automatically only if you make system config changes only using them, but you changed the UUID of '/' "manually", and this perhaps put the GRUB2 config and the system state out of sync. > After entering the new UUID from my Btrfs partition system > boots. Alternatively you could have used 'btrfstune -U ... ...' to change the UUID of the newly created '/' volume to the old one. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: SLES 11 SP4: can't mount btrfs
> I formatted the / partition with Btrfs again and could restore > the files from a backup. Everything seems to be there, I can > mount the Btrfs manually. [ ... ] But SLES finds from where I > don't know a UUID (see screenshot). This UUID is commented out > in fstab and replaced by /dev/vg1/lv_root. Using > /dev/vg1/lv_root I can manually mount my Btrfs without any > problem. Where does my SLES find that UUID ? This sounds like a SLES issue, rather than a Btrfs one. But it could simply be that you have forgotten to refresh the 'initramfs' with 'mkinitrd' after modifying the '/etc/fstab'. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it safe to use btrfs on top of different types of devices?
[ ... ] >> are USB drives really that unreliable [ ... ] [ ... ] > There are similar SATA chips too (occasionally JMicron and > Marvell for example are somewhat less awesome than they could > be), and practically all Firewire bridge chips of old "lied" a > lot [ ... ] > That plus Btrfs is designed to work on top of a "well defined" > block device abstraction that is assumed to "work correctly" > (except for data corruption), [ ... ] When I insist on the reminder that Btrfs is designed to use the block-device protocol and state machine, rather than USB and SATA devices, it is because that makes more explicit that the various layer between the USB and SATA device can "lie" too, including for example the Linux page cache which is just below the block-device layer. But also the disk scheduler, the SCSI protocol handler, the USB and SATA drivers and disk drivers, the PCIe chipset, the USB or SATA host bus adapter, the cable, the backplane. This paper reports the results of some testing of "enterprise grade" storage systems at CERN, and some of the symptoms imply that "lies" can happen *anywhere*. It is scary. It supports having data checksumming in the filesystem, a rather extreme choice. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it safe to use btrfs on top of different types of devices?
[ ... ] >>> Oh please, please a bit less silliness would be welcome here. >>> In a previous comment on this tedious thread I had written: > If the block device abstraction layer and lower layers work > correctly, Btrfs does not have problems of that sort when > adding new devices; conversely if the block device layer and > lower layers do not work correctly, no mainline Linux > filesystem I know can cope with that. > Note: "work correctly" does not mean "work error-free". >>> The last line is very important and I added it advisedly. [ ... ] >> Filesystems run on top of *block-devices* with a definite >> interface and a definite state machine, and filesystems in >> general assume that the block-device works *correctly*. > They do run on top of USB or SATA devices, otherwise a > significant majority of systems running Linux and/or BSD > should not be operating right now. That would be big news to any Linux/UNIX filesystem developer, who would have to rush to add SATA and USB protocol and state machine handling to their implementations, which currently only support the block-device protocol and state machine. Please send patches :-) Note to some readers: there are filesystems designed to work on top not of block devices, like on top the MTD abstraction layer, for example. > Yes, they don't directly access them, but the block layer > isn't much more than command translation, scheduling, and > accounting, so this distinction is meaningless and largely > irrelevant. More tedious silliness and grossly ignorant too, because the protocol and state machine of the block-device layer is completely different from that of both SATA and USB, and the mapping of the SATA or USB protocols and state machines onto the block-device ones is actually a very complex, difficult, and error prone task, involving mountains of very hairy code. In particular since the block-device protocol and state machine are rather simplistic, a lot is lost in translation. Note: the SATA handling firmware in disk device often involves *dozens of thousands* of lines of code, and "all it does" is "just" reading the device and passing the content over the IO bus. Filesystems are designed to that very simplistic protocol and state machine for good reasons, and sometimes they are designed to even just a subset; for example most filesystem designs assume that block-device writes never fail (that is, bad sector sparing is done by a lower layer), and only some handle gracefully block-device read failures. > [ ... ] to refer to a block-device connected via interface 'X' > as an 'X device' or an 'X storage device'. More tedious silliness as this is a grossly misleading shorthand when the point of the discussion is the error recovery protocol and state machine assumed by filesystem designers. To me it see that if people use that shorthand in that context, as if it was not a shorthand, they don't know what they are talking about, or they are trying to mislead the discussion. > [ ... ] For an end user, it generally doesn't matter whether a > given layer reported the error or passed it on (or generated > it), it matters whether it was corrected or not. [ ... ] You seem unable or unwilling to appreciate how detected and undetected errors are fundamentally different, and how layering of greatly different protocols is a complicated issue highly relevant to error recovery, so you seem to assume that other end users are likewise unable or unwilling. But I am not so dismissive of "end users", and I assume that there are end users that can eventually understand that Btrfs in the main is not designed to handle devices that "lie" because Btrfs actually is designed to use the block-device layer which is assumed to "work correctly" (except for checksums). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it safe to use btrfs on top of different types of devices?
> [ ... ] when writes to a USB device fail due to a temporary > disconnection, the kernel can actually recognize that a write > error happened. [ ... ] Usually, but who knows? Maybe half transfer gets written; maybe the data gets written to the wrong address; maybe stuff gets written but failure is reported, and this not just if the connection dies, but also if it does not. > are USB drives really that unreliable [ ... ] Welcome to the "real world", also called "Shenzen" :-). There aren't that many "USB drives", as I wrote somewhere there are usually USB host bus adapters (on the system side) and USB IO bus (usually SATA) bridges (on the device side). They both have to do difficult feats of conversion and signaling, and in the USB case they are usually designed by a stressed, overworked engineer in Guangzhou or Taiwan employed by a no-name contractor working who submitted the lowest bid to a no-name manufacturer, and was told to do the cheapest design to fabricate in the shortest possible time. Most of the time they mostly work, good enough for keyboard and mice, and for photos of cats on usb sticks; most users jut unplug and replug them in if they flake out. BTW my own USB keyboard and mice and their USB host bus adapter occasionaly crash too, and the cases where my webcam flakes out are more common than when it does not. USB is a mixed bag of poorly designed protocols and complex too, and it is very easy to do a bad implementation. There are similar SATA chips too (occasionally JMicron and Marvell for example are somewhat less awesome than they could be), and practically all Firewire bridge chips of old "lied" a lot except a few Oxford Semi ones (the legendary 911 series). I have even seen lying SAS "enterprise" grade storage interconnects. I had indeed previously written: > If you have concerns about the reliability of specific > storage and system configurations you should become or find a > system integration and qualification engineer who understand > the many subletities of storage devices and device-system > interconnects and who would run extensive tests on it; > storage and system commissioning is often far from trivial > even in seemingly simple cases, due in part to the enormous > complexity of interfaces, even when they have few bugs, and > test made with one combination often do not have the same > results even on apparently similar combinations. On the #Btrfs IRC channel there is a small group of cynical helpers, and when someone mentions "strange things happening" one of them usually immediately asks "USB?" and in most cases the answer is "how did you know?". That plus Btrfs is designed to work on top of a "well defined" block device abstraction that is assumed to "work correctly" (except for data corruption), and the Linux block device abstraction and SATA and USB layers beneath it are not designed to handle devices that "lie" (well, there are blacklists with workaround for known systematic bugs, but that is partial). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it safe to use btrfs on top of different types of devices?
> [ ... ] However, the disappearance of the device doesn't get > propagated up to the filesystem correctly, Indeed, sometimes it does, sometimes it does not, in part because of chipset bugs, in part because the USB protocol signaling side does not handle errors well even if the chipset were bug free. > and that is what causes the biggest issue with BTRFS. Because > BTRFS just knows writes are suddenly failing for some reason, > it doesn't try to release the device so that things get > properly cleaned up in the kernel, and thus when the same > device reappears (as it will when the disconnect was due to a > transient bus error, which happens a lot), it shows up as a > different device node, which gets scanned for filesystems by > udev, and BTRFS then gets really confused because it now sees > 3 (or more) devices for a 2 device filesystem. That's a good description that should be on the wiki. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it safe to use btrfs on top of different types of devices?
[ ... ] >> Oh please, please a bit less silliness would be welcome here. >> In a previous comment on this tedious thread I had written: >> > If the block device abstraction layer and lower layers work >> > correctly, Btrfs does not have problems of that sort when >> > adding new devices; conversely if the block device layer and >> > lower layers do not work correctly, no mainline Linux >> > filesystem I know can cope with that. >> >> > Note: "work correctly" does not mean "work error-free". >> >> The last line is very important and I added it advisedly. > Even looking at things that way though, Zoltan's assessment > that reliability is essentially a measure of error rate is > correct. It is instead based on a grave confusion between two very different kinds of "error rate", confusion also partially based on the ridiculous misunderstanding, which I have already pointed out, that UNIX filesystems run on top of SATA or USB devices: > Internal SATA devices absolutely can randomly drop off the bus > just like many USB storage devices do, Filesystems run on top of *block devices* with a definite interface and a definite state machine, and filesystems in general assume that the block device works *correctly*. > but it almost never happens (it's a statistical impossibility > if there are no hardware or firmware issues), so they are more > reliable in that respect. What the OP was doing was using "unreliable" both for the case where the device "lies" and the case where the device does not "lie" but reports a failure. Both of these are malfunctions in a wide sense: * The [block] device "lies" as to its status or what it has done. * The [block] device reports truthfully that an action has failed. But they are of very different nature and need completely different handling. Hint: one is an extensional property and the other is a modal one, there is a huge difference between "this data is wrong" and "I know that this data is wrong". The really important "detail" is that filesystems are, as a rule with very few exceptions, designed to work only if the block device layer (and those below it) does not "lie" (see "Bizantyne failures" below), that is "works correctly": reports the failure of every operation that fails and the success of every operation that succeeds and never gets into an unexpected state. In particular filesystems designs are nearly always based on the assumption that there are no undetected errors at the block device level or below. Then the expected *frequency* of detected errors influences how much redundancy and what kind of recovery are desirable, but the frequency of "lies" is assumed to be zero. The one case where Btrfs does not assume that the storage layer works *correctly* is checksumming: it is quite expensive and makes sense only if the block device is expected to (sometimes) "lie" about having written the data correctly or having read it correctly. The role of the checksum is to spot when a block device "lies" and turn an undetected read error into a detected one (they could be used also to detect correct writes that are misreported as having failed). The crucial difference that exists between SATA and USB is not that USB chips have higher rates of detected failures (even if they often do), but that in my experience SATA interfaces from reputable suppliers don't "lie" (more realistically have negligible "lie" rates), and USB interfaces (both host bus adapters and IO bus bridges) "lie" both systematically and statistically with non negligible rates, and anyhow the USB mass storage protocol is not very good at error reporting and handling. >> The "working incorrectly" general case is the so called >> "bizantine generals problem" [ ... ] This is compsci for beginners and someone dealing with storage issues (and not just) should be intimately familiar with the implications: https://en.wikipedia.org/wiki/Byzantine_fault_tolerance Byzantine failures are considered the most general and most difficult class of failures among the failure modes. The so-called fail-stop failure mode occupies the simplest end of the spectrum. Whereas fail-stop failure model simply means that the only way to fail is a node crash, detected by other nodes, Byzantine failures imply no restrictions, which means that the failed node can generate arbitrary data, pretending to be a correct one, which makes fault tolerance difficult. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it safe to use btrfs on top of different types of devices?
> [ ... ] After all, btrfs would just have to discard one copy > of each chunk. [ ... ] One more thing that is not clear to me > is the replication profile of a volume. I see that balance can > convert chunks between profiles, for example from single to > raid1, but I don't see how the default profile for new chunks > can be set or quiered. [ ... ] My impression is that the design rationale and aims for Btrfs two-level allocation (in other fields known as a "BIBOP" scheme) were not fully shared among Btrfs developers, that perhaps it could have benefited from some further reflection on its implications, and that its behaviour may have evolved "opportunistically", maybe without much worrying as to conceptual integrity. (I am trying to be euphemistic) So while I am happy with the "Rodeh" core of Btrfs (COW, sbuvolumes, checksums), the RAID-profile functionality and especially the multi-device layer is not something I find particularly to my taste. (I am trying to be euphemistic) So when it comes to allocation, RAID-profiles, multiple devices, I usually expect some random "surprising functionality". (I am trying to be euphemistic) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it safe to use btrfs on top of different types of devices?
>> I forget sometimes that people insist on storing large >> volumes of data on unreliable storage... Here obviously "unreliable" is used on the sense of storage that can work incorrectly, not in the sense of storage that can fail. > In my opinion the unreliability of the storage is the exact > reason for wanting to use raid1. And I think any problem one > encounters with an unreliable disk can likely happen with more > reliable ones as well, only less frequently, so if I don't > feel comfortable using raid1 on an unreliable medium then I > wouldn't trust it on a more reliable one either. Oh please, please a bit less silliness would be welcome here. In a previous comment on this tedious thread I had written: > If the block device abstraction layer and lower layers work > correctly, Btrfs does not have problems of that sort when > adding new devices; conversely if the block device layer and > lower layers do not work correctly, no mainline Linux > filesystem I know can cope with that. > Note: "work correctly" does not mean "work error-free". The last line is very important and I added it advisedly. You seem to be using "unreliable" in two completely different meanings, without realizing it, as both "working incorrectly" and "reporting a failure". They are really very different. The "working incorrectly" general case is the so called "bizantine generals problem" and (depending on assumptions) it is insoluble. Btrfs has some limited ability to detect (and sometimes recover from) "working incorrectly" storage layers, but don't expect too much from that. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it safe to use btrfs on top of different types of devices?
> A few years ago I tried to use a RAID1 mdadm array of a SATA > and a USB disk, which lead to strange error messages and data > corruption. That's common, quite a few reports of similar issues in previous entries in this mailing list and for many other filesystems. > I did some searching back then and found out that using > hot-pluggable devices with mdadm is a paved road to data > corruption. That's an amazing jump of logic. > Reading through that old bug again I see that it was > autoclosed due to old age but still hasn't been addressed: > https://bugs.launchpad.net/ubuntu/+source/linux/+bug/320638 I suspect that it is very easy to misinterpret what the reported issue is. However it is an interesting corner case what could happen with any type of hardware device, not just hot-pluggable, and one that I will try to remember even if unlikely to occur in practice. I was only aware (dimly) of something quite similar in the case of different logical sector sizes. > I would like to ask whether btrfs may also be prone to data > corruption issues in this scenario Btrfs like (nearly) all UNIX/Linux filesystems does not run on top of "devices", but on top of "files" of type "block device". If the block device abstraction layer and lower layers work correctly, Btrfs does not have problems of that sort when adding new devices; conversely if the block device layer and lower layers do not work correctly, no mainline Linux filesystem I know can cope with that. Note: "work correctly" does not mean "work error-free". > (due to the same underlying issue as the one described in the > bug above for mdadm), or is btrfs unaffected by the underlying > issue "Socratic method" questions: * What do you think is the underlying issue in that bug report? (hint: something to do with host adapters or device bridges) * Why do you think that bug report is in any way related to your issues with "a RAID1 mdadm array of a SATA and a USB disk"? > and is safe to use with a mix of regular and hot-pluggable > devices as well? In my experience Btrfs works very well with a set of block devices abstracting over on both regular and hot-pluggable device, as far as that goes. I personally don't like relying on Btrfs multi-device volumes, but that has nothing to do with your concerns, but with basic Btrfs multi-device handling design choices. If you have concerns about the reliability of specific storage and system configurations you should become or find a system integration and qualification engineer who understand the many subletities of storage devices and device-system interconnects and who would run extensive tests on it; storage and system commissioning is often far from trivial even in seemingly simple cases, due in part to the enormous complexity of interfaces, even when they have few bugs, and test made with one combination often do not have the same results even on apparently similar combinations. I suspect that you should have asked a completely different set of questions (XY problem), but the above are I think good answers to the questions that you have actually asked. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs errors over NFS
>> TL;DR: ran into some btrfs errors and weird behaviour, but >> things generally seem to work. Just posting some details in >> case it helps devs or other users. [ ... ] I've run into a >> btrfs error trying to do a -j8 build of android on a btrfs >> filesystem exported over NFSv3. [ ... ] I have an NFS server from Btrfs filesystem, and it is mostly read-only and low-use, unlike a massive build, but so far it has worked for me. The issue that was reported a while ago was that the kernel NFS server does not report as errors to clients checksum validation failures, just prints a warning, so for that reason and a few others I switch to the Ganesha NFS server. >From your stack traces I noticed that some go pretty deep so maybe there is an issue with that (but on 'amd64' the kernel stack is much bigger than it used to be on 'i386'). Another possibility is that the volume got somewhat damaged for other reasons (bugs, media errors, ...) and this is have further consequences. BTW 'errno 17' is "File exists", so perhaps there is a race condition over NFS. The bogus files with mode 0 seem to me to be bogus directory entries with no files linked to them, which could be again the result of race conditions. The problems with chunks allocation reported as "WARNING" are unfamiliar to me. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: What means "top level" in "btrfs subvolume list" ?
> I am trying to figure out which means "top level" in the > output of "btrfs sub list" The terminology (and sometimes the detailed behaviour) of Btrfs is not extremely consistent, I guess because of permissive editorship of the design, in a "let 1000 flowers bloom" sort of fashion so that does not matter a lot. > [ ... ] outputs a "top level ID" equal to the "parent ID" (on > the basis of the code). You could have used option '-p' and it would have printed out both "top level ID" and "parent ID" for extra enlightenment. > But I am still asking which would be the RIGHT "top level id". But perhaps one of them is now irrelevant, because 'man btrfs subvolume says: "If -p is given, then parent is added to the output between ID and top level. The parent’s ID may be used at mount time via the subvolrootid= option." and 'man 5 btrfs' says: "subvolrootid=objectid (irrelevant since: 3.2, formally deprecated since: 3.10) A workaround option from times (pre 3.2) when it was not possible to mount a subvolume that did not reside directly under the toplevel subvolume." > My Hypothesis, it should be the ID of the root subvolume ( or > 5 if it is not mounted). [ ... ] Well, a POSIX filesystem typically has a root directory, and it can be mounted as the system root or any other point. A Btrfs filesystem has multiple root directories, that are mounted by default "somewhere" (a design decision that I think was unwise, but "whatever"). The subvolume containing the mountpoint directory of another subvolume's root directory is is no way or sense its "parent", as there is no derivation relationship; root directories are independent of each other and their mountpoint is (or should be) a runtime entity. If there is a "parent" relationship that maybe be between snapshot and origin subvolume (ignoring 'send'/'receive'...), and I have created a few plain and snapshot subvolumes and I get this rather "confusing" output from version 4.4 of the 'btrfs' command: base# btrfs subvol list -uq -a -p /fs/sda7 | sort -k 6,6n -k 8,8 ID 257 gen 718 parent 5 top level 5 parent_uuid - uuid 2d7b0606-76d9-f24b-8f75-d20a5c0f3521 path = ID 356 gen 719 parent 5 top level 5 parent_uuid - uuid 9d201029-d2bf-2f43-8381-8c19d090483e path sl1 ID 358 gen 719 parent 5 top level 5 parent_uuid 2d7b0606-76d9-f24b-8f75-d20a5c0f3521 uuid bc0e6a33-b5dc-4d48-b2db-1452b705d227 path sn1 ID 357 gen 715 parent 356 top level 356 parent_uuid - uuid 2abc6399-956d-894f-836b-32eb5b603654 path /sl1/sl2 ID 360 gen 718 parent 356 top level 356 parent_uuid 2d7b0606-76d9-f24b-8f75-d20a5c0f3521 uuid ad896822-e9a5-c645-8cfd-0aca7f5a2298 path /sl1/sn3 ID 361 gen 719 parent 356 top level 356 parent_uuid bc0e6a33-b5dc-4d48-b2db-1452b705d227 uuid 9c1390d2-e485-cb4a-a41b-670248587bfb path /sl1/sn4 ID 359 gen 717 parent 358 top level 358 parent_uuid 2d7b0606-76d9-f24b-8f75-d20a5c0f3521 uuid 72d4f943-2881-6442-b398-2277be8f2fec path /sn1/sn2 The "confusion" is that for some subvolumes the "parent" is the same but the "parent_uuid" is different, and viceversa. IIRC this has already been mentioned in part elsewhere. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs performance with small blocksize on SSD
> i run a few performance tests comparing mdadm, hardware raid > and the btrfs raid. Fantastic beginning already! :-) > I noticed that the performance I have seen over the years a lot of messages like this where there is a wanton display of amusing misuses of terminology, of which the misuse of the word "performance" to mean "speed" is common, and your results are work-per-time which is a "speed": http://www.sabi.co.uk/blog/15-two.html?151023#151023 The "tl;dr" is: you and another guy are told to race the 100m to win a €10,000 prize, but you have to carry a sack with a 50Kg weight. It takes you a lot longer, as your speed is much lower, and the other guy gets the prize. Was that because your performance was much worse? :-) > for small blocksizes (2k) is very bad on SSD in general and on > HDD for sequential writing. Your graphs show pretty decent performance for small-file IO on Btrfs, depending on conditions, and you are very astutely not explaining the conditions, even if some can be guessed. > I wonder about that result, because you say on the wiki that > btrfs is very effective for small files. Effectivess/efficiency are not the same as performance or speed either. My own simplistic but somewhat meaningful tests show that Btrfs does relatively well on small files: http://www.sabi.co.uk/blog/17-one.html?170302#170302 As to "small files" in general I have read about many attempts to use filesystems as DBMSes, and I consider them intensely stupid: http://www.sabi.co.uk/blog/anno05-4th.html?051016#051016 > I attached my results from raid 1 random write HDD (rH1), SSD > (rS1) and from sequential write HDD (sH1), SSD (sS1) Ah, so it was specifically about small *writes* (and presumably because of other wording not small-updates-in-place of large files, but creating and writing small files). It is a very basic beginner level notion that most storage systems are very anisotropic as to IO size, and also for read vs. write, and never mind with and without 'fsync'. SSDs without supercapacitor backed buffers in particular are an issue. Btrfs has a performance envelope where the speed of small writes (in particular small in-place updates, but also because of POSIX small file creation) has been sacrificed for good reasons: https://btrfs.wiki.kernel.org/index.php/SysadminGuide#Copy_on_Write_.28CoW.29 https://btrfs.wiki.kernel.org/index.php/Gotchas#Fragmentation Also consider the consequences of the 'max_inline' option for 'mount' and the 'nodesize' option for 'mkfs.btrfs'. > Hopefully you have an explanation for that. The best explanation seems to me (euphemism alert) quite extensive "misknowledge" in the message I am responding to. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A user cannot remove his readonly snapshots?!
[ ... ] > I can delete normal subvolumes but not the readonly snapshots: It is because of ordinary permissions for both subvolumes and snapshots: tree$ btrfs sub create /fs/sda7/sub Create subvolume '/fs/sda7/sub' tree$ chmod a-w /fs/sda7/sub tree$ btrfs sub del /fs/sda7/sub Delete subvolume (no-commit): '/fs/sda7/sub' ERROR: cannot delete '/fs/sda7/sub': Permission denied tree$ chmod u+w /fs/sda7/sub tree$ btrfs sub del /fs/sda7/sub Delete subvolume (no-commit): '/fs/sda7/sub' It is however possible to remove an ordinary read-only directory, *as long as its parent directory is not read-only too*: tree$ mkdir /fs/sda7/sub tree$ chmod a-w /fs/sda7/sub tree$ rmdir /fs/sda7/sub; echo $? 0 IIRC this came up before, and the reason for the difference is that a subvolume root directory is "special" because its '..' entry points to itself (inode 256), that is if it is read-only its parent directory (itself) then is read-only too. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: A user cannot remove his readonly snapshots?!
> [ ... ] mounted with option user_subvol_rm_allowed [ ... ] > root can delete this snapshot, but not the user. Why? [ ... ] Ordinary permissions still apply both to 'create' and 'delete': tree$ sudo mkdir /fs/sda7/dir tree$ btrfs sub create /fs/sda7/dir/sub ERROR: cannot access /fs/sda7/dir/sub: Permission denied tree$ sudo chmod a+rwx /fs/sda7/dir tree$ btrfs sub create /fs/sda7/dir/sub Create subvolume '/fs/sda7/dir/sub' tree$ btrfs sub delete /fs/sda7/dir/sub Delete subvolume (no-commit): '/fs/sda7/dir/sub' ERROR: cannot delete '/fs/sda7/dir/sub': Operation not permitted tree$ sudo mount -o remount,user_subvol_rm_allowed /fs/sda7 tree$ btrfs sub delete /fs/sda7/dir/sub Delete subvolume (no-commit): '/fs/sda7/dir/sub' tree$ btrfs sub create /fs/sda7/dir/sub Create subvolume '/fs/sda7/dir/sub' tree$ sudo chmod a-w /fs/sda7/dir tree$ btrfs sub delete /fs/sda7/dir/sub Delete subvolume (no-commit): '/fs/sda7/dir/sub' ERROR: cannot delete '/fs/sda7/dir/sub': Permission denied tree$ sudo chmod a+w /fs/sda7/dir tree$ btrfs sub delete /fs/sda7/dir/sub Delete subvolume (no-commit): '/fs/sda7/dir/sub' tree$ sudo rmdir /fs/sda7/dir -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
[ ... ] Case #1 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu cow2 storage -> guest BTRFS filesystem SQL table row insertions per second: 1-2 Case #2 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu raw storage -> guest EXT4 filesystem SQL table row insertions per second: 10-15 [ ... ] >> Q 0) what do you think that you measure here? > Cow's fragmentation impact on SQL write performance. That's not what you are measuring, you are measing the impact on speed of configurations "designed" (perhaps unintentionally) for maximum flexibility, lowest cost, and complete disregard for speed. [ ... ] > It was quick and dirty task to find, prove and remove > performance bottleneck at minimal cost. This is based on the usual confusion between "performance" (the result of several tradeoffs) and "speed". When you report "row insertions per second" you are reporting a rate, that is a "speed", not "performance", which is always multi-dimensional. http://www.sabi.co.uk/blog/15-two.html?151023#151023 In the cases above speed is low, but I think that, taking into account flexibility and cost, performance is pretty good. > AFAIR removing storage cow2 and guest BTRFS storage gave us ~ > 10 times boost. "Oh doctor, if I stop stabbing my hand with a fork it no longer hurts, but running while carrying a rucksack full of bricks is still slower than with a rucksack full of feathers". [ ... ] -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
> Case #1 > 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu cow2 storage > -> guest BTRFS filesystem > SQL table row insertions per second: 1-2 "Doctor, if I stab my hand with a fork it hurts a lot: can you cure that?" > Case #2 > 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu raw > storage -> guest EXT4 filesystem > SQL table row insertions per second: 10-15 "Doctor, I can't run as fast with a backpack full of bricks as without it: can you cure that?" :-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: generic name for volume and subvolume root?
> As I am writing some documentation abount creating snapshots: > Is there a generic name for both volume and subvolume root? Yes, it is from the UNIX side 'root directory' and from the Btrfs side 'subvolume'. Like some other things Btrfs, its terminology is often inconsistent, but "volume" *usually* means "the set of devices [and contained root directories] with the same Btrfs 'fsid'". I think that the top-level subvolume should not be called the "volume": while there is no reason why a UNIX-like filesystem should be limited to a single block-device, one of the fundamental properties of UNIX-like filesystems is that hard-links are only possible (if at all possible) within a filesystem, and that 'statfs' returns a different "device id" per filesystem. Therefore a Btrfs volume is not properly a filesystem, but potentially a filesystem forest, as it may contain multiple filesystems each with its own root directory. > Is there a simple name for directories I can snapshot? You can only snapshot *root directories*, of which in Btrfs there are two types: subvolumes (an unfortunate name perhaps) or snapshots. In UNIX-like OSes every filesystem has a "root directory" and some filesystem types like Btrfs, NILFS2, and potentially JFS can have more than one, and some can even mount more than one simultaneously. The root directory mounted as '/' is called the "system root directory". When unmounted all filesystem root directories have no names, just an inode number. Conceivably the root inode of a UNIX-like filesystem could be an inode of any type, but I have never seen a recent UNIX-like OS able to mount anything other than a directory-type root inode (Plan 9 is not a UNIX-like OS :->). As someone else observed, the word "root" is overloaded in UNIX-like OS discourse, like the word "filesystem", and that's unfortunate but can always be resolved verbosely by using the appropriate qualifier like "root directory", "system root directory", "'root' user", "uid 0 capabilities", etc. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: test if a subvolume is a snapshot?
> How can I test if a subvolume is a snapshot? [ ... ] This question is based on the assumption that "snapshot" is a distinct type of subvolume and not just an operation that creates a subvolume with reflinked contents. Unfortunately Btrfs does indeed make snapshots a distinct type of subvolume... In my 4.4 kernel/progs version of Btrfs it seems that the 'Parent UUID' is that of the source of the snapshot, and the source of a snapshot somehow comes with a list to all the snapshots taken from it: # ls /fs/sda7 =@170826 @170829 @170901 @170903 @170905 @170907 @170825 @170828 @170830 @170902 @170904 @170906 lost+found # btrfs subvolume list /fs/sda7 ID 431 gen 532441 top level 5 path = ID 1619 gen 524915 top level 5 path @170825 ID 1649 gen 524915 top level 5 path @170826 ID 1651 gen 524915 top level 5 path @170828 ID 1652 gen 524915 top level 5 path @170829 ID 1654 gen 524915 top level 5 path @170830 ID 1655 gen 523316 top level 5 path @170901 ID 1656 gen 524034 top level 5 path @170902 ID 1658 gen 525628 top level 5 path @170903 ID 1659 gen 527121 top level 5 path @170904 ID 1660 gen 528719 top level 5 path @170905 ID 1665 gen 530565 top level 5 path @170906 ID 1666 gen 532217 top level 5 path @170907 # btrfs subvolume show /fs/sda7/= | egrep 'UUID|Parent|Top level|Snap|@' UUID: cb99579f-64e5-e94c-b22c-41dcc397c37f Parent UUID:- Received UUID: - Parent ID: 5 Top level ID: 5 Snapshot(s): @170825 @170826 @170828 @170829 @170830 @170901 @170902 @170903 @170904 @170905 @170906 @170907 # btrfs subvolume show /fs/sda7/@170901 | egrep 'UUID|Parent|Top level|Snap|@' /fs/sda7/@170901 Name: @170901 UUID: 851f8ef3-c2af-4b46-89af-0193fd4e6fc4 Parent UUID:cb99579f-64e5-e94c-b22c-41dcc397c37f Received UUID: - Parent ID: 5 Top level ID: 5 Snapshot(s): Note that with typical Btrfs consistency "Parent UUID" is that the source of the snapshot, while "Parent ID" is that of the upper level subvolume, and in the "flat" layout for this volume the snapshot parent is '/fs/sda7/=' and the upper level is '/fs/sda7' instead. The different results that you get make me suspect that the top-level subvolume is "special". -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 4.13: No space left with plenty of free space (/home/kernel/COD/linux/fs/btrfs/extent-tree.c:6989 __btrfs_free_extent.isra.62+0xc2c/0xdb0)
[ ... ] > [233787.921018] Call Trace: > [233787.921031] ? btrfs_merge_delayed_refs+0x62/0x550 [btrfs] > [233787.921039] __btrfs_run_delayed_refs+0x6f0/0x1380 [btrfs] > [233787.921047] btrfs_run_delayed_refs+0x6b/0x250 [btrfs] > [233787.921054] btrfs_write_dirty_block_groups+0x158/0x390 [btrfs] > [233787.921063] commit_cowonly_roots+0x221/0x2c0 [btrfs] > [233787.921071] btrfs_commit_transaction+0x46e/0x8d0 [btrfs] [ ... ] > [233787.921191] BTRFS: error (device md2) in > btrfs_run_delayed_refs:3009: errno=-28 No space left > [233789.507669] BTRFS warning (device md2): Skipping commit of aborted > transaction. > [233789.507672] BTRFS: error (device md2) in cleanup_transaction:1873: > errno=-28 No space left [ ... ] So the numbers that matter are: > Data,single: Size:12.84TiB, Used:7.13TiB > /dev/md2 12.84TiB > Metadata,DUP: Size:79.00GiB, Used:77.87GiB > /dev/md2 158.00GiB > Unallocated: > /dev/md23.31TiB The metadata allocations is nearly full, so it could be the usual story with the two-level allocator that there are not unallocated chunks for metadata expansion, but since you have 3TiB of 'unallocated' space there is no obvious reason why allocation of the metadata to do a new root transaction flush should abort, so this is about "guessing" which corner case or bug applies: * If you are using the 'space_cache' it has a known issue: https://btrfs.wiki.kernel.org/index.php/Gotchas#Free_space_cache * Some versions of Btrfs (IIRC around 4.8-4.9) had some other allocator bug. * Maybe some previous issue, hw or sw, had damaged internal filesystem structures. I also notice that your volume's data free space seems to be extremely fragmented, as the large difference here shows "Data,single: Size:12.84TiB, Used:7.13TiB". Which may mean that it is mounted with 'ssd' and/or has gone a long time without a 'balance', and conceivably this can make it easier for the free space cache to fail finding space (some handwaving here). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up big btrfs volumes with ssds
>>> [ ... ] Currently without any ssds i get the best speed with: >>> - 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" devices >>> and using btrfs as raid 0 for data and metadata on top of >>> those 4 raid 5. [ ... ] the write speed is not as good as i >>> would like - especially for random 8k-16k I/O. [ ... ] > [ ... ] 64kb data stripe and 16kb parity Btrfs raid0 use 64kb > as stripe so that can make data access unaligned (or use single > profile for btrfs) 3. Use btrfs ssd_spread to decrease RMW > cycles. This is not a "revolutionary" scientific discovery as the idea of a working set of a small-size random-write workload, but it still takes a lot of "optimism" to imagine that it is possible to "decrease RMW cycles" for "random 8k-16k" writes on 64KiB+16KiB RAID5 stripes, whether with 'ssd_spread' or not. To "decrease RMW cycles" seems inded to me the better aim than following the "radical" aim of caching the working set of a random-small-write workload, but it may be less easy to achieve than desirable :-). http://www.baarf.dk/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up big btrfs volumes with ssds
>> [ ... ] Currently the write speed is not as good as i would >> like - especially for random 8k-16k I/O. [ ... ] > [ ... ] So this 60TB is then 20 4TB disks or so and the 4x 1GB > cache is simply not very helpful I think. The working set > doesn't fit in it I guess. If there is mostly single or a few > users of the fs, a single pcie based bcacheing 4 devices can > work, but for SATA SSD, I would use 1 SSD per HWraid5. [ ... ] Probably the idea of the cacheable working set of a random small write workload is a major new scientific discovery. :-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: read-only for no good reason on 4.9.30
> [ ... ] I ran "btrfs balance" and then it started working > correctly again. It seems that a btrfs filesystem if left > alone will eventually get fragmented enough that it rejects > writes [ ... ] Free space will get fragmented, because Btrfs has a 2-level allocator scheme (chunks within devices and leaves within chunks). The issue is "free space" vs. "unallocated chunks". > Is this a known issue? https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#I_get_.22No_space_left_on_device.22_errors.2C_but_df_says_I.27ve_got_lots_of_space https://btrfs.wiki.kernel.org/index.php/FAQ#if_your_device_is_large_.28.3E16GiB.29 https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_new_tools https://btrfs.wiki.kernel.org/index.php/FAQ#Why_is_free_space_so_complicated.3F https://btrfs.wiki.kernel.org/index.php/FAQ#What_does_.22balance.22_do.3F That problem is particularly frequent with the 'ssd' mount option which probably should never be used: https://btrfs.wiki.kernel.org/index.php/Gotchas#The_ssd_mount_option -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: speed up big btrfs volumes with ssds
> [ ... ] - needed volume size is 60TB I wonder how long that takes to 'scrub', 'balance', 'check', 'subvolume delete', 'find', etc. > [ ... ] 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" > devices and using btrfs as raid 0 for data and metadata on top > of those 4 raid 5. [ ... ] the write speed is not as good as > i would like - especially for random 8k-16k I/O. [ ... ] Also I noticed that the rain is wet and cold - especially if one walks around for a few hours in a t-shirt, shorts and sandals. :-) > My current idea is to use a pcie flash card with bcache on top > of each raid 5. Is this something which makes sense to speed > up the write speed. Well 'bcache' in the role of write buffer allegedly helps turning unaligned writes into aligned writes, so might help, but I wonder how effective that will be in this case, plus it won't turn low random IOPS-per-TB 4TB devices into high ones. Anyhow if they are battery-backed the 1GB of HW HBA cache/buffer should do exactly that, excep that again in this case that is rather optimistic. But this reminds me of the common story: "Doctor, if I stab repeatedly my hand with a fork it hurts a lot, how to fix that?" "Don't do it". :-) PS Random writes of 8-16KiB over 60TB might seem like storing small records/images in small files. That would be "brave". On a 60TB RAID50 of 20x 4TB disk drives that might mean around 5-10MB/s of random small writes, including both data and metadata. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: number of subvolumes
>> Using hundreds or thousands of snapshots is probably fine >> mostly. As I mentioned previously, with a link to the relevant email describing the details, the real issue is reflinks/backrefs. Usually subvolume and snapshots involve them. > We find that typically apt is very slow on a machine with 50 > or so snapshots and raid10. Slow as in probably 10x slower as > doing the same update on a machine with 'single' and no > snapshots. That seems to indicate using snapshots on a '/' volume to provide a "rollback machine" like SUSE. Since '/' usually has many small files and installation of upgraded packages involves only a small part of them, that usually involves a lot of reflinks/backrefs. But that you find that the system has slowed down significantly in ordinary operations is unusual, because what is slow in situations with many relinks/backrefs per extent is not access, but operations like 'balance' or 'delete'. Guessing wildly what you describe seems more the effect of low locality (aka high fragmentation) which is often the result of the 'ssd' option which should always be explicitly disabled (even for volumes on flash SSD storage). I would suggest some use of 'filefrag' to analyze and perhaps use of 'defrag' and 'balance'. Another possibility is having enabled compression with the presence of many in-place updates on some files, which can result also in low locality (high fragmentation). As usual with Btrfs, there are corner cases to avoid: 'defrag' should be done before 'balance' and with compression switched off (IIRC): https://wiki.archlinux.org/index.php/Btrfs#Defragmentation Defragmenting a file which has a COW copy (either a snapshot copy or one made with cp --reflink or bcp) plus using the -c switch with a compression algorithm may result in two unrelated files effectively increasing the disk usage. https://wiki.debian.org/Btrfs Mounting with -o autodefrag will duplicate reflinked or snapshotted files when you run a balance. Also, whenever a portion of the fs is defragmented with "btrfs filesystem defragment" those files will lose their reflinks and the data will be "duplicated" with n-copies. The effect of this is that volumes that make heavy use of reflinks or snapshots will run out of space. Additionally, if you have a lot of snapshots or reflinked files, please use "-f" to flush data for each file before going to the next file. I prefer dump-and-reload. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: number of subvolumes
> This is a vanilla SLES12 installation: [ ... ] Why does SUSE > ignore this "not too many subvolumes" warning? As in many cases with Btrfs "it's complicated" because of the interaction of advanced features among themselves and the chosen implementation and properties of storage; anisotropy rules. IIRC the main problem actually is not with "too many subvolumes", but with too many "reflinks"/"backrefs"; subvolumes, in particular snapshots, are just the main way to create them: https://www.spinics.net/lists/linux-btrfs/msg42808.html A couple dozen subvolumes without reflinks as in the '/' scheme used by SUSE are going to be almost always fine. Then there is different a issue: I remember seeing a post by a SUSE guy saying that the 10/10/10/10 (hourly/daily/monthly/yearly) snapshots in the default settings for 'snapper' was a bad idea because it would create way too many snapshots, but that he was told to set those defaults that high. I can imagine a cowardly but plausible reason why "management" would want those defaults. Some semi-useful links: * Home page for 'snapper' https://snapper.io/ * Announcement of 'snapper' https://lizards.opensuse.org/2011/04/01/introducing-snapper/ * Useful maintenance scripts https://github.com/kdave/btrfsmaintenance -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: user snapshots
> So, still: What is the problem with user_subvol_rm_allowed? As usual, it is complicated: mostly that while subvol creation is very cheap, subvol deletion can be very expensive. But then so can be creating many snapshots, as in this: https://www.spinics.net/lists/linux-btrfs/msg62760.html Also that deleting a subvol can delete a lot of stuff "inadvertently", including things that the user could not delete using UNIX style permissions. But it many of the Btrfs semantics feel a bit "arbitrary" in part because they break new ground, in part because happenstance. http://linux-btrfs.vger.kernel.narkive.com/eTtmsQdL/patch-1-2-btrfs-don-t-check-the-permission-of-the-subvolume-which-we-want-to-delete http://linux-btrfs.vger.kernel.narkive.com/nR17xtw7/patch-btrfs-allow-subvol-deletion-by-unprivileged-user-with-o-user-subvol-rm-allowed -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: netapp-alike snapshots?
[ ... ] It is beneficial to not have snapshots in-place. With a local directory of snapshots, [ ... ] Indeed and there is a fair description of some options for subvolume nesting policies here which may be interesting to the original poster: https://btrfs.wiki.kernel.org/index.php/SysadminGuide#Layout It is unsurprising to me that there are tradeoffs involved in every choice. I find the "Flat" layout particularly desirable. >>> Netapp snapshots are invisible for tools doing opendir()/ >>> readdir() One could simulate this with symlinks for the >>> snapshot directory: store the snapshot elsewhere (not inplace) >>> and create a symlink to it, in every directory. More precisely in every subvolume root directory. >>> My users want the snapshots locally in a .snapshot >>> subdirectory. Btrfs snapshots can only be done for a whole subvolume. Subvolumes and snapshots can be created by users, but too many snapshots (see below) can cause trouble. For somewhat good reasons subvolumes including snapshots cannot be deleted by users though unless mount option 'user_subvol_rm_allowed' is used. >>> Because Netapp do it this way - for at least 20 years and we >>> have a multi-PB Netapp storage environment. No chance to change >>> this. Send patches :-). > Not only du works recursivly, but also find and with option > also ls, grep, etc. Note also that subvolume root directory inodes are indeed root directory inodes so they can be 'mount'ed and therefore the transition from a subvolume into a contained subvolume can be detected at the mountpoint. So 'find' has the '-xdev' option and 'du' has the '-x' options and so similarly nearly all other tools, so perhaps someone expects that to happen :-). > And it would require a bind mount for EVERY directory. There can > be hundreds... thousends! Assumptions that all Btrfs features such as snapshots are infinitely scalable at no cost may be optimistic: https://btrfs.wiki.kernel.org/index.php/Gotchas#Having_many_subvolumes_can_be_very_slow -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: finding root filesystem of a subvolume?
[ ... ] >> There is no fixed relationship between the root directory >> inode of a subvolume and the root directory inode of any >> other subvolume or the main volume. > Actually, there is, because it's inherently rooted in the > hierarchy of the volume itself. That root inode for the > subvolume is anchored somewhere under the next higher > subvolume. This stupid point relies on ignoring that it is not mandatory to mount the main volume, and that therefore "There is no fixed relationship between the root directory inode of a subvolume and the root directory inode of any other subvolume or the main volume", because the "root directory inode" of the "main volume" may not be mounted at all. This stupid point also relies on ignoring that subvolumes can be mounted *also* under another directory, even if the main volume is mounted somewhere else. Suppose that the following applies: subvol=5 /local subvol=383/local/.backup/home subvol=383/mnt/home-backup and you are given the mountpoint '/mnt/home-backup', how can you find the main volume mountpoint '/local' from that? Please explain how '/mnt/home-backup' is indeed "inherently rooted in the hierarchy of the volume itself", because there is always a "fixed relationship between the root directory inode of a subvolume and the root directory inode of any other subvolume or the main volume". [ ... ] > Again, it does, it's just not inherently exposed to userspace > unless you mount the top-level subvolume (subvolid=5 and/or > subvol=/ in mount options). This extra stupid point is based on ignoring that to "mount the top-level subvolume" relies on knowing already which one is the "top-level subvolume", which is begging the question. [ ... ] -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: finding root filesystem of a subvolume?
> How do I find the root filesystem of a subvolume? > Example: > root@fex:~# df -T > Filesystem Type 1K-blocks Used Available Use% Mounted on > - -1073740800 104244552 967773976 10% /local/.backup/home [ ... ] > I know, the root filesystem is /local, That question is somewhat misunderstood and uses the wrong concepts and terms. In UNIX filesystems a filesystem "root" is a directory inode with a number that is local to itself, and can be "mounted" anywhere, or left unmounted, and that is a property of the running system, not of the filesystem "root". Usually UNIX filesystems have a single "root" directory inode. In the case of Btrfs the main volume and its subvolumes all have filesystem "root" directory inodes, which may or may not be "mounted", anywhere the administrators of the running system pleases, as a property of the running system. There is no fixed relationship between the root directory inode of a subvolume and the root directory inode of any other subvolume or the main volume. Note: in Btrfs terminology "volume" seems to mean both the main volume and the collection of devices where it and subvolumes are hosted. > but who can I show it by command? The system does not keep an explicit record of which Btrfs "root" directory inode is related to which other Btrfs "root" directory inode in the same volume, whether mounted or unmounted. That relationship has to be discovered by using volume UUIDs, which are the same for the main subvolume and the other subvolumes, whether mounted or not, so one has to do: * For the indicated mounted subvolume "root" read its UUID. * For every mounted filesystem "root", check whether its type is 'btrfs' and if it is obtain its UUID. * If the UUID is the same, and the subvolume id is '5', that's the main subvolume, and terminate. * For every block device which is not mounted, check whether it has a Btrfs superblock. * If the type is 'btrfs' and the volume UUIS is the same as that of the subvolume, list the block device. In the latter case since the main volume is not mounted the only way to identify it is to list the block devices that host it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: slow btrfs with a single kworker process using 100% CPU
>>> I've one system where a single kworker process is using 100% >>> CPU sometimes a second process comes up with 100% CPU >>> [btrfs-transacti]. [ ... ] >> [ ... ]1413 Snapshots. I'm deleting 50 of them every night. But >> btrfs-cleaner process isn't running / consuming CPU currently. Reminder that: https://btrfs.wiki.kernel.org/index.php/Gotchas#Having_many_subvolumes_can_be_very_slow "The cost of several operations, including currently balance, device delete and fs resize, is proportional to the number of subvolumes, including snapshots, and (slightly super-linearly) the number of extents in the subvolumes." >> [ ... ] btrfs is mounted with compress-force=zlib > Could be similar issue as what I had recently, with the RAID5 and > 256kb chunk size. please provide more information about your RAID > setup. It is similar, but updating in-place compressed files can create this situation even without RAID5 RMW: https://btrfs.wiki.kernel.org/index.php/Gotchas#Fragmentation "Files with a lot of random writes can become heavily fragmented (1+ extents) causing thrashing on HDDs and excessive multi-second spikes of CPU load on systems with an SSD or large amount a RAM. ... Symptoms include btrfs-transacti and btrfs-endio-wri taking up a lot of CPU time (in spikes, possibly triggered by syncs)." -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
[ ... ] >>> Snapshots work fine with nodatacow, each block gets CoW'ed >>> once when it's first written to, and then goes back to being >>> NOCOW. >>> The only caveat is that you probably want to defrag either >>> once everything has been rewritten, or right after the >>> snapshot. >> I thought defrag would unshare the reflinks? > Which is exactly why you might want to do it. It will get rid > of the overhead of the single CoW operation, and it will make > sure there is minimal fragmentation. > IOW, when mixing NOCOW and snapshots, you either have to use > extra space, or you deal with performance issues. Aside from > that though, it works just fine and has no special issues as > compared to snapshots without NOCOW. The above illustrates my guess as to why RHEL 7.4 dropped Btrfs support, which is: * RHEL is sold to managers who want to minimize the cost of upgrades and sysadm skills. * Every time a customer creates a ticket, RH profits fall. * RH had adopted 'ext3' because it was an in-place upgrade from 'ext2' and "just worked", 'ext4' because it was an in-place upgrade from 'ext3' and was supposed to "just work", and then was looking at Btrfs as an in-place upgrade from 'ext4', and presumably also a replacement for MD RAID, that would "just work". * 'ext4' (and XFS before that) already created a few years ago trouble because of the 'O_PONIES' controversy. * Not only Btrfs still has "challenges" as to multi-device functionality, and in-place upgrades from 'ext4' have "challenges" too, it has many "special cases" that need skill and discretion to handle, because it tries to cover so many different cases, and the first thing many a RH customer would do is to create a ticket to ask what to do, or how to fix a choice already made. Try to imagine the impact on the RH ticketing system of a switch from 'ext4' to Btrfs, with explanations like the above, about NOCOW, defrag, snapshots, balance, reflinks, and the exact order in which they have to be performed for best results. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
[ ... ] > But I've talked to some friend at the local super computing > centre and they have rather general issues with CoW at their > virtualisation cluster. Amazing news! :-) > Like SUSE's snapper making many snapshots leading the storage > images of VMs apparently to explode (in terms of space usage). Well, this could be an argument that some of your friends are being "challenged" by running the storage systems of a "super computing centre" and that they could become "more prepared" about system administration, for example as to the principle "know which tool to use for which workload". Or else it could be an argument that they expect Btrfs to do their job while they watch cat videos from the intertubes. :-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
> We use the crcs to catch storage gone wrong, [ ... ] And that's an opportunistically feasible idea given that current CPUs can do that in real-time. > [ ... ] It's possible to protect against all three without COW, > but all solutions have their own tradeoffs and this is the setup > we chose. It's easy to trust and easy to debug and at scale that > really helps. Indeed all filesystem designs have pathological workloads, and system administrators and applications developers who are "more prepared" know which one is best for which workload, or try to figure it out. > Some databases also crc, and all drives have correction bits of > of some kind. There's nothing wrong with crcs happening at lots > of layers. Well, there is: in theory checksumming should be end-to-end, that is entirely application level, so applications that don't need it don't pay the price, but having it done at other layers can help the very many applications that don't do it and should do it, and it is cheap, and can help when troubleshooting exactly there the problem is. It is an opportunistic thing to do. > [ ... ] My real goal is to make COW fast enough that we can > leave it on for the database applications too. Obviously I > haven't quite finished that one yet ;) [ ... ] And this worries me because it portends the usual "marketing" goal of making Btrfs all things to all workloads, the "OpenStack of filesystems", with little consideration for complexity, maintainability, or even sometimes reality. The reality is that all known storage media have hugely anisotropic performance envelopes, both as to functionality, cost, speed, reliability, and there is no way to have an automagic filesystem that "just works" in all cases, despite the constant demands for one by "less prepared" storage administrators and application developers. The reality is also that if one such filesystem could automagically adapt to cover optimally the performance envelopes of every possible device and workload, it would be so complex as to be unmaintainable in practice. So Btrfs, in its base "Rodeh" functionality, with COW, checksums, subvolumes, shapshots, *on a single device*, works pretty well and reliably and it is already very useful, for most workloads. Some people also like some of its exotic complexities like in-place compression and defragmentation, but they come at a high cost. For workloads that inflict lots of small random in-place updates on storage, like tablespaces for DBMSes etc, perhaps simpler less featureful storage abstraction layers are more appropriate, from OCFS2 to simple DM/LVM2 LVs, and Btrfs NOCOW approximates them well. BTW as to the specifics of DBMSes and filesystems, there is a classic paper making eminently reasonable, practical, suggestions that have been ignored for only 35 years and some: %A M. R. Stonebraker %T Operating system support for database management %J CACM %V 24 %D JUL 1981 %P 412-418 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs + compression = slow performance and high cpu usage
[ ... ] > This is the "storage for beginners" version, what happens in > practice however depends a lot on specific workload profile > (typical read/write size and latencies and rates), caching and > queueing algorithms in both Linux and the HA firmware. To add a bit of slightly more advanced discussion, the main reason for larger strips ("chunk size) is to avoid the huge latencies of disk rotation using unsynchronized disk drives, as detailed here: http://www.sabi.co.uk/blog/12-thr.html?120310#120310 That relates weakly to Btrfs. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs + compression = slow performance and high cpu usage
>> [ ... ] a "RAID5 with 128KiB writes and a 768KiB stripe >> size". [ ... ] several back-to-back 128KiB writes [ ... ] get >> merged by the 3ware firmware only if it has a persistent >> cache, and maybe your 3ware does not have one, > KOS: No I don't have persistent cache. Only the 512 Mb cache > on board of a controller, that is BBU. If it is a persistent cache, that can be battery-backed (as I wrote, but it seems that you don't have too much time to read replies) then the size of the write, 128KiB or not, should not matter much; the write will be reported complete when it hits the persistent cache (whichever technology it used), and then the HA fimware will spill write cached data to the disks using the optimal operation width. Unless the 3ware firmware is really terrible (and depending on model and vintage it can be amazingly terrible) or the battery is no longer recharging and then the host adapter switches to write-through. That you see very different rates between uncompressed and compressed writes, where the main difference is the limitation on the segment size, seems to indicate that compressed writes involve a lot of RMW, that is sub-stripe updates. As I mentioned already, it would be interesting to retry 'dd' with different 'bs' values without compression and with 'sync' (or 'direct' which only makes sense without compression). > If I had additional SSD caching on the controller I would have > mentioned it. So far you had not mentioned the presence of BBU cache either, which is equivalent, even if in one of your previous message (which I try to read carefully) there were these lines: Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU So perhaps someone else would have checked long ago the status of the BBU and whether the "No Write Cache if Bad BBU" case has happened. If the BBU is still working and the policy is still "WriteBack" then things are stranger still. > I was also under impression, that in a situation where mostly > extra large files will be stored on the massive, the bigger > strip size would indeed increase the speed, thus I went with > with the 256 Kb strip size. That runs counter to this simple story: suppose a program is doing 64KiB IO: * For *reads*, there are 4 data drives and the strip size is 16KiB: the 64KiB will be read in parallel on 4 drives. If the strip size is 256KiB then the 64KiB will be read sequentially from just one disk, and 4 successive reads will be read sequentially from the same drive. * For *writes* on a parity RAID like RAID5 things are much, much more extreme: the 64KiB will be written with 16KiB strips on a 5-wide RAID5 set in parallel to 5 drives, with 4 stripes being updated with RMW. But with 256KiB strips it will partially update 5 drives, because the stripe is 1024+256KiB, and it needs to do RMW, and four successive 64KiB drives will need to do that too, even if only one drive is updated. Usually for RAID5 there is an optimization that means that only the specific target drive and the parity drives(s) need RMW, but it is still very expensive. This is the "storage for beginners" version, what happens in practice however depends a lot on specific workload profile (typical read/write size and latencies and rates), caching and queueing algorithms in both Linux and the HA firmware. > Would I be correct in assuming that the RAID strip size of 128 > Kb will be a better choice if one plans to use the BTRFS with > compression? That would need to be tested, because of "depends a lot on specific workload profile, caching and queueing algorithms", but my expectation is the the lower the better. Given that you have 4 drives giving a 3+1 RAID set, perhaps a 32KiB or 64KiB strip size, given a data stripe size of 96KiB or 192KiB, would be better. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs + compression = slow performance and high cpu usage
> Peter, I don't think the filefrag is showing the correct > fragmentation status of the file when the compression is used. As reported on a previous message the output of 'filefrag -v' which can be used to see what is going on: filefrag /mnt/sde3/testfile /mnt/sde3/testfile: 49287 extents found Most the latter extents are mercifully rather contiguous, their size is just limited by the compression code, here is an extract from 'filefrag -v' from around the middle: 24757: 1321888.. 1321919: 11339579.. 11339610: 32: 11339594: 24758: 1321920.. 1321951: 11339597.. 11339628: 32: 11339611: 24759: 1321952.. 1321983: 11339615.. 11339646: 32: 11339629: 24760: 1321984.. 1322015: 11339632.. 11339663: 32: 11339647: 24761: 1322016.. 1322047: 11339649.. 11339680: 32: 11339664: 24762: 1322048.. 1322079: 11339667.. 11339698: 32: 11339681: 24763: 1322080.. 1322111: 11339686.. 11339717: 32: 11339699: 24764: 1322112.. 1322143: 11339703.. 11339734: 32: 11339718: 24765: 1322144.. 1322175: 11339720.. 11339751: 32: 11339735: 24766: 1322176.. 1322207: 11339737.. 11339768: 32: 11339752: 24767: 1322208.. 1322239: 11339754.. 11339785: 32: 11339769: 24768: 1322240.. 1322271: 11339771.. 11339802: 32: 11339786: 24769: 1322272.. 1322303: 11339789.. 11339820: 32: 11339803: But again this is on a fresh empty Btrfs volume. As I wrote, "their size is just limited by the compression code" which results in "128KiB writes". On a "fresh empty Btrfs volume" the compressed extents limited to 128KiB also happen to be pretty physically contiguous, but on a more fragmented free space list they can be more scattered. As I already wrote the main issue here seems to be that we are talking about a "RAID5 with 128KiB writes and a 768KiB stripe size". On MD RAID5 the slowdown because of RMW seems only to be around 30-40%, but it looks like that several back-to-back 128KiB writes get merged by the Linux IO subsystem (not sure whether that's thoroughly legal), and perhaps they get merged by the 3ware firmware only if it has a persistent cache, and maybe your 3ware does not have one, but you have kept your counsel as to that. My impression is that you read the Btrfs documentation and my replies with a lot less attention than I write them. Some of the things you have done and said make me think that you did not read https://btrfs.wiki.kernel.org/index.php/Compression and 'man 5 btrfs', for example: "How does compression interact with direct IO or COW? Compression does not work with DIO, does work with COW and does not work for NOCOW files. If a file is opened in DIO mode, it will fall back to buffered IO. Are there speed penalties when doing random access to a compressed file? Yes. The compression processes ranges of a file of maximum size 128 KiB and compresses each 4 KiB (or page-sized) block separately." > I am currently defragmenting that mountpoint, ensuring that > everrything is compressed with zlib. Defragmenting the used space might help find more contiguous allocations. > p.s. any other suggestion that might help with the fragmentation > and data allocation. Should I try and rebalance the data on the > drive? Yes, regularly, as that defragments the unused space. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs + compression = slow performance and high cpu usage
> [ ... ] It is hard for me to see a speed issue here with > Btrfs: for comparison I have done a simple test with a both a > 3+1 MD RAID5 set with a 256KiB chunk size and a single block > device on "contemporary" 1T/2TB drives, capable of sequential > transfer rates of 150-190MB/s: [ ... ] The figures after this are a bit on the low side because I realized looking at 'vmstat' that the source block device 'sda6' was being a bottleneck, as the host has only 8GiB instead of the 16GiB I misremembered, and also 'sda' is a relatively slow flash SSD that reads are most at around 220MB/s. So I have redone the simple tests with a transfer size of 3GB, which ensures that all reads are from memory cache: with compression: soft# mount -t btrfs -o commit=10,compress-force=zlib /dev/md/test5 /mnt/test5 soft# mount -t btrfs -o commit=10,compress-force=zlib /dev/sdg3 /mnt/sdg3 soft# rm -f /mnt/test5/testfile /mnt/sdg3/testfile soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile bs=1M count=3000 conv=fsync 3000+0 records in 3000+0 records out 3145728000 bytes (3.1 GB) copied, 15.8869 s, 198 MB/s 0.00user 2.80system 0:15.88elapsed 17%CPU (0avgtext+0avgdata 3056maxresident)k 0inputs+6148256outputs (0major+346minor)pagefaults 0swaps soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sdg3/testfile bs=1M count=3000 conv=fsync 3000+0 records in 3000+0 records out 3145728000 bytes (3.1 GB) copied, 16.9663 s, 185 MB/s 0.00user 2.61system 0:16.96elapsed 15%CPU (0avgtext+0avgdata 3056maxresident)k 0inputs+6144672outputs (0major+346minor)pagefaults 0swaps soft# btrfs fi df /mnt/test5/ | grep Data Data, single: total=3.00GiB, used=2.28GiB soft# btrfs fi df /mnt/sdg3 | grep Data Data, single: total=3.00GiB, used=2.28GiB soft# filefrag /mnt/test5/testfile /mnt/sdg3/testfile /mnt/test5/testfile: 8811 extents found /mnt/sdg3/testfile: 8759 extents found Slightly weird that with a 3GB size the number of extents is almost double that for the 10GB, but I guess that depends on speed. Then without compression: soft# mount -t btrfs -o commit=10 /dev/md/test5 /mnt/test5 soft# mount -t btrfs -o commit=10 /dev/sdg3 /mnt/sdg3 soft# rm -f /mnt/test5/testfile /mnt/sdg3/testfile soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile bs=1M count=3000 conv=fsync 3000+0 records in 3000+0 records out 3145728000 bytes (3.1 GB) copied, 8.06841 s, 390 MB/s 0.00user 3.90system 0:08.80elapsed 44%CPU (0avgtext+0avgdata 2880maxresident)k 0inputs+6153856outputs (0major+345minor)pagefaults 0swaps soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sdg3/testfile bs=1M count=3000 conv=fsync 3000+0 records in 3000+0 records out 3145728000 bytes (3.1 GB) copied, 30.215 s, 104 MB/s 0.00user 4.82system 0:30.93elapsed 15%CPU (0avgtext+0avgdata 2888maxresident)k 0inputs+6152128outputs (0major+347minor)pagefaults 0swaps soft# filefrag /mnt/test5/testfile /mnt/sdg3/testfile /mnt/test5/testfile: 5 extents found /mnt/sdg3/testfile: 3 extents found Also added: soft# rm -f /mnt/test5/testfile /mnt/sdg3/testfile soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 bs=128k count=3000 | dd iflag=fullblock of=/mnt/test5/testfile bs=128k oflag=sync 3000+0 records in 3000+0 records out 393216000 bytes (393 MB) copied, 160.315 s, 2.5 MB/s 0.02user 0.46system 2:40.31elapsed 0%CPU (0avgtext+0avgdata 1992maxresident)k 0inputs+0outputs (0major+124minor)pagefaults 0swaps 3000+0 records in 3000+0 records out 393216000 bytes (393 MB) copied, 160.365 s, 2.5 MB/s soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 bs=128k count=3000 | dd iflag=fullblock of=/mnt/sdg3/testfile bs=128k oflag=sync 3000+0 records in 3000+0 records out 393216000 bytes (393 MB) copied, 113.51 s, 3.5 MB/s 0.02user 0.56system 1:53.51elapsed 0%CPU (0avgtext+0avgdata 2156maxresident)k 0inputs+0outputs (0major+120minor)pagefaults 0swaps 3000+0 records in 3000+0 records out 393216000 bytes (393 MB) copied, 113.544 s, 3.5 MB/s soft# filefrag /mnt/test5/testfile /mnt/sdg3/testfile /mnt/test5/testfile: 1 extent found /mnt/sdg3/testfile: 22 extents found soft# rm -f /mnt/test5/testfile /mnt/sdg3/testfile soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 bs=1M count=1000 | dd iflag=fullblock of=/mnt/test5/testfile bs=1M oflag=sync
Re: Btrfs + compression = slow performance and high cpu usage
[ ... ] > Also added: Feeling very generous :-) today, adding these too: soft# mkfs.btrfs -mraid10 -draid10 -L test5 /dev/sd{b,c,d,e}3 [ ... ] soft# mount -t btrfs -o commit=10,compress-force=zlib /dev/sdb3 /mnt/test5 soft# rm -f /mnt/test5/testfile soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile bs=1M count=3000 conv=fsync 3000+0 records in 3000+0 records out 3145728000 bytes (3.1 GB) copied, 14.2166 s, 221 MB/s 0.00user 2.54system 0:14.21elapsed 17%CPU (0avgtext+0avgdata 3056maxresident)k 0inputs+6144768outputs (0major+346minor)pagefaults 0swaps soft# rm -f /mnt/test5/testfile soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile bs=128k count=3000 conv=fsync 3000+0 records in 3000+0 records out 393216000 bytes (393 MB) copied, 2.05933 s, 191 MB/s 0.00user 0.32system 0:02.06elapsed 15%CPU (0avgtext+0avgdata 1996maxresident)k 0inputs+772512outputs (0major+124minor)pagefaults 0swaps soft# rm -f /mnt/test5/testfile soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 bs=1M count=1000 | dd iflag=fullblock of=/mnt/test5/testfile bs=1M oflag=sync 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 60.6019 s, 17.3 MB/s 0.01user 1.04system 1:00.60elapsed 1%CPU (0avgtext+0avgdata 2888maxresident)k 0inputs+0outputs (0major+348minor)pagefaults 0swaps 1000+0 records in 1000+0 records out 1048576000 bytes (1.0 GB) copied, 60.4116 s, 17.4 MB/s soft# rm -f /mnt/test5/testfile soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 bs=128k count=3000 | dd iflag=fullblock of=/mnt/test5/testfile bs=128k oflag=sync 3000+0 records in 3000+0 records out 393216000 bytes (393 MB) copied, 148.04 s, 2.7 MB/s 0.00user 0.62system 2:28.04elapsed 0%CPU (0avgtext+0avgdata 1996maxresident)k 0inputs+0outputs (0major+125minor)pagefaults 0swaps 3000+0 records in 3000+0 records out 393216000 bytes (393 MB) copied, 148.083 s, 2.7 MB/s soft# sysctl vm/drop_caches=3 vm.drop_caches = 3 soft# /usr/bin/time dd iflag=fullblock if=/mnt/test5/testfile bs=128k count=3000 of=/dev/zero 3000+0 records in 3000+0 records out 393216000 bytes (393 MB) copied, 1.09729 s, 358 MB/s 0.00user 0.24system 0:01.10elapsed 23%CPU (0avgtext+0avgdata 2164maxresident)k 459768inputs+0outputs (3major+121minor)pagefaults 0swaps -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs + compression = slow performance and high cpu usage
[ ... ] > grep 'model name' /proc/cpuinfo | sort -u > model name : Intel(R) Xeon(R) CPU E5645 @ 2.40GHz Good, contemporary CPU with all accelerations. > The sda device is a hardware RAID5 consisting of 4x8TB drives. [ ... ] > Strip Size : 256 KB So the full RMW data stripe length is 768KiB. > [ ... ] don't see the previously reported behaviour of one of > the kworker consuming 100% of the cputime, but the write speed > difference between the compression ON vs OFF is pretty large. That's weird; of course 'lzo' is a lot cheaper than 'zlib', but in my test the much higher CPU time of the latter was spread across many CPUs, while in your case it wasn't, even if the E5645 has 6 CPUs and can do 12 threads. That seemed to point to some high cost of finding free blocks, that is a very fragmented free list, or something else. > dd if=/dev/sdb of=./testing count=5120 bs=1M status=progress oflag=direct > 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 26.0685 s, 206 MB/s The results with 'oflag=direct' are not relevant, because Btrfs behaves "differently" with that. > mountflags: > (rw,relatime,compress-force=zlib,space_cache=v2,subvolid=5,subvol=/) [ ... ] > dd if=/dev/sdb of=./testing count=5120 bs=1M status=progress conv=fsync > 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 77.4845 s, 69.3 MB/s > mountflags: > (rw,relatime,compress-force=lzo,space_cache=v2,subvolid=5,subvol=/) [ ... ] > dd if=/dev/sdb of=./testing count=5120 bs=1M status=progress conv=fsync > 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 122.321 s, 43.9 MB/s That's pretty good for a RAID5 with 128KiB writes and a 768KiB stripe size, on a 3ware, and looks like that the hw host adapter does not have a persistent cache (battery backed usually). My guess that watching transfer rates and latencies with 'iostat -dk -zyx 1' did not happen. > mountflags: (rw,relatime,space_cache=v2,subvolid=5,subvol=/) [ ... ] > dd if=/dev/sdb of=./testing count=5120 bs=1M status=progress conv=fsync > 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 10.1033 s, 531 MB/s I had mentioned in my previous reply the output of 'filefrag'. That to me seems relevant here, because of RAID5 RMW and maximum extent size with Brfs compression and strip/stripe size. Perhaps redoing the tests with a 128KiB 'bs' *without* compression would be interesting, perhaps even with 'oflag=sync' instead of 'conv=fsync'. It is hard for me to see a speed issue here with Btrfs: for comparison I have done a simple test with a both a 3+1 MD RAID5 set with a 256KiB chunk size and a single block device on "contemporary" 1T/2TB drives, capable of sequential transfer rates of 150-190MB/s: soft# grep -A2 sdb3 /proc/mdstat md127 : active raid5 sde3[4] sdd3[2] sdc3[1] sdb3[0] 729808128 blocks super 1.0 level 5, 256k chunk, algorithm 2 [4/4] [] with compression: soft# mount -t btrfs -o commit=10,compress-force=zlib /dev/md/test5 /mnt/test5 soft# mount -t btrfs -o commit=10,compress-force=zlib /dev/sdg3 /mnt/sdg3 soft# rm -f /mnt/test5/testfile /mnt/sdg3/testfile soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile bs=1M count=1 conv=fsync 1+0 records in 1+0 records out 1048576 bytes (10 GB) copied, 94.3605 s, 111 MB/s 0.01user 12.59system 1:34.36elapsed 13%CPU (0avgtext+0avgdata 2932maxresident)k 13042144inputs+20482144outputs (3major+345minor)pagefaults 0swaps soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sdg3/testfile bs=1M count=1 conv=fsync 1+0 records in 1+0 records out 1048576 bytes (10 GB) copied, 93.5885 s, 112 MB/s 0.03user 12.35system 1:33.59elapsed 13%CPU (0avgtext+0avgdata 2940maxresident)k 13042144inputs+20482400outputs (3major+346minor)pagefaults 0swaps soft# filefrag /mnt/test5/testfile /mnt/sdg3/testfile /mnt/test5/testfile: 48945 extents found /mnt/sdg3/testfile: 49029 extents found soft# btrfs fi df /mnt/test5/ | grep Data Data, single: total=7.00GiB, used=6.55GiB soft# btrfs fi df /mnt/sdg3 | grep Data Data, single: total=7.00GiB, used=6.55GiB soft# sysctl vm/drop_caches=3 vm.drop_caches = 3 soft# /usr/bin/time dd iflag=fullblock if=/mnt/test5/testfile bs=1M count=1 of=/dev/zero 1+0 records in 1+0 records out 1048576 bytes (10 GB) copied, 23.2975 s, 450 MB/s 0.01user 7.59system 0:23.32elapsed 32%CPU (0avgtext+0avgdata 2932maxresident)k 13759624inputs+0outputs (3major+344minor)pagefaults 0swaps soft# sysctl vm/drop_caches=3 vm.drop_caches = 3 soft# /usr/bin/time dd iflag=fullblock if=/mnt/sdg3/testfile bs=1M count=1 of=/dev/zero 1+0 records in 1+0 records out 1048576 bytes (10 GB) copied, 35.0032 s, 300 MB/s 0.01user 8.46system 0:35.03elapsed 24%CPU (0avgtext+0avgdata 2924maxresident)k 13750568inputs+0outputs (3major+345minor)pagefaults 0swaps and
Re: Btrfs + compression = slow performance and high cpu usage
In addition to my previous "it does not happen here" comment, if someone is reading this thread, there are some other interesting details: > When the compression is turned off, I am able to get the > maximum 500-600 mb/s write speed on this disk (raid array) > with minimal cpu usage. No details on whether it is a parity RAID or not. > btrfs device usage /mnt/arh-backup1/ > /dev/sda, ID: 2 >Device size:21.83TiB >Device slack: 0.00B >Data,single: 9.29TiB >Metadata,single:46.00GiB >System,single: 32.00MiB >Unallocated:12.49TiB That's exactly 24TB of "Device size", of which around 45% are used, and the string "backup" may suggest that the content is backups, which may indicate a very fragmented freespace. Of course compression does not help with that, in my freshly created Btrfs volume I get as expected: soft# umount /mnt/sde3 soft# mount -t btrfs -o commit=10 /dev/sde3 /mnt/sde3 soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sde3/testfile bs=1M count=1 conv=fsync 1+0 records in 1+0 records out 1048576 bytes (10 GB) copied, 103.747 s, 101 MB/s 0.00user 11.56system 1:44.86elapsed 11%CPU (0avgtext+0avgdata 3072maxresident)k 20480672inputs+20498272outputs (1major+349minor)pagefaults 0swaps soft# filefrag /mnt/sde3/testfile /mnt/sde3/testfile: 11 extents found versus: soft# umount /mnt/sde3 soft# mount -t btrfs -o commit=10,compress=lzo,compress-force /dev/sde3 /mnt/sde3 soft# /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sde3/testfile bs=1M count=1 conv=fsync 1+0 records in 1+0 records out 1048576 bytes (10 GB) copied, 109.051 s, 96.2 MB/s 0.02user 13.03system 1:49.49elapsed 11%CPU (0avgtext+0avgdata 3068maxresident)k 20494784inputs+20492320outputs (1major+347minor)pagefaults 0swaps soft# filefrag /mnt/sde3/testfile /mnt/sde3/testfile: 49287 extents found Most the latter extents are mercifully rather contiguous, their size is just limited by the compression code, here is an extract from 'filefrag -v' from around the middle: 24757: 1321888.. 1321919: 11339579.. 11339610: 32: 11339594: 24758: 1321920.. 1321951: 11339597.. 11339628: 32: 11339611: 24759: 1321952.. 1321983: 11339615.. 11339646: 32: 11339629: 24760: 1321984.. 1322015: 11339632.. 11339663: 32: 11339647: 24761: 1322016.. 1322047: 11339649.. 11339680: 32: 11339664: 24762: 1322048.. 1322079: 11339667.. 11339698: 32: 11339681: 24763: 1322080.. 1322111: 11339686.. 11339717: 32: 11339699: 24764: 1322112.. 1322143: 11339703.. 11339734: 32: 11339718: 24765: 1322144.. 1322175: 11339720.. 11339751: 32: 11339735: 24766: 1322176.. 1322207: 11339737.. 11339768: 32: 11339752: 24767: 1322208.. 1322239: 11339754.. 11339785: 32: 11339769: 24768: 1322240.. 1322271: 11339771.. 11339802: 32: 11339786: 24769: 1322272.. 1322303: 11339789.. 11339820: 32: 11339803: But again this is on a fresh empty Btrfs volume. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs + compression = slow performance and high cpu usage
> I am stuck with a problem of btrfs slow performance when using > compression. [ ... ] That to me looks like an issue with speed, not performance, and in particular with PEBCAK issues. As to high CPU usage, when you find a way to do both compression and checksumming without using much CPU time, please send patches urgently :-). In your case the increase in CPU time is bizarre. I have the Ubuntu 4.4 "lts-xenial" kernel and what you report does not happen here (with a few little changes): soft# grep 'model name' /proc/cpuinfo | sort -u model name : AMD FX(tm)-6100 Six-Core Processor soft# cpufreq-info | grep 'current CPU frequency' current CPU frequency is 3.30 GHz (asserted by call to hardware). current CPU frequency is 3.30 GHz (asserted by call to hardware). current CPU frequency is 3.30 GHz (asserted by call to hardware). current CPU frequency is 3.30 GHz (asserted by call to hardware). current CPU frequency is 3.30 GHz (asserted by call to hardware). current CPU frequency is 3.30 GHz (asserted by call to hardware). soft# lsscsi | grep 'sd[ae]' [0:0:0:0]diskATA HFS256G32MNB-220 3L00 /dev/sda [5:0:0:0]diskATA ST2000DM001-1CH1 CC44 /dev/sde soft# mkfs.btrfs -f /dev/sde3 [ ... ] soft# mount -t btrfs -o discard,autodefrag,compress=lzo,compress-force,commit=10 /dev/sde3 /mnt/sde3 soft# df /dev/sda6 /mnt/sde3 Filesystem 1M-blocks Used Available Use% Mounted on /dev/sda6 90048 76046 14003 85% / /dev/sde3 23756819235501 1% /mnt/sde3 The above is useful context information that was "amazingly" omitted from your reported. In dmesg I see (not the "force zlib compression"): [327730.917285] BTRFS info (device sde3): turning on discard [327730.917294] BTRFS info (device sde3): enabling auto defrag [327730.917300] BTRFS info (device sde3): setting 8 feature flag [327730.917304] BTRFS info (device sde3): force zlib compression [327730.917313] BTRFS info (device sde3): disk space caching is enabled [327730.917315] BTRFS: has skinny extents [327730.917317] BTRFS: flagging fs with big metadata feature [327730.920740] BTRFS: creating UUID tree and the result is: soft# pv -tpreb /dev/sda6 | time dd iflag=fullblock of=/mnt/sde3/testfile bs=1M count=1 oflag=direct 1+0 records in17MB/s] [==>] 11% ETA 0:15:06 1+0 records out 1048576 bytes (10 GB) copied, 112.845 s, 92.9 MB/s 0.05user 9.93system 1:53.20elapsed 8%CPU (0avgtext+0avgdata 3016maxresident)k 120inputs+20496000outputs (1major+346minor)pagefaults 0swaps 9.77GB 0:01:53 [88.3MB/s] [==>] 11% soft# btrfs fi df /mnt/sde3/ Data, single: total=10.01GiB, used=9.77GiB System, DUP: total=8.00MiB, used=16.00KiB Metadata, DUP: total=1.00GiB, used=11.66MiB GlobalReserve, single: total=16.00MiB, used=0.00B As it was running system CPU time was under 20% of one CPU: top - 18:57:29 up 3 days, 19:27, 4 users, load average: 5.44, 2.82, 1.45 Tasks: 325 total, 1 running, 324 sleeping, 0 stopped, 0 zombie %Cpu0 : 0.0 us, 2.3 sy, 0.0 ni, 91.3 id, 6.3 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 0.0 us, 1.3 sy, 0.0 ni, 78.5 id, 20.2 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 0.3 us, 5.8 sy, 0.0 ni, 81.0 id, 12.5 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu3 : 0.3 us, 3.4 sy, 0.0 ni, 91.9 id, 4.4 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 : 0.3 us, 10.6 sy, 0.0 ni, 55.4 id, 33.7 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 0.0 us, 0.3 sy, 0.0 ni, 99.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st KiB Mem: 8120660 total, 5162236 used, 2958424 free, 4440100 buffers KiB Swap:0 total,0 used,0 free. 351848 cached Mem PID PPID USER PR NIVIRTRESDATA %CPU %MEM TIME+ TTY COMMAND 21047 21046 root 20 08872 26161364 12.9 0.0 0:02.31 pts/3dd iflag=fullblo+ 21045 3535 root 20 07928 1948 460 12.3 0.0 0:00.72 pts/3pv -tpreb /dev/s+ 21019 2 root 20 0 0 0 0 1.3 0.0 0:42.88 ? [kworker/u16:1] Of course "oflag=direct" is a rather "optimistic" option in this context, so I tried again with something more sensible: soft# pv -tpreb /dev/sda6 | time dd iflag=fullblock of=/mnt/sde3/testfile bs=1M count=1 conv=fsync 1+0 records in.4MB/s] [==>] 11% ETA 0:14:41 1+0 records out 1048576 bytes (10 GB) copied, 110.523 s, 94.9 MB/s 0.03user 8.94system 1:50.71elapsed 8%CPU (0avgtext+0avgdata 3024maxresident)k 136inputs+20499648outputs (1major+348minor)pagefaults 0swaps 9.77GB 0:01:50 [90.3MB/s] [==>] 11% soft# btrfs fi df /mnt/sde3/ Data, single: total=7.01GiB, used=6.35GiB System, DUP: total=8.00MiB, used=16.00KiB Metadata, DUP: total=1.00GiB, used=15.81MiB GlobalReserve,
Re: kernel btrfs file system wedged -- is it toast?
> [ ... ] announce loudly and clearly to any potential users, in > multiple places (perhaps a key announcement in a few places > and links to that announcement from many places, https://btrfs.wiki.kernel.org/index.php/Gotchas#Having_many_subvolumes_can_be_very_slow > ... DO expect to first have to learn, the hard way, of > ... whatever special mitigations might apply in ones > ... particular circumstances, before considering deploying > ... btrfs into a production environment where this, or other > ... (what other?) surprising limitations of btrfs may apply. In computer jargon that is called "being a system engineer". > All the prominent places that respond to the question of > whether btrfs is ready for production use (spanning several > years now) should if possible display this warning. https://btrfs.wiki.kernel.org/index.php/Status "The table below aims to serve as an overview for the stability status of the features BTRFS supports. While a feature may be functionally safe and reliable, it does not necessarily mean that its useful, for example in meeting your performance expectations for your specific workload." > [ ... ] Back in my day, such a performance bug would have made > the software containing it unreleasable, _especially_ in > software such as a major file system that is expected to > provide reliable service, where "reliable" means both > preserving data integrity and doing so within an order of > magnitude of a reasonably expected time. For the past several decades, since perhaps the Manchester MARK I (or the ZUSE probably), it has been known to "system engineers" and "programmers" that most features of "hardware" and "software" have very anisotropic performance envelopes, both as to speed and usability and reliability, and not all potentially syntactically valid combinations of features are equally robust and excellent under every possible workload, and indeed very few are, and it is part of "being a system engineer" or "being a programmer" to develop insight and experience, leading to knowledge ideally, as to what combinations work well and which work badly. Considering somewhat imprecise car analogies, jet engines on cars tend not to be wholly desirable, even if syntactically valid, and using cars to plow fields does not necessarily yield the highest productivity, even if also syntactically valid. General introduction about anisotropy and performance: http://www.sabi.co.uk/blog/15-two.html?151023#151023 In storage system and filesystems "system engineers" often have to confront a large number of pathological cases of anisotropy, some examples I can find quickly in my links: http://www.sabi.co.uk/blog/1005May.html?100520#100520 http://www.sabi.co.uk/blog/16-one.html?160322#160322 http://www.sabi.co.uk/blog/15-one.html?150203#150203 http://www.sabi.co.uk/blog/12-fou.html?121218#121218 http://www.sabi.co.uk/blog/0802feb.html?080210#080210 http://www.sabi.co.uk/blog/0802feb.html?080216#080216 I try to keep lists of "known pathologies" of various types here: http://www.sabi.co.uk/Notes/linuxStor.html http://www.sabi.co.uk/Notes/linuxFS.html#fsHints Even "legendary" 'ext3' has/had pathological (and common) cases: https://lwn.net/Articles/328370/ https://lwn.net/Articles/328363/ https://bugzilla.kernel.org/show_bug.cgi?id=12309 https://news.ycombinator.com/item?id=7376750 https://news.ycombinator.com/item?id=7377315 https://lkml.org/lkml/2009/4/6/331 the difference between the occasional 5+ second pause and the occasional 10+ second pause wasn't really all that interesting. They were both unusuable, and both made me kill the background writer almost immediately Wisely L Torvalds writes: https://lkml.org/lkml/2010/11/10/233 ext3 [f]sync sucks. We know. All filesystems suck. They just tend to do it in different dimensions. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Exactly what is wrong with RAID5/6
> [ ... ] This will make some filesystems mostly RAID1, negating > all space savings of RAID5, won't it? [ ... ] RAID5/RAID6/... don't merely save space, more precisely they trade lower resilience and a more anisotropic and smaller performance envelope to gain lower redundancy (= save space). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: does using different uid/gid/forceuid/... mount options for different subvolumes work / does fuse.bindfs play nice with btrfs?
> I intend to provide different "views" of the data stored on > btrfs subvolumes. e.g. mount a subvolume in location A rw; > and ro in location B while also overwriting uids, gids, and > permissions. [ ... ] That's not how UNIX/Linux permissions and ACLs are supposed to work, perhaps you should reconsider such a poor idea. Mount options are provided to map non-UNIX/Linux filesystems into the UNIX/Linux permissions and ACLs in a crude way. If you really want, uid/gid/permissions are an inode property, and Btrfs via "reflinking" allows sharing of file data between different inodes. So you can for example create a RW snapshot of a subvolume, change all ids/permissions/ACLs, and the data space will still be shared. This is not entirely cost-free. If you really really want this it may be doable also with mount namespaces combined with user namespaces, with any filesystem, depending on yoiur requirements: http://man7.org/linux/man-pages/man7/user_namespaces.7.html http://man7.org/linux/man-pages/man7/mount_namespaces.7.html As to mounting multiple times on multiple directories, that is possible with Linux VFS regardless of filesystem type, or by using 'mount --bind' or a variant. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Struggling with file system slowness
> Trying to peg down why I have one server that has > btrfs-transacti pegged at 100% CPU for most of the time. Too little information. Is IO happening at the same time? Is compression on? Deduplicated? Lots of subvolumes? SSD? What kind of workload and file size/distribution profile? Typical high CPU are extents (your defragging not necessarily worked), and 'qgroups', especially with many subvolumes. It could be the fre space cache in some rare cases. https://www.google.ca/search?num=100=images_q=cxpu_epq=btrfs-transaction To this something like this happens often, but is not Btrfs-related, but triggered for example by near-memory exhaustion in the kernel memory manager. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ded
> I have a btrfs filesystem mounted at /btrfs_vol/ Every N > minutes, I run bedup for deduplication of data in /btrfs_vol > Inside /btrfs_vol, I have several subvolumes (consider this as > home directories of several users) I have set individual > qgroup limits for each of these subvolumes. [ ... ] Let's hope that you have read the warnings about the potential downsides of deduplication, quota groups, many subvolumes. Even if "syntactically" every features works with every features without downsides, and they all scale up without limit. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
>> [ ... ] these extents are all over the place, they're not >> contiguous at all. 4K here, 4K there, 4K over there, back to >> 4K here next to this one, 4K over there...12K over there, 500K >> unwritten, 4K over there. This seems not so consequential on >> SSD, [ ... ] > Indeed there were recent reports that the 'ssd' mount option > causes that, IIRC by Hans van Kranenburg [ ... ] The report included news that "sometimes" the 'ssd' option is automatically switched on at mount even on hard disks. I had promised to put a summary of the issue on the Btrfs wiki, but I regret that I haven't yet done that. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
> [ ... ] Instead, you can use raw files (preferably sparse unless > there's both nocow and no snapshots). Btrfs does natively everything > you'd gain from qcow2, and does it better: you can delete the master > of a cloned image, deduplicate them, deduplicate two unrelated images; > you can turn on compression, etc. Uhm, I understand this argument in the general case (not specifically as to QCOW2 images), and it has some merit, but it is "controversial", as there are two counterarguments: * Application specifici file formats can match better application specific requirements. * Putting advanced functionality into the filesystem code makes it more complex and less robust, and Btrfs is a bit of a major example of the consequences. I put compression and deduplication as things that I reckon make a filesystem too complex. As to snapshots, I make a difference between filetree snapshots and file snapshots: the first clones a tree as of the snapshot moment, and it is a system management feature, the second provides per-file update rollback. One sort of implies the other, but using the per-file rollback *systematically*, that is a a feature an application can rely one seems a bit dangerous to me. > Once you pay the btrfs performance penalty, Uhmmm, Btrfs has a small or negative performance penalty as a general purpose filesystem, and many (more or less well conceived) tests show it performs up there with the best. The only two real costs I have to it are the huge CPU cost of doing checksumming all the time, but that's unavoidable if one wants checksumming, and that checksumming usually requires metadata duplication, that is at least 'dup' profile for metadata, and that is indeed a bit expensive. > you may as well actually use its features, The features that I think Btrfs gives that are worth using are checksumming, metadata duplication, and filetree snapshots. > which make qcow2 redundant and harmful. My impression is that in almost all cases QCOW2 is harmful, because it trades more IOPS and complexity for less disk space, and disk space is cheap and IOPS and complexity are expensive, but of course a lot of people know better :-). My preferred VM setup is a small essentially read-only non-QCOW2 image for '/' and everything else mounted via NFSv4, from the VM host itself or a NAS server, but again lots of people know better and use multi-terabyte-sized QCOW2 images :-). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
> [ ... ] these extents are all over the place, they're not > contiguous at all. 4K here, 4K there, 4K over there, back to > 4K here next to this one, 4K over there...12K over there, 500K > unwritten, 4K over there. This seems not so consequential on > SSD, [ ... ] Indeed there were recent reports that the 'ssd' mount option causes that, IIRC by Hans van Kranenburg (around 2017-04-17), which also noticed issues with the wandering trees in certain situations (around 2017-04-08). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
>> The gotcha though is there's a pile of data in the journal >> that would never make it to rsyslogd. If you use journalctl >> -o verbose you can see some of this. > You can send *all the info* to rsyslogd via imjournal > http://www.rsyslog.com/doc/v8-stable/configuration/modules/imjournal.html > In my setup all the data are stored in json format in the > /var/log/cee.log file: > $ head /var/log/cee.log 2017-04-28T18:41:41.931273+02:00 > venice liblogging-stdlog: @cee: { "PRIORITY": "6", "_BOOT_ID": > "a86d74bab91f44dc974c76aceb97141f", "_MACHINE_ID": [ ... ] Ahh the horror the horror, I will never be able to unsee that. The UNIX way of doing things is truly dead. >> The same behavior happens with NTFS in qcow2 files. They >> quickly end up with 100,000+ extents unless set nocow. >> It's like the worst case scenario. In a particularly demented setup I had to decastrophize with great pain a Zimbra QCOW2 disk image (XFS on NFS on XFS on RAID6) containining an ever growing number Maildir email archive ended up with over a million widely scattered microextents: http://www.sabi.co.uk/blog/1101Jan.html?110116#110116 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
> [ ... ] And that makes me wonder whether metadata > fragmentation is happening as a result. But in any case, > there's a lot of metadata being written for each journal > update compared to what's being added to the journal file. [ > ... ] That's the "wandering trees" problem in COW filesystems, and manifestations of it in Btrfs have also been reported before. If there is a workload that triggers a lot of "wandering trees" updates, then a filesystem that has "wandering trees" perhaps should not be used :-). > [ ... ] worse, a single file with 2 fragments; or 4 > separate journal files? *shrug* [ ... ] Well, depends, but probably the single file: it is more likely that the 20,000 fragments will actually be contiguous, and that there will be less metadata IO than for 40,000 separate journal files. The deeper "strategic" issue is that storage systems and filesystems in particular have very anisotropic performance envelopes, and mismatches between the envelopes of application and filesystem can be very expensive: http://www.sabi.co.uk/blog/15-two.html?151023#151023 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
> Old news is that systemd-journald journals end up pretty > heavily fragmented on Btrfs due to COW. This has been discussed before in detail indeeed here, but also here: http://www.sabi.co.uk/blog/15-one.html?150203#150203 > While journald uses chattr +C on journal files now, COW still > happens if the subvolume the journal is in gets snapshot. e.g. > a week old system.journal has 19000+ extents. [ ... ] It > appears to me (see below URLs pointing to example journals) > that journald fallocated in 8MiB increments but then ends up > doing 4KiB writes; [ ... ] So there are three layers of silliness here: * Writing large files slowly to a COW filesystem and snapshotting it frequently. * A filesystem that does delayed allocation instead of allocate-ahead, and does not have psychic code. * Working around that by using no-COW and preallocation with a fixed size regardless of snapshot frequency. The primary problem here is that there is no way to have slow small writes and frequent snapshots without generating small extents: if a file is written at a rate of 1MiB/hour and gets snapshot every hour the extent size will not be larger than 1MiB *obviously*. Filesystem-level snapshots are not designed to snapshot slowly growing files, but to snapshots changing collections of files. There are harsh tradeoffs involved. Application-level shapshots (also known as log rotations :->) are needed for special cases and finer grained policies. The secondary problem is that a fixed preallocate of 8MiB is good only if in betweeen snapshots the file grows by a little less than 8MiB or by substantially more. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: About free space fragmentation, metadata write amplification and (no)ssd
> [ ... ] This post is way too long [ ... ] Many thanks for your report, it is really useful, especially the details. > [ ... ] using rsync with --link-dest to btrfs while still > using rsync, but with btrfs subvolumes and snapshots [1]. [ > ... ] Currently there's ~35TiB of data present on the example > filesystem, with a total of just a bit more than 9 > subvolumes, in groups of 32 snapshots per remote host (daily > for 14 days, weekly for 3 months, montly for a year), so > that's about 2800 'groups' of them. Inside are millions and > millions and millions of files. And the best part is... it > just works. [ ... ] That kind of arrangement, with a single large pool and very many many files and many subdirectories is a worst case scanario for any filesystem type, so it is amazing-ish that it works well so far, especially with 90,000 subvolumes. As I mentioned elsewhere I would rather do a rotation of smaller volumes, to reduce risk, like "Duncan" also on this mailing list likes to do (perhaps to the opposite extreme). As to the 'ssd'/'nossd' issue that is as described in 'man 5 btrfs' (and I wonder whether 'ssd_spread' was tried too) but it is not at all obvious it should impact so much metadata handling. I'll add a new item in the "gotcha" list. It is sad that 'ssd' is used by default in your case, and it is quite perplexing that tghe "wandering trees" problem (that is "write amplification") is so large with 64KiB write clusters for metadata (and 'dup' profile for metadata). * Probably the metadata and data cluster sizes should be create or mount parameters instead of being implicit in the 'ssd' option. * A cluster size of 2MiB for metadata and/or data presumably has some downsides, otrherwise it would be the default. I wonder whether the downsides related to barriers... -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs filesystem keeps allocating new chunks for no apparent reason
[ ... ] >>> I've got a mostly inactive btrfs filesystem inside a virtual >>> machine somewhere that shows interesting behaviour: while no >>> interesting disk activity is going on, btrfs keeps >>> allocating new chunks, a GiB at a time. [ ... ] > Because the allocator keeps walking forward every file that is > created and then removed leaves a blank spot behind. That is a typical "log-structured" filesystem behaviour, not really surprised that Btrfs is doing something like that being COW. NILFS2 works like that and it requires a compactor (which does the requivalent of 'balance' and 'defrag'). It is all about tradeoffs. With Btrfs I figured out that fairly frequent 'balance' is really quite important, even with low percent values like "usage=50", and usually even 'usage=90' does not take a long time (while the default takes often a long time, I suspect needlessly). >> From the exact moment I did mount -o remount,nossd on this >> filesystem, the problem vanished. Haha. Indeed. So it switches from "COW" to more like "log structured" with the 'ssd' option. F2FS can switch like that too, with some tunables IIIRC. Except that modern flash SSDs already do the "log structured" bit internally, so doing in in Btrfs does not really help that much. >> And even I saw some early prototypes inside the codes to >> allow btrfs do allocation smaller extent than required. >> (E.g. caller needs 2M extent, but btrfs returns 2 1M extents) I am surprised that this is not already there, but it is a terrible fix to a big mistake. The big mistake, that nearly all filesystem designers do, is to assume that contiguous allocation must bew done by writing contiguous large blocks or extents. This big mistake was behind the stupid idea of the BSD FFS to raise the block size from 512B to 4096B plus 512B "tails", and endless stupid proposals to raise page and block sizes that get done all the time, and is behind the stupid idea of doing "delayed allocation", so large extents can be written in one go. The ancient and tried and obvious idea is to preallocate space ahead of it being written, so that a file physical size may be larger than its logical length, and by how much it depends on some adaptive logic, or hinting from the application (if the file size if known in advance it can be to preallocate the whole file). > [ ... ] So, this is why putting your /var/log, /var/lib/mailman and > /var/spool on btrfs is a terrible idea. [ ... ] That is just the old "writing a file slowly" issue, and many if not most filesystems have this issue: http://www.sabi.co.uk/blog/15-one.html?150203#150203 and as that post shows it was already reported for Btrfs here: http://kreijack.blogspot.co.uk/2014/06/btrfs-and-systemd-journal.html > [ ... ] The fun thing is that this might work, but because of > the pattern we end up with, a large write apparently fails > (the files downloaded when doing apt-get update by daily cron) > which causes a new chunk allocation. This is clearly visible > in the videos. Directly after that, the new chunk gets filled > with the same pattern, because the extent allocator now > continues there and next day same thing happens again etc... [ > ... ] The general problem is that filesystems have a very difficult job especially on rotating media and cannot avoid large important degenerate corner case by using any adaptive logic. Only predictive logic can avoid them, and since psychic code is not possible yet, "predictive" means hints from applications and users, and application developers and users are usually not going to give them, or give them wrong. Consider the "slow writing" corner case, common to logging or downloads, that you mention: the filesystem logic cannot do well in the general case because it cannot predict how large the final file will be, or what the rate of writing will be. However if the applications or users hint the total final size or at least a suitable allocation size things are going to be good. But it is already difficult to expect applications to give absolutely necessary 'fsync's, so explicit file size or access pattern hints are a bit of an illusion. It is the ancient 'O_PONIES' issue in one of its many forms. Fortunately it possible and even easy to do much better *synthetic* hinting than most library and kernels do today: http://www.sabi.co.uk/blog/anno05-4th.html?051012d#051012d http://www.sabi.co.uk/blog/anno05-4th.html?051011b#051011b http://www.sabi.co.uk/blog/anno05-4th.html?051011#051011 http://www.sabi.co.uk/blog/anno05-4th.html?051010#051010 But that has not happened because it is no developer's itch to fix. I was instead partially impressed that recently the 'vm_cluster' implementation was "fixed", after only one or two decades from being first reported: http://sabi.co.uk/blog/anno05-3rd.html?050923#050923 https://lwn.net/Articles/716296/ https://lkml.org/lkml/2001/1/30/160 And still the author(s) of the fix don't see to be persuaded by many decades of
Re: Do different btrfs volumes compete for CPU?
> [ ... ] I tried to use eSATA and ext4 first, but observed > silent data corruption and irrecoverable kernel hangs -- > apparently, SATA is not really designed for external use. SATA works for external use, eSATA works well, but what really matters is the chipset of the adapter card. In my experience JMicron is not so good, Marvell a bit better, best is to use a recent motherboard chipset with a SATA-eSATA internal cable and bracket. >> As written that question is meaningless: despite the current >> mania for "threads"/"threadlets" a filesystem driver is a >> library, not a set of processes (all those '[btrfs-*]' >> threadlets are somewhat misguided ways to do background >> stuff). > But these threadlets, misguided as the are, do exist, don't > they? But that does not change the fact that it is a library and work is initiated by user requests which are not per-subvolume, but in effect per-volume. > I understand that qgroups is very much work in progress, but > (correct me if I'm wrong) right now it's the only way to > estimate real usage of subvolume and its snapshots. It is a way to do so and not a very good way. There is no obviously good way to define "real usage" in the presence of hard-links and reflinking, and qgroups use just one way to define it. A similar problem happens with processes in the presence of shared pages, multiple mapped shared libraries etc. > For instance, if I have dozen 1TB subvolumes each having ~50 > snapshots and suddenly run out of space on a 24TB volume, how > do I find the culprit without qgroups? It is not clear what "culprit" means here. The problem is that both hard-links and ref-linking create really significant ambiguities as to used space. Plus the same problem would happen with directories instead of subvolumes and hard-links instead of reflinked snapshots. > [ ... ] The chip is ASM1142, not Intel/AMD sadly but quite > popular nevertheless. ASMedia USB3 chipsets are fairly reliable at the least the card ones on the system side. The ones on the disk side I don't know much about. I have seen some ASMedia one that also seem OK. For the disks I use a Seagate and a WDC external box from which I have removed the original disk, as I have noticed that Seagate and WDC for obvious reasons tend to test and use the more reliable chipsets. I have also got an external USB3 dock with a recent ASMedia chipset that also seems good, but I haven't used it much. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
[ ... ] >>> $ D='btrfs f2fs gfs2 hfsplus jfs nilfs2 reiserfs udf xfs' >>> $ find $D -name '*.ko' | xargs size | sed 's/^ *//;s/ .*\t//g' >>> textfilename >>> 832719 btrfs/btrfs.ko >>> 237952 f2fs/f2fs.ko >>> 251805 gfs2/gfs2.ko >>> 72731 hfsplus/hfsplus.ko >>> 171623 jfs/jfs.ko >>> 173540 nilfs2/nilfs2.ko >>> 214655 reiserfs/reiserfs.ko >>> 81628 udf/udf.ko >>> 658637 xfs/xfs.ko That was Linux AMD64. > udf is 637K on Mac OS 10.6 > exfat is 75K on Mac OS 10.9 > msdosfs is 79K on Mac OS 10.9 > ntfs is 394K (That must be Paragon's ntfs for Mac) ... > zfs is 1.7M (10.9) > spl is 247K (10.9) Similar on Linux AMD64 but smaller: $ size updates/dkms/*.ko | sed 's/^ *//;s/ .*\t//g' textfilename 62005 updates/dkms/spl.ko 184370 updates/dkms/splat.ko 3879updates/dkms/zavl.ko 22688 updates/dkms/zcommon.ko 1012212 updates/dkms/zfs.ko 39874 updates/dkms/znvpair.ko 18321 updates/dkms/zpios.ko 319224 updates/dkms/zunicode.ko > If they are somehow comparable even with the differences, 833K > is not bad for btrfs compared to zfs. I did not look at the > format of the file; it must be binary, but compression may be > optional for third party kexts. So the kernel module sizes are > large for both btrfs and zfs. Given the feature sets of both, > is that surprising? Not surprising and indeed I agree with the statement that appeared earlier that "there are use cases that actually need them". There are also use cases that need realtime translation of file content from chinese to spanish, and one could add to ZFS or Btrfs an extension to detect the language of text files and invoke via HTTP Google Translate, for example with option "translate=chinese-spanish" at mount time; or less flexibly there are many use cases where B-Tree lookup of records in files is useful, and it would be possible to add that to Btrfs or ZFS, so that for example 'lseek(4,"Jane Smith",SEEK_KEY)' would be possible, as in the ancient TSS/370 filesystem design. But the question is about engineering, where best to implement those "feature sets": in the kernel or higher levels. There is no doubt for me that realtime language translation and seeking by key can be added to a filesystem kernel module, and would "work". The issue is a crudely technical one: "works" for an engineer is not a binary state, but a statistical property over a wide spectrum of cost/benefit tradeoffs. Adding "feature sets" because "there are use cases that actually need them" is fine, adding their implementation to the kernel driver of a filesystem is quite a different proposition, which may have downsides, as the implementations of those feature sets may make code more complex and harder to understand and test, never mind debug, even for the base features. But of course lots of people know better :-). Buit there is more; look again at some compiled code sizes as a crude proxy for complexity, divided in two groups, both of robust, full featured designs: 1012212 updates/dkms/zfs.ko 832719 btrfs/btrfs.ko 658637 xfs/xfs.ko 237952 f2fs/f2fs.ko 173540 nilfs2/nilfs2.ko 171623 jfs/jfs.ko 81628 udf/udf.ko The code size for JFS or NILFS2 or UDF is roughly 1/4 the code size for XFS, yet there is little difference in functionality. Compared to ZFS as to base functionality JFS lacks checksums and snapshots (in theory it has subvolumes, but they are disabled), but NILFS2 has snapshots and checksums (but does not verify them on ordinary reads), and yet the code size is 1/6 that of ZFS. ZFS has also RAID, but looking at the code size of the Linux MD RAID modules I see rather smaller numbers. Even so ZFS has a good reputation for reliability despire its amazing complexity, but that is also because SUN invested big into massive release engineering for it, and similarly for XFS. Therefore my impression is that the filesystems in the first group have a lot of cool features like compression or dedup etc. that could have been implemented user-level, and having them in the kernel is good "for "marketing" purposes, to win box-ticking competitions". -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Do different btrfs volumes compete for CPU?
>> Approximately 16 hours ago I've run a script that deleted >> >~100 snapshots and started quota rescan on a large >> USB-connected btrfs volume (5.4 of 22 TB occupied now). That "USB-connected is a rather bad idea. On the IRC channel #Btrfs whenever someone reports odd things happening I ask "is that USB?" and usually it is and then we say "good luck!" :-). The issues are: * The USB mass storage protocol is poorly designed in particular for error handling. * The underlying USB protocol is very CPU intensive. * Most importantly nearly all USB chipsets, both system-side and peripheral-side, are breathtakingly buggy, but this does not get noticed for most USB devices. >> Quota rescan only completed just now, with 100% load from >> [btrfs-transacti] throughout this period, > [ ... ] are different btrfs volumes independent in terms of > CPU, or are there some shared workers that can be point of > contention? As written that question is meaningless: despite the current mania for "threads"/"threadlets" a filesystem driver is a library, not a set of processes (all those '[btrfs-*]' threadlets are somewhat misguided ways to do background stuff). The real problems here are: * Qgroups are famously system CPU intensive, even if less so than in earlier releases, especially with subvolumes, so the 16 hours CPU is both absurd and expected. I think that qgroups are still effectively unusable. * The scheduler gives excessive priority to kernel threads, so they can crowd out user processes. When for whatever reason the system CPU percentage rises everything else usually suffers. > BTW, USB adapter used is this one (though storage array only > supports USB 3.0): > https://www.asus.com/Motherboard-Accessory/USB_31_TYPEA_CARD/ Only Intel/AMD USB chipsets and a few others are fairly reliable, and for mass storage only with USB3 with UASPI, which is basically SATA-over-USB (more precisely SCSI-command-set over USB). Your system-side card seems to be recent enough to do UASPI, but probably the peripheral-side chipset isn't. Things are so bad with third-party chipsets that even several types of add-on SATA and SAS cards are too buggy. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
> [ ... ] what the signifigance of the xargs size limits of > btrfs might be. [ ... ] So what does it mean that btrfs has a > higher xargs size limit than other file systems? [ ... ] Or > does the lower capacity for argument length for hfsplus > demonstrate it is the superior file system for avoiding > breakage? [ ... ] That confuses, as my understanding of command argument size limit is that it is a system, not filesystem, property, and for example can be obtained with 'getconf _POSIX_ARG_MAX'. > Personally, I would go back to fossil and venti on Plan 9 for > an archival data server (using WORM drives), In an ideal world we would be using Plan 9. Not necessarily with Fossil and Venti. As a to storage/backup/archival Linux based options are not bad, even if the platform is far messier than Plan 9 (or some other alternatives). BTW I just noticed with a search that AWS might be offering Plan 9 hosts :-). > and VAX/VMS cluster for an HA server. [ ... ] Uhmmm, however nice it was, it was fairly weird. An IA32 or AMD64 port has been promised however :-). https://www.theregister.co.uk/2016/10/13/openvms_moves_slowly_towards_x86/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
>>> My guess is that very complex risky slow operations like >>> that are provided by "clever" filesystem developers for >>> "marketing" purposes, to win box-ticking competitions. >>> That applies to those system developers who do know better; >>> I suspect that even some filesystem developers are >>> "optimistic" as to what they can actually achieve. >>> There are cases where there really is no other sane >>> option. Not everyone has the kind of budget needed for >>> proper HA setups, >> Thnaks for letting me know, that must have never occurred to >> me, just as it must have never occurred to me that some >> people expect extremely advanced features that imply >> big-budget high-IOPS high-reliability storage to be fast and >> reliable on small-budget storage too :-) > You're missing my point (or intentionally ignoring it). In "Thanks for letting me know" I am not missing your point, I am simply pointing out that I do know that people try to run high-budget workloads on low-budget storage. The argument as to whether "very complex risky slow operations" should be provided in the filesystem itself is a very different one, and I did not develop it fully. But is quite "optimistic" to simply state "there really is no other sane option", even when for people that don't have "proper HA setups". Let'a start by assuming for the time being. that "very complex risky slow operations" are indeed feasible on very reliable high speed storage layers. Then the questions become: * Is it really true that "there is no other sane option" to running "very complex risky slow operations" even on storage that is not "big-budget high-IOPS high-reliability"? * Is is really true that it is a good idea to run "very complex risky slow operations" even on ¨big-budget high-IOPS high-reliability storage"? > Those types of operations are implemented because there are > use cases that actually need them, not because some developer > thought it would be cool. [ ... ] And this is the really crucial bit, I'll disregard without agreeing too much (but in part I do) with the rest of the response, as those are less important matters, and this is going to be londer than a twitter message. First, I agree that "there are use cases that actually need them", and I need to explain what I am agreeing to: I believe that computer systems, "system" in a wide sense, have what I call "inewvitable functionality", that is functionality that is not optional, but must be provided *somewhere*: for example print spooling is "inevitable functionality" as long as there are multuple users, and spell checking is another example. The only choice as to "inevitable functionality" is *where* to provide it. For example spooling can be done among two users by queuing jobs manually with one saying "I am going to print now", and the other user waits until the print is finished, or by using a spool program that queues jobs on the source system, or by using a spool program that queues jobs on the target printer. Spell checking can be done on the fly in the document processor, batch with a tool, or manually by the document author. All these are valid implementations of "inevitable functionality", just with very different performance envelope, where the "system" includes the users as "peripherals" or "plugins" :-) in the manual implementations. There is no dispute from me that multiple devices, adding/removing block devices, data compression, structural repair, balancing, growing/shrinking, defragmentation, quota groups, integrity checking, deduplication, ...a are all in the general case "inevitably functionality", and every non-trivial storage system *must* implement them. The big question is *where*: for example when I started using UNIX the 'fsck' tool was several years away, and when the system crashed I did like everybody filetree integrity checking and structure recovery myself (with the help of 'ncheck' and 'icheck' and 'adb'), that is 'fsck' was implemented in my head. In the general case there are three places where such "inevitable functionality" can be implemented: * In the filesystem module in the kernel, for example Btrfs scrubbing. * In a tool that uses hook provided by the filesystem module in the kernel, for example Btrfs deduplication, 'send'/'receive'. * In a tool, for example 'btrfsck'. * In the system administrator. Consider the "very complex risky slow" operation of defragmentation; the system administrator can implement it by dumping and reloading the volume, or a tool ban implement it by running on the unmounted filesystem, or a tool and the kernel can implement it by using kernel module hooks, or it can be provided entirely in the kernel module. My argument is that providing "very complex risky slow" maintenance operations as filesystem primitives looks awesomely convenient, a good way to "win box-ticking competitions" for "marketing" purposes, but is rather bad idea for several reasons, of varying strengths: * Most system
Re: Shrinking a device - performance?
>> [ ... ] CentOS, Redhat, and Oracle seem to take the position >> that very large data subvolumes using btrfs should work >> fine. But I would be curious what the rest of the list thinks >> about 20 TiB in one volume/subvolume. > To be sure I'm a biased voice here, as I have multiple > independent btrfs on multiple partitions here, with no btrfs > over 100 GiB in size, and that's on ssd so maintenance > commands normally return in minutes or even seconds, That's a bit extreme I think, as there are downsides to have many too small volumes too. > not the hours to days or even weeks it takes on multi-TB btrfs > on spinning rust. Or months :-). > But FWIW... 1) Don't put all your data eggs in one basket, > especially when that basket isn't yet entirely stable and > mature. Really good point here. > A mantra commonly repeated on this list is that btrfs is still > stabilizing, My impression is that most 4.x and later versions are very reliable for "base" functionality, that is excluding multi-device, compression, qgroups, ... Put another way, what scratches the Facebook itches works well :-). > [ ... ] the time/cost/hassle-factor of the backup, and being > practically prepared to use them, is even *MORE* important > than it is on fully mature and stable filesystems. Indeed, or at least *different* filesystems. I backup JFS filesystems to XFS ones, and Btrfs filesystems to NILFS2 ones, for example. > 2) Don't make your filesystems so large that any maintenance > on them, including both filesystem maintenance like btrfs > balance/scrub/check/ whatever, and normal backup and restore > operations, takes impractically long, As per my preceding post, that's the big deal, but so many people "know better" :-). > where "impractically" can be reasonably defined as so long it > discourages you from doing them in the first place and/or so > long that it's going to cause unwarranted downtime. That's the "Very Large DataBase" level of trouble. > Some years ago, before I started using btrfs and while I was > using mdraid, I learned this one the hard way. I had a bunch > of rather large mdraids setup, [ ... ] I have recently seen another much "funnier" example: people who "know better" and follow every cool trend decide to consolidate their server farm on VMs, backed by a storage server with a largish single pool of storage holding the virtual disk images of all the server VMs. They look like geniuses until the storage pool system crashes, and a minimal integrity check on restart takes two days during which the whole organization is without access to any email, files, databases, ... > [ ... ] And there was a good chance it was /not/ active and > mounted at the time of the crash and thus didn't need > repaired, saving that time entirely! =:^) As to that I have switched to using 'autofs' to mount volumes only on access, using a simple script that turns '/etc/fstab' into an automounter dynamic map, which means that most of the time most volumes on my (home) systems are not mounted: http://www.sabi.co.uk/blog/anno06-3rd.html?060928#060928 > Eventually I arranged things so I could keep root mounted > read-only unless I was updating it, and that's still the way I > run it today. The ancient way, instead of having '/' RO and '/var' RW, to have '/' RW and '/usr' RO (so for example it could be shared across many systems via NFS etc.), and while both are good ideas, I prefer the ancient way. But then some people who know better are moving to merge '/' with '/usr' without understanding what's the history and the advantages. > [ ... ] If it's multiple TBs, chances are it's going to be > faster to simply blow away and recreate from backup, than it > is to try to repair... [ ... ] Or to shrink or defragment or dedup etc., except on very high IOPS-per-TB storage. > [ ... ] how much simpler it would have been had they had an > independent btrfs of say a TB or two for each system they were > backing up. That is the general alternative to a single large pool/volume: sharding/chunking of filetrees, sometimes, like with Lustre or Ceph etc. with a "metafilesystem" layer on top. Done manually my suggestion is to do the sharding per-week (or other suitable period) rather than per-system, in a circular "crop rotation" scheme. So that once a volume has been filled, it becomes read-only and can even be unmounted until it needs to be reused: http://www.sabi.co.uk/blog/12-fou.html?121218b#121218b Then there is the problem that "a TB or two" is less easy with increasing disk capacities, but then I think that disks with a capacity larger than 1TB are not suitable for ordinary workloads, and more for tape-cartridge like usage. > What would they have done had the btrfs gone bad and needed > repaired? [ ... ] In most cases I have seen of designs aimed at achieving the lowest cost and highest flexibility "low IOPS single poool" at the expense of scalability and maintainability, the "clever" designer had been promoted or had
Re: Shrinking a device - performance?
> Can you try to first dedup the btrfs volume? This is probably > out of date, but you could try one of these: [ ... ] Yep, > that's probably a lot of work. [ ... ] My recollection is that > btrfs handles deduplication differently than zfs, but both of > them can be very, very slow But the big deal there is that dedup is indeed a very expensive operation, even worse than 'balance'. A balanced, deduped volume will shrink faster in most cases, but the time taken simply moved from shrinking to preparing. > Again, I'm not an expert in btrfs, but in most cases a full > balance and scrub takes care of any problems on the root > partition, but that is a relatively small partition. A full > balance (without the options) and scrub on 20 TiB must take a > very long time even with robust hardware, would it not? There have been reports of several months for volumes of that size subject to ordinary workload. > CentOS, Redhat, and Oracle seem to take the position that very > large data subvolumes using btrfs should work fine. This is a long standing controvery, and for example there have been "interesting" debates in the XFS mailing list. Btrfs in this is not really different from others, with one major difference in context, that many Btrfs developers work for a company that relies of large numbers of small servers, to the point that fixing multidevice issues has not been a priority. The controversy of large volumes is that while no doubt the logical structures of recent filesystem types can support single volumes of many petabytes (or even much larger), and such volumes have indeed been created and "work"-ish, so they are unquestionably "syntactically valid", the tradeoffs involved especially as to maintainability may mean that they don't "work" well and sustainably so. The fundamental issue is metadata: while the logical structures, using 48-64 bit pointers, unquestionably scale "syntactically", they don't scale pragmatically when considering whole-volume maintenance like checking, repair, balancing, scrubbing, indexing (which includes making incremental backups etc.). Note: large volumes don't have just a speed problem for whole-volume operations, they also have a memory problem, as most tools hold in memory copy of the metadata. There have been cases where indexing or repair of a volume requires a lot more RAM (many hundreds GiB or some TiB of RAM) than the system on which the volume was being used. The problem is of course smaller if the large volume contains mostly large files, and bigger if the volume is stored on low IOPS-per-TB devices and used on small-memory systems. But even with large files even if filetree object metadata (inodes etc.) are relatively few eventually space metadata must at least potentially resolve down to single sectors, and that can be a lot of metadata unless both used and free space are very unfragmented. The fundamental technological issue is: *data* IO rates, in both random IOPS and sequential ones, can be scaled "almost" linearly by parallelizing them using RAID or equivalent, allowing large volumes to serve scalably large and parallel *data* workloads, but *metadata* IO rates cannot be easily parallelized, because metadata structures are graphs, not arrays of bytes like files. So a large volume on 100 storage devices can serve in parallel a significant percentage of 100 times the data workload of a small volume on 1 storage device, but not so much for the metadata workload. For example, I have never seen a parallel 'fsck' tool that can take advantage of 100 storage devices to complete a scan of a single volume on 100 storage devices in not much longer time than the scan of a volume on 1 of the storage device. > But I would be curious what the rest of the list thinks about > 20 TiB in one volume/subvolume. Personally I think that while volumes of many petabytes "work" syntactically, there are serious maintainability problem (which I have seen happen at a number of sites) with volumes larger than 4TB-8TB with any current local filesystem design. That depends also on number/size of storage devices, and their nature, that is IOPS, as after all metadata workloads do scale a bit with number of available IOPS, even if far more slowly than data workloads. For for example I think that an 8TB volume is not desirable on a single 8TB disk for ordinary workloads (but then I think that disks above 1-2TB are just not suitable for ordinary filesystem workloads), but with lots of smaller/faster disks a 12TB volume would probably be acceptable, and maybe a number of flash SSDs might make acceptable even a 20TB volume. Of course there are lots of people who know better. :-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
>>> The way btrfs is designed I'd actually expect shrinking to >>> be fast in most cases. [ ... ] >> The proposed "move whole chunks" implementation helps only if >> there are enough unallocated chunks "below the line". If regular >> 'balance' is done on the filesystem there will be some, but that >> just spreads the cost of the 'balance' across time, it does not >> by itself make a «risky, difficult, slow operation» any less so, >> just spreads the risk, difficulty, slowness across time. > Isn't that too pessimistic? Maybe, it depends on the workload impacting the volume and how much it churns the free/unallocated situation. > Most of my filesystems have 90+% of free space unallocated, > even those I never run balance on. That seems quite lucky to me, as definitely is not my experience or even my expectation in the general case: in my laptop and desktop with relatively few updates I have to run 'balance' fairly frequently, and "Knorrie" has produced a nice tools that produces a graphical map of free vs. unallocated space and most examples and users find quite a bit of balancing needs to be done > For me it wouldn't just spread the cost, it would reduce it > considerably. In your case the cost of the implicit or explicit 'balance' simply does not arise because 'balance' is not necessary, and then moving whole chunks is indeed cheap. The argument here is in part whether used space (extents) or allocated space (chunks) is more fragmented as well as the amount of metadata to update in either case. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
>> My guess is that very complex risky slow operations like that are >> provided by "clever" filesystem developers for "marketing" purposes, >> to win box-ticking competitions. That applies to those system >> developers who do know better; I suspect that even some filesystem >> developers are "optimistic" as to what they can actually achieve. > There are cases where there really is no other sane option. Not > everyone has the kind of budget needed for proper HA setups, Thnaks for letting me know, that must have never occurred to me, just as it must have never occurred to me that some people expect extremely advanced features that imply big-budget high-IOPS high-reliability storage to be fast and reliable on small-budget storage too :-) > and if you need maximal uptime and as a result have to reprovision the > system online, then you pretty much need a filesystem that supports > online shrinking. That's a bigger topic than we can address here. The topic used to be known in one related domain as "Very Large Databases", which were defined as databases so large and critical that they the time needed for maintenance and backup were too slow for taking them them offline etc.; that is a topics that has largely vanished for discussion, I guess because most management just don't want to hear it :-). > Also, it's not really all that slow on most filesystem, BTRFS is just > hurt by it's comparatively poor performance, and the COW metadata > updates that are needed. Btrfs in realistic situations has pretty good speed *and* performance, and COW actually helps, as it often results in less head repositioning than update-in-place. What makes it a bit slower with metadata is having 'dup' by default to recover from especially damaging bitflips in metadata, but then that does not impact performance, only speed. >> That feature set is arguably not appropriate for VM images, but >> lots of people know better :-). > That depends on a lot of factors. I have no issues personally running > small VM images on BTRFS, but I'm also running on decent SSD's > (>500MB/s read and write speeds), using sparse files, and keeping on > top of managing them. [ ... ] Having (relatively) big-budget high-IOPS storage for high-IOPS workloads helps, that must have never occurred to me either :-). >> XFS and 'ext4' are essentially equivalent, except for the fixed-size >> inode table limitation of 'ext4' (and XFS reportedly has finer >> grained locking). Btrfs is nearly as good as either on most workloads >> is single-device mode [ ... ] > No, if you look at actual data, [ ... ] Well, I have looked at actual data in many published but often poorly made "benchmarks", and to me they seem they seem quite equivalent indeed, within somewhat differently shaped performance envelopes, so the results depend on the testing point within that envelope. I have been done my own simplistic actual data gathering, most recently here: http://www.sabi.co.uk/blog/17-one.html?170302#170302 http://www.sabi.co.uk/blog/17-one.html?170228#170228 and however simplistic they are fairly informative (and for writes they point a finger at a layer below the filesystem type). [ ... ] >> "Flexibility" in filesystems, especially on rotating disk >> storage with extremely anisotropic performance envelopes, is >> very expensive, but of course lots of people know better :-). > Time is not free, Your time seems especially and uniquely precious as you "waste" as little as possible editing your replies into readability. > and humans generally prefer to minimize the amount of time they have > to work on things. This is why ZFS is so popular, it handles most > errors correctly by itself and usually requires very little human > intervention for maintenance. That seems to me a pretty illusion, as it does not contain any magical AI, just pretty ordinary and limited error correction for trivial cases. > 'Flexibility' in a filesystem costs some time on a regular basis, but > can save a huge amount of time in the long run. Like everything else. The difficulty is having flexibility at scale with challenging workloads. "An engineer can do for a nickel what any damn fool can do for a dollar" :-). > To look at it another way, I have a home server system running BTRFS > on top of LVM. [ ... ] But usually home servers have "unchallenging" workloads, and it is relatively easy to overbudget their storage, because the total absolute cost is "affordable". -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
> I’ve glazed over on “Not only that …” … can you make youtube > video of that :)) [ ... ] It’s because I’m special :* Well played again, that's a fairly credible impersonation of a node.js/mongodb developer :-). > On a real note thank’s [ ... ] to much of open source stuff is > based on short comments :/ Yes... In part that's because the "sw engineering" aspect of programming takes a lot of time that unpaid volunteers sometimes cannot afford to take, in part though I have noticed sometimes free sw authors who do get paid to do free sw act as if they had a policy of obfuscation to protect their turf/jobs. Regardless, mailing lists, IRC channel logs, wikis, personal blogs, search engines allow a mosaic of lore to form, which in part remedies the situation, and here we are :-). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
>> As a general consideration, shrinking a large filetree online >> in-place is an amazingly risky, difficult, slow operation and >> should be a last desperate resort (as apparently in this case), >> regardless of the filesystem type, and expecting otherwise is >> "optimistic". > The way btrfs is designed I'd actually expect shrinking to be > fast in most cases. It could probably be done by moving whole > chunks at near platter speed, [ ... ] It just hasn't been > implemented yet. That seems to me a rather "optimistic" argument, as most of the cost of shrinking is the 'balance' to pack extents into chunks. As that thread implies, the current implementation in effect does a "balance" while shrinking, by moving extents from chunks "above the line" to free space in chunks "below the line". The proposed "move whole chunks" implementation helps only if there are enough unallocated chunks "below the line". If regular 'balance' is done on the filesystem there will be some, but that just spreads the cost of the 'balance' across time, it does not by itself make a «risky, difficult, slow operation» any less so, just spreads the risk, difficulty, slowness across time. More generally one of the downsides of Btrfs is that because of its two-level (allocated/unallocated chunks, used/free nodes or blocks) design it requires more than most other designs to do regular 'balance', which is indeed «risky, difficult, slow». Compare an even more COW design like NILFS2, which requires, but a bit less, to run its garbage collector, which is also «risky, difficult, slow». Just like in Btrfs that is a tradeoff that shrinks the performance envelope in one direction and expands it in another. But in the case of Btrfs it shrinks it perhaps a bit more than it expands it, as the added flexibility of having chunk-based 'profiles' is only very partially taken advantage of. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
> I glazed over at “This is going to be long” … :) >> [ ... ] Not only that, you also top-posted while quoting it pointlessly in its entirety, to the whole mailing list. Well played :-). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
> [ ... ] slaps together a large storage system in the cheapest > and quickest way knowing that while it is mostly empty it will > seem very fast regardless and therefore to have awesome > performance, and then the "clever" sysadm disappears surrounded > by a halo of glory before the storage system gets full workload > and fills up; [ ... ] Fortunately or unfortunately Btrfs is particularly suitable for this technique, as it has an enormous number of checkbox-ticking awesome looking feature: transparent compression, dynamic add/remove, online balance/scrub, different sized member devices, online grow/shrink, online defrag, limitless scalability, online dedup, arbitrary subvolumes and snapshots, COW and reflinking, online conversion of RAID profiles, ... and one can use all of them at the same time, and for the initial period where volume workload is low and space used not much, it will looks absolutely fantastic, cheap, flexible, always available, fast, the work of genius of a very cool sysadm. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
> [ ... ] reminded of all the cases where someone left me to > decatastrophize a storage system built on "optimistic" > assumptions. In particular when some "clever" sysadm with a "clever" (or dumb) manager slaps together a large storage system in the cheapest and quickest way knowing that while it is mostly empty it will seem very fast regardless and therefore to have awesome performance, and then the "clever" sysadm disappears surrounded by a halo of glory before the storage system gets full workload and fills up; when that happens usually I get to inherit it. BTW The same technique also can be done with HPC clusters. >> I intended to shrink a ~22TiB filesystem down to 20TiB. This >> is still using LVM underneath so that I can’t just remove a >> device from the filesystem but have to use the resize >> command. >> Label: 'backy' uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4 >> Total devices 1 FS bytes used 18.21TiB >> devid1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy Ahh it is indeed a filled up storage system now running a full workload. At least it wasn't me who inherited it this time. :-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Shrinking a device - performance?
This is going to be long because I am writing something detailed hoping pointlessly that someone in the future will find it by searching the list archives while doing research before setting up a new storage system, and they will be the kind of person that tolerates reading messages longer than Twitter. :-). > I’m currently shrinking a device and it seems that the > performance of shrink is abysmal. When I read this kind of statement I am reminded of all the cases where someone left me to decatastrophize a storage system built on "optimistic" assumptions. The usual "optimism" is what I call the "syntactic approach", that is the axiomatic belief that any syntactically valid combination of features not only will "work", but very fast too and reliably despite slow cheap hardware and "unattentive" configuration. Some people call that the expectation that system developers provide or should provide an "O_PONIES" option. In particular I get very saddened when people use "performance" to mean "speed", as the difference between the two is very great. As a general consideration, shrinking a large filetree online in-place is an amazingly risky, difficult, slow operation and should be a last desperate resort (as apparently in this case), regardless of the filesystem type, and expecting otherwise is "optimistic". My guess is that very complex risky slow operations like that are provided by "clever" filesystem developers for "marketing" purposes, to win box-ticking competitions. That applies to those system developers who do know better; I suspect that even some filesystem developers are "optimistic" as to what they can actually achieve. > I intended to shrink a ~22TiB filesystem down to 20TiB. This is > still using LVM underneath so that I can’t just remove a device > from the filesystem but have to use the resize command. That is actually a very good idea because Btrfs multi-device is not quite as reliable as DM/LVM2 multi-device. > Label: 'backy' uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4 >Total devices 1 FS bytes used 18.21TiB >devid1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy Maybe 'balance' should have been used a bit more. > This has been running since last Thursday, so roughly 3.5days > now. The “used” number in devid1 has moved about 1TiB in this > time. The filesystem is seeing regular usage (read and write) > and when I’m suspending any application traffic I see about > 1GiB of movement every now and then. Maybe once every 30 > seconds or so. Does this sound fishy or normal to you? With consistent "optimism" this is a request to assess whether "performance" of some operations is adequate on a filetree without telling us either what the filetree contents look like, what the regular workload is, or what the storage layer looks like. Being one of the few system administrators crippled by lack of psychic powers :-), I rely on guesses and inferences here, and having read the whole thread containing some belated details. >From the ~22TB total capacity my guess is that the storage layer involves rotating hard disks, and from later details the filesystem contents seems to be heavily reflinked files of several GB in size, and workload seems to be backups to those files from several source hosts. Considering the general level of "optimism" in the situation my wild guess is that the storage layer is based on large slow cheap rotating disks in teh 4GB-8GB range, with very low IOPS-per-TB. > Thanks for that info. The 1min per 1GiB is what I saw too - > the “it can take longer” wasn’t really explainable to me. A contemporary rotating disk device can do around 0.5MB/s transfer rate with small random accesses with barriers up to around 80-160MB/s in purely sequential access without barriers. 1GB/m of simultaneous read-write means around 16MB/s reads plus 16MB/s writes which is fairly good *performance* (even if slow *speed*) considering that moving extents around, even across disks, involves quite a bit of randomish same-disk updates of metadata; because it all depends usually on how much randomish metadata updates need to done, on any filesystem type, as those must be done with barriers. > As I’m not using snapshots: would large files (100+gb) Using 100GB sized VM virtual disks (never mind with COW) seems very unwise to me to start with, but of course a lot of other people know better :-). Just like a lot of other people know better that large single pool storage systems are awesome in every respect :-): cost, reliability, speed, flexibility, maintenance, etc. > with long chains of CoW history (specifically reflink copies) > also hurt? Oh yes... They are about one of the worst cases for using Btrfs. But also very "optimistic" to think that kind of stuff can work awesomely on *any* filesystem type. > Something I’d like to verify: does having traffic on the > volume have the potential to delay this infinitely? [ ... ] > it’s just slow and we’re looking forward to about
Re: send snapshot from snapshot incremental
[ ... ] > BUT if i take a snapshot from the system, and want to transfer > it to the external HD, i can not set a parent subvolume, > because there isn't any. Questions like this are based on incomplete understanding of 'send' and 'receive', and on IRC user "darkling" explained it fairly well: > When you use -c, you're telling the FS that it can expect to > find a sent copy of that subvol on the receiving side, and > that anything shared with it can be sent by reference. OK, so > with -c on its own, you're telling the FS that "all the data > in this subvol already exists on the remote". > So, when you send your subvol, *all* of the subvol's metadata > is sent, and where that metadata refers to an extent that's > shared with the -c subvol, the extent data isn't sent, because > it's known to be on the other end already, and can be shared > directly from there. > OK. So, with -p, there's a "base" subvol. The send subvol and > the -p reference subvol are both snapshots of that base (at > different times). The -p reference subvol, as with -c, is > assumed to be on the remote FS. However, because it's known to > be an earlier version of the same data, you can be more > efficient in the sending by saying "start from the earlier > version, and modify it in this way to get the new version" > So, with -p, not all of the metadata is sent, because you know > you've already got most of it on the remote in the form of the > earlier version. > So -p is "take this thing and apply these differences to it" > and -c is "build this thing from scratch, but you can share > some of the data with these sources" Also here some additional details: http://logs.tvrrug.org.uk/logs/%23btrfs/2016-06-29.html#2016-06-29T22:39:59 The requirement for read-only is because in that way it is pretty sure that the same stuff is on both origin and target volume. It may help to compare with RSYNC: it has to scan both the full origin and target trees, because it cannot be told that there is a parent tree that is the same on origin and target; but with option '--link-dest' it can do something similar to 'send -c'. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: backing up a file server with many subvolumes
> [ ... ] In each filesystem subdirectory are incremental > snapshot subvolumes for that filesystem. [ ... ] The scheme > is something like this: > /backup/// BTW hopefully this does not amounts to too many subvolumes in the '.../backup/' volume, because that can create complications, where "too many" IIRC is more than a few dozen (even if a low number of hundreds is still doable). > I'd like to try to back up (duplicate) the file server > filesystem containing these snapshot subvolumes for each > remote machine. The problem is that I don't think I can use > send/receive to do this. "Btrfs send" requires "read-only" > snapshots, and snapshots are not recursive as yet. Why is that a problem? What is a recursive snapshot? > I think there are too many subvolumes which change too often > to make doing this without recursion practical. It is not clear to me how the «incremental snapshot subvolumes for that filesystem» are made, whether with RSYNC or 'send' and 'receive' itself. It is also not clear to me why those snapshots «change too often», why would they change at all? Once a backup is made in whichever way to an «incremental snapshot», why would that «incremental snapshot» ever change but for being deleted? There are some tools that rely on the specific abilities of 'send' with options '-p' and '-c' to save a lot of network bandwidth and target storage space, perhaps you might be interested in searching for them. Anyhow I'll repeat here part of an answer to a similar message: issues like yours usually are based on incomplete understanding of 'send' and 'receive', and on IRC user "darkling" explained it fairly well: > When you use -c, you're telling the FS that it can expect to > find a sent copy of that subvol on the receiving side, and > that anything shared with it can be sent by reference. OK, so > with -c on its own, you're telling the FS that "all the data > in this subvol already exists on the remote". > So, when you send your subvol, *all* of the subvol's metadata > is sent, and where that metadata refers to an extent that's > shared with the -c subvol, the extent data isn't sent, because > it's known to be on the other end already, and can be shared > directly from there. > OK. So, with -p, there's a "base" subvol. The send subvol and > the -p reference subvol are both snapshots of that base (at > different times). The -p reference subvol, as with -c, is > assumed to be on the remote FS. However, because it's known to > be an earlier version of the same data, you can be more > efficient in the sending by saying "start from the earlier > version, and modify it in this way to get the new version" > So, with -p, not all of the metadata is sent, because you know > you've already got most of it on the remote in the form of the > earlier version. > So -p is "take this thing and apply these differences to it" > and -c is "build this thing from scratch, but you can share > some of the data with these sources" Also here some additional details: http://logs.tvrrug.org.uk/logs/%23btrfs/2016-06-29.html#2016-06-29T22:39:59 The requirement for read-only is because in that way it is pretty sure that the same stuff is on both origin and target volume. It may help to compare with RSYNC: it has to scan both the full origin and target trees, because it cannot be told that there is a parent tree that is the same on origin and target; but with option '--link-dest' it can do something similar to 'send -c'. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS Metadata Corruption Prevents Scrub and btrfs check
> How can I attempt to rebuild the metadata, with a treescan or > otherwise? I don't know unfortunately for backrefs. >> In general metadata in Btrfs is fairly intricate and metadata >> block loss is pretty fatal, that's why metadata should most >> times be redundant as in 'dup' or 'raid1' or similar: > All the data and metadata on this system is in raid1 or > raid10, in fact I discovered this issue while trying to change > my balance form raid1 to raid10. > johnf@carbon:~$ sudo btrfs fi df / > Data, RAID10: total=1.13TiB, used=1.12TiB > Data, RAID1: total=5.17TiB, used=5.16TiB > System, RAID1: total=32.00MiB, used=864.00KiB > Metadata, RAID10: total=3.09GiB, used=3.08GiB > Metadata, RAID1: total=13.00GiB, used=10.16GiB > GlobalReserve, single: total=512.00MiB, used=0.00B That's weird because as a rule when there is a checksum error it is automatically corrected on read if there is a "good copy". Also because you have both RAID1 and RAID10 data and metadata. You should have just RAID1 metadata and RAID10 data or both RAID10. There is probably an "interrupted" 'balance'. Just had a look at your previous message and it reports out of 12.56TB 2 uncorrectable errors. But you have got everything redundant ('raid1' or 'raid10', so it looks like somehow those two blocks are supposed to be copies of each other and are bad: * 'sdg1' at physical sector 5016524768 volume byte offset 9626194001920 * 'sdh1' at physical sector 5016524768 volume byte offset 9626194001920 * both sectors belong to the tree at byte offset 4804958584832. Note: BTW I remember someone wrote a guide to decoding Btrfs 'dmesg' lines, but I can't find it anymore, so not sure that interpretation is entirely correct. It is a bit "strange" that it is the same sector, as Btrfs 'raid1' profile is not necessarily block-for-block, mirrored chunks can be in different offsets. The "strange" symptoms hint not just at disk issues, but also some past attempts at conversion (I remember a previous message from you) or recovery have messed up things a bit. Someone mentioned in some mailing list articles various tools to print out trees and subtrees and in general inspect. 'Knorrie' has written a Python library (and a few inspection tools) with which it is possible to traverse various Btrfs trees, but I haven't used it: https://github.com/knorrie/python-btrfs/ I'd suggest searching the mailing list for related information. Also the relevant tree is described here (the one on kernel.org probably is more up-to-date): https://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-backrefs.html https://btrfs.wiki.kernel.org/index.php/Btrfs_design#Explicit_Back_References https://btrfs.wiki.kernel.org/index.php/Data_Structures https://btrfs.wiki.kernel.org/index.php/Trees You might want to use 'btrfs inspect-internal'. Conceivably as the issue seems related to an extent backref 'btrfsck --repair' with '--init-extent-tree' might help, but I cannot recommend that as I don't know if they are relevant to your problem and/or safe in your situation. Consider this: http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg26816.html I would use the very most recent version of 'btrfsprogs'. One possibility I would consider is to move a sufficiently large subtree of: /home/johnf/personal/projects/openwrt/trunk/build_dir/target-mips_r2_uClibc-0.9.32/hostapd-wpad-mini/hostapd-20110117/hostapd/hostapd.eap_user into its own directory, create a new subvolume, 'cp --reflink' everything except that directory into the new subvolume, and then *perhaps* work on the new subvolume will not access the damaged metadata block. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS Metadata Corruption Prevents Scrub and btrfs check
> Read error at byte 0, while reading 3975 bytes: Input/output error Bad news. That means that probably the disk is damaged and further issues may happen. > corrected errors: 0, uncorrectable errors: 2, unverified errors: 0 Even worse news. > Incorrect local backref count on 5165855678464 root 259 owner 1732872 > offset 0 found 0 wanted 1 back 0x3ba80f40 > Backref disk bytenr does not match extent record, > bytenr=5165855678464, ref bytenr=7880454922968236032 > backpointer mismatch on [5165855678464 28672] "Better" news. In practice a single metadata leaf node is corrupted in back references. You might be lucky and that might be rebuildable, but I don't know enough about the somewhat intricate Btrfs metadata trees to figure that out. Some metadata is rebuildable from other metadata with a tree scan, some not. In general metadata in Btrfs is fairly intricate and metadata block loss is pretty fatal, that's why metadata should most times be redundant as in 'dup' or 'raid1' or similar: http://www.sabi.co.uk/blog/16-two.html?160817#160817 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
>> Consider the common case of a 3-member volume with a 'raid1' >> target profile: if the sysadm thinks that a drive should be >> replaced, the goal is to take it out *without* converting every >> chunk to 'single', because with 2-out-of-3 devices half of the >> chunks will still be fully mirrored. >> Also, removing the device to be replaced should really not be >> the same thing as balancing the chunks, if there is space, to be >> 'raid1' across remaining drives, because that's a completely >> different operation. > There is a command specifically for replacing devices. It > operates very differently from the add+delete or delete+add > sequences. [ ... ] Perhaps it was not clear that I was talking about removing a device, as distinct from replacing it, and that I used "removed" instead of "deleted" deliberately, to avoid the confusion with the 'delete' command. In the everyday practice of system administration it often happens that a device should be removed first, and replaced later, for example when it is suspected to be faulty, or is intermittently faulty. The replacement can be done with 'replace' or 'add+delete' or 'delete+add', but that's a different matter. Perhaps I should have not have used the generic verb "remove", but written "make unavailable". This brings about again the topic of some "confusion" in the design of the Btrfs multidevice handling logic, where at least initially one could only expand the storage space of a multidevice by 'add' of a new device or shrink the storage space by 'delete' of an existing one, but I think it was not conceived at Btrfs design time of storage space being nominally constant but for a device (and the chunks on it) having a state of "available" ("present", "online", "enabled") or "unavailable" ("absent", "offline", "disabled"), either because of events or because of system administrator action. The 'missing' pseudo-device designator was added later, and 'replace' also later to avoid having to first expand then shrink (or viceversa) the storage space and the related copying. My impression is that it would be less "confused" if the Btrfs device handling logic were changed to allow for the the state of "member of the multidevice set but not actually available" and the related consequent state for chunks that ought to be on it; that probably would be essential to fixing the confusing current aspects of recovery in a multidevice set. That would be very useful even if it may require a change in the on-disk format to distinguish the distinct states of membership and availability for devices and mark chunks as available or not (chunks of course being only possible on member devices). That is, it would also be nice to have the opposite state of "not member of the multidevice set but actually available to it", that is a spare device, and related logic. Note: simply setting '/sys/block/$DEV/device/delete' is not a good option, because that makes the device unavailable not just to Btrfs, but also to the whole systems. In the ordinary practice of system administration it may well be useful to make a device unavailable to Btrfs but still available to the system, for example for testing, and anyhow they are logically distinct states. That also means a member device might well be available to the system, but marked as "not available" to Btrfs. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
[ ... on the difference between number of devices and length of a chunk-stripe ... ] > Note: possibilities get even more interesting with a 4-device > volume with 'raid1' profile chunks, and similar case involving > other profiles than 'raid1'. Consider for example a 4-device volume with 2 devices abruptly missing: if 2-length 'raid1' chunk-stripes have been uniformly laid across devices, then some chunk-stripes will be completely missing (where both chunks in the stripe were on the 2 missing devices), some will be 1-length, and some will be 2-length. What to do when devices are missing? One possibility is to simply require mount with the 'degraded' option, by default read-only, but allowing read-write, simply as a way to ensure the sysadm knows that some metada/data *may* not be redundant or *may* even be unavailable (if the chunk-stripe length is less than the minimum to reconstruct the data). Then attempts to read unavailable metadata or data would return an error like a checksum violation without redundancy, dynamically (when the application or 'balance' or 'scrub' attempt to read the unavailable data). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
>> What makes me think that "unmirrored" 'raid1' profile chunks >> are "not a thing" is that it is impossible to remove >> explicitly a member device from a 'raid1' profile volume: >> first one has to 'convert' to 'single', and then the 'remove' >> copies back to the remaining devices the 'single' chunks that >> are on the explicitly 'remove'd device. Which to me seems >> absurd. > It is, there should be a way to do this as a single operation. > [ ... ] The reason this is currently the case though is a > simple one, 'btrfs device delete' is just a special instance > of balance [ ... ] does no profile conversion, but having > that as an option would actually be _very_ useful from a data > safety perspective. That seems to me an even more "confused" opinion: because removing a device to make it "missing" and removing it permanently should be very different operations. Consider the common case of a 3-member volume with a 'raid1' target profile: if the sysadm thinks that a drive should be replaced, the goal is to take it out *without* converting every chunk to 'single', because with 2-out-of-3 devices half of the chunks will still be fully mirrored. Also, removing the device to be replaced should really not be the same thing as balancing the chunks, if there is space, to be 'raid1' across remaining drives, because that's a completely different operation. >> Going further in my speculation, I suspect that at the core of >> the Btrfs multidevice design there is a persistent "confusion" >> (to use en euphemism) between volumes having a profile, and >> merely chunks have a profile. > There generally is. The profile is entirely a property of the > chunks (each chunk literally has a bit of metadata that says > what profile it is), not the volume. There's some metadata in > the volume somewhere that says what profile to use for new > chunks of each type (I think), That's the "target" profile for the volume. > but that doesn't dictate what chunk profiles there are on the > volume. [ ... ] But as that's the case then the current Btrfs logic for determining whether a volume is degraded or not is quite "confused" indeed. Because suppose there is again the simple case of a 3-device volume, where all existing chunks have 'raid1' profile and the volume's target profile is also 'raid1' and one device has gone offline: the volume cannot be said to be "degraded", unless a full examination of all chunks is made. Because it can well happen that in fact *none* of the chunks was mirrored to that device, for example, however unlikely. And viceversa. Even with 3 devices some chunks may be temporarily "unmirrored" (even if for brief times hopefully). The average case is that half of the chunks will be fully mirrored across the two remaining devices and half will be "unmirrored". Now consider re-adding the third device: at that point the volume has got back all 3 devices, so it is not "degraded", but 50% of the chunks in the volume will still be "unmirrored", even if eventually they will be mirrored on the newly added device. Note: possibilities get even more interesting with a 4-device volume with 'raid1' profile chunks, and similar case involving other profiles than 'raid1'. Therefore the current Btrfs logic for deciding whether a volume is "degraded" seems simply "confused" to me, because whether there are missing devices and some chunks are "unmirrored" is not quite the same thing. The same applies to the current logic that in a 2-device volume with a device missing new chunks are created as "single" profile instead of as "unmirrored" 'raid1' profile: another example of "confusion" between number of devices and chunk profile. Note: the best that can be said is that a volume has both a "target chunk profile" (one per data, metadata, system chunks) and a target number of member devices, and that a volume with a number of devices below the target *might* be degraded, and that whether a volume is in fact degraded is not either/or, but given by the percentage of chunks or stripes that are degraded. This is expecially made clear by the 'raid1' case where the chunk stripe length is always 2, but the number of target devices can be greater than 2. Management of devices and management of stripes are in Btrfs, unlike conventional RAID like Linux MD, rather different operations needing rather different, if related, logic. My impression is that because of "confusion" between number of devices in a volume and status of chunk profile there are some "surprising" behaviors in Btrfs, and that will take quite a bit to fix, most importantly for the Btrfs developer team to clear among themselves the semantics attaching to both. After 10 years of development that seems the right thing to do :-). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 degraded mount still produce single chunks, writeable mount not allowed
> [ ... ] Meanwhile, the problem as I understand it is that at > the first raid1 degraded writable mount, no single-mode chunks > exist, but without the second device, they are created. [ > ... ] That does not make any sense, unless there is a fundamental mistake in the design of the 'raid1' profile, which this and other situations make me think is a possibility: that the category of "mirrored" 'raid1' chunk does not exist in the Btrfs chunk manager. That is, a chunk is either 'raid1' if it has a mirror, or if has no mirror it must be 'single'. If a member device of a 'raid1' profile multidevice volume disappears there will be "unmirrored" 'raid1' profile chunks and some code path must recognize them as such, but the logic of the code does not allow their creation. Question: how does the code know that a specific 'raid1' chunk is mirrored or not? The chunk must have a link (member, offset) to its mirror, do they? What makes me think that "unmirrored" 'raid1' profile chunks are "not a thing" is that it is impossible to remove explicitly a member device from a 'raid1' profile volume: first one has to 'convert' to 'single', and then the 'remove' copies back to the remaining devices the 'single' chunks that are on the explicitly 'remove'd device. Which to me seems absurd. Going further in my speculation, I suspect that at the core of the Btrfs multidevice design there is a persistent "confusion" (to use en euphemism) between volumes having a profile, and merely chunks have a profile. My additional guess that the original design concept had multidevice volumes to be merely containers for chunks of whichever mixed profiles, so a subvolume could have 'raid1' profile metadata and 'raid0' profile data, and another could have 'raid10' profile metadata and data, but since handling this turned out to be too hard, this was compromised into volumes having all metadata chunks to have the same profile and all data of the same profile, which requires special-case handling of corner cases, like volumes being converted or missing member devices. So in the case of 'raid1', a volume with say a 'raid1' data profile should have all-'raid1' and fully mirrored profile chunks, and the lack of a member devices fails that aim in two ways. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Low IOOP Performance
[ ... ] > I have a 6-device test setup at home and I tried various setups > and I think I got rather better than that. * 'raid1' profile: soft# btrfs fi df /mnt/sdb5 Data, RAID1: total=273.00GiB, used=269.94GiB System, RAID1: total=32.00MiB, used=56.00KiB Metadata, RAID1: total=1.00GiB, used=510.70MiB GlobalReserve, single: total=176.00MiB, used=0.00B soft# fio --directory=/mnt/sdb5 --runtime=30 --status-interval=10 blocks-randomish.fio | tail -3 Run status group 0 (all jobs): READ: io=105508KB, aggrb=3506KB/s, minb=266KB/s, maxb=311KB/s, mint=30009msec, maxt=30090msec WRITE: io=100944KB, aggrb=3354KB/s, minb=256KB/s, maxb=296KB/s, mint=30009msec, maxt=30090msec * 'raid10' profile: soft# btrfs fi df /mnt/sdb6 Data, RAID10: total=276.00GiB, used=272.49GiB System, RAID10: total=96.00MiB, used=48.00KiB Metadata, RAID10: total=3.00GiB, used=512.06MiB GlobalReserve, single: total=176.00MiB, used=0.00B soft# fio --directory=/mnt/sdb6 --runtime=30 --status-interval=10 blocks-randomish.fio | tail -3 Run status group 0 (all jobs): READ: io=89056KB, aggrb=2961KB/s, minb=225KB/s, maxb=271KB/s, mint=30009msec, maxt=30076msec WRITE: io=85248KB, aggrb=2834KB/s, minb=212KB/s, maxb=261KB/s, mint=30009msec, maxt=30076msec * 'single' profile on MD RAID10: soft# btrfs fi df /mnt/md0 Data, single: total=278.01GiB, used=274.32GiB System, single: total=4.00MiB, used=48.00KiB Metadata, single: total=2.01GiB, used=615.73MiB GlobalReserve, single: total=208.00MiB, used=0.00B soft# grep -A1 md0 /proc/mdstat md0 : active raid10 sdg1[6] sdb1[0] sdd1[2] sdf1[4] sdc1[1] sde1[3] 364904232 blocks super 1.0 8K chunks 2 near-copies [6/6] [UU] soft# fio --directory=/mnt/md0 --runtime=30 --status-interval=10 blocks-randomish.fio | tail -3 Run status group 0 (all jobs): READ: io=160928KB, aggrb=5357KB/s, minb=271KB/s, maxb=615KB/s, mint=30012msec, maxt=30038msec WRITE: io=158892KB, aggrb=5289KB/s, minb=261KB/s, maxb=616KB/s, mint=30012msec, maxt=30038msec That's a range of 700-1300 4KiB random mixed-rw IOPS, quite reasonable for 6x 1TB 7200RPM SATA drives, each capable of 100-120. It helps that the test file is just 100G, 10% of the total drive extent, so arm movement is limited. Not surprising that the much more mature MD RAID has an edge, a bit stranger that on this the 'raid1' profile seems a bit faster than the 'raid10' profile. The much smaller numbers seem to happen to me too (probably some misfeature of 'fio') with 'buffered=1', and the larger numbers for ZFSonLinux are "suspicious". > It seems unlikely to me that you got that with a 10-device > mirror 'vdev', most likely you configured it as a stripe of 5x > 2-device mirror vdevs, that is RAID10. Indeed I double checked the end of the attached lost and that was the case. My FIO config file: # vim:set ft=ini: [global] filename=FIO-TEST fallocate=keep size=100G buffered=0 ioengine=libaio io_submit_mode=offload iodepth=2 numjobs=12 blocksize=4K kb_base=1024 [rand-mixed] rw=randrw stonewall -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Low IOOP Performance
>>> On Mon, 27 Feb 2017 22:11:29 +, p...@btrfs.list.sabi.co.uk (Peter >>> Grandi) said: > [ ... ] >> I have a 6-device test setup at home and I tried various setups >> and I think I got rather better than that. [ ... ] > That's a range of 700-1300 4KiB random mixed-rw IOPS, Rerun with 1M blocksize: soft# fio --directory=/mnt/sdb5 --runtime=30 --status-interval=10 --blocksize=1M blocks-randomish.fio | tail -3 Run status group 0 (all jobs): READ: io=2646.0MB, aggrb=89372KB/s, minb=7130KB/s, maxb=7776KB/s, mint=30081msec, maxt=30317msec WRITE: io=2297.0MB, aggrb=77584KB/s, minb=6082KB/s, maxb=6796KB/s, mint=30081msec, maxt=30317msec soft# fio --directory=/mnt/sdb6 --runtime=30 --status-interval=10 --blocksize=1M blocks-randomish.fio | tail -3 Run status group 0 (all jobs): READ: io=2781.0MB, aggrb=94015KB/s, minb=5932KB/s, maxb=10290KB/s, mint=30121msec, maxt=30290msec WRITE: io=2431.0MB, aggrb=82183KB/s, minb=4779KB/s, maxb=9102KB/s, mint=30121msec, maxt=30290msec soft# killall -9 fio fio: no process found soft# fio --directory=/mnt/md0 --runtime=30 --status-interval=10 --blocksize=1M blocks-randomish.fio | tail -3 Run status group 0 (all jobs): READ: io=1504.0MB, aggrb=50402KB/s, minb=3931KB/s, maxb=4387KB/s, mint=30343msec, maxt=30556msec WRITE: io=1194.0MB, aggrb=40013KB/s, minb=3158KB/s, maxb=3475KB/s, mint=30343msec, maxt=30556msec Interesting that Btrfs 'single' on MD RAID10 becomes rather slower (I guess low level of intrinsic parallelism). For comparison, the same on a JFS on top of MD RAID10: soft# grep -A1 md40 /proc/mdstat md40 : active raid10 sdg4[5] sdd4[2] sdb4[0] sdf4[4] sdc4[1] sde4[3] 486538240 blocks super 1.0 512K chunks 3 near-copies [6/6] [UU] soft# fio --directory=/mnt/md40 --runtime=30 --status-interval=10 --blocksize=4K blocks-randomish.fio | grep -A2 '(all jobs)' | tail -3 Run status group 0 (all jobs): READ: io=31408KB, aggrb=1039KB/s, minb=80KB/s, maxb=90KB/s, mint=30206msec, maxt=30227msec WRITE: io=27800KB, aggrb=919KB/s, minb=70KB/s, maxb=81KB/s, mint=30206msec, maxt=30227msec soft# fio --directory=/mnt/md40 --runtime=30 --status-interval=10 --blocksize=1M blocks-randomish.fio | grep -A2 '(all jobs)' | tail -3 Run status group 0 (all jobs): READ: io=2151.0MB, aggrb=72619KB/s, minb=5865KB/s, maxb=6383KB/s, mint=30134msec, maxt=30331msec WRITE: io=1772.0MB, aggrb=59824KB/s, minb=4712KB/s, maxb=5365KB/s, mint=30134msec, maxt=30331msec XFS is usually better at multithreaded workloads within the same file (rather than across files). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Low IOOP Performance
[ ... ] > a ten disk raid1 using 7.2k 3 TB SAS drives Those are really low IOPS-per-TB devices, but good choice for SAS, as they will have SCT/ERC. > and used aio to test IOOP rates. I was surprised to measure > 215 read and 72 write IOOPs on the clean new filesystem. For that you really want to use the 'raid10' profile, 'raid1' is quite different, and has an odd recovery "gotcha". Also so far 'raid1' in Btrfs only reads from one of the two mirrors per thread. Anyhow the 72 write IOPS look like single member device IOPS rate and that's puzzling, as if Btrfs is not going multithreading to a many-device 'raid1' profile volume. I have a 6-device test setup at home and I tried various setups and I think I got rather better than that. > Sequential writes ran as expected at roughly 650 MB/s. That's a bit too high: on single similar drive I get around 65MB/s average with relatively large files, I would expect around 4-5x that from a 10-device mirrored profile, regardless of filesystem type. I strongly suspect that we have a different notion of "IOPS", perhaps either logical vs. physical IOPS, or randomish vs. sequentialish IOPS. I'll have a look at your attachments in more detail. > I created a zfs filesystem for comparison on another > checksumming filesystem using the same layout and measured > IOOP rates at 4315 read, 1449 write with sync enabled (without > sync it's clearly just writing to RAM), sequential performance > was comparable to btrfs. It seems unlikely to me that you got that with a 10-device mirror 'vdev', most likely you configured it as a stripe of 5x 2-device mirror vdevs, that is RAID10. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: understanding disk space usage
[ ... ] > The issue isn't total size, it's the difference between total > size and the amount of data you want to store on it. and how > well you manage chunk usage. If you're balancing regularly to > compact chunks that are less than 50% full, [ ... ] BTRFS on > 16GB disk images before with absolutely zero issues, and have > a handful of fairly active 8GB BTRFS volumes [ ... ] Unfortunately balance operations are quite expensive, especially from inside VMs. On the other hand if the system is not much disk constrained relatively frequent balances is a good idea indeed. It is a bit like the advice in the other thread on OLTP to run frequent data defrags, which are also quite expensive. Both combined are like running the compactor/cleaner on log structured (another variants of "COW") filesystems like NILFS2: running that frequently means tighter space use and better locality, but is quite expensive too. >> [ ... ] My impression is that the Btrfs design trades space >> for performance and reliability. > In general, yes, but a more accurate statement would be that > it offers a trade-off between space and convenience. [ ... ] It is not quite "convenience", it is overhead: whole-volume operations like compacting, defragmenting (or fscking) tend to cost significantly in IOPS and also in transfer rate, and on flash SSDs they also consume lifetime. Therefore personally I prefer to have quite a bit of unused space in Btrfs or NILFS2, at a minimum around double at 10-20% than the 5-10% that I think is the minimum advisable with conventional designs. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: understanding disk space usage
>> My system is or seems to be running out of disk space but I >> can't find out how or why. [ ... ] >> FilesystemSize Used Avail Use% Mounted on >> /dev/sda3 28G 26G 2.1G 93% / [ ... ] > So from chunk level, your fs is already full. And balance > won't success since there is no unallocated space at all. To add to this, 28GiB is a bit too small for Btrfs, because at that point chunk size is 1GiB. I have the habit of sizing partitions to an exact number of GiB, and that means that most of 1GiB will never be used by Btrfs because there is a small amount of space allocated that is smaller than 1GiB and thus there will be eventually just less than 1GiB unallocated. Unfortunately the chunk size is not manually settable. Example here from 'btrfs fi usage': Overall: Device size: 88.00GiB Device allocated: 86.06GiB Device unallocated:1.94GiB Device missing: 0.00B Used: 80.11GiB Free (estimated): 6.26GiB (min: 5.30GiB) That means that I should 'btrfs balance' now, because of the 1.94GiB "unallocated", 0.94GiB will never be allocated, and that leaves just 1GiB "unallocated" which is the minimum for running 'btrfs balance'. I have just done so and this is the result: Overall: Device size: 88.00GiB Device allocated: 82.03GiB Device unallocated:5.97GiB Device missing: 0.00B Used: 80.11GiB Free (estimated): 6.26GiB (min: 3.28GiB) At some point I had decided to use 'mixedbg' allocation to reduce this problem and hopefully improve locality, but that means that metadata and data need to have the same profile, and I really want metadata to be 'dup' because of checksumming, and I don't want data to be 'dup' too. > [ ... ] To proceed, add a larger device to current fs, and do > a balance or just delete the 28G partition then btrfs will > handle the rest well. Usually for this I use a USB stick, with a 1-3GiB partition plus a bit extra because of that extra bit of space. https://btrfs.wiki.kernel.org/index.php/FAQ#How_much_free_space_do_I_have.3F https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_Btrfs_claims_I.27m_out_of_space.2C_but_it_looks_like_I_should_have_lots_left.21 marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html Unfortunately if it is a single device volume and metadata is 'dup' to remove the extra temporary device one has first to convert the metadata to 'single' and then back to 'dup' after removal. There are also some additional reasons why space used (rather than allocated) may be larger than expected, in special but not wholly infrequent cases. My impression is that the Btrfs design trades space for performance and reliability. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS for OLTP Databases
> I have tried BTRFS from Ubuntu 16.04 LTS for write intensive > OLTP MySQL Workload. This has a lot of interesting and mostly agreeable information: https://blog.pgaddict.com/posts/friends-dont-let-friends-use-btrfs-for-oltp The main target of Btrfs is where one wants checksums and occasional snapshot for backup (rather than rollback) and applications do whole-file rewrites or appends. > It did not go very well ranging from multi-seconds stalls > where no transactions are completed That usually is more because of the "clever" design and defaults of the Linux page cache and block IO subsystem, which are astutely pessimized for every workload, but especially for read-modify-write ones, never mind for RMW workloads on copy-on-write filesystems. That most OS designs are pessimized for anything like a "write intensive OLTP" workload is not new, M Stonebraker complained about that 35 years ago, and nothing much has changed: http://www.sabi.co.uk/blog/anno05-4th.html?051012d#051012d > to the finally kernel OOPS with "no space left on device" > error message and filesystem going read only. That's because Btrfs has a a two-level allocator, where space is allocated in 1GiB chunks (distinct as to data and metadata) and then in 16KiB nodes, and this makes it far more likely for free space fragmentation to occur. Therefore Btrfs has a free space compactor ('btrfs balance') that must be used the more often the more updates happen. > interested in "free" snapshots which look very attractive The general problem is that it is pretty much impossible to have read-modify-write rollbacks for cheap, because the writes in general are scattered (that is their time coherence is very different from their spatial coherence). That means either heavy spatial fragmentation or huge write amplification. The 'snapshot' type of DM/LVM2 device delivers heavy spatial fragmentation, Btrfs does a balance of both. Another commenter has mentioned the use of 'nodatacow' to prevent RMW resulting in huge write-amplification. > to use for database recovery scenarios allow instant rollback > to the previous state. You may be more interested in NILFS2 for that, but there are significant tradeoffs there too, and NILFS2 requires a free space compactor too, plus since NILFS2 gives up on short-term spatial coherence, the compactor also needs to compact data space. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Jfs-discussion] benchmark results
I've had the chance to use a testsystem here and couldn't resist Unfortunately there seems to be an overproduction of rather meaningless file system benchmarks... running a few benchmark programs on them: bonnie++, tiobench, dbench and a few generic ones (cp/rm/tar/etc...) on ext{234}, btrfs, jfs, ufs, xfs, zfs. All with standard mkfs/mount options and +noatime for all of them. Here are the results, no graphs - sorry: [ ... ] After having a glance, I suspect that your tests could be enormously improved, and doing so would reduce the pointlessness of the results. A couple of hints: * In the generic test the 'tar' test bandwidth is exactly the same (276.68 MB/s) for nearly all filesystems. * There are read transfer rates higher than the one reported by 'hdparm' which is 66.23 MB/sec (comically enough *all* the read transfer rates your benchmarks report are higher). BTW the use of Bonnie++ is also usually a symptom of a poor misunderstanding of file system benchmarking. On the plus side, test setup context is provided in the env directory, which is rare enough to be commendable. Short summary, AFAICT: - btrfs, ext4 are the overall winners - xfs to, but creating/deleting many files was *very* slow Maybe, and these conclusions are sort of plausible (but I prefer JFS and XFS for different reasons); however they are not supported by your results as they seem to me to lack much meaning, as what is being measured is far from clear, and in particular it does not seem to be the file system performance, or anyhow an aspect of filesystem performance that might relate to common usage. I think that it is rather better to run a few simple operations (like the generic test) properly (unlike the generic test), to give a feel for how well implemented are the basic operations of the file system design. Profiling a file system performance with a meaningful full scale benchmark is a rather difficult task requiring great intellectual fortitude and lots of time. - if you need only fast but no cool features or journaling, ext2 is still a good choice :) That is however a generally valid conclusion, but with a very, very important qualification: for freshly loaded filesystems. Also with several other important qualifications, but freshly loaded is a pet peeve of mine :-). -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html