Re: [PATCH] btrfs: check options during subsequent mount
28.04.2017 12:14, Anand Jain пишет: > We allow recursive mounts with subvol options such as [1] > > [1] > mount -o rw,compress=lzo /dev/sdc /btrfs1 > mount -o ro,subvol=sv2 /dev/sdc /btrfs2 > > And except for the btrfs-specific subvol and subvolid options > all-other options are just ignored in the subsequent mounts. > > In the below example [2] the effect compression is only zlib, > even though there was no error for the option -o compress=lzo. > > [2] > > # mount -o compress=zlib /dev/sdc /btrfs1 > #echo $? > 0 > > # mount -o compress=lzo /dev/sdc /btrfs > #echo $? > 0 > > #cat /proc/self/mounts > :: > /dev/sdc /btrfs1 btrfs > rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/ 0 0 > /dev/sdc /btrfs btrfs > rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/ 0 0 > > > Further, random string .. has no error as well. > - > # mount -o compress=zlib /dev/sdc /btrfs1 > #echo $? > 0 > > # mount -o something /dev/sdc /btrfs > #echo $? > 0 > - > > This patch fixes the above issue, by checking the if the passed > options are only subvol or subvolid in the subsequent mount. > > Signed-off-by: Anand Jain> --- > fs/btrfs/super.c | 40 > 1 file changed, 40 insertions(+) > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index 9530a333d302..e0e542345c38 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -389,6 +389,44 @@ static const match_table_t tokens = { > {Opt_err, NULL}, > }; > > +static int parse_recursive_mount_options(char *data) > +{ > + substring_t args[MAX_OPT_ARGS]; > + char *options, *orig; > + char *p; > + int ret = 0; > + > + /* > + * This is not a remount thread, but we allow recursive mounts > + * with varying RO/RW flag to support subvol-mounts. So error-out > + * if any other option being passed in here. > + */ > + > + options = kstrdup(data, GFP_NOFS); > + if (!options) > + return -ENOMEM; > + > + orig = options; > + > + while ((p = strsep(, ",")) != NULL) { > + int token; > + if (!*p) > + continue; > + > + token = match_token(p, tokens, args); > + switch(token) { > + case Opt_subvol: > + case Opt_subvolid: > + break; > + default: > + ret = -EBUSY; > + } > + } > + > + kfree(orig); > + return ret; > +} > + > /* > * Regular mount options parser. Everything that is needed only when > * reading in a new superblock is parsed here. > @@ -1611,6 +1649,8 @@ static struct dentry *btrfs_mount(struct > file_system_type *fs_type, int flags, > free_fs_info(fs_info); > if ((flags ^ s->s_flags) & MS_RDONLY) > error = -EBUSY; > + if (parse_recursive_mount_options(data)) > + error = -EBUSY; But if subvol= was passed, it should not reach this place at all - btrfs_mount() returns earlier in if (subvol_name || subvol_objectid != BTRFS_FS_TREE_OBJECTID) { /* mount_subvol() will free subvol_name. */ return mount_subvol(subvol_name, subvol_objectid, flags, device_name, data); } So check for subvol here seems redundant. > } else { > snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev); > btrfs_sb(s)->bdev_holder = fs_type; > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
Goffredo Baroncelli posted on Fri, 28 Apr 2017 19:05:21 +0200 as excerpted: > After some thinking I adopted a different strategies: I used journald as > collector, then I forward all the log to rsyslogd, which used a "log > append" format. Journald never write on the root filesystem, only in > tmp. Great minds think alike. =:^) Only here it's syslog-ng that does the permanent writes. I just couldn't see journald's crazy (for btrfs) write pattern going to permanent storage. And AFAIK, journald has no pre-write filtering mechanism at all, only post-write display-time filtering, so even "log-spam" that I don't want/ need logged gets written to it, while if I see something spamming continuously (I run git kernels and kde, and do get such spammers occasionally) I setup a syslog-ng spam filter to kill it, so it never actually gets written to permanent storage at all. But the tmpfs journals and btrfs traditional logs gives me the best of both worlds, per-boot journals with all the extra metadata, the last ten journal entries for it when I do systemctl status on a unit, etc, and a nice filtered and ordered multi-boot log that I can use traditional text- based log-administration tools on. The only part of it I'm not happy with is that journald apparently can't keep separate user and system journals when set to temporary only -- everything goes to the system journal. Which eventually means that much of the stdout/stderr debugging spew that kde-based apps like to spew out ends up in the system journal and (would be in the) log. But that's a journald "documented bug-feature", and I can and do syslog-ng filter it before it actually hits the written system log (or console log display). -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: btrfs, journald logs, fragmentation, and fallocate
> -Original Message- > From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs- > ow...@vger.kernel.org] On Behalf Of Goffredo Baroncelli > Sent: Saturday, 29 April 2017 3:05 AM > To: Chris Murphy> Cc: Btrfs BTRFS > Subject: Re: btrfs, journald logs, fragmentation, and fallocate > > > In the past I faced the same problems; I collected some data here > http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html. > Unfortunately the journald files are very bad, because first the data is > written (appended), then the index fields are updated. Unfortunately these > indexes are near after the last write . So fragmentation is unavoidable. Perhaps a better idea for COW filesystems is to store the index in a separate file, and/or rewrite the last 1 MB block (or part thereof) of the data file every time data is appended? That way the data file will use 1MB extents and hopefully avoid ridiculous amounts of metadata. Paul.
Re: btrfs, journald logs, fragmentation, and fallocate
> [ ... ] these extents are all over the place, they're not > contiguous at all. 4K here, 4K there, 4K over there, back to > 4K here next to this one, 4K over there...12K over there, 500K > unwritten, 4K over there. This seems not so consequential on > SSD, [ ... ] Indeed there were recent reports that the 'ssd' mount option causes that, IIRC by Hans van Kranenburg (around 2017-04-17), which also noticed issues with the wandering trees in certain situations (around 2017-04-08). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
On Fri, Apr 28, 2017 at 11:41:00AM -0600, Chris Murphy wrote: > The same behavior happens with NTFS in qcow2 files. They quickly end > up with 100,000+ extents unless set nocow. It's like the worst case > scenario. You should never use qcow2 on btrfs, especially if snapshots are involved. They both do roughly the same thing, and layering fragmentation upon fragmentation ɪꜱ ɴᴏᴛ ᴘʀᴇᴛᴛʏ. Layering syncs is bad, too. Instead, you can use raw files (preferably sparse unless there's both nocow and no snapshots). Btrfs does natively everything you'd gain from qcow2, and does it better: you can delete the master of a cloned image, deduplicate them, deduplicate two unrelated images; you can turn on compression, etc. Once you pay the btrfs performance penalty, you may as well actually use its features, which make qcow2 redundant and harmful. Meow! -- Don't be racist. White, amber or black, all beers should be judged based solely on their merits. Heck, even if occasionally a cider applies for a beer's job, why not? On the other hand, corpo lager is not a race. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: File system corruption, btrfsck abort
On Fri, Apr 28, 2017 at 3:10 AM, Christophe de Dinechinwrote: > > QEMU qcow2. Host is BTRFS. Guests are BTRFS, LVM, Ext4, NTFS (winXP and > win10) and HFS+ (macOS Sierra). I think I had 7 VMs installed, planned to > restore another 8 from backups before my previous disk crash. I usually have > at least 2 running, often as many as 5 (fedora, ubuntu, winXP, win10, macOS) > to cover my software testing needs. That is quite a torture test for any file system but more so Btrfs. How are the qcow2 files being created? What's the qemu-img create command? In particular i'm wondering if these qcow2 files are cow or nocow; if they're compressed by Btrfs; and how many fragments they have with filefrag. When I was using qcow2 for backing I used qemu-img create -f qcow2 -o preallocation=falloc,nocow=on,lazy_refcounts=on But then later I started using fallocated raw files with chattr +C applied. And these days I'm just using LVM thin volumes. The journaled file systems in a guest cause a ton of backing file fragmentation unless nocow is used on Btrfs. I've seen hundreds of thousands of extents for a single backing file for a Windows guest. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
On Fri, Apr 28, 2017 at 1:39 PM, Peter Grandiwrote: > In a particularly demented setup I had to decastrophize with > great pain a Zimbra QCOW2 disk image (XFS on NFS on XFS on > RAID6) containining an ever growing number Maildir email archive > ended up with over a million widely scattered microextents: > > http://www.sabi.co.uk/blog/1101Jan.html?110116#110116 Related Btrfs thread "File system corruption, btrfsck abort" involves 5 concurrent use VM's with guests using ext4, NTFS, HFS+, Btrfs, LVM, pointing to qcow2 files on Btrfs for backing. And it's resulting in problems... -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
On Fri, Apr 28, 2017 at 11:53 AM, Peter Grandiwrote: > Well, depends, but probably the single file: it is more likely > that the 20,000 fragments will actually be contiguous, and that > there will be less metadata IO than for 40,000 separate journal > files. You can see from the examples I posted that these extents are all over the place, they're not contiguous at all. 4K here, 4K there, 4K over there, back to 4K here next to this one, 4K over there...12K over there, 500K unwritten, 4K over there. This seems not so consequential on SSD, at least if it impacts performance it's not so bad I care. On a hard drive, it's totally noticeable. And that's why journald went with chattr +C by default a few versions ago when on Btrfs. And it does help *if* the partent is never snapshot, which on a snapshotting file system can't really be guaranteed. Inadvertent snapshotting could be inhibited by putting the journals in their own subvolume though. Anyway, it's difficult to consider Btrfs a general purpose file system if other general purpose workloads like journal files, are causing a problem like wandering tree. Hence the subject of what to do about it, and that may mean short term and long term. I can't speak for systemd developers but if there's a different way to write to the journals that'd be better for Btrfs and no worse for ext4 and XFS, it might be considered. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
On Fri, Apr 28, 2017 at 11:46 AM, Peter Grandiwrote: > So there are three layers of silliness here: > > * Writing large files slowly to a COW filesystem and > snapshotting it frequently. > * A filesystem that does delayed allocation instead of > allocate-ahead, and does not have psychic code. > * Working around that by using no-COW and preallocation > with a fixed size regardless of snapshot frequency. > > The primary problem here is that there is no way to have slow > small writes and frequent snapshots without generating small > extents: if a file is written at a rate of 1MiB/hour and gets > snapshot every hour the extent size will not be larger than 1MiB > *obviously*. Sure. But in my example, no snapshotting, and +C is inhibited (i.e. I set /etc/tmpfiles.d/journal-nocow.conf which stops systemd from the new behavior of setting +C on journals). That's resulting in a 19000+ fragment journal file. In fact snapshotting does not make it worse though. If it's nocow, then yes snapshotting makes it worse than nocow, but no worse than cow. What I'm trying to get at is default Btrfs behavior and (previous) default journald behavior, have a misalignment resulting in a lot of fragmentation, is there a better way around this than merely setting journals to nocow *and* making sure they stay nocow by preventing snapshotting. If there's nothing better to be done, then I'll just re-recommend to systemd folks that the directory containing journals should be made a subvolume to isolate it from inadvertent snapshotting. If people want to snapshot it anyway there's nothing we can do about that. > Filesystem-level snapshots are not designed to snapshot slowly > growing files, but to snapshots changing collections of > files. There are harsh tradeoffs involved. Application-level > shapshots (also known as log rotations :->) are needed for > special cases and finer grained policies. > > The secondary problem is that a fixed preallocate of 8MiB is > good only if in betweeen snapshots the file grows by a little > less than 8MiB or by substantially more. Just to be clear, none of my own examples involve journals being snapshot. There are no shared extents for any of those files. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
>> The gotcha though is there's a pile of data in the journal >> that would never make it to rsyslogd. If you use journalctl >> -o verbose you can see some of this. > You can send *all the info* to rsyslogd via imjournal > http://www.rsyslog.com/doc/v8-stable/configuration/modules/imjournal.html > In my setup all the data are stored in json format in the > /var/log/cee.log file: > $ head /var/log/cee.log 2017-04-28T18:41:41.931273+02:00 > venice liblogging-stdlog: @cee: { "PRIORITY": "6", "_BOOT_ID": > "a86d74bab91f44dc974c76aceb97141f", "_MACHINE_ID": [ ... ] Ahh the horror the horror, I will never be able to unsee that. The UNIX way of doing things is truly dead. >> The same behavior happens with NTFS in qcow2 files. They >> quickly end up with 100,000+ extents unless set nocow. >> It's like the worst case scenario. In a particularly demented setup I had to decastrophize with great pain a Zimbra QCOW2 disk image (XFS on NFS on XFS on RAID6) containining an ever growing number Maildir email archive ended up with over a million widely scattered microextents: http://www.sabi.co.uk/blog/1101Jan.html?110116#110116 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
On 2017-04-28 19:41, Chris Murphy wrote: > On Fri, Apr 28, 2017 at 11:05 AM, Goffredo Baroncelli >wrote: > >> In the past I faced the same problems; I collected some data here >> http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html. >> Unfortunately the journald files are very bad, because first the data is >> written (appended), then the index fields are updated. Unfortunately these >> indexes are near after the last write . So fragmentation is unavoidable. >> >> After some thinking I adopted a different strategies: I used journald as >> collector, then I forward all the log to rsyslogd, which used a "log append" >> format. Journald never write on the root filesystem, only in tmp. > > The gotcha though is there's a pile of data in the journal that would > never make it to rsyslogd. If you use journalctl -o verbose you can > see some of this. You can send *all the info* to rsyslogd via imjournal http://www.rsyslog.com/doc/v8-stable/configuration/modules/imjournal.html In my setup all the data are stored in json format in the /var/log/cee.log file: $ head /var/log/cee.log 2017-04-28T18:41:41.931273+02:00 venice liblogging-stdlog: @cee: { "PRIORITY": "6", "_BOOT_ID": "a86d74bab91f44dc974c76aceb97141f", "_MACHINE_ID": "e84907d099904117b355a99c98378dca", "_HOSTNAME": "venice.bhome", "_SYSTEMD_SLICE": "system.slice", "_UID": "0", "_GID": "0", "_CAP_EFFECTIVE": "3f", "_TRANSPORT": "syslog", "SYSLOG_FACILITY": "23", "SYSLOG_IDENTIFIER": "liblogging-stdlog", "MESSAGE": " [origin software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed", "_PID": "737", "_COMM": "rsyslogd", "_EXE": "\/usr\/sbin\/rsyslogd", "_CMDLINE": "\/usr\/sbin\/rsyslogd -n", "_SYSTEMD_CGROUP": "\/system.slice\/rsyslog.service", "_SYSTEMD_UNIT": "rsyslog.service", "_SYSTEMD_INVOCATION_ID": "18b9a8b27f9143728adef972db7b394c", "_SOURCE_REALTIME_TIMESTAMP": "1493397701931255", "msg": "[origin software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed" } 2017-04-28T18:41:42.058549+02:00 venice liblogging-stdlog: @cee: { "PRIORITY": "6", "_BOOT_ID": "a86d74bab91f44dc974c76aceb97141f", "_MACHINE_ID": "e84907d099904117b355a99c98378dca", "_HOSTNAME": "venice.bhome", "_SYSTEMD_SLICE": "system.slice", "_UID": "0", "_GID": "0", "_CAP_EFFECTIVE": "3f", "_TRANSPORT": "syslog", "SYSLOG_FACILITY": "23", "SYSLOG_IDENTIFIER": "liblogging-stdlog", "MESSAGE": " [origin software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed", "_PID": "737", "_COMM": "rsyslogd", "_EXE": "\/usr\/sbin\/rsyslogd", "_CMDLINE": "\/usr\/sbin\/rsyslogd -n", "_SYSTEMD_CGROUP": "\/system.slice\/rsyslog.service", "_SYSTEMD_UNIT": "rsyslog.service", "_SYSTEMD_INVOCATION_ID": "18b9a8b27f9143728adef972db7b394c", "_SOURCE_REALTIME_TIMESTAMP": "1493397702058441", "msg": "[origin software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed" } [] All the info are stored with the same keys/values as journald does. I developed an utility (called clp), which allow to query the log by key, filtering by boot nr, by date For example to show all the log related to rsyslog $ clp log -t full-details _SYSTEMD_CGROUP=/system.slice/rsyslog.service 2017-04-21 19:12:29.579748 MESSAGE= [origin software="rsyslogd" swVersion="8.24.0" x-pid="804" x-info="http://www.rsyslog.com;] rsyslogd was HUPed PRIORITY=6 SYSLOG_FACILITY=23 SYSLOG_IDENTIFIER=liblogging-stdlog _BOOT_ID=d77198380c9344248e01166fbd8d60df _CAP_EFFECTIVE=3f _CMDLINE=/usr/sbin/rsyslogd -n _COMM=rsyslogd _EXE=/usr/sbin/rsyslogd _GID=0 _HOSTNAME=venice.bhome _LOGFILEINITLINE=2017-04-21T19:12:29.579768+02:00 venice liblogging-stdlog: _LOGFILELINENUMBER=1 _LOGFILENAME=/var/log/cee.log.7.gz _LOGFILETIMESTAMP=1492794749579768 _MACHINE_ID=e84907d099904117b355a99c98378dca _PID=804 _SOURCE_REALTIME_TIMESTAMP=1492794749579748 _SYSTEMD_CGROUP=/system.slice/rsyslog.service _SYSTEMD_INVOCATION_ID=8f9cb6c871be4158a3ccb374f4323027 _SYSTEMD_SLICE=system.slice _SYSTEMD_UNIT=rsyslog.service _TRANSPORT=syslog _UID=0 msg=[origin software="rsyslogd" swVersion="8.24.0" x-pid="804"
Re: [PATCH v2] fstests: regression test for btrfs buffered read's repair
On Fri, Apr 28, 2017 at 10:52:12AM +0100, Filipe Manana wrote: > On Wed, Apr 26, 2017 at 7:09 PM, Liu Bowrote: > > This case tests whether buffered read can repair the bad copy if we > > have a good copy. > > > > Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed > > and drop checks") introduced the regression. > > > > The upstream fix is > > Btrfs: bring back repair during read > > Same issue as reported for the other new test (btrfs/140), the test > fails on a kernel with the mentioned patch. Seems like it does wrong > assumptions somewhere (I haven't investigated, just run the test): Thanks for running the test, what I've missed is that chunk start offset depends on disk size, so I'll add a filesystem size limit to mkfs. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] fstests: regression test for btrfs buffered read's repair
This case tests whether buffered read can repair the bad copy if we have a good copy. Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed and drop checks") introduced the regression. The upstream fix is Btrfs: bring back repair during read Signed-off-by: Liu Bo--- v2: - Add regression commit and the fix to the description - Use btrfs inspect-internal dump-tree to get rid of the dependence btrfs-map-logical - Add comments in several places - Fix typo, dio->buffered. v3: - Add 'mkfs -b 1G' to limit filesystem size to 2G in raid1 profile so that we get a consistent output. tests/btrfs/141 | 169 tests/btrfs/141.out | 39 tests/btrfs/group | 1 + 3 files changed, 209 insertions(+) create mode 100755 tests/btrfs/141 create mode 100644 tests/btrfs/141.out diff --git a/tests/btrfs/141 b/tests/btrfs/141 new file mode 100755 index 000..c4e08ed --- /dev/null +++ b/tests/btrfs/141 @@ -0,0 +1,169 @@ +#! /bin/bash +# FS QA Test 141 +# +# Regression test for btrfs buffered read's repair during read. +# +# Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed +# and drop checks") introduced the regression. +# +# The upstream fix is +# Btrfs: bring back repair during read +# +#--- +# Copyright (c) 2017 Liu Bo. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs btrfs +_supported_os Linux +_require_scratch_dev_pool 2 + +_require_btrfs_command inspect-internal dump-tree +_require_command "$FILEFRAG_PROG" filefrag + + +# helpe to convert 'file offset' to btrfs logical offset +FILEFRAG_FILTER=' + if (/blocks? of (\d+) bytes/) { + $blocksize = $1; + next + } + ($ext, $logical, $physical, $length) = + (/^\s*(\d+):\s+(\d+)..\s+\d+:\s+(\d+)..\s+\d+:\s+(\d+):/) + or next; + ($flags) = /.*:\s*(\S*)$/; + print $physical * $blocksize, "#", + $length * $blocksize, "#", + $logical * $blocksize, "#", + $flags, " "' + +# this makes filefrag output script readable by using a perl helper. +# output is one extent per line, with three numbers separated by '#' +# the numbers are: physical, length, logical (all in bytes) +# sample output: "1234#10#5678" -> physical 1234, length 10, logical 5678 +_filter_extents() +{ + tee -a $seqres.full | $PERL_PROG -ne "$FILEFRAG_FILTER" +} + +_check_file_extents() +{ + cmd="filefrag -v $1" + echo "# $cmd" >> $seqres.full + out=`$cmd | _filter_extents` + if [ -z "$out" ]; then + return 1 + fi + echo "after filter: $out" >> $seqres.full + echo $out + return 0 +} + +_check_repair() +{ + filter=${1:-cat} + dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | tac | $filter | grep -q -e "csum failed" + if [ $? -eq 0 ]; then + echo 1 + else + echo 0 + fi +} + +_get_physical() +{ +# $1 is logical address +# print chunk tree and find devid 2 which is $SCRATCH_DEV +$BTRFS_UTIL_PROG inspect-internal dump-tree -t 3 $SCRATCH_DEV | grep $1 -A 6 | awk '($1 ~ /stripe/ && $3 ~ /devid/ && $4 ~ /1/) { print $6 }' +} + +_scratch_dev_pool_get 2 +# step 1, create a raid1 btrfs which contains one 128k file. +echo "step 1..mkfs.btrfs" >>$seqres.full + +mkfs_opts="-d raid1 -b 1G" +_scratch_pool_mkfs $mkfs_opts >>$seqres.full 2>&1 + +# -o nospace_cache makes sure data is written to the start position of the data +# chunk +_scratch_mount -o nospace_cache + +$XFS_IO_PROG -f -d -c "pwrite -S 0xaa -b 128K 0 128K" "$SCRATCH_MNT/foobar" | _filter_xfs_io + +sync + +# step 2, corrupt
[PATCH v2] fstests: regression test for nocsum dio read's repair
Commit 2dabb3248453 ("Btrfs: Direct I/O read: Work on sectorsized blocks") introduced this regression. It'd cause 'Segmentation fault' error. The upstream fix is Btrfs: fix segment fault when doing dio read Signed-off-by: Liu Bo--- v2: - Add 'mkfs -b 1G' to limit filesystem size to 2G in raid1 profile so that we get a consistent output. tests/btrfs/142 | 189 tests/btrfs/142.out | 39 +++ tests/btrfs/group | 1 + 3 files changed, 229 insertions(+) create mode 100755 tests/btrfs/142 create mode 100644 tests/btrfs/142.out diff --git a/tests/btrfs/142 b/tests/btrfs/142 new file mode 100755 index 000..94566de --- /dev/null +++ b/tests/btrfs/142 @@ -0,0 +1,189 @@ +#! /bin/bash +# FS QA Test 142 +# +# Regression test for btrfs DIO read's repair during read without checksum. +# +# Commit 2dabb3248453 ("Btrfs: Direct I/O read: Work on sectorsized blocks") +# introduced this regression. It'd cause 'Segmentation fault' error. +# +# The upstream fix is +# Btrfs: fix segment fault when doing dio read +# +#--- +# Copyright (c) 2017 Liu Bo. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs btrfs +_supported_os Linux +_require_scratch_dev_pool 2 + +_require_btrfs_command inspect-internal dump-tree +_require_command "$FILEFRAG_PROG" filefrag + +# helpe to convert 'file offset' to btrfs logical offset +FILEFRAG_FILTER=' + if (/blocks? of (\d+) bytes/) { + $blocksize = $1; + next + } + ($ext, $logical, $physical, $length) = + (/^\s*(\d+):\s+(\d+)..\s+\d+:\s+(\d+)..\s+\d+:\s+(\d+):/) + or next; + ($flags) = /.*:\s*(\S*)$/; + print $physical * $blocksize, "#", + $length * $blocksize, "#", + $logical * $blocksize, "#", + $flags, " "' + +# this makes filefrag output script readable by using a perl helper. +# output is one extent per line, with three numbers separated by '#' +# the numbers are: physical, length, logical (all in bytes) +# sample output: "1234#10#5678" -> physical 1234, length 10, logical 5678 +_filter_extents() +{ + tee -a $seqres.full | $PERL_PROG -ne "$FILEFRAG_FILTER" +} + +_check_file_extents() +{ + cmd="filefrag -v $1" + echo "# $cmd" >> $seqres.full + out=`$cmd | _filter_extents` + if [ -z "$out" ]; then + return 1 + fi + echo "after filter: $out" >> $seqres.full + echo $out + return 0 +} + +_check_repair() +{ + filter=${1:-cat} + dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | tac | $filter | grep -q -e "direct IO failed" + if [ $? -eq 0 ]; then + echo 1 + else + echo 0 + fi +} + +_get_physical() +{ +# $1 is logical address +# print chunk tree and find devid 2 which is $SCRATCH_DEV +$BTRFS_UTIL_PROG inspect-internal dump-tree -t 3 $SCRATCH_DEV | grep $1 -A 6 | awk '($1 ~ /stripe/ && $3 ~ /devid/ && $4 ~ /1/) { print $6 }' +} + + +SYSFS_BDEV=`_sysfs_dev $SCRATCH_DEV` + +start_fail() +{ + echo 100 > $DEBUGFS_MNT/fail_make_request/probability + echo 1 > $DEBUGFS_MNT/fail_make_request/times + echo 0 > $DEBUGFS_MNT/fail_make_request/verbose + echo 1 > $SYSFS_BDEV/make-it-fail +} + +stop_fail() +{ + echo 0 > $DEBUGFS_MNT/fail_make_request/probability + echo 0 > $DEBUGFS_MNT/fail_make_request/times + echo 0 > $SYSFS_BDEV/make-it-fail +} + +_scratch_dev_pool_get 2 +# step 1, create a raid1 btrfs which contains one 128k file. +echo "step 1..mkfs.btrfs" >>$seqres.full + +mkfs_opts="-d raid1 -b 1G" +_scratch_pool_mkfs $mkfs_opts >>$seqres.full 2>&1 + +# -o
[PATCH v2] fstests: regression test for nocsum buffered read's repair
This is to test whether buffered read retry-repair code is able to work in raid1 case as expected. Please note that without checksum, btrfs doesn't know if the data used to repair is correct, so repair is more of resync which makes sure that both of the copy has the same content. Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed and drop checks") introduced the regression. The upstream fix is Btrfs: bring back repair during read Signed-off-by: Liu Bo--- v2: - Add 'mkfs -b 1G' to limit filesystem size to 2G in raid1 profile so that we get a consistent output. tests/btrfs/143 | 197 tests/btrfs/143.out | 39 +++ tests/btrfs/group | 1 + 3 files changed, 237 insertions(+) create mode 100755 tests/btrfs/143 create mode 100644 tests/btrfs/143.out diff --git a/tests/btrfs/143 b/tests/btrfs/143 new file mode 100755 index 000..70f3f9f --- /dev/null +++ b/tests/btrfs/143 @@ -0,0 +1,197 @@ +#! /bin/bash +# FS QA Test 143 +# +# Regression test for btrfs buffered read's repair during read without checksum. +# +# This is to test whether buffered read retry-repair code is able to work in +# raid1 case as expected. +# +# Please note that without checksum, btrfs doesn't know if the data used to +# repair is correct, so repair is more of resync which makes sure that both +# of the copy has the same content. +# +# Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed and drop +# checks") introduced the regression. +# +# The upstream fix is +#Btrfs: bring back repair during read +# +#--- +# Copyright (c) 2017 Liu Bo. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs btrfs +_supported_os Linux +_require_scratch_dev_pool 2 + +_require_btrfs_command inspect-internal dump-tree +_require_command "$FILEFRAG_PROG" filefrag + +# helpe to convert 'file offset' to btrfs logical offset +FILEFRAG_FILTER=' + if (/blocks? of (\d+) bytes/) { + $blocksize = $1; + next + } + ($ext, $logical, $physical, $length) = + (/^\s*(\d+):\s+(\d+)..\s+\d+:\s+(\d+)..\s+\d+:\s+(\d+):/) + or next; + ($flags) = /.*:\s*(\S*)$/; + print $physical * $blocksize, "#", + $length * $blocksize, "#", + $logical * $blocksize, "#", + $flags, " "' + +# this makes filefrag output script readable by using a perl helper. +# output is one extent per line, with three numbers separated by '#' +# the numbers are: physical, length, logical (all in bytes) +# sample output: "1234#10#5678" -> physical 1234, length 10, logical 5678 +_filter_extents() +{ + tee -a $seqres.full | $PERL_PROG -ne "$FILEFRAG_FILTER" +} + +_check_file_extents() +{ + cmd="filefrag -v $1" + echo "# $cmd" >> $seqres.full + out=`$cmd | _filter_extents` + if [ -z "$out" ]; then + return 1 + fi + echo "after filter: $out" >> $seqres.full + echo $out + return 0 +} + +_check_repair() +{ + filter=${1:-cat} + dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | tac | $filter | grep -q -e "read error corrected" + if [ $? -eq 0 ]; then + echo 1 + else + echo 0 + fi +} + +_get_physical() +{ +# $1 is logical address +# print chunk tree and find devid 2 which is $SCRATCH_DEV +$BTRFS_UTIL_PROG inspect-internal dump-tree -t 3 $SCRATCH_DEV | grep $1 -A 6 | awk '($1 ~ /stripe/ && $3 ~ /devid/ && $4 ~ /1/) { print $6 }' +} + +SYSFS_BDEV=`_sysfs_dev $SCRATCH_DEV` + +start_fail() +{ + echo 100 > $DEBUGFS_MNT/fail_make_request/probability + echo 4 >
[PATCH v3] fstests: regression test for btrfs dio read repair
This case tests whether dio read can repair the bad copy if we have a good copy. Commit 2dabb3248453 ("Btrfs: Direct I/O read: Work on sectorsized blocks") introduced the regression. The upstream fix is Btrfs: fix invalid dereference in btrfs_retry_endio Signed-off-by: Liu Bo--- v2: - Add regression commit and the fix to the description - Use btrfs inspect-internal dump-tree to get rid of the dependence btrfs-map-logical - Add comments in several places v3: - Add 'mkfs -b 1G' to limit filesystem size to 2G in raid1 profile so that we get a consistent output. tests/btrfs/140 | 167 tests/btrfs/140.out | 39 tests/btrfs/group | 1 + 3 files changed, 207 insertions(+) create mode 100755 tests/btrfs/140 create mode 100644 tests/btrfs/140.out diff --git a/tests/btrfs/140 b/tests/btrfs/140 new file mode 100755 index 000..dcd8807 --- /dev/null +++ b/tests/btrfs/140 @@ -0,0 +1,167 @@ +#! /bin/bash +# FS QA Test 140 +# +# Regression test for btrfs DIO read's repair during read. +# +# Commit 2dabb3248453 ("Btrfs: Direct I/O read: Work on sectorsized blocks") +# introduced the regression. +# The upstream fix is +# Btrfs: fix invalid dereference in btrfs_retry_endio +# +#--- +# Copyright (c) 2017 Liu Bo. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs btrfs +_supported_os Linux +_require_scratch_dev_pool 2 + +_require_btrfs_command inspect-internal dump-tree +_require_command "$FILEFRAG_PROG" filefrag +_require_odirect + +# helpe to convert 'file offset' to btrfs logical offset +FILEFRAG_FILTER=' + if (/blocks? of (\d+) bytes/) { + $blocksize = $1; + next + } + ($ext, $logical, $physical, $length) = + (/^\s*(\d+):\s+(\d+)..\s+\d+:\s+(\d+)..\s+\d+:\s+(\d+):/) + or next; + ($flags) = /.*:\s*(\S*)$/; + print $physical * $blocksize, "#", + $length * $blocksize, "#", + $logical * $blocksize, "#", + $flags, " "' + +# this makes filefrag output script readable by using a perl helper. +# output is one extent per line, with three numbers separated by '#' +# the numbers are: physical, length, logical (all in bytes) +# sample output: "1234#10#5678" -> physical 1234, length 10, logical 5678 +_filter_extents() +{ + tee -a $seqres.full | $PERL_PROG -ne "$FILEFRAG_FILTER" +} + +_check_file_extents() +{ + cmd="filefrag -v $1" + echo "# $cmd" >> $seqres.full + out=`$cmd | _filter_extents` + if [ -z "$out" ]; then + return 1 + fi + echo "after filter: $out" >> $seqres.full + echo $out + return 0 +} + +_check_repair() +{ + filter=${1:-cat} + dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | tac | $filter | grep -q -e "csum failed" + if [ $? -eq 0 ]; then + echo 1 + else + echo 0 + fi +} + +_get_physical() +{ + # $1 is logical address + # print chunk tree and find devid 2 which is $SCRATCH_DEV + $BTRFS_UTIL_PROG inspect-internal dump-tree -t 3 $SCRATCH_DEV | grep $1 -A 6 | awk '($1 ~ /stripe/ && $3 ~ /devid/ && $4 ~ /1/) { print $6 }' +} + +_scratch_dev_pool_get 2 +# step 1, create a raid1 btrfs which contains one 128k file. +echo "step 1..mkfs.btrfs" >>$seqres.full + +mkfs_opts="-d raid1 -b 1G" +_scratch_pool_mkfs $mkfs_opts >>$seqres.full 2>&1 + +# -o nospace_cache makes sure data is written to the start position of the data +# chunk +_scratch_mount -o nospace_cache + +$XFS_IO_PROG -f -d -c "pwrite -S 0xaa -b 128K 0 128K" "$SCRATCH_MNT/foobar" | _filter_xfs_io + +sync + +# step 2, corrupt the first 64k of one copy
Re: btrfs, journald logs, fragmentation, and fallocate
> [ ... ] And that makes me wonder whether metadata > fragmentation is happening as a result. But in any case, > there's a lot of metadata being written for each journal > update compared to what's being added to the journal file. [ > ... ] That's the "wandering trees" problem in COW filesystems, and manifestations of it in Btrfs have also been reported before. If there is a workload that triggers a lot of "wandering trees" updates, then a filesystem that has "wandering trees" perhaps should not be used :-). > [ ... ] worse, a single file with 2 fragments; or 4 > separate journal files? *shrug* [ ... ] Well, depends, but probably the single file: it is more likely that the 20,000 fragments will actually be contiguous, and that there will be less metadata IO than for 40,000 separate journal files. The deeper "strategic" issue is that storage systems and filesystems in particular have very anisotropic performance envelopes, and mismatches between the envelopes of application and filesystem can be very expensive: http://www.sabi.co.uk/blog/15-two.html?151023#151023 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
> Old news is that systemd-journald journals end up pretty > heavily fragmented on Btrfs due to COW. This has been discussed before in detail indeeed here, but also here: http://www.sabi.co.uk/blog/15-one.html?150203#150203 > While journald uses chattr +C on journal files now, COW still > happens if the subvolume the journal is in gets snapshot. e.g. > a week old system.journal has 19000+ extents. [ ... ] It > appears to me (see below URLs pointing to example journals) > that journald fallocated in 8MiB increments but then ends up > doing 4KiB writes; [ ... ] So there are three layers of silliness here: * Writing large files slowly to a COW filesystem and snapshotting it frequently. * A filesystem that does delayed allocation instead of allocate-ahead, and does not have psychic code. * Working around that by using no-COW and preallocation with a fixed size regardless of snapshot frequency. The primary problem here is that there is no way to have slow small writes and frequent snapshots without generating small extents: if a file is written at a rate of 1MiB/hour and gets snapshot every hour the extent size will not be larger than 1MiB *obviously*. Filesystem-level snapshots are not designed to snapshot slowly growing files, but to snapshots changing collections of files. There are harsh tradeoffs involved. Application-level shapshots (also known as log rotations :->) are needed for special cases and finer grained policies. The secondary problem is that a fixed preallocate of 8MiB is good only if in betweeen snapshots the file grows by a little less than 8MiB or by substantially more. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
On Fri, Apr 28, 2017 at 11:05 AM, Goffredo Baroncelliwrote: > In the past I faced the same problems; I collected some data here > http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html. > Unfortunately the journald files are very bad, because first the data is > written (appended), then the index fields are updated. Unfortunately these > indexes are near after the last write . So fragmentation is unavoidable. > > After some thinking I adopted a different strategies: I used journald as > collector, then I forward all the log to rsyslogd, which used a "log append" > format. Journald never write on the root filesystem, only in tmp. The gotcha though is there's a pile of data in the journal that would never make it to rsyslogd. If you use journalctl -o verbose you can see some of this. There's a bunch of extra metadata in the journal. And then also filtering based on that metadata is useful rather than being limited to grep on a syslog file. Which, you know, it's fine for many use cases. I guess I'm just interested in whether there's an enhancement that can be done to make journals more compatible with Btrfs or vice versa. It's not a huge problem anyway. > > The think became interesting when I discovered that the searching in a > rsyslog file is faster than journalctl (on a rotational media). Unfortunately > I don't have any data to support this. Yes on drives all of these scattered extents cause a lot of head seeking. And I also suspect it's a lot of metadata spread out everywhere too, to account for all of these extents. That's why they moved to chattr +C to make them nocow. An idea I had on systemd list was to automatically make the journal directory a Btrfs subvolume, similar to how systemd already creates a /var/lib/machines subvolume for nspawn containers. This prevents the journals from being caught up in a snapshot of the parent subvolume that typically contains the journals (root fs). There's no practical use I can think of for snapshotting logs. You'd really want the logs to always be linear, contiguous, and never get rolled back. Even if something in the system does get rolled back, you'd want the logs to show that and continue on, rather than being rolled back themselves. So the super simple option would be continue with +C on journals, and then a separate subvolume to prevent COW from ever happening inadvertently. The same behavior happens with NTFS in qcow2 files. They quickly end up with 100,000+ extents unless set nocow. It's like the worst case scenario. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs, journald logs, fragmentation, and fallocate
On 2017-04-28 18:16, Chris Murphy wrote: > Old news is that systemd-journald journals end up pretty heavily > fragmented on Btrfs due to COW. While journald uses chattr +C on > journal files now, COW still happens if the subvolume the journal is > in gets snapshot. e.g. a week old system.journal has 19000+ extents. > > The news is I started a systemd thread. > > This is the start: > https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html > > Where it gets interesting, two messages by Andrei Borzenkov: He > evaluates existing code and does some tests on ext4 and XFS. > https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html > https://lists.freedesktop.org/archives/systemd-devel/2017-April/038728.html > > And then the question. > https://lists.freedesktop.org/archives/systemd-devel/2017-April/038735.html > > Given what journald is doing, is what Btrfs is doing expected? Is > there something it could do better to be more like ext4 and XFS in the > same situation? Or is it out of scope for Btrfs? In the past I faced the same problems; I collected some data here http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html. Unfortunately the journald files are very bad, because first the data is written (appended), then the index fields are updated. Unfortunately these indexes are near after the last write . So fragmentation is unavoidable. After some thinking I adopted a different strategies: I used journald as collector, then I forward all the log to rsyslogd, which used a "log append" format. Journald never write on the root filesystem, only in tmp. The think became interesting when I discovered that the searching in a rsyslog file is faster than journalctl (on a rotational media). Unfortunately I don't have any data to support this. However if someone is interested I can share more details. BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs, journald logs, fragmentation, and fallocate
Old news is that systemd-journald journals end up pretty heavily fragmented on Btrfs due to COW. While journald uses chattr +C on journal files now, COW still happens if the subvolume the journal is in gets snapshot. e.g. a week old system.journal has 19000+ extents. The news is I started a systemd thread. This is the start: https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html Where it gets interesting, two messages by Andrei Borzenkov: He evaluates existing code and does some tests on ext4 and XFS. https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html https://lists.freedesktop.org/archives/systemd-devel/2017-April/038728.html And then the question. https://lists.freedesktop.org/archives/systemd-devel/2017-April/038735.html Given what journald is doing, is what Btrfs is doing expected? Is there something it could do better to be more like ext4 and XFS in the same situation? Or is it out of scope for Btrfs? It appears to me (see below URLs pointing to example journals) that journald fallocated in 8MiB increments but then ends up doing 4KiB writes; there's a lot of these unused (unwritten) 8MiB extents that appear in both filefrag and btrfs-debug -f outputs. The +C idea just rearranges the deck chairs, it's not solving the underlying problem except in the case where the containing subvolume is never snapshot. And in the COW case, I'm seeing about 30 metadata nodes being written out for what amounts to less than a 4KiB journal append. Each time. And that makes me wonder whether metadata fragmentation is happening as a result. But in any case, there's a lot of metadata being written for each journal update compared to what's being added to the journal file. And then that makes me wonder if a better optimization on Btrfs would be having each write be a separate file. The small updates would have data inline. Which is worse, a single file with 2 fragments; or 4 separate journal files? *shrug* At least those individual files would be subject to compression with +c; whereas right now the open endedness of the active journal has not a single compressed extent. Only once rotated do they get compressed (via defragmentation which journald does only on Btrfs). Journals contain highly compressible data. Anyway, two example journals. The parent directory has chattr +c, both journals inherited it. The first URL is filefrag -v, the 2nd is btrfs-debug -f; for each journal. This is a rotated journal. Upon rotation on Btrfs, journald defragments the file which ends up compressing it when chattr +c. https://da.gd/4NKyq https://da.gd/zEeYW This is an active system.journal. No compressed extents (the writes I think are too small). https://da.gd/cBjX https://da.gd/YXuI Extra credit if you've followed this far... The rotated log has piles of unwritten items in it that are making it fairly inefficient even with compression. Just using cat to write its contents to a new file, compression goes from a 1.27 ratio, to 5.70. Here are the results after catting that file: https://da.gd/rE8KT https://da.gd/PD5qI -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: No space left on device when doing "mkdir"
Dmarc is off, here's the output of the allocations: it's working correctly right now, I'll update when it does it again. /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/flags:2 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/raid1/used_bytes:3948544 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/raid1/total_bytes:33554432 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/bytes_pinned:0 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/disk_total:67108864 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/bytes_may_use:0 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/bytes_readonly:0 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/bytes_used:3948544 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/bytes_reserved:0 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/disk_used:7897088 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/total_bytes_pinned:0 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/total_bytes:33554432 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/flags:4 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/raid1/used_bytes:65864957952 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/raid1/total_bytes:83751862272 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/bytes_pinned:0 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/disk_total:167503724544 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/bytes_may_use:739508224 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/bytes_readonly:0 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/bytes_used:65864957952 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/bytes_reserved:1835008 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/disk_used:131729915904 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/total_bytes_pinned:1884160 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/total_bytes:83751862272 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/global_rsv_size:536870912 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/flags:1 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/raid1/used_bytes:23029876707328 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/raid1/total_bytes:23175643529216 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/bytes_pinned:0 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/disk_total:46351287058432 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/bytes_may_use:36474880 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/bytes_readonly:1703936 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/bytes_used:23029876707328 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/bytes_reserved:15003648 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/disk_used:46059753414656 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/total_bytes_pinned:0 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/total_bytes:23175643529216 /sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/global_rsv_reserved:536870912 On Thu, Apr 27, 2017 at 6:35 PM, Chris Murphywrote: > On Thu, Apr 27, 2017 at 10:46 AM, Gerard Saraber wrote: >> After a reboot, I found this in the logs: >> [ 322.510152] BTRFS info (device sdm): The free space cache file >> (36114966511616) is invalid. skip it >> [ 488.702570] btrfs_printk: 847 callbacks suppressed >> >> >> >> On Thu, Apr 27, 2017 at 10:18 AM, Gerard Saraber wrote: >>> no snapshots and no qgroups, just a straight up large volume. >>> >>> shrapnel gerard-store # btrfs fi df /home/exports >>> Data, RAID1: total=20.93TiB, used=20.86TiB >>> System, RAID1: total=32.00MiB, used=3.73MiB >>> Metadata, RAID1: total=79.00GiB, used=61.10GiB >>> GlobalReserve, single: total=512.00MiB, used=544.00KiB >>> >>> shrapnel gerard-store # btrfs filesystem usage /home/exports >>> Overall: >>> Device size: 69.13TiB >>> Device allocated: 42.01TiB >>> Device unallocated: 27.13TiB >>> Device missing: 0.00B >>> Used: 41.84TiB >>> Free (estimated): 13.63TiB (min: 13.63TiB) >>> Data ratio: 2.00 >>> Metadata ratio: 2.00 >>> Global reserve: 512.00MiB (used: 1.52MiB) >>> >>> On Thu, Apr 27, 2017 at 9:07 AM, Roman Mamedov wrote: On Thu, 27 Apr
Re: [PATCH 1/2 v2] btrfs-progs: fix btrfs send & receive with -e flag
Hi, Sorry for confusion, I've checked once again and the same issue happens in all cases. I didn't notice this because my regular backups are done automatically in cron task + snapshots look fine despite the error, so I incorrectly assumed an error didn't happen there, but it actually did. I've clarified this in last comment on bugzilla. Sincerely, Nazar Mokrynskyi github.com/nazar-pc Tox: A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249 28.04.17 13:03, Lakshmipathi.G пише: > I can take a look. What I'm wondering about is why it fails only in the HDD > to SSD case. If -ENODATA is returned with this patch it should mean that there > was no header data. So is the user sure that this doesn't indicate a valid > error? > > Christian -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: regression test for nocsum buffered read's repair
On Wed, Apr 26, 2017 at 9:54 PM, Liu Bowrote: > This is to test whether buffered read retry-repair code is able to work in > raid1 case as expected. > > Please note that without checksum, btrfs doesn't know if the data used to > repair is correct, so repair is more of resync which makes sure that both > of the copy has the same content. > > Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed and > drop > checks") introduced the regression. > > The upstream fix is > Btrfs: bring back repair during read Same issue as all the other tests you sent, fails on a patched kernel due to mismatch between the $physical_on_scratch offset and the golden output: root 08:22:02 /home/fdmanana/git/hub/xfstests (master)> ./check btrfs/143 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian3 4.10.0-rc8-btrfs-next-40+ MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 btrfs/143 - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/143.out.bad) --- tests/btrfs/143.out 2017-04-28 08:21:59.358432901 +0100 +++ /home/fdmanana/git/hub/xfstests/results//btrfs/143.out.bad 2017-04-28 08:22:15.254446208 +0100 @@ -1,39 +1,39 @@ QA output created by 143 wrote 131072/131072 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -wrote 65536/65536 bytes at offset 244056064 +wrote 65536/65536 bytes at offset 1103101952 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -0e8c: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ... (Run 'diff -u tests/btrfs/143.out /home/fdmanana/git/hub/xfstests/results//btrfs/143.out.bad' to see the entire diff) Ran: btrfs/143 Failures: btrfs/143 Failed 1 of 1 tests root 08:22:16 /home/fdmanana/git/hub/xfstests (master)> root 08:22:22 /home/fdmanana/git/hub/xfstests (master)> diff -u tests/btrfs/143.out /home/fdmanana/git/hub/xfstests/results//btrfs/143.out.bad --- tests/btrfs/143.out 2017-04-28 08:21:59.358432901 +0100 +++ /home/fdmanana/git/hub/xfstests/results//btrfs/143.out.bad 2017-04-28 08:22:15.254446208 +0100 @@ -1,39 +1,39 @@ QA output created by 143 wrote 131072/131072 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -wrote 65536/65536 bytes at offset 244056064 +wrote 65536/65536 bytes at offset 1103101952 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -0e8c: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0010: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0020: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0030: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0040: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0050: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0060: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0070: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0080: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0090: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00a0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00b0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00c0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00d0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00e0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00f0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0100: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0110: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0120: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0130: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0140: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0150: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0160: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0170: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0180: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0190: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01a0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01b0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01c0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01d0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01e0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01f0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -read 512/512 bytes at offset 244056064
Re: [PATCH 1/2 v2] btrfs-progs: fix btrfs send & receive with -e flag
Hi. Adding the bug reporter, Nazar for the discussion (as I'm not familiar with send/receive feature/code). Cheers, Lakshmipathi.G http://www.giis.co.in http://www.webminal.org On Fri, Apr 28, 2017 at 3:25 PM, Christian Braunerwrote: > > Hi, > > On Fri, Apr 28, 2017 at 02:55:31PM +0530, Lakshmipathi.G wrote: > > Seems like user reported an issue with this patch. please check > > https://bugzilla.kernel.org/show_bug.cgi?id=195597 > > I can take a look. What I'm wondering about is why it fails only in the HDD > to SSD case. If -ENODATA is returned with this patch it should mean that there > was no header data. So is the user sure that this doesn't indicate a valid > error? > > Christian > > > > > > > Cheers, > > Lakshmipathi.G > > > > > > On Tue, Apr 4, 2017 at 1:51 AM, Christian Brauner < > > christian.brau...@ubuntu.com> wrote: > > > The old check here tried to ensure that empty streams are not considered > > valid. > > > The old check however, will always fail when only one run through the > > while(1) > > > loop is needed and honor_end_cmd is set. So this: > > > > > > btrfs send /some/subvol | btrfs receive -e /some/ > > > > > > will consistently fail because -e causes honor_cmd_to be set and > > > btrfs_read_and_process_send_stream() to correctly return 1. So the > > command will > > > be successful but btrfs receive will error out because the send - receive > > > concluded in one run through the while(1) loop. > > > > > > If we want to exclude empty streams we need a way to tell the difference > > between > > > btrfs_read_and_process_send_stream() returning 1 because read_buf() did > > not > > > detect any data and read_and_process_cmd() returning 1 because > > honor_end_cmd was > > > set. Without introducing too many changes the best way to me seems to have > > > btrfs_read_and_process_send_stream() return -ENODATA in the first case. > > The rest > > > stays the same. We can then check for -ENODATA in do_receive() and report > > a > > > proper error in this case. This should also be backwards compatible to > > previous > > > versions of btrfs receive. They will fail on empty streams because a > > negative > > > value is returned. The only thing that they will lack is a nice error > > message. > > > > > > Signed-off-by: Christian Brauner > > > --- > > > Changelog: 2017-04-03 > > > - no changes > > > --- > > > cmds-receive.c | 13 + > > > send-stream.c | 2 +- > > > 2 files changed, 6 insertions(+), 9 deletions(-) > > > > > > diff --git a/cmds-receive.c b/cmds-receive.c > > > index 6cf22637..b59f00e4 100644 > > > --- a/cmds-receive.c > > > +++ b/cmds-receive.c > > > @@ -1091,7 +1091,6 @@ static int do_receive(struct btrfs_receive *rctx, > > const char *tomnt, > > > char *dest_dir_full_path; > > > char root_subvol_path[PATH_MAX]; > > > int end = 0; > > > - int count; > > > > > > dest_dir_full_path = realpath(tomnt, NULL); > > > if (!dest_dir_full_path) { > > > @@ -1186,7 +1185,6 @@ static int do_receive(struct btrfs_receive *rctx, > > const char *tomnt, > > > if (ret < 0) > > > goto out; > > > > > > - count = 0; > > > while (!end) { > > > if (rctx->cached_capabilities_len) { > > > if (g_verbose >= 3) > > > @@ -1200,16 +1198,15 @@ static int do_receive(struct btrfs_receive *rctx, > > const char *tomnt, > > > rctx, > > > > > rctx->honor_end_cmd, > > > max_errors); > > > - if (ret < 0) > > > - goto out; > > > - /* Empty stream is invalid */ > > > - if (ret && count == 0) { > > > + if (ret < 0 && ret == -ENODATA) { > > > + /* Empty stream is invalid */ > > > error("empty stream is not considered valid"); > > > ret = -EINVAL; > > > goto out; > > > + } else if (ret < 0) { > > > + goto out; > > > } > > > - count++; > > > - if (ret) > > > + if (ret > 0) > > > end = 1; > > > > > > close_inode_for_write(rctx); > > > diff --git a/send-stream.c b/send-stream.c > > > index 5a028cd9..78f2571a 100644 > > > --- a/send-stream.c > > > +++ b/send-stream.c > > > @@ -492,7 +492,7 @@ int btrfs_read_and_process_send_stream(int fd, > > > if (ret < 0) > > > goto out; > > > if (ret) { > > > - ret = 1; > > > + ret = -ENODATA; > > > goto out; > > > } > > > > > > -- > > > 2.11.0 > > > > > > -- > > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > > > the body of a message to
Re: [PATCH] fstests: regression test for nocsum dio read's repair
On Wed, Apr 26, 2017 at 9:54 PM, Liu Bowrote: > Commit 2dabb3248453 ("Btrfs: Direct I/O read: Work on sectorsized blocks") > introduced this regression. It'd cause 'Segmentation fault' error. > > The upstream fix is > Btrfs: fix segment fault when doing dio read Same issue as the other tests, it fails on a patched kernel. The value of $physical_on_scratch doesn't match the value in the golden output: root 08:18:24 /home/fdmanana/git/hub/xfstests (master)> ./check btrfs/142 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian3 4.10.0-rc8-btrfs-next-40+ MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 btrfs/142 - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/142.out.bad) --- tests/btrfs/142.out 2017-04-28 08:18:22.206251115 +0100 +++ /home/fdmanana/git/hub/xfstests/results//btrfs/142.out.bad 2017-04-28 08:18:35.946262617 +0100 @@ -1,39 +1,39 @@ QA output created by 142 wrote 131072/131072 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -wrote 65536/65536 bytes at offset 244056064 +wrote 65536/65536 bytes at offset 1103101952 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -0e8c: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ... (Run 'diff -u tests/btrfs/142.out /home/fdmanana/git/hub/xfstests/results//btrfs/142.out.bad' to see the entire diff) Ran: btrfs/142 Failures: btrfs/142 Failed 1 of 1 tests root 08:18:36 /home/fdmanana/git/hub/xfstests (master)> root 08:18:38 /home/fdmanana/git/hub/xfstests (master)> diff -u tests/btrfs/142.out /home/fdmanana/git/hub/xfstests/results//btrfs/142.out.bad --- tests/btrfs/142.out 2017-04-28 08:18:22.206251115 +0100 +++ /home/fdmanana/git/hub/xfstests/results//btrfs/142.out.bad 2017-04-28 08:18:35.946262617 +0100 @@ -1,39 +1,39 @@ QA output created by 142 wrote 131072/131072 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -wrote 65536/65536 bytes at offset 244056064 +wrote 65536/65536 bytes at offset 1103101952 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -0e8c: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0010: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0020: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0030: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0040: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0050: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0060: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0070: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0080: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0090: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00a0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00b0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00c0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00d0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00e0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00f0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0100: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0110: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0120: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0130: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0140: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0150: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0160: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0170: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0180: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0190: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01a0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01b0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01c0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01d0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01e0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01f0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -read 512/512 bytes at offset 244056064 +41c0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa +41c00010: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa +41c00020: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa +41c00030: aa aa aa aa aa aa aa aa
Re: [PATCH 1/2 v2] btrfs-progs: fix btrfs send & receive with -e flag
Hi, On Fri, Apr 28, 2017 at 02:55:31PM +0530, Lakshmipathi.G wrote: > Seems like user reported an issue with this patch. please check > https://bugzilla.kernel.org/show_bug.cgi?id=195597 I can take a look. What I'm wondering about is why it fails only in the HDD to SSD case. If -ENODATA is returned with this patch it should mean that there was no header data. So is the user sure that this doesn't indicate a valid error? Christian > > > Cheers, > Lakshmipathi.G > > > On Tue, Apr 4, 2017 at 1:51 AM, Christian Brauner < > christian.brau...@ubuntu.com> wrote: > > The old check here tried to ensure that empty streams are not considered > valid. > > The old check however, will always fail when only one run through the > while(1) > > loop is needed and honor_end_cmd is set. So this: > > > > btrfs send /some/subvol | btrfs receive -e /some/ > > > > will consistently fail because -e causes honor_cmd_to be set and > > btrfs_read_and_process_send_stream() to correctly return 1. So the > command will > > be successful but btrfs receive will error out because the send - receive > > concluded in one run through the while(1) loop. > > > > If we want to exclude empty streams we need a way to tell the difference > between > > btrfs_read_and_process_send_stream() returning 1 because read_buf() did > not > > detect any data and read_and_process_cmd() returning 1 because > honor_end_cmd was > > set. Without introducing too many changes the best way to me seems to have > > btrfs_read_and_process_send_stream() return -ENODATA in the first case. > The rest > > stays the same. We can then check for -ENODATA in do_receive() and report > a > > proper error in this case. This should also be backwards compatible to > previous > > versions of btrfs receive. They will fail on empty streams because a > negative > > value is returned. The only thing that they will lack is a nice error > message. > > > > Signed-off-by: Christian Brauner> > --- > > Changelog: 2017-04-03 > > - no changes > > --- > > cmds-receive.c | 13 + > > send-stream.c | 2 +- > > 2 files changed, 6 insertions(+), 9 deletions(-) > > > > diff --git a/cmds-receive.c b/cmds-receive.c > > index 6cf22637..b59f00e4 100644 > > --- a/cmds-receive.c > > +++ b/cmds-receive.c > > @@ -1091,7 +1091,6 @@ static int do_receive(struct btrfs_receive *rctx, > const char *tomnt, > > char *dest_dir_full_path; > > char root_subvol_path[PATH_MAX]; > > int end = 0; > > - int count; > > > > dest_dir_full_path = realpath(tomnt, NULL); > > if (!dest_dir_full_path) { > > @@ -1186,7 +1185,6 @@ static int do_receive(struct btrfs_receive *rctx, > const char *tomnt, > > if (ret < 0) > > goto out; > > > > - count = 0; > > while (!end) { > > if (rctx->cached_capabilities_len) { > > if (g_verbose >= 3) > > @@ -1200,16 +1198,15 @@ static int do_receive(struct btrfs_receive *rctx, > const char *tomnt, > > rctx, > > > rctx->honor_end_cmd, > > max_errors); > > - if (ret < 0) > > - goto out; > > - /* Empty stream is invalid */ > > - if (ret && count == 0) { > > + if (ret < 0 && ret == -ENODATA) { > > + /* Empty stream is invalid */ > > error("empty stream is not considered valid"); > > ret = -EINVAL; > > goto out; > > + } else if (ret < 0) { > > + goto out; > > } > > - count++; > > - if (ret) > > + if (ret > 0) > > end = 1; > > > > close_inode_for_write(rctx); > > diff --git a/send-stream.c b/send-stream.c > > index 5a028cd9..78f2571a 100644 > > --- a/send-stream.c > > +++ b/send-stream.c > > @@ -492,7 +492,7 @@ int btrfs_read_and_process_send_stream(int fd, > > if (ret < 0) > > goto out; > > if (ret) { > > - ret = 1; > > + ret = -ENODATA; > > goto out; > > } > > > > -- > > 2.11.0 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] fstests: regression test for btrfs buffered read's repair
On Wed, Apr 26, 2017 at 7:09 PM, Liu Bowrote: > This case tests whether buffered read can repair the bad copy if we > have a good copy. > > Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed > and drop checks") introduced the regression. > > The upstream fix is > Btrfs: bring back repair during read Same issue as reported for the other new test (btrfs/140), the test fails on a kernel with the mentioned patch. Seems like it does wrong assumptions somewhere (I haven't investigated, just run the test): root 08:09:17 /home/fdmanana/git/hub/xfstests (master)> ./check btrfs/141 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian3 4.10.0-rc8-btrfs-next-40+ MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 btrfs/141 - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/141.out.bad) --- tests/btrfs/141.out 2017-04-28 08:09:13.289791597 +0100 +++ /home/fdmanana/git/hub/xfstests/results//btrfs/141.out.bad 2017-04-28 08:09:28.469804304 +0100 @@ -1,39 +1,39 @@ QA output created by 141 wrote 131072/131072 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -wrote 65536/65536 bytes at offset 244056064 +wrote 65536/65536 bytes at offset 1103101952 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -0e8c: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ... (Run 'diff -u tests/btrfs/141.out /home/fdmanana/git/hub/xfstests/results//btrfs/141.out.bad' to see the entire diff) Ran: btrfs/141 Failures: btrfs/141 Failed 1 of 1 tests root 08:09:29 /home/fdmanana/git/hub/xfstests (master)> root 08:09:31 /home/fdmanana/git/hub/xfstests (master)> diff -u tests/btrfs/141.out /home/fdmanana/git/hub/xfstests/results//btrfs/141.out.bad --- tests/btrfs/141.out 2017-04-28 08:09:13.289791597 +0100 +++ /home/fdmanana/git/hub/xfstests/results//btrfs/141.out.bad 2017-04-28 08:09:28.469804304 +0100 @@ -1,39 +1,39 @@ QA output created by 141 wrote 131072/131072 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -wrote 65536/65536 bytes at offset 244056064 +wrote 65536/65536 bytes at offset 1103101952 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -0e8c: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0010: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0020: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0030: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0040: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0050: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0060: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0070: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0080: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0090: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00a0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00b0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00c0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00d0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00e0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00f0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0100: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0110: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0120: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0130: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0140: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0150: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0160: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0170: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0180: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0190: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01a0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01b0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01c0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01d0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01e0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01f0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -read 512/512 bytes at offset 244056064 +41c0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa +41c00010: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
[PATCH 3/3 v2] Make max_size consistent with nr
Since we memset tmpl, max_size==0. This does not seem consistent with nr = 1. In check_extent_refs, we will call: set_extent_dirty(root->fs_info->excluded_extents, rec->start, rec->start + rec->max_size - 1); This ends up with BUG_ON(end < start) in insert_state. Signed-off-by: Christophe de Dinechin--- cmds-check.c | 1 + 1 file changed, 1 insertion(+) diff --git a/cmds-check.c b/cmds-check.c index c13f900..d5e2966 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -6193,6 +6193,7 @@ static int add_tree_backref(struct cache_tree *extent_cache, u64 bytenr, tmpl.start = bytenr; tmpl.nr = 1; tmpl.metadata = 1; + tmpl.max_size = 1; ret = add_extent_rec_nolookup(extent_cache, ); if (ret) -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Subject: [PATCH 2/3 v2] Prevent attempt to insert extent record with max_size==0
When this happens, we will trip a BUG_ON(end < start) in insert_state because in check_extent_refs, we use this max_size expecting it's not zero: set_extent_dirty(root->fs_info->excluded_extents, rec->start, rec->start + rec->max_size - 1); See https://bugzilla.redhat.com/show_bug.cgi?id=1435567 for an example where this scenario occurs. Signed-off-by: Christophe de Dinechin--- cmds-check.c | 1 + 1 file changed, 1 insertion(+) diff --git a/cmds-check.c b/cmds-check.c index 2d3ebc1..c13f900 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -6029,6 +6029,7 @@ static int add_extent_rec_nolookup(struct cache_tree *extent_cache, struct extent_record *rec; int ret = 0; + BUG_ON(tmpl->max_size == 0); rec = malloc(sizeof(*rec)); if (!rec) return -ENOMEM; -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] fstests: regression test for btrfs dio read repair
On Wed, Apr 26, 2017 at 7:09 PM, Liu Bowrote: > This case tests whether dio read can repair the bad copy if we have > a good copy. > > Commit 2dabb3248453 ("Btrfs: Direct I/O read: Work on sectorsized blocks") > introduced the regression. > > The upstream fix is > Btrfs: fix invalid dereference in btrfs_retry_endio > > Signed-off-by: Liu Bo Thanks for doing this. Just tested this, on a kernel with the mentioned fix, and it fails: root 08:04:11 /home/fdmanana/git/hub/xfstests (master)> ./check btrfs/140 FSTYP -- btrfs PLATFORM -- Linux/x86_64 debian3 4.10.0-rc8-btrfs-next-40+ MKFS_OPTIONS -- /dev/sdc MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1 btrfs/140 - output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/140.out.bad) --- tests/btrfs/140.out 2017-04-28 07:59:13.069289130 +0100 +++ /home/fdmanana/git/hub/xfstests/results//btrfs/140.out.bad 2017-04-28 08:04:18.209544574 +0100 @@ -1,39 +1,39 @@ QA output created by 140 wrote 131072/131072 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -wrote 65536/65536 bytes at offset 244056064 +wrote 65536/65536 bytes at offset 1103101952 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -0e8c: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa ... (Run 'diff -u tests/btrfs/140.out /home/fdmanana/git/hub/xfstests/results//btrfs/140.out.bad' to see the entire diff) Ran: btrfs/140 Failures: btrfs/140 Failed 1 of 1 tests root 08:04:18 /home/fdmanana/git/hub/xfstests (master)> root 08:04:27 /home/fdmanana/git/hub/xfstests (master)> diff -u tests/btrfs/140.out /home/fdmanana/git/hub/xfstests/results//btrfs/140.out.bad --- tests/btrfs/140.out 2017-04-28 07:59:13.069289130 +0100 +++ /home/fdmanana/git/hub/xfstests/results//btrfs/140.out.bad 2017-04-28 08:04:18.209544574 +0100 @@ -1,39 +1,39 @@ QA output created by 140 wrote 131072/131072 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -wrote 65536/65536 bytes at offset 244056064 +wrote 65536/65536 bytes at offset 1103101952 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) -0e8c: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0010: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0020: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0030: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0040: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0050: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0060: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0070: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0080: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0090: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00a0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00b0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00c0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00d0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00e0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c00f0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0100: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0110: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0120: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0130: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0140: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0150: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0160: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0170: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0180: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c0190: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01a0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01b0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01c0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01d0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01e0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -0e8c01f0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa -read 512/512 bytes at offset 244056064 +41c0: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa +41c00010: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa +41c00020: aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
Re: [PATCH 3/3] Make max_size consistent with nr
On Fri, 28 Apr 2017 11:13:36 +0200 Christophe de Dinechinwrote: > Since we memset tmpl, max_size==0. This does not seem consistent with nr = 1. > In check_extent_refs, we will call: > > set_extent_dirty(root->fs_info->excluded_extents, >rec->start, >rec->start + rec->max_size - 1); > > This ends up with BUG_ON(end < start) in insert_state. > > Signed-off-by: Christophe de Dinechin > --- > cmds-check.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/cmds-check.c b/cmds-check.c > index 58e65d6..774e9b6 100644 > --- a/cmds-check.c > +++ b/cmds-check.c > @@ -6193,6 +6193,7 @@ static int add_tree_backref(struct cache_tree > *extent_cache, u64 bytenr, > tmpl.start = bytenr; > tmpl.nr = 1; > tmpl.metadata = 1; > +tmpl.max_size = 1; > > ret = add_extent_rec_nolookup(extent_cache, ); > if (ret) The original code uses Tab characters for indent, but your addition uses spaces. Also same problem in patch 2/3. -- With respect, Roman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] Make max_size consistent with nr
Since we memset tmpl, max_size==0. This does not seem consistent with nr = 1. In check_extent_refs, we will call: set_extent_dirty(root->fs_info->excluded_extents, rec->start, rec->start + rec->max_size - 1); This ends up with BUG_ON(end < start) in insert_state. Signed-off-by: Christophe de Dinechin--- cmds-check.c | 1 + 1 file changed, 1 insertion(+) diff --git a/cmds-check.c b/cmds-check.c index 58e65d6..774e9b6 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -6193,6 +6193,7 @@ static int add_tree_backref(struct cache_tree *extent_cache, u64 bytenr, tmpl.start = bytenr; tmpl.nr = 1; tmpl.metadata = 1; +tmpl.max_size = 1; ret = add_extent_rec_nolookup(extent_cache, ); if (ret) -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] Prevent attempt to insert extent record with max_size==0
When this happens, we will trip a BUG_ON(end < start) in insert_state because in check_extent_refs, we use this max_size expecting it's not zero: set_extent_dirty(root->fs_info->excluded_extents, rec->start, rec->start + rec->max_size - 1); See https://bugzilla.redhat.com/show_bug.cgi?id=1435567 for an example where this scenario occurs. Signed-off-by: Christophe de Dinechin--- cmds-check.c | 1 + 1 file changed, 1 insertion(+) diff --git a/cmds-check.c b/cmds-check.c index 2d3ebc1..58e65d6 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -6029,6 +6029,7 @@ static int add_extent_rec_nolookup(struct cache_tree *extent_cache, struct extent_record *rec; int ret = 0; +BUG_ON(tmpl->max_size == 0); rec = malloc(sizeof(*rec)); if (!rec) return -ENOMEM; -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] Disambiguate between cases where add_tree_backref fails
See https://bugzilla.redhat.com/show_bug.cgi?id=1435567 for an example where the message occurs, Signed-off-by: Christophe de Dinechin--- cmds-check.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index 17b7efb..2d3ebc1 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -6832,14 +6832,14 @@ static int process_extent_item(struct btrfs_root *root, ret = add_tree_backref(extent_cache, key.objectid, 0, offset, 0); if (ret < 0) - error("add_tree_backref failed: %s", + error("add_tree_backref failed (extent items tree block): %s", strerror(-ret)); break; case BTRFS_SHARED_BLOCK_REF_KEY: ret = add_tree_backref(extent_cache, key.objectid, offset, 0, 0); if (ret < 0) - error("add_tree_backref failed: %s", + error("add_tree_backref failed (extent items shared block): %s", strerror(-ret)); break; case BTRFS_EXTENT_DATA_REF_KEY: @@ -7753,7 +7753,7 @@ static int run_next_block(struct btrfs_root *root, ret = add_tree_backref(extent_cache, key.objectid, 0, key.offset, 0); if (ret < 0) - error("add_tree_backref failed: %s", + error("add_tree_backref failed (leaf tree block): %s", strerror(-ret)); continue; } @@ -7761,7 +7761,7 @@ static int run_next_block(struct btrfs_root *root, ret = add_tree_backref(extent_cache, key.objectid, key.offset, 0, 0); if (ret < 0) - error("add_tree_backref failed: %s", + error("add_tree_backref failed (leaf shared block): %s", strerror(-ret)); continue; } @@ -7866,7 +7866,7 @@ static int run_next_block(struct btrfs_root *root, ret = add_tree_backref(extent_cache, ptr, parent, owner, 1); if (ret < 0) { - error("add_tree_backref failed: %s", + error("add_tree_backref failed (non-leaf block): %s", strerror(-ret)); continue; } -- 2.9.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: check options during subsequent mount
We allow recursive mounts with subvol options such as [1] [1] mount -o rw,compress=lzo /dev/sdc /btrfs1 mount -o ro,subvol=sv2 /dev/sdc /btrfs2 And except for the btrfs-specific subvol and subvolid options all-other options are just ignored in the subsequent mounts. In the below example [2] the effect compression is only zlib, even though there was no error for the option -o compress=lzo. [2] # mount -o compress=zlib /dev/sdc /btrfs1 #echo $? 0 # mount -o compress=lzo /dev/sdc /btrfs #echo $? 0 #cat /proc/self/mounts :: /dev/sdc /btrfs1 btrfs rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/ 0 0 /dev/sdc /btrfs btrfs rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/ 0 0 Further, random string .. has no error as well. - # mount -o compress=zlib /dev/sdc /btrfs1 #echo $? 0 # mount -o something /dev/sdc /btrfs #echo $? 0 - This patch fixes the above issue, by checking the if the passed options are only subvol or subvolid in the subsequent mount. Signed-off-by: Anand Jain--- fs/btrfs/super.c | 40 1 file changed, 40 insertions(+) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 9530a333d302..e0e542345c38 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -389,6 +389,44 @@ static const match_table_t tokens = { {Opt_err, NULL}, }; +static int parse_recursive_mount_options(char *data) +{ + substring_t args[MAX_OPT_ARGS]; + char *options, *orig; + char *p; + int ret = 0; + + /* +* This is not a remount thread, but we allow recursive mounts +* with varying RO/RW flag to support subvol-mounts. So error-out +* if any other option being passed in here. +*/ + + options = kstrdup(data, GFP_NOFS); + if (!options) + return -ENOMEM; + + orig = options; + + while ((p = strsep(, ",")) != NULL) { + int token; + if (!*p) + continue; + + token = match_token(p, tokens, args); + switch(token) { + case Opt_subvol: + case Opt_subvolid: + break; + default: + ret = -EBUSY; + } + } + + kfree(orig); + return ret; +} + /* * Regular mount options parser. Everything that is needed only when * reading in a new superblock is parsed here. @@ -1611,6 +1649,8 @@ static struct dentry *btrfs_mount(struct file_system_type *fs_type, int flags, free_fs_info(fs_info); if ((flags ^ s->s_flags) & MS_RDONLY) error = -EBUSY; + if (parse_recursive_mount_options(data)) + error = -EBUSY; } else { snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev); btrfs_sb(s)->bdev_holder = fs_type; -- 2.10.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: File system corruption, btrfsck abort
> On 28 Apr 2017, at 02:45, Qu Wenruowrote: > > > > At 04/26/2017 01:50 AM, Christophe de Dinechin wrote: >> Hi, >> I”ve been trying to run btrfs as my primary work filesystem for about 3-4 >> months now on Fedora 25 systems. I ran a few times into filesystem >> corruptions. At least one I attributed to a damaged disk, but the last one >> is with a brand new 3T disk that reports no SMART errors. Worse yet, in at >> least three cases, the filesystem corruption caused btrfsck to crash. >> The last filesystem corruption is documented here: >> https://bugzilla.redhat.com/show_bug.cgi?id=1444821. The dmesg log is in >> there. > > According to the bugzilla, the btrfs-progs seems to be too old in btrfs > standard. > What about using the latest btrfs-progs v4.10.2? I tried 4.10.1-1 https://bugzilla.redhat.com/show_bug.cgi?id=1435567#c4. I am currently debugging with a build from the master branch as of Tuesday (commit bd0ab27afbf14370f9f0da1f5f5ecbb0adc654c1), which is 4.10.2 There was no change in behavior. Runs are split about evenly between list crash and abort. I added instrumentation and tried a fix, which brings me a tiny bit further, until I hit a message from delete_duplicate_records: Ok we have overlapping extents that aren't completely covered by each other, this is going to require more careful thought. The extents are [52428800-16384] and [52432896-16384] > Furthermore for v4.10.2, btrfs check provides a new mode called lowmem. > You could try "btrfs check --mode=lowmem" to see if such problem can be > avoided. I will try that, but what makes you think this is a memory-related condition? The machine has 16G of RAM, isn’t that enough for an fsck? > > For the kernel bug, it seems to be related to wrongly inserted delayed ref, > but I can totally be wrong. For now, I’m focusing on the “repair” part as much as I can, because I assume the kernel bug is there anyway, so someone else is bound to hit this problem. Thanks Christophe > > Thanks, > Qu >> The btrfsck crash is here: >> https://bugzilla.redhat.com/show_bug.cgi?id=1435567. I have two crash modes: >> either an abort or a SIGSEGV. I checked that both still happens on master as >> of today. >> The cause of the abort is that we call set_extent_dirty from >> check_extent_refs with rec->max_size == 0. I’ve instrumented to try to see >> where we set this to 0 (see >> https://github.com/c3d/btrfs-progs/tree/rhbz1435567), and indeed, we do >> sometimes see max_size set to 0 in a few locations. My instrumentation shows >> this: >> 78655 [1.792241:0x451fe0] MAX_SIZE_ZERO: Add extent rec 0x139eb80 max_size >> 16384 tmpl 0x7fffd120 >> 78657 [1.792242:0x451cb8] MAX_SIZE_ZERO: Set max size 0 for rec 0x139ec50 >> from tmpl 0x7fffcf80 >> 78660 [1.792244:0x451fe0] MAX_SIZE_ZERO: Add extent rec 0x139ed50 max_size >> 16384 tmpl 0x7fffd120 >> I don’t really know what to make of it. >> The cause of the SIGSEGV is that we try to free a list entry that has its >> next set to NULL. >> #0 list_del (entry=0x55db0420) at >> /usr/src/debug/btrfs-progs-v4.10.1/kernel-lib/list.h:125 >> #1 free_all_extent_backrefs (rec=0x55db0350) at cmds-check.c:5386 >> #2 maybe_free_extent_rec (extent_cache=0x7fffd990, rec=0x55db0350) >> at cmds-check.c:5417 >> #3 0x555b308f in check_block (flags=, >> buf=0x7b87cdf0, extent_cache=0x7fffd990, root=0x5587d570) at >> cmds-check.c:5851 >> #4 run_next_block (root=root@entry=0x5587d570, >> bits=bits@entry=0x558841 >> I don’t know if the two problems are related, but they seem to be pretty >> consistent on this specific disk, so I think that we have a good opportunity >> to improve btrfsck to make it more robust to this specific form of >> corruption. But I don’t want to hapazardly modify a code I don’t really >> understand. So if anybody could make a suggestion on what the right strategy >> should be when we have max_size == 0, or how to avoid it in the first place. >> I don’t know if this is relevant at all, but all the machines that failed >> that way were used to run VMs with KVM/QEMU. DIsk activity tends to be >> somewhat intense on occasions, since the VMs running there are part of a >> personal Jenkins ring that automatically builds various projects. Nominally, >> there are between three and five guests running (Windows XP, WIndows 10, >> macOS, Fedora25, Ubuntu 16.04). >> Thanks >> Christophe de Dinechin >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body