Re: [PATCH] btrfs: check options during subsequent mount

2017-04-28 Thread Andrei Borzenkov
28.04.2017 12:14, Anand Jain пишет:
> We allow recursive mounts with subvol options such as [1]
> 
> [1]
>  mount -o rw,compress=lzo /dev/sdc /btrfs1
>  mount -o ro,subvol=sv2 /dev/sdc /btrfs2
> 
> And except for the btrfs-specific subvol and subvolid options
> all-other options are just ignored in the subsequent mounts.
> 
> In the below example [2] the effect compression is only zlib,
> even though there was no error for the option -o compress=lzo.
> 
> [2]
> 
>  # mount -o compress=zlib /dev/sdc /btrfs1
>  #echo $?
>  0
> 
>  # mount -o compress=lzo /dev/sdc /btrfs
>  #echo $?
>  0
> 
>  #cat /proc/self/mounts
>  ::
>  /dev/sdc /btrfs1 btrfs 
> rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/ 0 0
>  /dev/sdc /btrfs btrfs 
> rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/ 0 0
> 
> 
> Further, random string .. has no error as well.
> -
>  # mount -o compress=zlib /dev/sdc /btrfs1
>  #echo $?
>  0
> 
>  # mount -o something /dev/sdc /btrfs
>  #echo $?
>  0
> -
> 
> This patch fixes the above issue, by checking the if the passed
> options are only subvol or subvolid in the subsequent mount.
> 
> Signed-off-by: Anand Jain 
> ---
>  fs/btrfs/super.c | 40 
>  1 file changed, 40 insertions(+)
> 
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index 9530a333d302..e0e542345c38 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -389,6 +389,44 @@ static const match_table_t tokens = {
>   {Opt_err, NULL},
>  };
>  
> +static int parse_recursive_mount_options(char *data)
> +{
> + substring_t args[MAX_OPT_ARGS];
> + char *options, *orig;
> + char *p;
> + int ret = 0;
> +
> + /*
> +  * This is not a remount thread, but we allow recursive mounts
> +  * with varying RO/RW flag to support subvol-mounts. So error-out
> +  * if any other option being passed in here.
> +  */
> +
> + options = kstrdup(data, GFP_NOFS);
> + if (!options)
> + return -ENOMEM;
> +
> + orig = options;
> +
> + while ((p = strsep(, ",")) != NULL) {
> + int token;
> + if (!*p)
> + continue;
> +
> + token = match_token(p, tokens, args);
> + switch(token) {
> + case Opt_subvol:
> + case Opt_subvolid:
> + break;
> + default:
> + ret = -EBUSY;
> + }
> + }
> +
> + kfree(orig);
> + return ret;
> +}
> +
>  /*
>   * Regular mount options parser.  Everything that is needed only when
>   * reading in a new superblock is parsed here.
> @@ -1611,6 +1649,8 @@ static struct dentry *btrfs_mount(struct 
> file_system_type *fs_type, int flags,
>   free_fs_info(fs_info);
>   if ((flags ^ s->s_flags) & MS_RDONLY)
>   error = -EBUSY;
> + if (parse_recursive_mount_options(data))
> + error = -EBUSY;

But if subvol= was passed, it should not reach this place at all -
btrfs_mount() returns earlier in

if (subvol_name || subvol_objectid != BTRFS_FS_TREE_OBJECTID) {
/* mount_subvol() will free subvol_name. */
return mount_subvol(subvol_name, subvol_objectid, flags,
device_name, data);
}

So check for subvol here seems redundant.

>   } else {
>   snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev);
>   btrfs_sb(s)->bdev_holder = fs_type;
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Duncan
Goffredo Baroncelli posted on Fri, 28 Apr 2017 19:05:21 +0200 as
excerpted:

> After some thinking I adopted a different strategies: I used journald as
> collector, then I forward all the log to rsyslogd, which used a "log
> append" format. Journald never write on the root filesystem, only in
> tmp.

Great minds think alike. =:^)

Only here it's syslog-ng that does the permanent writes.

I just couldn't see journald's crazy (for btrfs) write pattern going to 
permanent storage.

And AFAIK, journald has no pre-write filtering mechanism at all, only 
post-write display-time filtering, so even "log-spam" that I don't want/
need logged gets written to it, while if I see something spamming 
continuously (I run git kernels and kde, and do get such spammers 
occasionally) I setup a syslog-ng spam filter to kill it, so it never 
actually gets written to permanent storage at all.

But the tmpfs journals and btrfs traditional logs gives me the best of 
both worlds, per-boot journals with all the extra metadata, the last ten 
journal entries for it when I do systemctl status on a unit, etc, and a 
nice filtered and ordered multi-boot log that I can use traditional text-
based log-administration tools on.

The only part of it I'm not happy with is that journald apparently can't 
keep separate user and system journals when set to temporary only -- 
everything goes to the system journal.  Which eventually means that much 
of the stdout/stderr debugging spew that kde-based apps like to spew out 
ends up in the system journal and (would be in the) log.  But that's a 
journald "documented bug-feature", and I can and do syslog-ng filter it 
before it actually hits the written system log (or console log display).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Paul Jones
> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-
> ow...@vger.kernel.org] On Behalf Of Goffredo Baroncelli
> Sent: Saturday, 29 April 2017 3:05 AM
> To: Chris Murphy 
> Cc: Btrfs BTRFS 
> Subject: Re: btrfs, journald logs, fragmentation, and fallocate
> 
> 
> In the past I faced the same problems; I collected some data here
> http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html.
> Unfortunately the journald files are very bad, because first the data is
> written (appended), then the index fields are updated. Unfortunately these
> indexes are near after the last write . So fragmentation is unavoidable.

Perhaps a better idea for COW filesystems is to store the index in a separate 
file, and/or rewrite the last 1 MB block (or part thereof) of the data file 
every time data is appended? That way the data file will use 1MB extents and 
hopefully avoid ridiculous amounts of metadata. 


Paul.


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Peter Grandi

> [ ... ] these extents are all over the place, they're not
> contiguous at all. 4K here, 4K there, 4K over there, back to
> 4K here next to this one, 4K over there...12K over there, 500K
> unwritten, 4K over there. This seems not so consequential on
> SSD, [ ... ]

Indeed there were recent reports that the 'ssd' mount option
causes that, IIRC by Hans van Kranenburg (around 2017-04-17),
which also noticed issues with the wandering trees in certain
situations (around 2017-04-08).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Adam Borowski
On Fri, Apr 28, 2017 at 11:41:00AM -0600, Chris Murphy wrote:
> The same behavior happens with NTFS in qcow2 files. They quickly end
> up with 100,000+ extents unless set nocow. It's like the worst case
> scenario.

You should never use qcow2 on btrfs, especially if snapshots are involved.
They both do roughly the same thing, and layering fragmentation upon
fragmentation ɪꜱ ɴᴏᴛ ᴘʀᴇᴛᴛʏ.  Layering syncs is bad, too.

Instead, you can use raw files (preferably sparse unless there's both nocow
and no snapshots).  Btrfs does natively everything you'd gain from qcow2,
and does it better: you can delete the master of a cloned image, deduplicate
them, deduplicate two unrelated images; you can turn on compression, etc.

Once you pay the btrfs performance penalty, you may as well actually use its
features, which make qcow2 redundant and harmful.


Meow!
-- 
Don't be racist.  White, amber or black, all beers should be judged based
solely on their merits.  Heck, even if occasionally a cider applies for a
beer's job, why not?
On the other hand, corpo lager is not a race.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: File system corruption, btrfsck abort

2017-04-28 Thread Chris Murphy
On Fri, Apr 28, 2017 at 3:10 AM, Christophe de Dinechin
 wrote:

>
> QEMU qcow2. Host is BTRFS. Guests are BTRFS, LVM, Ext4, NTFS (winXP and
> win10) and HFS+ (macOS Sierra). I think I had 7 VMs installed, planned to
> restore another 8 from backups before my previous disk crash. I usually have
> at least 2 running, often as many as 5 (fedora, ubuntu, winXP, win10, macOS)
> to cover my software testing needs.

That is quite a torture test for any file system but more so Btrfs.
How are the qcow2 files being created? What's the qemu-img create
command? In particular i'm wondering if these qcow2 files are cow or
nocow; if they're compressed by Btrfs; and how many fragments they
have with filefrag.

When I was using qcow2 for backing I used

qemu-img create -f qcow2 -o preallocation=falloc,nocow=on,lazy_refcounts=on

But then later I started using fallocated raw files with chattr +C
applied. And these days I'm just using LVM thin volumes. The journaled
file systems in a guest cause a ton of backing file fragmentation
unless nocow is used on Btrfs. I've seen hundreds of thousands of
extents for a single backing file for a Windows guest.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Chris Murphy
On Fri, Apr 28, 2017 at 1:39 PM, Peter Grandi  
wrote:


> In a particularly demented setup I had to decastrophize with
> great pain a Zimbra QCOW2 disk image (XFS on NFS on XFS on
> RAID6) containining an ever growing number Maildir email archive
> ended up with over a million widely scattered microextents:
>
>   http://www.sabi.co.uk/blog/1101Jan.html?110116#110116

Related Btrfs thread "File system corruption, btrfsck abort" involves
5 concurrent use VM's with guests using ext4, NTFS, HFS+, Btrfs, LVM,
pointing to qcow2 files on Btrfs for backing. And it's resulting in
problems...


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Chris Murphy
On Fri, Apr 28, 2017 at 11:53 AM, Peter Grandi  
wrote:

> Well, depends, but probably the single file: it is more likely
> that the 20,000 fragments will actually be contiguous, and that
> there will be less metadata IO than for 40,000 separate journal
> files.

You can see from the examples I posted that these extents are all over
the place, they're not contiguous at all. 4K here, 4K there, 4K over
there, back to 4K here next to this one, 4K over there...12K over
there, 500K unwritten, 4K over there. This seems not so consequential
on SSD, at least if it impacts performance it's not so bad I care. On
a hard drive, it's totally noticeable. And that's why journald went
with chattr +C by default a few versions ago when on Btrfs. And it
does help *if* the partent is never snapshot, which on a snapshotting
file system can't really be guaranteed. Inadvertent snapshotting could
be inhibited by putting the journals in their own subvolume though.

Anyway, it's difficult to consider Btrfs a general purpose file system
if other general purpose workloads like journal files, are causing a
problem like wandering tree. Hence the subject of what to do about it,
and that may mean short term and long term. I can't speak for systemd
developers but if there's a different way to write to the journals
that'd be better for Btrfs and no worse for ext4 and XFS, it might be
considered.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Chris Murphy
On Fri, Apr 28, 2017 at 11:46 AM, Peter Grandi  
wrote:

> So there are three layers of silliness here:
>
> * Writing large files slowly to a COW filesystem and
>   snapshotting it frequently.
> * A filesystem that does delayed allocation instead of
>   allocate-ahead, and does not have psychic code.
> * Working around that by using no-COW and preallocation
>   with a fixed size regardless of snapshot frequency.
>
> The primary problem here is that there is no way to have slow
> small writes and frequent snapshots without generating small
> extents: if a file is written at a rate of 1MiB/hour and gets
> snapshot every hour the extent size will not be larger than 1MiB
> *obviously*.

Sure.

But in my example, no snapshotting, and +C is inhibited (i.e. I set
/etc/tmpfiles.d/journal-nocow.conf which stops systemd from the new
behavior of setting +C on journals). That's resulting in a 19000+
fragment journal file. In fact snapshotting does not make it worse
though. If it's nocow, then yes snapshotting makes it worse than
nocow, but no worse than cow.

What I'm trying to get at is default Btrfs behavior and (previous)
default journald behavior, have a misalignment resulting in a lot of
fragmentation, is there a better way around this than merely setting
journals to nocow *and* making sure they stay nocow by preventing
snapshotting. If there's nothing better to be done, then I'll just
re-recommend to systemd folks that the directory containing journals
should be made a subvolume to isolate it from inadvertent
snapshotting. If people want to snapshot it anyway there's nothing we
can do about that.



> Filesystem-level snapshots are not designed to snapshot slowly
> growing files, but to snapshots changing collections of
> files. There are harsh tradeoffs involved. Application-level
> shapshots (also known as log rotations :->) are needed for
> special cases and finer grained policies.
>
> The secondary problem is that a fixed preallocate of 8MiB is
> good only if in betweeen snapshots the file grows by a little
> less than 8MiB or by substantially more.

Just to be clear, none of my own examples involve journals being
snapshot. There are no shared extents for any of those files.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Peter Grandi
>> The gotcha though is there's a pile of data in the journal
>> that would never make it to rsyslogd. If you use journalctl
>> -o verbose you can see some of this.

> You can send *all the info* to rsyslogd via imjournal
> http://www.rsyslog.com/doc/v8-stable/configuration/modules/imjournal.html
> In my setup all the data are stored in json format in the
> /var/log/cee.log file:
> $ head /var/log/cee.log 2017-04-28T18:41:41.931273+02:00
> venice liblogging-stdlog: @cee: { "PRIORITY": "6", "_BOOT_ID":
> "a86d74bab91f44dc974c76aceb97141f", "_MACHINE_ID": [ ... ]

Ahh the horror the horror, I will never be able to unsee
that. The UNIX way of doing things is truly dead.

>> The same behavior happens with NTFS in qcow2 files. They
>> quickly end up with 100,000+ extents unless set nocow.
>> It's like the worst case scenario.

In a particularly demented setup I had to decastrophize with
great pain a Zimbra QCOW2 disk image (XFS on NFS on XFS on
RAID6) containining an ever growing number Maildir email archive
ended up with over a million widely scattered microextents:

  http://www.sabi.co.uk/blog/1101Jan.html?110116#110116
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Goffredo Baroncelli
On 2017-04-28 19:41, Chris Murphy wrote:
> On Fri, Apr 28, 2017 at 11:05 AM, Goffredo Baroncelli
>  wrote:
> 
>> In the past I faced the same problems; I collected some data here 
>> http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html.
>> Unfortunately the journald files are very bad, because first the data is 
>> written (appended), then the index fields are updated. Unfortunately these 
>> indexes are near after the last write . So fragmentation is unavoidable.
>>
>> After some thinking I adopted a different strategies: I used journald as 
>> collector, then I forward all the log to rsyslogd, which used a "log append" 
>> format. Journald never write on the root filesystem, only in tmp.
> 
> The gotcha though is there's a pile of data in the journal that would
> never make it to rsyslogd. If you use journalctl -o verbose you can
> see some of this. 

You can send *all the info* to rsyslogd via imjournal

http://www.rsyslog.com/doc/v8-stable/configuration/modules/imjournal.html

In my setup all the data are stored in json format in the /var/log/cee.log file:


$ head  /var/log/cee.log
2017-04-28T18:41:41.931273+02:00 venice liblogging-stdlog: @cee: { "PRIORITY": 
"6", "_BOOT_ID": "a86d74bab91f44dc974c76aceb97141f", "_MACHINE_ID": 
"e84907d099904117b355a99c98378dca", "_HOSTNAME": "venice.bhome", 
"_SYSTEMD_SLICE": "system.slice", "_UID": "0", "_GID": "0", "_CAP_EFFECTIVE": 
"3f", "_TRANSPORT": "syslog", "SYSLOG_FACILITY": "23", 
"SYSLOG_IDENTIFIER": "liblogging-stdlog", "MESSAGE": " [origin 
software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" 
x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed", "_PID": "737", 
"_COMM": "rsyslogd", "_EXE": "\/usr\/sbin\/rsyslogd", "_CMDLINE": 
"\/usr\/sbin\/rsyslogd -n", "_SYSTEMD_CGROUP": 
"\/system.slice\/rsyslog.service", "_SYSTEMD_UNIT": "rsyslog.service", 
"_SYSTEMD_INVOCATION_ID": "18b9a8b27f9143728adef972db7b394c", 
"_SOURCE_REALTIME_TIMESTAMP": "1493397701931255", "msg": "[origin 
software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" 
x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed" }
2017-04-28T18:41:42.058549+02:00 venice liblogging-stdlog: @cee: { "PRIORITY": 
"6", "_BOOT_ID": "a86d74bab91f44dc974c76aceb97141f", "_MACHINE_ID": 
"e84907d099904117b355a99c98378dca", "_HOSTNAME": "venice.bhome", 
"_SYSTEMD_SLICE": "system.slice", "_UID": "0", "_GID": "0", "_CAP_EFFECTIVE": 
"3f", "_TRANSPORT": "syslog", "SYSLOG_FACILITY": "23", 
"SYSLOG_IDENTIFIER": "liblogging-stdlog", "MESSAGE": " [origin 
software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" 
x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed", "_PID": "737", 
"_COMM": "rsyslogd", "_EXE": "\/usr\/sbin\/rsyslogd", "_CMDLINE": 
"\/usr\/sbin\/rsyslogd -n", "_SYSTEMD_CGROUP": 
"\/system.slice\/rsyslog.service", "_SYSTEMD_UNIT": "rsyslog.service", 
"_SYSTEMD_INVOCATION_ID": "18b9a8b27f9143728adef972db7b394c", 
"_SOURCE_REALTIME_TIMESTAMP": "1493397702058441", "msg": "[origin 
software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" 
x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed" }
[]

All the info are stored with the same keys/values as journald does.

I developed an utility (called clp), which allow to query the log by key, 
filtering by boot nr, by date

For example to show all the log related to rsyslog

$ clp log -t full-details _SYSTEMD_CGROUP=/system.slice/rsyslog.service 

2017-04-21 19:12:29.579748 MESSAGE= [origin software="rsyslogd" 
swVersion="8.24.0" x-pid="804" x-info="http://www.rsyslog.com;] rsyslogd was 
HUPed
   PRIORITY=6
   SYSLOG_FACILITY=23
   SYSLOG_IDENTIFIER=liblogging-stdlog
   _BOOT_ID=d77198380c9344248e01166fbd8d60df
   _CAP_EFFECTIVE=3f
   _CMDLINE=/usr/sbin/rsyslogd -n
   _COMM=rsyslogd
   _EXE=/usr/sbin/rsyslogd
   _GID=0
   _HOSTNAME=venice.bhome
   _LOGFILEINITLINE=2017-04-21T19:12:29.579768+02:00 
venice liblogging-stdlog: 
   _LOGFILELINENUMBER=1
   _LOGFILENAME=/var/log/cee.log.7.gz
   _LOGFILETIMESTAMP=1492794749579768
   _MACHINE_ID=e84907d099904117b355a99c98378dca
   _PID=804
   _SOURCE_REALTIME_TIMESTAMP=1492794749579748
   _SYSTEMD_CGROUP=/system.slice/rsyslog.service
   
_SYSTEMD_INVOCATION_ID=8f9cb6c871be4158a3ccb374f4323027
   _SYSTEMD_SLICE=system.slice
   _SYSTEMD_UNIT=rsyslog.service
   _TRANSPORT=syslog
   _UID=0
   msg=[origin software="rsyslogd" swVersion="8.24.0" 
x-pid="804" 

Re: [PATCH v2] fstests: regression test for btrfs buffered read's repair

2017-04-28 Thread Liu Bo
On Fri, Apr 28, 2017 at 10:52:12AM +0100, Filipe Manana wrote:
> On Wed, Apr 26, 2017 at 7:09 PM, Liu Bo  wrote:
> > This case tests whether buffered read can repair the bad copy if we
> > have a good copy.
> >
> > Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed
> > and drop checks") introduced the regression.
> >
> > The upstream fix is
> > Btrfs: bring back repair during read
> 
> Same issue as reported for the other new test (btrfs/140), the test
> fails on a kernel with the mentioned patch. Seems like it does wrong
> assumptions somewhere (I haven't investigated, just run the test):

Thanks for running the test, what I've missed is that chunk start offset depends
on disk size, so I'll add a filesystem size limit to mkfs.

Thanks,

-liubo

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v3] fstests: regression test for btrfs buffered read's repair

2017-04-28 Thread Liu Bo
This case tests whether buffered read can repair the bad copy if we
have a good copy.

Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed
and drop checks") introduced the regression.

The upstream fix is
Btrfs: bring back repair during read

Signed-off-by: Liu Bo 
---
v2: - Add regression commit and the fix to the description
- Use btrfs inspect-internal dump-tree to get rid of the dependence 
btrfs-map-logical
- Add comments in several places
- Fix typo, dio->buffered.

v3: - Add 'mkfs -b 1G' to limit filesystem size to 2G in raid1 profile so that
  we get a consistent output.

 tests/btrfs/141 | 169 
 tests/btrfs/141.out |  39 
 tests/btrfs/group   |   1 +
 3 files changed, 209 insertions(+)
 create mode 100755 tests/btrfs/141
 create mode 100644 tests/btrfs/141.out

diff --git a/tests/btrfs/141 b/tests/btrfs/141
new file mode 100755
index 000..c4e08ed
--- /dev/null
+++ b/tests/btrfs/141
@@ -0,0 +1,169 @@
+#! /bin/bash
+# FS QA Test 141
+#
+# Regression test for btrfs buffered read's repair during read.
+#
+# Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed
+# and drop checks") introduced the regression.
+#
+# The upstream fix is
+#  Btrfs: bring back repair during read
+#
+#---
+# Copyright (c) 2017 Liu Bo.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_dev_pool 2
+
+_require_btrfs_command inspect-internal dump-tree
+_require_command "$FILEFRAG_PROG" filefrag
+
+
+# helpe to convert 'file offset' to btrfs logical offset
+FILEFRAG_FILTER='
+   if (/blocks? of (\d+) bytes/) {
+   $blocksize = $1;
+   next
+   }
+   ($ext, $logical, $physical, $length) =
+   (/^\s*(\d+):\s+(\d+)..\s+\d+:\s+(\d+)..\s+\d+:\s+(\d+):/)
+   or next;
+   ($flags) = /.*:\s*(\S*)$/;
+   print $physical * $blocksize, "#",
+ $length * $blocksize, "#",
+ $logical * $blocksize, "#",
+ $flags, " "'
+
+# this makes filefrag output script readable by using a perl helper.
+# output is one extent per line, with three numbers separated by '#'
+# the numbers are: physical, length, logical (all in bytes)
+# sample output: "1234#10#5678" -> physical 1234, length 10, logical 5678
+_filter_extents()
+{
+   tee -a $seqres.full | $PERL_PROG -ne "$FILEFRAG_FILTER"
+}
+
+_check_file_extents()
+{
+   cmd="filefrag -v $1"
+   echo "# $cmd" >> $seqres.full
+   out=`$cmd | _filter_extents`
+   if [ -z "$out" ]; then
+   return 1
+   fi
+   echo "after filter: $out" >> $seqres.full
+   echo $out
+   return 0
+}
+
+_check_repair()
+{
+   filter=${1:-cat}
+   dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | tac | 
$filter | grep -q -e "csum failed"
+   if [ $? -eq 0 ]; then
+   echo 1
+   else
+   echo 0
+   fi
+}
+
+_get_physical()
+{
+# $1 is logical address
+# print chunk tree and find devid 2 which is $SCRATCH_DEV
+$BTRFS_UTIL_PROG inspect-internal dump-tree -t 3 $SCRATCH_DEV | grep 
$1 -A 6 | awk '($1 ~ /stripe/ && $3 ~ /devid/ && $4 ~ /1/) { print $6 }'
+}
+
+_scratch_dev_pool_get 2
+# step 1, create a raid1 btrfs which contains one 128k file.
+echo "step 1..mkfs.btrfs" >>$seqres.full
+
+mkfs_opts="-d raid1 -b 1G"
+_scratch_pool_mkfs $mkfs_opts >>$seqres.full 2>&1
+
+# -o nospace_cache makes sure data is written to the start position of the data
+# chunk
+_scratch_mount -o nospace_cache
+
+$XFS_IO_PROG -f -d -c "pwrite -S 0xaa -b 128K 0 128K" "$SCRATCH_MNT/foobar" | 
_filter_xfs_io
+
+sync
+
+# step 2, corrupt 

[PATCH v2] fstests: regression test for nocsum dio read's repair

2017-04-28 Thread Liu Bo
Commit 2dabb3248453 ("Btrfs: Direct I/O read: Work on sectorsized blocks")
introduced this regression.  It'd cause 'Segmentation fault' error.

The upstream fix is
Btrfs: fix segment fault when doing dio read

Signed-off-by: Liu Bo 
---
v2: - Add 'mkfs -b 1G' to limit filesystem size to 2G in raid1 profile so that
  we get a consistent output.

 tests/btrfs/142 | 189 
 tests/btrfs/142.out |  39 +++
 tests/btrfs/group   |   1 +
 3 files changed, 229 insertions(+)
 create mode 100755 tests/btrfs/142
 create mode 100644 tests/btrfs/142.out

diff --git a/tests/btrfs/142 b/tests/btrfs/142
new file mode 100755
index 000..94566de
--- /dev/null
+++ b/tests/btrfs/142
@@ -0,0 +1,189 @@
+#! /bin/bash
+# FS QA Test 142
+#
+# Regression test for btrfs DIO read's repair during read without checksum.
+#
+# Commit 2dabb3248453 ("Btrfs: Direct I/O read: Work on sectorsized blocks")
+# introduced this regression.  It'd cause 'Segmentation fault' error.
+#
+# The upstream fix is
+#  Btrfs: fix segment fault when doing dio read
+#
+#---
+# Copyright (c) 2017 Liu Bo.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_dev_pool 2
+
+_require_btrfs_command inspect-internal dump-tree
+_require_command "$FILEFRAG_PROG" filefrag
+
+# helpe to convert 'file offset' to btrfs logical offset
+FILEFRAG_FILTER='
+   if (/blocks? of (\d+) bytes/) {
+   $blocksize = $1;
+   next
+   }
+   ($ext, $logical, $physical, $length) =
+   (/^\s*(\d+):\s+(\d+)..\s+\d+:\s+(\d+)..\s+\d+:\s+(\d+):/)
+   or next;
+   ($flags) = /.*:\s*(\S*)$/;
+   print $physical * $blocksize, "#",
+ $length * $blocksize, "#",
+ $logical * $blocksize, "#",
+ $flags, " "'
+
+# this makes filefrag output script readable by using a perl helper.
+# output is one extent per line, with three numbers separated by '#'
+# the numbers are: physical, length, logical (all in bytes)
+# sample output: "1234#10#5678" -> physical 1234, length 10, logical 5678
+_filter_extents()
+{
+   tee -a $seqres.full | $PERL_PROG -ne "$FILEFRAG_FILTER"
+}
+
+_check_file_extents()
+{
+   cmd="filefrag -v $1"
+   echo "# $cmd" >> $seqres.full
+   out=`$cmd | _filter_extents`
+   if [ -z "$out" ]; then
+   return 1
+   fi
+   echo "after filter: $out" >> $seqres.full
+   echo $out
+   return 0
+}
+
+_check_repair()
+{
+   filter=${1:-cat}
+   dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | tac | 
$filter | grep -q -e "direct IO failed"
+   if [ $? -eq 0 ]; then
+   echo 1
+   else
+   echo 0
+   fi
+}
+
+_get_physical()
+{
+# $1 is logical address
+# print chunk tree and find devid 2 which is $SCRATCH_DEV
+$BTRFS_UTIL_PROG inspect-internal dump-tree -t 3 $SCRATCH_DEV | grep 
$1 -A 6 | awk '($1 ~ /stripe/ && $3 ~ /devid/ && $4 ~ /1/) { print $6 }'
+}
+
+
+SYSFS_BDEV=`_sysfs_dev $SCRATCH_DEV`
+
+start_fail()
+{
+   echo 100 > $DEBUGFS_MNT/fail_make_request/probability
+   echo 1 > $DEBUGFS_MNT/fail_make_request/times
+   echo 0 > $DEBUGFS_MNT/fail_make_request/verbose
+   echo 1 > $SYSFS_BDEV/make-it-fail
+}
+
+stop_fail()
+{
+   echo 0 > $DEBUGFS_MNT/fail_make_request/probability
+   echo 0 > $DEBUGFS_MNT/fail_make_request/times
+   echo 0 > $SYSFS_BDEV/make-it-fail
+}
+
+_scratch_dev_pool_get 2
+# step 1, create a raid1 btrfs which contains one 128k file.
+echo "step 1..mkfs.btrfs" >>$seqres.full
+
+mkfs_opts="-d raid1 -b 1G"
+_scratch_pool_mkfs $mkfs_opts >>$seqres.full 2>&1
+
+# -o 

[PATCH v2] fstests: regression test for nocsum buffered read's repair

2017-04-28 Thread Liu Bo
This is to test whether buffered read retry-repair code is able to work in
raid1 case as expected.

Please note that without checksum, btrfs doesn't know if the data used to
repair is correct, so repair is more of resync which makes sure that both
of the copy has the same content.

Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed and drop
checks") introduced the regression.

The upstream fix is
Btrfs: bring back repair during read

Signed-off-by: Liu Bo 
---
v2: - Add 'mkfs -b 1G' to limit filesystem size to 2G in raid1 profile so that
  we get a consistent output.

 tests/btrfs/143 | 197 
 tests/btrfs/143.out |  39 +++
 tests/btrfs/group   |   1 +
 3 files changed, 237 insertions(+)
 create mode 100755 tests/btrfs/143
 create mode 100644 tests/btrfs/143.out

diff --git a/tests/btrfs/143 b/tests/btrfs/143
new file mode 100755
index 000..70f3f9f
--- /dev/null
+++ b/tests/btrfs/143
@@ -0,0 +1,197 @@
+#! /bin/bash
+# FS QA Test 143
+#
+# Regression test for btrfs buffered read's repair during read without 
checksum.
+#
+# This is to test whether buffered read retry-repair code is able to work in
+# raid1 case as expected.
+#
+# Please note that without checksum, btrfs doesn't know if the data used to
+# repair is correct, so repair is more of resync which makes sure that both
+# of the copy has the same content.
+#
+# Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed and 
drop
+# checks") introduced the regression.
+#
+# The upstream fix is
+#Btrfs: bring back repair during read
+#
+#---
+# Copyright (c) 2017 Liu Bo.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_dev_pool 2
+
+_require_btrfs_command inspect-internal dump-tree
+_require_command "$FILEFRAG_PROG" filefrag
+
+# helpe to convert 'file offset' to btrfs logical offset
+FILEFRAG_FILTER='
+   if (/blocks? of (\d+) bytes/) {
+   $blocksize = $1;
+   next
+   }
+   ($ext, $logical, $physical, $length) =
+   (/^\s*(\d+):\s+(\d+)..\s+\d+:\s+(\d+)..\s+\d+:\s+(\d+):/)
+   or next;
+   ($flags) = /.*:\s*(\S*)$/;
+   print $physical * $blocksize, "#",
+ $length * $blocksize, "#",
+ $logical * $blocksize, "#",
+ $flags, " "'
+
+# this makes filefrag output script readable by using a perl helper.
+# output is one extent per line, with three numbers separated by '#'
+# the numbers are: physical, length, logical (all in bytes)
+# sample output: "1234#10#5678" -> physical 1234, length 10, logical 5678
+_filter_extents()
+{
+   tee -a $seqres.full | $PERL_PROG -ne "$FILEFRAG_FILTER"
+}
+
+_check_file_extents()
+{
+   cmd="filefrag -v $1"
+   echo "# $cmd" >> $seqres.full
+   out=`$cmd | _filter_extents`
+   if [ -z "$out" ]; then
+   return 1
+   fi
+   echo "after filter: $out" >> $seqres.full
+   echo $out
+   return 0
+}
+
+_check_repair()
+{
+   filter=${1:-cat}
+   dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | tac | 
$filter | grep -q -e "read error corrected"
+   if [ $? -eq 0 ]; then
+   echo 1
+   else
+   echo 0
+   fi
+}
+
+_get_physical()
+{
+# $1 is logical address
+# print chunk tree and find devid 2 which is $SCRATCH_DEV
+$BTRFS_UTIL_PROG inspect-internal dump-tree -t 3 $SCRATCH_DEV | grep 
$1 -A 6 | awk '($1 ~ /stripe/ && $3 ~ /devid/ && $4 ~ /1/) { print $6 }'
+}
+
+SYSFS_BDEV=`_sysfs_dev $SCRATCH_DEV`
+
+start_fail()
+{
+   echo 100 > $DEBUGFS_MNT/fail_make_request/probability
+   echo 4 > 

[PATCH v3] fstests: regression test for btrfs dio read repair

2017-04-28 Thread Liu Bo
This case tests whether dio read can repair the bad copy if we have
a good copy.

Commit 2dabb3248453 ("Btrfs: Direct I/O read: Work on sectorsized blocks")
introduced the regression.

The upstream fix is
Btrfs: fix invalid dereference in btrfs_retry_endio

Signed-off-by: Liu Bo 
---
v2: - Add regression commit and the fix to the description
- Use btrfs inspect-internal dump-tree to get rid of the dependence 
btrfs-map-logical
- Add comments in several places

v3: - Add 'mkfs -b 1G' to limit filesystem size to 2G in raid1 profile so that
  we get a consistent output.

 tests/btrfs/140 | 167 
 tests/btrfs/140.out |  39 
 tests/btrfs/group   |   1 +
 3 files changed, 207 insertions(+)
 create mode 100755 tests/btrfs/140
 create mode 100644 tests/btrfs/140.out

diff --git a/tests/btrfs/140 b/tests/btrfs/140
new file mode 100755
index 000..dcd8807
--- /dev/null
+++ b/tests/btrfs/140
@@ -0,0 +1,167 @@
+#! /bin/bash
+# FS QA Test 140
+#
+# Regression test for btrfs DIO read's repair during read.
+#
+# Commit 2dabb3248453 ("Btrfs: Direct I/O read: Work on sectorsized blocks")
+# introduced the regression.
+# The upstream fix is
+#  Btrfs: fix invalid dereference in btrfs_retry_endio
+#
+#---
+# Copyright (c) 2017 Liu Bo.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs btrfs
+_supported_os Linux
+_require_scratch_dev_pool 2
+
+_require_btrfs_command inspect-internal dump-tree
+_require_command "$FILEFRAG_PROG" filefrag
+_require_odirect
+
+# helpe to convert 'file offset' to btrfs logical offset
+FILEFRAG_FILTER='
+   if (/blocks? of (\d+) bytes/) {
+   $blocksize = $1;
+   next
+   }
+   ($ext, $logical, $physical, $length) =
+   (/^\s*(\d+):\s+(\d+)..\s+\d+:\s+(\d+)..\s+\d+:\s+(\d+):/)
+   or next;
+   ($flags) = /.*:\s*(\S*)$/;
+   print $physical * $blocksize, "#",
+ $length * $blocksize, "#",
+ $logical * $blocksize, "#",
+ $flags, " "'
+
+# this makes filefrag output script readable by using a perl helper.
+# output is one extent per line, with three numbers separated by '#'
+# the numbers are: physical, length, logical (all in bytes)
+# sample output: "1234#10#5678" -> physical 1234, length 10, logical 5678
+_filter_extents()
+{
+   tee -a $seqres.full | $PERL_PROG -ne "$FILEFRAG_FILTER"
+}
+
+_check_file_extents()
+{
+   cmd="filefrag -v $1"
+   echo "# $cmd" >> $seqres.full
+   out=`$cmd | _filter_extents`
+   if [ -z "$out" ]; then
+   return 1
+   fi
+   echo "after filter: $out" >> $seqres.full
+   echo $out
+   return 0
+}
+
+_check_repair()
+{
+   filter=${1:-cat}
+   dmesg | tac | sed -ne "0,\#run fstests $seqnum at $date_time#p" | tac | 
$filter | grep -q -e "csum failed"
+   if [ $? -eq 0 ]; then
+   echo 1
+   else
+   echo 0
+   fi
+}
+
+_get_physical()
+{
+   # $1 is logical address
+   # print chunk tree and find devid 2 which is $SCRATCH_DEV
+   $BTRFS_UTIL_PROG inspect-internal dump-tree -t 3 $SCRATCH_DEV | grep $1 
-A 6 | awk '($1 ~ /stripe/ && $3 ~ /devid/ && $4 ~ /1/) { print $6 }'
+}
+
+_scratch_dev_pool_get 2
+# step 1, create a raid1 btrfs which contains one 128k file.
+echo "step 1..mkfs.btrfs" >>$seqres.full
+
+mkfs_opts="-d raid1 -b 1G"
+_scratch_pool_mkfs $mkfs_opts >>$seqres.full 2>&1
+
+# -o nospace_cache makes sure data is written to the start position of the data
+# chunk
+_scratch_mount -o nospace_cache
+
+$XFS_IO_PROG -f -d -c "pwrite -S 0xaa -b 128K 0 128K" "$SCRATCH_MNT/foobar" | 
_filter_xfs_io
+
+sync
+
+# step 2, corrupt the first 64k of one copy 

Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Peter Grandi
> [ ... ] And that makes me wonder whether metadata
> fragmentation is happening as a result. But in any case,
> there's a lot of metadata being written for each journal
> update compared to what's being added to the journal file. [
> ... ]

That's the "wandering trees" problem in COW filesystems, and
manifestations of it in Btrfs have also been reported before.
If there is a workload that triggers a lot of "wandering trees"
updates, then a filesystem that has "wandering trees" perhaps
should not be used :-).

> [ ... ] worse, a single file with 2 fragments; or 4
> separate journal files? *shrug* [ ... ]

Well, depends, but probably the single file: it is more likely
that the 20,000 fragments will actually be contiguous, and that
there will be less metadata IO than for 40,000 separate journal
files.

The deeper "strategic" issue is that storage systems and
filesystems in particular have very anisotropic performance
envelopes, and mismatches between the envelopes of application
and filesystem can be very expensive:
  http://www.sabi.co.uk/blog/15-two.html?151023#151023
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Peter Grandi
> Old news is that systemd-journald journals end up pretty
> heavily fragmented on Btrfs due to COW.

This has been discussed before in detail indeeed here, but also
here: http://www.sabi.co.uk/blog/15-one.html?150203#150203

> While journald uses chattr +C on journal files now, COW still
> happens if the subvolume the journal is in gets snapshot. e.g.
> a week old system.journal has 19000+ extents. [ ... ]  It
> appears to me (see below URLs pointing to example journals)
> that journald fallocated in 8MiB increments but then ends up
> doing 4KiB writes; [ ... ]

So there are three layers of silliness here:

* Writing large files slowly to a COW filesystem and
  snapshotting it frequently.
* A filesystem that does delayed allocation instead of
  allocate-ahead, and does not have psychic code.
* Working around that by using no-COW and preallocation
  with a fixed size regardless of snapshot frequency.

The primary problem here is that there is no way to have slow
small writes and frequent snapshots without generating small
extents: if a file is written at a rate of 1MiB/hour and gets
snapshot every hour the extent size will not be larger than 1MiB
*obviously*.

Filesystem-level snapshots are not designed to snapshot slowly
growing files, but to snapshots changing collections of
files. There are harsh tradeoffs involved. Application-level
shapshots (also known as log rotations :->) are needed for
special cases and finer grained policies.

The secondary problem is that a fixed preallocate of 8MiB is
good only if in betweeen snapshots the file grows by a little
less than 8MiB or by substantially more.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Chris Murphy
On Fri, Apr 28, 2017 at 11:05 AM, Goffredo Baroncelli
 wrote:

> In the past I faced the same problems; I collected some data here 
> http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html.
> Unfortunately the journald files are very bad, because first the data is 
> written (appended), then the index fields are updated. Unfortunately these 
> indexes are near after the last write . So fragmentation is unavoidable.
>
> After some thinking I adopted a different strategies: I used journald as 
> collector, then I forward all the log to rsyslogd, which used a "log append" 
> format. Journald never write on the root filesystem, only in tmp.

The gotcha though is there's a pile of data in the journal that would
never make it to rsyslogd. If you use journalctl -o verbose you can
see some of this. There's a bunch of extra metadata in the journal.
And then also filtering based on that metadata is useful rather than
being limited to grep on a syslog file. Which, you know, it's fine for
many use cases. I guess I'm just interested in whether there's an
enhancement that can be done to make journals more compatible with
Btrfs or vice versa. It's not a huge problem anyway.


>
> The think became interesting when I discovered that the searching in a 
> rsyslog file is faster than journalctl (on a rotational media). Unfortunately 
> I don't have any data to support this.


Yes on drives all of these scattered extents cause a lot of head
seeking. And I also suspect it's a lot of metadata spread out
everywhere too, to account for all of these extents. That's why they
moved to chattr +C to make them nocow. An idea I had on systemd list
was to automatically make the journal directory a Btrfs subvolume,
similar to how systemd already creates a /var/lib/machines subvolume
for nspawn containers. This prevents the journals from being caught up
in a snapshot of the parent subvolume that typically contains the
journals (root fs). There's no practical use I can think of for
snapshotting logs. You'd really want the logs to always be linear,
contiguous, and never get rolled back. Even if something in the system
does get rolled back, you'd want the logs to show that and continue
on, rather than being rolled back themselves.

So the super simple option would be continue with +C on journals, and
then a separate subvolume to prevent COW from ever happening
inadvertently.

The same behavior happens with NTFS in qcow2 files. They quickly end
up with 100,000+ extents unless set nocow. It's like the worst case
scenario.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Goffredo Baroncelli
On 2017-04-28 18:16, Chris Murphy wrote:
> Old news is that systemd-journald journals end up pretty heavily
> fragmented on Btrfs due to COW. While journald uses chattr +C on
> journal files now, COW still happens if the subvolume the journal is
> in gets snapshot. e.g. a week old system.journal has 19000+ extents.
> 
> The news is I started a systemd thread.
> 
> This is the start:
> https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html
> 
> Where it gets interesting, two messages by Andrei Borzenkov: He
> evaluates existing code and does some tests on ext4 and XFS.
> https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html
> https://lists.freedesktop.org/archives/systemd-devel/2017-April/038728.html
> 
> And then the question.
> https://lists.freedesktop.org/archives/systemd-devel/2017-April/038735.html
> 
> Given what journald is doing, is what Btrfs is doing expected? Is
> there something it could do better to be more like ext4 and XFS in the
> same situation? Or is it out of scope for Btrfs?

In the past I faced the same problems; I collected some data here 
http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html.
Unfortunately the journald files are very bad, because first the data is 
written (appended), then the index fields are updated. Unfortunately these 
indexes are near after the last write . So fragmentation is unavoidable.

After some thinking I adopted a different strategies: I used journald as 
collector, then I forward all the log to rsyslogd, which used a "log append" 
format. Journald never write on the root filesystem, only in tmp.

The think became interesting when I discovered that the searching in a rsyslog 
file is faster than journalctl (on a rotational media). Unfortunately I don't 
have any data to support this. 
However if someone is interested I can share more details.

BR
G.Baroncelli



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Chris Murphy
Old news is that systemd-journald journals end up pretty heavily
fragmented on Btrfs due to COW. While journald uses chattr +C on
journal files now, COW still happens if the subvolume the journal is
in gets snapshot. e.g. a week old system.journal has 19000+ extents.

The news is I started a systemd thread.

This is the start:
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html

Where it gets interesting, two messages by Andrei Borzenkov: He
evaluates existing code and does some tests on ext4 and XFS.
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038728.html

And then the question.
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038735.html

Given what journald is doing, is what Btrfs is doing expected? Is
there something it could do better to be more like ext4 and XFS in the
same situation? Or is it out of scope for Btrfs?

It appears to me (see below URLs pointing to example journals) that
journald fallocated in 8MiB increments but then ends up doing 4KiB
writes; there's a lot of these unused (unwritten) 8MiB extents that
appear in both filefrag and btrfs-debug -f outputs.

The +C idea just rearranges the deck chairs, it's not solving the
underlying problem except in the case where the containing subvolume
is never snapshot. And in the COW case, I'm seeing about 30 metadata
nodes being written out for what amounts to less than a 4KiB journal
append. Each time.

And that makes me wonder whether metadata fragmentation is happening
as a result. But in any case, there's a lot of metadata being written
for each journal update compared to what's being added to the journal
file.

And then that makes me wonder if a better optimization on Btrfs would
be having each write be a separate file. The small updates would have
data inline. Which is worse, a single file with 2 fragments; or
4 separate journal files? *shrug* At least those individual files
would be subject to compression with +c; whereas right now the open
endedness of the active journal has not a single compressed extent.
Only once rotated do they get compressed (via defragmentation which
journald does only on Btrfs). Journals contain highly compressible
data.



Anyway, two example journals. The parent directory has chattr +c, both
journals inherited it. The first URL is filefrag -v, the 2nd is
btrfs-debug -f; for each journal.

This is a rotated journal. Upon rotation on Btrfs, journald
defragments the file which ends up compressing it when chattr +c.
https://da.gd/4NKyq
https://da.gd/zEeYW

This is an active system.journal. No compressed extents (the writes I
think are too small).
https://da.gd/cBjX
https://da.gd/YXuI


Extra credit if you've followed this far... The rotated log has piles
of unwritten items in it that are making it fairly inefficient even
with compression. Just using cat to write its contents to a new file,
compression goes from a 1.27 ratio, to 5.70. Here are the results
after catting that file:
https://da.gd/rE8KT
https://da.gd/PD5qI



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: No space left on device when doing "mkdir"

2017-04-28 Thread Gerard Saraber
Dmarc is off, here's the output of the allocations: it's working
correctly right now, I'll update when it does it again.

/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/flags:2
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/raid1/used_bytes:3948544
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/raid1/total_bytes:33554432
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/bytes_pinned:0
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/disk_total:67108864
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/bytes_may_use:0
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/bytes_readonly:0
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/bytes_used:3948544
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/bytes_reserved:0
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/disk_used:7897088
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/total_bytes_pinned:0
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/system/total_bytes:33554432
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/flags:4
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/raid1/used_bytes:65864957952
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/raid1/total_bytes:83751862272
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/bytes_pinned:0
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/disk_total:167503724544
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/bytes_may_use:739508224
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/bytes_readonly:0
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/bytes_used:65864957952
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/bytes_reserved:1835008
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/disk_used:131729915904
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/total_bytes_pinned:1884160
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/metadata/total_bytes:83751862272
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/global_rsv_size:536870912
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/flags:1
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/raid1/used_bytes:23029876707328
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/raid1/total_bytes:23175643529216
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/bytes_pinned:0
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/disk_total:46351287058432
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/bytes_may_use:36474880
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/bytes_readonly:1703936
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/bytes_used:23029876707328
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/bytes_reserved:15003648
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/disk_used:46059753414656
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/total_bytes_pinned:0
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/data/total_bytes:23175643529216
/sys/fs/btrfs/7af2e65c-3935-4e0d-aa63-9ef6be991cb9/allocation/global_rsv_reserved:536870912


On Thu, Apr 27, 2017 at 6:35 PM, Chris Murphy  wrote:
> On Thu, Apr 27, 2017 at 10:46 AM, Gerard Saraber  wrote:
>> After a reboot, I found this in the logs:
>> [  322.510152] BTRFS info (device sdm): The free space cache file
>> (36114966511616) is invalid. skip it
>> [  488.702570] btrfs_printk: 847 callbacks suppressed
>>
>>
>>
>> On Thu, Apr 27, 2017 at 10:18 AM, Gerard Saraber  wrote:
>>> no snapshots and no qgroups, just a straight up large volume.
>>>
>>> shrapnel gerard-store # btrfs fi df /home/exports
>>> Data, RAID1: total=20.93TiB, used=20.86TiB
>>> System, RAID1: total=32.00MiB, used=3.73MiB
>>> Metadata, RAID1: total=79.00GiB, used=61.10GiB
>>> GlobalReserve, single: total=512.00MiB, used=544.00KiB
>>>
>>> shrapnel gerard-store # btrfs filesystem usage /home/exports
>>> Overall:
>>> Device size:  69.13TiB
>>> Device allocated: 42.01TiB
>>> Device unallocated:   27.13TiB
>>> Device missing:  0.00B
>>> Used: 41.84TiB
>>> Free (estimated): 13.63TiB  (min: 13.63TiB)
>>> Data ratio:   2.00
>>> Metadata ratio:   2.00
>>> Global reserve:  512.00MiB  (used: 1.52MiB)
>>>
>>> On Thu, Apr 27, 2017 at 9:07 AM, Roman Mamedov  wrote:
 On Thu, 27 Apr 

Re: [PATCH 1/2 v2] btrfs-progs: fix btrfs send & receive with -e flag

2017-04-28 Thread Nazar Mokrynskyi
Hi,

Sorry for confusion, I've checked once again and the same issue happens in all 
cases.

I didn't notice this because my regular backups are done automatically in cron 
task + snapshots look fine despite the error, so I incorrectly assumed an error 
didn't happen there, but it actually did.

I've clarified this in last comment on bugzilla.

Sincerely, Nazar Mokrynskyi
github.com/nazar-pc
Tox: 
A9D95C9AA5F7A3ED75D83D0292E22ACE84BA40E912185939414475AF28FD2B2A5C8EF5261249

28.04.17 13:03, Lakshmipathi.G пише:
> I can take a look. What I'm wondering about is why it fails only in the HDD
> to SSD case. If -ENODATA is returned with this patch it should mean that there
> was no header data. So is the user sure that this doesn't indicate a valid
> error?
>
> Christian

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: regression test for nocsum buffered read's repair

2017-04-28 Thread Filipe Manana
On Wed, Apr 26, 2017 at 9:54 PM, Liu Bo  wrote:
> This is to test whether buffered read retry-repair code is able to work in
> raid1 case as expected.
>
> Please note that without checksum, btrfs doesn't know if the data used to
> repair is correct, so repair is more of resync which makes sure that both
> of the copy has the same content.
>
> Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed and 
> drop
> checks") introduced the regression.
>
> The upstream fix is
> Btrfs: bring back repair during read

Same issue as all the other tests you sent, fails on a patched kernel
due to mismatch between the $physical_on_scratch offset and the golden
output:

root 08:22:02 /home/fdmanana/git/hub/xfstests (master)> ./check btrfs/143
FSTYP -- btrfs
PLATFORM  -- Linux/x86_64 debian3 4.10.0-rc8-btrfs-next-40+
MKFS_OPTIONS  -- /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

btrfs/143 - output mismatch (see
/home/fdmanana/git/hub/xfstests/results//btrfs/143.out.bad)
--- tests/btrfs/143.out 2017-04-28 08:21:59.358432901 +0100
+++ /home/fdmanana/git/hub/xfstests/results//btrfs/143.out.bad
2017-04-28 08:22:15.254446208 +0100
@@ -1,39 +1,39 @@
 QA output created by 143
 wrote 131072/131072 bytes at offset 0
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-wrote 65536/65536 bytes at offset 244056064
+wrote 65536/65536 bytes at offset 1103101952
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-0e8c:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa

...
(Run 'diff -u tests/btrfs/143.out
/home/fdmanana/git/hub/xfstests/results//btrfs/143.out.bad'  to see
the entire diff)
Ran: btrfs/143
Failures: btrfs/143
Failed 1 of 1 tests

root 08:22:16 /home/fdmanana/git/hub/xfstests (master)>
root 08:22:22 /home/fdmanana/git/hub/xfstests (master)> diff -u
tests/btrfs/143.out
/home/fdmanana/git/hub/xfstests/results//btrfs/143.out.bad
--- tests/btrfs/143.out 2017-04-28 08:21:59.358432901 +0100
+++ /home/fdmanana/git/hub/xfstests/results//btrfs/143.out.bad
2017-04-28 08:22:15.254446208 +0100
@@ -1,39 +1,39 @@
 QA output created by 143
 wrote 131072/131072 bytes at offset 0
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-wrote 65536/65536 bytes at offset 244056064
+wrote 65536/65536 bytes at offset 1103101952
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-0e8c:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0010:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0020:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0030:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0040:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0050:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0060:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0070:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0080:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0090:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00a0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00b0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00c0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00d0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00e0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00f0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0100:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0110:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0120:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0130:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0140:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0150:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0160:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0170:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0180:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0190:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01a0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01b0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01c0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01d0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01e0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01f0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-read 512/512 bytes at offset 244056064

Re: [PATCH 1/2 v2] btrfs-progs: fix btrfs send & receive with -e flag

2017-04-28 Thread Lakshmipathi.G
Hi.

Adding the bug reporter, Nazar for the discussion (as I'm not familiar
with send/receive feature/code).


Cheers,
Lakshmipathi.G
http://www.giis.co.in http://www.webminal.org

On Fri, Apr 28, 2017 at 3:25 PM, Christian Brauner
 wrote:
>
> Hi,
>
> On Fri, Apr 28, 2017 at 02:55:31PM +0530, Lakshmipathi.G wrote:
> > Seems like user reported an issue with this patch. please check
> > https://bugzilla.kernel.org/show_bug.cgi?id=195597
>
> I can take a look. What I'm wondering about is why it fails only in the HDD
> to SSD case. If -ENODATA is returned with this patch it should mean that there
> was no header data. So is the user sure that this doesn't indicate a valid
> error?
>
> Christian
>
> >
> > 
> > Cheers,
> > Lakshmipathi.G
> >
> >
> > On Tue, Apr 4, 2017 at 1:51 AM, Christian Brauner <
> > christian.brau...@ubuntu.com> wrote:
> > > The old check here tried to ensure that empty streams are not considered
> > valid.
> > > The old check however, will always fail when only one run through the
> > while(1)
> > > loop is needed and honor_end_cmd is set. So this:
> > >
> > > btrfs send /some/subvol | btrfs receive -e /some/
> > >
> > > will consistently fail because -e causes honor_cmd_to be set and
> > > btrfs_read_and_process_send_stream() to correctly return 1. So the
> > command will
> > > be successful but btrfs receive will error out because the send - receive
> > > concluded in one run through the while(1) loop.
> > >
> > > If we want to exclude empty streams we need a way to tell the difference
> > between
> > > btrfs_read_and_process_send_stream() returning 1 because read_buf() did
> > not
> > > detect any data and read_and_process_cmd() returning 1 because
> > honor_end_cmd was
> > > set. Without introducing too many changes the best way to me seems to have
> > > btrfs_read_and_process_send_stream() return -ENODATA in the first case.
> > The rest
> > > stays the same. We can then check for -ENODATA in do_receive() and report
> > a
> > > proper error in this case. This should also be backwards compatible to
> > previous
> > > versions of btrfs receive. They will fail on empty streams because a
> > negative
> > > value is returned. The only thing that they will lack is a nice error
> > message.
> > >
> > > Signed-off-by: Christian Brauner 
> > > ---
> > > Changelog: 2017-04-03
> > > - no changes
> > > ---
> > >  cmds-receive.c | 13 +
> > >  send-stream.c  |  2 +-
> > >  2 files changed, 6 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/cmds-receive.c b/cmds-receive.c
> > > index 6cf22637..b59f00e4 100644
> > > --- a/cmds-receive.c
> > > +++ b/cmds-receive.c
> > > @@ -1091,7 +1091,6 @@ static int do_receive(struct btrfs_receive *rctx,
> > const char *tomnt,
> > > char *dest_dir_full_path;
> > > char root_subvol_path[PATH_MAX];
> > > int end = 0;
> > > -   int count;
> > >
> > > dest_dir_full_path = realpath(tomnt, NULL);
> > > if (!dest_dir_full_path) {
> > > @@ -1186,7 +1185,6 @@ static int do_receive(struct btrfs_receive *rctx,
> > const char *tomnt,
> > > if (ret < 0)
> > > goto out;
> > >
> > > -   count = 0;
> > > while (!end) {
> > > if (rctx->cached_capabilities_len) {
> > > if (g_verbose >= 3)
> > > @@ -1200,16 +1198,15 @@ static int do_receive(struct btrfs_receive *rctx,
> > const char *tomnt,
> > >  rctx,
> > >
> >  rctx->honor_end_cmd,
> > >  max_errors);
> > > -   if (ret < 0)
> > > -   goto out;
> > > -   /* Empty stream is invalid */
> > > -   if (ret && count == 0) {
> > > +   if (ret < 0 && ret == -ENODATA) {
> > > +   /* Empty stream is invalid */
> > > error("empty stream is not considered valid");
> > > ret = -EINVAL;
> > > goto out;
> > > +   } else if (ret < 0) {
> > > +   goto out;
> > > }
> > > -   count++;
> > > -   if (ret)
> > > +   if (ret > 0)
> > > end = 1;
> > >
> > > close_inode_for_write(rctx);
> > > diff --git a/send-stream.c b/send-stream.c
> > > index 5a028cd9..78f2571a 100644
> > > --- a/send-stream.c
> > > +++ b/send-stream.c
> > > @@ -492,7 +492,7 @@ int btrfs_read_and_process_send_stream(int fd,
> > > if (ret < 0)
> > > goto out;
> > > if (ret) {
> > > -   ret = 1;
> > > +   ret = -ENODATA;
> > > goto out;
> > > }
> > >
> > > --
> > > 2.11.0
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > > the body of a message to 

Re: [PATCH] fstests: regression test for nocsum dio read's repair

2017-04-28 Thread Filipe Manana
On Wed, Apr 26, 2017 at 9:54 PM, Liu Bo  wrote:
> Commit 2dabb3248453 ("Btrfs: Direct I/O read: Work on sectorsized blocks")
> introduced this regression.  It'd cause 'Segmentation fault' error.
>
> The upstream fix is
> Btrfs: fix segment fault when doing dio read

Same issue as the other tests, it fails on a patched kernel. The value
of $physical_on_scratch doesn't match the value in the golden output:

root 08:18:24 /home/fdmanana/git/hub/xfstests (master)> ./check btrfs/142
FSTYP -- btrfs
PLATFORM  -- Linux/x86_64 debian3 4.10.0-rc8-btrfs-next-40+
MKFS_OPTIONS  -- /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

btrfs/142 - output mismatch (see
/home/fdmanana/git/hub/xfstests/results//btrfs/142.out.bad)
--- tests/btrfs/142.out 2017-04-28 08:18:22.206251115 +0100
+++ /home/fdmanana/git/hub/xfstests/results//btrfs/142.out.bad
2017-04-28 08:18:35.946262617 +0100
@@ -1,39 +1,39 @@
 QA output created by 142
 wrote 131072/131072 bytes at offset 0
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-wrote 65536/65536 bytes at offset 244056064
+wrote 65536/65536 bytes at offset 1103101952
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-0e8c:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa

...
(Run 'diff -u tests/btrfs/142.out
/home/fdmanana/git/hub/xfstests/results//btrfs/142.out.bad'  to see
the entire diff)
Ran: btrfs/142
Failures: btrfs/142
Failed 1 of 1 tests

root 08:18:36 /home/fdmanana/git/hub/xfstests (master)>
root 08:18:38 /home/fdmanana/git/hub/xfstests (master)> diff -u
tests/btrfs/142.out
/home/fdmanana/git/hub/xfstests/results//btrfs/142.out.bad
--- tests/btrfs/142.out 2017-04-28 08:18:22.206251115 +0100
+++ /home/fdmanana/git/hub/xfstests/results//btrfs/142.out.bad
2017-04-28 08:18:35.946262617 +0100
@@ -1,39 +1,39 @@
 QA output created by 142
 wrote 131072/131072 bytes at offset 0
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-wrote 65536/65536 bytes at offset 244056064
+wrote 65536/65536 bytes at offset 1103101952
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-0e8c:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0010:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0020:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0030:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0040:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0050:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0060:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0070:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0080:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0090:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00a0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00b0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00c0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00d0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00e0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00f0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0100:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0110:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0120:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0130:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0140:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0150:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0160:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0170:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0180:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0190:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01a0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01b0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01c0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01d0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01e0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01f0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-read 512/512 bytes at offset 244056064
+41c0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
+41c00010:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
+41c00020:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
+41c00030:  aa aa aa aa aa aa aa aa 

Re: [PATCH 1/2 v2] btrfs-progs: fix btrfs send & receive with -e flag

2017-04-28 Thread Christian Brauner
Hi,

On Fri, Apr 28, 2017 at 02:55:31PM +0530, Lakshmipathi.G wrote:
> Seems like user reported an issue with this patch. please check
> https://bugzilla.kernel.org/show_bug.cgi?id=195597

I can take a look. What I'm wondering about is why it fails only in the HDD
to SSD case. If -ENODATA is returned with this patch it should mean that there
was no header data. So is the user sure that this doesn't indicate a valid
error?

Christian

> 
> 
> Cheers,
> Lakshmipathi.G
> 
> 
> On Tue, Apr 4, 2017 at 1:51 AM, Christian Brauner <
> christian.brau...@ubuntu.com> wrote:
> > The old check here tried to ensure that empty streams are not considered
> valid.
> > The old check however, will always fail when only one run through the
> while(1)
> > loop is needed and honor_end_cmd is set. So this:
> >
> > btrfs send /some/subvol | btrfs receive -e /some/
> >
> > will consistently fail because -e causes honor_cmd_to be set and
> > btrfs_read_and_process_send_stream() to correctly return 1. So the
> command will
> > be successful but btrfs receive will error out because the send - receive
> > concluded in one run through the while(1) loop.
> >
> > If we want to exclude empty streams we need a way to tell the difference
> between
> > btrfs_read_and_process_send_stream() returning 1 because read_buf() did
> not
> > detect any data and read_and_process_cmd() returning 1 because
> honor_end_cmd was
> > set. Without introducing too many changes the best way to me seems to have
> > btrfs_read_and_process_send_stream() return -ENODATA in the first case.
> The rest
> > stays the same. We can then check for -ENODATA in do_receive() and report
> a
> > proper error in this case. This should also be backwards compatible to
> previous
> > versions of btrfs receive. They will fail on empty streams because a
> negative
> > value is returned. The only thing that they will lack is a nice error
> message.
> >
> > Signed-off-by: Christian Brauner 
> > ---
> > Changelog: 2017-04-03
> > - no changes
> > ---
> >  cmds-receive.c | 13 +
> >  send-stream.c  |  2 +-
> >  2 files changed, 6 insertions(+), 9 deletions(-)
> >
> > diff --git a/cmds-receive.c b/cmds-receive.c
> > index 6cf22637..b59f00e4 100644
> > --- a/cmds-receive.c
> > +++ b/cmds-receive.c
> > @@ -1091,7 +1091,6 @@ static int do_receive(struct btrfs_receive *rctx,
> const char *tomnt,
> > char *dest_dir_full_path;
> > char root_subvol_path[PATH_MAX];
> > int end = 0;
> > -   int count;
> >
> > dest_dir_full_path = realpath(tomnt, NULL);
> > if (!dest_dir_full_path) {
> > @@ -1186,7 +1185,6 @@ static int do_receive(struct btrfs_receive *rctx,
> const char *tomnt,
> > if (ret < 0)
> > goto out;
> >
> > -   count = 0;
> > while (!end) {
> > if (rctx->cached_capabilities_len) {
> > if (g_verbose >= 3)
> > @@ -1200,16 +1198,15 @@ static int do_receive(struct btrfs_receive *rctx,
> const char *tomnt,
> >  rctx,
> >
>  rctx->honor_end_cmd,
> >  max_errors);
> > -   if (ret < 0)
> > -   goto out;
> > -   /* Empty stream is invalid */
> > -   if (ret && count == 0) {
> > +   if (ret < 0 && ret == -ENODATA) {
> > +   /* Empty stream is invalid */
> > error("empty stream is not considered valid");
> > ret = -EINVAL;
> > goto out;
> > +   } else if (ret < 0) {
> > +   goto out;
> > }
> > -   count++;
> > -   if (ret)
> > +   if (ret > 0)
> > end = 1;
> >
> > close_inode_for_write(rctx);
> > diff --git a/send-stream.c b/send-stream.c
> > index 5a028cd9..78f2571a 100644
> > --- a/send-stream.c
> > +++ b/send-stream.c
> > @@ -492,7 +492,7 @@ int btrfs_read_and_process_send_stream(int fd,
> > if (ret < 0)
> > goto out;
> > if (ret) {
> > -   ret = 1;
> > +   ret = -ENODATA;
> > goto out;
> > }
> >
> > --
> > 2.11.0
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] fstests: regression test for btrfs buffered read's repair

2017-04-28 Thread Filipe Manana
On Wed, Apr 26, 2017 at 7:09 PM, Liu Bo  wrote:
> This case tests whether buffered read can repair the bad copy if we
> have a good copy.
>
> Commit 20a7db8ab3f2 ("btrfs: add dummy callback for readpage_io_failed
> and drop checks") introduced the regression.
>
> The upstream fix is
> Btrfs: bring back repair during read

Same issue as reported for the other new test (btrfs/140), the test
fails on a kernel with the mentioned patch. Seems like it does wrong
assumptions somewhere (I haven't investigated, just run the test):

root 08:09:17 /home/fdmanana/git/hub/xfstests (master)> ./check btrfs/141
FSTYP -- btrfs
PLATFORM  -- Linux/x86_64 debian3 4.10.0-rc8-btrfs-next-40+
MKFS_OPTIONS  -- /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

btrfs/141 - output mismatch (see
/home/fdmanana/git/hub/xfstests/results//btrfs/141.out.bad)
--- tests/btrfs/141.out 2017-04-28 08:09:13.289791597 +0100
+++ /home/fdmanana/git/hub/xfstests/results//btrfs/141.out.bad
2017-04-28 08:09:28.469804304 +0100
@@ -1,39 +1,39 @@
 QA output created by 141
 wrote 131072/131072 bytes at offset 0
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-wrote 65536/65536 bytes at offset 244056064
+wrote 65536/65536 bytes at offset 1103101952
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-0e8c:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa

...
(Run 'diff -u tests/btrfs/141.out
/home/fdmanana/git/hub/xfstests/results//btrfs/141.out.bad'  to see
the entire diff)
Ran: btrfs/141
Failures: btrfs/141
Failed 1 of 1 tests

root 08:09:29 /home/fdmanana/git/hub/xfstests (master)>
root 08:09:31 /home/fdmanana/git/hub/xfstests (master)> diff -u
tests/btrfs/141.out
/home/fdmanana/git/hub/xfstests/results//btrfs/141.out.bad
--- tests/btrfs/141.out 2017-04-28 08:09:13.289791597 +0100
+++ /home/fdmanana/git/hub/xfstests/results//btrfs/141.out.bad
2017-04-28 08:09:28.469804304 +0100
@@ -1,39 +1,39 @@
 QA output created by 141
 wrote 131072/131072 bytes at offset 0
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-wrote 65536/65536 bytes at offset 244056064
+wrote 65536/65536 bytes at offset 1103101952
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-0e8c:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0010:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0020:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0030:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0040:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0050:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0060:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0070:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0080:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0090:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00a0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00b0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00c0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00d0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00e0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00f0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0100:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0110:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0120:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0130:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0140:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0150:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0160:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0170:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0180:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0190:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01a0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01b0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01c0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01d0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01e0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01f0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-read 512/512 bytes at offset 244056064
+41c0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
+41c00010:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  

[PATCH 3/3 v2] Make max_size consistent with nr

2017-04-28 Thread Christophe de Dinechin
Since we memset tmpl, max_size==0. This does not seem consistent with nr = 1.
In check_extent_refs, we will call:

  set_extent_dirty(root->fs_info->excluded_extents,
   rec->start,
   rec->start + rec->max_size - 1);

This ends up with BUG_ON(end < start) in insert_state.

Signed-off-by: Christophe de Dinechin 
---
 cmds-check.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/cmds-check.c b/cmds-check.c
index c13f900..d5e2966 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -6193,6 +6193,7 @@ static int add_tree_backref(struct cache_tree 
*extent_cache, u64 bytenr,
tmpl.start = bytenr;
tmpl.nr = 1;
tmpl.metadata = 1;
+   tmpl.max_size = 1;
 
ret = add_extent_rec_nolookup(extent_cache, );
if (ret)
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Subject: [PATCH 2/3 v2] Prevent attempt to insert extent record with max_size==0

2017-04-28 Thread Christophe de Dinechin
When this happens, we will trip a BUG_ON(end < start) in insert_state
because in check_extent_refs, we use this max_size expecting it's not zero:

  set_extent_dirty(root->fs_info->excluded_extents,
   rec->start,
   rec->start + rec->max_size - 1);

See https://bugzilla.redhat.com/show_bug.cgi?id=1435567
for an example where this scenario occurs.

Signed-off-by: Christophe de Dinechin 
---
 cmds-check.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/cmds-check.c b/cmds-check.c
index 2d3ebc1..c13f900 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -6029,6 +6029,7 @@ static int add_extent_rec_nolookup(struct cache_tree 
*extent_cache,
struct extent_record *rec;
int ret = 0;
 
+   BUG_ON(tmpl->max_size == 0);
rec = malloc(sizeof(*rec));
if (!rec)
return -ENOMEM;
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] fstests: regression test for btrfs dio read repair

2017-04-28 Thread Filipe Manana
On Wed, Apr 26, 2017 at 7:09 PM, Liu Bo  wrote:
> This case tests whether dio read can repair the bad copy if we have
> a good copy.
>
> Commit 2dabb3248453 ("Btrfs: Direct I/O read: Work on sectorsized blocks")
> introduced the regression.
>
> The upstream fix is
> Btrfs: fix invalid dereference in btrfs_retry_endio
>
> Signed-off-by: Liu Bo 

Thanks for doing this.

Just tested this, on a kernel with the mentioned fix, and it fails:

root 08:04:11 /home/fdmanana/git/hub/xfstests (master)> ./check btrfs/140
FSTYP -- btrfs
PLATFORM  -- Linux/x86_64 debian3 4.10.0-rc8-btrfs-next-40+
MKFS_OPTIONS  -- /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1

btrfs/140 - output mismatch (see
/home/fdmanana/git/hub/xfstests/results//btrfs/140.out.bad)
--- tests/btrfs/140.out 2017-04-28 07:59:13.069289130 +0100
+++ /home/fdmanana/git/hub/xfstests/results//btrfs/140.out.bad
2017-04-28 08:04:18.209544574 +0100
@@ -1,39 +1,39 @@
 QA output created by 140
 wrote 131072/131072 bytes at offset 0
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-wrote 65536/65536 bytes at offset 244056064
+wrote 65536/65536 bytes at offset 1103101952
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-0e8c:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa

...
(Run 'diff -u tests/btrfs/140.out
/home/fdmanana/git/hub/xfstests/results//btrfs/140.out.bad'  to see
the entire diff)
Ran: btrfs/140
Failures: btrfs/140
Failed 1 of 1 tests

root 08:04:18 /home/fdmanana/git/hub/xfstests (master)>
root 08:04:27 /home/fdmanana/git/hub/xfstests (master)> diff -u
tests/btrfs/140.out
/home/fdmanana/git/hub/xfstests/results//btrfs/140.out.bad
--- tests/btrfs/140.out 2017-04-28 07:59:13.069289130 +0100
+++ /home/fdmanana/git/hub/xfstests/results//btrfs/140.out.bad
2017-04-28 08:04:18.209544574 +0100
@@ -1,39 +1,39 @@
 QA output created by 140
 wrote 131072/131072 bytes at offset 0
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-wrote 65536/65536 bytes at offset 244056064
+wrote 65536/65536 bytes at offset 1103101952
 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
-0e8c:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0010:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0020:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0030:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0040:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0050:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0060:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0070:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0080:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0090:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00a0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00b0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00c0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00d0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00e0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c00f0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0100:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0110:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0120:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0130:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0140:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0150:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0160:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0170:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0180:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c0190:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01a0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01b0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01c0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01d0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01e0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-0e8c01f0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
-read 512/512 bytes at offset 244056064
+41c0:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
+41c00010:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  
+41c00020:  aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa  

Re: [PATCH 3/3] Make max_size consistent with nr

2017-04-28 Thread Roman Mamedov
On Fri, 28 Apr 2017 11:13:36 +0200
Christophe de Dinechin  wrote:

> Since we memset tmpl, max_size==0. This does not seem consistent with nr = 1.
> In check_extent_refs, we will call:
> 
>   set_extent_dirty(root->fs_info->excluded_extents,
>rec->start,
>rec->start + rec->max_size - 1);
> 
> This ends up with BUG_ON(end < start) in insert_state.
> 
> Signed-off-by: Christophe de Dinechin 
> ---
>  cmds-check.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/cmds-check.c b/cmds-check.c
> index 58e65d6..774e9b6 100644
> --- a/cmds-check.c
> +++ b/cmds-check.c
> @@ -6193,6 +6193,7 @@ static int add_tree_backref(struct cache_tree 
> *extent_cache, u64 bytenr,
>   tmpl.start = bytenr;
>   tmpl.nr = 1;
>   tmpl.metadata = 1;
> +tmpl.max_size = 1;
>  
>   ret = add_extent_rec_nolookup(extent_cache, );
>   if (ret)

The original code uses Tab characters for indent, but your addition uses
spaces. Also same problem in patch 2/3.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] Make max_size consistent with nr

2017-04-28 Thread Christophe de Dinechin
Since we memset tmpl, max_size==0. This does not seem consistent with nr = 1.
In check_extent_refs, we will call:

  set_extent_dirty(root->fs_info->excluded_extents,
   rec->start,
   rec->start + rec->max_size - 1);

This ends up with BUG_ON(end < start) in insert_state.

Signed-off-by: Christophe de Dinechin 
---
 cmds-check.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/cmds-check.c b/cmds-check.c
index 58e65d6..774e9b6 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -6193,6 +6193,7 @@ static int add_tree_backref(struct cache_tree 
*extent_cache, u64 bytenr,
tmpl.start = bytenr;
tmpl.nr = 1;
tmpl.metadata = 1;
+tmpl.max_size = 1;
 
ret = add_extent_rec_nolookup(extent_cache, );
if (ret)
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] Prevent attempt to insert extent record with max_size==0

2017-04-28 Thread Christophe de Dinechin
When this happens, we will trip a BUG_ON(end < start) in insert_state
because in check_extent_refs, we use this max_size expecting it's not zero:

  set_extent_dirty(root->fs_info->excluded_extents,
   rec->start,
   rec->start + rec->max_size - 1);

See https://bugzilla.redhat.com/show_bug.cgi?id=1435567
for an example where this scenario occurs.

Signed-off-by: Christophe de Dinechin 
---
 cmds-check.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/cmds-check.c b/cmds-check.c
index 2d3ebc1..58e65d6 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -6029,6 +6029,7 @@ static int add_extent_rec_nolookup(struct cache_tree 
*extent_cache,
struct extent_record *rec;
int ret = 0;
 
+BUG_ON(tmpl->max_size == 0);
rec = malloc(sizeof(*rec));
if (!rec)
return -ENOMEM;
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] Disambiguate between cases where add_tree_backref fails

2017-04-28 Thread Christophe de Dinechin
See https://bugzilla.redhat.com/show_bug.cgi?id=1435567 for an example
where the message occurs,

Signed-off-by: Christophe de Dinechin 
---
 cmds-check.c | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 17b7efb..2d3ebc1 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -6832,14 +6832,14 @@ static int process_extent_item(struct btrfs_root *root,
ret = add_tree_backref(extent_cache, key.objectid,
0, offset, 0);
if (ret < 0)
-   error("add_tree_backref failed: %s",
+   error("add_tree_backref failed (extent items 
tree block): %s",
  strerror(-ret));
break;
case BTRFS_SHARED_BLOCK_REF_KEY:
ret = add_tree_backref(extent_cache, key.objectid,
offset, 0, 0);
if (ret < 0)
-   error("add_tree_backref failed: %s",
+   error("add_tree_backref failed (extent items 
shared block): %s",
  strerror(-ret));
break;
case BTRFS_EXTENT_DATA_REF_KEY:
@@ -7753,7 +7753,7 @@ static int run_next_block(struct btrfs_root *root,
ret = add_tree_backref(extent_cache,
key.objectid, 0, key.offset, 0);
if (ret < 0)
-   error("add_tree_backref failed: %s",
+   error("add_tree_backref failed (leaf 
tree block): %s",
  strerror(-ret));
continue;
}
@@ -7761,7 +7761,7 @@ static int run_next_block(struct btrfs_root *root,
ret = add_tree_backref(extent_cache,
key.objectid, key.offset, 0, 0);
if (ret < 0)
-   error("add_tree_backref failed: %s",
+   error("add_tree_backref failed (leaf 
shared block): %s",
  strerror(-ret));
continue;
}
@@ -7866,7 +7866,7 @@ static int run_next_block(struct btrfs_root *root,
ret = add_tree_backref(extent_cache, ptr, parent,
owner, 1);
if (ret < 0) {
-   error("add_tree_backref failed: %s",
+   error("add_tree_backref failed (non-leaf 
block): %s",
  strerror(-ret));
continue;
}
-- 
2.9.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: check options during subsequent mount

2017-04-28 Thread Anand Jain
We allow recursive mounts with subvol options such as [1]

[1]
 mount -o rw,compress=lzo /dev/sdc /btrfs1
 mount -o ro,subvol=sv2 /dev/sdc /btrfs2

And except for the btrfs-specific subvol and subvolid options
all-other options are just ignored in the subsequent mounts.

In the below example [2] the effect compression is only zlib,
even though there was no error for the option -o compress=lzo.

[2]

 # mount -o compress=zlib /dev/sdc /btrfs1
 #echo $?
 0

 # mount -o compress=lzo /dev/sdc /btrfs
 #echo $?
 0

 #cat /proc/self/mounts
 ::
 /dev/sdc /btrfs1 btrfs 
rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/ 0 0
 /dev/sdc /btrfs btrfs 
rw,relatime,compress=zlib,space_cache,subvolid=5,subvol=/ 0 0


Further, random string .. has no error as well.
-
 # mount -o compress=zlib /dev/sdc /btrfs1
 #echo $?
 0

 # mount -o something /dev/sdc /btrfs
 #echo $?
 0
-

This patch fixes the above issue, by checking the if the passed
options are only subvol or subvolid in the subsequent mount.

Signed-off-by: Anand Jain 
---
 fs/btrfs/super.c | 40 
 1 file changed, 40 insertions(+)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 9530a333d302..e0e542345c38 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -389,6 +389,44 @@ static const match_table_t tokens = {
{Opt_err, NULL},
 };
 
+static int parse_recursive_mount_options(char *data)
+{
+   substring_t args[MAX_OPT_ARGS];
+   char *options, *orig;
+   char *p;
+   int ret = 0;
+
+   /*
+* This is not a remount thread, but we allow recursive mounts
+* with varying RO/RW flag to support subvol-mounts. So error-out
+* if any other option being passed in here.
+*/
+
+   options = kstrdup(data, GFP_NOFS);
+   if (!options)
+   return -ENOMEM;
+
+   orig = options;
+
+   while ((p = strsep(, ",")) != NULL) {
+   int token;
+   if (!*p)
+   continue;
+
+   token = match_token(p, tokens, args);
+   switch(token) {
+   case Opt_subvol:
+   case Opt_subvolid:
+   break;
+   default:
+   ret = -EBUSY;
+   }
+   }
+
+   kfree(orig);
+   return ret;
+}
+
 /*
  * Regular mount options parser.  Everything that is needed only when
  * reading in a new superblock is parsed here.
@@ -1611,6 +1649,8 @@ static struct dentry *btrfs_mount(struct file_system_type 
*fs_type, int flags,
free_fs_info(fs_info);
if ((flags ^ s->s_flags) & MS_RDONLY)
error = -EBUSY;
+   if (parse_recursive_mount_options(data))
+   error = -EBUSY;
} else {
snprintf(s->s_id, sizeof(s->s_id), "%pg", bdev);
btrfs_sb(s)->bdev_holder = fs_type;
-- 
2.10.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: File system corruption, btrfsck abort

2017-04-28 Thread Christophe de Dinechin

> On 28 Apr 2017, at 02:45, Qu Wenruo  wrote:
> 
> 
> 
> At 04/26/2017 01:50 AM, Christophe de Dinechin wrote:
>> Hi,
>> I”ve been trying to run btrfs as my primary work filesystem for about 3-4 
>> months now on Fedora 25 systems. I ran a few times into filesystem 
>> corruptions. At least one I attributed to a damaged disk, but the last one 
>> is with a brand new 3T disk that reports no SMART errors. Worse yet, in at 
>> least three cases, the filesystem corruption caused btrfsck to crash.
>> The last filesystem corruption is documented here: 
>> https://bugzilla.redhat.com/show_bug.cgi?id=1444821. The dmesg log is in 
>> there.
> 
> According to the bugzilla, the btrfs-progs seems to be too old in btrfs 
> standard.

> What about using the latest btrfs-progs v4.10.2?

I tried 4.10.1-1 https://bugzilla.redhat.com/show_bug.cgi?id=1435567#c4.

I am currently debugging with a build from the master branch as of Tuesday 
(commit bd0ab27afbf14370f9f0da1f5f5ecbb0adc654c1), which is 4.10.2

There was no change in behavior. Runs are split about evenly between list crash 
and abort.

I added instrumentation and tried a fix, which brings me a tiny bit further, 
until I hit a message from delete_duplicate_records:

Ok we have overlapping extents that aren't completely covered by each
other, this is going to require more careful thought.  The extents are
[52428800-16384] and [52432896-16384]

> Furthermore for v4.10.2, btrfs check provides a new mode called lowmem.
> You could try "btrfs check --mode=lowmem" to see if such problem can be 
> avoided.

I will try that, but what makes you think this is a memory-related condition? 
The machine has 16G of RAM, isn’t that enough for an fsck?

> 
> For the kernel bug, it seems to be related to wrongly inserted delayed ref, 
> but I can totally be wrong.

For now, I’m focusing on the “repair” part as much as I can, because I assume 
the kernel bug is there anyway, so someone else is bound to hit this problem.


Thanks
Christophe

> 
> Thanks,
> Qu
>> The btrfsck crash is here: 
>> https://bugzilla.redhat.com/show_bug.cgi?id=1435567. I have two crash modes: 
>> either an abort or a SIGSEGV. I checked that both still happens on master as 
>> of today.
>> The cause of the abort is that we call set_extent_dirty from 
>> check_extent_refs with rec->max_size == 0. I’ve instrumented to try to see 
>> where we set this to 0 (see 
>> https://github.com/c3d/btrfs-progs/tree/rhbz1435567), and indeed, we do 
>> sometimes see max_size set to 0 in a few locations. My instrumentation shows 
>> this:
>> 78655 [1.792241:0x451fe0] MAX_SIZE_ZERO: Add extent rec 0x139eb80 max_size 
>> 16384 tmpl 0x7fffd120
>> 78657 [1.792242:0x451cb8] MAX_SIZE_ZERO: Set max size 0 for rec 0x139ec50 
>> from tmpl 0x7fffcf80
>> 78660 [1.792244:0x451fe0] MAX_SIZE_ZERO: Add extent rec 0x139ed50 max_size 
>> 16384 tmpl 0x7fffd120
>> I don’t really know what to make of it.
>> The cause of the SIGSEGV is that we try to free a list entry that has its 
>> next set to NULL.
>> #0  list_del (entry=0x55db0420) at 
>> /usr/src/debug/btrfs-progs-v4.10.1/kernel-lib/list.h:125
>> #1  free_all_extent_backrefs (rec=0x55db0350) at cmds-check.c:5386
>> #2  maybe_free_extent_rec (extent_cache=0x7fffd990, rec=0x55db0350) 
>> at cmds-check.c:5417
>> #3  0x555b308f in check_block (flags=, 
>> buf=0x7b87cdf0, extent_cache=0x7fffd990, root=0x5587d570) at 
>> cmds-check.c:5851
>> #4  run_next_block (root=root@entry=0x5587d570, 
>> bits=bits@entry=0x558841
>> I don’t know if the two problems are related, but they seem to be pretty 
>> consistent on this specific disk, so I think that we have a good opportunity 
>> to improve btrfsck to make it more robust to this specific form of 
>> corruption. But I don’t want to hapazardly modify a code I don’t really 
>> understand. So if anybody could make a suggestion on what the right strategy 
>> should be when we have max_size == 0, or how to avoid it in the first place.
>> I don’t know if this is relevant at all, but all the machines that failed 
>> that way were used to run VMs with KVM/QEMU. DIsk activity tends to be 
>> somewhat intense on occasions, since the VMs running there are part of a 
>> personal Jenkins ring that automatically builds various projects. Nominally, 
>> there are between three and five guests running (Windows XP, WIndows 10, 
>> macOS, Fedora25, Ubuntu 16.04).
>> Thanks
>> Christophe de Dinechin
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body