Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-09-20 Thread Zygo Blaxell
On Fri, Sep 21, 2018 at 12:59:31PM +1000, Dave Chinner wrote:
> On Wed, Sep 19, 2018 at 12:12:03AM -0400, Zygo Blaxell wrote:
[...]
> With no DMAPI in the future, people with custom HSM-like interfaces
> based on dmapi are starting to turn to fanotify and friends to
> provide them with the change notifications they require

I had a fanotify-based scanner once, before I noticed btrfs effectively
had timestamps all over its metadata.

fanotify won't tell me which parts of a file were modified (unless it
got that feature in the last few years?).  fanotify was pretty useless
when the only file on the system that was being modified was a 13TB
VM image.  Or even a little 16GB one.  Has to scan the whole file to
find the one new byte.  Even on desktops the poor thing spends most of
its time looping over /var/log/messages.  It was sad.

If fanotify gave me (inode, offset, length) tuples of dirty pages in
cache, I could look them up and use a dedupe_file_range call to replace
the dirty pages with a reference to an existing disk block.  If my
listener can do that fast enough, it's in-band dedupe; if it doesn't,
the data gets flushed to disk as normal, and I fall back to a scan of
the filesystem to clean it up later.

> > > e.g. a soft requirement is that we need to scan the entire fs at
> > > least once a month. 
> > 
> > I have to scan and dedupe multiple times per hour.  OK, the first-ever
> > scan of a non-empty filesystem is allowed to take much longer, but after
> > that, if you have enough spare iops for continuous autodefrag you should
> > also have spare iops for continuous dedupe.
> 
> Yup, but using notifications avoids the for even these scans - you'd
> know exactly what data has changed, when it changed, and know
> exactly that you needed to read to calculate the new hashes.

...if the scanner can keep up with the notifications; otherwise, the
notification receiver has to log them somewhere for the scanner to
catch up.  If there are missed or dropped notifications--or 23 hours a
day we're not listening for notifications because we only have an hour
a day maintenance window--some kind of filesystem scan has to be done
after the fact anyway.

> > > A simple piece-wise per-AG scanning algorithm (like we use in
> > > xfs_repair) could easily work within a 3GB RAM per AG constraint and
> > > would scale very well. We'd only need to scan 30-40 AGs in the hour,
> > > and a single AG at 1GB/s will only take 2 minutes to scan. We can
> > > then do the processing while the next AG gets scanned. If we've got
> > > 10-20GB RAM to use (and who doesn't when they have 1PB of storage?)
> > > then we can scan 5-10AGs at once to keep the IO rate up, and process
> > > them in bulk as we scan more.
> > 
> > How do you match dupe blocks from different AGs if you only keep RAM for
> > the duration of one AG scan?  Do you not dedupe across AG boundaries?
> 
> We could, but do we need too? There's a heap of runtime considerations
> at the filesystem level we need to take into consideration here, and
> there's every chance that too much consolidation creates
> unpredictable bottlenecks in overwrite workloads that need to break
> the sharing (i.e. COW operations).

I'm well aware of that.  I have a bunch of hacks in bees to not be too
efficient lest it push the btrfs reflink bottlenecks too far.

> e.g. An AG contains up to 1TB of data which is more than enough to
> get decent AG-internal dedupe rates. If we've got 1PB of data spread
> across 1000AGs, deduping a million copies of a common data pattern
> spread across the entire filesystem down to one per AG (i.e. 10^6
> copies down to 10^3) still gives a massive space saving.

That's true for 1000+ AG filesystems, but it's a bigger problem for
filesystems of 2-5 AGs, where each AG holds one copy of 20-50% of the
duplicates on the filesystem.

OTOH, a filesystem that small could just be done in one pass with a
larger but still reasonable amount of RAM.

> > What you've described so far means the scope isn't limited anyway.  If the
> > call is used to dedupe two heavily-reflinked extents together (e.g.
> > both duplicate copies are each shared by thousands of snapshots that
> > have been created during the month-long period between dedupe runs),
> > it could always be stuck doing a lot of work updating dst owners.
> > Was there an omitted detail there?
> 
> As I said early in the discussion - if both copies of identical data
> are already shared hundreds or thousands of times each, then it
> makes no sense to dedupe them again. All that does is create huge
> amounts of work updating metadata for very little additional gain.

I've had a user complain about the existing 2560-reflink limit in bees,
because they were starting with 3000 snapshots of their data before they
ran dedupe for the first time, so almost all their data started above
the reflink limit before dedupe, and no dedupes occurred because of that.

Build servers end up with a 3-4 digit number of reflinks to every 

Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-09-20 Thread Dave Chinner
On Wed, Sep 19, 2018 at 12:12:03AM -0400, Zygo Blaxell wrote:
> On Mon, Sep 10, 2018 at 07:06:46PM +1000, Dave Chinner wrote:
> > On Thu, Sep 06, 2018 at 11:53:06PM -0400, Zygo Blaxell wrote:
> > > On Thu, Sep 06, 2018 at 06:38:09PM +1000, Dave Chinner wrote:
> > > > On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote:
> > > > > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> > > > > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > > > > For future development I've abandoned the entire dedupe_file_range
> > > > > approach.  I need to be able to read and dedupe the data blocks of
> > > > > the filesystem directly without having to deal with details like which
> > > > > files those blocks belong to, especially on filesystems with lots of
> > > > > existing deduped blocks and snapshots. 
> > > > 
> > > > IOWs, your desired OOB dedupe algorithm is:
> > > > 
> > > > a) ask the filesystem where all it's file data is
> > > 
> > > Actually, it's "ask the filesystem where all the *new* file data is"
> > > since we don't want to read any unique data twice on subsequent runs.
> > 
> > Sorry, how do you read "unique data" twice? By definition, unique
> > data only occurs once
> 
> ...but once it has been read, we don't want to read it again.  Ever.
> Even better would be to read unique data less than 1.0 times on average.
> 
> > Oh, and you still need to hash the old data so you can find
> > collisions with the new data that got written. Unless, of course,
> > you are keeping your hash tree in a persistent database 
> 
> I do that.

OK, so you're slowly re-inventing the typical HSM implementation
that keep a userspace database to keep track of what is happening in
the filesystem. They do this to make better decisions when moving
less frequently accessed data out to a slower teirs of storage to
keep space in the primary storage available as it fills up. You're
approaching dedupe in essentially the same way, for very similar
reasons.

> One big difference I am noticing in our approaches is latency.  ZFS (and
> in-kernel btrfs dedupe) provides minimal dedupe latency (duplicate
> data occupies disk space for zero time as it is never written to disk
> at all) but it requires more RAM for a given dedupe hit rate than any
> other dedupe implementation I've seen.  What you've written tells me
> XFS saves RAM by partitioning the data and relying on an existing but
> very large source of iops (sharing scrub reads with dedupe), but then
> the dedupe latency is the same as the scrub interval (the worst so far).

That's just the background maintenance implementation. xfs_fsr can
run this way, but it can also be run as a complete immediate scan,
or it can be pointed at a defined set of files to run on. We can
(and probably will) do exactly the same thing with dedupe.

> bees aims to have latency of a few minutes (ideally scanning data while
> it's still dirty in cache, but there's no good userspace API for that)
> though it's obviously not there yet.

Yup, this is pretty much the "I need immediate notification that a
file changed" problem that HSMs face. That's one of the things DMAPI
was used for - XFS has (now unused) dmapi event notification fields in
it's inode structure that were used by DMAPI to configure the
notifications the filesystem was supposed to send userspace

With no DMAPI in the future, people with custom HSM-like interfaces
based on dmapi are starting to turn to fanotify and friends to
provide them with the change notifications they require

> > e.g. a soft requirement is that we need to scan the entire fs at
> > least once a month. 
> 
> I have to scan and dedupe multiple times per hour.  OK, the first-ever
> scan of a non-empty filesystem is allowed to take much longer, but after
> that, if you have enough spare iops for continuous autodefrag you should
> also have spare iops for continuous dedupe.

Yup, but using notifications avoids the for even these scans - you'd
know exactly what data has changed, when it changed, and know
exactly that you needed to read to calculate the new hashes.

> > A simple piece-wise per-AG scanning algorithm (like we use in
> > xfs_repair) could easily work within a 3GB RAM per AG constraint and
> > would scale very well. We'd only need to scan 30-40 AGs in the hour,
> > and a single AG at 1GB/s will only take 2 minutes to scan. We can
> > then do the processing while the next AG gets scanned. If we've got
> > 10-20GB RAM to use (and who doesn't when they have 1PB of storage?)
> > then we can scan 5-10AGs at once to keep the IO rate up, and process
> > them in bulk as we scan more.
> 
> How do you match dupe blocks from different AGs if you only keep RAM for
> the duration of one AG scan?  Do you not dedupe across AG boundaries?

We could, but do we need too? There's a heap of runtime considerations
at the filesystem level we need to take into consideration here, and
there's every chance that too much consolidation 

Re: btrfs problems

2018-09-20 Thread Remi Gauvin
On 2018-09-20 05:35 PM, Adrian Bastholm wrote:
> Thanks a lot for the detailed explanation.
> Aabout "stable hardware/no lying hardware". I'm not running any raid
> hardware, was planning on just software raid. three drives glued
> together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would
> this be a safer bet, or would You recommend running the sausage method
> instead, with "-d single" for safety ? I'm guessing that if one of the
> drives dies the data is completely lost
> Another variant I was considering is running a raid1 mirror on two of
> the drives and maybe a subvolume on the third, for less important
> stuff

In case you were not aware, it's perfectly acceptable with BTRFS to use
Raid 1 over 3 devices.  Even more amazing, regardless of how many
devices you start with, 2, 3, 4, whatever, you can add a single drive to
the array to increase capacity,  (at 50%, of course,, ie, adding a 4TB
drive will give you 2TB usable space, assuming the other drives add up
to at least 4TB to match it.)


<>

Re: btrfs problems

2018-09-20 Thread Chris Murphy
On Thu, Sep 20, 2018 at 3:36 PM Adrian Bastholm  wrote:
>
> Thanks a lot for the detailed explanation.
> Aabout "stable hardware/no lying hardware". I'm not running any raid
> hardware, was planning on just software raid.

Yep. I'm referring to the drives, their firmware, cables, logic board,
its firmware, the power supply, power, etc. Btrfs is by nature
intolerant of corruption. Other file systems are more tolerant because
they don't know about it (although recent versions of XFS and ext4 are
now defaulting to checksummed metadata and journals).


>three drives glued
> together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would
> this be a safer bet, or would You recommend running the sausage method
> instead, with "-d single" for safety ? I'm guessing that if one of the
> drives dies the data is completely lost
> Another variant I was considering is running a raid1 mirror on two of
> the drives and maybe a subvolume on the third, for less important
> stuff

RAID does not substantially reduce the chances of data loss. It's not
anything like a backup. It's an uptime enhancer. If you have backups,
and your primary storage dies, of course you can restore from backup
no problem, but it takes time and while the restore is happening,
you're not online - uptime is killed. If that's a negative, might want
to run RAID so you can keep working during the degraded period, and
instead of a restore you're doing a rebuild. But of course there is a
chance of failure during the degraded period. So you have to have a
backup anyway. At least with Btrfs/ZFS, there is another reason to run
with some replication like raid1 or raid5 and that's so that if
there's corruption or a bad sector, Btrfs doesn't just detect it, it
can fix it up with the good copy.

For what it's worth, make sure the drives have lower SCT ERC time than
the SCSI command timer. This is the same for Btrfs as it is for md and
LVM RAID. The command timer default is 30 seconds, and most drives
have SCT ERC disabled with very high recovery times well over 30
seconds. So either set SCT ERC to something like 70 deciseconds. Or
increase the command timer to something like 120 or 180 (either one is
absurdly high but what you want is for the drive to eventually give up
and report a discrete error message which Btrfs can do something
about, rather than do a SATA link reset in which case Btrfs can't do
anything about it).




-- 
Chris Murphy


Re: btrfs problems

2018-09-20 Thread Adrian Bastholm
Thanks a lot for the detailed explanation.
Aabout "stable hardware/no lying hardware". I'm not running any raid
hardware, was planning on just software raid. three drives glued
together with "mkfs.btrfs -d raid5 /dev/sdb /dev/sdc /dev/sdd". Would
this be a safer bet, or would You recommend running the sausage method
instead, with "-d single" for safety ? I'm guessing that if one of the
drives dies the data is completely lost
Another variant I was considering is running a raid1 mirror on two of
the drives and maybe a subvolume on the third, for less important
stuff

BR Adrian
On Thu, Sep 20, 2018 at 9:39 PM Chris Murphy  wrote:
>
> On Thu, Sep 20, 2018 at 11:23 AM, Adrian Bastholm  wrote:
> > On Mon, Sep 17, 2018 at 2:44 PM Qu Wenruo  wrote:
> >
> >>
> >> Then I strongly recommend to use the latest upstream kernel and progs
> >> for btrfs. (thus using Debian Testing)
> >>
> >> And if anything went wrong, please report asap to the mail list.
> >>
> >> Especially for fs corruption, that's the ghost I'm always chasing for.
> >> So if any corruption happens again (although I hope it won't happen), I
> >> may have a chance to catch it.
> >
> > You got it
> >> >
> >> >> Anyway, enjoy your stable fs even it's not btrfs
> >
> >> > My new stable fs is too rigid. Can't grow it, can't shrink it, can't
> >> > remove vdevs from it , so I'm planning a comeback to BTRFS. I guess
> >> > after the dust settled I realize I like the flexibility of BTRFS.
> >> >
> > I'm back to btrfs.
> >
> >> From the code aspect, the biggest difference is the chunk layout.
> >> Due to the ext* block group usage, each block group header (except some
> >> sparse bg) is always used, thus btrfs can't use them.
> >>
> >> This leads to highly fragmented chunk layout.
> >
> > The only thing I really understood is "highly fragmented" == not good
> > . I might need to google these "chunk" thingies
>
> Chunks are synonyms with block groups. They're like a super extent, or
> extent of extents.
>
> The block group is how Btrfs abstracts the logical address used most
> everywhere in Btrfs land, and device + physical location of extents.
> It's how a file is referenced only by on logical address, and doesn't
> need to know either where the extent is located, or how many copies
> there are. The block group allocation profile is what determines if
> there's one copy, duplicate copies, raid1, 10, 5, 6 copies of a chunk
> and where the copies are located. It's also fundamental to how device
> add, remove, replace, file system resize, and balance all interrelate.
>
>
> >> If your primary concern is to make the fs as stable as possible, then
> >> keep snapshots to a minimal amount, avoid any functionality you won't
> >> use, like qgroup, routinely balance, RAID5/6.
> >
> > So, is RAID5 stable enough ? reading the wiki there's a big fat
> > warning about some parity issues, I read an article about silent
> > corruption (written a while back), and chris says he can't recommend
> > raid56 to mere mortals.
>
> Depends on how you define stable. In recent kernels it's stable on
> stable hardware, i.e. no lying hardware (actually flushes when it
> claims it has), no power failures, and no failed devices. Of course
> it's designed to help protect against a clear loss of a device, but
> there's tons of stuff here that's just not finished including ejecting
> bad devices from the array like md and lvm raids will do. Btrfs will
> just keep trying, through all the failures. There are some patches to
> moderate this but I don't think they're merged yet.
>
> You'd also want to be really familiar with how to handle degraded
> operation, if you're going to depend on it, and how to replace a bad
> device. Last I refreshed my memory on it, it's advised to use "btrfs
> device add" followed by "btrfs device remove" for raid56; whereas
> "btrfs replace" is preferred for all other profiles. I'm not sure if
> the "btrfs replace" issues with parity raid were fixed.
>
> Metadata as raid56 shows a lot more problem reports than metadata
> raid1, so there's something goofy going on in those cases. I'm not
> sure how well understood they are. But other people don't have
> problems with it.
>
> It's worth looking through the archives about some things. Btrfs
> raid56 isn't exactly perfectly COW, there is read-modify-write code
> that means there can be overwrites. I vaguely recall that it's COW in
> the logical layer, but the physical writes can end up being RMW or not
> for sure COW.
>
>
>
> --
> Chris Murphy



-- 
Vänliga hälsningar / Kind regards,
Adrian Bastholm

``I would change the world, but they won't give me the sourcecode``


Re: btrfs problems

2018-09-20 Thread Chris Murphy
On Thu, Sep 20, 2018 at 11:23 AM, Adrian Bastholm  wrote:
> On Mon, Sep 17, 2018 at 2:44 PM Qu Wenruo  wrote:
>
>>
>> Then I strongly recommend to use the latest upstream kernel and progs
>> for btrfs. (thus using Debian Testing)
>>
>> And if anything went wrong, please report asap to the mail list.
>>
>> Especially for fs corruption, that's the ghost I'm always chasing for.
>> So if any corruption happens again (although I hope it won't happen), I
>> may have a chance to catch it.
>
> You got it
>> >
>> >> Anyway, enjoy your stable fs even it's not btrfs
>
>> > My new stable fs is too rigid. Can't grow it, can't shrink it, can't
>> > remove vdevs from it , so I'm planning a comeback to BTRFS. I guess
>> > after the dust settled I realize I like the flexibility of BTRFS.
>> >
> I'm back to btrfs.
>
>> From the code aspect, the biggest difference is the chunk layout.
>> Due to the ext* block group usage, each block group header (except some
>> sparse bg) is always used, thus btrfs can't use them.
>>
>> This leads to highly fragmented chunk layout.
>
> The only thing I really understood is "highly fragmented" == not good
> . I might need to google these "chunk" thingies

Chunks are synonyms with block groups. They're like a super extent, or
extent of extents.

The block group is how Btrfs abstracts the logical address used most
everywhere in Btrfs land, and device + physical location of extents.
It's how a file is referenced only by on logical address, and doesn't
need to know either where the extent is located, or how many copies
there are. The block group allocation profile is what determines if
there's one copy, duplicate copies, raid1, 10, 5, 6 copies of a chunk
and where the copies are located. It's also fundamental to how device
add, remove, replace, file system resize, and balance all interrelate.


>> If your primary concern is to make the fs as stable as possible, then
>> keep snapshots to a minimal amount, avoid any functionality you won't
>> use, like qgroup, routinely balance, RAID5/6.
>
> So, is RAID5 stable enough ? reading the wiki there's a big fat
> warning about some parity issues, I read an article about silent
> corruption (written a while back), and chris says he can't recommend
> raid56 to mere mortals.

Depends on how you define stable. In recent kernels it's stable on
stable hardware, i.e. no lying hardware (actually flushes when it
claims it has), no power failures, and no failed devices. Of course
it's designed to help protect against a clear loss of a device, but
there's tons of stuff here that's just not finished including ejecting
bad devices from the array like md and lvm raids will do. Btrfs will
just keep trying, through all the failures. There are some patches to
moderate this but I don't think they're merged yet.

You'd also want to be really familiar with how to handle degraded
operation, if you're going to depend on it, and how to replace a bad
device. Last I refreshed my memory on it, it's advised to use "btrfs
device add" followed by "btrfs device remove" for raid56; whereas
"btrfs replace" is preferred for all other profiles. I'm not sure if
the "btrfs replace" issues with parity raid were fixed.

Metadata as raid56 shows a lot more problem reports than metadata
raid1, so there's something goofy going on in those cases. I'm not
sure how well understood they are. But other people don't have
problems with it.

It's worth looking through the archives about some things. Btrfs
raid56 isn't exactly perfectly COW, there is read-modify-write code
that means there can be overwrites. I vaguely recall that it's COW in
the logical layer, but the physical writes can end up being RMW or not
for sure COW.



-- 
Chris Murphy


Re: inline extents

2018-09-20 Thread Theodore Y. Ts'o
On Wed, Sep 19, 2018 at 09:18:16PM -0600, Chris Murphy wrote:
> > ext4 has inline data, too, so there's every chance grub will corrupt
> > ext4 filesystems with tit's wonderful new feature. I'm not sure if
> > the ext4 metadata cksums cover the entire inode and inline data, but
> > if they do it's the same problem as btrfs.
> 
> I don't see inline used with a default mkfs, but I do see metadata_csum

The grub environment block is 1024 bytes, so unless you are using a 4k
inode size (the default is 256 bytes), the grub environment block
won't be an inline data file.  So in practice, this won't be a
problem.  So in practice this won't be a problem for ext4 today.

There are ways that grub can query the file system (using FIEMAP or
FS_IOC_GETFLAGS) to see whether or not the file is an inline file.
For xfs grub would also need to make sure the file isn't reflinked
file (which can also be queried using FIEMAP).  That's fine for
inline, since a file won't magically change from stored in a block
versus inline, so grub can check this at grub setup time.  But the
problem with reflink is that a file that was originally using a block
exclusively could later have that block be shared by another file, at
which point it's only safe to write the block using copy-on-write.

> I know some of the file systems have reserve areas for bootloader
> stuff. I'm not sure if that's preferred over bootloaders just getting
> their own partition and controlling it stem to stern however they
> want.

Ext4 had reserved the bootloader inode, and we have defined a way for
a bootloader to access it.  The original intent was that not it just
be fore the environment block, but also the grub stage 1.5 or stage 2
loader.  This would allow grub to just use a fixed list of blocks or
extents, and not need to understand the ext4 extents structures.  We
added way this when, because we anticipated this would be easier for
grub than having it learn how to read ext4 extents structures.
Instead, the bootloader would at setup time use the EXT4_IOC_SWAP_BOOT
ioctl and then use a fixed list of blocks or block extents like LILO
used to do.

Of course, grub exceeded expectations and actually learned how to read
ext4 file systems, and probably we (the ext4 developers) should have
reached out to GRUB folks.

Cheers,

- Ted


Re: [PATCH v8 0/6] Btrfs: implement swap file support

2018-09-20 Thread Omar Sandoval
On Thu, Sep 20, 2018 at 07:22:55PM +0200, David Sterba wrote:
> On Wed, Sep 19, 2018 at 10:02:11PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > Changes from v7 [1]:
> > 
> > - Expanded a few commit messages
> > - Added Johannes' acked-by on patches 1 and 2
> > - Rebased on v4.19-rc4
> 
> I've sent my comments, it's mostly about the usability or small
> enhancements. As you've got acks from MM people, I hope it would be ok
> if I add this series to for-next so we can give it some testing.

That'd be great. Feel free to grab it from my git tree
(https://github.com/osandov/linux/tree/btrfs-swap) if you want the
version with your comments addressed.

> The MM patches would better go separately to 4.20 via the mm tree.  I
> did only build tests so 4.20 target is still feasible but given that
> it's rc4 it's a bit too close. There are some limitations posed by the
> swapfiles so I'd like to have a chance to do some actual tests myself
> and check the usability status.

4.20 would be nice, but I could live with 4.21. I'll just be backporting
it to our internal kernel here anyways ;) Let me know how the tests go
and which way you want to go.

Thanks! It's nice to finally have the end in sight for this series, it's
almost 4 years old, although it's changed quite a bit since
https://lkml.org/lkml/2014/11/17/141.


Re: [PATCH 0/5] rb_first to rb_first_cached conversion

2018-09-20 Thread Liu Bo
On Thu, Sep 20, 2018 at 06:49:59PM +0200, David Sterba wrote:
> On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote:
> > Several structs in btrfs are using rb_first() in a while loop, it'd be
> > more efficient to do this with rb_first_cached() which has the O(1)
> > complexity.
> > 
> > This patch set updates five structs which may have a large rb tree in
> > practice
> > 
> > Liu Bo (5):
> >   Btrfs: href_root: use rb_first_cached
> >   Btrfs: href->ref_tree: use rb_first_cached
> >   Btrfs: delayed_items: use rb_first_cached
> >   Btrfs: extent_map: use rb_first_cached
> >   Btrfs: preftree: use rb_first_cached
> 
> For the record, patches have been merged to misc-next. I've changed the
> subject titles to
> 
> Btrfs: delayed-refs: use rb_first_cached for href_root
> Btrfs: delayed-refs: use rb_first_cached for ref_tree
> Btrfs: delayed-inode: use rb_first_cached for ins_root and del_root
> Btrfs: extent_map: use rb_first_cached
> Btrfs: preftree: use rb_first_cached
> 
> added Holger's tested-by and my reviewed-by.

Thanks so much for the efforts.

thanks,
-liubo


Re: btrfs send hangs after partial transfer and blocks all IO

2018-09-20 Thread Chris Murphy
On Wed, Sep 19, 2018 at 1:41 PM, Jürgen Herrmann  wrote:
> Am 13.9.2018 14:35, schrieb Nikolay Borisov:
>>
>> On 13.09.2018 15:30, Jürgen Herrmann wrote:
>>>
>>> OK, I will install kdump later and perform a dump after the hang.
>>>
>>> One more noob question beforehand: does this dump contain sensitive
>>> information, for example the luks encryption key for the disk etc? A
>>> Google search only brings up one relevant search result which can only
>>> be viewed with a redhat subscription...
>>
>>
>>
>> So a kdump will dump the kernel memory so it's possible that the LUKS
>> encryption keys could be extracted from that image. Bummer, it's
>> understandable why you wouldn't want to upload it :). In this case you'd
>> have to install also the 'crash' utility to open the crashdump and
>> extract the calltrace of the btrfs process. The rough process should be :
>>
>>
>> crash 'path to vm linux' 'path to vmcore file', then once inside the
>> crash utility :
>>
>> set , you can acquire the pid by issuing 'ps'
>> which will give you a ps-like output of all running processes at the
>> time of crash. After the context has been set you can run 'bt' which
>> will give you a backtrace of the send process.
>>
>>
>>
>>>
>>> Best regards,
>>> Jürgen
>>>
>>> Am 13. September 2018 14:02:11 schrieb Nikolay Borisov
>>> :
>>>
 On 13.09.2018 14:50, Jürgen Herrmann wrote:
>
> I was echoing "w" to /proc/sysrq_trigger every 0.5s which did work also
> after the hang because I started the loop before the hang. The dmesg
> output should show the hanging tasks from second 346 on or so. Still
> not
> useful?
>

 So from 346 it's evident that transaction commit is waiting for
 commit_root_sem to be acquired. So something else is holding it and not
 giving the transaction chance to finish committing. Now the only place
 where send acquires this lock is in find_extent_clone around the  call
 to extent_from_logical. The latter basically does an extent tree search
 and doesn't loop so can't possibly deadlock. Furthermore I don't see any
 userspace processes being hung in kernel space.

 Additionally looking at the userspace processes they indicate that
 find_extent_clone has finished and are blocked in send_write_or_clone
 which does the write. But I guess this actually happens before the hang.


 So at this point without looking at the stacktrace of the btrfs send
 process after the hung has occurred I don't think much can be done
>>>
>>>
> I know this is probably not the correct list to ask this question but maybe
> someone of the devs can point me to the right list?
>
> I cannot get kdump to work. The crashkernel is loaded and everything is
> setup for it afaict. I asked a question on this over at stackexchange but no
> answer yet.
> https://unix.stackexchange.com/questions/469838/linux-kdump-does-not-boot-second-kernel-when-kernel-is-crashing
>
> So i did a little digging and added some debug printk() statements to see
> whats going on and it seems that panic() is never called. maybe the second
> stack trace is the reason?
> Screenshot is here: https://t-5.eu/owncloud/index.php/s/OegsikXo4VFLTJN
>
> Could someone please tell me where I can report this problem and get some
> help on this topic?


Try kexec mailing list. They handle kdump.

http://lists.infradead.org/mailman/listinfo/kexec



-- 
Chris Murphy


Re: btrfs problems

2018-09-20 Thread Adrian Bastholm
On Mon, Sep 17, 2018 at 2:44 PM Qu Wenruo  wrote:

>
> Then I strongly recommend to use the latest upstream kernel and progs
> for btrfs. (thus using Debian Testing)
>
> And if anything went wrong, please report asap to the mail list.
>
> Especially for fs corruption, that's the ghost I'm always chasing for.
> So if any corruption happens again (although I hope it won't happen), I
> may have a chance to catch it.

You got it
> >
> >> Anyway, enjoy your stable fs even it's not btrfs

> > My new stable fs is too rigid. Can't grow it, can't shrink it, can't
> > remove vdevs from it , so I'm planning a comeback to BTRFS. I guess
> > after the dust settled I realize I like the flexibility of BTRFS.
> >
I'm back to btrfs.

> From the code aspect, the biggest difference is the chunk layout.
> Due to the ext* block group usage, each block group header (except some
> sparse bg) is always used, thus btrfs can't use them.
>
> This leads to highly fragmented chunk layout.

The only thing I really understood is "highly fragmented" == not good
. I might need to google these "chunk" thingies

> We doesn't have error report about such layout yet, but if you want
> everything to be as stable as possible, I still recommend to use a newly
> created fs.

I guess I'll stick with ext4 on the rootfs

> > Another thing is I'd like to see a "first steps after getting started
> > " section in the wiki. Something like take your first snapshot, back
> > up, how to think when running it - can i just set some cron jobs and
> > forget about it, or does it need constant attention, and stuff like
> > that.
>
> There are projects do such things automatically, like snapper.
>
> If your primary concern is to make the fs as stable as possible, then
> keep snapshots to a minimal amount, avoid any functionality you won't
> use, like qgroup, routinely balance, RAID5/6.

So, is RAID5 stable enough ? reading the wiki there's a big fat
warning about some parity issues, I read an article about silent
corruption (written a while back), and chris says he can't recommend
raid56 to mere mortals.

> And keep the necessary btrfs specific operations to minimal, like
> subvolume/snapshot (and don't keep too many snapshots, say over 20),
> shrink, send/receive.
>
> Thanks,
> Qu
>
> >
> > BR Adrian
> >
> >
>


-- 
Vänliga hälsningar / Kind regards,
Adrian Bastholm

``I would change the world, but they won't give me the sourcecode``


Re: [PATCH v8 6/6] Btrfs: support swap files

2018-09-20 Thread Omar Sandoval
On Thu, Sep 20, 2018 at 07:15:41PM +0200, David Sterba wrote:
> On Wed, Sep 19, 2018 at 10:02:17PM -0700, Omar Sandoval wrote:
> > From: Omar Sandoval 
> > 
> > Btrfs has not allowed swap files since commit 35054394c4b3 ("Btrfs: stop
> > providing a bmap operation to avoid swapfile corruptions"). However, now
> > that the proper restrictions are in place, Btrfs can support swap files
> > through the swap file a_ops, similar to iomap in commit 67482129cdab
> > ("iomap: add a swapfile activation function").
> > 
> > For Btrfs, activation needs to make sure that the file can be used as a
> > swap file, which currently means that it must be fully allocated as
> > nocow with no compression on one device. It must also do the proper
> > tracking so that ioctls will not interfere with the swap file.
> > Deactivation clears this tracking.
> > 
> > Signed-off-by: Omar Sandoval 
> > ---
> >  fs/btrfs/inode.c | 317 +++
> >  1 file changed, 317 insertions(+)
> > 
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 3ea5339603cf..0586285b1d9f 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c

[snip]

> > +static int btrfs_swap_activate(struct swap_info_struct *sis, struct file 
> > *file,
> > +  sector_t *span)
> > +{
> > +   struct inode *inode = file_inode(file);
> > +   struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> > +   struct extent_io_tree *io_tree = _I(inode)->io_tree;
> > +   struct extent_state *cached_state = NULL;
> > +   struct extent_map *em = NULL;
> > +   struct btrfs_device *device = NULL;
> > +   struct btrfs_swap_info bsi = {
> > +   .lowest_ppage = (sector_t)-1ULL,
> > +   };
> > +   int ret = 0;
> > +   u64 isize = inode->i_size;
> > +   u64 start;
> > +
> > +   /*
> > +* If the swap file was just created, make sure delalloc is done. If the
> > +* file changes again after this, the user is doing something stupid and
> > +* we don't really care.
> > +*/
> > +   ret = btrfs_wait_ordered_range(inode, 0, (u64)-1);
> > +   if (ret)
> > +   return ret;
> > +
> > +   /*
> > +* The inode is locked, so these flags won't change after we check them.
> > +*/
> > +   if (BTRFS_I(inode)->flags & BTRFS_INODE_COMPRESS) {
> > +   btrfs_info(fs_info, "swapfile must not be compressed");
> > +   return -EINVAL;
> > +   }
> > +   if (!(BTRFS_I(inode)->flags & BTRFS_INODE_NODATACOW)) {
> > +   btrfs_info(fs_info, "swapfile must not be copy-on-write");
> > +   return -EINVAL;
> > +   }
> 
> I wonder if we should also explicitly check for the checkums flag, ie.
> that NODATASUM is present. Right now it's bound to NODATACOW, but as
> with other sanity checks, it does not hurt to have it here.
> 
> > +
> > +   /*
> > +* Balance or device remove/replace/resize can move stuff around from
> > +* under us. The EXCL_OP flag makes sure they aren't running/won't run
> > +* concurrently while we are mapping the swap extents, and
> > +* fs_info->swapfile_pins prevents them from running while the swap file
> > +* is active and moving the extents. Note that this also prevents a
> > +* concurrent device add which isn't actually necessary, but it's not
> > +* really worth the trouble to allow it.
> > +*/
> > +   if (test_and_set_bit(BTRFS_FS_EXCL_OP, _info->flags))
> > +   return -EBUSY;
> 
> This could be also accompanied by a message, "why does not my swapfile
> activate?" -> "there's an exclusive operation running". I've checked if
> there are similar messages for the other exclusive ops. There are.

Sounds good. I addressed all of your comments and pushed to
https://github.com/osandov/linux/tree/btrfs-swap. The only thing I
didn't change was the btrfs_info about not being able to relocate an
active swapfile. I think it makes sense as btrfs_info since we already
log every block group we are relocating as info (see
describe_relocation()).


Re: [PATCH v8 0/6] Btrfs: implement swap file support

2018-09-20 Thread David Sterba
On Wed, Sep 19, 2018 at 10:02:11PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> Changes from v7 [1]:
> 
> - Expanded a few commit messages
> - Added Johannes' acked-by on patches 1 and 2
> - Rebased on v4.19-rc4

I've sent my comments, it's mostly about the usability or small
enhancements. As you've got acks from MM people, I hope it would be ok
if I add this series to for-next so we can give it some testing.

The MM patches would better go separately to 4.20 via the mm tree.  I
did only build tests so 4.20 target is still feasible but given that
it's rc4 it's a bit too close. There are some limitations posed by the
swapfiles so I'd like to have a chance to do some actual tests myself
and check the usability status.


Re: [PATCH v8 6/6] Btrfs: support swap files

2018-09-20 Thread David Sterba
On Wed, Sep 19, 2018 at 10:02:17PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> Btrfs has not allowed swap files since commit 35054394c4b3 ("Btrfs: stop
> providing a bmap operation to avoid swapfile corruptions"). However, now
> that the proper restrictions are in place, Btrfs can support swap files
> through the swap file a_ops, similar to iomap in commit 67482129cdab
> ("iomap: add a swapfile activation function").
> 
> For Btrfs, activation needs to make sure that the file can be used as a
> swap file, which currently means that it must be fully allocated as
> nocow with no compression on one device. It must also do the proper
> tracking so that ioctls will not interfere with the swap file.
> Deactivation clears this tracking.
> 
> Signed-off-by: Omar Sandoval 
> ---
>  fs/btrfs/inode.c | 317 +++
>  1 file changed, 317 insertions(+)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 3ea5339603cf..0586285b1d9f 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include "ctree.h"
>  #include "disk-io.h"
> @@ -10488,6 +10489,320 @@ void btrfs_set_range_writeback(struct 
> extent_io_tree *tree, u64 start, u64 end)
>   }
>  }
>  
> +/*
> + * Add an entry indicating a block group or device which is pinned by a
> + * swapfile. Returns 0 on success, 1 if there is already an entry for it, or 
> a
> + * negative errno on failure.
> + */
> +static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr,
> +   bool is_block_group)
> +{
> + struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> + struct btrfs_swapfile_pin *sp, *entry;
> + struct rb_node **p;
> + struct rb_node *parent = NULL;
> +
> + sp = kmalloc(sizeof(*sp), GFP_NOFS);
> + if (!sp)
> + return -ENOMEM;
> + sp->ptr = ptr;
> + sp->inode = inode;
> + sp->is_block_group = is_block_group;
> +
> + spin_lock(_info->swapfile_pins_lock);
> + p = _info->swapfile_pins.rb_node;
> + while (*p) {
> + parent = *p;
> + entry = rb_entry(parent, struct btrfs_swapfile_pin, node);
> + if (sp->ptr < entry->ptr ||
> + (sp->ptr == entry->ptr && sp->inode < entry->inode)) {
> + p = &(*p)->rb_left;
> + } else if (sp->ptr > entry->ptr ||
> +(sp->ptr == entry->ptr && sp->inode > entry->inode)) 
> {
> + p = &(*p)->rb_right;
> + } else {
> + spin_unlock(_info->swapfile_pins_lock);
> + kfree(sp);
> + return 1;
> + }
> + }
> + rb_link_node(>node, parent, p);
> + rb_insert_color(>node, _info->swapfile_pins);
> + spin_unlock(_info->swapfile_pins_lock);
> + return 0;
> +}
> +
> +/* Free all of the entries pinned by this swapfile. */
> +static void btrfs_free_swapfile_pins(struct inode *inode)
> +{
> + struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info;
> + struct btrfs_swapfile_pin *sp;
> + struct rb_node *node, *next;
> +
> + spin_lock(_info->swapfile_pins_lock);
> + node = rb_first(_info->swapfile_pins);
> + while (node) {
> + next = rb_next(node);
> + sp = rb_entry(node, struct btrfs_swapfile_pin, node);
> + if (sp->inode == inode) {
> + rb_erase(>node, _info->swapfile_pins);
> + if (sp->is_block_group)
> + btrfs_put_block_group(sp->ptr);
> + kfree(sp);
> + }
> + node = next;
> + }
> + spin_unlock(_info->swapfile_pins_lock);
> +}
> +
> +struct btrfs_swap_info {
> + u64 start;
> + u64 block_start;
> + u64 block_len;
> + u64 lowest_ppage;
> + u64 highest_ppage;
> + unsigned long nr_pages;
> + int nr_extents;
> +};
> +
> +static int btrfs_add_swap_extent(struct swap_info_struct *sis,
> +  struct btrfs_swap_info *bsi)
> +{
> + unsigned long nr_pages;
> + u64 first_ppage, first_ppage_reported, next_ppage;
> + int ret;
> +
> + first_ppage = ALIGN(bsi->block_start, PAGE_SIZE) >> PAGE_SHIFT;
> + next_ppage = ALIGN_DOWN(bsi->block_start + bsi->block_len,
> + PAGE_SIZE) >> PAGE_SHIFT;
> +
> + if (first_ppage >= next_ppage)
> + return 0;
> + nr_pages = next_ppage - first_ppage;
> +
> + first_ppage_reported = first_ppage;
> + if (bsi->start == 0)
> + first_ppage_reported++;
> + if (bsi->lowest_ppage > first_ppage_reported)
> + bsi->lowest_ppage = first_ppage_reported;
> + if (bsi->highest_ppage < (next_ppage - 1))
> + bsi->highest_ppage = next_ppage - 1;
> +
> + ret = add_swap_extent(sis, bsi->nr_pages, nr_pages, first_ppage);
> + if 

Re: [PATCH v8 5/6] Btrfs: rename get_chunk_map() and make it non-static

2018-09-20 Thread David Sterba
On Wed, Sep 19, 2018 at 10:02:16PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> The Btrfs swap code is going to need it, so give it a btrfs_ prefix and
> make it non-static.
> 
> Reviewed-by: Nikolay Borisov 
> Signed-off-by: Omar Sandoval 
> ---
>  fs/btrfs/volumes.c | 29 ++---
>  fs/btrfs/volumes.h |  2 ++
>  2 files changed, 20 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index a2761395ed22..fe66b635c023 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -2714,8 +2714,15 @@ static int btrfs_del_sys_chunk(struct btrfs_fs_info 
> *fs_info, u64 chunk_offset)
>   return ret;
>  }
>  
> -static struct extent_map *get_chunk_map(struct btrfs_fs_info *fs_info,
> - u64 logical, u64 length)
> +/**

Minor nit, I'll fix that eventually: the /** is for kernel doc but as the
function is internal to btrfs and there's no documentation generated
from that, it's better to use plain /* . The parameter formatting
according to the kernel doc documentation is fine but does not strictly
need to conform as long as it's human readable.

Otherwise patch is ok.


Re: [PATCH v8 4/6] Btrfs: prevent ioctls from interfering with a swap file

2018-09-20 Thread David Sterba
On Wed, Sep 19, 2018 at 10:02:15PM -0700, Omar Sandoval wrote:
> --- a/fs/btrfs/dev-replace.c
> +++ b/fs/btrfs/dev-replace.c
> @@ -414,6 +414,14 @@ int btrfs_dev_replace_start(struct btrfs_fs_info 
> *fs_info,
>   if (ret)
>   return ret;
>  
> + if (btrfs_pinned_by_swapfile(fs_info, src_device)) {
> + btrfs_info_in_rcu(fs_info,
> +   "cannot replace device %s (devid %llu) due to 
> active swapfile",

Please un-indent the messages so they fit under 80 columns, the
arguments can stay aligned as before.

> +   btrfs_dev_name(src_device),
> +   src_device->devid);
> + return -ETXTBSY;

I think all the error messages before returning ETXTBUSY or EINVAL
should be btrfs_error or btrfs_warning. The info level can be filtered
out and easily lost in the logs. If there's user intention to activate
the swap it needs to be resolved.

> + }
> +
>   ret = btrfs_init_dev_replace_tgtdev(fs_info, tgtdev_name,
>   src_device, _device);
>   if (ret)
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -3626,10 +3634,15 @@ static int __btrfs_balance(struct btrfs_fs_info 
> *fs_info)
>  
>   ret = btrfs_relocate_chunk(fs_info, found_key.offset);
>   mutex_unlock(_info->delete_unused_bgs_mutex);
> - if (ret && ret != -ENOSPC)
> - goto error;
>   if (ret == -ENOSPC) {
>   enospc_errors++;
> + } else if (ret == -ETXTBSY) {
> + btrfs_info(fs_info,
> +"skipping relocation of block group %llu due 
> to active swapfile",

I wonder if this should be visible even on the info level, ie. so it
could be debug. The swap file can be long lived, the balance could be
run several times during that period.

> +found_key.offset);
> + ret = 0;
> + } else if (ret) {
> + goto error;
>   } else {
>   spin_lock(_info->balance_lock);
>   bctl->stat.completed++;


Re: [PATCH 0/5] rb_first to rb_first_cached conversion

2018-09-20 Thread David Sterba
On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote:
> Several structs in btrfs are using rb_first() in a while loop, it'd be
> more efficient to do this with rb_first_cached() which has the O(1)
> complexity.
> 
> This patch set updates five structs which may have a large rb tree in
> practice
> 
> Liu Bo (5):
>   Btrfs: href_root: use rb_first_cached
>   Btrfs: href->ref_tree: use rb_first_cached
>   Btrfs: delayed_items: use rb_first_cached
>   Btrfs: extent_map: use rb_first_cached
>   Btrfs: preftree: use rb_first_cached

For the record, patches have been merged to misc-next. I've changed the
subject titles to

Btrfs: delayed-refs: use rb_first_cached for href_root
Btrfs: delayed-refs: use rb_first_cached for ref_tree
Btrfs: delayed-inode: use rb_first_cached for ins_root and del_root
Btrfs: extent_map: use rb_first_cached
Btrfs: preftree: use rb_first_cached

added Holger's tested-by and my reviewed-by.


Re: cannot mount btrfs as root

2018-09-20 Thread Zbigniew 'zibi' Jarosik
On 20 September 2018 at 08:20, Zbigniew 'zibi' Jarosik  wrote:
> On 19 September 2018 at 15:54, Qu Wenruo  wrote:
>>
>> Does the iniramfs do a "btrfs dev scan" to make populate btrfs devices
>> lists?
>
> This solved problem. Looks like btrfs scans disks to early, before
> bcache initializes. I need to debug initrfamfs scripts.

There is race condition - bcache initalizes slower than expected, so
btrfs cannot find devices, because they are not ready.

rootdelay=10 in grub solved problem temporary, need to write
boot-script that will handle it and wait until bcache devices will be
ready to use.


-- 
Zbigniew 'zibi' Jarosik
cell: +48 667956686
jid: tap...@gmail.com
gg: 2830
http://zibi.nora.pl/


Re: [PATCH 0/5] rb_first to rb_first_cached conversion

2018-09-20 Thread David Sterba
On Fri, Sep 14, 2018 at 12:14:57PM -0700, Liu Bo wrote:
> On Fri, Sep 14, 2018 at 05:11:02PM +0200, David Sterba wrote:
> > On Tue, Sep 11, 2018 at 11:31:49AM -0700, Liu Bo wrote:
> > > On Tue, Sep 11, 2018 at 05:34:03PM +0200, David Sterba wrote:
> > > > On Thu, Aug 23, 2018 at 03:51:48AM +0800, Liu Bo wrote:
> > > > > Several structs in btrfs are using rb_first() in a while loop, it'd be
> > > > > more efficient to do this with rb_first_cached() which has the O(1)
> > > > > complexity.
> > > > 
> > > > We'd like to see some numbers, though just by reading the code I think
> > > > there's going to be a noticeable improvement for some workloads.
> > > > 
> > > > There's a common pattern:
> > > > 
> > > > while(node = rb_first) {
> > > > entry = rb_entry(node)
> > > > next = rb_next(node)
> > > > rb_erase(node)
> > > > cleanup(entry)
> > > > }
> > > > 
> > > > rb_first needs to traverse the tree up to logN depth, rb_erase can
> > > > completely reshuffle the tree. With the caching we'll skip the traversal
> > > > in rb_first. That's a cached memory access vs looped pointer
> > > > dereference trade-off that IMHO has a clear winner.
> > > > 
> > > > As the pattern in many cases traverses the whole tree and removes all
> > > > entries, ideally we could get rid of the rebalancing too. The entries
> > > > would be cleaned up in postorder way, ie. reset the data and kfree in
> > > > the end. This requires that order of the freeing does not matter, which
> > > > might no apply to some trees.
> > > 
> > > The order of freeing does not matter, but we need the tree to return
> > > the correct next node to free, IOW, we have to maintain the rbtree
> > > until the last two nodes.
> > > 
> > > > 
> > > > I did not find suitable rbtree functions or helpers for that,
> > > > rbtree_postorder_for_each_entry_safe does not allow rb_erase and
> > > > rb_erase itself forces the rebalancing internally. But I think we can
> > > > implement that.
> > > > 
> > > > We could possibly use rb_next_postorder that makes sure that all
> > > > children are traversed before the parent so this is at least something
> > > > that could considerd.
> > > > 
> > > 
> > > Qu was inquiring about the same thing, I haven't got time to dig it
> > > further.
> > > 
> > > The question we need to answer is that whether we care about how fast
> > > destruction can make, as some are in critical paths while some are
> > > not.
> > 
> > Yeah, I take __btrfs_return_cluster_to_free_space as an example where
> > the more efficient tree destruction would shorten the time where a lock
> > is held.
> > 
> > In other cases it might complicate the code too much, though the
> > performance is not critical, eg. at umount time. Although, it'd be good
> > to optimize that too now that we know how.
> > 
> > And in the remaining cases it's not possible to avoid rb_erase. I had a
> > closer look at btrfs_cleanup_defrag_inodes and btrfs_put_tree_mod_seq.
> > Based on that I'd like to add this series to for-next, the first node
> > caching is clear enough and safe. We can do the other optimizations
> > later.
> 
> This needs more fine-grained debugging to see what's the cost
> lies on.
> 
> About the perf. number of rb_cached_node, I measured the time spent in
> extent map loop in btrfs_evict_inode(),
> 
> evict_inode_truncate_pages()
> while (!RB_EMPTY_ROOT(_tree->map)) {
>  rb_first();
>rb_erase();
> }
> 
> 
> for a em tree containing 10,000 em,

The maximum depth of the rbtree for this number of nodes is roughly
2 * lg(1) = 2 * 14 = 28. This is not an average case so it's going
to be better in practice but still 10+ pointer dereferences can be
exchanged with a pointer update.

> - with rb_first_cached, the loop takes 4880454 ns,
> - with rb_first, the loop takes 4584148 ns,
> 
> looks like the difference is not that much, ~6%.  After all it's about
> scalability, the more rb tree gets, the more rb_first_cached() wins.

Yeah, there's some difference but I think the cache effects will make
the measurements hard to estimate, so I take the numbers as a sort of
confirmation that there's not going to be a big difference.

The delayed refs and delayed inode can get some speedup from that as
there's usually only rb_first without rb_erase.

I'm going to update the first patch with the analysis we did and
reference the patch from the others so we'll have a starting point for
further optimizations and can verify if the described expected speedups
happen.


Re: Punch hole on full fs

2018-09-20 Thread Qu Wenruo


On 2018/9/20 下午5:04, Anand Jain wrote:
> 
> 
> On 09/20/2018 04:45 PM, Qu Wenruo wrote:
>>
>>
>> On 2018/9/20 下午1:26, anand.j...@oracle.com wrote:
>>>
>>>
>>> Test script [1] tries to punch hole on a full FS and it works fine as
>>> long as the hole size and the offset is aligned with the sectorsize and
>>> the extent, so that it could just drop the relevant extent to create the
>>> hole.
>>>
>>> The reason why this test fails for non aligned hole size and offset is
>>> because, as it rightfully tries to truncate the non aligned extent at
>>> the front and back of the hole, it tries to create new extent which ends
>>> up with ENOSPC.
>>
>> This is what btrfs should do, I don't see any problem though.
>>
>>>
>>> Any idea, how do we solve this?
>>
>> I don't really think this is a problem though.
>>
>> As long as we need to do data COW, such ENOSPC would happen.
> 
> It happens even with '-o max_inline=0 -o nodatacow' so certainly there
> is some problem, but looking at btrfs_punch_hole() it apparently looks
> like there is no way to fix this as sub-sectorsize hole-size depends on
> the truncate which needs to alloc space.

Forgot to mention, in your test script, it only fills data space, so we
should still have enough free metadata space to fulfill fs/extent/root
tree operations.

Although the problem is, btrfs_truncate_block() will call
btrfs_delalloc_reserve_space() even the inode has NODATACOW flag.
(buffered write part does the extra check, although not completely
perfect check).

So in short, you could still fix it false ENOSPC problem by enhance
btrfs_truncate_block() to follow NODATACOW check.

Thanks,
Qu

> 
>>>
>>> xfs is fine.
>>
>> Of course xfs is fine, XFS defaults to do NODATACOW, in contrast to
>> btrfs' default DATACOW.
>>
>> Have you tried to do the same when the file is reflinked?
> 
>  No but I presume it will rightfully fail.
> 
> Thanks, -Anand
> 
>> Thanks,
>> Qu
>>
>>>
>>> [1]
>>> cat ./punch-hole-on-full-fs
>>> 
>>> cleanup()
>>> {
>>>  umount /dev/sdb > /dev/null 2>&1
>>>  btrfs_reload
>>> }
>>>
>>> full_fs_setup()
>>> {
>>>  btrfs-dyndbg disable
>>>  mkfs.$fs -b 200M -fq $mkfs_options /dev/sdb || exit
>>>  mount $mount_options /dev/sdb /mnt/scratch || exit
>>>  dd status=none if=/dev/zero of=/mnt/scratch/filler bs=512 >
>>> /dev/null 2>&1
>>> }
>>>
>>> test()
>>> {
>>>  cleanup
>>>  full_fs_setup
>>>
>>>  btrfs-dyndbg enable
>>>  echo " fallocate -p -o 0 -l $punch_hole_size
>>> /mnt/scratch/filler
>>> "
>>>  fallocate -p -o 0 -l $punch_hole_size /mnt/scratch/filler
>>> }
>>>
>>> fs=btrfs; mkfs_options=""; mount_options="";
>>> [[ $1 ]] || { echo "usage: $0 "; exit; }
>>> punch_hole_size=$1 && test
>>> 
>>>
>>> ./punch-hole-on-full-fs 512
>>> -- fallocate -p -o 0 -l 512 /mnt/scratch/filler --
>>> fallocate: /mnt/scratch/filler: fallocate failed: No space left on
>>> device
>>>
>>> ./punch-hole-on-full-fs 8000
>>> -- fallocate -p -o 0 -l 8000 /mnt/scratch/filler --
>>> fallocate: /mnt/scratch/filler: fallocate failed: No space left on
>>> device
>>>
>>>
>>> Thanks, Anand
>>



signature.asc
Description: OpenPGP digital signature


Re: [PATCH V5 RESEND] Btrfs: enchanse raid1/10 balance heuristic

2018-09-20 Thread Timofey Titovets
чт, 20 сент. 2018 г. в 12:05, Peter Becker :
>
> i like the idea.
> do you have any benchmarks for this change?
>
> the general logic looks good for me.

https://patchwork.kernel.org/patch/10137909/

>
> Tested-by: Dmitrii Tcvetkov 
>
> Benchmark summary (arithmetic mean of 3 runs):
> Mainline Patch
> --
> RAID1 | 18.9 MiB/s | 26.5 MiB/s
> RAID10 | 30.7 MiB/s | 30.7 MiB/s


> fio configuration:
> [global]
> ioengine=libaio
> buffered=0
> direct=1
> bssplit=32k/100
> size=8G
> directory=/mnt/
> iodepth=16
> time_based
> runtime=900
>
> [test-fio]
> rw=randread
>
> All tests were run on 4 HDD btrfs filesystem in a VM with 4 Gb
> of ram on idle host. Full results attached to the email.

Also:
https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg71758.html
- - -
So, IIRC its works at least.


Re: Punch hole on full fs

2018-09-20 Thread Qu Wenruo


On 2018/9/20 下午5:04, Anand Jain wrote:
> 
> 
> On 09/20/2018 04:45 PM, Qu Wenruo wrote:
>>
>>
>> On 2018/9/20 下午1:26, anand.j...@oracle.com wrote:
>>>
>>>
>>> Test script [1] tries to punch hole on a full FS and it works fine as
>>> long as the hole size and the offset is aligned with the sectorsize and
>>> the extent, so that it could just drop the relevant extent to create the
>>> hole.
>>>
>>> The reason why this test fails for non aligned hole size and offset is
>>> because, as it rightfully tries to truncate the non aligned extent at
>>> the front and back of the hole, it tries to create new extent which ends
>>> up with ENOSPC.
>>
>> This is what btrfs should do, I don't see any problem though.
>>
>>>
>>> Any idea, how do we solve this?
>>
>> I don't really think this is a problem though.
>>
>> As long as we need to do data COW, such ENOSPC would happen.
> 
> It happens even with '-o max_inline=0 -o nodatacow' so certainly there
> is some problem, but looking at btrfs_punch_hole() it apparently looks
> like there is no way to fix this as sub-sectorsize hole-size depends on
> the truncate which needs to alloc space.

Well, at least in my test, that's not the case.

I did a pretty simpler test, and for NODATACOW inode, the sub-sectorsize
part didn't get COWed, just as what we expect:

# mkfs.btrfs -f $dev
# mount $dev -o nodatacow $dev $mnt
# xfs_io -f -c "pwrite 0 16K" -c sync $mnt/file
# btrfs ins dump-tree -t 5 $dev
# xfs_io -c "fpunch 2K 6K" $mnt/file
# btrfs ins dump-tree -t 5 $dev

The last punch hole did created a 4K hole as expect, but the 0~4K extent
doesn't get COWed at all.

item 6 key (257 EXTENT_DATA 0) itemoff 15813 itemsize 53
generation 6 type 1 (regular)
extent data disk byte 13631488 nr 16384 <<<
extent data offset 0 nr 4096 ram 16384
extent compression 0 (none)
item 7 key (257 EXTENT_DATA 4096) itemoff 15760 itemsize 53
generation 8 type 1 (regular)
extent data disk byte 0 nr 0
extent data offset 0 nr 4096 ram 4096
extent compression 0 (none)
item 8 key (257 EXTENT_DATA 8192) itemoff 15707 itemsize 53
generation 6 type 1 (regular)
extent data disk byte 13631488 nr 16384 <<<
extent data offset 8192 nr 8192 ram 16384
extent compression 0 (none)

So the failure even with nodatacow mount option looks a little
suspicious now.

Thanks,
Qu

> 
>>>
>>> xfs is fine.
>>
>> Of course xfs is fine, XFS defaults to do NODATACOW, in contrast to
>> btrfs' default DATACOW.
>>
>> Have you tried to do the same when the file is reflinked?
> 
>  No but I presume it will rightfully fail.
> 
> Thanks, -Anand
> 
>> Thanks,
>> Qu
>>
>>>
>>> [1]
>>> cat ./punch-hole-on-full-fs
>>> 
>>> cleanup()
>>> {
>>>  umount /dev/sdb > /dev/null 2>&1
>>>  btrfs_reload
>>> }
>>>
>>> full_fs_setup()
>>> {
>>>  btrfs-dyndbg disable
>>>  mkfs.$fs -b 200M -fq $mkfs_options /dev/sdb || exit
>>>  mount $mount_options /dev/sdb /mnt/scratch || exit
>>>  dd status=none if=/dev/zero of=/mnt/scratch/filler bs=512 >
>>> /dev/null 2>&1
>>> }
>>>
>>> test()
>>> {
>>>  cleanup
>>>  full_fs_setup
>>>
>>>  btrfs-dyndbg enable
>>>  echo " fallocate -p -o 0 -l $punch_hole_size
>>> /mnt/scratch/filler
>>> "
>>>  fallocate -p -o 0 -l $punch_hole_size /mnt/scratch/filler
>>> }
>>>
>>> fs=btrfs; mkfs_options=""; mount_options="";
>>> [[ $1 ]] || { echo "usage: $0 "; exit; }
>>> punch_hole_size=$1 && test
>>> 
>>>
>>> ./punch-hole-on-full-fs 512
>>> -- fallocate -p -o 0 -l 512 /mnt/scratch/filler --
>>> fallocate: /mnt/scratch/filler: fallocate failed: No space left on
>>> device
>>>
>>> ./punch-hole-on-full-fs 8000
>>> -- fallocate -p -o 0 -l 8000 /mnt/scratch/filler --
>>> fallocate: /mnt/scratch/filler: fallocate failed: No space left on
>>> device
>>>
>>>
>>> Thanks, Anand
>>



signature.asc
Description: OpenPGP digital signature


Re: [PATCH V5 RESEND] Btrfs: enchanse raid1/10 balance heuristic

2018-09-20 Thread Peter Becker
i like the idea.
do you have any benchmarks for this change?

the general logic looks good for me.


Re: Punch hole on full fs

2018-09-20 Thread Anand Jain




On 09/20/2018 04:45 PM, Qu Wenruo wrote:



On 2018/9/20 下午1:26, anand.j...@oracle.com wrote:



Test script [1] tries to punch hole on a full FS and it works fine as
long as the hole size and the offset is aligned with the sectorsize and
the extent, so that it could just drop the relevant extent to create the
hole.

The reason why this test fails for non aligned hole size and offset is
because, as it rightfully tries to truncate the non aligned extent at
the front and back of the hole, it tries to create new extent which ends
up with ENOSPC.


This is what btrfs should do, I don't see any problem though.



Any idea, how do we solve this?


I don't really think this is a problem though.

As long as we need to do data COW, such ENOSPC would happen.


It happens even with '-o max_inline=0 -o nodatacow' so certainly there 
is some problem, but looking at btrfs_punch_hole() it apparently looks 
like there is no way to fix this as sub-sectorsize hole-size depends on 
the truncate which needs to alloc space.




xfs is fine.


Of course xfs is fine, XFS defaults to do NODATACOW, in contrast to
btrfs' default DATACOW.

Have you tried to do the same when the file is reflinked?


 No but I presume it will rightfully fail.

Thanks, -Anand


Thanks,
Qu



[1]
cat ./punch-hole-on-full-fs

cleanup()
{
 umount /dev/sdb > /dev/null 2>&1
 btrfs_reload
}

full_fs_setup()
{
 btrfs-dyndbg disable
 mkfs.$fs -b 200M -fq $mkfs_options /dev/sdb || exit
 mount $mount_options /dev/sdb /mnt/scratch || exit
 dd status=none if=/dev/zero of=/mnt/scratch/filler bs=512 >
/dev/null 2>&1
}

test()
{
 cleanup
 full_fs_setup

 btrfs-dyndbg enable
 echo " fallocate -p -o 0 -l $punch_hole_size /mnt/scratch/filler
"
 fallocate -p -o 0 -l $punch_hole_size /mnt/scratch/filler
}

fs=btrfs; mkfs_options=""; mount_options="";
[[ $1 ]] || { echo "usage: $0 "; exit; }
punch_hole_size=$1 && test


./punch-hole-on-full-fs 512
-- fallocate -p -o 0 -l 512 /mnt/scratch/filler --
fallocate: /mnt/scratch/filler: fallocate failed: No space left on device

./punch-hole-on-full-fs 8000
-- fallocate -p -o 0 -l 8000 /mnt/scratch/filler --
fallocate: /mnt/scratch/filler: fallocate failed: No space left on device


Thanks, Anand




Re: Punch hole on full fs

2018-09-20 Thread Qu Wenruo


On 2018/9/20 下午1:26, anand.j...@oracle.com wrote:
> 
> 
> Test script [1] tries to punch hole on a full FS and it works fine as
> long as the hole size and the offset is aligned with the sectorsize and
> the extent, so that it could just drop the relevant extent to create the
> hole.
> 
> The reason why this test fails for non aligned hole size and offset is
> because, as it rightfully tries to truncate the non aligned extent at
> the front and back of the hole, it tries to create new extent which ends
> up with ENOSPC.

This is what btrfs should do, I don't see any problem though.

> 
> Any idea, how do we solve this?

I don't really think this is a problem though.

As long as we need to do data COW, such ENOSPC would happen.

> 
> xfs is fine.

Of course xfs is fine, XFS defaults to do NODATACOW, in contrast to
btrfs' default DATACOW.

Have you tried to do the same when the file is reflinked?

Thanks,
Qu

> 
> [1]
> cat ./punch-hole-on-full-fs
> 
> cleanup()
> {
> umount /dev/sdb > /dev/null 2>&1
> btrfs_reload
> }
> 
> full_fs_setup()
> {
> btrfs-dyndbg disable
> mkfs.$fs -b 200M -fq $mkfs_options /dev/sdb || exit
> mount $mount_options /dev/sdb /mnt/scratch || exit
> dd status=none if=/dev/zero of=/mnt/scratch/filler bs=512 >
> /dev/null 2>&1
> }
> 
> test()
> {
> cleanup
> full_fs_setup
> 
> btrfs-dyndbg enable
> echo " fallocate -p -o 0 -l $punch_hole_size /mnt/scratch/filler
> "
> fallocate -p -o 0 -l $punch_hole_size /mnt/scratch/filler
> }
> 
> fs=btrfs; mkfs_options=""; mount_options="";
> [[ $1 ]] || { echo "usage: $0 "; exit; }
> punch_hole_size=$1 && test
> 
> 
> ./punch-hole-on-full-fs 512
> -- fallocate -p -o 0 -l 512 /mnt/scratch/filler --
> fallocate: /mnt/scratch/filler: fallocate failed: No space left on device
> 
> ./punch-hole-on-full-fs 8000
> -- fallocate -p -o 0 -l 8000 /mnt/scratch/filler --
> fallocate: /mnt/scratch/filler: fallocate failed: No space left on device
> 
> 
> Thanks, Anand



signature.asc
Description: OpenPGP digital signature


Re: [RFC PATCH v2 0/4] btrfs-progs: build distinct binaries for specific btrfs subcommands

2018-09-20 Thread Duncan
Axel Burri posted on Thu, 20 Sep 2018 00:02:22 +0200 as excerpted:

> Now not everybody wants to install these with fscaps or setuid, but it
> might also make sense to provide "/usr/bin/btrfs-subvolume-{show,list}",
> as they now work for a regular user. Having both root/user binaries
> concurrently is not an issue (e.g. in gentoo the full-featured btrfs
> command is in "/sbin/").

That's going to be a problem for distros (or users like me with advanced 
layouts, on gentoo too FWIW) that have the bin/sbin merge, where one is a 
symlink to the other.

FWIW I have both the /usr merge (tho reversed for me, so /usr -> . 
instead of having to have /bin and /sbin symlinks to /usr/bin) and the 
bin/sbin merge, along with, since I'm on amd64-nomultilib, the lib/lib64 
merge.  So:

$$ dir -gGd /bin /sbin /usr /lib /lib64
drwxr-xr-x 1 35688 Sep 18 22:56 /bin
lrwxrwxrwx 1 5 Aug  7 00:29 /lib -> lib64
drwxr-xr-x 1 78560 Sep 18 22:56 /lib64
lrwxrwxrwx 1 3 Mar 11  2018 /sbin -> bin
lrwxrwxrwx 1 1 Mar 11  2018 /usr -> .


Of course that last one (/usr -> .) leads to /share and /include hanging 
directly off of / as well, but it works.

But in that scheme /bin, /sbin, /usr/bin and /usr/sbin, are all the same 
dir, so only one executable of a particularly name can exist therein.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



Re: very poor performance / a lot of writes to disk with space_cache (but not with space_cache=v2)

2018-09-20 Thread Duncan
Tomasz Chmielewski posted on Wed, 19 Sep 2018 10:43:18 +0200 as excerpted:

> I have a mysql slave which writes to a RAID-1 btrfs filesystem (with
> 4.17.14 kernel) on 3 x ~1.9 TB SSD disks; filesystem is around 40% full.
> 
> The slave receives around 0.5-1 MB/s of data from the master over the
> network, which is then saved to MySQL's relay log and executed. In ideal
> conditions (i.e. no filesystem overhead) we should expect some 1-3 MB/s
> of data written to disk.
> 
> MySQL directory and files in it are chattr +C (since the directory was
> created, so all files are really +C); there are no snapshots.
> 
> 
> Now, an interesting thing.
> 
> When the filesystem is mounted with these options in fstab:
> 
> defaults,noatime,discard
> 
> 
> We can see a *constant* write of 25-100 MB/s to each disk. The system is
> generally unresponsive and it sometimes takes long seconds for a simple
> command executed in bash to return.
> 
> 
> However, as soon as we remount the filesystem with space_cache=v2 -
> writes drop to just around 3-10 MB/s to each disk. If we remount to
> space_cache - lots of writes, system unresponsive. Again remount to
> space_cache=v2 - low writes, system responsive.
> 
> 
> That's a huuge, 10x overhead! Is it expected? Especially that
> space_cache=v1 is still the default mount option?

The other replies are good but I've not seen this pointed out yet...

Perhaps you are accounting for this already, but you don't /say/ you are, 
while you do mention repeatedly toggling the space-cache options, which 
would trigger it so you /need/ to account for it...

I'm not sure about space_cache=v2 (it's probably more efficient with it 
even if it does have to do it), but I'm quite sure that space_cache=v1 
takes some time after initial mount with it to scan the filesystem and 
actually create the map of available free space that is the space_cache.

Now you said ssds, which should be reasonably fast, but you also say 3-
device btrfs raid1, with each device ~2TB, and the filesystem ~40% full, 
which should be ~2 TB of data, which is likely somewhat fragmented so 
it's likely rather more than 2 TB of data chunks to scan for free space, 
and that's going to take /some/ time even on SSDs!

So if you're toggling settings like that in your tests, be sure to let 
the filesystem rebuild its cache that you just toggled and give it time 
to complete that and quiesce, before you start trying to measure write 
amplification.

Otherwise it's not write-amplification you're measuring, but the churn 
from the filesystem still trying to reset its cache after you toggled it!


Also, while 4.17 is well after the ssd mount option (usually auto-
detected, check /proc/mounts, mount output, or dmesg, to see if the ssd 
mount option is being added) fixes that went in in 4.14, if the 
filesystem has been in use for several kernel cycles and in particular 
before 4.14, with the ssd mount option active, and you've not rebalanced 
since then, you may well still have serious space fragmentation from 
that, which could increase the amount of data in the space_cache map 
rather drastically, thus increasing the time it takes to update the 
space_cache, particularly v1, after toggling it on.

A balance can help correct that, but it might well be easier and should 
result in a better layout to simply blow the filesystem away with a 
mkfs.btrfs and start over.


Meanwhile, as Remi already mentioned, you might want to reconsider nocow 
on btrfs raid1, since nocow defeats checksumming and thus scrub, which 
verifies checksums, simply skips it, and if the two copies get out of 
sync for some reason...

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



Re: btrfs filesystem show takes a long time

2018-09-20 Thread Stefan K
Hi,

> You may try to run the show command under strace to see where it blocks.
any recommendations for strace options?


On Friday, September 14, 2018 1:25:30 PM CEST David Sterba wrote:
> Hi,
> 
> thanks for the report, I've forwarded it to the issue tracker
> https://github.com/kdave/btrfs-progs/issues/148
> 
> The show command uses the information provided by blkid, that presumably
> caches that. The default behaviour of 'fi show' is to skip mount checks,
> so the delays are likely caused by blkid, but that's not the only
> possible reason.
> 
> You may try to run the show command under strace to see where it blocks.
> 



Re: cannot mount btrfs as root

2018-09-20 Thread Zbigniew 'zibi' Jarosik
On 19 September 2018 at 15:54, Qu Wenruo  wrote:
>
> Does the iniramfs do a "btrfs dev scan" to make populate btrfs devices
> lists?

This solved problem. Looks like btrfs scans disks to early, before
bcache initializes. I need to debug initrfamfs scripts.

Thanks a lot for hint!


-- 
Zbigniew 'zibi' Jarosik
cell: +48 667956686
jid: tap...@gmail.com
gg: 2830
http://zibi.nora.pl/