Re: btrfs-cleaner / snapshot performance analysis

2018-02-13 Thread Darrick J. Wong
On Sun, Feb 11, 2018 at 02:40:16PM +0800, Qu Wenruo wrote:
> 
> 
> On 2018年02月10日 00:45, Ellis H. Wilson III wrote:
> > Hi all,
> > 
> > I am trying to better understand how the cleaner kthread (btrfs-cleaner)
> > impacts foreground performance, specifically during snapshot deletion.
> > My experience so far has been that it can be dramatically disruptive to
> > foreground I/O.
> > 
> > Looking through the wiki at kernel.org I have not yet stumbled onto any
> > analysis that would shed light on this specific problem.  I have found
> > numerous complaints about btrfs-cleaner online, especially relating to
> > quotas being enabled.  This has proven thus far less than helpful, as
> > the response tends to be "use less snapshots," or "disable quotas," both
> > of which strike me as intellectually unsatisfying answers, especially
> > the former in a filesystem where snapshots are supposed to be
> > "first-class citizens."
> 
> Yes, snapshots of btrfs is really "first-class citizen".
> Tons of designs are all biased to snapshot.
> 
> But one should be clear about one thing:
> Snapshot creation and backref walk (used in qgroup, relocation and
> extent deletion), are two conflicting workload in fact.
> 
> Btrfs puts snapshot creation on a very high priority, so that it greatly
> degrades the performance of backref walk (used in snapshot deletion,
> relocation and extent exclusive/shared calculation of qgroup).
> 
> Let me explain this problem in detail.
> 
> Just as explained by Peter Grandi, for any snapshot system (or any
> system supports reflink) there must be a reserved mapping tree, to tell
> which extent is used by who.
> 
> It's very critical, to determine if an extent is shared so we determine
> if we need to do CoW.
> 
> There are several different ways to implement it, and this hugely
> affects snapshot creation performance.
> 
> 1) Direct mapping record
>Just records exactly which extent is used by who, directly.
>So when we needs to check the owner, just search the tree ONCE, then
>we get it.
> 
>This is simple and it seems that LVM thin-provision and LVM
>traditional targets are all using them.
>(Maybe XFS also follows this way?)

Yes, it does.

>Pros:
>*FAST* backref walk, which means quick extent deletion and CoW
>condition check.
> 
> 
>Cons:
>*SLOW* snapshot creation.
>Each snapshot creation needs to insert new owner relationship into
>the tree. This modification grow with the size of snapshot source.

...of course xfs also doesn't support snapshots. :)

--D

> 2) Indirect mapping record
>Records upper level referencer only.
> 
>To get all direct owner of an extent, it will needs multiple lookup
>in the reserved mapping tree.
> 
>And obviously, btrfs is using this method.
> 
>Pros:
>*FAST* owner inheritance, which means snapshot creation.
>(Well, the only advantage I can think of)
> 
>Cons:
>*VERY SLOW* backref walk, used by extent deletion, relocation, qgroup
>and Cow condition check.
>(That may also be why btrfs default to CoW data, so that it can skip
> the costy backref walk)
> 
> And a more detailed example of the difference between them will be:
> 
> [Basic tree layout]
>  Tree X
>  node A
>/\
> node B node C
> / \   /  \
>  leaf D  leaf E  leaf F  leaf G
> 
> Use above tree X as snapshot source.
> 
> [Snapshot creation: Direct mapping]
> Then for direct mapping record, if we are going to create snapshot Y
> then we would get:
> 
> Tree X  Tree Y
> node A 
>  |  \ / |
>  |   X  |
>  |  / \ |
> node B  node C
>  /  \  / \
>   leaf D  leaf E   leaf F   leaf G
> 
> We need to create new node H, and update the owner for node B/C/D/E/F/G.
> 
> That's to say, we need to create 1 new node, and update 6 references of
> existing nodes/leaves.
> And this will grow rapidly if the tree is large, but still should be a
> linear increase.
> 
> 
> [Snapshot creation: Indirect mapping]
> And if using indirect mapping tree, firstly, reserved mapping tree
> doesn't record exactly the owner for each leaf/node, but only records
> its parent(s).
> 
> So even when tree X exists along, without snapshot Y, if we need to know
> the owner of leaf D, we only knows its only parent is node B.
> And do the same query on node B until we read node A and knows it's
> owned by tree X.
> 
>  Tree X ^
>  node A ^ Look upward until
>/| we reach tree root
> node B  | to search the owner
> /   | of a leaf/node
>  

Re: btrfs-cleaner / snapshot performance analysis

2018-02-13 Thread E V
On Mon, Feb 12, 2018 at 10:37 AM, Ellis H. Wilson III
 wrote:
> On 02/11/2018 01:24 PM, Hans van Kranenburg wrote:
>>
>> Why not just use `btrfs fi du   ` now and then and
>> update your administration with the results? .. Instead of putting the
>> burden of keeping track of all administration during every tiny change
>> all day long?
>
>
> I will look into that if using built-in group capacity functionality proves
> to be truly untenable.  Thanks!
>
>>> CoW is still valuable for us as we're shooting to support on the order
>>> of hundreds of snapshots per subvolume,
>>
>>
>> Hundreds will get you into trouble even without qgroups.
>
>
> I should have been more specific.  We are looking to use up to a few dozen
> snapshots per subvolume, but will have many (tens to hundreds of) discrete
> subvolumes (each with up to a few dozen snapshots) in a BTRFS filesystem.
> If I have it wrong and the scalability issues in BTRFS do not solely apply
> to subvolumes and their snapshot counts, please let me know.
>
> I will note you focused on my tiny desktop filesystem when making some of
> your previous comments -- this is why I didn't want to share specific
> details.  Our filesystem will be RAID0 with six large HDDs (12TB each).
> Reliability concerns do not apply to our situation for technical reasons,
> but if there are capacity scaling issues with BTRFS I should be made aware
> of, I'd be glad to hear them.  I have not seen any in technical
> documentation of such a limit, and experiments so far on 6x6TB arrays has
> not shown any performance problems, so I'm inclined to believe the only
> scaling issue exists with reflinks.  Correct me if I'm wrong.
>
> Thanks,
>
> ellis
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

When testing btrfs on large volumes, especially with metadata heavy
operations, I'd suggest you match the node size of your mkfs.btrfs
(-n) with the stripe size used in creating your RAID array. Also, use
the ssd_spread mount option as discussed in a previous thread. It make
a big difference on arrays. It allocates much more space for metadata,
but it greatly reduces fragmentation over time.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-12 Thread Austin S. Hemmelgarn

On 2018-02-12 11:39, Ellis H. Wilson III wrote:

On 02/12/2018 11:02 AM, Austin S. Hemmelgarn wrote:
BTRFS in general works fine at that scale, dependent of course on the 
level of concurrent access you need to support.  Each tree update 
needs to lock a bunch of things in the tree itself, and having large 
numbers of clients writing to the same set of files concurrently can 
cause lock contention issues because of this, especially if all of 
them are calling fsync() or fdatasync() regularly.  These issues can 
be mitigated by segregating workloads into their own subvolumes (each 
subvolume is a mostly independent filesystem tree), but it sounds like 
you're already doing that, so I don't think that would be an issue for 
you.
Hmm...I'll think harder about this.  There is potential for us to 
artificially divide access to files across subvolumes automatically 
because of the way we are using BTRFS as a backing store for our 
parallel file system.  So far even with around 1000 threads across about 
10 machines accessing BTRFS via our parallel filesystem over the wire 
we've not seen issues, but if we do I have some ways out I've not 
explored yet.  Thanks!
For what it's worth, most of the issues I've personally seen with 
parallel performance involved very heavy use of fsync(), or lots of 
parallel calls to stat() and statvfs() happening while files are also 
being written to, so it may just be the way you happen to be doing 
things just doesn't cause issues.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-12 Thread Ellis H. Wilson III

On 02/12/2018 12:09 PM, Hans van Kranenburg wrote:

You are in the To: of it:

https://www.spinics.net/lists/linux-btrfs/msg74737.html


Apparently MS365 decided my disabling of junk/clutter filter rules some 
year+ ago wasn't wise and re-enabled it.  I wondered why I wasn't seeing 
my own messages back from the list.  Qu's along with all of my responses 
were in spam.  Go figure, MS marking kernel.org mail spam...


This is exactly what I was looking for, and indeed is a fantastic write 
up I'll need to read over a few times to really have it soak in.  Thank 
you very much Qu!


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-12 Thread Hans van Kranenburg
On 02/12/2018 03:45 PM, Ellis H. Wilson III wrote:
> On 02/11/2018 01:03 PM, Hans van Kranenburg wrote:
>>> 3. I need to look at the code to understand the interplay between
>>> qgroups, snapshots, and foreground I/O performance as there isn't
>>> existing architecture documentation to point me to that covers this
>>
>> Well, the excellent write-up of Qu this morning shows some explanation
>> from the design point of view.
> 
> Sorry, I may have missed this email.  Or perhaps you are referring to a
> wiki or blog post of some kind I'm not following actively?  Either way,
> if you can forward me the link, I'd greatly appreciate it.

You are in the To: of it:

https://www.spinics.net/lists/linux-btrfs/msg74737.html

>> nocow only keeps the cows on a distance as long as you don't start
>> snapshotting (or cp --reflink) those files... If you take a snapshot,
>> then you force btrfs to keep the data around that is referenced by the
>> snapshot. So, that means that every next write will be cowed once again,
>> moo, so small writes will be redirected to a new location, causing
>> fragmentation again. The second and third write can go in the same (new)
>> location of the first new write, but as soon as you snapshot again, this
>> happens again.
> 
> Ah, very interesting.  Thank you for clarifying!

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-12 Thread Ellis H. Wilson III

On 02/12/2018 11:02 AM, Austin S. Hemmelgarn wrote:
I will look into that if using built-in group capacity functionality 
proves to be truly untenable.  Thanks!
As a general rule, unless you really need to actively prevent a 
subvolume from exceeding it's quota, this will generally be more 
reliable and have much less performance impact than using qgroups.


Ok ok :).  I will plan to go this route, but since I'll want to 
benchmark it either way, I'll include qgroups enabled in the benchmark 
and will report back.


With qgroups involved, I really can't say for certain, as I've never 
done much with them myself, but based on my understanding of how it all 
works, I would expect multiple subvolumes with a small number of 
snapshots each to not have as many performance issues as a single 
subvolume with the same total number of snapshots.


Glad to hear that.  That was my expectation as well.

BTRFS in general works fine at that scale, dependent of course on the 
level of concurrent access you need to support.  Each tree update needs 
to lock a bunch of things in the tree itself, and having large numbers 
of clients writing to the same set of files concurrently can cause lock 
contention issues because of this, especially if all of them are calling 
fsync() or fdatasync() regularly.  These issues can be mitigated by 
segregating workloads into their own subvolumes (each subvolume is a 
mostly independent filesystem tree), but it sounds like you're already 
doing that, so I don't think that would be an issue for you.
Hmm...I'll think harder about this.  There is potential for us to 
artificially divide access to files across subvolumes automatically 
because of the way we are using BTRFS as a backing store for our 
parallel file system.  So far even with around 1000 threads across about 
10 machines accessing BTRFS via our parallel filesystem over the wire 
we've not seen issues, but if we do I have some ways out I've not 
explored yet.  Thanks!


Now, there are some other odd theoretical cases that may cause issues 
when dealing with really big filesystems, but they're either really 
specific edge cases (for example, starting with a really small 
filesystem and gradually scaling it up in size as it gets full) or 
happen at scales far larger than what you're talking about (on the order 
of at least double digit petabyte scale).


Yea, our use case will be in the tens of TB to hundreds of TB for the 
foreseeable future, so I'm glad to hear this is relatively standard. 
That was my read of the situation as well.


Thanks!

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-12 Thread Austin S. Hemmelgarn

On 2018-02-12 10:37, Ellis H. Wilson III wrote:

On 02/11/2018 01:24 PM, Hans van Kranenburg wrote:

Why not just use `btrfs fi du   ` now and then and
update your administration with the results? .. Instead of putting the
burden of keeping track of all administration during every tiny change
all day long?


I will look into that if using built-in group capacity functionality 
proves to be truly untenable.  Thanks!
As a general rule, unless you really need to actively prevent a 
subvolume from exceeding it's quota, this will generally be more 
reliable and have much less performance impact than using qgroups.



CoW is still valuable for us as we're shooting to support on the order
of hundreds of snapshots per subvolume,


Hundreds will get you into trouble even without qgroups.


I should have been more specific.  We are looking to use up to a few 
dozen snapshots per subvolume, but will have many (tens to hundreds of) 
discrete subvolumes (each with up to a few dozen snapshots) in a BTRFS 
filesystem.  If I have it wrong and the scalability issues in BTRFS do 
not solely apply to subvolumes and their snapshot counts, please let me 
know.
The issue isn't so much total number of snapshots as it is how many 
snapshots are sharing data.  If each of your individual subvolumes 
shares no data with any of the others via reflinks (so no deduplication 
across subvolumes, and no copying files around using reflinks or the 
clone ioctl), then I would expect things will be just fine without 
qgroups provided that you're not deleting huge numbers of snapshots at 
the same time.


With qgroups involved, I really can't say for certain, as I've never 
done much with them myself, but based on my understanding of how it all 
works, I would expect multiple subvolumes with a small number of 
snapshots each to not have as many performance issues as a single 
subvolume with the same total number of snapshots.


I will note you focused on my tiny desktop filesystem when making some 
of your previous comments -- this is why I didn't want to share specific 
details.  Our filesystem will be RAID0 with six large HDDs (12TB each). 
Reliability concerns do not apply to our situation for technical 
reasons, but if there are capacity scaling issues with BTRFS I should be 
made aware of, I'd be glad to hear them.  I have not seen any in 
technical documentation of such a limit, and experiments so far on 6x6TB 
arrays has not shown any performance problems, so I'm inclined to 
believe the only scaling issue exists with reflinks.  Correct me if I'm 
wrong.
BTRFS in general works fine at that scale, dependent of course on the 
level of concurrent access you need to support.  Each tree update needs 
to lock a bunch of things in the tree itself, and having large numbers 
of clients writing to the same set of files concurrently can cause lock 
contention issues because of this, especially if all of them are calling 
fsync() or fdatasync() regularly.  These issues can be mitigated by 
segregating workloads into their own subvolumes (each subvolume is a 
mostly independent filesystem tree), but it sounds like you're already 
doing that, so I don't think that would be an issue for you.


The only other possibility I can think of is that the performance hit 
from qgroups may scale not just based on the number of snapshots of a 
given subvolume, but also the total size of the subvolume (more data 
means more accounting work), though I'm not certain about that (it's 
just a hunch based on what I do know about qgroups).


Now, there are some other odd theoretical cases that may cause issues 
when dealing with really big filesystems, but they're either really 
specific edge cases (for example, starting with a really small 
filesystem and gradually scaling it up in size as it gets full) or 
happen at scales far larger than what you're talking about (on the order 
of at least double digit petabyte scale).

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-12 Thread Ellis H. Wilson III

On 02/11/2018 01:24 PM, Hans van Kranenburg wrote:

Why not just use `btrfs fi du   ` now and then and
update your administration with the results? .. Instead of putting the
burden of keeping track of all administration during every tiny change
all day long?


I will look into that if using built-in group capacity functionality 
proves to be truly untenable.  Thanks!



CoW is still valuable for us as we're shooting to support on the order
of hundreds of snapshots per subvolume,


Hundreds will get you into trouble even without qgroups.


I should have been more specific.  We are looking to use up to a few 
dozen snapshots per subvolume, but will have many (tens to hundreds of) 
discrete subvolumes (each with up to a few dozen snapshots) in a BTRFS 
filesystem.  If I have it wrong and the scalability issues in BTRFS do 
not solely apply to subvolumes and their snapshot counts, please let me 
know.


I will note you focused on my tiny desktop filesystem when making some 
of your previous comments -- this is why I didn't want to share specific 
details.  Our filesystem will be RAID0 with six large HDDs (12TB each). 
Reliability concerns do not apply to our situation for technical 
reasons, but if there are capacity scaling issues with BTRFS I should be 
made aware of, I'd be glad to hear them.  I have not seen any in 
technical documentation of such a limit, and experiments so far on 6x6TB 
arrays has not shown any performance problems, so I'm inclined to 
believe the only scaling issue exists with reflinks.  Correct me if I'm 
wrong.


Thanks,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-12 Thread Ellis H. Wilson III

On 02/11/2018 01:03 PM, Hans van Kranenburg wrote:

3. I need to look at the code to understand the interplay between
qgroups, snapshots, and foreground I/O performance as there isn't
existing architecture documentation to point me to that covers this


Well, the excellent write-up of Qu this morning shows some explanation
from the design point of view.


Sorry, I may have missed this email.  Or perhaps you are referring to a 
wiki or blog post of some kind I'm not following actively?  Either way, 
if you can forward me the link, I'd greatly appreciate it.



nocow only keeps the cows on a distance as long as you don't start
snapshotting (or cp --reflink) those files... If you take a snapshot,
then you force btrfs to keep the data around that is referenced by the
snapshot. So, that means that every next write will be cowed once again,
moo, so small writes will be redirected to a new location, causing
fragmentation again. The second and third write can go in the same (new)
location of the first new write, but as soon as you snapshot again, this
happens again.


Ah, very interesting.  Thank you for clarifying!

Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-11 Thread Hans van Kranenburg
On 02/11/2018 04:59 PM, Ellis H. Wilson III wrote:
> Thanks Tomasz,
> 
> Comments in-line:
> 
> On 02/10/2018 05:05 PM, Tomasz Pala wrote:
>> You won't have anything close to "accurate" in btrfs - quotas don't
>> include space wasted by fragmentation, which happens to allocate from
>> tens
>> to thousands times (sic!) more space than the files itself.
>> Not in some worst-case scenarios, but in real life situations...
>> I got 10 MB db-file which was eating 10 GB of space after a week of
>> regular updates - withOUT snapshotting it. All described here.
> 
> The underlying filesystem this is replacing was an in-house developed
> COW filesystem, so we're aware of the difficulties of fragmentation. I'm
> more interested in an approximate space consumed across snapshots when
> considering CoW.  I realize it will be approximate.  Approximate is ok
> for us -- no accounting for snapshot space consumed is not.

If your goal is to have an approximate idea for accounting, and you
don't need to be able to actually enforce limits, and if the filesystems
that you are using are as small as the 40GiB example you gave...

Why not just use `btrfs fi du   ` now and then and
update your administration with the results? .. Instead of putting the
burden of keeping track of all administration during every tiny change
all day long?

> Also, I don't see the thread you mentioned.  Perhaps you forgot to
> mention it, or an html link didn't come through properly?
> 
>>> course) or how many subvolumes/snapshots there are.  If I know that
>>> above N snapshots per subvolume performance tanks by M%, I can apply
>>> limits on the use-case in the field, but I am not aware of those kinds
>>> of performance implications yet.
>>
>> This doesn't work like this. It all depends on data that are subject of
>> snapshots, especially how they are updated. How exactly, including write
>> patterns.
>>
>> I think you expect answers that can't be formulated - with fs
>> architecture so
>> advanced as ZFS or btrfs it's behavior can't be analyzed for simple
>> answers like 'keep less than N snapshots'.
> 
> I was using an extremely simple heuristic to drive at what I was looking
> to get out of this.  I should have been more explicit that the example
> was not to be taken literally.
> 
>> This is an exception of easy-answer: btrfs doesn't handle databases with
>> CoW. Period. Doesn't matter if snapshotted or not, ANY database files
>> (systemd-journal, PostgreSQL, sqlite, db) are not handled at all. They
>> slow down entire system to the speed of cheap SD card.
> 
> I will keep this in mind, thank you.  We do have a higher level above
> BTRFS that stages data.  I will consider implementing an algorithm to
> add the nocow flag to the file if it has been written to sufficiently to
> indicate it will be a bad fit for the BTRFS COW algorithm.

Adding nocow attribute to a file only works when it's just created and
not written to yet or when setting it on the containing directory and
letting it inherit for new files. You can't just turn it on for existing
files with content.

https://btrfs.wiki.kernel.org/index.php/FAQ#Can_copy-on-write_be_turned_off_for_data_blocks.3F

>> Actually, if you do not use compression and don't need checksums of data
>> blocks, you may want to mount all the btrfs with nocow by default.
>> This way the quotas would be more accurate (no fragmentation _between_
>> snapshots) and you'll have some decent performance with snapshots.
>> If that is all you care.
> 
> CoW is still valuable for us as we're shooting to support on the order
> of hundreds of snapshots per subvolume,

Hundreds will get you into trouble even without qgroups.

> and without it (if BTRFS COW
> works the same as our old COW FS) that's going to be quite expensive to
> keep snapshots around.  So some hybrid solution is required here.

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-11 Thread Hans van Kranenburg
On 02/11/2018 05:15 PM, Ellis H. Wilson III wrote:
> Thanks Hans.  Sorry for the top-post, but I'm boiling things down here
> so I don't have a clear line-item to respond to.  The take-aways I see
> here to my original queries are:
> 
> 1. Nobody has done a thorough analysis of the impact of snapshot
> manipulation WITHOUT qgroups enabled on foreground I/O performance
> 2. Nobody has done a thorough analysis of the impact of snapshot
> manipulation WITH qgroups enabled on foreground I/O performance

It's more that there is no simple list of clear-cut answers that apply
to every possible situation and type/pattern of work that you can throw
at a btrfs filesystem.

> 3. I need to look at the code to understand the interplay between
> qgroups, snapshots, and foreground I/O performance as there isn't
> existing architecture documentation to point me to that covers this

Well, the excellent write-up of Qu this morning shows some explanation
from the design point of view.

> 4. I should be cautioned that CoW in BTRFS can exhibit pathological (if
> expected) capacity consumption for very random-write-oriented datasets
> with or without snapshots, and nocow (or in my case transparently
> absorbing and coalescing writes at a higher tier) is my friend.

nocow only keeps the cows on a distance as long as you don't start
snapshotting (or cp --reflink) those files... If you take a snapshot,
then you force btrfs to keep the data around that is referenced by the
snapshot. So, that means that every next write will be cowed once again,
moo, so small writes will be redirected to a new location, causing
fragmentation again. The second and third write can go in the same (new)
location of the first new write, but as soon as you snapshot again, this
happens again.

> 5. I should be cautioned that CoW is broken across snapshots when
> defragmentation is run.
> 
> I will update a test system to the most recent kernel and will perform
> tests to answer #1 and #2.  I will plan to share it when I'm done.  If I
> have time to write-up my findings for #3 I will similarly share that.
> 
> Thanks to all for your input on this issue.

Have fun!

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-11 Thread Adam Borowski
On Sun, Feb 11, 2018 at 12:31:42PM +0300, Andrei Borzenkov wrote:
> 11.02.2018 04:02, Hans van Kranenburg пишет:
> >> - /dev/sda6 / btrfs
> >> rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot
> >> 0 0
> > 
> > Note that changes on atime cause writes to metadata, which means cowing
> > metadata blocks and unsharing them from a previous snapshot, only when
> > using the filesystem, not even when changing things (!).
> 
> With relatime atime is updated only once after file was changed. So your
> description is not entirely accurate and things should not be that
> dramatic unless files are continuously being changed.

Alas, that's untrue.  relatime updates happen if:
* the file has been written after it was last read, or
* previous atime was older than 24 hours

Thus, you get at least one unshare per inode per day, which is also the most
widespread frequency of both snapshotting and cronjobs.

Fortunately, most uses of atime are gone, thus it's generally safe to
disable it completely.


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀ The bill with 3 years prison for mentioning Polish concentration
⣾⠁⢰⠒⠀⣿⡁ camps is back.  What about KL Warschau (operating until 1956)?
⢿⡄⠘⠷⠚⠋⠀ Zgoda?  Łambinowice?  Most ex-German KLs?  If those were "soviet
⠈⠳⣄ puppets", Bereza Kartuska?  Sikorski's camps in UK (thanks Brits!)?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-11 Thread Ellis H. Wilson III
Thanks Hans.  Sorry for the top-post, but I'm boiling things down here 
so I don't have a clear line-item to respond to.  The take-aways I see 
here to my original queries are:


1. Nobody has done a thorough analysis of the impact of snapshot 
manipulation WITHOUT qgroups enabled on foreground I/O performance
2. Nobody has done a thorough analysis of the impact of snapshot 
manipulation WITH qgroups enabled on foreground I/O performance
3. I need to look at the code to understand the interplay between 
qgroups, snapshots, and foreground I/O performance as there isn't 
existing architecture documentation to point me to that covers this
4. I should be cautioned that CoW in BTRFS can exhibit pathological (if 
expected) capacity consumption for very random-write-oriented datasets 
with or without snapshots, and nocow (or in my case transparently 
absorbing and coalescing writes at a higher tier) is my friend.
5. I should be cautioned that CoW is broken across snapshots when 
defragmentation is run.


I will update a test system to the most recent kernel and will perform 
tests to answer #1 and #2.  I will plan to share it when I'm done.  If I 
have time to write-up my findings for #3 I will similarly share that.


Thanks to all for your input on this issue.

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-11 Thread Ellis H. Wilson III

Thanks Tomasz,

Comments in-line:

On 02/10/2018 05:05 PM, Tomasz Pala wrote:

You won't have anything close to "accurate" in btrfs - quotas don't
include space wasted by fragmentation, which happens to allocate from tens
to thousands times (sic!) more space than the files itself.
Not in some worst-case scenarios, but in real life situations...
I got 10 MB db-file which was eating 10 GB of space after a week of
regular updates - withOUT snapshotting it. All described here.


The underlying filesystem this is replacing was an in-house developed 
COW filesystem, so we're aware of the difficulties of fragmentation. 
I'm more interested in an approximate space consumed across snapshots 
when considering CoW.  I realize it will be approximate.  Approximate is 
ok for us -- no accounting for snapshot space consumed is not.


Also, I don't see the thread you mentioned.  Perhaps you forgot to 
mention it, or an html link didn't come through properly?



course) or how many subvolumes/snapshots there are.  If I know that
above N snapshots per subvolume performance tanks by M%, I can apply
limits on the use-case in the field, but I am not aware of those kinds
of performance implications yet.


This doesn't work like this. It all depends on data that are subject of
snapshots, especially how they are updated. How exactly, including write
patterns.

I think you expect answers that can't be formulated - with fs architecture so
advanced as ZFS or btrfs it's behavior can't be analyzed for simple
answers like 'keep less than N snapshots'.


I was using an extremely simple heuristic to drive at what I was looking 
to get out of this.  I should have been more explicit that the example 
was not to be taken literally.



This is an exception of easy-answer: btrfs doesn't handle databases with
CoW. Period. Doesn't matter if snapshotted or not, ANY database files
(systemd-journal, PostgreSQL, sqlite, db) are not handled at all. They
slow down entire system to the speed of cheap SD card.


I will keep this in mind, thank you.  We do have a higher level above 
BTRFS that stages data.  I will consider implementing an algorithm to 
add the nocow flag to the file if it has been written to sufficiently to 
indicate it will be a bad fit for the BTRFS COW algorithm.



Actually, if you do not use compression and don't need checksums of data
blocks, you may want to mount all the btrfs with nocow by default.
This way the quotas would be more accurate (no fragmentation _between_
snapshots) and you'll have some decent performance with snapshots.
If that is all you care.


CoW is still valuable for us as we're shooting to support on the order 
of hundreds of snapshots per subvolume, and without it (if BTRFS COW 
works the same as our old COW FS) that's going to be quite expensive to 
keep snapshots around.  So some hybrid solution is required here.


Best,

ellis
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-11 Thread Andrei Borzenkov
11.02.2018 04:02, Hans van Kranenburg пишет:
...
> 
>> - /dev/sda6 / btrfs
>> rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot
>> 0 0
> 
> Note that changes on atime cause writes to metadata, which means cowing
> metadata blocks and unsharing them from a previous snapshot, only when
> using the filesystem, not even when changing things (!).

With relatime atime is updated only once after file was changed. So your
description is not entirely accurate and things should not be that
dramatic unless files are continuously being changed.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs-cleaner / snapshot performance analysis

2018-02-10 Thread Qu Wenruo


On 2018年02月10日 00:45, Ellis H. Wilson III wrote:
> Hi all,
> 
> I am trying to better understand how the cleaner kthread (btrfs-cleaner)
> impacts foreground performance, specifically during snapshot deletion.
> My experience so far has been that it can be dramatically disruptive to
> foreground I/O.
> 
> Looking through the wiki at kernel.org I have not yet stumbled onto any
> analysis that would shed light on this specific problem.  I have found
> numerous complaints about btrfs-cleaner online, especially relating to
> quotas being enabled.  This has proven thus far less than helpful, as
> the response tends to be "use less snapshots," or "disable quotas," both
> of which strike me as intellectually unsatisfying answers, especially
> the former in a filesystem where snapshots are supposed to be
> "first-class citizens."

Yes, snapshots of btrfs is really "first-class citizen".
Tons of designs are all biased to snapshot.

But one should be clear about one thing:
Snapshot creation and backref walk (used in qgroup, relocation and
extent deletion), are two conflicting workload in fact.

Btrfs puts snapshot creation on a very high priority, so that it greatly
degrades the performance of backref walk (used in snapshot deletion,
relocation and extent exclusive/shared calculation of qgroup).

Let me explain this problem in detail.

Just as explained by Peter Grandi, for any snapshot system (or any
system supports reflink) there must be a reserved mapping tree, to tell
which extent is used by who.

It's very critical, to determine if an extent is shared so we determine
if we need to do CoW.

There are several different ways to implement it, and this hugely
affects snapshot creation performance.

1) Direct mapping record
   Just records exactly which extent is used by who, directly.
   So when we needs to check the owner, just search the tree ONCE, then
   we get it.

   This is simple and it seems that LVM thin-provision and LVM
   traditional targets are all using them.
   (Maybe XFS also follows this way?)

   Pros:
   *FAST* backref walk, which means quick extent deletion and CoW
   condition check.


   Cons:
   *SLOW* snapshot creation.
   Each snapshot creation needs to insert new owner relationship into
   the tree. This modification grow with the size of snapshot source.

2) Indirect mapping record
   Records upper level referencer only.

   To get all direct owner of an extent, it will needs multiple lookup
   in the reserved mapping tree.

   And obviously, btrfs is using this method.

   Pros:
   *FAST* owner inheritance, which means snapshot creation.
   (Well, the only advantage I can think of)

   Cons:
   *VERY SLOW* backref walk, used by extent deletion, relocation, qgroup
   and Cow condition check.
   (That may also be why btrfs default to CoW data, so that it can skip
the costy backref walk)

And a more detailed example of the difference between them will be:

[Basic tree layout]
 Tree X
 node A
   /\
node B node C
/ \   /  \
 leaf D  leaf E  leaf F  leaf G

Use above tree X as snapshot source.

[Snapshot creation: Direct mapping]
Then for direct mapping record, if we are going to create snapshot Y
then we would get:

Tree X  Tree Y
node A 
 |  \ / |
 |   X  |
 |  / \ |
node B  node C
 /  \  / \
  leaf D  leaf E   leaf F   leaf G

We need to create new node H, and update the owner for node B/C/D/E/F/G.

That's to say, we need to create 1 new node, and update 6 references of
existing nodes/leaves.
And this will grow rapidly if the tree is large, but still should be a
linear increase.


[Snapshot creation: Indirect mapping]
And if using indirect mapping tree, firstly, reserved mapping tree
doesn't record exactly the owner for each leaf/node, but only records
its parent(s).

So even when tree X exists along, without snapshot Y, if we need to know
the owner of leaf D, we only knows its only parent is node B.
And do the same query on node B until we read node A and knows it's
owned by tree X.

 Tree X ^
 node A ^ Look upward until
   /| we reach tree root
node B  | to search the owner
/   | of a leaf/node
 leaf D |

So even in its best case, to look up the owner of leaf D, we need to do
3 times lookup. One for leaf D, one for node B, one for node A (which is
the end).
Such lookup will get more and more complex if there are extra branch in
the lookup chain.

But such complicated design makes one thing easier, that is snapshot
creation:
Tree X  Tree Y
  

Re: btrfs-cleaner / snapshot performance analysis

2018-02-10 Thread Hans van Kranenburg
Hey,

On 02/10/2018 07:29 PM, Ellis H. Wilson III wrote:
> Thank you very much for your response Hans.  Comments in-line, but I did
> want to handle one miscommunication straight-away:
> 
> I'm a huge fan of BTRFS.  If I came off like I was complaining, my
> sincere apologies.   To be completely transparent we are using BTRFS in
> a very large project at my company, which I am lead architect on, and
> while I have read the academic papers, perused a subset of the source
> code, and been following it's development in the background, I now need
> to deeply understand where there might be performance hiccups.

I'd suggest just trying to do what you want to do for real, finding out
what the problems are and then finding out what to do about them, but I
think that's already almost exactly what you've started doing now. :)

If you ask 100 different btrfs users about your specific situation, you
probably get 100 different answers. So, I'll just throw some of my own
thoughts in here, which may or may not make sense for you.

> All of
> our foreground I/O testing with BTRFS in RAID0/RAID1/single across
> different SSDs and HDDs has been stellar, but we haven't dug too far
> into snapshot performance, balancing, and other more background-oriented
> performance.  Hence my interest in finding documentation and analysis I
> can read and grok myself on the implications of snapshot operations on
> foreground I/O if such exists.

> More in-line below:
> 
> On 02/09/2018 03:36 PM, Hans van Kranenburg wrote:
>>> This has proven thus far less than helpful, as
>>> the response tends to be "use less snapshots," or "disable quotas," both
>>> of which strike me as intellectually unsatisfying answers
>>
>> Well, sometimes those answers help. :) "Oh, yes, I disabled qgroups, I
>> didn't even realize I had those, and now the problem is gone."
> 
> I meant less than helpful for me, since for my project I need detailed
> and fairly accurate capacity information per sub-volume, and the
> relationship between qgroups and subvolume performance wasn't being
> spelled out in the responses.  Please correct me if I am wrong about
> needing qgroups enabled to see detailed capacity information
> per-subvolume (including snapshots).

Aha, so you actually want to use qgroups.

>>> the former in a filesystem where snapshots are supposed to be
>>> "first-class citizens."

They are. But if you put extra optional feature X Y and Z on top which
kill your performance, then they are still supposed to be first-class
citizens, but feature X Y and Z might start blurring it a bit.

The problem is that qgroups and quota etc is still in development and if
you ask the developers, they are probably honest about the fact that you
cannot just enable that part of the functionality without some expected
and unexpected performance side effects.

>> Throwing complaints around is also not helpful.
> 
> Sorry about this.  It wasn't directed in any way at BTRFS developers,
> but rather some of the suggestions for solution proposed in random
> forums online.
> As mentioned I'm a fan of BTRFS, especially as my
> project requires the snapshots to truly be first-class citizens in that
> they are writable and one can roll-back to them at-will, unlike in ZFS
> and other filesystems.  I was just saying it seemed backwards to suggest
> having less snapshots was a solution in a filesystem where the
> architecture appears to treat them as a core part of the design.

And I was just saying that subvolumes and snapshots are fine, and that
you shouldn't blame them while your problems might be more likely
qgroups/quota related.

>> The "performance implications" are highly dependent on your specific
>> setup, kernel version, etc, so it really makes sense to share:
>>
>> * kernel version
>> * mount options (from /proc/mounts|grep btrfs)
>> * is it ssd? hdd? iscsi lun?
>> * how big is the FS
>> * how many subvolumes/snapshots? (how many snapshots per subvolume)
> 
> I will answer the above, but would like to reiterate my previous comment
> that I still would like to understand the fundamental relationships here
> as in my project kernel version is very likely to change (to more
> recent), along with mount options and underlying device media.  Once
> this project hits the field I will additionally have limited control
> over how large the FS gets (until physical media space is exhausted of
> course) or how many subvolumes/snapshots there are.  If I know that
> above N snapshots per subvolume performance tanks by M%, I can apply
> limits on the use-case in the field, but I am not aware of those kinds
> of performance implications yet.
> 
> My present situation is the following:
> - Fairly default opensuse 42.3.
> - uname -a: Linux betty 4.4.104-39-default #1 SMP Thu Jan 4 08:11:03 UTC
> 2018 (7db1912) x86_64 x86_64 x86_64 GNU/Linux

You're ignoring 2 years of development and performance improvements. I'd
suggest jumping forward to 4.14 to see which part of your problems will
disappear 

Re: btrfs-cleaner / snapshot performance analysis

2018-02-10 Thread Ellis H. Wilson III
Thank you very much for your response Hans.  Comments in-line, but I did 
want to handle one miscommunication straight-away:


I'm a huge fan of BTRFS.  If I came off like I was complaining, my 
sincere apologies.   To be completely transparent we are using BTRFS in 
a very large project at my company, which I am lead architect on, and 
while I have read the academic papers, perused a subset of the source 
code, and been following it's development in the background, I now need 
to deeply understand where there might be performance hiccups.  All of 
our foreground I/O testing with BTRFS in RAID0/RAID1/single across 
different SSDs and HDDs has been stellar, but we haven't dug too far 
into snapshot performance, balancing, and other more background-oriented 
performance.  Hence my interest in finding documentation and analysis I 
can read and grok myself on the implications of snapshot operations on 
foreground I/O if such exists.  More in-line below:


On 02/09/2018 03:36 PM, Hans van Kranenburg wrote:

This has proven thus far less than helpful, as
the response tends to be "use less snapshots," or "disable quotas," both
of which strike me as intellectually unsatisfying answers


Well, sometimes those answers help. :) "Oh, yes, I disabled qgroups, I
didn't even realize I had those, and now the problem is gone."


I meant less than helpful for me, since for my project I need detailed 
and fairly accurate capacity information per sub-volume, and the 
relationship between qgroups and subvolume performance wasn't being 
spelled out in the responses.  Please correct me if I am wrong about 
needing qgroups enabled to see detailed capacity information 
per-subvolume (including snapshots).



the former in a filesystem where snapshots are supposed to be
"first-class citizens."


Throwing complaints around is also not helpful.


Sorry about this.  It wasn't directed in any way at BTRFS developers, 
but rather some of the suggestions for solution proposed in random 
forums online.  As mentioned I'm a fan of BTRFS, especially as my 
project requires the snapshots to truly be first-class citizens in that 
they are writable and one can roll-back to them at-will, unlike in ZFS 
and other filesystems.  I was just saying it seemed backwards to suggest 
having less snapshots was a solution in a filesystem where the 
architecture appears to treat them as a core part of the design.



The "performance implications" are highly dependent on your specific
setup, kernel version, etc, so it really makes sense to share:

* kernel version
* mount options (from /proc/mounts|grep btrfs)
* is it ssd? hdd? iscsi lun?
* how big is the FS
* how many subvolumes/snapshots? (how many snapshots per subvolume)


I will answer the above, but would like to reiterate my previous comment 
that I still would like to understand the fundamental relationships here 
as in my project kernel version is very likely to change (to more 
recent), along with mount options and underlying device media.  Once 
this project hits the field I will additionally have limited control 
over how large the FS gets (until physical media space is exhausted of 
course) or how many subvolumes/snapshots there are.  If I know that 
above N snapshots per subvolume performance tanks by M%, I can apply 
limits on the use-case in the field, but I am not aware of those kinds 
of performance implications yet.


My present situation is the following:
- Fairly default opensuse 42.3.
- uname -a: Linux betty 4.4.104-39-default #1 SMP Thu Jan 4 08:11:03 UTC 
2018 (7db1912) x86_64 x86_64 x86_64 GNU/Linux
- /dev/sda6 / btrfs 
rw,relatime,ssd,space_cache,subvolid=259,subvol=/@/.snapshots/1/snapshot 0 0
(I have about 10 other btrfs subvolumes, but this is the only one being 
snapshotted)
- At the time of my noticing the slow-down, I had about 24 snapshots, 10 
of which were in the process of being deleted

- Usage output:
~> sudo btrfs filesystem usage /
Overall:
Device size:  40.00GiB
Device allocated: 11.54GiB
Device unallocated:   28.46GiB
Device missing:  0.00B
Used:  7.57GiB
Free (estimated): 32.28GiB  (min: 32.28GiB)
Data ratio:   1.00
Metadata ratio:   1.00
Global reserve:   28.44MiB  (used: 0.00B)
Data,single: Size:11.01GiB, Used:7.19GiB
   /dev/sda6  11.01GiB
Metadata,single: Size:512.00MiB, Used:395.91MiB
   /dev/sda6 512.00MiB
System,single: Size:32.00MiB, Used:16.00KiB
   /dev/sda6  32.00MiB
Unallocated:
   /dev/sda6  28.46GiB


And what's essential to look at is what your computer is doing while you
are throwing a list of subvolumes into the cleaner.

* is it using 100% cpu?
* is it showing 100% disk read I/O utilization?
* is it showing 100% disk write I/O utilization? (is it writing lots and
lots of data to disk?)


I noticed the problem when Thunderbird became completely 

Re: btrfs-cleaner / snapshot performance analysis

2018-02-09 Thread Peter Grandi
> I am trying to better understand how the cleaner kthread
> (btrfs-cleaner) impacts foreground performance, specifically
> during snapshot deletion.  My experience so far has been that
> it can be dramatically disruptive to foreground I/O.

That's such a warmly innocent and optimistic question! This post
gives the answer, and to an even more general question:

  http://www.sabi.co.uk/blog/17-one.html?170610#170610

> the response tends to be "use less snapshots," or "disable
> quotas," both of which strike me as intellectually
> unsatisfying answers, especially the former in a filesystem
> where snapshots are supposed to be "first-class citizens."

They are "first class" but not "cost-free".
In particular every extent is linked in a forward map and a
reverse map, and deleting a snapshot involves materializing and
updating a join of the two, which seems to be done with a
classic nested-loop join strategy resulting in N^2 running
time. I suspect that quotas have a similar optimization.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html