Re: btrfs-cleaner / snapshot performance analysis

2018-02-09 Thread Peter Grandi
> I am trying to better understand how the cleaner kthread
> (btrfs-cleaner) impacts foreground performance, specifically
> during snapshot deletion.  My experience so far has been that
> it can be dramatically disruptive to foreground I/O.

That's such a warmly innocent and optimistic question! This post
gives the answer, and to an even more general question:

  http://www.sabi.co.uk/blog/17-one.html?170610#170610

> the response tends to be "use less snapshots," or "disable
> quotas," both of which strike me as intellectually
> unsatisfying answers, especially the former in a filesystem
> where snapshots are supposed to be "first-class citizens."

They are "first class" but not "cost-free".
In particular every extent is linked in a forward map and a
reverse map, and deleting a snapshot involves materializing and
updating a join of the two, which seems to be done with a
classic nested-loop join strategy resulting in N^2 running
time. I suspect that quotas have a similar optimization.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs reserve metadata problem

2018-01-02 Thread Peter Grandi
> When testing Btrfs with fio 4k random write,

That's an exceptionally narrowly defined workload. Also it is
narrower than that, because it must be without 'fsync' after
each write, or else there would be no accumulation of dirty
blocks in memory at all.

> I found that volume with smaller free space available has
> lower performance.

That's an inappropriate use of "performance"... The speed may be
lower, the performance is another matter.

> It seems that the smaller the free space of volume is, the
> smaller amount of dirty page filesystem could have.

Is this a problem? Consider: all filesystems do less well when
there is less free space (smaller chance of finding spatially
compact allocations), it is usually good to minimize the the
amont of dirty pages anyhow (even if there are reasons to keep
delay writing them out).

> [ ... ] btrfs will reserve metadata for every write.  The
> amount to reserve is calculated as follows: nodesize *
> BTRFS_MAX_LEVEL(8) * 2, i.e., it reserves 256KB of metadata.
> The maximum amount of metadata reservation depends on size of
> metadata currently in used and free space within volume(free
> chunk size /16) When metadata reaches the limit, btrfs will
> need to flush the data to release the reservation.

I don't understand here: under POSIX semantics filesystems are
not really allowed to avoid flushing *metadata* to disk for most
operations, that is metadata operations have an implied 'fsync'.
Your case of the "4k random write" with "cow disabled" the only
metadata that should get updated is the last-modified timestamp,
unless the user/application has been so amazingly stupid to not
preallocate the file, and then they deserve whatever they get.

> 1. Is there any logic behind the value (free chunk size /16)

>   /*
>* If we have dup, raid1 or raid10 then only half of the free
>* space is actually useable. For raid56, the space info used
>* doesn't include the parity drive, so we don't have to
>* change the math
>*/
>   if (profile & (BTRFS_BLOCK_GROUP_DUP |
>   BTRFS_BLOCK_GROUP_RAID1 |
>   BTRFS_BLOCK_GROUP_RAID10))
>avail >>= 1;

As written there is a plausible logic, but it is quite crude.

>   /*
>* If we aren't flushing all things, let us overcommit up to
>* 1/2th of the space. If we can flush, don't let us overcommit
>* too much, let it overcommit up to 1/8 of the space.
>*/
>   if (flush == BTRFS_RESERVE_FLUSH_ALL)
>avail >>= 3;
>   else
>avail >>= 1;

Presumably overcommitting beings some benefits on other workloads.

In particular other parts of Btrfs don't behave awesomely well
when free space runs out.

> 2. Is there any way to improve this problem?

Again, is it a problem? More interestingly, if it is a problem
is a solution available that does not impact other workloads?
It is simply impossible to optimize a filesystem perfectly for
every workload.

I'll try to summarize your report as I understand it:

* If:
  - The workload is "4k random write" (without 'fsync').
  - On a "cow disabled" file.
  - The file is not preallocated.
  - There is not much free space available.
* Then allocation overcommitting results in a higher frequency
  of unrequested metadata flushes, and those metadata flushes
  slow down a specific benchmark.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-19 Thread Peter Grandi
[ ... ]

> The advantage of writing single chunks when degraded, is in
> the case where a missing device returns (is readded,
> intact).  Catching up that device with the first drive, is a
> manual but simple invocation of 'btrfs balance start
> -dconvert=raid1,soft -mconvert=raid1,soft' The alternative is
> a full balance or full scrub. It's pretty tedious for big
> arrays.

That is merely an after-the-fact rationalization for a design
that is at the same time entirely logical and quite broken: that
the intended replication factor is the same as the current
number of members of the volume, so if a volume has (currently)
only one member, than only "single" chunks gets created.

A design that would work better for operations would be to have
"profiles" to be a concept entirely independent of number of
members, or perhaps more precisely to have the "desired" profile
of a chunk be distinct from the "actual" profile (dependent on
the actual number of members of a volume) of that chunk, so that
if a volume has only one member chunks could be created that
have "desired" profile 'raid1' but "actual" profile 'single', or
perhaps more sensibly 'raid1-with-missing-mirror', with checks
that "actual" profile be usable else the volume is not
mountable.

Note: ideally every chunk would have both a static desired
profile and a desired stripe width, and a computed actual
profile and a actual stripe width. Or perhaps the desired
profile and width would be properties of the volume (for each of
the three types of data).

For example in MD RAID it is perfectly legitimate to create a
RAID6 set with "desired" width of 6 and "actual" width of 4 (in
which case it can be activated as degraded) or a RAID5 set with
"desired" width of 5 and actual width of 3 (in which case it
cannot be activated at all until at least another member is
added).

The difference with MD RAID is that in MD RAID there is (except
in one case , during conversion) an exact match between
"desired" profile stripe width and number of members, while at
least in principle a Btrfs volume can have any number of chunks
of any profile of any desired stripe size (except that current
implementation is not so flexible in most profiles).

That would require scanning all chunks to determine whether a
volume is mountable at all or mountable only as degraded, while
MD RAID can just count the members. Apparently recent versions
of the Btrfs 'raid1' profile do just that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-18 Thread Peter Grandi
>> The fact is, the only cases where this is really an issue is
>> if you've either got intermittently bad hardware, or are
>> dealing with external

> Well, the RAID1+ is all about the failing hardware.

>> storage devices. For the majority of people who are using
>> multi-device setups, the common case is internally connected
>> fixed storage devices with properly working hardware, and for
>> that use case, it works perfectly fine.

> If you're talking about "RAID"-0 or storage pools (volume
> management) that is true. But if you imply, that RAID1+ "works
> perfectly fine as long as hardware works fine" this is
> fundamentally wrong.

I really agree with this, the argument about "properly working
hardware" is utterly ridiculous. I'll to this: apparently I am
not the first one to discover the "anomalies" in the "RAID"
profiles, but I may have been the first to document some of
them, e.g. the famous issues with the 'raid1' profile. How did I
discover them? Well, I had used Btrfs in single device mode for
a bit, and wanted to try multi-device, and the docs seemed
"strange", so I did tests before trying it out.

The tests were simply on a spare PC with a bunch of old disks to
create two block devices (partitions), put them in 'raid1' first
natively, then by adding a new member to an existing partition,
and then 'remove' one, or simply unplug it (actually 'echo 1 >
/sys/block/.../device/delete') initially. I wanted to check
exactly what happened, resync times, speed, behaviour and speed
when degraded, just ordinary operational tasks.

Well I found significant problems after less than one hour. I
can't imagine anyone with some experience of hw or sw RAID
(especially hw RAID, as hw RAID firmware is often fantastically
buggy especially as to RAID operations) that wouldn't have done
the same tests before operational use, and would not have found
the same issues too straight away. The only guess I could draw
is that whover designed the "RAID" profile had zero operational
system administration experience.

> If the hardware needs to work properly for the RAID to work
> properly, noone would need this RAID in the first place.

It is not just that, but some maintenance operations are needed
even if the hardware works properly: for example preventive
maintenance, replacing drives that are becoming too old,
expanding capacity, testing periodically hardware bits. Systems
engineers don't just say "it works, let's assume it continues to
work properly, why worry".

My impression is that multi-device and "chunks" were designed in
one way by someone, and someone else did not understand the
intent, and confused them with "RAID", and based the 'raid'
profiles on that confusion. For example the 'raid10' profile
seems the least confused to me, and that's I think because the
"RAID" aspect is kept more distinct from the "multi-device"
aspect. But perhaps I am an optimist...

To simplify a longer discussion to have "RAID" one needs an
explicit design concept of "stripe", which in Btrfs needs to be
quite different from that of "set of member devices" and
"chunks", so that for example adding/removing to a "stripe" is
not quite the same thing as adding/removing members to a volume,
plus to make a distinction between online and offline members,
not just added and removed ones, and well-defined state machine
transitions (e.g. in response to hardware problems) among all
those, like in MD RAID. But the importance of such distinctions
may not be apparent to everybody.

But I may have read comments in which "block device" (a data
container on some medium), "block device inode" (a descriptor
for that) and "block device name" (a path to a "block device
inode") were hopelessly confused, so I don't hold a lot of
hope. :-(
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-18 Thread Peter Grandi
>> I haven't seen that, but I doubt that it is the radical
>> redesign of the multi-device layer of Btrfs that is needed to
>> give it operational semantics similar to those of MD RAID,
>> and that I have vaguely described previously.

> I agree that btrfs volume manager is incomplete in view of
> data center RAS requisites, there are couple of critical
> bugs and inconsistent design between raid profiles, but I
> doubt if it needs a radical redesign.

Well it needs a radical redesign because the original design was
based on an entirely consistent and logical concept that was
quite different from that required for sensible operations, and
then special-case case was added (and keeps being added) to
fix the consequences.

But I suspect that it does not need a radical *recoding*,
because most if not all of the needed code is already there.
All tha needs changing most likely is the member state-machine,
that's the bit that need a radical redesign, and it is a
relatively small part of the whole.

The closer the member state-machine design is to the MD RAID one
the better as it is a very workable, proven model.

Sometimes I suspect that the design needs to be changed to also
add a formal notion of "stripe" to the Btrfs internals, where a
"stripe" is a collection of chunks that are "related" (and
something like that is already part of the 'raid10' profile),
but I think that needs not be user-visible.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Unexpected raid1 behaviour

2017-12-17 Thread Peter Grandi
"Duncan"'s reply is slightly optimistic in parts, so some
further information...

[ ... ]

> Basically, at this point btrfs doesn't have "dynamic" device
> handling.  That is, if a device disappears, it doesn't know
> it.

That's just the consequence of what is a completely broken
conceptual model: the current way most multi-device profiles are
designed is that block-devices and only be "added" or "removed",
and cannot be "broken"/"missing". Therefore if IO fails, that is
just one IO failing, not the entire block-device going away.
The time when a block-device is noticed as sort-of missing is
when it is not available for "add"-ing at start.

Put another way, the multi-device design is/was based on the
demented idea that block-devices that are missing are/should be
"remove"d, so that a 2-device volume with a 'raid1' profile
becomes a 1-device volume with a 'single'/'dup' profile, and not
a 2-device volume with a missing block-device and an incomplete
'raid1' profile, even if things have been awkwardly moving in
that direction in recent years.

Note the above is not totally accurate today because various
hacks have been introduced to work around the various issues.

> Thus, if a device disappears, to get it back you really have
> to reboot, or at least unload/reload the btrfs kernel module,
> in ordered to clear the stale device state and have btrfs
> rescan and reassociate devices with the matching filesystems.

IIRC that is not quite accurate: a "missing" device can be
nowadays "replace"d (by "devid") or "remove"d, the latter
possibly implying profile changes:

  
https://btrfs.wiki.kernel.org/index.php/Using_Btrfs_with_Multiple_Devices#Using_add_and_delete

Terrible tricks like this also work:

  https://www.spinics.net/lists/linux-btrfs/msg48394.html

> Meanwhile, as mentioned above, there's active work on proper
> dynamic btrfs device tracking and management. It may or may
> not be ready for 4.16, but once it goes in, btrfs should
> properly detect a device going away and react accordingly,

I haven't seen that, but I doubt that it is the radical redesign
of the multi-device layer of Btrfs that is needed to give it
operational semantics similar to those of MD RAID, and that I
have vaguely described previously.

> and it should detect a device coming back as a different
> device too.

That is disagreeable because of poor terminology: I guess that
what was intended that it should be able to detect a previous
member block-device becoming available again as a different
device inode, which currently is very dangerous in some vital
situations.

> Longer term, there's further patches that will provide a
> hot-spare functionality, automatically bringing in a device
> pre-configured as a hot- spare if a device disappears, but
> that of course requires that btrfs properly recognize devices
> disappearing and coming back first, so one thing at a time.

That would be trivial if the complete redesign of block-device
states of the Btrfs multi-device layer happened, adding an
"active" flag to an "accessible" flag to describe new member
states, for example.

My guess that while logically consistent, the current
multi-device logic is fundamentally broken from an operational
point of view, and needs a complete replacement instead of
fixes.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] retry write on error

2017-12-03 Thread Peter Grandi
> [ ... ] btrfs incorporates disk management which is actually a
> version of md layer, [ ... ]

As far as I know Btrfs has no disk management, and was wisely
designed without any, just like MD: Btrfs volumes and MD sets
can be composed from "block devices", not disks, and block
devices are quite high level abstractions, as they closely mimic
the semantics of a UNIX file, not a physical device.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/7] retry write on error

2017-11-28 Thread Peter Grandi
>>> If the underlying protocal doesn't support retry and there
>>> are some transient errors happening somewhere in our IO
>>> stack, we'd like to give an extra chance for IO.

>> A limited number of retries may make sense, though I saw some
>> long stalls after retries on bad disks.

Indeed! One of the major issues in actual storage administration
is to find ways to reliably disable most retries, or to shorten
them, both at the block device level and the device level,
because in almost all cases where storage reliability matters
what is important is simply swapping out the failing device
immediately and then examining and possible refreshing it
offline.

To the point that many device manufacturers deliberately cripple
in cheaper products retry shortening or disabling options to
force long stalls, so that people who care about reliability
more than price will buy the more expensive version that can
disable or shorten retries.

> Seems preferable to avoid issuing retries when the underlying
> transport layer(s) has already done so, but I am not sure
> there is a way to know that at the fs level.

Inded, and to use an euphemism, a third layer of retries at the
filesystem level are currently a thoroughly imbecilic idea :-),
as whether retries are worth doing is not a filesystem dependent
issue (but then plugging is done at the block io level when it
is entirely device dependent whether it is worth doing, so there
is famous precedent).

There are excellent reasons why error recovery is in general not
done at the filesystem level since around 20 years ago, which do
not need repeating every time. However one of them is that where
it makes sense device firmware does retries, and the block
device layer does retries too, which is often a bad idea, and
where it is not, the block io level should be do that, not the
filesystem.

A large part of the above discussion would not be needed if
Linux kernel "developers" exposed a clear notion of hardware
device and block device state machine and related semantics, or
even knew that it were desirable, but that's an idea that is
only 50 years old, so may not have yet reached popularity :-).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fixed subject: updatedb does not index separately mounted btrfs subvolumes

2017-11-05 Thread Peter Grandi
>> The issue is that updatedb by default will not index bind
>> mounts, but by default on Fedora and probably other distros,
>> put /home on a subvolume and then mount that subvolume which
>> is in effect a bind mount.

> 

> So the issue isn't /home being btrfs (as you said in the
> subject), but rather, it's /home being an explicitly mounted
> subvolume, since btrfs uses bind-mounts internally for
> subvolume mounts.

That to me seems like rather improper terminology and notes, and
I would consider these to be more appropriate:

* There are entities known as "root directories", and their main
  property is that all inodes reachable from one in the same
  filesystem have the same "device id".
* Each "filesystem" has at least one, and a Btrfs "volume" has
  one for every "subvolume", including the "top subvolume".
* A "root directory" can be "mounted" on a "mount point"
  directory of another "filesystem", which allows navigating
  from one filesystem to another.
* A "mounted" root directory can be identified by the device id
  of '.' being different from that of '..'.
* In Linux a "root directory" can be "mounted" onto several
  "mount point" directories at the same time.
* In Linux a "bind" operation is not a "mount" operation, it is
  in effect a kind of temporary "hard link", one that makes a
  directory aliased to a "bind point" directory.

Looking at this:

  tree#  tail -3 /proc/mounts   

 
  /dev/mapper/sda7 /fs/sda7 btrfs 
rw,nodiratime,relatime,nossd,nospace_cache,user_subvol_rm_allowed,subvolid=5,subvol=/
 0 0
  /dev/mapper/sda7 /fs/sda7/bind btrfs 
rw,nodiratime,relatime,nossd,nospace_cache,user_subvol_rm_allowed,subvolid=431,subvol=/=
 0 0
  /dev/mapper/sda7 /fs/sda7/bind-tmp btrfs 
rw,nodiratime,relatime,nossd,nospace_cache,user_subvol_rm_allowed,subvolid=431,subvol=/=/tmp
 0 0

  tree#  stat --format "%3D %6i %N" {,/fs,/fs/sda7}/{.,..} 
/fs/sda7/{=,=/subvol,=/subvol/dir,=/tmp,bind,bind-tmp}/{.,..} 
  806  2 ‘/.’
  806  2 ‘/..’
   23  36176 ‘/fs/.’
  806  2 ‘/fs/..’
   26256 ‘/fs/sda7/.’
   23  36176 ‘/fs/sda7/..’
   27256 ‘/fs/sda7/=/.’
   26256 ‘/fs/sda7/=/..’
   2b256 ‘/fs/sda7/=/subvol/.’
   27256 ‘/fs/sda7/=/subvol/..’
   2b258 ‘/fs/sda7/=/subvol/dir/.’
   2b256 ‘/fs/sda7/=/subvol/dir/..’
   27 344618 ‘/fs/sda7/=/tmp/.’
   27256 ‘/fs/sda7/=/tmp/..’
   27256 ‘/fs/sda7/bind/.’
   26256 ‘/fs/sda7/bind/..’
   27 344618 ‘/fs/sda7/bind-tmp/.’
   26256 ‘/fs/sda7/bind-tmp/..’

It shows that subvolume root directories are "mount points" and
not "bind points" (note that ‘/fs/sda7/=/subvol’ is not
explicitly mounted, yet its '.' and '..' have different device
ids), and that "bind points" appear as if they were ordinary
directories (an unwise decision I suspect).

Many tools for UNIX-like systems don't cross "mount point"
directories (or follow symbolic links), by default or with an
option, but will cross "bind point" directories as they look
like ordinary directories.

For 'mlocate' the "bind point" directories are a special case,
handled by looking up every directory examined in the list of
"bind point" directories, as per line 381 here:

https://pagure.io/mlocate/blob/master/f/src/bind-mount.c#_381
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-11-01 Thread Peter Grandi
> Another one is to find the most fragmented files first or all
> files of at least 1M with with at least say 100 fragments as in:

> find "$HOME" -xdev -type f -size +1M -print0 | xargs -0 filefrag \
> | perl -n -e 'print "$1\0" if (m/(.*): ([0-9]+) extents/ && $1 > 100)' \
> | xargs -0 btrfs fi defrag

That should have "&& $2 > 100".
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-01 Thread Peter Grandi
[ ... ]

> The poor performance has existed from the beginning of using
> BTRFS + KDE + Firefox (almost 2 years ago), at a point when
> very few snapshots had yet been created. A comparison system
> running similar hardware as well as KDE + Firefox (and LVM +
> EXT4) did not have the performance problems. The difference
> has been consistent and significant.

That seems rather unlikely to depend on Btrfs, as I use Firefox
56 + KDE4 + Btrfs without issue, on somewhat old/small desktop
and laptop, and is implausible on general grounds. You haven't
provided so far any indication or quantification of your "speed"
problem (which may or not be a "performance" issue".

The things to look at usually at disk IO latency and rates, and
system CPU time while the bad speed is observable (user CPU time
is usually stuck at 100% on any JS based site as written earlier).
To look at IO latency and rates the #1 choice is always: 'iostat
-dk -zyx 1' and to look as system CPU (and user CPU) and other
interesting details I suggest using 'htop' with the attached
configuration file to write to "$HOME/.config/htop/htoprc".

> Sometimes I have used Snapper settings like this:

> TIMELINE_MIN_AGE="1800"
> TIMELINE_LIMIT_HOURLY="36"
> TIMELINE_LIMIT_DAILY="30"
> TIMELINE_LIMIT_MONTHLY="12"
> TIMELINE_LIMIT_YEARLY="10"

> However, I also have some computers set like this:

> TIMELINE_MIN_AGE="1800"
> TIMELINE_LIMIT_HOURLY="10"
> TIMELINE_LIMIT_DAILY="10"
> TIMELINE_LIMIT_WEEKLY="0"
> TIMELINE_LIMIT_MONTHLY="0"
> TIMELINE_LIMIT_YEARLY="0"

The first seems a bit "aspirational". IIRC "someone" confessed
that the SUSE default of 'TIMELINE_LIMIT_YEARLY="10"' was imposed
by external forces in the SUSE default configuration:
https://github.com/openSUSE/snapper/blob/master/data/default-config

https://wiki.archlinux.org/index.php/Snapper#Set_snapshot_limits
https://lists.opensuse.org/yast-devel/2014-05/msg00036.html

# Beware! This file is rewritten by htop when settings are changed in the 
interface.
# The parser is also very primitive, and not human-friendly.
fields=0 48 38 39 40 44 62 63 2 46 13 14 1 
sort_key=47
sort_direction=1
hide_threads=1
hide_kernel_threads=1
hide_userland_threads=1
shadow_other_users=0
show_thread_names=1
highlight_base_name=1
highlight_megabytes=1
highlight_threads=1
tree_view=0
header_margin=0
detailed_cpu_time=1
cpu_count_from_zero=1
update_process_names=0
color_scheme=0
delay=15
left_meters=AllCPUs Memory Swap 
left_meter_modes=1 1 1 
right_meters=Tasks LoadAverage Uptime 
right_meter_modes=2 2 2 


Re: defragmenting best practice?

2017-11-01 Thread Peter Grandi
> When defragmenting individual files on a BTRFS filesystem with
> COW, I assume reflinks between that file and all snapshots are
> broken. So if there are 30 snapshots on that volume, that one
> file will suddenly take up 30 times more space... [ ... ]

Defragmentation works by effectively making a copy of the file
contents (simplistic view), so the end result is one copy with
29 reflinked contents, and one copy with defragmented contents.

> Can you also give an example of using find, as you suggested
> above? [ ... ]

Well, one way is to use 'find' as a filtering replacement for
'defrag' option '-r', as in for example:

  find "$HOME" -xdev '(' -name '*.sqlite' -o -name '*.mk4' ')' \
-type f  -print0 | xargs -0 btrfs fi defrag

Another one is to find the most fragmented files first or all
files of at least 1M with with at least say 100 fragments as in:

  find "$HOME" -xdev -type f -size +1M -print0 | xargs -0 filefrag \
| perl -n -e 'print "$1\0" if (m/(.*): ([0-9]+) extents/ && $1 > 100)' \
| xargs -0 btrfs fi defrag

But there are many 'find' web pages and that is not quite a
Btrfs related topic.

> [ ... ] The easiest way I know to exclude cache from
> BTRFS snapshots is to put it on a separate subvolume. I assumed this
> would make several things related to snapshots more efficient too.

Only slightly.

> Background: I'm not sure why our Firefox performance is so terrible

As I always say, "performance" is not the same as "speed", and
probably your Firefox "performance" is sort of OKish even if the
"speed" is terrile, and neither is likely related to the profile
or the cache being on Btrfs: most JavaScript based sites are
awfully horrible regardless of browser:

  http://www.sabi.co.uk/blog/13-two.html?130817#130817

and if Firefox makes a special contribution it tends to leak
memory on several odd but common cases:

  
https://utcc.utoronto.ca/~cks/space/blog/web/FirefoxResignedToLeaks?showcomments

Plus it tends to cache too much, e.g. recently close tabs.

But Firefox is not special because most web browsers are not
designed to run for a long time without a restart, and
Chromium/Chrome simply have a different set of problem sites.
Maybe the new "Quantum" Firefox 57 will improve matters because
it has a far more restrictive plugin API.

The overall problem is insoluble, hipster UX designers will be
the second the the wall when the revolution comes :-).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-10-31 Thread Peter Grandi
> I'm following up on all the suggestions regarding Firefox performance
> on BTRFS. [ ... ]

I haven't read that yet, so maybe I am missing something, but I
use Firefox with Btrfs all the time and I haven't got issues.

[ ... ]
> 1. BTRFS snapshots have proven to be too useful (and too important to
>our overall IT approach) to forego.
[ ... ]
> 3. We have large amounts of storage space (and can add more), but not
>enough to break all reflinks on all snapshots.

Firefox profiles get fragmented only in the databases containes
in them, and they are tiny, as in dozens of MB. That's usually
irrelevant.

Also nothing forces you to defragment a whole filesystem, you
can just defragment individual files or directories by using
'find' with it.

My top "$HOME" fragmented files are the aKregator RSS feed
databases, usually a few hundred fragments each, and the
'.sqlite' files for Firefox. Occasionally like just now I do
this:

  tree$  sudo filefrag .firefox/default/*.sqlite | sort -t: -k 2n | tail -4
  .firefox/default/cleanup.sqlite: 43 extents found
  .firefox/default/content-prefs.sqlite: 67 extents found
  .firefox/default/formhistory.sqlite: 87 extents found
  .firefox/default/places.sqlite: 3879 extents found

  tree$  sudo btrfs fi defrag .firefox/default/*.sqlite

  tree$  sudo filefrag .firefox/default/*.sqlite | sort -t: -k 2n | tail -4
  .firefox/default/webappsstore.sqlite: 1 extent found
  .firefox/default/favicons.sqlite: 2 extents found
  .firefox/default/kinto.sqlite: 2 extents found
  .firefox/default/places.sqlite: 44 extents found

> 2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted
> nocow -- it will NOT be snapshotted

The cache can be simply deleted, and usually files in it are not
updated in place, so don't get fragmented, so no worry.

Also, you can declare the '.firefox/default/' directory to be
NOCOW, and that "just works". I haven't even bothered with that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: SLES 11 SP4: can't mount btrfs

2017-10-26 Thread Peter Grandi
>> But it could simply be that you have forgotten to refresh the
>> 'initramfs' with 'mkinitrd' after modifying the '/etc/fstab'.

> I finally managed it. I'm pretty sure having changed
> /boot/grub/menu.lst, but somehow changes got lost/weren't
> saved ?

So the next thing to check would indeed have been that the GRUB2
script had been updated, which you can do with 'grub2-mkconfig'.
Also double check that in '/etc/sysconfig/bootloader' there is a
line 'LOADER_TYPE="grub"' instead of "none".

The system config tools will update the 'initramfs' and the
'menu.lst' automatically only if you make system config changes
only using them, but you changed the UUID of '/' "manually", and
this perhaps put the GRUB2 config and the system state out of
sync.

> After entering the new UUID from my Btrfs partition system
> boots.

Alternatively you could have used 'btrfstune -U ... ...' to
change the UUID of the newly created '/' volume to the old one.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: SLES 11 SP4: can't mount btrfs

2017-10-26 Thread Peter Grandi
> I formatted the / partition with Btrfs again and could restore
> the files from a backup.  Everything seems to be there, I can
> mount the Btrfs manually. [ ... ] But SLES finds from where I
> don't know a UUID (see screenshot). This UUID is commented out
> in fstab and replaced by /dev/vg1/lv_root. Using
> /dev/vg1/lv_root I can manually mount my Btrfs without any
> problem. Where does my SLES find that UUID ?

This sounds like a SLES issue, rather than a Btrfs one.

But it could simply be that you have forgotten to refresh the
'initramfs' with 'mkinitrd' after modifying the '/etc/fstab'.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-19 Thread Peter Grandi
[ ... ]

>> are USB drives really that unreliable [ ... ]
[ ... ]
> There are similar SATA chips too (occasionally JMicron and
> Marvell for example are somewhat less awesome than they could
> be), and practically all Firewire bridge chips of old "lied" a
> lot [ ... ]
> That plus Btrfs is designed to work on top of a "well defined"
> block device abstraction that is assumed to "work correctly"
> (except for data corruption), [ ... ]

When I insist on the reminder that Btrfs is designed to use the
block-device protocol and state machine, rather than USB and
SATA devices, it is because that makes more explicit that the
various layer between the USB and SATA device can "lie" too,
including for example the Linux page cache which is just below
the block-device layer. But also the disk scheduler, the SCSI
protocol handler, the USB and SATA drivers and disk drivers, the
PCIe chipset, the USB or SATA host bus adapter, the cable, the
backplane.

This paper reports the results of some testing of "enterprise
grade" storage systems at CERN, and some of the symptoms imply
that "lies" can happen *anywhere*. It is scary. It supports
having data checksumming in the filesystem, a rather extreme
choice.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-19 Thread Peter Grandi
[ ... ]
>>> Oh please, please a bit less silliness would be welcome here.
>>> In a previous comment on this tedious thread I had written:

> If the block device abstraction layer and lower layers work
> correctly, Btrfs does not have problems of that sort when
> adding new devices; conversely if the block device layer and
> lower layers do not work correctly, no mainline Linux
> filesystem I know can cope with that.

> Note: "work correctly" does not mean "work error-free".

>>> The last line is very important and I added it advisedly.
[ ... ]
>> Filesystems run on top of *block-devices* with a definite
>> interface and a definite state machine, and filesystems in
>> general assume that the block-device works *correctly*.

> They do run on top of USB or SATA devices, otherwise a
> significant majority of systems running Linux and/or BSD
> should not be operating right now.

That would be big news to any Linux/UNIX filesystem developer,
who would have to rush to add SATA and USB protocol and state
machine handling to their implementations, which currently only
support the block-device protocol and state machine.
Please send patches :-)

  Note to some readers: there are filesystems designed to work
  on top not of block devices, like on top the MTD abstraction
  layer, for example.

> Yes, they don't directly access them, but the block layer
> isn't much more than command translation, scheduling, and
> accounting, so this distinction is meaningless and largely
> irrelevant.

More tedious silliness and grossly ignorant too, because the
protocol and state machine of the block-device layer is
completely different from that of both SATA and USB, and the
mapping of the SATA or USB protocols and state machines onto the
block-device ones is actually a very complex, difficult, and
error prone task, involving mountains of very hairy code. In
particular since the block-device protocol and state machine are
rather simplistic, a lot is lost in translation.

  Note: the SATA handling firmware in disk device often involves
  *dozens of thousands* of lines of code, and "all it does" is
  "just" reading the device and passing the content over the IO
  bus.

Filesystems are designed to that very simplistic protocol and
state machine for good reasons, and sometimes they are designed
to even just a subset; for example most filesystem designs
assume that block-device writes never fail (that is, bad sector
sparing is done by a lower layer), and only some handle
gracefully block-device read failures.

> [ ... ] to refer to a block-device connected via interface 'X'
> as an 'X device' or an 'X storage device'.

More tedious silliness as this is a grossly misleading shorthand
when the point of the discussion is the error recovery protocol
and state machine assumed by filesystem designers. To me it see
that if people use that shorthand in that context, as if it was
not a shorthand, they don't know what they are talking about, or
they are trying to mislead the discussion.

> [ ... ] For an end user, it generally doesn't matter whether a
> given layer reported the error or passed it on (or generated
> it), it matters whether it was corrected or not. [ ... ]

You seem unable or unwilling to appreciate how detected and
undetected errors are fundamentally different, and how layering
of greatly different protocols is a complicated issue highly
relevant to error recovery, so you seem to assume that other end
users are likewise unable or unwilling.

But I am not so dismissive of "end users", and I assume that
there are end users that can eventually understand that Btrfs in
the main is not designed to handle devices that "lie" because
Btrfs actually is designed to use the block-device layer which
is assumed to "work correctly" (except for checksums).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-19 Thread Peter Grandi
> [ ... ] when writes to a USB device fail due to a temporary
> disconnection, the kernel can actually recognize that a write
> error happened. [ ... ]

Usually, but who knows? Maybe half transfer gets written; maybe
the data gets written to the wrong address; maybe stuff gets
written but failure is reported, and this not just if the
connection dies, but also if it does not.

> are USB drives really that unreliable [ ... ]

Welcome to the "real world", also called "Shenzen" :-).

There aren't that many "USB drives", as I wrote somewhere there
are usually USB host bus adapters (on the system side) and USB
IO bus (usually SATA) bridges (on the device side).

They both have to do difficult feats of conversion and signaling,
and in the USB case they are usually designed by a stressed,
overworked engineer in Guangzhou or Taiwan employed by a no-name
contractor working who submitted the lowest bid to a no-name
manufacturer, and was told to do the cheapest design to fabricate
in the shortest possible time. Most of the time they mostly work,
good enough for keyboard and mice, and for photos of cats on usb
sticks; most users jut unplug and replug them in if they flake
out. BTW my own USB keyboard and mice and their USB host bus
adapter occasionaly crash too, and the cases where my webcam
flakes out are more common than when it does not. USB is a mixed
bag of poorly designed protocols and complex too, and it is very
easy to do a bad implementation.

There are similar SATA chips too (occasionally JMicron and
Marvell for example are somewhat less awesome than they could
be), and practically all Firewire bridge chips of old "lied" a
lot except a few Oxford Semi ones (the legendary 911 series).
I have even seen lying SAS "enterprise" grade storage
interconnects. I had indeed previously written:

  > If you have concerns about the reliability of specific
  > storage and system configurations you should become or find a
  > system integration and qualification engineer who understand
  > the many subletities of storage devices and device-system
  > interconnects and who would run extensive tests on it;
  > storage and system commissioning is often far from trivial
  > even in seemingly simple cases, due in part to the enormous
  > complexity of interfaces, even when they have few bugs, and
  > test made with one combination often do not have the same
  > results even on apparently similar combinations.

On the #Btrfs IRC channel there is a small group of cynical
helpers, and when someone mentions "strange things happening" one
of them usually immediately asks "USB?" and in most cases the
answer is "how did you know?".

That plus Btrfs is designed to work on top of a "well defined"
block device abstraction that is assumed to "work correctly"
(except for data corruption), and the Linux block device
abstraction and SATA and USB layers beneath it are not designed
to handle devices that "lie" (well, there are blacklists with
workaround for known systematic bugs, but that is partial).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-19 Thread Peter Grandi
> [ ... ] However, the disappearance of the device doesn't get
> propagated up to the filesystem correctly,

Indeed, sometimes it does, sometimes it does not, in part
because of chipset bugs, in part because the USB protocol
signaling side does not handle errors well even if the chipset
were bug free.

> and that is what causes the biggest issue with BTRFS. Because
> BTRFS just knows writes are suddenly failing for some reason,
> it doesn't try to release the device so that things get
> properly cleaned up in the kernel, and thus when the same
> device reappears (as it will when the disconnect was due to a
> transient bus error, which happens a lot), it shows up as a
> different device node, which gets scanned for filesystems by
> udev, and BTRFS then gets really confused because it now sees
> 3 (or more) devices for a 2 device filesystem.

That's a good description that should be on the wiki.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-19 Thread Peter Grandi
[ ... ]

>> Oh please, please a bit less silliness would be welcome here.
>> In a previous comment on this tedious thread I had written:

>> > If the block device abstraction layer and lower layers work
>> > correctly, Btrfs does not have problems of that sort when
>> > adding new devices; conversely if the block device layer and
>> > lower layers do not work correctly, no mainline Linux
>> > filesystem I know can cope with that.
>> 
>> > Note: "work correctly" does not mean "work error-free".
>> 
>> The last line is very important and I added it advisedly.

> Even looking at things that way though, Zoltan's assessment
> that reliability is essentially a measure of error rate is
> correct.

It is instead based on a grave confusion between two very
different kinds of "error rate", confusion also partially based
on the ridiculous misunderstanding, which I have already pointed
out, that UNIX filesystems run on top of SATA or USB devices:

> Internal SATA devices absolutely can randomly drop off the bus
> just like many USB storage devices do,

Filesystems run on top of *block devices* with a definite
interface and a definite state machine, and filesystems in
general assume that the block device works *correctly*.

> but it almost never happens (it's a statistical impossibility
> if there are no hardware or firmware issues), so they are more
> reliable in that respect.

What the OP was doing was using "unreliable" both for the case
where the device "lies" and the case where the device does not
"lie" but reports a failure. Both of these are malfunctions in a
wide sense:

  * The [block] device "lies" as to its status or what it has done.
  * The [block] device reports truthfully that an action has failed.

But they are of very different nature and need completely
different handling. Hint: one is an extensional property and the
other is a modal one, there is a huge difference between "this
data is wrong" and "I know that this data is wrong".

The really important "detail" is that filesystems are, as a rule
with very few exceptions, designed to work only if the block
device layer (and those below it) does not "lie" (see "Bizantyne
failures" below), that is "works correctly": reports the failure
of every operation that fails and the success of every operation
that succeeds and never gets into an unexpected state.

In particular filesystems designs are nearly always based on the
assumption that there are no undetected errors at the block
device level or below. Then the expected *frequency* of detected
errors influences how much redundancy and what kind of recovery
are desirable, but the frequency of "lies" is assumed to be
zero.

The one case where Btrfs does not assume that the storage layer
works *correctly* is checksumming: it is quite expensive and
makes sense only if the block device is expected to (sometimes)
"lie" about having written the data correctly or having read it
correctly. The role of the checksum is to spot when a block
device "lies" and turn an undetected read error into a detected
one (they could be used also to detect correct writes that are
misreported as having failed).

The crucial difference that exists between SATA and USB is not
that USB chips have higher rates of detected failures (even if
they often do), but that in my experience SATA interfaces from
reputable suppliers don't "lie" (more realistically have
negligible "lie" rates), and USB interfaces (both host bus
adapters and IO bus bridges) "lie" both systematically and
statistically with non negligible rates, and anyhow the USB mass
storage protocol is not very good at error reporting and
handling.

>> The "working incorrectly" general case is the so called
>> "bizantine generals problem" [ ... ]

This is compsci for beginners and someone dealing with storage
issues (and not just) should be intimately familiar with the
implications:

  https://en.wikipedia.org/wiki/Byzantine_fault_tolerance

  Byzantine failures are considered the most general and most
  difficult class of failures among the failure modes. The
  so-called fail-stop failure mode occupies the simplest end of
  the spectrum. Whereas fail-stop failure model simply means
  that the only way to fail is a node crash, detected by other
  nodes, Byzantine failures imply no restrictions, which means
  that the failed node can generate arbitrary data, pretending
  to be a correct one, which makes fault tolerance difficult.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-18 Thread Peter Grandi
> [ ... ] After all, btrfs would just have to discard one copy
> of each chunk. [ ... ]  One more thing that is not clear to me
> is the replication profile of a volume. I see that balance can
> convert chunks between profiles, for example from single to
> raid1, but I don't see how the default profile for new chunks
> can be set or quiered. [ ... ]

My impression is that the design rationale and aims for Btrfs
two-level allocation (in other fields known as a "BIBOP" scheme)
were not fully shared among Btrfs developers, that perhaps it
could have benefited from some further reflection on its
implications, and that its behaviour may have evolved
"opportunistically", maybe without much worrying as to
conceptual integrity. (I am trying to be euphemistic)

So while I am happy with the "Rodeh" core of Btrfs (COW,
sbuvolumes, checksums), the RAID-profile functionality and
especially the multi-device layer is not something I find
particularly to my taste. (I am trying to be euphemistic)

So when it comes to allocation, RAID-profiles, multiple devices,
I usually expect some random "surprising functionality". (I am
trying to be euphemistic)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-18 Thread Peter Grandi
>> I forget sometimes that people insist on storing large
>> volumes of data on unreliable storage...

Here obviously "unreliable" is used on the sense of storage that
can work incorrectly, not in the sense of storage that can fail.

> In my opinion the unreliability of the storage is the exact
> reason for wanting to use raid1. And I think any problem one
> encounters with an unreliable disk can likely happen with more
> reliable ones as well, only less frequently, so if I don't
> feel comfortable using raid1 on an unreliable medium then I
> wouldn't trust it on a more reliable one either.

Oh please, please a bit less silliness would be welcome here.
In a previous comment on this tedious thread I had written:

  > If the block device abstraction layer and lower layers work
  > correctly, Btrfs does not have problems of that sort when
  > adding new devices; conversely if the block device layer and
  > lower layers do not work correctly, no mainline Linux
  > filesystem I know can cope with that.

  > Note: "work correctly" does not mean "work error-free".

The last line is very important and I added it advisedly.

You seem to be using "unreliable" in two completely different
meanings, without realizing it, as both "working incorrectly"
and "reporting a failure". They are really very different.

The "working incorrectly" general case is the so called
"bizantine generals problem" and (depending on assumptions) it
is insoluble.

Btrfs has some limited ability to detect (and sometimes recover
from) "working incorrectly" storage layers, but don't expect too
much from that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Is it safe to use btrfs on top of different types of devices?

2017-10-14 Thread Peter Grandi
> A few years ago I tried to use a RAID1 mdadm array of a SATA
> and a USB disk, which lead to strange error messages and data
> corruption.

That's common, quite a few reports of similar issues in previous
entries in this mailing list and for many other filesystems.

> I did some searching back then and found out that using
> hot-pluggable devices with mdadm is a paved road to data
> corruption.

That's an amazing jump of logic.

> Reading through that old bug again I see that it was
> autoclosed due to old age but still hasn't been addressed:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/320638

I suspect that it is very easy to misinterpret what the reported
issue is. However it is an interesting corner case what could
happen with any type of hardware device, not just hot-pluggable,
and one that I will try to remember even if unlikely to occur in
practice. I was only aware (dimly) of something quite similar in
the case of different logical sector sizes.

> I would like to ask whether btrfs may also be prone to data
> corruption issues in this scenario

Btrfs like (nearly) all UNIX/Linux filesystems does not run on
top of "devices", but on top of "files" of type "block device".

If the block device abstraction layer and lower layers work
correctly, Btrfs does not have problems of that sort when adding
new devices; conversely if the block device layer and lower
layers do not work correctly, no mainline Linux filesystem I know
can cope with that.

Note: "work correctly" does not mean "work error-free".

> (due to the same underlying issue as the one described in the
> bug above for mdadm), or is btrfs unaffected by the underlying
> issue

"Socratic method" questions:

* What do you think is the underlying issue in that bug report?
  (hint: something to do with host adapters or device bridges)
* Why do you think that bug report is in any way related to your
  issues with "a RAID1 mdadm array of a SATA and a USB disk"?

> and is safe to use with a mix of regular and hot-pluggable
> devices as well?

In my experience Btrfs works very well with a set of block
devices abstracting over on both regular and hot-pluggable
device, as far as that goes.

I personally don't like relying on Btrfs multi-device volumes,
but that has nothing to do with your concerns, but with basic
Btrfs multi-device handling design choices.

If you have concerns about the reliability of specific storage
and system configurations you should become or find a system
integration and qualification engineer who understand the many
subletities of storage devices and device-system interconnects
and who would run extensive tests on it; storage and system
commissioning is often far from trivial even in seemingly simple
cases, due in part to the enormous complexity of interfaces, even
when they have few bugs, and test made with one combination often
do not have the same results even on apparently similar
combinations.

I suspect that you should have asked a completely different set
of questions (XY problem), but the above are I think good answers
to the questions that you have actually asked.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs errors over NFS

2017-10-13 Thread Peter Grandi
>> TL;DR: ran into some btrfs errors and weird behaviour, but
>> things generally seem to work. Just posting some details in
>> case it helps devs or other users. [ ... ] I've run into a
>> btrfs error trying to do a -j8 build of android on a btrfs
>> filesystem exported over NFSv3. [ ... ]

I have an NFS server from Btrfs filesystem, and it is mostly
read-only and low-use, unlike a massive build, but so far it has
worked for me. The issue that was reported a while ago was that
the kernel NFS server does not report as errors to clients
checksum validation failures, just prints a warning, so for that
reason and a few others I switch to the Ganesha NFS server.

>From your stack traces I noticed that some go pretty deep so
maybe there is an issue with that (but on 'amd64' the kernel
stack is much bigger than it used to be on 'i386'). Another
possibility is that the volume got somewhat damaged for other
reasons (bugs, media errors, ...) and this is have further
consequences.

BTW 'errno 17' is "File exists", so perhaps there is a race
condition over NFS. The bogus files with mode 0 seem to me to be
bogus directory entries with no files linked to them, which
could be again the result of race conditions. The problems with
chunks allocation reported as "WARNING" are unfamiliar to me.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What means "top level" in "btrfs subvolume list" ?

2017-09-30 Thread Peter Grandi
> I am trying to figure out which means "top level" in the
> output of "btrfs sub list"

The terminology (and sometimes the detailed behaviour) of Btrfs
is not extremely consistent, I guess because of permissive
editorship of the design, in a "let 1000 flowers bloom" sort
of fashion so that does not matter a lot.

> [ ... ] outputs a "top level ID" equal to the "parent ID" (on
> the basis of the code).

You could have used option '-p' and it would have printed out
both "top level ID" and "parent ID" for extra enlightenment.

> But I am still asking which would be the RIGHT "top level id".

But perhaps one of them is now irrelevant, because 'man btrfs
subvolume says:

  "If -p is given, then parent  is added to the output
  between ID and top level. The parent’s ID may be used at mount
  time via the subvolrootid= option."

and 'man 5 btrfs' says:

  "subvolrootid=objectid
(irrelevant since: 3.2, formally deprecated since: 3.10)
A workaround option from times (pre 3.2) when it was not
possible to mount a subvolume that did not reside directly
under the toplevel subvolume."

> My Hypothesis, it should be the ID of the root subvolume ( or
> 5 if it is not mounted). [ ... ]

Well, a POSIX filesystem typically has a root directory, and it
can be mounted as the system root or any other point. A Btrfs
filesystem has multiple root directories, that are mounted by
default "somewhere" (a design decision that I think was unwise,
but "whatever").

The subvolume containing the mountpoint directory of another
subvolume's root directory is is no way or sense its "parent",
as there is no derivation relationship; root directories are
independent of each other and their mountpoint is (or should be)
a runtime entity.

If there is a "parent" relationship that maybe be between
snapshot and origin subvolume (ignoring 'send'/'receive'...),
and I have created a few plain and snapshot subvolumes and I get
this rather "confusing" output from version 4.4 of the 'btrfs'
command:

  base# btrfs subvol list -uq -a -p /fs/sda7 | sort -k 6,6n -k 8,8
  ID 257 gen 718 parent 5 top level 5 parent_uuid - uuid 
2d7b0606-76d9-f24b-8f75-d20a5c0f3521 path =
  ID 356 gen 719 parent 5 top level 5 parent_uuid - uuid 
9d201029-d2bf-2f43-8381-8c19d090483e path sl1
  ID 358 gen 719 parent 5 top level 5 parent_uuid 
2d7b0606-76d9-f24b-8f75-d20a5c0f3521 uuid bc0e6a33-b5dc-4d48-b2db-1452b705d227 
path sn1
  ID 357 gen 715 parent 356 top level 356 parent_uuid - uuid 
2abc6399-956d-894f-836b-32eb5b603654 path /sl1/sl2
  ID 360 gen 718 parent 356 top level 356 parent_uuid 
2d7b0606-76d9-f24b-8f75-d20a5c0f3521 uuid ad896822-e9a5-c645-8cfd-0aca7f5a2298 
path /sl1/sn3
  ID 361 gen 719 parent 356 top level 356 parent_uuid 
bc0e6a33-b5dc-4d48-b2db-1452b705d227 uuid 9c1390d2-e485-cb4a-a41b-670248587bfb 
path /sl1/sn4
  ID 359 gen 717 parent 358 top level 358 parent_uuid 
2d7b0606-76d9-f24b-8f75-d20a5c0f3521 uuid 72d4f943-2881-6442-b398-2277be8f2fec 
path /sn1/sn2

The "confusion" is that for some subvolumes the "parent" is the
same but the "parent_uuid" is different, and viceversa.
IIRC this has already been mentioned in part elsewhere.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs performance with small blocksize on SSD

2017-09-26 Thread Peter Grandi
> i run a few performance tests comparing mdadm, hardware raid
> and the btrfs raid.

Fantastic beginning already! :-)

> I noticed that the performance

I have seen over the years a lot of messages like this where
there is a wanton display of amusing misuses of terminology, of
which the misuse of the word "performance" to mean "speed" is
common, and your results are work-per-time which is a "speed":
http://www.sabi.co.uk/blog/15-two.html?151023#151023

The "tl;dr" is: you and another guy are told to race the 100m to
win a €10,000 prize, but you have to carry a sack with a 50Kg
weight. It takes you a lot longer, as your speed is much lower,
and the other guy gets the prize. Was that because your
performance was much worse? :-)

> for small blocksizes (2k) is very bad on SSD in general and on
> HDD for sequential writing.

Your graphs show pretty decent performance for small-file IO on
Btrfs, depending on conditions, and you are very astutely not
explaining the conditions, even if some can be guessed.

> I wonder about that result, because you say on the wiki that
> btrfs is very effective for small files.

Effectivess/efficiency are not the same as performance or speed
either. My own simplistic but somewhat meaningful tests show
that Btrfs does relatively well on small files:

  http://www.sabi.co.uk/blog/17-one.html?170302#170302

As to "small files" in general I have read about many attempts
to use filesystems as DBMSes, and I consider them intensely
stupid:

  http://www.sabi.co.uk/blog/anno05-4th.html?051016#051016

> I attached my results from raid 1 random write HDD (rH1), SSD
> (rS1) and from sequential write HDD (sH1), SSD (sS1)

Ah, so it was specifically about small *writes* (and presumably
because of other wording not small-updates-in-place of large
files, but creating and writing small files).

It is a very basic beginner level notion that most storage
systems are very anisotropic as to IO size, and also for read
vs. write, and never mind with and without 'fsync'. SSDs without
supercapacitor backed buffers in particular are an issue.

Btrfs has a performance envelope where the speed of small writes
(in particular small in-place updates, but also because of POSIX
small file creation) has been sacrificed for good reasons:

https://btrfs.wiki.kernel.org/index.php/SysadminGuide#Copy_on_Write_.28CoW.29
https://btrfs.wiki.kernel.org/index.php/Gotchas#Fragmentation

Also consider the consequences of the 'max_inline' option for
'mount' and the 'nodesize' option for 'mkfs.btrfs'.

> Hopefully you have an explanation for that.

The best explanation seems to me (euphemism alert) quite
extensive "misknowledge" in the message I am responding to.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A user cannot remove his readonly snapshots?!

2017-09-16 Thread Peter Grandi
[ ... ]

> I can delete normal subvolumes but not the readonly snapshots:

It is because of ordinary permissions for both subvolumes and
snapshots:

  tree$  btrfs sub create /fs/sda7/sub
  Create subvolume '/fs/sda7/sub'

  tree$  chmod a-w /fs/sda7/sub
  tree$  btrfs sub del /fs/sda7/sub
  Delete subvolume (no-commit): '/fs/sda7/sub'
  ERROR: cannot delete '/fs/sda7/sub': Permission denied

  tree$  chmod u+w /fs/sda7/sub
  tree$  btrfs sub del /fs/sda7/sub
  Delete subvolume (no-commit): '/fs/sda7/sub'

It is however possible to remove an ordinary read-only
directory, *as long as its parent directory is not read-only
too*:

  tree$  mkdir /fs/sda7/sub
  tree$  chmod a-w /fs/sda7/sub
  tree$  rmdir /fs/sda7/sub; echo $?
  0

IIRC this came up before, and the reason for the difference is
that a subvolume root directory is "special" because its '..'
entry points to itself (inode 256), that is if it is read-only
its parent directory (itself) then is read-only too.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: A user cannot remove his readonly snapshots?!

2017-09-15 Thread Peter Grandi
>  [ ... ] mounted with option user_subvol_rm_allowed [ ... ]
> root can delete this snapshot, but not the user. Why? [ ... ]

Ordinary permissions still apply both to 'create' and 'delete':

  tree$  sudo mkdir /fs/sda7/dir
  tree$  btrfs sub create /fs/sda7/dir/sub
  ERROR: cannot access /fs/sda7/dir/sub: Permission denied

  tree$  sudo chmod a+rwx /fs/sda7/dir
  tree$  btrfs sub create /fs/sda7/dir/sub
  Create subvolume '/fs/sda7/dir/sub'

  tree$  btrfs sub delete /fs/sda7/dir/sub
  Delete subvolume (no-commit): '/fs/sda7/dir/sub'
  ERROR: cannot delete '/fs/sda7/dir/sub': Operation not permitted
  tree$  sudo mount -o remount,user_subvol_rm_allowed /fs/sda7
  tree$  btrfs sub delete /fs/sda7/dir/sub
  Delete subvolume (no-commit): '/fs/sda7/dir/sub'

  tree$  btrfs sub create /fs/sda7/dir/sub
  Create subvolume '/fs/sda7/dir/sub'

  tree$  sudo chmod a-w /fs/sda7/dir
  tree$  btrfs sub delete /fs/sda7/dir/sub
  Delete subvolume (no-commit): '/fs/sda7/dir/sub'
  ERROR: cannot delete '/fs/sda7/dir/sub': Permission denied

  tree$  sudo chmod a+w /fs/sda7/dir
  tree$  btrfs sub delete /fs/sda7/dir/sub
  Delete subvolume (no-commit): '/fs/sda7/dir/sub'
  tree$  sudo rmdir /fs/sda7/dir
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-15 Thread Peter Grandi
[ ... ]
 Case #1
 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs
 -> qemu cow2 storage -> guest BTRFS filesystem
 SQL table row insertions per second: 1-2

 Case #2
 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs
 -> qemu raw storage -> guest EXT4 filesystem
 SQL table row insertions per second: 10-15
[ ... ]

>> Q 0) what do you think that you measure here?

> Cow's fragmentation impact on SQL write performance.

That's not what you are measuring, you are measing the impact on
speed of configurations "designed" (perhaps unintentionally) for
maximum flexibility, lowest cost, and complete disregard for
speed.

[ ... ]

> It was quick and dirty task to find, prove and remove
> performance bottleneck at minimal cost.

This is based on the usual confusion between "performance" (the
result of several tradeoffs) and "speed". When you report "row
insertions per second" you are reporting a rate, that is a
"speed", not "performance", which is always multi-dimensional.
http://www.sabi.co.uk/blog/15-two.html?151023#151023

In the cases above speed is low, but I think that, taking into
account flexibility and cost, performance is pretty good.

> AFAIR removing storage cow2 and guest BTRFS storage gave us ~
> 10 times boost.

"Oh doctor, if I stop stabbing my hand with a fork it no longer
hurts, but running while carrying a rucksack full of bricks is
still slower than with a rucksack full of feathers".

[ ... ]
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: defragmenting best practice?

2017-09-15 Thread Peter Grandi
> Case #1
> 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu cow2 storage
> -> guest BTRFS filesystem
> SQL table row insertions per second: 1-2

"Doctor, if I stab my hand with a fork it hurts a lot: can you
cure that?"

> Case #2
> 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu raw
> storage -> guest EXT4 filesystem
> SQL table row insertions per second: 10-15

"Doctor, I can't run as fast with a backpack full of bricks as
without it: can you cure that?"

:-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: generic name for volume and subvolume root?

2017-09-10 Thread Peter Grandi
> As I am writing some documentation abount creating snapshots:
> Is there a generic name for both volume and subvolume root?

Yes, it is from the UNIX side 'root directory' and from the
Btrfs side 'subvolume'. Like some other things Btrfs, its
terminology is often inconsistent, but "volume" *usually* means
"the set of devices [and contained root directories] with the
same Btrfs 'fsid'".

I think that the top-level subvolume should not be called the
"volume": while there is no reason why a UNIX-like filesystem
should be limited to a single block-device, one of the
fundamental properties of UNIX-like filesystems is that
hard-links are only possible (if at all possible) within a
filesystem, and that 'statfs' returns a different "device id"
per filesystem. Therefore a Btrfs volume is not properly a
filesystem, but potentially a filesystem forest, as it may
contain multiple filesystems each with its own root directory.

> Is there a simple name for directories I can snapshot?

You can only snapshot *root directories*, of which in Btrfs
there are two types: subvolumes (an unfortunate name perhaps) or
snapshots.

In UNIX-like OSes every filesystem has a "root directory" and
some filesystem types like Btrfs, NILFS2, and potentially JFS
can have more than one, and some can even mount more than one
simultaneously.

The root directory mounted as '/' is called the "system root
directory". When unmounted all filesystem root directories have
no names, just an inode number. Conceivably the root inode of a
UNIX-like filesystem could be an inode of any type, but I have
never seen a recent UNIX-like OS able to mount anything other
than a directory-type root inode (Plan 9 is not a UNIX-like OS
:->).

As someone else observed, the word "root" is overloaded in
UNIX-like OS discourse, like the word "filesystem", and that's
unfortunate but can always be resolved verbosely by using the
appropriate qualifier like "root directory", "system root
directory", "'root' user", "uid 0 capabilities", etc.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: test if a subvolume is a snapshot?

2017-09-08 Thread Peter Grandi
> How can I test if a subvolume is a snapshot? [ ... ]

This question is based on the assumption that "snapshot" is a
distinct type of subvolume and not just an operation that
creates a subvolume with reflinked contents.

Unfortunately Btrfs does indeed make snapshots a distinct type
of subvolume... In my 4.4 kernel/progs version of Btrfs it seems
that the 'Parent UUID' is that of the source of the snapshot,
and the source of a snapshot somehow comes with a list to all
the snapshots taken from it:

  #  ls /fs/sda7
  =@170826  @170829  @170901  @170903  @170905  @170907
  @170825  @170828  @170830  @170902  @170904  @170906  lost+found

  #  btrfs subvolume list /fs/sda7
  ID 431 gen 532441 top level 5 path =
  ID 1619 gen 524915 top level 5 path @170825
  ID 1649 gen 524915 top level 5 path @170826
  ID 1651 gen 524915 top level 5 path @170828
  ID 1652 gen 524915 top level 5 path @170829
  ID 1654 gen 524915 top level 5 path @170830
  ID 1655 gen 523316 top level 5 path @170901
  ID 1656 gen 524034 top level 5 path @170902
  ID 1658 gen 525628 top level 5 path @170903
  ID 1659 gen 527121 top level 5 path @170904
  ID 1660 gen 528719 top level 5 path @170905
  ID 1665 gen 530565 top level 5 path @170906
  ID 1666 gen 532217 top level 5 path @170907

  #  btrfs subvolume show /fs/sda7/= | egrep 'UUID|Parent|Top level|Snap|@'
  UUID:   cb99579f-64e5-e94c-b22c-41dcc397c37f
  Parent UUID:-
  Received UUID:  -
  Parent ID:  5
  Top level ID:   5
  Snapshot(s):
  @170825
  @170826
  @170828
  @170829
  @170830
  @170901
  @170902
  @170903
  @170904
  @170905
  @170906
  @170907

  #  btrfs subvolume show /fs/sda7/@170901 | egrep 'UUID|Parent|Top 
level|Snap|@'
  /fs/sda7/@170901
  Name:   @170901
  UUID:   851f8ef3-c2af-4b46-89af-0193fd4e6fc4
  Parent UUID:cb99579f-64e5-e94c-b22c-41dcc397c37f
  Received UUID:  -
  Parent ID:  5
  Top level ID:   5
  Snapshot(s):

Note that with typical Btrfs consistency "Parent UUID" is that
the source of the snapshot, while "Parent ID" is that of the
upper level subvolume, and in the "flat" layout for this volume
the snapshot parent is '/fs/sda7/=' and the upper level is
'/fs/sda7' instead.

The different results that you get make me suspect that the
top-level subvolume is "special".
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 4.13: No space left with plenty of free space (/home/kernel/COD/linux/fs/btrfs/extent-tree.c:6989 __btrfs_free_extent.isra.62+0xc2c/0xdb0)

2017-09-08 Thread Peter Grandi
[ ... ]
> [233787.921018] Call Trace:
> [233787.921031]  ? btrfs_merge_delayed_refs+0x62/0x550 [btrfs]
> [233787.921039]  __btrfs_run_delayed_refs+0x6f0/0x1380 [btrfs]
> [233787.921047]  btrfs_run_delayed_refs+0x6b/0x250 [btrfs]
> [233787.921054]  btrfs_write_dirty_block_groups+0x158/0x390 [btrfs]
> [233787.921063]  commit_cowonly_roots+0x221/0x2c0 [btrfs]
> [233787.921071]  btrfs_commit_transaction+0x46e/0x8d0 [btrfs]
[ ... ]
> [233787.921191] BTRFS: error (device md2) in 
> btrfs_run_delayed_refs:3009: errno=-28 No space left
> [233789.507669] BTRFS warning (device md2): Skipping commit of aborted 
> transaction.
> [233789.507672] BTRFS: error (device md2) in cleanup_transaction:1873: 
> errno=-28 No space left
[ ... ]

So the numbers that matter are:

> Data,single: Size:12.84TiB, Used:7.13TiB
> /dev/md2   12.84TiB
> Metadata,DUP: Size:79.00GiB, Used:77.87GiB
> /dev/md2  158.00GiB
> Unallocated:
> /dev/md23.31TiB

The metadata allocations is nearly full, so it could be the
usual story with the two-level allocator that there are not
unallocated chunks for metadata expansion, but since you have
3TiB of 'unallocated' space there is no obvious reason why
allocation of the metadata to do a new root transaction flush
should abort, so this is about "guessing" which corner case or
bug applies:

* If you are using the 'space_cache' it has a known issue:
https://btrfs.wiki.kernel.org/index.php/Gotchas#Free_space_cache
* Some versions of Btrfs (IIRC around 4.8-4.9) had some other
  allocator bug.
* Maybe some previous issue, hw or sw, had damaged internal
  filesystem structures.

I also notice that your volume's data free space seems to be
extremely fragmented, as the large difference here shows
"Data,single: Size:12.84TiB, Used:7.13TiB".

Which may mean that it is mounted with 'ssd' and/or has gone a
long time without a 'balance', and conceivably this can make it
easier for the free space cache to fail finding space (some
handwaving here).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up big btrfs volumes with ssds

2017-09-04 Thread Peter Grandi
>>> [ ... ] Currently without any ssds i get the best speed with:
>>> - 4x HW Raid 5 with 1GB controller memory of 4TB 3,5" devices
>>> and using btrfs as raid 0 for data and metadata on top of
>>> those 4 raid 5. [ ... ] the write speed is not as good as i
>>> would like - especially for random 8k-16k I/O. [ ... ]

> [ ... ] 64kb data stripe and 16kb parity Btrfs raid0 use 64kb
> as stripe so that can make data access unaligned (or use single
> profile for btrfs) 3. Use btrfs ssd_spread to decrease RMW
> cycles.

This is not a "revolutionary" scientific discovery as the idea of
a working set of a small-size random-write workload, but it still
takes a lot of "optimism" to imagine that it is possible to
"decrease RMW cycles" for "random 8k-16k" writes on 64KiB+16KiB
RAID5 stripes, whether with 'ssd_spread' or not.

To "decrease RMW cycles" seems inded to me the better aim than
following the "radical" aim of caching the working set of a
random-small-write workload, but it may be less easy to achieve
than desirable :-). http://www.baarf.dk/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up big btrfs volumes with ssds

2017-09-04 Thread Peter Grandi
>> [ ... ] Currently the write speed is not as good as i would
>> like - especially for random 8k-16k I/O. [ ... ]

> [ ... ] So this 60TB is then 20 4TB disks or so and the 4x 1GB
> cache is simply not very helpful I think. The working set
> doesn't fit in it I guess. If there is mostly single or a few
> users of the fs, a single pcie based bcacheing 4 devices can
> work, but for SATA SSD, I would use 1 SSD per HWraid5. [ ... ]

Probably the idea of the cacheable working set of a random small
write workload is a major new scientific discovery. :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: read-only for no good reason on 4.9.30

2017-09-04 Thread Peter Grandi
> [ ... ] I ran "btrfs balance" and then it started working
> correctly again. It seems that a btrfs filesystem if left
> alone will eventually get fragmented enough that it rejects
> writes [ ... ]

Free space will get fragmented, because Btrfs has a 2-level
allocator scheme (chunks within devices and leaves within
chunks). The issue is "free space" vs. "unallocated chunks".

> Is this a known issue?

https://btrfs.wiki.kernel.org/index.php/Problem_FAQ#I_get_.22No_space_left_on_device.22_errors.2C_but_df_says_I.27ve_got_lots_of_space
https://btrfs.wiki.kernel.org/index.php/FAQ#if_your_device_is_large_.28.3E16GiB.29
https://btrfs.wiki.kernel.org/index.php/FAQ#Understanding_free_space.2C_using_the_new_tools
https://btrfs.wiki.kernel.org/index.php/FAQ#Why_is_free_space_so_complicated.3F
https://btrfs.wiki.kernel.org/index.php/FAQ#What_does_.22balance.22_do.3F

That problem is particularly frequent with the 'ssd' mount
option which probably should never be used:

https://btrfs.wiki.kernel.org/index.php/Gotchas#The_ssd_mount_option
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: speed up big btrfs volumes with ssds

2017-09-03 Thread Peter Grandi
> [ ... ] - needed volume size is 60TB

I wonder how long that takes to 'scrub', 'balance', 'check',
'subvolume delete', 'find', etc.

> [ ... ] 4x HW Raid 5 with 1GB controller memory of 4TB 3,5"
> devices and using btrfs as raid 0 for data and metadata on top
> of those 4 raid 5. [ ... ]  the write speed is not as good as
> i would like - especially for random 8k-16k I/O. [ ... ]

Also I noticed that the rain is wet and cold - especially if one
walks around for a few hours in a t-shirt, shorts and sandals.
:-)

> My current idea is to use a pcie flash card with bcache on top
> of each raid 5. Is this something which makes sense to speed
> up the write speed.

Well 'bcache' in the role of write buffer allegedly helps
turning unaligned writes into aligned writes, so might help, but
I wonder how effective that will be in this case, plus it won't
turn low random IOPS-per-TB 4TB devices into high ones. Anyhow
if they are battery-backed the 1GB of HW HBA cache/buffer should
do exactly that, excep that again in this case that is rather
optimistic.

But this reminds me of the common story: "Doctor, if I stab
repeatedly my hand with a fork it hurts a lot, how to fix that?"
"Don't do it".
:-)

PS Random writes of 8-16KiB over 60TB might seem like storing
small records/images in small files. That would be "brave".
On a 60TB RAID50 of 20x 4TB disk drives that might mean around
5-10MB/s of random small writes, including both data and
metadata.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: number of subvolumes

2017-08-24 Thread Peter Grandi
>> Using hundreds or thousands of snapshots is probably fine
>> mostly.

As I mentioned previously, with a link to the relevant email
describing the details, the real issue is reflinks/backrefs.
Usually subvolume and snapshots involve them.

> We find that typically apt is very slow on a machine with 50
> or so snapshots and raid10. Slow as in probably 10x slower as
> doing the same update on a machine with 'single' and no
> snapshots.

That seems to indicate using snapshots on a '/' volume to
provide a "rollback machine" like SUSE. Since '/' usually has
many small files and installation of upgraded packages involves
only a small part of them, that usually involves a lot of
reflinks/backrefs.

But that you find that the system has slowed down significantly
in ordinary operations is unusual, because what is slow in
situations with many relinks/backrefs per extent is not access,
but operations like 'balance' or 'delete'.

Guessing wildly what you describe seems more the effect of low
locality (aka high fragmentation) which is often the result of
the 'ssd' option which should always be explicitly disabled
(even for volumes on flash SSD storage). I would suggest some
use of 'filefrag' to analyze and perhaps use of 'defrag' and
'balance'.

Another possibility is having enabled compression with the
presence of many in-place updates on some files, which can
result also in low locality (high fragmentation).

As usual with Btrfs, there are corner cases to avoid: 'defrag'
should be done before 'balance' and with compression switched
off (IIRC):

https://wiki.archlinux.org/index.php/Btrfs#Defragmentation

  Defragmenting a file which has a COW copy (either a snapshot
  copy or one made with cp --reflink or bcp) plus using the -c
  switch with a compression algorithm may result in two
  unrelated files effectively increasing the disk usage.

https://wiki.debian.org/Btrfs

  Mounting with -o autodefrag will duplicate reflinked or
  snapshotted files when you run a balance. Also, whenever a
  portion of the fs is defragmented with "btrfs filesystem
  defragment" those files will lose their reflinks and the data
  will be "duplicated" with n-copies. The effect of this is that
  volumes that make heavy use of reflinks or snapshots will run
  out of space.

  Additionally, if you have a lot of snapshots or reflinked files,
  please use "-f" to flush data for each file before going to the
  next file.

I prefer dump-and-reload.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: number of subvolumes

2017-08-23 Thread Peter Grandi
> This is a vanilla SLES12 installation: [ ... ] Why does SUSE
> ignore this "not too many subvolumes" warning?

As in many cases with Btrfs "it's complicated" because of the
interaction of advanced features among themselves and the chosen
implementation and properties of storage; anisotropy rules.

IIRC the main problem actually is not with "too many subvolumes",
but with too many "reflinks"/"backrefs"; subvolumes, in particular
snapshots, are just the main way to create them:

  https://www.spinics.net/lists/linux-btrfs/msg42808.html

A couple dozen subvolumes without reflinks as in the '/' scheme
used by SUSE are going to be almost always fine.

Then there is different a issue: I remember seeing a post by a SUSE
guy saying that the 10/10/10/10 (hourly/daily/monthly/yearly)
snapshots in the default settings for 'snapper' was a bad idea
because it would create way too many snapshots, but that he was
told to set those defaults that high. I can imagine a cowardly but
plausible reason why "management" would want those defaults.

Some semi-useful links:

* Home page for 'snapper'
  https://snapper.io/
* Announcement of 'snapper'
  https://lizards.opensuse.org/2011/04/01/introducing-snapper/
* Useful maintenance scripts
  https://github.com/kdave/btrfsmaintenance
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: user snapshots

2017-08-23 Thread Peter Grandi
> So, still: What is the problem with user_subvol_rm_allowed?

As usual, it is complicated: mostly that while subvol creation
is very cheap, subvol deletion can be very expensive. But then
so can be creating many snapshots, as in this:

  https://www.spinics.net/lists/linux-btrfs/msg62760.html

Also that deleting a subvol can delete a lot of stuff
"inadvertently", including things that the user could not delete
using UNIX style permissions. But it many of the Btrfs semantics
feel a bit "arbitrary" in part because they break new ground, in
part because happenstance.

  
http://linux-btrfs.vger.kernel.narkive.com/eTtmsQdL/patch-1-2-btrfs-don-t-check-the-permission-of-the-subvolume-which-we-want-to-delete
  
http://linux-btrfs.vger.kernel.narkive.com/nR17xtw7/patch-btrfs-allow-subvol-deletion-by-unprivileged-user-with-o-user-subvol-rm-allowed
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: netapp-alike snapshots?

2017-08-22 Thread Peter Grandi
[ ... ]

 It is beneficial to not have snapshots in-place. With a local
 directory of snapshots, [ ... ]

Indeed and there is a fair description of some options for
subvolume nesting policies here which may be interesting to the
original poster:

  https://btrfs.wiki.kernel.org/index.php/SysadminGuide#Layout

It is unsurprising to me that there are tradeoffs involved in
every choice. I find the "Flat" layout particularly desirable.

>>> Netapp snapshots are invisible for tools doing opendir()/
>>> readdir() One could simulate this with symlinks for the
>>> snapshot directory: store the snapshot elsewhere (not inplace)
>>> and create a symlink to it, in every directory.

More precisely in every subvolume root directory.

>>> My users want the snapshots locally in a .snapshot
>>> subdirectory.

Btrfs snapshots can only be done for a whole subvolume. Subvolumes
and snapshots can be created by users, but too many snapshots (see
below) can cause trouble. For somewhat good reasons subvolumes
including snapshots cannot be deleted by users though unless mount
option 'user_subvol_rm_allowed' is used.

>>> Because Netapp do it this way - for at least 20 years and we
>>> have a multi-PB Netapp storage environment. No chance to change
>>> this.

Send patches :-).

> Not only du works recursivly, but also find and with option
> also ls, grep, etc.

Note also that subvolume root directory inodes are indeed root
directory inodes so they can be 'mount'ed and therefore the
transition from a subvolume into a contained subvolume can be
detected at the mountpoint.

So 'find' has the '-xdev' option and 'du' has the '-x' options and
so similarly nearly all other tools, so perhaps someone expects
that to happen :-).

> And it would require a bind mount for EVERY directory. There can
> be hundreds... thousends!

Assumptions that all Btrfs features such as snapshots are
infinitely scalable at no cost may be optimistic:

  
https://btrfs.wiki.kernel.org/index.php/Gotchas#Having_many_subvolumes_can_be_very_slow
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: finding root filesystem of a subvolume?

2017-08-22 Thread Peter Grandi
[ ... ]

>> There is no fixed relationship between the root directory
>> inode of a subvolume and the root directory inode of any
>> other subvolume or the main volume.

> Actually, there is, because it's inherently rooted in the
> hierarchy of the volume itself. That root inode for the
> subvolume is anchored somewhere under the next higher
> subvolume.

This stupid point relies on ignoring that it is not mandatory to
mount the main volume, and that therefore "There is no fixed
relationship between the root directory inode of a subvolume and
the root directory inode of any other subvolume or the main
volume", because the "root directory inode" of the "main volume"
may not be mounted at all.

This stupid point also relies on ignoring that subvolumes can be
mounted *also* under another directory, even if the main volume
is mounted somewhere else. Suppose that the following applies:

  subvol=5  /local
  subvol=383/local/.backup/home
  subvol=383/mnt/home-backup

and you are given the mountpoint '/mnt/home-backup', how can you
find the main volume mountpoint '/local' from that?

Please explain how '/mnt/home-backup' is indeed "inherently
rooted in the hierarchy of the volume itself", because there is
always a "fixed relationship between the root directory inode of
a subvolume and the root directory inode of any other subvolume
or the main volume".

[ ... ]

> Again, it does, it's just not inherently exposed to userspace
> unless you mount the top-level subvolume (subvolid=5 and/or
> subvol=/ in mount options).

This extra stupid point is based on ignoring that to "mount the
top-level subvolume" relies on knowing already which one is the
"top-level subvolume", which is begging the question.

[ ... ]
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: finding root filesystem of a subvolume?

2017-08-22 Thread Peter Grandi
> How do I find the root filesystem of a subvolume?
> Example:
> root@fex:~# df -T 
> Filesystem Type  1K-blocks  Used Available Use% Mounted on
> -  -1073740800 104244552 967773976  10% /local/.backup/home
[ ... ]
> I know, the root filesystem is /local,

That question is somewhat misunderstood and uses the wrong
concepts and terms. In UNIX filesystems a filesystem "root" is a
directory inode with a number that is local to itself, and can
be "mounted" anywhere, or left unmounted, and that is a property
of the running system, not of the filesystem "root". Usually
UNIX filesystems have a single "root" directory inode.

In the case of Btrfs the main volume and its subvolumes all have
filesystem "root" directory inodes, which may or may not be
"mounted", anywhere the administrators of the running system
pleases, as a property of the running system. There is no fixed
relationship between the root directory inode of a subvolume and
the root directory inode of any other subvolume or the main
volume.

Note: in Btrfs terminology "volume" seems to mean both the main
volume and the collection of devices where it and subvolumes are
hosted.

> but who can I show it by command?

The system does not keep an explicit record of which Btrfs
"root" directory inode is related to which other Btrfs "root"
directory inode in the same volume, whether mounted or
unmounted.

That relationship has to be discovered by using volume UUIDs,
which are the same for the main subvolume and the other
subvolumes, whether mounted or not, so one has to do:

  * For the indicated mounted subvolume "root" read its UUID.
  * For every mounted filesystem "root", check whether its type
is 'btrfs' and if it is obtain its UUID.
  * If the UUID is the same, and the subvolume id is '5', that's
the main subvolume, and terminate.
  * For every block device which is not mounted, check whether it
has a Btrfs superblock.
  * If the type is 'btrfs' and the volume UUIS is the same as
that of the subvolume, list the block device.

In the latter case since the main volume is not mounted the only
way to identify it is to list the block devices that host it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: slow btrfs with a single kworker process using 100% CPU

2017-08-16 Thread Peter Grandi
>>> I've one system where a single kworker process is using 100%
>>> CPU sometimes a second process comes up with 100% CPU
>>> [btrfs-transacti]. [ ... ]

>> [ ... ]1413 Snapshots. I'm deleting 50 of them every night. But
>> btrfs-cleaner process isn't running / consuming CPU currently.

Reminder that:

https://btrfs.wiki.kernel.org/index.php/Gotchas#Having_many_subvolumes_can_be_very_slow

"The cost of several operations, including currently balance, device
delete and fs resize, is proportional to the number of subvolumes,
including snapshots, and (slightly super-linearly) the number of
extents in the subvolumes."

>> [ ... ] btrfs is mounted with compress-force=zlib

> Could be similar issue as what I had recently, with the RAID5 and
> 256kb chunk size. please provide more information about your RAID
> setup.

It is similar, but updating in-place compressed files can create
this situation even without RAID5 RMW:

https://btrfs.wiki.kernel.org/index.php/Gotchas#Fragmentation

"Files with a lot of random writes can become heavily fragmented
(1+ extents) causing thrashing on HDDs and excessive multi-second
spikes of CPU load on systems with an SSD or large amount a RAM. ...
Symptoms include btrfs-transacti and btrfs-endio-wri taking up a lot
of CPU time (in spikes, possibly triggered by syncs)."
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-16 Thread Peter Grandi
[ ... ]

>>> Snapshots work fine with nodatacow, each block gets CoW'ed
>>> once when it's first written to, and then goes back to being
>>> NOCOW.
>>> The only caveat is that you probably want to defrag either
>>> once everything has been rewritten, or right after the
>>> snapshot.

>> I thought defrag would unshare the reflinks?
 
> Which is exactly why you might want to do it. It will get rid
> of the overhead of the single CoW operation, and it will make
> sure there is minimal fragmentation.
> IOW, when mixing NOCOW and snapshots, you either have to use
> extra space, or you deal with performance issues. Aside from
> that though, it works just fine and has no special issues as
> compared to snapshots without NOCOW.

The above illustrates my guess as to why RHEL 7.4 dropped Btrfs
support, which is:

  * RHEL is sold to managers who want to minimize the cost of
upgrades and sysadm skills.
  * Every time a customer creates a ticket, RH profits fall.
  * RH had adopted 'ext3' because it was an in-place upgrade
from 'ext2' and "just worked", 'ext4' because it was an
in-place upgrade from 'ext3' and was supposed to "just
work", and then was looking at Btrfs as an in-place upgrade
from 'ext4', and presumably also a replacement for MD RAID,
that would "just work".
  * 'ext4' (and XFS before that) already created a few years ago
trouble because of the 'O_PONIES' controversy.
  * Not only Btrfs still has "challenges" as to multi-device
functionality, and in-place upgrades from 'ext4' have
"challenges" too, it has many "special cases" that need
skill and discretion to handle, because it tries to cover so
many different cases, and the first thing many a RH customer
would do is to create a ticket to ask what to do, or how to
fix a choice already made.

Try to imagine the impact on the RH ticketing system of a switch
from 'ext4' to Btrfs, with explanations like the above, about
NOCOW, defrag, snapshots, balance, reflinks, and the exact order
in which they have to be performed for best results.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-16 Thread Peter Grandi
[ ... ]

> But I've talked to some friend at the local super computing
> centre and they have rather general issues with CoW at their
> virtualisation cluster.

Amazing news! :-)

> Like SUSE's snapper making many snapshots leading the storage
> images of VMs apparently to explode (in terms of space usage).

Well, this could be an argument that some of your friends are being
"challenged" by running the storage systems of a "super computing
centre" and that they could become "more prepared" about system
administration, for example as to the principle "know which tool to
use for which workload". Or else it could be an argument that they
expect Btrfs to do their job while they watch cat videos from the
intertubes. :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?

2017-08-16 Thread Peter Grandi
> We use the crcs to catch storage gone wrong, [ ... ]

And that's an opportunistically feasible idea given that current
CPUs can do that in real-time.

> [ ... ] It's possible to protect against all three without COW,
> but all solutions have their own tradeoffs and this is the setup
> we chose. It's easy to trust and easy to debug and at scale that
> really helps.

Indeed all filesystem designs have pathological workloads, and
system administrators and applications developers who are "more
prepared" know which one is best for which workload, or try to
figure it out.

> Some databases also crc, and all drives have correction bits of
> of some kind. There's nothing wrong with crcs happening at lots
> of layers.

Well, there is: in theory checksumming should be end-to-end, that
is entirely application level, so applications that don't need it
don't pay the price, but having it done at other layers can help
the very many applications that don't do it and should do it, and
it is cheap, and can help when troubleshooting exactly there the
problem is. It is an opportunistic thing to do.

> [ ... ] My real goal is to make COW fast enough that we can
> leave it on for the database applications too.  Obviously I
> haven't quite finished that one yet ;) [ ... ]

And this worries me because it portends the usual "marketing" goal
of making Btrfs all things to all workloads, the "OpenStack of
filesystems", with little consideration for complexity,
maintainability, or even sometimes reality.

The reality is that all known storage media have hugely
anisotropic performance envelopes, both as to functionality, cost,
speed, reliability, and there is no way to have an automagic
filesystem that "just works" in all cases, despite the constant
demands for one by "less prepared" storage administrators and
application developers. The reality is also that if one such
filesystem could automagically adapt to cover optimally the
performance envelopes of every possible device and workload, it
would be so complex as to be unmaintainable in practice.

So Btrfs, in its base "Rodeh" functionality, with COW, checksums,
subvolumes, shapshots, *on a single device*, works pretty well and
reliably and it is already very useful, for most workloads. Some
people also like some of its exotic complexities like in-place
compression and defragmentation, but they come at a high cost.

For workloads that inflict lots of small random in-place updates
on storage, like tablespaces for DBMSes etc, perhaps simpler less
featureful storage abstraction layers are more appropriate, from
OCFS2 to simple DM/LVM2 LVs, and Btrfs NOCOW approximates them
well.

BTW as to the specifics of DBMSes and filesystems, there is a
classic paper making eminently reasonable, practical, suggestions
that have been ignored for only 35 years and some:

  %A M. R. Stonebraker
  %T Operating system support for database management
  %J CACM
  %V 24
  %D JUL 1981
  %P 412-418
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs + compression = slow performance and high cpu usage

2017-08-01 Thread Peter Grandi
[ ... ]

> This is the "storage for beginners" version, what happens in
> practice however depends a lot on specific workload profile
> (typical read/write size and latencies and rates), caching and
> queueing algorithms in both Linux and the HA firmware.

To add a bit of slightly more advanced discussion, the main
reason for larger strips ("chunk size) is to avoid the huge
latencies of disk rotation using unsynchronized disk drives, as
detailed here:

  http://www.sabi.co.uk/blog/12-thr.html?120310#120310

That relates weakly to Btrfs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs + compression = slow performance and high cpu usage

2017-08-01 Thread Peter Grandi
>> [ ... ] a "RAID5 with 128KiB writes and a 768KiB stripe
>> size". [ ... ] several back-to-back 128KiB writes [ ... ] get
>> merged by the 3ware firmware only if it has a persistent
>> cache, and maybe your 3ware does not have one,

> KOS: No I don't have persistent cache. Only the 512 Mb cache
> on board of a controller, that is BBU.

If it is a persistent cache, that can be battery-backed (as I
wrote, but it seems that you don't have too much time to read
replies) then the size of the write, 128KiB or not, should not
matter much; the write will be reported complete when it hits
the persistent cache (whichever technology it used), and then
the HA fimware will spill write cached data to the disks using
the optimal operation width.

Unless the 3ware firmware is really terrible (and depending on
model and vintage it can be amazingly terrible) or the battery
is no longer recharging and then the host adapter switches to
write-through.

That you see very different rates between uncompressed and
compressed writes, where the main difference is the limitation
on the segment size, seems to indicate that compressed writes
involve a lot of RMW, that is sub-stripe updates. As I mentioned
already, it would be interesting to retry 'dd' with different
'bs' values without compression and with 'sync' (or 'direct'
which only makes sense without compression).

> If I had additional SSD caching on the controller I would have
> mentioned it.

So far you had not mentioned the presence of BBU cache either,
which is equivalent, even if in one of your previous message
(which I try to read carefully) there were these lines:

 Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad 
 BBU
 Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad 
 BBU

So perhaps someone else would have checked long ago the status
of the BBU and whether the "No Write Cache if Bad BBU" case has
happened. If the BBU is still working and the policy is still
"WriteBack" then things are stranger still.

> I was also under impression, that in a situation where mostly
> extra large files will be stored on the massive, the bigger
> strip size would indeed increase the speed, thus I went with
> with the 256 Kb strip size.

That runs counter to this simple story: suppose a program is
doing 64KiB IO:

* For *reads*, there are 4 data drives and the strip size is
  16KiB: the 64KiB will be read in parallel on 4 drives. If the
  strip size is 256KiB then the 64KiB will be read sequentially
  from just one disk, and 4 successive reads will be read
  sequentially from the same drive.

* For *writes* on a parity RAID like RAID5 things are much, much
  more extreme: the 64KiB will be written with 16KiB strips on a
  5-wide RAID5 set in parallel to 5 drives, with 4 stripes being
  updated with RMW. But with 256KiB strips it will partially
  update 5 drives, because the stripe is 1024+256KiB, and it
  needs to do RMW, and four successive 64KiB drives will need to
  do that too, even if only one drive is updated. Usually for
  RAID5 there is an optimization that means that only the
  specific target drive and the parity drives(s) need RMW, but
  it is still very expensive.

This is the "storage for beginners" version, what happens in
practice however depends a lot on specific workload profile
(typical read/write size and latencies and rates), caching and
queueing algorithms in both Linux and the HA firmware.

> Would I be correct in assuming that the RAID strip size of 128
> Kb will be a better choice if one plans to use the BTRFS with
> compression?

That would need to be tested, because of "depends a lot on
specific workload profile, caching and queueing algorithms", but
my expectation is the the lower the better. Given that you have
4 drives giving a 3+1 RAID set, perhaps a 32KiB or 64KiB strip
size, given a data stripe size of 96KiB or 192KiB, would be
better.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs + compression = slow performance and high cpu usage

2017-08-01 Thread Peter Grandi
> Peter, I don't think the filefrag is showing the correct
> fragmentation status of the file when the compression is used.

As reported on a previous message the output of 'filefrag -v'
which can be used to see what is going on:

 filefrag /mnt/sde3/testfile 
   /mnt/sde3/testfile: 49287 extents found

 Most the latter extents are mercifully rather contiguous, their
 size is just limited by the compression code, here is an extract
 from 'filefrag -v' from around the middle:

   24757:  1321888.. 1321919:   11339579..  11339610: 32:   11339594:
   24758:  1321920.. 1321951:   11339597..  11339628: 32:   11339611:
   24759:  1321952.. 1321983:   11339615..  11339646: 32:   11339629:
   24760:  1321984.. 1322015:   11339632..  11339663: 32:   11339647:
   24761:  1322016.. 1322047:   11339649..  11339680: 32:   11339664:
   24762:  1322048.. 1322079:   11339667..  11339698: 32:   11339681:
   24763:  1322080.. 1322111:   11339686..  11339717: 32:   11339699:
   24764:  1322112.. 1322143:   11339703..  11339734: 32:   11339718:
   24765:  1322144.. 1322175:   11339720..  11339751: 32:   11339735:
   24766:  1322176.. 1322207:   11339737..  11339768: 32:   11339752:
   24767:  1322208.. 1322239:   11339754..  11339785: 32:   11339769:
   24768:  1322240.. 1322271:   11339771..  11339802: 32:   11339786:
   24769:  1322272.. 1322303:   11339789..  11339820: 32:   11339803:

 But again this is on a fresh empty Btrfs volume.

As I wrote, "their size is just limited by the compression code"
which results in "128KiB writes". On a "fresh empty Btrfs volume"
the compressed extents limited to 128KiB also happen to be pretty
physically contiguous, but on a more fragmented free space list
they can be more scattered.

As I already wrote the main issue here seems to be that we are
talking about a "RAID5 with 128KiB writes and a 768KiB stripe
size". On MD RAID5 the slowdown because of RMW seems only to be
around 30-40%, but it looks like that several back-to-back 128KiB
writes get merged by the Linux IO subsystem (not sure whether
that's thoroughly legal), and perhaps they get merged by the 3ware
firmware only if it has a persistent cache, and maybe your 3ware
does not have one, but you have kept your counsel as to that.

My impression is that you read the Btrfs documentation and my
replies with a lot less attention than I write them. Some of the
things you have done and said make me think that you did not read
https://btrfs.wiki.kernel.org/index.php/Compression and 'man 5
btrfs', for example:

   "How does compression interact with direct IO or COW?

 Compression does not work with DIO, does work with COW and
 does not work for NOCOW files. If a file is opened in DIO
 mode, it will fall back to buffered IO.

   Are there speed penalties when doing random access to a
   compressed file?

 Yes. The compression processes ranges of a file of maximum
 size 128 KiB and compresses each 4 KiB (or page-sized) block
 separately."

> I am currently defragmenting that mountpoint, ensuring that
> everrything is compressed with zlib.

Defragmenting the used space might help find more contiguous
allocations.

> p.s. any other suggestion that might help with the fragmentation
> and data allocation. Should I try and rebalance the data on the
> drive?

Yes, regularly, as that defragments the unused space.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs + compression = slow performance and high cpu usage

2017-07-31 Thread Peter Grandi
> [ ... ] It is hard for me to see a speed issue here with
> Btrfs: for comparison I have done a simple test with a both a
> 3+1 MD RAID5 set with a 256KiB chunk size and a single block
> device on "contemporary" 1T/2TB drives, capable of sequential
> transfer rates of 150-190MB/s: [ ... ]

The figures after this are a bit on the low side because I
realized looking at 'vmstat' that the source block device 'sda6'
was being a bottleneck, as the host has only 8GiB instead of the
16GiB I misremembered, and also 'sda' is a relatively slow flash
SSD that reads are most at around 220MB/s. So I have redone the
simple tests with a transfer size of 3GB, which ensures that
all reads are from memory cache:

with compression:

  soft#  mount -t btrfs -o commit=10,compress-force=zlib /dev/md/test5 
/mnt/test5
  soft#  mount -t btrfs -o commit=10,compress-force=zlib /dev/sdg3 /mnt/sdg3
  soft#  rm -f /mnt/test5/testfile /mnt/sdg3/testfile   


  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile 
bs=1M count=3000 conv=fsync
  3000+0 records in
  3000+0 records out
  3145728000 bytes (3.1 GB) copied, 15.8869 s, 198 MB/s
  0.00user 2.80system 0:15.88elapsed 17%CPU (0avgtext+0avgdata 3056maxresident)k
  0inputs+6148256outputs (0major+346minor)pagefaults 0swaps

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sdg3/testfile 
bs=1M count=3000 conv=fsync
  3000+0 records in
  3000+0 records out
  3145728000 bytes (3.1 GB) copied, 16.9663 s, 185 MB/s
  0.00user 2.61system 0:16.96elapsed 15%CPU (0avgtext+0avgdata 3056maxresident)k
  0inputs+6144672outputs (0major+346minor)pagefaults 0swaps

  soft#  btrfs fi df /mnt/test5/ | grep Data

  Data, single: total=3.00GiB, used=2.28GiB
  soft#  btrfs fi df /mnt/sdg3 | grep Data
  Data, single: total=3.00GiB, used=2.28GiB

  soft#  filefrag /mnt/test5/testfile /mnt/sdg3/testfile
  /mnt/test5/testfile: 8811 extents found
  /mnt/sdg3/testfile: 8759 extents found

Slightly weird that with a 3GB size the number of extents is
almost double that for the 10GB, but I guess that depends on
speed.

Then without compression:

  soft#  mount -t btrfs -o commit=10 /dev/md/test5 /mnt/test5
  soft#  mount -t btrfs -o commit=10 /dev/sdg3 /mnt/sdg3
  soft#  rm -f /mnt/test5/testfile /mnt/sdg3/testfile

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile 
bs=1M count=3000 conv=fsync
  3000+0 records in
  3000+0 records out
  3145728000 bytes (3.1 GB) copied, 8.06841 s, 390 MB/s
  0.00user 3.90system 0:08.80elapsed 44%CPU (0avgtext+0avgdata 2880maxresident)k
  0inputs+6153856outputs (0major+345minor)pagefaults 0swaps

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sdg3/testfile 
bs=1M count=3000 conv=fsync
  3000+0 records in
  3000+0 records out
  3145728000 bytes (3.1 GB) copied, 30.215 s, 104 MB/s
  0.00user 4.82system 0:30.93elapsed 15%CPU (0avgtext+0avgdata 2888maxresident)k
  0inputs+6152128outputs (0major+347minor)pagefaults 0swaps

  soft#  filefrag /mnt/test5/testfile /mnt/sdg3/testfile

  /mnt/test5/testfile: 5 extents found
  /mnt/sdg3/testfile: 3 extents found

Also added:

  soft#  rm -f /mnt/test5/testfile /mnt/sdg3/testfile   
   

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 bs=128k count=3000 | dd 
iflag=fullblock of=/mnt/test5/testfile bs=128k oflag=sync
  3000+0 records in
  3000+0 records out
  393216000 bytes (393 MB) copied, 160.315 s, 2.5 MB/s
  0.02user 0.46system 2:40.31elapsed 0%CPU (0avgtext+0avgdata 1992maxresident)k
  0inputs+0outputs (0major+124minor)pagefaults 0swaps
  3000+0 records in
  3000+0 records out
  393216000 bytes (393 MB) copied, 160.365 s, 2.5 MB/s

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 bs=128k count=3000 | dd 
iflag=fullblock of=/mnt/sdg3/testfile bs=128k oflag=sync
  3000+0 records in
  3000+0 records out
  393216000 bytes (393 MB) copied, 113.51 s, 3.5 MB/s
  0.02user 0.56system 1:53.51elapsed 0%CPU (0avgtext+0avgdata 2156maxresident)k
  0inputs+0outputs (0major+120minor)pagefaults 0swaps
  3000+0 records in
  3000+0 records out
  393216000 bytes (393 MB) copied, 113.544 s, 3.5 MB/s

  soft#  filefrag /mnt/test5/testfile /mnt/sdg3/testfile
   
  /mnt/test5/testfile: 1 extent found
  /mnt/sdg3/testfile: 22 extents found

  soft#  rm -f /mnt/test5/testfile /mnt/sdg3/testfile   
   

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 bs=1M count=1000 | dd 
iflag=fullblock of=/mnt/test5/testfile bs=1M oflag=sync

Re: Btrfs + compression = slow performance and high cpu usage

2017-07-31 Thread Peter Grandi
[ ... ]

> Also added:

Feeling very generous :-) today, adding these too:

  soft#  mkfs.btrfs -mraid10 -draid10 -L test5 /dev/sd{b,c,d,e}3
  [ ... ]
  soft#  mount -t btrfs -o commit=10,compress-force=zlib /dev/sdb3 /mnt/test5

  soft#  rm -f /mnt/test5/testfile
  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile 
bs=1M count=3000 conv=fsync
  3000+0 records in
  3000+0 records out
  3145728000 bytes (3.1 GB) copied, 14.2166 s, 221 MB/s
  0.00user 2.54system 0:14.21elapsed 17%CPU (0avgtext+0avgdata 3056maxresident)k
  0inputs+6144768outputs (0major+346minor)pagefaults 0swaps

  soft#  rm -f /mnt/test5/testfile
  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile 
bs=128k count=3000 conv=fsync   
 
  3000+0 records in
  3000+0 records out
  393216000 bytes (393 MB) copied, 2.05933 s, 191 MB/s
  0.00user 0.32system 0:02.06elapsed 15%CPU (0avgtext+0avgdata 1996maxresident)k
  0inputs+772512outputs (0major+124minor)pagefaults 0swaps

  soft#  rm -f /mnt/test5/testfile
  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 bs=1M count=1000 | dd 
iflag=fullblock of=/mnt/test5/testfile bs=1M oflag=sync 
  
  1000+0 records in
  1000+0 records out
  1048576000 bytes (1.0 GB) copied, 60.6019 s, 17.3 MB/s
  0.01user 1.04system 1:00.60elapsed 1%CPU (0avgtext+0avgdata 2888maxresident)k
  0inputs+0outputs (0major+348minor)pagefaults 0swaps
  1000+0 records in
  1000+0 records out
  1048576000 bytes (1.0 GB) copied, 60.4116 s, 17.4 MB/s

  soft#  rm -f /mnt/test5/testfile
  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 bs=128k count=3000 | dd 
iflag=fullblock of=/mnt/test5/testfile bs=128k oflag=sync
  3000+0 records in
  3000+0 records out
  393216000 bytes (393 MB) copied, 148.04 s, 2.7 MB/s
  0.00user 0.62system 2:28.04elapsed 0%CPU (0avgtext+0avgdata 1996maxresident)k
  0inputs+0outputs (0major+125minor)pagefaults 0swaps
  3000+0 records in
  3000+0 records out
  393216000 bytes (393 MB) copied, 148.083 s, 2.7 MB/s

  soft#  sysctl vm/drop_caches=3
  vm.drop_caches = 3
  soft#  /usr/bin/time dd iflag=fullblock if=/mnt/test5/testfile bs=128k 
count=3000 of=/dev/zero 
  
  3000+0 records in
  3000+0 records out
  393216000 bytes (393 MB) copied, 1.09729 s, 358 MB/s
  0.00user 0.24system 0:01.10elapsed 23%CPU (0avgtext+0avgdata 2164maxresident)k
  459768inputs+0outputs (3major+121minor)pagefaults 0swaps
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs + compression = slow performance and high cpu usage

2017-07-31 Thread Peter Grandi
[ ... ]

> grep 'model name' /proc/cpuinfo | sort -u 
> model name  : Intel(R) Xeon(R) CPU   E5645  @ 2.40GHz

Good, contemporary CPU with all accelerations.

> The sda device is a hardware RAID5 consisting of 4x8TB drives.
[ ... ]
> Strip Size  : 256 KB

So the full RMW data stripe length is 768KiB.

> [ ... ] don't see the previously reported behaviour of one of
> the kworker consuming 100% of the cputime, but the write speed
> difference between the compression ON vs OFF is pretty large.

That's weird; of course 'lzo' is a lot cheaper than 'zlib', but
in my test the much higher CPU time of the latter was spread
across many CPUs, while in your case it wasn't, even if the
E5645 has 6 CPUs and can do 12 threads. That seemed to point to
some high cost of finding free blocks, that is a very fragmented
free list, or something else.

> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress oflag=direct
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 26.0685 s, 206 MB/s

The results with 'oflag=direct' are not relevant, because Btrfs
behaves "differently" with that.

> mountflags: 
> (rw,relatime,compress-force=zlib,space_cache=v2,subvolid=5,subvol=/)
[ ... ]
> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 77.4845 s, 69.3 MB/s
> mountflags: 
> (rw,relatime,compress-force=lzo,space_cache=v2,subvolid=5,subvol=/)
[ ... ]
> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 122.321 s, 43.9 MB/s

That's pretty good for a RAID5 with 128KiB writes and a 768KiB
stripe size, on a 3ware, and looks like that the hw host adapter
does not have a persistent cache (battery backed usually). My
guess that watching transfer rates and latencies with 'iostat
-dk -zyx 1' did not happen.

> mountflags: (rw,relatime,space_cache=v2,subvolid=5,subvol=/)
[ ... ]
> dd if=/dev/sdb  of=./testing count=5120 bs=1M status=progress conv=fsync
> 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 10.1033 s, 531 MB/s

I had mentioned in my previous reply the output of 'filefrag'.
That to me seems relevant here, because of RAID5 RMW and maximum
extent size with Brfs compression and strip/stripe size.

Perhaps redoing the tests with a 128KiB 'bs' *without*
compression would be interesting, perhaps even with 'oflag=sync'
instead of 'conv=fsync'.

It is hard for me to see a speed issue here with Btrfs: for
comparison I have done a simple test with a both a 3+1 MD RAID5
set with a 256KiB chunk size and a single block device on
"contemporary" 1T/2TB drives, capable of sequential transfer
rates of 150-190MB/s:

  soft#  grep -A2 sdb3 /proc/mdstat 
  md127 : active raid5 sde3[4] sdd3[2] sdc3[1] sdb3[0]
729808128 blocks super 1.0 level 5, 256k chunk, algorithm 2 [4/4] []

with compression:

  soft#  mount -t btrfs -o commit=10,compress-force=zlib /dev/md/test5 
/mnt/test5   
  soft#  mount -t btrfs -o commit=10,compress-force=zlib /dev/sdg3 /mnt/sdg3
  soft#  rm -f /mnt/test5/testfile /mnt/sdg3/testfile

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/test5/testfile 
bs=1M count=1 conv=fsync
  1+0 records in
  1+0 records out
  1048576 bytes (10 GB) copied, 94.3605 s, 111 MB/s
  0.01user 12.59system 1:34.36elapsed 13%CPU (0avgtext+0avgdata 
2932maxresident)k
  13042144inputs+20482144outputs (3major+345minor)pagefaults 0swaps

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sdg3/testfile 
bs=1M count=1 conv=fsync
  1+0 records in
  1+0 records out
  1048576 bytes (10 GB) copied, 93.5885 s, 112 MB/s
  0.03user 12.35system 1:33.59elapsed 13%CPU (0avgtext+0avgdata 
2940maxresident)k
  13042144inputs+20482400outputs (3major+346minor)pagefaults 0swaps

  soft#  filefrag /mnt/test5/testfile /mnt/sdg3/testfile
  /mnt/test5/testfile: 48945 extents found
  /mnt/sdg3/testfile: 49029 extents found

  soft#  btrfs fi df /mnt/test5/ | grep Data
  Data, single: total=7.00GiB, used=6.55GiB

  soft#  btrfs fi df /mnt/sdg3 | grep Data
  Data, single: total=7.00GiB, used=6.55GiB

  soft#  sysctl vm/drop_caches=3
  vm.drop_caches = 3
  soft#  /usr/bin/time dd iflag=fullblock if=/mnt/test5/testfile bs=1M 
count=1 of=/dev/zero
  1+0 records in
  1+0 records out
  1048576 bytes (10 GB) copied, 23.2975 s, 450 MB/s
  0.01user 7.59system 0:23.32elapsed 32%CPU (0avgtext+0avgdata 2932maxresident)k
  13759624inputs+0outputs (3major+344minor)pagefaults 0swaps

  soft#  sysctl vm/drop_caches=3
  vm.drop_caches = 3
  soft#  /usr/bin/time dd iflag=fullblock if=/mnt/sdg3/testfile bs=1M 
count=1 of=/dev/zero  
  1+0 records in
  1+0 records out
  1048576 bytes (10 GB) copied, 35.0032 s, 300 MB/s
  0.01user 8.46system 0:35.03elapsed 24%CPU (0avgtext+0avgdata 2924maxresident)k
  13750568inputs+0outputs (3major+345minor)pagefaults 0swaps

and 

Re: Btrfs + compression = slow performance and high cpu usage

2017-07-28 Thread Peter Grandi
In addition to my previous "it does not happen here" comment, if
someone is reading this thread, there are some other interesting
details:

> When the compression is turned off, I am able to get the
> maximum 500-600 mb/s write speed on this disk (raid array)
> with minimal cpu usage.

No details on whether it is a parity RAID or not.

> btrfs device usage /mnt/arh-backup1/
> /dev/sda, ID: 2
>Device size:21.83TiB
>Device slack:  0.00B
>Data,single: 9.29TiB
>Metadata,single:46.00GiB
>System,single:  32.00MiB
>Unallocated:12.49TiB

That's exactly 24TB of "Device size", of which around 45% are
used, and the string "backup" may suggest that the content is
backups, which may indicate a very fragmented freespace.
Of course compression does not help with that, in my freshly
created Btrfs volume I get as expected:

  soft#  umount /mnt/sde3
  soft#  mount -t btrfs -o commit=10 /dev/sde3 /mnt/sde3
 

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sde3/testfile 
bs=1M count=1 conv=fsync
  1+0 records in
  1+0 records out
  1048576 bytes (10 GB) copied, 103.747 s, 101 MB/s
  0.00user 11.56system 1:44.86elapsed 11%CPU (0avgtext+0avgdata 
3072maxresident)k
  20480672inputs+20498272outputs (1major+349minor)pagefaults 0swaps

  soft#  filefrag /mnt/sde3/testfile 
  /mnt/sde3/testfile: 11 extents found

versus:

  soft#  umount /mnt/sde3   
 
  soft#  mount -t btrfs -o commit=10,compress=lzo,compress-force /dev/sde3 
/mnt/sde3

  soft#  /usr/bin/time dd iflag=fullblock if=/dev/sda6 of=/mnt/sde3/testfile 
bs=1M count=1 conv=fsync
  1+0 records in  
  1+0 records out
  1048576 bytes (10 GB) copied, 109.051 s, 96.2 MB/s
  0.02user 13.03system 1:49.49elapsed 11%CPU (0avgtext+0avgdata 
3068maxresident)k
  20494784inputs+20492320outputs (1major+347minor)pagefaults 0swaps

  soft#  filefrag /mnt/sde3/testfile 
  /mnt/sde3/testfile: 49287 extents found

Most the latter extents are mercifully rather contiguous, their
size is just limited by the compression code, here is an extract
from 'filefrag -v' from around the middle:

  24757:  1321888.. 1321919:   11339579..  11339610: 32:   11339594:
  24758:  1321920.. 1321951:   11339597..  11339628: 32:   11339611:
  24759:  1321952.. 1321983:   11339615..  11339646: 32:   11339629:
  24760:  1321984.. 1322015:   11339632..  11339663: 32:   11339647:
  24761:  1322016.. 1322047:   11339649..  11339680: 32:   11339664:
  24762:  1322048.. 1322079:   11339667..  11339698: 32:   11339681:
  24763:  1322080.. 1322111:   11339686..  11339717: 32:   11339699:
  24764:  1322112.. 1322143:   11339703..  11339734: 32:   11339718:
  24765:  1322144.. 1322175:   11339720..  11339751: 32:   11339735:
  24766:  1322176.. 1322207:   11339737..  11339768: 32:   11339752:
  24767:  1322208.. 1322239:   11339754..  11339785: 32:   11339769:
  24768:  1322240.. 1322271:   11339771..  11339802: 32:   11339786:
  24769:  1322272.. 1322303:   11339789..  11339820: 32:   11339803:

But again this is on a fresh empty Btrfs volume.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs + compression = slow performance and high cpu usage

2017-07-28 Thread Peter Grandi
> I am stuck with a problem of btrfs slow performance when using
> compression. [ ... ]

That to me looks like an issue with speed, not performance, and
in particular with PEBCAK issues.

As to high CPU usage, when you find a way to do both compression
and checksumming without using much CPU time, please send patches
urgently :-).

In your case the increase in CPU time is bizarre. I have the
Ubuntu 4.4 "lts-xenial" kernel and what you report does not
happen here (with a few little changes):

  soft#  grep 'model name' /proc/cpuinfo | sort -u
  model name  : AMD FX(tm)-6100 Six-Core Processor
  soft#  cpufreq-info | grep 'current CPU frequency'
current CPU frequency is 3.30 GHz (asserted by call to hardware).
current CPU frequency is 3.30 GHz (asserted by call to hardware).
current CPU frequency is 3.30 GHz (asserted by call to hardware).
current CPU frequency is 3.30 GHz (asserted by call to hardware).
current CPU frequency is 3.30 GHz (asserted by call to hardware).
current CPU frequency is 3.30 GHz (asserted by call to hardware).

  soft#  lsscsi | grep 'sd[ae]'
  [0:0:0:0]diskATA  HFS256G32MNB-220 3L00  /dev/sda
  [5:0:0:0]diskATA  ST2000DM001-1CH1 CC44  /dev/sde

  soft#  mkfs.btrfs -f /dev/sde3
  [ ... ]
  soft#  mount -t btrfs -o 
discard,autodefrag,compress=lzo,compress-force,commit=10 /dev/sde3 /mnt/sde3

  soft#  df /dev/sda6 /mnt/sde3
  Filesystem 1M-blocks  Used Available Use% Mounted on
  /dev/sda6  90048 76046 14003  85% /
  /dev/sde3 23756819235501   1% /mnt/sde3

The above is useful context information that was "amazingly"
omitted from your reported.

In dmesg I see (not the "force zlib compression"):

  [327730.917285] BTRFS info (device sde3): turning on discard
  [327730.917294] BTRFS info (device sde3): enabling auto defrag
  [327730.917300] BTRFS info (device sde3): setting 8 feature flag
  [327730.917304] BTRFS info (device sde3): force zlib compression
  [327730.917313] BTRFS info (device sde3): disk space caching is enabled
  [327730.917315] BTRFS: has skinny extents
  [327730.917317] BTRFS: flagging fs with big metadata feature
  [327730.920740] BTRFS: creating UUID tree

and the result is:

  soft#  pv -tpreb /dev/sda6 | time dd iflag=fullblock of=/mnt/sde3/testfile 
bs=1M count=1 oflag=direct
  1+0 records in17MB/s] [==>] 11% ETA 
0:15:06
  1+0 records out
  1048576 bytes (10 GB) copied, 112.845 s, 92.9 MB/s
  0.05user 9.93system 1:53.20elapsed 8%CPU (0avgtext+0avgdata 3016maxresident)k
  120inputs+20496000outputs (1major+346minor)pagefaults 0swaps
  9.77GB 0:01:53 [88.3MB/s] [==>]
  11%

  soft#  btrfs fi df /mnt/sde3/
  Data, single: total=10.01GiB, used=9.77GiB
  System, DUP: total=8.00MiB, used=16.00KiB
  Metadata, DUP: total=1.00GiB, used=11.66MiB
  GlobalReserve, single: total=16.00MiB, used=0.00B

As it was running system CPU time was under 20% of one CPU:

  top - 18:57:29 up 3 days, 19:27,  4 users,  load average: 5.44, 2.82, 1.45
  Tasks: 325 total,   1 running, 324 sleeping,   0 stopped,   0 zombie
  %Cpu0  :  0.0 us,  2.3 sy,  0.0 ni, 91.3 id,  6.3 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu1  :  0.0 us,  1.3 sy,  0.0 ni, 78.5 id, 20.2 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu2  :  0.3 us,  5.8 sy,  0.0 ni, 81.0 id, 12.5 wa,  0.0 hi,  0.3 si,  0.0 
st
  %Cpu3  :  0.3 us,  3.4 sy,  0.0 ni, 91.9 id,  4.4 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu4  :  0.3 us, 10.6 sy,  0.0 ni, 55.4 id, 33.7 wa,  0.0 hi,  0.0 si,  0.0 
st
  %Cpu5  :  0.0 us,  0.3 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 
st
  KiB Mem:   8120660 total,  5162236 used,  2958424 free,  4440100 buffers
  KiB Swap:0 total,0 used,0 free.   351848 cached Mem

PID  PPID USER  PR  NIVIRTRESDATA  %CPU %MEM TIME+ TTY  
COMMAND
  21047 21046 root  20   08872   26161364  12.9  0.0   0:02.31 
pts/3dd iflag=fullblo+
  21045  3535 root  20   07928   1948 460  12.3  0.0   0:00.72 
pts/3pv -tpreb /dev/s+
  21019 2 root  20   0   0  0   0   1.3  0.0   0:42.88 ?
[kworker/u16:1]

Of course "oflag=direct" is a rather "optimistic" option in this
context, so I tried again with something more sensible:

  soft#  pv -tpreb /dev/sda6 | time dd iflag=fullblock of=/mnt/sde3/testfile 
bs=1M count=1 conv=fsync
  1+0 records in.4MB/s] [==>] 11% ETA 
0:14:41
  1+0 records out
  1048576 bytes (10 GB) copied, 110.523 s, 94.9 MB/s
  0.03user 8.94system 1:50.71elapsed 8%CPU (0avgtext+0avgdata 3024maxresident)k
  136inputs+20499648outputs (1major+348minor)pagefaults 0swaps
  9.77GB 0:01:50 [90.3MB/s] [==>] 11%

  soft#  btrfs fi df /mnt/sde3/
  Data, single: total=7.01GiB, used=6.35GiB
  System, DUP: total=8.00MiB, used=16.00KiB
  Metadata, DUP: total=1.00GiB, used=15.81MiB
  GlobalReserve, 

Re: kernel btrfs file system wedged -- is it toast?

2017-07-21 Thread Peter Grandi
> [ ... ] announce loudly and clearly to any potential users, in
> multiple places (perhaps a key announcement in a few places
> and links to that announcement from many places,

https://btrfs.wiki.kernel.org/index.php/Gotchas#Having_many_subvolumes_can_be_very_slow

> ... DO expect to first have to learn, the hard way, of
> ... whatever special mitigations might apply in ones
> ... particular circumstances, before considering deploying
> ... btrfs into a production environment where this, or other
> ... (what other?) surprising limitations of btrfs may apply.

In computer jargon that is called "being a system engineer".

> All the prominent places that respond to the question of
> whether btrfs is ready for production use (spanning several
> years now) should if possible display this warning.

https://btrfs.wiki.kernel.org/index.php/Status
  "The table below aims to serve as an overview for the stability
  status of the features BTRFS supports. While a feature may be
  functionally safe and reliable, it does not necessarily mean
  that its useful, for example in meeting your performance
  expectations for your specific workload."

> [ ... ] Back in my day, such a performance bug would have made
> the software containing it unreleasable, _especially_ in
> software such as a major file system that is expected to
> provide reliable service, where "reliable" means both
> preserving data integrity and doing so within an order of
> magnitude of a reasonably expected time.

For the past several decades, since perhaps the Manchester MARK I
(or the ZUSE probably), it has been known to "system engineers"
and "programmers" that most features of "hardware" and "software"
have very anisotropic performance envelopes, both as to speed and
usability and reliability, and not all potentially syntactically
valid combinations of features are equally robust and excellent
under every possible workload, and indeed very few are, and it is
part of "being a system engineer" or "being a programmer" to
develop insight and experience, leading to knowledge ideally, as
to what combinations work well and which work badly.

Considering somewhat imprecise car analogies, jet engines on cars
tend not to be wholly desirable, even if syntactically valid, and
using cars to plow fields does not necessarily yield the highest
productivity, even if also syntactically valid.

General introduction about anisotropy and performance:

  http://www.sabi.co.uk/blog/15-two.html?151023#151023

In storage system and filesystems "system engineers" often have
to confront a large number of pathological cases of anisotropy,
some examples I can find quickly in my links:

  http://www.sabi.co.uk/blog/1005May.html?100520#100520
  http://www.sabi.co.uk/blog/16-one.html?160322#160322
  http://www.sabi.co.uk/blog/15-one.html?150203#150203
  http://www.sabi.co.uk/blog/12-fou.html?121218#121218
  http://www.sabi.co.uk/blog/0802feb.html?080210#080210
  http://www.sabi.co.uk/blog/0802feb.html?080216#080216

I try to keep lists of "known pathologies" of various types here:

  http://www.sabi.co.uk/Notes/linuxStor.html
  http://www.sabi.co.uk/Notes/linuxFS.html#fsHints

Even "legendary" 'ext3' has/had pathological (and common) cases:

  https://lwn.net/Articles/328370/
  https://lwn.net/Articles/328363/
  https://bugzilla.kernel.org/show_bug.cgi?id=12309
  https://news.ycombinator.com/item?id=7376750
  https://news.ycombinator.com/item?id=7377315

  https://lkml.org/lkml/2009/4/6/331
the difference between the occasional 5+ second pause and the
occasional 10+ second pause wasn't really all that
interesting. They were both unusuable, and both made me kill
the background writer almost immediately

Wisely L Torvalds writes:

  https://lkml.org/lkml/2010/11/10/233
ext3 [f]sync sucks. We know. All filesystems suck.
They just tend to do it in different dimensions.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Exactly what is wrong with RAID5/6

2017-06-21 Thread Peter Grandi
> [ ... ] This will make some filesystems mostly RAID1, negating
> all space savings of RAID5, won't it? [ ... ]

RAID5/RAID6/... don't merely save space, more precisely they
trade lower resilience and a more anisotropic and smaller
performance envelope to gain lower redundancy (= save space).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: does using different uid/gid/forceuid/... mount options for different subvolumes work / does fuse.bindfs play nice with btrfs?

2017-06-20 Thread Peter Grandi
> I intend to provide different "views" of the data stored on
> btrfs subvolumes.  e.g. mount a subvolume in location A rw;
> and ro in location B while also overwriting uids, gids, and
> permissions. [ ... ]

That's not how UNIX/Linux permissions and ACLs are supposed to
work, perhaps you should reconsider such a poor idea.

Mount options are provided to map non-UNIX/Linux filesystems
into the UNIX/Linux permissions and ACLs in a crude way.

If you really want, uid/gid/permissions are an inode property,
and Btrfs via "reflinking" allows sharing of file data between
different inodes. So you can for example create a RW snapshot of
a subvolume, change all ids/permissions/ACLs, and the data space
will still be shared. This is not entirely cost-free.

If you really really want this it may be doable also with mount
namespaces combined with user namespaces, with any filesystem,
depending on yoiur requirements:

http://man7.org/linux/man-pages/man7/user_namespaces.7.html
http://man7.org/linux/man-pages/man7/mount_namespaces.7.html

As to mounting multiple times on multiple directories, that is
possible with Linux VFS regardless of filesystem type, or by
using 'mount --bind' or a variant.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Struggling with file system slowness

2017-05-04 Thread Peter Grandi
> Trying to peg down why I have one server that has
> btrfs-transacti pegged at 100% CPU for most of the time.

Too little information. Is IO happening at the same time? Is
compression on? Deduplicated? Lots of subvolumes? SSD? What kind
of workload and file size/distribution profile?

Typical high CPU are extents (your defragging not necessarily
worked), and 'qgroups', especially with many subvolumes. It
could be the fre space cache in some rare cases.

  
https://www.google.ca/search?num=100=images_q=cxpu_epq=btrfs-transaction

To this something like this happens often, but is not
Btrfs-related, but triggered for example by near-memory
exhaustion in the kernel memory manager.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ded

2017-05-03 Thread Peter Grandi
> I have a btrfs filesystem mounted at /btrfs_vol/ Every N
> minutes, I run bedup for deduplication of data in /btrfs_vol
> Inside /btrfs_vol, I have several subvolumes (consider this as
> home directories of several users) I have set individual
> qgroup limits for each of these subvolumes. [ ... ]

Let's hope that you have read the warnings about the potential
downsides of deduplication, quota groups, many subvolumes. Even
if "syntactically" every features works with every features
without downsides, and they all scale up without limit.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-29 Thread Peter Grandi
>> [ ... ] these extents are all over the place, they're not
>> contiguous at all. 4K here, 4K there, 4K over there, back to
>> 4K here next to this one, 4K over there...12K over there, 500K
>> unwritten, 4K over there. This seems not so consequential on
>> SSD, [ ... ]

> Indeed there were recent reports that the 'ssd' mount option
> causes that, IIRC by Hans van Kranenburg [ ... ]

The report included news that "sometimes" the 'ssd' option is
automatically switched on at mount even on hard disks. I had
promised to put a summary of the issue on the Btrfs wiki, but
I regret that I haven't yet done that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-29 Thread Peter Grandi
> [ ... ] Instead, you can use raw files (preferably sparse unless
> there's both nocow and no snapshots). Btrfs does natively everything
> you'd gain from qcow2, and does it better: you can delete the master
> of a cloned image, deduplicate them, deduplicate two unrelated images;
> you can turn on compression, etc.

Uhm, I understand this argument in the general case (not
specifically as to QCOW2 images), and it has some merit, but it is
"controversial", as there are two counterarguments:

* Application specifici file formats can match better application
  specific requirements.
* Putting advanced functionality into the filesystem code makes it more
  complex and less robust, and Btrfs is a bit of a major example of the
  consequences. I put compression and deduplication as things that I
  reckon make a filesystem too complex.

As to snapshots, I make a difference between filetree snapshots and file
snapshots: the first clones a tree as of the snapshot moment, and it is
a system management feature, the second provides per-file update
rollback. One sort of implies the other, but using the per-file rollback
*systematically*, that is a a feature an application can rely one seems
a bit dangerous to me.

> Once you pay the btrfs performance penalty,

Uhmmm, Btrfs has a small or negative performance penalty as a
general purpose filesystem, and many (more or less well conceived) tests
show it performs up there with the best. The only two real costs I have
to it are the huge CPU cost of doing checksumming all the time, but
that's unavoidable if one wants checksumming, and that checksumming
usually requires metadata duplication, that is at least 'dup' profile
for metadata, and that is indeed a bit expensive.

> you may as well actually use its features,

The features that I think Btrfs gives that are worth using are
checksumming, metadata duplication, and filetree snapshots.

> which make qcow2 redundant and harmful.

My impression is that in almost all cases QCOW2 is harmful, because it
trades more IOPS and complexity for less disk space, and disk space is
cheap and IOPS and complexity are expensive, but of course a lot of
people know better :-). My preferred VM setup is a small essentially
read-only non-QCOW2 image for '/' and everything else mounted via NFSv4,
from the VM host itself or a NAS server, but again lots of people know
better and use multi-terabyte-sized QCOW2 images :-).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Peter Grandi

> [ ... ] these extents are all over the place, they're not
> contiguous at all. 4K here, 4K there, 4K over there, back to
> 4K here next to this one, 4K over there...12K over there, 500K
> unwritten, 4K over there. This seems not so consequential on
> SSD, [ ... ]

Indeed there were recent reports that the 'ssd' mount option
causes that, IIRC by Hans van Kranenburg (around 2017-04-17),
which also noticed issues with the wandering trees in certain
situations (around 2017-04-08).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Peter Grandi
>> The gotcha though is there's a pile of data in the journal
>> that would never make it to rsyslogd. If you use journalctl
>> -o verbose you can see some of this.

> You can send *all the info* to rsyslogd via imjournal
> http://www.rsyslog.com/doc/v8-stable/configuration/modules/imjournal.html
> In my setup all the data are stored in json format in the
> /var/log/cee.log file:
> $ head /var/log/cee.log 2017-04-28T18:41:41.931273+02:00
> venice liblogging-stdlog: @cee: { "PRIORITY": "6", "_BOOT_ID":
> "a86d74bab91f44dc974c76aceb97141f", "_MACHINE_ID": [ ... ]

Ahh the horror the horror, I will never be able to unsee
that. The UNIX way of doing things is truly dead.

>> The same behavior happens with NTFS in qcow2 files. They
>> quickly end up with 100,000+ extents unless set nocow.
>> It's like the worst case scenario.

In a particularly demented setup I had to decastrophize with
great pain a Zimbra QCOW2 disk image (XFS on NFS on XFS on
RAID6) containining an ever growing number Maildir email archive
ended up with over a million widely scattered microextents:

  http://www.sabi.co.uk/blog/1101Jan.html?110116#110116
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Peter Grandi
> [ ... ] And that makes me wonder whether metadata
> fragmentation is happening as a result. But in any case,
> there's a lot of metadata being written for each journal
> update compared to what's being added to the journal file. [
> ... ]

That's the "wandering trees" problem in COW filesystems, and
manifestations of it in Btrfs have also been reported before.
If there is a workload that triggers a lot of "wandering trees"
updates, then a filesystem that has "wandering trees" perhaps
should not be used :-).

> [ ... ] worse, a single file with 2 fragments; or 4
> separate journal files? *shrug* [ ... ]

Well, depends, but probably the single file: it is more likely
that the 20,000 fragments will actually be contiguous, and that
there will be less metadata IO than for 40,000 separate journal
files.

The deeper "strategic" issue is that storage systems and
filesystems in particular have very anisotropic performance
envelopes, and mismatches between the envelopes of application
and filesystem can be very expensive:
  http://www.sabi.co.uk/blog/15-two.html?151023#151023
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Peter Grandi
> Old news is that systemd-journald journals end up pretty
> heavily fragmented on Btrfs due to COW.

This has been discussed before in detail indeeed here, but also
here: http://www.sabi.co.uk/blog/15-one.html?150203#150203

> While journald uses chattr +C on journal files now, COW still
> happens if the subvolume the journal is in gets snapshot. e.g.
> a week old system.journal has 19000+ extents. [ ... ]  It
> appears to me (see below URLs pointing to example journals)
> that journald fallocated in 8MiB increments but then ends up
> doing 4KiB writes; [ ... ]

So there are three layers of silliness here:

* Writing large files slowly to a COW filesystem and
  snapshotting it frequently.
* A filesystem that does delayed allocation instead of
  allocate-ahead, and does not have psychic code.
* Working around that by using no-COW and preallocation
  with a fixed size regardless of snapshot frequency.

The primary problem here is that there is no way to have slow
small writes and frequent snapshots without generating small
extents: if a file is written at a rate of 1MiB/hour and gets
snapshot every hour the extent size will not be larger than 1MiB
*obviously*.

Filesystem-level snapshots are not designed to snapshot slowly
growing files, but to snapshots changing collections of
files. There are harsh tradeoffs involved. Application-level
shapshots (also known as log rotations :->) are needed for
special cases and finer grained policies.

The secondary problem is that a fixed preallocate of 8MiB is
good only if in betweeen snapshots the file grows by a little
less than 8MiB or by substantially more.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: About free space fragmentation, metadata write amplification and (no)ssd

2017-04-08 Thread Peter Grandi
> [ ... ] This post is way too long [ ... ]

Many thanks for your report, it is really useful, especially the
details.

> [ ... ] using rsync with --link-dest to btrfs while still
> using rsync, but with btrfs subvolumes and snapshots [1]. [
> ... ]  Currently there's ~35TiB of data present on the example
> filesystem, with a total of just a bit more than 9
> subvolumes, in groups of 32 snapshots per remote host (daily
> for 14 days, weekly for 3 months, montly for a year), so
> that's about 2800 'groups' of them. Inside are millions and
> millions and millions of files. And the best part is... it
> just works. [ ... ]

That kind of arrangement, with a single large pool and very many
many files and many subdirectories is a worst case scanario for
any filesystem type, so it is amazing-ish that it works well so
far, especially with 90,000 subvolumes. As I mentioned elsewhere
I would rather do a rotation of smaller volumes, to reduce risk,
like "Duncan" also on this mailing list likes to do (perhaps to
the opposite extreme).

As to the 'ssd'/'nossd' issue that is as described in 'man 5
btrfs' (and I wonder whether 'ssd_spread' was tried too) but it
is not at all obvious it should impact so much metadata
handling. I'll add a new item in the "gotcha" list. 

It is sad that 'ssd' is used by default in your case, and it is
quite perplexing that tghe "wandering trees" problem (that is
"write amplification") is so large with 64KiB write clusters for
metadata (and 'dup' profile for metadata).

* Probably the metadata and data cluster sizes should be create
  or mount parameters instead of being implicit in the 'ssd'
  option.
* A cluster size of 2MiB for metadata and/or data presumably
  has some downsides, otrherwise it would be the default. I
  wonder whether the downsides related to barriers...
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-07 Thread Peter Grandi
[ ... ]
>>> I've got a mostly inactive btrfs filesystem inside a virtual
>>> machine somewhere that shows interesting behaviour: while no
>>> interesting disk activity is going on, btrfs keeps
>>> allocating new chunks, a GiB at a time.
[ ... ]
> Because the allocator keeps walking forward every file that is
> created and then removed leaves a blank spot behind.

That is a typical "log-structured" filesystem behaviour, not
really surprised that Btrfs is doing something like that being
COW. NILFS2 works like that and it requires a compactor (which
does the requivalent of 'balance' and 'defrag'). It is all about
tradeoffs.

With Btrfs I figured out that fairly frequent 'balance' is
really quite important, even with low percent values like
"usage=50", and usually even 'usage=90' does not take a long
time (while the default takes often a long time, I suspect
needlessly).

>> From the exact moment I did mount -o remount,nossd on this
>> filesystem, the problem vanished.

Haha. Indeed. So it switches from "COW" to more like "log
structured" with the 'ssd' option. F2FS can switch like that
too, with some tunables IIIRC. Except that modern flash SSDs
already do the "log structured" bit internally, so doing in in
Btrfs does not really help that much.

>> And even I saw some early prototypes inside the codes to
>> allow btrfs do allocation smaller extent than required.
>> (E.g. caller needs 2M extent, but btrfs returns 2 1M extents)

I am surprised that this is not already there, but it is a
terrible fix to a big mistake. The big mistake, that nearly all
filesystem designers do, is to assume that contiguous allocation
must bew done by writing contiguous large blocks or extents.

This big mistake was behind the stupid idea of the BSD FFS to
raise the block size from 512B to 4096B plus 512B "tails", and
endless stupid proposals to raise page and block sizes that get
done all the time, and is behind the stupid idea of doing
"delayed allocation", so large extents can be written in one go.

The ancient and tried and obvious idea is to preallocate space
ahead of it being written, so that a file physical size may be
larger than its logical length, and by how much it depends on
some adaptive logic, or hinting from the application (if the
file size if known in advance it can be to preallocate the whole
file).

> [ ... ] So, this is why putting your /var/log, /var/lib/mailman and
> /var/spool on btrfs is a terrible idea. [ ... ]

That is just the old "writing a file slowly" issue, and many if
not most filesystems have this issue:

  http://www.sabi.co.uk/blog/15-one.html?150203#150203

and as that post shows it was already reported for Btrfs here:

  http://kreijack.blogspot.co.uk/2014/06/btrfs-and-systemd-journal.html

> [ ... ] The fun thing is that this might work, but because of
> the pattern we end up with, a large write apparently fails
> (the files downloaded when doing apt-get update by daily cron)
> which causes a new chunk allocation. This is clearly visible
> in the videos. Directly after that, the new chunk gets filled
> with the same pattern, because the extent allocator now
> continues there and next day same thing happens again etc... [
> ... ]

The general problem is that filesystems have a very difficult
job especially on rotating media and cannot avoid large
important degenerate corner case by using any adaptive logic.

Only predictive logic can avoid them, and since psychic code is
not possible yet, "predictive" means hints from applications and
users, and application developers and users are usually not
going to give them, or give them wrong.

Consider the "slow writing" corner case, common to logging or
downloads, that you mention: the filesystem logic cannot do well
in the general case because it cannot predict how large the
final file will be, or what the rate of writing will be.

However if the applications or users hint the total final size
or at least a suitable allocation size things are going to be
good. But it is already difficult to expect applications to give
absolutely necessary 'fsync's, so explicit file size or access
pattern hints are a bit of an illusion. It is the ancient
'O_PONIES' issue in one of its many forms.

Fortunately it possible and even easy to do much better
*synthetic* hinting than most library and kernels do today:

  http://www.sabi.co.uk/blog/anno05-4th.html?051012d#051012d
  http://www.sabi.co.uk/blog/anno05-4th.html?051011b#051011b
  http://www.sabi.co.uk/blog/anno05-4th.html?051011#051011
  http://www.sabi.co.uk/blog/anno05-4th.html?051010#051010

But that has not happened because it is no developer's itch to
fix. I was instead partially impressed that recently the
'vm_cluster' implementation was "fixed", after only one or two
decades from being first reported:

  http://sabi.co.uk/blog/anno05-3rd.html?050923#050923
  https://lwn.net/Articles/716296/
  https://lkml.org/lkml/2001/1/30/160

And still the author(s) of the fix don't see to be persuaded by
many decades of 

Re: Do different btrfs volumes compete for CPU?

2017-04-04 Thread Peter Grandi
> [ ... ] I tried to use eSATA and ext4 first, but observed
> silent data corruption and irrecoverable kernel hangs --
> apparently, SATA is not really designed for external use.

SATA works for external use, eSATA works well, but what really
matters is the chipset of the adapter card.

In my experience JMicron is not so good, Marvell a bit better,
best is to use a recent motherboard chipset with a SATA-eSATA
internal cable and bracket.

>> As written that question is meaningless: despite the current
>> mania for "threads"/"threadlets" a filesystem driver is a
>> library, not a set of processes (all those '[btrfs-*]'
>> threadlets are somewhat misguided ways to do background
>> stuff).

> But these threadlets, misguided as the are, do exist, don't
> they?

But that does not change the fact that it is a library and work
is initiated by user requests which are not per-subvolume, but
in effect per-volume.

> I understand that qgroups is very much work in progress, but
> (correct me if I'm wrong) right now it's the only way to
> estimate real usage of subvolume and its snapshots.

It is a way to do so and not a very good way. There is no
obviously good way to define "real usage" in the presence of
hard-links and reflinking, and qgroups use just one way to
define it. A similar problem happens with processes in the
presence of shared pages, multiple mapped shared libraries etc.

> For instance, if I have dozen 1TB subvolumes each having ~50
> snapshots and suddenly run out of space on a 24TB volume, how
> do I find the culprit without qgroups?

It is not clear what "culprit" means here. The problem is that
both hard-links and ref-linking create really significant
ambiguities as to used space. Plus the same problem would happen
with directories instead of subvolumes and hard-links instead of
reflinked snapshots.

> [ ... ] The chip is ASM1142, not Intel/AMD sadly but quite
> popular nevertheless.

ASMedia USB3 chipsets are fairly reliable at the least the card
ones on the system side. The ones on the disk side I don't know
much about. I have seen some ASMedia one that also seem OK. For
the disks I use a Seagate and a WDC external box from which I
have removed the original disk, as I have noticed that Seagate
and WDC for obvious reasons tend to test and use the more
reliable chipsets. I have also got an external USB3 dock with a
recent ASMedia chipset that also seems good, but I haven't used
it much.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-04-01 Thread Peter Grandi
[ ... ]

>>>   $  D='btrfs f2fs gfs2 hfsplus jfs nilfs2 reiserfs udf xfs'
>>>   $  find $D -name '*.ko' | xargs size | sed 's/^  *//;s/ .*\t//g'
>>>   textfilename
>>>   832719  btrfs/btrfs.ko
>>>   237952  f2fs/f2fs.ko
>>>   251805  gfs2/gfs2.ko
>>>   72731   hfsplus/hfsplus.ko
>>>   171623  jfs/jfs.ko
>>>   173540  nilfs2/nilfs2.ko
>>>   214655  reiserfs/reiserfs.ko
>>>   81628   udf/udf.ko
>>>   658637  xfs/xfs.ko

That was Linux AMD64.

> udf is 637K on Mac OS 10.6
> exfat is 75K on Mac OS 10.9
> msdosfs is 79K on Mac OS 10.9
> ntfs is 394K (That must be Paragon's ntfs for Mac)
...
> zfs is 1.7M (10.9)
> spl is 247K (10.9)

Similar on Linux AMD64 but smaller:

  $ size updates/dkms/*.ko | sed 's/^  *//;s/ .*\t//g'
  textfilename
  62005   updates/dkms/spl.ko
  184370  updates/dkms/splat.ko
  3879updates/dkms/zavl.ko
  22688   updates/dkms/zcommon.ko
  1012212 updates/dkms/zfs.ko
  39874   updates/dkms/znvpair.ko
  18321   updates/dkms/zpios.ko
  319224  updates/dkms/zunicode.ko

> If they are somehow comparable even with the differences, 833K
> is not bad for btrfs compared to zfs. I did not look at the
> format of the file; it must be binary, but compression may be
> optional for third party kexts. So the kernel module sizes are
> large for both btrfs and zfs. Given the feature sets of both,
> is that surprising?

Not surprising and indeed I agree with the statement that
appeared earlier that "there are use cases that actually need
them". There are also use cases that need realtime translation
of file content from chinese to spanish, and one could add to
ZFS or Btrfs an extension to detect the language of text files
and invoke via HTTP Google Translate, for example with option
"translate=chinese-spanish" at mount time; or less flexibly
there are many use cases where B-Tree lookup of records in files
is useful, and it would be possible to add that to Btrfs or ZFS,
so that for example 'lseek(4,"Jane Smith",SEEK_KEY)' would be
possible, as in the ancient TSS/370 filesystem design.

But the question is about engineering, where best to implement
those "feature sets": in the kernel or higher levels. There is
no doubt for me that realtime language translation and seeking
by key can be added to a filesystem kernel module, and would
"work". The issue is a crudely technical one: "works" for an
engineer is not a binary state, but a statistical property over
a wide spectrum of cost/benefit tradeoffs.

Adding "feature sets" because "there are use cases that actually
need them" is fine, adding their implementation to the kernel
driver of a filesystem is quite a different proposition, which
may have downsides, as the implementations of those feature sets
may make code more complex and harder to understand and test,
never mind debug, even for the base features. But of course lots
of people know better :-).

Buit there is more; look again at some compiled code sizes as a
crude proxy for complexity, divided in two groups, both of
robust, full featured designs:

  1012212 updates/dkms/zfs.ko
  832719  btrfs/btrfs.ko
  658637  xfs/xfs.ko

  237952  f2fs/f2fs.ko
  173540  nilfs2/nilfs2.ko
  171623  jfs/jfs.ko
  81628   udf/udf.ko

The code size for JFS or NILFS2 or UDF is roughly 1/4 the code
size for XFS, yet there is little difference in functionality.
Compared to ZFS as to base functionality JFS lacks checksums and
snapshots (in theory it has subvolumes, but they are disabled),
but NILFS2 has snapshots and checksums (but does not verify them
on ordinary reads), and yet the code size is 1/6 that of ZFS.
ZFS has also RAID, but looking at the code size of the Linux MD
RAID modules I see rather smaller numbers. Even so ZFS has a
good reputation for reliability despire its amazing complexity,
but that is also because SUN invested big into massive release
engineering for it, and similarly for XFS.

Therefore my impression is that the filesystems in the first
group have a lot of cool features like compression or dedup
etc. that could have been implemented user-level, and having
them in the kernel is good "for "marketing" purposes, to win
box-ticking competitions".
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Do different btrfs volumes compete for CPU?

2017-04-01 Thread Peter Grandi
>> Approximately 16 hours ago I've run a script that deleted
>> >~100 snapshots and started quota rescan on a large
>> USB-connected btrfs volume (5.4 of 22 TB occupied now).

That "USB-connected is a rather bad idea. On the IRC channel
#Btrfs whenever someone reports odd things happening I ask "is
that USB?" and usually it is and then we say "good luck!" :-).

The issues are:

* The USB mass storage protocol is poorly designed in particular
  for error handling.
* The underlying USB protocol is very CPU intensive.
* Most importantly nearly all USB chipsets, both system-side
  and peripheral-side, are breathtakingly buggy, but this does
  not get noticed for most USB devices.

>> Quota rescan only completed just now, with 100% load from
>> [btrfs-transacti] throughout this period,

> [ ... ] are different btrfs volumes independent in terms of
> CPU, or are there some shared workers that can be point of
> contention?

As written that question is meaningless: despite the current
mania for "threads"/"threadlets" a filesystem driver is a
library, not a set of processes (all those '[btrfs-*]'
threadlets are somewhat misguided ways to do background
stuff).

The real problems here are:

* Qgroups are famously system CPU intensive, even if less so
  than in earlier releases, especially with subvolumes, so the
  16 hours CPU is both absurd and expected. I think that qgroups
  are still effectively unusable.
* The scheduler gives excessive priority to kernel threads, so
  they can crowd out user processes. When for whatever reason
  the system CPU percentage rises everything else usually
  suffers.

> BTW, USB adapter used is this one (though storage array only
> supports USB 3.0):
> https://www.asus.com/Motherboard-Accessory/USB_31_TYPEA_CARD/

Only Intel/AMD USB chipsets and a few others are fairly
reliable, and for mass storage only with USB3 with UASPI, which
is basically SATA-over-USB (more precisely SCSI-command-set over
USB). Your system-side card seems to be recent enough to do
UASPI, but probably the peripheral-side chipset isn't. Things
are so bad with third-party chipsets that even several types of
add-on SATA and SAS cards are too buggy.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-31 Thread Peter Grandi
> [ ... ] what the signifigance of the xargs size limits of
> btrfs might be. [ ... ] So what does it mean that btrfs has a
> higher xargs size limit than other file systems? [ ... ] Or
> does the lower capacity for argument length for hfsplus
> demonstrate it is the superior file system for avoiding
> breakage? [ ... ]

That confuses, as my understanding of command argument size
limit is that it is a system, not filesystem, property, and for
example can be obtained with 'getconf _POSIX_ARG_MAX'.

> Personally, I would go back to fossil and venti on Plan 9 for
> an archival data server (using WORM drives),

In an ideal world we would be using Plan 9. Not necessarily with
Fossil and Venti. As a to storage/backup/archival Linux based
options are not bad, even if the platform is far messier than
Plan 9 (or some other alternatives). BTW I just noticed with a
search that AWS might be offering Plan 9 hosts :-).

> and VAX/VMS cluster for an HA server. [ ... ]

Uhmmm, however nice it was, it was fairly weird. An IA32 or
AMD64 port has been promised however :-).

https://www.theregister.co.uk/2016/10/13/openvms_moves_slowly_towards_x86/
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-31 Thread Peter Grandi
>>> My guess is that very complex risky slow operations like
>>> that are provided by "clever" filesystem developers for
>>> "marketing" purposes, to win box-ticking competitions.

>>> That applies to those system developers who do know better;
>>> I suspect that even some filesystem developers are
>>> "optimistic" as to what they can actually achieve.

>>> There are cases where there really is no other sane
>>> option. Not everyone has the kind of budget needed for
>>> proper HA setups,

>> Thnaks for letting me know, that must have never occurred to
>> me, just as it must have never occurred to me that some
>> people expect extremely advanced features that imply
>> big-budget high-IOPS high-reliability storage to be fast and
>> reliable on small-budget storage too :-)

> You're missing my point (or intentionally ignoring it).

In "Thanks for letting me know" I am not missing your point, I
am simply pointing out that I do know that people try to run
high-budget workloads on low-budget storage.

The argument as to whether "very complex risky slow operations"
should be provided in the filesystem itself is a very different
one, and I did not develop it fully. But is quite "optimistic"
to simply state "there really is no other sane option", even
when for people that don't have "proper HA setups".

Let'a start by assuming for the time being. that "very complex
risky slow operations" are indeed feasible on very reliable high
speed storage layers. Then the questions become:

* Is it really true that "there is no other sane option" to
  running "very complex risky slow operations" even on storage
  that is not "big-budget high-IOPS high-reliability"?

* Is is really true that it is a good idea to run "very complex
  risky slow operations" even on ¨big-budget high-IOPS
  high-reliability storage"?

> Those types of operations are implemented because there are
> use cases that actually need them, not because some developer
> thought it would be cool. [ ... ]

And this is the really crucial bit, I'll disregard without
agreeing too much (but in part I do) with the rest of the
response, as those are less important matters, and this is going
to be londer than a twitter message.

First, I agree that "there are use cases that actually need
them", and I need to explain what I am agreeing to: I believe
that computer systems, "system" in a wide sense, have what I
call "inewvitable functionality", that is functionality that is
not optional, but must be provided *somewhere*: for example
print spooling is "inevitable functionality" as long as there
are multuple users, and spell checking is another example.

The only choice as to "inevitable functionality" is *where* to
provide it. For example spooling can be done among two users by
queuing jobs manually with one saying "I am going to print now",
and the other user waits until the print is finished, or by
using a spool program that queues jobs on the source system, or
by using a spool program that queues jobs on the target
printer. Spell checking can be done on the fly in the document
processor, batch with a tool, or manually by the document
author. All these are valid implementations of "inevitable
functionality", just with very different performance envelope,
where the "system" includes the users as "peripherals" or
"plugins" :-) in the manual implementations.

There is no dispute from me that multiple devices,
adding/removing block devices, data compression, structural
repair, balancing, growing/shrinking, defragmentation, quota
groups, integrity checking, deduplication, ...a are all in the
general case "inevitably functionality", and every non-trivial
storage system *must* implement them.

The big question is *where*: for example when I started using
UNIX the 'fsck' tool was several years away, and when the system
crashed I did like everybody filetree integrity checking and
structure recovery myself (with the help of 'ncheck' and
'icheck' and 'adb'), that is 'fsck' was implemented in my head.

In the general case there are three places where such
"inevitable functionality" can be implemented:

* In the filesystem module in the kernel, for example Btrfs
  scrubbing.
* In a tool that uses hook provided by the filesystem module in
  the kernel, for example Btrfs deduplication, 'send'/'receive'.
* In a tool, for example 'btrfsck'.
* In the system administrator.

Consider the "very complex risky slow" operation of
defragmentation; the system administrator can implement it by
dumping and reloading the volume, or a tool ban implement it by
running on the unmounted filesystem, or a tool and the kernel
can implement it by using kernel module hooks, or it can be
provided entirely in the kernel module.

My argument is that providing "very complex risky slow"
maintenance operations as filesystem primitives looks awesomely
convenient, a good way to "win box-ticking competitions" for
"marketing" purposes, but is rather bad idea for several
reasons, of varying strengths:

* Most system 

Re: Shrinking a device - performance?

2017-03-31 Thread Peter Grandi
>> [ ... ] CentOS, Redhat, and Oracle seem to take the position
>> that very large data subvolumes using btrfs should work
>> fine. But I would be curious what the rest of the list thinks
>> about 20 TiB in one volume/subvolume.

> To be sure I'm a biased voice here, as I have multiple
> independent btrfs on multiple partitions here, with no btrfs
> over 100 GiB in size, and that's on ssd so maintenance
> commands normally return in minutes or even seconds,

That's a bit extreme I think, as there are downsides to have
many too small volumes too.

> not the hours to days or even weeks it takes on multi-TB btrfs
> on spinning rust.

Or months :-).

> But FWIW... 1) Don't put all your data eggs in one basket,
> especially when that basket isn't yet entirely stable and
> mature.

Really good point here.

> A mantra commonly repeated on this list is that btrfs is still
> stabilizing,

My impression is that most 4.x and later versions are very
reliable for "base" functionality, that is excluding
multi-device, compression, qgroups, ... Put another way, what
scratches the Facebook itches works well :-).

> [ ... ] the time/cost/hassle-factor of the backup, and being
> practically prepared to use them, is even *MORE* important
> than it is on fully mature and stable filesystems.

Indeed, or at least *different* filesystems. I backup JFS
filesystems to XFS ones, and Btrfs filesystems to NILFS2 ones,
for example.

> 2) Don't make your filesystems so large that any maintenance
> on them, including both filesystem maintenance like btrfs
> balance/scrub/check/ whatever, and normal backup and restore
> operations, takes impractically long,

As per my preceding post, that's the big deal, but so many
people "know better" :-).

> where "impractically" can be reasonably defined as so long it
> discourages you from doing them in the first place and/or so
> long that it's going to cause unwarranted downtime.

That's the "Very Large DataBase" level of trouble.

> Some years ago, before I started using btrfs and while I was
> using mdraid, I learned this one the hard way. I had a bunch
> of rather large mdraids setup, [ ... ]

I have recently seen another much "funnier" example: people who
"know better" and follow every cool trend decide to consolidate
their server farm on VMs, backed by a storage server with a
largish single pool of storage holding the virtual disk images
of all the server VMs. They look like geniuses until the storage
pool system crashes, and a minimal integrity check on restart
takes two days during which the whole organization is without
access to any email, files, databases, ...

> [ ... ] And there was a good chance it was /not/ active and
> mounted at the time of the crash and thus didn't need
> repaired, saving that time entirely! =:^)

As to that I have switched to using 'autofs' to mount volumes
only on access, using a simple script that turns '/etc/fstab'
into an automounter dynamic map, which means that most of the
time most volumes on my (home) systems are not mounted:

  http://www.sabi.co.uk/blog/anno06-3rd.html?060928#060928

> Eventually I arranged things so I could keep root mounted
> read-only unless I was updating it, and that's still the way I
> run it today.

The ancient way, instead of having '/' RO and '/var' RW, to have
'/' RW and '/usr' RO (so for example it could be shared across
many systems via NFS etc.), and while both are good ideas, I
prefer the ancient way. But then some people who know better are
moving to merge '/' with '/usr' without understanding what's the
history and the advantages.

> [ ... ] If it's multiple TBs, chances are it's going to be
> faster to simply blow away and recreate from backup, than it
> is to try to repair... [ ... ]

Or to shrink or defragment or dedup etc., except on very high
IOPS-per-TB storage.

> [ ... ] how much simpler it would have been had they had an
> independent btrfs of say a TB or two for each system they were
> backing up.

That is the general alternative to a single large pool/volume:
sharding/chunking of filetrees, sometimes, like with Lustre or
Ceph etc. with a "metafilesystem" layer on top.

Done manually my suggestion is to do the sharding per-week (or
other suitable period) rather than per-system, in a circular
"crop rotation" scheme. So that once a volume has been filled,
it becomes read-only and can even be unmounted until it needs
to be reused:

  http://www.sabi.co.uk/blog/12-fou.html?121218b#121218b

Then there is the problem that "a TB or two" is less easy with
increasing disk capacities, but then I think that disks with a
capacity larger than 1TB are not suitable for ordinary
workloads, and more for tape-cartridge like usage.

> What would they have done had the btrfs gone bad and needed
> repaired? [ ... ]

In most cases I have seen of designs aimed at achieving the
lowest cost and highest flexibility "low IOPS single poool" at
the expense of scalability and maintainability, the "clever"
designer had been promoted or had 

Re: Shrinking a device - performance?

2017-03-31 Thread Peter Grandi
> Can you try to first dedup the btrfs volume?  This is probably
> out of date, but you could try one of these: [ ... ] Yep,
> that's probably a lot of work. [ ... ] My recollection is that
> btrfs handles deduplication differently than zfs, but both of
> them can be very, very slow

But the big deal there is that dedup is indeed a very expensive
operation, even worse than 'balance'. A balanced, deduped volume
will shrink faster in most cases, but the time taken simply
moved from shrinking to preparing.

> Again, I'm not an expert in btrfs, but in most cases a full
> balance and scrub takes care of any problems on the root
> partition, but that is a relatively small partition.  A full
> balance (without the options) and scrub on 20 TiB must take a
> very long time even with robust hardware, would it not?

There have been reports of several months for volumes of that
size subject to ordinary workload.

> CentOS, Redhat, and Oracle seem to take the position that very
> large data subvolumes using btrfs should work fine.

This is a long standing controvery, and for example there have
been "interesting" debates in the XFS mailing list. Btrfs in
this is not really different from others, with one major
difference in context, that many Btrfs developers work for a
company that relies of large numbers of small servers, to the
point that fixing multidevice issues has not been a priority.

The controversy of large volumes is that while no doubt the
logical structures of recent filesystem types can support single
volumes of many petabytes (or even much larger), and such
volumes have indeed been created and "work"-ish, so they are
unquestionably "syntactically valid", the tradeoffs involved
especially as to maintainability may mean that they don't "work"
well and sustainably so.

The fundamental issue is metadata: while the logical structures,
using 48-64 bit pointers, unquestionably scale "syntactically",
they don't scale pragmatically when considering whole-volume
maintenance like checking, repair, balancing, scrubbing,
indexing (which includes making incremental backups etc.).

Note: large volumes don't have just a speed problem for
whole-volume operations, they also have a memory problem, as
most tools hold in memory copy of the metadata. There have been
cases where indexing or repair of a volume requires a lot more
RAM (many hundreds GiB or some TiB of RAM) than the system on
which the volume was being used.

The problem is of course smaller if the large volume contains
mostly large files, and bigger if the volume is stored on low
IOPS-per-TB devices and used on small-memory systems. But even
with large files even if filetree object metadata (inodes etc.)
are relatively few eventually space metadata must at least
potentially resolve down to single sectors, and that can be a
lot of metadata unless both used and free space are very
unfragmented.

The fundamental technological issue is: *data* IO rates, in both
random IOPS and sequential ones, can be scaled "almost" linearly
by parallelizing them using RAID or equivalent, allowing large
volumes to serve scalably large and parallel *data* workloads,
but *metadata* IO rates cannot be easily parallelized, because
metadata structures are graphs, not arrays of bytes like files.

So a large volume on 100 storage devices can serve in parallel a
significant percentage of 100 times the data workload of a small
volume on 1 storage device, but not so much for the metadata
workload.

For example, I have never seen a parallel 'fsck' tool that can
take advantage of 100 storage devices to complete a scan of a
single volume on 100 storage devices in not much longer time
than the scan of a volume on 1 of the storage device.

> But I would be curious what the rest of the list thinks about
> 20 TiB in one volume/subvolume.

Personally I think that while volumes of many petabytes "work"
syntactically, there are serious maintainability problem (which
I have seen happen at a number of sites) with volumes larger
than 4TB-8TB with any current local filesystem design.

That depends also on number/size of storage devices, and their
nature, that is IOPS, as after all metadata workloads do scale a
bit with number of available IOPS, even if far more slowly than
data workloads.

For for example I think that an 8TB volume is not desirable on a
single 8TB disk for ordinary workloads (but then I think that
disks above 1-2TB are just not suitable for ordinary filesystem
workloads), but with lots of smaller/faster disks a 12TB volume
would probably be acceptable, and maybe a number of flash SSDs
might make acceptable even a 20TB volume.

Of course there are lots of people who know better. :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-31 Thread Peter Grandi
>>> The way btrfs is designed I'd actually expect shrinking to
>>> be fast in most cases. [ ... ]

>> The proposed "move whole chunks" implementation helps only if
>> there are enough unallocated chunks "below the line". If regular
>> 'balance' is done on the filesystem there will be some, but that
>> just spreads the cost of the 'balance' across time, it does not
>> by itself make a «risky, difficult, slow operation» any less so,
>> just spreads the risk, difficulty, slowness across time.

> Isn't that too pessimistic?

Maybe, it depends on the workload impacting the volume and how
much it churns the free/unallocated situation.

> Most of my filesystems have 90+% of free space unallocated,
> even those I never run balance on.

That seems quite lucky to me, as definitely is not my experience
or even my expectation in the general case: in my laptop and
desktop with relatively few updates I have to run 'balance'
fairly frequently, and "Knorrie" has produced a nice tools that
produces a graphical map of free vs. unallocated space and most
examples and users find quite a bit of balancing needs to be
done

> For me it wouldn't just spread the cost, it would reduce it
> considerably.

In your case the cost of the implicit or explicit 'balance'
simply does not arise because 'balance' is not necessary, and
then moving whole chunks is indeed cheap. The argument here is
in part whether used space (extents) or allocated space (chunks)
is more fragmented as well as the amount of metadata to update
in either case.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-30 Thread Peter Grandi
>> My guess is that very complex risky slow operations like that are
>> provided by "clever" filesystem developers for "marketing" purposes,
>> to win box-ticking competitions. That applies to those system
>> developers who do know better; I suspect that even some filesystem
>> developers are "optimistic" as to what they can actually achieve.

> There are cases where there really is no other sane option. Not
> everyone has the kind of budget needed for proper HA setups,

Thnaks for letting me know, that must have never occurred to me, just as
it must have never occurred to me that some people expect extremely
advanced features that imply big-budget high-IOPS high-reliability
storage to be fast and reliable on small-budget storage too :-)

> and if you need maximal uptime and as a result have to reprovision the
> system online, then you pretty much need a filesystem that supports
> online shrinking.

That's a bigger topic than we can address here. The topic used to be
known in one related domain as "Very Large Databases", which were
defined as databases so large and critical that they the time needed for
maintenance and backup were too slow for taking them them offline etc.;
that is a topics that has largely vanished for discussion, I guess
because most management just don't want to hear it :-).

> Also, it's not really all that slow on most filesystem, BTRFS is just
> hurt by it's comparatively poor performance, and the COW metadata
> updates that are needed.

Btrfs in realistic situations has pretty good speed *and* performance,
and COW actually helps, as it often results in less head repositioning
than update-in-place. What makes it a bit slower with metadata is having
'dup' by default to recover from especially damaging bitflips in
metadata, but then that does not impact performance, only speed.

>> That feature set is arguably not appropriate for VM images, but
>> lots of people know better :-).

> That depends on a lot of factors.  I have no issues personally running
> small VM images on BTRFS, but I'm also running on decent SSD's
> (>500MB/s read and write speeds), using sparse files, and keeping on
> top of managing them. [ ... ]

Having (relatively) big-budget high-IOPS storage for high-IOPS workloads
helps, that must have never occurred to me either :-).

>> XFS and 'ext4' are essentially equivalent, except for the fixed-size
>> inode table limitation of 'ext4' (and XFS reportedly has finer
>> grained locking). Btrfs is nearly as good as either on most workloads
>> is single-device mode [ ... ]

> No, if you look at actual data, [ ... ]

Well, I have looked at actual data in many published but often poorly
made "benchmarks", and to me they seem they seem quite equivalent
indeed, within somewhat differently shaped performance envelopes, so the
results depend on the testing point within that envelope. I have been
done my own simplistic actual data gathering, most recently here:

  http://www.sabi.co.uk/blog/17-one.html?170302#170302
  http://www.sabi.co.uk/blog/17-one.html?170228#170228

and however simplistic they are fairly informative (and for writes they
point a finger at a layer below the filesystem type).

[ ... ]

>> "Flexibility" in filesystems, especially on rotating disk
>> storage with extremely anisotropic performance envelopes, is
>> very expensive, but of course lots of people know better :-).

> Time is not free,

Your time seems especially and uniquely precious as you "waste"
as little as possible editing your replies into readability.

> and humans generally prefer to minimize the amount of time they have
> to work on things. This is why ZFS is so popular, it handles most
> errors correctly by itself and usually requires very little human
> intervention for maintenance.

That seems to me a pretty illusion, as it does not contain any magical
AI, just pretty ordinary and limited error correction for trivial cases.

> 'Flexibility' in a filesystem costs some time on a regular basis, but
> can save a huge amount of time in the long run.

Like everything else. The difficulty is having flexibility at scale with
challenging workloads. "An engineer can do  for a nickel what  any damn
fool can do for a dollar" :-).

> To look at it another way, I have a home server system running BTRFS
> on top of LVM. [ ... ]

But usually home servers have "unchallenging" workloads, and it is
relatively easy to overbudget their storage, because the total absolute
cost is "affordable".
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-30 Thread Peter Grandi
> I’ve glazed over on “Not only that …” … can you make youtube
> video of that :)) [ ... ]  It’s because I’m special :*

Well played again, that's a fairly credible impersonation of a
node.js/mongodb developer :-).

> On a real note thank’s [ ... ] to much of open source stuff is
> based on short comments :/

Yes... In part that's because the "sw engineering" aspect of
programming takes a lot of time that unpaid volunteers sometimes
cannot afford to take, in part though I have noticed sometimes
free sw authors who do get paid to do free sw act as if they had
a policy of obfuscation to protect their turf/jobs.

Regardless, mailing lists, IRC channel logs, wikis, personal
blogs, search engines allow a mosaic of lore to form, which
in part remedies the situation, and here we are :-).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-30 Thread Peter Grandi
>> As a general consideration, shrinking a large filetree online
>> in-place is an amazingly risky, difficult, slow operation and
>> should be a last desperate resort (as apparently in this case),
>> regardless of the filesystem type, and expecting otherwise is
>> "optimistic".

> The way btrfs is designed I'd actually expect shrinking to be
> fast in most cases. It could probably be done by moving whole
> chunks at near platter speed, [ ... ] It just hasn't been
> implemented yet.

That seems to me a rather "optimistic" argument, as most of the
cost of shrinking is the 'balance' to pack extents into chunks.

As that thread implies, the current implementation in effect
does a "balance" while shrinking, by moving extents from chunks
"above the line" to free space in chunks "below the line".

The proposed "move whole chunks" implementation helps only if
there are enough unallocated chunks "below the line". If regular
'balance' is done on the filesystem there will be some, but that
just spreads the cost of the 'balance' across time, it does not
by itself make a «risky, difficult, slow operation» any less so,
just spreads the risk, difficulty, slowness across time.

More generally one of the downsides of Btrfs is that because of
its two-level (allocated/unallocated chunks, used/free nodes or
blocks) design it requires more than most other designs to do
regular 'balance', which is indeed «risky, difficult, slow».

Compare an even more COW design like NILFS2, which requires, but a
bit less, to run its garbage collector, which is also «risky,
difficult, slow». Just like in Btrfs that is a tradeoff that
shrinks the performance envelope in one direction and expands it
in another.

But in the case of Btrfs it shrinks it perhaps a bit more than it
expands it, as the added flexibility of having chunk-based
'profiles' is only very partially taken advantage of.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-28 Thread Peter Grandi
> I glazed over at “This is going to be long” … :)
>> [ ... ]

Not only that, you also top-posted while quoting it pointlessly
in its entirety, to the whole mailing list. Well played :-).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-28 Thread Peter Grandi
>  [ ... ] slaps together a large storage system in the cheapest
> and quickest way knowing that while it is mostly empty it will
> seem very fast regardless and therefore to have awesome
> performance, and then the "clever" sysadm disappears surrounded
> by a halo of glory before the storage system gets full workload
> and fills up; [ ... ]

Fortunately or unfortunately Btrfs is particularly suitable for
this technique, as it has an enormous number of checkbox-ticking
awesome looking feature: transparent compression, dynamic
add/remove, online balance/scrub, different sized member devices,
online grow/shrink, online defrag, limitless scalability, online
dedup, arbitrary subvolumes and snapshots, COW and reflinking,
online conversion of RAID profiles, ... and one can use all of
them at the same time, and for the initial period where volume
workload is low and space used not much, it will looks absolutely
fantastic, cheap, flexible, always available, fast, the work of
genius of a very cool sysadm.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-28 Thread Peter Grandi
> [ ... ] reminded of all the cases where someone left me to
> decatastrophize a storage system built on "optimistic"
> assumptions.

In particular when some "clever" sysadm with a "clever" (or
dumb) manager slaps together a large storage system in the
cheapest and quickest way knowing that while it is mostly empty
it will seem very fast regardless and therefore to have awesome
performance, and then the "clever" sysadm disappears surrounded
by a halo of glory before the storage system gets full workload
and fills up; when that happens usually I get to inherit it.
BTW The same technique also can be done with HPC clusters.

>> I intended to shrink a ~22TiB filesystem down to 20TiB. This
>> is still using LVM underneath so that I can’t just remove a
>> device from the filesystem but have to use the resize
>> command.

>> Label: 'backy'  uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4
>> Total devices 1 FS bytes used 18.21TiB
>> devid1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy

Ahh it is indeed a filled up storage system now running a full
workload. At least it wasn't me who inherited it this time. :-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Shrinking a device - performance?

2017-03-28 Thread Peter Grandi
This is going to be long because I am writing something detailed
hoping pointlessly that someone in the future will find it by
searching the list archives while doing research before setting
up a new storage system, and they will be the kind of person
that tolerates reading messages longer than Twitter. :-).

> I’m currently shrinking a device and it seems that the
> performance of shrink is abysmal.

When I read this kind of statement I am reminded of all the
cases where someone left me to decatastrophize a storage system
built on "optimistic" assumptions. The usual "optimism" is what
I call the "syntactic approach", that is the axiomatic belief
that any syntactically valid combination of features not only
will "work", but very fast too and reliably despite slow cheap
hardware and "unattentive" configuration. Some people call that
the expectation that system developers provide or should provide
an "O_PONIES" option. In particular I get very saddened when
people use "performance" to mean "speed", as the difference
between the two is very great.

As a general consideration, shrinking a large filetree online
in-place is an amazingly risky, difficult, slow operation and
should be a last desperate resort (as apparently in this case),
regardless of the filesystem type, and expecting otherwise is
"optimistic".

My guess is that very complex risky slow operations like that
are provided by "clever" filesystem developers for "marketing"
purposes, to win box-ticking competitions. That applies to those
system developers who do know better; I suspect that even some
filesystem developers are "optimistic" as to what they can
actually achieve.

> I intended to shrink a ~22TiB filesystem down to 20TiB. This is
> still using LVM underneath so that I can’t just remove a device
> from the filesystem but have to use the resize command.

That is actually a very good idea because Btrfs multi-device is
not quite as reliable as DM/LVM2 multi-device.

> Label: 'backy'  uuid: 3d0b7511-4901-4554-96d4-e6f9627ea9a4
>Total devices 1 FS bytes used 18.21TiB
>devid1 size 20.00TiB used 20.71TiB path /dev/mapper/vgsys-backy

Maybe 'balance' should have been used a bit more.

> This has been running since last Thursday, so roughly 3.5days
> now. The “used” number in devid1 has moved about 1TiB in this
> time. The filesystem is seeing regular usage (read and write)
> and when I’m suspending any application traffic I see about
> 1GiB of movement every now and then. Maybe once every 30
> seconds or so. Does this sound fishy or normal to you?

With consistent "optimism" this is a request to assess whether
"performance" of some operations is adequate on a filetree
without telling us either what the filetree contents look like,
what the regular workload is, or what the storage layer looks
like.

Being one of the few system administrators crippled by lack of
psychic powers :-), I rely on guesses and inferences here, and
having read the whole thread containing some belated details.

>From the ~22TB total capacity my guess is that the storage layer
involves rotating hard disks, and from later details the
filesystem contents seems to be heavily reflinked files of
several GB in size, and workload seems to be backups to those
files from several source hosts. Considering the general level
of "optimism" in the situation my wild guess is that the storage
layer is based on large slow cheap rotating disks in teh 4GB-8GB
range, with very low IOPS-per-TB.

> Thanks for that info. The 1min per 1GiB is what I saw too -
> the “it can take longer” wasn’t really explainable to me.

A contemporary rotating disk device can do around 0.5MB/s
transfer rate with small random accesses with barriers up to
around 80-160MB/s in purely sequential access without barriers.

1GB/m of simultaneous read-write means around 16MB/s reads plus
16MB/s writes which is fairly good *performance* (even if slow
*speed*) considering that moving extents around, even across
disks, involves quite a bit of randomish same-disk updates of
metadata; because it all depends usually on how much randomish
metadata updates need to done, on any filesystem type, as those
must be done with barriers.

> As I’m not using snapshots: would large files (100+gb)

Using 100GB sized VM virtual disks (never mind with COW) seems
very unwise to me to start with, but of course a lot of other
people know better :-). Just like a lot of other people know
better that large single pool storage systems are awesome in
every respect :-): cost, reliability, speed, flexibility,
maintenance, etc.

> with long chains of CoW history (specifically reflink copies)
> also hurt?

Oh yes... They are about one of the worst cases for using
Btrfs. But also very "optimistic" to think that kind of stuff
can work awesomely on *any* filesystem type.

> Something I’d like to verify: does having traffic on the
> volume have the potential to delay this infinitely? [ ... ]
> it’s just slow and we’re looking forward to about 

Re: send snapshot from snapshot incremental

2017-03-26 Thread Peter Grandi
[ ... ]
> BUT if i take a snapshot from the system, and want to transfer
> it to the external HD, i can not set a parent subvolume,
> because there isn't any.

Questions like this are based on incomplete understanding of
'send' and 'receive', and on IRC user "darkling" explained it
fairly well:

> When you use -c, you're telling the FS that it can expect to
> find a sent copy of that subvol on the receiving side, and
> that anything shared with it can be sent by reference. OK, so
> with -c on its own, you're telling the FS that "all the data
> in this subvol already exists on the remote".

> So, when you send your subvol, *all* of the subvol's metadata
> is sent, and where that metadata refers to an extent that's
> shared with the -c subvol, the extent data isn't sent, because
> it's known to be on the other end already, and can be shared
> directly from there.

> OK. So, with -p, there's a "base" subvol. The send subvol and
> the -p reference subvol are both snapshots of that base (at
> different times). The -p reference subvol, as with -c, is
> assumed to be on the remote FS. However, because it's known to
> be an earlier version of the same data, you can be more
> efficient in the sending by saying "start from the earlier
> version, and modify it in this way to get the new version"

> So, with -p, not all of the metadata is sent, because you know
> you've already got most of it on the remote in the form of the
> earlier version.

> So -p is "take this thing and apply these differences to it"
> and -c is "build this thing from scratch, but you can share
> some of the data with these sources"

Also here some additional details:

  http://logs.tvrrug.org.uk/logs/%23btrfs/2016-06-29.html#2016-06-29T22:39:59

The requirement for read-only is because in that way it is
pretty sure that the same stuff is on both origin and target
volume.

It may help to compare with RSYNC: it has to scan both the full
origin and target trees, because it cannot be told that there is
a parent tree that is the same on origin and target; but with
option '--link-dest' it can do something similar to 'send -c'.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: backing up a file server with many subvolumes

2017-03-26 Thread Peter Grandi
> [ ... ] In each filesystem subdirectory are incremental
> snapshot subvolumes for that filesystem.  [ ... ] The scheme
> is something like this:

> /backup///

BTW hopefully this does not amounts to too many subvolumes in
the '.../backup/' volume, because that can create complications,
where "too many" IIRC is more than a few dozen (even if a low
number of hundreds is still doable).

> I'd like to try to back up (duplicate) the file server
> filesystem containing these snapshot subvolumes for each
> remote machine. The problem is that I don't think I can use
> send/receive to do this. "Btrfs send" requires "read-only"
> snapshots, and snapshots are not recursive as yet.

Why is that a problem? What is a recursive snapshot?

> I think there are too many subvolumes which change too often
> to make doing this without recursion practical.

It is not clear to me how the «incremental snapshot subvolumes
for that filesystem» are made, whether with RSYNC or 'send' and
'receive' itself. It is also not clear to me why those snapshots
«change too often», why would they change at all? Once a backup
is made in whichever way to an «incremental snapshot», why would
that «incremental snapshot» ever change but for being deleted?

There are some tools that rely on the specific abilities of
'send' with options '-p' and '-c' to save a lot of network
bandwidth and target storage space, perhaps you might be
interested in searching for them.

Anyhow I'll repeat here part of an answer to a similar message:
issues like yours usually are based on incomplete understanding
of 'send' and 'receive', and on IRC user "darkling" explained it
fairly well:

> When you use -c, you're telling the FS that it can expect to
> find a sent copy of that subvol on the receiving side, and
> that anything shared with it can be sent by reference. OK, so
> with -c on its own, you're telling the FS that "all the data
> in this subvol already exists on the remote".

> So, when you send your subvol, *all* of the subvol's metadata
> is sent, and where that metadata refers to an extent that's
> shared with the -c subvol, the extent data isn't sent, because
> it's known to be on the other end already, and can be shared
> directly from there.

> OK. So, with -p, there's a "base" subvol. The send subvol and
> the -p reference subvol are both snapshots of that base (at
> different times). The -p reference subvol, as with -c, is
> assumed to be on the remote FS. However, because it's known to
> be an earlier version of the same data, you can be more
> efficient in the sending by saying "start from the earlier
> version, and modify it in this way to get the new version"

> So, with -p, not all of the metadata is sent, because you know
> you've already got most of it on the remote in the form of the
> earlier version.

> So -p is "take this thing and apply these differences to it"
> and -c is "build this thing from scratch, but you can share
> some of the data with these sources"

Also here some additional details:

  http://logs.tvrrug.org.uk/logs/%23btrfs/2016-06-29.html#2016-06-29T22:39:59

The requirement for read-only is because in that way it is
pretty sure that the same stuff is on both origin and target
volume.

It may help to compare with RSYNC: it has to scan both the full
origin and target trees, because it cannot be told that there is
a parent tree that is the same on origin and target; but with
option '--link-dest' it can do something similar to 'send -c'.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS Metadata Corruption Prevents Scrub and btrfs check

2017-03-17 Thread Peter Grandi
> How can I attempt to rebuild the metadata, with a treescan or
> otherwise?

I don't know unfortunately for backrefs.

>> In general metadata in Btrfs is fairly intricate and metadata
>> block loss is pretty fatal, that's why metadata should most
>> times be redundant as in 'dup' or 'raid1' or similar:

> All the data and metadata on this system is in raid1 or
> raid10, in fact I discovered this issue while trying to change
> my balance form raid1 to raid10.

> johnf@carbon:~$ sudo btrfs fi df /
> Data, RAID10: total=1.13TiB, used=1.12TiB
> Data, RAID1: total=5.17TiB, used=5.16TiB
> System, RAID1: total=32.00MiB, used=864.00KiB
> Metadata, RAID10: total=3.09GiB, used=3.08GiB
> Metadata, RAID1: total=13.00GiB, used=10.16GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B

That's weird because as a rule when there is a checksum error
it is automatically corrected on read if there is a "good copy".
Also because you have both RAID1 and RAID10 data and metadata.
You should have just RAID1 metadata and RAID10 data or both
RAID10. There is probably an "interrupted" 'balance'.

Just had a look at your previous message and it reports out of
12.56TB 2 uncorrectable errors. But you have got everything
redundant ('raid1' or 'raid10', so it looks like somehow those
two blocks are supposed to be copies of each other and are bad:

* 'sdg1' at physical sector 5016524768 volume byte offset 9626194001920
* 'sdh1' at physical sector 5016524768 volume byte offset 9626194001920
* both sectors belong to the tree at byte offset 4804958584832.

Note: BTW I remember someone wrote a guide to decoding Btrfs
'dmesg' lines, but I can't find it anymore, so not sure that
interpretation is entirely correct.

It is a bit "strange" that it is the same sector, as Btrfs
'raid1' profile is not necessarily block-for-block, mirrored
chunks can be in different offsets.

The "strange" symptoms hint not just at disk issues, but also
some past attempts at conversion (I remember a previous message
from you) or recovery have messed up things a bit.

Someone mentioned in some mailing list articles various tools to
print out trees and subtrees and in general inspect. 'Knorrie'
has written a Python library (and a few inspection tools) with
which it is possible to traverse various Btrfs trees, but I
haven't used it:

 https://github.com/knorrie/python-btrfs/

I'd suggest searching the mailing list for related information.
Also the relevant tree is described here (the one on kernel.org
probably is more up-to-date):

  https://oss.oracle.com/projects/btrfs/dist/documentation/btrfs-backrefs.html
  https://btrfs.wiki.kernel.org/index.php/Btrfs_design#Explicit_Back_References
  https://btrfs.wiki.kernel.org/index.php/Data_Structures
  https://btrfs.wiki.kernel.org/index.php/Trees

You might want to use 'btrfs inspect-internal'.

Conceivably as the issue seems related to an extent backref
'btrfsck --repair' with '--init-extent-tree' might help, but I
cannot recommend that as I don't know if they are relevant to
your problem and/or safe in your situation. Consider this:

  http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg26816.html

I would use the very most recent version of 'btrfsprogs'.

One possibility I would consider is to move a sufficiently large
subtree of:

  
/home/johnf/personal/projects/openwrt/trunk/build_dir/target-mips_r2_uClibc-0.9.32/hostapd-wpad-mini/hostapd-20110117/hostapd/hostapd.eap_user

into its own directory, create a new subvolume, 'cp --reflink'
everything except that directory into the new subvolume, and
then *perhaps* work on the new subvolume will not access the
damaged metadata block.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS Metadata Corruption Prevents Scrub and btrfs check

2017-03-17 Thread Peter Grandi
> Read error at byte 0, while reading 3975 bytes: Input/output error

Bad news. That means that probably the disk is damaged and
further issues may happen.

> corrected errors: 0, uncorrectable errors: 2, unverified errors: 0

Even worse news.

> Incorrect local backref count on 5165855678464 root 259 owner 1732872
> offset 0 found 0 wanted 1 back 0x3ba80f40
> Backref disk bytenr does not match extent record,
> bytenr=5165855678464, ref bytenr=7880454922968236032
> backpointer mismatch on [5165855678464 28672]

"Better" news. In practice a single metadata leaf node is
corrupted in back references. You might be lucky and that might
be rebuildable, but I don't know enough about the somewhat
intricate Btrfs metadata trees to figure that out. Some metadata
is rebuildable from other metadata with a tree scan, some not.

In general metadata in Btrfs is fairly intricate and metadata
block loss is pretty fatal, that's why metadata should most
times be redundant as in 'dup' or 'raid1' or similar:

http://www.sabi.co.uk/blog/16-two.html?160817#160817
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-09 Thread Peter Grandi
>> Consider the common case of a 3-member volume with a 'raid1'
>> target profile: if the sysadm thinks that a drive should be
>> replaced, the goal is to take it out *without* converting every
>> chunk to 'single', because with 2-out-of-3 devices half of the
>> chunks will still be fully mirrored.

>> Also, removing the device to be replaced should really not be
>> the same thing as balancing the chunks, if there is space, to be
>> 'raid1' across remaining drives, because that's a completely
>> different operation.

> There is a command specifically for replacing devices.  It
> operates very differently from the add+delete or delete+add
> sequences. [ ... ]

Perhaps it was not clear that I was talking about removing a
device, as distinct from replacing it, and that I used "removed"
instead of "deleted" deliberately, to avoid the confusion with
the 'delete' command.

In the everyday practice of system administration it often
happens that a device should be removed first, and replaced
later, for example when it is suspected to be faulty, or is
intermittently faulty. The replacement can be done with
'replace' or 'add+delete' or 'delete+add', but that's a
different matter.

Perhaps I should have not have used the generic verb "remove",
but written "make unavailable".

This brings about again the topic of some "confusion" in the
design of the Btrfs multidevice handling logic, where at least
initially one could only expand the storage space of a
multidevice by 'add' of a new device or shrink the storage space
by 'delete' of an existing one, but I think it was not conceived
at Btrfs design time of storage space being nominally constant
but for a device (and the chunks on it) having a state of
"available" ("present", "online", "enabled") or "unavailable"
("absent", "offline", "disabled"), either because of events or
because of system administrator action.

The 'missing' pseudo-device designator was added later, and
'replace' also later to avoid having to first expand then shrink
(or viceversa) the storage space and the related copying.

My impression is that it would be less "confused" if the Btrfs
device handling logic were changed to allow for the the state of
"member of the multidevice set but not actually available" and
the related consequent state for chunks that ought to be on it;
that probably would be essential to fixing the confusing current
aspects of recovery in a multidevice set. That would be very
useful even if it may require a change in the on-disk format to
distinguish the distinct states of membership and availability
for devices and mark chunks as available or not (chunks of course
being only possible on member devices).

That is, it would also be nice to have the opposite state of "not
member of the multidevice set but actually available to it", that
is a spare device, and related logic.

Note: simply setting '/sys/block/$DEV/device/delete' is not a
good option, because that makes the device unavailable not just
to Btrfs, but also to the whole systems. In the ordinary practice
of system administration it may well be useful to make a device
unavailable to Btrfs but still available to the system, for
example for testing, and anyhow they are logically distinct
states. That also means a member device might well be available
to the system, but marked as "not available" to Btrfs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-05 Thread Peter Grandi
[ ... on the difference between number of devices and length of
a chunk-stripe ... ]

> Note: possibilities get even more interesting with a 4-device
> volume with 'raid1' profile chunks, and similar case involving
> other profiles than 'raid1'.

Consider for example a 4-device volume with 2 devices abruptly
missing: if 2-length 'raid1' chunk-stripes have been uniformly
laid across devices, then some chunk-stripes will be completely
missing (where both chunks in the stripe were on the 2 missing
devices), some will be 1-length, and some will be 2-length.

What to do when devices are missing?

One possibility is to simply require mount with the 'degraded'
option, by default read-only, but allowing read-write, simply as
a way to ensure the sysadm knows that some metada/data *may* not
be redundant or *may* even be unavailable (if the chunk-stripe
length is less than the minimum to reconstruct the data).

Then attempts to read unavailable metadata or data would return
an error like a checksum violation without redundancy,
dynamically (when the application or 'balance' or 'scrub'
attempt to read the unavailable data).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-05 Thread Peter Grandi
>> What makes me think that "unmirrored" 'raid1' profile chunks
>> are "not a thing" is that it is impossible to remove
>> explicitly a member device from a 'raid1' profile volume:
>> first one has to 'convert' to 'single', and then the 'remove'
>> copies back to the remaining devices the 'single' chunks that
>> are on the explicitly 'remove'd device. Which to me seems
>> absurd.

> It is, there should be a way to do this as a single operation.
> [ ... ] The reason this is currently the case though is a
> simple one, 'btrfs device delete' is just a special instance
> of balance [ ... ]  does no profile conversion, but having
> that as an option would actually be _very_ useful from a data
> safety perspective.

That seems to me an even more "confused" opinion: because
removing a device to make it "missing" and removing it
permanently should be very different operations.

Consider the common case of a 3-member volume with a 'raid1'
target profile: if the sysadm thinks that a drive should be
replaced, the goal is to take it out *without* converting every
chunk to 'single', because with 2-out-of-3 devices half of the
chunks will still be fully mirrored.

Also, removing the device to be replaced should really not be
the same thing as balancing the chunks, if there is space, to be
'raid1' across remaining drives, because that's a completely
different operation.

>> Going further in my speculation, I suspect that at the core of
>> the Btrfs multidevice design there is a persistent "confusion"
>> (to use en euphemism) between volumes having a profile, and
>> merely chunks have a profile.

> There generally is.  The profile is entirely a property of the
> chunks (each chunk literally has a bit of metadata that says
> what profile it is), not the volume.  There's some metadata in
> the volume somewhere that says what profile to use for new
> chunks of each type (I think),

That's the "target" profile for the volume.

> but that doesn't dictate what chunk profiles there are on the
> volume. [ ... ]

But as that's the case then the current Btrfs logic for
determining whether a volume is degraded or not is quite
"confused" indeed.

Because suppose there is again the simple case of a 3-device
volume, where all existing chunks have 'raid1' profile and the
volume's target profile is also 'raid1' and one device has gone
offline: the volume cannot be said to be "degraded", unless a
full examination of all chunks is made. Because it can well
happen that in fact *none* of the chunks was mirrored to that
device, for example, however unlikely. And viceversa. Even with
3 devices some chunks may be temporarily "unmirrored" (even if
for brief times hopefully).

The average case is that half of the chunks will be fully
mirrored across the two remaining devices and half will be
"unmirrored".

Now consider re-adding the third device: at that point the
volume has got back all 3 devices, so it is not "degraded", but
50% of the chunks in the volume will still be "unmirrored", even
if eventually they will be mirrored on the newly added device.

Note: possibilities get even more interesting with a 4-device
volume with 'raid1' profile chunks, and similar case involving
other profiles than 'raid1'.

Therefore the current Btrfs logic for deciding whether a volume
is "degraded" seems simply "confused" to me, because whether
there are missing devices and some chunks are "unmirrored" is
not quite the same thing.

The same applies to the current logic that in a 2-device volume
with a device missing new chunks are created as "single" profile
instead of as "unmirrored" 'raid1' profile: another example of
"confusion" between number of devices and chunk profile.

Note: the best that can be said is that a volume has both a
"target chunk profile" (one per data, metadata, system chunks)
and a target number of member devices, and that a volume with a
number of devices below the target *might* be degraded, and that
whether a volume is in fact degraded is not either/or, but given
by the percentage of chunks or stripes that are degraded. This
is expecially made clear by the 'raid1' case where the chunk
stripe length is always 2, but the number of target devices can
be greater than 2. Management of devices and management of
stripes are in Btrfs, unlike conventional RAID like Linux MD,
rather different operations needing rather different, if
related, logic.

My impression is that because of "confusion" between number of
devices in a volume and status of chunk profile there are some
"surprising" behaviors in Btrfs, and that will take quite a bit
to fix, most importantly for the Btrfs developer team to clear
among themselves the semantics attaching to both. After 10 years
of development that seems the right thing to do :-).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-02 Thread Peter Grandi
> [ ... ] Meanwhile, the problem as I understand it is that at
> the first raid1 degraded writable mount, no single-mode chunks
> exist, but without the second device, they are created.  [
> ... ]

That does not make any sense, unless there is a fundamental
mistake in the design of the 'raid1' profile, which this and
other situations make me think is a possibility: that the
category of "mirrored" 'raid1' chunk does not exist in the Btrfs
chunk manager. That is, a chunk is either 'raid1' if it has a
mirror, or if has no mirror it must be 'single'.

If a member device of a 'raid1' profile multidevice volume
disappears there will be "unmirrored" 'raid1' profile chunks and
some code path must recognize them as such, but the logic of the
code does not allow their creation. Question: how does the code
know that a specific 'raid1' chunk is mirrored or not? The chunk
must have a link (member, offset) to its mirror, do they?

What makes me think that "unmirrored" 'raid1' profile chunks are
"not a thing" is that it is impossible to remove explicitly a
member device from a 'raid1' profile volume: first one has to
'convert' to 'single', and then  the 'remove' copies back to the
remaining devices the 'single' chunks that are on the explicitly
'remove'd device. Which to me seems absurd.

Going further in my speculation, I suspect that at the core of
the Btrfs multidevice design there is a persistent "confusion"
(to use en euphemism) between volumes having a profile, and
merely chunks have a profile.

My additional guess that the original design concept had
multidevice volumes to be merely containers for chunks of
whichever mixed profiles, so a subvolume could have 'raid1'
profile metadata and 'raid0' profile data, and another could
have 'raid10' profile metadata and data, but since handling this
turned out to be too hard, this was compromised into volumes
having all metadata chunks to have the same profile and all data
of the same profile, which requires special-case handling of
corner cases, like volumes being converted or missing member
devices.

So in the case of 'raid1', a volume with say a 'raid1' data
profile should have all-'raid1' and fully mirrored profile
chunks, and the lack of a member devices fails that aim in two
ways.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Low IOOP Performance

2017-02-27 Thread Peter Grandi
[ ... ]

> I have a 6-device test setup at home and I tried various setups
> and I think I got rather better than that.

* 'raid1' profile:

  soft#  btrfs fi df /mnt/sdb5  
  

  Data, RAID1: total=273.00GiB, used=269.94GiB
  System, RAID1: total=32.00MiB, used=56.00KiB
  Metadata, RAID1: total=1.00GiB, used=510.70MiB
  GlobalReserve, single: total=176.00MiB, used=0.00B

  soft#  fio --directory=/mnt/sdb5 --runtime=30 --status-interval=10 
blocks-randomish.fio | tail -3   
  Run status group 0 (all jobs):
 READ: io=105508KB, aggrb=3506KB/s, minb=266KB/s, maxb=311KB/s, 
mint=30009msec, maxt=30090msec
WRITE: io=100944KB, aggrb=3354KB/s, minb=256KB/s, maxb=296KB/s, 
mint=30009msec, maxt=30090msec

* 'raid10' profile:

  soft#  btrfs fi df /mnt/sdb6
  Data, RAID10: total=276.00GiB, used=272.49GiB
  System, RAID10: total=96.00MiB, used=48.00KiB
  Metadata, RAID10: total=3.00GiB, used=512.06MiB
  GlobalReserve, single: total=176.00MiB, used=0.00B

  soft#  fio --directory=/mnt/sdb6 --runtime=30 --status-interval=10 
blocks-randomish.fio | tail -3   
  Run status group 0 (all jobs):
 READ: io=89056KB, aggrb=2961KB/s, minb=225KB/s, maxb=271KB/s, 
mint=30009msec, maxt=30076msec
WRITE: io=85248KB, aggrb=2834KB/s, minb=212KB/s, maxb=261KB/s, 
mint=30009msec, maxt=30076msec

* 'single' profile on MD RAID10:

  soft#  btrfs fi df /mnt/md0
  Data, single: total=278.01GiB, used=274.32GiB
  System, single: total=4.00MiB, used=48.00KiB
  Metadata, single: total=2.01GiB, used=615.73MiB
  GlobalReserve, single: total=208.00MiB, used=0.00B

  soft#  grep -A1 md0 /proc/mdstat 
  md0 : active raid10 sdg1[6] sdb1[0] sdd1[2] sdf1[4] sdc1[1] sde1[3]
364904232 blocks super 1.0 8K chunks 2 near-copies [6/6] [UU]

  soft#  fio --directory=/mnt/md0 --runtime=30 --status-interval=10 
blocks-randomish.fio | tail -3
  Run status group 0 (all jobs):
 READ: io=160928KB, aggrb=5357KB/s, minb=271KB/s, maxb=615KB/s, 
mint=30012msec, maxt=30038msec
WRITE: io=158892KB, aggrb=5289KB/s, minb=261KB/s, maxb=616KB/s, 
mint=30012msec, maxt=30038msec

That's a range of 700-1300 4KiB random mixed-rw IOPS, quite
reasonable for 6x 1TB 7200RPM SATA drives, each capable of
100-120. It helps that the test file is just 100G, 10% of the
total drive extent, so arm movement is limited.

Not surprising that the much more mature MD RAID has an edge, a
bit stranger that on this the 'raid1' profile seems a bit faster
than the 'raid10' profile.

The much smaller numbers seem to happen to me too (probably some
misfeature of 'fio') with 'buffered=1', and the larger numbers
for ZFSonLinux are "suspicious".

> It seems unlikely to me that you got that with a 10-device
> mirror 'vdev', most likely you configured it as a stripe of 5x
> 2-device mirror vdevs, that is RAID10.

Indeed I double checked the end of the attached lost and that
was the case.

My FIO config file:

  # vim:set ft=ini:

  [global]
  filename=FIO-TEST
  fallocate=keep
  size=100G

  buffered=0
  ioengine=libaio
  io_submit_mode=offload

  iodepth=2
  numjobs=12
  blocksize=4K

  kb_base=1024

  [rand-mixed]

  rw=randrw
  stonewall
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Low IOOP Performance

2017-02-27 Thread Peter Grandi
>>> On Mon, 27 Feb 2017 22:11:29 +, p...@btrfs.list.sabi.co.uk (Peter 
>>> Grandi) said:

> [ ... ]
>> I have a 6-device test setup at home and I tried various setups
>> and I think I got rather better than that.

[ ... ]

> That's a range of 700-1300 4KiB random mixed-rw IOPS,

Rerun with 1M blocksize:

  soft#  fio --directory=/mnt/sdb5 --runtime=30 --status-interval=10 
--blocksize=1M blocks-randomish.fio | tail -3
  Run status group 0 (all jobs):
 READ: io=2646.0MB, aggrb=89372KB/s, minb=7130KB/s, maxb=7776KB/s, 
mint=30081msec, maxt=30317msec
WRITE: io=2297.0MB, aggrb=77584KB/s, minb=6082KB/s, maxb=6796KB/s, 
mint=30081msec, maxt=30317msec

  soft#  fio --directory=/mnt/sdb6 --runtime=30 --status-interval=10 
--blocksize=1M blocks-randomish.fio | tail -3
  Run status group 0 (all jobs):
 READ: io=2781.0MB, aggrb=94015KB/s, minb=5932KB/s, maxb=10290KB/s, 
mint=30121msec, maxt=30290msec
WRITE: io=2431.0MB, aggrb=82183KB/s, minb=4779KB/s, maxb=9102KB/s, 
mint=30121msec, maxt=30290msec
  soft#  killall -9 fio 
  
  fio: no process found

  soft#  fio --directory=/mnt/md0 --runtime=30 --status-interval=10 
--blocksize=1M blocks-randomish.fio | tail -3
  Run status group 0 (all jobs):
 READ: io=1504.0MB, aggrb=50402KB/s, minb=3931KB/s, maxb=4387KB/s, 
mint=30343msec, maxt=30556msec
WRITE: io=1194.0MB, aggrb=40013KB/s, minb=3158KB/s, maxb=3475KB/s, 
mint=30343msec, maxt=30556msec

Interesting that Btrfs 'single' on MD RAID10 becomes rather
slower (I guess low level of intrinsic parallelism).

For comparison, the same on a JFS on top of MD RAID10:

  soft#  grep -A1 md40 /proc/mdstat 
  md40 : active raid10 sdg4[5] sdd4[2] sdb4[0] sdf4[4] sdc4[1] sde4[3]
486538240 blocks super 1.0 512K chunks 3 near-copies [6/6] [UU]

  soft#  fio --directory=/mnt/md40 --runtime=30 --status-interval=10 
--blocksize=4K blocks-randomish.fio | grep -A2 '(all jobs)' | tail -3
  Run status group 0 (all jobs):
 READ: io=31408KB, aggrb=1039KB/s, minb=80KB/s, maxb=90KB/s, 
mint=30206msec, maxt=30227msec
WRITE: io=27800KB, aggrb=919KB/s, minb=70KB/s, maxb=81KB/s, mint=30206msec, 
maxt=30227msec

  soft#  fio --directory=/mnt/md40 --runtime=30 --status-interval=10 
--blocksize=1M blocks-randomish.fio | grep -A2 '(all jobs)' | tail -3
  Run status group 0 (all jobs):
 READ: io=2151.0MB, aggrb=72619KB/s, minb=5865KB/s, maxb=6383KB/s, 
mint=30134msec, maxt=30331msec
WRITE: io=1772.0MB, aggrb=59824KB/s, minb=4712KB/s, maxb=5365KB/s, 
mint=30134msec, maxt=30331msec

XFS is usually better at multithreaded workloads within the same
file (rather than across files).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Low IOOP Performance

2017-02-27 Thread Peter Grandi
[ ... ]
> a ten disk raid1 using 7.2k 3 TB SAS drives

Those are really low IOPS-per-TB devices, but good choice for
SAS, as they will have SCT/ERC.

> and used aio to test IOOP rates. I was surprised to measure
> 215 read and 72 write IOOPs on the clean new filesystem.

For that you really want to use the 'raid10' profile, 'raid1' is
quite different, and has an odd recovery "gotcha". Also so far
'raid1' in Btrfs only reads from one of the two mirrors per
thread.

Anyhow the 72 write IOPS look like single member device IOPS
rate and that's puzzling, as if Btrfs is not going
multithreading to a many-device 'raid1' profile volume.

I have a 6-device test setup at home and I tried various setups
and I think I got rather better than that.

> Sequential writes ran as expected at roughly 650 MB/s.

That's a bit too high: on single similar drive I get around
65MB/s average with relatively large files, I would expect
around 4-5x that from a 10-device mirrored profile, regardless
of filesystem type.

I strongly suspect that we have a different notion of "IOPS",
perhaps either logical vs. physical IOPS, or randomish vs.
sequentialish IOPS. I'll have a look at your attachments in more
detail.

> I created a zfs filesystem for comparison on another
> checksumming filesystem using the same layout and measured
> IOOP rates at 4315 read, 1449 write with sync enabled (without
> sync it's clearly just writing to RAM), sequential performance
> was comparable to btrfs.

It seems unlikely to me that you got that with a 10-device
mirror 'vdev', most likely you configured it as a stripe of 5x
2-device mirror vdevs, that is RAID10.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: understanding disk space usage

2017-02-08 Thread Peter Grandi
[ ... ]
> The issue isn't total size, it's the difference between total
> size and the amount of data you want to store on it. and how
> well you manage chunk usage. If you're balancing regularly to
> compact chunks that are less than 50% full, [ ... ] BTRFS on
> 16GB disk images before with absolutely zero issues, and have
> a handful of fairly active 8GB BTRFS volumes [ ... ]

Unfortunately balance operations are quite expensive, especially
from inside VMs. On the other hand if the system is not much
disk constrained relatively frequent balances is a good idea
indeed. It is a bit like the advice in the other thread on OLTP
to run frequent data defrags, which are also quite expensive.

Both combined are like running the compactor/cleaner on log
structured (another variants of "COW") filesystems like NILFS2:
running that frequently means tighter space use and better
locality, but is quite expensive too.

>> [ ... ] My impression is that the Btrfs design trades space
>> for performance and reliability.

> In general, yes, but a more accurate statement would be that
> it offers a trade-off between space and convenience. [ ... ]

It is not quite "convenience", it is overhead: whole-volume
operations like compacting, defragmenting (or fscking) tend to
cost significantly in IOPS and also in transfer rate, and on
flash SSDs they also consume lifetime.

Therefore personally I prefer to have quite a bit of unused
space in Btrfs or NILFS2, at a minimum around double at 10-20%
than the 5-10% that I think is the minimum advisable with
conventional designs.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: understanding disk space usage

2017-02-08 Thread Peter Grandi
>> My system is or seems to be running out of disk space but I
>> can't find out how or why. [ ... ]
>> FilesystemSize  Used Avail Use% Mounted on
>> /dev/sda3  28G   26G  2.1G  93% /
[ ... ]
> So from chunk level, your fs is already full.  And balance
> won't success since there is no unallocated space at all.

To add to this, 28GiB is a bit too small for Btrfs, because at
that point chunk size is 1GiB. I have the habit of sizing
partitions to an exact number of GiB, and that means that most
of 1GiB will never be used by Btrfs because there is a small
amount of space allocated that is smaller than 1GiB and thus
there will be eventually just less than 1GiB unallocated.
Unfortunately the chunk size is not manually settable.

Example here from 'btrfs fi usage':

Overall:
Device size:  88.00GiB
Device allocated: 86.06GiB
Device unallocated:1.94GiB
Device missing:  0.00B
Used: 80.11GiB
Free (estimated):  6.26GiB  (min: 5.30GiB)

That means that I should 'btrfs balance' now, because of the
1.94GiB "unallocated", 0.94GiB will never be allocated, and that
leaves just 1GiB "unallocated" which is the minimum for running
'btrfs balance'. I have just done so and this is the result:

Overall:
Device size:  88.00GiB
Device allocated: 82.03GiB
Device unallocated:5.97GiB
Device missing:  0.00B
Used: 80.11GiB
Free (estimated):  6.26GiB  (min: 3.28GiB)

At some point I had decided to use 'mixedbg' allocation to
reduce this problem and hopefully improve locality, but that
means that metadata and data need to have the same profile, and
I really want metadata to be 'dup' because of checksumming,
and I don't want data to be 'dup' too.

> [ ... ] To proceed, add a larger device to current fs, and do
> a balance or just delete the 28G partition then btrfs will
> handle the rest well.

Usually for this I use a USB stick, with a 1-3GiB partition plus
a bit extra because of that extra bit of space.

https://btrfs.wiki.kernel.org/index.php/FAQ#How_much_free_space_do_I_have.3F
https://btrfs.wiki.kernel.org/index.php/FAQ#Help.21_Btrfs_claims_I.27m_out_of_space.2C_but_it_looks_like_I_should_have_lots_left.21
marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html

Unfortunately if it is a single device volume and metadata is
'dup' to remove the extra temporary device one has first to
convert the metadata to 'single' and then back to 'dup' after
removal.

There are also some additional reasons why space used (rather
than allocated) may be larger than expected, in special but not
wholly infrequent cases. My impression is that the Btrfs design
trades space for performance and reliability.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS for OLTP Databases

2017-02-07 Thread Peter Grandi
> I have tried BTRFS from Ubuntu 16.04 LTS for write intensive
> OLTP MySQL Workload.

This has a lot of interesting and mostly agreeable information:

https://blog.pgaddict.com/posts/friends-dont-let-friends-use-btrfs-for-oltp

The main target of Btrfs is where one wants checksums and
occasional snapshot for backup (rather than rollback) and
applications do whole-file rewrites or appends.

> It did not go very well ranging from multi-seconds stalls
> where no transactions are completed

That usually is more because of the "clever" design and defaults
of the Linux page cache and block IO subsystem, which are
astutely pessimized for every workload, but especially for
read-modify-write ones, never mind for RMW workloads on
copy-on-write filesystems.

That most OS designs are pessimized for anything like a "write
intensive OLTP" workload is not new, M Stonebraker complained
about that 35 years ago, and nothing much has changed:

  http://www.sabi.co.uk/blog/anno05-4th.html?051012d#051012d

> to the finally kernel OOPS with "no space left on device"
> error message and filesystem going read only.

That's because Btrfs has a a two-level allocator, where space is
allocated in 1GiB chunks (distinct as to data and metadata) and
then in 16KiB nodes, and this makes it far more likely for free
space fragmentation to occur. Therefore Btrfs has a free space
compactor ('btrfs balance') that must be used the more often the
more updates happen.

> interested in "free" snapshots which look very attractive

The general problem is that it is pretty much impossible to have
read-modify-write rollbacks for cheap, because the writes in
general are scattered (that is their time coherence is very
different from their spatial coherence). That means either heavy
spatial fragmentation or huge write amplification.

The 'snapshot' type of DM/LVM2 device delivers heavy spatial
fragmentation, Btrfs does a balance of both. Another commenter
has mentioned the use of 'nodatacow' to prevent RMW resulting in
huge write-amplification.

> to use for database recovery scenarios allow instant rollback
> to the previous state.

You may be more interested in NILFS2 for that, but there are
significant tradeoffs there too, and NILFS2 requires a free
space compactor too, plus since NILFS2 gives up on short-term
spatial coherence, the compactor also needs to compact data
space.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Jfs-discussion] benchmark results

2009-12-24 Thread Peter Grandi
 I've had the chance to use a testsystem here and couldn't
 resist

Unfortunately there seems to be an overproduction of rather
meaningless file system benchmarks...

 running a few benchmark programs on them: bonnie++, tiobench,
 dbench and a few generic ones (cp/rm/tar/etc...) on ext{234},
 btrfs, jfs, ufs, xfs, zfs. All with standard mkfs/mount options
 and +noatime for all of them.

 Here are the results, no graphs - sorry: [ ... ]

After having a glance, I suspect that your tests could be
enormously improved, and doing so would reduce the pointlessness of
the results.

A couple of hints:

* In the generic test the 'tar' test bandwidth is exactly the
  same (276.68 MB/s) for nearly all filesystems.

* There are read transfer rates higher than the one reported by
  'hdparm' which is 66.23 MB/sec (comically enough *all* the
  read transfer rates your benchmarks report are higher).

BTW the use of Bonnie++ is also usually a symptom of a poor
misunderstanding of file system benchmarking.

On the plus side, test setup context is provided in the env
directory, which is rare enough to be commendable.

 Short summary, AFAICT:
 - btrfs, ext4 are the overall winners
 - xfs to, but creating/deleting many files was *very* slow

Maybe, and these conclusions are sort of plausible (but I prefer
JFS and XFS for different reasons); however they are not supported
by your results as they seem to me to lack much meaning, as what is
being measured is far from clear, and in particular it does not
seem to be the file system performance, or anyhow an aspect of
filesystem performance that might relate to common usage.

I think that it is rather better to run a few simple operations
(like the generic test) properly (unlike the generic test), to
give a feel for how well implemented are the basic operations of
the file system design.

Profiling a file system performance with a meaningful full scale
benchmark is a rather difficult task requiring great intellectual
fortitude and lots of time.

 - if you need only fast but no cool features or
   journaling, ext2 is still a good choice :)

That is however a generally valid conclusion, but with a very,
very important qualification: for freshly loaded filesystems.
Also with several other important qualifications, but freshly
loaded is a pet peeve of mine :-).
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html