Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-29 Thread Peter Grandi
>> [ ... ] these extents are all over the place, they're not
>> contiguous at all. 4K here, 4K there, 4K over there, back to
>> 4K here next to this one, 4K over there...12K over there, 500K
>> unwritten, 4K over there. This seems not so consequential on
>> SSD, [ ... ]

> Indeed there were recent reports that the 'ssd' mount option
> causes that, IIRC by Hans van Kranenburg [ ... ]

The report included news that "sometimes" the 'ssd' option is
automatically switched on at mount even on hard disks. I had
promised to put a summary of the issue on the Btrfs wiki, but
I regret that I haven't yet done that.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-29 Thread Peter Grandi
> [ ... ] Instead, you can use raw files (preferably sparse unless
> there's both nocow and no snapshots). Btrfs does natively everything
> you'd gain from qcow2, and does it better: you can delete the master
> of a cloned image, deduplicate them, deduplicate two unrelated images;
> you can turn on compression, etc.

Uhm, I understand this argument in the general case (not
specifically as to QCOW2 images), and it has some merit, but it is
"controversial", as there are two counterarguments:

* Application specifici file formats can match better application
  specific requirements.
* Putting advanced functionality into the filesystem code makes it more
  complex and less robust, and Btrfs is a bit of a major example of the
  consequences. I put compression and deduplication as things that I
  reckon make a filesystem too complex.

As to snapshots, I make a difference between filetree snapshots and file
snapshots: the first clones a tree as of the snapshot moment, and it is
a system management feature, the second provides per-file update
rollback. One sort of implies the other, but using the per-file rollback
*systematically*, that is a a feature an application can rely one seems
a bit dangerous to me.

> Once you pay the btrfs performance penalty,

Uhmmm, Btrfs has a small or negative performance penalty as a
general purpose filesystem, and many (more or less well conceived) tests
show it performs up there with the best. The only two real costs I have
to it are the huge CPU cost of doing checksumming all the time, but
that's unavoidable if one wants checksumming, and that checksumming
usually requires metadata duplication, that is at least 'dup' profile
for metadata, and that is indeed a bit expensive.

> you may as well actually use its features,

The features that I think Btrfs gives that are worth using are
checksumming, metadata duplication, and filetree snapshots.

> which make qcow2 redundant and harmful.

My impression is that in almost all cases QCOW2 is harmful, because it
trades more IOPS and complexity for less disk space, and disk space is
cheap and IOPS and complexity are expensive, but of course a lot of
people know better :-). My preferred VM setup is a small essentially
read-only non-QCOW2 image for '/' and everything else mounted via NFSv4,
from the VM host itself or a NAS server, but again lots of people know
better and use multi-terabyte-sized QCOW2 images :-).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Duncan
Goffredo Baroncelli posted on Fri, 28 Apr 2017 19:05:21 +0200 as
excerpted:

> After some thinking I adopted a different strategies: I used journald as
> collector, then I forward all the log to rsyslogd, which used a "log
> append" format. Journald never write on the root filesystem, only in
> tmp.

Great minds think alike. =:^)

Only here it's syslog-ng that does the permanent writes.

I just couldn't see journald's crazy (for btrfs) write pattern going to 
permanent storage.

And AFAIK, journald has no pre-write filtering mechanism at all, only 
post-write display-time filtering, so even "log-spam" that I don't want/
need logged gets written to it, while if I see something spamming 
continuously (I run git kernels and kde, and do get such spammers 
occasionally) I setup a syslog-ng spam filter to kill it, so it never 
actually gets written to permanent storage at all.

But the tmpfs journals and btrfs traditional logs gives me the best of 
both worlds, per-boot journals with all the extra metadata, the last ten 
journal entries for it when I do systemctl status on a unit, etc, and a 
nice filtered and ordered multi-boot log that I can use traditional text-
based log-administration tools on.

The only part of it I'm not happy with is that journald apparently can't 
keep separate user and system journals when set to temporary only -- 
everything goes to the system journal.  Which eventually means that much 
of the stdout/stderr debugging spew that kde-based apps like to spew out 
ends up in the system journal and (would be in the) log.  But that's a 
journald "documented bug-feature", and I can and do syslog-ng filter it 
before it actually hits the written system log (or console log display).

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Paul Jones
> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-
> ow...@vger.kernel.org] On Behalf Of Goffredo Baroncelli
> Sent: Saturday, 29 April 2017 3:05 AM
> To: Chris Murphy <li...@colorremedies.com>
> Cc: Btrfs BTRFS <linux-btrfs@vger.kernel.org>
> Subject: Re: btrfs, journald logs, fragmentation, and fallocate
> 
> 
> In the past I faced the same problems; I collected some data here
> http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html.
> Unfortunately the journald files are very bad, because first the data is
> written (appended), then the index fields are updated. Unfortunately these
> indexes are near after the last write . So fragmentation is unavoidable.

Perhaps a better idea for COW filesystems is to store the index in a separate 
file, and/or rewrite the last 1 MB block (or part thereof) of the data file 
every time data is appended? That way the data file will use 1MB extents and 
hopefully avoid ridiculous amounts of metadata. 


Paul.


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Peter Grandi

> [ ... ] these extents are all over the place, they're not
> contiguous at all. 4K here, 4K there, 4K over there, back to
> 4K here next to this one, 4K over there...12K over there, 500K
> unwritten, 4K over there. This seems not so consequential on
> SSD, [ ... ]

Indeed there were recent reports that the 'ssd' mount option
causes that, IIRC by Hans van Kranenburg (around 2017-04-17),
which also noticed issues with the wandering trees in certain
situations (around 2017-04-08).
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Adam Borowski
On Fri, Apr 28, 2017 at 11:41:00AM -0600, Chris Murphy wrote:
> The same behavior happens with NTFS in qcow2 files. They quickly end
> up with 100,000+ extents unless set nocow. It's like the worst case
> scenario.

You should never use qcow2 on btrfs, especially if snapshots are involved.
They both do roughly the same thing, and layering fragmentation upon
fragmentation ɪꜱ ɴᴏᴛ ᴘʀᴇᴛᴛʏ.  Layering syncs is bad, too.

Instead, you can use raw files (preferably sparse unless there's both nocow
and no snapshots).  Btrfs does natively everything you'd gain from qcow2,
and does it better: you can delete the master of a cloned image, deduplicate
them, deduplicate two unrelated images; you can turn on compression, etc.

Once you pay the btrfs performance penalty, you may as well actually use its
features, which make qcow2 redundant and harmful.


Meow!
-- 
Don't be racist.  White, amber or black, all beers should be judged based
solely on their merits.  Heck, even if occasionally a cider applies for a
beer's job, why not?
On the other hand, corpo lager is not a race.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Chris Murphy
On Fri, Apr 28, 2017 at 1:39 PM, Peter Grandi  
wrote:


> In a particularly demented setup I had to decastrophize with
> great pain a Zimbra QCOW2 disk image (XFS on NFS on XFS on
> RAID6) containining an ever growing number Maildir email archive
> ended up with over a million widely scattered microextents:
>
>   http://www.sabi.co.uk/blog/1101Jan.html?110116#110116

Related Btrfs thread "File system corruption, btrfsck abort" involves
5 concurrent use VM's with guests using ext4, NTFS, HFS+, Btrfs, LVM,
pointing to qcow2 files on Btrfs for backing. And it's resulting in
problems...


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Chris Murphy
On Fri, Apr 28, 2017 at 11:53 AM, Peter Grandi  
wrote:

> Well, depends, but probably the single file: it is more likely
> that the 20,000 fragments will actually be contiguous, and that
> there will be less metadata IO than for 40,000 separate journal
> files.

You can see from the examples I posted that these extents are all over
the place, they're not contiguous at all. 4K here, 4K there, 4K over
there, back to 4K here next to this one, 4K over there...12K over
there, 500K unwritten, 4K over there. This seems not so consequential
on SSD, at least if it impacts performance it's not so bad I care. On
a hard drive, it's totally noticeable. And that's why journald went
with chattr +C by default a few versions ago when on Btrfs. And it
does help *if* the partent is never snapshot, which on a snapshotting
file system can't really be guaranteed. Inadvertent snapshotting could
be inhibited by putting the journals in their own subvolume though.

Anyway, it's difficult to consider Btrfs a general purpose file system
if other general purpose workloads like journal files, are causing a
problem like wandering tree. Hence the subject of what to do about it,
and that may mean short term and long term. I can't speak for systemd
developers but if there's a different way to write to the journals
that'd be better for Btrfs and no worse for ext4 and XFS, it might be
considered.


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Chris Murphy
On Fri, Apr 28, 2017 at 11:46 AM, Peter Grandi  
wrote:

> So there are three layers of silliness here:
>
> * Writing large files slowly to a COW filesystem and
>   snapshotting it frequently.
> * A filesystem that does delayed allocation instead of
>   allocate-ahead, and does not have psychic code.
> * Working around that by using no-COW and preallocation
>   with a fixed size regardless of snapshot frequency.
>
> The primary problem here is that there is no way to have slow
> small writes and frequent snapshots without generating small
> extents: if a file is written at a rate of 1MiB/hour and gets
> snapshot every hour the extent size will not be larger than 1MiB
> *obviously*.

Sure.

But in my example, no snapshotting, and +C is inhibited (i.e. I set
/etc/tmpfiles.d/journal-nocow.conf which stops systemd from the new
behavior of setting +C on journals). That's resulting in a 19000+
fragment journal file. In fact snapshotting does not make it worse
though. If it's nocow, then yes snapshotting makes it worse than
nocow, but no worse than cow.

What I'm trying to get at is default Btrfs behavior and (previous)
default journald behavior, have a misalignment resulting in a lot of
fragmentation, is there a better way around this than merely setting
journals to nocow *and* making sure they stay nocow by preventing
snapshotting. If there's nothing better to be done, then I'll just
re-recommend to systemd folks that the directory containing journals
should be made a subvolume to isolate it from inadvertent
snapshotting. If people want to snapshot it anyway there's nothing we
can do about that.



> Filesystem-level snapshots are not designed to snapshot slowly
> growing files, but to snapshots changing collections of
> files. There are harsh tradeoffs involved. Application-level
> shapshots (also known as log rotations :->) are needed for
> special cases and finer grained policies.
>
> The secondary problem is that a fixed preallocate of 8MiB is
> good only if in betweeen snapshots the file grows by a little
> less than 8MiB or by substantially more.

Just to be clear, none of my own examples involve journals being
snapshot. There are no shared extents for any of those files.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Peter Grandi
>> The gotcha though is there's a pile of data in the journal
>> that would never make it to rsyslogd. If you use journalctl
>> -o verbose you can see some of this.

> You can send *all the info* to rsyslogd via imjournal
> http://www.rsyslog.com/doc/v8-stable/configuration/modules/imjournal.html
> In my setup all the data are stored in json format in the
> /var/log/cee.log file:
> $ head /var/log/cee.log 2017-04-28T18:41:41.931273+02:00
> venice liblogging-stdlog: @cee: { "PRIORITY": "6", "_BOOT_ID":
> "a86d74bab91f44dc974c76aceb97141f", "_MACHINE_ID": [ ... ]

Ahh the horror the horror, I will never be able to unsee
that. The UNIX way of doing things is truly dead.

>> The same behavior happens with NTFS in qcow2 files. They
>> quickly end up with 100,000+ extents unless set nocow.
>> It's like the worst case scenario.

In a particularly demented setup I had to decastrophize with
great pain a Zimbra QCOW2 disk image (XFS on NFS on XFS on
RAID6) containining an ever growing number Maildir email archive
ended up with over a million widely scattered microextents:

  http://www.sabi.co.uk/blog/1101Jan.html?110116#110116
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Goffredo Baroncelli
On 2017-04-28 19:41, Chris Murphy wrote:
> On Fri, Apr 28, 2017 at 11:05 AM, Goffredo Baroncelli
>  wrote:
> 
>> In the past I faced the same problems; I collected some data here 
>> http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html.
>> Unfortunately the journald files are very bad, because first the data is 
>> written (appended), then the index fields are updated. Unfortunately these 
>> indexes are near after the last write . So fragmentation is unavoidable.
>>
>> After some thinking I adopted a different strategies: I used journald as 
>> collector, then I forward all the log to rsyslogd, which used a "log append" 
>> format. Journald never write on the root filesystem, only in tmp.
> 
> The gotcha though is there's a pile of data in the journal that would
> never make it to rsyslogd. If you use journalctl -o verbose you can
> see some of this. 

You can send *all the info* to rsyslogd via imjournal

http://www.rsyslog.com/doc/v8-stable/configuration/modules/imjournal.html

In my setup all the data are stored in json format in the /var/log/cee.log file:


$ head  /var/log/cee.log
2017-04-28T18:41:41.931273+02:00 venice liblogging-stdlog: @cee: { "PRIORITY": 
"6", "_BOOT_ID": "a86d74bab91f44dc974c76aceb97141f", "_MACHINE_ID": 
"e84907d099904117b355a99c98378dca", "_HOSTNAME": "venice.bhome", 
"_SYSTEMD_SLICE": "system.slice", "_UID": "0", "_GID": "0", "_CAP_EFFECTIVE": 
"3f", "_TRANSPORT": "syslog", "SYSLOG_FACILITY": "23", 
"SYSLOG_IDENTIFIER": "liblogging-stdlog", "MESSAGE": " [origin 
software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" 
x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed", "_PID": "737", 
"_COMM": "rsyslogd", "_EXE": "\/usr\/sbin\/rsyslogd", "_CMDLINE": 
"\/usr\/sbin\/rsyslogd -n", "_SYSTEMD_CGROUP": 
"\/system.slice\/rsyslog.service", "_SYSTEMD_UNIT": "rsyslog.service", 
"_SYSTEMD_INVOCATION_ID": "18b9a8b27f9143728adef972db7b394c", 
"_SOURCE_REALTIME_TIMESTAMP": "1493397701931255", "msg": "[origin 
software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" 
x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed" }
2017-04-28T18:41:42.058549+02:00 venice liblogging-stdlog: @cee: { "PRIORITY": 
"6", "_BOOT_ID": "a86d74bab91f44dc974c76aceb97141f", "_MACHINE_ID": 
"e84907d099904117b355a99c98378dca", "_HOSTNAME": "venice.bhome", 
"_SYSTEMD_SLICE": "system.slice", "_UID": "0", "_GID": "0", "_CAP_EFFECTIVE": 
"3f", "_TRANSPORT": "syslog", "SYSLOG_FACILITY": "23", 
"SYSLOG_IDENTIFIER": "liblogging-stdlog", "MESSAGE": " [origin 
software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" 
x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed", "_PID": "737", 
"_COMM": "rsyslogd", "_EXE": "\/usr\/sbin\/rsyslogd", "_CMDLINE": 
"\/usr\/sbin\/rsyslogd -n", "_SYSTEMD_CGROUP": 
"\/system.slice\/rsyslog.service", "_SYSTEMD_UNIT": "rsyslog.service", 
"_SYSTEMD_INVOCATION_ID": "18b9a8b27f9143728adef972db7b394c", 
"_SOURCE_REALTIME_TIMESTAMP": "1493397702058441", "msg": "[origin 
software=\"rsyslogd\" swVersion=\"8.24.0\" x-pid=\"737\" 
x-info=\"http:\/\/www.rsyslog.com\"] rsyslogd was HUPed" }
[]

All the info are stored with the same keys/values as journald does.

I developed an utility (called clp), which allow to query the log by key, 
filtering by boot nr, by date

For example to show all the log related to rsyslog

$ clp log -t full-details _SYSTEMD_CGROUP=/system.slice/rsyslog.service 

2017-04-21 19:12:29.579748 MESSAGE= [origin software="rsyslogd" 
swVersion="8.24.0" x-pid="804" x-info="http://www.rsyslog.com;] rsyslogd was 
HUPed
   PRIORITY=6
   SYSLOG_FACILITY=23
   SYSLOG_IDENTIFIER=liblogging-stdlog
   _BOOT_ID=d77198380c9344248e01166fbd8d60df
   _CAP_EFFECTIVE=3f
   _CMDLINE=/usr/sbin/rsyslogd -n
   _COMM=rsyslogd
   _EXE=/usr/sbin/rsyslogd
   _GID=0
   _HOSTNAME=venice.bhome
   _LOGFILEINITLINE=2017-04-21T19:12:29.579768+02:00 
venice liblogging-stdlog: 
   _LOGFILELINENUMBER=1
   _LOGFILENAME=/var/log/cee.log.7.gz
   _LOGFILETIMESTAMP=1492794749579768
   _MACHINE_ID=e84907d099904117b355a99c98378dca
   _PID=804
   _SOURCE_REALTIME_TIMESTAMP=1492794749579748
   _SYSTEMD_CGROUP=/system.slice/rsyslog.service
   
_SYSTEMD_INVOCATION_ID=8f9cb6c871be4158a3ccb374f4323027
   _SYSTEMD_SLICE=system.slice
   _SYSTEMD_UNIT=rsyslog.service
   _TRANSPORT=syslog
   _UID=0
   msg=[origin software="rsyslogd" swVersion="8.24.0" 
x-pid="804" 

Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Peter Grandi
> [ ... ] And that makes me wonder whether metadata
> fragmentation is happening as a result. But in any case,
> there's a lot of metadata being written for each journal
> update compared to what's being added to the journal file. [
> ... ]

That's the "wandering trees" problem in COW filesystems, and
manifestations of it in Btrfs have also been reported before.
If there is a workload that triggers a lot of "wandering trees"
updates, then a filesystem that has "wandering trees" perhaps
should not be used :-).

> [ ... ] worse, a single file with 2 fragments; or 4
> separate journal files? *shrug* [ ... ]

Well, depends, but probably the single file: it is more likely
that the 20,000 fragments will actually be contiguous, and that
there will be less metadata IO than for 40,000 separate journal
files.

The deeper "strategic" issue is that storage systems and
filesystems in particular have very anisotropic performance
envelopes, and mismatches between the envelopes of application
and filesystem can be very expensive:
  http://www.sabi.co.uk/blog/15-two.html?151023#151023
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Peter Grandi
> Old news is that systemd-journald journals end up pretty
> heavily fragmented on Btrfs due to COW.

This has been discussed before in detail indeeed here, but also
here: http://www.sabi.co.uk/blog/15-one.html?150203#150203

> While journald uses chattr +C on journal files now, COW still
> happens if the subvolume the journal is in gets snapshot. e.g.
> a week old system.journal has 19000+ extents. [ ... ]  It
> appears to me (see below URLs pointing to example journals)
> that journald fallocated in 8MiB increments but then ends up
> doing 4KiB writes; [ ... ]

So there are three layers of silliness here:

* Writing large files slowly to a COW filesystem and
  snapshotting it frequently.
* A filesystem that does delayed allocation instead of
  allocate-ahead, and does not have psychic code.
* Working around that by using no-COW and preallocation
  with a fixed size regardless of snapshot frequency.

The primary problem here is that there is no way to have slow
small writes and frequent snapshots without generating small
extents: if a file is written at a rate of 1MiB/hour and gets
snapshot every hour the extent size will not be larger than 1MiB
*obviously*.

Filesystem-level snapshots are not designed to snapshot slowly
growing files, but to snapshots changing collections of
files. There are harsh tradeoffs involved. Application-level
shapshots (also known as log rotations :->) are needed for
special cases and finer grained policies.

The secondary problem is that a fixed preallocate of 8MiB is
good only if in betweeen snapshots the file grows by a little
less than 8MiB or by substantially more.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Chris Murphy
On Fri, Apr 28, 2017 at 11:05 AM, Goffredo Baroncelli
 wrote:

> In the past I faced the same problems; I collected some data here 
> http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html.
> Unfortunately the journald files are very bad, because first the data is 
> written (appended), then the index fields are updated. Unfortunately these 
> indexes are near after the last write . So fragmentation is unavoidable.
>
> After some thinking I adopted a different strategies: I used journald as 
> collector, then I forward all the log to rsyslogd, which used a "log append" 
> format. Journald never write on the root filesystem, only in tmp.

The gotcha though is there's a pile of data in the journal that would
never make it to rsyslogd. If you use journalctl -o verbose you can
see some of this. There's a bunch of extra metadata in the journal.
And then also filtering based on that metadata is useful rather than
being limited to grep on a syslog file. Which, you know, it's fine for
many use cases. I guess I'm just interested in whether there's an
enhancement that can be done to make journals more compatible with
Btrfs or vice versa. It's not a huge problem anyway.


>
> The think became interesting when I discovered that the searching in a 
> rsyslog file is faster than journalctl (on a rotational media). Unfortunately 
> I don't have any data to support this.


Yes on drives all of these scattered extents cause a lot of head
seeking. And I also suspect it's a lot of metadata spread out
everywhere too, to account for all of these extents. That's why they
moved to chattr +C to make them nocow. An idea I had on systemd list
was to automatically make the journal directory a Btrfs subvolume,
similar to how systemd already creates a /var/lib/machines subvolume
for nspawn containers. This prevents the journals from being caught up
in a snapshot of the parent subvolume that typically contains the
journals (root fs). There's no practical use I can think of for
snapshotting logs. You'd really want the logs to always be linear,
contiguous, and never get rolled back. Even if something in the system
does get rolled back, you'd want the logs to show that and continue
on, rather than being rolled back themselves.

So the super simple option would be continue with +C on journals, and
then a separate subvolume to prevent COW from ever happening
inadvertently.

The same behavior happens with NTFS in qcow2 files. They quickly end
up with 100,000+ extents unless set nocow. It's like the worst case
scenario.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Goffredo Baroncelli
On 2017-04-28 18:16, Chris Murphy wrote:
> Old news is that systemd-journald journals end up pretty heavily
> fragmented on Btrfs due to COW. While journald uses chattr +C on
> journal files now, COW still happens if the subvolume the journal is
> in gets snapshot. e.g. a week old system.journal has 19000+ extents.
> 
> The news is I started a systemd thread.
> 
> This is the start:
> https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html
> 
> Where it gets interesting, two messages by Andrei Borzenkov: He
> evaluates existing code and does some tests on ext4 and XFS.
> https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html
> https://lists.freedesktop.org/archives/systemd-devel/2017-April/038728.html
> 
> And then the question.
> https://lists.freedesktop.org/archives/systemd-devel/2017-April/038735.html
> 
> Given what journald is doing, is what Btrfs is doing expected? Is
> there something it could do better to be more like ext4 and XFS in the
> same situation? Or is it out of scope for Btrfs?

In the past I faced the same problems; I collected some data here 
http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html.
Unfortunately the journald files are very bad, because first the data is 
written (appended), then the index fields are updated. Unfortunately these 
indexes are near after the last write . So fragmentation is unavoidable.

After some thinking I adopted a different strategies: I used journald as 
collector, then I forward all the log to rsyslogd, which used a "log append" 
format. Journald never write on the root filesystem, only in tmp.

The think became interesting when I discovered that the searching in a rsyslog 
file is faster than journalctl (on a rotational media). Unfortunately I don't 
have any data to support this. 
However if someone is interested I can share more details.

BR
G.Baroncelli



-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs, journald logs, fragmentation, and fallocate

2017-04-28 Thread Chris Murphy
Old news is that systemd-journald journals end up pretty heavily
fragmented on Btrfs due to COW. While journald uses chattr +C on
journal files now, COW still happens if the subvolume the journal is
in gets snapshot. e.g. a week old system.journal has 19000+ extents.

The news is I started a systemd thread.

This is the start:
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html

Where it gets interesting, two messages by Andrei Borzenkov: He
evaluates existing code and does some tests on ext4 and XFS.
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038724.html
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038728.html

And then the question.
https://lists.freedesktop.org/archives/systemd-devel/2017-April/038735.html

Given what journald is doing, is what Btrfs is doing expected? Is
there something it could do better to be more like ext4 and XFS in the
same situation? Or is it out of scope for Btrfs?

It appears to me (see below URLs pointing to example journals) that
journald fallocated in 8MiB increments but then ends up doing 4KiB
writes; there's a lot of these unused (unwritten) 8MiB extents that
appear in both filefrag and btrfs-debug -f outputs.

The +C idea just rearranges the deck chairs, it's not solving the
underlying problem except in the case where the containing subvolume
is never snapshot. And in the COW case, I'm seeing about 30 metadata
nodes being written out for what amounts to less than a 4KiB journal
append. Each time.

And that makes me wonder whether metadata fragmentation is happening
as a result. But in any case, there's a lot of metadata being written
for each journal update compared to what's being added to the journal
file.

And then that makes me wonder if a better optimization on Btrfs would
be having each write be a separate file. The small updates would have
data inline. Which is worse, a single file with 2 fragments; or
4 separate journal files? *shrug* At least those individual files
would be subject to compression with +c; whereas right now the open
endedness of the active journal has not a single compressed extent.
Only once rotated do they get compressed (via defragmentation which
journald does only on Btrfs). Journals contain highly compressible
data.



Anyway, two example journals. The parent directory has chattr +c, both
journals inherited it. The first URL is filefrag -v, the 2nd is
btrfs-debug -f; for each journal.

This is a rotated journal. Upon rotation on Btrfs, journald
defragments the file which ends up compressing it when chattr +c.
https://da.gd/4NKyq
https://da.gd/zEeYW

This is an active system.journal. No compressed extents (the writes I
think are too small).
https://da.gd/cBjX
https://da.gd/YXuI


Extra credit if you've followed this far... The rotated log has piles
of unwritten items in it that are making it fairly inefficient even
with compression. Just using cat to write its contents to a new file,
compression goes from a 1.27 ratio, to 5.70. Here are the results
after catting that file:
https://da.gd/rE8KT
https://da.gd/PD5qI



-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html