Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov  wrote:

> strace -p $(pgrep systemd-journal)
>
> You will not see actual writes as file is memory mapped, but it
> definitely does not do any fsync() every so often.

https://paste.fedoraproject.org/paste/oVT-tsU2sBOdTJaZxGua-15M1UNdIGYhyRLivL9gydE=

That's just a partial, but the complete output captured for a couple
minutes doesn't contain an fsync.

Then I did this for 8 minutes, strace -c -f -p $(pgrep systemd-journal)

https://paste.fedoraproject.org/paste/Uzc2KhkkaqLOU8USLd38B15M1UNdIGYhyRLivL9gydE=

So 6 fsyncs in 8 minutes; more than 1 per 5 minutes, but not nearly as
many as I thought. So maybe as you say it's just memory mapped
activity I'm seeing with state and filefrag.


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov  wrote:

>> I have no idea if it's fsync or what. How can I tell?
>>
>
> strace -p $(pgrep systemd-journal)
>
> You will not see actual writes as file is memory mapped, but it
> definitely does not do any fsync() every so often.

Also found this.

https://lwn.net/Articles/306046/

Not sure how to enable and use it though.

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 10:05 PM, Andrei Borzenkov  wrote:
> 18.04.2017 06:50, Chris Murphy пишет:

>>> What exactly "changes" mean? Write() syscall?
>>
>> filefrag reported entries increase, it's using FIEMAP.
>>
>
> So far it sounds like btrfs allocates new extent on every write to
> journal file. Each journal record itself is relatively small indeed.

Hence why it would be better if there's no fsync so that it can
accumulate these and do its own commit (30s default for Btrfs) and let
them accumulate.

It is likely that the ssd allocation option on these ssd's is a factor
in fragmentation because it's trying to allocation to a unique 2MB
section based on expected erase block size. There's a lot of
discussion going on right now on the Btrfs list whether these
assumptions are still true, and in what cases maybe we should be using
nossd on higher end SSD's and NVMe.

What's for sure though is that with any of these allocators, nocow is
not good for lower end SSDs like SD cards; all that does it ask to
write to the same LBA over and over and over again, for a journal. And
it just increases write amplification unnecessarily. So I'm beginning
to think that on SSDs, it's better if journald did +c rather than +C
on journals. But there's still some researching to do.

I definitely think /var/log/journal/ should be a subvolume
to avoid its contents being snapshot. That does make the fragmentation
problem worse.

And also I think defragmentation feature should be disabled at least
on SSD; or should include zlib compression. The write amplification on
SSD is worse than just leaving the file fragmented.



>
>> Also with stat I see the times (all three) change on the file. If I go
>> to GNOME Terminal and just sudo some command, that itself causes the
>> current system.journal file to get all three times modified. It
>> happens immediately, there's no delay. So if I'm doing something like
>> drm.debug=0x1e, which is spitting a bunch of stuff to dmesg and thus
>> the journal, it's just constantly writing stuff to the journal. This
>> is without anything running journalctl -f or reading the journal.
>>
>>>
 #Storage=auto
 #Compress=yes
 #Seal=yes
 #SplitMode=uid
 #SyncIntervalSec=5m
>>>
>>> This controls how often systemd calls fsync() on currently active
>>> journal file. Do you see fsync() every 3 seconds?
>>
>> I have no idea if it's fsync or what. How can I tell?
>>
>
> strace -p $(pgrep systemd-journal)
>
> You will not see actual writes as file is memory mapped, but it
> definitely does not do any fsync() every so often.
>
> Is it possible that btrfs behavior you observe is specific to memory
> mapped files handling?

Maybe. But even after a reboot I see the same extent entries in the
file. Granted a good deal of these 1 block entries have addresses that
are one after the other so they often make up larger continuous
extents, but they still have separate entries.


>
>> Also, I don't think these journal files are being compressed.
>>
>> Using the btrfs-progs/btrfs-debugfs script on a few user journal
>> files, I'm seeing massive compression ratios. Maybe I'll try
>> Compress=No and see if there's a change.
>>
>
> Only actual message payload above some threshold (I think 256 or 512
> bytes, not sure) is compressed; everything else is not. For average
> syslog-type messages payload is far too small. This is really only
> interesting when you store core dump or similar.

Interesting I see. Thanks.

I'll try strace and see what's going on.


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Andrei Borzenkov
18.04.2017 06:50, Chris Murphy пишет:
> On Mon, Apr 17, 2017 at 9:42 PM, Andrei Borzenkov  wrote:
>> 17.04.2017 22:49, Chris Murphy пишет:
>>> On Mon, Apr 17, 2017 at 11:27 AM, Andrei Borzenkov  
>>> wrote:
 17.04.2017 19:25, Chris Murphy пишет:
> This explains one system's fragmented journals; but the other system
> isn't snapshotting journals and I haven't figured out why they're so
> fragmented. No snapshots, and they are all +C at create time
> (systemd-journald default on Btrfs). Is it possible to prevent
> journald from setting +C on /var/log/journal and
> /var/log/journal/? If I remove them, at next boot they get
> reset, so any new journals created inherit that.
>

 Yes, should be possible by creating empty
 /etc/tmpfiles.d/journal-nocow.conf.
>>>
>>> OK super.
>>>
>>> How about inhibiting the defragmentation on rotate? I'm suspicious one
>>> of the things I'm seeing is due to ssd optimization mount options, but
>>> I need to see the predefrag state of the files.
>>>
>>> Why do I see so many changes to the journal file, once ever 2-5
>>> seconds? This adds 4096 byte blocks to the file each time, and when
>>> cow, that'd explain why there are so many fragments.
>>>
>>
>>
>> What exactly "changes" mean? Write() syscall?
> 
> filefrag reported entries increase, it's using FIEMAP.
> 

So far it sounds like btrfs allocates new extent on every write to
journal file. Each journal record itself is relatively small indeed.

> Also with stat I see the times (all three) change on the file. If I go
> to GNOME Terminal and just sudo some command, that itself causes the
> current system.journal file to get all three times modified. It
> happens immediately, there's no delay. So if I'm doing something like
> drm.debug=0x1e, which is spitting a bunch of stuff to dmesg and thus
> the journal, it's just constantly writing stuff to the journal. This
> is without anything running journalctl -f or reading the journal.
> 
>>
>>> #Storage=auto
>>> #Compress=yes
>>> #Seal=yes
>>> #SplitMode=uid
>>> #SyncIntervalSec=5m
>>
>> This controls how often systemd calls fsync() on currently active
>> journal file. Do you see fsync() every 3 seconds?
> 
> I have no idea if it's fsync or what. How can I tell?
> 

strace -p $(pgrep systemd-journal)

You will not see actual writes as file is memory mapped, but it
definitely does not do any fsync() every so often.

Is it possible that btrfs behavior you observe is specific to memory
mapped files handling?

> Also, I don't think these journal files are being compressed.
> 
> Using the btrfs-progs/btrfs-debugfs script on a few user journal
> files, I'm seeing massive compression ratios. Maybe I'll try
> Compress=No and see if there's a change.
> 

Only actual message payload above some threshold (I think 256 or 512
bytes, not sure) is compressed; everything else is not. For average
syslog-type messages payload is far too small. This is really only
interesting when you store core dump or similar.

> file: 
> user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-00059b73-00054d51b3f442ff.journal
> extents 64 disk size 294912 logical size 8388608 ratio 28.44
> file: 
> user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-0002ec5b-00054d4ebb7114e7.journal
> extents 64 disk size 278528 logical size 8388608 ratio 30.12
> file: 
> user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-06e5-00054c3c32607483.journal
> extents 320 disk size 5206016 logical size 41943040 ratio 8.06
> 

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


[systemd-devel] feature request: implement macsec interface configuration in systemd-networkd

2017-04-17 Thread george Nopicture
Are there any plans on implementing macsec interface configuration from
systemd-networkd? Since its already added in kernel as a loadable
module, fedora misses a patched iproute2 to support macsec and also
lacks automatic interface configuration (i dunno if nm supports it?)
preferably from systemd-networkd.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 9:42 PM, Andrei Borzenkov  wrote:
> 17.04.2017 22:49, Chris Murphy пишет:
>> On Mon, Apr 17, 2017 at 11:27 AM, Andrei Borzenkov  
>> wrote:
>>> 17.04.2017 19:25, Chris Murphy пишет:
 This explains one system's fragmented journals; but the other system
 isn't snapshotting journals and I haven't figured out why they're so
 fragmented. No snapshots, and they are all +C at create time
 (systemd-journald default on Btrfs). Is it possible to prevent
 journald from setting +C on /var/log/journal and
 /var/log/journal/? If I remove them, at next boot they get
 reset, so any new journals created inherit that.

>>>
>>> Yes, should be possible by creating empty
>>> /etc/tmpfiles.d/journal-nocow.conf.
>>
>> OK super.
>>
>> How about inhibiting the defragmentation on rotate? I'm suspicious one
>> of the things I'm seeing is due to ssd optimization mount options, but
>> I need to see the predefrag state of the files.
>>
>> Why do I see so many changes to the journal file, once ever 2-5
>> seconds? This adds 4096 byte blocks to the file each time, and when
>> cow, that'd explain why there are so many fragments.
>>
>
>
> What exactly "changes" mean? Write() syscall?

filefrag reported entries increase, it's using FIEMAP.

Also with stat I see the times (all three) change on the file. If I go
to GNOME Terminal and just sudo some command, that itself causes the
current system.journal file to get all three times modified. It
happens immediately, there's no delay. So if I'm doing something like
drm.debug=0x1e, which is spitting a bunch of stuff to dmesg and thus
the journal, it's just constantly writing stuff to the journal. This
is without anything running journalctl -f or reading the journal.

>
>> #Storage=auto
>> #Compress=yes
>> #Seal=yes
>> #SplitMode=uid
>> #SyncIntervalSec=5m
>
> This controls how often systemd calls fsync() on currently active
> journal file. Do you see fsync() every 3 seconds?

I have no idea if it's fsync or what. How can I tell?

Also, I don't think these journal files are being compressed.

Using the btrfs-progs/btrfs-debugfs script on a few user journal
files, I'm seeing massive compression ratios. Maybe I'll try
Compress=No and see if there's a change.

file: 
user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-00059b73-00054d51b3f442ff.journal
extents 64 disk size 294912 logical size 8388608 ratio 28.44
file: 
user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-0002ec5b-00054d4ebb7114e7.journal
extents 64 disk size 278528 logical size 8388608 ratio 30.12
file: 
user-1000@6532e07ad7104b1c94d26a5b0fb2ad6e-06e5-00054c3c32607483.journal
extents 320 disk size 5206016 logical size 41943040 ratio 8.06

-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 11:27 AM, Andrei Borzenkov  wrote:
> 17.04.2017 19:25, Chris Murphy пишет:
>> This explains one system's fragmented journals; but the other system
>> isn't snapshotting journals and I haven't figured out why they're so
>> fragmented. No snapshots, and they are all +C at create time
>> (systemd-journald default on Btrfs). Is it possible to prevent
>> journald from setting +C on /var/log/journal and
>> /var/log/journal/? If I remove them, at next boot they get
>> reset, so any new journals created inherit that.
>>
>
> Yes, should be possible by creating empty
> /etc/tmpfiles.d/journal-nocow.conf.

OK super.

How about inhibiting the defragmentation on rotate? I'm suspicious one
of the things I'm seeing is due to ssd optimization mount options, but
I need to see the predefrag state of the files.

Why do I see so many changes to the journal file, once ever 2-5
seconds? This adds 4096 byte blocks to the file each time, and when
cow, that'd explain why there are so many fragments.

#Storage=auto
#Compress=yes
#Seal=yes
#SplitMode=uid
#SyncIntervalSec=5m
#RateLimitIntervalSec=30s
#RateLimitBurst=1000

A change every 5m is not what I'm seeing with stat. I have no crit,
emerg, or alert messages happening. Just a bunch of drm debug messages
which are constant. But if the flush should only happen every 5
minutes, I'm confused.


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Chris Murphy
Here's an example rotated log (Btrfs, NVMe, no compression, default
ssd mount option). As you can see it takes up more space on disk than
it contains data, so there's a lot of slack space for some reason,
despite /etc/systemd/journald.conf being unmodified and thus
Compress=Yes.

file: 
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
extents 41 disk size 143511552 logical size 100663296 ratio 0.70

$ sudo btrfs fi defrag -c
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal

And now:

file: 
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
extents 768 disk size 21504000 logical size 100663296 ratio 4.68

That's nearly 1/7th smaller. The existing defrag without compression
is probably just increasing write amplification on SSDs. If it's badly
fragmented just leave it alone.

This also works on nocow journals with +C set, although I'm not sure
whether this is intended behavior (I thought nocow implies no
compression); so I've asked about that on the Btrfs list.

Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Andrei Borzenkov
17.04.2017 19:25, Chris Murphy пишет:
> This explains one system's fragmented journals; but the other system
> isn't snapshotting journals and I haven't figured out why they're so
> fragmented. No snapshots, and they are all +C at create time
> (systemd-journald default on Btrfs). Is it possible to prevent
> journald from setting +C on /var/log/journal and
> /var/log/journal/? If I remove them, at next boot they get
> reset, so any new journals created inherit that.
> 

Yes, should be possible by creating empty
/etc/tmpfiles.d/journal-nocow.conf.


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Chris Murphy
On Mon, Apr 17, 2017 at 3:57 AM, Lennart Poettering
 wrote:

>> I do manual snapshots before software updates, which means new writes
>> to these files are subject to COW, but additional writes to the same
>> extents are overwrites and are not COW because of chattr +C. I've used
>> this same strategy for a long time, since systemd-journald defaults to
>> +C for journal files; but I've not seen them get this fragmented this
>> quickly.
>>
>
> IIRC NOCOW only has an effect if set right after the file is created
> before the first write to it is done. Or in other words, you cannot
> retroactively make a file NOCOW. This means that if you in one way or
> another make a COW copy of a file (through reflinking — implicit or
> not, note that "cp" reflinks by default — or through snapshotting or
> something else) the file is COW and you'll get fragmentation.

Correct.

There are three states for files on Btrfs: cow (normal), nocow (+C),
and a snapshot of a nocow (+C) file which is "cowandthennocow" or
whatever you want to call it. But yes a snapshot of a nocow file does
fragment a ton, but then becomes nocow and won't fragment more.

This explains one system's fragmented journals; but the other system
isn't snapshotting journals and I haven't figured out why they're so
fragmented. No snapshots, and they are all +C at create time
(systemd-journald default on Btrfs). Is it possible to prevent
journald from setting +C on /var/log/journal and
/var/log/journal/? If I remove them, at next boot they get
reset, so any new journals created inherit that.

Anyway, snapshots of journals on Btrfs should be avoided for other
reasons. The autocleaning features (SystemMaxUse=, SystemKeepFree=) as
well as --vacuum-size=. ) don't work correctly when there are
snapshots of journals. Even when journald deletes journals, their
extents are pinned by snapshots, so they still take up the same space.
Basically journald could get into a situation where it deletes all
journals it sees, but no space is freed up because those journals are
stuck in a snapshot.


> I am not entirely sure what to recommend you. Ultimately whether btrfs
> fragments or not, is probably something you have to discuss with the
> btrfs folks. We do try to make the best of btrfs, by managing the COW
> flag, but this only helps you to a limited degree as
> snapshots/reflinks will fuck things up anyway...

Definitely.

An easy solution would be for journald to create
/var/log/journal/ as a subvolume instead of a directory.
This will make journals immune to snapshots of the containing
subvolume (typically root fs). Of course systemd already makes
subvolumes behind the scenes for other sane reasons like
/var/lib/machines.

Snapshotting logs strikes me as an invalid use case anyway. Anyone
would want logs immune to rollback, that'd defeat troubleshooting and
auditing. Logs should be linear and continuous, not rolled back. The
snapshotting is arguably a mistake, due to lack of user understanding
of the consequences. It is admittedly esoteric.


> We also ask btrfs to defrag the file as soon as we mark it as
> archived... I'd even be willing to extend on that, and defrag the file
> on other events too, for example if it ends up being too heavily
> fragmented. But last time I looked btrfs didn't have any nice API for
> that, that would have a clear focus on a single file only...

The biggest issue with them is they take up a lot of space and very
inconsistently defragment. Depending on kernel version they can become
magnificently larger.

Speaking of which, even with Compress=Yes (default), the journal files
are highly compressible. By copying some to a Btrfs volume with
compress mount option (this does not force compression it gives up
easily on already compressed data), I'm finding 4-6x smaller files. So
the journals are highly compressible. This is the last line for a
couple journals, from btrfs-progs/btrfs-debugfs:

file: 
system@01b44589014542e3b48df31f152c0916-ca2b-00054546539416e8.journal
extents 384 disk size 9691136 logical size 50331648 ratio 5.19
file: 
system@2547b430ecdc441c9cf82569eeb22065-0001-00054c3c31bec567.journal
extents 768 disk size 21504000 logical size 100663296 ratio 4.68

If there is a way to optimize this compression when rotating logs,
read-compress-write, this means defragmentation isn't needed on Btrfs,
and all file systems gain the benefit of much smaller logs.


-- 
Chris Murphy
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Kai Krakow
Am Mon, 17 Apr 2017 16:01:48 +0200
schrieb Kai Krakow :

> > We also ask btrfs to defrag the file as soon as we mark it as
> > archived...  
> 
> This makes sense. And I've learned that journal on btrfs works much
> better if you use many small files vs. a few big files. I've currently
> set the journal size limit to 8 MB for that reason which gives me very
> good performance.

Hmm well, just looked, I eventually stopped doing that, probably when
you introduced defragging the archived journals. But I see no journal
file being bigger than 128M which seems to work well.

-- 
Regards,
Kai

Replies to list-only preferred.

___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Kai Krakow
Am Mon, 17 Apr 2017 11:57:21 +0200
schrieb Lennart Poettering :

> On Sun, 16.04.17 14:30, Chris Murphy (li...@colorremedies.com) wrote:
> 
> > Hi,
> > 
> > This is on a Fedora 26 workstation (systemd-233-3.fc26.x86_64)
> > that's maybe a couple weeks old and was clean installed. Drive is
> > NVMe.
> > 
> > 
> > # filefrag *
> > system.journal: 9283 extents found
> > user-1000.journal: 3437 extents found
> > # lsattr
> > C-- ./system.journal
> > C-- ./user-1000.journal
> > 
> > I do manual snapshots before software updates, which means new
> > writes to these files are subject to COW, but additional writes to
> > the same extents are overwrites and are not COW because of chattr
> > +C. I've used this same strategy for a long time, since
> > systemd-journald defaults to +C for journal files; but I've not
> > seen them get this fragmented this quickly.
> >  
> 
> IIRC NOCOW only has an effect if set right after the file is created
> before the first write to it is done. Or in other words, you cannot
> retroactively make a file NOCOW. This means that if you in one way or
> another make a COW copy of a file (through reflinking — implicit or
> not, note that "cp" reflinks by default — or through snapshotting or
> something else) the file is COW and you'll get fragmentation.

To mark a file nocow, it has to exist with zero bytes and never
been written to. The nocow attribute (chattr +C) will be inherited from
the directory upon creation of a file. So the best way to go is setting
+C on the directory and all future files of the journal would be nocow.

You can still do snapshots, nocow doesn't prohibit that and doesn't
make journals cow again. What happens is that btrfs simply unshares
extents as soon as you write to the snapshot. The newly created extent
itself will behave like nocow again. If the extents are big enough,
this shouldn't introduce any serious fragmentation, just waste space.
Btrfs won't split extents upon unsharing them during a write. It may,
however, "replace" only part of the unshared extent thus making three
new: two sharing the old copy, one having the new data. But since
journals are append only, that should be no problem. It's just that the
data is written so slowly that writes almost never become combined into
one single writes, resulting in many extents.

> I am not entirely sure what to recommend you. Ultimately whether btrfs
> fragments or not, is probably something you have to discuss with the
> btrfs folks. We do try to make the best of btrfs, by managing the COW
> flag, but this only helps you to a limited degree as
> snapshots/reflinks will fuck things up anyway...

Well, usually you shouldn't have to manage the cow flag at all: Just
set it once for the newly created journal directory and everything is
fine. And even then, people may not want this so they could easily
unset the flag on the directory and rotate the journal.

> We also ask btrfs to defrag the file as soon as we mark it as
> archived...

This makes sense. And I've learned that journal on btrfs works much
better if you use many small files vs. a few big files. I've currently
set the journal size limit to 8 MB for that reason which gives me very
good performance.

> I'd even be willing to extend on that, and defrag the file
> on other events too, for example if it ends up being too heavily
> fragmented.

Since the append behavior of btrfs is so bad wrt journal files, it
should be enough to simply let btrfs defrag the previous written
journal block upon append the file: Lennart, I think you are hinting the
OS that the file is going to grow and thus truncate it to 8 MB beyond
the current end of file to continue writing. That would be a good event
to let btrfs defrag the old 8 MB block (and just that, not the complete
file). If this works well, you could maybe skip defragging the complete
file upon rotation which should improve disk io performance during
rotation.

I think the default extent size hint for defragging with btrfs defrag
has been set to 32 MB lately, so it would be enough to maybe do the
above step every 32 MB.

> But last time I looked btrfs didn't have any nice API for
> that, that would have a clear focus on a single file only...

The high number of extents may not be an indicator for fragmentation
when btrfs compression is used. Compressed data will be organized in
logical 128k units which are reported as fragments to filefrag, in
reality they are laid out continuously on disk, so no fragmentation.
It would be interesting to see the blockmap of this.

-- 
Regards,
Kai

Replies to list-only preferred.


___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?

2017-04-17 Thread Lennart Poettering
On Mon, 10.04.17 20:20, Chris Murphy (li...@colorremedies.com) wrote:

> 4. Systemd for not enforcing limited kill exemption to those running
> from initramfs, i.e. ignore kill exemption if the program is running
> other than initramfs.

Well, we are not the police, and we do kill everything by default,
even though we have this explicit, privileged opt-out of this. If
people misuse it, then I am pretty sure it's on them, not us...

That said, I will subscribe to the request that systemd's shutdown
logic should go the safest way possible, and hence I am fine with
calling the generic FIFREEZE+FITHAW ioctls one after the other, if
that helps, even though I think this is really broken API.

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Unable to mask /proc using currently available options (InaccessiblePaths...)

2017-04-17 Thread Lennart Poettering
On Wed, 12.04.17 18:27, Timothée Ravier (sios...@gmail.com) wrote:

> Hi,
> 
> I would like to make the /proc directory inaccessible for some services.
> Unfortunately, adding the InaccessiblePaths=/proc option to a service unit 
> will
> not work.

Hmm, what precisely do you intend to make unavailable here? Note that
/proc/self/ is kinda normal process API on Linux, as are some other
files, and a variety of calls (including in glibc defined ones) assume
that /proc is available, at least for read access.

It definitely makes sense to restrict /proc
somehwat. ProtectKernelTunables= will make /proc/sys read-only for
example, and there's work in progress to permit the kernel's hidepid
procfs mount option to be settable per mount point so that we can
expose it per-service in systemd, but I am not sure it is really
desirable to completely disable it — at least at a service level. It
might make sense to restrict it in even more restricted sandboxes
(for example, a web browser might restrict this if it uses per-page
renderer process sandboxes).

That all said, even if I don't see the great benefit of blocking the
entirety of /proc for a service, I'm still willing to merge changes to
make this work, if this helps you.

> With systemd v233, during the filesystem layout setup for the new service, an
> empty directory will be mounted on top of /proc first (in core:namespace.c:
> setup_namespace(): apply_mount()) and then mount points will be turned 
> readonly
> (in core:namespace.c: setup_namespace(): make_read_only()), using
> /proc/mountinfo which is now unavailable. Thus this step will fail.

Maybe we can find a somewhat clean fall-back for this, when /proc is
not around?

Or maybe we slightly alter the logic here, and open
/proc/self/mountinfo before we rearrange the directories, and then
always only read from the already opened fd, and do not refer to the
actual file system anymore? I figure that would mean adding a version
of bind_remount_recursive() that takes a FILE* or so of
/proc/self/mountinfo as additional parameter, and then seeks to the
beginning before reading off it, if you follow what I mean? I think
this approach would be the nicest one.

> With systemd v233, it is possible to work around this issue leaving only a 
> single
> /proc/self/mountinfo file available using this hack:
> 
> $ umask 0277
> $ mkdir -p /.proc/self
> $ touch /.proc/self/mountinfo
> 
> And in the unit:
> 
> BindReadOnlyPaths=/.proc:/proc /proc/self/mountinfo:/.proc/self/mountinfo
> 
> But this is not really pretty.
> 
> I would like your opinion on the following suggestions before writing code:
>   * Should I extend the MountVFSAPI option to support the case where the
> RootImage and RootDirectory options are not set?

How precisely would you alter the effect of MountVFSAPI= here?

>   * Should I add a special HideProc option to support hiding /proc for
> conventional services?

As above, I'd prefer not to add this. I am not against making work
what you want to do, but I am not convinced that adding first class
config options for it would be a good idea, since systemd after all is
a service manager and hence we should focus on making things easy that
match the service usecase, but not more.

Or in other words: making InaccessiblePaths=/proc work sounds
preferable to me.

> As a side note, debug logs in core/namespace.c are non functional. A call to
> log_open() appears to be missing.

Yupp, this is known. But opening fds comes with other issues (in
particular because seccomp and other security systems would need
preparation to permit that), hence currently we just keep the code in
there, and it is normally a NOP, except if you hack around, turn it on
manually, by adding a log_open for your local compilation.

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] more verbose debug info than systemd.log_level=debug?

2017-04-17 Thread Lennart Poettering
On Mon, 10.04.17 19:30, Chris Murphy (li...@colorremedies.com) wrote:

> >> Remember, all of this is because there *is* software that does the wrong
> >> thing, and it *is* possible for software to hang and be unkillable. It 
> >> would
> >> be good for systemd to do the right thing even in the presence of that kind
> >> of software.
> >
> > Yeah, we do what we can.
> >
> > But I seriously doubt FIFREEZE will make things better. It's just
> > going to make shutdowns hang every now and then.
> 
> My understanding is freeze isn't ignorable, it's expressly for the use
> case when the disk has active processing writing and the fs must be
> made completely consistent, e.g. prior to taking a snapshot. The thaw
> immediately following freeze would prevent any shutdown hang.
> 
> The point of freeze/thaw is it will cause the file system metadata
> that grub depends on to know where the new grub.cfg is located, to get
> committed to disk prior to reboot. If some process is still hanging
> around with an open write, it doesn't really matter.

As mentioned: if you prep a patch that adds FIFREEZE+FITHAW when we
remount stuff read-only, then I'd merge it, even though I think the
kernel APIs for this are really broken, and it would be much
preferably having a proper API for this, either exposed via the
well-understood sync() syscall, or through a new ioctl, if they must.

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Why journald has NotifyAccess=all set in the unit file?

2017-04-17 Thread Lennart Poettering
On Tue, 11.04.17 10:18, Michal Sekletar (msekl...@redhat.com) wrote:

> Hi everyone,
> 
> I was asked today about $subject. I quickly skimmed trough the
> relevant parts of the code and current default looks like an
> oversight. I think there are no processes other than journald involved
> in notification handling. I think it would be nice if drop the setting
> and rely on default NotifyAccess=main.

Good question. It has been that way since time began, and I couldn't
extract any useful explanation for that from the git history.

Hence, please file a PR that turns this into NotifyAccess=main.

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] what is sd_notify() really for ?

2017-04-17 Thread Reindl Harald



Am 17.04.2017 um 00:47 schrieb Enrico Weigelt, metux IT consult:

On 17.04.2017 00:04, Lennart Poettering wrote:


Please always check the man pages if you have questions regarding a
specific systemd interface:

https://www.freedesktop.org/software/systemd/man/sd_notify.html


Done so, of course. Unfortunately, it doesn't answer my questions,
eg. what the service manager actually does w/ that information.


really?

what exactly do you not understand in the descriptions below?

if there are several services depending on each other you don't want to 
start depending services while your big database still inits and is not 
ready for connections - for "Restart=always" it maybe not enough that 
your proess is just running - hence the watchdog where the service needs 
to say "i am still alive"



READY=1

Tells the service manager that service startup is finished. This is 
only used by systemd if the service definition file has Type=notify set. 
Since there is little value in signaling non-readiness, the only value 
services should send is "READY=1" (i.e. "READY=0" is not defined).


Example 2. Extended Start-up Notification

A service could send the following after completing initialization:

sd_notifyf(0, "READY=1\n"
"STATUS=Processing requests?\n"
"MAINPID=%lu",
(unsigned long) getpid());

RELOADING=1

Tells the service manager that the service is reloading its 
configuration. This is useful to allow the service manager to track the 
service's internal state, and present it to the user. Note that a 
service that sends this notification must also send a "READY=1" 
notification when it completed reloading its configuration.


STOPPING=1

Tells the service manager that the service is beginning its 
shutdown. This is useful to allow the service manager to track the 
service's internal state, and present it to the user.


WATCHDOG=1

Tells the service manager to update the watchdog timestamp. This is 
the keep-alive ping that services need to issue in regular intervals if 
WatchdogSec= is enabled for it. See systemd.service(5) for information 
how to enable this functionality and sd_watchdog_enabled(3) for the 
details of how the service can check whether the watchdog is enabled.


https://www.freedesktop.org/software/systemd/man/sd_watchdog_enabled.html
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] systemd-nspawn network-interface

2017-04-17 Thread Lennart Poettering
On Thu, 13.04.17 16:08, poma (pomidorabelis...@gmail.com) wrote:

> Hello
> 
> Regaining of the network-interface, as is stated in the manual, ain't 
> happening;
> man 1 systemd-nspawn
> ...
> OPTIONS
> ...
> --network-interface=
>   Assign the specified network interface to the container.
>   This will remove the specified interface from the calling namespace and
>   place it in the container.
>   When the container terminates,
>   it is moved back to the host namespace. [...]
> 
> Given what's actually going on, should be stated;
> --network-interface=
>   Assign the specified network interface to the container.
>   This will remove the specified interface from the calling namespace and
>   place it in the container.
>   When the container terminates,
>   considering that the specified interface is not moved back to the host 
> namespace,
>   specific kernel module need to be reloaded to move it back to the host 
> namespace. [...]

Upgrade your kernel! This all works correctly on current kernels:
network interfaces will now safely migrate back to the parent
namespace when a network namespace dies.

We usually don't document bugs in other software in systemd, but
instead ask people to run current systemd only in conjunction with
somewhat current kernels.

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Short way to show messages of executable and unit with `journalctl`

2017-04-17 Thread Lennart Poettering
On Fri, 14.04.17 20:30, Paul Menzel (paulepan...@users.sourceforge.net) wrote:

> Dear systemd folks,
> 
> 
> Is there a shorter way than below to show all messages of an executable
> and a unit?
> 
> ```
> $ journalctl _COMM=sudo + _SYSTEMD_UNIT=NetworkManager.service
> ```
> 
> I would be happy about a command, that involves `-u` so that I don’t
> have to type the suffix `.service`.

This is currently not available. And I am not sure this is a highly
typical usage that warrants an explicit option... Sorry...

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] journal fragmentation on Btrfs

2017-04-17 Thread Lennart Poettering
On Sun, 16.04.17 14:30, Chris Murphy (li...@colorremedies.com) wrote:

> Hi,
> 
> This is on a Fedora 26 workstation (systemd-233-3.fc26.x86_64) that's
> maybe a couple weeks old and was clean installed. Drive is NVMe.
> 
> 
> # filefrag *
> system.journal: 9283 extents found
> user-1000.journal: 3437 extents found
> # lsattr
> C-- ./system.journal
> C-- ./user-1000.journal
> 
> I do manual snapshots before software updates, which means new writes
> to these files are subject to COW, but additional writes to the same
> extents are overwrites and are not COW because of chattr +C. I've used
> this same strategy for a long time, since systemd-journald defaults to
> +C for journal files; but I've not seen them get this fragmented this
> quickly.
>

IIRC NOCOW only has an effect if set right after the file is created
before the first write to it is done. Or in other words, you cannot
retroactively make a file NOCOW. This means that if you in one way or
another make a COW copy of a file (through reflinking — implicit or
not, note that "cp" reflinks by default — or through snapshotting or
something else) the file is COW and you'll get fragmentation.

I am not entirely sure what to recommend you. Ultimately whether btrfs
fragments or not, is probably something you have to discuss with the
btrfs folks. We do try to make the best of btrfs, by managing the COW
flag, but this only helps you to a limited degree as
snapshots/reflinks will fuck things up anyway...

We also ask btrfs to defrag the file as soon as we mark it as
archived... I'd even be willing to extend on that, and defrag the file
on other events too, for example if it ends up being too heavily
fragmented. But last time I looked btrfs didn't have any nice API for
that, that would have a clear focus on a single file only...

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Early testing for service enablement

2017-04-17 Thread Lennart Poettering
On Thu, 13.04.17 11:58, Martin Wilck (mwi...@suse.com) wrote:

> On Thu, 2017-04-13 at 08:49 +, Mantas Mikulėnas wrote:
> > IIRC, enable/disable/is-enabled are implemented entirely via direct
> > filesystem access. Other than that, systemctl uses a private socket
> > when running as root – it talks DBus but doesn't require dbus-daemon.
> 
> 
> > A bigger problem is that initramfs can't know much about the main
> > system due to having a separate /etc, unless maybe you run `systemctl
> > --root=...`
> 
> This is not a problem for us because in initramfs, we only care whether
> the service is enabled in initramfs itself.
> 
> > Could you elaborate on why you find this checking necessary in the
> > first place? Do your udev rules run some weird stuff?
> 
> It's about multipath. In the udev rule that checks whether or not a
> given device should be treated as a multipath device path, we need to
> figure out whether multipathd.service is enabled. We want to to that
> without connecting to multipathd.socket at that time in the boot
> process, because that would fire up multipathd, and there's strong
> evidence that multipath-enabled systems boot more stably if multipathd
> is started later (after udev settle). Therefore the idea was to obtain
> the information from systemd ("will multipathd.service be started later
> in the boot process?").

That appears questionnable to me. Synchronously requesting data from
other services from inside an udev rule like that appears highly
problematic to me, in particular if you sometimes do it and sometimes
not, as that makes things underterministic.

Also: instead of checking whether a service unit is enabled before
contacting a specific socket, please make sure that the socket unit is
only enabled if the service is enabled too (i.e. via Also= in the
[Install] section of the service), so that you can directly talk to
the socket, and if the service is not enabled (and hence the socket
either) you will just get an ENOENT/ECONNREFUSED back...

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] Early testing for service enablement

2017-04-17 Thread Lennart Poettering
On Thu, 13.04.17 12:05, Martin Wilck (mwi...@suse.com) wrote:

> On Thu, 2017-04-13 at 11:45 +0200, Lennart Poettering wrote:
> > On Thu, 13.04.17 08:49, Mantas Mikulėnas (graw...@gmail.com) wrote:
> > 
> > > IIRC, enable/disable/is-enabled are implemented entirely via direct
> > > filesystem access. Other than that, systemctl uses a private socket
> > > when
> > > running as root – it talks DBus but doesn't require dbus-daemon.
> > 
> > Correct, enable/disable/is-enabled can operate without PID 1, but
> > they
> > usually don't unless the tool detects it is being run in a chroot
> > environment.
> > 
> > And yes, systemctl can communicate with PID 1 through a private
> > communication socket that exists as long as PID 1 exists. dbus-daemon
> > is not needed, except when your client is unprivileged.
> 
> If I interpret this answer correctly, you're saying that "systemctl is-
> enabled xyz.service" *should* actually work, even if it's called right
> after PID 1 is started. I'm pretty certain that that wasn't the case
> for me. My client was running from an udev rule and thus not
> unprivileged. That should be considered a bug, then?

Yes, systemctl is-enabled should always work fine regardless if you
run it in early or late boot or even the initrd. However, it will
always just return you the state that applies to its current context,
i.e. inside the initrd it will tell you whether the unit is enabled in
the initrd, and on the host whether it is enabled on the host.

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel


Re: [systemd-devel] what is sd_notify() really for ?

2017-04-17 Thread Lennart Poettering
On Mon, 17.04.17 00:47, Enrico Weigelt, metux IT consult 
(enrico.weig...@gr13.net) wrote:

> On 17.04.2017 00:04, Lennart Poettering wrote:
> 
> > Please always check the man pages if you have questions regarding a
> > specific systemd interface:
> > 
> > https://www.freedesktop.org/software/systemd/man/sd_notify.html
> 
> Done so, of course. Unfortunately, it doesn't answer my questions,
> eg. what the service manager actually does w/ that information.

Well, it's used for a variety of things. I figure most relevant usage
is for the implementation of Type=notify services, which is referenced
from the man page, if you have a look. For details about that option
see:

https://www.freedesktop.org/software/systemd/man/systemd.service.html#Type=

Another major use is for the watchdog logic, i.e. the implementation
of the WatchdogSec= setting, also referenced from sd_notify()'s man
page. For details about this specific setting see:

https://www.freedesktop.org/software/systemd/man/systemd.service.html#WatchdogSec=

And there's more. For example, you can use it to store fds in the
service manager, so that your service may be restarted (or terminated
abnormally) and access to specific sockets, devices, or any other
object that may be referenced with a file descriptor isn't lost.

If the brief descriptions in the man pages aren't sufficient, I'd
recommend you to have a look at the sources.

Lennart

-- 
Lennart Poettering, Red Hat
___
systemd-devel mailing list
systemd-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/systemd-devel