from:"Lennart Poettering"

Re: [PATCH -next 1/5] block: add disk sequence number

2021-04-20 Thread Lennart Poettering

On Di, 16.03.21 14:13, Christoph Hellwig (h...@infradead.org) wrote:

> On Mon, Mar 15, 2021 at 08:18:24PM +, Matthew Wilcox wrote:
> > On Mon, Mar 15, 2021 at 09:02:38PM +0100, Matteo Croce wrote:
> > > From: Matteo Croce 
> > >
> > > Add a sequence number to the disk devices. This number is put in the
> > > uevent so userspace can correlate events when a driver reuses a device,
> > > like the loop one.
> >
> > Should this be documented as monotonically increasing?  I think this
> > is actually a media identifier.  Consider (if you will) a floppy disc.
> > Back when such things were common, it was possible with personal computers
> > of the era to have multiple floppy discs "in play" and be prompted to
> > insert them as needed.  So shouldn't it be possible to support something
> > similar here -- you're really removing the media from the loop device.
> > With a monotonically increasing number, you're always destroying the
> > media when you remove it, but in principle, it should be possible to
> > reinsert the same media and have the same media identifier number.
>
> And we have some decent infrastructure related to media changes,
> grep for disk_events.  I think this needs to plug into that
> infrastructure instead of duplicating it.

I'd argue this makes sense in one way only, i.e. that whenever the
media_change event is seen the seqnum is implicitly bumped.

I am pretty sure though that loopback devices shouldn't synthesize
media_change events themselves though. There's quite a difference I
would argue between a real media change event caused by external
effect (i.e. humans/hw buttons/sensors) to loop device reuse, which is
exclusively triggered by internal events (i.e. local code). Moreover I
think the loopback subsystem should manage the seqnum on its own,
since it ideally would return the assigned seqnum immediately from the
attachment ioctl, i.e. it shouldn't just be a side-effect of
attachment, but a part of it, if you follow what I mean.

Does that make sense?

Matteo, would it make sense to extend your patch set to bump the
seqnum implicitly on media_change for devices that implement that?

Lennart

Re: [PATCH -next 1/5] block: add disk sequence number

2021-03-25 Thread Lennart Poettering

On Mo, 15.03.21 21:04, Matthew Wilcox (wi...@infradead.org) wrote:

> On Mon, Mar 15, 2021 at 08:18:24PM +, Matthew Wilcox wrote:
> > On Mon, Mar 15, 2021 at 09:02:38PM +0100, Matteo Croce wrote:
> > > From: Matteo Croce 
> > >
> > > Add a sequence number to the disk devices. This number is put in the
> > > uevent so userspace can correlate events when a driver reuses a device,
> > > like the loop one.
> >
> > Should this be documented as monotonically increasing?  I think this
> > is actually a media identifier.  Consider (if you will) a floppy disc.
> > Back when such things were common, it was possible with personal computers
> > of the era to have multiple floppy discs "in play" and be prompted to
> > insert them as needed.  So shouldn't it be possible to support something
> > similar here -- you're really removing the media from the loop device.
> > With a monotonically increasing number, you're always destroying the
> > media when you remove it, but in principle, it should be possible to
> > reinsert the same media and have the same media identifier number.
>
> So ... a lot of devices have UUIDs or similar.  eg:
>
> $ cat /sys/block/nvme0n1/uuid
> e8238fa6-bf53-0001-001b-448b49cec94f
>
> https://linux.die.net/man/8/scsi_id (for scsi)
>
> how about making this way more generic; create an xattr on a file to
> store the uuid (if one doesn't already exist) whenever it's used as the
> base for a loop device.  then sysfs (or whatever) can report the contents
> of that xattr as the unique id.
>
> That can be mostly in userspace -- losetup can create it, and read it.
> It can be passed in as the first two current-reserved __u64 entries in
> loop_config.  The only kernel change should be creating the sysfs
> entry /sys/block/loopN/uuid from those two array entries.

I prefer seqnos over uuids because we can order them when we see a
bunch of uevents for the same loopback device with their seqnos, as
mentioned in that other mail.

But beggars can't be choosers. If we could propagate some uuid from
the loopback setup ioctl into the device so that that appears via
sysfs that would work too for me, but not as robustly, since we lack
the ordering to detect whether it's worth waiting for more uevents or
if already somebody else took possesion of the device.

TLDR: seqnos FTW! but uuids assigned at attachment time is better than
nothing.

Lennart

--
Lennart Poettering, Berlin

Re: [PATCH -next 1/5] block: add disk sequence number

2021-03-25 Thread Lennart Poettering

On Mo, 15.03.21 20:18, Matthew Wilcox (wi...@infradead.org) wrote:
65;6203;1c
> On Mon, Mar 15, 2021 at 09:02:38PM +0100, Matteo Croce wrote:
> > From: Matteo Croce 
> >
> > Add a sequence number to the disk devices. This number is put in the
> > uevent so userspace can correlate events when a driver reuses a device,
> > like the loop one.
>
> Should this be documented as monotonically increasing?

I think this would be great. My usecase for this would be to match up
uevents with loopback block device attachments, because that's
basically impossible right now: you attach a loopback device to a
file, and then wait for the relevant uevents to happen, for all
partitions but you cannot do this safely right now, since loopback
block devices are heavily reused in many scenarios so you never know
if a uevent is from the attachment you created yourself or from a
previous one — or even already for the next.

If this would be documented as being monotonic this would be excellent
for this usecase: if you know that your own use of a specific loopback
device got seqno x then you know that if you see uevents for seqno < x
it makes sense to wait longer, but when you see seqno > x then you
know it's too late, somehow you lost uevents and hsould abort.

Hence: for my usecase having this strictly monotonic, and thus being
able to *order* attachments across all areas where the seqno appears
would be absolutely excellent and make this as robust as it possibly
could be.

> I think this is actually a media identifier.  Consider (if you will)
> a floppy disc.  Back when such things were common, it was possible
> with personal computers of the era to have multiple floppy discs "in
> play" and be prompted to insert them as needed.  So shouldn't it be
> possible to support something similar here -- you're really removing
> the media from the loop device.  With a monotonically increasing
> number, you're always destroying the media when you remove it, but
> in principle, it should be possible to reinsert the same media and
> have the same media identifier number.

This would be useless for my usecase, we don't really care for the
precise file being attached (which is queriable via sysfs anyway), but
we want to match up our use of the device with the uevents it
generates on itself and decendend partition block devices.

Hence: for my usecase I want something that recognizes *attachments*
and not media. If i attach the same media 3 times i want to be able to
discern the three times. And more importantly: if I attach it once and
someone else also once, then I don't want to get confused by that and
be able ti distinguish both attachments.

Morevoer, I am not even sure what media identifier would mean: if you
have one image and then copy it, is that still the same image? in your
model, should that have distinct ids? or the same, because it is from
the same common original version? and if i then modify one, what
happens then?

Finally, media usually comes with ids anyway. i.e. file systems have
uuids, GPT partition tables have meda uuids. The infrastructure for
that already exists. What we need really is something that allows us
to track attachments, not media.

(That said, I think it would make sense to bump the IDs not only on
explicit user-induced reattachments, but also when media is replaced,
i.e. bump it more often than not)

Lennart

--
Lennart Poettering, Berlin

Re: [PATCH -next 1/5] block: add disk sequence number

2021-03-15 Thread Lennart Poettering

On Mo, 15.03.21 21:04, Matthew Wilcox (wi...@infradead.org) wrote:

> On Mon, Mar 15, 2021 at 08:18:24PM +, Matthew Wilcox wrote:
> > On Mon, Mar 15, 2021 at 09:02:38PM +0100, Matteo Croce wrote:
> > > From: Matteo Croce 
> > >
> > > Add a sequence number to the disk devices. This number is put in the
> > > uevent so userspace can correlate events when a driver reuses a device,
> > > like the loop one.
> >
> > Should this be documented as monotonically increasing?  I think this
> > is actually a media identifier.  Consider (if you will) a floppy disc.
> > Back when such things were common, it was possible with personal computers
> > of the era to have multiple floppy discs "in play" and be prompted to
> > insert them as needed.  So shouldn't it be possible to support something
> > similar here -- you're really removing the media from the loop device.
> > With a monotonically increasing number, you're always destroying the
> > media when you remove it, but in principle, it should be possible to
> > reinsert the same media and have the same media identifier number.
>
> So ... a lot of devices have UUIDs or similar.  eg:
>
> $ cat /sys/block/nvme0n1/uuid
> e8238fa6-bf53-0001-001b-448b49cec94f
>
> https://linux.die.net/man/8/scsi_id (for scsi)
>
> how about making this way more generic; create an xattr on a file to
> store the uuid (if one doesn't already exist) whenever it's used as the
> base for a loop device.  then sysfs (or whatever) can report the contents
> of that xattr as the unique id.
>
> That can be mostly in userspace -- losetup can create it, and read it.
> It can be passed in as the first two current-reserved __u64 entries in
> loop_config.  The only kernel change should be creating the sysfs
> entry /sys/block/loopN/uuid from those two array entries.

As a (part-time) maintainer of udev: as one major likely consumer of
this I'd *really* prefer some concept here that works without
`losetup` needing to be patched. i.e. we have plenty userspace that
calls LOOP_CONFIGURE or LOOP_SET_FD, not just losetup, and we'd have
to patch them all. In particular in a world of containers it's even
worse: people probably will continue to use old userspaces (mixed with
newer ones) for a very long time (decades!), and those old userpace
won't fill in the fields for the ioctl hence.

Hence, for me it would be essential to have an identifier that is
assigned by the kernel, instead of requiring userspace to assign it,
because userspace won't for a long long time.

I'd be OK with a hybrid approach where userspace *can* fill
something in, but doesn't have to in which case the kernel would fill
it in.

That all said, I very much prefer if we'd use a kernel-enforced
"sequence number" or "generation counter" or so for this instead of a
uuid or random cookie or so. Why? because it allows userspace that
monitors things to derive ordering from these ids: when you watch
these events and see a uevent for a device seqno=4711 then you know
that it is from an earlier use than one you see for seqno=8878. UUIDs
can't give you that. That's in particular a nice property since
uevents/netlink are not a reliable transport: messages can get lost
when the socket buffers overrun, or when udev as the uevent broker
gets overloaded. Hence, for a userspace program it's kinda nice to
know whether it' worth waiting for a specific loop device use or if
it's clear that ship has sailed already: i.e. if my own use of a
specific loop device gets seqno 777 then I know it still makes sense
to wait for appropriate uevents as long as I see seqno <= 776. But if
we I see seqneo >= 778 then I know it's not worth waiting anymore and
one component in the uevent message chain has dropped my messages.

But of course, beggars can't be choosers. If a seqno/generation
counter concept is not in the cards, I'd be OK with a uuid/random
cooie approach too. And if an approach where the kernel assigns these
seqnos strictly monotonically is not in the cards, then I'd be OK with
an approach where userspace can pick the ids, too. I'll take what I
can get. My primary concern is that we get something to match up
uevents, partition devices and the main block device with, and all of
the suggested approaches could deliver that.

Lennart

--
Lennart Poettering, Berlin

Re: [PATCH 0/5] block: add a sequence number to disks

2021-02-08 Thread Lennart Poettering

On Sa, 06.02.21 01:08, Matteo Croce (mcr...@linux.microsoft.com) wrote:

> From: Matteo Croce 
>
> With this series a monotonically increasing number is added to disks,
> precisely in the genhd struct, and it's exported in sysfs and uevent.
>
> This helps the userspace correlate events for devices that reuse the
> same device, like loop.
>
> The first patch is the core one, the 2..4 expose the information in
> different ways, while the last one increase the sequence number for
> loop devices at every attach.

Patch set looks excellent to me. This would be great to have for the
systems project, as it would allow us to fix some major races around
loop device allocation, that are relatively easily triggered on loaded
systems.

Lennart

Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Lennart Poettering

On Do, 22.10.20 09:29, Szabolcs Nagy (szabolcs.n...@arm.com) wrote:

> > > The dynamic loader has to process the LOAD segments to get to the ELF
> > > note that says to enable BTI.  Maybe we could do a first pass and load
> > > only the segments that cover notes.  But that requires lots of changes
> > > to generic code in the loader.
> >
> > What if the loader always enabled BTI for PROT_EXEC pages, but then when
> > discovering that this was a mistake, mprotect() the pages without BTI? Then
> > both BTI and MDWX would work and the penalty of not getting MDWX would fall
> > to non-BTI programs. What's the expected proportion of BTI enabled code vs.
> > disabled in the future, is it perhaps expected that a distro would enable
> > the flag globally so eventually only a few legacy programs might be
> > unprotected?
>
> i thought mprotect(PROT_EXEC) would get filtered
> with or without bti, is that not the case?

We can adjust the filter in systemd to match any combination of
flags to allow and to deny.

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Lennart Poettering

On Do, 22.10.20 09:05, Szabolcs Nagy (szabolcs.n...@arm.com) wrote:

> > > Various changes have been suggested, replacing the mprotect with mmap 
> > > calls
> > > having PROT_BTI set on the original mapping, re-mmapping the segments,
> > > implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
> > > and various modification to seccomp to allow particular mprotect cases to
> > > bypass the filters. In each case there seems to be an undesirable 
> > > attribute
> > > to the solution.
> > >
> > > So, whats the best solution?
> >
> > Did you see Topi's comments on the systemd issue?
> >
> > https://github.com/systemd/systemd/issues/17368#issuecomment-710485532
> >
> > I think I agree with this: it's a bit weird to alter the bits after
> > the fact. Can't glibc set up everything right from the begining? That
> > would keep both concepts working.
>
> that's hard to do and does not work for the main exe currently
> (which is mmaped by the kernel).
>
> (it's hard to do because to know that the elf module requires
> bti the PT_GNU_PROPERTY notes have to be accessed that are
> often in the executable load segment, so either you mmap that
> or have to read that, but the latter has a lot more failure
> modes, so if i have to get the mmap flags right i'd do a mmap
> and then re-mmap if the flags were not right)

Only other option I then see is to neuter one of the two
mechanisms. We could certainly turn off MDWE on arm in systemd, if
people want that. Or make it a build-time choice, so that distros make
the choice: build everything with BTI xor suppport MDWE.

(Might make sense for glibc to gracefully fallback to non-BTI mode if
the mprotect() fails though, to make sure BTI-built binaries work
everywhere.)

I figure your interest in ARM system security is bigger than mine. I
am totally fine to turn off MDWE on ARM if that's what the Linux ARM
folks want. I ave no horse in the race. Just let me know.

[An acceptable compromise might be to allow
mprotect(PROT_EXEC|PROT_BTI) if MDWE is on, but prohibit
mprotect(PROT_EXEC) without PROT_BTI. Then at least you get one of the
two protections, but not both. I mean, MDWE is not perfect anyway on
non-x86-64 already: on 32bit i386 MDWE protection is not complete, due
to ipc() syscall multiplexing being unmatchable with seccomp. I
personally am happy as long as it works fully on x86-64]

Lennart

--
Lennart Poettering, Berlin

Re: [systemd-devel] BTI interaction between seccomp filters in systemd and glibc mprotect calls, causing service failures

2020-10-22 Thread Lennart Poettering

On Mi, 21.10.20 22:44, Jeremy Linton (jeremy.lin...@arm.com) wrote:

> Hi,
>
> There is a problem with glibc+systemd on BTI enabled systems. Systemd
> has a service flag "MemoryDenyWriteExecute" which uses seccomp to deny
> PROT_EXEC changes. Glibc enables BTI only on segments which are marked as
> being BTI compatible by calling mprotect PROT_EXEC|PROT_BTI. That call is
> caught by the seccomp filter, resulting in service failures.
>
> So, at the moment one has to pick either denying PROT_EXEC changes, or BTI.
> This is obviously not desirable.
>
> Various changes have been suggested, replacing the mprotect with mmap calls
> having PROT_BTI set on the original mapping, re-mmapping the segments,
> implying PROT_EXEC on mprotect PROT_BTI calls when VM_EXEC is already set,
> and various modification to seccomp to allow particular mprotect cases to
> bypass the filters. In each case there seems to be an undesirable attribute
> to the solution.
>
> So, whats the best solution?

Did you see Topi's comments on the systemd issue?

https://github.com/systemd/systemd/issues/17368#issuecomment-710485532

I think I agree with this: it's a bit weird to alter the bits after
the fact. Can't glibc set up everything right from the begining? That
would keep both concepts working.

Lennart

--
Lennart Poettering, Berlin

LOOP_CONFIGURE ioctl doesn't work if lo_offset/lo_sizelimit are set

2020-08-24 Thread Lennart Poettering

Hi!

Even with fe6a8fc5ed2f0081f17375ae2005718522c392c6 the LOOP_CONFIGURE
ioctl doesn't work correctly. It gets confused if the
lo_offset/lo_sizelimit fields are set to non-zero.

In a quick test I ran (on Linux 5.8.3) I call LOOP_CONFIGURE with
.lo_offset=3221204992 and .lo_sizelimit=50331648 and immediately
verify the size of the block device with BLKGETSIZE64. It should of
course return 50331648, but actually returns 3271557120. (the precise
values have no particular relevance, it's just what I happened to use
in my test.) If I instead use LOOP_SET_STATUS64 with the exact same
parameters, everything works correctly. In either case, if I use
LOOP_GET_STATUS64 insted of BLKGETSIZE64 to verify things, everything
looks great.

My guess is that the new ioctl simply doesn't properly propagate the
size limit into the underlying block device like it should. I didn't
have the time to investigate further though.

Lennart

Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

2020-08-14 Thread Lennart Poettering

On Mi, 12.08.20 11:18, Linus Torvalds (torva...@linux-foundation.org) wrote:

> On Tue, Aug 11, 2020 at 5:05 PM David Howells  wrote:
> >
> > Well, the start of it was my proposal of an fsinfo() system call.
>
> Ugh. Ok, it's that thing.
>
> This all seems *WAY* over-designed - both your fsinfo and Miklos' version.
>
> What's wrong with fstatfs()? All the extra magic metadata seems to not
> really be anything people really care about.
>
> What people are actually asking for seems to be some unique mount ID,
> and we have 16 bytes of spare information in 'struct statfs64'.

statx() exposes a `stx_mnt_id` field nowadays. So that's easy and
quick to get nowadays. It's just so inefficient matching that up with
/proc/self/mountinfo then. And it still won't give you any of the fs
capability bits (time granularity, max file size, features, …),
because the kernel doesn't expose that at all right now.

OTOH I'd already be quite happy if struct statfs64 would expose
f_features, f_max_fsize, f_time_granularity, f_charset_case_handling
fields or so.

Lennart

--
Lennart Poettering, Berlin

Re: file metadata via fs API

2020-08-14 Thread Lennart Poettering

On Mi, 12.08.20 12:50, Linus Torvalds (torva...@linux-foundation.org) wrote:

> On Wed, Aug 12, 2020 at 12:34 PM Steven Whitehouse  
> wrote:
> >
> > The point of this is to give us the ability to monitor mounts from
> > userspace.
>
> We haven't had that before, I don't see why it's suddenly such a big deal.
>
> The notification side I understand. Polling /proc files is not the answer.
>
> But the whole "let's design this crazy subsystem for it" seems way
> overkill. I don't see anybody caring that deeply.
>
> It really smells like "do it because we can, not because we must".

With my systemd maintainer hat on (and of other userspace stuff),
there's a couple of things I really want from the kernel because it
would fix real problems for us:

1. we want mount notifications that don't require to scan
   /proc/self/mountinfo entirely again every time things change, over
   and over again, simply because that doesn't scale. We have various
   bugs open about this performance bottleneck, I could point you to,
   but I figure it's easy to see why this currently doesn't scale...

2. We want an unpriv API to query (and maybe set) the fs UUID, like we
   have nowadays for the fs label FS_IOC_[GS]ETFSLABEL

3. We want an API to query time granularity of file systems
   timestamps. Otherwise it's so hard in userspace to reproducibly
   re-generate directory trees. We need to know for example that some
   fs only has 2s granularity (like fat).

4. Similar, we want to know if an fs is case-sensitive for file
   names. Or case-preserving. And which charset it accepts for filenames.

5. We want to know if a file system supports access modes, xattrs,
   file ownership, device nodes, symlinks, hardlinks, fifos, atimes,
   btimes, ACLs and so on. All these things currently can only be
   figured out by changing things and reading back if it worked. Which
   sucks hard of course.

6. We'd like to know the max file size on a file system.

7. Right now it's hard to figure out mount options used for the fs
   backing some file: you can now statx() the file, determine the
   mnt_id by that, and then search that in /proc/self/mountinfo, but
   it's slow, because again we need to scan the whole file until we
   find the entry we need. And that can be huge IRL.

8. Similar: we quite often want to know submounts of a mount. It would
   be great if for that kind of information (i.e. list of mnt_ids
   below some other mnt_id) we wouldn't have to scan the whole of
   /p/s/mi again. In many cases in our code we operate recursively,
   and want to know the mounts below some specific dir, but currently
   pay performance price for it if the number of file systems on the
   host is huge. This doesn't sound like a biggie, but actually is a
   biggie. In systemd we spend a lot of time scaninng /p/s/mi...

9. How are file locks implemented on this fs? Are they local only, and
   orthogonal to remote locks? Are POSIX and BSD locks possibly merged
   at the backend? Do they work at all?

I don't really care too much how an API for this looks like, but let
me just say that I am not a fan of APIs that require allocating an fd
for querying info about an fd. This 'feels' a bit too recursive: if
you expose information about some fd in some magic procfs subdir, or
even in some virtual pseudo-file below the file's path then this means
we have to allocate a new fd to figure out things or the first fd, and
if we'd know the same info for that, we'd theoretically recurse
down. Now of course, most likely IRL we wouldn't actually recurse down,
but it is still smelly. In particular if fd limits are tight. I mean,
I really don't care if you expose non-file-system stuff via the fs, if
that's what you want, but I think exposing *fs* metainfo in the *fs*,
it's just ugly.

I generally detest APIs that have no chance to ever returning multiple
bits of information atomically. Splitting up querying of multiple
attributes into multiple system calls means they couldn't possibly be
determined in a congruent way. I much prefer APIs where we provide a
struct to fill in and do a single syscall, and at least for some
fields we'd know afterwards that the fields were filled in together
and are congruent with each other.

I am a fan of the statx() system call I must say. If we had something
like this for the file system itself I'd be quite happy, it could tick
off many of the requests I list above.

Hope this is useful,

Lennart

--
Lennart Poettering, Berlin

Re: file metadata via fs API (was: [GIT PULL] Filesystem Information)

2020-08-11 Thread Lennart Poettering

On Di, 11.08.20 20:49, Miklos Szeredi (mik...@szeredi.hu) wrote:

> On Tue, Aug 11, 2020 at 6:05 PM Linus Torvalds
>  wrote:
>
> > and then people do "$(srctree)/". If you haven't seen that kind of
> > pattern where the pathname has two (or sometimes more!) slashes in the
> > middle, you've led a very sheltered life.
>
> Oh, I have.   That's why I opted for triple slashes, since that should
> work most of the time even in those concatenated cases.  And yes, I
> know, most is not always, and this might just be hiding bugs, etc...
> I think the pragmatic approach would be to try this and see how many
> triple slash hits a normal workload gets and if it's reasonably low,
> then hopefully that together with warnings for O_ALT would be enough.

There's no point. Userspace relies on the current meaning of triple
slashes. It really does.

I know many places in systemd where we might end up with a triple
slash. Here's a real-life example: some code wants to access the
cgroup attribute 'cgroup.controllers' of the root cgroup. It thus
generates the right path in the fs for it, which is the concatenation of
"/sys/fs/cgroup/" (because that's where cgroupfs is mounted), of "/"
(i.e. for the root cgroup) and of "/cgroup.controllers" (as that's the
file the attribute is exposed under).

And there you go:

   "/sys/fs/cgroup/" + "/" + "/cgroup.controllers" → 
"/sys/fs/cgroup///cgroup.controllers"

This is a real-life thing. Don't break this please.

Lennart

--
Lennart Poettering, Berlin

[PATCH v2] loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE

2020-08-10 Thread Lennart Poettering

When LOOP_CONFIGURE is used with LO_FLAGS_PARTSCAN we need to propagate
this into the GENHD_FL_NO_PART_SCAN. LOOP_SETSTATUS does this,
LOOP_CONFIGURE doesn't so far. Effect is that setting up a loopback
device with partition scanning doesn't actually work when LOOP_CONFIGURE
is issued, though it works fine with LOOP_SETSTATUS.

Let's correct that and propagate the flag in LOOP_CONFIGURE too.

Fixes: 3448914e8cc5("loop: Add LOOP_CONFIGURE ioctl")

Signed-off-by: Lennart Poettering 
Acked-by: Martijn Coenen 
---
 drivers/block/loop.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d18160146226..2f137d6ce169 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1171,6 +1171,8 @@ static int loop_configure(struct loop_device *lo, fmode_t 
mode,
if (part_shift)
lo->lo_flags |= LO_FLAGS_PARTSCAN;
partscan = lo->lo_flags & LO_FLAGS_PARTSCAN;
+   if (partscan)
+   lo->lo_disk->flags &= ~GENHD_FL_NO_PART_SCAN;

/* Grab the block_device to prevent its destruction after we
 * put /dev/loopXX inode. Later in __loop_clr_fd() we bdput(bdev).
--
2.26.2

Re: [PATCH] loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE

2020-08-07 Thread Lennart Poettering

On Fr, 07.08.20 10:53, Martijn Coenen (m...@android.com) wrote:

> Hi Lennart,
>
> Thanks again for the patch, I tested it and it looks good to me. I'll
> also add a test case to LTP for this. Two minor nits on the patch:
>
> On Thu, Aug 6, 2020 at 9:32 AM Lennart Poettering  
> wrote:
> > Let's correct that and propagate the flag in LOOP_SETSTATUS too.
>
> Think you meant LOOP_CONFIGURE.

True!

> Also, could you add a "Fixes" tag, like:
>
> Fixes: 3448914e8cc5("loop: Add LOOP_CONFIGURE ioctl")

Thanks for the review. I'll fix this up and send a v2. Are you OK with
me adding your Ack to the patch? And also should this geta cc for
stable?

Thanks,

Lennart

--
Lennart Poettering, Berlin

[PATCH] loop: unset GENHD_FL_NO_PART_SCAN on LOOP_CONFIGURE

2020-08-06 Thread Lennart Poettering

When LOOP_CONFIGURE is used with LO_FLAGS_PARTSCAN we need to propagate
this into the GENHD_FL_NO_PART_SCAN. LOOP_SETSTATUS does this,
LOOP_CONFIGURE doesn't so far. Effect is that setting up a loopback
device with partition scanning doesn't actually work when LOOP_CONFIGURE
is issued, though it works fine with LOOP_SETSTATUS.

Let's correct that and propagate the flag in LOOP_SETSTATUS too.

Signed-off-by: Lennart Poettering 
---
 drivers/block/loop.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index d18160146226..2f137d6ce169 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -1171,6 +1171,8 @@ static int loop_configure(struct loop_device *lo, fmode_t 
mode,
if (part_shift)
lo->lo_flags |= LO_FLAGS_PARTSCAN;
partscan = lo->lo_flags & LO_FLAGS_PARTSCAN;
+   if (partscan)
+   lo->lo_disk->flags &= ~GENHD_FL_NO_PART_SCAN;

/* Grab the block_device to prevent its destruction after we
 * put /dev/loopXX inode. Later in __loop_clr_fd() we bdput(bdev).
--
2.26.2

Re: Linux 5.3-rc8

2019-09-29 Thread Lennart Poettering

On Fr, 27.09.19 08:58, Linus Torvalds (torva...@linux-foundation.org) wrote:

> On Fri, Sep 27, 2019 at 6:57 AM Lennart Poettering  
> wrote:
> >
> > Doing the random seed in the boot loader is nice for two reasons:
> >
> > 1. It runs very very early, so that the OS can come up with fully
> >initialized entropy right from the beginning.
>
> Oh, that part I love.
>
> But I don't believe in your second case:
>
> > 2. The boot loader generally has found some disk to read the kernel from,
> >i.e. has a place where stuff can be stored and which can be updated
> >(most modern boot loaders can write to disk these days, and so can
> >EFI). Thus, it can derive a new random seed from a stored seed on disk
> >and pass it to the OS *AND* update it right away on disk ensuring that
> >it is never reused again.
>
> No. This is absolutely no different at all from user space doing it
> early with a file.
>
> All the same "golden image" issues exist, and in general the less the
> boot loader writes to disk, the better.
>
> Plus it doesn't actually work anyway in the one situation where people
> _really_ want it - embedded devices, where the kernel image is quite
> possibly in read-only flash that needs major setup for updates.
>
> PLUS.
>
> Your "it can update it right away on disk" is just crazy talk. With
> WHAT? It has no randomness to play with, and it doesn't have time to
> do jitter entropy stuff.

So these two issues are addressed by the logic implemented in sd-boot
(systemd's boot loader) like this:

The old seed is read off the ESP seed file. We then calculate two hash
sums in counter mode from it (SHA256), one we pass to the OS as seed
to initialize the random pool from. The other we use to update the ESP
seed file with. Unless you are capable of breaking SHA256 this means
the seed passed to the OS and the new seed stored on disk are derived
from the same seed but in a way you cannot determine one if you
managed to learn the other. Moreover, on each boot you are guaranteed
to get two new seeds, each time, and you cannot derive the sums used
on previous boots from those. This means we are robust towards
potential seed reuse when turning the system forcibly off during boot.

Now, what's still missing in the above is protection against "golden
image" issues, as you correctly pointed out. To deal with that the
SHA256 sums are not just hashed from the old seed and the counter, but
also include a system specific "system token" (you may also call it
"salt") which is stored in an EFI variable, persistently, which was
created once, during system installation. This hence gives you the
behaviour your are looking for, using the NVRAM like you suggested,
but we don't need to write the EFI vars all the time, as instead we
update the seed file stored in the ESP each time, and updating the ESP
should be safer and less problematic (i.e. if everything is done right
it's a single sector write).

To make this safer, on EFI firmwares that support the RNG protocol we
also include some data derived from that in the hash, just for good
measure. To sumarize:

NEWDISKSEED = SHA256(OLDDISKSEED || SYSTEMTOKEN || EFIRNGVAL || "1")
SEEDFORLINUX = SHA256(OLDDISKSEED || SYSTEMTOKEN || EFIRNGVAL || "2")

(and no, this is not a crypto scheme I designed, but something
Dr. Bertram Poettering (my brother, a cryptographer) suggested)

> So all it can do is a really bad job at taking the previous random
> seed, doing some transformation on it, and add a little bit of
> whatever system randomness it can find. None of which is any better
> than what the kernel can do.

Well, the kernel cannot hash and rewrite the old seed file early enough,
it's that simple. It can do that only when /var becomes writable,
i.e. very late during boot, much later than when we need entropy
for. The boot loader on the hand, can hash and rewrite the old seed
file even before the kernel initializes, and that's the big benefit!

> End result: you'd need to have the kernel update whatever bootloader
> data later on, and I'm not seeing that happening. Afaik the current
> bootloader interface has no way to specify how to update it when you
> actually have better randomness.

So, you could, but don't have to update the ESP random seed file from
the OS too, every now and then, but the security of the model dos not
rely on that.

(And yes, the above doesn't help if you have a fully R/O medium, but
those tend to be embedded devices, and I am much less concerned about
those, the designers really can deal with the RNG seed issues
themselves, and maybe provide some hw to do it; it's the generic user
PCs that we should be concerned about, and for those the above should
generally work)

Lennart

--
Lennart Poettering, Berlin

Re: Linux 5.3-rc8

2019-09-27 Thread Lennart Poettering

On Mi, 18.09.19 13:26, Linus Torvalds (torva...@linux-foundation.org) wrote:

> On Wed, Sep 18, 2019 at 1:15 PM Alexander E. Patrakov
>  wrote:
> >
> > No, this is not the solution, if we take seriously not only getrandom
> > hangs, but also urandom warnings. In some setups (root on LUKS is one of
> > them) they happen early in the initramfs. Therefore "restoring" entropy
> > from the previous boot by a script that runs from the main system is too
> > late. That's why it is suggested to load at least a part of the random
> > seed in the boot loader, and that has not been commonly implemented.
>
> Honestly, I think the bootloader suggestion is naive and silly too.
>
> Yes, we now support it. And no, I don't think people will trust that
> either. And I suspect for good reason: there's really very little
> reason to believe that bootloaders would be any better than any other
> part of the system.
>
> So right now some people trust bootloaders exactly _because_ there
> basically is just one or two that do this, and the people who use them
> are usually the people who wrote them or are at least closely
> associated with them. That will change, and then people will say "why
> would I trust that, when we know of bug Xyz".

Doing the random seed in the boot loader is nice for two reasons:

1. It runs very very early, so that the OS can come up with fully
   initialized entropy right from the beginning.

2. The boot loader generally has found some disk to read the kernel from,
   i.e. has a place where stuff can be stored and which can be updated
   (most modern boot loaders can write to disk these days, and so can
   EFI). Thus, it can derive a new random seed from a stored seed on disk
   and pass it to the OS *AND* update it right away on disk ensuring that
   it is never reused again. The point where the OS kernel comes to an
   equivalent point where it can write to disk is much much later,
   i.e. after the initrd, after the transition to the actual OS, ony
   after /var has been remounted writable.

So to me this is not about trust, but about "first place we can read
*AND* write a seed on disk".

i.e. the key to grok here: it's not OK to use a stored seed unless you
can at the same time update the it on disk, as only that protects you
from reusing the key if the system's startup is aborted due to power
failure or such.

> Adding an EFI variable (or other platform nonvolatile thing), and
> reading (and writing to it) purely from the kernel ends up being one
> of those things where you can then say "ok, if we trust the platform
> AT ALL, we can trust that". Since you can't reasonably do things like
> add EFI variables to your distro image by mistake.

NVRAM backing EFI vars sucks. Nothing you want to update on every
cycle. It's OK to update during OS installation, but during every
single boot? I'd rather not.

Lennart

--
Lennart Poettering, Berlin

Re: Linux 5.3-rc8

2019-09-18 Thread Lennart Poettering

On Mi, 18.09.19 00:10, Martin Steigerwald (mar...@lichtvoll.de) wrote:

> > getrandom() will never "consume entropy" in a way that will block any
> > users of getrandom(). If you don't have enough collected entropy to
> > seed the rng, getrandom() will block. If you do, getrandom() will
> > generate as many numbers as you ask it to, even if no more entropy is
> > ever collected by the system. So it doesn't matter how many clients
> > you have calling getrandom() in the boot process - either there'll be
> > enough entropy available to satisfy all of them, or there'll be too
> > little to satisfy any of them.
>
> Right, but then Systemd would not use getrandom() for initial hashmap/
> UUID stuff since it

Actually things are more complex. In systemd there are four classes of
random values we need:

1. High "cryptographic" quality. There are very few needs for this in
   systemd, as we do very little in this area. It's basically only
   used for generating salt values for hashed passwords, in the
   systemd-firstboot component, which can be used to set the root
   pw. systemd uses synchronous getrandom() for this. It does not use
   RDRAND for this.

2. High "non-cryptographic" quality. This is used for example for
   generating type 4 uuids, i.e uuids that are supposed to be globally
   unique, but aren't key material. We use RDRAND for this if
   available, falling back to synchronous getrandom(). Type 3 UUIDs
   are frequently needed by systemd, as we assign a uuid to each
   service invocation implicitly, so that people can match logging
   data and such to a specific instance and runtime of a service.

3. Medium quality. This is used for seeding hash tables. These may be
   crap initially, but should not be guessable in the long
   run. /dev/urandom would be perfect for this, but the mentioned log
   message sucks, hence we use RDRAND for this if available, and fall
   back to /dev/urandom if that isn't available, accepting the log
   message.

4. Crap quality. There are only a few uses of this, where rand_r() is
   is OK.

Of these four case, the first two might block boot. Because the first
case is not common you won't see blocking that often though for
them. The second case is very common, but since we use RDRAND you
won't see it on any recent Intel machines.

Or to say this all differently: the hash table seeding and the uuid
case are two distinct cases in systemd, and I am sure they should be.

Lennart

--
Lennart Poettering, Berlin

Re: Linux 5.3-rc8

2019-09-18 Thread Lennart Poettering

On Di, 17.09.19 23:38, Martin Steigerwald (mar...@lichtvoll.de) wrote:

> (I know that it still with /dev/urandom, so if it is using RDRAND now,
> this may indeed be different, but would it then deplete entropy the CPU
> has available and that by default is fed into the Linux crng as well
> (even without trusting it completely)?)

Neither RDRAND nor /dev/urandom know a concept of "depleting
entropy". That concept does not exist for them. It does exist for
/dev/random, but only crazy people use that. systemd does not.

Lennart

--
Lennart Poettering, Berlin

Re: Linux 5.3-rc8

2019-09-18 Thread Lennart Poettering

On Di, 17.09.19 19:29, Willy Tarreau (w...@1wt.eu) wrote:

> > What do you expect these systems to do though?
> >
> > I mean, think about general purpose distros: they put together live
> > images that are supposed to work on a myriad of similar (as in: same
> > arch) but otherwise very different systems (i.e. VMs that might lack
> > any form of RNG source the same as beefy servers with muliple sources
> > the same as older netbooks with few and crappy sources, ...). They can't
> > know what the specific hw will provide or won't. It's not their
> > incompetence that they build the image like that. It's a common, very
> > common usecase to install a system via SSH, and it's also very common
> > to have very generic images for a large number varied systems to run
> > on.
>
> I'm totally file with installing the system via SSH, using a temporary
> SSH key. I do make a strong distinction between the installation phase
> and the final deployment. The SSH key used *for installation* doesn't
> need to the be same as the final one. And very often at the end of the
> installation we'll have produced enough entropy to produce a correct
> key.

That's not how systems are built today though. And I am not sure they
should be. I mean, the majority of systems at this point probably have
some form of hardware (or virtualized) RNG available (even raspi has
one these days!), so generating these keys once at boot is totally
OK. Probably a number of others need just a few seconds to get the
entropy needed, where things are totally OK too. The only problem is
systems that lack any reasonable source of entropy and where
initialization of the pool will take overly long.

I figure we can reduce the number of systems where entropy is scarce
quite a bit if we'd start crediting entropy by default from various hw
rngs we currently don't credit entropy for. For example, the TPM and
older intel/amd chipsets. You currently have to specify
rng_core.default_quality=1000 on the kernel cmdline to make them
credit entropy. I am pretty sure this should be the default now, in a
world where CONFIG_RANDOM_TRUST_CPU=y is set anyway. i.e. why say
RDRAND is fine but those chipsets are not? That makes no sense to me.

I am very sure that crediting entropy to chipset hwrngs is a much
better way to solve the issue on those systems than to just hand out
rubbish randomness.

Lennart

--
Lennart Poettering, Berlin

Re: Linux 5.3-rc8

2019-09-17 Thread Lennart Poettering

On Di, 17.09.19 09:23, Linus Torvalds (torva...@linux-foundation.org) wrote:

> On Tue, Sep 17, 2019 at 9:08 AM Lennart Poettering  
> wrote:
> >
> > Here's what I'd propose:
>
> So I think this is ok, but I have another proposal. Before I post that
> one, though, I just wanted to point out:
>
> > 1) Add GRND_INSECURE to get those users of getrandom() who do not need
> >high quality entropy off its use (systemd has uses for this, for
> >seeding hash tables for example), thus reducing the places where
> >things might block.
>
> I really think that trhe logic should be the other way around.
>
> The getrandom() users that don't need high quality entropy are the
> ones that don't really think about this, and so _they_ shouldn't be
> the ones that have to explicitly state anything. To those users,
> "random is random". By definition they don't much care, and quite
> possibly they don't even know what "entropy" really means in that
> context.

So I think people nowadays prefer getrandom() over /dev/urandom
primarily because of the noisy logging the kernel does when you use
the latter on a non-initialized pool. If that'd be dropped then I am
pretty sure that the porting from /dev/urandom to getrandom() you see
in various projects (such as gdm/x11) would probably not take place.

In fact, speaking for systemd: the noisy logging in the kernel is the
primary (actually: only) reason that we prefer using RDRAND (if
available) over /dev/urandom if we need "medium quality" random
numbers, for example to seed hash tables and such. If the log message
wasn't there we wouldn't be tempted to bother with RDRAND and would
just use /dev/urandom like we used to for that.

> > 2) Add a kernel log message if a getrandom(0) client hung for 15s or
> >more, explaining the situation briefly, but not otherwise changing
> >behaviour.
>
> The problem is that when you have some graphical boot, you'll not even
> see the kernel messages ;(

Well, but as mentioned, there's infrastructure for this, that's why I
suggested changing systemd-random-seed.service.

We can make boot hang in "sane", discoverable way.

The reason why I think this should also be logged by the kernel since
people use netconsole and pstore and whatnot and they should see this
there. If systemd with its infrastructure brings this to screen via
plymouth then this wouldn't help people who debug much more low-level.

(I mean, there have been requests to add a logic to systemd that
refuses booting — or delays it — if the system has a battery and it is
nearly empty. I am pretty sure adding a cleanm discoverable concept of
"uh, i can't boot for a good reason which is this" wouldn't be the
worst of ideas)

Lennart

--
Lennart Poettering, Berlin

Re: Linux 5.3-rc8

2019-09-17 Thread Lennart Poettering

On Di, 17.09.19 21:58, Alexander E. Patrakov (patra...@gmail.com) wrote:

> I am worried that the getrandom delays will be serialized, because processes
> sometimes run one after another. If there are enough chained/dependent
> processes that ask for randomness before it is ready, the end result is
> still a too-big delay, essentially a failed boot.
>
> In other words: your approach of adding delays only makes sense for heavily
> parallelized boot, which may not be the case, especially for embedded
> systems that don't like systemd.

As mentioned elsewhere: once the pool is initialized it's
initialized. This means any pending getrandom() on the whole system
will unblock at the same time, and from the on all getrandom()s will
be non-blocking.

systemd-random-seed.service is nowadays a synchronization point for
exactly the moment where the pool is considered full.

Lennart

--
Lennart Poettering, Berlin

Re: Linux 5.3-rc8

2019-09-17 Thread Lennart Poettering

On Di, 17.09.19 09:27, Linus Torvalds (torva...@linux-foundation.org) wrote:

> But look at what gnome-shell and gnome-session-b does:
>
> https://lore.kernel.org/linux-ext4/20190912034421.GA2085@darwi-home-pc/
>
> and most of them already set GRND_NONBLOCK, but look at the
> problematic one that actually causes the boot problem:
>
> gnome-session-b-327   4.400620: getrandom(16 bytes, flags = 0)
>
> and here the big clue is: "Hey, it only asks for 128 bits of
> randomness".

I don't think this is a good check to make.

In fact most cryptography folks say taking out more than 256bit is
never going to make sense, that's why BSD getentropy() even returns an
error if you ask for more than 256bit. (and glibc's getentropy()
wrapper around getrandom() enforces the same size limit btw)

On the BSDs the kernel's getentropy() call is primarily used to seed
their libc's arc4random() every now and then, and userspace is
supposed to use only arc4random(). I am pretty sure we should do the
same on Linux in the long run. i.e. the idea that everyone uses the
kernel syscall directly sounds wrong to me, and designing the syscall
so that everyone calls it is hence wrong too.

On the BSDs getentropy() is hence unconditionally blocking, without
any flags or so, which makes sense since it's not supposed to be
user-facing really so much, but more a basic primitive for low-level
userspace infrastructure only, that is supposed to be wrapped
non-trivially to be useful. (that's at least how I understood their
APIs)

> Does anybody believe that 128 bits of randomness is a good basis for a
> long-term secure key? Even if the key itself contains than that, if
> you are generating a long-term secure key in this day and age, you had
> better be asking for more than 128 bits of actual unpredictable base
> data. So just based on the size of the request we can determine that
> this is not hugely important.

aes128 is very common today. It's what baseline security is.

I have the suspicion crypto folks would argue that 128…256 is the only
sane range for cryptographic keys...

Lennart

--
Lennart Poettering, Berlin

Re: Linux 5.3-rc8

2019-09-17 Thread Lennart Poettering

On Di, 17.09.19 18:21, Willy Tarreau (w...@1wt.eu) wrote:

> On Tue, Sep 17, 2019 at 05:57:43PM +0200, Lennart Poettering wrote:
> > Note that calling getrandom(0) "too early" is not something people do
> > on purpose. It happens by accident, i.e. because we live in a world
> > where SSH or HTTPS or so is run in the initrd already, and in a world
> > where booting sometimes can be very very fast.
>
> It's not an accident, it's a lack of understanding of the impacts
> from the people who package the systems. Generating an SSH key from
> an initramfs without thinking where the randomness used for this could
> come from is not accidental, it's a lack of experience that will be
> fixed once they start to collect such reports. And those who absolutely
> need their SSH daemon or HTTPS server for a recovery image in initramfs
> can very well feed fake entropy by dumping whatever they want into
> /dev/random to make it possible to build temporary keys for use within
> this single session. At least all supposedly incorrect use will be made
> *on purpose* and will still be possible to match what users need.

What do you expect these systems to do though?

I mean, think about general purpose distros: they put together live
images that are supposed to work on a myriad of similar (as in: same
arch) but otherwise very different systems (i.e. VMs that might lack
any form of RNG source the same as beefy servers with muliple sources
the same as older netbooks with few and crappy sources, …). They can't
know what the specific hw will provide or won't. It's not their
incompetence that they build the image like that. It's a common, very
common usecase to install a system via SSH, and it's also very common
to have very generic images for a large number varied systems to run
on.

Lennart

--
Lennart Poettering, Berlin

Re: Linux 5.3-rc8

2019-09-17 Thread Lennart Poettering

On Di, 17.09.19 12:30, Ahmed S. Darwish (darwish...@gmail.com) wrote:

> Ideally, systems would be configured with hardware random
> number generators, and/or configured to trust the CPU-provided
> RNG's (CONFIG_RANDOM_TRUST_CPU) or boot-loader provided ones
> (CONFIG_RANDOM_TRUST_BOOTLOADER).  In addition, userspace
> should generate cryptographic keys only as late as possible,
> when they are needed, instead of during early boot.  (For
> non-cryptographic use cases, such as dictionary seeds or MIT
> Magic Cookies, other mechanisms such as /dev/urandom or
> random(3) may be more appropropriate.)
>
> Sounds good?

This sounds mean. You make apps pay for something they aren't really
at fault for.

I mean, in the cloud people typically put together images that are
replicated to many systems, and as first thing generate an SSH key, on
the individual system. In fact, most big distros tend to ship SSH that
is precisely set up this way: on first boot the SSH key is
generated. They tend to call getrandom(0) for this right now, and
rightfully so. Now suddenly you kill them because they are doing
everything correctly? Those systems aren't going to be more useful if
they have no SSH key at all than they would be if they would hang at
boot: either way you can't log in.

Here's what I'd propose:

1) Add GRND_INSECURE to get those users of getrandom() who do not need
   high quality entropy off its use (systemd has uses for this, for
   seeding hash tables for example), thus reducing the places where
   things might block.

2) Add a kernel log message if a getrandom(0) client hung for 15s or
   more, explaining the situation briefly, but not otherwise changing
   behaviour.

3) Change systemd-random-seed.service to log to console in the same
   case, blocking boot cleanly and discoverably.

I am not a fan of randomly killing userspace processes that just
happened to be the unlucky ones, to call this first... I see no
benefit in killing stuff over letting boot hang in a discoverable way.

Lennart

Re: Linux 5.3-rc8

2019-09-17 Thread Lennart Poettering

On Di, 17.09.19 08:11, Theodore Y. Ts'o (ty...@mit.edu) wrote:

> On Tue, Sep 17, 2019 at 09:33:40AM +0200, Martin Steigerwald wrote:
> > Willy Tarreau - 17.09.19, 07:24:38 CEST:
> > > On Mon, Sep 16, 2019 at 06:46:07PM -0700, Matthew Garrett wrote:
> > > > >Well, the patch actually made getrandom() return en error too, but
> > > > >you seem more interested in the hypotheticals than in arguing
> > > > >actualities.>
> > > > If you want to be safe, terminate the process.
> > >
> > > This is an interesting approach. At least it will cause bug reports in
> > > application using getrandom() in an unreliable way and they will
> > > check for other options. Because one of the issues with systems that
> > > do not finish to boot is that usually the user doesn't know what
> > > process is hanging.
> >
>
> I would be happy with a change which changes getrandom(0) to send a
> kill -9 to the process if it is called too early, with a new flag,
> getrandom(GRND_BLOCK) which blocks until entropy is available.  That
> leaves it up to the application developer to decide what behavior they
> want.

Note that calling getrandom(0) "too early" is not something people do
on purpose. It happens by accident, i.e. because we live in a world
where SSH or HTTPS or so is run in the initrd already, and in a world
where booting sometimes can be very very fast. So even if you write a
program and you think "this stuff should run late I'll just
getrandom(0)" it might not actually be that case IRL because people
deploy it a slightly bit differently than you initially thought in a
slightly differently equipped system with other runtime behaviour...

Lennart

Re: Linux 5.3-rc8

2019-09-17 Thread Lennart Poettering

On Mo, 16.09.19 13:21, Theodore Y. Ts'o (ty...@mit.edu) wrote:

> We could create a new flag, GRND_INSECURE, which never blocks.  And
> that that allows us to solve the problem for silly applications that
> are using getrandom(2) for non-cryptographic use cases.  Use cases
> might include Python dictionary seeds, gdm for MIT Magic Cookie, UUID
> generation where best efforts probably is good enough, etc.  The
> answer today is they should just use /dev/urandom, since that exists
> today, and we have to support it for backwards compatibility anyway.
> It sounds like gdm recently switched to getrandom(2), and I suspect
> that it's going to get caught on some hardware configs anyway, even
> without the ext4 optimization patch.  So I suspect gdm will switch
> back to /dev/urandom, and this particular pain point will probably go
> away.

The problem is that reading from /dev/urandom at a point where it's
not initialized yet results in noisy kernel logging on current
kernels. If you want people to use /dev/urandom then the logging needs
to go away, because it scares people, makes them file bug reports and
so on, even though there isn't actually any problem for these specific
purposes.

For that reason I'd prefer GRND_INSECURE I must say, because it
indicates people grokked "I know I might get questionnable entropy".

Lennart

Re: [PATCH RFC v2] random: optionally block in getrandom(2) when the CRNG is uninitialized

2019-09-16 Thread Lennart Poettering

On So, 15.09.19 10:32, Linus Torvalds (torva...@linux-foundation.org) wrote:

> [ Added Lennart, who was active in the other thread ]
>
> On Sat, Sep 14, 2019 at 10:22 PM Theodore Y. Ts'o  wrote:
> >
> > Thus, add an optional configuration option which stops getrandom(2)
> > from blocking, but instead returns "best efforts" randomness, which
> > might not be random or secure at all.
>
> So I hate having a config option for something like this.
>
> How about this attached patch instead? It only changes the waiting
> logic, and I'll quote the comment in full, because I think that
> explains not only the rationale, it explains every part of the patch
> (and is most of the patch anyway):
>
>  * We refuse to wait very long for a blocking getrandom().
>  *
>  * The crng may not be ready during boot, but if you ask for
>  * blocking random numbers very early, there is no guarantee
>  * that you'll ever get any timely entropy.
>  *
>  * If you are sure you need entropy and that you can generate
>  * it, you need to ask for non-blocking random state, and then
>  * if that fails you must actively _do_something_ that causes
>  * enough system activity, perhaps asking the user to type
>  * something on the keyboard.

You are requesting a UI change here. Maybe the kernel shouldn't be the
one figuring out UI.

I mean, as I understand you are unhappy with behaviour you saw on
systemd systems; we can certainly improve behaviour of systemd in
userspace alone, i.e. abort the getrandom() after a while in userspace
and log about it using typical userspace logging to the console. I am
not sure why you want to do all that in the kernel, the kernel isn't
great at user interaction, and really shouldn't be.

If all you want is abort the getrandom() after 30s and a friendly
message on screen, by all means, let's add that to systemd, I have
zero problem with that. systemd has infrastructure for pushing that to
the user, the kernel doesn't really have that so nicely.

It appears to me you subscribe too much to an idea that userspace
people are not smart enough and couldn't implement something like
this. Turns out we can though, and there's no need to add logic that
appears to follow the logic of "never trust userspace"...

i.e. why not just consider this all just a feature request for the
systemd-random-seed.service, i.e. the service you saw the issue with
to handle this on its own?

> Hmm? No strange behavior. No odd config variables. A bounded total
> boot-time wait of 30s (which is a completely random number, but I
> claimed it as the "big red button" time).

As mentioned, in systemd's case, updating the random seed on disk
is entirely fine to take 5h or so. I don't really think we really need
to bound this in kernel space.

Lennart

--
Lennart Poettering, Berlin

Re: [PATCH RFC v3] random: getrandom(2): optionally block when CRNG is uninitialized

2019-09-15 Thread Lennart Poettering

On So, 15.09.19 10:17, Ahmed S. Darwish (darwish...@gmail.com) wrote:

> Thus, don't trust user-space on calling getrandom(2) from the right
> context. Never block, by default, and just return data from the
> urandom source if entropy is not yet available. This is an explicit
> decision not to let user-space work around this through busy loops on
> error-codes.
>
> Note: this lowers the quality of random data returned by getrandom(2)
> to the level of randomness returned by /dev/urandom, with all the
> original security implications coming out of that, as discussed in
> problem "3." at the top of this commit log. If this is not desirable,
> offer users a fallback to old behavior, by CONFIG_RANDOM_BLOCK=y, or
> random.getrandom_block=true bootparam.

This is an awful idea. It just means that all crypto that needs
entropy doing during early boot will now be using weak keys, and
doesn't even know it.

Yeah, it's a bad situation, but I am very sure that failing loudly in
this case is better than just sticking your head in the sand and
ignoring the issue without letting userspace know is an exceptionally
bad idea.

We live in a world where people run HTTPS, SSH, and all that stuff in
the initrd already. It's where SSH host keys are generated, and plenty
session keys. If Linux lets all that stuff run with awful entropy then
you pretend things where secure while they actually aren't. It's much
better to fail loudly in that case, I am sure.

Quite frankly, I don't think this is something to fix in the
kernel. Let the people putting together systems deal with this. Let
them provide a creditable hw rng, and let them pay the price if they
don't.

Lennart

--
Lennart Poettering, Berlin

Re: Linux 5.3-rc8

2019-09-15 Thread Lennart Poettering

On So, 15.09.19 09:27, Ahmed S. Darwish (darwish...@gmail.com) wrote:

> On Sun, Sep 15, 2019 at 08:51:42AM +0200, Lennart Poettering wrote:
> > On Sa, 14.09.19 09:30, Linus Torvalds (torva...@linux-foundation.org) wrote:
> [...]
> >
> > And please don't break /dev/urandom again. The above code is the ony
> > way I see how we can make /dev/urandom-derived swap encryption safe,
> > and the only way I can see how we can sanely write a valid random seed
> > to disk after boot.
> >
>
> Any hope in making systemd-random-seed(8) credit that "random seed
> from previous boot" file, through RNDADDENTROPY, *by default*?

No. For two reasons:

a) It's way too late. We shouldn't credit entropy from the disk seed
   if we cannot update the disk seed with a new one at the same time,
   otherwise we might end up crediting the same seed twice on
   subsequent reboots (think: user hard powers off a system after we
   credited but before we updated), in which case there would not be a
   point in doing that at all. Hence, we have to wait until /var is
   writable, but that's relatively late during boot. Long afer the
   initrd ran, long after iscsi and what not ran. Long after the
   network stack is up and so on. In a time where people load root
   images from the initrd via HTTPS thats's generally too late to be
   useful.

b) Golden images are a problem. There are probably more systems
   running off golden images in the wild, than those not running off
   them. This means: a random seed on disk is only safe to credit if
   it gets purged when the image is distributed to the systems it's
   supposed to be used on, because otherwise these systems will all
   come up with the very same seed, which makes it useless. So, by
   requesting people to explicitly acknowledge that they are aware of
   this problem (and either don't use golden images, or safely wipe
   the seed off the image before shipping it), by setting the env var,
   we protect ourselves from this.

Last time I looked at it most popular distro's live images didn't wipe
the random seed properly before distributing it to users...

This is all documented btw:

https://systemd.io/RANDOM_SEEDS#systemds-support-for-filling-the-kernel-entropy-pool

See point #2.

> I know that by v243, just released 12 days ago, this can be optionally
> done through SYSTEMD_RANDOM_SEED_CREDIT=1. I wonder though if it can
> ever be done by default, just like what the BSDs does... This would
> solve a big part of the current problem.

I think the best approach would be to do this in the boot loader. In
fact systemd does this in its own boot loader (sd-boot): it reads a
seed off the ESP, updates it (via a SHA256 hashed from the old one)
and passes that to the OS. PID 1 very early on then credits this to
the kernel's pool (ideally the kernel would just do this on its own
btw). The trick we employ to make this generally safe is that we
persistently store a "system token" as EFI var too, and include it in
the SHA sum. The "system token" is a per-system random blob. It is
created the first time it's needed and a good random source exists,
and then stays on the system, for all future live images to use. This
makes sure that even if sloppily put together live images are used
(which do not reset any random seed) every system will use a different
series of RNG seeds.

This then solves both problems: the golden image problem, and the
early-on problem. But of course only on ESP. Other systems should be
able to provide similar mechanisms though, it's not rocket science.

This is also documented here:

https://systemd.io/RANDOM_SEEDS#systemds-support-for-filling-the-kernel-entropy-pool

See point #3...

Ideally other boot loaders (grub, …) would support the same scheme,
but I am not sure the problem set is known to them.

Lennart

--
Lennart Poettering, Berlin

Re: Linux 5.3-rc8

2019-09-15 Thread Lennart Poettering

On So, 15.09.19 09:07, Willy Tarreau (w...@1wt.eu) wrote:

> > That code can finish 5h after boot, it's entirely fine with this
> > specific usecase.
> >
> > Again: we don't delay "the boot" for this. We just delay "writing a
> > new seed to disk" for this. And if that is 5h later, then that's
> > totally fine, because in the meantime it's just one bg process more that
> > hangs around waiting to do what it needs to do.
>
> Didn't you say it could also happen when using encrypted swap ? If so
> I suspect this could happen very early during boot, before any services
> may be started ?

Depends on the deps, and what options are used in /etc/crypttab. If
people hard rely on swap to be enabled for boot to proceed and also
use one-time passwords from /dev/urandom they better provide some form
of hw rng, too. Otherwise the boot will block, yes.

Basically, just add "nofail" to a line in /etc/crypttab, and the entry
will be activated at boot, but we won't delay boot for it. It's going
to be activated as soon as the deps are fulfilled (and thus the pool
initialized), but that may well be 5h after boot, and that's totally
OK as long as nothing else hard depends on it.

Lennart

--
Lennart Poettering, Berlin

Re: Linux 5.3-rc8

2019-09-15 Thread Lennart Poettering

On So, 15.09.19 09:01, Willy Tarreau (w...@1wt.eu) wrote:

> On Sun, Sep 15, 2019 at 08:56:55AM +0200, Lennart Poettering wrote:
> > There's benefit in being able to wait until the pool is initialized
> > before we update the random seed stored on disk with a new one,
>
> And what exactly makes you think that waiting with arms crossed not
> doing anything else has any chance to make the situation change if
> you already had no such entropy available when reaching that first
> call, especially during early boot ?

That code can finish 5h after boot, it's entirely fine with this
specific usecase.

Again: we don't delay "the boot" for this. We just delay "writing a
new seed to disk" for this. And if that is 5h later, then that's
totally fine, because in the meantime it's just one bg process more that
hangs around waiting to do what it needs to do.

Lennart

--
Lennart Poettering, Berlin

Re: Linux 5.3-rc8

2019-09-15 Thread Lennart Poettering

On Sa, 14.09.19 09:52, Linus Torvalds (torva...@linux-foundation.org) wrote:

> On Sat, Sep 14, 2019 at 9:35 AM Alexander E. Patrakov
>  wrote:
> >
> > Let me repeat: not -EINVAL, please. Please find some other error code,
> > so that the application could sensibly distinguish between this case
> > (low quality entropy is in the buffer) and the "kernel is too dumb" case
> > (and no entropy is in the buffer).
>
> I'm not convinced we want applications to see that difference.
>
> The fact is, every time an application thinks it cares, it has caused
> problems. I can just see systemd saying "ok, the kernel didn't block,
> so I'll just do
>
>while (getrandom(x) == -ENOENTROPY)
>sleep(1);
>
> instead. Which is still completely buggy garbage.
>
> The fact is, we can't guarantee entropy in general. It's probably
> there is practice, particularly with user space saving randomness from
> last boot etc, but that kind of data may be real entropy, but the
> kernel cannot *guarantee* that it is.

I am not expecting the kernel to guarantee entropy. I just expecting
the kernel to not give me garbage knowingly. It's OK if it gives me
garbage unknowingly, but I have a problem if it gives me trash all the
time.

There's benefit in being able to wait until the pool is initialized
before we update the random seed stored on disk with a new one, and
there's benefit in being able to wait until the pool is initialized
before we let cryptsetup read a fresh, one-time key for dm-crypt from
/dev/urandom. I fully understand that any such reporting for
initialization is "best-effort", i.e. to the point where we don't know
anything to the contrary, but at least give userspace that.

Lennart

--
Lennart Poettering, Berlin

Re: Linux 5.3-rc8

2019-09-15 Thread Lennart Poettering

On Sa, 14.09.19 09:30, Linus Torvalds (torva...@linux-foundation.org) wrote:

> > => src/random-seed/random-seed.c:
> > /*
> >  * Let's make this whole job asynchronous, i.e. let's make
> >  * ourselves a barrier for proper initialization of the
> >  * random pool.
> >  */
> >  k = getrandom(buf, buf_size, GRND_NONBLOCK);
> >  if (k < 0 && errno == EAGAIN && synchronous) {
> >  log_notice("Kernel entropy pool is not initialized yet, "
> > "waiting until it is.");
> >
> >  k = getrandom(buf, buf_size, 0); /* retry synchronously */
> >  }
>
> Yeah, the above is yet another example of completely broken garbage.
>
> You can't just wait and block at boot. That is simply 100%
> unacceptable, and always has been, exactly because that may
> potentially mean waiting forever since you didn't do anything that
> actually is likely to add any entropy.

Oh man. Just spend 5min to understand the situation, before claiming
this was garbage or that was garbage. The code above does not block
boot. It blocks startup of services that explicit order themselves
after the code above. There's only a few services that should do that,
and the main system boots up just fine without waiting for this.

Primary example for stuff that orders itself after the above,
correctly: cryptsetup entries that specify /dev/urandom as password
source (i.e. swap space and stuff, that wants a new key on every
boot). If we don't wait for the initialized pool for cases like that
the password for that swap space is not actually going to be random,
and that defeats its purpose.

Another example: the storing of an updated random seed file on
disk. We should only do that if the seed on disk is actually properly
random, i.e. comes from an initialized pool. Hence we wait for the
pool to be initialized before reading the seed from the pool, and
writing it to disk.

I'd argue that doing things like this is not "garbage", like you say,
but *necessary* to make this stuff safe and secure.

And no, other stuff is not delayed for this (but there are bugs of
course, some random services in 3rd party packages that set too
agressive deps, but that needs to be fixed there, and not in the
kernel).

Anyway, I really don't appreciate your tone, and being sucked into
messy LKML discussions. I generally stay away from LKML, and gah, you
remind me why. Just tone it down, not everything you never bothered to
understand is "garbage".

And please don't break /dev/urandom again. The above code is the ony
way I see how we can make /dev/urandom-derived swap encryption safe,
and the only way I can see how we can sanely write a valid random seed
to disk after boot. You guys changed semantics on /dev/urandom all the
time in the past, don't break API again, thank you very much.

Lennart

Re: New kernel interface for sys_tz and timewarp?

2019-08-14 Thread Lennart Poettering

On Mi, 14.08.19 11:32, Alexandre Belloni (alexandre.bell...@bootlin.com) wrote:

> On 14/08/2019 11:09:36+0200, Lennart Poettering wrote:
> > On Mi, 14.08.19 10:31, Arnd Bergmann (a...@arndb.de) wrote:
> >
> > > - glibc stops passing the caller timezone argument to the kernel
> > > - the distro kernel disables CONFIG_RTC_HCTOSYS,
> > >   CONFIG_RTC_SYSTOHC  and CONFIG_GENERIC_CMOS_UPDATE
> >
> > What's the benefit of letting userspace do this? It sounds a lot more
> > fragile to leave this syncing to userspace if the kernel can do this
> > trivially on its own.
> >
>
> It does it trivially and badly:
>
>  -  hctosys will always think the RTC is in UTC so if the RTC is in
> local time, you will anyway have up to 12 hours difference until
> userspace fixes that.

Sure, but 12h off is not that bad, much better than being 39years
off. Moreover, it's off only for those who actually dual boot Windows
and make use of the RTC-in-local-time functionality. For them having
the time slightly off during early boot is not great but also not
totally afwul, and the whole concept of RTC-in-local-time is not that
great anyway. It's not a reason to penalize everybody else who has the
RTC in UTC, as they should.

>  - the RTC to be used for hctosys and systohc is hardcoded in Kconfig
>and distro usually let the default rtc0 but many platforms have a non
> functional RTC that ends up being rtc0. I would prefer that to be a
> userspace configuration change instead of a kernel configuration
>change

Well, but how do you think userspace would figure out which RTC to use
in a way the kernel couldn't do equally well or better?

On PCs at least it's very clear which RTC driver is the right one. And
if non-PC hardware comes with borked RTC hw then it's probably a good
idea not to compile support for such RTCs into the kernel in the first
place...

I know that there are some environments where RTC devices are compiled
as modules. But that means they are loaded relatively late during the
boot process, i.e. at a time where udevd is started and triggers all
busses, but that's *very* late in most cases, and it woud suck
having timestamps in early-boot logs that are 39y off until that
point.

I'd argue that in the vast majority of cases the person building the
kernel for a device knows very well which RTC is connected to the
device they are interested in, and should just build that driver in,
and don't bother with userspace complexity, later userspace module
loading or anything like that.

Lennart

--
Lennart Poettering, Berlin

Re: New kernel interface for sys_tz and timewarp?

2019-08-14 Thread Lennart Poettering

On Mi, 14.08.19 10:31, Arnd Bergmann (a...@arndb.de) wrote:

> - glibc stops passing the caller timezone argument to the kernel
> - the distro kernel disables CONFIG_RTC_HCTOSYS,
>   CONFIG_RTC_SYSTOHC  and CONFIG_GENERIC_CMOS_UPDATE

What's the benefit of letting userspace do this? It sounds a lot more
fragile to leave this syncing to userspace if the kernel can do this
trivially on its own.

IIRC there are uses in kernel that use CLOCK_REALTIME already before
userspace starts. e.g. iirc networking generally prefers
CLOCK_REALTIME timestamps over CLOCK_MONOTONIC timestamps
(i.e. SO_TIMESTAMP and friends are still CLOCK_REALTIME only so far,
unless I am missing something). If the kernel comes up with a
CLOCK_REALTIME that starts at 0 this is pretty annoying I
figure... Hence, so far I suggested to distros to continue turning on
the options above, and let the kernel do this on its own without
involving userspace in that.

Lennart

--
Lennart Poettering, Berlin

Re: [RFC] better visibility into kworkers

2018-05-17 Thread Lennart Poettering

On Do, 17.05.18 10:42, Alexey Dobriyan (adobri...@gmail.com) wrote:

> > The kernel APIs for all this aren't really good
> > though, and I always was reluctant to check for PF_KTHREAD, as that
> > flag is neither documented for userspace, nor available in any
> > userspace-accessible headers. However, I wanted to tighten this a bit,
> > and hence we now define the flag in our own code, as it appeared to me
> > otherwise there was no chance to ever make this fully robust.
> 
> PF_KTHREAD was introduced in 2.6.27 and its value appears to be stable
> since then. I think it should be retroactively declared part of ABI
> otherwise people will continue to rediscover that all other means do not
> work.

Yes, I agree. And it's exposed to userspace after all, though not
symbolic, but simply as flags value.

> Empty /proc/*/exe is second best option for systemd and it's only one
> system call.

That doesn't really work for us as readlink() on that requires
CAP_SYS_PTRACE, and we need something that works unprivileged, which
PF_KTHREAD does.

Lennart

-- 
Lennart Poettering, Red Hat

Re: [RFC] better visibility into kworkers

2018-05-17 Thread Lennart Poettering

On Do, 17.05.18 10:42, Alexey Dobriyan (adobri...@gmail.com) wrote:

> > The kernel APIs for all this aren't really good
> > though, and I always was reluctant to check for PF_KTHREAD, as that
> > flag is neither documented for userspace, nor available in any
> > userspace-accessible headers. However, I wanted to tighten this a bit,
> > and hence we now define the flag in our own code, as it appeared to me
> > otherwise there was no chance to ever make this fully robust.
> 
> PF_KTHREAD was introduced in 2.6.27 and its value appears to be stable
> since then. I think it should be retroactively declared part of ABI
> otherwise people will continue to rediscover that all other means do not
> work.

Yes, I agree. And it's exposed to userspace after all, though not
symbolic, but simply as flags value.

> Empty /proc/*/exe is second best option for systemd and it's only one
> system call.

That doesn't really work for us as readlink() on that requires
CAP_SYS_PTRACE, and we need something that works unprivileged, which
PF_KTHREAD does.

Lennart

-- 
Lennart Poettering, Red Hat

Re: [systemd-devel] [PATCH] firmware: wake all waiters

2017-06-28 Thread Lennart Poettering

On Wed, 28.06.17 09:06, Luis R. Rodriguez (mcg...@kernel.org) wrote:

> On Wed, Jun 28, 2017 at 12:06 AM, Lennart Poettering
> <mzxre...@0pointer.de> wrote:
> > On Wed, 28.06.17 00:24, Luis R. Rodriguez (mcg...@kernel.org) wrote:
> >
> >> > Do you know how systemd developers feel about the issue (CCed)?  Given
> >> > that it seems to dominate in data center OSes now I'm slightly worried
> >> > having to push Big Linux Vendors to package some seemingly
> >> > embedded-centric software just to make advanced NICs run :(
> >>
> >> firmwared was written by a systemd developer :)
> >
> > No it wasn't. I don't know what firmwared is really. Sorry.
> 
> Is Tom Gundersen not a systemd developer?

Not really anymore, and "firmwared" is an effort independent of
systemd, never was part of it, and while I heard Tom was working on
this I was not aware of the project's naming or anything else...

Lennart

-- 
Lennart Poettering, Red Hat

Re: [systemd-devel] [PATCH] firmware: wake all waiters

2017-06-28 Thread Lennart Poettering

On Wed, 28.06.17 09:06, Luis R. Rodriguez (mcg...@kernel.org) wrote:

> On Wed, Jun 28, 2017 at 12:06 AM, Lennart Poettering
>  wrote:
> > On Wed, 28.06.17 00:24, Luis R. Rodriguez (mcg...@kernel.org) wrote:
> >
> >> > Do you know how systemd developers feel about the issue (CCed)?  Given
> >> > that it seems to dominate in data center OSes now I'm slightly worried
> >> > having to push Big Linux Vendors to package some seemingly
> >> > embedded-centric software just to make advanced NICs run :(
> >>
> >> firmwared was written by a systemd developer :)
> >
> > No it wasn't. I don't know what firmwared is really. Sorry.
> 
> Is Tom Gundersen not a systemd developer?

Not really anymore, and "firmwared" is an effort independent of
systemd, never was part of it, and while I heard Tom was working on
this I was not aware of the project's naming or anything else...

Lennart

-- 
Lennart Poettering, Red Hat

Re: [systemd-devel] [PATCH] firmware: wake all waiters

2017-06-28 Thread Lennart Poettering

On Wed, 28.06.17 00:24, Luis R. Rodriguez (mcg...@kernel.org) wrote:

> > Do you know how systemd developers feel about the issue (CCed)?  Given
> > that it seems to dominate in data center OSes now I'm slightly worried
> > having to push Big Linux Vendors to package some seemingly
> > embedded-centric software just to make advanced NICs run :(
> 
> firmwared was written by a systemd developer :)

No it wasn't. I don't know what firmwared is really. Sorry.

> I think it was first packaged into systemd, and then it was split out to
> help those who want it external.

Certainly not. I'd sure know about that. ;-)

Lennart

-- 
Lennart Poettering, Red Hat

Re: [systemd-devel] [PATCH] firmware: wake all waiters

2017-06-28 Thread Lennart Poettering

On Wed, 28.06.17 00:24, Luis R. Rodriguez (mcg...@kernel.org) wrote:

> > Do you know how systemd developers feel about the issue (CCed)?  Given
> > that it seems to dominate in data center OSes now I'm slightly worried
> > having to push Big Linux Vendors to package some seemingly
> > embedded-centric software just to make advanced NICs run :(
> 
> firmwared was written by a systemd developer :)

No it wasn't. I don't know what firmwared is really. Sorry.

> I think it was first packaged into systemd, and then it was split out to
> help those who want it external.

Certainly not. I'd sure know about that. ;-)

Lennart

-- 
Lennart Poettering, Red Hat

Re: [systemd-devel] [WIP PATCH 0/4] Rework the unreliable LID switch exported by ACPI

2017-06-16 Thread Lennart Poettering

On Fri, 16.06.17 11:06, Bastien Nocera (had...@hadess.net) wrote:

> > Let's consider this case with delay:
> > After resume, gnome-setting-daemon queries SW_LID and got "close".
> > Then it lights up the wrong monitors.
> > Then I believe "open" will be delivered to it several seconds later.
> > Should gnome-setting-daemon light-up correct monitors this time?
> > So it just looks like user programs behave with a delay accordingly because 
> > of the "platform turnaround" delay.
> 
> If you implement it in such a way that GNOME settings daemon behaves weirdly, 
> you'll get my revert request in the mail. Do. Not. Ever. Lie.

Just to mention this:

the reason logind applies the timeout and doesn't immediately react to
lid changes is to be friendly to users, if they quickly close and
reopen the lid. It's not supposed to be a work-around around broken
input drivers.

I am very sure that input drivers shouldn't lie to userspace. If you
don't know the state of the switch, then you don#t know it, and should
clarify that to userspace somehow.

Lennart

-- 
Lennart Poettering, Red Hat

Re: [systemd-devel] [WIP PATCH 0/4] Rework the unreliable LID switch exported by ACPI

2017-06-16 Thread Lennart Poettering

On Fri, 16.06.17 11:06, Bastien Nocera (had...@hadess.net) wrote:

> > Let's consider this case with delay:
> > After resume, gnome-setting-daemon queries SW_LID and got "close".
> > Then it lights up the wrong monitors.
> > Then I believe "open" will be delivered to it several seconds later.
> > Should gnome-setting-daemon light-up correct monitors this time?
> > So it just looks like user programs behave with a delay accordingly because 
> > of the "platform turnaround" delay.
> 
> If you implement it in such a way that GNOME settings daemon behaves weirdly, 
> you'll get my revert request in the mail. Do. Not. Ever. Lie.

Just to mention this:

the reason logind applies the timeout and doesn't immediately react to
lid changes is to be friendly to users, if they quickly close and
reopen the lid. It's not supposed to be a work-around around broken
input drivers.

I am very sure that input drivers shouldn't lie to userspace. If you
don't know the state of the switch, then you don#t know it, and should
clarify that to userspace somehow.

Lennart

-- 
Lennart Poettering, Red Hat

Re: [systemd-devel] [WIP PATCH 0/4] Rework the unreliable LID switch exported by ACPI

2017-06-07 Thread Lennart Poettering

On Thu, 01.06.17 20:46, Benjamin Tissoires (benjamin.tissoi...@redhat.com) 
wrote:

> Hi,
> 
> Sending this as a WIP as it still need a few changes, but it mostly works as
> expected (still not fully compliant yet).
> 
> So this is based on Lennart's comment in [1]: if the LID state is not 
> reliable,
> the kernel should not export the LID switch device as long as we are not sure
> about its state.

Ah nice! I (obviously) like this approach.

> Note that systemd currently doesn't sync the state when the input node just
> appears. This is a systemd bug, and it should not be handled by the kernel
> community.

Uh if this is borked, we should indeed fix this in systemd. Is there
already a systemd github bug about this? If not, please create one,
and we'll look into it!

Thanks for working on this,

Lennart

-- 
Lennart Poettering, Red Hat

Re: [systemd-devel] [WIP PATCH 0/4] Rework the unreliable LID switch exported by ACPI

2017-06-07 Thread Lennart Poettering

On Thu, 01.06.17 20:46, Benjamin Tissoires (benjamin.tissoi...@redhat.com) 
wrote:

> Hi,
> 
> Sending this as a WIP as it still need a few changes, but it mostly works as
> expected (still not fully compliant yet).
> 
> So this is based on Lennart's comment in [1]: if the LID state is not 
> reliable,
> the kernel should not export the LID switch device as long as we are not sure
> about its state.

Ah nice! I (obviously) like this approach.

> Note that systemd currently doesn't sync the state when the input node just
> appears. This is a systemd bug, and it should not be handled by the kernel
> community.

Uh if this is borked, we should indeed fix this in systemd. Is there
already a systemd github bug about this? If not, please create one,
and we'll look into it!

Thanks for working on this,

Lennart

-- 
Lennart Poettering, Red Hat

Re: [PATCH] x86: defconfig: Enable CONFIG_FHANDLE

2014-12-01 Thread Lennart Poettering

On Mon, 01.12.14 14:54, Dave Chinner (da...@fromorbit.com) wrote:

> On Mon, Dec 01, 2014 at 02:03:43AM +0100, Lennart Poettering wrote:
> > On Mon, 01.12.14 01:41, Richard Weinberger (rich...@nod.at) wrote:
> > 
> > > CC'ing systemd folks.
> > > 
> > > Lennart, can you please explain why you need CONFIG_FHANDLE for systemd?
> > > Maybe I'm reading the source horrible wrong.
> > 
> > For two usecases:
> > 
> > a) Being able to detect if something is a mount point. The traditional
> >way to do this is by stat()ing the dir in question and its parent
> >and comparing st_dev. That logic is not able to detect bind mounts
> >however, if destination and the place the mount is at are actually
> >on the same file system... Thus we check the mount id too, if we
> >can get our hands on it.
> 
> So what you really want in the mount id in st_buf.st_dev, not the
> underlying device number. i.e. fstatat(dirfd, path, buf,
> AT_MOUNTID)?

Well, I am not a fan of overloading things, and there might be reasons
why one would want to know both the mount id and the device id at the
same time with one atomic call, but ultimately I don't really care,
and fstatat(AT_MOUNT_ID) would certainly be at least as useful as
name_to_handle_at() is. 

Lennart
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: defconfig: Enable CONFIG_FHANDLE

2014-12-01 Thread Lennart Poettering

On Mon, 01.12.14 14:54, Dave Chinner (da...@fromorbit.com) wrote:

 On Mon, Dec 01, 2014 at 02:03:43AM +0100, Lennart Poettering wrote:
  On Mon, 01.12.14 01:41, Richard Weinberger (rich...@nod.at) wrote:
  
   CC'ing systemd folks.
   
   Lennart, can you please explain why you need CONFIG_FHANDLE for systemd?
   Maybe I'm reading the source horrible wrong.
  
  For two usecases:
  
  a) Being able to detect if something is a mount point. The traditional
 way to do this is by stat()ing the dir in question and its parent
 and comparing st_dev. That logic is not able to detect bind mounts
 however, if destination and the place the mount is at are actually
 on the same file system... Thus we check the mount id too, if we
 can get our hands on it.
 
 So what you really want in the mount id in st_buf.st_dev, not the
 underlying device number. i.e. fstatat(dirfd, path, buf,
 AT_MOUNTID)?

Well, I am not a fan of overloading things, and there might be reasons
why one would want to know both the mount id and the device id at the
same time with one atomic call, but ultimately I don't really care,
and fstatat(AT_MOUNT_ID) would certainly be at least as useful as
name_to_handle_at() is. 

Lennart
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: defconfig: Enable CONFIG_FHANDLE

2014-11-30 Thread Lennart Poettering

On Mon, 01.12.14 01:41, Richard Weinberger (rich...@nod.at) wrote:

> CC'ing systemd folks.
> 
> Lennart, can you please explain why you need CONFIG_FHANDLE for systemd?
> Maybe I'm reading the source horrible wrong.

For two usecases:

a) Being able to detect if something is a mount point. The traditional
   way to do this is by stat()ing the dir in question and its parent
   and comparing st_dev. That logic is not able to detect bind mounts
   however, if destination and the place the mount is at are actually
   on the same file system... Thus we check the mount id too, if we
   can get our hands on it. This actually fixes real-life
   problems. For example time-based recursive clean-up logic in /tmp,
   where it is desirable that the clean up stops at
   submounts. However, we had reports where the clean-up fucked up
   people's home directories because they mounted them for some reason
   into some subdir in /tmp and they had /tmp and /home on the same
   fs.

b) Because we sometimes want to know the mount options used for
   specific file systems. For that you want to correlate
   /proc/self/mountinfo with a path in the fs. You can of course try
   to do path prefix matching and then fuck things up as soon as
   people do weird mounts on top of each other. Or you can use use the
   mount id name_to_handle tells you and look it up in
   /proc/self/mountinfo, and everything is clean and reliable.

We have no interest in the actual fhandle data. If you give us some
other syscall to figure out the mount id, we'd be delighted to use it
instead.

udev uses the logic described in b) to determine if /dev is a devtmpfs
instance. Since devtmpfs has the same fs magic as any other tmpfs we
cannot use the statfs() magic stuff to detect this case.

Lennart
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] x86: defconfig: Enable CONFIG_FHANDLE

2014-11-30 Thread Lennart Poettering

On Mon, 01.12.14 01:41, Richard Weinberger (rich...@nod.at) wrote:

 CC'ing systemd folks.
 
 Lennart, can you please explain why you need CONFIG_FHANDLE for systemd?
 Maybe I'm reading the source horrible wrong.

For two usecases:

a) Being able to detect if something is a mount point. The traditional
   way to do this is by stat()ing the dir in question and its parent
   and comparing st_dev. That logic is not able to detect bind mounts
   however, if destination and the place the mount is at are actually
   on the same file system... Thus we check the mount id too, if we
   can get our hands on it. This actually fixes real-life
   problems. For example time-based recursive clean-up logic in /tmp,
   where it is desirable that the clean up stops at
   submounts. However, we had reports where the clean-up fucked up
   people's home directories because they mounted them for some reason
   into some subdir in /tmp and they had /tmp and /home on the same
   fs.

b) Because we sometimes want to know the mount options used for
   specific file systems. For that you want to correlate
   /proc/self/mountinfo with a path in the fs. You can of course try
   to do path prefix matching and then fuck things up as soon as
   people do weird mounts on top of each other. Or you can use use the
   mount id name_to_handle tells you and look it up in
   /proc/self/mountinfo, and everything is clean and reliable.

We have no interest in the actual fhandle data. If you give us some
other syscall to figure out the mount id, we'd be delighted to use it
instead.

udev uses the logic described in b) to determine if /dev is a devtmpfs
instance. Since devtmpfs has the same fs magic as any other tmpfs we
cannot use the statfs() magic stuff to detect this case.

Lennart
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Hyperv: Trigger DHCP renew after host hibernation

2014-07-21 Thread Lennart Poettering

On Mon, 21.07.14 10:21, Yue Zhang (OSTC DEV) (yue...@microsoft.com) wrote:

> Some network monitoring daemon, like ifplugd has a deferring mechanism.
> When it detects carriers is offline, it doesn't trigger DHCP renew 
> immediately. 
> Instead it will wait for another 5 seconds to check whether carrier is back 
> to 
> online status. In that case, it will avoid renew DHCP lease.

ifplugd doesn't renew DHCP leases anyway, one of the scripts it invokes
does.

ifplugd is obsolete software. I wrote it more than 10 years ago, and
haven't really updated it since. it's sounds seriously wrong to add
multi-second waits to the kernel just to make this crappy, obsolete
software work.

Please fix this properly, and work with the PM guys, so that we get a
sane userspace how the kernel can notify userspace about
suspends/hibernations triggered from the outside, so that userspace
daemons can subscribe to that and then refresh the DHCP leases on their
own.

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] Hyperv: Trigger DHCP renew after host hibernation

2014-07-21 Thread Lennart Poettering

On Mon, 21.07.14 10:21, Yue Zhang (OSTC DEV) (yue...@microsoft.com) wrote:

 Some network monitoring daemon, like ifplugd has a deferring mechanism.
 When it detects carriers is offline, it doesn't trigger DHCP renew 
 immediately. 
 Instead it will wait for another 5 seconds to check whether carrier is back 
 to 
 online status. In that case, it will avoid renew DHCP lease.

ifplugd doesn't renew DHCP leases anyway, one of the scripts it invokes
does.

ifplugd is obsolete software. I wrote it more than 10 years ago, and
haven't really updated it since. it's sounds seriously wrong to add
multi-second waits to the kernel just to make this crappy, obsolete
software work.

Please fix this properly, and work with the PM guys, so that we get a
sane userspace how the kernel can notify userspace about
suspends/hibernations triggered from the outside, so that userspace
daemons can subscribe to that and then refresh the DHCP leases on their
own.

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [systemd-devel] [PATCH v5 12/14] autoconf: xen: enable explicit preference option for xenstored preference

2014-06-05 Thread Lennart Poettering

On Thu, 05.06.14 20:01, Luis R. Rodriguez (mcg...@suse.com) wrote:

> > Hmm? You should "exec" the real daemon binary at the end, not just fork
> > it off. That wait the shell script process is replaced by the daemon
> > binary, which is what you want.
> 
> I tried both just running it and also running exec foo; both presented
> the same issue given that shell exec does not really execve.

Hmmm? You shell's "exec" command doesn't actually execve()? What are you
using? This doesn't sound very accurate...

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [systemd-devel] [PATCH v5 12/14] autoconf: xen: enable explicit preference option for xenstored preference

2014-06-05 Thread Lennart Poettering

On Thu, 05.06.14 02:31, Luis R. Rodriguez (mcg...@suse.com) wrote:

> On Sun, Jun 01, 2014 at 08:15:47AM +0200, Lennart Poettering wrote:
> > On Fri, 30.05.14 01:29, Luis R. Rodriguez (mcg...@suse.com) wrote:
> > 
> > > I'm cc'ing a few security folks as I'd appreciate review on the ideas 
> > > here,
> > > in particular that of a launcher idea on system to replace alternatives 
> > > on the
> > > ExecStart= line of a systemd service unit file, alternative ideas are of
> > > course welcomed. I'm also Cc'ing systemd-devel as this subject was 
> > > reviewed
> > > a little while ago with nothing concrete being recommended but instead a 
> > > few
> > > options being now archived as possibilities. I'm looking for a bit wider
> > > review of the approaches and recomendations.
> > > 
> > > Some general background for non xen folks: old xen requires the launch of
> > > a daemon which implements supports of the xenstore, which is the database
> > > that xen uses for information about guests / dom0. There are two supported
> > > daemons, xenstored (C version) and oxenstored (Ocaml version) but they do 
> > > the
> > > same thing. Right now old init lets you override which one you pick 
> > > through
> > > an environment variable on /etc/{sysconfig,default}/xencommons, the script
> > > will use the appropriate on there. Systemd doesn't let you use variables 
> > > on
> > > the ExecStart line of a service unit file so alternatives are required.
> > > 
> > > The reason I'm being very careful here this could set a precedent and at
> > > least for the launcher idea it'd require the usage of getenv() and 
> > > execve(),
> > > and secure alternatives for these (secure_getenv(), execve_nosecurity())
> > > have either been merged or suggested before for Linux. The systemd 
> > > discussion
> > > is only specific to Linux but if we have a launcher we could consider it 
> > > for
> > > other supported OSes. All that said I'd like proper review of the security
> > > implications of *all* strategies but obviously in particular the launcher
> > > idea. I want to tread carefuly before setting precedents.
> > 
> > You can also just invoke a shell script from ExecStart=. I mean, we try
> > to deemphesize them in the boot process, but there's nothing wrong with
> > using shell, if you need to parse shell configuraiton fragments and just
> > want to execute on ot another program...
> 
> I tried this and it didn't work given that systemd expects sd_notify()
> to be called from the parent process, in this case the shell script.

Hmm? You should "exec" the real daemon binary at the end, not just fork
it off. That wait the shell script process is replaced by the daemon
binary, which is what you want.

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [systemd-devel] [PATCH v5 12/14] autoconf: xen: enable explicit preference option for xenstored preference

2014-06-05 Thread Lennart Poettering

On Thu, 05.06.14 02:31, Luis R. Rodriguez (mcg...@suse.com) wrote:

 On Sun, Jun 01, 2014 at 08:15:47AM +0200, Lennart Poettering wrote:
  On Fri, 30.05.14 01:29, Luis R. Rodriguez (mcg...@suse.com) wrote:
  
   I'm cc'ing a few security folks as I'd appreciate review on the ideas 
   here,
   in particular that of a launcher idea on system to replace alternatives 
   on the
   ExecStart= line of a systemd service unit file, alternative ideas are of
   course welcomed. I'm also Cc'ing systemd-devel as this subject was 
   reviewed
   a little while ago with nothing concrete being recommended but instead a 
   few
   options being now archived as possibilities. I'm looking for a bit wider
   review of the approaches and recomendations.
   
   Some general background for non xen folks: old xen requires the launch of
   a daemon which implements supports of the xenstore, which is the database
   that xen uses for information about guests / dom0. There are two supported
   daemons, xenstored (C version) and oxenstored (Ocaml version) but they do 
   the
   same thing. Right now old init lets you override which one you pick 
   through
   an environment variable on /etc/{sysconfig,default}/xencommons, the script
   will use the appropriate on there. Systemd doesn't let you use variables 
   on
   the ExecStart line of a service unit file so alternatives are required.
   
   The reason I'm being very careful here this could set a precedent and at
   least for the launcher idea it'd require the usage of getenv() and 
   execve(),
   and secure alternatives for these (secure_getenv(), execve_nosecurity())
   have either been merged or suggested before for Linux. The systemd 
   discussion
   is only specific to Linux but if we have a launcher we could consider it 
   for
   other supported OSes. All that said I'd like proper review of the security
   implications of *all* strategies but obviously in particular the launcher
   idea. I want to tread carefuly before setting precedents.
  
  You can also just invoke a shell script from ExecStart=. I mean, we try
  to deemphesize them in the boot process, but there's nothing wrong with
  using shell, if you need to parse shell configuraiton fragments and just
  want to execute on ot another program...
 
 I tried this and it didn't work given that systemd expects sd_notify()
 to be called from the parent process, in this case the shell script.

Hmm? You should exec the real daemon binary at the end, not just fork
it off. That wait the shell script process is replaced by the daemon
binary, which is what you want.

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [systemd-devel] [PATCH v5 12/14] autoconf: xen: enable explicit preference option for xenstored preference

2014-06-05 Thread Lennart Poettering

On Thu, 05.06.14 20:01, Luis R. Rodriguez (mcg...@suse.com) wrote:

  Hmm? You should exec the real daemon binary at the end, not just fork
  it off. That wait the shell script process is replaced by the daemon
  binary, which is what you want.
 
 I tried both just running it and also running exec foo; both presented
 the same issue given that shell exec does not really execve.

Hmmm? You shell's exec command doesn't actually execve()? What are you
using? This doesn't sound very accurate...

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [systemd-devel] [PATCH v5 12/14] autoconf: xen: enable explicit preference option for xenstored preference

2014-06-01 Thread Lennart Poettering

On Fri, 30.05.14 01:29, Luis R. Rodriguez (mcg...@suse.com) wrote:

> I'm cc'ing a few security folks as I'd appreciate review on the ideas here,
> in particular that of a launcher idea on system to replace alternatives on the
> ExecStart= line of a systemd service unit file, alternative ideas are of
> course welcomed. I'm also Cc'ing systemd-devel as this subject was reviewed
> a little while ago with nothing concrete being recommended but instead a few
> options being now archived as possibilities. I'm looking for a bit wider
> review of the approaches and recomendations.
> 
> Some general background for non xen folks: old xen requires the launch of
> a daemon which implements supports of the xenstore, which is the database
> that xen uses for information about guests / dom0. There are two supported
> daemons, xenstored (C version) and oxenstored (Ocaml version) but they do the
> same thing. Right now old init lets you override which one you pick through
> an environment variable on /etc/{sysconfig,default}/xencommons, the script
> will use the appropriate on there. Systemd doesn't let you use variables on
> the ExecStart line of a service unit file so alternatives are required.
> 
> The reason I'm being very careful here this could set a precedent and at
> least for the launcher idea it'd require the usage of getenv() and execve(),
> and secure alternatives for these (secure_getenv(), execve_nosecurity())
> have either been merged or suggested before for Linux. The systemd discussion
> is only specific to Linux but if we have a launcher we could consider it for
> other supported OSes. All that said I'd like proper review of the security
> implications of *all* strategies but obviously in particular the launcher
> idea. I want to tread carefuly before setting precedents.

You can also just invoke a shell script from ExecStart=. I mean, we try
to deemphesize them in the boot process, but there's nothing wrong with
using shell, if you need to parse shell configuraiton fragments and just
want to execute on ot another program...

That said, I'd certainly make a clean cut and drop support for
/etc/sysconfig from any project I see, earlier rather than later, since
it's just cruft, a bad idea and should really just go away. But then
again, I would also just not do the thing with supporting two
implementations at the same time... 

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [systemd-devel] [PATCH v5 12/14] autoconf: xen: enable explicit preference option for xenstored preference

2014-06-01 Thread Lennart Poettering

On Fri, 30.05.14 01:29, Luis R. Rodriguez (mcg...@suse.com) wrote:

 I'm cc'ing a few security folks as I'd appreciate review on the ideas here,
 in particular that of a launcher idea on system to replace alternatives on the
 ExecStart= line of a systemd service unit file, alternative ideas are of
 course welcomed. I'm also Cc'ing systemd-devel as this subject was reviewed
 a little while ago with nothing concrete being recommended but instead a few
 options being now archived as possibilities. I'm looking for a bit wider
 review of the approaches and recomendations.
 
 Some general background for non xen folks: old xen requires the launch of
 a daemon which implements supports of the xenstore, which is the database
 that xen uses for information about guests / dom0. There are two supported
 daemons, xenstored (C version) and oxenstored (Ocaml version) but they do the
 same thing. Right now old init lets you override which one you pick through
 an environment variable on /etc/{sysconfig,default}/xencommons, the script
 will use the appropriate on there. Systemd doesn't let you use variables on
 the ExecStart line of a service unit file so alternatives are required.
 
 The reason I'm being very careful here this could set a precedent and at
 least for the launcher idea it'd require the usage of getenv() and execve(),
 and secure alternatives for these (secure_getenv(), execve_nosecurity())
 have either been merged or suggested before for Linux. The systemd discussion
 is only specific to Linux but if we have a launcher we could consider it for
 other supported OSes. All that said I'd like proper review of the security
 implications of *all* strategies but obviously in particular the launcher
 idea. I want to tread carefuly before setting precedents.

You can also just invoke a shell script from ExecStart=. I mean, we try
to deemphesize them in the boot process, but there's nothing wrong with
using shell, if you need to parse shell configuraiton fragments and just
want to execute on ot another program...

That said, I'd certainly make a clean cut and drop support for
/etc/sysconfig from any project I see, earlier rather than later, since
it's just cruft, a bad idea and should really just go away. But then
again, I would also just not do the thing with supporting two
implementations at the same time... 

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [systemd-devel] Suspending access to opened/active /dev/nodes during application runtime

2014-03-07 Thread Lennart Poettering

On Fri, 07.03.14 21:51, Lukasz Pawelczyk (hav...@gmail.com) wrote:

> >> Problem:
> >> Has anyone thought about a mechanism to limit/remove an access to a
> >> device during an application runtime? Meaning we have an
> >> application that has an open file descriptor to some /dev/node and
> >> depending on *something* it gains or looses the access to it
> >> gracefully (with or without a notification, but without any fatal
> >> consequences).
> > 
> > logind can mute input devices as sessions are switched, to enable
> > unpriviliged X11 and wayland compositors.
> 
> Would you please elaborate on this? Where is this mechanism? How does
> it work without kernel space support? Is there some kernel space
> support I’m not aware of?

There's EVIOCREVOKE for input devices and
DRM_IOCTL_SET_MASTER/DRM_IOCTL_DROP_MASTER for DRM devices. See logind
sources.

> > Before you think about doing something like this, you need to fix the
> > kernel to provide namespaced devices (good luck!)
> 
> Precisly! That’s the generic idea. I’m not for implementing it though
> at this moment. I just wanted to know whether anybody actually though
> about it or maybe someone is interested in starting such a work, etc.

It's not just about turning on and turning off access to the event
stream. It's mostly about enumeration and probing which doesn't work in
containers, and is particularly messy if you intend to share devices
between containers.

> > logind can do this for you between sessions. But such a container setup
> > will never work without proper device namespacing.
> 
> So how can it do it when there is no kernel support? You mean it could
> be doing this if the support were there?

EVIOCREVOKE and the DRM ioctls are pretty real...

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [systemd-devel] Suspending access to opened/active /dev/nodes during application runtime

2014-03-07 Thread Lennart Poettering

On Fri, 07.03.14 19:45, Lukasz Pawelczyk (hav...@gmail.com) wrote:

> Problem:
> Has anyone thought about a mechanism to limit/remove an access to a
> device during an application runtime? Meaning we have an application
> that has an open file descriptor to some /dev/node and depending on
> *something* it gains or looses the access to it gracefully (with or
> without a notification, but without any fatal consequences).

logind can mute input devices as sessions are switched, to enable
unpriviliged X11 and wayland compositors.

> Example:
> LXC. Imagine we have 2 separate containers. Both running full operating
> systems. Specifically with 2 X servers. Both running concurrently of

Well, devices are not namespaced on Linux (with the single exception of
network devices). An X server needs device access, hence this doesn't
fly at all.

When you enumerate devices with libudev in a container they will never
be marked as "initialized" and you do not get any udev hotplug events in
containers, and you don#t have the host's udev db around, nor would it
make any sense to you if you had. X11 and friends rely on udev
however...

Before you think about doing something like this, you need to fix the
kernel to provide namespaced devices (good luck!)

> course. Both need the same input devices (e.g. we have just one mouse).
> This creates a security problem when we want to have completely separate
> environments. One container is active (being displayed on a monitor and
> controlled with a mouse) while the other container runs evtest
> /dev/input/something and grabs the secret password user typed in the
> other.

logind can do this for you between sessions. But such a container setup
will never work without proper device namespacing.

> Solutions:
> The complete solution would comprise of 2 parts:
> - a mechanism that would allow to temporally "hide" a device from an
> open file descriptor.
> - a mechanism for deciding whether application/process/namespace should
> have an access to a specific device at a specific moment

Well, there's no point in inventing any "mechanisms" like this, as long
as devices are not namespaced in the kernel, so that userspace in
containers can enumerate/probe/identify/... things correctly...

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [systemd-devel] Suspending access to opened/active /dev/nodes during application runtime

2014-03-07 Thread Lennart Poettering

On Fri, 07.03.14 19:45, Lukasz Pawelczyk (hav...@gmail.com) wrote:

 Problem:
 Has anyone thought about a mechanism to limit/remove an access to a
 device during an application runtime? Meaning we have an application
 that has an open file descriptor to some /dev/node and depending on
 *something* it gains or looses the access to it gracefully (with or
 without a notification, but without any fatal consequences).

logind can mute input devices as sessions are switched, to enable
unpriviliged X11 and wayland compositors.

 Example:
 LXC. Imagine we have 2 separate containers. Both running full operating
 systems. Specifically with 2 X servers. Both running concurrently of

Well, devices are not namespaced on Linux (with the single exception of
network devices). An X server needs device access, hence this doesn't
fly at all.

When you enumerate devices with libudev in a container they will never
be marked as initialized and you do not get any udev hotplug events in
containers, and you don#t have the host's udev db around, nor would it
make any sense to you if you had. X11 and friends rely on udev
however...

Before you think about doing something like this, you need to fix the
kernel to provide namespaced devices (good luck!)

 course. Both need the same input devices (e.g. we have just one mouse).
 This creates a security problem when we want to have completely separate
 environments. One container is active (being displayed on a monitor and
 controlled with a mouse) while the other container runs evtest
 /dev/input/something and grabs the secret password user typed in the
 other.

logind can do this for you between sessions. But such a container setup
will never work without proper device namespacing.

 Solutions:
 The complete solution would comprise of 2 parts:
 - a mechanism that would allow to temporally hide a device from an
 open file descriptor.
 - a mechanism for deciding whether application/process/namespace should
 have an access to a specific device at a specific moment

Well, there's no point in inventing any mechanisms like this, as long
as devices are not namespaced in the kernel, so that userspace in
containers can enumerate/probe/identify/... things correctly...

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [systemd-devel] Suspending access to opened/active /dev/nodes during application runtime

2014-03-07 Thread Lennart Poettering

On Fri, 07.03.14 21:51, Lukasz Pawelczyk (hav...@gmail.com) wrote:

  Problem:
  Has anyone thought about a mechanism to limit/remove an access to a
  device during an application runtime? Meaning we have an
  application that has an open file descriptor to some /dev/node and
  depending on *something* it gains or looses the access to it
  gracefully (with or without a notification, but without any fatal
  consequences).
  
  logind can mute input devices as sessions are switched, to enable
  unpriviliged X11 and wayland compositors.
 
 Would you please elaborate on this? Where is this mechanism? How does
 it work without kernel space support? Is there some kernel space
 support I’m not aware of?

There's EVIOCREVOKE for input devices and
DRM_IOCTL_SET_MASTER/DRM_IOCTL_DROP_MASTER for DRM devices. See logind
sources.

  Before you think about doing something like this, you need to fix the
  kernel to provide namespaced devices (good luck!)
 
 Precisly! That’s the generic idea. I’m not for implementing it though
 at this moment. I just wanted to know whether anybody actually though
 about it or maybe someone is interested in starting such a work, etc.

It's not just about turning on and turning off access to the event
stream. It's mostly about enumeration and probing which doesn't work in
containers, and is particularly messy if you intend to share devices
between containers.

  logind can do this for you between sessions. But such a container setup
  will never work without proper device namespacing.
 
 So how can it do it when there is no kernel support? You mean it could
 be doing this if the support were there?

EVIOCREVOKE and the DRM ioctls are pretty real...

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] add StartTimeMonotomic, StartTimeBootTime to per pid in /proc

2014-01-24 Thread Lennart Poettering

On Fri, 24.01.14 12:32, Peter Zijlstra (pet...@infradead.org) wrote:

> > The process starttime is useful for a variety of things, like figuring
> > out creation ordering of processes. Or it is useful to detect PID
> > reuses in a somewhat reliable way. 
> 
> OK, maybe. Changelog should have said so.
> 
> > It is useful information to show the admin in "ps".
> 
> Does the one jiffy rounding really matter there? I doubt it, ps
> typically shows in second granularity.

Well, it's just annoying. Much of userspace uses CLOCK_MONOTONIC
throughout all the local timestamping needs these days, however the jiffy 
rounding
and the fact that "starttime" is based on CLOCK_BOOTTIME makes it hard
to compare process timestamps currently with other timestamps...

> > Profilers like "bootchart" can use this information to
> > plot when precisely specific process got started. From the outside it is
> > often useful to see for how long a specific process has already been
> > running, for accounting needs, and so on.
> 
> Profilers have far better interfaces than /proc to get information
> from.

That is true, but note that at least on Fedora taskstats and thing are
actually disabled these days in the kernel, since they slow things down
too much. The /proc interface is certainly much nicer there, since it
relies on a the timestamping the kernel does anyway...

> > Note that Dan's patch doesn't add any new timestamp logic to the kernel,
> > it just exposes the existing timestamps in a way to userspace that is
> > more in line with the rest of timestamps exposed. 
> 
> Yeah, Dan was also too lazy to explain the need, and had like 3 typoes
> in the inadequate changelog he had.
> 
> He also fails to explain why he needs the timestamp twice, as do you for
> that matter.

Well, I am mostly interesting int the monotonic timestamp. But given
that the kernel keeps the boottime clock value as well, and already
exposes it in a skewed way to userspace it looked like a natural choice
to also expose that time in a clean way, while we are it...

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] add StartTimeMonotomic, StartTimeBootTime to per pid in /proc

2014-01-24 Thread Lennart Poettering

On Wed, 22.01.14 16:53, Peter Zijlstra (pet...@infradead.org) wrote:

> 
> On Tue, Jan 21, 2014 at 07:10:04AM -0800, Dan Ballard wrote:
> > starttime in /proc/$PID/stat is inaccurate by "clock tick" granularity.
> > The kernel keeps better track os this exposes that in /prod/$PID/status
> > as StartTimeMonotonic and StartTimeBootTime
> 
> Why?

Well, the canonical way to expose clocks to userspace these days is with
CLOCK_MONOTONIC, CLOCK_BOOTTIME, and so on. The starttime is currently
exposed in a way that is made inaccurate by the clock tick in
/proc/$PID/stat. Dan's patch simply unfucks that interface.

The process starttime is useful for a variety of things, like figuring
out creation ordering of processes. Or it is useful to detect PID
reuses in a somewhat reliable way. It is useful information to show the
admin in "ps". Profilers like "bootchart" can use this information to
plot when precisely specific process got started. From the outside it is
often useful to see for how long a specific process has already been
running, for accounting needs, and so on.

Note that Dan's patch doesn't add any new timestamp logic to the kernel,
it just exposes the existing timestamps in a way to userspace that is
more in line with the rest of timestamps exposed. 

Lennart
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] add StartTimeMonotomic, StartTimeBootTime to per pid in /proc

2014-01-24 Thread Lennart Poettering

On Wed, 22.01.14 16:53, Peter Zijlstra (pet...@infradead.org) wrote:

 
 On Tue, Jan 21, 2014 at 07:10:04AM -0800, Dan Ballard wrote:
  starttime in /proc/$PID/stat is inaccurate by clock tick granularity.
  The kernel keeps better track os this exposes that in /prod/$PID/status
  as StartTimeMonotonic and StartTimeBootTime
 
 Why?

Well, the canonical way to expose clocks to userspace these days is with
CLOCK_MONOTONIC, CLOCK_BOOTTIME, and so on. The starttime is currently
exposed in a way that is made inaccurate by the clock tick in
/proc/$PID/stat. Dan's patch simply unfucks that interface.

The process starttime is useful for a variety of things, like figuring
out creation ordering of processes. Or it is useful to detect PID
reuses in a somewhat reliable way. It is useful information to show the
admin in ps. Profilers like bootchart can use this information to
plot when precisely specific process got started. From the outside it is
often useful to see for how long a specific process has already been
running, for accounting needs, and so on.

Note that Dan's patch doesn't add any new timestamp logic to the kernel,
it just exposes the existing timestamps in a way to userspace that is
more in line with the rest of timestamps exposed. 

Lennart
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH 1/1] add StartTimeMonotomic, StartTimeBootTime to per pid in /proc

2014-01-24 Thread Lennart Poettering

On Fri, 24.01.14 12:32, Peter Zijlstra (pet...@infradead.org) wrote:

  The process starttime is useful for a variety of things, like figuring
  out creation ordering of processes. Or it is useful to detect PID
  reuses in a somewhat reliable way. 
 
 OK, maybe. Changelog should have said so.
 
  It is useful information to show the admin in ps.
 
 Does the one jiffy rounding really matter there? I doubt it, ps
 typically shows in second granularity.

Well, it's just annoying. Much of userspace uses CLOCK_MONOTONIC
throughout all the local timestamping needs these days, however the jiffy 
rounding
and the fact that starttime is based on CLOCK_BOOTTIME makes it hard
to compare process timestamps currently with other timestamps...

  Profilers like bootchart can use this information to
  plot when precisely specific process got started. From the outside it is
  often useful to see for how long a specific process has already been
  running, for accounting needs, and so on.
 
 Profilers have far better interfaces than /proc to get information
 from.

That is true, but note that at least on Fedora taskstats and thing are
actually disabled these days in the kernel, since they slow things down
too much. The /proc interface is certainly much nicer there, since it
relies on a the timestamping the kernel does anyway...

  Note that Dan's patch doesn't add any new timestamp logic to the kernel,
  it just exposes the existing timestamps in a way to userspace that is
  more in line with the rest of timestamps exposed. 
 
 Yeah, Dan was also too lazy to explain the need, and had like 3 typoes
 in the inadequate changelog he had.
 
 He also fails to explain why he needs the timestamp twice, as do you for
 that matter.

Well, I am mostly interesting int the monotonic timestamp. But given
that the kernel keeps the boottime clock value as well, and already
exposes it in a skewed way to userspace it looked like a natural choice
to also expose that time in a clean way, while we are it...

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] acpi/video: Add Lenovo IdeaPad Yoga 13 to acpi video detect blacklist

2013-10-14 Thread Lennart Poettering

On Mon, 14.10.13 15:48, Matthew Garrett (matthew.garr...@nebula.com) wrote:

> On Mon, 2013-10-14 at 17:36 +0200, Lennart Poettering wrote:
> 
> > Sorry, still not getting this. How should this ever work if the intel
> > video driver is compiled as kmod? That means that it isn't clear at all
> > when the kmod is going to be loaded or if it is loaded at all, you
> > cannot delay the registration of the acpi backlight that long, since the
> > time you'd have to wait is basically unbounded...
> 
> See the intel_opregion_present() code in drivers/acpi/video.c. The ACPI
> driver won't bind to Intel hardware until i915 indicates that it should
> do so.

Hmm, OK, so this means that the acpi backlight will check whether i915
hw is around, and not whether the i915 driver is actually loaded? That
would work for me I guess. Thanks.

That means on win8 machines with win8 graphics there'll always be a
single backlight device in userspace only?

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] acpi/video: Add Lenovo IdeaPad Yoga 13 to acpi video detect blacklist

2013-10-14 Thread Lennart Poettering

On Mon, 14.10.13 12:17, Aaron Lu (aaron...@intel.com) wrote:

> > Hmm, regarding your patch series, do you plan to skip the registration
> > of the acpi backlight device if the "raw" device is supported? I mean,
> 
> Yes, that's right.
> 
> > the intel driver could be compiled as a module (and generally is on the
> > popular distros), so at the time the ACPI subsystem wants to register
> > the backlight device and know if a raw backlight device is around it
> > never will be, so what is the point of that? Or am I missing something?
> 
> For systems with Intel i915 GPU, ACPI video will wait for GPU driver to
> run first, see drivers/acpi/video.c acpi_video_init, the actual
> acpi_video_register function is called by i915 driver in
> i915_driver_load due to operation region related stuff. Since all
> problematic systems reported so far has an Intel GPU, I'm doing it this
> way now. If things change, we can enhance it then.

Sorry, still not getting this. How should this ever work if the intel
video driver is compiled as kmod? That means that it isn't clear at all
when the kmod is going to be loaded or if it is loaded at all, you
cannot delay the registration of the acpi backlight that long, since the
time you'd have to wait is basically unbounded...

So, how could this ever work?

AFAICS all popular distros ship the video drivers as kernel modules,
hence trying to avoid registration of the ACPI backlight if the intel
driver is compiled will be an entirely pointless excercise on all
distros?

What am I missing?

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] acpi/video: Add Lenovo IdeaPad Yoga 13 to acpi video detect blacklist

2013-10-14 Thread Lennart Poettering

On Mon, 14.10.13 12:17, Aaron Lu (aaron...@intel.com) wrote:

  Hmm, regarding your patch series, do you plan to skip the registration
  of the acpi backlight device if the raw device is supported? I mean,
 
 Yes, that's right.
 
  the intel driver could be compiled as a module (and generally is on the
  popular distros), so at the time the ACPI subsystem wants to register
  the backlight device and know if a raw backlight device is around it
  never will be, so what is the point of that? Or am I missing something?
 
 For systems with Intel i915 GPU, ACPI video will wait for GPU driver to
 run first, see drivers/acpi/video.c acpi_video_init, the actual
 acpi_video_register function is called by i915 driver in
 i915_driver_load due to operation region related stuff. Since all
 problematic systems reported so far has an Intel GPU, I'm doing it this
 way now. If things change, we can enhance it then.

Sorry, still not getting this. How should this ever work if the intel
video driver is compiled as kmod? That means that it isn't clear at all
when the kmod is going to be loaded or if it is loaded at all, you
cannot delay the registration of the acpi backlight that long, since the
time you'd have to wait is basically unbounded...

So, how could this ever work?

AFAICS all popular distros ship the video drivers as kernel modules,
hence trying to avoid registration of the ACPI backlight if the intel
driver is compiled will be an entirely pointless excercise on all
distros?

What am I missing?

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] acpi/video: Add Lenovo IdeaPad Yoga 13 to acpi video detect blacklist

2013-10-14 Thread Lennart Poettering

On Mon, 14.10.13 15:48, Matthew Garrett (matthew.garr...@nebula.com) wrote:

 On Mon, 2013-10-14 at 17:36 +0200, Lennart Poettering wrote:
 
  Sorry, still not getting this. How should this ever work if the intel
  video driver is compiled as kmod? That means that it isn't clear at all
  when the kmod is going to be loaded or if it is loaded at all, you
  cannot delay the registration of the acpi backlight that long, since the
  time you'd have to wait is basically unbounded...
 
 See the intel_opregion_present() code in drivers/acpi/video.c. The ACPI
 driver won't bind to Intel hardware until i915 indicates that it should
 do so.

Hmm, OK, so this means that the acpi backlight will check whether i915
hw is around, and not whether the i915 driver is actually loaded? That
would work for me I guess. Thanks.

That means on win8 machines with win8 graphics there'll always be a
single backlight device in userspace only?

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] acpi/video: Add Lenovo IdeaPad Yoga 13 to acpi video detect blacklist

2013-10-13 Thread Lennart Poettering

On Mon, 14.10.13 10:36, Aaron Lu (aaron...@intel.com) wrote:

> > - Backlight control doesn't work via ACPI without acpi_osi="!Windows 2012"
> > - Backlight control doesn't work via ACPI with acpi_osi="!Windows 2012"
> > - Backlight control doesn't work via EC commands from ideapad-laptop.c
> > 
> > The only way backlight handling is supported on the Yoga 13 is via the
> > intel video driver.
> 
> Since I don't have access to the acpidump of this system, my only
> question is, does the firmware has a _OSI("Windows 2012") query in DSDT
> table?

Yes, it appears to do that as part of _OSC.

The disassembled DSDT table is here:

http://0pointer.de/public/yoga13-dsdt.dsl

> > Or in other words: the situation for the Yoga 13 is *unrelated* to the
> > Windows 8 issues, and your patch.
> 
> I think they are related...
> If the firmware is compatible to Windows 8, then my patch will disable
> ACPI video backlight interface to prefer GPU's interface.

OK, that might indeed work.

> > I'll soon send another patch which also blacklists the thing in the
> > ideapad driver, so that only the intel backlight driver is enabled on
> > Yoga 13 systems, at which point everything will work fine.
> 
> Right, that is needed. And if going with my patch, the ideapad driver
> will need to be patched similarly like thinkpad_acpi to add a check of
> acpi_video_backlight_support before it decides to register its own
> backlight interface.

The ideapad driver currently skips registration of the backlight device
if acpi_video_backlight_support() returns true. is that all you need?

Hmm, regarding your patch series, do you plan to skip the registration
of the acpi backlight device if the "raw" device is supported? I mean,
the intel driver could be compiled as a module (and generally is on the
popular distros), so at the time the ACPI subsystem wants to register
the backlight device and know if a raw backlight device is around it
never will be, so what is the point of that? Or am I missing something?

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] acpi/video: Add Lenovo IdeaPad Yoga 13 to acpi video detect blacklist

2013-10-13 Thread Lennart Poettering

On Mon, 14.10.13 09:11, Aaron Lu (aaron...@intel.com) wrote:

> > Note that this appears unrelated to the Windows 8 backlight issues tracked
> > here:
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=51231
> > https://bugzilla.kernel.org/show_bug.cgi?id=60682
> > 
> > The Yoga's ACPI backlight controls work neither with nor without
> > acpi_osi="!Windows 2012" on the kernel command line. It appears that
> > backlight control via the EC simply is not available at all, regardless
> > whether done via ACPI or via the vendor driver.
> 
> Just a side note, if the firmware of Yoga 13 has a _OSI("Windows 2012")
> query, then it should be solved with the patch proposed here:
> https://lkml.org/lkml/2013/10/11/409, Fix Win8 backlight issue.
> 
> We are still discussing a proper default behaviour in that patchset.

No. 

Did you actually read the commit message of the patch? Please do!

The backlight for the Yoga 13 doesn't work, regardless what the _OSI
value is. In fact, it doesn't even work by directly accessing the EC via
the ideapad-laptop driver. 

So, again:

- Backlight control doesn't work via ACPI without acpi_osi="!Windows 2012"
- Backlight control doesn't work via ACPI with acpi_osi="!Windows 2012"
- Backlight control doesn't work via EC commands from ideapad-laptop.c

The only way backlight handling is supported on the Yoga 13 is via the
intel video driver.

Or in other words: the situation for the Yoga 13 is *unrelated* to the
Windows 8 issues, and your patch.

Hence this patch I posted, which blacklists the backlight control
entirely in the ACPI driver, since the Windows 2012 setting is
irrelevant to it.

I'll soon send another patch which also blacklists the thing in the
ideapad driver, so that only the intel backlight driver is enabled on
Yoga 13 systems, at which point everything will work fine.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] acpi/video: Add Lenovo IdeaPad Yoga 13 to acpi video detect blacklist

2013-10-13 Thread Lennart Poettering

On the Yoga 13 the backlight control doesn't work via ACPI. (And doesn't
work either with the low-level platform driver ideapad_laptop; but
works correctly via the intel video driver).  This patch hence adds the
Yoga 13 to the ACPI video detect blacklist, to make sure the broken ACPI
backlight device is never exposed to userspace.

Note that this appears unrelated to the Windows 8 backlight issues tracked
here:

https://bugzilla.kernel.org/show_bug.cgi?id=51231
https://bugzilla.kernel.org/show_bug.cgi?id=60682

The Yoga's ACPI backlight controls work neither with nor without
acpi_osi="!Windows 2012" on the kernel command line. It appears that
backlight control via the EC simply is not available at all, regardless
whether done via ACPI or via the vendor driver.

Signed-off-by: Lennart Poettering 
---
 drivers/acpi/video_detect.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/acpi/video_detect.c b/drivers/acpi/video_detect.c
index 940edbf..a88e8f7 100644
--- a/drivers/acpi/video_detect.c
+++ b/drivers/acpi/video_detect.c
@@ -168,6 +168,14 @@ static struct dmi_system_id video_detect_dmi_table[] = {
DMI_MATCH(DMI_PRODUCT_NAME, "UL30A"),
},
},
+   {
+   .callback = video_detect_force_vendor,
+   .ident = "Lenovo Yoga 13",
+   .matches = {
+   DMI_MATCH(DMI_SYS_VENDOR, "LENOVO"),
+   DMI_MATCH(DMI_PRODUCT_VERSION, "Lenovo IdeaPad Yoga 13"),
+   },
+   },
    { },
 };
 
-- 
1.8.3.1

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[PATCH] acpi/video: Add Lenovo IdeaPad Yoga 13 to acpi video detect blacklist

2013-10-13 Thread Lennart Poettering

On the Yoga 13 the backlight control doesn't work via ACPI. (And doesn't
work either with the low-level platform driver ideapad_laptop; but
works correctly via the intel video driver).  This patch hence adds the
Yoga 13 to the ACPI video detect blacklist, to make sure the broken ACPI
backlight device is never exposed to userspace.

Note that this appears unrelated to the Windows 8 backlight issues tracked
here:

https://bugzilla.kernel.org/show_bug.cgi?id=51231
https://bugzilla.kernel.org/show_bug.cgi?id=60682

The Yoga's ACPI backlight controls work neither with nor without
acpi_osi=!Windows 2012 on the kernel command line. It appears that
backlight control via the EC simply is not available at all, regardless
whether done via ACPI or via the vendor driver.

Signed-off-by: Lennart Poettering lenn...@poettering.net
---
 drivers/acpi/video_detect.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/drivers/acpi/video_detect.c b/drivers/acpi/video_detect.c
index 940edbf..a88e8f7 100644
--- a/drivers/acpi/video_detect.c
+++ b/drivers/acpi/video_detect.c
@@ -168,6 +168,14 @@ static struct dmi_system_id video_detect_dmi_table[] = {
DMI_MATCH(DMI_PRODUCT_NAME, UL30A),
},
},
+   {
+   .callback = video_detect_force_vendor,
+   .ident = Lenovo Yoga 13,
+   .matches = {
+   DMI_MATCH(DMI_SYS_VENDOR, LENOVO),
+   DMI_MATCH(DMI_PRODUCT_VERSION, Lenovo IdeaPad Yoga 13),
+   },
+   },
{ },
 };
 
-- 
1.8.3.1

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] acpi/video: Add Lenovo IdeaPad Yoga 13 to acpi video detect blacklist

2013-10-13 Thread Lennart Poettering

On Mon, 14.10.13 09:11, Aaron Lu (aaron...@intel.com) wrote:

  Note that this appears unrelated to the Windows 8 backlight issues tracked
  here:
  
  https://bugzilla.kernel.org/show_bug.cgi?id=51231
  https://bugzilla.kernel.org/show_bug.cgi?id=60682
  
  The Yoga's ACPI backlight controls work neither with nor without
  acpi_osi=!Windows 2012 on the kernel command line. It appears that
  backlight control via the EC simply is not available at all, regardless
  whether done via ACPI or via the vendor driver.
 
 Just a side note, if the firmware of Yoga 13 has a _OSI(Windows 2012)
 query, then it should be solved with the patch proposed here:
 https://lkml.org/lkml/2013/10/11/409, Fix Win8 backlight issue.
 
 We are still discussing a proper default behaviour in that patchset.

No. 

Did you actually read the commit message of the patch? Please do!

The backlight for the Yoga 13 doesn't work, regardless what the _OSI
value is. In fact, it doesn't even work by directly accessing the EC via
the ideapad-laptop driver. 

So, again:

- Backlight control doesn't work via ACPI without acpi_osi=!Windows 2012
- Backlight control doesn't work via ACPI with acpi_osi=!Windows 2012
- Backlight control doesn't work via EC commands from ideapad-laptop.c

The only way backlight handling is supported on the Yoga 13 is via the
intel video driver.

Or in other words: the situation for the Yoga 13 is *unrelated* to the
Windows 8 issues, and your patch.

Hence this patch I posted, which blacklists the backlight control
entirely in the ACPI driver, since the Windows 2012 setting is
irrelevant to it.

I'll soon send another patch which also blacklists the thing in the
ideapad driver, so that only the intel backlight driver is enabled on
Yoga 13 systems, at which point everything will work fine.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] acpi/video: Add Lenovo IdeaPad Yoga 13 to acpi video detect blacklist

2013-10-13 Thread Lennart Poettering

On Mon, 14.10.13 10:36, Aaron Lu (aaron...@intel.com) wrote:

  - Backlight control doesn't work via ACPI without acpi_osi=!Windows 2012
  - Backlight control doesn't work via ACPI with acpi_osi=!Windows 2012
  - Backlight control doesn't work via EC commands from ideapad-laptop.c
  
  The only way backlight handling is supported on the Yoga 13 is via the
  intel video driver.
 
 Since I don't have access to the acpidump of this system, my only
 question is, does the firmware has a _OSI(Windows 2012) query in DSDT
 table?

Yes, it appears to do that as part of _OSC.

The disassembled DSDT table is here:

http://0pointer.de/public/yoga13-dsdt.dsl

  Or in other words: the situation for the Yoga 13 is *unrelated* to the
  Windows 8 issues, and your patch.
 
 I think they are related...
 If the firmware is compatible to Windows 8, then my patch will disable
 ACPI video backlight interface to prefer GPU's interface.

OK, that might indeed work.

  I'll soon send another patch which also blacklists the thing in the
  ideapad driver, so that only the intel backlight driver is enabled on
  Yoga 13 systems, at which point everything will work fine.
 
 Right, that is needed. And if going with my patch, the ideapad driver
 will need to be patched similarly like thinkpad_acpi to add a check of
 acpi_video_backlight_support before it decides to register its own
 backlight interface.

The ideapad driver currently skips registration of the backlight device
if acpi_video_backlight_support() returns true. is that all you need?

Hmm, regarding your patch series, do you plan to skip the registration
of the acpi backlight device if the raw device is supported? I mean,
the intel driver could be compiled as a module (and generally is on the
popular distros), so at the time the ACPI subsystem wants to register
the backlight device and know if a raw backlight device is around it
never will be, so what is the point of that? Or am I missing something?

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] um: change defconfig to stop spawning xterm

2013-07-24 Thread Lennart Poettering

On Tue, 23.07.13 08:57, Al Viro (v...@zeniv.linux.org.uk) wrote:

> 
> On Tue, Jul 23, 2013 at 07:47:07AM +0200, richard -rw- weinberger wrote:
> > Adding Al again, someone dropped him from the CC list...
> 
> FWIW, all this crap stems from the old decision to use major 4 for
> uml consoles.  And it was a bad decision, no arguments here.
> It's also a decision we are years too late to revert.
> 
> a) VT102, let alone the extensions to it, is simply wrong for uml;
> if it's understood by anything, it's on the host userland side.
> xterm(1) has a notion of two-dimensional array of characters on screen,
> organized in logical lines, etc.  So does screen(1).  So does
> drivers/tty/vt/* (i.e. the kernel side of virtual console).  uml
> console does *not* have such a notion - it passes a linear stream
> of octets, sight unseen, to whatever's on the other side of connection.
> Doing an equivalent of drivers/tty/vt/* would mean maintaining such
> a 2D array internally *AND* somehow passing updates to that beast
> to whatever's on the other side.  That could be done (after all,
> libcurses manages), but it won't be compatible with existing setups
> and it should be a separate driver, anyway.  Granted, it would've
> made a whole lot more sense in role of /dev/tty, but it's too late
> for that now.

The UML tty devices are in most regards pretty much like serial TTYs
where there's also no meta-information available which terminal
emulation is actually spoken on it, and that's covered pretty much OK
everyhwere...

> b) changing the major of /dev/tty on uml will break existing setups.
> Ain't feasible.  We probably can get away with making that controlled
> by kernel option, and it might make sense to try going that way, but
> I'm not entirely convinced it's worth bothering.  Up to uml maintainer...
> IMO if we go that way, we ought to pass the relevant part of config
> (i.e. is it xterm or pts or plain opened file) in the event udev
> gets, so that the userland would have at least a chance of dealing
> with another real problem - selecting TERM value for getty.

Which major/minor you use is irrelevant to userspace. The userspace API
however assumes that /dev/tty[1..63] refers to the tty devices of the
virtual console. As long as you provide some other TTY under that name
then the virtual console TTYs you simply provide a broken API to
userspace, and hence programs break. systemd does, gpm does, X11 does,
and everything else that interfaces with the VC via VC APIs does too.

Just pick a different name for the TTYs that UML uses, just not
/dev/tty[1..63] and everything is fine. That's what the virtualization
folks did with their hypervisor consoles, and is what we required from
the container folks too.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] um: change defconfig to stop spawning xterm

2013-07-24 Thread Lennart Poettering

On Tue, 23.07.13 07:40, Richard Weinberger (rich...@nod.at) wrote:

> >>>>> UML shouldn't be penalized for not implementing some terminal emulation,
> >>>>> but it should be penalized for doing so under the label of "VT support",
> >>>>> which it simply is not providing.
> >>>>>
> >>>>> They can call their ttys any way they want. If the call them
> >>>>> /dev/tty[1..64] however, then they need to implement the VC
> >>>>> interfaces. All of them.
> >>>
> >>> Lennart, can you please explain us why /dev/tty[1..64] is forced to
> >>> have virtual console support?
> > 
> > /dev/tty[1..64] is the userspace API to the kernel VT subsystem. If you
> > support it you need to match up all /dev/tty[1..64] with a
> > /dev/vcs[1..64] + /dev/vcsa[1..64]. You need to expose a tty that
> > understands TERM=linux and the ioctls listed on console_ioctl(4). You
> > need /dev/tty0 as something that behaves like a symlink to the fg
> > VT. You should also support files like /sys/class/tty/tty0/active with
> > its POLLHUP iface.
> 
> I sightly disagree with you.
> /dev/tty[1..64] is not directly bound to VT.
> You can have systems with CONFIG_VT=n and still have /dev/tty[1..64].
> Linux supports this perfectly.
> UML does not have VT because having virtual consoles makes no sense.
> (Same like on s390)

You are aware that turning off the tty subsystem in the kernel is
something different than turning off the virtual console? Note that the
whole stuff is really confusingly named, as /dev/tty1 is genericly named
"tty", even if it actually refers to a virtual console tty and nothing else.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] um: change defconfig to stop spawning xterm

2013-07-24 Thread Lennart Poettering

On Tue, 23.07.13 07:40, Richard Weinberger (rich...@nod.at) wrote:

  UML shouldn't be penalized for not implementing some terminal emulation,
  but it should be penalized for doing so under the label of VT support,
  which it simply is not providing.
 
  They can call their ttys any way they want. If the call them
  /dev/tty[1..64] however, then they need to implement the VC
  interfaces. All of them.
 
  Lennart, can you please explain us why /dev/tty[1..64] is forced to
  have virtual console support?
  
  /dev/tty[1..64] is the userspace API to the kernel VT subsystem. If you
  support it you need to match up all /dev/tty[1..64] with a
  /dev/vcs[1..64] + /dev/vcsa[1..64]. You need to expose a tty that
  understands TERM=linux and the ioctls listed on console_ioctl(4). You
  need /dev/tty0 as something that behaves like a symlink to the fg
  VT. You should also support files like /sys/class/tty/tty0/active with
  its POLLHUP iface.
 
 I sightly disagree with you.
 /dev/tty[1..64] is not directly bound to VT.
 You can have systems with CONFIG_VT=n and still have /dev/tty[1..64].
 Linux supports this perfectly.
 UML does not have VT because having virtual consoles makes no sense.
 (Same like on s390)

You are aware that turning off the tty subsystem in the kernel is
something different than turning off the virtual console? Note that the
whole stuff is really confusingly named, as /dev/tty1 is genericly named
tty, even if it actually refers to a virtual console tty and nothing else.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] um: change defconfig to stop spawning xterm

2013-07-24 Thread Lennart Poettering

On Tue, 23.07.13 08:57, Al Viro (v...@zeniv.linux.org.uk) wrote:

 
 On Tue, Jul 23, 2013 at 07:47:07AM +0200, richard -rw- weinberger wrote:
  Adding Al again, someone dropped him from the CC list...
 
 FWIW, all this crap stems from the old decision to use major 4 for
 uml consoles.  And it was a bad decision, no arguments here.
 It's also a decision we are years too late to revert.
 
 a) VT102, let alone the extensions to it, is simply wrong for uml;
 if it's understood by anything, it's on the host userland side.
 xterm(1) has a notion of two-dimensional array of characters on screen,
 organized in logical lines, etc.  So does screen(1).  So does
 drivers/tty/vt/* (i.e. the kernel side of virtual console).  uml
 console does *not* have such a notion - it passes a linear stream
 of octets, sight unseen, to whatever's on the other side of connection.
 Doing an equivalent of drivers/tty/vt/* would mean maintaining such
 a 2D array internally *AND* somehow passing updates to that beast
 to whatever's on the other side.  That could be done (after all,
 libcurses manages), but it won't be compatible with existing setups
 and it should be a separate driver, anyway.  Granted, it would've
 made a whole lot more sense in role of /dev/ttyn, but it's too late
 for that now.

The UML tty devices are in most regards pretty much like serial TTYs
where there's also no meta-information available which terminal
emulation is actually spoken on it, and that's covered pretty much OK
everyhwere...

 b) changing the major of /dev/ttyn on uml will break existing setups.
 Ain't feasible.  We probably can get away with making that controlled
 by kernel option, and it might make sense to try going that way, but
 I'm not entirely convinced it's worth bothering.  Up to uml maintainer...
 IMO if we go that way, we ought to pass the relevant part of config
 (i.e. is it xterm or pts or plain opened file) in the event udev
 gets, so that the userland would have at least a chance of dealing
 with another real problem - selecting TERM value for getty.

Which major/minor you use is irrelevant to userspace. The userspace API
however assumes that /dev/tty[1..63] refers to the tty devices of the
virtual console. As long as you provide some other TTY under that name
then the virtual console TTYs you simply provide a broken API to
userspace, and hence programs break. systemd does, gpm does, X11 does,
and everything else that interfaces with the VC via VC APIs does too.

Just pick a different name for the TTYs that UML uses, just not
/dev/tty[1..63] and everything is fine. That's what the virtualization
folks did with their hypervisor consoles, and is what we required from
the container folks too.

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] um: change defconfig to stop spawning xterm

2013-07-22 Thread Lennart Poettering

On Mon, 22.07.13 16:13, Ramkumar Ramachandra (artag...@gmail.com) wrote:

> 
> [Corrected Lennart's email ID]
> 
> Richard Weinberger wrote:
> > CC'ing Lennart.
> >
> > Am 22.07.2013 11:45, schrieb Ramkumar Ramachandra:
> >> Ramkumar Ramachandra wrote:
> >>> [1]: 
> >>> http://lists.freedesktop.org/archives/systemd-devel/2013-July/012152.html
> >>
> >> ... and the patches were rejected.  Lennart says that UML providing
> >> /dev/tty* is wrong, and that UML should call them /dev/hvc* (or
> >> something).  Can we do something about the situation?  Can we remove
> >> /dev/tty*, and provide /dev/hvc*?  Will we be breaking existing users?
> >>
> >> Thanks.
> >>
> >> Lennart Poettering wrote:
> >>> UML shouldn't be penalized for not implementing some terminal emulation,
> >>> but it should be penalized for doing so under the label of "VT support",
> >>> which it simply is not providing.
> >>>
> >>> They can call their ttys any way they want. If the call them
> >>> /dev/tty[1..64] however, then they need to implement the VC
> >>> interfaces. All of them.
> >
> > Lennart, can you please explain us why /dev/tty[1..64] is forced to
> > have virtual console support?

/dev/tty[1..64] is the userspace API to the kernel VT subsystem. If you
support it you need to match up all /dev/tty[1..64] with a
/dev/vcs[1..64] + /dev/vcsa[1..64]. You need to expose a tty that
understands TERM=linux and the ioctls listed on console_ioctl(4). You
need /dev/tty0 as something that behaves like a symlink to the fg
VT. You should also support files like /sys/class/tty/tty0/active with
its POLLHUP iface.

If you expose a very different terminal than a VT as /dev/tty[1..64]
this will confuse a lot of userspace, since userspace, be it
systemd/logind, gpm, X11, openvt, ... all expect that /dev/tty[1..64] is
the VT subsystem where all that functionality is available. And you just
broke the userspace API quite badly.

It's totally fine to register ttys with a different feature set under
some other name you like, but if you name it /dev/tty[1..64] then
userspace expects this to be the real deal.

The various hypervisors understood this and provide their ttys under the
name of /dev/hvc0 and suchlike. UML should probably do something
similar. If you pick a name of your own, you have complete freedom what
you actually implement...

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH] um: change defconfig to stop spawning xterm

2013-07-22 Thread Lennart Poettering

On Mon, 22.07.13 16:13, Ramkumar Ramachandra (artag...@gmail.com) wrote:

 
 [Corrected Lennart's email ID]
 
 Richard Weinberger wrote:
  CC'ing Lennart.
 
  Am 22.07.2013 11:45, schrieb Ramkumar Ramachandra:
  Ramkumar Ramachandra wrote:
  [1]: 
  http://lists.freedesktop.org/archives/systemd-devel/2013-July/012152.html
 
  ... and the patches were rejected.  Lennart says that UML providing
  /dev/tty* is wrong, and that UML should call them /dev/hvc* (or
  something).  Can we do something about the situation?  Can we remove
  /dev/tty*, and provide /dev/hvc*?  Will we be breaking existing users?
 
  Thanks.
 
  Lennart Poettering wrote:
  UML shouldn't be penalized for not implementing some terminal emulation,
  but it should be penalized for doing so under the label of VT support,
  which it simply is not providing.
 
  They can call their ttys any way they want. If the call them
  /dev/tty[1..64] however, then they need to implement the VC
  interfaces. All of them.
 
  Lennart, can you please explain us why /dev/tty[1..64] is forced to
  have virtual console support?

/dev/tty[1..64] is the userspace API to the kernel VT subsystem. If you
support it you need to match up all /dev/tty[1..64] with a
/dev/vcs[1..64] + /dev/vcsa[1..64]. You need to expose a tty that
understands TERM=linux and the ioctls listed on console_ioctl(4). You
need /dev/tty0 as something that behaves like a symlink to the fg
VT. You should also support files like /sys/class/tty/tty0/active with
its POLLHUP iface.

If you expose a very different terminal than a VT as /dev/tty[1..64]
this will confuse a lot of userspace, since userspace, be it
systemd/logind, gpm, X11, openvt, ... all expect that /dev/tty[1..64] is
the VT subsystem where all that functionality is available. And you just
broke the userspace API quite badly.

It's totally fine to register ttys with a different feature set under
some other name you like, but if you name it /dev/tty[1..64] then
userspace expects this to be the real deal.

The various hypervisors understood this and provide their ttys under the
name of /dev/hvc0 and suchlike. UML should probably do something
similar. If you pick a name of your own, you have complete freedom what
you actually implement...

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup: status-quo and userland efforts

2013-06-30 Thread Lennart Poettering


Heya,

On 29.06.2013 05:05, Tim Hockin wrote:

Come on, now, Lennart.  You put a lot of words in my mouth.



I for sure am not going to make the PID 1 a client of another daemon. That's
just wrong. If you have a daemon that is both conceptually the manager of
another service and the client of that other service, then that's bad design
and you will easily run into deadlocks and such. Just think about it: if you
have some external daemon for managing cgroups, and you need cgroups for
running external daemons, how are you going to start the external daemon for
managing cgroups? Sure, you can hack around this, make that daemon special,
and magic, and stuff -- or you can just not do such nonsense. There's no
reason to repeat the fuckup that cgroup became in kernelspace a second time,
but this time in userspace, with multiple manager daemons all with different
and slightly incompatible definitions what a unit to manage actualy is...


I forgot about the tautology of systemd.  systemd is monolithic.


systemd is certainly not monolithic for almost any definition of that 
term. I am not sure where you are taking that from, and I am not sure I 
want to discuss on that level. This just sounds like FUD you picked up 
somewhere and are repeating carelessly...



But that's not my point.  It seems pretty easy to make this cgroup
management (in "native mode") a library that can have either a thin
veneer of a main() function, while also being usable by systemd.  The
point is to solve all of the problems ONCE.  I'm trying to make the
case that systemd itself should be focusing on features and policies
and awesome APIs.


You know, getting this all right isn't easy. If you want to do things 
properly, then you need to propagate attribute changes between the units 
you manage. You also need something like a scheduler, since a number of 
controllers can only be configured under certain external conditions 
(for example: the blkio or devices controller use major/minor parameters 
for configuring per-device limits. Since major/minor assignments are 
pretty much unpredictable these days -- and users probably want to 
configure things with friendly and stable /dev/disk/by-id/* symlinks 
anyway -- this requires us to wait for devices to show up before we can 
configure the parameters.) Soo... you need a graph of units, where you 
can propagate things, and schedule things based on some execution/event 
queue. And the propagation and scheduling are closely intermingled.


Now, that's pretty much exactly what systemd actually *is*. It 
implements a graph of units with a scheduler. And if you rip that part 
out of systemd to make this an "easy cgroup management library", then 
you simply turn what systemd is into a library without leaving anything. 
Which is just bogus.


So no, if you say "seems pretty easy to make this cgroup management a 
library" then well, I have to disagree with you.



We want to run fewer, simpler things on our systems, we want to reuse as


Fewer and simpler are not compatible, unless you are losing
functionality.  Systemd is fewer, but NOT simpler.


Oh, certainly it is. If we'd split up the cgroup fs access into 
separate daemon of some kind, then we'd need some kind of IPC for that, 
and so you have more daemons and you have some complex IPC between the 
processes. So yeah, the systemd approach is certainly both simpler and 
uses fewer daemons then your hypothetical one.



much of the code as we can. You don't achieve that by running yet another
daemon that does worse what systemd can anyway do simpler, easier and
better.


Considering this is all hypothetical, I find this to be a funny
debate.  My hypothetical idea is better than your hypothetical idea.


Well, systemd is pretty real, and the code to do the unified cgroup 
management within systemd is pretty complete. systemd is certainly not 
hypothetical.



The least you could grant us is to have a look at the final APIs we will
have to offer before you already imply that systemd cannot be a valid
implementation of any API people could ever agree on.


Whoah, don't get defensive.  I said nothing of the sort.  The fact of
the matter is that we do not run systemd, at least in part because of
the monolithic nature.  That's unlikely to change in this timescale.


Oh, my. I am not sure what makes you think it is monolithic.


What I said was that it would be a shame if we had to invent our own
low-level cgroup daemon just because the "upstream" daemons was too
tightly coupled with systemd.


I have no interest to reimplement systemd as a library, just to make you 
happy... I am quite happy with what we already have



This is supposed to be collaborative, not combative.


It certainly sounds *very* differently in what you are writing.

Lennart
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read

Re: cgroup: status-quo and userland efforts

2013-06-30 Thread Lennart Poettering


Heya,

On 29.06.2013 05:05, Tim Hockin wrote:

Come on, now, Lennart.  You put a lot of words in my mouth.



I for sure am not going to make the PID 1 a client of another daemon. That's
just wrong. If you have a daemon that is both conceptually the manager of
another service and the client of that other service, then that's bad design
and you will easily run into deadlocks and such. Just think about it: if you
have some external daemon for managing cgroups, and you need cgroups for
running external daemons, how are you going to start the external daemon for
managing cgroups? Sure, you can hack around this, make that daemon special,
and magic, and stuff -- or you can just not do such nonsense. There's no
reason to repeat the fuckup that cgroup became in kernelspace a second time,
but this time in userspace, with multiple manager daemons all with different
and slightly incompatible definitions what a unit to manage actualy is...


I forgot about the tautology of systemd.  systemd is monolithic.


systemd is certainly not monolithic for almost any definition of that 
term. I am not sure where you are taking that from, and I am not sure I 
want to discuss on that level. This just sounds like FUD you picked up 
somewhere and are repeating carelessly...



But that's not my point.  It seems pretty easy to make this cgroup
management (in native mode) a library that can have either a thin
veneer of a main() function, while also being usable by systemd.  The
point is to solve all of the problems ONCE.  I'm trying to make the
case that systemd itself should be focusing on features and policies
and awesome APIs.


You know, getting this all right isn't easy. If you want to do things 
properly, then you need to propagate attribute changes between the units 
you manage. You also need something like a scheduler, since a number of 
controllers can only be configured under certain external conditions 
(for example: the blkio or devices controller use major/minor parameters 
for configuring per-device limits. Since major/minor assignments are 
pretty much unpredictable these days -- and users probably want to 
configure things with friendly and stable /dev/disk/by-id/* symlinks 
anyway -- this requires us to wait for devices to show up before we can 
configure the parameters.) Soo... you need a graph of units, where you 
can propagate things, and schedule things based on some execution/event 
queue. And the propagation and scheduling are closely intermingled.


Now, that's pretty much exactly what systemd actually *is*. It 
implements a graph of units with a scheduler. And if you rip that part 
out of systemd to make this an easy cgroup management library, then 
you simply turn what systemd is into a library without leaving anything. 
Which is just bogus.


So no, if you say seems pretty easy to make this cgroup management a 
library then well, I have to disagree with you.



We want to run fewer, simpler things on our systems, we want to reuse as


Fewer and simpler are not compatible, unless you are losing
functionality.  Systemd is fewer, but NOT simpler.


Oh, certainly it is. If we'd split up the cgroup fs access into 
separate daemon of some kind, then we'd need some kind of IPC for that, 
and so you have more daemons and you have some complex IPC between the 
processes. So yeah, the systemd approach is certainly both simpler and 
uses fewer daemons then your hypothetical one.



much of the code as we can. You don't achieve that by running yet another
daemon that does worse what systemd can anyway do simpler, easier and
better.


Considering this is all hypothetical, I find this to be a funny
debate.  My hypothetical idea is better than your hypothetical idea.


Well, systemd is pretty real, and the code to do the unified cgroup 
management within systemd is pretty complete. systemd is certainly not 
hypothetical.



The least you could grant us is to have a look at the final APIs we will
have to offer before you already imply that systemd cannot be a valid
implementation of any API people could ever agree on.


Whoah, don't get defensive.  I said nothing of the sort.  The fact of
the matter is that we do not run systemd, at least in part because of
the monolithic nature.  That's unlikely to change in this timescale.


Oh, my. I am not sure what makes you think it is monolithic.


What I said was that it would be a shame if we had to invent our own
low-level cgroup daemon just because the upstream daemons was too
tightly coupled with systemd.


I have no interest to reimplement systemd as a library, just to make you 
happy... I am quite happy with what we already have



This is supposed to be collaborative, not combative.


It certainly sounds *very* differently in what you are writing.

Lennart
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at

Re: cgroup: status-quo and userland efforts

2013-06-28 Thread Lennart Poettering


On 28.06.2013 20:53, Tim Hockin wrote:


a single-agent, we should make a kick-ass implementation that is
flexible and scalable, and full-featured enough to not require
divergence at the lowest layer of the stack.  Then build systemd on
top of that. Let systemd offer more features and policies and
"semantic" APIs.


Well, what if systemd is already kick-ass? I mean, if you have a problem 
with systemd, then that's your own problem, but I really don't think why 
I should bother?


I for sure am not going to make the PID 1 a client of another daemon. 
That's just wrong. If you have a daemon that is both conceptually the 
manager of another service and the client of that other service, then 
that's bad design and you will easily run into deadlocks and such. Just 
think about it: if you have some external daemon for managing cgroups, 
and you need cgroups for running external daemons, how are you going to 
start the external daemon for managing cgroups? Sure, you can hack 
around this, make that daemon special, and magic, and stuff -- or you 
can just not do such nonsense. There's no reason to repeat the fuckup 
that cgroup became in kernelspace a second time, but this time in 
userspace, with multiple manager daemons all with different and slightly 
incompatible definitions what a unit to manage actualy is...


We want to run fewer, simpler things on our systems, we want to reuse as 
much of the code as we can. You don't achieve that by running yet 
another daemon that does worse what systemd can anyway do simpler, 
easier and better.


The least you could grant us is to have a look at the final APIs we will 
have to offer before you already imply that systemd cannot be a valid 
implementation of any API people could ever agree on.


Lennart
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup: status-quo and userland efforts

2013-06-28 Thread Lennart Poettering


On 28.06.2013 20:53, Tim Hockin wrote:


a single-agent, we should make a kick-ass implementation that is
flexible and scalable, and full-featured enough to not require
divergence at the lowest layer of the stack.  Then build systemd on
top of that. Let systemd offer more features and policies and
semantic APIs.


Well, what if systemd is already kick-ass? I mean, if you have a problem 
with systemd, then that's your own problem, but I really don't think why 
I should bother?


I for sure am not going to make the PID 1 a client of another daemon. 
That's just wrong. If you have a daemon that is both conceptually the 
manager of another service and the client of that other service, then 
that's bad design and you will easily run into deadlocks and such. Just 
think about it: if you have some external daemon for managing cgroups, 
and you need cgroups for running external daemons, how are you going to 
start the external daemon for managing cgroups? Sure, you can hack 
around this, make that daemon special, and magic, and stuff -- or you 
can just not do such nonsense. There's no reason to repeat the fuckup 
that cgroup became in kernelspace a second time, but this time in 
userspace, with multiple manager daemons all with different and slightly 
incompatible definitions what a unit to manage actualy is...


We want to run fewer, simpler things on our systems, we want to reuse as 
much of the code as we can. You don't achieve that by running yet 
another daemon that does worse what systemd can anyway do simpler, 
easier and better.


The least you could grant us is to have a look at the final APIs we will 
have to offer before you already imply that systemd cannot be a valid 
implementation of any API people could ever agree on.


Lennart
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [systemd-devel] 2013 Plumber's CFP: Fastboot

2013-05-16 Thread Lennart Poettering

On Wed, 15.05.13 15:01, Mehaffey, John (john_mehaf...@mentor.com) wrote:

> > What if we merge the proposals?
> > 
> > John, are you ok with proposing (some of) these topics in the "Boot
> > and Core OS" track? I could help with the module-related part, too.
> > 
> > 
> > Lucas De Marchi
> 
> Hi Lucas, Lennart,
> I am fine with merging the two boot related sessions.  I did not want to have 
> it be a topic in the automotive microconf, as I believe there is enough 
> material for a couple of hours on fastboot alone.
> 
> It was not clear to me what topics might be in the boot and core os 
> microconf, so I proposed fastboot as separate.

Let's merge this then (the LPC guys want this to).

We could copy the stuff from the fastboot MC wiki page into the
boot/core OS wiki page, or the other way round. And preference? I'd
still call this "Boot and Core OS MC" as this has the slightler broader
topic, I guess...

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [systemd-devel] 2013 Plumber's CFP: Fastboot

2013-05-16 Thread Lennart Poettering

On Wed, 15.05.13 15:01, Mehaffey, John (john_mehaf...@mentor.com) wrote:

  What if we merge the proposals?
  
  John, are you ok with proposing (some of) these topics in the Boot
  and Core OS track? I could help with the module-related part, too.
  
  
  Lucas De Marchi
 
 Hi Lucas, Lennart,
 I am fine with merging the two boot related sessions.  I did not want to have 
 it be a topic in the automotive microconf, as I believe there is enough 
 material for a couple of hours on fastboot alone.
 
 It was not clear to me what topics might be in the boot and core os 
 microconf, so I proposed fastboot as separate.

Let's merge this then (the LPC guys want this to).

We could copy the stuff from the fastboot MC wiki page into the
boot/core OS wiki page, or the other way round. And preference? I'd
still call this Boot and Core OS MC as this has the slightler broader
topic, I guess...

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [systemd-devel] 2013 Plumber's CFP: Fastboot

2013-05-15 Thread Lennart Poettering

On Wed, 15.05.13 11:43, Jeremiah Foster (jeremiah.fos...@pelagicore.com) wrote:

> On Tue, May 14, 2013 at 1:51 AM, Mehaffey, John 
> wrote:
> 
> > Hello All,
> >
> 
> Hey John!
> 
> 
> > I am proposing a microconference on fastboot at the Linux Plumber's
> > conference 2013 in New Orleans. The goal is to get to sub 1S boot times for
> > a large (IVI) system using NAND flash. This pushes the state of the art,
> > and will require innovative solutions in may areas of Linux plumbing,
> > including bootloader, kernel init, UBI, and systemd.
> >
> > Note that fastboot improvements will (generally) help all architectures so
> > I am not limiting this to automotive systems.
> >
> > Please visit http://wiki.linuxplumbersconf.org/2013:fastboot for more
> > information or if you want to submit a topic.
> >
> 
> I've linked to your micrconference from the automotive microconference:
> http://wiki.linuxplumbersconf.org/2013:automotive
> 
> I'm a bit confused about the LPC format though. John, is it planned to have
> the non-Android fastboot discussion as part of the automotive
> microconference or is this separate (despite its automotive relevance.) I
> ask because it might be nice to have the participants in both
> microconferences and it would be a shame to lose attendees to one or the
> other if they're competing tracks.
> 
> Can someone clue me in to the microconference format as regards LPC?


BTW, there's also this MC we proposed:

http://wiki.linuxplumbersconf.org/2013:boot_and_core_os

which sounds pretty close to fastboot?

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [systemd-devel] 2013 Plumber's CFP: Fastboot

2013-05-15 Thread Lennart Poettering

On Wed, 15.05.13 11:43, Jeremiah Foster (jeremiah.fos...@pelagicore.com) wrote:

 On Tue, May 14, 2013 at 1:51 AM, Mehaffey, John 
 john_mehaf...@mentor.comwrote:
 
  Hello All,
 
 
 Hey John!
 
 
  I am proposing a microconference on fastboot at the Linux Plumber's
  conference 2013 in New Orleans. The goal is to get to sub 1S boot times for
  a large (IVI) system using NAND flash. This pushes the state of the art,
  and will require innovative solutions in may areas of Linux plumbing,
  including bootloader, kernel init, UBI, and systemd.
 
  Note that fastboot improvements will (generally) help all architectures so
  I am not limiting this to automotive systems.
 
  Please visit http://wiki.linuxplumbersconf.org/2013:fastboot for more
  information or if you want to submit a topic.
 
 
 I've linked to your micrconference from the automotive microconference:
 http://wiki.linuxplumbersconf.org/2013:automotive
 
 I'm a bit confused about the LPC format though. John, is it planned to have
 the non-Android fastboot discussion as part of the automotive
 microconference or is this separate (despite its automotive relevance.) I
 ask because it might be nice to have the participants in both
 microconferences and it would be a shame to lose attendees to one or the
 other if they're competing tracks.
 
 Can someone clue me in to the microconference format as regards LPC?


BTW, there's also this MC we proposed:

http://wiki.linuxplumbersconf.org/2013:boot_and_core_os

which sounds pretty close to fastboot?

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [systemd-devel] [PATCH 2/2] coredump: Handle programs with spaces in COMM

2013-05-03 Thread Lennart Poettering

On Wed, 01.05.13 18:42, Oleg Nesterov (o...@redhat.com) wrote:

> On 04/30, Colin Walters wrote:
> >
> > On Tue, 2013-04-30 at 19:47 +0200, Zbigniew Jędrzejewski-Szmek wrote:
> > > On Tue, Apr 30, 2013 at 01:12:19PM -0400, Colin Walters wrote:
> > > > This patch makes systemd-coredump handle processes that have
> > > > whitespace in their COMM fields.
> > > >
> > > > fs/coredump.c when given %e (as systemd-coredump uses), will end up
> > > > joining the process arguments into a string (along with the other
> > > > fields), then will split the entire thing up on whitespace, and use
> > > > it as the arguments to the coredump pipe handler.
> > > > ---
> > > That's a workaround for a bug in the kernel. I think it makes sense, but
> > > it'd be nice to fix the kernel too.
> 
> I wouldn't say this is bug... at least this is expected.
> 
> Sure, it is possible to rewrite format_corename/argv_split interaction,
> but this is a bit painful and I am not sure it worth the trouble.

It sounds really wrong to first merge this into one string and then
split it up again. It sounds much more sensible to instead just pass the
string array around all the time. What's the reason to make this one
string first?

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [systemd-devel] [PATCH 2/2] coredump: Handle programs with spaces in COMM

2013-05-03 Thread Lennart Poettering

On Wed, 01.05.13 18:42, Oleg Nesterov (o...@redhat.com) wrote:

 On 04/30, Colin Walters wrote:
 
  On Tue, 2013-04-30 at 19:47 +0200, Zbigniew Jędrzejewski-Szmek wrote:
   On Tue, Apr 30, 2013 at 01:12:19PM -0400, Colin Walters wrote:
This patch makes systemd-coredump handle processes that have
whitespace in their COMM fields.
   
fs/coredump.c when given %e (as systemd-coredump uses), will end up
joining the process arguments into a string (along with the other
fields), then will split the entire thing up on whitespace, and use
it as the arguments to the coredump pipe handler.
---
   That's a workaround for a bug in the kernel. I think it makes sense, but
   it'd be nice to fix the kernel too.
 
 I wouldn't say this is bug... at least this is expected.
 
 Sure, it is possible to rewrite format_corename/argv_split interaction,
 but this is a bit painful and I am not sure it worth the trouble.

It sounds really wrong to first merge this into one string and then
split it up again. It sounds much more sensible to instead just pass the
string array around all the time. What's the reason to make this one
string first?

Lennart

-- 
Lennart Poettering - Red Hat, Inc.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup: status-quo and userland efforts

2013-04-08 Thread Lennart Poettering


Heya,

On 08.04.2013 15:46, Glauber Costa wrote:

On 04/06/2013 05:21 AM, Tejun Heo wrote:

Hello, guys.


Hello Tejun, how are you?


  Status-quo
  ==


tl;did read;

This is mostly sensible. There is still one problem that we hadn't yet
had the bandwidth to tackle that should be added to your official TODO list.

The cpu cgroup needs a real-time timeslice to accept real time tasks. It
defaults to 0, meaning that a newly created cpu cgroup cannot accept
tasks (rt tasks) without the user having to manually configure it.
As far as I know, this problem hasn't yet been fixed.

The fix of course, is as trivial as setting a new value instead of 0 as
a default. The complication lies in determining which value should that be.

There are many things that we should ask from a controller to implement
in order to be able to handle fully joint hierarchies. One of them,
IMHO, is that if you drop a task into a newly created cgroup it should
run without the user having to do anything for it.


The other big thing we want from the systemd side is saner notifications 
when cgroups run empty. i.e. currently we don't get these at all in 
containers (since the agent can be only installed once, for the host). 
And the way we get this is awful, via kernel-spawned processes. I am 
looking for a way how I can establish a watch on a certain subtree (not 
just one directory) and get simple notifications in a race-free whenever 
a cgroup runs empty.


Lennart
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: cgroup: status-quo and userland efforts

2013-04-08 Thread Lennart Poettering


Heya,

On 08.04.2013 15:46, Glauber Costa wrote:

On 04/06/2013 05:21 AM, Tejun Heo wrote:

Hello, guys.


Hello Tejun, how are you?


  Status-quo
  ==


tl;did read;

This is mostly sensible. There is still one problem that we hadn't yet
had the bandwidth to tackle that should be added to your official TODO list.

The cpu cgroup needs a real-time timeslice to accept real time tasks. It
defaults to 0, meaning that a newly created cpu cgroup cannot accept
tasks (rt tasks) without the user having to manually configure it.
As far as I know, this problem hasn't yet been fixed.

The fix of course, is as trivial as setting a new value instead of 0 as
a default. The complication lies in determining which value should that be.

There are many things that we should ask from a controller to implement
in order to be able to handle fully joint hierarchies. One of them,
IMHO, is that if you drop a task into a newly created cgroup it should
run without the user having to do anything for it.


The other big thing we want from the systemd side is saner notifications 
when cgroups run empty. i.e. currently we don't get these at all in 
containers (since the agent can be only installed once, for the host). 
And the way we get this is awful, via kernel-spawned processes. I am 
looking for a way how I can establish a watch on a certain subtree (not 
just one directory) and get simple notifications in a race-free whenever 
a cgroup runs empty.


Lennart
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 3/4] cgroup: add xattr support

2012-08-21 Thread Lennart Poettering


Heya,

(sorry for the late reply)

On 16.08.2012 22:00, Tejun Heo wrote:

On Thu, Aug 16, 2012 at 01:44:56PM -0400, a...@redhat.com wrote:



Attaching meta information to services, in an easily discoverable
way. For example, in systemd we create one cgroup for each service, and
could then store data like the main pid of the specific service as an
xattr on the cgroup itself. That way we'd have almost all service state
in the cgroupfs, which would make it possible to terminate systemd and
later restart it without losing any state information. But there's more:
for example, some very peculiar services cannot be terminated on
shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
services in question could just mark that on their cgroup, by setting an
xattr. On the more desktopy side of things there are other
possibilities: for example there are plans defining what an application
is along the lines of a cgroup (i.e. an app being a collection of
processes). With xattrs one could then attach an icon or human readable
program name on the cgroup.

The key idea is that this would allow attaching runtime meta information
to cgroups and everything they model (services, apps, vms), that doesn't
need any complex userspace infrastructure, has good access control
(i.e. because the file system enforces that anyway, and there's the
"trusted." xattr namespace), notifications (inotify), and can easily be
shared among applications.




I'm not against this but unsure whether using kmem is enough for the
suggested use case.  Lennart, would this suit systemd?  How much
metadata are we talking about?


Just small things, like values, PIDs, i.e. a few 100 bytes or so per 
cgroup should be more than sufficient for our needs.


Lennart
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [PATCH v6 3/4] cgroup: add xattr support

2012-08-21 Thread Lennart Poettering


Heya,

(sorry for the late reply)

On 16.08.2012 22:00, Tejun Heo wrote:

On Thu, Aug 16, 2012 at 01:44:56PM -0400, a...@redhat.com wrote:



Attaching meta information to services, in an easily discoverable
way. For example, in systemd we create one cgroup for each service, and
could then store data like the main pid of the specific service as an
xattr on the cgroup itself. That way we'd have almost all service state
in the cgroupfs, which would make it possible to terminate systemd and
later restart it without losing any state information. But there's more:
for example, some very peculiar services cannot be terminated on
shutdown (i.e. fakeraid DM stuff) and it would be really nice if the
services in question could just mark that on their cgroup, by setting an
xattr. On the more desktopy side of things there are other
possibilities: for example there are plans defining what an application
is along the lines of a cgroup (i.e. an app being a collection of
processes). With xattrs one could then attach an icon or human readable
program name on the cgroup.

The key idea is that this would allow attaching runtime meta information
to cgroups and everything they model (services, apps, vms), that doesn't
need any complex userspace infrastructure, has good access control
(i.e. because the file system enforces that anyway, and there's the
trusted. xattr namespace), notifications (inotify), and can easily be
shared among applications.




I'm not against this but unsure whether using kmem is enough for the
suggested use case.  Lennart, would this suit systemd?  How much
metadata are we talking about?


Just small things, like values, PIDs, i.e. a few 100 bytes or so per 
cgroup should be more than sufficient for our needs.


Lennart
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory

2007-12-20 Thread Lennart Poettering

On Thu, 20.12.07 14:09, Hugh Dickins ([EMAIL PROTECTED]) wrote:

> > Lennart asked for madvise(WILLNEED) to work on anonymous pages, he plans
> > to use this to pre-fault pages. He currently uses: mlock/munlock for
> > this purpose.
> 
> I certainly agree with this in principle: it just seems an unnecessary
> and surprising restriction to refuse on anonymous vmas; I guess the only
> reason for not adding this was not having anyone asking for it until now.
> Though, does Lennart realize he could use MAP_POPULATE in the mmap?

Not really. First, if the mmap() is hidden somewhere in glibc (i.e. as
part of malloc() or whatever) it's not really possible to do
MAP_POPULATE. Also, I need this for some memory that is allocated
during the whole runtime but only seldomly used. Thus I am happy if it
is swapped out, but everytime I want to use it I want to make sure it
is paged in before I pass it on to the RT thread. So, there's a
mmap() during startup only, and then, during the whole runtime of my
program I want to page in the memory again and again, with long
intervals in between, but with no call to mmap()/munmap().

Lennart

-- 
Lennart PoetteringRed Hat, Inc.
lennart [at] poettering [dot] net ICQ# 11060553
http://0pointer.net/lennart/   GnuPG 0x1A015CC4
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [rfc][patch] mm: madvise(WILLNEED) for anonymous memory

2007-12-20 Thread Lennart Poettering

On Thu, 20.12.07 14:09, Hugh Dickins ([EMAIL PROTECTED]) wrote:

  Lennart asked for madvise(WILLNEED) to work on anonymous pages, he plans
  to use this to pre-fault pages. He currently uses: mlock/munlock for
  this purpose.
 
 I certainly agree with this in principle: it just seems an unnecessary
 and surprising restriction to refuse on anonymous vmas; I guess the only
 reason for not adding this was not having anyone asking for it until now.
 Though, does Lennart realize he could use MAP_POPULATE in the mmap?

Not really. First, if the mmap() is hidden somewhere in glibc (i.e. as
part of malloc() or whatever) it's not really possible to do
MAP_POPULATE. Also, I need this for some memory that is allocated
during the whole runtime but only seldomly used. Thus I am happy if it
is swapped out, but everytime I want to use it I want to make sure it
is paged in before I pass it on to the RT thread. So, there's a
mmap() during startup only, and then, during the whole runtime of my
program I want to page in the memory again and again, with long
intervals in between, but with no call to mmap()/munmap().

Lennart

-- 
Lennart PoetteringRed Hat, Inc.
lennart [at] poettering [dot] net ICQ# 11060553
http://0pointer.net/lennart/   GnuPG 0x1A015CC4
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] sched: SCHED_FIFO watchdog timer

2007-10-15 Thread Lennart Poettering

On Sun, 14.10.07 00:51, Peter Zijlstra ([EMAIL PROTECTED]) wrote:

> The below patch is an idea proposed by tglx and depends on sched-devel +
> the hrtick patch previously posted.
> 
> The current watchdog action is to demote the task to SCHED_NORMAL,
> however it might be wanted to deliver a signal instead (or have more per
> task configuration state). Which is why I added Lennart to the CC list
> as I gathered he would like something like this for PulseAudio.

Indeed! Having this in the kernel would allow us to enable RT
scheduling for PulseAudio by default without bad effects. I was thinking about
adding some kind of babysitting process to userspace -- but doing this as
an RLIMIT in the kernel strikes me a much better idea!

I think it would make a lot of sense to make the API very similar to
RLIMIT_CPU, i.e. also send out SIGXCPU and SIGKILL, with the single
difference that RLIMIT_CPU sends out a signal depending on the total
CPU time used for the process and the new RLIMIT based on the time the
process spent without sleeping. That would be a very reasonable
extension to the current RLIMIT_CPU model.

Thank you very much for doing this patch!

Lennart

-- 
Lennart PoetteringRed Hat, Inc.
lennart [at] poettering [dot] net ICQ# 11060553
http://0pointer.net/lennart/   GnuPG 0x1A015CC4
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC][PATCH] sched: SCHED_FIFO watchdog timer

2007-10-15 Thread Lennart Poettering

On Sun, 14.10.07 00:51, Peter Zijlstra ([EMAIL PROTECTED]) wrote:

 The below patch is an idea proposed by tglx and depends on sched-devel +
 the hrtick patch previously posted.
 
 The current watchdog action is to demote the task to SCHED_NORMAL,
 however it might be wanted to deliver a signal instead (or have more per
 task configuration state). Which is why I added Lennart to the CC list
 as I gathered he would like something like this for PulseAudio.

Indeed! Having this in the kernel would allow us to enable RT
scheduling for PulseAudio by default without bad effects. I was thinking about
adding some kind of babysitting process to userspace -- but doing this as
an RLIMIT in the kernel strikes me a much better idea!

I think it would make a lot of sense to make the API very similar to
RLIMIT_CPU, i.e. also send out SIGXCPU and SIGKILL, with the single
difference that RLIMIT_CPU sends out a signal depending on the total
CPU time used for the process and the new RLIMIT based on the time the
process spent without sleeping. That would be a very reasonable
extension to the current RLIMIT_CPU model.

Thank you very much for doing this patch!

Lennart

-- 
Lennart PoetteringRed Hat, Inc.
lennart [at] poettering [dot] net ICQ# 11060553
http://0pointer.net/lennart/   GnuPG 0x1A015CC4
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 108 matches

Mail list logo