Re: [PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-13 Thread Rasmus Villemoes
On 11/03/2021 19.02, Linus Torvalds wrote:
> On Wed, Mar 10, 2021 at 5:45 PM Rasmus Villemoes
>  wrote:
>>
>> Hm, gcc does elide the test of the return value, but jumps back to a
>> place where it always loads state from its memory location and does the
>> whole switch(). To get it to jump directly to the code implementing the
>> various do_* helpers it seems one needs to avoid that global variable
>> and instead return the next state explicitly. The below boots, but I
>> still can't see any measurable improvement on ppc.
> 
> Ok. That's definitely the right way to do efficient statemachines that
> the compiler can actually generate ok code for, but if you can't
> measure the difference I guess it isn't even worth doing.

Just for good measure, I now got around to test on x86 as well, where I
thought the speculation stuff might make a difference. However, the
indirect calls through the actions[] array don't actually hurt due to
__noinitretpoline, and even removing that from the __init definition, I
only see about 1.5% difference with that state machine patch applied.

So it doesn't seem worth pursuing. I'll send v3 of the async patches
shortly.

Rasmus


Re: [PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-11 Thread Linus Torvalds
On Wed, Mar 10, 2021 at 5:45 PM Rasmus Villemoes
 wrote:
>
> Hm, gcc does elide the test of the return value, but jumps back to a
> place where it always loads state from its memory location and does the
> whole switch(). To get it to jump directly to the code implementing the
> various do_* helpers it seems one needs to avoid that global variable
> and instead return the next state explicitly. The below boots, but I
> still can't see any measurable improvement on ppc.

Ok. That's definitely the right way to do efficient statemachines that
the compiler can actually generate ok code for, but if you can't
measure the difference I guess it isn't even worth doing.

Thanks for checking, though.

Linus


Re: [PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-11 Thread Luis Chamberlain
On Tue, Mar 09, 2021 at 02:07:36PM -0800, Linus Torvalds wrote:
> On Tue, Mar 9, 2021 at 1:17 PM Rasmus Villemoes
>  wrote:
> >
> > So add an initramfs_async= kernel parameter, allowing the main init
> > process to proceed to handling device_initcall()s without waiting for
> > populate_rootfs() to finish.
> 
> I like this smaller second version of the patch, but am wondering why
> we even need the parameter.
> 
> It sounds mostly like a "maybe I didn't think of all cases" thing -
> and one that will mean that this code will not see a lot of actual
> test coverage..
> 
> And because of the lack of test coverage, I'd rather reverse the
> meaning, and have the async case on by default (without even the
> Kconfig option), and have the kernel command line purely as a "oops,
> it's buggy, easy to ask people to test if this is what ails them".

If we're going to set this as default it might be good to document on
init.h that components that need content in initramfs need the wait
call.

> What *can* happen early boot outside of firmware loading and usermodehelpers?

*In practice* the only thing I can think of at this time is races with
other rootfs_initcall() calls, granted ordering among these is done at
linker time, but I can't think of a issue with them:

arch/x86/kernel/pci-dma.c:rootfs_initcall(pci_iommu_init);
drivers/iommu/intel/irq_remapping.c:rootfs_initcall(ir_dev_scope_init);
drivers/mfd/sta2x11-mfd.c:rootfs_initcall(sta2x11_mfd_init);
drivers/thunderbolt/nhi.c:rootfs_initcall(nhi_init);

But Cc'ing the maintainers of these components just in case.

  Luis


Re: [PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-10 Thread Rasmus Villemoes
On 11/03/2021 01.17, Rasmus Villemoes wrote:
> On 09/03/2021 23.16, Linus Torvalds wrote:
>> On Tue, Mar 9, 2021 at 1:17 PM Rasmus Villemoes
>>  wrote:
>>>
>>> So add an initramfs_async= kernel parameter, allowing the main init
>>> process to proceed to handling device_initcall()s without waiting for
>>> populate_rootfs() to finish.
>>
>> Oh, and a completely unrelated second comment about this: some of the
>> initramfs population code seems to be actively written to be slow.
>>
>> For example, I'm not sure why that write_buffer() function uses an
>> array of indirect function pointer actions. Even ignoring the whole
>> "speculation protections make that really slow" issue that came later,
>> it seems to always have been actively (a) slower and (b) more complex.
>>
>> The normal way to do that is with a simple switch() statement, which
>> makes the compiler able to do a much better job. Not just for the
>> state selector - maybe it picks that function pointer approach, but
>> probably these days just direct comparisons - but simply to do things
>> like inline all those "it's used in one place" cases entirely. In
>> fact, at that point, a lot of the state machine changes may end up
>> turning into pure goto's - compilers are sometimes good at that
>> (because state machines have often been very timing-critical).
> 
> FWIW, I tried doing
> 

Hm, gcc does elide the test of the return value, but jumps back to a
place where it always loads state from its memory location and does the
whole switch(). To get it to jump directly to the code implementing the
various do_* helpers it seems one needs to avoid that global variable
and instead return the next state explicitly. The below boots, but I
still can't see any measurable improvement on ppc.

Rasmus

Subject: [PATCH] init/initramfs.c: change state machine implementation

Instead of having write_buffer() rely on the global variable "state",
have each of the do_* helpers return the next state, or the new token
Stop. Also, instead of an array of function pointers, use a switch
statement.

This means all the do_* helpers end up inlined into write_buffer(),
and all the places which return a compile-time constant next state now
compile to a direct jump to that code.

We still need the global variable state for the initial choice within
write_buffer, and we also need to preserve the last non-Stop state
across calls.
---
 init/initramfs.c | 90 
 1 file changed, 45 insertions(+), 45 deletions(-)

diff --git a/init/initramfs.c b/init/initramfs.c
index 1d0fdd05e5e9..ad7e04393acb 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -189,7 +189,8 @@ static __initdata enum state {
GotName,
CopyFile,
GotSymlink,
-   Reset
+   Reset,
+   Stop
 } state, next_state;

 static __initdata char *victim;
@@ -207,17 +208,17 @@ static __initdata char *collected;
 static long remains __initdata;
 static __initdata char *collect;

-static void __init read_into(char *buf, unsigned size, enum state next)
+static int __init read_into(char *buf, unsigned size, enum state next)
 {
if (byte_count >= size) {
collected = victim;
eat(size);
-   state = next;
+   return next;
} else {
collect = collected = buf;
remains = size;
next_state = next;
-   state = Collect;
+   return Collect;
}
 }

@@ -225,8 +226,7 @@ static __initdata char *header_buf, *symlink_buf,
*name_buf;

 static int __init do_start(void)
 {
-   read_into(header_buf, 110, GotHeader);
-   return 0;
+   return read_into(header_buf, 110, GotHeader);
 }

 static int __init do_collect(void)
@@ -238,50 +238,46 @@ static int __init do_collect(void)
eat(n);
collect += n;
if ((remains -= n) != 0)
-   return 1;
-   state = next_state;
-   return 0;
+   return Stop;
+   return next_state;
 }

 static int __init do_header(void)
 {
if (memcmp(collected, "070707", 6)==0) {
error("incorrect cpio method used: use -H newc option");
-   return 1;
+   return Stop;
}
if (memcmp(collected, "070701", 6)) {
error("no cpio magic");
-   return 1;
+   return Stop;
}
parse_header(collected);
next_header = this_header + N_ALIGN(name_len) + body_len;
next_header = (next_header + 3) & ~3;
-   state = SkipIt;
if (name_len <= 0 || name_len > PATH_MAX)
-   return 0;
+   return SkipIt;
if (S_ISLNK(mode)) {
if (body_len > PATH_MAX)
-   return 0;
+   return SkipIt;
collect = collected = symlink_buf;
remains = N_ALIGN(name_len) + body_len;
next_state = GotSymlink;
-   

Re: [PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-10 Thread Rasmus Villemoes
On 09/03/2021 23.16, Linus Torvalds wrote:
> On Tue, Mar 9, 2021 at 1:17 PM Rasmus Villemoes
>  wrote:
>>
>> So add an initramfs_async= kernel parameter, allowing the main init
>> process to proceed to handling device_initcall()s without waiting for
>> populate_rootfs() to finish.
> 
> Oh, and a completely unrelated second comment about this: some of the
> initramfs population code seems to be actively written to be slow.
> 
> For example, I'm not sure why that write_buffer() function uses an
> array of indirect function pointer actions. Even ignoring the whole
> "speculation protections make that really slow" issue that came later,
> it seems to always have been actively (a) slower and (b) more complex.
> 
> The normal way to do that is with a simple switch() statement, which
> makes the compiler able to do a much better job. Not just for the
> state selector - maybe it picks that function pointer approach, but
> probably these days just direct comparisons - but simply to do things
> like inline all those "it's used in one place" cases entirely. In
> fact, at that point, a lot of the state machine changes may end up
> turning into pure goto's - compilers are sometimes good at that
> (because state machines have often been very timing-critical).

FWIW, I tried doing

--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -401,24 +401,26 @@ static int __init do_symlink(void)
return 0;
 }

-static __initdata int (*actions[])(void) = {
-   [Start] = do_start,
-   [Collect]   = do_collect,
-   [GotHeader] = do_header,
-   [SkipIt]= do_skip,
-   [GotName]   = do_name,
-   [CopyFile]  = do_copy,
-   [GotSymlink]= do_symlink,
-   [Reset] = do_reset,
-};
-
 static long __init write_buffer(char *buf, unsigned long len)
 {
+   int ret;
+
byte_count = len;
victim = buf;

-   while (!actions[state]())
-   ;
+   do {
+   switch (state) {
+   case Start: ret = do_start(); break;
+   case Collect: ret = do_collect(); break;
+   case GotHeader: ret = do_header(); break;
+   case SkipIt: ret = do_skip(); break;
+   case GotName: ret = do_name(); break;
+   case CopyFile: ret = do_copy(); break;
+   case GotSymlink: ret = do_symlink(); break;
+   case Reset: ret = do_reset(); break;
+   }
+   } while (!ret);
+
return len - byte_count;
 }


and yes, all the do_* functions get inlined into write_buffer with some
net space saving:

add/remove: 0/9 grow/shrink: 1/0 up/down: 1696/-2112 (-416)
Function old new   delta
write_buffer 1001796   +1696
actions   32   - -32
do_start  52   - -52
do_reset 112   --112
do_skip  128   --128
do_collect   144   --144
do_symlink   172   --172
do_copy  304   --304
do_header572   --572
do_name  596   --596
Total: Before=5360, After=4944, chg -7.76%

(ppc32 build). But the unpacking still takes the same time. It might be
different on x86.

Rasmus


Re: [PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-09 Thread Rasmus Villemoes
On 09/03/2021 23.16, Linus Torvalds wrote:
> On Tue, Mar 9, 2021 at 1:17 PM Rasmus Villemoes
>  wrote:
>>
>> So add an initramfs_async= kernel parameter, allowing the main init
>> process to proceed to handling device_initcall()s without waiting for
>> populate_rootfs() to finish.
> 
> Oh, and a completely unrelated second comment about this: some of the
> initramfs population code seems to be actively written to be slow.
> 
> For example, I'm not sure why that write_buffer() function uses an
> array of indirect function pointer actions. Even ignoring the whole
> "speculation protections make that really slow" issue that came later,
> it seems to always have been actively (a) slower and (b) more complex.
> 
[...]
> Is that likely to be a big part of the costs? No. I assume it's the
> decompression and the actual VFS operations. 

Yes, I have been doing some simple measurements, simply by decompressing
the blob in userspace and comparing to the time to that used by
populate_rootfs(). For both the 6M lz4-compressed blob on my ppc target
and the 26M xz-compressed blob on my laptop, the result is that the
decompression itself accounts for the vast majority of the time - and
for ppc in particular, I don't think there's any spectre slowdown.

So I haven't dared looking into changing the unpack implementation since
it doesn't seem it could buy that much.

Rasmus


Re: [PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-09 Thread Rasmus Villemoes
On 09/03/2021 23.07, Linus Torvalds wrote:
> On Tue, Mar 9, 2021 at 1:17 PM Rasmus Villemoes
>  wrote:
>>
>> So add an initramfs_async= kernel parameter, allowing the main init
>> process to proceed to handling device_initcall()s without waiting for
>> populate_rootfs() to finish.
> 
> I like this smaller second version of the patch, but am wondering why
> we even need the parameter.
> 
> It sounds mostly like a "maybe I didn't think of all cases" thing -

That's exactly what it is.

> and one that will mean that this code will not see a lot of actual
> test coverage..

Yeah, that's probably true.

> And because of the lack of test coverage, I'd rather reverse the
> meaning, and have the async case on by default (without even the
> Kconfig option), and have the kernel command line purely as a "oops,
> it's buggy, easy to ask people to test if this is what ails them".

Well, I wasn't bold enough to make it "default y" by myself, but I can
certainly do that and nuke the config option.

> What *can* happen early boot outside of firmware loading and usermodehelpers?

Well, that was what I tried to get people to tell me when I sent the
first version as RFC, and also before that
(https://lore.kernel.org/lkml/19574912-44b4-c1dc-44c3-67309968d...@rasmusvillemoes.dk/).
That you can't think of anything suggests that I have covered the
important cases - which does leave random drivers that poke around the
filesystem on their own, but (a) it would probably be a good thing to
have this flush those out and (b) there's the command line option to
make it boot anyway.

Rasmus


Re: [PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-09 Thread Linus Torvalds
On Tue, Mar 9, 2021 at 1:17 PM Rasmus Villemoes
 wrote:
>
> So add an initramfs_async= kernel parameter, allowing the main init
> process to proceed to handling device_initcall()s without waiting for
> populate_rootfs() to finish.

Oh, and a completely unrelated second comment about this: some of the
initramfs population code seems to be actively written to be slow.

For example, I'm not sure why that write_buffer() function uses an
array of indirect function pointer actions. Even ignoring the whole
"speculation protections make that really slow" issue that came later,
it seems to always have been actively (a) slower and (b) more complex.

The normal way to do that is with a simple switch() statement, which
makes the compiler able to do a much better job. Not just for the
state selector - maybe it picks that function pointer approach, but
probably these days just direct comparisons - but simply to do things
like inline all those "it's used in one place" cases entirely. In
fact, at that point, a lot of the state machine changes may end up
turning into pure goto's - compilers are sometimes good at that
(because state machines have often been very timing-critical).

Is that likely to be a big part of the costs? No. I assume it's the
decompression and the actual VFS operations. But when the series is
about how this all takes a long time, and I go "that code really looks
actively pessimally written", maybe it would be a good thing to look
into it?

   Linus


Re: [PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-09 Thread Linus Torvalds
On Tue, Mar 9, 2021 at 1:17 PM Rasmus Villemoes
 wrote:
>
> So add an initramfs_async= kernel parameter, allowing the main init
> process to proceed to handling device_initcall()s without waiting for
> populate_rootfs() to finish.

I like this smaller second version of the patch, but am wondering why
we even need the parameter.

It sounds mostly like a "maybe I didn't think of all cases" thing -
and one that will mean that this code will not see a lot of actual
test coverage..

And because of the lack of test coverage, I'd rather reverse the
meaning, and have the async case on by default (without even the
Kconfig option), and have the kernel command line purely as a "oops,
it's buggy, easy to ask people to test if this is what ails them".

What *can* happen early boot outside of firmware loading and usermodehelpers?

Hmm?

  Linus


[PATCH v2 1/2] init/initramfs.c: allow asynchronous unpacking

2021-03-09 Thread Rasmus Villemoes
This is primarily motivated by an embedded ppc target, where unpacking
even the rather modest sized initramfs takes 0.6 seconds, which is
long enough that the external watchdog becomes unhappy that it doesn't
get attention soon enough. By doing the initramfs decompression in a
worker thread, we get to do the device_initcalls and hence start
petting the watchdog much sooner.

So add an initramfs_async= kernel parameter, allowing the main init
process to proceed to handling device_initcall()s without waiting for
populate_rootfs() to finish.

Should one of those initcalls need something from the initramfs (say,
a kernel module or a firmware blob), it will simply wait for the
initramfs unpacking to be done before proceeding, which should in
theory make this completely safe to always enable. But if some driver
pokes around in the filesystem directly and not via one of the
official kernel interfaces (i.e. request_firmware*(),
call_usermodehelper*) that theory may not hold - also, I certainly
might have missed a spot when sprinkling wait_for_initramfs().

Normal desktops might benefit as well. On my mostly stock Ubuntu
kernel, my initramfs is a 26M xz-compressed blob, decompressing to
around 126M. That takes almost two seconds:

[0.201454] Trying to unpack rootfs image as initramfs...
[1.976633] Freeing initrd memory: 29416K

Before this patch, or with initramfs_async=0, these lines occur
consecutively in dmesg. With initramfs_async=1, the timestamps on
these two lines is roughly the same as above, but with 172 lines
inbetween - so more than one cpu has been kept busy doing work that
would otherwise only happen after the populate_rootfs() finished.

Signed-off-by: Rasmus Villemoes 
---
 .../admin-guide/kernel-parameters.txt | 12 ++
 drivers/base/firmware_loader/main.c   |  2 +
 include/linux/initrd.h|  2 +
 init/initramfs.c  | 41 ++-
 init/main.c   |  1 +
 kernel/umh.c  |  2 +
 usr/Kconfig   | 10 +
 7 files changed, 69 insertions(+), 1 deletion(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 04545725f187..fda9f012c42b 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -1825,6 +1825,18 @@
initcall functions.  Useful for debugging built-in
modules and initcalls.
 
+   initramfs_async= [KNL] Normally, the initramfs image is
+   unpacked synchronously, before most devices
+   are initialized. When the initramfs is huge,
+   or on slow CPUs, this can take a significant
+   amount of time. Setting this option means the
+   unpacking is instead done in a background
+   thread, allowing the main init process to
+   begin calling device_initcall()s while the
+   initramfs is being unpacked.
+   Format: 
+   Default set by CONFIG_INITRAMFS_ASYNC.
+
initrd= [BOOT] Specify the location of the initial ramdisk
 
initrdmem=  [KNL] Specify a physical address and size from which to
diff --git a/drivers/base/firmware_loader/main.c 
b/drivers/base/firmware_loader/main.c
index 78355095e00d..4fdb8219cd08 100644
--- a/drivers/base/firmware_loader/main.c
+++ b/drivers/base/firmware_loader/main.c
@@ -15,6 +15,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -504,6 +505,7 @@ fw_get_filesystem_firmware(struct device *device, struct 
fw_priv *fw_priv,
if (!path)
return -ENOMEM;
 
+   wait_for_initramfs();
for (i = 0; i < ARRAY_SIZE(fw_path); i++) {
size_t file_size = 0;
size_t *file_size_ptr = NULL;
diff --git a/include/linux/initrd.h b/include/linux/initrd.h
index 85c15717af34..1bbe9af48dc3 100644
--- a/include/linux/initrd.h
+++ b/include/linux/initrd.h
@@ -20,8 +20,10 @@ extern void free_initrd_mem(unsigned long, unsigned long);
 
 #ifdef CONFIG_BLK_DEV_INITRD
 extern void __init reserve_initrd_mem(void);
+extern void wait_for_initramfs(void);
 #else
 static inline void __init reserve_initrd_mem(void) {}
+static inline void wait_for_initramfs(void) {}
 #endif
 
 extern phys_addr_t phys_initrd_start;
diff --git a/init/initramfs.c b/init/initramfs.c
index d677e8e717f1..d33bd98481c2 100644
--- a/init/initramfs.c
+++ b/init/initramfs.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -541,6 +542,14 @@ static int __init keepinitrd_setup(char *__unused)
 __setup("keepinitrd", keepinitrd_setup);
 #endif
 
+static bool initramfs_async = IS_ENABLED(CONFIG_INITRAMFS_ASYNC);