Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?

2021-04-16 Thread Luis Chamberlain
On Thu, Apr 08, 2021 at 06:02:24PM +, Luis Chamberlain wrote:
> On Sat, Apr 03, 2021 at 08:25:38PM +, Luis Chamberlain wrote:
> > So creating say 1000 random files in /lib/firmware on a freshly created
> > btrfs partition helps reproduce:
> > 
> > mkfs.btrfs /dev/whatever
> > mount /dev/wahtever /lib/firmware
> > # Put it on /etc/fstab too
> > 
> > Generate 1000 random files on it:
> > 
> > ```
> > for n in {1..1000}; do  
> > 
> > dd if=/dev/urandom of=/lib/firmware/file$( printf %03d "$n" ).bin bs=1 
> > count=$((RANDOM + 1024 ))
> > done  
> > ```
> > 
> > Then reboot, upon reboot do:
> > 
> > modprobe test_firmware
> > echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test
> > systemctl suspend
> > 
> > If its a guest wake it up:
> > 
> > virsh dompmwakeup domidofguest
> 
> This happens because:
> 
> btrfs_lookup() --> ... -->
>   
> btrfs_search_slot() --> read_block_for_search() -->   
>   
>   --> read_tree_block() --> btree_read_extent_buffer_pages() -->  
>   
>   --> submit_one_bio() --> btrfs_submit_metadata_bio() -->
>   
>   --> btrfsic_submit_bio() --> submit_bio()
>   --> this completes and then
>   --> wait_on_page_locked() on the first page
>   --> never returns   
>   
> 
> I also managed to reproduce this easily with XFS as well, so this is not
> a btrfs thing as I suspected. It does not happen with ext4 though.
> However I think that's just by chance, it should still be prone to the
> same issue.
> 
> Either way, I'm dusting off my patches for fs freeze as I believe that
> should fix this problem. I am not sure if we want a stop gap hack like
> the one I posted in the meantime... I don't think so. I rather fix this
> well with the series I'll post for fs freeze. Give me a bit of time and
> I'll CC you on the patches.

Low and behold, as I suspectd, my old VFS fsfreeze / thaw patch series
this. However I should note that I needed to add remove also the
WQ_FREEZABLE from fs as well, which was missing in my patch series, and
which Jan Kara had pointed out.

However, the VFS freeze work is quite a bit of work which we are still
discussing, so in the meantime, I think we have no other option but
to put the stop-gap patch I suggested with the usermode helper lock.
I will just modify it a bit more.

I'll also post my fs freeze work now again, but I'll note that it
still requires some more work to address everything which we have
discussed in the community. I'll post the patches as I think others
may be interested in the progress of that.

  Luis


Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?

2021-04-08 Thread Luis Chamberlain
On Sat, Apr 03, 2021 at 08:25:38PM +, Luis Chamberlain wrote:
> So creating say 1000 random files in /lib/firmware on a freshly created
> btrfs partition helps reproduce:
> 
> mkfs.btrfs /dev/whatever
> mount /dev/wahtever /lib/firmware
> # Put it on /etc/fstab too
> 
> Generate 1000 random files on it:
> 
> ```
> for n in {1..1000}; do
>   
> dd if=/dev/urandom of=/lib/firmware/file$( printf %03d "$n" ).bin bs=1 
> count=$((RANDOM + 1024 ))
> done  
> ```
> 
> Then reboot, upon reboot do:
> 
> modprobe test_firmware
> echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test
> systemctl suspend
> 
> If its a guest wake it up:
> 
> virsh dompmwakeup domidofguest

This happens because:

btrfs_lookup() --> ... -->  
btrfs_search_slot() --> read_block_for_search() --> 
--> read_tree_block() --> btree_read_extent_buffer_pages() -->  
  
--> submit_one_bio() --> btrfs_submit_metadata_bio() -->
  
--> btrfsic_submit_bio() --> submit_bio()
--> this completes and then
--> wait_on_page_locked() on the first page
--> never returns   
  

I also managed to reproduce this easily with XFS as well, so this is not
a btrfs thing as I suspected. It does not happen with ext4 though.
However I think that's just by chance, it should still be prone to the
same issue.

Either way, I'm dusting off my patches for fs freeze as I believe that
should fix this problem. I am not sure if we want a stop gap hack like
the one I posted in the meantime... I don't think so. I rather fix this
well with the series I'll post for fs freeze. Give me a bit of time and
I'll CC you on the patches.

  Luis


Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?

2021-04-05 Thread Lukas Middendorf




On 03/04/2021 23:04, Luis Chamberlain wrote:

OK this fixes it but this just shows that likely the thaw'ing allows
a race to take place which we didn't expect. I'll do some more digging
for a proper fix.


I can indeed confirm that this fixes the stall. This however does not
seem to be the (complete) solution. Instead I now get a kernel crash
message (see below) for every firmware location tried to read during
resume. This might be intentional for debugging purposes during testing.
This is identical for ext4 and btrfs.

If the firmware file can not be found during caching on suspend, the
reads are still attempted again during resume. This leads to multiple of
those crash messages (for different firmware locations) during resume if
the firmware file is not present, even for drivers properly requesting
caching.

So if this patch is to go in (those crashes would really help with
getting the si2168 patches in…), I think you have to make sure that even for
non-existent firmware files, no read is ever attempted on resume. Which
means to set up the caching even if the initial request_firmware()
failed and to store knowledge about failed caching attempts to not retry
these reads on resume.

Lukas

[ cut here ]
WARNING: CPU: 0 PID: 662 at fs/kernel_read_file.c:161 
kernel_read_file_from_path_initns+0x11c/0x140
Modules linked in: test_firmware nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib 
nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat 
nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables 
nfnetlink intel_rapl_msr intel_rapl_common kvm_amd vmwgfx ccp kvm ttm 
drm_kms_helper snd_pcm joydev snd_timer snd e1000 irqbypass cec soundcore 
vboxguest i2c_piix4 pcspkr drm fuse zram ip_tables crct10dif_pclmul 
crc32_pclmul crc32c_intel ghash_clmulni_intel serio_raw ata_generic pata_acpi 
video
CPU: 0 PID: 662 Comm: systemd-sleep Not tainted 5.12.0-rc5+ #2
Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
RIP: 0010:kernel_read_file_from_path_initns+0x11c/0x140
Code: ff ff 4c 89 e7 89 44 24 10 e8 50 07 fc ff e8 fb 1c d8 ff 44 8b 44 24 10 48 83 
c4 28 44 89 c0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 80 3d b4 10 7a 01 00 75 
e3 e9 5b 5e 83 00 e8 cf 1c d8 ff 45
RSP: 0018:a096c0b9fb90 EFLAGS: 00010286
RAX: fff5 RBX:  RCX: bd85d688
RDX: bd85d688 RSI: 0297 RDI: bd85d680
RBP: a096c0b9fbe0 R08: fff5 R09: bd85d688
R10:  R11:  R12: 8976ca811000
R13: a096c0b9fc00 R14: 7fff R15: 0001
FS:  7f4718de3b40() GS:8979cfc0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 5594afeca4f8 CR3: 000106a62000 CR4: 000506f0
Call Trace:
 ? snprintf+0x39/0x40
 fw_get_filesystem_firmware+0xe2/0x270
 _request_firmware+0x21e/0x500
 request_firmware+0x32/0x50
 test_firmware_resume.cold+0x4e/0xb2 [test_firmware]
 ? platform_pm_suspend+0x40/0x40
 dpm_run_callback+0x4c/0x120
 device_resume+0xa7/0x200
 dpm_resume+0xce/0x2c0
 dpm_resume_end+0xd/0x20
 suspend_devices_and_enter+0x195/0x750
 pm_suspend.cold+0x329/0x374
 state_store+0x71/0xd0
 kernfs_fop_write_iter+0x11c/0x1b0
 new_sync_write+0x108/0x180
 vfs_write+0x1b8/0x270
 ksys_write+0x4f/0xc0
 do_syscall_64+0x33/0x40
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f4719a527a7
Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 
25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 
83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:7ffee4d18e18 EFLAGS: 0246 ORIG_RAX: 0001
RAX: ffda RBX: 0004 RCX: 7f4719a527a7
RDX: 0004 RSI: 7ffee4d18f00 RDI: 0004
RBP: 7ffee4d18f00 R08: 5626b555c710 R09: 7f4719ae84e0
R10: 7f4719ae83e0 R11: 0246 R12: 0004
R13: 5626b5558650 R14: 0004 R15: 7f4719b25700
---[ end trace 7f7ef0dc067dd714 ]---
Trying to do direct read when not available
test_firmware test_firmware: loading 
/lib/firmware/updates/5.12.0-rc5+/test-firmware.bin failed with error -11
[ cut here ]


Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?

2021-04-03 Thread Lukas Middendorf



Great to hear that you now succeeded in reproducing the problem.

On 03/04/2021 22:25, Luis Chamberlain wrote:

On Sat, Apr 03, 2021 at 12:24:07PM +0200, Lukas Middendorf wrote:

One further thing I noticed which might be problematic in rare cases:
According to the kernel debug messages, the firmware-loader does not attempt
to cache the firmware during suspend,


Correct, the goal with this test was to purposely *not* call for the
firmware prior to suspend, and instead *race* (do the wrong thing) at
resume. It shouldn't really stall... fail yes, but stall, that seems
fishy.


I understood that.

if the previous call to
request_firmware() has failed (file not present; call made during previous
resume). In my opinion it should attempt to cache the firmware on suspend
even in this case


Yes, that is the *proper* way to do use the firmware API, but we want to
reproduce the stall, and so we have to recreate the issue you reported
by doing something bad.


Should firmware_request_cache() always be called, or only instead of 
request_firmware() ? In case request_firmware is called on its own, and 
the firmware file is not present, it might still race on the next resume.
request_firmware seems to be meant to include the cache request for the 
next suspend, but it apparently *does not* if firmware loading fails due 
to a missing file. I think this is something that should either be 
changed or properly documented in the API documentation.




Did you also try to create a random test-firmware.bin (I used 1M from
/dev/urandom) instead of an empty /lib/firmware ?


No, right now I want to to just focus on fixing the stall you saw.


This is a second nuance of the stall I saw: The firmware file (in this 
case test-firmware.bin) is present but not in cache.


If you run

ls -lR /lib/firmware > /dev/null

before

systemctl suspend

your reproduction steps will very likely not stall.
If you in addition run

dd if=/dev/urandom of=/lib/firmware/test-firmware.bin bs=1K count=64

during the preparation of /lib/firmware (in addition to or instead of 
your for loop), then it will always stall (as long as the file content 
is not read before suspend).


One stall is happening while the directory content is being listed (to 
find that the file is not present), the second stall is happening while 
actually trying to read the file. Those two stalls likely have the same 
root cause, but it might be different. I just want to make sure you have 
covered all cases.




You might be better off just reposting your
patches with the respective Reviewed-by tags and pestering your
maintainer.


I will try to be a little bit more insistent this time. Is "just repost" the
usual way to handle if patches are ignored?


You can repost, v2 just add my reviewed-by. Those patches are indeed
correct as they are calling for the firmware prior to resume. Maybe
clarify that.


OK, will do.


But indeed there is also another issue which you reported which needs to
be fixed, and for that thanks so much for you patience! I'll be looking
into this!


Also thank you for your effort.

Lukas





Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?

2021-04-03 Thread Luis Chamberlain
On Sat, Apr 03, 2021 at 08:25:38PM +, Luis Chamberlain wrote:
> So creating say 1000 random files in /lib/firmware on a freshly created
> btrfs partition helps reproduce:
> 
> mkfs.btrfs /dev/whatever
> mount /dev/wahtever /lib/firmware
> # Put it on /etc/fstab too
> 
> Generate 1000 random files on it:
> 
> ```
> for n in {1..1000}; do
>   
> dd if=/dev/urandom of=/lib/firmware/file$( printf %03d "$n" ).bin bs=1 
> count=$((RANDOM + 1024 ))
> done  
> ```
> 
> Then reboot, upon reboot do:
> 
> modprobe test_firmware
> echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test
> systemctl suspend

OK this fixes it but this just shows that likely the thaw'ing allows
a race to take place which we didn't expect. I'll do some more digging
for a proper fix.

diff --git a/fs/kernel_read_file.c b/fs/kernel_read_file.c
index 90d255fbdd9b..b9c37eefab35 100644
--- a/fs/kernel_read_file.c
+++ b/fs/kernel_read_file.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /**
  * kernel_read_file() - read file contents into a kernel buffer
@@ -156,17 +157,25 @@ int kernel_read_file_from_path_initns(const char *path, 
loff_t offset,
if (!path || !*path)
return -EINVAL;
 
+   ret = usermodehelper_read_trylock();
+   if (WARN_ON(ret)) {
+   pr_warn_once("Trying to do direct read when not available");
+   return ret;
+   }
task_lock(&init_task);
get_fs_root(init_task.fs, &root);
task_unlock(&init_task);
 
file = file_open_root(root.dentry, root.mnt, path, O_RDONLY, 0);
path_put(&root);
-   if (IS_ERR(file))
+   if (IS_ERR(file)) {
+   usermodehelper_read_unlock();
return PTR_ERR(file);
+   }
 
ret = kernel_read_file(file, offset, buf, buf_size, file_size, id);
fput(file);
+   usermodehelper_read_unlock();
return ret;
 }
 EXPORT_SYMBOL_GPL(kernel_read_file_from_path_initns);


Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?

2021-04-03 Thread Luis Chamberlain
On Sat, Apr 03, 2021 at 12:24:07PM +0200, Lukas Middendorf wrote:
> 
> On 02/04/2021 20:02, Luis Chamberlain wrote:
> > No sorry, I dropped the ball on this but I managed to now spawn up the
> > virtual guests where I was doing development to reproduce this. Give
> > me some time and I will zero in on this now.
> > 
> > For now what I have is the following to test this, I next will work
> > on the userspace part.
> 
> I can report that your patch for test_firmware works (applied to current
> master from linus d93a0d4; I get some offsets and had to adjust some
> whitespace for it to apply). With that module (and resume test enabled) I
> get freezes at resume in the same cases that would also cause problems with
> si2168.

Beautiful!

> I'm testing this on bare metal F34 beta with / on a btrfs. I'm using nvidia
> driver again to make sure the system does not otherwise use any firmware
> from /usr/lib/firmware (confirmed with kernel debug messages for
> firmware_loader). My si2168 is not plugged in.

OK I was not able to reproduce but the details below also clarify why,
so I now was able to reproduce! So I don't need to specifically test
with this distro, but others may do so.

> I tested it with a normal population of files in /usr/lib/firmware , without
> test-firmware.bin and also with a random 1MiB file in place. I tested after
> a reboot so it does not do caching.
> With /usr/lib/firmware on a separate ext4 partition I can confirm with dmesg
> that the test_firmware suspend test actually works (does not freeze).
> With /usr/lib/firmware on btrfs it fails in both cases (with and without the
> firmware files).
> With caching (first suspend with the ext4 partition mounted, then a second
> suspend without) it does not freeze even with the firmware on btrfs.

OK thanks for confirming!

> One further thing I noticed which might be problematic in rare cases:
> According to the kernel debug messages, the firmware-loader does not attempt
> to cache the firmware during suspend, 

Correct, the goal with this test was to purposely *not* call for the
firmware prior to suspend, and instead *race* (do the wrong thing) at
resume. It shouldn't really stall... fail yes, but stall, that seems
fishy.

> if the previous call to
> request_firmware() has failed (file not present; call made during previous
> resume). In my opinion it should attempt to cache the firmware on suspend
> even in this case 

Yes, that is the *proper* way to do use the firmware API, but we want to
reproduce the stall, and so we have to recreate the issue you reported
by doing something bad.

> (If I remember correctly, firmware_request_cache also
> works without the file being present). In case some low-memory condition has
> caused the file system cache to lose the information about the file being
> non-present (or the file has been written after the initial attempt and is
> no longer in the file system cache), this might lead to freezes even for
> well-behaved drivers in case they reattempt to do request_firmware() on
> resume.
> If the firmware is found during resume, it is cached on further suspends.

Right.

> Given how long it took me to narrow down this problem in this (for me)
> reliably reproducible case, something like this happening at random would be
> almost impossible to debug/locate and might actually happen frequently in
> the wild.

OK thanks for the report!

> On 03/04/2021 00:19, Luis Chamberlain wrote:
> > Lukas, can you share your /etc/fstab ?
> 
> This is the core part (everything else is unmounted), I shortened the UUIDs.
> The ext4 mount of course is also unmounted when I want it to fail, with it
> in place it reliably not-fails.
> 
> UUID=<1> /   btrfs   subvol=linux1 0 0
> UUID=<2> /usr/lib/firmware   ext4defaults 1 1
> UUID=<1> /home/lukas btrfs   subvol=linux1-f34-lukas 0 0
> tmpfs /home/lukas/.cache tmpfs size=16g,gid=lukas,uid=lukas,mode=700 0 0
> 
> > Also, how long do you stay in the boot before you try to suspend?
> 
> During my reproduction sessions usually only shortly 1-5min, but I think I
> have seen this also after a slightly longer time. I can try to let it sit
> for longer if you think that is important.

No, I was looking to see how easy it is to reproduce right after a
fresh boot, now that I can reproduce I confirm it happens right away.

> On 03/04/2021 00:58, Luis Chamberlain wrote:
> > On Fri, Apr 2, 2021 at 3:19 PM Luis Chamberlain  wrote:
> > > 
> > > Lukas, can you share your /etc/fstab ? Also, how long do you stay in
> > > the boot before you try to suspend?
> > 
> > OK I cannot reproduce the issue with the modified patch I sent to
> > test_firmware, which if you enable config_enable_resume_test will
> > trigger a request_firmware() on resume, thus trying to mimic the race
> > you note. To test this you can simply use a loopback filesystem for
> > your /lib/firmware and create a btrfs filesystem for it, and then run:
> > 
> > echo 1 > /sys/devices/virtual/misc/tes

Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?

2021-04-03 Thread Lukas Middendorf

Hi Luis,

I now succeeded in reproducing this in VirtualBox running a F34 minimal 
installation with / on btrfs (with default firmware files).
I can send you the virtual disk file if you want it. With the kernel 
sourcecode and the compiled and installed kernel it currently is 21.8GiB.



On 03/04/2021 12:24, Lukas Middendorf wrote:

On 03/04/2021 00:58, Luis Chamberlain wrote:
Can you provide kernel logs for where you are seeing things get stuck at? 


I have dumped the output of the serial console with all outputs cranked 
to the max.
In [1] is the output with test_firmware loaded and suspend test running. 
This leads to a freeze. However it is not completely dead. While it does 
not visually react, at least from time to time there still are some 
messages from the kernel and the cursor is still blinking.



In [2] is a (successful) suspend and resume cycle with test_firmware not 
loaded.


Lukas

[1] https://gist.github.com/midluk/05716f714f778d26dc02771245df0ac9
[2] https://gist.github.com/midluk/a03eb953b6cf097688b8be2e0cd387fd


Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?

2021-04-03 Thread Lukas Middendorf



On 02/04/2021 20:02, Luis Chamberlain wrote:

No sorry, I dropped the ball on this but I managed to now spawn up the
virtual guests where I was doing development to reproduce this. Give
me some time and I will zero in on this now.

For now what I have is the following to test this, I next will work
on the userspace part.


I can report that your patch for test_firmware works (applied to current 
master from linus d93a0d4; I get some offsets and had to adjust some 
whitespace for it to apply). With that module (and resume test enabled) 
I get freezes at resume in the same cases that would also cause problems 
with si2168.


I'm testing this on bare metal F34 beta with / on a btrfs. I'm using 
nvidia driver again to make sure the system does not otherwise use any 
firmware from /usr/lib/firmware (confirmed with kernel debug messages 
for firmware_loader). My si2168 is not plugged in.


I tested it with a normal population of files in /usr/lib/firmware , 
without test-firmware.bin and also with a random 1MiB file in place. I 
tested after a reboot so it does not do caching.
With /usr/lib/firmware on a separate ext4 partition I can confirm with 
dmesg that the test_firmware suspend test actually works (does not freeze).
With /usr/lib/firmware on btrfs it fails in both cases (with and without 
the firmware files).
With caching (first suspend with the ext4 partition mounted, then a 
second suspend without) it does not freeze even with the firmware on btrfs.



One further thing I noticed which might be problematic in rare cases:
According to the kernel debug messages, the firmware-loader does not 
attempt to cache the firmware during suspend, if the previous call to 
request_firmware() has failed (file not present; call made during 
previous resume). In my opinion it should attempt to cache the firmware 
on suspend even in this case (If I remember correctly, 
firmware_request_cache also works without the file being present). In 
case some low-memory condition has caused the file system cache to lose 
the information about the file being non-present (or the file has been 
written after the initial attempt and is no longer in the file system 
cache), this might lead to freezes even for well-behaved drivers in case 
they reattempt to do request_firmware() on resume.

If the firmware is found during resume, it is cached on further suspends.
Given how long it took me to narrow down this problem in this (for me) 
reliably reproducible case, something like this happening at random 
would be almost impossible to debug/locate and might actually happen 
frequently in the wild.



On 03/04/2021 00:19, Luis Chamberlain wrote:

Lukas, can you share your /etc/fstab ?


This is the core part (everything else is unmounted), I shortened the 
UUIDs. The ext4 mount of course is also unmounted when I want it to 
fail, with it in place it reliably not-fails.


UUID=<1> /   btrfs   subvol=linux1 0 0
UUID=<2> /usr/lib/firmware   ext4defaults 1 1
UUID=<1> /home/lukas btrfs   subvol=linux1-f34-lukas 0 0
tmpfs /home/lukas/.cache tmpfs size=16g,gid=lukas,uid=lukas,mode=700 0 0


Also, how long do you stay in the boot before you try to suspend?


During my reproduction sessions usually only shortly 1-5min, but I think 
I have seen this also after a slightly longer time. I can try to let it 
sit for longer if you think that is important.


On 03/04/2021 00:58, Luis Chamberlain wrote:

On Fri, Apr 2, 2021 at 3:19 PM Luis Chamberlain  wrote:


Lukas, can you share your /etc/fstab ? Also, how long do you stay in
the boot before you try to suspend?


OK I cannot reproduce the issue with the modified patch I sent to
test_firmware, which if you enable config_enable_resume_test will
trigger a request_firmware() on resume, thus trying to mimic the race
you note. To test this you can simply use a loopback filesystem for
your /lib/firmware and create a btrfs filesystem for it, and then run:

echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test

systemctl suspend

Then resume. You should see "resume test" print on dmesg. I keep my
/lib/firmware/ empty and still, nothing.


Did you also try to create a random test-firmware.bin (I used 1M from 
/dev/urandom) instead of an empty /lib/firmware ?
If the directory is completely empty, it also does not freeze for me. If 
the directory is empty, any attempt to access its content can likely be 
directly served from cache, even if the actual directory has never been 
accessed before, as long as /lib (which is a symlink to /usr/lib on 
fedora) has been accessed (which will likely always be true). So I have 
to further add to my previous findings that "firmware directory is not 
completely empty" is a further prerequisite for it to fail.


Can you provide kernel logs for where you are seeing things get stuck at? 


The log does not have any entries from resume. For the attempts where it 
freezes the last entry in journalctl is

systemd-sleep[5050]: 

Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?

2021-04-02 Thread Luis Chamberlain
On Fri, Apr 2, 2021 at 3:19 PM Luis Chamberlain  wrote:
>
> Lukas, can you share your /etc/fstab ? Also, how long do you stay in
> the boot before you try to suspend?

OK I cannot reproduce the issue with the modified patch I sent to
test_firmware, which if you enable config_enable_resume_test will
trigger a request_firmware() on resume, thus trying to mimic the race
you note. To test this you can simply use a loopback filesystem for
your /lib/firmware and create a btrfs filesystem for it, and then run:

echo 1 > /sys/devices/virtual/misc/test_firmware/config_enable_resume_test

systemctl suspend

Then resume. You should see "resume test" print on dmesg. I keep my
/lib/firmware/ empty and still, nothing. Can you provide kernel logs
for where you are seeing things get stuck at? Note that I had
mentioned the races on suspend/resume do exist for any journaling
filesystem, but this typically happens if you are doing ongoing
writes. I suppose you are *not* doing writes and your filesystem is
idle.

As such without kernel logs I cannot be sure what the issue is, but at
this point after the initial testing I've done I don't suspect this is
a firmware API issue. You might be better off just reposting your
patches with the respective Reviewed-by tags and pestering your
maintainer.

PS. If you want to test this on a guest you bring the guest back up
with virsh dompmwakeup.

  Luis


Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?

2021-04-02 Thread Luis Chamberlain
Lukas, can you share your /etc/fstab ? Also, how long do you stay in
the boot before you try to suspend?

 Luis


Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?

2021-04-02 Thread Luis Chamberlain
On Thu, Apr 01, 2021 at 04:59:47PM +0200, Lukas Middendorf wrote:
> Hello Luis,
> 
> 
> On 18/08/2020 16:37, Luis Chamberlain wrote:
> > On Tue, Aug 18, 2020 at 12:04:51AM +0200, Lukas Middendorf wrote:
> > > On 17/08/2020 17:20, Luis Chamberlain wrote:
> > > > This helps, thanks so much, now we'll have to write a reproducer, thanks
> > > > for the report!!
> > > 
> > > Will you do it yourself or do you expect me to do anything for this?
> > 
> > I meant to imply that we'd do this, now that we understand the problem. 
> > Thanks
> > for your report!
> 
> any news on this issue? Did you succeed with reproducing this at your end?

No sorry, I dropped the ball on this but I managed to now spawn up the
virtual guests where I was doing development to reproduce this. Give me
some time and I will zero in on this now.

For now what I have is the following to test this, I next will work
on the userspace part.

diff --git a/lib/test_firmware.c b/lib/test_firmware.c
index 9fee2b93a8d1..f9e67fc4145a 100644
--- a/lib/test_firmware.c
+++ b/lib/test_firmware.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -33,6 +34,8 @@
 static DEFINE_MUTEX(test_fw_mutex);
 static const struct firmware *test_firmware;
 
+static struct platform_device *pdev;
+
 struct test_batched_req {
u8 idx;
int rc;
@@ -53,6 +56,9 @@ struct test_batched_req {
  * @sync_direct: when the sync trigger is used if this is true
  * request_firmware_direct() will be used instead.
  * @send_uevent: whether or not to send a uevent for async requests
+ * @enable_resume_test: if @senable_resume is true this will enable a test to
+ * issue a request_firmware() upon resume. This is useful to test resume
+ * after suspend filesystem races.
  * @num_requests: number of requests to try per test case. This is trigger
  * specific.
  * @reqs: stores all requests information
@@ -91,6 +97,7 @@ struct test_config {
bool into_buf;
bool sync_direct;
bool send_uevent;
+   bool enable_resume_test;
u8 num_requests;
u8 read_fw_idx;
 
@@ -184,6 +191,7 @@ static int __test_firmware_config_init(void)
test_fw_config->send_uevent = true;
test_fw_config->into_buf = false;
test_fw_config->sync_direct = false;
+   test_fw_config->enable_resume_test = false;
test_fw_config->req_firmware = request_firmware;
test_fw_config->test_result = 0;
test_fw_config->reqs = NULL;
@@ -257,6 +265,9 @@ static ssize_t config_show(struct device *dev,
len += scnprintf(buf+len, PAGE_SIZE - len,
"sync_direct:\t\t%s\n",
test_fw_config->sync_direct ? "true" : "false");
+   len += scnprintf(buf+len, PAGE_SIZE - len,
+   "enable_resume_test:\t\t%s\n",
+   test_fw_config->enable_resume_test ? "true" : "false");
len += scnprintf(buf+len, PAGE_SIZE - len,
"read_fw_idx:\t%u\n", test_fw_config->read_fw_idx);
 
@@ -422,6 +433,22 @@ static ssize_t config_sync_direct_show(struct device *dev,
 }
 static DEVICE_ATTR_RW(config_sync_direct);
 
+static ssize_t config_enable_resume_test_store(struct device *dev,
+  struct device_attribute *attr,
+  const char *buf, size_t count)
+{
+   return test_dev_config_update_bool(buf, count,
+  &test_fw_config->enable_resume_test);
+}
+
+static ssize_t config_enable_resume_test_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+   return test_dev_config_show_bool(buf, 
test_fw_config->enable_resume_test);
+}
+static DEVICE_ATTR_RW(config_enable_resume_test);
+
 static ssize_t config_send_uevent_store(struct device *dev,
struct device_attribute *attr,
const char *buf, size_t count)
@@ -929,6 +956,7 @@ static struct attribute *test_dev_attrs[] = {
TEST_FW_DEV_ATTR(config_into_buf),
TEST_FW_DEV_ATTR(config_sync_direct),
TEST_FW_DEV_ATTR(config_send_uevent),
+   TEST_FW_DEV_ATTR(config_enable_resume_test),
TEST_FW_DEV_ATTR(config_read_fw_idx),
 
/* These don't use the config at all - they could be ported! */
@@ -958,6 +986,81 @@ static struct miscdevice test_fw_misc_device = {
.groups = test_dev_groups,
 };
 
+static int __maybe_unused test_firmware_suspend(struct device *dev)
+{
+   return 0;
+}
+
+
+static int __maybe_unused test_firmware_resume(struct device *dev)
+{
+   int rc;
+
+   if (!test_fw_config->enable_resume_test)
+   return 0;
+
+   pr_info("resume test, loading '%s'\n", test_fw_config->name);
+
+   mutex_lock(&test_fw_mutex);
+   release_firmware(test_

Re: Is request_firmware() really safe to call in resume callback when /usr/lib/firmware is on btrfs?

2021-04-01 Thread Lukas Middendorf

Hello Luis,


On 18/08/2020 16:37, Luis Chamberlain wrote:

On Tue, Aug 18, 2020 at 12:04:51AM +0200, Lukas Middendorf wrote:

On 17/08/2020 17:20, Luis Chamberlain wrote:

This helps, thanks so much, now we'll have to write a reproducer, thanks
for the report!!


Will you do it yourself or do you expect me to do anything for this?


I meant to imply that we'd do this, now that we understand the problem. Thanks
for your report!


any news on this issue? Did you succeed with reproducing this at your end?

I retested this with 5.12-rc5 and everything still seems unchanged. I 
still get freezes on resume under the same circumstances described 
previously if /usr/lib/firmware is on btrfs. My patches to si2168.c were 
not merged.
The documentation of request_firmware() seems unchanged and does not 
reflect the fact that prerequisites exist for making it safe to call 
during resume callbacks.


Lukas