Re: Linux 5.10.25
Jamie Heilman wrote: > Greg Kroah-Hartman wrote: > > On Sat, Mar 20, 2021 at 09:31:55PM +0000, Jamie Heilman wrote: > > > [ugh, resent with the lkml headers unbroken, sorry about the dupe] > > > > > > Greg Kroah-Hartman wrote: > > > > J. Bruce Fields (2): > > > > Revert "nfsd4: remove check_conflicting_opens warning" > > > > Revert "nfsd4: a client's own opens needn't prevent delegations" > > > > > > Hrm, just got this when I udpated my nfs server (32bit Via EPIA system) > > > from 5.10.20 to 5.10.25: > > > > > > [ 49.225914] NFSD: Using UMH upcall client tracking operations. > > > [ 49.231919] NFSD: starting 90-second grace period (net f036) > > > [ 50.036973] [ cut here ] > > > [ 50.041771] WARNING: CPU: 0 PID: 2284 at fs/nfsd/nfs4state.c:4968 > > > nfsd4_process_open2+0xf9c/0x1170 [nfsd] > > > [ 50.051434] Modules linked in: md5 cpufreq_conservative > > > cpufreq_userspace cpufreq_powersave cpufreq_ondemand autofs4 quota_v2 > > > quota_tree nfsd auth_rpcgss nfs lockd grace nfs_ssc fscache sunrpc > > > xt_mark cls_fw sch_htb iptable_nat xt_nat nf_nat ipt_REJECT > > > nf_reject_ipv4 xt_tcpudp xt_multiport iptable_mangle xt_state > > > xt_conntrack nf_conntrack nf_defrag_ipv4 nf_log_ipv4 nf_log_common xt_LOG > > > xt_limit iptable_filter ip_tables x_tables nhpoly1305 chacha_generic > > > libchacha adiantum libpoly1305 dm_crypt dm_mod snd_hda_codec_via > > > snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec > > > snd_hwdep snd_hda_core snd_pcm snd_timer snd via_rhine psmouse soundcore > > > mii via_agp sg via_velocity evdev agpgart > > > [ 50.113386] CPU: 0 PID: 2284 Comm: nfsd Tainted: GT > > > 5.10.25 #1 > > > [ 50.120669] Hardware name: To Be Filled By O.E.M. To Be Filled By > > > O.E.M./To be filled by O.E.M., BIOS 080014 06/01/2009 > > > [ 50.131652] EIP: nfsd4_process_open2+0xf9c/0x1170 [nfsd] > > > [ 50.137036] Code: 04 88 45 a4 88 07 8b 45 a0 8d 78 49 8b 45 84 8d 70 > > > 01 e9 2b f8 ff ff c7 45 9c 00 00 00 00 31 ff bb 00 00 27 67 e9 04 f6 ff > > > ff <0f> 0b e9 a0 f5 ff ff 0f b6 d4 0f a3 15 94 3f 23 f8 0f 83 b1 fd ff > > > [ 50.155866] EAX: EBX: 82da25b0 ECX: 830c0920 EDX: > > > [ 50.162187] ESI: 82da25b0 EDI: 82da55a0 EBP: 8310be68 ESP: 8310bddc > > > [ 50.168507] DS: 007b ES: 007b FS: GS: 00e0 SS: 0068 EFLAGS: > > > 00010246 > > > [ 50.175338] CR0: 80050033 CR2: 00551e50 CR3: 03222000 CR4: 06b0 > > > [ 50.181654] Call Trace: > > > [ 50.184165] ? inode_permission+0x17/0xc0 > > > [ 50.188289] nfsd4_open+0x429/0x910 [nfsd] > > > [ 50.192483] ? nfsd4_encode_operation+0x185/0x1e0 [nfsd] > > > [ 50.197900] ? nfsd4_rename+0x1a0/0x1a0 [nfsd] > > > [ 50.202439] nfsd4_proc_compound+0x457/0x6c0 [nfsd] > > > [ 50.207419] nfsd_dispatch+0xdc/0x1a0 [nfsd] > > > [ 50.211816] svc_process_common+0x38a/0x650 [sunrpc] > > > [ 50.216880] ? svc_xprt_do_enqueue+0xd7/0xe0 [sunrpc] > > > [ 50.222017] ? svc_xprt_received+0x5d/0xf0 [sunrpc] > > > [ 50.227000] ? nfsd_svc+0x300/0x300 [nfsd] > > > [ 50.231190] svc_process+0xa9/0xf0 [sunrpc] > > > [ 50.235468] nfsd+0xcd/0x120 [nfsd] > > > [ 50.239025] kthread+0xe1/0x100 > > > [ 50.242259] ? nfsd_destroy+0x50/0x50 [nfsd] > > > [ 50.246588] ? kthread_create_on_node+0x30/0x30 > > > [ 50.251165] ret_from_fork+0x1c/0x28 > > > [ 50.254789] ---[ end trace 171bde4774bc9795 ]--- > > > > > > Can't readily reproduce it though, so likely a race condition or > > > something that requires more state buildup than I have after a few > > > minutes of uptime. Kernel config at > > > http://audible.transient.net/~jamie/k/nfsd.config-5.10.25 in case you > > > think this worth more investigation. > > > > Do you also have this issue in Linus's tree and the latest 5.11.y > > release? > > Haven't tried it, but like I said, I couldn't even reproduce it again > with 5.10.25, or booting between 5.10.20 to 5.10.25 again ... I'll > give the others a shot and see if I can repro it there. Yeah, can't repro it with 5.11.8 or 5.12.0-rc3-00281-g1d4345eb51a1 either, seems to have been a spurious thing. -- Jamie Heilman http://audible.transient.net/~jamie/
Re: Linux 5.10.25
Greg Kroah-Hartman wrote: > On Sat, Mar 20, 2021 at 09:31:55PM +0000, Jamie Heilman wrote: > > [ugh, resent with the lkml headers unbroken, sorry about the dupe] > > > > Greg Kroah-Hartman wrote: > > > J. Bruce Fields (2): > > > Revert "nfsd4: remove check_conflicting_opens warning" > > > Revert "nfsd4: a client's own opens needn't prevent delegations" > > > > Hrm, just got this when I udpated my nfs server (32bit Via EPIA system) > > from 5.10.20 to 5.10.25: > > > > [ 49.225914] NFSD: Using UMH upcall client tracking operations. > > [ 49.231919] NFSD: starting 90-second grace period (net f036) > > [ 50.036973] [ cut here ] > > [ 50.041771] WARNING: CPU: 0 PID: 2284 at fs/nfsd/nfs4state.c:4968 > > nfsd4_process_open2+0xf9c/0x1170 [nfsd] > > [ 50.051434] Modules linked in: md5 cpufreq_conservative > > cpufreq_userspace cpufreq_powersave cpufreq_ondemand autofs4 quota_v2 > > quota_tree nfsd auth_rpcgss nfs lockd grace nfs_ssc fscache sunrpc xt_mark > > cls_fw sch_htb iptable_nat xt_nat nf_nat ipt_REJECT nf_reject_ipv4 > > xt_tcpudp xt_multiport iptable_mangle xt_state xt_conntrack nf_conntrack > > nf_defrag_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_filter > > ip_tables x_tables nhpoly1305 chacha_generic libchacha adiantum libpoly1305 > > dm_crypt dm_mod snd_hda_codec_via snd_hda_codec_generic snd_hda_intel > > snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd > > via_rhine psmouse soundcore mii via_agp sg via_velocity evdev agpgart > > [ 50.113386] CPU: 0 PID: 2284 Comm: nfsd Tainted: GT > > 5.10.25 #1 > > [ 50.120669] Hardware name: To Be Filled By O.E.M. To Be Filled By > > O.E.M./To be filled by O.E.M., BIOS 080014 06/01/2009 > > [ 50.131652] EIP: nfsd4_process_open2+0xf9c/0x1170 [nfsd] > > [ 50.137036] Code: 04 88 45 a4 88 07 8b 45 a0 8d 78 49 8b 45 84 8d 70 01 > > e9 2b f8 ff ff c7 45 9c 00 00 00 00 31 ff bb 00 00 27 67 e9 04 f6 ff ff > > <0f> 0b e9 a0 f5 ff ff 0f b6 d4 0f a3 15 94 3f 23 f8 0f 83 b1 fd ff > > [ 50.155866] EAX: EBX: 82da25b0 ECX: 830c0920 EDX: > > [ 50.162187] ESI: 82da25b0 EDI: 82da55a0 EBP: 8310be68 ESP: 8310bddc > > [ 50.168507] DS: 007b ES: 007b FS: GS: 00e0 SS: 0068 EFLAGS: 00010246 > > [ 50.175338] CR0: 80050033 CR2: 00551e50 CR3: 03222000 CR4: 06b0 > > [ 50.181654] Call Trace: > > [ 50.184165] ? inode_permission+0x17/0xc0 > > [ 50.188289] nfsd4_open+0x429/0x910 [nfsd] > > [ 50.192483] ? nfsd4_encode_operation+0x185/0x1e0 [nfsd] > > [ 50.197900] ? nfsd4_rename+0x1a0/0x1a0 [nfsd] > > [ 50.202439] nfsd4_proc_compound+0x457/0x6c0 [nfsd] > > [ 50.207419] nfsd_dispatch+0xdc/0x1a0 [nfsd] > > [ 50.211816] svc_process_common+0x38a/0x650 [sunrpc] > > [ 50.216880] ? svc_xprt_do_enqueue+0xd7/0xe0 [sunrpc] > > [ 50.222017] ? svc_xprt_received+0x5d/0xf0 [sunrpc] > > [ 50.227000] ? nfsd_svc+0x300/0x300 [nfsd] > > [ 50.231190] svc_process+0xa9/0xf0 [sunrpc] > > [ 50.235468] nfsd+0xcd/0x120 [nfsd] > > [ 50.239025] kthread+0xe1/0x100 > > [ 50.242259] ? nfsd_destroy+0x50/0x50 [nfsd] > > [ 50.246588] ? kthread_create_on_node+0x30/0x30 > > [ 50.251165] ret_from_fork+0x1c/0x28 > > [ 50.254789] ---[ end trace 171bde4774bc9795 ]--- > > > > Can't readily reproduce it though, so likely a race condition or > > something that requires more state buildup than I have after a few > > minutes of uptime. Kernel config at > > http://audible.transient.net/~jamie/k/nfsd.config-5.10.25 in case you > > think this worth more investigation. > > Do you also have this issue in Linus's tree and the latest 5.11.y > release? Haven't tried it, but like I said, I couldn't even reproduce it again with 5.10.25, or booting between 5.10.20 to 5.10.25 again ... I'll give the others a shot and see if I can repro it there. -- Jamie Heilman http://audible.transient.net/~jamie/
Re: Linux 5.10.25
[ugh, resent with the lkml headers unbroken, sorry about the dupe] Greg Kroah-Hartman wrote: > J. Bruce Fields (2): > Revert "nfsd4: remove check_conflicting_opens warning" > Revert "nfsd4: a client's own opens needn't prevent delegations" Hrm, just got this when I udpated my nfs server (32bit Via EPIA system) from 5.10.20 to 5.10.25: [ 49.225914] NFSD: Using UMH upcall client tracking operations. [ 49.231919] NFSD: starting 90-second grace period (net f036) [ 50.036973] [ cut here ] [ 50.041771] WARNING: CPU: 0 PID: 2284 at fs/nfsd/nfs4state.c:4968 nfsd4_process_open2+0xf9c/0x1170 [nfsd] [ 50.051434] Modules linked in: md5 cpufreq_conservative cpufreq_userspace cpufreq_powersave cpufreq_ondemand autofs4 quota_v2 quota_tree nfsd auth_rpcgss nfs lockd grace nfs_ssc fscache sunrpc xt_mark cls_fw sch_htb iptable_nat xt_nat nf_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport iptable_mangle xt_state xt_conntrack nf_conntrack nf_defrag_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_filter ip_tables x_tables nhpoly1305 chacha_generic libchacha adiantum libpoly1305 dm_crypt dm_mod snd_hda_codec_via snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd via_rhine psmouse soundcore mii via_agp sg via_velocity evdev agpgart [ 50.113386] CPU: 0 PID: 2284 Comm: nfsd Tainted: GT 5.10.25 #1 [ 50.120669] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To be filled by O.E.M., BIOS 080014 06/01/2009 [ 50.131652] EIP: nfsd4_process_open2+0xf9c/0x1170 [nfsd] [ 50.137036] Code: 04 88 45 a4 88 07 8b 45 a0 8d 78 49 8b 45 84 8d 70 01 e9 2b f8 ff ff c7 45 9c 00 00 00 00 31 ff bb 00 00 27 67 e9 04 f6 ff ff <0f> 0b e9 a0 f5 ff ff 0f b6 d4 0f a3 15 94 3f 23 f8 0f 83 b1 fd ff [ 50.155866] EAX: EBX: 82da25b0 ECX: 830c0920 EDX: [ 50.162187] ESI: 82da25b0 EDI: 82da55a0 EBP: 8310be68 ESP: 8310bddc [ 50.168507] DS: 007b ES: 007b FS: GS: 00e0 SS: 0068 EFLAGS: 00010246 [ 50.175338] CR0: 80050033 CR2: 00551e50 CR3: 03222000 CR4: 06b0 [ 50.181654] Call Trace: [ 50.184165] ? inode_permission+0x17/0xc0 [ 50.188289] nfsd4_open+0x429/0x910 [nfsd] [ 50.192483] ? nfsd4_encode_operation+0x185/0x1e0 [nfsd] [ 50.197900] ? nfsd4_rename+0x1a0/0x1a0 [nfsd] [ 50.202439] nfsd4_proc_compound+0x457/0x6c0 [nfsd] [ 50.207419] nfsd_dispatch+0xdc/0x1a0 [nfsd] [ 50.211816] svc_process_common+0x38a/0x650 [sunrpc] [ 50.216880] ? svc_xprt_do_enqueue+0xd7/0xe0 [sunrpc] [ 50.222017] ? svc_xprt_received+0x5d/0xf0 [sunrpc] [ 50.227000] ? nfsd_svc+0x300/0x300 [nfsd] [ 50.231190] svc_process+0xa9/0xf0 [sunrpc] [ 50.235468] nfsd+0xcd/0x120 [nfsd] [ 50.239025] kthread+0xe1/0x100 [ 50.242259] ? nfsd_destroy+0x50/0x50 [nfsd] [ 50.246588] ? kthread_create_on_node+0x30/0x30 [ 50.251165] ret_from_fork+0x1c/0x28 [ 50.254789] ---[ end trace 171bde4774bc9795 ]--- Can't readily reproduce it though, so likely a race condition or something that requires more state buildup than I have after a few minutes of uptime. Kernel config at http://audible.transient.net/~jamie/k/nfsd.config-5.10.25 in case you think this worth more investigation. -- Jamie Heilman http://audible.transient.net/~jamie/
Re: [kbuild-all] Re: [PATCH] gcov: fail build on gcov_info size mismatch
Linus Torvalds wrote: > On Sun, Mar 14, 2021 at 7:32 PM Rong Chen wrote: > > > > It can be reproduced with '-a' option in dash: > > Oh, ok. That kind of explains it. > > 'dash' is trash. Please somebody make a bug report. > > > $ a="!" > > $ [ "$a" = ".size" ] > > $ [ "$a" = ".size" -a "$b" = ".LPBX0," ] > > sh: 2: [: =: unexpected operator > > This is 100% a dash bug. There is no question what-so-ever about it. > This is not some kind of "POSIX is ambiguous", or "the handling of > '-a' is complicated". > > It's simply just that dash is buggy. > > > While dash supports most uses of the -a and -o options, they have > > very confusing semantics even in bash and are best avoided. > > No, they have perfectly sane semantics in bash, and in POSIX. > > See > > https://pubs.opengroup.org/onlinepubs/9699919799/utilities/test.html > > and there is absolutely zero question that bash does this correctly, > and dash does not. > > But yes, it seems to be easy to work around, but still - could some > Ubuntu person please open a bug report on dash? fwiw, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=850202
Re: [Nouveau] nouveau regression post v5.8, still present in v5.10
Jamie Heilman wrote: > Jamie Heilman wrote: > > Karol Herbst wrote: > > > fyi, there is a patch which solves a maybe related issue on your GPU, > > > mind giving it a try before we dig further? > > > https://gitlab.freedesktop.org/drm/nouveau/-/issues/14#note_767791 > > > > So, I tried that. Turns out, I can still trigger a problem. Is it > > the same problem? Maybe? I also tried applying the patch from > > > > ca386aa7155a ("drm/nouveau/kms/nv50-gp1xx: add WAR for EVO push buffer HW > > bug") > > to 5.8.0-rc6-01516-g0a96099691c8 and very interestingly, it changed > > the mode of failure to same thing I saw with 5.10.9 patched with the patch > > from that bug report. In both cases I get this in the log: > > > > kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16 > > kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16 > > kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16 > > kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16 > > ... > > and so on > > > > In one incident my monitor would't even wake up anymore after this. > > > > > > I'm trying to repo it now on an unpatched 5.8.0-rc6-01515-gae09163ac27c > > right now, as running glxgears does seem to help reproduce problems > > faster which is nice, I'm just not entirely sure it's the same set of > > problems; hopefully that version is free from issues, but we'll > > see... > > Ugh, well I can crash 5.8.0-rc6-01515-gae09163ac27c and 5.8.18 in > basically the same way running glxgears and a xset dpms force off > loop. So I'm starting to think it's not the same thing, and that > problem has been latent from before I started having periodic issues. > > I should note that my exact testing technique for the above was to run > 4 copies of glxgears and the xset force dpms off loop at the same > time. Really looks more like it triggers a resource starvation issue > maybe. The crash is also worse, particularly if I don't do anything > about it right away as my workstation eventually falls off the network > and I'm forced to power cycle it; the crashes I was chasing after > wouldn't do quite that much violence, normally I could still log in, > rebuild a kernel, and shut things down cleanly. > > More than one bug here I suspect. OK, I went back and bisected again while patching known issues to get a better idea what was causing the problem I've been having and I'm confident it was the bug which Bastian Beranek's patch (now in mainline) addressed. My original bisection got confused by the EVO push buffer HW bug which was fixed in ca386aa7155a54. Once I bisected with the patch from ca386aa7155a54 applied, my bisection landed on f844eb485eb05 and Bastian Beranek's patch fixed that right up. 'course I remain mildly concerned I can crash the kernel with little more than glxgears and xset ... but the original stability problem I reported I can safely say has been fixed. If I can figure out the nature of what I suspect is unrecoverable resource starvation, I'll start a new thread for that. -- Jamie Heilman http://audible.transient.net/~jamie/
Re: bisected regression in v5.11-rc1 snd-usb-audio
Takashi Iwai wrote: > On Sat, 23 Jan 2021 12:08:02 +0100, > Jamie Heilman wrote: > > > > > If the problem is still seen with the very latest Linus tree and the > > > previous patch, please enable the dyndbg, e.g. pass dydbg=+p option to > > > snd-usb-audio module, i.e. reload like > > >modprobe snd-usb-audio dydbg=+p > > > or boot with > > >snd_usb_audio.dyndbg=+p > > > boot option, retest, and give the kernel messages. > > > > OK, updated to 5.11.0-rc4-00202-gfe75a21824e7, repatched & rebuilt with: > > CONFIG_SND_DEBUG=y > > CONFIG_SND_CTL_VALIDATION=y > > CONFIG_DYNAMIC_DEBUG=y > > CONFIG_DYNAMIC_DEBUG_CORE=y > > > > and added dyndbg=+p to my snd-usb-audio options, dmesg after an > > attempted aplay execution is attached. > (snip) > > [ 108.941521] WARNING: CPU: 0 PID: 2776 at sound/core/pcm_lib.c:1148 > > snd_pcm_hw_rule_add+0xd3/0x160 [snd_pcm] > > OK, this must be the culprit of the weird behavior. > I totally overlooked that the max dependencies are three because of > the terminator. > > Could you try the patch below in addition? Yep, this works now, thanks! FWIW, here's what the debug output looks like when its working, and I start jack now: usb 1-1.1.2: 1:1: found sync_ep=0x81, iface=1, alt=1, implicit_fb=0 usb 1-1.1.2: 1:1: add audio endpoint 0x1 usb 1-1.1.2: Creating new data endpoint #1 usb 1-1.1.2: Creating new sync endpoint #81 usb 1-1.1.2: 1:1 Set sample rate 192000, clock 40 usb 1-1.1.2: [10] FU [M2Tech Audio 2.0 Output Playback Switch] ch = 2, val = 0/1/1 usb 1-1.1.2: [10] FU [M2Tech Audio 2.0 Output Playback Switch] ch = 1, val = 0/1/1 usb 1-1.1.2: RANGE setting not yet supported usb 1-1.1.2: [10] FU [M2Tech Audio 2.0 Output Playback Volume] ch = 2, val = -32512/0/256 usb 1-1.1.2: RANGE setting not yet supported usb 1-1.1.2: [10] FU [M2Tech Audio 2.0 Output Playback Volume] ch = 1, val = -32512/0/256 usb 1-1.1.2: Open EP 0x1, iface=1:1, idx=0 usb 1-1.1.2: channels=2, rate=96000, format=S32_LE, period_bytes=8192, periods=3, implicit_fb=0 usb 1-1.1.2: Open EP 0x81, iface=1:1, idx=1 usb 1-1.1.2: channels=2, rate=96000, format=S32_LE, period_bytes=8192, periods=3, implicit_fb=0 usb 1-1.1.2: Setting usb interface 1:0 for EP 0x1 usb 1-1.1.2: 1:1 Set sample rate 96000, clock 40 usb 1-1.1.2: Setting params for data EP 0x1, pipe 0x8600 usb 1-1.1.2: Set up 12 URBS, ret=0 usb 1-1.1.2: Setting usb interface 1:1 for EP 0x1 usb 1-1.1.2: Setting params for sync EP 0x81, pipe 0x8680 usb 1-1.1.2: Set up 4 URBS, ret=0 usb 1-1.1.2: Starting data EP 0x1 (running 0) usb 1-1.1.2: 12 URBs submitted for EP 0x1 usb 1-1.1.2: Starting sync EP 0x81 (running 0) usb 1-1.1.2: 4 URBs submitted for EP 0x81 usb 1-1.1.2: 1:1 Start Playback PCM > --- a/include/sound/pcm.h > +++ b/include/sound/pcm.h > @@ -229,7 +229,7 @@ typedef int (*snd_pcm_hw_rule_func_t)(struct > snd_pcm_hw_params *params, > struct snd_pcm_hw_rule { > unsigned int cond; > int var; > - int deps[4]; > + int deps[5]; > > snd_pcm_hw_rule_func_t func; > void *private; -- Jamie Heilman http://audible.transient.net/~jamie/
Re: bisected regression in v5.11-rc1 snd-usb-audio
Takashi Iwai wrote: > On Fri, 22 Jan 2021 11:03:48 +0100, > Jamie Heilman wrote: > > > > Takashi Iwai wrote: > > > You seem hitting a firmware bug, and it doesn't look like the only > > > case. Interestingly, the backport of 5.11 USB-audio stuff on 5.3 > > > kernel on openSUSE Leap 15.2 caused a similar bug on Steinberg device, > > > while it worked with 5.11-rc. So I thought this specific with the > > > older kernel (in USB core changes or such), but my guess seems wrong, > > > and the bug looks remaining in 5.11 for some cases. > > > > > > Below is the fix patch. Please give it a try. > > > > > > > > > thanks, > > > > > > Takashi > > > > > > -- 8< -- > > > From: Takashi Iwai > > > Subject: [PATCH] ALSA: usb-audio: workaround for iface reset issue > > > > > > The recently introduced sample rate validation code seems causing a > > > problem on some devices; namely, after performing this, the bus gets > > > screwed and it influences even on other USB devices. > > > As a quick workaround, perform it only for the necessary devices; > > > currently MOTU devices are known to need the valid altset checks, so > > > filter out other devices. > > > > > > Fixes: 93db51d06b32 ("ALSA: usb-audio: Check valid altsetting at parsing > > > rates for UAC2/3") > > > Reported-by: Jamie Heilman > > > BugLink: https://bugzilla.suse.com/show_bug.cgi?id=1178203 > > > Signed-off-by: Takashi Iwai > > > --- > > > sound/usb/format.c | 11 +++ > > > 1 file changed, 11 insertions(+) > > > > > > diff --git a/sound/usb/format.c b/sound/usb/format.c > > > index 9ebc5d202c87..e6ff317a6785 100644 > > > --- a/sound/usb/format.c > > > +++ b/sound/usb/format.c > > > @@ -466,6 +466,17 @@ static int validate_sample_rate_table_v2v3(struct > > > snd_usb_audio *chip, > > > unsigned int nr_rates; > > > int i, err; > > > > > > + /* performing the rate verification may lead to unexpected USB bus > > > + * behavior afterwards by some unknown reason. Do this only for the > > > + * known devices. > > > + */ > > > + switch (USB_ID_VENDOR(chip->usb_id)) { > > > + case 0x07fd: /* MOTU */ > > > + break; > > > + default: > > > + return 0; /* don't perform the validation as default */ > > > + } > > > + > > > table = kcalloc(fp->nr_rates, sizeof(*table), GFP_KERNEL); > > > if (!table) > > > return -ENOMEM; > > > > Well, with this patch, while I can get it to seemingly initialize > > without errors by replugging, I can't actually get it to work. Jack > > says: > > > > Starting jack server... > > JACK server starting in realtime mode with priority 10 > > self-connect-mode is "Don't restrict self connect requests" > > Acquired audio card Audio1 > > creating alsa driver ... > > hw:CARD=hiFace2,DEV=0|-|1024|3|96000|0|0|nomon|swmeter|-|32bit > > ERROR: ALSA: Cannot open PCM device alsa_pcm for playback. Falling back to > > capture-only mode > > Released audio card Audio1 > > ERROR: Cannot initialize driver > > ERROR: JackServer::Open failed with -1 > > ERROR: Failed to open server > > Which kernel did you test? There have been a few more USB-audio fixes > that landed to Linus tree yesterday, so you'd need to test with it (in > addition to the previous patch). Sorry, thought that'd be clear from the alsa-info output, but yes, the tip of Linus's tree at the time: 5.11.0-rc4-00164-g83d09ad4b950 > > here's what aplay does when it tries: > > > > aplay: main:830: audio open error: Invalid argument > > > > strace of that says: > > openat(AT_FDCWD, "/dev/snd/pcmC1D0p", O_RDWR|O_NONBLOCK|O_CLOEXEC) = -1 > > EINVAL (Invalid argument) > > Hm, it's weird that open returns -EINVAL, not the later ioctl... > > > I dug up a different USB sound device (bithead) and plugged it in, it > > still works just fine with the same commands. > > > > The only difference I see in the alsa-info output is that > > > > control.7 { > > iface CARD > > name 'Keep Interface' > > value false > > comment { > > access 'read write' > > type BOOLEAN > > count 1 > >
Re: [Nouveau] nouveau regression post v5.8, still present in v5.10
Jamie Heilman wrote: > Karol Herbst wrote: > > fyi, there is a patch which solves a maybe related issue on your GPU, > > mind giving it a try before we dig further? > > https://gitlab.freedesktop.org/drm/nouveau/-/issues/14#note_767791 > > So, I tried that. Turns out, I can still trigger a problem. Is it > the same problem? Maybe? I also tried applying the patch from > > ca386aa7155a ("drm/nouveau/kms/nv50-gp1xx: add WAR for EVO push buffer HW > bug") > to 5.8.0-rc6-01516-g0a96099691c8 and very interestingly, it changed > the mode of failure to same thing I saw with 5.10.9 patched with the patch > from that bug report. In both cases I get this in the log: > > kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16 > kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16 > kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16 > kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16 > ... > and so on > > In one incident my monitor would't even wake up anymore after this. > > > I'm trying to repo it now on an unpatched 5.8.0-rc6-01515-gae09163ac27c > right now, as running glxgears does seem to help reproduce problems > faster which is nice, I'm just not entirely sure it's the same set of > problems; hopefully that version is free from issues, but we'll > see... Ugh, well I can crash 5.8.0-rc6-01515-gae09163ac27c and 5.8.18 in basically the same way running glxgears and a xset dpms force off loop. So I'm starting to think it's not the same thing, and that problem has been latent from before I started having periodic issues. I should note that my exact testing technique for the above was to run 4 copies of glxgears and the xset force dpms off loop at the same time. Really looks more like it triggers a resource starvation issue maybe. The crash is also worse, particularly if I don't do anything about it right away as my workstation eventually falls off the network and I'm forced to power cycle it; the crashes I was chasing after wouldn't do quite that much violence, normally I could still log in, rebuild a kernel, and shut things down cleanly. More than one bug here I suspect. > > On Thu, Jan 21, 2021 at 3:33 AM Jamie Heilman > > wrote: > > > > > > Karol Herbst wrote: > > > > On Wed, Jan 6, 2021 at 4:25 AM Jamie Heilman > > > > wrote: > > > > > > > > > > Jamie Heilman wrote: > > > > > > Jamie Heilman wrote: > > > > > > > Karol Herbst wrote: > > > > > > > > do you think you'd be able to do a kernel bisect in order to > > > > > > > > pinpoint > > > > > > > > the actual commit causing it? Thanks > > > > > > > > > > > > > > No. I can't reproduce it reliably. I if I could, bisection > > > > > > > wouldn't > > > > > > > be a problem but as I can't and as it can take weeks for the > > > > > > > problem > > > > > > > to occur there's essentially no chance. I know it regressed > > > > > > > roughly > > > > > > > in 5.8-rc1 only because that's what I was running when the first > > > > > > > event > > > > > > > occured. > > > > > > > > > > > > er, 5.9.0-rc1 rather > > > > > > > > > > Actually ... I've found a way to reproduce this in hours intead of > > > > > weeks, so I think I may be able to bisect it after all, it's something > > > > > of a brute force approach and its probably doing horrible things to > > > > > the backlight in my poor old monitor, but just running this: > > > > > > > > > > #!/bin/sh > > > > > sleep 5 > > > > > while ! dmesg | tail | grep -q nouveau > > > > > do > > > > > xset dpms force off > > > > > sleep 65 > > > > > xdotool mousemove 1024 1024 mousemove restore > > > > > sleep 10 > > > > > done > > > > > > > > > > Does manage to trip the issue sooner than it would otherwise happen > > > > > with natural usage. Given that this is my primary workstation and I > > > > > sort of need it functional during waking hours, it'll take me a bit, > > > > > but I'll update folks when I have the error more dialed in. > > > > > > > > > > > > > huh interesting. Kind of feels like a random thing still. But I think > > > > in general you'd spend way
Re: [Nouveau] nouveau regression post v5.8, still present in v5.10
Karol Herbst wrote: > fyi, there is a patch which solves a maybe related issue on your GPU, > mind giving it a try before we dig further? > https://gitlab.freedesktop.org/drm/nouveau/-/issues/14#note_767791 So, I tried that. Turns out, I can still trigger a problem. Is it the same problem? Maybe? I also tried applying the patch from ca386aa7155a ("drm/nouveau/kms/nv50-gp1xx: add WAR for EVO push buffer HW bug") to 5.8.0-rc6-01516-g0a96099691c8 and very interestingly, it changed the mode of failure to same thing I saw with 5.10.9 patched with the patch from that bug report. In both cases I get this in the log: kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16 kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16 kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16 kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16 ... and so on In one incident my monitor would't even wake up anymore after this. I'm trying to repo it now on an unpatched 5.8.0-rc6-01515-gae09163ac27c right now, as running glxgears does seem to help reproduce problems faster which is nice, I'm just not entirely sure it's the same set of problems; hopefully that version is free from issues, but we'll see... > On Thu, Jan 21, 2021 at 3:33 AM Jamie Heilman > wrote: > > > > Karol Herbst wrote: > > > On Wed, Jan 6, 2021 at 4:25 AM Jamie Heilman > > > wrote: > > > > > > > > Jamie Heilman wrote: > > > > > Jamie Heilman wrote: > > > > > > Karol Herbst wrote: > > > > > > > do you think you'd be able to do a kernel bisect in order to > > > > > > > pinpoint > > > > > > > the actual commit causing it? Thanks > > > > > > > > > > > > No. I can't reproduce it reliably. I if I could, bisection > > > > > > wouldn't > > > > > > be a problem but as I can't and as it can take weeks for the problem > > > > > > to occur there's essentially no chance. I know it regressed roughly > > > > > > in 5.8-rc1 only because that's what I was running when the first > > > > > > event > > > > > > occured. > > > > > > > > > > er, 5.9.0-rc1 rather > > > > > > > > Actually ... I've found a way to reproduce this in hours intead of > > > > weeks, so I think I may be able to bisect it after all, it's something > > > > of a brute force approach and its probably doing horrible things to > > > > the backlight in my poor old monitor, but just running this: > > > > > > > > #!/bin/sh > > > > sleep 5 > > > > while ! dmesg | tail | grep -q nouveau > > > > do > > > > xset dpms force off > > > > sleep 65 > > > > xdotool mousemove 1024 1024 mousemove restore > > > > sleep 10 > > > > done > > > > > > > > Does manage to trip the issue sooner than it would otherwise happen > > > > with natural usage. Given that this is my primary workstation and I > > > > sort of need it functional during waking hours, it'll take me a bit, > > > > but I'll update folks when I have the error more dialed in. > > > > > > > > > > huh interesting. Kind of feels like a random thing still. But I think > > > in general you'd spend way too much time on this if you can't > > > reproduce within seconds/minutes and then it might not point out the > > > actual issue, because randomly the issue didn't appear and stuff. > > > > > > maybe you can tune it to have shorter pauses or something? I'd really > > > try to bring down the time per cycle. > > > > Well I'm confident enough, at this point, to say this bisects to > > 0a96099691c8 ("drm/nouveau/kms/nv50-: implement proper push buffer control > > logic") > > > > Now... I wish I could say the bisection was straightforward and > > simple, but it wasn't thanks to still not having a reproducer really > > dialed in. The above script doesn't work unless I've got some normal > > usage around it. It certainly triggers the issue sooner than it > > otherwise would, but by itself it isn't enough. I modified it > > somewhat in the hopes of capturing the rough idea of how many > > itterations it would take to trigger the problem by using: > > > > #!/bin/sh > > I=0 > > trap 'echo;echo $I' 0 > > trap 'exit' INT > > sleep 5 > > while ! dmesg | tail | grep -q nouveau > > do > > I=$(($I + 1)) > > xse
Re: bisected regression in v5.11-rc1 snd-usb-audio
Takashi Iwai wrote: > You seem hitting a firmware bug, and it doesn't look like the only > case. Interestingly, the backport of 5.11 USB-audio stuff on 5.3 > kernel on openSUSE Leap 15.2 caused a similar bug on Steinberg device, > while it worked with 5.11-rc. So I thought this specific with the > older kernel (in USB core changes or such), but my guess seems wrong, > and the bug looks remaining in 5.11 for some cases. > > Below is the fix patch. Please give it a try. > > > thanks, > > Takashi > > -- 8< -- > From: Takashi Iwai > Subject: [PATCH] ALSA: usb-audio: workaround for iface reset issue > > The recently introduced sample rate validation code seems causing a > problem on some devices; namely, after performing this, the bus gets > screwed and it influences even on other USB devices. > As a quick workaround, perform it only for the necessary devices; > currently MOTU devices are known to need the valid altset checks, so > filter out other devices. > > Fixes: 93db51d06b32 ("ALSA: usb-audio: Check valid altsetting at parsing > rates for UAC2/3") > Reported-by: Jamie Heilman > BugLink: https://bugzilla.suse.com/show_bug.cgi?id=1178203 > Signed-off-by: Takashi Iwai > --- > sound/usb/format.c | 11 +++ > 1 file changed, 11 insertions(+) > > diff --git a/sound/usb/format.c b/sound/usb/format.c > index 9ebc5d202c87..e6ff317a6785 100644 > --- a/sound/usb/format.c > +++ b/sound/usb/format.c > @@ -466,6 +466,17 @@ static int validate_sample_rate_table_v2v3(struct > snd_usb_audio *chip, > unsigned int nr_rates; > int i, err; > > + /* performing the rate verification may lead to unexpected USB bus > + * behavior afterwards by some unknown reason. Do this only for the > + * known devices. > + */ > + switch (USB_ID_VENDOR(chip->usb_id)) { > + case 0x07fd: /* MOTU */ > + break; > + default: > + return 0; /* don't perform the validation as default */ > + } > + > table = kcalloc(fp->nr_rates, sizeof(*table), GFP_KERNEL); > if (!table) > return -ENOMEM; Well, with this patch, while I can get it to seemingly initialize without errors by replugging, I can't actually get it to work. Jack says: Starting jack server... JACK server starting in realtime mode with priority 10 self-connect-mode is "Don't restrict self connect requests" Acquired audio card Audio1 creating alsa driver ... hw:CARD=hiFace2,DEV=0|-|1024|3|96000|0|0|nomon|swmeter|-|32bit ERROR: ALSA: Cannot open PCM device alsa_pcm for playback. Falling back to capture-only mode Released audio card Audio1 ERROR: Cannot initialize driver ERROR: JackServer::Open failed with -1 ERROR: Failed to open server here's what aplay does when it tries: aplay: main:830: audio open error: Invalid argument strace of that says: openat(AT_FDCWD, "/dev/snd/pcmC1D0p", O_RDWR|O_NONBLOCK|O_CLOEXEC) = -1 EINVAL (Invalid argument) I dug up a different USB sound device (bithead) and plugged it in, it still works just fine with the same commands. The only difference I see in the alsa-info output is that control.7 { iface CARD name 'Keep Interface' value false comment { access 'read write' type BOOLEAN count 1 } } has gone missing. I've attached it incase you can spot something I didn't (obviously the cards are in a different order which makes things somewhat tedious for a direct comparison). Apart from the device number on the usb bus, the lsusb -vvv output is the same as it was in with 5.10.9. -- Jamie Heilman http://audible.transient.net/~jamie/ upload=true=true= !! !!ALSA Information Script v 0.4.65 !! !!Script ran on: Fri Jan 22 09:51:41 UTC 2021 !!Linux Distribution !!-- Debian GNU/Linux bullseye/sid \n \l PRETTY_NAME="Debian GNU/Linux bullseye/sid" NAME="Debian GNU/Linux" ID=debian HOME_URL="https://www.debian.org/; SUPPORT_URL="https://www.debian.org/support; BUG_REPORT_URL="https://bugs.debian.org/; !!DMI Information !!--- Manufacturer: Dell Inc. Product Name: Precision WorkStation T3400 Product Version: Firmware Version: A14 System SKU: Board Vendor: Dell Inc. Board Name:0TP412 !!ACPI Device Status Information !!--- /sys/bus/acpi/devices/PNP0401:00/status 15 /sys/bus/acpi/devices/PNP0501:00/status 15 /sys/bus/acpi/devices/PNP0700:00/status 15 /sys/bus/acpi/devices/PNP0C0F:00/status 9 /sys/bus/acpi/devices/PNP0C0F:01/status
bisected regression in v5.11-rc1 snd-usb-audio
I've bisected a failure in support for the M2Tech USB HiFace Two 192kHz Digital Audio Interface device (read as: a reclocked USB S/PDIF) to the following ... commit 93db51d06b32227319dae2ac289029ccf1b33181 Author: Takashi Iwai Date: Mon Nov 23 09:53:09 2020 +0100 ALSA: usb-audio: Check valid altsetting at parsing rates for UAC2/3 This has always been a somewhat finicky device, but my life is a silent void without it, as it is currently a vital part of getting audio to my mixer (now, if I could get USB Audio on my Rane MP2015 actually working right I'd very happily take that approach instead[1], but I digress). It has been known to require replugging to initialize properly at times, but until now, it's always worked fine eventually. I've attached the dmesg from a working 5.10.9 kernel, the alsa-info output, and the lsusb -vvv output for this device when it's functioning. (Note, that alsa-info claims jack isn't running... that's not actually true though, maybe because I'm using jack 2's jackdbus subsystem, but jack is infact running, though it's probably not relevant to the issue at hand[2].) When I boot 93db51d06b32 or later I get lot of errors in the dmesg like: usb 1-1.1.2: New USB device found, idVendor=249c, idProduct=930b, bcdDevice= 2.13 usb 1-1.1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3 usb 1-1.1.2: Product: M2Tech USB Audio 2.0 usb 1-1.1.2: Manufacturer: M2Tech usb 1-1.1.2: SerialNumber: usb 1-1.1.2: cannot get ctl value: req = 0x83, wValue = 0x201, wIndex = 0xa00, type = 4 usb 1-1.1.2: 10:0: cannot get min/max values for control 2 (id 10) usb 1-1.1.2: cannot get ctl value: req = 0x83, wValue = 0x200, wIndex = 0xa00, type = 4 usb 1-1.1.2: 10:0: cannot get min/max values for control 2 (id 10) usbcore: registered new interface driver snd-usb-audio usb 1-1.1-port2: disabled by hub (EMI?), re-enabling... usb 1-1.1.2: USB disconnect, device number 6 usb 1-1.1.2: new full-speed USB device number 7 using ehci-pci usb 1-1.1.2: device descriptor read/64, error -32 usb 1-1.1.2: device descriptor read/64, error -32 usb 1-1.1.2: new full-speed USB device number 8 using ehci-pci usb 1-1.1.2: device descriptor read/64, error -32 usb 1-1.1.2: device descriptor read/64, error -32 usb 1-1.1-port2: attempt power cycle usb 1-1.1.2: new full-speed USB device number 9 using ehci-pci usb 1-1.1.2: device not accepting address 9, error -32 usb 1-1.1.2: new full-speed USB device number 10 using ehci-pci usb 1-1.1.2: device not accepting address 10, error -32 usb 1-1.1-port2: unable to enumerate USB device and obviously the device doesn't work at all. Sometimes there's more "cannot get ctl value:" noise than other times, but the above is a clean boot after being in a working state (it tends to require replugging to get back to a being usable again after this, once I've booted a working kernel, possibly becuase its hanging off a hub in my monitor, not the most elegant of setups, I know---but none of this changes even if I plug it directly into my workstation's USB ports, I tried that). I'm happy to try any patches, or provide more details, just ask. -- Jamie Heilman http://audible.transient.net/~jamie/ [1] it's never been quite right, if you ever want to try to figure out why, I'd be happy to tackle that separately [2] just in case, jack_control dp output: --- get driver parameters (type:isset:default:value) device: ALSA device name (str:set:hw:0:hw:CARD=hiFace2,DEV=0) capture: Provide capture ports. Optionally set device (str:notset:none:none) playback: Provide playback ports. Optionally set device (str:set:none:hw:CARD=hiFace2,DEV=0) rate: Sample rate (uint:set:48000:96000) period: Frames per period (uint:notset:1024:1024) nperiods: Number of periods of playback latency (uint:set:2:3) hwmon: Hardware monitoring, if available (bool:notset:False:False) hwmeter: Hardware metering, if available (bool:notset:False:False) duplex: Provide both capture and playback ports (bool:notset:True:True) softmode: Soft-mode, no xrun handling (bool:notset:False:False) monitor: Provide monitor ports for the output (bool:notset:False:False) dither: Dithering mode (char:notset:n:n) inchannels: Number of capture channels (defaults to hardware max) (uint:notset:0:0) outchannels: Number of playback channels (defaults to hardware max) (uint:notset:0:0) shorts: Try 16-bit samples before 32-bit (bool:notset:False:False) input-latency: Extra input latency (frames) (uint:notset:0:0) output-latency: Extra output latency (frames) (uint:notset:0:0) midi-driver: ALSA MIDI driver (str:notset:none:none) [0.00] Linux version 5.10.9 (jamie@cucamonga) (gcc (Debian 10.2.1-3) 10.2.1 20201224, GNU ld (GNU Binutils for Debi
Re: [Nouveau] nouveau regression post v5.8, still present in v5.10
Karol Herbst wrote: > On Wed, Jan 6, 2021 at 4:25 AM Jamie Heilman > wrote: > > > > Jamie Heilman wrote: > > > Jamie Heilman wrote: > > > > Karol Herbst wrote: > > > > > do you think you'd be able to do a kernel bisect in order to pinpoint > > > > > the actual commit causing it? Thanks > > > > > > > > No. I can't reproduce it reliably. I if I could, bisection wouldn't > > > > be a problem but as I can't and as it can take weeks for the problem > > > > to occur there's essentially no chance. I know it regressed roughly > > > > in 5.8-rc1 only because that's what I was running when the first event > > > > occured. > > > > > > er, 5.9.0-rc1 rather > > > > Actually ... I've found a way to reproduce this in hours intead of > > weeks, so I think I may be able to bisect it after all, it's something > > of a brute force approach and its probably doing horrible things to > > the backlight in my poor old monitor, but just running this: > > > > #!/bin/sh > > sleep 5 > > while ! dmesg | tail | grep -q nouveau > > do > > xset dpms force off > > sleep 65 > > xdotool mousemove 1024 1024 mousemove restore > > sleep 10 > > done > > > > Does manage to trip the issue sooner than it would otherwise happen > > with natural usage. Given that this is my primary workstation and I > > sort of need it functional during waking hours, it'll take me a bit, > > but I'll update folks when I have the error more dialed in. > > > > huh interesting. Kind of feels like a random thing still. But I think > in general you'd spend way too much time on this if you can't > reproduce within seconds/minutes and then it might not point out the > actual issue, because randomly the issue didn't appear and stuff. > > maybe you can tune it to have shorter pauses or something? I'd really > try to bring down the time per cycle. Well I'm confident enough, at this point, to say this bisects to 0a96099691c8 ("drm/nouveau/kms/nv50-: implement proper push buffer control logic") Now... I wish I could say the bisection was straightforward and simple, but it wasn't thanks to still not having a reproducer really dialed in. The above script doesn't work unless I've got some normal usage around it. It certainly triggers the issue sooner than it otherwise would, but by itself it isn't enough. I modified it somewhat in the hopes of capturing the rough idea of how many itterations it would take to trigger the problem by using: #!/bin/sh I=0 trap 'echo;echo $I' 0 trap 'exit' INT sleep 5 while ! dmesg | tail | grep -q nouveau do I=$(($I + 1)) xset dpms force off sleep 32 xdotool mousemove 1 12 mousemove restore sleep 28 done but ultimately that didn't really pan out the way I'd hoped. I don't think the itterations have all that much to do with the condition in the end. I wanted to try applying ca386aa7155a ("drm/nouveau/kms/nv50-gp1xx: add WAR for EVO push buffer HW bug") on top of the first bad commit becuase I sort of felt like when I was running -rc versions that things got a bit less chaotic after that commit landed, but it was just a gut feeling and I wanted to see if I could support it with metrics---but no, I can't really get consistent metrics even without that commit so I gave up, and decided to report what I've got so far. > > I'm using git bisect start -- drivers/gpu/drm include/drm include/video > > in an effort to make this go a bit quicker, let me know if you think > > that's a bad idea or I should add other paths. > > > > > > > On Sun, Dec 27, 2020 at 8:16 PM Jamie Heilman > > > > > wrote: > > > > > > > > > > > > Something between v5.8 and v5.9 has resulted in periodically losing > > > > > > video. > > > > > > Unfortunately, I can't reliably reproduce it, it seems to happen > > > > > > every > > > > > > once in a long while---I can go weeks without an occurance, but it > > > > > > always seems to happen after my workstation has been idle long > > > > > > enough > > > > > > to screen blank and put the monitor to sleep. I'm using a single > > > > > > display (Dell 2405FPW) connected via DVI, running X (Xorg 1.20.x > > > > > > from > > > > > > Debian sid). I don't really do anything fancy, xterms, a browser or > > > > > > two, play the occasional video, but like I said, I can't reliably > > > > > > reproduce this. I've had it happen about 11 times since August. > >
Re: [Nouveau] nouveau regression post v5.8, still present in v5.10
Jamie Heilman wrote: > Jamie Heilman wrote: > > Karol Herbst wrote: > > > do you think you'd be able to do a kernel bisect in order to pinpoint > > > the actual commit causing it? Thanks > > > > No. I can't reproduce it reliably. I if I could, bisection wouldn't > > be a problem but as I can't and as it can take weeks for the problem > > to occur there's essentially no chance. I know it regressed roughly > > in 5.8-rc1 only because that's what I was running when the first event > > occured. > > er, 5.9.0-rc1 rather Actually ... I've found a way to reproduce this in hours intead of weeks, so I think I may be able to bisect it after all, it's something of a brute force approach and its probably doing horrible things to the backlight in my poor old monitor, but just running this: #!/bin/sh sleep 5 while ! dmesg | tail | grep -q nouveau do xset dpms force off sleep 65 xdotool mousemove 1024 1024 mousemove restore sleep 10 done Does manage to trip the issue sooner than it would otherwise happen with natural usage. Given that this is my primary workstation and I sort of need it functional during waking hours, it'll take me a bit, but I'll update folks when I have the error more dialed in. I'm using git bisect start -- drivers/gpu/drm include/drm include/video in an effort to make this go a bit quicker, let me know if you think that's a bad idea or I should add other paths. > > > On Sun, Dec 27, 2020 at 8:16 PM Jamie Heilman > > > wrote: > > > > > > > > Something between v5.8 and v5.9 has resulted in periodically losing > > > > video. > > > > Unfortunately, I can't reliably reproduce it, it seems to happen every > > > > once in a long while---I can go weeks without an occurance, but it > > > > always seems to happen after my workstation has been idle long enough > > > > to screen blank and put the monitor to sleep. I'm using a single > > > > display (Dell 2405FPW) connected via DVI, running X (Xorg 1.20.x from > > > > Debian sid). I don't really do anything fancy, xterms, a browser or > > > > two, play the occasional video, but like I said, I can't reliably > > > > reproduce this. I've had it happen about 11 times since August. > > > > > > > > lspci -vv output is: > > > > > > > > 01:00.0 VGA compatible controller: NVIDIA Corporation G86 [Quadro NVS > > > > 290] (rev a1) (prog-if 00 [VGA controller]) > > > > Subsystem: NVIDIA Corporation G86 [Quadro NVS 290] > > > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- > > > > ParErr- Stepping- SERR- FastB2B- DisINTx+ > > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > > > > SERR- > > > Latency: 0, Cache Line Size: 64 bytes > > > > Interrupt: pin A routed to IRQ 28 > > > > Region 0: Memory at fc00 (32-bit, non-prefetchable) > > > > [size=16M] > > > > Region 1: Memory at d000 (64-bit, prefetchable) [size=256M] > > > > Region 3: Memory at fa00 (64-bit, non-prefetchable) > > > > [size=32M] > > > > Region 5: I/O ports at dc80 [size=128] > > > > Expansion ROM at 000c [disabled] [size=128K] > > > > Capabilities: [60] Power Management version 2 > > > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA > > > > PME(D0-,D1-,D2-,D3hot-,D3cold-) > > > > Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- > > > > Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ > > > > Address: fee01004 Data: 4023 > > > > Capabilities: [78] Express (v1) Endpoint, MSI 00 > > > > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s > > > > <512ns, L1 <4us > > > > ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- > > > > SlotPowerLimit 25.000W > > > > DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq- > > > > RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ > > > > MaxPayload 128 bytes, MaxReadReq 512 bytes > > > > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- > > > > AuxPwr- TransPend- > > > > LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, > > > > Exit Latency L0s <512ns, L1 <4us > > > > ClockPM- Surprise- L
Re: [Nouveau] nouveau regression post v5.8, still present in v5.10
Jamie Heilman wrote: > Karol Herbst wrote: > > do you think you'd be able to do a kernel bisect in order to pinpoint > > the actual commit causing it? Thanks > > No. I can't reproduce it reliably. I if I could, bisection wouldn't > be a problem but as I can't and as it can take weeks for the problem > to occur there's essentially no chance. I know it regressed roughly > in 5.8-rc1 only because that's what I was running when the first event > occured. er, 5.9.0-rc1 rather > > On Sun, Dec 27, 2020 at 8:16 PM Jamie Heilman > > wrote: > > > > > > Something between v5.8 and v5.9 has resulted in periodically losing video. > > > Unfortunately, I can't reliably reproduce it, it seems to happen every > > > once in a long while---I can go weeks without an occurance, but it > > > always seems to happen after my workstation has been idle long enough > > > to screen blank and put the monitor to sleep. I'm using a single > > > display (Dell 2405FPW) connected via DVI, running X (Xorg 1.20.x from > > > Debian sid). I don't really do anything fancy, xterms, a browser or > > > two, play the occasional video, but like I said, I can't reliably > > > reproduce this. I've had it happen about 11 times since August. > > > > > > lspci -vv output is: > > > > > > 01:00.0 VGA compatible controller: NVIDIA Corporation G86 [Quadro NVS > > > 290] (rev a1) (prog-if 00 [VGA controller]) > > > Subsystem: NVIDIA Corporation G86 [Quadro NVS 290] > > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- > > > ParErr- Stepping- SERR- FastB2B- DisINTx+ > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > > > SERR- > > Latency: 0, Cache Line Size: 64 bytes > > > Interrupt: pin A routed to IRQ 28 > > > Region 0: Memory at fc00 (32-bit, non-prefetchable) [size=16M] > > > Region 1: Memory at d000 (64-bit, prefetchable) [size=256M] > > > Region 3: Memory at fa00 (64-bit, non-prefetchable) [size=32M] > > > Region 5: I/O ports at dc80 [size=128] > > > Expansion ROM at 000c [disabled] [size=128K] > > > Capabilities: [60] Power Management version 2 > > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA > > > PME(D0-,D1-,D2-,D3hot-,D3cold-) > > > Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- > > > Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ > > > Address: fee01004 Data: 4023 > > > Capabilities: [78] Express (v1) Endpoint, MSI 00 > > > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s > > > <512ns, L1 <4us > > > ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- > > > SlotPowerLimit 25.000W > > > DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq- > > > RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ > > > MaxPayload 128 bytes, MaxReadReq 512 bytes > > > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- > > > TransPend- > > > LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, > > > Exit Latency L0s <512ns, L1 <4us > > > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- > > > LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ > > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > > > LnkSta: Speed 2.5GT/s (ok), Width x16 (ok) > > > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- > > > Capabilities: [100 v1] Virtual Channel > > > Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 > > > Arb:Fixed- WRR32- WRR64- WRR128- > > > Ctrl: ArbSelect=Fixed > > > Status: InProgress- > > > VC0:Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- > > > Arb:Fixed- WRR32- WRR64- WRR128- TWRR128- > > > WRR256- > > > Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01 > > > Status: NegoPending- InProgress- > > > Capabilities: [128 v1] Power Budgeting > > > Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 > > > Len=024 > > > Kernel driver in use: nouveau > > &g
Re: [Nouveau] nouveau regression post v5.8, still present in v5.10
Karol Herbst wrote: > do you think you'd be able to do a kernel bisect in order to pinpoint > the actual commit causing it? Thanks No. I can't reproduce it reliably. I if I could, bisection wouldn't be a problem but as I can't and as it can take weeks for the problem to occur there's essentially no chance. I know it regressed roughly in 5.8-rc1 only because that's what I was running when the first event occured. > On Sun, Dec 27, 2020 at 8:16 PM Jamie Heilman > wrote: > > > > Something between v5.8 and v5.9 has resulted in periodically losing video. > > Unfortunately, I can't reliably reproduce it, it seems to happen every > > once in a long while---I can go weeks without an occurance, but it > > always seems to happen after my workstation has been idle long enough > > to screen blank and put the monitor to sleep. I'm using a single > > display (Dell 2405FPW) connected via DVI, running X (Xorg 1.20.x from > > Debian sid). I don't really do anything fancy, xterms, a browser or > > two, play the occasional video, but like I said, I can't reliably > > reproduce this. I've had it happen about 11 times since August. > > > > lspci -vv output is: > > > > 01:00.0 VGA compatible controller: NVIDIA Corporation G86 [Quadro NVS 290] > > (rev a1) (prog-if 00 [VGA controller]) > > Subsystem: NVIDIA Corporation G86 [Quadro NVS 290] > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- > > Stepping- SERR- FastB2B- DisINTx+ > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- > > SERR- > Latency: 0, Cache Line Size: 64 bytes > > Interrupt: pin A routed to IRQ 28 > > Region 0: Memory at fc00 (32-bit, non-prefetchable) [size=16M] > > Region 1: Memory at d000 (64-bit, prefetchable) [size=256M] > > Region 3: Memory at fa00 (64-bit, non-prefetchable) [size=32M] > > Region 5: I/O ports at dc80 [size=128] > > Expansion ROM at 000c [disabled] [size=128K] > > Capabilities: [60] Power Management version 2 > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA > > PME(D0-,D1-,D2-,D3hot-,D3cold-) > > Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- > > Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+ > > Address: fee01004 Data: 4023 > > Capabilities: [78] Express (v1) Endpoint, MSI 00 > > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s > > <512ns, L1 <4us > > ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- > > SlotPowerLimit 25.000W > > DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq- > > RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ > > MaxPayload 128 bytes, MaxReadReq 512 bytes > > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- > > TransPend- > > LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, > > Exit Latency L0s <512ns, L1 <4us > > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp- > > LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+ > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- > > LnkSta: Speed 2.5GT/s (ok), Width x16 (ok) > > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- > > Capabilities: [100 v1] Virtual Channel > > Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 > > Arb:Fixed- WRR32- WRR64- WRR128- > > Ctrl: ArbSelect=Fixed > > Status: InProgress- > > VC0:Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- > > Arb:Fixed- WRR32- WRR64- WRR128- TWRR128- > > WRR256- > > Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01 > > Status: NegoPending- InProgress- > > Capabilities: [128 v1] Power Budgeting > > Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 > > Len=024 > > Kernel driver in use: nouveau > > > > The last time this happened, this is what got logged: > > > > nouveau :01:00.0: disp: ERROR 5 [INVALID_STATE] 06 [] chid 1 mthd 0080 > > data 0001 > > nouveau :01:00.0: disp: Base 1: > > nouveau :01:00.0: disp:0084: > > nouveau :01:00.0: disp:0088: > > nouveau :01:00.0: disp:008c: > >
nouveau regression post v5.8, still present in v5.10
sp:080c: 7804 nouveau :01:00.0: disp:0810: cf00 nouveau :01:00.0: disp: Base 1 - Image 1: nouveau :01:00.0: disp:0c00: 9500 nouveau :01:00.0: disp:0c04: nouveau :01:00.0: disp:0c08: 04b00780 nouveau :01:00.0: disp:0c0c: 7804 nouveau :01:00.0: disp:0c10: cf00 nouveau :01:00.0: DRM: core notifier timeout nouveau :01:00.0: DRM: base-0: timeout I've got logs of all of this, if they help I can collect them. The timeout message are consistent the error messages a little less so. If there's more debugging I can do when this happens, I'd love to know what it is. kernel config: http://audible.transient.net/~jamie/k/nouveau.config-5.10.0 dmesg at boot: http://audible.transient.net/~jamie/k/nouveau.dmesg -- Jamie Heilman http://audible.transient.net/~jamie/
Re: [PATCH] kvm: x86/mmu: Fix is_tdp_mmu_check when using PAE
Ben Gardon wrote: > When PAE is in use, the root_hpa will not have a shadow page assoicated > with it. In this case the kernel will crash with a NULL pointer > dereference. Add checks to ensure is_tdp_mmu_root works as intended even > when using PAE. This seems to work in my amd64 case as well. (https://marc.info/?l=linux-kernel=160494962201032=2) > Tested: compiles > > Fixes: 02c00b3a2f7e ("kvm: x86/mmu: Allocate and free TDP MMU roots") > Reported-by: Zdenek Kaspar > Signed-off-by: Ben Gardon > --- > arch/x86/kvm/mmu/tdp_mmu.c | 10 ++ > 1 file changed, 10 insertions(+) > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index 27e381c9da6c..13013f4d98ad 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -49,8 +49,18 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa) > { > struct kvm_mmu_page *sp; > > + if (WARN_ON(!VALID_PAGE(hpa))) > + return false; > + > sp = to_shadow_page(hpa); > > + /* > + * If this VM is being run with PAE, the TDP MMU will not be enabled > + * and the root HPA will not have a shadow page associated with it. > + */ > + if (!sp) > + return false; > + > return sp->tdp_mmu_page && sp->root_count; > } > > -- > 2.29.2.222.g5d2a92d10f8-goog > -- Jamie Heilman http://audible.transient.net/~jamie/
null pointer deref in kvm
or,id=monitor,mode=control -rtc base=utc -no-shutdown -boot strict=on \ -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 \ -drive file=/dev/S/ruth,if=none,id=drive-virtio-disk0,format=raw,cache=none \ -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 \ -fsdev local,security_model=mapped,id=fsdev-fs0,path=/var/local/downloads \ -device virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=downloads,bus=pci.0,addr=0x3 \ -netdev tap,fd=23,id=hostnet0,vhost=on,vhostfd=24 \ -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:9a:0c:07,bus=pci.0,addr=0x4 \ -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 \ -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 \ -object rng-random,id=rng0,filename=/dev/random \ -device virtio-rng-pci,rng=rng0,max-bytes=1024,period=3000,bus=pci.0,addr=0x2 \ -msg timestamp=on Let me know if you need anything else. -- Jamie Heilman http://audible.transient.net/~jamie/ # # Automatically generated file; DO NOT EDIT. # Linux/x86 5.10.0-rc2 Kernel Configuration # CONFIG_CC_VERSION_TEXT="gcc (Debian 10.2.0-16) 10.2.0" CONFIG_CC_IS_GCC=y CONFIG_GCC_VERSION=100200 CONFIG_LD_VERSION=23501 CONFIG_CLANG_VERSION=0 CONFIG_CC_CAN_LINK=y CONFIG_CC_CAN_LINK_STATIC=y CONFIG_CC_HAS_ASM_GOTO=y CONFIG_CC_HAS_ASM_INLINE=y CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_TABLE_SORT=y CONFIG_THREAD_INFO_IN_TASK=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_BUILD_SALT="" CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y CONFIG_HAVE_KERNEL_ZSTD=y # CONFIG_KERNEL_GZIP is not set # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set CONFIG_KERNEL_XZ=y # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set # CONFIG_KERNEL_ZSTD is not set CONFIG_DEFAULT_INIT="" CONFIG_DEFAULT_HOSTNAME="cucamonga" CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y # CONFIG_WATCH_QUEUE is not set CONFIG_CROSS_MEMORY_ATTACH=y # CONFIG_USELIB is not set # CONFIG_AUDIT is not set CONFIG_HAVE_ARCH_AUDITSYSCALL=y # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_GENERIC_IRQ_MIGRATION=y CONFIG_HARDIRQS_SW_RESEND=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_DOMAIN_HIERARCHY=y CONFIG_GENERIC_MSI_IRQ=y CONFIG_GENERIC_MSI_IRQ_DOMAIN=y CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y CONFIG_GENERIC_IRQ_RESERVATION_MODE=y CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y # CONFIG_GENERIC_IRQ_DEBUGFS is not set # end of IRQ subsystem CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_INIT=y CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK=y CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y # # Timers subsystem # CONFIG_TICK_ONESHOT=y CONFIG_NO_HZ_COMMON=y # CONFIG_HZ_PERIODIC is not set CONFIG_NO_HZ_IDLE=y # CONFIG_NO_HZ_FULL is not set # CONFIG_NO_HZ is not set CONFIG_HIGH_RES_TIMERS=y # end of Timers subsystem # CONFIG_PREEMPT_NONE is not set # CONFIG_PREEMPT_VOLUNTARY is not set CONFIG_PREEMPT=y CONFIG_PREEMPT_COUNT=y CONFIG_PREEMPTION=y # # CPU/Task time and stats accounting # CONFIG_TICK_CPU_ACCOUNTING=y # CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set # CONFIG_IRQ_TIME_ACCOUNTING is not set CONFIG_BSD_PROCESS_ACCT=y CONFIG_BSD_PROCESS_ACCT_V3=y CONFIG_TASKSTATS=y CONFIG_TASK_DELAY_ACCT=y CONFIG_TASK_XACCT=y CONFIG_TASK_IO_ACCOUNTING=y # CONFIG_PSI is not set # end of CPU/Task time and stats accounting CONFIG_CPU_ISOLATION=y # # RCU Subsystem # CONFIG_TREE_RCU=y CONFIG_PREEMPT_RCU=y # CONFIG_RCU_EXPERT is not set CONFIG_SRCU=y CONFIG_TREE_SRCU=y CONFIG_TASKS_RCU_GENERIC=y CONFIG_TASKS_RCU=y CONFIG_TASKS_TRACE_RCU=y CONFIG_RCU_STALL_COMMON=y CONFIG_RCU_NEED_SEGCBLIST=y # end of RCU Subsystem CONFIG_IKCONFIG=m CONFIG_IKCONFIG_PROC=y # CONFIG_IKHEADERS is not set CONFIG_LOG_BUF_SHIFT=17 CONFIG_LOG_CPU_MAX_BUF_SHIFT=12 CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13 CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y # # Scheduler features # # CONFIG_UCLAMP_TASK is not set # end of Scheduler features CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y CONFIG_CC_HAS_INT128=y CONFIG_ARCH_SUPPORTS_INT128=y CONFIG_CGROUPS=y # CONFIG_MEMCG is not set CONFIG_BLK_CGROUP=y CONFIG_CGROUP_SCHED=y CONFIG_FAIR_GROUP_SCHED=y CONFIG_CFS_BANDWIDTH=y # CONFIG_RT_GROUP_SCHED is not set # CONFIG_CGROUP_PIDS is not set # CONFIG_CGROUP_RDMA is not set # CONFIG_CGROUP_FREEZER is not set # CONFIG_CPUSETS
Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)
Tadeusz Struk wrote: > On 05/03/2016 07:26 PM, Jamie Heilman wrote: > >>> Alrighty, presumably relevant bits: > >>> > > > >>> > > X.509: Cert Issuer: Build time autogenerated kernel key > >>> > > X.509: Cert Subject: Build time autogenerated kernel key > >>> > > X.509: Cert Key Algo: rsa > >>> > > X.509: Cert Valid period: 1461826791-4615426791 > >>> > > X.509: Cert Signature: rsa + sha512 > >>> > > X.509: ==>x509_check_signature() > >>> > > X.509: ==>x509_get_sig_params() > >>> > > X.509: <==x509_get_sig_params() = 0 > >>> > > PKEY: ==>public_key_verify_signature() > >>> > > X.509: Cert Verification: -2 > >> > > >> > Hmmm... Okay, the only ways out of public_key_verify_signature() without > >> > printing a leaving message are for snprintf() to overrun (which would > >> > return > >> > error -22) or for crypto_alloc_akcipher() to have failed; everything > >> > else must > >> > go through the kleave() at the pr_devel() at the bottom of the function. > >> > > >> > Can you stick: > >> > > >> > pr_devel("ALGO: %s\n", alg_name); > >> > > >> > immediately before this line: > >> > > >> > tfm = crypto_alloc_akcipher(alg_name, 0, 0); > >> > > >> > and try it again? > > PKEY: ALGO: pkcs1pad(rsa,sha512) > > I think the problem is that pkcs1pad template needs CRYPTO_MANAGER, but > your configuration doesn't enable CRYPTO_MANAGER. Could you try this > please: > > diff --git a/crypto/Kconfig b/crypto/Kconfig > index 93a1fdc..1d33beb 100644 > --- a/crypto/Kconfig > +++ b/crypto/Kconfig > @@ -96,6 +96,7 @@ config CRYPTO_AKCIPHER > config CRYPTO_RSA > tristate "RSA algorithm" > select CRYPTO_AKCIPHER > + select CRYPTO_MANAGER > select MPILIB > select ASN1 > help Yep, that does indeed make everything work again. -- Jamie Heilman http://audible.transient.net/~jamie/
Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)
Tadeusz Struk wrote: > On 05/03/2016 07:26 PM, Jamie Heilman wrote: > >>> Alrighty, presumably relevant bits: > >>> > > > >>> > > X.509: Cert Issuer: Build time autogenerated kernel key > >>> > > X.509: Cert Subject: Build time autogenerated kernel key > >>> > > X.509: Cert Key Algo: rsa > >>> > > X.509: Cert Valid period: 1461826791-4615426791 > >>> > > X.509: Cert Signature: rsa + sha512 > >>> > > X.509: ==>x509_check_signature() > >>> > > X.509: ==>x509_get_sig_params() > >>> > > X.509: <==x509_get_sig_params() = 0 > >>> > > PKEY: ==>public_key_verify_signature() > >>> > > X.509: Cert Verification: -2 > >> > > >> > Hmmm... Okay, the only ways out of public_key_verify_signature() without > >> > printing a leaving message are for snprintf() to overrun (which would > >> > return > >> > error -22) or for crypto_alloc_akcipher() to have failed; everything > >> > else must > >> > go through the kleave() at the pr_devel() at the bottom of the function. > >> > > >> > Can you stick: > >> > > >> > pr_devel("ALGO: %s\n", alg_name); > >> > > >> > immediately before this line: > >> > > >> > tfm = crypto_alloc_akcipher(alg_name, 0, 0); > >> > > >> > and try it again? > > PKEY: ALGO: pkcs1pad(rsa,sha512) > > I think the problem is that pkcs1pad template needs CRYPTO_MANAGER, but > your configuration doesn't enable CRYPTO_MANAGER. Could you try this > please: > > diff --git a/crypto/Kconfig b/crypto/Kconfig > index 93a1fdc..1d33beb 100644 > --- a/crypto/Kconfig > +++ b/crypto/Kconfig > @@ -96,6 +96,7 @@ config CRYPTO_AKCIPHER > config CRYPTO_RSA > tristate "RSA algorithm" > select CRYPTO_AKCIPHER > + select CRYPTO_MANAGER > select MPILIB > select ASN1 > help Yep, that does indeed make everything work again. -- Jamie Heilman http://audible.transient.net/~jamie/
Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)
David Howells wrote: > (cc'ing Tadeusz as he did the pkcs1 padding function) > > Jamie Heilman <ja...@audible.transient.net> wrote: > > > > > Problem loading in-kernel X.509 certificate (-2) > > > > > > ENOENT? Hmmm... The only place that is generated is in the crypto layer. > > > That suggests missing crypto of some sort. > > > > > > The attached patch enables some debugging in some relevant files if you > > > can > > > try applying it to your kernel. > > > > Alrighty, presumably relevant bits: > > > > X.509: Cert Issuer: Build time autogenerated kernel key > > X.509: Cert Subject: Build time autogenerated kernel key > > X.509: Cert Key Algo: rsa > > X.509: Cert Valid period: 1461826791-4615426791 > > X.509: Cert Signature: rsa + sha512 > > X.509: ==>x509_check_signature() > > X.509: ==>x509_get_sig_params() > > X.509: <==x509_get_sig_params() = 0 > > PKEY: ==>public_key_verify_signature() > > X.509: Cert Verification: -2 > > Hmmm... Okay, the only ways out of public_key_verify_signature() without > printing a leaving message are for snprintf() to overrun (which would return > error -22) or for crypto_alloc_akcipher() to have failed; everything else must > go through the kleave() at the pr_devel() at the bottom of the function. > > Can you stick: > > pr_devel("ALGO: %s\n", alg_name); > > immediately before this line: > > tfm = crypto_alloc_akcipher(alg_name, 0, 0); > > and try it again? PKEY: ALGO: pkcs1pad(rsa,sha512) -- Jamie Heilman http://audible.transient.net/~jamie/
Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)
David Howells wrote: > (cc'ing Tadeusz as he did the pkcs1 padding function) > > Jamie Heilman wrote: > > > > > Problem loading in-kernel X.509 certificate (-2) > > > > > > ENOENT? Hmmm... The only place that is generated is in the crypto layer. > > > That suggests missing crypto of some sort. > > > > > > The attached patch enables some debugging in some relevant files if you > > > can > > > try applying it to your kernel. > > > > Alrighty, presumably relevant bits: > > > > X.509: Cert Issuer: Build time autogenerated kernel key > > X.509: Cert Subject: Build time autogenerated kernel key > > X.509: Cert Key Algo: rsa > > X.509: Cert Valid period: 1461826791-4615426791 > > X.509: Cert Signature: rsa + sha512 > > X.509: ==>x509_check_signature() > > X.509: ==>x509_get_sig_params() > > X.509: <==x509_get_sig_params() = 0 > > PKEY: ==>public_key_verify_signature() > > X.509: Cert Verification: -2 > > Hmmm... Okay, the only ways out of public_key_verify_signature() without > printing a leaving message are for snprintf() to overrun (which would return > error -22) or for crypto_alloc_akcipher() to have failed; everything else must > go through the kleave() at the pr_devel() at the bottom of the function. > > Can you stick: > > pr_devel("ALGO: %s\n", alg_name); > > immediately before this line: > > tfm = crypto_alloc_akcipher(alg_name, 0, 0); > > and try it again? PKEY: ALGO: pkcs1pad(rsa,sha512) -- Jamie Heilman http://audible.transient.net/~jamie/
Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)
Tadeusz Struk wrote: > Could you please check if this one fixes the problem for you: > https://patchwork.kernel.org/patch/8766361/ That patch is already present by 4.6-rc5 and the problem still exists, so no, it doesn't. -- Jamie Heilman http://audible.transient.net/~jamie/
Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)
Tadeusz Struk wrote: > Could you please check if this one fixes the problem for you: > https://patchwork.kernel.org/patch/8766361/ That patch is already present by 4.6-rc5 and the problem still exists, so no, it doesn't. -- Jamie Heilman http://audible.transient.net/~jamie/
Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)
Herbert Xu wrote: > On Sat, Apr 30, 2016 at 08:32:48AM +0000, Jamie Heilman wrote: > > I usually build my kernels to require module signatures and use > > automatic signing. As of v4.6-rc1 I'm getting this on boot: > > > > Problem loading in-kernel X.509 certificate (-2) > > > > I bisected that to commit d43de6c780a84def056afaf4fb3e66bdaa1efc00 > > (akcipher: Move the RSA DER encoding check to the crypto layer) > > > > For some reason after this commit my system keyring always ends up > > empty. I use the deb-pkg make target. My kernel config can > > be found at > > http://audible.transient.net/~jamie/k/modsign.config-4.6.0-rc5-guest > > > > Let me know if you need anything else. > > David, any ideas on this problem? If we can't get a quick fix > on this then we'll have to revert. > > Jamie, does reverting this patch by itself resolve the problem? Haven't tried that, given the nature of the change I just assumed it would break too much, but I'll give it a shot this evening after work. -- Jamie Heilman http://audible.transient.net/~jamie/
Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)
Herbert Xu wrote: > On Sat, Apr 30, 2016 at 08:32:48AM +0000, Jamie Heilman wrote: > > I usually build my kernels to require module signatures and use > > automatic signing. As of v4.6-rc1 I'm getting this on boot: > > > > Problem loading in-kernel X.509 certificate (-2) > > > > I bisected that to commit d43de6c780a84def056afaf4fb3e66bdaa1efc00 > > (akcipher: Move the RSA DER encoding check to the crypto layer) > > > > For some reason after this commit my system keyring always ends up > > empty. I use the deb-pkg make target. My kernel config can > > be found at > > http://audible.transient.net/~jamie/k/modsign.config-4.6.0-rc5-guest > > > > Let me know if you need anything else. > > David, any ideas on this problem? If we can't get a quick fix > on this then we'll have to revert. > > Jamie, does reverting this patch by itself resolve the problem? Haven't tried that, given the nature of the change I just assumed it would break too much, but I'll give it a shot this evening after work. -- Jamie Heilman http://audible.transient.net/~jamie/
Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)
David Howells wrote: > > Problem loading in-kernel X.509 certificate (-2) > > ENOENT? Hmmm... The only place that is generated is in the crypto layer. > That suggests missing crypto of some sort. > > The attached patch enables some debugging in some relevant files if you can > try applying it to your kernel. Alrighty, presumably relevant bits: X.509: Cert Issuer: Build time autogenerated kernel key X.509: Cert Subject: Build time autogenerated kernel key X.509: Cert Key Algo: rsa X.509: Cert Valid period: 1461826791-4615426791 X.509: Cert Signature: rsa + sha512 X.509: ==>x509_check_signature() X.509: ==>x509_get_sig_params() X.509: <==x509_get_sig_params() = 0 PKEY: ==>public_key_verify_signature() X.509: Cert Verification: -2 Problem loading in-kernel X.509 certificate (-2) ... PKCS7: ==> pkcs7_verify() PKCS7: ==> pkcs7_verify_one(,1) PKCS7: ==> pkcs7_digest(,1,sha512) PKCS7: MsgDigest = [0b f2 1f 7e f0 37 12 e6] PKCS7: <== pkcs7_digest() = 0 PKCS7: ==> pkcs7_find_key(1) PKCS7: Sig 1: Issuing X.509 cert not found (#008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579) PKCS7: <== pkcs7_verify() = 0 X.509: Look up: "ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579" X.509: Request for key 'ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579' err -11 PKCS7: ==> pkcs7_verify() PKCS7: ==> pkcs7_verify_one(,1) PKCS7: ==> pkcs7_digest(,1,sha512) PKCS7: MsgDigest = [5b b5 bb 52 28 05 ba 55] PKCS7: <== pkcs7_digest() = 0 PKCS7: ==> pkcs7_find_key(1) PKCS7: Sig 1: Issuing X.509 cert not found (#008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579) PKCS7: <== pkcs7_verify() = 0 X.509: Look up: "ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579" X.509: Request for key 'ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579' err -11 PKCS7: ==> pkcs7_verify() PKCS7: ==> pkcs7_verify_one(,1) PKCS7: ==> pkcs7_digest(,1,sha512) PKCS7: MsgDigest = [94 a4 59 31 7f a9 d0 3a] PKCS7: <== pkcs7_digest() = 0 PKCS7: ==> pkcs7_find_key(1) PKCS7: Sig 1: Issuing X.509 cert not found (#008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579) PKCS7: <== pkcs7_verify() = 0 X.509: Look up: "ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579" X.509: Request for key 'ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579' err -11 full dmesg at http://audible.transient.net/~jamie/k/modsign.dmesg-debugging -- Jamie Heilman http://audible.transient.net/~jamie/
Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)
David Howells wrote: > > Problem loading in-kernel X.509 certificate (-2) > > ENOENT? Hmmm... The only place that is generated is in the crypto layer. > That suggests missing crypto of some sort. > > The attached patch enables some debugging in some relevant files if you can > try applying it to your kernel. Alrighty, presumably relevant bits: X.509: Cert Issuer: Build time autogenerated kernel key X.509: Cert Subject: Build time autogenerated kernel key X.509: Cert Key Algo: rsa X.509: Cert Valid period: 1461826791-4615426791 X.509: Cert Signature: rsa + sha512 X.509: ==>x509_check_signature() X.509: ==>x509_get_sig_params() X.509: <==x509_get_sig_params() = 0 PKEY: ==>public_key_verify_signature() X.509: Cert Verification: -2 Problem loading in-kernel X.509 certificate (-2) ... PKCS7: ==> pkcs7_verify() PKCS7: ==> pkcs7_verify_one(,1) PKCS7: ==> pkcs7_digest(,1,sha512) PKCS7: MsgDigest = [0b f2 1f 7e f0 37 12 e6] PKCS7: <== pkcs7_digest() = 0 PKCS7: ==> pkcs7_find_key(1) PKCS7: Sig 1: Issuing X.509 cert not found (#008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579) PKCS7: <== pkcs7_verify() = 0 X.509: Look up: "ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579" X.509: Request for key 'ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579' err -11 PKCS7: ==> pkcs7_verify() PKCS7: ==> pkcs7_verify_one(,1) PKCS7: ==> pkcs7_digest(,1,sha512) PKCS7: MsgDigest = [5b b5 bb 52 28 05 ba 55] PKCS7: <== pkcs7_digest() = 0 PKCS7: ==> pkcs7_find_key(1) PKCS7: Sig 1: Issuing X.509 cert not found (#008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579) PKCS7: <== pkcs7_verify() = 0 X.509: Look up: "ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579" X.509: Request for key 'ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579' err -11 PKCS7: ==> pkcs7_verify() PKCS7: ==> pkcs7_verify_one(,1) PKCS7: ==> pkcs7_digest(,1,sha512) PKCS7: MsgDigest = [94 a4 59 31 7f a9 d0 3a] PKCS7: <== pkcs7_digest() = 0 PKCS7: ==> pkcs7_find_key(1) PKCS7: Sig 1: Issuing X.509 cert not found (#008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579) PKCS7: <== pkcs7_verify() = 0 X.509: Look up: "ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579" X.509: Request for key 'ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579' err -11 full dmesg at http://audible.transient.net/~jamie/k/modsign.dmesg-debugging -- Jamie Heilman http://audible.transient.net/~jamie/
v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)
I usually build my kernels to require module signatures and use automatic signing. As of v4.6-rc1 I'm getting this on boot: Problem loading in-kernel X.509 certificate (-2) I bisected that to commit d43de6c780a84def056afaf4fb3e66bdaa1efc00 (akcipher: Move the RSA DER encoding check to the crypto layer) For some reason after this commit my system keyring always ends up empty. I use the deb-pkg make target. My kernel config can be found at http://audible.transient.net/~jamie/k/modsign.config-4.6.0-rc5-guest Let me know if you need anything else. -- Jamie Heilman http://audible.transient.net/~jamie/
v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)
I usually build my kernels to require module signatures and use automatic signing. As of v4.6-rc1 I'm getting this on boot: Problem loading in-kernel X.509 certificate (-2) I bisected that to commit d43de6c780a84def056afaf4fb3e66bdaa1efc00 (akcipher: Move the RSA DER encoding check to the crypto layer) For some reason after this commit my system keyring always ends up empty. I use the deb-pkg make target. My kernel config can be found at http://audible.transient.net/~jamie/k/modsign.config-4.6.0-rc5-guest Let me know if you need anything else. -- Jamie Heilman http://audible.transient.net/~jamie/
Re: regression bisected; KVM: entry failed, hardware error 0x80000021
Chen, Tiejun wrote: > On 2014/12/24 19:02, Jamie Heilman wrote: > >Chen, Tiejun wrote: > >>On 2014/12/23 15:26, Jamie Heilman wrote: > >>>Chen, Tiejun wrote: > >>>>On 2014/12/23 9:50, Chen, Tiejun wrote: > >>>>>On 2014/12/22 17:23, Jamie Heilman wrote: > >>>>>>KVM internal error. Suberror: 1 > >>>>>>emulation failure > >>>>>>EAX=000de494 EBX= ECX= EDX=0cfd > >>>>>>ESI=0059 EDI= EBP= ESP=6fb4 > >>>>>>EIP=000f15c1 EFL=00010016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0 > >>>>>>ES =0010 00c09300 DPL=0 DS [-WA] > >>>>>>CS =0008 00c09b00 DPL=0 CS32 [-RA] > >>>>>>SS =0010 00c09300 DPL=0 DS [-WA] > >>>>>>DS =0010 00c09300 DPL=0 DS [-WA] > >>>>>>FS =0010 00c09300 DPL=0 DS [-WA] > >>>>>>GS =0010 00c09300 DPL=0 DS [-WA] > >>>>>>LDT= 8200 DPL=0 LDT > >>>>>>TR = 8b00 DPL=0 TSS32-busy > >>>>>>GDT= 000f6be8 0037 > >>>>>>IDT= 000f6c26 > >>>>>>CR0=6011 CR2= CR3= CR4= > >>>>>>DR0= DR1= DR2= > >>>>>>DR3= > >>>>>>DR6=0ff0 DR7=0400 > >>>>>>EFER= > >>>>>>Code=e8 ae fc ff ff 89 f2 a8 10 89 d8 75 0a b9 41 15 ff ff ff d1 <5b> > >>>>>>5e c3 5b 5e e9 76 ff ff ff b0 11 e6 20 e6 a0 b0 08 e6 21 b0 70 e6 a1 > >>>>>>b0 04 e6 21 b0 02 > >>>>>> > >>>>>>FWIW, I get the same thing with 34a1cd60d17 reverted. Maybe there are > >>>>>>two bugs, maybe there's more to this first one. I can repro this > >>>>> > >>>>>So if my understanding is correct, this is probably another bug. And > >>>>>especially, I already saw the same log in another thread, "Cleaning up > >>>>>the KVM clock". Maybe you can continue to `git bisect` to locate that > >>>>>bad commit. > >>>>> > >>>> > >>>>Looks just now Andy found that commit, > >>>>0e60b0799fedc495a5c57dbd669de3c10d72edd2 "kvm: change memslot sorting rule > >>>>from size to GFN", maybe you can try to revert this to try yours again. > >>> > >>>That doesn't revert cleanly for me, and I don't have much time to > >>>fiddle with it until the 24th---so checked out the commit before it > >>>(d4ae84a0), applied your patch, built, and yes, everything works fine > >>>at that point. I'll probably have time for another full bisection > >>>later, assuming things aren't ironed out already by then. > > > >3.18.0-rc3-00120-gd4ae84a0 + vmx reorder msr writes patch = OK > >3.18.0-rc3-00121-g0e60b07 + vmx reorder msr writes patch = emulation failure > > > >So that certainly points to 0e60b0799fedc495a5c57dbd669de3c10d72edd2 > >as well. > > > >>Could you try this to fix your last error? > > > >Running qemu-system-x86_64 -machine pc,accel=kvm -nodefaults works, > >my real (headless) kvm guests work, but this new patch makes running > >"qemu-system-x86_64 -machine pc,accel=kvm" fail again, this time with > > Are you sure? From my test based on 3.19-rc1 that it owns top commit, > > aa39477b5692611b91ac9455ae588738852b3f60 > > just plus my previous patch, "kvm: x86: vmx: reorder some msr writing" > > I already can execute such a command successfully, > > qemu-system-x86_64 -machine pc,accel=kvm -m 2048 -smp 2 -hda ubuntu.img > > And your log below seems not to relate mem_slot issue we're discussing, I > guess you need to update qemu as well. Yes, I'm sure. > But I also found my new patch just work out Andy's next case, its really > bringing a new issue in !next case. So I tried to refine that patch again as > follows, This latest patch (again, after fixing all the whitespace so it actually applies), does the trick. Both "qemu-system-x86_64 -machine pc,accel=kvm" and "qemu-system-x86_64 -machine pc,accel=kvm -nodefaults" work for me now without any of the aforementioned warnings from the host. > Signed-off-by: Tiejun Chen >
Re: regression bisected; KVM: entry failed, hardware error 0x80000021
Chen, Tiejun wrote: On 2014/12/24 19:02, Jamie Heilman wrote: Chen, Tiejun wrote: On 2014/12/23 15:26, Jamie Heilman wrote: Chen, Tiejun wrote: On 2014/12/23 9:50, Chen, Tiejun wrote: On 2014/12/22 17:23, Jamie Heilman wrote: KVM internal error. Suberror: 1 emulation failure EAX=000de494 EBX= ECX= EDX=0cfd ESI=0059 EDI= EBP= ESP=6fb4 EIP=000f15c1 EFL=00010016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 00c09300 DPL=0 DS [-WA] CS =0008 00c09b00 DPL=0 CS32 [-RA] SS =0010 00c09300 DPL=0 DS [-WA] DS =0010 00c09300 DPL=0 DS [-WA] FS =0010 00c09300 DPL=0 DS [-WA] GS =0010 00c09300 DPL=0 DS [-WA] LDT= 8200 DPL=0 LDT TR = 8b00 DPL=0 TSS32-busy GDT= 000f6be8 0037 IDT= 000f6c26 CR0=6011 CR2= CR3= CR4= DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER= Code=e8 ae fc ff ff 89 f2 a8 10 89 d8 75 0a b9 41 15 ff ff ff d1 5b 5e c3 5b 5e e9 76 ff ff ff b0 11 e6 20 e6 a0 b0 08 e6 21 b0 70 e6 a1 b0 04 e6 21 b0 02 FWIW, I get the same thing with 34a1cd60d17 reverted. Maybe there are two bugs, maybe there's more to this first one. I can repro this So if my understanding is correct, this is probably another bug. And especially, I already saw the same log in another thread, Cleaning up the KVM clock. Maybe you can continue to `git bisect` to locate that bad commit. Looks just now Andy found that commit, 0e60b0799fedc495a5c57dbd669de3c10d72edd2 kvm: change memslot sorting rule from size to GFN, maybe you can try to revert this to try yours again. That doesn't revert cleanly for me, and I don't have much time to fiddle with it until the 24th---so checked out the commit before it (d4ae84a0), applied your patch, built, and yes, everything works fine at that point. I'll probably have time for another full bisection later, assuming things aren't ironed out already by then. 3.18.0-rc3-00120-gd4ae84a0 + vmx reorder msr writes patch = OK 3.18.0-rc3-00121-g0e60b07 + vmx reorder msr writes patch = emulation failure So that certainly points to 0e60b0799fedc495a5c57dbd669de3c10d72edd2 as well. Could you try this to fix your last error? Running qemu-system-x86_64 -machine pc,accel=kvm -nodefaults works, my real (headless) kvm guests work, but this new patch makes running qemu-system-x86_64 -machine pc,accel=kvm fail again, this time with Are you sure? From my test based on 3.19-rc1 that it owns top commit, aa39477b5692611b91ac9455ae588738852b3f60 just plus my previous patch, kvm: x86: vmx: reorder some msr writing I already can execute such a command successfully, qemu-system-x86_64 -machine pc,accel=kvm -m 2048 -smp 2 -hda ubuntu.img And your log below seems not to relate mem_slot issue we're discussing, I guess you need to update qemu as well. Yes, I'm sure. But I also found my new patch just work out Andy's next case, its really bringing a new issue in !next case. So I tried to refine that patch again as follows, This latest patch (again, after fixing all the whitespace so it actually applies), does the trick. Both qemu-system-x86_64 -machine pc,accel=kvm and qemu-system-x86_64 -machine pc,accel=kvm -nodefaults work for me now without any of the aforementioned warnings from the host. Signed-off-by: Tiejun Chen tiejun.c...@intel.com --- virt/kvm/kvm_main.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index f528343..910bc48 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -672,6 +672,7 @@ static void update_memslots(struct kvm_memslots *slots, WARN_ON(mslots[i].id != id); if (!new-npages) { new-base_gfn = 0; + new-flags = 0; if (mslots[i].npages) slots-used_slots--; } else { @@ -688,7 +689,9 @@ static void update_memslots(struct kvm_memslots *slots, i++; } while (i 0 - new-base_gfn mslots[i - 1].base_gfn) { + ((new-base_gfn mslots[i - 1].base_gfn) || +(!new-base_gfn + !mslots[i - 1].base_gfn !mslots[i - 1].npages))) { mslots[i] = mslots[i - 1]; slots-id_to_index[mslots[i].id] = i; i--; Tiejun errors in the host to the tune of: [ cut here ] WARNING: CPU: 1 PID: 3901 at arch/x86/kvm/x86.c:6575 kvm_arch_vcpu_ioctl_run+0xd63/0xe5b [kvm]() Modules linked in: nfsv4 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative autofs4 fan nfsd auth_rpcgss nfs lockd grace fscache sunrpc
Re: regression bisected; KVM: entry failed, hardware error 0x80000021
Chen, Tiejun wrote: > On 2014/12/23 15:26, Jamie Heilman wrote: > >Chen, Tiejun wrote: > >>On 2014/12/23 9:50, Chen, Tiejun wrote: > >>>On 2014/12/22 17:23, Jamie Heilman wrote: > >>>>KVM internal error. Suberror: 1 > >>>>emulation failure > >>>>EAX=000de494 EBX= ECX= EDX=0cfd > >>>>ESI=0059 EDI= EBP= ESP=6fb4 > >>>>EIP=000f15c1 EFL=00010016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0 > >>>>ES =0010 00c09300 DPL=0 DS [-WA] > >>>>CS =0008 00c09b00 DPL=0 CS32 [-RA] > >>>>SS =0010 00c09300 DPL=0 DS [-WA] > >>>>DS =0010 00c09300 DPL=0 DS [-WA] > >>>>FS =0010 00c09300 DPL=0 DS [-WA] > >>>>GS =0010 00c09300 DPL=0 DS [-WA] > >>>>LDT= 8200 DPL=0 LDT > >>>>TR = 8b00 DPL=0 TSS32-busy > >>>>GDT= 000f6be8 0037 > >>>>IDT= 000f6c26 > >>>>CR0=6011 CR2= CR3= CR4= > >>>>DR0= DR1= DR2= > >>>>DR3= > >>>>DR6=0ff0 DR7=0400 > >>>>EFER= > >>>>Code=e8 ae fc ff ff 89 f2 a8 10 89 d8 75 0a b9 41 15 ff ff ff d1 <5b> > >>>>5e c3 5b 5e e9 76 ff ff ff b0 11 e6 20 e6 a0 b0 08 e6 21 b0 70 e6 a1 > >>>>b0 04 e6 21 b0 02 > >>>> > >>>>FWIW, I get the same thing with 34a1cd60d17 reverted. Maybe there are > >>>>two bugs, maybe there's more to this first one. I can repro this > >>> > >>>So if my understanding is correct, this is probably another bug. And > >>>especially, I already saw the same log in another thread, "Cleaning up > >>>the KVM clock". Maybe you can continue to `git bisect` to locate that > >>>bad commit. > >>> > >> > >>Looks just now Andy found that commit, > >>0e60b0799fedc495a5c57dbd669de3c10d72edd2 "kvm: change memslot sorting rule > >>from size to GFN", maybe you can try to revert this to try yours again. > > > >That doesn't revert cleanly for me, and I don't have much time to > >fiddle with it until the 24th---so checked out the commit before it > >(d4ae84a0), applied your patch, built, and yes, everything works fine > >at that point. I'll probably have time for another full bisection > >later, assuming things aren't ironed out already by then. 3.18.0-rc3-00120-gd4ae84a0 + vmx reorder msr writes patch = OK 3.18.0-rc3-00121-g0e60b07 + vmx reorder msr writes patch = emulation failure So that certainly points to 0e60b0799fedc495a5c57dbd669de3c10d72edd2 as well. > Could you try this to fix your last error? Running qemu-system-x86_64 -machine pc,accel=kvm -nodefaults works, my real (headless) kvm guests work, but this new patch makes running "qemu-system-x86_64 -machine pc,accel=kvm" fail again, this time with errors in the host to the tune of: [ cut here ] WARNING: CPU: 1 PID: 3901 at arch/x86/kvm/x86.c:6575 kvm_arch_vcpu_ioctl_run+0xd63/0xe5b [kvm]() Modules linked in: nfsv4 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative autofs4 fan nfsd auth_rpcgss nfs lockd grace fscache sunrpc bridge stp llc vhost_net tun vhost macvtap macvlan fuse cbc dm_crypt usb_storage snd_hda_codec_analog snd_hda_codec_generic kvm_intel kvm tg3 ptp pps_core sr_mod snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer snd sg dcdbas cdrom psmouse soundcore floppy evdev xfs dm_mod raid1 md_mod CPU: 1 PID: 3901 Comm: qemu-system-x86 Not tainted 3.19.0-rc1-00011-g53262d1-dirty #1 Hardware name: Dell Inc. Precision WorkStation T3400 /0TP412, BIOS A14 04/30/2012 7e052328 8800c25ffcf8 813defbe 8800c25ffd38 8103b517 8800c25ffd28 a019bdec 8800caf1d000 8800c2774800 Call Trace: [] dump_stack+0x4c/0x6e [] warn_slowpath_common+0x97/0xb1 [] ? kvm_arch_vcpu_ioctl_run+0xd63/0xe5b [kvm] [] warn_slowpath_null+0x15/0x17 [] kvm_arch_vcpu_ioctl_run+0xd63/0xe5b [kvm] [] ? vmcs_load+0x20/0x62 [kvm_intel] [] ? vmx_vcpu_load+0x140/0x16a [kvm_intel] [] ? kvm_arch_vcpu_load+0x15c/0x161 [kvm] [] kvm_vcpu_ioctl+0x189/0x4bd [kvm] [] ? do_sigtimedwait+0x12f/0x189 [] do_vfs_ioctl+0x370/0x436 [] ? __fget+0x67/0x72 [] SyS_ioctl+0x3f/0x5e [] system_call_fastpath+0x12/0x17 ---[ end trace 46abac932fb3b4a1 ]--- -
Re: regression bisected; KVM: entry failed, hardware error 0x80000021
Chen, Tiejun wrote: On 2014/12/23 15:26, Jamie Heilman wrote: Chen, Tiejun wrote: On 2014/12/23 9:50, Chen, Tiejun wrote: On 2014/12/22 17:23, Jamie Heilman wrote: KVM internal error. Suberror: 1 emulation failure EAX=000de494 EBX= ECX= EDX=0cfd ESI=0059 EDI= EBP= ESP=6fb4 EIP=000f15c1 EFL=00010016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 00c09300 DPL=0 DS [-WA] CS =0008 00c09b00 DPL=0 CS32 [-RA] SS =0010 00c09300 DPL=0 DS [-WA] DS =0010 00c09300 DPL=0 DS [-WA] FS =0010 00c09300 DPL=0 DS [-WA] GS =0010 00c09300 DPL=0 DS [-WA] LDT= 8200 DPL=0 LDT TR = 8b00 DPL=0 TSS32-busy GDT= 000f6be8 0037 IDT= 000f6c26 CR0=6011 CR2= CR3= CR4= DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER= Code=e8 ae fc ff ff 89 f2 a8 10 89 d8 75 0a b9 41 15 ff ff ff d1 5b 5e c3 5b 5e e9 76 ff ff ff b0 11 e6 20 e6 a0 b0 08 e6 21 b0 70 e6 a1 b0 04 e6 21 b0 02 FWIW, I get the same thing with 34a1cd60d17 reverted. Maybe there are two bugs, maybe there's more to this first one. I can repro this So if my understanding is correct, this is probably another bug. And especially, I already saw the same log in another thread, Cleaning up the KVM clock. Maybe you can continue to `git bisect` to locate that bad commit. Looks just now Andy found that commit, 0e60b0799fedc495a5c57dbd669de3c10d72edd2 kvm: change memslot sorting rule from size to GFN, maybe you can try to revert this to try yours again. That doesn't revert cleanly for me, and I don't have much time to fiddle with it until the 24th---so checked out the commit before it (d4ae84a0), applied your patch, built, and yes, everything works fine at that point. I'll probably have time for another full bisection later, assuming things aren't ironed out already by then. 3.18.0-rc3-00120-gd4ae84a0 + vmx reorder msr writes patch = OK 3.18.0-rc3-00121-g0e60b07 + vmx reorder msr writes patch = emulation failure So that certainly points to 0e60b0799fedc495a5c57dbd669de3c10d72edd2 as well. Could you try this to fix your last error? Running qemu-system-x86_64 -machine pc,accel=kvm -nodefaults works, my real (headless) kvm guests work, but this new patch makes running qemu-system-x86_64 -machine pc,accel=kvm fail again, this time with errors in the host to the tune of: [ cut here ] WARNING: CPU: 1 PID: 3901 at arch/x86/kvm/x86.c:6575 kvm_arch_vcpu_ioctl_run+0xd63/0xe5b [kvm]() Modules linked in: nfsv4 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative autofs4 fan nfsd auth_rpcgss nfs lockd grace fscache sunrpc bridge stp llc vhost_net tun vhost macvtap macvlan fuse cbc dm_crypt usb_storage snd_hda_codec_analog snd_hda_codec_generic kvm_intel kvm tg3 ptp pps_core sr_mod snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer snd sg dcdbas cdrom psmouse soundcore floppy evdev xfs dm_mod raid1 md_mod CPU: 1 PID: 3901 Comm: qemu-system-x86 Not tainted 3.19.0-rc1-00011-g53262d1-dirty #1 Hardware name: Dell Inc. Precision WorkStation T3400 /0TP412, BIOS A14 04/30/2012 7e052328 8800c25ffcf8 813defbe 8800c25ffd38 8103b517 8800c25ffd28 a019bdec 8800caf1d000 8800c2774800 Call Trace: [813defbe] dump_stack+0x4c/0x6e [8103b517] warn_slowpath_common+0x97/0xb1 [a019bdec] ? kvm_arch_vcpu_ioctl_run+0xd63/0xe5b [kvm] [8103b60b] warn_slowpath_null+0x15/0x17 [a019bdec] kvm_arch_vcpu_ioctl_run+0xd63/0xe5b [kvm] [a02308b9] ? vmcs_load+0x20/0x62 [kvm_intel] [a0231e03] ? vmx_vcpu_load+0x140/0x16a [kvm_intel] [a0196ba3] ? kvm_arch_vcpu_load+0x15c/0x161 [kvm] [a018d8b1] kvm_vcpu_ioctl+0x189/0x4bd [kvm] [8104647a] ? do_sigtimedwait+0x12f/0x189 [810ea316] do_vfs_ioctl+0x370/0x436 [810f24f2] ? __fget+0x67/0x72 [810ea41b] SyS_ioctl+0x3f/0x5e [813e34d2] system_call_fastpath+0x12/0x17 ---[ end trace 46abac932fb3b4a1 ]--- [ cut here ] WARNING: CPU: 1 PID: 3901 at arch/x86/kvm/x86.c:6575 kvm_arch_vcpu_ioctl_run+0xd63/0xe5b [kvm]() Modules linked in: nfsv4 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand cpufreq_conservative autofs4 fan nfsd auth_rpcgss nfs lockd grace fscache sunrpc bridge stp llc vhost_net tun vhost macvtap macvlan fuse cbc dm_crypt usb_storage snd_hda_codec_analog snd_hda_codec_generic kvm_intel kvm tg3 ptp pps_core sr_mod snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_timer snd sg dcdbas cdrom psmouse soundcore
Re: regression bisected; KVM: entry failed, hardware error 0x80000021
Chen, Tiejun wrote: > On 2014/12/23 9:50, Chen, Tiejun wrote: > >On 2014/12/22 17:23, Jamie Heilman wrote: > >>Chen, Tiejun wrote: > >>>On 2014/12/21 20:46, Jamie Heilman wrote: > >>>>With v3.19-rc1 when I run qemu-system-x86_64 -machine pc,accel=kvm I > >>>>get: > >>>> > >>>>KVM: entry failed, hardware error 0x8021 > >>> > >>>Looks some MSR writing issues such a failed entry. > >>> > >>>>If you're running a guest on an Intel machine without unrestricted mode > >>>>support, the failure can be most likely due to the guest entering an > >>>>invalid > >>>>state for Intel VT. For example, the guest maybe running in big real > >>>>mode > >>>>which is not supported on less recent Intel processors. > >>>> > >>>>EAX= EBX= ECX= EDX=0663 > >>>>ESI= EDI= EBP= ESP= > >>>>EIP=e05b EFL=00010002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0 > >>>>ES = 9300 > >>>>CS =f000 000f 9b00 > >>>>SS = 9300 > >>>>DS = 9300 > >>>>FS = 9300 > >>>>GS = 9300 > >>>>LDT= 8200 > >>>>TR = 8b00 > >>>>GDT= > >>>>IDT= > >>>>CR0=6010 CR2= CR3= CR4= > >>>>DR0= DR1= DR2= > >>>>DR3= > >>>>DR6=0ff0 DR7=0400 > >>>>EFER= > >>> > >>>And I don't see any obvious wrong as well. Any valuable info from dmesg? > >> > >>With the simple qemu command above, on 3.18.1 I see: > >> > >>kern.info: kvm: zapping shadow pages for mmio generation wraparound > >> > >>when I fire up a full guest that's actually useful I get: > >> > >>kern.info: kvm: zapping shadow pages for mmio generation wraparound > >>kern.err: kvm [4073]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x > >> > >>On 3.18.0-rc3-00042-g34a1cd6 nothing appears in the dmesg, just the > >>message I mention above to stderr. Same thing with a stock > >>3.19.0-rc1. Once I apply your patch the simple test command produces > >>the same zapping shadow pages messages as 3.18.1, and a test guest of > >>a Debian Jessie image (w/stock distro kernel) produces the same thing > >>with disabled perfctr wrmsr message. However, it doesn't look like > > > >Sorry I'm not sure if I understood current status. Looks 3.19-rc1 & my > >patch just fix that error above, > > > >KVM: entry failed, hardware error 0x8021 > >... > > > >Right? > > > >>I'm entirely out of the woods, because one of my other guest VMs with a > >>custom kernel that works great under 3.18.1 now fails to run. Nothing > >>in dmesg, but here's the stderr: > > > >But even you revert 34a1cd60d17 or just apply my patch, something else > >introduced between 3.18.1 and 3.19-rc1 led this error below, right? > > > >> > >>KVM internal error. Suberror: 1 > >>emulation failure > >>EAX=000de494 EBX= ECX= EDX=0cfd > >>ESI=0059 EDI= EBP= ESP=6fb4 > >>EIP=000f15c1 EFL=00010016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0 > >>ES =0010 00c09300 DPL=0 DS [-WA] > >>CS =0008 00c09b00 DPL=0 CS32 [-RA] > >>SS =0010 00c09300 DPL=0 DS [-WA] > >>DS =0010 00c09300 DPL=0 DS [-WA] > >>FS =0010 00c09300 DPL=0 DS [-WA] > >>GS =0010 00c09300 DPL=0 DS [-WA] > >>LDT= 8200 DPL=0 LDT > >>TR = 8b00 DPL=0 TSS32-busy > >>GDT= 000f6be8 0037 > >>IDT= 000f6c26 > >>CR0=6011 CR2= CR3= CR4= > >>DR0= DR1= DR2= > >>DR3= > >>DR6=0ff0 DR7=0400 > >>EFER= > >>Code=e8 ae fc ff ff 89 f2 a8 10 89 d8 75 0a b9 41 15 ff ff ff d1 <5b> > >>5e c3 5
Re: regression bisected; KVM: entry failed, hardware error 0x80000021
Chen, Tiejun wrote: > On 2014/12/21 20:46, Jamie Heilman wrote: > >With v3.19-rc1 when I run qemu-system-x86_64 -machine pc,accel=kvm I > >get: > > > >KVM: entry failed, hardware error 0x8021 > > Looks some MSR writing issues such a failed entry. > > >If you're running a guest on an Intel machine without unrestricted mode > >support, the failure can be most likely due to the guest entering an invalid > >state for Intel VT. For example, the guest maybe running in big real mode > >which is not supported on less recent Intel processors. > > > >EAX= EBX= ECX= EDX=0663 > >ESI= EDI= EBP= ESP= > >EIP=e05b EFL=00010002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0 > >ES = 9300 > >CS =f000 000f 9b00 > >SS = 9300 > >DS = 9300 > >FS = 9300 > >GS = 9300 > >LDT= 8200 > >TR = 8b00 > >GDT= > >IDT= > >CR0=6010 CR2= CR3= CR4= > >DR0= DR1= DR2= > >DR3= > >DR6=0ff0 DR7=0400 > >EFER= > > And I don't see any obvious wrong as well. Any valuable info from dmesg? With the simple qemu command above, on 3.18.1 I see: kern.info: kvm: zapping shadow pages for mmio generation wraparound when I fire up a full guest that's actually useful I get: kern.info: kvm: zapping shadow pages for mmio generation wraparound kern.err: kvm [4073]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x On 3.18.0-rc3-00042-g34a1cd6 nothing appears in the dmesg, just the message I mention above to stderr. Same thing with a stock 3.19.0-rc1. Once I apply your patch the simple test command produces the same zapping shadow pages messages as 3.18.1, and a test guest of a Debian Jessie image (w/stock distro kernel) produces the same thing with disabled perfctr wrmsr message. However, it doesn't look like I'm entirely out of the woods, because one of my other guest VMs with a custom kernel that works great under 3.18.1 now fails to run. Nothing in dmesg, but here's the stderr: KVM internal error. Suberror: 1 emulation failure EAX=000de494 EBX= ECX= EDX=0cfd ESI=0059 EDI= EBP= ESP=6fb4 EIP=000f15c1 EFL=00010016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 00c09300 DPL=0 DS [-WA] CS =0008 00c09b00 DPL=0 CS32 [-RA] SS =0010 00c09300 DPL=0 DS [-WA] DS =0010 00c09300 DPL=0 DS [-WA] FS =0010 00c09300 DPL=0 DS [-WA] GS =0010 00c09300 DPL=0 DS [-WA] LDT= 8200 DPL=0 LDT TR = 8b00 DPL=0 TSS32-busy GDT= 000f6be8 0037 IDT= 000f6c26 CR0=6011 CR2= CR3= CR4= DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER= Code=e8 ae fc ff ff 89 f2 a8 10 89 d8 75 0a b9 41 15 ff ff ff d1 <5b> 5e c3 5b 5e e9 76 ff ff ff b0 11 e6 20 e6 a0 b0 08 e6 21 b0 70 e6 a1 b0 04 e6 21 b0 02 FWIW, I get the same thing with 34a1cd60d17 reverted. Maybe there are two bugs, maybe there's more to this first one. I can repro this error with the command: qemu-system-x86_64 -machine pc,accel=kvm -nodefaults > >This is with QEMU emulator version 2.1.2 (Debian 1:2.1+dfsg-11), > >Copyright (c) 2003-2008 Fabrice Bellard > > > >The host system is: > > > >cpu family : 6 > >model : 23 > >model name : Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz > >stepping: 10 > >microcode : 0xa0b > >... > >flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca > >cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm > >constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 > >monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm > >dtherm tpr_shadow vnmi flexpriority > > > >I bisected this back to: > > > >commit 34a1cd60d17f62c1f077c1478a6c2ca8c3d17af4 > >Author: Tiejun Chen > >Date: Tue Oct 28 10:14:48 2014 +0800 > > > > kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup() > > > > Instead of vmx_init(), actually it would make reasonable sense to do > > anything specific to vmx hardware setting in > > vmx_x86_ops->hardwa
Re: regression bisected; KVM: entry failed, hardware error 0x80000021
Chen, Tiejun wrote: On 2014/12/23 9:50, Chen, Tiejun wrote: On 2014/12/22 17:23, Jamie Heilman wrote: Chen, Tiejun wrote: On 2014/12/21 20:46, Jamie Heilman wrote: With v3.19-rc1 when I run qemu-system-x86_64 -machine pc,accel=kvm I get: KVM: entry failed, hardware error 0x8021 Looks some MSR writing issues such a failed entry. If you're running a guest on an Intel machine without unrestricted mode support, the failure can be most likely due to the guest entering an invalid state for Intel VT. For example, the guest maybe running in big real mode which is not supported on less recent Intel processors. EAX= EBX= ECX= EDX=0663 ESI= EDI= EBP= ESP= EIP=e05b EFL=00010002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES = 9300 CS =f000 000f 9b00 SS = 9300 DS = 9300 FS = 9300 GS = 9300 LDT= 8200 TR = 8b00 GDT= IDT= CR0=6010 CR2= CR3= CR4= DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER= And I don't see any obvious wrong as well. Any valuable info from dmesg? With the simple qemu command above, on 3.18.1 I see: kern.info: kvm: zapping shadow pages for mmio generation wraparound when I fire up a full guest that's actually useful I get: kern.info: kvm: zapping shadow pages for mmio generation wraparound kern.err: kvm [4073]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x On 3.18.0-rc3-00042-g34a1cd6 nothing appears in the dmesg, just the message I mention above to stderr. Same thing with a stock 3.19.0-rc1. Once I apply your patch the simple test command produces the same zapping shadow pages messages as 3.18.1, and a test guest of a Debian Jessie image (w/stock distro kernel) produces the same thing with disabled perfctr wrmsr message. However, it doesn't look like Sorry I'm not sure if I understood current status. Looks 3.19-rc1 my patch just fix that error above, KVM: entry failed, hardware error 0x8021 ... Right? I'm entirely out of the woods, because one of my other guest VMs with a custom kernel that works great under 3.18.1 now fails to run. Nothing in dmesg, but here's the stderr: But even you revert 34a1cd60d17 or just apply my patch, something else introduced between 3.18.1 and 3.19-rc1 led this error below, right? KVM internal error. Suberror: 1 emulation failure EAX=000de494 EBX= ECX= EDX=0cfd ESI=0059 EDI= EBP= ESP=6fb4 EIP=000f15c1 EFL=00010016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 00c09300 DPL=0 DS [-WA] CS =0008 00c09b00 DPL=0 CS32 [-RA] SS =0010 00c09300 DPL=0 DS [-WA] DS =0010 00c09300 DPL=0 DS [-WA] FS =0010 00c09300 DPL=0 DS [-WA] GS =0010 00c09300 DPL=0 DS [-WA] LDT= 8200 DPL=0 LDT TR = 8b00 DPL=0 TSS32-busy GDT= 000f6be8 0037 IDT= 000f6c26 CR0=6011 CR2= CR3= CR4= DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER= Code=e8 ae fc ff ff 89 f2 a8 10 89 d8 75 0a b9 41 15 ff ff ff d1 5b 5e c3 5b 5e e9 76 ff ff ff b0 11 e6 20 e6 a0 b0 08 e6 21 b0 70 e6 a1 b0 04 e6 21 b0 02 FWIW, I get the same thing with 34a1cd60d17 reverted. Maybe there are two bugs, maybe there's more to this first one. I can repro this So if my understanding is correct, this is probably another bug. And especially, I already saw the same log in another thread, Cleaning up the KVM clock. Maybe you can continue to `git bisect` to locate that bad commit. Looks just now Andy found that commit, 0e60b0799fedc495a5c57dbd669de3c10d72edd2 kvm: change memslot sorting rule from size to GFN, maybe you can try to revert this to try yours again. That doesn't revert cleanly for me, and I don't have much time to fiddle with it until the 24th---so checked out the commit before it (d4ae84a0), applied your patch, built, and yes, everything works fine at that point. I'll probably have time for another full bisection later, assuming things aren't ironed out already by then. -- Jamie Heilman http://audible.transient.net/~jamie/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: regression bisected; KVM: entry failed, hardware error 0x80000021
Chen, Tiejun wrote: On 2014/12/21 20:46, Jamie Heilman wrote: With v3.19-rc1 when I run qemu-system-x86_64 -machine pc,accel=kvm I get: KVM: entry failed, hardware error 0x8021 Looks some MSR writing issues such a failed entry. If you're running a guest on an Intel machine without unrestricted mode support, the failure can be most likely due to the guest entering an invalid state for Intel VT. For example, the guest maybe running in big real mode which is not supported on less recent Intel processors. EAX= EBX= ECX= EDX=0663 ESI= EDI= EBP= ESP= EIP=e05b EFL=00010002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES = 9300 CS =f000 000f 9b00 SS = 9300 DS = 9300 FS = 9300 GS = 9300 LDT= 8200 TR = 8b00 GDT= IDT= CR0=6010 CR2= CR3= CR4= DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER= And I don't see any obvious wrong as well. Any valuable info from dmesg? With the simple qemu command above, on 3.18.1 I see: kern.info: kvm: zapping shadow pages for mmio generation wraparound when I fire up a full guest that's actually useful I get: kern.info: kvm: zapping shadow pages for mmio generation wraparound kern.err: kvm [4073]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x On 3.18.0-rc3-00042-g34a1cd6 nothing appears in the dmesg, just the message I mention above to stderr. Same thing with a stock 3.19.0-rc1. Once I apply your patch the simple test command produces the same zapping shadow pages messages as 3.18.1, and a test guest of a Debian Jessie image (w/stock distro kernel) produces the same thing with disabled perfctr wrmsr message. However, it doesn't look like I'm entirely out of the woods, because one of my other guest VMs with a custom kernel that works great under 3.18.1 now fails to run. Nothing in dmesg, but here's the stderr: KVM internal error. Suberror: 1 emulation failure EAX=000de494 EBX= ECX= EDX=0cfd ESI=0059 EDI= EBP= ESP=6fb4 EIP=000f15c1 EFL=00010016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =0010 00c09300 DPL=0 DS [-WA] CS =0008 00c09b00 DPL=0 CS32 [-RA] SS =0010 00c09300 DPL=0 DS [-WA] DS =0010 00c09300 DPL=0 DS [-WA] FS =0010 00c09300 DPL=0 DS [-WA] GS =0010 00c09300 DPL=0 DS [-WA] LDT= 8200 DPL=0 LDT TR = 8b00 DPL=0 TSS32-busy GDT= 000f6be8 0037 IDT= 000f6c26 CR0=6011 CR2= CR3= CR4= DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER= Code=e8 ae fc ff ff 89 f2 a8 10 89 d8 75 0a b9 41 15 ff ff ff d1 5b 5e c3 5b 5e e9 76 ff ff ff b0 11 e6 20 e6 a0 b0 08 e6 21 b0 70 e6 a1 b0 04 e6 21 b0 02 FWIW, I get the same thing with 34a1cd60d17 reverted. Maybe there are two bugs, maybe there's more to this first one. I can repro this error with the command: qemu-system-x86_64 -machine pc,accel=kvm -nodefaults This is with QEMU emulator version 2.1.2 (Debian 1:2.1+dfsg-11), Copyright (c) 2003-2008 Fabrice Bellard The host system is: cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz stepping: 10 microcode : 0xa0b ... flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority I bisected this back to: commit 34a1cd60d17f62c1f077c1478a6c2ca8c3d17af4 Author: Tiejun Chen tiejun.c...@intel.com Date: Tue Oct 28 10:14:48 2014 +0800 kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup() Instead of vmx_init(), actually it would make reasonable sense to do anything specific to vmx hardware setting in vmx_x86_ops-hardware_setup(). This commit just reorders something but some MSR writing depend on previous status. Could you try this? I unmangled the expanded tabs, and applied this: diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index feb852b..96c84a8 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -5840,49 +5840,6 @@ static __init int hardware_setup(void) memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE); memset(vmx_msr_bitmap_longmode
regression bisected; KVM: entry failed, hardware error 0x80000021
With v3.19-rc1 when I run qemu-system-x86_64 -machine pc,accel=kvm I get: KVM: entry failed, hardware error 0x8021 If you're running a guest on an Intel machine without unrestricted mode support, the failure can be most likely due to the guest entering an invalid state for Intel VT. For example, the guest maybe running in big real mode which is not supported on less recent Intel processors. EAX= EBX= ECX= EDX=0663 ESI= EDI= EBP= ESP= EIP=e05b EFL=00010002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES = 9300 CS =f000 000f 9b00 SS = 9300 DS = 9300 FS = 9300 GS = 9300 LDT= 8200 TR = 8b00 GDT= IDT= CR0=6010 CR2= CR3= CR4= DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER= Code=85 00 87 00 89 00 8b 00 00 00 86 00 88 00 8a 00 8c 00 00 90 <2e> 66 83 3e 30 6c 00 0f 85 e7 f2 31 c0 8e d0 66 bc 00 70 00 00 66 ba 31 2e 0f 00 e9 45 f1 This is with QEMU emulator version 2.1.2 (Debian 1:2.1+dfsg-11), Copyright (c) 2003-2008 Fabrice Bellard The host system is: cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz stepping: 10 microcode : 0xa0b ... flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority I bisected this back to: commit 34a1cd60d17f62c1f077c1478a6c2ca8c3d17af4 Author: Tiejun Chen Date: Tue Oct 28 10:14:48 2014 +0800 kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup() Instead of vmx_init(), actually it would make reasonable sense to do anything specific to vmx hardware setting in vmx_x86_ops->hardware_setup(). Signed-off-by: Tiejun Chen Signed-off-by: Paolo Bonzini reverting this commit appears to fix the issue, in so far that I can run kvm again without the aforementioned error, but I haven't tested it thoroughly beyond that. Let me know if you need more information or testing. -- Jamie Heilman http://audible.transient.net/~jamie/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
regression bisected; KVM: entry failed, hardware error 0x80000021
With v3.19-rc1 when I run qemu-system-x86_64 -machine pc,accel=kvm I get: KVM: entry failed, hardware error 0x8021 If you're running a guest on an Intel machine without unrestricted mode support, the failure can be most likely due to the guest entering an invalid state for Intel VT. For example, the guest maybe running in big real mode which is not supported on less recent Intel processors. EAX= EBX= ECX= EDX=0663 ESI= EDI= EBP= ESP= EIP=e05b EFL=00010002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES = 9300 CS =f000 000f 9b00 SS = 9300 DS = 9300 FS = 9300 GS = 9300 LDT= 8200 TR = 8b00 GDT= IDT= CR0=6010 CR2= CR3= CR4= DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER= Code=85 00 87 00 89 00 8b 00 00 00 86 00 88 00 8a 00 8c 00 00 90 2e 66 83 3e 30 6c 00 0f 85 e7 f2 31 c0 8e d0 66 bc 00 70 00 00 66 ba 31 2e 0f 00 e9 45 f1 This is with QEMU emulator version 2.1.2 (Debian 1:2.1+dfsg-11), Copyright (c) 2003-2008 Fabrice Bellard The host system is: cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Duo CPU E8400 @ 3.00GHz stepping: 10 microcode : 0xa0b ... flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority I bisected this back to: commit 34a1cd60d17f62c1f077c1478a6c2ca8c3d17af4 Author: Tiejun Chen tiejun.c...@intel.com Date: Tue Oct 28 10:14:48 2014 +0800 kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup() Instead of vmx_init(), actually it would make reasonable sense to do anything specific to vmx hardware setting in vmx_x86_ops-hardware_setup(). Signed-off-by: Tiejun Chen tiejun.c...@intel.com Signed-off-by: Paolo Bonzini pbonz...@redhat.com reverting this commit appears to fix the issue, in so far that I can run kvm again without the aforementioned error, but I haven't tested it thoroughly beyond that. Let me know if you need more information or testing. -- Jamie Heilman http://audible.transient.net/~jamie/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.7-rc5/rc6 regression] "drm/nvc0/disp: fix regression in vblank semaphore release" broke nouveau driver and mplayer
Mikael Pettersson wrote: > mplayer worked fine on my Dell Latitude E6510 (nVidia GT218 [NVS > 3100M] graphics) up to and including kernel 3.7-rc4. However, with > 3.7-rc5 or -rc6, any attempt to run mplayer just blanks the screen, > shows some stray white pixels in the upper left corner, kills the X > server, and spews the following errors from the kernel: ... > > The error is 100% repeatable. I see the same problem on my workstation with a G86 [Quadro NVS 290] (rev a1) running Debian (sid) with: xserver-xorg-core 2:1.12.4-3 libdrm-nouveau1a:amd64 2.4.33-3 xserver-xorg-video-nouveau 1:1.0.1-3 and mplayer2 2.0-600-g95e81df w/the xv video output driver > git bisect identified the following culprit: > > 11d92561c81be2f4a7af37f035e1af294b960abe is the first bad commit I bisected to the same commit as well. -- Jamie Heilman http://audible.transient.net/~jamie/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [3.7-rc5/rc6 regression] drm/nvc0/disp: fix regression in vblank semaphore release broke nouveau driver and mplayer
Mikael Pettersson wrote: mplayer worked fine on my Dell Latitude E6510 (nVidia GT218 [NVS 3100M] graphics) up to and including kernel 3.7-rc4. However, with 3.7-rc5 or -rc6, any attempt to run mplayer just blanks the screen, shows some stray white pixels in the upper left corner, kills the X server, and spews the following errors from the kernel: ... The error is 100% repeatable. I see the same problem on my workstation with a G86 [Quadro NVS 290] (rev a1) running Debian (sid) with: xserver-xorg-core 2:1.12.4-3 libdrm-nouveau1a:amd64 2.4.33-3 xserver-xorg-video-nouveau 1:1.0.1-3 and mplayer2 2.0-600-g95e81df w/the xv video output driver git bisect identified the following culprit: 11d92561c81be2f4a7af37f035e1af294b960abe is the first bad commit I bisected to the same commit as well. -- Jamie Heilman http://audible.transient.net/~jamie/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to return
J. Bruce Fields wrote: > On Thu, Aug 16, 2012 at 04:42:08PM -0400, J. Bruce Fields wrote: > > Hm, weird. In the good case the cb_recall's done with auth_unix, in the > > bad case with auth_null. OK, that should be enough to go on > > > > Thanks for digging into this! > > This should fix it--could you confirm? Yep, looks like that fixes it. > commit e950bebdac1f17121f972728489cdba43734d56d > Author: J. Bruce Fields > Date: Thu Aug 16 17:01:21 2012 -0400 > > nfsd4: fix security flavor of NFSv4.0 callback > > Commit d5497fc693a446ce9100fcf4117c3f795ddfd0d2 "nfsd4: move rq_flavor > into svc_cred" forgot to remove cl_flavor from the client, leaving two > places (cl_flavor and cl_cred.cr_flavor) for the flavor to be stored. > After that patch, the latter was the one that was updated, but the > former was the one that the callback used. > > Symptoms were a long delay on utime(). This is because the utime() > generated a setattr which recalled a delegation, but the cb_recall was > ignored by the client because it had the wrong security flavor. > > Cc: sta...@vger.kernel.org > Reported-by: Jamie Heilman > Signed-off-by: J. Bruce Fields > > diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c > index cbaf4f8..4c7bd35 100644 > --- a/fs/nfsd/nfs4callback.c > +++ b/fs/nfsd/nfs4callback.c > @@ -651,12 +651,12 @@ static int setup_callback_client(struct nfs4_client > *clp, struct nfs4_cb_conn *c > > if (clp->cl_minorversion == 0) { > if (!clp->cl_cred.cr_principal && > - (clp->cl_flavor >= RPC_AUTH_GSS_KRB5)) > + (clp->cl_cred.cr_flavor >= RPC_AUTH_GSS_KRB5)) > return -EINVAL; > args.client_name = clp->cl_cred.cr_principal; > args.prognumber = conn->cb_prog, > args.protocol = XPRT_TRANSPORT_TCP; > - args.authflavor = clp->cl_flavor; > + args.authflavor = clp->cl_cred.cr_flavor; > clp->cl_cb_ident = conn->cb_ident; > } else { > if (!conn->cb_xprt) > diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h > index 167d7d8..9db0bb5 100644 > --- a/fs/nfsd/state.h > +++ b/fs/nfsd/state.h > @@ -231,7 +231,6 @@ struct nfs4_client { > nfs4_verifier cl_verifier;/* generated by client */ > time_t cl_time;/* time of last lease renewal */ > struct sockaddr_storage cl_addr;/* client ipaddress */ > - u32 cl_flavor; /* setclientid pseudoflavor */ > struct svc_cred cl_cred;/* setclientid principal */ > clientid_t cl_clientid;/* generated by server */ > nfs4_verifier cl_confirm; /* generated by server */ > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Jamie Heilman http://audible.transient.net/~jamie/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to return
J. Bruce Fields wrote: > On Wed, Aug 15, 2012 at 01:58:54PM +0000, Jamie Heilman wrote: > > Jamie Heilman wrote: > > > I'll try to get full rcpdebug traces on client and server as the delay > > > is occuring in the hopes that helps pin things down, and post them > > > separately. > > > > OK, here are the logs from client and server, where a run of my test program > > under strace -T resulted in: > > > > utime("utime-test.c", [2012/08/15-13:35:20, 2012/08/15-13:22:04]) = 0 > > <0.150815> > > open("utime-test.c", O_RDONLY) = 3 <0.242635> > > close(3)= 0 <0.147768> > > stat("utime-test.c", {st_mode=S_IFREG|0644, st_size=696, ...}) = 0 > > <0.002772> > > utime("utime-test.c", [2012/08/15-13:35:20, 2012/08/15-13:22:04]) = 0 > > <71.878058> > > > > The client is a 64-bit v3.4.8 kernel, the server is 32-bit, v3.5.1 + > > the two sunprc patches that will be in v3.5.2. > > > > (The client's system clock is a touch faster than the server's, but > > these logs start at the same instant.) > > Thanks for all the details. > > What's probably happening is that the client is returning a delegation > with the open. The setattr then breaks that delegation; you can see it > getting 10008 (NFS4ERR_DELAY) replies while the server waits for the > delegation to be returned. But for some reason the callback to break > the delegation isn't working. ("NFSD: warning: no callback path to > client 192.168.2.42/192.168.2.4 tcp UNIX 0: error -110" (110 is > ETIMEDOUT).) So instead you wait for the delegation to time out and get > forcibly revoked. > > The reproducer might be more reliable if you did two opens. I made the change... not entirely sure it helped, but I think I've bisected this reliably anyway. It came down to: d5497fc693a446ce9100fcf4117c3f795ddfd0d2 is the first bad commit commit d5497fc693a446ce9100fcf4117c3f795ddfd0d2 Author: J. Bruce Fields Date: Mon May 14 22:06:49 2012 -0400 nfsd4: move rq_flavor into svc_cred Move the rq_flavor into struct svc_cred, and use it in setclientid and exchange_id comparisons as well. Signed-off-by: J. Bruce Fields :04 04 165568aae0fc47067863ea8ca911494033cbf2ce 54244f0084f58df4844b09d56e8e6bad0e19d913 M fs :04 04 bae5a74a85bda2aa2b155b0bdca1ea0028502164 207751ce30c8f49c68fb0a85d0772cb0fa426ae1 M include :04 04 ef940cf0c558fe65a21a710e6c95b2cefb2996fd 459ac143acf5cb2b3b7f59df29c417cedbb23c4b M net FWIW reverting this commit does seem to fix the problem. > It'd be worth looking at the traffic in wireshark. You should see > setattr, open, close, setattr, a DELAY reply to the setattr, a > CB_RECALL, and then a DELEGRETURN that gets a succesful reply. But for > some reason the DELEGRETURN isn't getting through in your case, I'm not > sure why. I can't reproduce that. You'll need to start wireshark > before you mount to make sure it knows how to parse the callbacks. Captures from the server showing the delay (running 3.5.2) and the previous behavior (running 3.4.9) are attached. -- Jamie Heilman http://audible.transient.net/~jamie/ nfsd4-delay.pcap.gz Description: Binary data nfsd4-ok.pcap.gz Description: Binary data
Re: v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to return
J. Bruce Fields wrote: On Wed, Aug 15, 2012 at 01:58:54PM +, Jamie Heilman wrote: Jamie Heilman wrote: I'll try to get full rcpdebug traces on client and server as the delay is occuring in the hopes that helps pin things down, and post them separately. OK, here are the logs from client and server, where a run of my test program under strace -T resulted in: utime(utime-test.c, [2012/08/15-13:35:20, 2012/08/15-13:22:04]) = 0 0.150815 open(utime-test.c, O_RDONLY) = 3 0.242635 close(3)= 0 0.147768 stat(utime-test.c, {st_mode=S_IFREG|0644, st_size=696, ...}) = 0 0.002772 utime(utime-test.c, [2012/08/15-13:35:20, 2012/08/15-13:22:04]) = 0 71.878058 The client is a 64-bit v3.4.8 kernel, the server is 32-bit, v3.5.1 + the two sunprc patches that will be in v3.5.2. (The client's system clock is a touch faster than the server's, but these logs start at the same instant.) Thanks for all the details. What's probably happening is that the client is returning a delegation with the open. The setattr then breaks that delegation; you can see it getting 10008 (NFS4ERR_DELAY) replies while the server waits for the delegation to be returned. But for some reason the callback to break the delegation isn't working. (NFSD: warning: no callback path to client 192.168.2.42/192.168.2.4 tcp UNIX 0: error -110 (110 is ETIMEDOUT).) So instead you wait for the delegation to time out and get forcibly revoked. The reproducer might be more reliable if you did two opens. I made the change... not entirely sure it helped, but I think I've bisected this reliably anyway. It came down to: d5497fc693a446ce9100fcf4117c3f795ddfd0d2 is the first bad commit commit d5497fc693a446ce9100fcf4117c3f795ddfd0d2 Author: J. Bruce Fields bfie...@redhat.com Date: Mon May 14 22:06:49 2012 -0400 nfsd4: move rq_flavor into svc_cred Move the rq_flavor into struct svc_cred, and use it in setclientid and exchange_id comparisons as well. Signed-off-by: J. Bruce Fields bfie...@redhat.com :04 04 165568aae0fc47067863ea8ca911494033cbf2ce 54244f0084f58df4844b09d56e8e6bad0e19d913 M fs :04 04 bae5a74a85bda2aa2b155b0bdca1ea0028502164 207751ce30c8f49c68fb0a85d0772cb0fa426ae1 M include :04 04 ef940cf0c558fe65a21a710e6c95b2cefb2996fd 459ac143acf5cb2b3b7f59df29c417cedbb23c4b M net FWIW reverting this commit does seem to fix the problem. It'd be worth looking at the traffic in wireshark. You should see setattr, open, close, setattr, a DELAY reply to the setattr, a CB_RECALL, and then a DELEGRETURN that gets a succesful reply. But for some reason the DELEGRETURN isn't getting through in your case, I'm not sure why. I can't reproduce that. You'll need to start wireshark before you mount to make sure it knows how to parse the callbacks. Captures from the server showing the delay (running 3.5.2) and the previous behavior (running 3.4.9) are attached. -- Jamie Heilman http://audible.transient.net/~jamie/ nfsd4-delay.pcap.gz Description: Binary data nfsd4-ok.pcap.gz Description: Binary data
Re: v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to return
J. Bruce Fields wrote: On Thu, Aug 16, 2012 at 04:42:08PM -0400, J. Bruce Fields wrote: Hm, weird. In the good case the cb_recall's done with auth_unix, in the bad case with auth_null. OK, that should be enough to go on Thanks for digging into this! This should fix it--could you confirm? Yep, looks like that fixes it. commit e950bebdac1f17121f972728489cdba43734d56d Author: J. Bruce Fields bfie...@redhat.com Date: Thu Aug 16 17:01:21 2012 -0400 nfsd4: fix security flavor of NFSv4.0 callback Commit d5497fc693a446ce9100fcf4117c3f795ddfd0d2 nfsd4: move rq_flavor into svc_cred forgot to remove cl_flavor from the client, leaving two places (cl_flavor and cl_cred.cr_flavor) for the flavor to be stored. After that patch, the latter was the one that was updated, but the former was the one that the callback used. Symptoms were a long delay on utime(). This is because the utime() generated a setattr which recalled a delegation, but the cb_recall was ignored by the client because it had the wrong security flavor. Cc: sta...@vger.kernel.org Reported-by: Jamie Heilman ja...@audible.transient.net Signed-off-by: J. Bruce Fields bfie...@redhat.com diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c index cbaf4f8..4c7bd35 100644 --- a/fs/nfsd/nfs4callback.c +++ b/fs/nfsd/nfs4callback.c @@ -651,12 +651,12 @@ static int setup_callback_client(struct nfs4_client *clp, struct nfs4_cb_conn *c if (clp-cl_minorversion == 0) { if (!clp-cl_cred.cr_principal - (clp-cl_flavor = RPC_AUTH_GSS_KRB5)) + (clp-cl_cred.cr_flavor = RPC_AUTH_GSS_KRB5)) return -EINVAL; args.client_name = clp-cl_cred.cr_principal; args.prognumber = conn-cb_prog, args.protocol = XPRT_TRANSPORT_TCP; - args.authflavor = clp-cl_flavor; + args.authflavor = clp-cl_cred.cr_flavor; clp-cl_cb_ident = conn-cb_ident; } else { if (!conn-cb_xprt) diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h index 167d7d8..9db0bb5 100644 --- a/fs/nfsd/state.h +++ b/fs/nfsd/state.h @@ -231,7 +231,6 @@ struct nfs4_client { nfs4_verifier cl_verifier;/* generated by client */ time_t cl_time;/* time of last lease renewal */ struct sockaddr_storage cl_addr;/* client ipaddress */ - u32 cl_flavor; /* setclientid pseudoflavor */ struct svc_cred cl_cred;/* setclientid principal */ clientid_t cl_clientid;/* generated by server */ nfs4_verifier cl_confirm; /* generated by server */ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Jamie Heilman http://audible.transient.net/~jamie/ -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to return
Jamie Heilman wrote: > I'll try to get full rcpdebug traces on client and server as the delay > is occuring in the hopes that helps pin things down, and post them > separately. OK, here are the logs from client and server, where a run of my test program under strace -T resulted in: utime("utime-test.c", [2012/08/15-13:35:20, 2012/08/15-13:22:04]) = 0 <0.150815> open("utime-test.c", O_RDONLY) = 3 <0.242635> close(3)= 0 <0.147768> stat("utime-test.c", {st_mode=S_IFREG|0644, st_size=696, ...}) = 0 <0.002772> utime("utime-test.c", [2012/08/15-13:35:20, 2012/08/15-13:22:04]) = 0 <71.878058> The client is a 64-bit v3.4.8 kernel, the server is 32-bit, v3.5.1 + the two sunprc patches that will be in v3.5.2. (The client's system clock is a touch faster than the server's, but these logs start at the same instant.) -- Jamie Heilman http://audible.transient.net/~jamie/ server.gz Description: Binary data client.gz Description: Binary data
v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to return
I really wish I could have nailed this down better, but I've had a hard time reliably reproducing the problem during bisection, and I haven't seen anyone report a similar sounding problem. Here's what I've seen: since 3.5 I've been having spurious delays on my nfs clients, noticable particularly when I open an mbox file in mutt over an nfs v4 mount from a v3.5 or later server. The servers I've reproduced this on are all uni-proc 32-bit systems... but then I haven't tried SMP or 64-bit systems yet, it may or may not exist there. When the delay occurs, it's quite noticable. I've never seen one that takes less than 40 seconds to "unstick." I wrote a quick and dirty reproduction tool, based on the syscalls mutt was doing that triggered the problem, attached to this message. To use it, compile the file as utime-test on an exported volume, then execute with (cd /some/mount/point && strace -T ./utime-test) from a nfs4 client. For whatever, reason I frequently find the second call to utime takes an irritatingly long time to return and I see something like: utime("utime-test.c", [2012/08/14-22:47:21, 2012/08/14-17:25:21]) = 0 <70.510913> in the strace output. I've reproduced this on Debian Squeeze / nfs-utils 1.2.2 based servers (legacy idmapper, no user-space nfsidmap), as well as Debian Wheezy / nfs-utils 1.2.6 (uses keyutils upcalls) servers, so I doubt it's a user-space related issue... Attempts to bisect have been muddled, I'll keep trying in the interim, but the best I've been able to pin things down is that issue was probably introduced in the 419f4319495043a9507ac3e616be9ca60af09744 merge. I can't repo on a kernel based on fb21affa49204acd409328415b49bfe90136653c. (I say based on, because I have to apply the patch from http://marc.info/?l=linux-nfs=133950479803025 or face additional problems.) I'll try to get full rcpdebug traces on client and server as the delay is occuring in the hopes that helps pin things down, and post them separately. -- Jamie Heilman http://audible.transient.net/~jamie/ #include #include #include #include #include #include #include #define F_PATH "utime-test.c" int main() { struct stat sb; struct utimbuf ub; int fd; if (stat(F_PATH, ) == -1) { perror(NULL); return -1; } ub.modtime = sb.st_mtime; ub.actime = time(NULL); if (utime(F_PATH, ) == -1) { perror(NULL); return -2; } if ((fd = open(F_PATH, O_RDONLY)) == -1) { perror(NULL); return -3; } close(fd); if (stat(F_PATH, ) == -1) { perror(NULL); return -1; } ub.modtime = sb.st_mtime; ub.actime = time(NULL); if (utime(F_PATH, ) == -1) { perror(NULL); return -2; } return 0; }
v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to return
I really wish I could have nailed this down better, but I've had a hard time reliably reproducing the problem during bisection, and I haven't seen anyone report a similar sounding problem. Here's what I've seen: since 3.5 I've been having spurious delays on my nfs clients, noticable particularly when I open an mbox file in mutt over an nfs v4 mount from a v3.5 or later server. The servers I've reproduced this on are all uni-proc 32-bit systems... but then I haven't tried SMP or 64-bit systems yet, it may or may not exist there. When the delay occurs, it's quite noticable. I've never seen one that takes less than 40 seconds to unstick. I wrote a quick and dirty reproduction tool, based on the syscalls mutt was doing that triggered the problem, attached to this message. To use it, compile the file as utime-test on an exported volume, then execute with (cd /some/mount/point strace -T ./utime-test) from a nfs4 client. For whatever, reason I frequently find the second call to utime takes an irritatingly long time to return and I see something like: utime(utime-test.c, [2012/08/14-22:47:21, 2012/08/14-17:25:21]) = 0 70.510913 in the strace output. I've reproduced this on Debian Squeeze / nfs-utils 1.2.2 based servers (legacy idmapper, no user-space nfsidmap), as well as Debian Wheezy / nfs-utils 1.2.6 (uses keyutils upcalls) servers, so I doubt it's a user-space related issue... Attempts to bisect have been muddled, I'll keep trying in the interim, but the best I've been able to pin things down is that issue was probably introduced in the 419f4319495043a9507ac3e616be9ca60af09744 merge. I can't repo on a kernel based on fb21affa49204acd409328415b49bfe90136653c. (I say based on, because I have to apply the patch from http://marc.info/?l=linux-nfsm=133950479803025 or face additional problems.) I'll try to get full rcpdebug traces on client and server as the delay is occuring in the hopes that helps pin things down, and post them separately. -- Jamie Heilman http://audible.transient.net/~jamie/ #include sys/types.h #include sys/stat.h #include fcntl.h #include unistd.h #include time.h #include utime.h #include stdio.h #define F_PATH utime-test.c int main() { struct stat sb; struct utimbuf ub; int fd; if (stat(F_PATH, sb) == -1) { perror(NULL); return -1; } ub.modtime = sb.st_mtime; ub.actime = time(NULL); if (utime(F_PATH, ub) == -1) { perror(NULL); return -2; } if ((fd = open(F_PATH, O_RDONLY)) == -1) { perror(NULL); return -3; } close(fd); if (stat(F_PATH, sb) == -1) { perror(NULL); return -1; } ub.modtime = sb.st_mtime; ub.actime = time(NULL); if (utime(F_PATH, ub) == -1) { perror(NULL); return -2; } return 0; }
Re: v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to return
Jamie Heilman wrote: I'll try to get full rcpdebug traces on client and server as the delay is occuring in the hopes that helps pin things down, and post them separately. OK, here are the logs from client and server, where a run of my test program under strace -T resulted in: utime(utime-test.c, [2012/08/15-13:35:20, 2012/08/15-13:22:04]) = 0 0.150815 open(utime-test.c, O_RDONLY) = 3 0.242635 close(3)= 0 0.147768 stat(utime-test.c, {st_mode=S_IFREG|0644, st_size=696, ...}) = 0 0.002772 utime(utime-test.c, [2012/08/15-13:35:20, 2012/08/15-13:22:04]) = 0 71.878058 The client is a 64-bit v3.4.8 kernel, the server is 32-bit, v3.5.1 + the two sunprc patches that will be in v3.5.2. (The client's system clock is a touch faster than the server's, but these logs start at the same instant.) -- Jamie Heilman http://audible.transient.net/~jamie/ server.gz Description: Binary data client.gz Description: Binary data