Re: Linux 5.10.25

2021-03-21 Thread Jamie Heilman
Jamie Heilman wrote:
> Greg Kroah-Hartman wrote:
> > On Sat, Mar 20, 2021 at 09:31:55PM +0000, Jamie Heilman wrote:
> > > [ugh, resent with the lkml headers unbroken, sorry about the dupe]
> > > 
> > > Greg Kroah-Hartman wrote:
> > > > J. Bruce Fields (2):
> > > >   Revert "nfsd4: remove check_conflicting_opens warning"
> > > >   Revert "nfsd4: a client's own opens needn't prevent delegations"
> > > 
> > > Hrm, just got this when I udpated my nfs server (32bit Via EPIA system)
> > > from 5.10.20 to 5.10.25:
> > > 
> > > [   49.225914] NFSD: Using UMH upcall client tracking operations.
> > > [   49.231919] NFSD: starting 90-second grace period (net f036)
> > > [   50.036973] [ cut here ]
> > > [   50.041771] WARNING: CPU: 0 PID: 2284 at fs/nfsd/nfs4state.c:4968 
> > > nfsd4_process_open2+0xf9c/0x1170 [nfsd]
> > > [   50.051434] Modules linked in: md5 cpufreq_conservative 
> > > cpufreq_userspace cpufreq_powersave cpufreq_ondemand autofs4 quota_v2 
> > > quota_tree nfsd auth_rpcgss nfs lockd grace nfs_ssc fscache sunrpc 
> > > xt_mark cls_fw sch_htb iptable_nat xt_nat nf_nat ipt_REJECT 
> > > nf_reject_ipv4 xt_tcpudp xt_multiport iptable_mangle xt_state 
> > > xt_conntrack nf_conntrack nf_defrag_ipv4 nf_log_ipv4 nf_log_common xt_LOG 
> > > xt_limit iptable_filter ip_tables x_tables nhpoly1305 chacha_generic 
> > > libchacha adiantum libpoly1305 dm_crypt dm_mod snd_hda_codec_via 
> > > snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec 
> > > snd_hwdep snd_hda_core snd_pcm snd_timer snd via_rhine psmouse soundcore 
> > > mii via_agp sg via_velocity evdev agpgart
> > > [   50.113386] CPU: 0 PID: 2284 Comm: nfsd Tainted: GT 
> > > 5.10.25 #1
> > > [   50.120669] Hardware name: To Be Filled By O.E.M. To Be Filled By 
> > > O.E.M./To be filled by O.E.M., BIOS 080014  06/01/2009
> > > [   50.131652] EIP: nfsd4_process_open2+0xf9c/0x1170 [nfsd]
> > > [   50.137036] Code: 04 88 45 a4 88 07 8b 45 a0 8d 78 49 8b 45 84 8d 70 
> > > 01 e9 2b f8 ff ff c7 45 9c 00 00 00 00 31 ff bb 00 00 27 67 e9 04 f6 ff 
> > > ff <0f> 0b e9 a0 f5 ff ff 0f b6 d4 0f a3 15 94 3f 23 f8 0f 83 b1 fd ff
> > > [   50.155866] EAX:  EBX: 82da25b0 ECX: 830c0920 EDX: 
> > > [   50.162187] ESI: 82da25b0 EDI: 82da55a0 EBP: 8310be68 ESP: 8310bddc
> > > [   50.168507] DS: 007b ES: 007b FS:  GS: 00e0 SS: 0068 EFLAGS: 
> > > 00010246
> > > [   50.175338] CR0: 80050033 CR2: 00551e50 CR3: 03222000 CR4: 06b0
> > > [   50.181654] Call Trace:
> > > [   50.184165]  ? inode_permission+0x17/0xc0
> > > [   50.188289]  nfsd4_open+0x429/0x910 [nfsd]
> > > [   50.192483]  ? nfsd4_encode_operation+0x185/0x1e0 [nfsd]
> > > [   50.197900]  ? nfsd4_rename+0x1a0/0x1a0 [nfsd]
> > > [   50.202439]  nfsd4_proc_compound+0x457/0x6c0 [nfsd]
> > > [   50.207419]  nfsd_dispatch+0xdc/0x1a0 [nfsd]
> > > [   50.211816]  svc_process_common+0x38a/0x650 [sunrpc]
> > > [   50.216880]  ? svc_xprt_do_enqueue+0xd7/0xe0 [sunrpc]
> > > [   50.222017]  ? svc_xprt_received+0x5d/0xf0 [sunrpc]
> > > [   50.227000]  ? nfsd_svc+0x300/0x300 [nfsd]
> > > [   50.231190]  svc_process+0xa9/0xf0 [sunrpc]
> > > [   50.235468]  nfsd+0xcd/0x120 [nfsd]
> > > [   50.239025]  kthread+0xe1/0x100
> > > [   50.242259]  ? nfsd_destroy+0x50/0x50 [nfsd]
> > > [   50.246588]  ? kthread_create_on_node+0x30/0x30
> > > [   50.251165]  ret_from_fork+0x1c/0x28
> > > [   50.254789] ---[ end trace 171bde4774bc9795 ]---
> > > 
> > > Can't readily reproduce it though, so likely a race condition or
> > > something that requires more state buildup than I have after a few
> > > minutes of uptime.  Kernel config at
> > > http://audible.transient.net/~jamie/k/nfsd.config-5.10.25 in case you
> > > think this worth more investigation.
> > 
> > Do you also have this issue in Linus's tree and the latest 5.11.y
> > release?
> 
> Haven't tried it, but like I said, I couldn't even reproduce it again
> with 5.10.25, or booting between 5.10.20 to 5.10.25 again ... I'll
> give the others a shot and see if I can repro it there.

Yeah, can't repro it with 5.11.8 or 5.12.0-rc3-00281-g1d4345eb51a1
either, seems to have been a spurious thing.

-- 
Jamie Heilman http://audible.transient.net/~jamie/


Re: Linux 5.10.25

2021-03-21 Thread Jamie Heilman
Greg Kroah-Hartman wrote:
> On Sat, Mar 20, 2021 at 09:31:55PM +0000, Jamie Heilman wrote:
> > [ugh, resent with the lkml headers unbroken, sorry about the dupe]
> > 
> > Greg Kroah-Hartman wrote:
> > > J. Bruce Fields (2):
> > >   Revert "nfsd4: remove check_conflicting_opens warning"
> > >   Revert "nfsd4: a client's own opens needn't prevent delegations"
> > 
> > Hrm, just got this when I udpated my nfs server (32bit Via EPIA system)
> > from 5.10.20 to 5.10.25:
> > 
> > [   49.225914] NFSD: Using UMH upcall client tracking operations.
> > [   49.231919] NFSD: starting 90-second grace period (net f036)
> > [   50.036973] [ cut here ]
> > [   50.041771] WARNING: CPU: 0 PID: 2284 at fs/nfsd/nfs4state.c:4968 
> > nfsd4_process_open2+0xf9c/0x1170 [nfsd]
> > [   50.051434] Modules linked in: md5 cpufreq_conservative 
> > cpufreq_userspace cpufreq_powersave cpufreq_ondemand autofs4 quota_v2 
> > quota_tree nfsd auth_rpcgss nfs lockd grace nfs_ssc fscache sunrpc xt_mark 
> > cls_fw sch_htb iptable_nat xt_nat nf_nat ipt_REJECT nf_reject_ipv4 
> > xt_tcpudp xt_multiport iptable_mangle xt_state xt_conntrack nf_conntrack 
> > nf_defrag_ipv4 nf_log_ipv4 nf_log_common xt_LOG xt_limit iptable_filter 
> > ip_tables x_tables nhpoly1305 chacha_generic libchacha adiantum libpoly1305 
> > dm_crypt dm_mod snd_hda_codec_via snd_hda_codec_generic snd_hda_intel 
> > snd_intel_dspcfg snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd 
> > via_rhine psmouse soundcore mii via_agp sg via_velocity evdev agpgart
> > [   50.113386] CPU: 0 PID: 2284 Comm: nfsd Tainted: GT 
> > 5.10.25 #1
> > [   50.120669] Hardware name: To Be Filled By O.E.M. To Be Filled By 
> > O.E.M./To be filled by O.E.M., BIOS 080014  06/01/2009
> > [   50.131652] EIP: nfsd4_process_open2+0xf9c/0x1170 [nfsd]
> > [   50.137036] Code: 04 88 45 a4 88 07 8b 45 a0 8d 78 49 8b 45 84 8d 70 01 
> > e9 2b f8 ff ff c7 45 9c 00 00 00 00 31 ff bb 00 00 27 67 e9 04 f6 ff ff 
> > <0f> 0b e9 a0 f5 ff ff 0f b6 d4 0f a3 15 94 3f 23 f8 0f 83 b1 fd ff
> > [   50.155866] EAX:  EBX: 82da25b0 ECX: 830c0920 EDX: 
> > [   50.162187] ESI: 82da25b0 EDI: 82da55a0 EBP: 8310be68 ESP: 8310bddc
> > [   50.168507] DS: 007b ES: 007b FS:  GS: 00e0 SS: 0068 EFLAGS: 00010246
> > [   50.175338] CR0: 80050033 CR2: 00551e50 CR3: 03222000 CR4: 06b0
> > [   50.181654] Call Trace:
> > [   50.184165]  ? inode_permission+0x17/0xc0
> > [   50.188289]  nfsd4_open+0x429/0x910 [nfsd]
> > [   50.192483]  ? nfsd4_encode_operation+0x185/0x1e0 [nfsd]
> > [   50.197900]  ? nfsd4_rename+0x1a0/0x1a0 [nfsd]
> > [   50.202439]  nfsd4_proc_compound+0x457/0x6c0 [nfsd]
> > [   50.207419]  nfsd_dispatch+0xdc/0x1a0 [nfsd]
> > [   50.211816]  svc_process_common+0x38a/0x650 [sunrpc]
> > [   50.216880]  ? svc_xprt_do_enqueue+0xd7/0xe0 [sunrpc]
> > [   50.222017]  ? svc_xprt_received+0x5d/0xf0 [sunrpc]
> > [   50.227000]  ? nfsd_svc+0x300/0x300 [nfsd]
> > [   50.231190]  svc_process+0xa9/0xf0 [sunrpc]
> > [   50.235468]  nfsd+0xcd/0x120 [nfsd]
> > [   50.239025]  kthread+0xe1/0x100
> > [   50.242259]  ? nfsd_destroy+0x50/0x50 [nfsd]
> > [   50.246588]  ? kthread_create_on_node+0x30/0x30
> > [   50.251165]  ret_from_fork+0x1c/0x28
> > [   50.254789] ---[ end trace 171bde4774bc9795 ]---
> > 
> > Can't readily reproduce it though, so likely a race condition or
> > something that requires more state buildup than I have after a few
> > minutes of uptime.  Kernel config at
> > http://audible.transient.net/~jamie/k/nfsd.config-5.10.25 in case you
> > think this worth more investigation.
> 
> Do you also have this issue in Linus's tree and the latest 5.11.y
> release?

Haven't tried it, but like I said, I couldn't even reproduce it again
with 5.10.25, or booting between 5.10.20 to 5.10.25 again ... I'll
give the others a shot and see if I can repro it there.

-- 
Jamie Heilman http://audible.transient.net/~jamie/


Re: Linux 5.10.25

2021-03-20 Thread Jamie Heilman
[ugh, resent with the lkml headers unbroken, sorry about the dupe]

Greg Kroah-Hartman wrote:
> J. Bruce Fields (2):
>   Revert "nfsd4: remove check_conflicting_opens warning"
>   Revert "nfsd4: a client's own opens needn't prevent delegations"

Hrm, just got this when I udpated my nfs server (32bit Via EPIA system)
from 5.10.20 to 5.10.25:

[   49.225914] NFSD: Using UMH upcall client tracking operations.
[   49.231919] NFSD: starting 90-second grace period (net f036)
[   50.036973] [ cut here ]
[   50.041771] WARNING: CPU: 0 PID: 2284 at fs/nfsd/nfs4state.c:4968 
nfsd4_process_open2+0xf9c/0x1170 [nfsd]
[   50.051434] Modules linked in: md5 cpufreq_conservative cpufreq_userspace 
cpufreq_powersave cpufreq_ondemand autofs4 quota_v2 quota_tree nfsd auth_rpcgss 
nfs lockd grace nfs_ssc fscache sunrpc xt_mark cls_fw sch_htb iptable_nat 
xt_nat nf_nat ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_multiport iptable_mangle 
xt_state xt_conntrack nf_conntrack nf_defrag_ipv4 nf_log_ipv4 nf_log_common 
xt_LOG xt_limit iptable_filter ip_tables x_tables nhpoly1305 chacha_generic 
libchacha adiantum libpoly1305 dm_crypt dm_mod snd_hda_codec_via 
snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hwdep 
snd_hda_core snd_pcm snd_timer snd via_rhine psmouse soundcore mii via_agp sg 
via_velocity evdev agpgart
[   50.113386] CPU: 0 PID: 2284 Comm: nfsd Tainted: GT 5.10.25 
#1
[   50.120669] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./To 
be filled by O.E.M., BIOS 080014  06/01/2009
[   50.131652] EIP: nfsd4_process_open2+0xf9c/0x1170 [nfsd]
[   50.137036] Code: 04 88 45 a4 88 07 8b 45 a0 8d 78 49 8b 45 84 8d 70 01 e9 
2b f8 ff ff c7 45 9c 00 00 00 00 31 ff bb 00 00 27 67 e9 04 f6 ff ff <0f> 0b e9 
a0 f5 ff ff 0f b6 d4 0f a3 15 94 3f 23 f8 0f 83 b1 fd ff
[   50.155866] EAX:  EBX: 82da25b0 ECX: 830c0920 EDX: 
[   50.162187] ESI: 82da25b0 EDI: 82da55a0 EBP: 8310be68 ESP: 8310bddc
[   50.168507] DS: 007b ES: 007b FS:  GS: 00e0 SS: 0068 EFLAGS: 00010246
[   50.175338] CR0: 80050033 CR2: 00551e50 CR3: 03222000 CR4: 06b0
[   50.181654] Call Trace:
[   50.184165]  ? inode_permission+0x17/0xc0
[   50.188289]  nfsd4_open+0x429/0x910 [nfsd]
[   50.192483]  ? nfsd4_encode_operation+0x185/0x1e0 [nfsd]
[   50.197900]  ? nfsd4_rename+0x1a0/0x1a0 [nfsd]
[   50.202439]  nfsd4_proc_compound+0x457/0x6c0 [nfsd]
[   50.207419]  nfsd_dispatch+0xdc/0x1a0 [nfsd]
[   50.211816]  svc_process_common+0x38a/0x650 [sunrpc]
[   50.216880]  ? svc_xprt_do_enqueue+0xd7/0xe0 [sunrpc]
[   50.222017]  ? svc_xprt_received+0x5d/0xf0 [sunrpc]
[   50.227000]  ? nfsd_svc+0x300/0x300 [nfsd]
[   50.231190]  svc_process+0xa9/0xf0 [sunrpc]
[   50.235468]  nfsd+0xcd/0x120 [nfsd]
[   50.239025]  kthread+0xe1/0x100
[   50.242259]  ? nfsd_destroy+0x50/0x50 [nfsd]
[   50.246588]  ? kthread_create_on_node+0x30/0x30
[   50.251165]  ret_from_fork+0x1c/0x28
[   50.254789] ---[ end trace 171bde4774bc9795 ]---

Can't readily reproduce it though, so likely a race condition or
something that requires more state buildup than I have after a few
minutes of uptime.  Kernel config at
http://audible.transient.net/~jamie/k/nfsd.config-5.10.25 in case you
think this worth more investigation.

-- 
Jamie Heilman http://audible.transient.net/~jamie/


Re: [kbuild-all] Re: [PATCH] gcov: fail build on gcov_info size mismatch

2021-03-15 Thread Jamie Heilman
Linus Torvalds wrote:
> On Sun, Mar 14, 2021 at 7:32 PM Rong Chen  wrote:
> >
> > It can be reproduced with '-a' option in dash:
> 
> Oh, ok. That kind of explains it.
> 
> 'dash' is trash. Please somebody make a bug report.
> 
> >  $ a="!"
> >  $ [ "$a" = ".size" ]
> >  $ [ "$a" = ".size" -a "$b" = ".LPBX0," ]
> >  sh: 2: [: =: unexpected operator
> 
> This is 100% a dash bug. There is no question what-so-ever about it.
> This is not some kind of "POSIX is ambiguous", or "the handling of
> '-a' is complicated".
> 
> It's simply just that dash is buggy.
> 
> >  While dash supports most uses of the -a and -o options, they have
> > very confusing semantics even in bash and are best avoided.
> 
> No, they have perfectly sane semantics in bash, and in POSIX.
> 
> See
> 
>  https://pubs.opengroup.org/onlinepubs/9699919799/utilities/test.html
> 
> and there is absolutely zero question that bash does this correctly,
> and dash does not.
> 
> But yes, it seems to be easy to work around, but still - could some
> Ubuntu person please open a bug report on dash?

fwiw, https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=850202


Re: [Nouveau] nouveau regression post v5.8, still present in v5.10

2021-02-02 Thread Jamie Heilman
Jamie Heilman wrote:
> Jamie Heilman wrote:
> > Karol Herbst wrote:
> > > fyi, there is a patch which solves a maybe related issue on your GPU,
> > > mind giving it a try before we dig further?
> > > https://gitlab.freedesktop.org/drm/nouveau/-/issues/14#note_767791
> > 
> > So, I tried that.  Turns out, I can still trigger a problem.  Is it
> > the same problem?  Maybe?  I also tried applying the patch from 
> > 
> > ca386aa7155a ("drm/nouveau/kms/nv50-gp1xx: add WAR for EVO push buffer HW 
> > bug")
> > to 5.8.0-rc6-01516-g0a96099691c8 and very interestingly, it changed
> > the mode of failure to same thing I saw with 5.10.9 patched with the patch
> > from that bug report.  In both cases I get this in the log:
> > 
> > kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16
> > kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16
> > kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16
> > kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16
> > ...
> > and so on
> > 
> > In one incident my monitor would't even wake up anymore after this.
> > 
> > 
> > I'm trying to repo it now on an unpatched 5.8.0-rc6-01515-gae09163ac27c
> > right now, as running glxgears does seem to help reproduce problems
> > faster which is nice, I'm just not entirely sure it's the same set of
> > problems; hopefully that version is free from issues, but we'll
> > see...
> 
> Ugh, well I can crash 5.8.0-rc6-01515-gae09163ac27c and 5.8.18 in
> basically the same way running glxgears and a xset dpms force off
> loop.  So I'm starting to think it's not the same thing, and that
> problem has been latent from before I started having periodic issues.
> 
> I should note that my exact testing technique for the above was to run
> 4 copies of glxgears and the xset force dpms off loop at the same
> time.  Really looks more like it triggers a resource starvation issue
> maybe.  The crash is also worse, particularly if I don't do anything
> about it right away as my workstation eventually falls off the network
> and I'm forced to power cycle it; the crashes I was chasing after
> wouldn't do quite that much violence, normally I could still log in,
> rebuild a kernel, and shut things down cleanly.
> 
> More than one bug here I suspect.

OK, I went back and bisected again while patching known issues to get
a better idea what was causing the problem I've been having and I'm
confident it was the bug which Bastian Beranek's patch (now in
mainline) addressed.  My original bisection got confused by the
EVO push buffer HW bug which was fixed in ca386aa7155a54.  Once I
bisected with the patch from ca386aa7155a54 applied, my bisection
landed on f844eb485eb05 and Bastian Beranek's patch fixed that right
up.

'course I remain mildly concerned I can crash the kernel with little
more than glxgears and xset ... but the original stability problem I
reported I can safely say has been fixed.  If I can figure out the
nature of what I suspect is unrecoverable resource starvation, I'll
start a new thread for that.


-- 
Jamie Heilman http://audible.transient.net/~jamie/


Re: bisected regression in v5.11-rc1 snd-usb-audio

2021-01-23 Thread Jamie Heilman
Takashi Iwai wrote:
> On Sat, 23 Jan 2021 12:08:02 +0100,
> Jamie Heilman wrote:
> > 
> > > If the problem is still seen with the very latest Linus tree and the
> > > previous patch, please enable the dyndbg, e.g. pass dydbg=+p option to
> > > snd-usb-audio module, i.e. reload like
> > >modprobe snd-usb-audio dydbg=+p
> > > or boot with
> > >snd_usb_audio.dyndbg=+p
> > > boot option, retest, and give the kernel messages.
> > 
> > OK, updated to 5.11.0-rc4-00202-gfe75a21824e7, repatched & rebuilt with:
> > CONFIG_SND_DEBUG=y
> > CONFIG_SND_CTL_VALIDATION=y
> > CONFIG_DYNAMIC_DEBUG=y
> > CONFIG_DYNAMIC_DEBUG_CORE=y
> > 
> > and added dyndbg=+p to my snd-usb-audio options, dmesg after an
> > attempted aplay execution is attached.
> (snip)
> > [  108.941521] WARNING: CPU: 0 PID: 2776 at sound/core/pcm_lib.c:1148 
> > snd_pcm_hw_rule_add+0xd3/0x160 [snd_pcm]
> 
> OK, this must be the culprit of the weird behavior.
> I totally overlooked that the max dependencies are three because of
> the terminator.
> 
> Could you try the patch below in addition?

Yep, this works now, thanks!  FWIW, here's what the debug output looks
like when its working, and I start jack now:

usb 1-1.1.2: 1:1: found sync_ep=0x81, iface=1, alt=1, implicit_fb=0
usb 1-1.1.2: 1:1: add audio endpoint 0x1
usb 1-1.1.2: Creating new data endpoint #1
usb 1-1.1.2: Creating new sync endpoint #81
usb 1-1.1.2: 1:1 Set sample rate 192000, clock 40
usb 1-1.1.2: [10] FU [M2Tech Audio 2.0 Output Playback Switch] ch = 2, val = 
0/1/1
usb 1-1.1.2: [10] FU [M2Tech Audio 2.0 Output Playback Switch] ch = 1, val = 
0/1/1
usb 1-1.1.2: RANGE setting not yet supported
usb 1-1.1.2: [10] FU [M2Tech Audio 2.0 Output Playback Volume] ch = 2, val = 
-32512/0/256
usb 1-1.1.2: RANGE setting not yet supported
usb 1-1.1.2: [10] FU [M2Tech Audio 2.0 Output Playback Volume] ch = 1, val = 
-32512/0/256
usb 1-1.1.2: Open EP 0x1, iface=1:1, idx=0
usb 1-1.1.2:   channels=2, rate=96000, format=S32_LE, period_bytes=8192, 
periods=3, implicit_fb=0
usb 1-1.1.2: Open EP 0x81, iface=1:1, idx=1
usb 1-1.1.2:   channels=2, rate=96000, format=S32_LE, period_bytes=8192, 
periods=3, implicit_fb=0
usb 1-1.1.2: Setting usb interface 1:0 for EP 0x1
usb 1-1.1.2: 1:1 Set sample rate 96000, clock 40
usb 1-1.1.2: Setting params for data EP 0x1, pipe 0x8600
usb 1-1.1.2: Set up 12 URBS, ret=0
usb 1-1.1.2: Setting usb interface 1:1 for EP 0x1
usb 1-1.1.2: Setting params for sync EP 0x81, pipe 0x8680
usb 1-1.1.2: Set up 4 URBS, ret=0
usb 1-1.1.2: Starting data EP 0x1 (running 0)
usb 1-1.1.2: 12 URBs submitted for EP 0x1
usb 1-1.1.2: Starting sync EP 0x81 (running 0)
usb 1-1.1.2: 4 URBs submitted for EP 0x81
usb 1-1.1.2: 1:1 Start Playback PCM


> --- a/include/sound/pcm.h
> +++ b/include/sound/pcm.h
> @@ -229,7 +229,7 @@ typedef int (*snd_pcm_hw_rule_func_t)(struct 
> snd_pcm_hw_params *params,
>  struct snd_pcm_hw_rule {
>   unsigned int cond;
>   int var;
> - int deps[4];
> + int deps[5];
>  
>   snd_pcm_hw_rule_func_t func;
>   void *private;

-- 
Jamie Heilman http://audible.transient.net/~jamie/


Re: bisected regression in v5.11-rc1 snd-usb-audio

2021-01-23 Thread Jamie Heilman
Takashi Iwai wrote:
> On Fri, 22 Jan 2021 11:03:48 +0100,
> Jamie Heilman wrote:
> > 
> > Takashi Iwai wrote:
> > > You seem hitting a firmware bug, and it doesn't look like the only
> > > case.  Interestingly, the backport of 5.11 USB-audio stuff on 5.3
> > > kernel on openSUSE Leap 15.2 caused a similar bug on Steinberg device,
> > > while it worked with 5.11-rc.  So I thought this specific with the
> > > older kernel (in USB core changes or such), but my guess seems wrong,
> > > and the bug looks remaining in 5.11 for some cases.
> > > 
> > > Below is the fix patch.  Please give it a try.
> > > 
> > > 
> > > thanks,
> > > 
> > > Takashi
> > > 
> > > -- 8< --
> > > From: Takashi Iwai 
> > > Subject: [PATCH] ALSA: usb-audio: workaround for iface reset issue
> > > 
> > > The recently introduced sample rate validation code seems causing a
> > > problem on some devices; namely, after performing this, the bus gets
> > > screwed and it influences even on other USB devices.
> > > As a quick workaround, perform it only for the necessary devices;
> > > currently MOTU devices are known to need the valid altset checks, so
> > > filter out other devices.
> > > 
> > > Fixes: 93db51d06b32 ("ALSA: usb-audio: Check valid altsetting at parsing 
> > > rates for UAC2/3")
> > > Reported-by: Jamie Heilman 
> > > BugLink: https://bugzilla.suse.com/show_bug.cgi?id=1178203
> > > Signed-off-by: Takashi Iwai 
> > > ---
> > >  sound/usb/format.c | 11 +++
> > >  1 file changed, 11 insertions(+)
> > > 
> > > diff --git a/sound/usb/format.c b/sound/usb/format.c
> > > index 9ebc5d202c87..e6ff317a6785 100644
> > > --- a/sound/usb/format.c
> > > +++ b/sound/usb/format.c
> > > @@ -466,6 +466,17 @@ static int validate_sample_rate_table_v2v3(struct 
> > > snd_usb_audio *chip,
> > >   unsigned int nr_rates;
> > >   int i, err;
> > >  
> > > + /* performing the rate verification may lead to unexpected USB bus
> > > +  * behavior afterwards by some unknown reason.  Do this only for the
> > > +  * known devices.
> > > +  */
> > > + switch (USB_ID_VENDOR(chip->usb_id)) {
> > > + case 0x07fd: /* MOTU */
> > > + break;
> > > + default:
> > > + return 0; /* don't perform the validation as default */
> > > + }
> > > +
> > >   table = kcalloc(fp->nr_rates, sizeof(*table), GFP_KERNEL);
> > >   if (!table)
> > >   return -ENOMEM;
> > 
> > Well, with this patch, while I can get it to seemingly initialize
> > without errors by replugging, I can't actually get it to work.  Jack
> > says:
> > 
> > Starting jack server...
> > JACK server starting in realtime mode with priority 10
> > self-connect-mode is "Don't restrict self connect requests"
> > Acquired audio card Audio1
> > creating alsa driver ... 
> > hw:CARD=hiFace2,DEV=0|-|1024|3|96000|0|0|nomon|swmeter|-|32bit
> > ERROR: ALSA: Cannot open PCM device alsa_pcm for playback. Falling back to 
> > capture-only mode
> > Released audio card Audio1
> > ERROR: Cannot initialize driver
> > ERROR: JackServer::Open failed with -1
> > ERROR: Failed to open server
> 
> Which kernel did you test?  There have been a few more USB-audio fixes
> that landed to Linus tree yesterday, so you'd need to test with it (in
> addition to the previous patch).

Sorry, thought that'd be clear from the alsa-info output, but yes, the
tip of Linus's tree at the time: 5.11.0-rc4-00164-g83d09ad4b950

> > here's what aplay does when it tries:
> > 
> > aplay: main:830: audio open error: Invalid argument
> > 
> > strace of that says:
> > openat(AT_FDCWD, "/dev/snd/pcmC1D0p", O_RDWR|O_NONBLOCK|O_CLOEXEC) = -1 
> > EINVAL (Invalid argument)
> 
> Hm, it's weird that open returns -EINVAL, not the later ioctl...
> 
> > I dug up a different USB sound device (bithead) and plugged it in, it
> > still works just fine with the same commands.
> > 
> > The only difference I see in the alsa-info output is that
> > 
> > control.7 {
> > iface CARD
> > name 'Keep Interface'
> > value false
> > comment {
> > access 'read write'
> > type BOOLEAN
> > count 1
> >   

Re: [Nouveau] nouveau regression post v5.8, still present in v5.10

2021-01-22 Thread Jamie Heilman
Jamie Heilman wrote:
> Karol Herbst wrote:
> > fyi, there is a patch which solves a maybe related issue on your GPU,
> > mind giving it a try before we dig further?
> > https://gitlab.freedesktop.org/drm/nouveau/-/issues/14#note_767791
> 
> So, I tried that.  Turns out, I can still trigger a problem.  Is it
> the same problem?  Maybe?  I also tried applying the patch from 
> 
> ca386aa7155a ("drm/nouveau/kms/nv50-gp1xx: add WAR for EVO push buffer HW 
> bug")
> to 5.8.0-rc6-01516-g0a96099691c8 and very interestingly, it changed
> the mode of failure to same thing I saw with 5.10.9 patched with the patch
> from that bug report.  In both cases I get this in the log:
> 
> kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16
> kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16
> kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16
> kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16
> ...
> and so on
> 
> In one incident my monitor would't even wake up anymore after this.
> 
> 
> I'm trying to repo it now on an unpatched 5.8.0-rc6-01515-gae09163ac27c
> right now, as running glxgears does seem to help reproduce problems
> faster which is nice, I'm just not entirely sure it's the same set of
> problems; hopefully that version is free from issues, but we'll
> see...

Ugh, well I can crash 5.8.0-rc6-01515-gae09163ac27c and 5.8.18 in
basically the same way running glxgears and a xset dpms force off
loop.  So I'm starting to think it's not the same thing, and that
problem has been latent from before I started having periodic issues.

I should note that my exact testing technique for the above was to run
4 copies of glxgears and the xset force dpms off loop at the same
time.  Really looks more like it triggers a resource starvation issue
maybe.  The crash is also worse, particularly if I don't do anything
about it right away as my workstation eventually falls off the network
and I'm forced to power cycle it; the crashes I was chasing after
wouldn't do quite that much violence, normally I could still log in,
rebuild a kernel, and shut things down cleanly.

More than one bug here I suspect.

> > On Thu, Jan 21, 2021 at 3:33 AM Jamie Heilman
> >  wrote:
> > >
> > > Karol Herbst wrote:
> > > > On Wed, Jan 6, 2021 at 4:25 AM Jamie Heilman
> > > >  wrote:
> > > > >
> > > > > Jamie Heilman wrote:
> > > > > > Jamie Heilman wrote:
> > > > > > > Karol Herbst wrote:
> > > > > > > > do you think you'd be able to do a kernel bisect in order to 
> > > > > > > > pinpoint
> > > > > > > > the actual commit causing it? Thanks
> > > > > > >
> > > > > > > No.  I can't reproduce it reliably.  I if I could, bisection 
> > > > > > > wouldn't
> > > > > > > be a problem but as I can't and as it can take weeks for the 
> > > > > > > problem
> > > > > > > to occur there's essentially no chance.  I know it regressed 
> > > > > > > roughly
> > > > > > > in 5.8-rc1 only because that's what I was running when the first 
> > > > > > > event
> > > > > > > occured.
> > > > > >
> > > > > > er, 5.9.0-rc1 rather
> > > > >
> > > > > Actually ... I've found a way to reproduce this in hours intead of
> > > > > weeks, so I think I may be able to bisect it after all, it's something
> > > > > of a brute force approach and its probably doing horrible things to
> > > > > the backlight in my poor old monitor, but just running this:
> > > > >
> > > > > #!/bin/sh
> > > > > sleep 5
> > > > > while ! dmesg | tail | grep -q nouveau
> > > > > do
> > > > > xset dpms force off
> > > > > sleep 65
> > > > > xdotool mousemove 1024 1024 mousemove restore
> > > > > sleep 10
> > > > > done
> > > > >
> > > > > Does manage to trip the issue sooner than it would otherwise happen
> > > > > with natural usage.  Given that this is my primary workstation and I
> > > > > sort of need it functional during waking hours, it'll take me a bit,
> > > > > but I'll update folks when I have the error more dialed in.
> > > > >
> > > >
> > > > huh interesting. Kind of feels like a random thing still. But I think
> > > > in general you'd spend way 

Re: [Nouveau] nouveau regression post v5.8, still present in v5.10

2021-01-22 Thread Jamie Heilman
Karol Herbst wrote:
> fyi, there is a patch which solves a maybe related issue on your GPU,
> mind giving it a try before we dig further?
> https://gitlab.freedesktop.org/drm/nouveau/-/issues/14#note_767791

So, I tried that.  Turns out, I can still trigger a problem.  Is it
the same problem?  Maybe?  I also tried applying the patch from 

ca386aa7155a ("drm/nouveau/kms/nv50-gp1xx: add WAR for EVO push buffer HW bug")
to 5.8.0-rc6-01516-g0a96099691c8 and very interestingly, it changed
the mode of failure to same thing I saw with 5.10.9 patched with the patch
from that bug report.  In both cases I get this in the log:

kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16
kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16
kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16
kern.err: nouveau :01:00.0: Xorg[2243]: nv50cal_space: -16
...
and so on

In one incident my monitor would't even wake up anymore after this.


I'm trying to repo it now on an unpatched 5.8.0-rc6-01515-gae09163ac27c
right now, as running glxgears does seem to help reproduce problems
faster which is nice, I'm just not entirely sure it's the same set of
problems; hopefully that version is free from issues, but we'll
see...



> On Thu, Jan 21, 2021 at 3:33 AM Jamie Heilman
>  wrote:
> >
> > Karol Herbst wrote:
> > > On Wed, Jan 6, 2021 at 4:25 AM Jamie Heilman
> > >  wrote:
> > > >
> > > > Jamie Heilman wrote:
> > > > > Jamie Heilman wrote:
> > > > > > Karol Herbst wrote:
> > > > > > > do you think you'd be able to do a kernel bisect in order to 
> > > > > > > pinpoint
> > > > > > > the actual commit causing it? Thanks
> > > > > >
> > > > > > No.  I can't reproduce it reliably.  I if I could, bisection 
> > > > > > wouldn't
> > > > > > be a problem but as I can't and as it can take weeks for the problem
> > > > > > to occur there's essentially no chance.  I know it regressed roughly
> > > > > > in 5.8-rc1 only because that's what I was running when the first 
> > > > > > event
> > > > > > occured.
> > > > >
> > > > > er, 5.9.0-rc1 rather
> > > >
> > > > Actually ... I've found a way to reproduce this in hours intead of
> > > > weeks, so I think I may be able to bisect it after all, it's something
> > > > of a brute force approach and its probably doing horrible things to
> > > > the backlight in my poor old monitor, but just running this:
> > > >
> > > > #!/bin/sh
> > > > sleep 5
> > > > while ! dmesg | tail | grep -q nouveau
> > > > do
> > > > xset dpms force off
> > > > sleep 65
> > > > xdotool mousemove 1024 1024 mousemove restore
> > > > sleep 10
> > > > done
> > > >
> > > > Does manage to trip the issue sooner than it would otherwise happen
> > > > with natural usage.  Given that this is my primary workstation and I
> > > > sort of need it functional during waking hours, it'll take me a bit,
> > > > but I'll update folks when I have the error more dialed in.
> > > >
> > >
> > > huh interesting. Kind of feels like a random thing still. But I think
> > > in general you'd spend way too much time on this if you can't
> > > reproduce within seconds/minutes and then it might not point out the
> > > actual issue, because randomly the issue didn't appear and stuff.
> > >
> > > maybe you can tune it to have shorter pauses or something? I'd really
> > > try to bring down the time per cycle.
> >
> > Well I'm confident enough, at this point, to say this bisects to
> > 0a96099691c8 ("drm/nouveau/kms/nv50-: implement proper push buffer control 
> > logic")
> >
> > Now... I wish I could say the bisection was straightforward and
> > simple, but it wasn't thanks to still not having a reproducer really
> > dialed in.  The above script doesn't work unless I've got some normal
> > usage around it.  It certainly triggers the issue sooner than it
> > otherwise would, but by itself it isn't enough.  I modified it
> > somewhat in the hopes of capturing the rough idea of how many
> > itterations it would take to trigger the problem by using:
> >
> > #!/bin/sh
> > I=0
> > trap 'echo;echo $I' 0
> > trap 'exit' INT
> > sleep 5
> > while ! dmesg | tail | grep -q nouveau
> > do
> > I=$(($I + 1))
> > xse

Re: bisected regression in v5.11-rc1 snd-usb-audio

2021-01-22 Thread Jamie Heilman
Takashi Iwai wrote:
> You seem hitting a firmware bug, and it doesn't look like the only
> case.  Interestingly, the backport of 5.11 USB-audio stuff on 5.3
> kernel on openSUSE Leap 15.2 caused a similar bug on Steinberg device,
> while it worked with 5.11-rc.  So I thought this specific with the
> older kernel (in USB core changes or such), but my guess seems wrong,
> and the bug looks remaining in 5.11 for some cases.
> 
> Below is the fix patch.  Please give it a try.
> 
> 
> thanks,
> 
> Takashi
> 
> -- 8< --
> From: Takashi Iwai 
> Subject: [PATCH] ALSA: usb-audio: workaround for iface reset issue
> 
> The recently introduced sample rate validation code seems causing a
> problem on some devices; namely, after performing this, the bus gets
> screwed and it influences even on other USB devices.
> As a quick workaround, perform it only for the necessary devices;
> currently MOTU devices are known to need the valid altset checks, so
> filter out other devices.
> 
> Fixes: 93db51d06b32 ("ALSA: usb-audio: Check valid altsetting at parsing 
> rates for UAC2/3")
> Reported-by: Jamie Heilman 
> BugLink: https://bugzilla.suse.com/show_bug.cgi?id=1178203
> Signed-off-by: Takashi Iwai 
> ---
>  sound/usb/format.c | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/sound/usb/format.c b/sound/usb/format.c
> index 9ebc5d202c87..e6ff317a6785 100644
> --- a/sound/usb/format.c
> +++ b/sound/usb/format.c
> @@ -466,6 +466,17 @@ static int validate_sample_rate_table_v2v3(struct 
> snd_usb_audio *chip,
>   unsigned int nr_rates;
>   int i, err;
>  
> + /* performing the rate verification may lead to unexpected USB bus
> +  * behavior afterwards by some unknown reason.  Do this only for the
> +  * known devices.
> +  */
> + switch (USB_ID_VENDOR(chip->usb_id)) {
> + case 0x07fd: /* MOTU */
> + break;
> + default:
> + return 0; /* don't perform the validation as default */
> + }
> +
>   table = kcalloc(fp->nr_rates, sizeof(*table), GFP_KERNEL);
>   if (!table)
>   return -ENOMEM;

Well, with this patch, while I can get it to seemingly initialize
without errors by replugging, I can't actually get it to work.  Jack
says:

Starting jack server...
JACK server starting in realtime mode with priority 10
self-connect-mode is "Don't restrict self connect requests"
Acquired audio card Audio1
creating alsa driver ... 
hw:CARD=hiFace2,DEV=0|-|1024|3|96000|0|0|nomon|swmeter|-|32bit
ERROR: ALSA: Cannot open PCM device alsa_pcm for playback. Falling back to 
capture-only mode
Released audio card Audio1
ERROR: Cannot initialize driver
ERROR: JackServer::Open failed with -1
ERROR: Failed to open server

here's what aplay does when it tries:

aplay: main:830: audio open error: Invalid argument

strace of that says:
openat(AT_FDCWD, "/dev/snd/pcmC1D0p", O_RDWR|O_NONBLOCK|O_CLOEXEC) = -1 EINVAL 
(Invalid argument)

I dug up a different USB sound device (bithead) and plugged it in, it
still works just fine with the same commands.

The only difference I see in the alsa-info output is that

control.7 {
iface CARD
name 'Keep Interface'
value false
comment {
access 'read write'
type BOOLEAN
count 1
}
}

has gone missing.  I've attached it incase you can spot something I
didn't (obviously the cards are in a different order which makes
things somewhat tedious for a direct comparison).  Apart from the
device number on the usb bus, the lsusb -vvv output is the same as
it was in with 5.10.9.

-- 
Jamie Heilman http://audible.transient.net/~jamie/
upload=true=true=
!!
!!ALSA Information Script v 0.4.65
!!

!!Script ran on: Fri Jan 22 09:51:41 UTC 2021


!!Linux Distribution
!!--

Debian GNU/Linux bullseye/sid \n \l PRETTY_NAME="Debian GNU/Linux bullseye/sid" 
NAME="Debian GNU/Linux" ID=debian HOME_URL="https://www.debian.org/; 
SUPPORT_URL="https://www.debian.org/support; 
BUG_REPORT_URL="https://bugs.debian.org/;


!!DMI Information
!!---

Manufacturer:  Dell Inc.
Product Name:  Precision WorkStation T3400  
Product Version:   
Firmware Version:  A14
System SKU:
Board Vendor:  Dell Inc.
Board Name:0TP412


!!ACPI Device Status Information
!!---

/sys/bus/acpi/devices/PNP0401:00/status  15
/sys/bus/acpi/devices/PNP0501:00/status  15
/sys/bus/acpi/devices/PNP0700:00/status  15
/sys/bus/acpi/devices/PNP0C0F:00/status  9
/sys/bus/acpi/devices/PNP0C0F:01/status   

bisected regression in v5.11-rc1 snd-usb-audio

2021-01-22 Thread Jamie Heilman
I've bisected a failure in support for the M2Tech USB HiFace Two
192kHz Digital Audio Interface device (read as: a reclocked USB
S/PDIF) to the following ...

commit 93db51d06b32227319dae2ac289029ccf1b33181
Author: Takashi Iwai 
Date:   Mon Nov 23 09:53:09 2020 +0100

ALSA: usb-audio: Check valid altsetting at parsing rates for UAC2/3

This has always been a somewhat finicky device, but my life is a
silent void without it, as it is currently a vital part of getting
audio to my mixer (now, if I could get USB Audio on my Rane MP2015
actually working right I'd very happily take that approach instead[1],
but I digress).  It has been known to require replugging to initialize
properly at times, but until now, it's always worked fine eventually.

I've attached the dmesg from a working 5.10.9 kernel, the alsa-info
output, and the lsusb -vvv output for this device when it's
functioning.  (Note, that alsa-info claims jack isn't running...
that's not actually true though, maybe because I'm using jack 2's
jackdbus subsystem, but jack is infact running, though it's probably
not relevant to the issue at hand[2].) 

When I boot 93db51d06b32 or later I get lot of errors in the dmesg
like:

 usb 1-1.1.2: New USB device found, idVendor=249c, idProduct=930b, bcdDevice= 
2.13
 usb 1-1.1.2: New USB device strings: Mfr=1, Product=2, SerialNumber=3
 usb 1-1.1.2: Product: M2Tech USB Audio 2.0
 usb 1-1.1.2: Manufacturer: M2Tech 
 usb 1-1.1.2: SerialNumber: 
 usb 1-1.1.2: cannot get ctl value: req = 0x83, wValue = 0x201, wIndex = 0xa00, 
type = 4
 usb 1-1.1.2: 10:0: cannot get min/max values for control 2 (id 10)
 usb 1-1.1.2: cannot get ctl value: req = 0x83, wValue = 0x200, wIndex = 0xa00, 
type = 4
 usb 1-1.1.2: 10:0: cannot get min/max values for control 2 (id 10)
 usbcore: registered new interface driver snd-usb-audio
 usb 1-1.1-port2: disabled by hub (EMI?), re-enabling...
 usb 1-1.1.2: USB disconnect, device number 6
 usb 1-1.1.2: new full-speed USB device number 7 using ehci-pci
 usb 1-1.1.2: device descriptor read/64, error -32
 usb 1-1.1.2: device descriptor read/64, error -32
 usb 1-1.1.2: new full-speed USB device number 8 using ehci-pci
 usb 1-1.1.2: device descriptor read/64, error -32
 usb 1-1.1.2: device descriptor read/64, error -32
 usb 1-1.1-port2: attempt power cycle
 usb 1-1.1.2: new full-speed USB device number 9 using ehci-pci
 usb 1-1.1.2: device not accepting address 9, error -32
 usb 1-1.1.2: new full-speed USB device number 10 using ehci-pci
 usb 1-1.1.2: device not accepting address 10, error -32
 usb 1-1.1-port2: unable to enumerate USB device

and obviously the device doesn't work at all.  Sometimes there's more
"cannot get ctl value:" noise than other times, but the above is a
clean boot after being in a working state (it tends to require
replugging to get back to a being usable again after this, once I've
booted a working kernel, possibly becuase its hanging off a hub in my
monitor, not the most elegant of setups, I know---but none of this
changes even if I plug it directly into my workstation's USB ports, I
tried that).

I'm happy to try any patches, or provide more details, just ask.

-- 
Jamie Heilman http://audible.transient.net/~jamie/

[1] it's never been quite right, if you ever want to try to figure out
why, I'd be happy to tackle that separately

[2] just in case, jack_control dp output:
--- get driver parameters (type:isset:default:value)
  device: ALSA device name (str:set:hw:0:hw:CARD=hiFace2,DEV=0)
 capture: Provide capture ports.  Optionally set device 
(str:notset:none:none)
playback: Provide playback ports.  Optionally set device 
(str:set:none:hw:CARD=hiFace2,DEV=0)
rate: Sample rate (uint:set:48000:96000)
  period: Frames per period (uint:notset:1024:1024)
nperiods: Number of periods of playback latency (uint:set:2:3)
   hwmon: Hardware monitoring, if available 
(bool:notset:False:False)
 hwmeter: Hardware metering, if available (bool:notset:False:False)
  duplex: Provide both capture and playback ports 
(bool:notset:True:True)
softmode: Soft-mode, no xrun handling (bool:notset:False:False)
 monitor: Provide monitor ports for the output 
(bool:notset:False:False)
  dither: Dithering mode (char:notset:n:n)
  inchannels: Number of capture channels (defaults to hardware max) 
(uint:notset:0:0)
 outchannels: Number of playback channels (defaults to hardware max) 
(uint:notset:0:0)
  shorts: Try 16-bit samples before 32-bit (bool:notset:False:False)
   input-latency: Extra input latency (frames) (uint:notset:0:0)
  output-latency: Extra output latency (frames) (uint:notset:0:0)
 midi-driver: ALSA MIDI driver (str:notset:none:none)

[0.00] Linux version 5.10.9 (jamie@cucamonga) (gcc (Debian 10.2.1-3) 
10.2.1 20201224, GNU ld (GNU Binutils for Debi

Re: [Nouveau] nouveau regression post v5.8, still present in v5.10

2021-01-20 Thread Jamie Heilman
Karol Herbst wrote:
> On Wed, Jan 6, 2021 at 4:25 AM Jamie Heilman
>  wrote:
> >
> > Jamie Heilman wrote:
> > > Jamie Heilman wrote:
> > > > Karol Herbst wrote:
> > > > > do you think you'd be able to do a kernel bisect in order to pinpoint
> > > > > the actual commit causing it? Thanks
> > > >
> > > > No.  I can't reproduce it reliably.  I if I could, bisection wouldn't
> > > > be a problem but as I can't and as it can take weeks for the problem
> > > > to occur there's essentially no chance.  I know it regressed roughly
> > > > in 5.8-rc1 only because that's what I was running when the first event
> > > > occured.
> > >
> > > er, 5.9.0-rc1 rather
> >
> > Actually ... I've found a way to reproduce this in hours intead of
> > weeks, so I think I may be able to bisect it after all, it's something
> > of a brute force approach and its probably doing horrible things to
> > the backlight in my poor old monitor, but just running this:
> >
> > #!/bin/sh
> > sleep 5
> > while ! dmesg | tail | grep -q nouveau
> > do
> > xset dpms force off
> > sleep 65
> > xdotool mousemove 1024 1024 mousemove restore
> > sleep 10
> > done
> >
> > Does manage to trip the issue sooner than it would otherwise happen
> > with natural usage.  Given that this is my primary workstation and I
> > sort of need it functional during waking hours, it'll take me a bit,
> > but I'll update folks when I have the error more dialed in.
> >
> 
> huh interesting. Kind of feels like a random thing still. But I think
> in general you'd spend way too much time on this if you can't
> reproduce within seconds/minutes and then it might not point out the
> actual issue, because randomly the issue didn't appear and stuff.
> 
> maybe you can tune it to have shorter pauses or something? I'd really
> try to bring down the time per cycle.

Well I'm confident enough, at this point, to say this bisects to
0a96099691c8 ("drm/nouveau/kms/nv50-: implement proper push buffer control 
logic")

Now... I wish I could say the bisection was straightforward and
simple, but it wasn't thanks to still not having a reproducer really
dialed in.  The above script doesn't work unless I've got some normal
usage around it.  It certainly triggers the issue sooner than it
otherwise would, but by itself it isn't enough.  I modified it
somewhat in the hopes of capturing the rough idea of how many
itterations it would take to trigger the problem by using:

#!/bin/sh
I=0
trap 'echo;echo $I' 0
trap 'exit' INT
sleep 5
while ! dmesg | tail | grep -q nouveau
do
I=$(($I + 1))
xset dpms force off
sleep 32
xdotool mousemove 1 12 mousemove restore
sleep 28
done

but ultimately that didn't really pan out the way I'd hoped.  I don't
think the itterations have all that much to do with the condition in
the end.  I wanted to try applying ca386aa7155a
("drm/nouveau/kms/nv50-gp1xx: add WAR for EVO push buffer HW bug") on
top of the first bad commit becuase I sort of felt like when I was
running -rc versions that things got a bit less chaotic after that
commit landed, but it was just a gut feeling and I wanted to see if I
could support it with metrics---but no, I can't really get consistent
metrics even without that commit so I gave up, and decided to report
what I've got so far.



> > I'm using git bisect start -- drivers/gpu/drm include/drm include/video
> > in an effort to make this go a bit quicker, let me know if you think
> > that's a bad idea or I should add other paths.
> >
> > > > > On Sun, Dec 27, 2020 at 8:16 PM Jamie Heilman
> > > > >  wrote:
> > > > > >
> > > > > > Something between v5.8 and v5.9 has resulted in periodically losing 
> > > > > > video.
> > > > > > Unfortunately, I can't reliably reproduce it, it seems to happen 
> > > > > > every
> > > > > > once in a long while---I can go weeks without an occurance, but it
> > > > > > always seems to happen after my workstation has been idle long 
> > > > > > enough
> > > > > > to screen blank and put the monitor to sleep.  I'm using a single
> > > > > > display (Dell 2405FPW) connected via DVI, running X (Xorg 1.20.x 
> > > > > > from
> > > > > > Debian sid).  I don't really do anything fancy, xterms, a browser or
> > > > > > two, play the occasional video, but like I said, I can't reliably
> > > > > > reproduce this.  I've had it happen about 11 times since August.
> >

Re: [Nouveau] nouveau regression post v5.8, still present in v5.10

2021-01-05 Thread Jamie Heilman
Jamie Heilman wrote:
> Jamie Heilman wrote:
> > Karol Herbst wrote:
> > > do you think you'd be able to do a kernel bisect in order to pinpoint
> > > the actual commit causing it? Thanks
> > 
> > No.  I can't reproduce it reliably.  I if I could, bisection wouldn't
> > be a problem but as I can't and as it can take weeks for the problem
> > to occur there's essentially no chance.  I know it regressed roughly
> > in 5.8-rc1 only because that's what I was running when the first event
> > occured.
> 
> er, 5.9.0-rc1 rather

Actually ... I've found a way to reproduce this in hours intead of
weeks, so I think I may be able to bisect it after all, it's something
of a brute force approach and its probably doing horrible things to
the backlight in my poor old monitor, but just running this:

#!/bin/sh
sleep 5
while ! dmesg | tail | grep -q nouveau
do
xset dpms force off
sleep 65
xdotool mousemove 1024 1024 mousemove restore
sleep 10
done

Does manage to trip the issue sooner than it would otherwise happen
with natural usage.  Given that this is my primary workstation and I
sort of need it functional during waking hours, it'll take me a bit,
but I'll update folks when I have the error more dialed in.

I'm using git bisect start -- drivers/gpu/drm include/drm include/video
in an effort to make this go a bit quicker, let me know if you think
that's a bad idea or I should add other paths.

> > > On Sun, Dec 27, 2020 at 8:16 PM Jamie Heilman
> > >  wrote:
> > > >
> > > > Something between v5.8 and v5.9 has resulted in periodically losing 
> > > > video.
> > > > Unfortunately, I can't reliably reproduce it, it seems to happen every
> > > > once in a long while---I can go weeks without an occurance, but it
> > > > always seems to happen after my workstation has been idle long enough
> > > > to screen blank and put the monitor to sleep.  I'm using a single
> > > > display (Dell 2405FPW) connected via DVI, running X (Xorg 1.20.x from
> > > > Debian sid).  I don't really do anything fancy, xterms, a browser or
> > > > two, play the occasional video, but like I said, I can't reliably
> > > > reproduce this.  I've had it happen about 11 times since August.
> > > >
> > > > lspci -vv output is:
> > > >
> > > > 01:00.0 VGA compatible controller: NVIDIA Corporation G86 [Quadro NVS 
> > > > 290] (rev a1) (prog-if 00 [VGA controller])
> > > > Subsystem: NVIDIA Corporation G86 [Quadro NVS 290]
> > > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- 
> > > > ParErr- Stepping- SERR- FastB2B- DisINTx+
> > > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> > > > SERR-  > > > Latency: 0, Cache Line Size: 64 bytes
> > > > Interrupt: pin A routed to IRQ 28
> > > > Region 0: Memory at fc00 (32-bit, non-prefetchable) 
> > > > [size=16M]
> > > > Region 1: Memory at d000 (64-bit, prefetchable) [size=256M]
> > > > Region 3: Memory at fa00 (64-bit, non-prefetchable) 
> > > > [size=32M]
> > > > Region 5: I/O ports at dc80 [size=128]
> > > > Expansion ROM at 000c [disabled] [size=128K]
> > > > Capabilities: [60] Power Management version 2
> > > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
> > > > PME(D0-,D1-,D2-,D3hot-,D3cold-)
> > > > Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
> > > > Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
> > > > Address: fee01004  Data: 4023
> > > > Capabilities: [78] Express (v1) Endpoint, MSI 00
> > > > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s 
> > > > <512ns, L1 <4us
> > > > ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- 
> > > > SlotPowerLimit 25.000W
> > > > DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq-
> > > > RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
> > > > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > > > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- 
> > > > AuxPwr- TransPend-
> > > > LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, 
> > > > Exit Latency L0s <512ns, L1 <4us
> > > > ClockPM- Surprise- L

Re: [Nouveau] nouveau regression post v5.8, still present in v5.10

2021-01-04 Thread Jamie Heilman
Jamie Heilman wrote:
> Karol Herbst wrote:
> > do you think you'd be able to do a kernel bisect in order to pinpoint
> > the actual commit causing it? Thanks
> 
> No.  I can't reproduce it reliably.  I if I could, bisection wouldn't
> be a problem but as I can't and as it can take weeks for the problem
> to occur there's essentially no chance.  I know it regressed roughly
> in 5.8-rc1 only because that's what I was running when the first event
> occured.

er, 5.9.0-rc1 rather

> > On Sun, Dec 27, 2020 at 8:16 PM Jamie Heilman
> >  wrote:
> > >
> > > Something between v5.8 and v5.9 has resulted in periodically losing video.
> > > Unfortunately, I can't reliably reproduce it, it seems to happen every
> > > once in a long while---I can go weeks without an occurance, but it
> > > always seems to happen after my workstation has been idle long enough
> > > to screen blank and put the monitor to sleep.  I'm using a single
> > > display (Dell 2405FPW) connected via DVI, running X (Xorg 1.20.x from
> > > Debian sid).  I don't really do anything fancy, xterms, a browser or
> > > two, play the occasional video, but like I said, I can't reliably
> > > reproduce this.  I've had it happen about 11 times since August.
> > >
> > > lspci -vv output is:
> > >
> > > 01:00.0 VGA compatible controller: NVIDIA Corporation G86 [Quadro NVS 
> > > 290] (rev a1) (prog-if 00 [VGA controller])
> > > Subsystem: NVIDIA Corporation G86 [Quadro NVS 290]
> > > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- 
> > > ParErr- Stepping- SERR- FastB2B- DisINTx+
> > > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> > > SERR-  > > Latency: 0, Cache Line Size: 64 bytes
> > > Interrupt: pin A routed to IRQ 28
> > > Region 0: Memory at fc00 (32-bit, non-prefetchable) [size=16M]
> > > Region 1: Memory at d000 (64-bit, prefetchable) [size=256M]
> > > Region 3: Memory at fa00 (64-bit, non-prefetchable) [size=32M]
> > > Region 5: I/O ports at dc80 [size=128]
> > > Expansion ROM at 000c [disabled] [size=128K]
> > > Capabilities: [60] Power Management version 2
> > > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
> > > PME(D0-,D1-,D2-,D3hot-,D3cold-)
> > > Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
> > > Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
> > > Address: fee01004  Data: 4023
> > > Capabilities: [78] Express (v1) Endpoint, MSI 00
> > > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s 
> > > <512ns, L1 <4us
> > > ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- 
> > > SlotPowerLimit 25.000W
> > > DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq-
> > > RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
> > > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- 
> > > TransPend-
> > > LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, 
> > > Exit Latency L0s <512ns, L1 <4us
> > > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
> > > LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
> > > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > > LnkSta: Speed 2.5GT/s (ok), Width x16 (ok)
> > > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > > Capabilities: [100 v1] Virtual Channel
> > > Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
> > > Arb:Fixed- WRR32- WRR64- WRR128-
> > > Ctrl:   ArbSelect=Fixed
> > > Status: InProgress-
> > > VC0:Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
> > > Arb:Fixed- WRR32- WRR64- WRR128- TWRR128- 
> > > WRR256-
> > > Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
> > > Status: NegoPending- InProgress-
> > > Capabilities: [128 v1] Power Budgeting 
> > > Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 
> > > Len=024 
> > > Kernel driver in use: nouveau
> > &g

Re: [Nouveau] nouveau regression post v5.8, still present in v5.10

2021-01-04 Thread Jamie Heilman
Karol Herbst wrote:
> do you think you'd be able to do a kernel bisect in order to pinpoint
> the actual commit causing it? Thanks

No.  I can't reproduce it reliably.  I if I could, bisection wouldn't
be a problem but as I can't and as it can take weeks for the problem
to occur there's essentially no chance.  I know it regressed roughly
in 5.8-rc1 only because that's what I was running when the first event
occured.

> On Sun, Dec 27, 2020 at 8:16 PM Jamie Heilman
>  wrote:
> >
> > Something between v5.8 and v5.9 has resulted in periodically losing video.
> > Unfortunately, I can't reliably reproduce it, it seems to happen every
> > once in a long while---I can go weeks without an occurance, but it
> > always seems to happen after my workstation has been idle long enough
> > to screen blank and put the monitor to sleep.  I'm using a single
> > display (Dell 2405FPW) connected via DVI, running X (Xorg 1.20.x from
> > Debian sid).  I don't really do anything fancy, xterms, a browser or
> > two, play the occasional video, but like I said, I can't reliably
> > reproduce this.  I've had it happen about 11 times since August.
> >
> > lspci -vv output is:
> >
> > 01:00.0 VGA compatible controller: NVIDIA Corporation G86 [Quadro NVS 290] 
> > (rev a1) (prog-if 00 [VGA controller])
> > Subsystem: NVIDIA Corporation G86 [Quadro NVS 290]
> > Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
> > Stepping- SERR- FastB2B- DisINTx+
> > Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
> > SERR-  > Latency: 0, Cache Line Size: 64 bytes
> > Interrupt: pin A routed to IRQ 28
> > Region 0: Memory at fc00 (32-bit, non-prefetchable) [size=16M]
> > Region 1: Memory at d000 (64-bit, prefetchable) [size=256M]
> > Region 3: Memory at fa00 (64-bit, non-prefetchable) [size=32M]
> > Region 5: I/O ports at dc80 [size=128]
> > Expansion ROM at 000c [disabled] [size=128K]
> > Capabilities: [60] Power Management version 2
> > Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA 
> > PME(D0-,D1-,D2-,D3hot-,D3cold-)
> > Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
> > Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
> > Address: fee01004  Data: 4023
> > Capabilities: [78] Express (v1) Endpoint, MSI 00
> > DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s 
> > <512ns, L1 <4us
> > ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- 
> > SlotPowerLimit 25.000W
> > DevCtl: CorrErr- NonFatalErr+ FatalErr+ UnsupReq-
> > RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
> > MaxPayload 128 bytes, MaxReadReq 512 bytes
> > DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- 
> > TransPend-
> > LnkCap: Port #0, Speed 2.5GT/s, Width x16, ASPM L0s L1, 
> > Exit Latency L0s <512ns, L1 <4us
> > ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
> > LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
> > ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > LnkSta: Speed 2.5GT/s (ok), Width x16 (ok)
> > TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> > Capabilities: [100 v1] Virtual Channel
> > Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
> > Arb:Fixed- WRR32- WRR64- WRR128-
> > Ctrl:   ArbSelect=Fixed
> > Status: InProgress-
> > VC0:Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
> > Arb:Fixed- WRR32- WRR64- WRR128- TWRR128- 
> > WRR256-
> > Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=01
> > Status: NegoPending- InProgress-
> > Capabilities: [128 v1] Power Budgeting 
> > Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 
> > Len=024 
> > Kernel driver in use: nouveau
> >
> > The last time this happened, this is what got logged:
> >
> > nouveau :01:00.0: disp: ERROR 5 [INVALID_STATE] 06 [] chid 1 mthd 0080 
> > data 0001
> > nouveau :01:00.0: disp: Base 1:
> > nouveau :01:00.0: disp:0084: 
> > nouveau :01:00.0: disp:0088: 
> > nouveau :01:00.0: disp:008c: 
> > 

nouveau regression post v5.8, still present in v5.10

2020-12-24 Thread Jamie Heilman
sp:080c: 7804  
nouveau :01:00.0: disp:0810: cf00  
nouveau :01:00.0: disp: Base 1 - Image 1:
nouveau :01:00.0: disp:0c00: 9500  
nouveau :01:00.0: disp:0c04:   
nouveau :01:00.0: disp:0c08: 04b00780  
nouveau :01:00.0: disp:0c0c: 7804  
nouveau :01:00.0: disp:0c10: cf00  
nouveau :01:00.0: DRM: core notifier timeout
nouveau :01:00.0: DRM: base-0: timeout

I've got logs of all of this, if they help I can collect them.  The
timeout message are consistent the error messages a little less so.

If there's more debugging I can do when this happens, I'd love to know
what it is.

kernel config: http://audible.transient.net/~jamie/k/nouveau.config-5.10.0
dmesg at boot: http://audible.transient.net/~jamie/k/nouveau.dmesg

-- 
Jamie Heilman http://audible.transient.net/~jamie/


Re: [PATCH] kvm: x86/mmu: Fix is_tdp_mmu_check when using PAE

2020-11-11 Thread Jamie Heilman
Ben Gardon wrote:
> When PAE is in use, the root_hpa will not have a shadow page assoicated
> with it. In this case the kernel will crash with a NULL pointer
> dereference. Add checks to ensure is_tdp_mmu_root works as intended even
> when using PAE.

This seems to work in my amd64 case as well.
(https://marc.info/?l=linux-kernel=160494962201032=2)

> Tested: compiles
> 
> Fixes: 02c00b3a2f7e ("kvm: x86/mmu: Allocate and free TDP MMU roots")
> Reported-by: Zdenek Kaspar 
> Signed-off-by: Ben Gardon 
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 10 ++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 27e381c9da6c..13013f4d98ad 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -49,8 +49,18 @@ bool is_tdp_mmu_root(struct kvm *kvm, hpa_t hpa)
>  {
>   struct kvm_mmu_page *sp;
>  
> + if (WARN_ON(!VALID_PAGE(hpa)))
> + return false;
> +
>   sp = to_shadow_page(hpa);
>  
> + /*
> +  * If this VM is being run with PAE, the TDP MMU will not be enabled
> +  * and the root HPA will not have a shadow page associated with it.
> +  */
> + if (!sp)
> + return false;
> +
>   return sp->tdp_mmu_page && sp->root_count;
>  }
>  
> -- 
> 2.29.2.222.g5d2a92d10f8-goog
> 

-- 
Jamie Heilman http://audible.transient.net/~jamie/


null pointer deref in kvm

2020-11-09 Thread Jamie Heilman
or,id=monitor,mode=control -rtc base=utc
-no-shutdown -boot strict=on \
-device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x5 \
-drive file=/dev/S/ruth,if=none,id=drive-virtio-disk0,format=raw,cache=none 
\
-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
 \
-fsdev local,security_model=mapped,id=fsdev-fs0,path=/var/local/downloads \
-device 
virtio-9p-pci,id=fs0,fsdev=fsdev-fs0,mount_tag=downloads,bus=pci.0,addr=0x3 \
-netdev tap,fd=23,id=hostnet0,vhost=on,vhostfd=24 \
-device 
virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:9a:0c:07,bus=pci.0,addr=0x4 
\
-chardev pty,id=charserial0 -device 
isa-serial,chardev=charserial0,id=serial0 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x7 \
-object rng-random,id=rng0,filename=/dev/random \
-device 
virtio-rng-pci,rng=rng0,max-bytes=1024,period=3000,bus=pci.0,addr=0x2 \
-msg timestamp=on

Let me know if you need anything else.

-- 
Jamie Heilman http://audible.transient.net/~jamie/
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 5.10.0-rc2 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="gcc (Debian 10.2.0-16) 10.2.0"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=100200
CONFIG_LD_VERSION=23501
CONFIG_CLANG_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_CAN_LINK_STATIC=y
CONFIG_CC_HAS_ASM_GOTO=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_HAVE_KERNEL_ZSTD=y
# CONFIG_KERNEL_GZIP is not set
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
CONFIG_KERNEL_XZ=y
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
# CONFIG_KERNEL_ZSTD is not set
CONFIG_DEFAULT_INIT=""
CONFIG_DEFAULT_HOSTNAME="cucamonga"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_WATCH_QUEUE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
# CONFIG_USELIB is not set
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
# end of IRQ subsystem

CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_INIT=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK=y
CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
# CONFIG_NO_HZ is not set
CONFIG_HIGH_RES_TIMERS=y
# end of Timers subsystem

# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_PREEMPTION=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
# CONFIG_PSI is not set
# end of CPU/Task time and stats accounting

CONFIG_CPU_ISOLATION=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
CONFIG_PREEMPT_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RCU=y
CONFIG_TASKS_TRACE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
# end of RCU Subsystem

CONFIG_IKCONFIG=m
CONFIG_IKCONFIG_PROC=y
# CONFIG_IKHEADERS is not set
CONFIG_LOG_BUF_SHIFT=17
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y

#
# Scheduler features
#
# CONFIG_UCLAMP_TASK is not set
# end of Scheduler features

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_CC_HAS_INT128=y
CONFIG_ARCH_SUPPORTS_INT128=y
CONFIG_CGROUPS=y
# CONFIG_MEMCG is not set
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
# CONFIG_RT_GROUP_SCHED is not set
# CONFIG_CGROUP_PIDS is not set
# CONFIG_CGROUP_RDMA is not set
# CONFIG_CGROUP_FREEZER is not set
# CONFIG_CPUSETS 

Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)

2016-05-03 Thread Jamie Heilman
Tadeusz Struk wrote:
> On 05/03/2016 07:26 PM, Jamie Heilman wrote:
> >>> Alrighty, presumably relevant bits:
> >>> > >
> >>> > > X.509: Cert Issuer: Build time autogenerated kernel key
> >>> > > X.509: Cert Subject: Build time autogenerated kernel key
> >>> > > X.509: Cert Key Algo: rsa
> >>> > > X.509: Cert Valid period: 1461826791-4615426791
> >>> > > X.509: Cert Signature: rsa + sha512
> >>> > > X.509: ==>x509_check_signature()
> >>> > > X.509: ==>x509_get_sig_params()
> >>> > > X.509: <==x509_get_sig_params() = 0
> >>> > > PKEY: ==>public_key_verify_signature()
> >>> > > X.509: Cert Verification: -2
> >> > 
> >> > Hmmm...  Okay, the only ways out of public_key_verify_signature() without
> >> > printing a leaving message are for snprintf() to overrun (which would 
> >> > return
> >> > error -22) or for crypto_alloc_akcipher() to have failed; everything 
> >> > else must
> >> > go through the kleave() at the pr_devel() at the bottom of the function.
> >> > 
> >> > Can you stick:
> >> > 
> >> >  pr_devel("ALGO: %s\n", alg_name);
> >> > 
> >> > immediately before this line:
> >> > 
> >> >  tfm = crypto_alloc_akcipher(alg_name, 0, 0);
> >> > 
> >> > and try it again?
> > PKEY: ALGO: pkcs1pad(rsa,sha512)
> 
> I think the problem is that pkcs1pad template needs CRYPTO_MANAGER, but
> your configuration doesn't enable CRYPTO_MANAGER. Could you try this
> please:
> 
> diff --git a/crypto/Kconfig b/crypto/Kconfig
> index 93a1fdc..1d33beb 100644
> --- a/crypto/Kconfig
> +++ b/crypto/Kconfig
> @@ -96,6 +96,7 @@ config CRYPTO_AKCIPHER
>  config CRYPTO_RSA
>   tristate "RSA algorithm"
>   select CRYPTO_AKCIPHER
> + select CRYPTO_MANAGER
>   select MPILIB
>   select ASN1
>   help

Yep, that does indeed make everything work again.

-- 
Jamie Heilman http://audible.transient.net/~jamie/


Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)

2016-05-03 Thread Jamie Heilman
Tadeusz Struk wrote:
> On 05/03/2016 07:26 PM, Jamie Heilman wrote:
> >>> Alrighty, presumably relevant bits:
> >>> > >
> >>> > > X.509: Cert Issuer: Build time autogenerated kernel key
> >>> > > X.509: Cert Subject: Build time autogenerated kernel key
> >>> > > X.509: Cert Key Algo: rsa
> >>> > > X.509: Cert Valid period: 1461826791-4615426791
> >>> > > X.509: Cert Signature: rsa + sha512
> >>> > > X.509: ==>x509_check_signature()
> >>> > > X.509: ==>x509_get_sig_params()
> >>> > > X.509: <==x509_get_sig_params() = 0
> >>> > > PKEY: ==>public_key_verify_signature()
> >>> > > X.509: Cert Verification: -2
> >> > 
> >> > Hmmm...  Okay, the only ways out of public_key_verify_signature() without
> >> > printing a leaving message are for snprintf() to overrun (which would 
> >> > return
> >> > error -22) or for crypto_alloc_akcipher() to have failed; everything 
> >> > else must
> >> > go through the kleave() at the pr_devel() at the bottom of the function.
> >> > 
> >> > Can you stick:
> >> > 
> >> >  pr_devel("ALGO: %s\n", alg_name);
> >> > 
> >> > immediately before this line:
> >> > 
> >> >  tfm = crypto_alloc_akcipher(alg_name, 0, 0);
> >> > 
> >> > and try it again?
> > PKEY: ALGO: pkcs1pad(rsa,sha512)
> 
> I think the problem is that pkcs1pad template needs CRYPTO_MANAGER, but
> your configuration doesn't enable CRYPTO_MANAGER. Could you try this
> please:
> 
> diff --git a/crypto/Kconfig b/crypto/Kconfig
> index 93a1fdc..1d33beb 100644
> --- a/crypto/Kconfig
> +++ b/crypto/Kconfig
> @@ -96,6 +96,7 @@ config CRYPTO_AKCIPHER
>  config CRYPTO_RSA
>   tristate "RSA algorithm"
>   select CRYPTO_AKCIPHER
> + select CRYPTO_MANAGER
>   select MPILIB
>   select ASN1
>   help

Yep, that does indeed make everything work again.

-- 
Jamie Heilman http://audible.transient.net/~jamie/


Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)

2016-05-03 Thread Jamie Heilman
David Howells wrote:
> (cc'ing Tadeusz as he did the pkcs1 padding function)
> 
> Jamie Heilman <ja...@audible.transient.net> wrote:
> 
> > > > Problem loading in-kernel X.509 certificate (-2)
> > > 
> > > ENOENT?  Hmmm...  The only place that is generated is in the crypto layer.
> > > That suggests missing crypto of some sort.
> > > 
> > > The attached patch enables some debugging in some relevant files if you 
> > > can
> > > try applying it to your kernel.
> > 
> > Alrighty, presumably relevant bits:
> >
> > X.509: Cert Issuer: Build time autogenerated kernel key
> > X.509: Cert Subject: Build time autogenerated kernel key
> > X.509: Cert Key Algo: rsa
> > X.509: Cert Valid period: 1461826791-4615426791
> > X.509: Cert Signature: rsa + sha512
> > X.509: ==>x509_check_signature()
> > X.509: ==>x509_get_sig_params()
> > X.509: <==x509_get_sig_params() = 0
> > PKEY: ==>public_key_verify_signature()
> > X.509: Cert Verification: -2
> 
> Hmmm...  Okay, the only ways out of public_key_verify_signature() without
> printing a leaving message are for snprintf() to overrun (which would return
> error -22) or for crypto_alloc_akcipher() to have failed; everything else must
> go through the kleave() at the pr_devel() at the bottom of the function.
> 
> Can you stick:
> 
>   pr_devel("ALGO: %s\n", alg_name);
> 
> immediately before this line:
> 
>   tfm = crypto_alloc_akcipher(alg_name, 0, 0);
> 
> and try it again?

PKEY: ALGO: pkcs1pad(rsa,sha512)

-- 
Jamie Heilman http://audible.transient.net/~jamie/


Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)

2016-05-03 Thread Jamie Heilman
David Howells wrote:
> (cc'ing Tadeusz as he did the pkcs1 padding function)
> 
> Jamie Heilman  wrote:
> 
> > > > Problem loading in-kernel X.509 certificate (-2)
> > > 
> > > ENOENT?  Hmmm...  The only place that is generated is in the crypto layer.
> > > That suggests missing crypto of some sort.
> > > 
> > > The attached patch enables some debugging in some relevant files if you 
> > > can
> > > try applying it to your kernel.
> > 
> > Alrighty, presumably relevant bits:
> >
> > X.509: Cert Issuer: Build time autogenerated kernel key
> > X.509: Cert Subject: Build time autogenerated kernel key
> > X.509: Cert Key Algo: rsa
> > X.509: Cert Valid period: 1461826791-4615426791
> > X.509: Cert Signature: rsa + sha512
> > X.509: ==>x509_check_signature()
> > X.509: ==>x509_get_sig_params()
> > X.509: <==x509_get_sig_params() = 0
> > PKEY: ==>public_key_verify_signature()
> > X.509: Cert Verification: -2
> 
> Hmmm...  Okay, the only ways out of public_key_verify_signature() without
> printing a leaving message are for snprintf() to overrun (which would return
> error -22) or for crypto_alloc_akcipher() to have failed; everything else must
> go through the kleave() at the pr_devel() at the bottom of the function.
> 
> Can you stick:
> 
>   pr_devel("ALGO: %s\n", alg_name);
> 
> immediately before this line:
> 
>   tfm = crypto_alloc_akcipher(alg_name, 0, 0);
> 
> and try it again?

PKEY: ALGO: pkcs1pad(rsa,sha512)

-- 
Jamie Heilman http://audible.transient.net/~jamie/


Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)

2016-05-03 Thread Jamie Heilman
Tadeusz Struk wrote:
> Could you please check if this one fixes the problem for you:
> https://patchwork.kernel.org/patch/8766361/

That patch is already present by 4.6-rc5 and the problem still exists,
so no, it doesn't.

-- 
Jamie Heilman http://audible.transient.net/~jamie/


Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)

2016-05-03 Thread Jamie Heilman
Tadeusz Struk wrote:
> Could you please check if this one fixes the problem for you:
> https://patchwork.kernel.org/patch/8766361/

That patch is already present by 4.6-rc5 and the problem still exists,
so no, it doesn't.

-- 
Jamie Heilman http://audible.transient.net/~jamie/


Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)

2016-05-03 Thread Jamie Heilman
Herbert Xu wrote:
> On Sat, Apr 30, 2016 at 08:32:48AM +0000, Jamie Heilman wrote:
> > I usually build my kernels to require module signatures and use
> > automatic signing.  As of v4.6-rc1 I'm getting this on boot:
> > 
> > Problem loading in-kernel X.509 certificate (-2)
> > 
> > I bisected that to commit d43de6c780a84def056afaf4fb3e66bdaa1efc00
> > (akcipher: Move the RSA DER encoding check to the crypto layer)
> > 
> > For some reason after this commit my system keyring always ends up
> > empty.  I use the deb-pkg make target.  My kernel config can
> > be found at
> > http://audible.transient.net/~jamie/k/modsign.config-4.6.0-rc5-guest
> > 
> > Let me know if you need anything else.
> 
> David, any ideas on this problem? If we can't get a quick fix
> on this then we'll have to revert.
> 
> Jamie, does reverting this patch by itself resolve the problem?

Haven't tried that, given the nature of the change I just assumed
it would break too much, but I'll give it a shot this evening after
work.

-- 
Jamie Heilman http://audible.transient.net/~jamie/


Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)

2016-05-03 Thread Jamie Heilman
Herbert Xu wrote:
> On Sat, Apr 30, 2016 at 08:32:48AM +0000, Jamie Heilman wrote:
> > I usually build my kernels to require module signatures and use
> > automatic signing.  As of v4.6-rc1 I'm getting this on boot:
> > 
> > Problem loading in-kernel X.509 certificate (-2)
> > 
> > I bisected that to commit d43de6c780a84def056afaf4fb3e66bdaa1efc00
> > (akcipher: Move the RSA DER encoding check to the crypto layer)
> > 
> > For some reason after this commit my system keyring always ends up
> > empty.  I use the deb-pkg make target.  My kernel config can
> > be found at
> > http://audible.transient.net/~jamie/k/modsign.config-4.6.0-rc5-guest
> > 
> > Let me know if you need anything else.
> 
> David, any ideas on this problem? If we can't get a quick fix
> on this then we'll have to revert.
> 
> Jamie, does reverting this patch by itself resolve the problem?

Haven't tried that, given the nature of the change I just assumed
it would break too much, but I'll give it a shot this evening after
work.

-- 
Jamie Heilman http://audible.transient.net/~jamie/


Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)

2016-05-03 Thread Jamie Heilman
David Howells wrote:
> > Problem loading in-kernel X.509 certificate (-2)
> 
> ENOENT?  Hmmm...  The only place that is generated is in the crypto layer.
> That suggests missing crypto of some sort.
> 
> The attached patch enables some debugging in some relevant files if you can
> try applying it to your kernel.

Alrighty, presumably relevant bits:

X.509: Cert Issuer: Build time autogenerated kernel key
X.509: Cert Subject: Build time autogenerated kernel key
X.509: Cert Key Algo: rsa
X.509: Cert Valid period: 1461826791-4615426791
X.509: Cert Signature: rsa + sha512
X.509: ==>x509_check_signature()
X.509: ==>x509_get_sig_params()
X.509: <==x509_get_sig_params() = 0
PKEY: ==>public_key_verify_signature()
X.509: Cert Verification: -2
Problem loading in-kernel X.509 certificate (-2)
...
PKCS7: ==> pkcs7_verify()
PKCS7: ==> pkcs7_verify_one(,1)
PKCS7: ==> pkcs7_digest(,1,sha512)
PKCS7: MsgDigest = [0b f2 1f 7e f0 37 12 e6]
PKCS7: <== pkcs7_digest() = 0
PKCS7: ==> pkcs7_find_key(1)
PKCS7: Sig 1: Issuing X.509 cert not found 
(#008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579)
PKCS7: <== pkcs7_verify() = 0
X.509: Look up: 
"ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579"
X.509: Request for key 
'ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579'
 err -11
PKCS7: ==> pkcs7_verify()
PKCS7: ==> pkcs7_verify_one(,1)
PKCS7: ==> pkcs7_digest(,1,sha512)
PKCS7: MsgDigest = [5b b5 bb 52 28 05 ba 55]
PKCS7: <== pkcs7_digest() = 0
PKCS7: ==> pkcs7_find_key(1)
PKCS7: Sig 1: Issuing X.509 cert not found 
(#008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579)
PKCS7: <== pkcs7_verify() = 0
X.509: Look up: 
"ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579"
X.509: Request for key 
'ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579'
 err -11
PKCS7: ==> pkcs7_verify()
PKCS7: ==> pkcs7_verify_one(,1)
PKCS7: ==> pkcs7_digest(,1,sha512)
PKCS7: MsgDigest = [94 a4 59 31 7f a9 d0 3a]
PKCS7: <== pkcs7_digest() = 0
PKCS7: ==> pkcs7_find_key(1)
PKCS7: Sig 1: Issuing X.509 cert not found 
(#008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579)
PKCS7: <== pkcs7_verify() = 0
X.509: Look up: 
"ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579"
X.509: Request for key 
'ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579'
 err -11

full dmesg at http://audible.transient.net/~jamie/k/modsign.dmesg-debugging


-- 
Jamie Heilman http://audible.transient.net/~jamie/


Re: v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)

2016-05-03 Thread Jamie Heilman
David Howells wrote:
> > Problem loading in-kernel X.509 certificate (-2)
> 
> ENOENT?  Hmmm...  The only place that is generated is in the crypto layer.
> That suggests missing crypto of some sort.
> 
> The attached patch enables some debugging in some relevant files if you can
> try applying it to your kernel.

Alrighty, presumably relevant bits:

X.509: Cert Issuer: Build time autogenerated kernel key
X.509: Cert Subject: Build time autogenerated kernel key
X.509: Cert Key Algo: rsa
X.509: Cert Valid period: 1461826791-4615426791
X.509: Cert Signature: rsa + sha512
X.509: ==>x509_check_signature()
X.509: ==>x509_get_sig_params()
X.509: <==x509_get_sig_params() = 0
PKEY: ==>public_key_verify_signature()
X.509: Cert Verification: -2
Problem loading in-kernel X.509 certificate (-2)
...
PKCS7: ==> pkcs7_verify()
PKCS7: ==> pkcs7_verify_one(,1)
PKCS7: ==> pkcs7_digest(,1,sha512)
PKCS7: MsgDigest = [0b f2 1f 7e f0 37 12 e6]
PKCS7: <== pkcs7_digest() = 0
PKCS7: ==> pkcs7_find_key(1)
PKCS7: Sig 1: Issuing X.509 cert not found 
(#008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579)
PKCS7: <== pkcs7_verify() = 0
X.509: Look up: 
"ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579"
X.509: Request for key 
'ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579'
 err -11
PKCS7: ==> pkcs7_verify()
PKCS7: ==> pkcs7_verify_one(,1)
PKCS7: ==> pkcs7_digest(,1,sha512)
PKCS7: MsgDigest = [5b b5 bb 52 28 05 ba 55]
PKCS7: <== pkcs7_digest() = 0
PKCS7: ==> pkcs7_find_key(1)
PKCS7: Sig 1: Issuing X.509 cert not found 
(#008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579)
PKCS7: <== pkcs7_verify() = 0
X.509: Look up: 
"ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579"
X.509: Request for key 
'ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579'
 err -11
PKCS7: ==> pkcs7_verify()
PKCS7: ==> pkcs7_verify_one(,1)
PKCS7: ==> pkcs7_digest(,1,sha512)
PKCS7: MsgDigest = [94 a4 59 31 7f a9 d0 3a]
PKCS7: <== pkcs7_digest() = 0
PKCS7: ==> pkcs7_find_key(1)
PKCS7: Sig 1: Issuing X.509 cert not found 
(#008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579)
PKCS7: <== pkcs7_verify() = 0
X.509: Look up: 
"ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579"
X.509: Request for key 
'ex:008a32081403f1709a312c302a06035504030c234275696c642074696d65206175746f67656e657261746564206b65726e656c206b6579'
 err -11

full dmesg at http://audible.transient.net/~jamie/k/modsign.dmesg-debugging


-- 
Jamie Heilman http://audible.transient.net/~jamie/


v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)

2016-04-30 Thread Jamie Heilman
I usually build my kernels to require module signatures and use
automatic signing.  As of v4.6-rc1 I'm getting this on boot:

Problem loading in-kernel X.509 certificate (-2)

I bisected that to commit d43de6c780a84def056afaf4fb3e66bdaa1efc00
(akcipher: Move the RSA DER encoding check to the crypto layer)

For some reason after this commit my system keyring always ends up
empty.  I use the deb-pkg make target.  My kernel config can
be found at
http://audible.transient.net/~jamie/k/modsign.config-4.6.0-rc5-guest

Let me know if you need anything else.

-- 
Jamie Heilman http://audible.transient.net/~jamie/


v4.6-rc1 regression bisected, Problem loading in-kernel X.509 certificate (-2)

2016-04-30 Thread Jamie Heilman
I usually build my kernels to require module signatures and use
automatic signing.  As of v4.6-rc1 I'm getting this on boot:

Problem loading in-kernel X.509 certificate (-2)

I bisected that to commit d43de6c780a84def056afaf4fb3e66bdaa1efc00
(akcipher: Move the RSA DER encoding check to the crypto layer)

For some reason after this commit my system keyring always ends up
empty.  I use the deb-pkg make target.  My kernel config can
be found at
http://audible.transient.net/~jamie/k/modsign.config-4.6.0-rc5-guest

Let me know if you need anything else.

-- 
Jamie Heilman http://audible.transient.net/~jamie/


Re: regression bisected; KVM: entry failed, hardware error 0x80000021

2014-12-25 Thread Jamie Heilman
Chen, Tiejun wrote:
> On 2014/12/24 19:02, Jamie Heilman wrote:
> >Chen, Tiejun wrote:
> >>On 2014/12/23 15:26, Jamie Heilman wrote:
> >>>Chen, Tiejun wrote:
> >>>>On 2014/12/23 9:50, Chen, Tiejun wrote:
> >>>>>On 2014/12/22 17:23, Jamie Heilman wrote:
> >>>>>>KVM internal error. Suberror: 1
> >>>>>>emulation failure
> >>>>>>EAX=000de494 EBX= ECX= EDX=0cfd
> >>>>>>ESI=0059 EDI= EBP= ESP=6fb4
> >>>>>>EIP=000f15c1 EFL=00010016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> >>>>>>ES =0010   00c09300 DPL=0 DS   [-WA]
> >>>>>>CS =0008   00c09b00 DPL=0 CS32 [-RA]
> >>>>>>SS =0010   00c09300 DPL=0 DS   [-WA]
> >>>>>>DS =0010   00c09300 DPL=0 DS   [-WA]
> >>>>>>FS =0010   00c09300 DPL=0 DS   [-WA]
> >>>>>>GS =0010   00c09300 DPL=0 DS   [-WA]
> >>>>>>LDT=   8200 DPL=0 LDT
> >>>>>>TR =   8b00 DPL=0 TSS32-busy
> >>>>>>GDT= 000f6be8 0037
> >>>>>>IDT= 000f6c26 
> >>>>>>CR0=6011 CR2= CR3= CR4=
> >>>>>>DR0= DR1= DR2=
> >>>>>>DR3=
> >>>>>>DR6=0ff0 DR7=0400
> >>>>>>EFER=
> >>>>>>Code=e8 ae fc ff ff 89 f2 a8 10 89 d8 75 0a b9 41 15 ff ff ff d1 <5b>
> >>>>>>5e c3 5b 5e e9 76 ff ff ff b0 11 e6 20 e6 a0 b0 08 e6 21 b0 70 e6 a1
> >>>>>>b0 04 e6 21 b0 02
> >>>>>>
> >>>>>>FWIW, I get the same thing with 34a1cd60d17 reverted.  Maybe there are
> >>>>>>two bugs, maybe there's more to this first one.  I can repro this
> >>>>>
> >>>>>So if my understanding is correct, this is probably another bug. And
> >>>>>especially, I already saw the same log in another thread, "Cleaning up
> >>>>>the KVM clock". Maybe you can continue to `git bisect` to locate that
> >>>>>bad commit.
> >>>>>
> >>>>
> >>>>Looks just now Andy found that commit,
> >>>>0e60b0799fedc495a5c57dbd669de3c10d72edd2 "kvm: change memslot sorting rule
> >>>>from size to GFN", maybe you can try to revert this to try yours again.
> >>>
> >>>That doesn't revert cleanly for me, and I don't have much time to
> >>>fiddle with it until the 24th---so checked out the commit before it
> >>>(d4ae84a0), applied your patch, built, and yes, everything works fine
> >>>at that point.  I'll probably have time for another full bisection
> >>>later, assuming things aren't ironed out already by then.
> >
> >3.18.0-rc3-00120-gd4ae84a0 + vmx reorder msr writes patch = OK
> >3.18.0-rc3-00121-g0e60b07 + vmx reorder msr writes patch = emulation failure
> >
> >So that certainly points to 0e60b0799fedc495a5c57dbd669de3c10d72edd2
> >as well.
> >
> >>Could you try this to fix your last error?
> >
> >Running qemu-system-x86_64 -machine pc,accel=kvm -nodefaults works,
> >my real (headless) kvm guests work, but this new patch makes running
> >"qemu-system-x86_64 -machine pc,accel=kvm" fail again, this time with
> 
> Are you sure? From my test based on 3.19-rc1 that it owns top commit,
> 
> aa39477b5692611b91ac9455ae588738852b3f60
> 
> just plus my previous patch, "kvm: x86: vmx: reorder some msr writing"
> 
> I already can execute such a command successfully,
> 
> qemu-system-x86_64 -machine pc,accel=kvm -m 2048 -smp 2 -hda ubuntu.img
> 
> And your log below seems not to relate mem_slot issue we're discussing, I
> guess you need to update qemu as well.

Yes, I'm sure.

> But I also found my new patch just work out Andy's next case, its really
> bringing a new issue in !next case. So I tried to refine that patch again as
> follows,

This latest patch (again, after fixing all the whitespace so it actually
applies), does the trick.  Both
"qemu-system-x86_64 -machine pc,accel=kvm" and
"qemu-system-x86_64 -machine pc,accel=kvm -nodefaults" work for me
now without any of the aforementioned warnings from the host.


> Signed-off-by: Tiejun Chen 
>

Re: regression bisected; KVM: entry failed, hardware error 0x80000021

2014-12-25 Thread Jamie Heilman
Chen, Tiejun wrote:
 On 2014/12/24 19:02, Jamie Heilman wrote:
 Chen, Tiejun wrote:
 On 2014/12/23 15:26, Jamie Heilman wrote:
 Chen, Tiejun wrote:
 On 2014/12/23 9:50, Chen, Tiejun wrote:
 On 2014/12/22 17:23, Jamie Heilman wrote:
 KVM internal error. Suberror: 1
 emulation failure
 EAX=000de494 EBX= ECX= EDX=0cfd
 ESI=0059 EDI= EBP= ESP=6fb4
 EIP=000f15c1 EFL=00010016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =0010   00c09300 DPL=0 DS   [-WA]
 CS =0008   00c09b00 DPL=0 CS32 [-RA]
 SS =0010   00c09300 DPL=0 DS   [-WA]
 DS =0010   00c09300 DPL=0 DS   [-WA]
 FS =0010   00c09300 DPL=0 DS   [-WA]
 GS =0010   00c09300 DPL=0 DS   [-WA]
 LDT=   8200 DPL=0 LDT
 TR =   8b00 DPL=0 TSS32-busy
 GDT= 000f6be8 0037
 IDT= 000f6c26 
 CR0=6011 CR2= CR3= CR4=
 DR0= DR1= DR2=
 DR3=
 DR6=0ff0 DR7=0400
 EFER=
 Code=e8 ae fc ff ff 89 f2 a8 10 89 d8 75 0a b9 41 15 ff ff ff d1 5b
 5e c3 5b 5e e9 76 ff ff ff b0 11 e6 20 e6 a0 b0 08 e6 21 b0 70 e6 a1
 b0 04 e6 21 b0 02
 
 FWIW, I get the same thing with 34a1cd60d17 reverted.  Maybe there are
 two bugs, maybe there's more to this first one.  I can repro this
 
 So if my understanding is correct, this is probably another bug. And
 especially, I already saw the same log in another thread, Cleaning up
 the KVM clock. Maybe you can continue to `git bisect` to locate that
 bad commit.
 
 
 Looks just now Andy found that commit,
 0e60b0799fedc495a5c57dbd669de3c10d72edd2 kvm: change memslot sorting rule
 from size to GFN, maybe you can try to revert this to try yours again.
 
 That doesn't revert cleanly for me, and I don't have much time to
 fiddle with it until the 24th---so checked out the commit before it
 (d4ae84a0), applied your patch, built, and yes, everything works fine
 at that point.  I'll probably have time for another full bisection
 later, assuming things aren't ironed out already by then.
 
 3.18.0-rc3-00120-gd4ae84a0 + vmx reorder msr writes patch = OK
 3.18.0-rc3-00121-g0e60b07 + vmx reorder msr writes patch = emulation failure
 
 So that certainly points to 0e60b0799fedc495a5c57dbd669de3c10d72edd2
 as well.
 
 Could you try this to fix your last error?
 
 Running qemu-system-x86_64 -machine pc,accel=kvm -nodefaults works,
 my real (headless) kvm guests work, but this new patch makes running
 qemu-system-x86_64 -machine pc,accel=kvm fail again, this time with
 
 Are you sure? From my test based on 3.19-rc1 that it owns top commit,
 
 aa39477b5692611b91ac9455ae588738852b3f60
 
 just plus my previous patch, kvm: x86: vmx: reorder some msr writing
 
 I already can execute such a command successfully,
 
 qemu-system-x86_64 -machine pc,accel=kvm -m 2048 -smp 2 -hda ubuntu.img
 
 And your log below seems not to relate mem_slot issue we're discussing, I
 guess you need to update qemu as well.

Yes, I'm sure.

 But I also found my new patch just work out Andy's next case, its really
 bringing a new issue in !next case. So I tried to refine that patch again as
 follows,

This latest patch (again, after fixing all the whitespace so it actually
applies), does the trick.  Both
qemu-system-x86_64 -machine pc,accel=kvm and
qemu-system-x86_64 -machine pc,accel=kvm -nodefaults work for me
now without any of the aforementioned warnings from the host.


 Signed-off-by: Tiejun Chen tiejun.c...@intel.com
 ---
  virt/kvm/kvm_main.c | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)
 
 diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
 index f528343..910bc48 100644
 --- a/virt/kvm/kvm_main.c
 +++ b/virt/kvm/kvm_main.c
 @@ -672,6 +672,7 @@ static void update_memslots(struct kvm_memslots *slots,
 WARN_ON(mslots[i].id != id);
 if (!new-npages) {
 new-base_gfn = 0;
 +   new-flags = 0;
 if (mslots[i].npages)
 slots-used_slots--;
 } else {
 @@ -688,7 +689,9 @@ static void update_memslots(struct kvm_memslots *slots,
 i++;
 }
 while (i  0 
 -  new-base_gfn  mslots[i - 1].base_gfn) {
 +  ((new-base_gfn  mslots[i - 1].base_gfn) ||
 +(!new-base_gfn 
 + !mslots[i - 1].base_gfn  !mslots[i - 1].npages))) {
 mslots[i] = mslots[i - 1];
 slots-id_to_index[mslots[i].id] = i;
 i--;
 
 
 
 Tiejun
 
 errors in the host to the tune of:
 
 [ cut here ]
 WARNING: CPU: 1 PID: 3901 at arch/x86/kvm/x86.c:6575 
 kvm_arch_vcpu_ioctl_run+0xd63/0xe5b [kvm]()
 Modules linked in: nfsv4 cpufreq_userspace cpufreq_stats cpufreq_powersave 
 cpufreq_ondemand cpufreq_conservative autofs4 fan nfsd auth_rpcgss nfs lockd 
 grace fscache sunrpc

Re: regression bisected; KVM: entry failed, hardware error 0x80000021

2014-12-24 Thread Jamie Heilman
Chen, Tiejun wrote:
> On 2014/12/23 15:26, Jamie Heilman wrote:
> >Chen, Tiejun wrote:
> >>On 2014/12/23 9:50, Chen, Tiejun wrote:
> >>>On 2014/12/22 17:23, Jamie Heilman wrote:
> >>>>KVM internal error. Suberror: 1
> >>>>emulation failure
> >>>>EAX=000de494 EBX= ECX= EDX=0cfd
> >>>>ESI=0059 EDI= EBP= ESP=6fb4
> >>>>EIP=000f15c1 EFL=00010016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> >>>>ES =0010   00c09300 DPL=0 DS   [-WA]
> >>>>CS =0008   00c09b00 DPL=0 CS32 [-RA]
> >>>>SS =0010   00c09300 DPL=0 DS   [-WA]
> >>>>DS =0010   00c09300 DPL=0 DS   [-WA]
> >>>>FS =0010   00c09300 DPL=0 DS   [-WA]
> >>>>GS =0010   00c09300 DPL=0 DS   [-WA]
> >>>>LDT=   8200 DPL=0 LDT
> >>>>TR =   8b00 DPL=0 TSS32-busy
> >>>>GDT= 000f6be8 0037
> >>>>IDT= 000f6c26 
> >>>>CR0=6011 CR2= CR3= CR4=
> >>>>DR0= DR1= DR2=
> >>>>DR3=
> >>>>DR6=0ff0 DR7=0400
> >>>>EFER=
> >>>>Code=e8 ae fc ff ff 89 f2 a8 10 89 d8 75 0a b9 41 15 ff ff ff d1 <5b>
> >>>>5e c3 5b 5e e9 76 ff ff ff b0 11 e6 20 e6 a0 b0 08 e6 21 b0 70 e6 a1
> >>>>b0 04 e6 21 b0 02
> >>>>
> >>>>FWIW, I get the same thing with 34a1cd60d17 reverted.  Maybe there are
> >>>>two bugs, maybe there's more to this first one.  I can repro this
> >>>
> >>>So if my understanding is correct, this is probably another bug. And
> >>>especially, I already saw the same log in another thread, "Cleaning up
> >>>the KVM clock". Maybe you can continue to `git bisect` to locate that
> >>>bad commit.
> >>>
> >>
> >>Looks just now Andy found that commit,
> >>0e60b0799fedc495a5c57dbd669de3c10d72edd2 "kvm: change memslot sorting rule
> >>from size to GFN", maybe you can try to revert this to try yours again.
> >
> >That doesn't revert cleanly for me, and I don't have much time to
> >fiddle with it until the 24th---so checked out the commit before it
> >(d4ae84a0), applied your patch, built, and yes, everything works fine
> >at that point.  I'll probably have time for another full bisection
> >later, assuming things aren't ironed out already by then.

3.18.0-rc3-00120-gd4ae84a0 + vmx reorder msr writes patch = OK
3.18.0-rc3-00121-g0e60b07 + vmx reorder msr writes patch = emulation failure

So that certainly points to 0e60b0799fedc495a5c57dbd669de3c10d72edd2
as well.

> Could you try this to fix your last error?

Running qemu-system-x86_64 -machine pc,accel=kvm -nodefaults works,
my real (headless) kvm guests work, but this new patch makes running
"qemu-system-x86_64 -machine pc,accel=kvm" fail again, this time with
errors in the host to the tune of:

[ cut here ]
WARNING: CPU: 1 PID: 3901 at arch/x86/kvm/x86.c:6575 
kvm_arch_vcpu_ioctl_run+0xd63/0xe5b [kvm]()
Modules linked in: nfsv4 cpufreq_userspace cpufreq_stats cpufreq_powersave 
cpufreq_ondemand cpufreq_conservative autofs4 fan nfsd auth_rpcgss nfs lockd 
grace fscache sunrpc bridge stp llc vhost_net tun vhost macvtap macvlan fuse 
cbc dm_crypt usb_storage snd_hda_codec_analog snd_hda_codec_generic kvm_intel 
kvm tg3 ptp pps_core sr_mod snd_hda_intel snd_hda_controller snd_hda_codec 
snd_hwdep snd_pcm snd_timer snd sg dcdbas cdrom psmouse soundcore floppy evdev 
xfs dm_mod raid1 md_mod
CPU: 1 PID: 3901 Comm: qemu-system-x86 Not tainted 
3.19.0-rc1-00011-g53262d1-dirty #1
Hardware name: Dell Inc. Precision WorkStation T3400  /0TP412, BIOS A14 
04/30/2012
  7e052328 8800c25ffcf8 813defbe
   8800c25ffd38 8103b517
 8800c25ffd28 a019bdec 8800caf1d000 8800c2774800
Call Trace:
 [] dump_stack+0x4c/0x6e
 [] warn_slowpath_common+0x97/0xb1
 [] ? kvm_arch_vcpu_ioctl_run+0xd63/0xe5b [kvm]
 [] warn_slowpath_null+0x15/0x17
 [] kvm_arch_vcpu_ioctl_run+0xd63/0xe5b [kvm]
 [] ? vmcs_load+0x20/0x62 [kvm_intel]
 [] ? vmx_vcpu_load+0x140/0x16a [kvm_intel]
 [] ? kvm_arch_vcpu_load+0x15c/0x161 [kvm]
 [] kvm_vcpu_ioctl+0x189/0x4bd [kvm]
 [] ? do_sigtimedwait+0x12f/0x189
 [] do_vfs_ioctl+0x370/0x436
 [] ? __fget+0x67/0x72
 [] SyS_ioctl+0x3f/0x5e
 [] system_call_fastpath+0x12/0x17
---[ end trace 46abac932fb3b4a1 ]---
-

Re: regression bisected; KVM: entry failed, hardware error 0x80000021

2014-12-24 Thread Jamie Heilman
Chen, Tiejun wrote:
 On 2014/12/23 15:26, Jamie Heilman wrote:
 Chen, Tiejun wrote:
 On 2014/12/23 9:50, Chen, Tiejun wrote:
 On 2014/12/22 17:23, Jamie Heilman wrote:
 KVM internal error. Suberror: 1
 emulation failure
 EAX=000de494 EBX= ECX= EDX=0cfd
 ESI=0059 EDI= EBP= ESP=6fb4
 EIP=000f15c1 EFL=00010016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =0010   00c09300 DPL=0 DS   [-WA]
 CS =0008   00c09b00 DPL=0 CS32 [-RA]
 SS =0010   00c09300 DPL=0 DS   [-WA]
 DS =0010   00c09300 DPL=0 DS   [-WA]
 FS =0010   00c09300 DPL=0 DS   [-WA]
 GS =0010   00c09300 DPL=0 DS   [-WA]
 LDT=   8200 DPL=0 LDT
 TR =   8b00 DPL=0 TSS32-busy
 GDT= 000f6be8 0037
 IDT= 000f6c26 
 CR0=6011 CR2= CR3= CR4=
 DR0= DR1= DR2=
 DR3=
 DR6=0ff0 DR7=0400
 EFER=
 Code=e8 ae fc ff ff 89 f2 a8 10 89 d8 75 0a b9 41 15 ff ff ff d1 5b
 5e c3 5b 5e e9 76 ff ff ff b0 11 e6 20 e6 a0 b0 08 e6 21 b0 70 e6 a1
 b0 04 e6 21 b0 02
 
 FWIW, I get the same thing with 34a1cd60d17 reverted.  Maybe there are
 two bugs, maybe there's more to this first one.  I can repro this
 
 So if my understanding is correct, this is probably another bug. And
 especially, I already saw the same log in another thread, Cleaning up
 the KVM clock. Maybe you can continue to `git bisect` to locate that
 bad commit.
 
 
 Looks just now Andy found that commit,
 0e60b0799fedc495a5c57dbd669de3c10d72edd2 kvm: change memslot sorting rule
 from size to GFN, maybe you can try to revert this to try yours again.
 
 That doesn't revert cleanly for me, and I don't have much time to
 fiddle with it until the 24th---so checked out the commit before it
 (d4ae84a0), applied your patch, built, and yes, everything works fine
 at that point.  I'll probably have time for another full bisection
 later, assuming things aren't ironed out already by then.

3.18.0-rc3-00120-gd4ae84a0 + vmx reorder msr writes patch = OK
3.18.0-rc3-00121-g0e60b07 + vmx reorder msr writes patch = emulation failure

So that certainly points to 0e60b0799fedc495a5c57dbd669de3c10d72edd2
as well.

 Could you try this to fix your last error?

Running qemu-system-x86_64 -machine pc,accel=kvm -nodefaults works,
my real (headless) kvm guests work, but this new patch makes running
qemu-system-x86_64 -machine pc,accel=kvm fail again, this time with
errors in the host to the tune of:

[ cut here ]
WARNING: CPU: 1 PID: 3901 at arch/x86/kvm/x86.c:6575 
kvm_arch_vcpu_ioctl_run+0xd63/0xe5b [kvm]()
Modules linked in: nfsv4 cpufreq_userspace cpufreq_stats cpufreq_powersave 
cpufreq_ondemand cpufreq_conservative autofs4 fan nfsd auth_rpcgss nfs lockd 
grace fscache sunrpc bridge stp llc vhost_net tun vhost macvtap macvlan fuse 
cbc dm_crypt usb_storage snd_hda_codec_analog snd_hda_codec_generic kvm_intel 
kvm tg3 ptp pps_core sr_mod snd_hda_intel snd_hda_controller snd_hda_codec 
snd_hwdep snd_pcm snd_timer snd sg dcdbas cdrom psmouse soundcore floppy evdev 
xfs dm_mod raid1 md_mod
CPU: 1 PID: 3901 Comm: qemu-system-x86 Not tainted 
3.19.0-rc1-00011-g53262d1-dirty #1
Hardware name: Dell Inc. Precision WorkStation T3400  /0TP412, BIOS A14 
04/30/2012
  7e052328 8800c25ffcf8 813defbe
   8800c25ffd38 8103b517
 8800c25ffd28 a019bdec 8800caf1d000 8800c2774800
Call Trace:
 [813defbe] dump_stack+0x4c/0x6e
 [8103b517] warn_slowpath_common+0x97/0xb1
 [a019bdec] ? kvm_arch_vcpu_ioctl_run+0xd63/0xe5b [kvm]
 [8103b60b] warn_slowpath_null+0x15/0x17
 [a019bdec] kvm_arch_vcpu_ioctl_run+0xd63/0xe5b [kvm]
 [a02308b9] ? vmcs_load+0x20/0x62 [kvm_intel]
 [a0231e03] ? vmx_vcpu_load+0x140/0x16a [kvm_intel]
 [a0196ba3] ? kvm_arch_vcpu_load+0x15c/0x161 [kvm]
 [a018d8b1] kvm_vcpu_ioctl+0x189/0x4bd [kvm]
 [8104647a] ? do_sigtimedwait+0x12f/0x189
 [810ea316] do_vfs_ioctl+0x370/0x436
 [810f24f2] ? __fget+0x67/0x72
 [810ea41b] SyS_ioctl+0x3f/0x5e
 [813e34d2] system_call_fastpath+0x12/0x17
---[ end trace 46abac932fb3b4a1 ]---
[ cut here ]
WARNING: CPU: 1 PID: 3901 at arch/x86/kvm/x86.c:6575 
kvm_arch_vcpu_ioctl_run+0xd63/0xe5b [kvm]()
Modules linked in: nfsv4 cpufreq_userspace cpufreq_stats cpufreq_powersave 
cpufreq_ondemand cpufreq_conservative autofs4 fan nfsd auth_rpcgss nfs lockd 
grace fscache sunrpc bridge stp llc vhost_net tun vhost macvtap macvlan fuse 
cbc dm_crypt usb_storage snd_hda_codec_analog snd_hda_codec_generic kvm_intel 
kvm tg3 ptp pps_core sr_mod snd_hda_intel snd_hda_controller snd_hda_codec 
snd_hwdep snd_pcm snd_timer snd sg dcdbas cdrom psmouse soundcore

Re: regression bisected; KVM: entry failed, hardware error 0x80000021

2014-12-22 Thread Jamie Heilman
Chen, Tiejun wrote:
> On 2014/12/23 9:50, Chen, Tiejun wrote:
> >On 2014/12/22 17:23, Jamie Heilman wrote:
> >>Chen, Tiejun wrote:
> >>>On 2014/12/21 20:46, Jamie Heilman wrote:
> >>>>With v3.19-rc1 when I run qemu-system-x86_64 -machine pc,accel=kvm I
> >>>>get:
> >>>>
> >>>>KVM: entry failed, hardware error 0x8021
> >>>
> >>>Looks some MSR writing issues such a failed entry.
> >>>
> >>>>If you're running a guest on an Intel machine without unrestricted mode
> >>>>support, the failure can be most likely due to the guest entering an
> >>>>invalid
> >>>>state for Intel VT. For example, the guest maybe running in big real
> >>>>mode
> >>>>which is not supported on less recent Intel processors.
> >>>>
> >>>>EAX= EBX= ECX= EDX=0663
> >>>>ESI= EDI= EBP= ESP=
> >>>>EIP=e05b EFL=00010002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
> >>>>ES =   9300
> >>>>CS =f000 000f  9b00
> >>>>SS =   9300
> >>>>DS =   9300
> >>>>FS =   9300
> >>>>GS =   9300
> >>>>LDT=   8200
> >>>>TR =   8b00
> >>>>GDT=  
> >>>>IDT=  
> >>>>CR0=6010 CR2= CR3= CR4=
> >>>>DR0= DR1= DR2=
> >>>>DR3=
> >>>>DR6=0ff0 DR7=0400
> >>>>EFER=
> >>>
> >>>And I don't see any obvious wrong as well. Any valuable info from dmesg?
> >>
> >>With the simple qemu command above, on 3.18.1 I see:
> >>
> >>kern.info: kvm: zapping shadow pages for mmio generation wraparound
> >>
> >>when I fire up a full guest that's actually useful I get:
> >>
> >>kern.info: kvm: zapping shadow pages for mmio generation wraparound
> >>kern.err: kvm [4073]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
> >>
> >>On 3.18.0-rc3-00042-g34a1cd6 nothing appears in the dmesg, just the
> >>message I mention above to stderr.  Same thing with a stock
> >>3.19.0-rc1.  Once I apply your patch the simple test command produces
> >>the same zapping shadow pages messages as 3.18.1, and a test guest of
> >>a Debian Jessie image (w/stock distro kernel) produces the same thing
> >>with disabled perfctr wrmsr message.  However, it doesn't look like
> >
> >Sorry I'm not sure if I understood current status. Looks 3.19-rc1 & my
> >patch just fix that error above,
> >
> >KVM: entry failed, hardware error 0x8021
> >...
> >
> >Right?
> >
> >>I'm entirely out of the woods, because one of my other guest VMs with a
> >>custom kernel that works great under 3.18.1 now fails to run.  Nothing
> >>in dmesg, but here's the stderr:
> >
> >But even you revert 34a1cd60d17 or just apply my patch, something else
> >introduced between 3.18.1 and 3.19-rc1 led this error below, right?
> >
> >>
> >>KVM internal error. Suberror: 1
> >>emulation failure
> >>EAX=000de494 EBX= ECX= EDX=0cfd
> >>ESI=0059 EDI= EBP= ESP=6fb4
> >>EIP=000f15c1 EFL=00010016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0
> >>ES =0010   00c09300 DPL=0 DS   [-WA]
> >>CS =0008   00c09b00 DPL=0 CS32 [-RA]
> >>SS =0010   00c09300 DPL=0 DS   [-WA]
> >>DS =0010   00c09300 DPL=0 DS   [-WA]
> >>FS =0010   00c09300 DPL=0 DS   [-WA]
> >>GS =0010   00c09300 DPL=0 DS   [-WA]
> >>LDT=   8200 DPL=0 LDT
> >>TR =   8b00 DPL=0 TSS32-busy
> >>GDT= 000f6be8 0037
> >>IDT= 000f6c26 
> >>CR0=6011 CR2= CR3= CR4=
> >>DR0= DR1= DR2=
> >>DR3=
> >>DR6=0ff0 DR7=0400
> >>EFER=
> >>Code=e8 ae fc ff ff 89 f2 a8 10 89 d8 75 0a b9 41 15 ff ff ff d1 <5b>
> >>5e c3 5

Re: regression bisected; KVM: entry failed, hardware error 0x80000021

2014-12-22 Thread Jamie Heilman
Chen, Tiejun wrote:
> On 2014/12/21 20:46, Jamie Heilman wrote:
> >With v3.19-rc1 when I run qemu-system-x86_64 -machine pc,accel=kvm I
> >get:
> >
> >KVM: entry failed, hardware error 0x8021
> 
> Looks some MSR writing issues such a failed entry.
> 
> >If you're running a guest on an Intel machine without unrestricted mode
> >support, the failure can be most likely due to the guest entering an invalid
> >state for Intel VT. For example, the guest maybe running in big real mode
> >which is not supported on less recent Intel processors.
> >
> >EAX= EBX= ECX= EDX=0663
> >ESI= EDI= EBP= ESP=
> >EIP=e05b EFL=00010002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
> >ES =   9300
> >CS =f000 000f  9b00
> >SS =   9300
> >DS =   9300
> >FS =   9300
> >GS =   9300
> >LDT=   8200
> >TR =   8b00
> >GDT=  
> >IDT=  
> >CR0=6010 CR2= CR3= CR4=
> >DR0= DR1= DR2= 
> >DR3=
> >DR6=0ff0 DR7=0400
> >EFER=
> 
> And I don't see any obvious wrong as well. Any valuable info from dmesg?

With the simple qemu command above, on 3.18.1 I see:

kern.info: kvm: zapping shadow pages for mmio generation wraparound

when I fire up a full guest that's actually useful I get:

kern.info: kvm: zapping shadow pages for mmio generation wraparound
kern.err: kvm [4073]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x

On 3.18.0-rc3-00042-g34a1cd6 nothing appears in the dmesg, just the
message I mention above to stderr.  Same thing with a stock
3.19.0-rc1.  Once I apply your patch the simple test command produces
the same zapping shadow pages messages as 3.18.1, and a test guest of
a Debian Jessie image (w/stock distro kernel) produces the same thing
with disabled perfctr wrmsr message.  However, it doesn't look like
I'm entirely out of the woods, because one of my other guest VMs with a
custom kernel that works great under 3.18.1 now fails to run.  Nothing
in dmesg, but here's the stderr:

KVM internal error. Suberror: 1
emulation failure
EAX=000de494 EBX= ECX= EDX=0cfd
ESI=0059 EDI= EBP= ESP=6fb4
EIP=000f15c1 EFL=00010016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010   00c09300 DPL=0 DS   [-WA]
CS =0008   00c09b00 DPL=0 CS32 [-RA]
SS =0010   00c09300 DPL=0 DS   [-WA]
DS =0010   00c09300 DPL=0 DS   [-WA]
FS =0010   00c09300 DPL=0 DS   [-WA]
GS =0010   00c09300 DPL=0 DS   [-WA]
LDT=   8200 DPL=0 LDT
TR =   8b00 DPL=0 TSS32-busy
GDT= 000f6be8 0037
IDT= 000f6c26 
CR0=6011 CR2= CR3= CR4=
DR0= DR1= DR2= 
DR3= 
DR6=0ff0 DR7=0400
EFER=
Code=e8 ae fc ff ff 89 f2 a8 10 89 d8 75 0a b9 41 15 ff ff ff d1 <5b> 5e c3 5b 
5e e9 76 ff ff ff b0 11 e6 20 e6 a0 b0 08 e6 21 b0 70 e6 a1 b0 04 e6 21 b0 02

FWIW, I get the same thing with 34a1cd60d17 reverted.  Maybe there are
two bugs, maybe there's more to this first one.  I can repro this
error with the command: qemu-system-x86_64 -machine pc,accel=kvm -nodefaults

> >This is with QEMU emulator version 2.1.2 (Debian 1:2.1+dfsg-11),
> >Copyright (c) 2003-2008 Fabrice Bellard
> >
> >The host system is:
> >
> >cpu family  : 6
> >model   : 23
> >model name  : Intel(R) Core(TM)2 Duo CPU E8400  @ 3.00GHz
> >stepping: 10
> >microcode   : 0xa0b
> >...
> >flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
> >cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm 
> >constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 
> >monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm 
> >dtherm tpr_shadow vnmi flexpriority
> >
> >I bisected this back to:
> >
> >commit 34a1cd60d17f62c1f077c1478a6c2ca8c3d17af4
> >Author: Tiejun Chen 
> >Date:   Tue Oct 28 10:14:48 2014 +0800
> >
> > kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup()
> >
> > Instead of vmx_init(), actually it would make reasonable sense to do
> > anything specific to vmx hardware setting in 
> > vmx_x86_ops->hardwa

Re: regression bisected; KVM: entry failed, hardware error 0x80000021

2014-12-22 Thread Jamie Heilman
Chen, Tiejun wrote:
 On 2014/12/23 9:50, Chen, Tiejun wrote:
 On 2014/12/22 17:23, Jamie Heilman wrote:
 Chen, Tiejun wrote:
 On 2014/12/21 20:46, Jamie Heilman wrote:
 With v3.19-rc1 when I run qemu-system-x86_64 -machine pc,accel=kvm I
 get:
 
 KVM: entry failed, hardware error 0x8021
 
 Looks some MSR writing issues such a failed entry.
 
 If you're running a guest on an Intel machine without unrestricted mode
 support, the failure can be most likely due to the guest entering an
 invalid
 state for Intel VT. For example, the guest maybe running in big real
 mode
 which is not supported on less recent Intel processors.
 
 EAX= EBX= ECX= EDX=0663
 ESI= EDI= EBP= ESP=
 EIP=e05b EFL=00010002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   9300
 CS =f000 000f  9b00
 SS =   9300
 DS =   9300
 FS =   9300
 GS =   9300
 LDT=   8200
 TR =   8b00
 GDT=  
 IDT=  
 CR0=6010 CR2= CR3= CR4=
 DR0= DR1= DR2=
 DR3=
 DR6=0ff0 DR7=0400
 EFER=
 
 And I don't see any obvious wrong as well. Any valuable info from dmesg?
 
 With the simple qemu command above, on 3.18.1 I see:
 
 kern.info: kvm: zapping shadow pages for mmio generation wraparound
 
 when I fire up a full guest that's actually useful I get:
 
 kern.info: kvm: zapping shadow pages for mmio generation wraparound
 kern.err: kvm [4073]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x
 
 On 3.18.0-rc3-00042-g34a1cd6 nothing appears in the dmesg, just the
 message I mention above to stderr.  Same thing with a stock
 3.19.0-rc1.  Once I apply your patch the simple test command produces
 the same zapping shadow pages messages as 3.18.1, and a test guest of
 a Debian Jessie image (w/stock distro kernel) produces the same thing
 with disabled perfctr wrmsr message.  However, it doesn't look like
 
 Sorry I'm not sure if I understood current status. Looks 3.19-rc1  my
 patch just fix that error above,
 
 KVM: entry failed, hardware error 0x8021
 ...
 
 Right?
 
 I'm entirely out of the woods, because one of my other guest VMs with a
 custom kernel that works great under 3.18.1 now fails to run.  Nothing
 in dmesg, but here's the stderr:
 
 But even you revert 34a1cd60d17 or just apply my patch, something else
 introduced between 3.18.1 and 3.19-rc1 led this error below, right?
 
 
 KVM internal error. Suberror: 1
 emulation failure
 EAX=000de494 EBX= ECX= EDX=0cfd
 ESI=0059 EDI= EBP= ESP=6fb4
 EIP=000f15c1 EFL=00010016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =0010   00c09300 DPL=0 DS   [-WA]
 CS =0008   00c09b00 DPL=0 CS32 [-RA]
 SS =0010   00c09300 DPL=0 DS   [-WA]
 DS =0010   00c09300 DPL=0 DS   [-WA]
 FS =0010   00c09300 DPL=0 DS   [-WA]
 GS =0010   00c09300 DPL=0 DS   [-WA]
 LDT=   8200 DPL=0 LDT
 TR =   8b00 DPL=0 TSS32-busy
 GDT= 000f6be8 0037
 IDT= 000f6c26 
 CR0=6011 CR2= CR3= CR4=
 DR0= DR1= DR2=
 DR3=
 DR6=0ff0 DR7=0400
 EFER=
 Code=e8 ae fc ff ff 89 f2 a8 10 89 d8 75 0a b9 41 15 ff ff ff d1 5b
 5e c3 5b 5e e9 76 ff ff ff b0 11 e6 20 e6 a0 b0 08 e6 21 b0 70 e6 a1
 b0 04 e6 21 b0 02
 
 FWIW, I get the same thing with 34a1cd60d17 reverted.  Maybe there are
 two bugs, maybe there's more to this first one.  I can repro this
 
 So if my understanding is correct, this is probably another bug. And
 especially, I already saw the same log in another thread, Cleaning up
 the KVM clock. Maybe you can continue to `git bisect` to locate that
 bad commit.
 
 
 Looks just now Andy found that commit,
 0e60b0799fedc495a5c57dbd669de3c10d72edd2 kvm: change memslot sorting rule
 from size to GFN, maybe you can try to revert this to try yours again.

That doesn't revert cleanly for me, and I don't have much time to
fiddle with it until the 24th---so checked out the commit before it
(d4ae84a0), applied your patch, built, and yes, everything works fine
at that point.  I'll probably have time for another full bisection
later, assuming things aren't ironed out already by then.

-- 
Jamie Heilman http://audible.transient.net/~jamie/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: regression bisected; KVM: entry failed, hardware error 0x80000021

2014-12-22 Thread Jamie Heilman
Chen, Tiejun wrote:
 On 2014/12/21 20:46, Jamie Heilman wrote:
 With v3.19-rc1 when I run qemu-system-x86_64 -machine pc,accel=kvm I
 get:
 
 KVM: entry failed, hardware error 0x8021
 
 Looks some MSR writing issues such a failed entry.
 
 If you're running a guest on an Intel machine without unrestricted mode
 support, the failure can be most likely due to the guest entering an invalid
 state for Intel VT. For example, the guest maybe running in big real mode
 which is not supported on less recent Intel processors.
 
 EAX= EBX= ECX= EDX=0663
 ESI= EDI= EBP= ESP=
 EIP=e05b EFL=00010002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
 ES =   9300
 CS =f000 000f  9b00
 SS =   9300
 DS =   9300
 FS =   9300
 GS =   9300
 LDT=   8200
 TR =   8b00
 GDT=  
 IDT=  
 CR0=6010 CR2= CR3= CR4=
 DR0= DR1= DR2= 
 DR3=
 DR6=0ff0 DR7=0400
 EFER=
 
 And I don't see any obvious wrong as well. Any valuable info from dmesg?

With the simple qemu command above, on 3.18.1 I see:

kern.info: kvm: zapping shadow pages for mmio generation wraparound

when I fire up a full guest that's actually useful I get:

kern.info: kvm: zapping shadow pages for mmio generation wraparound
kern.err: kvm [4073]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x

On 3.18.0-rc3-00042-g34a1cd6 nothing appears in the dmesg, just the
message I mention above to stderr.  Same thing with a stock
3.19.0-rc1.  Once I apply your patch the simple test command produces
the same zapping shadow pages messages as 3.18.1, and a test guest of
a Debian Jessie image (w/stock distro kernel) produces the same thing
with disabled perfctr wrmsr message.  However, it doesn't look like
I'm entirely out of the woods, because one of my other guest VMs with a
custom kernel that works great under 3.18.1 now fails to run.  Nothing
in dmesg, but here's the stderr:

KVM internal error. Suberror: 1
emulation failure
EAX=000de494 EBX= ECX= EDX=0cfd
ESI=0059 EDI= EBP= ESP=6fb4
EIP=000f15c1 EFL=00010016 [AP-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =0010   00c09300 DPL=0 DS   [-WA]
CS =0008   00c09b00 DPL=0 CS32 [-RA]
SS =0010   00c09300 DPL=0 DS   [-WA]
DS =0010   00c09300 DPL=0 DS   [-WA]
FS =0010   00c09300 DPL=0 DS   [-WA]
GS =0010   00c09300 DPL=0 DS   [-WA]
LDT=   8200 DPL=0 LDT
TR =   8b00 DPL=0 TSS32-busy
GDT= 000f6be8 0037
IDT= 000f6c26 
CR0=6011 CR2= CR3= CR4=
DR0= DR1= DR2= 
DR3= 
DR6=0ff0 DR7=0400
EFER=
Code=e8 ae fc ff ff 89 f2 a8 10 89 d8 75 0a b9 41 15 ff ff ff d1 5b 5e c3 5b 
5e e9 76 ff ff ff b0 11 e6 20 e6 a0 b0 08 e6 21 b0 70 e6 a1 b0 04 e6 21 b0 02

FWIW, I get the same thing with 34a1cd60d17 reverted.  Maybe there are
two bugs, maybe there's more to this first one.  I can repro this
error with the command: qemu-system-x86_64 -machine pc,accel=kvm -nodefaults

 This is with QEMU emulator version 2.1.2 (Debian 1:2.1+dfsg-11),
 Copyright (c) 2003-2008 Fabrice Bellard
 
 The host system is:
 
 cpu family  : 6
 model   : 23
 model name  : Intel(R) Core(TM)2 Duo CPU E8400  @ 3.00GHz
 stepping: 10
 microcode   : 0xa0b
 ...
 flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca 
 cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm 
 constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 
 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm 
 dtherm tpr_shadow vnmi flexpriority
 
 I bisected this back to:
 
 commit 34a1cd60d17f62c1f077c1478a6c2ca8c3d17af4
 Author: Tiejun Chen tiejun.c...@intel.com
 Date:   Tue Oct 28 10:14:48 2014 +0800
 
  kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup()
 
  Instead of vmx_init(), actually it would make reasonable sense to do
  anything specific to vmx hardware setting in 
  vmx_x86_ops-hardware_setup().
 
 
 This commit just reorders something but some MSR writing depend on previous
 status.
 
 Could you try this?

I unmangled the expanded tabs, and applied this:
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index feb852b..96c84a8 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -5840,49 +5840,6 @@ static __init int hardware_setup(void)
memset(vmx_msr_bitmap_legacy, 0xff, PAGE_SIZE);
memset(vmx_msr_bitmap_longmode

regression bisected; KVM: entry failed, hardware error 0x80000021

2014-12-21 Thread Jamie Heilman
With v3.19-rc1 when I run qemu-system-x86_64 -machine pc,accel=kvm I
get:

KVM: entry failed, hardware error 0x8021

If you're running a guest on an Intel machine without unrestricted mode
support, the failure can be most likely due to the guest entering an invalid
state for Intel VT. For example, the guest maybe running in big real mode
which is not supported on less recent Intel processors.

EAX= EBX= ECX= EDX=0663
ESI= EDI= EBP= ESP=
EIP=e05b EFL=00010002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =   9300
CS =f000 000f  9b00
SS =   9300
DS =   9300
FS =   9300
GS =   9300
LDT=   8200
TR =   8b00
GDT=  
IDT=  
CR0=6010 CR2= CR3= CR4=
DR0= DR1= DR2= 
DR3= 
DR6=0ff0 DR7=0400
EFER=
Code=85 00 87 00 89 00 8b 00 00 00 86 00 88 00 8a 00 8c 00 00 90 <2e> 66 83 3e 
30 6c 00 0f 85 e7 f2 31 c0 8e d0 66 bc 00 70 00 00 66 ba 31 2e 0f 00 e9 45 f1

This is with QEMU emulator version 2.1.2 (Debian 1:2.1+dfsg-11),
Copyright (c) 2003-2008 Fabrice Bellard

The host system is:

cpu family  : 6
model   : 23
model name  : Intel(R) Core(TM)2 Duo CPU E8400  @ 3.00GHz
stepping: 10
microcode   : 0xa0b
...
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm 
constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor 
ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm 
tpr_shadow vnmi flexpriority

I bisected this back to:

commit 34a1cd60d17f62c1f077c1478a6c2ca8c3d17af4
Author: Tiejun Chen 
Date:   Tue Oct 28 10:14:48 2014 +0800

kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup()

Instead of vmx_init(), actually it would make reasonable sense to do
anything specific to vmx hardware setting in vmx_x86_ops->hardware_setup().

Signed-off-by: Tiejun Chen 
Signed-off-by: Paolo Bonzini 

reverting this commit appears to fix the issue, in so far that I can
run kvm again without the aforementioned error, but I haven't tested
it thoroughly beyond that.  Let me know if you need more information
or testing.

-- 
Jamie Heilman http://audible.transient.net/~jamie/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


regression bisected; KVM: entry failed, hardware error 0x80000021

2014-12-21 Thread Jamie Heilman
With v3.19-rc1 when I run qemu-system-x86_64 -machine pc,accel=kvm I
get:

KVM: entry failed, hardware error 0x8021

If you're running a guest on an Intel machine without unrestricted mode
support, the failure can be most likely due to the guest entering an invalid
state for Intel VT. For example, the guest maybe running in big real mode
which is not supported on less recent Intel processors.

EAX= EBX= ECX= EDX=0663
ESI= EDI= EBP= ESP=
EIP=e05b EFL=00010002 [---] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =   9300
CS =f000 000f  9b00
SS =   9300
DS =   9300
FS =   9300
GS =   9300
LDT=   8200
TR =   8b00
GDT=  
IDT=  
CR0=6010 CR2= CR3= CR4=
DR0= DR1= DR2= 
DR3= 
DR6=0ff0 DR7=0400
EFER=
Code=85 00 87 00 89 00 8b 00 00 00 86 00 88 00 8a 00 8c 00 00 90 2e 66 83 3e 
30 6c 00 0f 85 e7 f2 31 c0 8e d0 66 bc 00 70 00 00 66 ba 31 2e 0f 00 e9 45 f1

This is with QEMU emulator version 2.1.2 (Debian 1:2.1+dfsg-11),
Copyright (c) 2003-2008 Fabrice Bellard

The host system is:

cpu family  : 6
model   : 23
model name  : Intel(R) Core(TM)2 Duo CPU E8400  @ 3.00GHz
stepping: 10
microcode   : 0xa0b
...
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm 
constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor 
ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm 
tpr_shadow vnmi flexpriority

I bisected this back to:

commit 34a1cd60d17f62c1f077c1478a6c2ca8c3d17af4
Author: Tiejun Chen tiejun.c...@intel.com
Date:   Tue Oct 28 10:14:48 2014 +0800

kvm: x86: vmx: move some vmx setting from vmx_init() to hardware_setup()

Instead of vmx_init(), actually it would make reasonable sense to do
anything specific to vmx hardware setting in vmx_x86_ops-hardware_setup().

Signed-off-by: Tiejun Chen tiejun.c...@intel.com
Signed-off-by: Paolo Bonzini pbonz...@redhat.com

reverting this commit appears to fix the issue, in so far that I can
run kvm again without the aforementioned error, but I haven't tested
it thoroughly beyond that.  Let me know if you need more information
or testing.

-- 
Jamie Heilman http://audible.transient.net/~jamie/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [3.7-rc5/rc6 regression] "drm/nvc0/disp: fix regression in vblank semaphore release" broke nouveau driver and mplayer

2012-11-17 Thread Jamie Heilman
Mikael Pettersson wrote:
> mplayer worked fine on my Dell Latitude E6510 (nVidia GT218 [NVS
> 3100M] graphics) up to and including kernel 3.7-rc4. However, with
> 3.7-rc5 or -rc6, any attempt to run mplayer just blanks the screen,
> shows some stray white pixels in the upper left corner, kills the X
> server, and spews the following errors from the kernel:
...
> 
> The error is 100% repeatable.

I see the same problem on my workstation with a G86 [Quadro NVS 290] (rev a1)
running Debian (sid) with:
xserver-xorg-core 2:1.12.4-3
libdrm-nouveau1a:amd64 2.4.33-3
xserver-xorg-video-nouveau 1:1.0.1-3
and mplayer2 2.0-600-g95e81df w/the xv video output driver

> git bisect identified the following culprit:
> 
> 11d92561c81be2f4a7af37f035e1af294b960abe is the first bad commit

I bisected to the same commit as well.


-- 
Jamie Heilman http://audible.transient.net/~jamie/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [3.7-rc5/rc6 regression] drm/nvc0/disp: fix regression in vblank semaphore release broke nouveau driver and mplayer

2012-11-17 Thread Jamie Heilman
Mikael Pettersson wrote:
 mplayer worked fine on my Dell Latitude E6510 (nVidia GT218 [NVS
 3100M] graphics) up to and including kernel 3.7-rc4. However, with
 3.7-rc5 or -rc6, any attempt to run mplayer just blanks the screen,
 shows some stray white pixels in the upper left corner, kills the X
 server, and spews the following errors from the kernel:
...
 
 The error is 100% repeatable.

I see the same problem on my workstation with a G86 [Quadro NVS 290] (rev a1)
running Debian (sid) with:
xserver-xorg-core 2:1.12.4-3
libdrm-nouveau1a:amd64 2.4.33-3
xserver-xorg-video-nouveau 1:1.0.1-3
and mplayer2 2.0-600-g95e81df w/the xv video output driver

 git bisect identified the following culprit:
 
 11d92561c81be2f4a7af37f035e1af294b960abe is the first bad commit

I bisected to the same commit as well.


-- 
Jamie Heilman http://audible.transient.net/~jamie/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to return

2012-08-16 Thread Jamie Heilman
J. Bruce Fields wrote:
> On Thu, Aug 16, 2012 at 04:42:08PM -0400, J. Bruce Fields wrote:
> > Hm, weird.  In the good case the cb_recall's done with auth_unix, in the
> > bad case with auth_null.  OK, that should be enough to go on
> > 
> > Thanks for digging into this!
> 
> This should fix it--could you confirm?

Yep, looks like that fixes it.

> commit e950bebdac1f17121f972728489cdba43734d56d
> Author: J. Bruce Fields 
> Date:   Thu Aug 16 17:01:21 2012 -0400
> 
> nfsd4: fix security flavor of NFSv4.0 callback
> 
> Commit d5497fc693a446ce9100fcf4117c3f795ddfd0d2 "nfsd4: move rq_flavor
> into svc_cred" forgot to remove cl_flavor from the client, leaving two
> places (cl_flavor and cl_cred.cr_flavor) for the flavor to be stored.
> After that patch, the latter was the one that was updated, but the
> former was the one that the callback used.
> 
> Symptoms were a long delay on utime().  This is because the utime()
> generated a setattr which recalled a delegation, but the cb_recall was
> ignored by the client because it had the wrong security flavor.
> 
> Cc: sta...@vger.kernel.org
> Reported-by: Jamie Heilman 
> Signed-off-by: J. Bruce Fields 
> 
> diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
> index cbaf4f8..4c7bd35 100644
> --- a/fs/nfsd/nfs4callback.c
> +++ b/fs/nfsd/nfs4callback.c
> @@ -651,12 +651,12 @@ static int setup_callback_client(struct nfs4_client 
> *clp, struct nfs4_cb_conn *c
>  
>   if (clp->cl_minorversion == 0) {
>   if (!clp->cl_cred.cr_principal &&
> - (clp->cl_flavor >= RPC_AUTH_GSS_KRB5))
> + (clp->cl_cred.cr_flavor >= RPC_AUTH_GSS_KRB5))
>   return -EINVAL;
>   args.client_name = clp->cl_cred.cr_principal;
>   args.prognumber = conn->cb_prog,
>   args.protocol = XPRT_TRANSPORT_TCP;
> - args.authflavor = clp->cl_flavor;
> + args.authflavor = clp->cl_cred.cr_flavor;
>   clp->cl_cb_ident = conn->cb_ident;
>   } else {
>   if (!conn->cb_xprt)
> diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
> index 167d7d8..9db0bb5 100644
> --- a/fs/nfsd/state.h
> +++ b/fs/nfsd/state.h
> @@ -231,7 +231,6 @@ struct nfs4_client {
>   nfs4_verifier   cl_verifier;/* generated by client */
>   time_t  cl_time;/* time of last lease renewal */
>   struct sockaddr_storage cl_addr;/* client ipaddress */
> - u32 cl_flavor;  /* setclientid pseudoflavor */
>   struct svc_cred cl_cred;/* setclientid principal */
>   clientid_t  cl_clientid;/* generated by server */
>   nfs4_verifier   cl_confirm; /* generated by server */
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Jamie Heilman http://audible.transient.net/~jamie/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to return

2012-08-16 Thread Jamie Heilman
J. Bruce Fields wrote:
> On Wed, Aug 15, 2012 at 01:58:54PM +0000, Jamie Heilman wrote:
> > Jamie Heilman wrote:
> > > I'll try to get full rcpdebug traces on client and server as the delay
> > > is occuring in the hopes that helps pin things down, and post them
> > > separately.
> > 
> > OK, here are the logs from client and server, where a run of my test program
> > under strace -T resulted in:
> > 
> > utime("utime-test.c", [2012/08/15-13:35:20, 2012/08/15-13:22:04]) = 0 
> > <0.150815>
> > open("utime-test.c", O_RDONLY)  = 3 <0.242635>
> > close(3)= 0 <0.147768>
> > stat("utime-test.c", {st_mode=S_IFREG|0644, st_size=696, ...}) = 0 
> > <0.002772>
> > utime("utime-test.c", [2012/08/15-13:35:20, 2012/08/15-13:22:04]) = 0 
> > <71.878058>
> > 
> > The client is a 64-bit v3.4.8 kernel, the server is 32-bit, v3.5.1 +
> > the two sunprc patches that will be in v3.5.2.
> > 
> > (The client's system clock is a touch faster than the server's, but
> > these logs start at the same instant.)
> 
> Thanks for all the details.
> 
> What's probably happening is that the client is returning a delegation
> with the open.  The setattr then breaks that delegation; you can see it
> getting 10008 (NFS4ERR_DELAY) replies while the server waits for the
> delegation to be returned.  But for some reason the callback to break
> the delegation isn't working.  ("NFSD: warning: no callback path to
> client 192.168.2.42/192.168.2.4 tcp UNIX 0: error -110" (110 is
> ETIMEDOUT).)  So instead you wait for the delegation to time out and get
> forcibly revoked.
> 
> The reproducer might be more reliable if you did two opens.

I made the change... not entirely sure it helped, but I think I've
bisected this reliably anyway.  It came down to:

d5497fc693a446ce9100fcf4117c3f795ddfd0d2 is the first bad commit
commit d5497fc693a446ce9100fcf4117c3f795ddfd0d2
Author: J. Bruce Fields 
Date:   Mon May 14 22:06:49 2012 -0400

nfsd4: move rq_flavor into svc_cred

Move the rq_flavor into struct svc_cred, and use it in setclientid and
exchange_id comparisons as well.

Signed-off-by: J. Bruce Fields 

:04 04 165568aae0fc47067863ea8ca911494033cbf2ce 
54244f0084f58df4844b09d56e8e6bad0e19d913 M  fs
:04 04 bae5a74a85bda2aa2b155b0bdca1ea0028502164 
207751ce30c8f49c68fb0a85d0772cb0fa426ae1 M  include
:04 04 ef940cf0c558fe65a21a710e6c95b2cefb2996fd 
459ac143acf5cb2b3b7f59df29c417cedbb23c4b M  net

FWIW reverting this commit does seem to fix the problem.

> It'd be worth looking at the traffic in wireshark.  You should see
> setattr, open, close, setattr, a DELAY reply to the setattr, a
> CB_RECALL, and then a DELEGRETURN that gets a succesful reply.  But for
> some reason the DELEGRETURN isn't getting through in your case, I'm not
> sure why.  I can't reproduce that.  You'll need to start wireshark
> before you mount to make sure it knows how to parse the callbacks.

Captures from the server showing the delay (running 3.5.2) and the
previous behavior (running 3.4.9) are attached.

-- 
Jamie Heilman http://audible.transient.net/~jamie/


nfsd4-delay.pcap.gz
Description: Binary data


nfsd4-ok.pcap.gz
Description: Binary data


Re: v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to return

2012-08-16 Thread Jamie Heilman
J. Bruce Fields wrote:
 On Wed, Aug 15, 2012 at 01:58:54PM +, Jamie Heilman wrote:
  Jamie Heilman wrote:
   I'll try to get full rcpdebug traces on client and server as the delay
   is occuring in the hopes that helps pin things down, and post them
   separately.
  
  OK, here are the logs from client and server, where a run of my test program
  under strace -T resulted in:
  
  utime(utime-test.c, [2012/08/15-13:35:20, 2012/08/15-13:22:04]) = 0 
  0.150815
  open(utime-test.c, O_RDONLY)  = 3 0.242635
  close(3)= 0 0.147768
  stat(utime-test.c, {st_mode=S_IFREG|0644, st_size=696, ...}) = 0 
  0.002772
  utime(utime-test.c, [2012/08/15-13:35:20, 2012/08/15-13:22:04]) = 0 
  71.878058
  
  The client is a 64-bit v3.4.8 kernel, the server is 32-bit, v3.5.1 +
  the two sunprc patches that will be in v3.5.2.
  
  (The client's system clock is a touch faster than the server's, but
  these logs start at the same instant.)
 
 Thanks for all the details.
 
 What's probably happening is that the client is returning a delegation
 with the open.  The setattr then breaks that delegation; you can see it
 getting 10008 (NFS4ERR_DELAY) replies while the server waits for the
 delegation to be returned.  But for some reason the callback to break
 the delegation isn't working.  (NFSD: warning: no callback path to
 client 192.168.2.42/192.168.2.4 tcp UNIX 0: error -110 (110 is
 ETIMEDOUT).)  So instead you wait for the delegation to time out and get
 forcibly revoked.
 
 The reproducer might be more reliable if you did two opens.

I made the change... not entirely sure it helped, but I think I've
bisected this reliably anyway.  It came down to:

d5497fc693a446ce9100fcf4117c3f795ddfd0d2 is the first bad commit
commit d5497fc693a446ce9100fcf4117c3f795ddfd0d2
Author: J. Bruce Fields bfie...@redhat.com
Date:   Mon May 14 22:06:49 2012 -0400

nfsd4: move rq_flavor into svc_cred

Move the rq_flavor into struct svc_cred, and use it in setclientid and
exchange_id comparisons as well.

Signed-off-by: J. Bruce Fields bfie...@redhat.com

:04 04 165568aae0fc47067863ea8ca911494033cbf2ce 
54244f0084f58df4844b09d56e8e6bad0e19d913 M  fs
:04 04 bae5a74a85bda2aa2b155b0bdca1ea0028502164 
207751ce30c8f49c68fb0a85d0772cb0fa426ae1 M  include
:04 04 ef940cf0c558fe65a21a710e6c95b2cefb2996fd 
459ac143acf5cb2b3b7f59df29c417cedbb23c4b M  net

FWIW reverting this commit does seem to fix the problem.

 It'd be worth looking at the traffic in wireshark.  You should see
 setattr, open, close, setattr, a DELAY reply to the setattr, a
 CB_RECALL, and then a DELEGRETURN that gets a succesful reply.  But for
 some reason the DELEGRETURN isn't getting through in your case, I'm not
 sure why.  I can't reproduce that.  You'll need to start wireshark
 before you mount to make sure it knows how to parse the callbacks.

Captures from the server showing the delay (running 3.5.2) and the
previous behavior (running 3.4.9) are attached.

-- 
Jamie Heilman http://audible.transient.net/~jamie/


nfsd4-delay.pcap.gz
Description: Binary data


nfsd4-ok.pcap.gz
Description: Binary data


Re: v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to return

2012-08-16 Thread Jamie Heilman
J. Bruce Fields wrote:
 On Thu, Aug 16, 2012 at 04:42:08PM -0400, J. Bruce Fields wrote:
  Hm, weird.  In the good case the cb_recall's done with auth_unix, in the
  bad case with auth_null.  OK, that should be enough to go on
  
  Thanks for digging into this!
 
 This should fix it--could you confirm?

Yep, looks like that fixes it.

 commit e950bebdac1f17121f972728489cdba43734d56d
 Author: J. Bruce Fields bfie...@redhat.com
 Date:   Thu Aug 16 17:01:21 2012 -0400
 
 nfsd4: fix security flavor of NFSv4.0 callback
 
 Commit d5497fc693a446ce9100fcf4117c3f795ddfd0d2 nfsd4: move rq_flavor
 into svc_cred forgot to remove cl_flavor from the client, leaving two
 places (cl_flavor and cl_cred.cr_flavor) for the flavor to be stored.
 After that patch, the latter was the one that was updated, but the
 former was the one that the callback used.
 
 Symptoms were a long delay on utime().  This is because the utime()
 generated a setattr which recalled a delegation, but the cb_recall was
 ignored by the client because it had the wrong security flavor.
 
 Cc: sta...@vger.kernel.org
 Reported-by: Jamie Heilman ja...@audible.transient.net
 Signed-off-by: J. Bruce Fields bfie...@redhat.com
 
 diff --git a/fs/nfsd/nfs4callback.c b/fs/nfsd/nfs4callback.c
 index cbaf4f8..4c7bd35 100644
 --- a/fs/nfsd/nfs4callback.c
 +++ b/fs/nfsd/nfs4callback.c
 @@ -651,12 +651,12 @@ static int setup_callback_client(struct nfs4_client 
 *clp, struct nfs4_cb_conn *c
  
   if (clp-cl_minorversion == 0) {
   if (!clp-cl_cred.cr_principal 
 - (clp-cl_flavor = RPC_AUTH_GSS_KRB5))
 + (clp-cl_cred.cr_flavor = RPC_AUTH_GSS_KRB5))
   return -EINVAL;
   args.client_name = clp-cl_cred.cr_principal;
   args.prognumber = conn-cb_prog,
   args.protocol = XPRT_TRANSPORT_TCP;
 - args.authflavor = clp-cl_flavor;
 + args.authflavor = clp-cl_cred.cr_flavor;
   clp-cl_cb_ident = conn-cb_ident;
   } else {
   if (!conn-cb_xprt)
 diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
 index 167d7d8..9db0bb5 100644
 --- a/fs/nfsd/state.h
 +++ b/fs/nfsd/state.h
 @@ -231,7 +231,6 @@ struct nfs4_client {
   nfs4_verifier   cl_verifier;/* generated by client */
   time_t  cl_time;/* time of last lease renewal */
   struct sockaddr_storage cl_addr;/* client ipaddress */
 - u32 cl_flavor;  /* setclientid pseudoflavor */
   struct svc_cred cl_cred;/* setclientid principal */
   clientid_t  cl_clientid;/* generated by server */
   nfs4_verifier   cl_confirm; /* generated by server */
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

-- 
Jamie Heilman http://audible.transient.net/~jamie/
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to return

2012-08-15 Thread Jamie Heilman
Jamie Heilman wrote:
> I'll try to get full rcpdebug traces on client and server as the delay
> is occuring in the hopes that helps pin things down, and post them
> separately.

OK, here are the logs from client and server, where a run of my test program
under strace -T resulted in:

utime("utime-test.c", [2012/08/15-13:35:20, 2012/08/15-13:22:04]) = 0 <0.150815>
open("utime-test.c", O_RDONLY)  = 3 <0.242635>
close(3)= 0 <0.147768>
stat("utime-test.c", {st_mode=S_IFREG|0644, st_size=696, ...}) = 0 <0.002772>
utime("utime-test.c", [2012/08/15-13:35:20, 2012/08/15-13:22:04]) = 0 
<71.878058>

The client is a 64-bit v3.4.8 kernel, the server is 32-bit, v3.5.1 +
the two sunprc patches that will be in v3.5.2.

(The client's system clock is a touch faster than the server's, but
these logs start at the same instant.)


-- 
Jamie Heilman http://audible.transient.net/~jamie/


server.gz
Description: Binary data


client.gz
Description: Binary data


v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to return

2012-08-15 Thread Jamie Heilman
I really wish I could have nailed this down better, but I've had a
hard time reliably reproducing the problem during bisection, and I
haven't seen anyone report a similar sounding problem.  Here's what
I've seen: since 3.5 I've been having spurious delays on my nfs
clients, noticable particularly when I open an mbox file in mutt over
an nfs v4 mount from a v3.5 or later server.  The servers I've
reproduced this on are all uni-proc 32-bit systems... but then I
haven't tried SMP or 64-bit systems yet, it may or may not exist
there.  When the delay occurs, it's quite noticable.  I've never seen
one that takes less than 40 seconds to "unstick."  I wrote a quick and
dirty reproduction tool, based on the syscalls mutt was doing that
triggered the problem, attached to this message.  To use it, compile
the file as utime-test on an exported volume, then execute with (cd
/some/mount/point && strace -T ./utime-test) from a nfs4 client.

For whatever, reason I frequently find the second call to utime takes
an irritatingly long time to return and I see something like:
utime("utime-test.c", [2012/08/14-22:47:21, 2012/08/14-17:25:21]) = 0 
<70.510913>
in the strace output.

I've reproduced this on Debian Squeeze / nfs-utils 1.2.2 based servers
(legacy idmapper, no user-space nfsidmap), as well as Debian Wheezy /
nfs-utils 1.2.6 (uses keyutils upcalls) servers, so I doubt it's a
user-space related issue...  Attempts to bisect have been muddled,
I'll keep trying in the interim, but the best I've been able to pin
things down is that issue was probably introduced in the
419f4319495043a9507ac3e616be9ca60af09744 merge.  I can't repo on a
kernel based on fb21affa49204acd409328415b49bfe90136653c.  (I say
based on, because I have to apply the patch from
http://marc.info/?l=linux-nfs=133950479803025 or face additional
problems.)

I'll try to get full rcpdebug traces on client and server as the delay
is occuring in the hopes that helps pin things down, and post them
separately.

-- 
Jamie Heilman http://audible.transient.net/~jamie/
#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define F_PATH "utime-test.c"

int main()
{
	struct stat sb;
	struct utimbuf ub;
	int fd;

	if (stat(F_PATH, ) == -1) {
		perror(NULL);
		return -1;
	}
	ub.modtime = sb.st_mtime;
	ub.actime = time(NULL);
	if (utime(F_PATH, ) == -1) {
		perror(NULL);
		return -2;
	}
	if ((fd = open(F_PATH, O_RDONLY)) == -1) {
		perror(NULL);
		return -3;
	}
	close(fd);
	if (stat(F_PATH, ) == -1) {
		perror(NULL);
		return -1;
	}
	ub.modtime = sb.st_mtime;
	ub.actime = time(NULL);
	if (utime(F_PATH, ) == -1) {
		perror(NULL);
		return -2;
	}

	return 0;
}


v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to return

2012-08-15 Thread Jamie Heilman
I really wish I could have nailed this down better, but I've had a
hard time reliably reproducing the problem during bisection, and I
haven't seen anyone report a similar sounding problem.  Here's what
I've seen: since 3.5 I've been having spurious delays on my nfs
clients, noticable particularly when I open an mbox file in mutt over
an nfs v4 mount from a v3.5 or later server.  The servers I've
reproduced this on are all uni-proc 32-bit systems... but then I
haven't tried SMP or 64-bit systems yet, it may or may not exist
there.  When the delay occurs, it's quite noticable.  I've never seen
one that takes less than 40 seconds to unstick.  I wrote a quick and
dirty reproduction tool, based on the syscalls mutt was doing that
triggered the problem, attached to this message.  To use it, compile
the file as utime-test on an exported volume, then execute with (cd
/some/mount/point  strace -T ./utime-test) from a nfs4 client.

For whatever, reason I frequently find the second call to utime takes
an irritatingly long time to return and I see something like:
utime(utime-test.c, [2012/08/14-22:47:21, 2012/08/14-17:25:21]) = 0 
70.510913
in the strace output.

I've reproduced this on Debian Squeeze / nfs-utils 1.2.2 based servers
(legacy idmapper, no user-space nfsidmap), as well as Debian Wheezy /
nfs-utils 1.2.6 (uses keyutils upcalls) servers, so I doubt it's a
user-space related issue...  Attempts to bisect have been muddled,
I'll keep trying in the interim, but the best I've been able to pin
things down is that issue was probably introduced in the
419f4319495043a9507ac3e616be9ca60af09744 merge.  I can't repo on a
kernel based on fb21affa49204acd409328415b49bfe90136653c.  (I say
based on, because I have to apply the patch from
http://marc.info/?l=linux-nfsm=133950479803025 or face additional
problems.)

I'll try to get full rcpdebug traces on client and server as the delay
is occuring in the hopes that helps pin things down, and post them
separately.

-- 
Jamie Heilman http://audible.transient.net/~jamie/
#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include unistd.h
#include time.h
#include utime.h
#include stdio.h

#define F_PATH utime-test.c

int main()
{
	struct stat sb;
	struct utimbuf ub;
	int fd;

	if (stat(F_PATH, sb) == -1) {
		perror(NULL);
		return -1;
	}
	ub.modtime = sb.st_mtime;
	ub.actime = time(NULL);
	if (utime(F_PATH, ub) == -1) {
		perror(NULL);
		return -2;
	}
	if ((fd = open(F_PATH, O_RDONLY)) == -1) {
		perror(NULL);
		return -3;
	}
	close(fd);
	if (stat(F_PATH, sb) == -1) {
		perror(NULL);
		return -1;
	}
	ub.modtime = sb.st_mtime;
	ub.actime = time(NULL);
	if (utime(F_PATH, ub) == -1) {
		perror(NULL);
		return -2;
	}

	return 0;
}


Re: v3.5 nfsd4 regression; utime sometimes takes 40+ seconds to return

2012-08-15 Thread Jamie Heilman
Jamie Heilman wrote:
 I'll try to get full rcpdebug traces on client and server as the delay
 is occuring in the hopes that helps pin things down, and post them
 separately.

OK, here are the logs from client and server, where a run of my test program
under strace -T resulted in:

utime(utime-test.c, [2012/08/15-13:35:20, 2012/08/15-13:22:04]) = 0 0.150815
open(utime-test.c, O_RDONLY)  = 3 0.242635
close(3)= 0 0.147768
stat(utime-test.c, {st_mode=S_IFREG|0644, st_size=696, ...}) = 0 0.002772
utime(utime-test.c, [2012/08/15-13:35:20, 2012/08/15-13:22:04]) = 0 
71.878058

The client is a 64-bit v3.4.8 kernel, the server is 32-bit, v3.5.1 +
the two sunprc patches that will be in v3.5.2.

(The client's system clock is a touch faster than the server's, but
these logs start at the same instant.)


-- 
Jamie Heilman http://audible.transient.net/~jamie/


server.gz
Description: Binary data


client.gz
Description: Binary data