> Date: Fri, 24 Feb 2023 23:21:35 -0800 > From: Jeff Frasca <that...@jeff-frasca.name> > > Ok, first off, the FP code I've run into is in the Display > Core code, specifically in: > sys/external/bsd/drm2/dist/drm/amd/display/dc/calcs/amdgpu_dcn_calcs.c > It's all SIMD code operating on xmmN registers. To get to > this codepath, I needed to have CONFIG_DRM_AMD_DC set during > compilation. I've attached a diff that adds this to files.amdgpu. > > A typical backtrace printed out by ddb is: > breakpoint() > vpanic() > panic() > fputrap() > Xtrap16() > dcn10_create_resource_pool() > [...] > > (I had to type it manually from a picture snapped on my > phone, so, no offsets, if any of those are of interest, > let me know.)
Probably not, except perhaps the one in dcn10_create_resource_pool to confirm that it is where you think it is, in dcn_bw_update_from_pplib. > There's a missing call from the backtrace that (I think) > gets eaten by the trap jump: dcn_bw_update_from_pplib(). > (It's in amdgpu_dcn_calcs.c) That's via dcn10_resource_construct, I assume? (which is a single-use static that presumably gets compiled away) > The actual trap number that's getting generated is 19 > rather than the 16 implied by the call to Xtrap16 (but > I suspect y'all understand that quirk better than I do.) The logic looks like this: IDTVEC(trap16) ZTRAP_NJ(T_ARITHTRAP) .Ldo_fputrap: ... call _C_LABEL(fputrap) jmp .Lalltraps_checkusr IDTVEC_END(trap16) IDTVEC(trap19) ZTRAP_NJ(T_XMM) jmp .Ldo_fputrap IDTVEC_END(trap19) So the return address of fputrap will always live in the Xtrap16 symbol, not the Xtrap19 one, even if it gets there by trap 19. > dcn_bw_update_from_pplib() dutifully calls the macro > DC_FP_START(), which I believe Taylor wired up to call > fpu_kern_enter(), which seems like it should do the right > thing. However, the x86 fpu_kern_enter() only appears to > save registers and mask off the x87 FP trap flag in CR0. > > The instruction that's causing the trap in this case is > the very first FP instruction in the function, and it's > tripping the precision exception (MXCSR is set to 0x20 > when printed out in fputrap() by a debug printf I added > in my local build; this is also where I'm getting the > trap number 19 rather than 16). This looks like a mistake on my part. It's possible that we never noticed with the crypto code because it largely doesn't deal in floating-point exceptions, and that's all that we've use the FP/SIMD unit for in the kernel so far. But we should have set the MXCSR (and FPSW/FPCW, if that matters) to a reliable state. And we need to do that anyway for crypto on CPUs with the MCDT bug (https://gnats.netbsd.org/57230). Since we're definitely not in a position to handle floating-point exception traps in the kernel, I just committed a change to set MXCSR to 0x1fbf (all exception status bits set, denormals-are-zero disabled, all exception trap mask bits set, round-to-nearest/ties-to-even, flush-to-zero disabled). Does that change anything? > Unless I add another function call to DC_FP_START() that > masks all the non-fatal FP traps in MXCSR. I tried > setting it to 0x00001d00 and 0x00009d40. The former just > masks the non-fatal traps and the latter tries to set "do > sane things with edge cases" flags. (If I try to mask > MXCSR in fpu_kern_enter(), then some of the crypto code > breaks.) Can you be more specific about the crypto code breaking? Does it still break with the change I just committed? (I verified all the self-tests at boot run under qemu before committing, of course, but it's possible I broke something on real hardware.) https://mail-index.netbsd.org/source-changes/2023/02/25/msg143547.html