Re: AMDGPU: Floating Point traps in Display Core code

Taylor R Campbell Sat, 25 Feb 2023 06:18:23 -0800

> Date: Fri, 24 Feb 2023 23:21:35 -0800
> From: Jeff Frasca <that...@jeff-frasca.name>
> 
> Ok, first off, the FP code I've run into is in the Display
> Core code, specifically in:
>   sys/external/bsd/drm2/dist/drm/amd/display/dc/calcs/amdgpu_dcn_calcs.c
> It's all SIMD code operating on xmmN registers.  To get to
> this codepath, I needed to have CONFIG_DRM_AMD_DC set during
> compilation.  I've attached a diff that adds this to files.amdgpu.
> 
> A typical backtrace printed out by ddb is:
> breakpoint()
> vpanic()
> panic()
> fputrap()
> Xtrap16()
> dcn10_create_resource_pool()
> [...]
> 
> (I had to type it manually from a picture snapped on my
> phone, so, no offsets, if any of those are of interest,
> let me know.)


Probably not, except perhaps the one in dcn10_create_resource_pool to
confirm that it is where you think it is, in dcn_bw_update_from_pplib.

> There's a missing call from the backtrace that (I think)
> gets eaten by the trap jump: dcn_bw_update_from_pplib().
> (It's in amdgpu_dcn_calcs.c)

That's via dcn10_resource_construct, I assume?  (which is a single-use
static that presumably gets compiled away)

> The actual trap number that's getting generated is 19
> rather than the 16 implied by the call to Xtrap16 (but
> I suspect y'all understand that quirk better than I do.)

The logic looks like this:

IDTVEC(trap16)
        ZTRAP_NJ(T_ARITHTRAP)
.Ldo_fputrap:
...
        call    _C_LABEL(fputrap)
        jmp     .Lalltraps_checkusr
IDTVEC_END(trap16)

IDTVEC(trap19)
        ZTRAP_NJ(T_XMM)
        jmp     .Ldo_fputrap
IDTVEC_END(trap19)

So the return address of fputrap will always live in the Xtrap16
symbol, not the Xtrap19 one, even if it gets there by trap 19.

> dcn_bw_update_from_pplib() dutifully calls the macro
> DC_FP_START(), which I believe Taylor wired up to call
> fpu_kern_enter(), which seems like it should do the right
> thing.  However, the x86 fpu_kern_enter() only appears to
> save registers and mask off the x87 FP trap flag in CR0.
> 
> The instruction that's causing the trap in this case is
> the very first FP instruction in the function, and it's
> tripping the precision exception (MXCSR is set to 0x20
> when printed out in fputrap() by a debug printf I added
> in my local build; this is also where I'm getting the
> trap number 19 rather than 16).

This looks like a mistake on my part.  It's possible that we never
noticed with the crypto code because it largely doesn't deal in
floating-point exceptions, and that's all that we've use the FP/SIMD
unit for in the kernel so far.

But we should have set the MXCSR (and FPSW/FPCW, if that matters) to a
reliable state.  And we need to do that anyway for crypto on CPUs with
the MCDT bug (https://gnats.netbsd.org/57230).

Since we're definitely not in a position to handle floating-point
exception traps in the kernel, I just committed a change to set MXCSR
to 0x1fbf (all exception status bits set, denormals-are-zero disabled,
all exception trap mask bits set, round-to-nearest/ties-to-even,
flush-to-zero disabled).

Does that change anything?

> Unless I add another function call to DC_FP_START() that
> masks all the non-fatal FP traps in MXCSR.  I tried
> setting it to 0x00001d00 and 0x00009d40.  The former just
> masks the non-fatal traps and the latter tries to set "do
> sane things with edge cases" flags.  (If I try to mask
> MXCSR in fpu_kern_enter(), then some of the crypto code
> breaks.)

Can you be more specific about the crypto code breaking?  Does it
still break with the change I just committed?  (I verified all the
self-tests at boot run under qemu before committing, of course, but
it's possible I broke something on real hardware.)

https://mail-index.netbsd.org/source-changes/2023/02/25/msg143547.html

Re: AMDGPU: Floating Point traps in Display Core code

Reply via email to