On 08/10/2025 1:28 pm, Jan Beulich wrote:
> On 04.10.2025 00:53, Andrew Cooper wrote:
>> Under FRED, entry_from_pv() handles everything.  To start with, implement
>> exception handling in the same manner as entry_from_xen(), although we can
>> unconditionally enable interrupts after the async/fatal events.
>>
>> After entry_from_pv() returns, test_all_events() needs to run to perform
>> exception and interrupt injection.  Split entry_FRED_R3() into two and
>> introduce eretu_exit_to_guest() as the latter half, coming unilaterally from
>> restore_all_guest().
>>
>> For all of this, there is a slightly complicated relationship with CONFIG_PV.
>> entry_FRED_R3() must exist irrespective of CONFIG_PV, because it's the
>> entrypoint registered with hardware.  For simplicity, entry_from_pv() is
>> always called, but it collapses into fatal_trap() in the !PV case.
>>
>> Signed-off-by: Andrew Cooper <[email protected]>
> Reviewed-by: Jan Beulich <[email protected]>

Thanks.

>
> Nevertheless ...
>
>> --- a/xen/arch/x86/traps.c
>> +++ b/xen/arch/x86/traps.c
>> @@ -2266,9 +2266,82 @@ void asmlinkage check_ist_exit(const struct 
>> cpu_user_regs *regs, bool ist_exit)
>>  
>>  void asmlinkage entry_from_pv(struct cpu_user_regs *regs)
>>  {
>> +    struct fred_info *fi = cpu_regs_fred_info(regs);
>> +    uint8_t type = regs->fred_ss.type;
>> +    uint8_t vec = regs->fred_ss.vector;
>> +
>>      /* Copy fred_ss.vector into entry_vector as IDT delivery would have 
>> done. */
>> -    regs->entry_vector = regs->fred_ss.vector;
>> +    regs->entry_vector = vec;
>> +
>> +    if ( !IS_ENABLED(CONFIG_PV) )
>> +        goto fatal;
>> +
>> +    /*
>> +     * First, handle the asynchronous or fatal events.  These are either
>> +     * unrelated to the interrupted context, or may not have valid context
>> +     * recorded, and all have special rules on how/whether to re-enable 
>> IRQs.
>> +     */
>> +    switch ( type )
>> +    {
>> +    case X86_ET_EXT_INTR:
>> +        return do_IRQ(regs);
>> +
>> +    case X86_ET_NMI:
>> +        return do_nmi(regs);
>> +
>> +    case X86_ET_HW_EXC:
>> +        switch ( vec )
>> +        {
>> +        case X86_EXC_DF: return do_double_fault(regs);
>> +        case X86_EXC_MC: return do_machine_check(regs);
>> +        }
>> +        break;
>> +    }
>> +
>> +    /*
>> +     * With the asynchronous events handled, what remains are the 
>> synchronous
>> +     * ones.  PV guest context always had interrupts enabled.
>> +     */
>> +    local_irq_enable();
>> +
>> +    switch ( type )
>> +    {
>> +    case X86_ET_HW_EXC:
>> +    case X86_ET_PRIV_SW_EXC:
>> +    case X86_ET_SW_EXC:
>> +        switch ( vec )
>> +        {
>> +        case X86_EXC_PF:  handle_PF(regs, fi->edata); break;
>> +        case X86_EXC_GP:  do_general_protection(regs); break;
>> +        case X86_EXC_UD:  do_invalid_op(regs); break;
>> +        case X86_EXC_NM:  do_device_not_available(regs); break;
>> +        case X86_EXC_BP:  do_int3(regs); break;
>> +        case X86_EXC_DB:  handle_DB(regs, fi->edata); break;
>> +        case X86_EXC_CP:  do_entry_CP(regs); break;
>> +
>> +        case X86_EXC_DE:
>> +        case X86_EXC_OF:
>> +        case X86_EXC_BR:
>> +        case X86_EXC_NP:
>> +        case X86_EXC_SS:
>> +        case X86_EXC_MF:
>> +        case X86_EXC_AC:
>> +        case X86_EXC_XM:
>> +            do_trap(regs);
>> +            break;
>>  
>> +        default:
>> +            goto fatal;
>> +        }
>> +        break;
>> +
>> +    default:
>> +        goto fatal;
>> +    }
>> +
>> +    return;
>> +
>> + fatal:
>>      fatal_trap(regs, false);
>>  }
> ... I'm still somewhat bothered by this almost entirely duplicating the
> other entry function, i.e. I continue to wonder if we wouldn't be better
> off by eliminating that duplication (say by way of an always_inline
> helper with a suitable extra parameter).

They are not sufficiently similar.

By the end of this series alone, they differ by IS_ENABLED(CONFIG_PV),
the condition for enabling local interrupts, the ERETU fixup, and the
SYSCALL/SYSENTER handling.

NMI handling is still an open question (deferred for now, because it
functions, albeit inefficiently) and adds a further difference.


>
>> --- a/xen/arch/x86/x86_64/entry.S
>> +++ b/xen/arch/x86/x86_64/entry.S
>> @@ -63,7 +63,7 @@ UNLIKELY_END(syscall_no_callback)
>>          /* Conditionally clear DF */
>>          and   %esi, UREGS_eflags(%rsp)
>>  /* %rbx: struct vcpu */
>> -test_all_events:
>> +LABEL(test_all_events, 0)
>>          ASSERT_NOT_IN_ATOMIC
>>          cli                             # tests must not race interrupts
>>  /*test_softirqs:*/
>> @@ -152,6 +152,8 @@ END(switch_to_kernel)
>>  FUNC_LOCAL(restore_all_guest)
>>          ASSERT_INTERRUPTS_DISABLED
>>  
>> +        ALTERNATIVE "", "jmp eretu_exit_to_guest", X86_FEATURE_XEN_FRED
>> +
>>          /* Stash guest SPEC_CTRL value while we can read struct vcpu. */
>>          mov VCPU_arch_msrs(%rbx), %rdx
> I also continue to wonder if we wouldn't do a tiny bit better by using
>
>         ALTERNATIVE "mov VCPU_arch_msrs(%rbx), %rdx", \
>                     "jmp eretu_exit_to_guest", \
>                     X86_FEATURE_XEN_FRED
>
> Or by converting the few jumps to restore_all_guest to alternatives
> (duplicating the ASSERT_INTERRUPTS_DISABLED there).

I'm quite firmly against this.

Sure, we could save a 5 byte nop, but the cost of doing that is merging
two unrelated pieces of logic in a construct explicitly to signal two
related things.

The added complexity to follow the logic is not worth the 5 byte nop saving.

~Andrew

Reply via email to