On 11/02/2019 14:23, Jan Beulich wrote:
>>>> On 11.02.19 at 06:40, <chao....@intel.com> wrote:
>> On Fri, Feb 08, 2019 at 09:29:32AM -0700, Jan Beulich wrote:
>>>>>> On 28.01.19 at 08:06, <chao....@intel.com> wrote:
>>>> +    /*
>>>> +     * Initiate an update on all processors which don't have an online 
>>>> sibling
>>>> +     * thread with a lower thread id. Other sibling threads just await the
>>>> +     * completion of microcode update.
>>>> +     */
>>>> +    if ( cpu == cpumask_first(per_cpu(cpu_sibling_mask, cpu)) )
>>>> +        ret = microcode_update_cpu();
>>>> +    /*
>>>> +     * Increase the wait timeout to a safe value here since we're 
>>>> serializing
>>>> +     * the microcode update and that could take a while on a large number 
>>>> of
>>>> +     * CPUs. And that is fine as the *actual* timeout will be determined 
>>>> by
>>>> +     * the last CPU finished updating and thus cut short
>>>> +     */
>>>> +    if ( wait_for_cpus(&cpu_out, MICROCODE_DEFAULT_TIMEOUT_US * nr_cores) 
>>>> )
>>>> +        panic("Timeout when finishing updating microcode");
>>>
>>> While I expect this to go away again in the next patch, I'd still like to
>>> see this improved, in particular in case the patch here goes in
>>> independently of the next one. After all on a system with 100 cores
>>> the timeout totals to a whopping 3 seconds.
>>
>> To be clear, the timeout remains the same in the next patch due to
>> the serial print clause in apply_microcode().
>>
>>>
>>> Generally the time needed to wait scales by the number of CPUs still
>>> in need of doing the update. And if a timeout is really to occur, it's
>>> perhaps because of one bad core or socket, not because nothing
>>> works at all. Hence it would seem both nice and possible to scale the
>>> "remaining time to wait" by the (known) number of remaining
>>> processors to respond.
>>
>> Basically, I think the benefit is we can recognize the failure earlier
>> if no core called in in a given interval (i.e. 30ms), and trigger a
>> panic. Considering for such case, even with this optimization, the
>> system needs reboot, which generally takes several minutes, what's the
>> value of this optimization?
> 
> Hmm, on one hand this is a fair point you make. Otoh, why do
> you add any timeout at all, if we say we're hosed anyway if the
> timeout expires? You could then as well log a message (say
> once a second) about how many (or which) CPUs still didn't
> respond. The admin can then still reboot the system if desired.

That's not a data center friendly approach.

The ability to do microcode update in an online system might by
risky, but in case of failure requiring access to the console or
power settings of the system isn't nice.

I think doing a panic() after some timeout is a sensible way to
handle a failure.

In case you'd like having a way to wait longer: we could allow the
"noreboot" parameter to be modified at runtime and do the panic only
if opt_noreboot isn't set.


Juergen

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Reply via email to