Re: [Xen-devel] [PATCH v2] x86/microcode: Synchronize late microcode loading

2018-05-02 Thread Jan Beulich
>>> On 01.05.18 at 10:15,  wrote:
> On Mon, Apr 30, 2018 at 09:25:26AM -0600, Jan Beulich wrote:
> On 25.04.18 at 13:46,  wrote:
>>> +static int do_microcode_update(void *_info)
>>> +{
>>> +struct microcode_info *info = _info;
>>> +unsigned int cpu = smp_processor_id();
>>> +int ret;
>>> +
>>> +ret = wait_for_cpus(>cpu_in, MICROCODE_DEFAULT_TIMEOUT);
>>> +if ( ret )
>>> +return ret;
>>> +/*
>>> + * Logical threads which set the first bit in cpu_sibling_mask can do
>>> + * the update. Other sibling threads just await the completion of
>>> + * microcode update.
>>> + */
>>> +if ( cpumask_test_and_set_cpu(
>>> +cpumask_first(per_cpu(cpu_sibling_mask, cpu)), 
>>> >cpus) )
>>> +ret = microcode_update_cpu(info->buffer, info->buffer_size);
>>
>>Isn't the condition inverted (i.e. missing a ! )?
> 
> Yes.
> 
>>
>>Also I take it that you've confirmed that loading ucode in parallel on 
>>multiple
>>cores of the same socket is not a problem? The comment in the last hunk
>>suggests otherwise.
> 
> No. In microcode_update_cpu(), microcode_mutex makes the update
> sequential.

Oh, right, of course.

>>> +
>>> +return ret;
>>>  }
>>
>>You're losing this return value (once for every CPU making it into this
>>function).
> 
> I don't understand this remark. This function is called by
> stop_machine_run(). stop_machine_run() could return error if
> any cpu failed during update. We don't care the specific CPU and
> how many CPUs failed to do the update.

Then please check your stop_machine_run() invocation again, in particular
what happens with that function's return value.

Jan



___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2] x86/microcode: Synchronize late microcode loading

2018-05-01 Thread Chao Gao
On Mon, Apr 30, 2018 at 09:25:26AM -0600, Jan Beulich wrote:
 On 25.04.18 at 13:46,  wrote:
>> @@ -281,24 +288,56 @@ static int microcode_update_cpu(const void *buf, 
>> size_t size)
>>  return err;
>>  }
>>  
>> -static long do_microcode_update(void *_info)
>> +/* Wait for all CPUs to rendezvous with a timeout (us) */
>> +static int wait_for_cpus(atomic_t *cnt, int timeout)
>
>unsigned int
>
>> +static int do_microcode_update(void *_info)
>> +{
>> +struct microcode_info *info = _info;
>> +unsigned int cpu = smp_processor_id();
>> +int ret;
>> +
>> +ret = wait_for_cpus(>cpu_in, MICROCODE_DEFAULT_TIMEOUT);
>> +if ( ret )
>> +return ret;
>> +/*
>> + * Logical threads which set the first bit in cpu_sibling_mask can do
>> + * the update. Other sibling threads just await the completion of
>> + * microcode update.
>> + */
>> +if ( cpumask_test_and_set_cpu(
>> +cpumask_first(per_cpu(cpu_sibling_mask, cpu)), >cpus) 
>> )
>> +ret = microcode_update_cpu(info->buffer, info->buffer_size);
>
>Isn't the condition inverted (i.e. missing a ! )?

Yes.

>
>Also I take it that you've confirmed that loading ucode in parallel on multiple
>cores of the same socket is not a problem? The comment in the last hunk
>suggests otherwise.

No. In microcode_update_cpu(), microcode_mutex makes the update
sequential.

>
>> +/*
>> + * Increase the wait timeout to a safe value here since we're 
>> serializing
>> + * the microcode update and that could take a while on a large number of
>> + * CPUs. And that is fine as the *actual* timeout will be determined by
>> + * the last CPU finished updating and thus cut short
>> + */
>> +if ( wait_for_cpus(>cpu_out, MICROCODE_DEFAULT_TIMEOUT *
>> +   num_online_cpus()) )
>> +panic("Timeout when finishing updating microcode");
>
>A 3s timeout (as an example for a system with 100 CPU threads) is still
>absurdly high to me, but considering you panic() anyway if you hit the
>timeout the question mainly is whether there's a slim chance for this to
>complete a brief moment before the timeout expires. If all goes well,
>you won't come close to even 1s, but as said before - there may be
>guests running, and they may become utterly confused if they don't
>get any time within a second or more.
>
>With you no longer doing things sequentially I don't, however, see why

No. It is still sequential. And only one thread in a core will try to
acquire the lock -- microcode_mutex.

>you need to scale the timeout by CPU count.

Maybe by the number of core. But I did't find an existing variable to
count cores.

>
>> +
>> +return ret;
>>  }
>
>You're losing this return value (once for every CPU making it into this
>function).

I don't understand this remark. This function is called by
stop_machine_run(). stop_machine_run() could return error if
any cpu failed during update. We don't care the specific CPU and
how many CPUs failed to do the update.

Thanks
Chao

>
>> @@ -318,26 +357,52 @@ int 
>> microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) buf, unsigned long len)
>>  
>>  ret = copy_from_guest(info->buffer, buf, len);
>>  if ( ret != 0 )
>> -{
>> -xfree(info);
>> -return ret;
>> -}
>> +goto free;
>>  
>>  info->buffer_size = len;
>> -info->error = 0;
>> -info->cpu = cpumask_first(_online_map);
>> +
>> +/* cpu_online_map must not change during update */
>> +if ( !get_cpu_maps() )
>> +{
>> +ret = -EBUSY;
>> +goto free;
>> +}
>>  
>>  if ( microcode_ops->start_update )
>>  {
>>  ret = microcode_ops->start_update();
>>  if ( ret != 0 )
>> -{
>> -xfree(info);
>> -return ret;
>> -}
>> +goto put;
>>  }
>>  
>> -return continue_hypercall_on_cpu(info->cpu, do_microcode_update, info);
>> +cpumask_empty(>cpus);
>
>DYM cpumask_clear()?
>
>> +atomic_set(>cpu_in, 0);
>> +atomic_set(>cpu_out, 0);
>> +
>> +/*
>> + * We intend to disable interrupt for long time, which may lead to
>> + * watchdog timeout.
>> + */
>> +watchdog_disable();
>> +/*
>> + * Late loading dance. Why the heavy-handed stop_machine effort?
>> + *
>> + * -HT siblings must be idle and not execute other code while the other
>> + *  sibling is loading microcode in order to avoid any negative
>> + *  interactions cause by the loading.
>> + *
>> + * -In addition, microcode update on the cores must be serialized until
>> + *  this requirement can be relaxed in the feature. Right now, this is
>> + *  conservative and good.
>
>This is the comment I've referred to above.
>
>Jan
>

___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel

Re: [Xen-devel] [PATCH v2] x86/microcode: Synchronize late microcode loading

2018-04-30 Thread Jan Beulich
>>> On 25.04.18 at 13:46,  wrote:
> @@ -281,24 +288,56 @@ static int microcode_update_cpu(const void *buf, size_t 
> size)
>  return err;
>  }
>  
> -static long do_microcode_update(void *_info)
> +/* Wait for all CPUs to rendezvous with a timeout (us) */
> +static int wait_for_cpus(atomic_t *cnt, int timeout)

unsigned int

> +static int do_microcode_update(void *_info)
> +{
> +struct microcode_info *info = _info;
> +unsigned int cpu = smp_processor_id();
> +int ret;
> +
> +ret = wait_for_cpus(>cpu_in, MICROCODE_DEFAULT_TIMEOUT);
> +if ( ret )
> +return ret;
> +/*
> + * Logical threads which set the first bit in cpu_sibling_mask can do
> + * the update. Other sibling threads just await the completion of
> + * microcode update.
> + */
> +if ( cpumask_test_and_set_cpu(
> +cpumask_first(per_cpu(cpu_sibling_mask, cpu)), >cpus) )
> +ret = microcode_update_cpu(info->buffer, info->buffer_size);

Isn't the condition inverted (i.e. missing a ! )?

Also I take it that you've confirmed that loading ucode in parallel on multiple
cores of the same socket is not a problem? The comment in the last hunk
suggests otherwise.

> +/*
> + * Increase the wait timeout to a safe value here since we're serializing
> + * the microcode update and that could take a while on a large number of
> + * CPUs. And that is fine as the *actual* timeout will be determined by
> + * the last CPU finished updating and thus cut short
> + */
> +if ( wait_for_cpus(>cpu_out, MICROCODE_DEFAULT_TIMEOUT *
> +   num_online_cpus()) )
> +panic("Timeout when finishing updating microcode");

A 3s timeout (as an example for a system with 100 CPU threads) is still
absurdly high to me, but considering you panic() anyway if you hit the
timeout the question mainly is whether there's a slim chance for this to
complete a brief moment before the timeout expires. If all goes well,
you won't come close to even 1s, but as said before - there may be
guests running, and they may become utterly confused if they don't
get any time within a second or more.

With you no longer doing things sequentially I don't, however, see why
you need to scale the timeout by CPU count.

> +
> +return ret;
>  }

You're losing this return value (once for every CPU making it into this
function).

> @@ -318,26 +357,52 @@ int microcode_update(XEN_GUEST_HANDLE_PARAM(const_void) 
> buf, unsigned long len)
>  
>  ret = copy_from_guest(info->buffer, buf, len);
>  if ( ret != 0 )
> -{
> -xfree(info);
> -return ret;
> -}
> +goto free;
>  
>  info->buffer_size = len;
> -info->error = 0;
> -info->cpu = cpumask_first(_online_map);
> +
> +/* cpu_online_map must not change during update */
> +if ( !get_cpu_maps() )
> +{
> +ret = -EBUSY;
> +goto free;
> +}
>  
>  if ( microcode_ops->start_update )
>  {
>  ret = microcode_ops->start_update();
>  if ( ret != 0 )
> -{
> -xfree(info);
> -return ret;
> -}
> +goto put;
>  }
>  
> -return continue_hypercall_on_cpu(info->cpu, do_microcode_update, info);
> +cpumask_empty(>cpus);

DYM cpumask_clear()?

> +atomic_set(>cpu_in, 0);
> +atomic_set(>cpu_out, 0);
> +
> +/*
> + * We intend to disable interrupt for long time, which may lead to
> + * watchdog timeout.
> + */
> +watchdog_disable();
> +/*
> + * Late loading dance. Why the heavy-handed stop_machine effort?
> + *
> + * -HT siblings must be idle and not execute other code while the other
> + *  sibling is loading microcode in order to avoid any negative
> + *  interactions cause by the loading.
> + *
> + * -In addition, microcode update on the cores must be serialized until
> + *  this requirement can be relaxed in the feature. Right now, this is
> + *  conservative and good.

This is the comment I've referred to above.

Jan


___
Xen-devel mailing list
Xen-devel@lists.xenproject.org
https://lists.xenproject.org/mailman/listinfo/xen-devel