On 12/11/2024 10:45 am, Jan Beulich wrote: > On 07.11.2024 13:21, Andrew Cooper wrote: >> Fold microcode_update_cpu() into its single remaining caller and simplify the >> logic by removing the patch != NULL path with microcode_mutex held. >> >> Explain why we bother grabbing the microcode revision even if we can't load >> microcode. >> >> Furthermore, delete the -EIO path. An error updating microcode on AP boot or >> S3 resume is certainly bad, but freeing the cache is about the worst possible >> action we can take in response; it prevents subsequent APs from taking an >> update they might have accepted. > I'm afraid I disagree here, but I also disagree with the present error > handling. > -EIO indicates the patch didn't apply. Why would there be any hope that any > other CPU would accept it?
-EIO is "something went wrong". On modern systems this can include "checksum didn't match because there's a bad SRAM cell". This is literally one of the failures leading to the introduction of In-Field-Scan. Individual cores really can fail in a way which won't be the same elsewhere in the system. > Keeping what's cached might be an option, but then followed by cleaning the > cache unless at least one CPU actually accepted the ucode. We already have that behaviour. We cache speculatively on boot, even if the BSP doesn't need to load, because APs might need to. This really is the best we can do. The only other time the cache gets modified is after a late-load attempt which reported success. There are still a lot of partial-failure error cases to handle less badly, but that needs yet more untangling before it can be addressed adequately. ~Andrew
