Re: [Xen-devel] [BUG] EDAC infomation partially missing
On Tue, May 16, 2017 at 03:54:37AM -0600, Jan Beulich wrote: > >>> On 16.05.17 at 05:47,wrote: > > I suspect the only paravirtualization needed is to > > map the physical address of the soft|hard errors to which VM's memory > > range was effected. What this effects is which VM should panic in case > > of hard errors. > > Which in turn obviously requires hypervisor interaction. It's not really > clear to me whether perhaps the driver would better live in the > hypervisor in the first place for that reason. > > And there's a second piece of paravirtualization needed: The driver > doesn't distinguish physical and machine address spaces, yet the > addresses reported by hardware are machine ones and hence would > generally need translation to physical ones in order to assign Dom0- > local meaning to them (or to determine that the address belongs to > another VM or the hypervisor). Merely reporting the machine address to Dom0 is already high value since it lets you attribute the failure to a memory module. Without that you may have a VM or whole machine randomly crash for a completely unknown reason. -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [BUG] EDAC infomation partially missing
On 16/05/17 10:54, Jan Beulich wrote: On 16.05.17 at 05:47,wrote: >> On Mon, May 15, 2017 at 02:02:53AM -0600, Jan Beulich wrote: >> On 14.05.17 at 00:36, wrote: I haven't yet done as much experimentation as Andreas Pflug has, but I can confirm I'm also running into this bug with Xen 4.4.1. I've only tried Linux kernel 3.16.43, but as Dom0: EDAC MC: Ver: 3.0.0 AMD64 EDAC driver v3.4.0 EDAC amd64: DRAM ECC enabled. EDAC amd64: NB MCE bank disabled, set MSR 0x017b[4] on node 0 to enable. EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. AMD64 EDAC driver v3.4.0 EDAC amd64: DRAM ECC enabled. EDAC amd64: NB MCE bank disabled, set MSR 0x017b[4] on node 0 to enable. EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. >>> Afaict the driver as is simply can't work in a Xen Dom0; it needs >>> enabling (read: para-virtualizing). I'm actually glad to see it doesn't >>> load (the worse alternative would be for it to load and then do the >>> wrong thing or give you a false sense of safety of your data). >> I'm unsure of how to evaluate the situation. Since ECC is enabled in the >> BIOS, data should be safe whether or not the EDAC driver loads. I >> /suspect/ the EDAC driver failing to load merely means reportting of ECC >> errors won't happen. > "Merely" being relative here: The missing reports mean a false feeling > of safety, as they may be early indications of later double-bit errors. > >> I suspect the only paravirtualization needed is to >> map the physical address of the soft|hard errors to which VM's memory >> range was effected. What this effects is which VM should panic in case >> of hard errors. > Which in turn obviously requires hypervisor interaction. It's not really > clear to me whether perhaps the driver would better live in the > hypervisor in the first place for that reason. The driver should probably live directly in Xen; it needs to program a number of nothbridge and CPU registers including interrupt information. For the reporting side of things, it looks like it would require vMCE to pass on fault information to guests. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [BUG] EDAC infomation partially missing
>>> On 16.05.17 at 05:47,wrote: > On Mon, May 15, 2017 at 02:02:53AM -0600, Jan Beulich wrote: >> >>> On 14.05.17 at 00:36, wrote: >> > I haven't yet done as much experimentation as Andreas Pflug has, but I >> > can confirm I'm also running into this bug with Xen 4.4.1. >> > >> > I've only tried Linux kernel 3.16.43, but as Dom0: >> > >> > EDAC MC: Ver: 3.0.0 >> > AMD64 EDAC driver v3.4.0 >> > EDAC amd64: DRAM ECC enabled. >> > EDAC amd64: NB MCE bank disabled, set MSR 0x017b[4] on node 0 to >> > enable. >> > EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not >> > load. >> > AMD64 EDAC driver v3.4.0 >> > EDAC amd64: DRAM ECC enabled. >> > EDAC amd64: NB MCE bank disabled, set MSR 0x017b[4] on node 0 to >> > enable. >> > EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not >> > load. >> >> Afaict the driver as is simply can't work in a Xen Dom0; it needs >> enabling (read: para-virtualizing). I'm actually glad to see it doesn't >> load (the worse alternative would be for it to load and then do the >> wrong thing or give you a false sense of safety of your data). > > I'm unsure of how to evaluate the situation. Since ECC is enabled in the > BIOS, data should be safe whether or not the EDAC driver loads. I > /suspect/ the EDAC driver failing to load merely means reportting of ECC > errors won't happen. "Merely" being relative here: The missing reports mean a false feeling of safety, as they may be early indications of later double-bit errors. > I suspect the only paravirtualization needed is to > map the physical address of the soft|hard errors to which VM's memory > range was effected. What this effects is which VM should panic in case > of hard errors. Which in turn obviously requires hypervisor interaction. It's not really clear to me whether perhaps the driver would better live in the hypervisor in the first place for that reason. And there's a second piece of paravirtualization needed: The driver doesn't distinguish physical and machine address spaces, yet the addresses reported by hardware are machine ones and hence would generally need translation to physical ones in order to assign Dom0- local meaning to them (or to determine that the address belongs to another VM or the hypervisor). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [BUG] EDAC infomation partially missing
On Mon, May 15, 2017 at 02:02:53AM -0600, Jan Beulich wrote: > >>> On 14.05.17 at 00:36,wrote: > > I haven't yet done as much experimentation as Andreas Pflug has, but I > > can confirm I'm also running into this bug with Xen 4.4.1. > > > > I've only tried Linux kernel 3.16.43, but as Dom0: > > > > EDAC MC: Ver: 3.0.0 > > AMD64 EDAC driver v3.4.0 > > EDAC amd64: DRAM ECC enabled. > > EDAC amd64: NB MCE bank disabled, set MSR 0x017b[4] on node 0 to enable. > > EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not > > load. > > AMD64 EDAC driver v3.4.0 > > EDAC amd64: DRAM ECC enabled. > > EDAC amd64: NB MCE bank disabled, set MSR 0x017b[4] on node 0 to enable. > > EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not > > load. > > Afaict the driver as is simply can't work in a Xen Dom0; it needs > enabling (read: para-virtualizing). I'm actually glad to see it doesn't > load (the worse alternative would be for it to load and then do the > wrong thing or give you a false sense of safety of your data). I'm unsure of how to evaluate the situation. Since ECC is enabled in the BIOS, data should be safe whether or not the EDAC driver loads. I /suspect/ the EDAC driver failing to load merely means reportting of ECC errors won't happen. I suspect the only paravirtualization needed is to map the physical address of the soft|hard errors to which VM's memory range was effected. What this effects is which VM should panic in case of hard errors. Depending upon the environment there may or may not be cause to report soft errors anywhere beside Dom0. In most cases a soft error will at worst trigger a desire to replace the memory module, but not trigger a panic for the affected VM. It is only once a hard error occurs that it is urgent to warn the effected VM and cause a panic; in this case it may also be desireable to first alert Dom0 anyway. As such I'm inclined to think force-enabling ECC EDAC monitoring in Dom0 is the best approach for now. As long as a hard error doesn't occur in Dom0's address range, Dom0 is in the best position to deal with the situation. The worst case is a hard error occuring in Xen's address range, since that will mean all VMs on the machine are likely to be toast. I think this should be a fairly high priority for Xen since ECC memory is a feature very common on systems running with a hypervisor. -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [BUG] EDAC infomation partially missing
>>> On 14.05.17 at 00:36,wrote: > I haven't yet done as much experimentation as Andreas Pflug has, but I > can confirm I'm also running into this bug with Xen 4.4.1. > > I've only tried Linux kernel 3.16.43, but as Dom0: > > EDAC MC: Ver: 3.0.0 > AMD64 EDAC driver v3.4.0 > EDAC amd64: DRAM ECC enabled. > EDAC amd64: NB MCE bank disabled, set MSR 0x017b[4] on node 0 to enable. > EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not > load. > AMD64 EDAC driver v3.4.0 > EDAC amd64: DRAM ECC enabled. > EDAC amd64: NB MCE bank disabled, set MSR 0x017b[4] on node 0 to enable. > EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not > load. Afaict the driver as is simply can't work in a Xen Dom0; it needs enabling (read: para-virtualizing). I'm actually glad to see it doesn't load (the worse alternative would be for it to load and then do the wrong thing or give you a false sense of safety of your data). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [BUG] EDAC infomation partially missing
I haven't yet done as much experimentation as Andreas Pflug has, but I can confirm I'm also running into this bug with Xen 4.4.1. I've only tried Linux kernel 3.16.43, but as Dom0: EDAC MC: Ver: 3.0.0 AMD64 EDAC driver v3.4.0 EDAC amd64: DRAM ECC enabled. EDAC amd64: NB MCE bank disabled, set MSR 0x017b[4] on node 0 to enable. EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. AMD64 EDAC driver v3.4.0 EDAC amd64: DRAM ECC enabled. EDAC amd64: NB MCE bank disabled, set MSR 0x017b[4] on node 0 to enable. EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. Whereas directly booting: EDAC MC: Ver: 3.0.0 AMD64 EDAC driver v3.4.0 EDAC amd64: DRAM ECC enabled. EDAC amd64: F10h detected (node 0). EDAC MC: DCT0 chip selects: EDAC amd64: MC: 0: 0MB 1: 0MB EDAC amd64: MC: 2: 0MB 3: 0MB EDAC amd64: MC: 4: 0MB 5: 0MB EDAC amd64: MC: 6: 0MB 7: 0MB EDAC MC: DCT1 chip selects: EDAC amd64: MC: 0: 4096MB 1: 4096MB EDAC amd64: MC: 2: 0MB 3: 0MB EDAC amd64: MC: 4: 0MB 5: 0MB EDAC amd64: MC: 6: 0MB 7: 0MB EDAC amd64: using x4 syndromes. EDAC amd64: MCT channel count: 2 EDAC amd64: CS0: Unbuffered DDR3 RAM EDAC amd64: CS1: Unbuffered DDR3 RAM EDAC MC0: Giving out device to module amd64_edac controller F10h: DEV :00:18.2 (INTERRUPT) EDAC PCI0: Giving out device to module amd64_edac controller EDAC PCI controller: DEV :00:18.2 (POLLED) I have not tried force-enabling ECC checking. Since I place high value on my data, I rate this as a rather important bug. -- (\___(\___(\__ --=> 8-) EHM <=-- __/)___/)___/) \BS (| ehem+sig...@m5p.com PGP 87145445 |) / \_CS\ | _ -O #include O- _ | / _/ 8A19\___\_|_/58D2 7E3D DDF4 7BA6 <-PGP-> 41D1 B375 37D0 8714\_|_/___/5445 ___ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel
Re: [Xen-devel] [BUG] EDAC infomation partially missing
Am 21.01.16 um 17:41 schrieb Jan Beulich: On 20.01.16 at 16:01,wrote: >> Initially reported to debian >> (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=810964), redirected here: >> >> With AMD Opteron 6xxx processors, half of the memory controllers are >> missing from /sys/devices/system/edac/mc >> Checked with single 6120 (dual memory controller) and twin 6344 (2x dual >> MC), other dual-module CPUs might be affected too. >> >> Booting plain Linux (3.2, 3.16, 4.1, 4.3), all memory controllers are >> listed under /sys/devices/system/edac/mc as expected. Same happens, when >> Xen 4.1 is used: all MCs present. >> >> Starting with Xen 4.4 (Debian Jessie), only mc1 (on the single CPU >> machine) or mc2/mc3 (dual CPU machine) are present, although the full >> system memory is accessible. Checked versions were 4.1.4 (Debian >> Wheezy), 4.4.1 (Jessie) and 4.6.0 (Sid) > As already indicated by Ian in that bug, you should supply us with > full kernel and hypervisor logs for both the good and bad cases > (ideally with the same kernel version use in both runs, so that we > can exclude kernel behavior differences). Here are some dmesg excerpts, all performed with Linux 4.1.3. When booting with Xen 4.1.4: AMD64 EDAC driver v3.4.0 EDAC amd64: DRAM ECC enabled. EDAC amd64: F10h detected (node 0). EDAC MC: DCT0 chip selects: EDAC amd64: MC: 0: 0MB 1: 0MB EDAC amd64: MC: 2: 2048MB 3: 2048MB EDAC amd64: MC: 4: 0MB 5: 0MB EDAC amd64: MC: 6: 0MB 7: 0MB EDAC MC: DCT1 chip selects: EDAC amd64: MC: 0: 0MB 1: 0MB EDAC amd64: MC: 2: 2048MB 3: 2048MB EDAC amd64: MC: 4: 0MB 5: 0MB EDAC amd64: MC: 6: 0MB 7: 0MB EDAC amd64: using x8 syndromes. EDAC amd64: MCT channel count: 2 EDAC MC0: Giving out device to module amd64_edac controller F10h: DEV :00:18.2 (INTERRUPT) EDAC amd64: DRAM ECC enabled. EDAC amd64: F10h detected (node 1). EDAC MC: DCT0 chip selects: EDAC amd64: MC: 0: 0MB 1: 0MB EDAC amd64: MC: 2: 2048MB 3: 2048MB EDAC amd64: MC: 4: 0MB 5: 0MB EDAC amd64: MC: 6: 0MB 7: 0MB EDAC MC: DCT1 chip selects: EDAC amd64: MC: 0: 0MB 1: 0MB EDAC amd64: MC: 2: 2048MB 3: 2048MB EDAC amd64: MC: 4: 0MB 5: 0MB EDAC amd64: MC: 6: 0MB 7: 0MB EDAC amd64: using x8 syndromes. EDAC amd64: MCT channel count: 2 EDAC MC1: Giving out device to module amd64_edac controller F10h: DEV :00:19.2 (INTERRUPT) When booting with Xen 4.4.1: AMD64 EDAC driver v3.4.0 EDAC amd64: DRAM ECC enabled. EDAC amd64: NB MCE bank disabled, set MSR 0x017b[4] on node 0 to enable. EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. Either enable ECC checking or force module loading by setting 'ecc_enable_override'. (Note that use of the override may cause unknown side effects.) EDAC amd64: DRAM ECC enabled. EDAC amd64: F10h detected (node 1). EDAC MC: DCT0 chip selects: EDAC amd64: MC: 0: 0MB 1: 0MB EDAC amd64: MC: 2: 2048MB 3: 2048MB EDAC amd64: MC: 4: 0MB 5: 0MB EDAC amd64: MC: 6: 0MB 7: 0MB EDAC MC: DCT1 chip selects: EDAC amd64: MC: 0: 0MB 1: 0MB EDAC amd64: MC: 2: 2048MB 3: 2048MB EDAC amd64: MC: 4: 0MB 5: 0MB EDAC amd64: MC: 6: 0MB 7: 0MB EDAC amd64: using x8 syndromes. EDAC amd64: MCT channel count: 2 EDAC MC1: Giving out device to module amd64_edac controller F10h: DEV :00:19.2 (INTERRUPT) Apparently Xen4.4 doesn't report the BIOS flag correctly. I added ecc_enable_override=1 to amd64_edac_mod, and then I get EDAC MC: Ver: 3.0.0 AMD64 EDAC driver v3.4.0 EDAC amd64: DRAM ECC enabled. EDAC amd64: NB MCE bank disabled, set MSR 0x017b[4] on node 0 to enable. EDAC amd64: ECC disabled in the BIOS or no ECC capability, module will not load. EDAC amd64: Forcing ECC on! EDAC amd64: F10h detected (node 0). EDAC MC: DCT0 chip selects: EDAC amd64: MC: 0: 0MB 1: 0MB EDAC amd64: MC: 2: 2048MB 3: 2048MB EDAC amd64: MC: 4: 0MB 5: 0MB EDAC amd64: MC: 6: 0MB 7: 0MB EDAC MC: DCT1 chip selects: EDAC amd64: MC: 0: 0MB 1: 0MB EDAC amd64: MC: 2: 2048MB 3: 2048MB EDAC amd64: MC: 4: 0MB 5: 0MB EDAC amd64: MC: 6: 0MB 7: 0MB EDAC amd64: using x8 syndromes. EDAC amd64: MCT channel count: 2 EDAC MC0: Giving out device to module amd64_edac controller F10h: DEV :00:18.2 (INTERRUPT) EDAC amd64: DRAM ECC enabled. EDAC amd64: F10h detected (node 1). EDAC MC: DCT0 chip selects: EDAC amd64: MC: 0: 0MB 1: 0MB EDAC amd64: MC: 2: 2048MB 3: 2048MB EDAC amd64: MC: 4: 0MB 5: 0MB EDAC amd64: MC: 6: 0MB 7: 0MB EDAC MC: DCT1 chip selects: EDAC amd64: MC: 0: 0MB 1: 0MB EDAC amd64: MC: 2: 2048MB 3: 2048MB EDAC amd64: MC: 4: 0MB 5: 0MB EDAC amd64: MC: 6: 0MB 7: 0MB EDAC amd64: using x8 syndromes. EDAC amd64: MCT channel count: 2 EDAC MC1: Giving out device to module amd64_edac controller F10h: DEV :00:19.2 (INTERRUPT) This restored both MCs, so the BIOS flag seems
Re: [Xen-devel] [BUG] EDAC infomation partially missing
>>> On 22.01.16 at 10:09,wrote: > When booting with Xen 4.4.1: > > AMD64 EDAC driver v3.4.0 > EDAC amd64: DRAM ECC enabled. > EDAC amd64: NB MCE bank disabled, set MSR 0x017b[4] on node 0 to enable. I wonder how valid his message is. We actually write this MSR with all ones during boot. However, considering involved functions like nb_mce_bank_enabled_on_node() or node_to_amd_nb() taking node IDs as inputs, and considering that PV guests (including Dom0) don't have a topology matching that of the host, I doubt very much that this driver is even remotely prepared to run under Xen. It working on Xen 4.1.x would then be by pure accident. Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [BUG] EDAC infomation partially missing
Am 22.01.16 um 11:40 schrieb Jan Beulich: On 22.01.16 at 10:09,wrote: >> When booting with Xen 4.4.1: >> >> AMD64 EDAC driver v3.4.0 >> EDAC amd64: DRAM ECC enabled. >> EDAC amd64: NB MCE bank disabled, set MSR 0x017b[4] on node 0 to enable. > I wonder how valid his message is. We actually write this MSR with > all ones during boot. > > However, considering involved functions like > nb_mce_bank_enabled_on_node() or node_to_amd_nb() taking > node IDs as inputs, and considering that PV guests (including > Dom0) don't have a topology matching that of the host, I doubt > very much that this driver is even remotely prepared to run > under Xen. It working on Xen 4.1.x would then be by pure > accident. The dmesg is identical with or without Xen4.1, so I'd guess it does work if flags are detected correctly. Regards Andreas ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [BUG] EDAC infomation partially missing
>>> On 20.01.16 at 16:01,wrote: > Initially reported to debian > (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=810964), redirected here: > > With AMD Opteron 6xxx processors, half of the memory controllers are > missing from /sys/devices/system/edac/mc > Checked with single 6120 (dual memory controller) and twin 6344 (2x dual > MC), other dual-module CPUs might be affected too. > > Booting plain Linux (3.2, 3.16, 4.1, 4.3), all memory controllers are > listed under /sys/devices/system/edac/mc as expected. Same happens, when > Xen 4.1 is used: all MCs present. > > Starting with Xen 4.4 (Debian Jessie), only mc1 (on the single CPU > machine) or mc2/mc3 (dual CPU machine) are present, although the full > system memory is accessible. Checked versions were 4.1.4 (Debian > Wheezy), 4.4.1 (Jessie) and 4.6.0 (Sid) As already indicated by Ian in that bug, you should supply us with full kernel and hypervisor logs for both the good and bad cases (ideally with the same kernel version use in both runs, so that we can exclude kernel behavior differences). Jan ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
[Xen-devel] [BUG] EDAC infomation partially missing
Initially reported to debian (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=810964), redirected here: With AMD Opteron 6xxx processors, half of the memory controllers are missing from /sys/devices/system/edac/mc Checked with single 6120 (dual memory controller) and twin 6344 (2x dual MC), other dual-module CPUs might be affected too. Booting plain Linux (3.2, 3.16, 4.1, 4.3), all memory controllers are listed under /sys/devices/system/edac/mc as expected. Same happens, when Xen 4.1 is used: all MCs present. Starting with Xen 4.4 (Debian Jessie), only mc1 (on the single CPU machine) or mc2/mc3 (dual CPU machine) are present, although the full system memory is accessible. Checked versions were 4.1.4 (Debian Wheezy), 4.4.1 (Jessie) and 4.6.0 (Sid) ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel