[PATCH 2nd try] Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!
On Thu, 14 Dec 2006, Muli Ben-Yehuda wrote: > The rest looks good. Please resend and I'll add my Acked-by. Thanks a lot for your comments and suggestions. Here's my 2nd try: === From: Karsten Weiss <[EMAIL PROTECTED]> $ diffstat ~/iommu-patch_v2.patch Documentation/kernel-parameters.txt |3 Documentation/x86_64/boot-options.txt | 104 +++--- arch/x86_64/Kconfig | 10 ++- arch/x86_64/kernel/pci-dma.c | 28 + 4 files changed, 87 insertions(+), 58 deletions(-) Patch description: - add SWIOTLB config help text - mention Documentation/x86_64/boot-options.txt in Documentation/kernel-parameters.txt - remove the duplication of the iommu kernel parameter documentation. - Better explanation of some of the iommu kernel parameter options. - "32MB< --- --- linux-2.6.19/arch/x86_64/kernel/pci-dma.c.original 2006-12-14 11:15:38.348598021 +0100 +++ linux-2.6.19/arch/x86_64/kernel/pci-dma.c 2006-12-14 12:14:48.176967312 +0100 @@ -223,30 +223,10 @@ } EXPORT_SYMBOL(dma_set_mask); -/* iommu=[size][,noagp][,off][,force][,noforce][,leak][,memaper[=order]][,merge] - [,forcesac][,fullflush][,nomerge][,biomerge] - size set size of iommu (in bytes) - noagp don't initialize the AGP driver and use full aperture. - off don't use the IOMMU - leak turn on simple iommu leak tracing (only when CONFIG_IOMMU_LEAK is on) - memaper[=order] allocate an own aperture over RAM with size 32MB^order. - noforce don't force IOMMU usage. Default. - force Force IOMMU. - merge Do lazy merging. This may improve performance on some block devices. - Implies force (experimental) - biomerge Do merging at the BIO layer. This is more efficient than merge, -but should be only done with very big IOMMUs. Implies merge,force. - nomerge Don't do SG merging. - forcesac For SAC mode for masks <40bits (experimental) - fullflush Flush IOMMU on each allocation (default) - nofullflush Don't use IOMMU fullflush - allowed overwrite iommu off workarounds for specific chipsets. - soft Use software bounce buffering (default for Intel machines) - noaperture Don't touch the aperture for AGP. - allowdac Allow DMA >4GB - nodacForbid DMA >4GB - panicForce panic when IOMMU overflows -*/ +/* + * See for the iommu kernel parameter + * documentation. + */ __init int iommu_setup(char *p) { iommu_merge = 1; --- linux-2.6.19/arch/x86_64/Kconfig.original 2006-12-14 11:37:35.832142506 +0100 +++ linux-2.6.19/arch/x86_64/Kconfig2006-12-14 14:01:24.009193996 +0100 @@ -431,8 +431,8 @@ on systems with more than 3GB. This is usually needed for USB, sound, many IDE/SATA chipsets and some other devices. Provides a driver for the AMD Athlon64/Opteron/Turion/Sempron GART - based IOMMU and a software bounce buffer based IOMMU used on Intel - systems and as fallback. + based hardware IOMMU and a software bounce buffer based IOMMU used + on Intel systems and as fallback. The code is only active when needed (enough memory and limited device) unless CONFIG_IOMMU_DEBUG or iommu=force is specified too. @@ -458,6 +458,12 @@ # need this always selected by IOMMU for the VIA workaround config SWIOTLB bool + help + Support for software bounce buffers used on x86-64 systems + which don't have a hardware IOMMU (e.g. the current generation + of Intel's x86-64 CPUs). Using this PCI devices which can only + access 32-bits of memory can be used on systems with more than + 3 GB of memory. If unsure, say Y. config X86_MCE bool "Machine check support" if EMBEDDED --- linux-2.6.19/Documentation/x86_64/boot-options.txt.original 2006-12-14 11:11:32.099300994 +0100 +++ linux-2.6.19/Documentation/x86_64/boot-options.txt 2006-12-14 14:14:55.869560532 +0100 @@ -180,39 +180,79 @@ pci=lastbus=NUMBER Scan upto NUMBER busses, no matter what the mptable says. pci=noacpi Don't use ACPI to set up PCI interrupt routing. -IOMMU +IOMMU (input/output memory management unit) - iommu=[size][,noagp][,off][,force][,noforce][,leak][,memaper[=order]][,merge] - [,forcesac][,fullflush][,nomerge][,noaperture] - size set size of iommu (in bytes) - noagp don't initialize the AGP driver and use full aperture. - off don't use the IOMMU - leak turn on simple iommu leak tracing (only when CONFIG_IOMMU_LEAK is on) - memaper[=order] allocate an own aperture over RAM with size 32MB^order. - noforce don't force IOMMU usage. Default. - force Force IOMMU. - merge Do SG merging. Implies force (experimental) - nomerge Don't do SG merging. - forcesac For SAC mode for masks <40bits (experimental) - fullflush Flush IOMMU o
[PATCH] Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!
On Thu, 14 Dec 2006, Muli Ben-Yehuda wrote: > On Wed, Dec 13, 2006 at 09:34:16PM +0100, Karsten Weiss wrote: > > > BTW: It would be really great if this area of the kernel would get some > > more and better documentation. The information at > > linux-2.6/Documentation/x86_64/boot_options.txt is very terse. I had to > > read the code to get a *rough* idea what all the "iommu=" options > > actually do and how they interact. > > Patches happily accepted :-) Well, you asked for it. :-) So here's my little contribution. Please *double* *check*! (BTW: I would like to know what "DAC" and "SAC" means in this context) === From: Karsten Weiss <[EMAIL PROTECTED]> Patch summary: - Better explanation of some of the iommu kernel parameter options. - "32MB< --- --- linux-2.6.19/arch/x86_64/kernel/pci-dma.c.original 2006-12-14 11:15:38.348598021 +0100 +++ linux-2.6.19/arch/x86_64/kernel/pci-dma.c 2006-12-14 12:14:48.176967312 +0100 @@ -223,30 +223,10 @@ } EXPORT_SYMBOL(dma_set_mask); -/* iommu=[size][,noagp][,off][,force][,noforce][,leak][,memaper[=order]][,merge] - [,forcesac][,fullflush][,nomerge][,biomerge] - size set size of iommu (in bytes) - noagp don't initialize the AGP driver and use full aperture. - off don't use the IOMMU - leak turn on simple iommu leak tracing (only when CONFIG_IOMMU_LEAK is on) - memaper[=order] allocate an own aperture over RAM with size 32MB^order. - noforce don't force IOMMU usage. Default. - force Force IOMMU. - merge Do lazy merging. This may improve performance on some block devices. - Implies force (experimental) - biomerge Do merging at the BIO layer. This is more efficient than merge, -but should be only done with very big IOMMUs. Implies merge,force. - nomerge Don't do SG merging. - forcesac For SAC mode for masks <40bits (experimental) - fullflush Flush IOMMU on each allocation (default) - nofullflush Don't use IOMMU fullflush - allowed overwrite iommu off workarounds for specific chipsets. - soft Use software bounce buffering (default for Intel machines) - noaperture Don't touch the aperture for AGP. - allowdac Allow DMA >4GB - nodacForbid DMA >4GB - panicForce panic when IOMMU overflows -*/ +/* + * See for the iommu kernel parameter + * documentation. + */ __init int iommu_setup(char *p) { iommu_merge = 1; --- linux-2.6.19/arch/x86_64/Kconfig.original 2006-12-14 11:37:35.832142506 +0100 +++ linux-2.6.19/arch/x86_64/Kconfig2006-12-14 11:47:24.346056710 +0100 @@ -431,8 +431,8 @@ on systems with more than 3GB. This is usually needed for USB, sound, many IDE/SATA chipsets and some other devices. Provides a driver for the AMD Athlon64/Opteron/Turion/Sempron GART - based IOMMU and a software bounce buffer based IOMMU used on Intel - systems and as fallback. + based hardware IOMMU and a software bounce buffer based IOMMU used + on Intel systems and as fallback. The code is only active when needed (enough memory and limited device) unless CONFIG_IOMMU_DEBUG or iommu=force is specified too. @@ -458,6 +458,11 @@ # need this always selected by IOMMU for the VIA workaround config SWIOTLB bool + help + Support for a software bounce buffer based IOMMU used on Intel + systems which don't have a hardware IOMMU. Using this code + PCI devices with 32bit memory access only are able to be + used on systems with more than 3 GB. config X86_MCE bool "Machine check support" if EMBEDDED --- linux-2.6.19/Documentation/x86_64/boot-options.txt.original 2006-12-14 11:11:32.099300994 +0100 +++ linux-2.6.19/Documentation/x86_64/boot-options.txt 2006-12-14 12:10:24.028009890 +0100 @@ -180,35 +180,66 @@ pci=lastbus=NUMBER Scan upto NUMBER busses, no matter what the mptable says. pci=noacpi Don't use ACPI to set up PCI interrupt routing. -IOMMU +IOMMU (input/output memory management unit) + + Currently four x86_64 PCI DMA mapping implementations exist: + + 1. : use no hardware/software IOMMU at all + (e.g. because you have < 3 GB memory). + Kernel boot message: "PCI-DMA: Disabling IOMMU" + + 2. : AMD GART based hardware IOMMU. + Kernel boot message: "PCI-DMA: using GART IOMMU" + + 3. : Software IOMMU implementation. Used + e.g. if there is no hardware IOMMU in the system and it is need because + you have >3GB memory or told the kernel to us it (iommu=soft)) + Kernel boot message: "PCI-DMA: Using software bounce buffering + for IO (SWIOTLB)" + + 4. : IBM Calgary hardware IOMMU. Used in IBM + pSeries and xSeries servers. This hardware IOMMU supports DMA addres
Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!
On Wed, 13 Dec 2006, Chris Wedgwood wrote: > > Any ideas why iommu=disabled in the bios does not solve the issue? > > The kernel will still use the IOMMU if the BIOS doesn't set it up if > it can, check your dmesg for IOMMU strings, there might be something > printed to this effect. FWIW: As far as I understand the linux kernel code (I am no kernel developer so please correct me if I am wrong) the PCI dma mapping code is abstracted by struct dma_mapping_ops. I.e. there are currently four possible implementations for x86_64 (see linux-2.6/arch/x86_64/kernel/) 1. pci-nommu.c : no IOMMU at all (e.g. because you have < 4 GB memory) Kernel boot message: "PCI-DMA: Disabling IOMMU." 2. pci-gart.c : (AMD) Hardware-IOMMU. Kernel boot message: "PCI-DMA: using GART IOMMU" (this message first appeared in 2.6.16) 3. pci-swiotlb.c : Software-IOMMU (used e.g. if there is no hw iommu) Kernel boot message: "PCI-DMA: Using software bounce buffering for IO (SWIOTLB)" 4. pci-calgary.c : Calgary HW-IOMMU from IBM; used in pSeries servers. This HW-IOMMU supports dma address mapping with memory proctection, etc. Kernel boot message: "PCI-DMA: Using Calgary IOMMU" (since 2.6.18!) What all this means is that you can use "dmesg|grep ^PCI-DMA:" to see which implementation your kernel is currently using. As far as our problem machines are concerned the "PCI-DMA: using GART IOMMU" case is broken (data corruption). But both "PCI-DMA: Disabling IOMMU" (trigged with mem=2g) and "PCI-DMA: Using software bounce buffering for IO (SWIOTLB)" (triggered with iommu=soft) are stable. BTW: It would be really great if this area of the kernel would get some more and better documentation. The information at linux-2.6/Documentation/x86_64/boot_options.txt is very terse. I had to read the code to get a *rough* idea what all the "iommu=" options actually do and how they interact. > > 1) And does this now mean that there's an error in the hardware > > (chipset or CPU/memcontroller)? > > My guess is it's a kernel bug, I don't know for certain. Perhaps we > shaould start making a more comprehensive list of affected kernels & > CPUs? BTW: Did someone already open an official bug at http://bugzilla.kernel.org ? Best regards, Karsten -- __creating IT solutions Dipl.-Inf. Karsten Weiss science + computing ag phone:+49 7071 9457 452Hagellocher Weg 73 teamline: +49 7071 9457 68172070 Tuebingen, Germany email:[EMAIL PROTECTED] www.science-computing.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!
On Wed, 13 Dec 2006, Erik Andersen wrote: > On Mon Dec 11, 2006 at 10:24:02AM +0100, Karsten Weiss wrote: > > Last week we did some more testing with the following result: > > > > We could not reproduce the data corruption anymore if we boot the machines > > with the kernel parameter "iommu=soft" i.e. if we use software bounce > > buffering instead of the hw-iommu. (As mentioned before, booting with > > mem=2g works fine, too, because this disables the iommu altogether.) > > > > I.e. on these systems the data corruption only happens if the hw-iommu > > (PCI-GART) of the Opteron CPUs is in use. > > > > Christoph, Erik, Chris: I would appreciate if you would test and hopefully > > confirm this workaround, too. > > What did you set the BIOS to when testing this setting? > Memory Hole enabled? IOMMU enabled? "Memory hole mapping" was set to "hardware". With "disabled" we only see 3 of our 4 GB memory. Best regards, Karsten -- __creating IT solutions Dipl.-Inf. Karsten Weiss science + computing ag phone:+49 7071 9457 452Hagellocher Weg 73 teamline: +49 7071 9457 68172070 Tuebingen, Germany email:[EMAIL PROTECTED] www.science-computing.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!
On Wed, 13 Dec 2006, Christoph Anton Mitterer wrote: Christoph, I will carefully re-read your entire posting and the included links on Monday and will also try the memory hole setting. And did you get out anything new? As I already mentioned the kernel parameter "iommu=soft" fixes the data corruption for me. We saw no more data corruption during a test on 48 machines over the last week-end. Chris Wedgewood already confirmed that this setting fixed the data corruption for him, too. Of course, the big question "Why does the hardware iommu *not* work on those machines?" still remains. I have also tried setting "memory hole mapping" to "disabled" instead of "hardware" on some of the machines and this *seems* to work stable, too. However, I did only test it on about a dozen machines because this bios setting costs us 1 GB memory (and iommu=soft does not). BTW: Maybe I should also mention that other machines types (e.g. the HP xw9300 dual opteron workstations) which also use a NVIDIA chipset and Opterons never had this problem as far as I know. Best regards, Karsten -- Dipl.-Inf. Karsten Weiss - http://www.machineroom.de/knweiss - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!
On Sat, 2 Dec 2006, Karsten Weiss wrote: > On Sat, 2 Dec 2006, Christoph Anton Mitterer wrote: > > > I found a severe bug mainly by fortune because it occurs very rarely. > > My test looks like the following: I have about 30GB of testing data on > > This sounds very familiar! One of the Linux compute clusters I > administer at work is a 336 node system consisting of the > following components: > > * 2x Dual-Core AMD Opteron 275 > * Tyan S2891 mainboard > * Hitachi HDS728080PLA380 harddisk > * 4 GB RAM (some nodes have 8 GB) - intensively tested with > memtest86+ > * SUSE 9.3 x86_64 (kernel 2.6.11.4-21.14-smp) - But I've also > e.g. tried the latest openSUSE 10.2 RC1+ kernel 2.6.18.2-33 which > makes no difference. > > We are running LS-Dyna on these machines and discovered a > testcase which shows a similar data corruption. So I can > confirm that the problem is for real an not a hardware defect > of a single machine! Last week we did some more testing with the following result: We could not reproduce the data corruption anymore if we boot the machines with the kernel parameter "iommu=soft" i.e. if we use software bounce buffering instead of the hw-iommu. (As mentioned before, booting with mem=2g works fine, too, because this disables the iommu altogether.) I.e. on these systems the data corruption only happens if the hw-iommu (PCI-GART) of the Opteron CPUs is in use. Christoph, Erik, Chris: I would appreciate if you would test and hopefully confirm this workaround, too. Best regards, Karsten -- __creating IT solutions Dipl.-Inf. Karsten Weiss science + computing ag phone:+49 7071 9457 452Hagellocher Weg 73 teamline: +49 7071 9457 68172070 Tuebingen, Germany email:[EMAIL PROTECTED] www.science-computing.de - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!
Hello Christoph! On Sat, 2 Dec 2006, Christoph Anton Mitterer wrote: I found a severe bug mainly by fortune because it occurs very rarely. My test looks like the following: I have about 30GB of testing data on This sounds very familiar! One of the Linux compute clusters I administer at work is a 336 node system consisting of the following components: * 2x Dual-Core AMD Opteron 275 * Tyan S2891 mainboard * Hitachi HDS728080PLA380 harddisk * 4 GB RAM (some nodes have 8 GB) - intensively tested with memtest86+ * SUSE 9.3 x86_64 (kernel 2.6.11.4-21.14-smp) - But I've also e.g. tried the latest openSUSE 10.2 RC1+ kernel 2.6.18.2-33 which makes no difference. We are running LS-Dyna on these machines and discovered a testcase which shows a similar data corruption. So I can confirm that the problem is for real an not a hardware defect of a single machine! Here's a diff of a corrupted and a good file written during our testcase: ("-" == corrupted file, "+" == good file) ... 009f2ff0 67 2a 4c c4 6d 9d 34 44 ad e6 3c 45 05 9a 4d c4 |g*L.m.4D.. From our testing I can also tell that the data corruption does *not* appear at all when we are booting the nodes with mem=2G. However, when we are using all the 4GB the data corruption shows up - but not everytime and thus not on all nodes. Sometimes a node runs for ours without any problem. That's why we are testing on 32 nodes in parallel most of the time. I have the impression that it has something to do with physical memory layout of the running processes. Please also notice that this is a silent data corruption. I.e. there are no error or warning messages in the kernel log or the mce log at all. Christoph, I will carefully re-read your entire posting and the included links on Monday and will also try the memory hole setting. If somebody has an explanation for this problem I can offer some of our compute nodes+time for testing because we really want to get this fixed as soon as possible. Best regards, Karsten -- Dipl.-Inf. Karsten Weiss - http://www.machineroom.de/knweiss - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/