[PATCH 2nd try] Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!

2006-12-14 Thread Karsten Weiss
On Thu, 14 Dec 2006, Muli Ben-Yehuda wrote:

> The rest looks good. Please resend and I'll add my Acked-by.

Thanks a lot for your comments and suggestions. Here's my 2nd try:

===

From: Karsten Weiss <[EMAIL PROTECTED]>

$ diffstat ~/iommu-patch_v2.patch
 Documentation/kernel-parameters.txt   |3
 Documentation/x86_64/boot-options.txt |  104 +++---
 arch/x86_64/Kconfig   |   10 ++-
 arch/x86_64/kernel/pci-dma.c  |   28 +
 4 files changed, 87 insertions(+), 58 deletions(-)

Patch description:

- add SWIOTLB config help text
- mention Documentation/x86_64/boot-options.txt in
  Documentation/kernel-parameters.txt
- remove the duplication of the iommu kernel parameter documentation.
- Better explanation of some of the iommu kernel parameter options.
- "32MB<

---

--- linux-2.6.19/arch/x86_64/kernel/pci-dma.c.original  2006-12-14 
11:15:38.348598021 +0100
+++ linux-2.6.19/arch/x86_64/kernel/pci-dma.c   2006-12-14 12:14:48.176967312 
+0100
@@ -223,30 +223,10 @@
 }
 EXPORT_SYMBOL(dma_set_mask);
 
-/* 
iommu=[size][,noagp][,off][,force][,noforce][,leak][,memaper[=order]][,merge]
- [,forcesac][,fullflush][,nomerge][,biomerge]
-   size  set size of iommu (in bytes)
-   noagp don't initialize the AGP driver and use full aperture.
-   off   don't use the IOMMU
-   leak  turn on simple iommu leak tracing (only when CONFIG_IOMMU_LEAK is on)
-   memaper[=order] allocate an own aperture over RAM with size 32MB^order.
-   noforce don't force IOMMU usage. Default.
-   force  Force IOMMU.
-   merge  Do lazy merging. This may improve performance on some block devices.
-  Implies force (experimental)
-   biomerge Do merging at the BIO layer. This is more efficient than merge,
-but should be only done with very big IOMMUs. Implies merge,force.
-   nomerge Don't do SG merging.
-   forcesac For SAC mode for masks <40bits  (experimental)
-   fullflush Flush IOMMU on each allocation (default)
-   nofullflush Don't use IOMMU fullflush
-   allowed  overwrite iommu off workarounds for specific chipsets.
-   soft Use software bounce buffering (default for Intel machines)
-   noaperture Don't touch the aperture for AGP.
-   allowdac Allow DMA >4GB
-   nodacForbid DMA >4GB
-   panicForce panic when IOMMU overflows
-*/
+/*
+ * See  for the iommu kernel parameter
+ * documentation.
+ */
 __init int iommu_setup(char *p)
 {
iommu_merge = 1;
--- linux-2.6.19/arch/x86_64/Kconfig.original   2006-12-14 11:37:35.832142506 
+0100
+++ linux-2.6.19/arch/x86_64/Kconfig2006-12-14 14:01:24.009193996 +0100
@@ -431,8 +431,8 @@
  on systems with more than 3GB. This is usually needed for USB,
  sound, many IDE/SATA chipsets and some other devices.
  Provides a driver for the AMD Athlon64/Opteron/Turion/Sempron GART
- based IOMMU and a software bounce buffer based IOMMU used on Intel
- systems and as fallback.
+ based hardware IOMMU and a software bounce buffer based IOMMU used
+ on Intel systems and as fallback.
  The code is only active when needed (enough memory and limited
  device) unless CONFIG_IOMMU_DEBUG or iommu=force is specified
  too.
@@ -458,6 +458,12 @@
 # need this always selected by IOMMU for the VIA workaround
 config SWIOTLB
bool
+   help
+ Support for software bounce buffers used on x86-64 systems
+ which don't have a hardware IOMMU (e.g. the current generation
+ of Intel's x86-64 CPUs). Using this PCI devices which can only
+ access 32-bits of memory can be used on systems with more than
+ 3 GB of memory. If unsure, say Y.
 
 config X86_MCE
bool "Machine check support" if EMBEDDED
--- linux-2.6.19/Documentation/x86_64/boot-options.txt.original 2006-12-14 
11:11:32.099300994 +0100
+++ linux-2.6.19/Documentation/x86_64/boot-options.txt  2006-12-14 
14:14:55.869560532 +0100
@@ -180,39 +180,79 @@
   pci=lastbus=NUMBER  Scan upto NUMBER busses, no matter what the 
mptable says.
   pci=noacpi   Don't use ACPI to set up PCI interrupt routing.
 
-IOMMU
+IOMMU (input/output memory management unit)
 
- iommu=[size][,noagp][,off][,force][,noforce][,leak][,memaper[=order]][,merge]
- [,forcesac][,fullflush][,nomerge][,noaperture]
-   size  set size of iommu (in bytes)
-   noagp don't initialize the AGP driver and use full aperture.
-   off   don't use the IOMMU
-   leak  turn on simple iommu leak tracing (only when CONFIG_IOMMU_LEAK is on)
-   memaper[=order] allocate an own aperture over RAM with size 32MB^order.
-   noforce don't force IOMMU usage. Default.
-   force  Force IOMMU.
-   merge  Do SG merging. Implies force (experimental)
-   nomerge Don't do SG merging.
-   forcesac For SAC mode for masks <40bits  (experimental)
-   fullflush Flush IOMMU o

[PATCH] Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!

2006-12-14 Thread Karsten Weiss
On Thu, 14 Dec 2006, Muli Ben-Yehuda wrote:

> On Wed, Dec 13, 2006 at 09:34:16PM +0100, Karsten Weiss wrote:
> 
> > BTW: It would be really great if this area of the kernel would get some 
> > more and better documentation. The information at 
> > linux-2.6/Documentation/x86_64/boot_options.txt is very terse. I had to 
> > read the code to get a *rough* idea what all the "iommu=" options 
> > actually do and how they interact.
> 
> Patches happily accepted :-)

Well, you asked for it. :-) So here's my little contribution. Please 
*double* *check*!

(BTW: I would like to know what "DAC" and "SAC" means in this context)

===

From: Karsten Weiss <[EMAIL PROTECTED]>

Patch summary:

- Better explanation of some of the iommu kernel parameter options.
- "32MB<

---

--- linux-2.6.19/arch/x86_64/kernel/pci-dma.c.original  2006-12-14 
11:15:38.348598021 +0100
+++ linux-2.6.19/arch/x86_64/kernel/pci-dma.c   2006-12-14 12:14:48.176967312 
+0100
@@ -223,30 +223,10 @@
 }
 EXPORT_SYMBOL(dma_set_mask);
 
-/* 
iommu=[size][,noagp][,off][,force][,noforce][,leak][,memaper[=order]][,merge]
- [,forcesac][,fullflush][,nomerge][,biomerge]
-   size  set size of iommu (in bytes)
-   noagp don't initialize the AGP driver and use full aperture.
-   off   don't use the IOMMU
-   leak  turn on simple iommu leak tracing (only when CONFIG_IOMMU_LEAK is on)
-   memaper[=order] allocate an own aperture over RAM with size 32MB^order.
-   noforce don't force IOMMU usage. Default.
-   force  Force IOMMU.
-   merge  Do lazy merging. This may improve performance on some block devices.
-  Implies force (experimental)
-   biomerge Do merging at the BIO layer. This is more efficient than merge,
-but should be only done with very big IOMMUs. Implies merge,force.
-   nomerge Don't do SG merging.
-   forcesac For SAC mode for masks <40bits  (experimental)
-   fullflush Flush IOMMU on each allocation (default)
-   nofullflush Don't use IOMMU fullflush
-   allowed  overwrite iommu off workarounds for specific chipsets.
-   soft Use software bounce buffering (default for Intel machines)
-   noaperture Don't touch the aperture for AGP.
-   allowdac Allow DMA >4GB
-   nodacForbid DMA >4GB
-   panicForce panic when IOMMU overflows
-*/
+/*
+ * See  for the iommu kernel parameter
+ * documentation.
+ */
 __init int iommu_setup(char *p)
 {
iommu_merge = 1;
--- linux-2.6.19/arch/x86_64/Kconfig.original   2006-12-14 11:37:35.832142506 
+0100
+++ linux-2.6.19/arch/x86_64/Kconfig2006-12-14 11:47:24.346056710 +0100
@@ -431,8 +431,8 @@
  on systems with more than 3GB. This is usually needed for USB,
  sound, many IDE/SATA chipsets and some other devices.
  Provides a driver for the AMD Athlon64/Opteron/Turion/Sempron GART
- based IOMMU and a software bounce buffer based IOMMU used on Intel
- systems and as fallback.
+ based hardware IOMMU and a software bounce buffer based IOMMU used
+ on Intel systems and as fallback.
  The code is only active when needed (enough memory and limited
  device) unless CONFIG_IOMMU_DEBUG or iommu=force is specified
  too.
@@ -458,6 +458,11 @@
 # need this always selected by IOMMU for the VIA workaround
 config SWIOTLB
bool
+   help
+ Support for a software bounce buffer based IOMMU used on Intel
+ systems which don't have a hardware IOMMU. Using this code
+ PCI devices with 32bit memory access only are able to be
+ used on systems with more than 3 GB.
 
 config X86_MCE
bool "Machine check support" if EMBEDDED
--- linux-2.6.19/Documentation/x86_64/boot-options.txt.original 2006-12-14 
11:11:32.099300994 +0100
+++ linux-2.6.19/Documentation/x86_64/boot-options.txt  2006-12-14 
12:10:24.028009890 +0100
@@ -180,35 +180,66 @@
   pci=lastbus=NUMBER  Scan upto NUMBER busses, no matter what the 
mptable says.
   pci=noacpi   Don't use ACPI to set up PCI interrupt routing.
 
-IOMMU
+IOMMU (input/output memory management unit)
+
+ Currently four x86_64 PCI DMA mapping implementations exist:
+
+   1. : use no hardware/software IOMMU at all
+  (e.g. because you have < 3 GB memory).
+  Kernel boot message: "PCI-DMA: Disabling IOMMU"
+
+   2. : AMD GART based hardware IOMMU.
+  Kernel boot message: "PCI-DMA: using GART IOMMU"
+
+   3.  : Software IOMMU implementation. Used
+  e.g. if there is no hardware IOMMU in the system and it is need because
+  you have >3GB memory or told the kernel to us it (iommu=soft))
+  Kernel boot message: "PCI-DMA: Using software bounce buffering
+  for IO (SWIOTLB)"
+
+   4.  : IBM Calgary hardware IOMMU. Used in IBM
+  pSeries and xSeries servers. This hardware IOMMU supports DMA addres

Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!

2006-12-13 Thread Karsten Weiss
On Wed, 13 Dec 2006, Chris Wedgwood wrote:

> > Any ideas why iommu=disabled in the bios does not solve the issue?
> 
> The kernel will still use the IOMMU if the BIOS doesn't set it up if
> it can, check your dmesg for IOMMU strings, there might be something
> printed to this effect.

FWIW: As far as I understand the linux kernel code (I am no kernel 
developer so please correct me if I am wrong) the PCI dma mapping code is 
abstracted by struct dma_mapping_ops. I.e. there are currently four 
possible implementations for x86_64 (see linux-2.6/arch/x86_64/kernel/)

1. pci-nommu.c : no IOMMU at all (e.g. because you have < 4 GB memory)
   Kernel boot message: "PCI-DMA: Disabling IOMMU."

2. pci-gart.c : (AMD) Hardware-IOMMU.
   Kernel boot message: "PCI-DMA: using GART IOMMU" (this message
   first appeared in 2.6.16)

3. pci-swiotlb.c : Software-IOMMU (used e.g. if there is no hw iommu)
   Kernel boot message: "PCI-DMA: Using software bounce buffering 
   for IO (SWIOTLB)"

4. pci-calgary.c : Calgary HW-IOMMU from IBM; used in pSeries servers. 
   This HW-IOMMU supports dma address mapping with memory proctection,
   etc.
   Kernel boot message: "PCI-DMA: Using Calgary IOMMU" (since 2.6.18!)

What all this means is that you can use "dmesg|grep ^PCI-DMA:" to see 
which implementation your kernel is currently using.

As far as our problem machines are concerned the "PCI-DMA: using GART 
IOMMU" case is broken (data corruption). But both "PCI-DMA: Disabling 
IOMMU" (trigged with mem=2g) and "PCI-DMA: Using software bounce buffering 
for IO (SWIOTLB)" (triggered with iommu=soft) are stable.

BTW: It would be really great if this area of the kernel would get some 
more and better documentation. The information at 
linux-2.6/Documentation/x86_64/boot_options.txt is very terse. I had to 
read the code to get a *rough* idea what all the "iommu=" options 
actually do and how they interact.
 
> > 1) And does this now mean that there's an error in the hardware
> > (chipset or CPU/memcontroller)?
> 
> My guess is it's a kernel bug, I don't know for certain.  Perhaps we
> shaould start making a more comprehensive list of affected kernels &
> CPUs?

BTW: Did someone already open an official bug at 
http://bugzilla.kernel.org ?

Best regards,
Karsten

-- 
__creating IT solutions
Dipl.-Inf. Karsten Weiss   science + computing ag
phone:+49 7071 9457 452Hagellocher Weg 73
teamline: +49 7071 9457 68172070 Tuebingen, Germany
email:[EMAIL PROTECTED] www.science-computing.de

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!

2006-12-13 Thread Karsten Weiss
On Wed, 13 Dec 2006, Erik Andersen wrote:

> On Mon Dec 11, 2006 at 10:24:02AM +0100, Karsten Weiss wrote:
> > Last week we did some more testing with the following result:
> > 
> > We could not reproduce the data corruption anymore if we boot the machines 
> > with the kernel parameter "iommu=soft" i.e. if we use software bounce 
> > buffering instead of the hw-iommu. (As mentioned before, booting with 
> > mem=2g works fine, too, because this disables the iommu altogether.)
> > 
> > I.e. on these systems the data corruption only happens if the hw-iommu 
> > (PCI-GART) of the Opteron CPUs is in use.
> > 
> > Christoph, Erik, Chris: I would appreciate if you would test and hopefully 
> > confirm this workaround, too.
> 
> What did you set the BIOS to when testing this setting?
> Memory Hole enabled?  IOMMU enabled?

"Memory hole mapping" was set to "hardware". With "disabled" we only
see 3 of our 4 GB memory.

Best regards,
Karsten

-- 
__creating IT solutions
Dipl.-Inf. Karsten Weiss   science + computing ag
phone:+49 7071 9457 452Hagellocher Weg 73
teamline: +49 7071 9457 68172070 Tuebingen, Germany
email:[EMAIL PROTECTED] www.science-computing.de

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!

2006-12-13 Thread Karsten Weiss

On Wed, 13 Dec 2006, Christoph Anton Mitterer wrote:


Christoph, I will carefully re-read your entire posting and the
included links on Monday and will also try the memory hole
setting.


And did you get out anything new?


As I already mentioned the kernel parameter "iommu=soft" fixes
the data corruption for me. We saw no more data corruption
during a test on 48 machines over the last week-end. Chris
Wedgewood already confirmed that this setting fixed the data
corruption for him, too.

Of course, the big question "Why does the hardware iommu *not*
work on those machines?" still remains.

I have also tried setting "memory hole mapping" to "disabled"
instead of "hardware" on some of the machines and this *seems*
to work stable, too. However, I did only test it on about a
dozen machines because this bios setting costs us 1 GB memory
(and iommu=soft does not).

BTW: Maybe I should also mention that other machines types
(e.g. the HP xw9300 dual opteron workstations) which also use a
NVIDIA chipset and Opterons never had this problem as far as I
know.

Best regards,
Karsten

--
Dipl.-Inf. Karsten Weiss - http://www.machineroom.de/knweiss
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!

2006-12-11 Thread Karsten Weiss
On Sat, 2 Dec 2006, Karsten Weiss wrote:

> On Sat, 2 Dec 2006, Christoph Anton Mitterer wrote:
> 
> > I found a severe bug mainly by fortune because it occurs very rarely.
> > My test looks like the following: I have about 30GB of testing data on
> 
> This sounds very familiar! One of the Linux compute clusters I
> administer at work is a 336 node system consisting of the
> following components:
> 
> * 2x Dual-Core AMD Opteron 275
> * Tyan S2891 mainboard
> * Hitachi HDS728080PLA380 harddisk
> * 4 GB RAM (some nodes have 8 GB) - intensively tested with
>   memtest86+
> * SUSE 9.3 x86_64 (kernel 2.6.11.4-21.14-smp) - But I've also
>   e.g. tried the latest openSUSE 10.2 RC1+ kernel 2.6.18.2-33 which
>   makes no difference.
> 
> We are running LS-Dyna on these machines and discovered a
> testcase which shows a similar data corruption. So I can
> confirm that the problem is for real an not a hardware defect
> of a single machine!

Last week we did some more testing with the following result:

We could not reproduce the data corruption anymore if we boot the machines 
with the kernel parameter "iommu=soft" i.e. if we use software bounce 
buffering instead of the hw-iommu. (As mentioned before, booting with 
mem=2g works fine, too, because this disables the iommu altogether.)

I.e. on these systems the data corruption only happens if the hw-iommu 
(PCI-GART) of the Opteron CPUs is in use.

Christoph, Erik, Chris: I would appreciate if you would test and hopefully 
confirm this workaround, too.

Best regards,
Karsten

-- 
__creating IT solutions
Dipl.-Inf. Karsten Weiss   science + computing ag
phone:+49 7071 9457 452Hagellocher Weg 73
teamline: +49 7071 9457 68172070 Tuebingen, Germany
email:[EMAIL PROTECTED] www.science-computing.de

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: data corruption with nvidia chipsets and IDE/SATA drives // memory hole mapping related bug?!

2006-12-02 Thread Karsten Weiss

Hello Christoph!

On Sat, 2 Dec 2006, Christoph Anton Mitterer wrote:


I found a severe bug mainly by fortune because it occurs very rarely.
My test looks like the following: I have about 30GB of testing data on


This sounds very familiar! One of the Linux compute clusters I
administer at work is a 336 node system consisting of the
following components:

* 2x Dual-Core AMD Opteron 275
* Tyan S2891 mainboard
* Hitachi HDS728080PLA380 harddisk
* 4 GB RAM (some nodes have 8 GB) - intensively tested with
  memtest86+
* SUSE 9.3 x86_64 (kernel 2.6.11.4-21.14-smp) - But I've also
  e.g. tried the latest openSUSE 10.2 RC1+ kernel 2.6.18.2-33 which
  makes no difference.

We are running LS-Dyna on these machines and discovered a
testcase which shows a similar data corruption. So I can
confirm that the problem is for real an not a hardware defect
of a single machine!

Here's a diff of a corrupted and a good file written during our
testcase:

("-" == corrupted file, "+" == good file)
...
 009f2ff0  67 2a 4c c4 6d 9d 34 44  ad e6 3c 45 05 9a 4d c4  |g*L.m.4D..
From our testing I can also tell that the data corruption does

*not* appear at all when we are booting the nodes with mem=2G.
However, when we are using all the 4GB the data corruption
shows up - but not everytime and thus not on all nodes.
Sometimes a node runs for ours without any problem. That's why
we are testing on 32 nodes in parallel most of the time. I have
the impression that it has something to do with physical memory
layout of the running processes.

Please also notice that this is a silent data corruption. I.e.
there are no error or warning messages in the kernel log or the
mce log at all.

Christoph, I will carefully re-read your entire posting and the
included links on Monday and will also try the memory hole
setting.

If somebody has an explanation for this problem I can offer
some of our compute nodes+time for testing because we really
want to get this fixed as soon as possible.

Best regards,
Karsten

--
Dipl.-Inf. Karsten Weiss - http://www.machineroom.de/knweiss
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/