Re: ECC and DMA to/from disk controllers
* Alan Cox ([EMAIL PROTECTED]) [20070910 14:54]: Alan, Thanks for your interest (and Bruce, for posting). > - The ECC level on the drive processors and memory cache vary > by vendor. Good luck getting any information on this although > maybe if you are Cern sized they will talk Do you have any contacts? We're in contact directly with the system integrators only, not the drive manufacturers. > The next usual mess is network transfers. [...] All our data is based on system-local probes (i.e. no network involved). > Type III wrong block on PATA fits with the fact the block number > isn't protected and also the limits on the cache quality of > drives/drive firmware bugs. Thanks, it's new information. I was planning to extend fsprobe with locality information inside the buffers so that we can catch this as it is happening. > For drivers/ide there are *lots* of problems with error handling > so that might be implicated (would want to do old v new ide > tests on the same h/w which would be very intriguing). We tried to “force” these corruptions out from their hiding places on targeted systems, but we failed miserably. Currently we can't reproduce the issue at will, even on the affected systems. > Stale data from disk cache I've seen reported, also offsets from > FIFO hardware bugs (The LOTR render farm hit the latter and had > to avoid UDMA to avoid a hardware bug) That's interesting, I'll think about how to expose this. Currently a single pass writes data only once, so I don't think any chunk can live hours long in the drives' cache. > Chunks of zero sounds like caches again, would be interesting to > know what hardware changes occurred at the point they began to > pop up and what software. They seem to be popping more frequently on ARECA-based boxes. The “software” is a running target as we gradually upgrade the computer center. > We also see chipset bugs under high contention some of which > are explained and worked around (VIA ones in the past), others > we see are clear correlations - eg between Nvidia chipsets and > Silicon Image SATA controllers. Most of our workhorses are 3ware controllers, the CPU nodes usually have Intel SATA chips. The fsprobe utility we run in the background on practically all our boxes is available at http://cern.ch/Peter.Kelemen/fsprobe/ . We have it deployed on several thousand machines to gather data. I know that some other HEP institutes looked at it, but I have no information on who's running it on how many boxes, let alone what it found. I would be very much interested in whatever findings people have. Peter -- .+'''+. .+'''+. .+'''+. .+'''+. .+'' Kelemen Péter / \ / \ [EMAIL PROTECTED] .+' `+...+' `+...+' `+...+' `+...+' - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ECC and DMA to/from disk controllers
* Alan Cox ([EMAIL PROTECTED]) [20070910 14:54]: Alan, Thanks for your interest (and Bruce, for posting). - The ECC level on the drive processors and memory cache vary by vendor. Good luck getting any information on this although maybe if you are Cern sized they will talk Do you have any contacts? We're in contact directly with the system integrators only, not the drive manufacturers. The next usual mess is network transfers. [...] All our data is based on system-local probes (i.e. no network involved). Type III wrong block on PATA fits with the fact the block number isn't protected and also the limits on the cache quality of drives/drive firmware bugs. Thanks, it's new information. I was planning to extend fsprobe with locality information inside the buffers so that we can catch this as it is happening. For drivers/ide there are *lots* of problems with error handling so that might be implicated (would want to do old v new ide tests on the same h/w which would be very intriguing). We tried to “force” these corruptions out from their hiding places on targeted systems, but we failed miserably. Currently we can't reproduce the issue at will, even on the affected systems. Stale data from disk cache I've seen reported, also offsets from FIFO hardware bugs (The LOTR render farm hit the latter and had to avoid UDMA to avoid a hardware bug) That's interesting, I'll think about how to expose this. Currently a single pass writes data only once, so I don't think any chunk can live hours long in the drives' cache. Chunks of zero sounds like caches again, would be interesting to know what hardware changes occurred at the point they began to pop up and what software. They seem to be popping more frequently on ARECA-based boxes. The “software” is a running target as we gradually upgrade the computer center. We also see chipset bugs under high contention some of which are explained and worked around (VIA ones in the past), others we see are clear correlations - eg between Nvidia chipsets and Silicon Image SATA controllers. Most of our workhorses are 3ware controllers, the CPU nodes usually have Intel SATA chips. The fsprobe utility we run in the background on practically all our boxes is available at http://cern.ch/Peter.Kelemen/fsprobe/ . We have it deployed on several thousand machines to gather data. I know that some other HEP institutes looked at it, but I have no information on who's running it on how many boxes, let alone what it found. I would be very much interested in whatever findings people have. Peter -- .+'''+. .+'''+. .+'''+. .+'''+. .+'' Kelemen Péter / \ / \ [EMAIL PROTECTED] .+' `+...+' `+...+' `+...+' `+...+' - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ECC and DMA to/from disk controllers
Alan, Robert, Dick, Thank you all for the informed and helpful response! Alan, I'll pass your comments on to Peter Kelemen. Not sure if he follows LKML. I think he'll be interested in your characterization of the error types. I'll point him to the thread. (I think Peter and his collaborators are fairly aware of the undetected error rates in standard ethernet TCP/IP traffic which as I recall is about one undetected single-bit error per 4TB transfered. I am pretty sure they have ruled this out since they have checksums computed after any network transfers.) Robert, Dick, if I have understood correctly, in response to my specific question, RAID controllers on PCI cards will DMA data into memory over a PCI bus using one parity bit per 32 data bits for protection. This does provide some protection against errors in the data transfer, but much less protection than typical RAM ECC which has one ECC byte for each eight data bytes. As I recall, many older motherboards disabled parity on the PCI bus, so even this protection may be inactive in many cases. From a few minutes of on-line research, I have the impression that PCI-e has better ECC protection against address/data errors than PCI but I am not certain. Thanks again! Cheers, Bruce - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ECC and DMA to/from disk controllers
Alan, Robert, Dick, Thank you all for the informed and helpful response! Alan, I'll pass your comments on to Peter Kelemen. Not sure if he follows LKML. I think he'll be interested in your characterization of the error types. I'll point him to the thread. (I think Peter and his collaborators are fairly aware of the undetected error rates in standard ethernet TCP/IP traffic which as I recall is about one undetected single-bit error per 4TB transfered. I am pretty sure they have ruled this out since they have checksums computed after any network transfers.) Robert, Dick, if I have understood correctly, in response to my specific question, RAID controllers on PCI cards will DMA data into memory over a PCI bus using one parity bit per 32 data bits for protection. This does provide some protection against errors in the data transfer, but much less protection than typical RAM ECC which has one ECC byte for each eight data bytes. As I recall, many older motherboards disabled parity on the PCI bus, so even this protection may be inactive in many cases. From a few minutes of on-line research, I have the impression that PCI-e has better ECC protection against address/data errors than PCI but I am not certain. Thanks again! Cheers, Bruce - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ECC and DMA to/from disk controllers
Bruce Allen wrote: Dear LKML, Apologies in advance for potential mis-use of LKML, but I don't know where else to ask. An ongoing study on datasets of several Petabytes have shown that there can be 'silent data corruption' at rates much larger than one might naively expect from the expected error rates in RAID arrays and the expected probability of single bit uncorrected errors in hard disks. The origin of this data corruption is still unknown. See for example http://cern.ch/Peter.Kelemen/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf In thinking about this, I began to wonder about the following. Suppose that a (possibly RAID) disk controller correctly reads data from disk and has correct data in the controller memory and buffers. However when that data is DMA'd into system memory some errors occur (cosmic rays, electrical noise, etc). Am I correct that these errors would NOT be detected, even on a 'reliable' server with ECC memory? In other words the ECC bits would be calculated in server memory based on incorrect data from the disk. It depends where the data got corrupted. Normally transfers over the PCI or PCI Express bus are protected by parity (or CRC or something, I assume on PCI-E) so errors there would get detected. This is quite rare unless the motherboard or expansion card is faulty or badly designed with timing problems. However, it's conceivable that data could get corrupted inside the controller, or inside the chipset. This seems quite rare however, except in the presence of design flaws (like some VIA southbridges that had nasty problems with losing data if PCI bus masters kept the CPU off the PCI bus too long, which we have to work around). The alternative is that disk controllers (or at least ones that are meant to be reliable) DMA both the data AND the ECC byte into system memory. So that if an error occurs in this transfer, then it would most likely be picked up and corrected by the ECC mechanism. But I don't think that 'this is how it works'. Could someone knowledgable please confirm or contradict? I don't know any controller that works in this way. This would greatly increase CPU overhead since the CPU would need to perform this CRC calculation. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ECC and DMA to/from disk controllers
On Mon, 10 Sep 2007, Bruce Allen wrote: > Dear LKML, > > Apologies in advance for potential mis-use of LKML, but I don't know where > else to ask. > > An ongoing study on datasets of several Petabytes have shown that there > can be 'silent data corruption' at rates much larger than one might > naively expect from the expected error rates in RAID arrays and the > expected probability of single bit uncorrected errors in hard disks. > > The origin of this data corruption is still unknown. See for example > http://cern.ch/Peter.Kelemen/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf > > In thinking about this, I began to wonder about the following. Suppose > that a (possibly RAID) disk controller correctly reads data from disk and > has correct data in the controller memory and buffers. However when that > data is DMA'd into system memory some errors occur (cosmic rays, > electrical noise, etc). Am I correct that these errors would NOT be > detected, even on a 'reliable' server with ECC memory? In other words the > ECC bits would be calculated in server memory based on incorrect data from > the disk. > > The alternative is that disk controllers (or at least ones that are meant > to be reliable) DMA both the data AND the ECC byte into system memory. > So that if an error occurs in this transfer, then it would most likely be > picked up and corrected by the ECC mechanism. But I don't think that > 'this is how it works'. Could someone knowledgable please confirm or > contradict? > > Cheers, > Bruce > - In a typical system, there are usually hardware data transfer paths that are not under the protection of any ECC mechanism. One example is "bus mastering" DMA itself. If the bus-interface state-machine is improperly designed (read timing problems), data transfer may be unreliable. Of course serial-ATA, SCSI, and other external buses have a modicum of protection, but early IDE did not. There are many file-systems that have been corrupted by incorrect cables, bad motherboard or chip designs, or using UDMA when the hardware won't reliably work. That said, the reliability of data transfer buses is pretty good because they don't need to store data for long periods of time, like RAM. The probability of a bit upset due to a nuclear event is highly unlikely in a bus where something is driving the bus, keeping the data valid, during the time that something else is reading the bus. Nuclear events generally upset RAM because the data are stored in very small charges and femtoamperes of spurious current can alter logic states. Cheers, Dick Johnson Penguin : Linux version 2.6.22.1 on an i686 machine (5588.30 BogoMips). My book : http://www.AbominableFirebug.com/ _ The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [EMAIL PROTECTED] - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ECC and DMA to/from disk controllers
> In thinking about this, I began to wonder about the following. Suppose > that a (possibly RAID) disk controller correctly reads data from disk and > has correct data in the controller memory and buffers. However when that > data is DMA'd into system memory some errors occur (cosmic rays, > electrical noise, etc). Am I correct that these errors would NOT be > detected, even on a 'reliable' server with ECC memory? In other words the > ECC bits would be calculated in server memory based on incorrect data from > the disk. Architecture specific. > The alternative is that disk controllers (or at least ones that are meant > to be reliable) DMA both the data AND the ECC byte into system memory. > So that if an error occurs in this transfer, then it would most likely be > picked up and corrected by the ECC mechanism. But I don't think that > 'this is how it works'. Could someone knowledgable please confirm or > contradict? Its almost entirely device specific at every level. Some general information and comment however - Drives normally do error correction and shouldn't be fooled very often by bad bits. - The ECC level on the drive processors and memory cache vary by vendor. Good luck getting any information on this although maybe if you are Cern sized they will talk After the drive we cross the cable. For SATA this is pretty good, and UDMA data transfer is CRC protected. For PATA the data is but not the command block so on PATA there is a minute chance you send the CRC protected block to the wrong place Once its crossing the PCI bus and main memory and CPU cache its entirely down to the system you are running what is protected and how much. Note that a lot of systems won't report ECC errors unless you ask. If you have hardware RAID controllers its all vendor specific including CPU cache etc on the card etc. The next usual mess is network transfers. The TCP checksum strength is questionable for such workloads but the ethernet one is pretty good. Unfortunately lots of high performance people use checksum offload which removes much of the end to end protection and leads to problems with iffy cards and the like. This is well studied and known to be very problematic but in the market speed sells not correctness. >From the paper type II sounds like slab might be a candidate kernel side but also CPU bugs as near OOM we will be paging hard and any L2 cache page out/page table race from software or hardware would fit what it describes, especially the transient nature Type III wrong block on PATA fits with the fact the block number isn't protected and also the limits on the cache quality of drives/drive firmware bugs. For drivers/ide there are *lots* of problems with error handling so that might be implicated (would want to do old v new ide tests on the same h/w which would be very intriguing). Stale data from disk cache I've seen reported, also offsets from FIFO hardware bugs (The LOTR render farm hit the latter and had to avoid UDMA to avoid a hardware bug) Chunks of zero sounds like caches again, would be interesting to know what hardware changes occurred at the point they began to pop up and what software. We also see chipset bugs under high contention some of which are explained and worked around (VIA ones in the past), others we see are clear correlations - eg between Nvidia chipsets and Silicon Image SATA controllers. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ECC and DMA to/from disk controllers
In thinking about this, I began to wonder about the following. Suppose that a (possibly RAID) disk controller correctly reads data from disk and has correct data in the controller memory and buffers. However when that data is DMA'd into system memory some errors occur (cosmic rays, electrical noise, etc). Am I correct that these errors would NOT be detected, even on a 'reliable' server with ECC memory? In other words the ECC bits would be calculated in server memory based on incorrect data from the disk. Architecture specific. The alternative is that disk controllers (or at least ones that are meant to be reliable) DMA both the data AND the ECC byte into system memory. So that if an error occurs in this transfer, then it would most likely be picked up and corrected by the ECC mechanism. But I don't think that 'this is how it works'. Could someone knowledgable please confirm or contradict? Its almost entirely device specific at every level. Some general information and comment however - Drives normally do error correction and shouldn't be fooled very often by bad bits. - The ECC level on the drive processors and memory cache vary by vendor. Good luck getting any information on this although maybe if you are Cern sized they will talk After the drive we cross the cable. For SATA this is pretty good, and UDMA data transfer is CRC protected. For PATA the data is but not the command block so on PATA there is a minute chance you send the CRC protected block to the wrong place Once its crossing the PCI bus and main memory and CPU cache its entirely down to the system you are running what is protected and how much. Note that a lot of systems won't report ECC errors unless you ask. If you have hardware RAID controllers its all vendor specific including CPU cache etc on the card etc. The next usual mess is network transfers. The TCP checksum strength is questionable for such workloads but the ethernet one is pretty good. Unfortunately lots of high performance people use checksum offload which removes much of the end to end protection and leads to problems with iffy cards and the like. This is well studied and known to be very problematic but in the market speed sells not correctness. From the paper type II sounds like slab might be a candidate kernel side but also CPU bugs as near OOM we will be paging hard and any L2 cache page out/page table race from software or hardware would fit what it describes, especially the transient nature Type III wrong block on PATA fits with the fact the block number isn't protected and also the limits on the cache quality of drives/drive firmware bugs. For drivers/ide there are *lots* of problems with error handling so that might be implicated (would want to do old v new ide tests on the same h/w which would be very intriguing). Stale data from disk cache I've seen reported, also offsets from FIFO hardware bugs (The LOTR render farm hit the latter and had to avoid UDMA to avoid a hardware bug) Chunks of zero sounds like caches again, would be interesting to know what hardware changes occurred at the point they began to pop up and what software. We also see chipset bugs under high contention some of which are explained and worked around (VIA ones in the past), others we see are clear correlations - eg between Nvidia chipsets and Silicon Image SATA controllers. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ECC and DMA to/from disk controllers
On Mon, 10 Sep 2007, Bruce Allen wrote: Dear LKML, Apologies in advance for potential mis-use of LKML, but I don't know where else to ask. An ongoing study on datasets of several Petabytes have shown that there can be 'silent data corruption' at rates much larger than one might naively expect from the expected error rates in RAID arrays and the expected probability of single bit uncorrected errors in hard disks. The origin of this data corruption is still unknown. See for example http://cern.ch/Peter.Kelemen/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf In thinking about this, I began to wonder about the following. Suppose that a (possibly RAID) disk controller correctly reads data from disk and has correct data in the controller memory and buffers. However when that data is DMA'd into system memory some errors occur (cosmic rays, electrical noise, etc). Am I correct that these errors would NOT be detected, even on a 'reliable' server with ECC memory? In other words the ECC bits would be calculated in server memory based on incorrect data from the disk. The alternative is that disk controllers (or at least ones that are meant to be reliable) DMA both the data AND the ECC byte into system memory. So that if an error occurs in this transfer, then it would most likely be picked up and corrected by the ECC mechanism. But I don't think that 'this is how it works'. Could someone knowledgable please confirm or contradict? Cheers, Bruce - In a typical system, there are usually hardware data transfer paths that are not under the protection of any ECC mechanism. One example is bus mastering DMA itself. If the bus-interface state-machine is improperly designed (read timing problems), data transfer may be unreliable. Of course serial-ATA, SCSI, and other external buses have a modicum of protection, but early IDE did not. There are many file-systems that have been corrupted by incorrect cables, bad motherboard or chip designs, or using UDMA when the hardware won't reliably work. That said, the reliability of data transfer buses is pretty good because they don't need to store data for long periods of time, like RAM. The probability of a bit upset due to a nuclear event is highly unlikely in a bus where something is driving the bus, keeping the data valid, during the time that something else is reading the bus. Nuclear events generally upset RAM because the data are stored in very small charges and femtoamperes of spurious current can alter logic states. Cheers, Dick Johnson Penguin : Linux version 2.6.22.1 on an i686 machine (5588.30 BogoMips). My book : http://www.AbominableFirebug.com/ _ The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [EMAIL PROTECTED] - and destroy all copies of this information, including any attachments, without reading or disclosing them. Thank you. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ECC and DMA to/from disk controllers
Bruce Allen wrote: Dear LKML, Apologies in advance for potential mis-use of LKML, but I don't know where else to ask. An ongoing study on datasets of several Petabytes have shown that there can be 'silent data corruption' at rates much larger than one might naively expect from the expected error rates in RAID arrays and the expected probability of single bit uncorrected errors in hard disks. The origin of this data corruption is still unknown. See for example http://cern.ch/Peter.Kelemen/talk/2007/kelemen-2007-C5-Silent_Corruptions.pdf In thinking about this, I began to wonder about the following. Suppose that a (possibly RAID) disk controller correctly reads data from disk and has correct data in the controller memory and buffers. However when that data is DMA'd into system memory some errors occur (cosmic rays, electrical noise, etc). Am I correct that these errors would NOT be detected, even on a 'reliable' server with ECC memory? In other words the ECC bits would be calculated in server memory based on incorrect data from the disk. It depends where the data got corrupted. Normally transfers over the PCI or PCI Express bus are protected by parity (or CRC or something, I assume on PCI-E) so errors there would get detected. This is quite rare unless the motherboard or expansion card is faulty or badly designed with timing problems. However, it's conceivable that data could get corrupted inside the controller, or inside the chipset. This seems quite rare however, except in the presence of design flaws (like some VIA southbridges that had nasty problems with losing data if PCI bus masters kept the CPU off the PCI bus too long, which we have to work around). The alternative is that disk controllers (or at least ones that are meant to be reliable) DMA both the data AND the ECC byte into system memory. So that if an error occurs in this transfer, then it would most likely be picked up and corrected by the ECC mechanism. But I don't think that 'this is how it works'. Could someone knowledgable please confirm or contradict? I don't know any controller that works in this way. This would greatly increase CPU overhead since the CPU would need to perform this CRC calculation. -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/