Re: [PATCH] eepro100.c, kernel 2.4.1
Alan Cox, Thu Feb 08 2001 - 02:42:52 EST: > > It's the printk that gets it wrong, although that's harmless. > > Intel's documentation states that the bug does NOT exist if the > > bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct, > > the printk is wrong. > > So why does it fix the problem for him. His report and your reply don't > make sense viewed together Wish I'd seen this patch about a month and a half before. I had borrowed two machines from IBM Denmark for evaluation and their motherboard mounted eepro100 cards (forget which exact chip version it was) didn't quite work with the driver in the standard RH 6.2. On boot up it said something about the Receiver lock up bug (only one of the two messages, I think) and then it locked up anyway half an hour and a couple of hundred ethernet packets later. I didn't have time to look really closely at the source code at the time :/ Just another data point indicating that the current receiver lock up enabling code isn't good enough on newish chips. -Peter - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
Alan Cox, Thu Feb 08 2001 - 02:42:52 EST: It's the printk that gets it wrong, although that's harmless. Intel's documentation states that the bug does NOT exist if the bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct, the printk is wrong. So why does it fix the problem for him. His report and your reply don't make sense viewed together Wish I'd seen this patch about a month and a half before. I had borrowed two machines from IBM Denmark for evaluation and their motherboard mounted eepro100 cards (forget which exact chip version it was) didn't quite work with the driver in the standard RH 6.2. On boot up it said something about the Receiver lock up bug (only one of the two messages, I think) and then it locked up anyway half an hour and a couple of hundred ethernet packets later. I didn't have time to look really closely at the source code at the time :/ Just another data point indicating that the current receiver lock up enabling code isn't good enough on newish chips. -Peter - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
On Thu, Feb 08, 2001 at 03:53:10AM -0800, Ion Badulescu wrote: > Still, there should be something before these suppressed messages started. No, sorry, but absolutely nothing since the boot. > It goes like this: > > bit0 = 1 means the workaround may be omitted when operating at 10 Mbit > bit1 = 1 means the workaround may be omitted when operating at 100 Mbit > > So the workaround needs to be activated when at least one bit is zero, and > may be omitted when both bits are 1. That's exactly what the original code > does. Ah ok. > "Yesterday, a brick fell upon my head while I was walking on the street. > Today, I put my hat on before leaving home, and no brick fell on my head > anymore. So the hat must have helped!" You're absolutely right. I still don't know if activating the workaround helped, it just seemed to help. > Please read the code if you don't believe me. I read it, but I don't have the Intel docs, so I miss the information you have. Thank you for spending time for this problem. -- Augustin Vidovic http://www.vidovic.org/augustin/ "Nous sommes tous quelque chose de naissance, musicien ou assassin, mais il faut apprendre le maniement de la harpe ou du couteau." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
On Thu, 8 Feb 2001, Augustin Vidovic wrote: > This suppression of thousands of lines was described as a DOS-protection > in the docs I read. Still, there should be something before these suppressed messages started. > With my patch, the test becomes (eeprom[3] & 0x03), which is not null > for every possible non-null value of the two lower bits : > > bit1bit0[bit1,bit0]&[1,1] > 0 0 00 > 0 1 01 > 1 0 10 > 1 1 11 > > Whereas the other test is more restrictive, because it excludes the "11" > from the results. > The old cards still get the workaround enabled this this wider test. No, they don't. It goes like this: bit0 = 1 means the workaround may be omitted when operating at 10 Mbit bit1 = 1 means the workaround may be omitted when operating at 100 Mbit So the workaround needs to be activated when at least one bit is zero, and may be omitted when both bits are 1. That's exactly what the original code does. > > So your patch did not do you any good. Case closed, as far as the work-around > > is concerned. > > To the contrary, it seems to do a lot of good, because the NET subsystem > does not send any more panic messages to the kernel, and the cluster has > not meltdown again so far. "Yesterday, a brick fell upon my head while I was walking on the street. Today, I put my hat on before leaving home, and no brick fell on my head anymore. So the hat must have helped!" Please read the code if you don't believe me. Ion -- It is better to keep your mouth shut and be thought a fool, than to open it and remove all doubt. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
On Thu, Feb 08, 2001 at 03:26:51AM -0800, Ion Badulescu wrote: > syslogd does not suppress messages, it suppresses *identical* messages. > So what was the *first* message logged by syslogd, the one followed by > "last message repeated XXX times"? It's not "last message repeatead XXX times", it's : ... Jan 30 00:01:18 XXX kernel: NET: 8298 messages suppressed. Jan 30 00:01:24 XXX kernel: NET: 2929 messages suppressed. Jan 30 00:01:38 XXX kernel: NET: 1225 messages suppressed. Jan 30 00:01:43 XXX kernel: NET: 4397 messages suppressed. Jan 30 00:01:48 XXX kernel: NET: 2342 messages suppressed. ... (ad nauseam) This suppression of thousands of lines was described as a DOS-protection in the docs I read. > Umm, no. With your patch, both the diagnostic and the activation are wrong, > whereas before only the diagnostic was wrong. With my patch, the test becomes (eeprom[3] & 0x03), which is not null for every possible non-null value of the two lower bits : bit1bit0[bit1,bit0]&[1,1] 0 0 00 0 1 01 1 0 10 1 1 11 Whereas the other test is more restrictive, because it excludes the "11" from the results. The old cards still get the workaround enabled this this wider test. > > Now, I do not get _any_ message in the logs, which means that the network > > cards activity is closer to normality than before the patch. > > So your patch did not do you any good. Case closed, as far as the work-around > is concerned. To the contrary, it seems to do a lot of good, because the NET subsystem does not send any more panic messages to the kernel, and the cluster has not meltdown again so far. > If you post the original log messages, we might be able to find the real > bug... Sorry, I can't, as they were suppressed (as you can see in the example I copy-pasted before in this mail), and now I don't get any other one. > [and please don't drop the Cc:] Ok, if you insist. -- Augustin Vidovic http://www.vidovic.org/augustin/ "Nous sommes tous quelque chose de naissance, musicien ou assassin, mais il faut apprendre le maniement de la harpe ou du couteau." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
On Thu, 8 Feb 2001 20:15:39 +0900, Augustin Vidovic <[EMAIL PROTECTED]> wrote: >> So what _were_ those messages? Can you post them? > > No I can't because they were suppressed by the syslogd (DOS protection), only > their number being reported (several thousands every few seconds). syslogd does not suppress messages, it suppresses *identical* messages. So what was the *first* message logged by syslogd, the one followed by "last message repeated XXX times"? >> Well, your patch disables the work-around exactly for those (really old) cards >> that actually need it and enables it for those that don't need it. > > No, because the test usede for the activation is now the same as the one used > for the diagnostic, which means that every card which is diagnosed to have the > bug get the workaround activated. Umm, no. With your patch, both the diagnostic and the activation are wrong, whereas before only the diagnostic was wrong. >> eth0: Sending a multicast list set command from a timer routine." >> >> If you find such messages, the work-around really did something. Otherwise, >> it's the placebo effect... > > Now, I do not get _any_ message in the logs, which means that the network > cards activity is closer to normality than before the patch. So your patch did not do you any good. Case closed, as far as the work-around is concerned. If you post the original log messages, we might be able to find the real bug... [and please don't drop the Cc:] Ion -- It is better to keep your mouth shut and be thought a fool, than to open it and remove all doubt. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
On Thu, Feb 08, 2001 at 03:00:10AM -0800, Ion Badulescu wrote: > > At the same time, the /var/log/messages receives thousands of messages from the > > NET: subsystem. > > So what _were_ those messages? Can you post them? No I can't because they were suppressed by the syslogd (DOS protection), only their number being reported (several thousands every few seconds). > > Since the dmesg of the kernel tells about a work-around for such a bug, I was >assuming > > that the work around was activated, but I had a doubt and after looking at the >source, > > I discovered that it wasn't. > > Well, your patch disables the work-around exactly for those (really old) cards > that actually need it and enables it for those that don't need it. No, because the test usede for the activation is now the same as the one used for the diagnostic, which means that every card which is diagnosed to have the bug get the workaround activated. > > Now, as Ion says, maybe it is not the "receiver lock-up bug" itself which is > > worked-around, frankly I don't know. > > eth0: Sending a multicast list set command from a timer routine." > > If you find such messages, the work-around really did something. Otherwise, > it's the placebo effect... Now, I do not get _any_ message in the logs, which means that the network cards activity is closer to normality than before the patch. -- Augustin Vidovic http://www.vidovic.org/augustin/ "Nous sommes tous quelque chose de naissance, musicien ou assassin, mais il faut apprendre le maniement de la harpe ou du couteau." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
On Thu, 8 Feb 2001 19:41:56 +0900, Augustin Vidovic <[EMAIL PROTECTED]> wrote: > You can see a kind of sudden blackout which lasts about 3 hours, and then the > situation resumes to normality. > > At the same time, the /var/log/messages receives thousands of messages from the > NET: subsystem. So what _were_ those messages? Can you post them? > Since the dmesg of the kernel tells about a work-around for such a bug, I was >assuming > that the work around was activated, but I had a doubt and after looking at the >source, > I discovered that it wasn't. Well, your patch disables the work-around exactly for those (really old) cards that actually need it and enables it for those that don't need it. > Now, as Ion says, maybe it is not the "receiver lock-up bug" itself which is > worked-around, frankly I don't know. There is a very simple way to tell. Check your logs for messages like: eth0: Sending a multicast list set command from a timer routine." If you find such messages, the work-around really did something. Otherwise, it's the placebo effect... Ion -- It is better to keep your mouth shut and be thought a fool, than to open it and remove all doubt. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
On Wed, Feb 07, 2001 at 11:59:05PM -0800, Ion Badulescu wrote: > I don't think it fixes *this* bug. However, the bug workaround effectively > reinitializes the chip, so it might serve as a generic 'reset and try > again' kind of workaround. In that case, we might as well enable it > unconditionally... but I don't see it as a good solution. It's a stop-gap > measure at best. > > We need to find out what exactly happens. Until he tells us more about how > his boxes "were failing before", there really isn't much we can diagnose. Ok, then let's go into a bit more details. First, the part of the dmesg concerning the network interfaces: eepro100.c:v1.09j-t 9/29/99 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin <[EMAIL PROTECTED]> and others PCI: Found IRQ 5 for device 00:0c.0 PCI: The same IRQ used for device 00:0d.0 eth0: PCI device 8086:1229, 00:D0:B7:00:BE:00, IRQ 5. Receiver lock-up bug exists -- enabling work-around. Board assembly 00-000, Physical connectors present: RJ45 Primary interface chip i82555 PHY #1. General self-test: passed. Serial sub-system self-test: passed. Internal registers self-test: passed. ROM checksum self-test: passed (0x04f4518b). Receiver lock-up workaround activated. PCI: Found IRQ 5 for device 00:0d.0 PCI: The same IRQ used for device 00:0c.0 eth1: PCI device 8086:1229, 00:D0:B7:00:BE:01, IRQ 5. Receiver lock-up bug exists -- enabling work-around. Board assembly 00-000, Physical connectors present: RJ45 Primary interface chip i82555 PHY #1. General self-test: passed. Serial sub-system self-test: passed. Internal registers self-test: passed. ROM checksum self-test: passed (0x04f4518b). Receiver lock-up workaround activated. Please note: the "Receiver lock-up workaround activated." message is printed now only since I applied my patch. Before, only the "enabling work-around." part appeared, which is a bit tricky. Second, attached to this mail is an mrtg graph png. Beware that the timeline goes from right to left. This covers the past week. Every day the big peak is the midnight "masturbation rush" when nearly everyone connects at the same time to browse pr0n sites. You'll notice that the midnight peak is castrated suddenly last friday. This accident happened 3 times the previous week. Kind of frustrating. You can see a kind of sudden blackout which lasts about 3 hours, and then the situation resumes to normality. At the same time, the /var/log/messages receives thousands of messages from the NET: subsystem. A rather long research on the various mailing lists and newsgroups about networking shows that this behavior is shown the same way on systems using a bugged Intel EtherExpress Pro 100 network card. Since the dmesg of the kernel tells about a work-around for such a bug, I was assuming that the work around was activated, but I had a doubt and after looking at the source, I discovered that it wasn't. On saturday I patched the kernels, and since the midnight peaks are no longer broken, there is no more desperate messages from the NET subsystem in the logs, so maybe the problem has been fixed. Now, as Ion says, maybe it is not the "receiver lock-up bug" itself which is worked-around, frankly I don't know. -- Augustin Vidovic http://www.vidovic.org/augustin/ "Nous sommes tous quelque chose de naissance, musicien ou assassin, mais il faut apprendre le maniement de la harpe ou du couteau." mrtg.png
Re: [PATCH] eepro100.c, kernel 2.4.1
On Wed, Feb 07, 2001 at 11:59:05PM -0800, Ion Badulescu wrote: I don't think it fixes *this* bug. However, the bug workaround effectively reinitializes the chip, so it might serve as a generic 'reset and try again' kind of workaround. In that case, we might as well enable it unconditionally... but I don't see it as a good solution. It's a stop-gap measure at best. We need to find out what exactly happens. Until he tells us more about how his boxes "were failing before", there really isn't much we can diagnose. Ok, then let's go into a bit more details. First, the part of the dmesg concerning the network interfaces: eepro100.c:v1.09j-t 9/29/99 Donald Becker http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin [EMAIL PROTECTED] and others PCI: Found IRQ 5 for device 00:0c.0 PCI: The same IRQ used for device 00:0d.0 eth0: PCI device 8086:1229, 00:D0:B7:00:BE:00, IRQ 5. Receiver lock-up bug exists -- enabling work-around. Board assembly 00-000, Physical connectors present: RJ45 Primary interface chip i82555 PHY #1. General self-test: passed. Serial sub-system self-test: passed. Internal registers self-test: passed. ROM checksum self-test: passed (0x04f4518b). Receiver lock-up workaround activated. PCI: Found IRQ 5 for device 00:0d.0 PCI: The same IRQ used for device 00:0c.0 eth1: PCI device 8086:1229, 00:D0:B7:00:BE:01, IRQ 5. Receiver lock-up bug exists -- enabling work-around. Board assembly 00-000, Physical connectors present: RJ45 Primary interface chip i82555 PHY #1. General self-test: passed. Serial sub-system self-test: passed. Internal registers self-test: passed. ROM checksum self-test: passed (0x04f4518b). Receiver lock-up workaround activated. Please note: the "Receiver lock-up workaround activated." message is printed now only since I applied my patch. Before, only the "enabling work-around." part appeared, which is a bit tricky. Second, attached to this mail is an mrtg graph png. Beware that the timeline goes from right to left. This covers the past week. Every day the big peak is the midnight "masturbation rush" when nearly everyone connects at the same time to browse pr0n sites. You'll notice that the midnight peak is castrated suddenly last friday. This accident happened 3 times the previous week. Kind of frustrating. You can see a kind of sudden blackout which lasts about 3 hours, and then the situation resumes to normality. At the same time, the /var/log/messages receives thousands of messages from the NET: subsystem. A rather long research on the various mailing lists and newsgroups about networking shows that this behavior is shown the same way on systems using a bugged Intel EtherExpress Pro 100 network card. Since the dmesg of the kernel tells about a work-around for such a bug, I was assuming that the work around was activated, but I had a doubt and after looking at the source, I discovered that it wasn't. On saturday I patched the kernels, and since the midnight peaks are no longer broken, there is no more desperate messages from the NET subsystem in the logs, so maybe the problem has been fixed. Now, as Ion says, maybe it is not the "receiver lock-up bug" itself which is worked-around, frankly I don't know. -- Augustin Vidovic http://www.vidovic.org/augustin/ "Nous sommes tous quelque chose de naissance, musicien ou assassin, mais il faut apprendre le maniement de la harpe ou du couteau." mrtg.png
Re: [PATCH] eepro100.c, kernel 2.4.1
On Thu, Feb 08, 2001 at 03:00:10AM -0800, Ion Badulescu wrote: At the same time, the /var/log/messages receives thousands of messages from the NET: subsystem. So what _were_ those messages? Can you post them? No I can't because they were suppressed by the syslogd (DOS protection), only their number being reported (several thousands every few seconds). Since the dmesg of the kernel tells about a work-around for such a bug, I was assuming that the work around was activated, but I had a doubt and after looking at the source, I discovered that it wasn't. Well, your patch disables the work-around exactly for those (really old) cards that actually need it and enables it for those that don't need it. No, because the test usede for the activation is now the same as the one used for the diagnostic, which means that every card which is diagnosed to have the bug get the workaround activated. Now, as Ion says, maybe it is not the "receiver lock-up bug" itself which is worked-around, frankly I don't know. eth0: Sending a multicast list set command from a timer routine." If you find such messages, the work-around really did something. Otherwise, it's the placebo effect... Now, I do not get _any_ message in the logs, which means that the network cards activity is closer to normality than before the patch. -- Augustin Vidovic http://www.vidovic.org/augustin/ "Nous sommes tous quelque chose de naissance, musicien ou assassin, mais il faut apprendre le maniement de la harpe ou du couteau." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
On Thu, 8 Feb 2001 20:15:39 +0900, Augustin Vidovic [EMAIL PROTECTED] wrote: So what _were_ those messages? Can you post them? No I can't because they were suppressed by the syslogd (DOS protection), only their number being reported (several thousands every few seconds). syslogd does not suppress messages, it suppresses *identical* messages. So what was the *first* message logged by syslogd, the one followed by "last message repeated XXX times"? Well, your patch disables the work-around exactly for those (really old) cards that actually need it and enables it for those that don't need it. No, because the test usede for the activation is now the same as the one used for the diagnostic, which means that every card which is diagnosed to have the bug get the workaround activated. Umm, no. With your patch, both the diagnostic and the activation are wrong, whereas before only the diagnostic was wrong. eth0: Sending a multicast list set command from a timer routine." If you find such messages, the work-around really did something. Otherwise, it's the placebo effect... Now, I do not get _any_ message in the logs, which means that the network cards activity is closer to normality than before the patch. So your patch did not do you any good. Case closed, as far as the work-around is concerned. If you post the original log messages, we might be able to find the real bug... [and please don't drop the Cc:] Ion -- It is better to keep your mouth shut and be thought a fool, than to open it and remove all doubt. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
On Thu, Feb 08, 2001 at 03:26:51AM -0800, Ion Badulescu wrote: syslogd does not suppress messages, it suppresses *identical* messages. So what was the *first* message logged by syslogd, the one followed by "last message repeated XXX times"? It's not "last message repeatead XXX times", it's : ... Jan 30 00:01:18 XXX kernel: NET: 8298 messages suppressed. Jan 30 00:01:24 XXX kernel: NET: 2929 messages suppressed. Jan 30 00:01:38 XXX kernel: NET: 1225 messages suppressed. Jan 30 00:01:43 XXX kernel: NET: 4397 messages suppressed. Jan 30 00:01:48 XXX kernel: NET: 2342 messages suppressed. ... (ad nauseam) This suppression of thousands of lines was described as a DOS-protection in the docs I read. Umm, no. With your patch, both the diagnostic and the activation are wrong, whereas before only the diagnostic was wrong. With my patch, the test becomes (eeprom[3] 0x03), which is not null for every possible non-null value of the two lower bits : bit1bit0[bit1,bit0][1,1] 0 0 00 0 1 01 1 0 10 1 1 11 Whereas the other test is more restrictive, because it excludes the "11" from the results. The old cards still get the workaround enabled this this wider test. Now, I do not get _any_ message in the logs, which means that the network cards activity is closer to normality than before the patch. So your patch did not do you any good. Case closed, as far as the work-around is concerned. To the contrary, it seems to do a lot of good, because the NET subsystem does not send any more panic messages to the kernel, and the cluster has not meltdown again so far. If you post the original log messages, we might be able to find the real bug... Sorry, I can't, as they were suppressed (as you can see in the example I copy-pasted before in this mail), and now I don't get any other one. [and please don't drop the Cc:] Ok, if you insist. -- Augustin Vidovic http://www.vidovic.org/augustin/ "Nous sommes tous quelque chose de naissance, musicien ou assassin, mais il faut apprendre le maniement de la harpe ou du couteau." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
On Thu, 8 Feb 2001, Augustin Vidovic wrote: This suppression of thousands of lines was described as a DOS-protection in the docs I read. Still, there should be something before these suppressed messages started. With my patch, the test becomes (eeprom[3] 0x03), which is not null for every possible non-null value of the two lower bits : bit1bit0[bit1,bit0][1,1] 0 0 00 0 1 01 1 0 10 1 1 11 Whereas the other test is more restrictive, because it excludes the "11" from the results. The old cards still get the workaround enabled this this wider test. No, they don't. It goes like this: bit0 = 1 means the workaround may be omitted when operating at 10 Mbit bit1 = 1 means the workaround may be omitted when operating at 100 Mbit So the workaround needs to be activated when at least one bit is zero, and may be omitted when both bits are 1. That's exactly what the original code does. So your patch did not do you any good. Case closed, as far as the work-around is concerned. To the contrary, it seems to do a lot of good, because the NET subsystem does not send any more panic messages to the kernel, and the cluster has not meltdown again so far. "Yesterday, a brick fell upon my head while I was walking on the street. Today, I put my hat on before leaving home, and no brick fell on my head anymore. So the hat must have helped!" Please read the code if you don't believe me. Ion -- It is better to keep your mouth shut and be thought a fool, than to open it and remove all doubt. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
On Thu, Feb 08, 2001 at 03:53:10AM -0800, Ion Badulescu wrote: Still, there should be something before these suppressed messages started. No, sorry, but absolutely nothing since the boot. It goes like this: bit0 = 1 means the workaround may be omitted when operating at 10 Mbit bit1 = 1 means the workaround may be omitted when operating at 100 Mbit So the workaround needs to be activated when at least one bit is zero, and may be omitted when both bits are 1. That's exactly what the original code does. Ah ok. "Yesterday, a brick fell upon my head while I was walking on the street. Today, I put my hat on before leaving home, and no brick fell on my head anymore. So the hat must have helped!" You're absolutely right. I still don't know if activating the workaround helped, it just seemed to help. Please read the code if you don't believe me. I read it, but I don't have the Intel docs, so I miss the information you have. Thank you for spending time for this problem. -- Augustin Vidovic http://www.vidovic.org/augustin/ "Nous sommes tous quelque chose de naissance, musicien ou assassin, mais il faut apprendre le maniement de la harpe ou du couteau." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
On Thu, 8 Feb 2001, Alan Cox wrote: > > It's the printk that gets it wrong, although that's harmless. > > Intel's documentation states that the bug does NOT exist if the > > bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct, > > the printk is wrong. > > So why does it fix the problem for him. His report and your reply don't > make sense viewed together I don't think it fixes *this* bug. However, the bug workaround effectively reinitializes the chip, so it might serve as a generic 'reset and try again' kind of workaround. In that case, we might as well enable it unconditionally... but I don't see it as a good solution. It's a stop-gap measure at best. We need to find out what exactly happens. Until he tells us more about how his boxes "were failing before", there really isn't much we can diagnose. I happen to also have an Intel ISP1100 box here, and I know what's inside -- i82559 C-step chips which definitely don't have this bug. The bug is an i82557-only bug; what makes things confusing is Intel idea of giving multiple chips the same PCI id. They can be identified via the PCI rev: i82557 step A-C: rev 1-3 i82558 step A-B: rev 4-5 i82559 step A-C: rev 6-8 Ion -- It is better to keep your mouth shut and be thought a fool, than to open it and remove all doubt. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
> It's the printk that gets it wrong, although that's harmless. > Intel's documentation states that the bug does NOT exist if the > bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct, > the printk is wrong. So why does it fix the problem for him. His report and your reply don't make sense viewed together - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
On Wed, Feb 07, 2001 at 11:23:01PM -0800, Ion Badulescu wrote: > Intel's documentation states that the bug does NOT exist if the > bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct, > the printk is wrong. I wonder if it's not Intel's documentation which is wrong : it seems that the bug showed up also with the network cards used in my boxes, and the patch I proposed seemed to fix that problem. -- Augustin Vidovic http://www.vidovic.org/augustin/ "Nous sommes tous quelque chose de naissance, musicien ou assassin, mais il faut apprendre le maniement de la harpe ou du couteau." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
On Thu, 8 Feb 2001 14:53:55 +0900, Augustin Vidovic <[EMAIL PROTECTED]> wrote: > --- linux-2.4.1/drivers/net/eepro100.c Sun Jan 28 03:40:14 2001 > +++ linux-2.4.1-vido1/drivers/net/eepro100.cThu Feb 8 14:08:49 2001 > @@ -815,7 +815,7 @@ > > sp->phy[0] = eeprom[6]; > sp->phy[1] = eeprom[7]; > - sp->rx_bug = (eeprom[3] & 0x03) == 3 ? 0 : 1; > + sp->rx_bug = eeprom[3] & 0x03; > > if (sp->rx_bug) > printk(KERN_INFO " Receiver lock-up workaround activated.\n"); This patch is wrong, please DON'T apply it. It's the printk that gets it wrong, although that's harmless. Intel's documentation states that the bug does NOT exist if the bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct, the printk is wrong. The correct patch for 2.4.1 is attached. 2.2.18 needs something similar, the same patch can be applied with some fuzz. Thanks, Ion -- It is better to keep your mouth shut and be thought a fool, than to open it and remove all doubt. --- /usr/src/local/linux-2.4.vanilla/drivers/net/eepro100.c Wed Feb 7 15:45:16 2001 +++ linux-2.4/drivers/net/eepro100.cWed Feb 7 23:07:29 2001 @@ -725,7 +725,7 @@ /* The self-test results must be paragraph aligned. */ volatile s32 *self_test_results; int boguscnt = 16000; /* Timeout for set-test. */ - if (eeprom[3] & 0x03) + if ((eeprom[3] & 0x03) != 0x03) printk(KERN_INFO " Receiver lock-up bug exists -- enabling" " work-around.\n"); printk(KERN_INFO " Board assembly %4.4x%2.2x-%3.3d, Physical" - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
[PATCH] eepro100.c, kernel 2.4.1
Patch for drivers/net/eepro100.c in kernel 2.4.1 (and before). For some of the bugged Intel EtherExpress Pro 100 network cards, although the driver diagnoses the receiver lock-up bug, the workaround is not enabled. It appears that the test for the diagnostic and the test for the workaround activation are different. I assumed the diagnostic test is OK and I changed the work-around activation test. I had several Intel ISP1100 boxes with the bug diagnosed, but the workaround not enabled, and after the patch, the workaround is activated and the boxes seem to be alright even under very high network trafic (they were failing before, due to the card bug, I think). Attached is the tarball of the patch, I believe conform to the list FAQ guidelines. Since the patch is only one line, I also include it in the body of this message as plain text. --- linux-2.4.1/drivers/net/eepro100.c Sun Jan 28 03:40:14 2001 +++ linux-2.4.1-vido1/drivers/net/eepro100.cThu Feb 8 14:08:49 2001 @@ -815,7 +815,7 @@ sp->phy[0] = eeprom[6]; sp->phy[1] = eeprom[7]; - sp->rx_bug = (eeprom[3] & 0x03) == 3 ? 0 : 1; + sp->rx_bug = eeprom[3] & 0x03; if (sp->rx_bug) printk(KERN_INFO " Receiver lock-up workaround activated.\n"); I don't understand why the tests for the diagnostic and for the workaround activation were different. Maybe a simple bug, but maybe there was an obscure reason. I someone knows... -- Augustin Vidovic http://www.vidovic.org/augustin/ "Nous sommes tous quelque chose de naissance, musicien ou assassin, mais il faut apprendre le maniement de la harpe ou du couteau." patch-eepro100-vido1.tar
[PATCH] eepro100.c, kernel 2.4.1
Patch for drivers/net/eepro100.c in kernel 2.4.1 (and before). For some of the bugged Intel EtherExpress Pro 100 network cards, although the driver diagnoses the receiver lock-up bug, the workaround is not enabled. It appears that the test for the diagnostic and the test for the workaround activation are different. I assumed the diagnostic test is OK and I changed the work-around activation test. I had several Intel ISP1100 boxes with the bug diagnosed, but the workaround not enabled, and after the patch, the workaround is activated and the boxes seem to be alright even under very high network trafic (they were failing before, due to the card bug, I think). Attached is the tarball of the patch, I believe conform to the list FAQ guidelines. Since the patch is only one line, I also include it in the body of this message as plain text. --- linux-2.4.1/drivers/net/eepro100.c Sun Jan 28 03:40:14 2001 +++ linux-2.4.1-vido1/drivers/net/eepro100.cThu Feb 8 14:08:49 2001 @@ -815,7 +815,7 @@ sp-phy[0] = eeprom[6]; sp-phy[1] = eeprom[7]; - sp-rx_bug = (eeprom[3] 0x03) == 3 ? 0 : 1; + sp-rx_bug = eeprom[3] 0x03; if (sp-rx_bug) printk(KERN_INFO " Receiver lock-up workaround activated.\n"); I don't understand why the tests for the diagnostic and for the workaround activation were different. Maybe a simple bug, but maybe there was an obscure reason. I someone knows... -- Augustin Vidovic http://www.vidovic.org/augustin/ "Nous sommes tous quelque chose de naissance, musicien ou assassin, mais il faut apprendre le maniement de la harpe ou du couteau." patch-eepro100-vido1.tar
Re: [PATCH] eepro100.c, kernel 2.4.1
On Thu, 8 Feb 2001 14:53:55 +0900, Augustin Vidovic [EMAIL PROTECTED] wrote: --- linux-2.4.1/drivers/net/eepro100.c Sun Jan 28 03:40:14 2001 +++ linux-2.4.1-vido1/drivers/net/eepro100.cThu Feb 8 14:08:49 2001 @@ -815,7 +815,7 @@ sp-phy[0] = eeprom[6]; sp-phy[1] = eeprom[7]; - sp-rx_bug = (eeprom[3] 0x03) == 3 ? 0 : 1; + sp-rx_bug = eeprom[3] 0x03; if (sp-rx_bug) printk(KERN_INFO " Receiver lock-up workaround activated.\n"); This patch is wrong, please DON'T apply it. It's the printk that gets it wrong, although that's harmless. Intel's documentation states that the bug does NOT exist if the bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct, the printk is wrong. The correct patch for 2.4.1 is attached. 2.2.18 needs something similar, the same patch can be applied with some fuzz. Thanks, Ion -- It is better to keep your mouth shut and be thought a fool, than to open it and remove all doubt. --- /usr/src/local/linux-2.4.vanilla/drivers/net/eepro100.c Wed Feb 7 15:45:16 2001 +++ linux-2.4/drivers/net/eepro100.cWed Feb 7 23:07:29 2001 @@ -725,7 +725,7 @@ /* The self-test results must be paragraph aligned. */ volatile s32 *self_test_results; int boguscnt = 16000; /* Timeout for set-test. */ - if (eeprom[3] 0x03) + if ((eeprom[3] 0x03) != 0x03) printk(KERN_INFO " Receiver lock-up bug exists -- enabling" " work-around.\n"); printk(KERN_INFO " Board assembly %4.4x%2.2x-%3.3d, Physical" - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
On Wed, Feb 07, 2001 at 11:23:01PM -0800, Ion Badulescu wrote: Intel's documentation states that the bug does NOT exist if the bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct, the printk is wrong. I wonder if it's not Intel's documentation which is wrong : it seems that the bug showed up also with the network cards used in my boxes, and the patch I proposed seemed to fix that problem. -- Augustin Vidovic http://www.vidovic.org/augustin/ "Nous sommes tous quelque chose de naissance, musicien ou assassin, mais il faut apprendre le maniement de la harpe ou du couteau." - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
It's the printk that gets it wrong, although that's harmless. Intel's documentation states that the bug does NOT exist if the bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct, the printk is wrong. So why does it fix the problem for him. His report and your reply don't make sense viewed together - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] eepro100.c, kernel 2.4.1
On Thu, 8 Feb 2001, Alan Cox wrote: It's the printk that gets it wrong, although that's harmless. Intel's documentation states that the bug does NOT exist if the bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct, the printk is wrong. So why does it fix the problem for him. His report and your reply don't make sense viewed together I don't think it fixes *this* bug. However, the bug workaround effectively reinitializes the chip, so it might serve as a generic 'reset and try again' kind of workaround. In that case, we might as well enable it unconditionally... but I don't see it as a good solution. It's a stop-gap measure at best. We need to find out what exactly happens. Until he tells us more about how his boxes "were failing before", there really isn't much we can diagnose. I happen to also have an Intel ISP1100 box here, and I know what's inside -- i82559 C-step chips which definitely don't have this bug. The bug is an i82557-only bug; what makes things confusing is Intel idea of giving multiple chips the same PCI id. They can be identified via the PCI rev: i82557 step A-C: rev 1-3 i82558 step A-B: rev 4-5 i82559 step A-C: rev 6-8 Ion -- It is better to keep your mouth shut and be thought a fool, than to open it and remove all doubt. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/