Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-09 Thread Peter Lund

Alan Cox, Thu Feb 08 2001 - 02:42:52 EST:

> > It's the printk that gets it wrong, although that's harmless. 
> > Intel's documentation states that the bug does NOT exist if the 
> > bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct, 
> > the printk is wrong. 
> 
> So why does it fix the problem for him. His report and your reply don't 
> make sense viewed together 

Wish I'd seen this patch about a month and a half before.  I had borrowed two
machines from IBM Denmark for evaluation and their motherboard mounted eepro100
cards (forget which exact chip version it was) didn't quite work with the driver
in the standard RH 6.2.

On boot up it said something about the Receiver lock up bug (only one of the two
messages, I think) and then it locked up anyway half an hour and a couple of
hundred ethernet packets later.  I didn't have time to look really closely at
the source code at the time :/

Just another data point indicating that the current receiver lock up enabling
code isn't good enough on newish chips.

-Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-09 Thread Peter Lund

Alan Cox, Thu Feb 08 2001 - 02:42:52 EST:

  It's the printk that gets it wrong, although that's harmless. 
  Intel's documentation states that the bug does NOT exist if the 
  bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct, 
  the printk is wrong. 
 
 So why does it fix the problem for him. His report and your reply don't 
 make sense viewed together 

Wish I'd seen this patch about a month and a half before.  I had borrowed two
machines from IBM Denmark for evaluation and their motherboard mounted eepro100
cards (forget which exact chip version it was) didn't quite work with the driver
in the standard RH 6.2.

On boot up it said something about the Receiver lock up bug (only one of the two
messages, I think) and then it locked up anyway half an hour and a couple of
hundred ethernet packets later.  I didn't have time to look really closely at
the source code at the time :/

Just another data point indicating that the current receiver lock up enabling
code isn't good enough on newish chips.

-Peter
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-08 Thread Augustin Vidovic

On Thu, Feb 08, 2001 at 03:53:10AM -0800, Ion Badulescu wrote:
> Still, there should be something before these suppressed messages started.

No, sorry, but absolutely nothing since the boot.

> It goes like this:
> 
> bit0 = 1 means the workaround may be omitted when operating at 10 Mbit
> bit1 = 1 means the workaround may be omitted when operating at 100 Mbit
> 
> So the workaround needs to be activated when at least one bit is zero, and 
> may be omitted when both bits are 1. That's exactly what the original code 
> does.

Ah ok.

> "Yesterday, a brick fell upon my head while I was walking on the street. 
> Today, I put my hat on before leaving home, and no brick fell on my head 
> anymore. So the hat must have helped!"

You're absolutely right. I still don't know if activating the workaround
helped, it just seemed to help.

> Please read the code if you don't believe me.

I read it, but I don't have the Intel docs, so I miss the information you
have.

Thank you for spending time for this problem.

-- 
Augustin Vidovic   http://www.vidovic.org/augustin/
"Nous sommes tous quelque chose de naissance, musicien ou assassin,
 mais il faut apprendre le maniement de la harpe ou du couteau."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-08 Thread Ion Badulescu

On Thu, 8 Feb 2001, Augustin Vidovic wrote:

> This suppression of thousands of lines was described as a DOS-protection
> in the docs I read.

Still, there should be something before these suppressed messages started.

> With my patch, the test becomes (eeprom[3] & 0x03), which is not null
> for every possible non-null value of the two lower bits :
> 
>   bit1bit0[bit1,bit0]&[1,1]
>   0   0   00
>   0   1   01
>   1   0   10
>   1   1   11
> 
> Whereas the other test is more restrictive, because it excludes the "11"
> from the results.
> The old cards still get the workaround enabled this this wider test.

No, they don't.

It goes like this:

bit0 = 1 means the workaround may be omitted when operating at 10 Mbit
bit1 = 1 means the workaround may be omitted when operating at 100 Mbit

So the workaround needs to be activated when at least one bit is zero, and 
may be omitted when both bits are 1. That's exactly what the original code 
does.

> > So your patch did not do you any good. Case closed, as far as the work-around
> > is concerned.
> 
> To the contrary, it seems to do a lot of good, because the NET subsystem
> does not send any more panic messages to the kernel, and the cluster has
> not meltdown again so far.

"Yesterday, a brick fell upon my head while I was walking on the street. 
Today, I put my hat on before leaving home, and no brick fell on my head 
anymore. So the hat must have helped!"

Please read the code if you don't believe me.


Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-08 Thread Augustin Vidovic

On Thu, Feb 08, 2001 at 03:26:51AM -0800, Ion Badulescu wrote:
> syslogd does not suppress messages, it suppresses *identical* messages.
> So what was the *first* message logged by syslogd, the one followed by
> "last message repeated XXX times"?

It's not "last message repeatead XXX times", it's :
...
Jan 30 00:01:18 XXX kernel: NET: 8298 messages suppressed. 
Jan 30 00:01:24 XXX kernel: NET: 2929 messages suppressed. 
Jan 30 00:01:38 XXX kernel: NET: 1225 messages suppressed. 
Jan 30 00:01:43 XXX kernel: NET: 4397 messages suppressed. 
Jan 30 00:01:48 XXX kernel: NET: 2342 messages suppressed. 
...
(ad nauseam)

This suppression of thousands of lines was described as a DOS-protection
in the docs I read.

> Umm, no. With your patch, both the diagnostic and the activation are wrong,
> whereas before only the diagnostic was wrong.

With my patch, the test becomes (eeprom[3] & 0x03), which is not null
for every possible non-null value of the two lower bits :

bit1bit0[bit1,bit0]&[1,1]
0   0   00
0   1   01
1   0   10
1   1   11

Whereas the other test is more restrictive, because it excludes the "11"
from the results.
The old cards still get the workaround enabled this this wider test.

> > Now, I do not get _any_ message in the logs, which means that the network
> > cards activity is closer to normality than before the patch.
> 
> So your patch did not do you any good. Case closed, as far as the work-around
> is concerned.

To the contrary, it seems to do a lot of good, because the NET subsystem
does not send any more panic messages to the kernel, and the cluster has
not meltdown again so far.

> If you post the original log messages, we might be able to find the real
> bug...

Sorry, I can't, as they were suppressed (as you can see in the example
I copy-pasted before in this mail), and now I don't get any other one.

> [and please don't drop the Cc:]

Ok, if you insist.

-- 
Augustin Vidovic   http://www.vidovic.org/augustin/
"Nous sommes tous quelque chose de naissance, musicien ou assassin,
 mais il faut apprendre le maniement de la harpe ou du couteau."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-08 Thread Ion Badulescu

On Thu, 8 Feb 2001 20:15:39 +0900, Augustin Vidovic <[EMAIL PROTECTED]> wrote:

>> So what _were_ those messages? Can you post them?
> 
> No I can't because they were suppressed by the syslogd (DOS protection), only
> their number being reported (several thousands every few seconds).

syslogd does not suppress messages, it suppresses *identical* messages.
So what was the *first* message logged by syslogd, the one followed by
"last message repeated XXX times"?

>> Well, your patch disables the work-around exactly for those (really old) cards
>> that actually need it and enables it for those that don't need it.
> 
> No, because the test usede for the activation is now the same as the one used
> for the diagnostic, which means that every card which is diagnosed to have the
> bug get the workaround activated.

Umm, no. With your patch, both the diagnostic and the activation are wrong,
whereas before only the diagnostic was wrong.

>> eth0: Sending a multicast list set command from a timer routine."
>> 
>> If you find such messages, the work-around really did something. Otherwise,
>> it's the placebo effect...
> 
> Now, I do not get _any_ message in the logs, which means that the network
> cards activity is closer to normality than before the patch.

So your patch did not do you any good. Case closed, as far as the work-around
is concerned.

If you post the original log messages, we might be able to find the real
bug...

[and please don't drop the Cc:]

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-08 Thread Augustin Vidovic

On Thu, Feb 08, 2001 at 03:00:10AM -0800, Ion Badulescu wrote:
> > At the same time, the /var/log/messages receives thousands of messages from the
> > NET: subsystem.
> 
> So what _were_ those messages? Can you post them?

No I can't because they were suppressed by the syslogd (DOS protection), only
their number being reported (several thousands every few seconds).

> > Since the dmesg of the kernel tells about a work-around for such a bug, I was 
>assuming
> > that the work around was activated, but I had a doubt and after looking at the 
>source,
> > I discovered that it wasn't.
> 
> Well, your patch disables the work-around exactly for those (really old) cards
> that actually need it and enables it for those that don't need it.

No, because the test usede for the activation is now the same as the one used
for the diagnostic, which means that every card which is diagnosed to have the
bug get the workaround activated.

> > Now, as Ion says, maybe it is not the "receiver lock-up bug" itself which is
> > worked-around, frankly I don't know.
> 
> eth0: Sending a multicast list set command from a timer routine."
> 
> If you find such messages, the work-around really did something. Otherwise,
> it's the placebo effect...

Now, I do not get _any_ message in the logs, which means that the network
cards activity is closer to normality than before the patch.

-- 
Augustin Vidovic   http://www.vidovic.org/augustin/
"Nous sommes tous quelque chose de naissance, musicien ou assassin,
 mais il faut apprendre le maniement de la harpe ou du couteau."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-08 Thread Ion Badulescu

On Thu, 8 Feb 2001 19:41:56 +0900, Augustin Vidovic <[EMAIL PROTECTED]> wrote:

> You can see a kind of sudden blackout which lasts about 3 hours, and then the
> situation resumes to normality.
> 
> At the same time, the /var/log/messages receives thousands of messages from the
> NET: subsystem.

So what _were_ those messages? Can you post them?

> Since the dmesg of the kernel tells about a work-around for such a bug, I was 
>assuming
> that the work around was activated, but I had a doubt and after looking at the 
>source,
> I discovered that it wasn't.

Well, your patch disables the work-around exactly for those (really old) cards
that actually need it and enables it for those that don't need it.

> Now, as Ion says, maybe it is not the "receiver lock-up bug" itself which is
> worked-around, frankly I don't know.

There is a very simple way to tell. Check your logs for messages like:

eth0: Sending a multicast list set command from a timer routine."

If you find such messages, the work-around really did something. Otherwise,
it's the placebo effect...


Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-08 Thread Augustin Vidovic

On Wed, Feb 07, 2001 at 11:59:05PM -0800, Ion Badulescu wrote:
> I don't think it fixes *this* bug. However, the bug workaround effectively
> reinitializes the chip, so it might serve as a generic 'reset and try
> again' kind of workaround. In that case, we might as well enable it
> unconditionally... but I don't see it as a good solution. It's a stop-gap 
> measure at best.
> 
> We need to find out what exactly happens. Until he tells us more about how 
> his boxes "were failing before", there really isn't much we can diagnose.


Ok, then let's go into a bit more details.

First, the part of the dmesg concerning the network interfaces:

eepro100.c:v1.09j-t 9/29/99 Donald Becker 
http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html
eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin 
<[EMAIL PROTECTED]> and others
PCI: Found IRQ 5 for device 00:0c.0
PCI: The same IRQ used for device 00:0d.0
eth0: PCI device 8086:1229, 00:D0:B7:00:BE:00, IRQ 5.
  Receiver lock-up bug exists -- enabling work-around.
  Board assembly 00-000, Physical connectors present: RJ45
  Primary interface chip i82555 PHY #1.
  General self-test: passed.
  Serial sub-system self-test: passed.
  Internal registers self-test: passed.
  ROM checksum self-test: passed (0x04f4518b).
  Receiver lock-up workaround activated.
PCI: Found IRQ 5 for device 00:0d.0
PCI: The same IRQ used for device 00:0c.0
eth1: PCI device 8086:1229, 00:D0:B7:00:BE:01, IRQ 5.
  Receiver lock-up bug exists -- enabling work-around.
  Board assembly 00-000, Physical connectors present: RJ45
  Primary interface chip i82555 PHY #1.
  General self-test: passed.
  Serial sub-system self-test: passed.
  Internal registers self-test: passed.
  ROM checksum self-test: passed (0x04f4518b).
  Receiver lock-up workaround activated.


Please note: the "Receiver lock-up workaround activated." message is printed
now only since I applied my patch. Before, only the "enabling work-around." part
appeared, which is a bit tricky.

Second, attached to this mail is an mrtg graph png. Beware that the timeline goes
from right to left. This covers the past week. Every day the big peak is the
midnight "masturbation rush" when nearly everyone connects at the same time to
browse pr0n sites. You'll notice that the midnight peak is castrated suddenly
last friday. This accident happened 3 times the previous week. Kind of frustrating.

You can see a kind of sudden blackout which lasts about 3 hours, and then the
situation resumes to normality.

At the same time, the /var/log/messages receives thousands of messages from the
NET: subsystem.

A rather long research on the various mailing lists and newsgroups about networking
shows that this behavior is shown the same way on systems using a bugged Intel 
EtherExpress
Pro 100 network card.

Since the dmesg of the kernel tells about a work-around for such a bug, I was assuming
that the work around was activated, but I had a doubt and after looking at the source,
I discovered that it wasn't.

On saturday I patched the kernels, and since the midnight peaks are no longer
broken, there is no more desperate messages from the NET subsystem in the logs,
so maybe the problem has been fixed.

Now, as Ion says, maybe it is not the "receiver lock-up bug" itself which is
worked-around, frankly I don't know.


-- 
Augustin Vidovic   http://www.vidovic.org/augustin/
"Nous sommes tous quelque chose de naissance, musicien ou assassin,
 mais il faut apprendre le maniement de la harpe ou du couteau."

 mrtg.png


Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-08 Thread Augustin Vidovic

On Wed, Feb 07, 2001 at 11:59:05PM -0800, Ion Badulescu wrote:
 I don't think it fixes *this* bug. However, the bug workaround effectively
 reinitializes the chip, so it might serve as a generic 'reset and try
 again' kind of workaround. In that case, we might as well enable it
 unconditionally... but I don't see it as a good solution. It's a stop-gap 
 measure at best.
 
 We need to find out what exactly happens. Until he tells us more about how 
 his boxes "were failing before", there really isn't much we can diagnose.


Ok, then let's go into a bit more details.

First, the part of the dmesg concerning the network interfaces:

eepro100.c:v1.09j-t 9/29/99 Donald Becker 
http://cesdis.gsfc.nasa.gov/linux/drivers/eepro100.html
eepro100.c: $Revision: 1.36 $ 2000/11/17 Modified by Andrey V. Savochkin 
[EMAIL PROTECTED] and others
PCI: Found IRQ 5 for device 00:0c.0
PCI: The same IRQ used for device 00:0d.0
eth0: PCI device 8086:1229, 00:D0:B7:00:BE:00, IRQ 5.
  Receiver lock-up bug exists -- enabling work-around.
  Board assembly 00-000, Physical connectors present: RJ45
  Primary interface chip i82555 PHY #1.
  General self-test: passed.
  Serial sub-system self-test: passed.
  Internal registers self-test: passed.
  ROM checksum self-test: passed (0x04f4518b).
  Receiver lock-up workaround activated.
PCI: Found IRQ 5 for device 00:0d.0
PCI: The same IRQ used for device 00:0c.0
eth1: PCI device 8086:1229, 00:D0:B7:00:BE:01, IRQ 5.
  Receiver lock-up bug exists -- enabling work-around.
  Board assembly 00-000, Physical connectors present: RJ45
  Primary interface chip i82555 PHY #1.
  General self-test: passed.
  Serial sub-system self-test: passed.
  Internal registers self-test: passed.
  ROM checksum self-test: passed (0x04f4518b).
  Receiver lock-up workaround activated.


Please note: the "Receiver lock-up workaround activated." message is printed
now only since I applied my patch. Before, only the "enabling work-around." part
appeared, which is a bit tricky.

Second, attached to this mail is an mrtg graph png. Beware that the timeline goes
from right to left. This covers the past week. Every day the big peak is the
midnight "masturbation rush" when nearly everyone connects at the same time to
browse pr0n sites. You'll notice that the midnight peak is castrated suddenly
last friday. This accident happened 3 times the previous week. Kind of frustrating.

You can see a kind of sudden blackout which lasts about 3 hours, and then the
situation resumes to normality.

At the same time, the /var/log/messages receives thousands of messages from the
NET: subsystem.

A rather long research on the various mailing lists and newsgroups about networking
shows that this behavior is shown the same way on systems using a bugged Intel 
EtherExpress
Pro 100 network card.

Since the dmesg of the kernel tells about a work-around for such a bug, I was assuming
that the work around was activated, but I had a doubt and after looking at the source,
I discovered that it wasn't.

On saturday I patched the kernels, and since the midnight peaks are no longer
broken, there is no more desperate messages from the NET subsystem in the logs,
so maybe the problem has been fixed.

Now, as Ion says, maybe it is not the "receiver lock-up bug" itself which is
worked-around, frankly I don't know.


-- 
Augustin Vidovic   http://www.vidovic.org/augustin/
"Nous sommes tous quelque chose de naissance, musicien ou assassin,
 mais il faut apprendre le maniement de la harpe ou du couteau."

 mrtg.png


Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-08 Thread Augustin Vidovic

On Thu, Feb 08, 2001 at 03:00:10AM -0800, Ion Badulescu wrote:
  At the same time, the /var/log/messages receives thousands of messages from the
  NET: subsystem.
 
 So what _were_ those messages? Can you post them?

No I can't because they were suppressed by the syslogd (DOS protection), only
their number being reported (several thousands every few seconds).

  Since the dmesg of the kernel tells about a work-around for such a bug, I was 
assuming
  that the work around was activated, but I had a doubt and after looking at the 
source,
  I discovered that it wasn't.
 
 Well, your patch disables the work-around exactly for those (really old) cards
 that actually need it and enables it for those that don't need it.

No, because the test usede for the activation is now the same as the one used
for the diagnostic, which means that every card which is diagnosed to have the
bug get the workaround activated.

  Now, as Ion says, maybe it is not the "receiver lock-up bug" itself which is
  worked-around, frankly I don't know.
 
 eth0: Sending a multicast list set command from a timer routine."
 
 If you find such messages, the work-around really did something. Otherwise,
 it's the placebo effect...

Now, I do not get _any_ message in the logs, which means that the network
cards activity is closer to normality than before the patch.

-- 
Augustin Vidovic   http://www.vidovic.org/augustin/
"Nous sommes tous quelque chose de naissance, musicien ou assassin,
 mais il faut apprendre le maniement de la harpe ou du couteau."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-08 Thread Ion Badulescu

On Thu, 8 Feb 2001 20:15:39 +0900, Augustin Vidovic [EMAIL PROTECTED] wrote:

 So what _were_ those messages? Can you post them?
 
 No I can't because they were suppressed by the syslogd (DOS protection), only
 their number being reported (several thousands every few seconds).

syslogd does not suppress messages, it suppresses *identical* messages.
So what was the *first* message logged by syslogd, the one followed by
"last message repeated XXX times"?

 Well, your patch disables the work-around exactly for those (really old) cards
 that actually need it and enables it for those that don't need it.
 
 No, because the test usede for the activation is now the same as the one used
 for the diagnostic, which means that every card which is diagnosed to have the
 bug get the workaround activated.

Umm, no. With your patch, both the diagnostic and the activation are wrong,
whereas before only the diagnostic was wrong.

 eth0: Sending a multicast list set command from a timer routine."
 
 If you find such messages, the work-around really did something. Otherwise,
 it's the placebo effect...
 
 Now, I do not get _any_ message in the logs, which means that the network
 cards activity is closer to normality than before the patch.

So your patch did not do you any good. Case closed, as far as the work-around
is concerned.

If you post the original log messages, we might be able to find the real
bug...

[and please don't drop the Cc:]

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-08 Thread Augustin Vidovic

On Thu, Feb 08, 2001 at 03:26:51AM -0800, Ion Badulescu wrote:
 syslogd does not suppress messages, it suppresses *identical* messages.
 So what was the *first* message logged by syslogd, the one followed by
 "last message repeated XXX times"?

It's not "last message repeatead XXX times", it's :
...
Jan 30 00:01:18 XXX kernel: NET: 8298 messages suppressed. 
Jan 30 00:01:24 XXX kernel: NET: 2929 messages suppressed. 
Jan 30 00:01:38 XXX kernel: NET: 1225 messages suppressed. 
Jan 30 00:01:43 XXX kernel: NET: 4397 messages suppressed. 
Jan 30 00:01:48 XXX kernel: NET: 2342 messages suppressed. 
...
(ad nauseam)

This suppression of thousands of lines was described as a DOS-protection
in the docs I read.

 Umm, no. With your patch, both the diagnostic and the activation are wrong,
 whereas before only the diagnostic was wrong.

With my patch, the test becomes (eeprom[3]  0x03), which is not null
for every possible non-null value of the two lower bits :

bit1bit0[bit1,bit0][1,1]
0   0   00
0   1   01
1   0   10
1   1   11

Whereas the other test is more restrictive, because it excludes the "11"
from the results.
The old cards still get the workaround enabled this this wider test.

  Now, I do not get _any_ message in the logs, which means that the network
  cards activity is closer to normality than before the patch.
 
 So your patch did not do you any good. Case closed, as far as the work-around
 is concerned.

To the contrary, it seems to do a lot of good, because the NET subsystem
does not send any more panic messages to the kernel, and the cluster has
not meltdown again so far.

 If you post the original log messages, we might be able to find the real
 bug...

Sorry, I can't, as they were suppressed (as you can see in the example
I copy-pasted before in this mail), and now I don't get any other one.

 [and please don't drop the Cc:]

Ok, if you insist.

-- 
Augustin Vidovic   http://www.vidovic.org/augustin/
"Nous sommes tous quelque chose de naissance, musicien ou assassin,
 mais il faut apprendre le maniement de la harpe ou du couteau."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-08 Thread Ion Badulescu

On Thu, 8 Feb 2001, Augustin Vidovic wrote:

 This suppression of thousands of lines was described as a DOS-protection
 in the docs I read.

Still, there should be something before these suppressed messages started.

 With my patch, the test becomes (eeprom[3]  0x03), which is not null
 for every possible non-null value of the two lower bits :
 
   bit1bit0[bit1,bit0][1,1]
   0   0   00
   0   1   01
   1   0   10
   1   1   11
 
 Whereas the other test is more restrictive, because it excludes the "11"
 from the results.
 The old cards still get the workaround enabled this this wider test.

No, they don't.

It goes like this:

bit0 = 1 means the workaround may be omitted when operating at 10 Mbit
bit1 = 1 means the workaround may be omitted when operating at 100 Mbit

So the workaround needs to be activated when at least one bit is zero, and 
may be omitted when both bits are 1. That's exactly what the original code 
does.

  So your patch did not do you any good. Case closed, as far as the work-around
  is concerned.
 
 To the contrary, it seems to do a lot of good, because the NET subsystem
 does not send any more panic messages to the kernel, and the cluster has
 not meltdown again so far.

"Yesterday, a brick fell upon my head while I was walking on the street. 
Today, I put my hat on before leaving home, and no brick fell on my head 
anymore. So the hat must have helped!"

Please read the code if you don't believe me.


Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-08 Thread Augustin Vidovic

On Thu, Feb 08, 2001 at 03:53:10AM -0800, Ion Badulescu wrote:
 Still, there should be something before these suppressed messages started.

No, sorry, but absolutely nothing since the boot.

 It goes like this:
 
 bit0 = 1 means the workaround may be omitted when operating at 10 Mbit
 bit1 = 1 means the workaround may be omitted when operating at 100 Mbit
 
 So the workaround needs to be activated when at least one bit is zero, and 
 may be omitted when both bits are 1. That's exactly what the original code 
 does.

Ah ok.

 "Yesterday, a brick fell upon my head while I was walking on the street. 
 Today, I put my hat on before leaving home, and no brick fell on my head 
 anymore. So the hat must have helped!"

You're absolutely right. I still don't know if activating the workaround
helped, it just seemed to help.

 Please read the code if you don't believe me.

I read it, but I don't have the Intel docs, so I miss the information you
have.

Thank you for spending time for this problem.

-- 
Augustin Vidovic   http://www.vidovic.org/augustin/
"Nous sommes tous quelque chose de naissance, musicien ou assassin,
 mais il faut apprendre le maniement de la harpe ou du couteau."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-07 Thread Ion Badulescu

On Thu, 8 Feb 2001, Alan Cox wrote:

> > It's the printk that gets it wrong, although that's harmless.
> > Intel's documentation states that the bug does NOT exist if the
> > bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct,
> > the printk is wrong.
> 
> So why does it fix the problem for him. His report and your reply don't
> make sense viewed together

I don't think it fixes *this* bug. However, the bug workaround effectively
reinitializes the chip, so it might serve as a generic 'reset and try
again' kind of workaround. In that case, we might as well enable it
unconditionally... but I don't see it as a good solution. It's a stop-gap 
measure at best.

We need to find out what exactly happens. Until he tells us more about how 
his boxes "were failing before", there really isn't much we can diagnose.

I happen to also have an Intel ISP1100 box here, and I know what's inside 
-- i82559 C-step chips which definitely don't have this bug. The bug is an 
i82557-only bug; what makes things confusing is Intel idea of giving 
multiple chips the same PCI id. They can be identified via the PCI rev:

i82557 step A-C: rev 1-3
i82558 step A-B: rev 4-5
i82559 step A-C: rev 6-8

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-07 Thread Alan Cox

> It's the printk that gets it wrong, although that's harmless.
> Intel's documentation states that the bug does NOT exist if the
> bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct,
> the printk is wrong.

So why does it fix the problem for him. His report and your reply don't
make sense viewed together

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-07 Thread Augustin Vidovic

On Wed, Feb 07, 2001 at 11:23:01PM -0800, Ion Badulescu wrote:
> Intel's documentation states that the bug does NOT exist if the
> bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct,
> the printk is wrong.

I wonder if it's not Intel's documentation which is wrong : it seems
that the bug showed up also with the network cards used in my boxes,
and the patch I proposed seemed to fix that problem.

-- 
Augustin Vidovic   http://www.vidovic.org/augustin/
"Nous sommes tous quelque chose de naissance, musicien ou assassin,
 mais il faut apprendre le maniement de la harpe ou du couteau."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-07 Thread Ion Badulescu

On Thu, 8 Feb 2001 14:53:55 +0900, Augustin Vidovic <[EMAIL PROTECTED]> wrote:

> --- linux-2.4.1/drivers/net/eepro100.c  Sun Jan 28 03:40:14 2001
> +++ linux-2.4.1-vido1/drivers/net/eepro100.cThu Feb  8 14:08:49 2001
> @@ -815,7 +815,7 @@
>  
> sp->phy[0] = eeprom[6];
> sp->phy[1] = eeprom[7];
> -   sp->rx_bug = (eeprom[3] & 0x03) == 3 ? 0 : 1;
> +   sp->rx_bug = eeprom[3] & 0x03;
>  
> if (sp->rx_bug)
> printk(KERN_INFO "  Receiver lock-up workaround activated.\n");

This patch is wrong, please DON'T apply it.

It's the printk that gets it wrong, although that's harmless.
Intel's documentation states that the bug does NOT exist if the
bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct,
the printk is wrong.

The correct patch for 2.4.1 is attached. 2.2.18 needs something
similar, the same patch can be applied with some fuzz.

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

--- /usr/src/local/linux-2.4.vanilla/drivers/net/eepro100.c Wed Feb  7 15:45:16 
2001
+++ linux-2.4/drivers/net/eepro100.cWed Feb  7 23:07:29 2001
@@ -725,7 +725,7 @@
/* The self-test results must be paragraph aligned. */
volatile s32 *self_test_results;
int boguscnt = 16000;   /* Timeout for set-test. */
-   if (eeprom[3] & 0x03)
+   if ((eeprom[3] & 0x03) != 0x03)
printk(KERN_INFO "  Receiver lock-up bug exists -- enabling"
   " work-around.\n");
printk(KERN_INFO "  Board assembly %4.4x%2.2x-%3.3d, Physical"
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



[PATCH] eepro100.c, kernel 2.4.1

2001-02-07 Thread Augustin Vidovic

Patch for drivers/net/eepro100.c in kernel 2.4.1 (and before).
For some of the bugged Intel EtherExpress Pro 100 network cards,
although the driver diagnoses the receiver lock-up bug, the workaround
is not enabled. It appears that the test for the diagnostic and the test
for the workaround activation are different. I assumed the diagnostic
test is OK and I changed the work-around activation test. I had several
Intel ISP1100 boxes with the bug diagnosed, but the workaround not enabled,
and after the patch, the workaround is activated and the boxes seem to
be alright even under very high network trafic (they were failing before,
due to the card bug, I think).

Attached is the tarball of the patch, I believe conform to the list
FAQ guidelines. Since the patch is only one line, I also include it
in the body of this message as plain text.

--- linux-2.4.1/drivers/net/eepro100.c  Sun Jan 28 03:40:14 2001
+++ linux-2.4.1-vido1/drivers/net/eepro100.cThu Feb  8 14:08:49 2001
@@ -815,7 +815,7 @@
 
sp->phy[0] = eeprom[6];
sp->phy[1] = eeprom[7];
-   sp->rx_bug = (eeprom[3] & 0x03) == 3 ? 0 : 1;
+   sp->rx_bug = eeprom[3] & 0x03;
 
if (sp->rx_bug)
printk(KERN_INFO "  Receiver lock-up workaround activated.\n");

I don't understand why the tests for the diagnostic and for the
workaround activation were different. Maybe a simple bug, but maybe there
was an obscure reason. I someone knows...

-- 
Augustin Vidovic   http://www.vidovic.org/augustin/
"Nous sommes tous quelque chose de naissance, musicien ou assassin,
 mais il faut apprendre le maniement de la harpe ou du couteau."

 patch-eepro100-vido1.tar


[PATCH] eepro100.c, kernel 2.4.1

2001-02-07 Thread Augustin Vidovic

Patch for drivers/net/eepro100.c in kernel 2.4.1 (and before).
For some of the bugged Intel EtherExpress Pro 100 network cards,
although the driver diagnoses the receiver lock-up bug, the workaround
is not enabled. It appears that the test for the diagnostic and the test
for the workaround activation are different. I assumed the diagnostic
test is OK and I changed the work-around activation test. I had several
Intel ISP1100 boxes with the bug diagnosed, but the workaround not enabled,
and after the patch, the workaround is activated and the boxes seem to
be alright even under very high network trafic (they were failing before,
due to the card bug, I think).

Attached is the tarball of the patch, I believe conform to the list
FAQ guidelines. Since the patch is only one line, I also include it
in the body of this message as plain text.

--- linux-2.4.1/drivers/net/eepro100.c  Sun Jan 28 03:40:14 2001
+++ linux-2.4.1-vido1/drivers/net/eepro100.cThu Feb  8 14:08:49 2001
@@ -815,7 +815,7 @@
 
sp-phy[0] = eeprom[6];
sp-phy[1] = eeprom[7];
-   sp-rx_bug = (eeprom[3]  0x03) == 3 ? 0 : 1;
+   sp-rx_bug = eeprom[3]  0x03;
 
if (sp-rx_bug)
printk(KERN_INFO "  Receiver lock-up workaround activated.\n");

I don't understand why the tests for the diagnostic and for the
workaround activation were different. Maybe a simple bug, but maybe there
was an obscure reason. I someone knows...

-- 
Augustin Vidovic   http://www.vidovic.org/augustin/
"Nous sommes tous quelque chose de naissance, musicien ou assassin,
 mais il faut apprendre le maniement de la harpe ou du couteau."

 patch-eepro100-vido1.tar


Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-07 Thread Ion Badulescu

On Thu, 8 Feb 2001 14:53:55 +0900, Augustin Vidovic [EMAIL PROTECTED] wrote:

 --- linux-2.4.1/drivers/net/eepro100.c  Sun Jan 28 03:40:14 2001
 +++ linux-2.4.1-vido1/drivers/net/eepro100.cThu Feb  8 14:08:49 2001
 @@ -815,7 +815,7 @@
  
 sp-phy[0] = eeprom[6];
 sp-phy[1] = eeprom[7];
 -   sp-rx_bug = (eeprom[3]  0x03) == 3 ? 0 : 1;
 +   sp-rx_bug = eeprom[3]  0x03;
  
 if (sp-rx_bug)
 printk(KERN_INFO "  Receiver lock-up workaround activated.\n");

This patch is wrong, please DON'T apply it.

It's the printk that gets it wrong, although that's harmless.
Intel's documentation states that the bug does NOT exist if the
bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct,
the printk is wrong.

The correct patch for 2.4.1 is attached. 2.2.18 needs something
similar, the same patch can be applied with some fuzz.

Thanks,
Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.

--- /usr/src/local/linux-2.4.vanilla/drivers/net/eepro100.c Wed Feb  7 15:45:16 
2001
+++ linux-2.4/drivers/net/eepro100.cWed Feb  7 23:07:29 2001
@@ -725,7 +725,7 @@
/* The self-test results must be paragraph aligned. */
volatile s32 *self_test_results;
int boguscnt = 16000;   /* Timeout for set-test. */
-   if (eeprom[3]  0x03)
+   if ((eeprom[3]  0x03) != 0x03)
printk(KERN_INFO "  Receiver lock-up bug exists -- enabling"
   " work-around.\n");
printk(KERN_INFO "  Board assembly %4.4x%2.2x-%3.3d, Physical"
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-07 Thread Augustin Vidovic

On Wed, Feb 07, 2001 at 11:23:01PM -0800, Ion Badulescu wrote:
 Intel's documentation states that the bug does NOT exist if the
 bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct,
 the printk is wrong.

I wonder if it's not Intel's documentation which is wrong : it seems
that the bug showed up also with the network cards used in my boxes,
and the patch I proposed seemed to fix that problem.

-- 
Augustin Vidovic   http://www.vidovic.org/augustin/
"Nous sommes tous quelque chose de naissance, musicien ou assassin,
 mais il faut apprendre le maniement de la harpe ou du couteau."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-07 Thread Alan Cox

 It's the printk that gets it wrong, although that's harmless.
 Intel's documentation states that the bug does NOT exist if the
 bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct,
 the printk is wrong.

So why does it fix the problem for him. His report and your reply don't
make sense viewed together

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/



Re: [PATCH] eepro100.c, kernel 2.4.1

2001-02-07 Thread Ion Badulescu

On Thu, 8 Feb 2001, Alan Cox wrote:

  It's the printk that gets it wrong, although that's harmless.
  Intel's documentation states that the bug does NOT exist if the
  bits 0 and 1 in eeprom[3] are 1. Thus, the workaround is correct,
  the printk is wrong.
 
 So why does it fix the problem for him. His report and your reply don't
 make sense viewed together

I don't think it fixes *this* bug. However, the bug workaround effectively
reinitializes the chip, so it might serve as a generic 'reset and try
again' kind of workaround. In that case, we might as well enable it
unconditionally... but I don't see it as a good solution. It's a stop-gap 
measure at best.

We need to find out what exactly happens. Until he tells us more about how 
his boxes "were failing before", there really isn't much we can diagnose.

I happen to also have an Intel ISP1100 box here, and I know what's inside 
-- i82559 C-step chips which definitely don't have this bug. The bug is an 
i82557-only bug; what makes things confusing is Intel idea of giving 
multiple chips the same PCI id. They can be identified via the PCI rev:

i82557 step A-C: rev 1-3
i82558 step A-B: rev 4-5
i82559 step A-C: rev 6-8

Ion

-- 
  It is better to keep your mouth shut and be thought a fool,
than to open it and remove all doubt.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/