Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-02 Thread Kok, Auke

Michel Lespinasse wrote:
I've had good results with 2.6.21.1 (even running tickless :)) on these 
NICs. Have you tried that yet?


Not yet. Coming up... I'd prefer not to rely on new kernels at this
point though - but I can certainly try it just to report on current status.


I just checked and the fix I was referring to earlier didn't make it into 
2.6.21-final. You can get 2.6.21-git1 from kernel.org which has the fix. See


http://www.kernel.org/pub/linux/kernel/v2.6/snapshots/patch-2.6.21-git1.log

for the changelog and commit id's for the patch:

commit c58b8e4a25a1ba347a0e5d21984c97bd296f1691
Merge: afc2e82... f50393f...
Author: Linus Torvalds <[EMAIL PROTECTED]>
Date:   Fri Apr 27 10:14:53 2007 -0700

Merge branch 'e1000-fixes' of 
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6


* 'e1000-fixes' of 
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6:

  e1000: FIX: Stop raw interrupts disabled nag from RT
  e1000: FIX: firmware handover bits
  e1000: FIX: be ready for incoming irq at pci_request_irq

Please give this a try and see if it resolves your issues.

hth,

Auke
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-02 Thread Kok, Auke

Michel Lespinasse wrote:

On my system, every e1000_watchdog() invocation calls e1000_read_phy_reg()
twice: first near the top of e1000_check_for_link() within the
e1000_media_type_copper && hw->get_link_status condition, then within
e1000_update_stats() to read and update the idle_errors statistic.
Each call results in a 100ms delay. The second call is enclosed within
an spin_lock_irqsave()..spin_unlock_irqrestore() section, so it results
in 100ms of lost ticks too.


Unfortunately we need the spinlock here. I'm not 100% sure the irqsave is no 
longer needed since we recently modified the watchdog to run as a task (out of 
interrupt context), but this code hasn't made it upstream yet (it's sitting in 
mm if you're interested).



Now I have no idea how to fix that, but it does seem like it must be an
initialisation issue. Possibly it might be a matter of telling the firmware
"management engine" to keep its paws off of the adapter, I dont know.
If you want me to add logging within the init functions, let me know.


please don't, see below


The other operations - like all the E1000_READ_REG() calls within
e1000_update_stats() - seem to take negligible time compared to the
two failing e1000_read_phy_reg() calls.

I've had good results with 2.6.21.1 (even running tickless :)) on these 
NICs. Have you tried that yet?


Not yet. Coming up... I'd prefer not to rely on new kernels at this
point though - but I can certainly try it just to report on current status.


I currently suspect that (on this NIC) you're being bitten by a initialization 
bug that was fixed in later patches that made it into 2.6.21. The best thing to 
try for you is attempt to run 2.6.21 in the same configuration and see if that 
fixes it for you. It has to do with a patch I sent to fix the firmware takeover 
bits at startup, something that was definately broken in 2.6.19 and probably 2.6.20.


Auke
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-02 Thread Andi Kleen
Michel Lespinasse <[EMAIL PROTECTED]> writes:
> 
> running with report_lost_ticks, I see the following:
> 
> May  1 12:58:57 server kernel: time.c: Lost 24 timer tick(s)! rip 
> _spin_unlock_irqrestore+0x8/0x9)
> May  1 12:58:59 server kernel: time.c: Lost 24 timer tick(s)! rip 
> _spin_unlock_irqrestore+0x8/0x9)
> May  1 12:59:01 server kernel: time.c: Lost 25 timer tick(s)! rip 
> _spin_unlock_irqrestore+0x8/0x9)
> May  1 12:59:03 server kernel: time.c: Lost 24 timer tick(s)! rip 
> _spin_unlock_irqrestore+0x8/0x9)

This means some code disabled interrupts using a spinlock for far too long.
Can you add a dump_stack() in arch/x86_64/kernel/time.c:handle_lost_ticks()? 
That should 
tell the culprit

-Andi
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-02 Thread Michel Lespinasse
On Tue, May 01, 2007 at 03:08:48PM -0700, Kok, Auke wrote:
> Michel Lespinasse wrote:
> >(I've added the E1000 maintainers to the thread as I found the issue
> >seems to go away after I compile out that driver. For reference, I was
> >trying to figure out why I lose exactly 24 ticks about every two
> >seconds, as shown with report_lost_ticks. This is with a DQ965GF
> >motherboard with onboard E1000).
> 
> that's perfectly likely. The main issue is that we read the hardware stats 
> every two seconds and that can consume quite some time. It's strange that 
> you are losing that many ticks IMHO, but losing one or two might very well 
> be.

Ah, interesting.

So, I dived a bit deeper and added printk's all over the place to
trace what's going on. Here is what I'm seeing.

For some reason (I suppose something wrong in the hardware init code ???)
e1000_get_software_flag() does not succeed. It just loops PHY_CFG_TIMEOUT=100
times, sleeping 1ms each with mdelay(1), and then returns (after 100ms)
with an exit code of -E1000_ERR_CONFIG.

Because of that, e1000_read_phy_reg() does not succeed:
it calls e1000_swfw_sync_acquire -> e1000_get_software_flag and then
returns -E1000_ERR_SWFW_SYNC

On my system, every e1000_watchdog() invocation calls e1000_read_phy_reg()
twice: first near the top of e1000_check_for_link() within the
e1000_media_type_copper && hw->get_link_status condition, then within
e1000_update_stats() to read and update the idle_errors statistic.
Each call results in a 100ms delay. The second call is enclosed within
an spin_lock_irqsave()..spin_unlock_irqrestore() section, so it results
in 100ms of lost ticks too.

Now I have no idea how to fix that, but it does seem like it must be an
initialisation issue. Possibly it might be a matter of telling the firmware
"management engine" to keep its paws off of the adapter, I dont know.
If you want me to add logging within the init functions, let me know.

The other operations - like all the E1000_READ_REG() calls within
e1000_update_stats() - seem to take negligible time compared to the
two failing e1000_read_phy_reg() calls.

> I've had good results with 2.6.21.1 (even running tickless :)) on these 
> NICs. Have you tried that yet?

Not yet. Coming up... I'd prefer not to rely on new kernels at this
point though - but I can certainly try it just to report on current status.

-- 
Michel "Walken" Lespinasse
"Bill Gates is a monocle and a Persian cat away from being the villain
in a James Bond movie." -- Dennis Miller
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-02 Thread Michel Lespinasse
On Tue, May 01, 2007 at 03:08:48PM -0700, Kok, Auke wrote:
 Michel Lespinasse wrote:
 (I've added the E1000 maintainers to the thread as I found the issue
 seems to go away after I compile out that driver. For reference, I was
 trying to figure out why I lose exactly 24 ticks about every two
 seconds, as shown with report_lost_ticks. This is with a DQ965GF
 motherboard with onboard E1000).
 
 that's perfectly likely. The main issue is that we read the hardware stats 
 every two seconds and that can consume quite some time. It's strange that 
 you are losing that many ticks IMHO, but losing one or two might very well 
 be.

Ah, interesting.

So, I dived a bit deeper and added printk's all over the place to
trace what's going on. Here is what I'm seeing.

For some reason (I suppose something wrong in the hardware init code ???)
e1000_get_software_flag() does not succeed. It just loops PHY_CFG_TIMEOUT=100
times, sleeping 1ms each with mdelay(1), and then returns (after 100ms)
with an exit code of -E1000_ERR_CONFIG.

Because of that, e1000_read_phy_reg() does not succeed:
it calls e1000_swfw_sync_acquire - e1000_get_software_flag and then
returns -E1000_ERR_SWFW_SYNC

On my system, every e1000_watchdog() invocation calls e1000_read_phy_reg()
twice: first near the top of e1000_check_for_link() within the
e1000_media_type_copper  hw-get_link_status condition, then within
e1000_update_stats() to read and update the idle_errors statistic.
Each call results in a 100ms delay. The second call is enclosed within
an spin_lock_irqsave()..spin_unlock_irqrestore() section, so it results
in 100ms of lost ticks too.

Now I have no idea how to fix that, but it does seem like it must be an
initialisation issue. Possibly it might be a matter of telling the firmware
management engine to keep its paws off of the adapter, I dont know.
If you want me to add logging within the init functions, let me know.

The other operations - like all the E1000_READ_REG() calls within
e1000_update_stats() - seem to take negligible time compared to the
two failing e1000_read_phy_reg() calls.

 I've had good results with 2.6.21.1 (even running tickless :)) on these 
 NICs. Have you tried that yet?

Not yet. Coming up... I'd prefer not to rely on new kernels at this
point though - but I can certainly try it just to report on current status.

-- 
Michel Walken Lespinasse
Bill Gates is a monocle and a Persian cat away from being the villain
in a James Bond movie. -- Dennis Miller
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-02 Thread Andi Kleen
Michel Lespinasse [EMAIL PROTECTED] writes:
 
 running with report_lost_ticks, I see the following:
 
 May  1 12:58:57 server kernel: time.c: Lost 24 timer tick(s)! rip 
 _spin_unlock_irqrestore+0x8/0x9)
 May  1 12:58:59 server kernel: time.c: Lost 24 timer tick(s)! rip 
 _spin_unlock_irqrestore+0x8/0x9)
 May  1 12:59:01 server kernel: time.c: Lost 25 timer tick(s)! rip 
 _spin_unlock_irqrestore+0x8/0x9)
 May  1 12:59:03 server kernel: time.c: Lost 24 timer tick(s)! rip 
 _spin_unlock_irqrestore+0x8/0x9)

This means some code disabled interrupts using a spinlock for far too long.
Can you add a dump_stack() in arch/x86_64/kernel/time.c:handle_lost_ticks()? 
That should 
tell the culprit

-Andi
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-02 Thread Kok, Auke

Michel Lespinasse wrote:

On my system, every e1000_watchdog() invocation calls e1000_read_phy_reg()
twice: first near the top of e1000_check_for_link() within the
e1000_media_type_copper  hw-get_link_status condition, then within
e1000_update_stats() to read and update the idle_errors statistic.
Each call results in a 100ms delay. The second call is enclosed within
an spin_lock_irqsave()..spin_unlock_irqrestore() section, so it results
in 100ms of lost ticks too.


Unfortunately we need the spinlock here. I'm not 100% sure the irqsave is no 
longer needed since we recently modified the watchdog to run as a task (out of 
interrupt context), but this code hasn't made it upstream yet (it's sitting in 
mm if you're interested).



Now I have no idea how to fix that, but it does seem like it must be an
initialisation issue. Possibly it might be a matter of telling the firmware
management engine to keep its paws off of the adapter, I dont know.
If you want me to add logging within the init functions, let me know.


please don't, see below


The other operations - like all the E1000_READ_REG() calls within
e1000_update_stats() - seem to take negligible time compared to the
two failing e1000_read_phy_reg() calls.

I've had good results with 2.6.21.1 (even running tickless :)) on these 
NICs. Have you tried that yet?


Not yet. Coming up... I'd prefer not to rely on new kernels at this
point though - but I can certainly try it just to report on current status.


I currently suspect that (on this NIC) you're being bitten by a initialization 
bug that was fixed in later patches that made it into 2.6.21. The best thing to 
try for you is attempt to run 2.6.21 in the same configuration and see if that 
fixes it for you. It has to do with a patch I sent to fix the firmware takeover 
bits at startup, something that was definately broken in 2.6.19 and probably 2.6.20.


Auke
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-02 Thread Kok, Auke

Michel Lespinasse wrote:
I've had good results with 2.6.21.1 (even running tickless :)) on these 
NICs. Have you tried that yet?


Not yet. Coming up... I'd prefer not to rely on new kernels at this
point though - but I can certainly try it just to report on current status.


I just checked and the fix I was referring to earlier didn't make it into 
2.6.21-final. You can get 2.6.21-git1 from kernel.org which has the fix. See


http://www.kernel.org/pub/linux/kernel/v2.6/snapshots/patch-2.6.21-git1.log

for the changelog and commit id's for the patch:

commit c58b8e4a25a1ba347a0e5d21984c97bd296f1691
Merge: afc2e82... f50393f...
Author: Linus Torvalds [EMAIL PROTECTED]
Date:   Fri Apr 27 10:14:53 2007 -0700

Merge branch 'e1000-fixes' of 
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6


* 'e1000-fixes' of 
master.kernel.org:/pub/scm/linux/kernel/git/jgarzik/netdev-2.6:

  e1000: FIX: Stop raw interrupts disabled nag from RT
  e1000: FIX: firmware handover bits
  e1000: FIX: be ready for incoming irq at pci_request_irq

Please give this a try and see if it resolves your issues.

hth,

Auke
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-01 Thread Lee Revell

On 5/1/07, Kok, Auke <[EMAIL PROTECTED]> wrote:

Michel Lespinasse wrote:
> (I've added the E1000 maintainers to the thread as I found the issue
> seems to go away after I compile out that driver. For reference, I was
> trying to figure out why I lose exactly 24 ticks about every two
> seconds, as shown with report_lost_ticks. This is with a DQ965GF
> motherboard with onboard E1000).

that's perfectly likely. The main issue is that we read the hardware stats every
two seconds and that can consume quite some time. It's strange that you are
losing that many ticks IMHO, but losing one or two might very well be.



Why do you have to disable all interrupts?  Shouldn't it be sufficient
to disable interrupts on the e1000 only?

Lee
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-01 Thread Kok, Auke

Chuck Ebbert wrote:

Kok, Auke wrote:

Michel Lespinasse wrote:

(I've added the E1000 maintainers to the thread as I found the issue
seems to go away after I compile out that driver. For reference, I was
trying to figure out why I lose exactly 24 ticks about every two
seconds, as shown with report_lost_ticks. This is with a DQ965GF
motherboard with onboard E1000).

that's perfectly likely. The main issue is that we read the hardware
stats every two seconds and that can consume quite some time. It's
strange that you are losing that many ticks IMHO, but losing one or two
might very well be.

We've been playing with all sorts of solutions to this problem and
haven't come up with a way to reduce the load of the system reading HW
stats, and it remains the most likely culprit, allthough I don't rule
out clean routines just yet. This could very well be exaggerated at
100mbit speeds as well, I never looked at that.

I've had good results with 2.6.21.1 (even running tickless :)) on these
NICs. Have you tried that yet?


Maybe this could fix it in 2.6.20? (went into 2.6.21)


well, that hasn't got anything to do with stats, but is part of the clean_tx/rx 
codepath. I personally don't get any lost_ticks so I can't reproduce, but that 
was why I was hinting that you can try it for us ;)


codewise, the patch below makes our cleanup routine spend _more_ time, instead 
of less, which is why I think it's not the cause nor fix.


Auke



--

Gitweb: 
http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=46fcc86dd71d70211e965102fb69414c90381880
Commit: 46fcc86dd71d70211e965102fb69414c90381880
Parent: 2b858bd02ffca71391161f5709588fc70da79531
Author: Linus Torvalds <[EMAIL PROTECTED]>
AuthorDate: Thu Apr 19 18:21:01 2007 -0700
Committer:  Linus Torvalds <[EMAIL PROTECTED]>
CommitDate: Thu Apr 19 18:21:01 2007 -0700

Revert "e1000: fix NAPI performance on 4-port adapters"

This reverts commit 60cba200f11b6f90f35634c5cd608773ae3721b7.  It's been

linked to lockups of the e1000 hardware, see for example

	https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229603

but it's likely that the commit itself is not really introducing the

bug, but just allowing an unrelated problem to rear its ugly head (ie
one current working theory is that the code exposes us to a hardware
race condition by decreasing the amount of time we spend in each NAPI
poll cycle).

We'll revert it until root cause is known.  Intel has a repeatable

reproduction on two different machines and bus traces of the hardware
doing something bad.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-01 Thread Chuck Ebbert
Kok, Auke wrote:
> Michel Lespinasse wrote:
>> (I've added the E1000 maintainers to the thread as I found the issue
>> seems to go away after I compile out that driver. For reference, I was
>> trying to figure out why I lose exactly 24 ticks about every two
>> seconds, as shown with report_lost_ticks. This is with a DQ965GF
>> motherboard with onboard E1000).
> 
> that's perfectly likely. The main issue is that we read the hardware
> stats every two seconds and that can consume quite some time. It's
> strange that you are losing that many ticks IMHO, but losing one or two
> might very well be.
> 
> We've been playing with all sorts of solutions to this problem and
> haven't come up with a way to reduce the load of the system reading HW
> stats, and it remains the most likely culprit, allthough I don't rule
> out clean routines just yet. This could very well be exaggerated at
> 100mbit speeds as well, I never looked at that.
> 
> I've had good results with 2.6.21.1 (even running tickless :)) on these
> NICs. Have you tried that yet?

Maybe this could fix it in 2.6.20? (went into 2.6.21)

--

Gitweb: 
http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=46fcc86dd71d70211e965102fb69414c90381880
Commit: 46fcc86dd71d70211e965102fb69414c90381880
Parent: 2b858bd02ffca71391161f5709588fc70da79531
Author: Linus Torvalds <[EMAIL PROTECTED]>
AuthorDate: Thu Apr 19 18:21:01 2007 -0700
Committer:  Linus Torvalds <[EMAIL PROTECTED]>
CommitDate: Thu Apr 19 18:21:01 2007 -0700

Revert "e1000: fix NAPI performance on 4-port adapters"

This reverts commit 60cba200f11b6f90f35634c5cd608773ae3721b7.  It's been
linked to lockups of the e1000 hardware, see for example

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229603

but it's likely that the commit itself is not really introducing the
bug, but just allowing an unrelated problem to rear its ugly head (ie
one current working theory is that the code exposes us to a hardware
race condition by decreasing the amount of time we spend in each NAPI
poll cycle).

We'll revert it until root cause is known.  Intel has a repeatable
reproduction on two different machines and bus traces of the hardware
doing something bad.


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-01 Thread Kok, Auke

Michel Lespinasse wrote:

(I've added the E1000 maintainers to the thread as I found the issue
seems to go away after I compile out that driver. For reference, I was
trying to figure out why I lose exactly 24 ticks about every two
seconds, as shown with report_lost_ticks. This is with a DQ965GF
motherboard with onboard E1000).


that's perfectly likely. The main issue is that we read the hardware stats every 
two seconds and that can consume quite some time. It's strange that you are 
losing that many ticks IMHO, but losing one or two might very well be.


We've been playing with all sorts of solutions to this problem and haven't come 
up with a way to reduce the load of the system reading HW stats, and it remains 
the most likely culprit, allthough I don't rule out clean routines just yet. 
This could very well be exaggerated at 100mbit speeds as well, I never looked at 
that.


I've had good results with 2.6.21.1 (even running tickless :)) on these NICs. 
Have you tried that yet?


Auke
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-01 Thread Michel Lespinasse
(I've added the E1000 maintainers to the thread as I found the issue
seems to go away after I compile out that driver. For reference, I was
trying to figure out why I lose exactly 24 ticks about every two
seconds, as shown with report_lost_ticks. This is with a DQ965GF
motherboard with onboard E1000).

On Tue, May 01, 2007 at 11:34:28AM -0400, Chuck Ebbert wrote:
> Michel Lespinasse wrote:
> > running with report_lost_ticks, I see the following:
> > 
> > May  1 12:58:57 server kernel: time.c: Lost 24 timer tick(s)! rip 
> > _spin_unlock_irqrestore+0x8/0x9)
> > May  1 12:58:59 server kernel: time.c: Lost 24 timer tick(s)! rip 
> > _spin_unlock_irqrestore+0x8/0x9)

> Try disabling CONFIG_CPU_FREQ?

OK, so I tried a few things.

Disabling CONFIG_CPU_FREQ does not fix the issue.

Running with 2.6.16.49 did fix the issue, but my onboard E1000 is not
detected then (I do see the 'Intel Pro/1000 network driver' message
at boot time, but ifconfig -a does not show an ethernet interface)

Going back to 2.6.20.10 with no E1000 driver (so really no
network devices at all): aah, lost ticks are gone.
I can even re-enable CONFIG_CPU_FREQ.

With 2.6.20.10, E1000 compiled in, without NAPI: I do see the lost ticks.

With 2.6.20.10, modular E1000 with NAPI: I did not get any lost ticks.
However the network did not work correctly after boot up.
The e1000 module was loaded, the routes were set etc... but I could
not ping anything. The switch lights seemed to indicate no packets
were coming out. After ifdown eth0 and ifup eth0 again, the network
came up fine, but at 100 megabit speed (full duplex) instead of gigabit.
A few log messages come up:
e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
e1000: eth0: e1000_watchdog: NIC Link is Down
e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex
e1000: eth0: e1000_watchdog: 10/100 speed: disabling TSO
Which is slightly strange since the box is connected to a gigabit switch.

The default kernel from debian etch, 2.6.18 based with modular e1000,
connects at gigabit speeds but loses ticks on this machine.

> > One last thing: I have another box which is fairly similar, with a
> > DG965RY motherboard. That other box does *not* seem to lose any ticks,
> > running the same kernel that works so poorly on the DQ965GF board.
> > Does that point to a hardware/bios issue then ?
> 
> Is it running the exact same kernel, with the same cpufreq settings at
> runtime (governor, controller, etc?)

I have not been running 2.6.20.10 on the DG965RY yet. However, the kernel
I use on the DG965RY (2.6.20.7 with statically compiled E1000 driver),
which does not lose ticks there, does lose ticks when I move it to the DQ965GF.

If it makes any difference, the DQ965GF is running with BIOS version 5882
dated 4/13/2007. The DG965RY runs with BIOS version 1676 (same date).
I have not changed any settings in the 'management engine' BIOS menu
(actually I have not entered it since it'd require me to set a password,
and I thought I'd avoid features I dont fully understand :)

Anything else I should try ?

Thanks,

-- 
Michel "Walken" Lespinasse
"Bill Gates is a monocle and a Persian cat away from being the villain
in a James Bond movie." -- Dennis Miller
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-01 Thread Chuck Ebbert
Michel Lespinasse wrote:
> 
> I'm having an issue with lost ticks, runnign linux 2.6.20.10 on an
> intel DQ965GF motherboard. For some reason this occurs with clock-like
> regularity, always exactly 24 lost ticks, about every two seconds.
> This is running with 250-HZ ticks, and the small pause every two seconds
> is also noticeable when using the console.
> 
> 
> running with report_lost_ticks, I see the following:
> 
> May  1 12:58:57 server kernel: time.c: Lost 24 timer tick(s)! rip 
> _spin_unlock_irqrestore+0x8/0x9)
> May  1 12:58:59 server kernel: time.c: Lost 24 timer tick(s)! rip 
> _spin_unlock_irqrestore+0x8/0x9)
> May  1 12:59:01 server kernel: time.c: Lost 25 timer tick(s)! rip 
> _spin_unlock_irqrestore+0x8/0x9)
> May  1 12:59:03 server kernel: time.c: Lost 24 timer tick(s)! rip 
> _spin_unlock_irqrestore+0x8/0x9)
> May  1 12:59:07 server last message repeated 2 times
> May  1 12:59:10 server kernel: time.c: Lost 24 timer tick(s)! rip 
> handle_IRQ_event+0x19/0x55)
> May  1 12:59:12 server kernel: time.c: Lost 24 timer tick(s)! rip 
> handle_IRQ_event+0x19/0x55)
> May  1 12:59:14 server kernel: time.c: Lost 24 timer tick(s)! rip 
> _spin_unlock_irqrestore+0x8/0x9)
> May  1 12:59:16 server kernel: time.c: Lost 24 timer tick(s)! rip 
> _spin_unlock_irqrestore+0x8/0x9)
> May  1 12:59:18 server kernel: time.c: Lost 24 timer tick(s)! rip 
> handle_IRQ_event+0x19/0x55)
> May  1 12:59:20 server kernel: time.c: Lost 24 timer tick(s)! rip 
> _spin_unlock_irqrestore+0x8/0x9)
> May  1 12:59:52 server last message repeated 15 times
> May  1 13:00:51 server last message repeated 28 times
> May  1 13:01:47 server last message repeated 27 times
> May  1 13:01:50 server kernel: time.c: Lost 25 timer tick(s)! rip 
> _spin_unlock_irqrestore+0x8/0x9)
> May  1 13:01:56 server last message repeated 3 times
> May  1 13:01:58 server kernel: time.c: Lost 24 timer tick(s)! rip 
> _spin_unlock_irqrestore+0x8/0x9)
> 
> 
> Since the rip often mentions _spin_unlock_irqrestore, I figured maybe
> someone sleeps for 25 ticks with the interrupts disabled. I tried
> running with CONFIG_DEBUG_SPINLOCK_SLEEP but this did not uncover anything.
> 
> 
> I also noticed this code in arch/x86_64/kernel/time.c. I'm not sure if it
> has anything to do with my issue, but it might since I get lost_count==24...
> I could not figure out what this code is trying to do though.
> 
> #ifdef CONFIG_CPU_FREQ
> /* In some cases the CPU can change frequency without us noticing
>Give cpufreq a change to catch up. */
> if ((lost_count+1) % 25 == 0)
> cpufreq_delayed_get();
> #endif
> 
> 
> I'm not sure where to go from here. Additional information about my system:
> 
 
Try disabling CONFIG_CPU_FREQ?


> One last thing: I have another box which is fairly similar, with a
> DG965RY motherboard. That other box does *not* seem to lose any ticks,
> running the same kernel that works so poorly on the DQ965GF board.
> Does that point to a hardware/bios issue then ?

Is it running the exact same kernel, with the same cpufreq settings at
runtime (governor, controller, etc?)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-01 Thread Chuck Ebbert
Michel Lespinasse wrote:
 
 I'm having an issue with lost ticks, runnign linux 2.6.20.10 on an
 intel DQ965GF motherboard. For some reason this occurs with clock-like
 regularity, always exactly 24 lost ticks, about every two seconds.
 This is running with 250-HZ ticks, and the small pause every two seconds
 is also noticeable when using the console.
 
 
 running with report_lost_ticks, I see the following:
 
 May  1 12:58:57 server kernel: time.c: Lost 24 timer tick(s)! rip 
 _spin_unlock_irqrestore+0x8/0x9)
 May  1 12:58:59 server kernel: time.c: Lost 24 timer tick(s)! rip 
 _spin_unlock_irqrestore+0x8/0x9)
 May  1 12:59:01 server kernel: time.c: Lost 25 timer tick(s)! rip 
 _spin_unlock_irqrestore+0x8/0x9)
 May  1 12:59:03 server kernel: time.c: Lost 24 timer tick(s)! rip 
 _spin_unlock_irqrestore+0x8/0x9)
 May  1 12:59:07 server last message repeated 2 times
 May  1 12:59:10 server kernel: time.c: Lost 24 timer tick(s)! rip 
 handle_IRQ_event+0x19/0x55)
 May  1 12:59:12 server kernel: time.c: Lost 24 timer tick(s)! rip 
 handle_IRQ_event+0x19/0x55)
 May  1 12:59:14 server kernel: time.c: Lost 24 timer tick(s)! rip 
 _spin_unlock_irqrestore+0x8/0x9)
 May  1 12:59:16 server kernel: time.c: Lost 24 timer tick(s)! rip 
 _spin_unlock_irqrestore+0x8/0x9)
 May  1 12:59:18 server kernel: time.c: Lost 24 timer tick(s)! rip 
 handle_IRQ_event+0x19/0x55)
 May  1 12:59:20 server kernel: time.c: Lost 24 timer tick(s)! rip 
 _spin_unlock_irqrestore+0x8/0x9)
 May  1 12:59:52 server last message repeated 15 times
 May  1 13:00:51 server last message repeated 28 times
 May  1 13:01:47 server last message repeated 27 times
 May  1 13:01:50 server kernel: time.c: Lost 25 timer tick(s)! rip 
 _spin_unlock_irqrestore+0x8/0x9)
 May  1 13:01:56 server last message repeated 3 times
 May  1 13:01:58 server kernel: time.c: Lost 24 timer tick(s)! rip 
 _spin_unlock_irqrestore+0x8/0x9)
 
 
 Since the rip often mentions _spin_unlock_irqrestore, I figured maybe
 someone sleeps for 25 ticks with the interrupts disabled. I tried
 running with CONFIG_DEBUG_SPINLOCK_SLEEP but this did not uncover anything.
 
 
 I also noticed this code in arch/x86_64/kernel/time.c. I'm not sure if it
 has anything to do with my issue, but it might since I get lost_count==24...
 I could not figure out what this code is trying to do though.
 
 #ifdef CONFIG_CPU_FREQ
 /* In some cases the CPU can change frequency without us noticing
Give cpufreq a change to catch up. */
 if ((lost_count+1) % 25 == 0)
 cpufreq_delayed_get();
 #endif
 
 
 I'm not sure where to go from here. Additional information about my system:
 
 
Try disabling CONFIG_CPU_FREQ?


 One last thing: I have another box which is fairly similar, with a
 DG965RY motherboard. That other box does *not* seem to lose any ticks,
 running the same kernel that works so poorly on the DQ965GF board.
 Does that point to a hardware/bios issue then ?

Is it running the exact same kernel, with the same cpufreq settings at
runtime (governor, controller, etc?)
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-01 Thread Michel Lespinasse
(I've added the E1000 maintainers to the thread as I found the issue
seems to go away after I compile out that driver. For reference, I was
trying to figure out why I lose exactly 24 ticks about every two
seconds, as shown with report_lost_ticks. This is with a DQ965GF
motherboard with onboard E1000).

On Tue, May 01, 2007 at 11:34:28AM -0400, Chuck Ebbert wrote:
 Michel Lespinasse wrote:
  running with report_lost_ticks, I see the following:
  
  May  1 12:58:57 server kernel: time.c: Lost 24 timer tick(s)! rip 
  _spin_unlock_irqrestore+0x8/0x9)
  May  1 12:58:59 server kernel: time.c: Lost 24 timer tick(s)! rip 
  _spin_unlock_irqrestore+0x8/0x9)

 Try disabling CONFIG_CPU_FREQ?

OK, so I tried a few things.

Disabling CONFIG_CPU_FREQ does not fix the issue.

Running with 2.6.16.49 did fix the issue, but my onboard E1000 is not
detected then (I do see the 'Intel Pro/1000 network driver' message
at boot time, but ifconfig -a does not show an ethernet interface)

Going back to 2.6.20.10 with no E1000 driver (so really no
network devices at all): aah, lost ticks are gone.
I can even re-enable CONFIG_CPU_FREQ.

With 2.6.20.10, E1000 compiled in, without NAPI: I do see the lost ticks.

With 2.6.20.10, modular E1000 with NAPI: I did not get any lost ticks.
However the network did not work correctly after boot up.
The e1000 module was loaded, the routes were set etc... but I could
not ping anything. The switch lights seemed to indicate no packets
were coming out. After ifdown eth0 and ifup eth0 again, the network
came up fine, but at 100 megabit speed (full duplex) instead of gigabit.
A few log messages come up:
e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
e1000: eth0: e1000_watchdog: NIC Link is Down
e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex
e1000: eth0: e1000_watchdog: 10/100 speed: disabling TSO
Which is slightly strange since the box is connected to a gigabit switch.

The default kernel from debian etch, 2.6.18 based with modular e1000,
connects at gigabit speeds but loses ticks on this machine.

  One last thing: I have another box which is fairly similar, with a
  DG965RY motherboard. That other box does *not* seem to lose any ticks,
  running the same kernel that works so poorly on the DQ965GF board.
  Does that point to a hardware/bios issue then ?
 
 Is it running the exact same kernel, with the same cpufreq settings at
 runtime (governor, controller, etc?)

I have not been running 2.6.20.10 on the DG965RY yet. However, the kernel
I use on the DG965RY (2.6.20.7 with statically compiled E1000 driver),
which does not lose ticks there, does lose ticks when I move it to the DQ965GF.

If it makes any difference, the DQ965GF is running with BIOS version 5882
dated 4/13/2007. The DG965RY runs with BIOS version 1676 (same date).
I have not changed any settings in the 'management engine' BIOS menu
(actually I have not entered it since it'd require me to set a password,
and I thought I'd avoid features I dont fully understand :)

Anything else I should try ?

Thanks,

-- 
Michel Walken Lespinasse
Bill Gates is a monocle and a Persian cat away from being the villain
in a James Bond movie. -- Dennis Miller
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-01 Thread Kok, Auke

Michel Lespinasse wrote:

(I've added the E1000 maintainers to the thread as I found the issue
seems to go away after I compile out that driver. For reference, I was
trying to figure out why I lose exactly 24 ticks about every two
seconds, as shown with report_lost_ticks. This is with a DQ965GF
motherboard with onboard E1000).


that's perfectly likely. The main issue is that we read the hardware stats every 
two seconds and that can consume quite some time. It's strange that you are 
losing that many ticks IMHO, but losing one or two might very well be.


We've been playing with all sorts of solutions to this problem and haven't come 
up with a way to reduce the load of the system reading HW stats, and it remains 
the most likely culprit, allthough I don't rule out clean routines just yet. 
This could very well be exaggerated at 100mbit speeds as well, I never looked at 
that.


I've had good results with 2.6.21.1 (even running tickless :)) on these NICs. 
Have you tried that yet?


Auke
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-01 Thread Chuck Ebbert
Kok, Auke wrote:
 Michel Lespinasse wrote:
 (I've added the E1000 maintainers to the thread as I found the issue
 seems to go away after I compile out that driver. For reference, I was
 trying to figure out why I lose exactly 24 ticks about every two
 seconds, as shown with report_lost_ticks. This is with a DQ965GF
 motherboard with onboard E1000).
 
 that's perfectly likely. The main issue is that we read the hardware
 stats every two seconds and that can consume quite some time. It's
 strange that you are losing that many ticks IMHO, but losing one or two
 might very well be.
 
 We've been playing with all sorts of solutions to this problem and
 haven't come up with a way to reduce the load of the system reading HW
 stats, and it remains the most likely culprit, allthough I don't rule
 out clean routines just yet. This could very well be exaggerated at
 100mbit speeds as well, I never looked at that.
 
 I've had good results with 2.6.21.1 (even running tickless :)) on these
 NICs. Have you tried that yet?

Maybe this could fix it in 2.6.20? (went into 2.6.21)

--

Gitweb: 
http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=46fcc86dd71d70211e965102fb69414c90381880
Commit: 46fcc86dd71d70211e965102fb69414c90381880
Parent: 2b858bd02ffca71391161f5709588fc70da79531
Author: Linus Torvalds [EMAIL PROTECTED]
AuthorDate: Thu Apr 19 18:21:01 2007 -0700
Committer:  Linus Torvalds [EMAIL PROTECTED]
CommitDate: Thu Apr 19 18:21:01 2007 -0700

Revert e1000: fix NAPI performance on 4-port adapters

This reverts commit 60cba200f11b6f90f35634c5cd608773ae3721b7.  It's been
linked to lockups of the e1000 hardware, see for example

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229603

but it's likely that the commit itself is not really introducing the
bug, but just allowing an unrelated problem to rear its ugly head (ie
one current working theory is that the code exposes us to a hardware
race condition by decreasing the amount of time we spend in each NAPI
poll cycle).

We'll revert it until root cause is known.  Intel has a repeatable
reproduction on two different machines and bus traces of the hardware
doing something bad.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-01 Thread Kok, Auke

Chuck Ebbert wrote:

Kok, Auke wrote:

Michel Lespinasse wrote:

(I've added the E1000 maintainers to the thread as I found the issue
seems to go away after I compile out that driver. For reference, I was
trying to figure out why I lose exactly 24 ticks about every two
seconds, as shown with report_lost_ticks. This is with a DQ965GF
motherboard with onboard E1000).

that's perfectly likely. The main issue is that we read the hardware
stats every two seconds and that can consume quite some time. It's
strange that you are losing that many ticks IMHO, but losing one or two
might very well be.

We've been playing with all sorts of solutions to this problem and
haven't come up with a way to reduce the load of the system reading HW
stats, and it remains the most likely culprit, allthough I don't rule
out clean routines just yet. This could very well be exaggerated at
100mbit speeds as well, I never looked at that.

I've had good results with 2.6.21.1 (even running tickless :)) on these
NICs. Have you tried that yet?


Maybe this could fix it in 2.6.20? (went into 2.6.21)


well, that hasn't got anything to do with stats, but is part of the clean_tx/rx 
codepath. I personally don't get any lost_ticks so I can't reproduce, but that 
was why I was hinting that you can try it for us ;)


codewise, the patch below makes our cleanup routine spend _more_ time, instead 
of less, which is why I think it's not the cause nor fix.


Auke



--

Gitweb: 
http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=46fcc86dd71d70211e965102fb69414c90381880
Commit: 46fcc86dd71d70211e965102fb69414c90381880
Parent: 2b858bd02ffca71391161f5709588fc70da79531
Author: Linus Torvalds [EMAIL PROTECTED]
AuthorDate: Thu Apr 19 18:21:01 2007 -0700
Committer:  Linus Torvalds [EMAIL PROTECTED]
CommitDate: Thu Apr 19 18:21:01 2007 -0700

Revert e1000: fix NAPI performance on 4-port adapters

This reverts commit 60cba200f11b6f90f35634c5cd608773ae3721b7.  It's been

linked to lockups of the e1000 hardware, see for example

	https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=229603

but it's likely that the commit itself is not really introducing the

bug, but just allowing an unrelated problem to rear its ugly head (ie
one current working theory is that the code exposes us to a hardware
race condition by decreasing the amount of time we spend in each NAPI
poll cycle).

We'll revert it until root cause is known.  Intel has a repeatable

reproduction on two different machines and bus traces of the hardware
doing something bad.


-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: 24 lost ticks with 2.6.20.10 kernel

2007-05-01 Thread Lee Revell

On 5/1/07, Kok, Auke [EMAIL PROTECTED] wrote:

Michel Lespinasse wrote:
 (I've added the E1000 maintainers to the thread as I found the issue
 seems to go away after I compile out that driver. For reference, I was
 trying to figure out why I lose exactly 24 ticks about every two
 seconds, as shown with report_lost_ticks. This is with a DQ965GF
 motherboard with onboard E1000).

that's perfectly likely. The main issue is that we read the hardware stats every
two seconds and that can consume quite some time. It's strange that you are
losing that many ticks IMHO, but losing one or two might very well be.



Why do you have to disable all interrupts?  Shouldn't it be sufficient
to disable interrupts on the e1000 only?

Lee
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/