subject:"test10\-pre1 problems on 4\-way SuperServer8050"

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-15 Thread Panu Matilainen


On Thu, 12 Oct 2000, Keith Owens wrote:

> On Thu, 12 Oct 2000 12:56:09 +0100 (BST), 
> Tigran Aivazian <[EMAIL PROTECTED]> wrote:
> >one correction -- it was "down and up the interface" that did the trick
> >and not deleting the 64M mtrr entry. I.e. the eepro100 problem is better
> >formulated as "when highmem is enabled one or both eepro100 interfaces
> >sometimes do not work from boot but downing/upping the interface usually
> >helps". When highmem is disabled, so far, _both_ eepro100 interfaces
> >_always_ work on boot.
> 
> That may only be coincidence.  We have intermittent problems with
> eepro100 under 2.4.0-testx, both ix86 and ia64.  The symptoms are "card
> reports no resources" messages; down and up the interface and it
> usually works.

Might be related or not, but I've had nothing but problems with eepro100
until it's forced to use 100/FD. Symptoms include: either no network at
all (driver complaining "card reports no resources") or impossibly slow
and erratic network connections (like "ypcat foo" hanging for a second a
few times in between)

- Panu -

> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-13 Thread Rik van Riel


On Fri, 13 Oct 2000, Richard Guenther wrote:
> On Fri, 13 Oct 2000, Rik van Riel wrote:
> > On Thu, 12 Oct 2000, Richard Guenther wrote:
> > 
> > > I reported this BUG on a few days ago but got no response - happens
> > > on UP with only 32M ram, too. (see below). Also note the second
> > > BUG at vmscan.c:538 which I believe never saw reported again.
> > 
> > > > Oct 11 16:05:26 hilbert36 kernel: kernel BUG at page_alloc.c:221!
> > > [snipped]
> > 
> > Did you get the bug with or without VMware ?
> > [it seems vmware is doing something strange ;)]
> 
> I dont have VMware - at least it would be no fun on 32M and an
> old P100 I suspect... :)

OK, I'll look into this a bit more...

[but I'm leaving for a conference in Miami in 3 hours, so
there's little chance of hearing anything back from me in
the next few days.]

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-13 Thread Richard Guenther


On Fri, 13 Oct 2000, Rik van Riel wrote:

> On Thu, 12 Oct 2000, Richard Guenther wrote:
> 
> > I reported this BUG on a few days ago but got no response - happens
> > on UP with only 32M ram, too. (see below). Also note the second
> > BUG at vmscan.c:538 which I believe never saw reported again.
> 
> > > Oct 11 16:05:26 hilbert36 kernel: kernel BUG at page_alloc.c:221!
> > [snipped]
> 
> Did you get the bug with or without VMware ?
> [it seems vmware is doing something strange ;)]

I dont have VMware - at least it would be no fun on 32M and an old
P100 I suspect... :)

> The second bug is almost certainly a direct
> consequence of the kernel continuing after
> the first one happened...

Yeah, that was my thought, too - but who knows...

Richard.

--
Richard Guenther <[EMAIL PROTECTED]>
WWW: http://www.anatom.uni-tuebingen.de/~richi/
The GLAME Project: http://www.glame.de/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-13 Thread Rik van Riel


On Thu, 12 Oct 2000, Richard Guenther wrote:

> I reported this BUG on a few days ago but got no response - happens
> on UP with only 32M ram, too. (see below). Also note the second
> BUG at vmscan.c:538 which I believe never saw reported again.

> > Oct 11 16:05:26 hilbert36 kernel: kernel BUG at page_alloc.c:221!
> [snipped]

Did you get the bug with or without VMware ?
[it seems vmware is doing something strange ;)]

The second bug is almost certainly a direct
consequence of the kernel continuing after
the first one happened...

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

eepro100 problem [was: Re: test10-pre1 problems on 4-way SuperServer8050]

2000-10-12 Thread Andrey Savochkin


Hi,

On Thu, Oct 12, 2000 at 02:19:27PM +0100, Tigran Aivazian wrote:
> Having done a few more reboots I got more info -- one of the eepro100
> interfaces is dead only in 4 out 5 cases. So, sometimes, doing ifdown eth0
> ; ifup eth0 does help.

Tigran, please check if you have any driver's messages, in particular,
"card reports no resources".
There is a known problem which fits the sympomes described by you.
Dragan Stancevic <[EMAIL PROTECTED]> was going to look at Intel's errata
about this matter.

> 
> So, the latest status: all 6G of RAM work fast but the onboard eepro100
> interface, often, doesn't work. This starts to look like eepro100-driver
> related so I copied Andrey Savochkin. Btw, one of my colleagues also
> reported a similar situation on his quad Xeon with 6G RAM whereby one of
> the eepro100 interfaces was dead until one restarts it.
> 
> Starting to fiddle with eepro100.c now...

Best regards
Andrey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Cort Dougan


} Hi,
} 
} > How?  If you compile with egcs-2.91.66 without frame pointers on ix86 then
} > __builtin_return_address() yields garbage.  Does anybody have a generic
} > solution to this problem, other than "compile with frame pointers"?  Or is
} > it fixed in newer versions of gcc?
} 
} Are you sure? I just I tried it 2.91.66 and it works. With 
} -fomit-frame-pointer only __builtin_return_address(0) works, but that is
} true for any version.

I've found, with several versions of gcc, that leaf functions will give
bad results (sometimes resulting in a bad access fault) with calls to
__builtin_return_address(0).  The workaround in the kernel and RTLinux is
making a call do your function isn't a leaf.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Roman Zippel


Hi,

> How?  If you compile with egcs-2.91.66 without frame pointers on ix86 then
> __builtin_return_address() yields garbage.  Does anybody have a generic
> solution to this problem, other than "compile with frame pointers"?  Or is
> it fixed in newer versions of gcc?

Are you sure? I just I tried it 2.91.66 and it works. With 
-fomit-frame-pointer only __builtin_return_address(0) works, but that is
true for any version.

bye, Roman

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: IRQ affinity vs. MTRRs, was Re: 36 bit MTRRs, Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread David Wragg

Boszormenyi Zoltan <[EMAIL PROTECTED]> writes:
> The idea is that when it is sure that _only one_ (or some) CPU will access
> a PCI card's mmio area then only that CPU's (those CPUs') MTRRs needs to
> contain an entry for that area.
>
> Although there are (must be) common MTRR entries for the main memory
> and the commonly accessed mmio register areas.
> 
> The idea came because fiddling with MTRRs quickly revaled that
> only 8 variable ones exist.

I see.  I think there is a more straightforward solution: PAT does the
same thing as MTRRs, but has no such "number of ranges" limitation ---
it lets you set the memory type on a page-by-page basis.  If the
number of MTRRs becomes a problem (anyone know how many the P4 has?),
then the real solution is to implement PAT support.

IIRC, only the PPro, the first PII model (Klamath?), and the first
Celeron model have MTRR but not PAT (Athlon has PAT, but /proc/cpuinfo
misreports it as "fcmov", at least in 2.2.14; Xeons always had PAT).

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

[success!] Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Tigran Aivazian


On Thu, 12 Oct 2000, Tigran Aivazian wrote:

> On 12 Oct 2000, David Wragg wrote:
> > Ok.  I'll wait for feedback from Tigran, and if I don't get anything
> > negative I'll submit to Linus.  The 2.2 version of my patch fixes
> > problems for other people, VA Linux have included it in their kernel
> > for a while with no problems that have been reported back to me), and
> > it's silly that it isn't in 2.4testX.  I should have addressed this a
> > while ago, but I have my own distractions from kernel hacking.
> > 
> > Later on, you can send a mtrr.c maintenance patch, if you like.
> > 
> > I've just caught up on this whole thread, and I don't have any
> > objections in principle to Zoltan's patch being used instead of mine,
> > though I'd like to take a look at it first.
> 
> David, sorry I didn't know that your patch is fundamentally different from
> Zoltan's. I will now re-test with your patch and see if it makes my
> eepro100 "instabilities" go away.
> 
> The performance problems went away as I said earlier, by fiddling with
> cache settings in the BIOS. (with and without Zoltan's patch my machine is
> now as fast as it can be)
> 

hmmm, very interesting... It looks like your patch fixed all the remaining
problems. I.e. not only my 6G is now fast (it was without your patch) but
all the eepro100 interfaces now _always_ (tried 4 reboots) come up
functioning.

Your patch is now a permanent part of my tree, thank you! :)

Thanks,
Tigran

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: IRQ affinity vs. MTRRs, was Re: 36 bit MTRRs, Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread David Wragg

Boszormenyi Zoltan <[EMAIL PROTECTED]> writes:
> I came up with an idea. The MTRRs are per-cpu things.
> Ingo Molnar's IRQ affinity code helps binding certain
> IRQ sources to certain CPUs.

They are implemented as per-cpu things but the Intel manuals say that
all cpus should have the same MTRR settings.  They also give
pseudo-code for how to update them on an SMP system, which mtrr.c
follows.

If the BIOS has set them up differently at boot time, mtrr.c will
complain and copy the MTRR settings of CPU0 to the others.

Regards,
David Wragg

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Tigran Aivazian


On 12 Oct 2000, David Wragg wrote:
> Ok.  I'll wait for feedback from Tigran, and if I don't get anything
> negative I'll submit to Linus.  The 2.2 version of my patch fixes
> problems for other people, VA Linux have included it in their kernel
> for a while with no problems that have been reported back to me), and
> it's silly that it isn't in 2.4testX.  I should have addressed this a
> while ago, but I have my own distractions from kernel hacking.
> 
> Later on, you can send a mtrr.c maintenance patch, if you like.
> 
> I've just caught up on this whole thread, and I don't have any
> objections in principle to Zoltan's patch being used instead of mine,
> though I'd like to take a look at it first.

David, sorry I didn't know that your patch is fundamentally different from
Zoltan's. I will now re-test with your patch and see if it makes my
eepro100 "instabilities" go away.

The performance problems went away as I said earlier, by fiddling with
cache settings in the BIOS. (with and without Zoltan's patch my machine is
now as fast as it can be)

Regards,
Tigran

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread David Wragg

Richard Gooch <[EMAIL PROTECTED]> writes:
> David Wragg writes:
> > mtrr.c is broken for machines with >=4GB of memory (or less than 4GB,
> > if the chipset reserves an addresses range below 4GB for PCI).
> > 
> > The patch against 2.4.0-test9 to fix this is below.
> > 
> > Richard: Is there a reason you haven't passed this on to Linus, or do
> > you want me to do it?
> 
> Partly because I haven't had time to look at it, partly because I'm
> not sure if it's needed (why, exactly?)

Because mtrr.c throws away the top 4 bits of 36-bit physical
addresses, it gives misleading /proc/mtrr output on machines with
>=4GB of memory, which I think requires a fix on its own.  But worse,
if it tries to make MTRR changes on such a machine, you can get bogus
MTRR settings. This can ruin a machine's performance (if real memory
ends up write combined or uncached) or give hardware instabilities (if
a device's MMIO area gets the wrong memory type).

So far, this probably hasn't bitten too many people, since relatively
few Linux x86 users have >=4GB memory, and /proc/mtrr hasn't usually
been altered without explicit intervention.  But with XFree86-4
finally "out there" and more kernel drivers using MTRRs, this can only
get worse.

(Whether Tigran's performance problems are actually down to the mtrr.c
issue, I don't know.  It's not worth hypothesizing until we have
accurate /proc/mtrr output.)

When I checked the 2.2 version of my patch, it didn't involve a
significant increase in code size.

> and partly because I've
> recently moved house and (STILL!) don't have IP access at home (not
> even dialup) so I can't really look at stuff yet 

Ok.  I'll wait for feedback from Tigran, and if I don't get anything
negative I'll submit to Linus.  The 2.2 version of my patch fixes
problems for other people, VA Linux have included it in their kernel
for a while with no problems that have been reported back to me), and
it's silly that it isn't in 2.4testX.  I should have addressed this a
while ago, but I have my own distractions from kernel hacking.

Later on, you can send a mtrr.c maintenance patch, if you like.

I've just caught up on this whole thread, and I don't have any
objections in principle to Zoltan's patch being used instead of mine,
though I'd like to take a look at it first.

Regards,
David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

[fixed (well, it works)]Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Tigran Aivazian


Hello,

Ok, I despaired a bit about mtrrs on the Linux side and went into BIOS and
started playing with the cache settings there. The change that fixed the
problem was to disable all "area CXXX-> : cached". Now, I have a really
fast quad Xeon 6G RAM with consistently failing eepro100
interface. Downing/upping the interface does not help. I suppose in this
state it is easier to debug because everything else is fully functional --
let's just find out why this particular eepro100 doesn't work.

Kernel compiles in 54-60 seconds -- very impressive (I am talking about
full make -j4 bzImage after make clean)

Now, this is with and without Zoltan's big-mtrr patch, just verified a
minute ago.

Regards,
Tigran

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Keith Owens


On Thu, 12 Oct 2000 12:56:09 +0100 (BST), 
Tigran Aivazian <[EMAIL PROTECTED]> wrote:
>one correction -- it was "down and up the interface" that did the trick
>and not deleting the 64M mtrr entry. I.e. the eepro100 problem is better
>formulated as "when highmem is enabled one or both eepro100 interfaces
>sometimes do not work from boot but downing/upping the interface usually
>helps". When highmem is disabled, so far, _both_ eepro100 interfaces
>_always_ work on boot.

That may only be coincidence.  We have intermittent problems with
eepro100 under 2.4.0-testx, both ix86 and ia64.  The symptoms are "card
reports no resources" messages; down and up the interface and it
usually works.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Tigran Aivazian


On Thu, 12 Oct 2000, Tigran Aivazian wrote:

> On Thu, 12 Oct 2000, Tigran Aivazian wrote:
> 
> > On Wed, 11 Oct 2000, Linus Torvalds wrote:
> > > What happens if MTRR support is entirely disabled?
> > 
> > If MTRR support is disabled then both eepro100 interfaces work fine but
> > the system is still 40x slower. This is the entire bootlog of
> > 2.4.0-test10-pre1 + lspci-vvx + /proc/interrupts + /proc/iomem + ifconfig
> > output
> 
> one more finding -- deleting the strange 64M mtrr entry enabled the second
> eepro100 interface!
> 
> # cat /proc/mtrr
> reg00: base=0x001 (4096MB), size=2048MB: write-combining,
> count=1
> reg02: base=0xfc00 (4032MB), size=  64MB: uncachable, count=1
> # 
> # echo "disable=2" > /proc/mtrr 
> # cat /proc/mtrr
> reg00: base=0x001 (4096MB), size=2048MB: write-combining,
> count=1
> 
> (now down and up the interface and it works. Both eepro100 work)

one correction -- it was "down and up the interface" that did the trick
and not deleting the 64M mtrr entry. I.e. the eepro100 problem is better
formulated as "when highmem is enabled one or both eepro100 interfaces
sometimes do not work from boot but downing/upping the interface usually
helps". When highmem is disabled, so far, _both_ eepro100 interfaces
_always_ work on boot.

Regards,
Tigran

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 36 bit MTRRs, Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Tigran Aivazian


On Thu, 12 Oct 2000, Boszormenyi Zoltan wrote:

> On Thu, 12 Oct 2000, Boszormenyi Zoltan wrote:
> 
> > echo "base=0 size=0x1 type=write-back" >/proc/mtrr 
> > echo "base=0x1 size=0x8000 type=write-back" >/proc/mtrr
> > echo "base=0xfe00 size=0x80 type=write-combining" >/proc/mtrr
> > echo "base=0xfde0 size=0x10 type=uncached" >/proc/mtrr
> > echo "base=0xfe80 size=0x10 type=uncached" >/proc/mtrr
> > echo "base=0xfe9ed000 size=0x1000 type=uncached" >/proc/mtrr
> > echo "base=0xfe9ee000 size=0x2000 type=uncached" >/proc/mtrr
> > echo "base=0xfeafe000 size=0x2000 type=uncached" >/proc/mtrr
> 
> Sorry, use 'uncachable' instead of 'uncached'. :-(

ok, doing it from the bottom up was fine (didn't lockup) but reaching the
last (first in your list) entry was refused by mtrr:

mtrr: 0x0,0x1 overlaps existing 0xfeafe000,0x2000

# cat /proc/mtrr
reg00: base=0xfeafe000 (4074MB), size=   0kB: uncachable, count=1
reg01: base=0xfe9ee000 (4073MB), size=   0kB: uncachable, count=1
reg02: base=0xfe9ed000 (4073MB), size=   0kB: uncachable, count=1
reg03: base=0xfe80 (4072MB), size=   1MB: uncachable, count=1
reg04: base=0xfde0 (4062MB), size=   1MB: uncachable, count=1
reg05: base=0xfe00 (4064MB), size=   8MB: write-combining, count=1
reg06: base=0x001 (4096MB), size=2048MB: write-back,
count=1

and machine is still slow. So, what is the correct way to cover the 6G by
some mtrrs? I will now try to disable or change strategy of L2 caching in
BIOS and see if it makes things worse.

Regards,
Tigran

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: IRQ affinity vs. MTRRs, was Re: 36 bit MTRRs, Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Gábor Lénárt


On Thu, Oct 12, 2000 at 12:12:19PM +0200, Boszormenyi Zoltan wrote:
> I came up with an idea. The MTRRs are per-cpu things.
> Ingo Molnar's IRQ affinity code helps binding certain
> IRQ sources to certain CPUs.
> 
> What if the MTRR driver allows per-CPU settings, maybe only on
> uncached areas? Of course the real memory should be cached in
> every CPU to avoid slowdowns. So that if you set that eth0's
> IRQ will be handled by CPU1, the MTRRs of CPU1 will be set
> accordingly, and the other CPUs will not care about eth0,
> so they do not need eth0's MTRR settings.

A little question. Why do we want to bind irq of eth0 to a single CPU ?
imho it will casue slowdown of some situation. Why don't we leave scheduler
to select CPU for processing IRQ ?

- Gabor
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 36 bit MTRRs, Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Tigran Aivazian

> On Thu, 12 Oct 2000, Boszormenyi Zoltan wrote:
> 
> > echo "base=0 size=0x1 type=write-back" >/proc/mtrr 

this line immediately locks up the machine. But I want to understand where
did you get base=0 and size=0x1 from? Shouldn't it be
base=0x10 and size=0xfccf according to this entry from e820:

BIOS-e820: fccf @ 0010 (usable)

Regards,
Tigran

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Keith Owens

On Thu, 12 Oct 2000 10:45:11 +0100 (BST), 
Tigran Aivazian <[EMAIL PROTECTED]> wrote:
>It would be nice if /proc/mtrr showed eip of
>the caller who set up the entry :)

How?  If you compile with egcs-2.91.66 without frame pointers on ix86 then
__builtin_return_address() yields garbage.  Does anybody have a generic
solution to this problem, other than "compile with frame pointers"?  Or is
it fixed in newer versions of gcc?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Tigran Aivazian

On Thu, 12 Oct 2000, Tigran Aivazian wrote:

> On Wed, 11 Oct 2000, Linus Torvalds wrote:
> > What happens if MTRR support is entirely disabled?
> 
> If MTRR support is disabled then both eepro100 interfaces work fine but
> the system is still 40x slower. This is the entire bootlog of
> 2.4.0-test10-pre1 + lspci-vvx + /proc/interrupts + /proc/iomem + ifconfig
> output

one more finding -- deleting the strange 64M mtrr entry enabled the second
eepro100 interface!

# cat /proc/mtrr
reg00: base=0x001 (4096MB), size=2048MB: write-combining,
count=1
reg02: base=0xfc00 (4032MB), size=  64MB: uncachable, count=1
# 
# echo "disable=2" > /proc/mtrr 
# cat /proc/mtrr
reg00: base=0x001 (4096MB), size=2048MB: write-combining,
count=1

(now down and up the interface and it works. Both eepro100 work)

but the machine is still intolerably slow. Where did this 64M entry come
from? (I don't have agp or drm support enabled or anything like that, I
don't even have an agp bus!) It would be nice if /proc/mtrr showed eip of
the caller who set up the entry :)

Regards,
Tigran

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 36 bit MTRRs, Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Tigran Aivazian


On Thu, 12 Oct 2000, Boszormenyi Zoltan wrote:
> Look at the e820 map in the boot log, mark those areas
> as write-back and tell me what happens.

Here is e820 map:

 BIOS-e820: 0009fc00 @  (usable)
 BIOS-e820: 0400 @ 0009fc00 (reserved)
 BIOS-e820: 0002 @ 000e (reserved)
 BIOS-e820: fccf @ 0010 (usable)
 BIOS-e820: f000 @ fcdf (ACPI data)
 BIOS-e820: 1000 @ fcdff000 (ACPI NVS)
 BIOS-e820: 1000 @ fec0 (reserved)
 BIOS-e820: 1000 @ fee0 (reserved)
 BIOS-e820: 0008 @ fff8 (reserved)
 BIOS-e820: 8000 @ 0001 (usable)

I can easily setup the mtrr entry for the top 2G:

 BIOS-e820: 8000 @ 0001 (usable)

# cat /proc/mtrr
reg00: base=0x001 (4096MB), size=2048MB: write-combining,
count=1
reg02: base=0xfc00 (4032MB), size=  64MB: uncachable, count=1

but trying to do the same for the low 4G:

BIOS-e820: fccf @ 0010 (usable)

mtrr complains:

# echo "base=0x10 size=0xfccf type=write-combining" > /proc/mtrr
mtrr: base(0x10) is not aligned on a size(0xfccf) boundary

suggestions?

Regards,
Tigran

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Markus


Hi,

someone looked at the XEON errata already, perhaps one can find the problem there? 
Just in case.
G16 seems to have something to do with it ... But there are others also. I´ll boot 
linux and look into the sources ...

Cheers Markus

Tigran Aivazian wrote:

> On Wed, 11 Oct 2000, Linus Torvalds wrote:
> > What happens if MTRR support is entirely disabled?
>
> If MTRR support is disabled then both eepro100 interfaces work fine but
> the system is still 40x slower. This is the entire bootlog of
> 2.4.0-test10-pre1 + lspci-vvx + /proc/interrupts + /proc/iomem + ifconfig
> output
>
> Two currently active ideas (from Mark, Linus and Zoltan):
>
> a) one needs to use big-mtrr patch from Zoltan, look at e820 map and
> manually set up mtrrs to cover all 6G.
>
> b) this is an L2 cache-tag issue and there is just not enough bits in the
> tag to cover such high addresses so nothing will help, save removing the
> extra 2G or so out of the machine (or using them as MTD devices : I
> hope this is _not_ the case...
>
> another idea (in parallel) is that eepro100 stops working because its PCI
> memory space is marked as cacheable.
>
> All should become clear soon -- I will spend the whole day on this, slowly
> trying to understand what's going on.
>
> Regards,
> Tigran
>
> Linux version 2.4.0-test9 (root@hilbert) (gcc version egcs-2.91.66 19990314/Linux 
>(egcs-1.1.2 release)) #15 SMP Wed Oct 11 19:23:15 BST 2000
> BIOS-provided physical RAM map:
>  BIOS-e820: 0009fc00 @  (usable)
>  BIOS-e820: 0400 @ 0009fc00 (reserved)
>  BIOS-e820: 0002 @ 000e (reserved)
>  BIOS-e820: fccf @ 0010 (usable)
>  BIOS-e820: f000 @ fcdf (ACPI data)
>  BIOS-e820: 1000 @ fcdff000 (ACPI NVS)
>  BIOS-e820: 1000 @ fec0 (reserved)
>  BIOS-e820: 1000 @ fee0 (reserved)
>  BIOS-e820: 0008 @ fff8 (reserved)
>  BIOS-e820: 8000 @ 0001 (usable)
> 5248MB HIGHMEM available.
> Scan SMP from c000 for 1024 bytes.
> Scan SMP from c009fc00 for 1024 bytes.
> Scan SMP from c00f for 65536 bytes.
> found SMP MP-table at 000fb4d0
> hm, page 000fb000 reserved twice.
> hm, page 000fc000 reserved twice.
> hm, page 000f5000 reserved twice.
> hm, page 000f6000 reserved twice.
> On node 0 totalpages: 1572864
> zone(0): 4096 pages.
> zone(1): 225280 pages.
> zone(2): 1343488 pages.
> Intel MultiProcessor Specification v1.1
> Virtual Wire compatibility mode.
> OEM ID: AMI  Product ID: CNB20HE  APIC at: 0xFEE0
> Processor #0 Pentium(tm) Pro APIC version 17
> Floating point unit present.
> Machine Exception supported.
> 64 bit compare & exchange supported.
> Internal APIC present.
> Bootup CPU
> Processor #1 Pentium(tm) Pro APIC version 17
> Floating point unit present.
> Machine Exception supported.
> 64 bit compare & exchange supported.
> Internal APIC present.
> Processor #2 Pentium(tm) Pro APIC version 17
> Floating point unit present.
> Machine Exception supported.
> 64 bit compare & exchange supported.
> Internal APIC present.
> Processor #3 Pentium(tm) Pro APIC version 17
> Floating point unit present.
> Machine Exception supported.
> 64 bit compare & exchange supported.
> Internal APIC present.
> Bus #0 is PCI
> Bus #1 is PCI
> Bus #2 is PCI
> Bus #3 is ISA
> I/O APIC #4 Version 17 at 0xFEC0.
> I/O APIC #5 Version 17 at 0xFEC01000.
> Int: type 0, pol 3, trig 3, bus 0, IRQ 04, APIC ID 5, APIC INT 0a
> Int: type 0, pol 3, trig 3, bus 0, IRQ 08, APIC ID 5, APIC INT 0b
> Int: type 0, pol 3, trig 3, bus 0, IRQ 0c, APIC ID 5, APIC INT 0f
> Int: type 0, pol 3, trig 3, bus 0, IRQ 3c, APIC ID 4, APIC INT 0a
> Int: type 0, pol 3, trig 3, bus 1, IRQ 15, APIC ID 5, APIC INT 01
> Int: type 0, pol 3, trig 3, bus 1, IRQ 14, APIC ID 5, APIC INT 00
> Int: type 3, pol 1, trig 1, bus 3, IRQ 00, APIC ID 4, APIC INT 00
> Int: type 0, pol 1, trig 1, bus 3, IRQ 01, APIC ID 4, APIC INT 01
> Int: type 0, pol 1, trig 1, bus 3, IRQ 00, APIC ID 4, APIC INT 02
> Int: type 0, pol 1, trig 1, bus 3, IRQ 03, APIC ID 4, APIC INT 03
> Int: type 0, pol 1, trig 1, bus 3, IRQ 04, APIC ID 4, APIC INT 04
> Int: type 0, pol 1, trig 1, bus 3, IRQ 06, APIC ID 4, APIC INT 06
> Int: type 0, pol 1, trig 1, bus 3, IRQ 07, APIC ID 4, APIC INT 07
> Int: type 0, pol 1, trig 1, bus 3, IRQ 08, APIC ID 4, APIC INT 08
> Int: type 0, pol 1, trig 1, bus 3, IRQ 0c, APIC ID 4, APIC INT 0c
> Int: type 0, pol 1, trig 1, bus 3, IRQ 0d, APIC ID 4, APIC INT 0d
> Int: type 0, pol 1, trig 1, bus 3, IRQ 0e, APIC ID 4, APIC INT 0e
> Int: type 0, pol 1, trig 1, bus 3, IRQ 0f, APIC ID 4, APIC INT 0f
> Lint: type 3, pol 1, trig 1, bus 3, IRQ 00, APIC ID ff, APIC LINT 00
> Lint: type 1, pol 1, trig 1, bus 0, IRQ 00, APIC ID ff, APIC LINT 01
> Processors: 4
> mapped APIC to e000 (fee0)
> mapped IOAPI

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Richard Guenther


Hi!

I reported this BUG on a few days ago but got no response - happens
on UP with only 32M ram, too. (see below). Also note the second
BUG at vmscan.c:538 which I believe never saw reported again.

Richard.

On Wed, 11 Oct 2000, Tigran Aivazian wrote:

> On Wed, 11 Oct 2000, Rik van Riel wrote:
> > Could you send me the backtrace of one of the cases where
> > you hit the bug ?
> 
> here you are:
> 
> Oct 11 16:05:26 hilbert36 kernel: kernel BUG at page_alloc.c:221!
[snipped]

--
Richard Guenther <[EMAIL PROTECTED]>
WWW: http://www.anatom.uni-tuebingen.de/~richi/
The GLAME Project: http://www.glame.de/

Oct  7 11:50:47 localhost kernel: kernel BUG at page_alloc.c:91!
Oct  7 11:50:47 localhost kernel: invalid operand: 
Oct  7 11:50:47 localhost kernel: CPU:0
Oct  7 11:50:47 localhost kernel: EIP:0010:[__free_pages_ok+73/892]
Oct  7 11:50:47 localhost kernel: EFLAGS: 00010286
Oct  7 11:50:47 localhost kernel: eax: 001f   ebx: c1002a90   ecx: c10a4000   edx: 

Oct  7 11:50:47 localhost kernel: esi: c1002aac   edi:    ebp: 002c   esp: 
c10a5f64
Oct  7 11:50:47 localhost kernel: ds: 0018   es: 0018   ss: 0018
Oct  7 11:50:47 localhost kernel: Process kswapd (pid: 2, stackpage=c10a5000)
Oct  7 11:50:47 localhost kernel: Stack: c01d4877 c01d4a65 005b c1002a90 c1002aac 
00ce 002c 00ce 
Oct  7 11:50:47 localhost kernel:002b  0003 c0126042 c01278cb 
c0126229  0004 
Oct  7 11:50:47 localhost kernel:   0004  
 c0126870 0004 
Oct  7 11:50:47 localhost kernel: Call Trace: [tvecs+8671/55752] [tvecs+9165/55752] 
[page_launder+674/1888] [__free_pages+19/20] [page_launder+1161/1888] 
[do_try_to_free_pages+52/128] [tvecs+7999/55752] 
Oct  7 11:50:47 localhost kernel:[kswapd+115/288] [kernel_thread+40/56] 
Oct  7 11:50:47 localhost kernel: Code: 0f 0b 83 c4 0c 89 f6 89 da 2b 15 f8 89 26 c0 
89 d0 c1 e0 04 

Oct  7 11:50:51 localhost kernel: kernel BUG at vmscan.c:538!
Oct  7 11:50:51 localhost kernel: invalid operand: 
Oct  7 11:50:51 localhost kernel: CPU:0
Oct  7 11:50:51 localhost kernel: EIP:0010:[reclaim_page+897/980]
Oct  7 11:50:51 localhost kernel: EFLAGS: 00010282
Oct  7 11:50:51 localhost kernel: eax: 001c   ebx: c1002aac   ecx: c1636000   edx: 
0010
Oct  7 11:50:51 localhost kernel: esi: c1002a90   edi:    ebp: 0040   esp: 
c1637e3c
Oct  7 11:50:51 localhost kernel: ds: 0018   es: 0018   ss: 0018
Oct  7 11:50:51 localhost kernel: Process cc1 (pid: 2614, stackpage=c1637000)
Oct  7 11:50:51 localhost kernel: Stack: c01d4277 c01d4456 021a c020bb20 c020bdb4 
  c0127548 
Oct  7 11:50:51 localhost kernel:c020bb20  c020bdb8 0001  
c0127702 c020bdac  
Oct  7 11:50:51 localhost kernel: 0001 1000 c03a7d60 0001 
c04fe080 0007a746 0005 
Oct  7 11:50:51 localhost kernel: Call Trace: [tvecs+7135/55752] [tvecs+7614/55752] 
[__alloc_pages_limit+124/172] [__alloc_pages+394/756] [do_anonymous_page+57/160] 
[do_no_page+48/192] [handle_mm_fault+232/340] 
Oct  7 11:50:51 localhost kernel:[do_page_fault+299/976] 
[merge_segments+324/364] [do_brk+267/316] [sys_brk+180/216] [error_code+44/64] 
Oct  7 11:50:51 localhost kernel: Code: 0f 0b 83 c4 0c 31 c0 0f b3 46 18 8d 4e 28 8d 
46 2c 39 46 2c 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Tigran Aivazian


On Thu, 12 Oct 2000, Matti Aarnio wrote:
> > CPU0: Intel Pentium III (Cascades) stepping 01
> > CPU1: Intel Pentium III (Cascades) stepping 01
> > CPU2: Intel Pentium III (Cascades) stepping 01
> > CPU3: Intel Pentium III (Cascades) stepping 01
> > Total of 4 processors activated (5606.60 BogoMIPS).
> 
>   Hmm.. More marketing names, what is "Cascades" in the scale
>   of "cheap bastards" versus "all bells and whistless" ?
>   (Celeron vs. XEON, that is.)

It is a Xeon 700MHz with 1M cache, at least we paid for it as such! :)

here is a sample from /proc/cpuinfo

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 10
model name  : Pentium III (Cascades)
stepping: 1
cpu MHz : 701.000611
cache size  : 1024 KB
fdiv_bug: no
hlt_bug : no
sep_bug : no
f00f_bug: no
coma_bug: no
fpu : yes
fpu_exception   : yes
cpuid level : 2
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca
cmov pat pse36 mmx fxsr xmm
bogomips: 1399.19

the other 3 look the same.

Tigran

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Matti Aarnio


On Thu, Oct 12, 2000 at 09:21:00AM +0100, Tigran Aivazian wrote:
> If MTRR support is disabled then both eepro100 interfaces work fine but
> the system is still 40x slower. This is the entire bootlog of
> 2.4.0-test10-pre1 + lspci-vvx + /proc/interrupts + /proc/iomem + ifconfig
> output
>
> Two currently active ideas (from Mark, Linus and Zoltan):
> 
> b) this is an L2 cache-tag issue and there is just not enough bits in the
> tag to cover such high addresses so nothing will help, save removing the
> extra 2G or so out of the machine (or using them as MTD devices : I
> hope this is _not_ the case...

Reminds me of the difference in between Celeron and XEON
variants of Pentium II -- Celerons can cache only the low
4 GB of address space, XEONs can cache whole 36 bits.

(Propably other differences exist also, but that is primary
 one concerning memory cacheability --> apparent speed.)

> CPU0: Intel Pentium III (Cascades) stepping 01
> CPU1: Intel Pentium III (Cascades) stepping 01
> CPU2: Intel Pentium III (Cascades) stepping 01
> CPU3: Intel Pentium III (Cascades) stepping 01
> Total of 4 processors activated (5606.60 BogoMIPS).

Hmm.. More marketing names, what is "Cascades" in the scale
of "cheap bastards" versus "all bells and whistless" ?
(Celeron vs. XEON, that is.)

/Matti Aarnio
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-12 Thread Tigran Aivazian


On Wed, 11 Oct 2000, Linus Torvalds wrote:
> What happens if MTRR support is entirely disabled?

If MTRR support is disabled then both eepro100 interfaces work fine but
the system is still 40x slower. This is the entire bootlog of
2.4.0-test10-pre1 + lspci-vvx + /proc/interrupts + /proc/iomem + ifconfig
output

Two currently active ideas (from Mark, Linus and Zoltan):

a) one needs to use big-mtrr patch from Zoltan, look at e820 map and
manually set up mtrrs to cover all 6G.

b) this is an L2 cache-tag issue and there is just not enough bits in the
tag to cover such high addresses so nothing will help, save removing the
extra 2G or so out of the machine (or using them as MTD devices : I
hope this is _not_ the case...

another idea (in parallel) is that eepro100 stops working because its PCI
memory space is marked as cacheable.

All should become clear soon -- I will spend the whole day on this, slowly
trying to understand what's going on.

Regards,
Tigran

Linux version 2.4.0-test9 (root@hilbert) (gcc version egcs-2.91.66 19990314/Linux 
(egcs-1.1.2 release)) #15 SMP Wed Oct 11 19:23:15 BST 2000
BIOS-provided physical RAM map:
 BIOS-e820: 0009fc00 @  (usable)
 BIOS-e820: 0400 @ 0009fc00 (reserved)
 BIOS-e820: 0002 @ 000e (reserved)
 BIOS-e820: fccf @ 0010 (usable)
 BIOS-e820: f000 @ fcdf (ACPI data)
 BIOS-e820: 1000 @ fcdff000 (ACPI NVS)
 BIOS-e820: 1000 @ fec0 (reserved)
 BIOS-e820: 1000 @ fee0 (reserved)
 BIOS-e820: 0008 @ fff8 (reserved)
 BIOS-e820: 8000 @ 0001 (usable)
5248MB HIGHMEM available.
Scan SMP from c000 for 1024 bytes.
Scan SMP from c009fc00 for 1024 bytes.
Scan SMP from c00f for 65536 bytes.
found SMP MP-table at 000fb4d0
hm, page 000fb000 reserved twice.
hm, page 000fc000 reserved twice.
hm, page 000f5000 reserved twice.
hm, page 000f6000 reserved twice.
On node 0 totalpages: 1572864
zone(0): 4096 pages.
zone(1): 225280 pages.
zone(2): 1343488 pages.
Intel MultiProcessor Specification v1.1
Virtual Wire compatibility mode.
OEM ID: AMI  Product ID: CNB20HE  APIC at: 0xFEE0
Processor #0 Pentium(tm) Pro APIC version 17
Floating point unit present.
Machine Exception supported.
64 bit compare & exchange supported.
Internal APIC present.
Bootup CPU
Processor #1 Pentium(tm) Pro APIC version 17
Floating point unit present.
Machine Exception supported.
64 bit compare & exchange supported.
Internal APIC present.
Processor #2 Pentium(tm) Pro APIC version 17
Floating point unit present.
Machine Exception supported.
64 bit compare & exchange supported.
Internal APIC present.
Processor #3 Pentium(tm) Pro APIC version 17
Floating point unit present.
Machine Exception supported.
64 bit compare & exchange supported.
Internal APIC present.
Bus #0 is PCI   
Bus #1 is PCI   
Bus #2 is PCI   
Bus #3 is ISA   
I/O APIC #4 Version 17 at 0xFEC0.
I/O APIC #5 Version 17 at 0xFEC01000.
Int: type 0, pol 3, trig 3, bus 0, IRQ 04, APIC ID 5, APIC INT 0a
Int: type 0, pol 3, trig 3, bus 0, IRQ 08, APIC ID 5, APIC INT 0b
Int: type 0, pol 3, trig 3, bus 0, IRQ 0c, APIC ID 5, APIC INT 0f
Int: type 0, pol 3, trig 3, bus 0, IRQ 3c, APIC ID 4, APIC INT 0a
Int: type 0, pol 3, trig 3, bus 1, IRQ 15, APIC ID 5, APIC INT 01
Int: type 0, pol 3, trig 3, bus 1, IRQ 14, APIC ID 5, APIC INT 00
Int: type 3, pol 1, trig 1, bus 3, IRQ 00, APIC ID 4, APIC INT 00
Int: type 0, pol 1, trig 1, bus 3, IRQ 01, APIC ID 4, APIC INT 01
Int: type 0, pol 1, trig 1, bus 3, IRQ 00, APIC ID 4, APIC INT 02
Int: type 0, pol 1, trig 1, bus 3, IRQ 03, APIC ID 4, APIC INT 03
Int: type 0, pol 1, trig 1, bus 3, IRQ 04, APIC ID 4, APIC INT 04
Int: type 0, pol 1, trig 1, bus 3, IRQ 06, APIC ID 4, APIC INT 06
Int: type 0, pol 1, trig 1, bus 3, IRQ 07, APIC ID 4, APIC INT 07
Int: type 0, pol 1, trig 1, bus 3, IRQ 08, APIC ID 4, APIC INT 08
Int: type 0, pol 1, trig 1, bus 3, IRQ 0c, APIC ID 4, APIC INT 0c
Int: type 0, pol 1, trig 1, bus 3, IRQ 0d, APIC ID 4, APIC INT 0d
Int: type 0, pol 1, trig 1, bus 3, IRQ 0e, APIC ID 4, APIC INT 0e
Int: type 0, pol 1, trig 1, bus 3, IRQ 0f, APIC ID 4, APIC INT 0f
Lint: type 3, pol 1, trig 1, bus 3, IRQ 00, APIC ID ff, APIC LINT 00
Lint: type 1, pol 1, trig 1, bus 0, IRQ 00, APIC ID ff, APIC LINT 01
Processors: 4
mapped APIC to e000 (fee0)
mapped IOAPIC to d000 (fec0)
mapped IOAPIC to c000 (fec01000)
Kernel command line: auto BOOT_IMAGE=240-test10 ro root=805 
BOOT_FILE=/boot/vmlinuz-2.4.0-test10 console=ttyS0,9600 console=tty0
Initializing CPU#0
Detected 701.611 MHz processor.
Console: colour VGA+ 80x25
Calibrating delay loop... 1399.19 BogoMIPS
Memory: 6132848k/6291456k available (1531k kernel code, 106956k reserved, 88k data, 
188k init, 5322688k highmem)
Dentry-cache hash table ent

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Richard Gooch


David Wragg writes:
> Tigran Aivazian <[EMAIL PROTECTED]> writes:
> > b) it detects all memory correctly but creates a write-back mtrr only for
> > the first 2G, is this normal?
> 
> mtrr.c is broken for machines with >=4GB of memory (or less than 4GB,
> if the chipset reserves an addresses range below 4GB for PCI).
> 
> The patch against 2.4.0-test9 to fix this is below.
> 
> Richard: Is there a reason you haven't passed this on to Linus, or do
> you want me to do it?

Partly because I haven't had time to look at it, partly because I'm
not sure if it's needed (why, exactly?), and partly because I've
recently moved house and (STILL!) don't have IP access at home (not
even dialup) so I can't really look at stuff yet :-( :-( :-(

BTW: I'm away at conferences for the next two weeks, so don't expect
fast responses.

Regards,

Richard
Permanent: [EMAIL PROTECTED]
Current:   [EMAIL PROTECTED]
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Linus Torvalds




On Wed, 11 Oct 2000, Rik van Riel wrote:

> On Wed, 11 Oct 2000, Tigran Aivazian wrote:
> > > On Wed, 11 Oct 2000, Rik van Riel wrote:
> > > > Could you send me the backtrace of one of the cases where
> > > > you hit the bug ?
> > 
> > just to add -- I was following Alan Cox's suggestion of
> > incrementing "mem=N" and finding the value where the system
> > stops working normally. It was ok as high as "mem=3096M" but
> > then I realized that I was also using Zoltan's big-mtrr patch at
> > the same time so I will retest the whole thing without it...
> > tomorrow.
> > 
> > Just to clarify - the problem _does_ show up without Zoltan's
> > patch but my "mem=N" tests were done with it so those findings
> > are not really proving much. I need to redo them with vanilla
> > kernel.
> 
> Interesting, so up to 3GB works just fine with the new
> VM and above that you can trigger all kinds of funny
> errors ?

I bet that the performance thing at least is due to MTRR issues.

Basically, if Tigran ends up using memory that is non-cached, a 30-40
times perfomance degradation is not just explainable, it's expected. 

Also, the eepro100 will not work correctly if its PCI space is set to be
cacheable.

What happens if MTRR support is entirely disabled? Make it print out what
the BIOS set up, nothing more. 

Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Ben LaHaise

On Wed, 11 Oct 2000, Tigran Aivazian wrote:

> it works fine then. Kernel compiles in 68 seconds as it should. Shall I
> keep incrementing mem= to see what happens next...

I suspect fixing the mtrrs on the machine will fix this problem, as a
38-40 times slowdown on a machine that isn't swapping is most likely a
lack of memory caching (as Rik pointed out 38-40 times is right on the
nose for the difference in speed between the cache and main memory).

-ben

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Rik van Riel


On Wed, 11 Oct 2000, Tigran Aivazian wrote:
> > On Wed, 11 Oct 2000, Rik van Riel wrote:
> > > Could you send me the backtrace of one of the cases where
> > > you hit the bug ?
> 
> just to add -- I was following Alan Cox's suggestion of
> incrementing "mem=N" and finding the value where the system
> stops working normally. It was ok as high as "mem=3096M" but
> then I realized that I was also using Zoltan's big-mtrr patch at
> the same time so I will retest the whole thing without it...
> tomorrow.
> 
> Just to clarify - the problem _does_ show up without Zoltan's
> patch but my "mem=N" tests were done with it so those findings
> are not really proving much. I need to redo them with vanilla
> kernel.

Interesting, so up to 3GB works just fine with the new
VM and above that you can trigger all kinds of funny
errors ?

Could it be that the kernel fills most of low memory
with kernel data structures to manage high memory, so
that it doesn't have enough low memory left to do the
bookkeeping for the eepro card, etc???

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread David Wragg


Tigran Aivazian <[EMAIL PROTECTED]> writes:
> b) it detects all memory correctly but creates a write-back mtrr only for
> the first 2G, is this normal?

mtrr.c is broken for machines with >=4GB of memory (or less than 4GB,
if the chipset reserves an addresses range below 4GB for PCI).

The patch against 2.4.0-test9 to fix this is below.

Richard: Is there a reason you haven't passed this on to Linus, or do
you want me to do it?


Dave



diff -rua linux-2.4.0test9/arch/i386/kernel/mtrr.c 
linux-2.4.0test9.mod/arch/i386/kernel/mtrr.c
--- linux-2.4.0test9/arch/i386/kernel/mtrr.cWed Oct 11 19:54:56 2000
+++ linux-2.4.0test9.mod/arch/i386/kernel/mtrr.cWed Oct 11 20:48:26 2000
@@ -503,9 +503,9 @@
 static void intel_get_mtrr (unsigned int reg, unsigned long *base,
unsigned long *size, mtrr_type *type)
 {
-unsigned long dummy, mask_lo, base_lo;
+unsigned long mask_lo, mask_hi, base_lo, base_hi;
 
-rdmsr (MTRRphysMask_MSR(reg), mask_lo, dummy);
+rdmsr (MTRRphysMask_MSR(reg), mask_lo, mask_hi);
 if ( (mask_lo & 0x800) == 0 )
 {
/*  Invalid (i.e. free) range  */
@@ -515,20 +515,17 @@
return;
 }
 
-rdmsr(MTRRphysBase_MSR(reg), base_lo, dummy);
+rdmsr(MTRRphysBase_MSR(reg), base_lo, base_hi);
 
-/* We ignore the extra address bits (32-35). If someone wants to
-   run x86 Linux on a machine with >4GB memory, this will be the
-   least of their problems. */
+/* Work out the shifted address mask. */
+mask_lo = 0xff00 | mask_hi << (32 - PAGE_SHIFT)
+   | mask_lo >> PAGE_SHIFT;
 
-/* Clean up mask_lo so it gives the real address mask. */
-mask_lo = (mask_lo & 0xf000UL);
 /* This works correctly if size is a power of two, i.e. a
contiguous range. */
-*size = ~(mask_lo - 1);
-
-*base = (base_lo & 0xf000UL);
-*type = (base_lo & 0xff);
+*size = -mask_lo;
+*base = base_hi << (32 - PAGE_SHIFT) | base_lo >> PAGE_SHIFT;
+*type = base_lo & 0xff;
 }   /*  End Function intel_get_mtrr  */
 
 static void cyrix_get_arr (unsigned int reg, unsigned long *base,
@@ -553,13 +550,13 @@
 /* Enable interrupts if it was enabled previously */
 __restore_flags (flags);
 shift = ((unsigned char *) base)[1] & 0x0f;
-*base &= 0xf000UL;
+*base >>= PAGE_SHIFT;
 
 /* Power of two, at least 4K on ARR0-ARR6, 256K on ARR7
  * Note: shift==0xf means 4G, this is unsupported.
  */
 if (shift)
-  *size = (reg < 7 ? 0x800UL : 0x2UL) << shift;
+  *size = (reg < 7 ? 0x1UL : 0x40UL) << (shift - 1);
 else
   *size = 0;
 
@@ -596,7 +593,7 @@
 /*  Upper dword is region 1, lower is region 0  */
 if (reg == 1) low = high;
 /*  The base masks off on the right alignment  */
-*base = low & 0xFFFE;
+*base = (low & 0xFFFE) >> PAGE_SHIFT;
 *type = 0;
 if (low & 1) *type = MTRR_TYPE_UNCACHABLE;
 if (low & 2) *type = MTRR_TYPE_WRCOMB;
@@ -621,7 +618,7 @@
  * *128K   ...
  */
 low = (~low) & 0x1FFFC;
-*size = (low + 4) << 15;
+*size = (low + 4) << (15 - PAGE_SHIFT);
 return;
 }   /*  End Function amd_get_mtrr  */
 
@@ -634,8 +631,8 @@
 static void centaur_get_mcr (unsigned int reg, unsigned long *base,
 unsigned long *size, mtrr_type *type)
 {
-*base = centaur_mcr[reg].high & 0xf000;
-*size = (~(centaur_mcr[reg].low & 0xf000))+1;
+*base = centaur_mcr[reg].high >> PAGE_SHIFT;
+*size = -(centaur_mcr[reg].low & 0xf000) >> PAGE_SHIFT;
 *type = MTRR_TYPE_WRCOMB;  /*  If it is there, it is write-combining  */
 }   /*  End Function centaur_get_mcr  */
 
@@ -665,8 +662,10 @@
 }
 else
 {
-   wrmsr (MTRRphysBase_MSR (reg), base | type, 0);
-   wrmsr (MTRRphysMask_MSR (reg), ~(size - 1) | 0x800, 0);
+   wrmsr (MTRRphysBase_MSR (reg), base << PAGE_SHIFT | type,
+  (base & 0xf0) >> (32 - PAGE_SHIFT));
+   wrmsr (MTRRphysMask_MSR (reg), -size << PAGE_SHIFT | 0x800,
+  (-size & 0xf0) >> (32 - PAGE_SHIFT));
 }
 if (do_safe) set_mtrr_done (&ctxt);
 }   /*  End Function intel_set_mtrr_up  */
@@ -680,7 +679,9 @@
 arr = CX86_ARR_BASE + (reg << 1) + reg; /* avoid multiplication by 3 */
 
 /* count down from 32M (ARR0-ARR6) or from 2G (ARR7) */
-size >>= (reg < 7 ? 12 : 18);
+if (reg >= 7)
+   size >>= 6;
+
 size &= 0x7fff; /* make sure arr_size <= 14 */
 for(arr_size = 0; size; arr_size++, size >>= 1);
 
@@ -705,6 +706,7 @@
 }
 
 if (do_safe) set_mtrr_prepare (&ctxt);
+base <<= PAGE_SHIFT;
 setCx86(arr,((unsigned char *) &base)[3]);
 setCx86(arr+1,  ((unsigned char *) &base)[2]);
 setCx86(arr+2, (((unsigned char *) &base)[1]) | arr_size);
@@ -724,34 +726,36 @@
 [RETURNS] Nothing.
 */
 {
-u32 low, high;
+u32 regs[2];
 struct set_mtrr_context ctxt;
 
 if (do_safe) set_mtrr_prepare (&ctxt);
 /*

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Rik van Riel


On Wed, 11 Oct 2000, Tigran Aivazian wrote:
> On Wed, 11 Oct 2000, Rik van Riel wrote:
> > Could you send me the backtrace of one of the cases where
> > you hit the bug ?
> 
> here you are:

> Oct 11 16:05:26 hilbert36 kernel: kernel BUG at page_alloc.c:221!
> Oct 11 16:05:27 hilbert36 kernel: Call Trace: [tvecs+9181/112440] 
>[tvecs+9707/112440] [__alloc_pages+225/740] [filemap_nopage+240/1120] 
>[do_no_page+93/440] [] []
> Oct 11 16:05:27 hilbert36 kernel:[] [] [] 
>[handle_mm_fault+944/1388] [do_page_fault+0/1008] [unmap_fixup+99/316] 
>[do_page_fault+323/1008] [do_page_fault+0/1008]
> Oct 11 16:05:27 hilbert36 kernel:[timer_bh+56/700] [bh_action+78/176] 
>[tasklet_hi_action+81/124] [do_softirq+90/136] [do_IRQ+218/236] [error_code+52/60]

Ughhh, of course ... this comes into __alloc_pages(), which
finds a page with some flags set on the free list ... this
backtrace - of course - isn't helpful in debugging the thing, 
sorry for wasting your time...

[off to find other ways of finding this problem ... note that
__free_pages_ok() does the SAME BUG() check before putting the
page on the free list]

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Tigran Aivazian

> On Wed, 11 Oct 2000, Rik van Riel wrote:
> > Could you send me the backtrace of one of the cases where
> > you hit the bug ?

just to add -- I was following Alan Cox's suggestion of incrementing
"mem=N" and finding the value where the system stops working normally. It
was ok as high as "mem=3096M" but then I realized that I was also using
Zoltan's big-mtrr patch at the same time so I will retest the whole thing
without it... tomorrow.

Just to clarify - the problem _does_ show up without Zoltan's patch but my
"mem=N" tests were done with it so those findings are not really proving
much. I need to redo them with vanilla kernel.

Regards,
Tigran

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Tigran Aivazian


On Wed, 11 Oct 2000, Rik van Riel wrote:
> Could you send me the backtrace of one of the cases where
> you hit the bug ?

here you are:

Oct 11 16:05:26 hilbert36 kernel: kernel BUG at page_alloc.c:221!
Oct 11 16:05:26 hilbert36 kernel: invalid operand: 
Oct 11 16:05:26 hilbert36 kernel: CPU:2
Oct 11 16:05:26 hilbert36 kernel: EIP:0010:[rmqueue+590/636]
Oct 11 16:05:26 hilbert36 kernel: EFLAGS: 00010292
Oct 11 16:05:26 hilbert36 kernel: eax: 0020   ebx: c75e0a4c   ecx: c027ff68   edx: 
0026
Oct 11 16:05:26 hilbert36 kernel: esi: 0001   edi: c02811d0   ebp:    esp: 
f7205d84
Oct 11 16:05:26 hilbert36 kernel: ds: 0018   es: 0018   ss: 0018
Oct 11 16:05:26 hilbert36 kernel: Process head (pid: 582, stackpage=f7205000)
Oct 11 16:05:26 hilbert36 kernel: Stack: c023a365 c023a573 00dd c02811d0 0112 
c0281430  c02811f8
Oct 11 16:05:26 hilbert36 kernel:0014789c 0014789f 0286  c02811d0 
c0133309 c532a204 0112
Oct 11 16:05:26 hilbert36 kernel:0001 f7562500 c75da270 0015 0001 
c028142c c012a944 f7554460
Oct 11 16:05:27 hilbert36 kernel: Call Trace: [tvecs+9181/112440] [tvecs+9707/112440] 
[__alloc_pages+225/740] [filemap_nopage+240/1120] [do_no_page+93/440] [] 
[]
Oct 11 16:05:27 hilbert36 kernel:[] [] [] 
[handle_mm_fault+944/1388] [do_page_fault+0/1008] [unmap_fixup+99/316] 
[do_page_fault+323/1008] [do_page_fault+0/1008]
Oct 11 16:05:27 hilbert36 kernel:[timer_bh+56/700] [bh_action+78/176] 
[tasklet_hi_action+81/124] [do_softirq+90/136] [do_IRQ+218/236] [error_code+52/60]
Oct 11 16:05:27 hilbert36 kernel: Code: 0f 0b 83 c4 0c 90 89 d8 eb 1c 45 83 c6 0c 83 
fd 09 0f 86 c6


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Rik van Riel


On Wed, 11 Oct 2000, Tigran Aivazian wrote:
> On Wed, 11 Oct 2000, Rik van Riel wrote:
> > On Wed, 11 Oct 2000, Tigran Aivazian wrote:
> > > On Wed, 11 Oct 2000, Mark Hemment wrote:
> > > > On Wed, 11 Oct 2000, Tigran Aivazian wrote:
> > > >  
> > > > > a) one of the eepro100 interfaces (the onboard one on the S2QR6 mb) is
> > > > > malfunctioning, interrupts are generated but no traffic gets through (YES,
> > > > > I did plug it in correctly, this time, and I repeat 2.2.16 works!)
> > > > 
> > > >   I saw this the other week on our two-way Dell under a reasonibly heavy
> > > > load - but with 3c59x.c driver, the eepro100s survived!
> > > >   Either NIC (had two Tornados) could go this away after anything from 1
> > > > to 36 hours of load.  They would end up running in "poll" mode off the
> > > > transmit watchdog timer.
> > > >   Swapped them for a dual-port eepro100 and no more problems.
> > > 
> > > I disabled eepro100 support completely and the problem is still
> > > there. What I also noticed is that with highmem-PAE enabled I
> > > get BUG in page_alloc.c at line 221 so it is probably a VM
> > > problem recently introduced (hence cc'd Rik).
> > 
> > Can you trigger this bug /without/ PAE ?
> 
> no, I can't.

I wonder if PAE somehow messes with the locking semantics
of the page table things, because the test on line 221 of
page_alloc.c depends on the fact that locking works.

[in fact, that test is there exactly to verify that nothing
went wrong with the locking and we're not re-using a page
that's already in use]

Could you send me the backtrace of one of the cases where
you hit the bug ?

> > > I will continue to narrow down by removing some things (like
> > > mtrr) from the equation. Rik, the problem is that when one
> > > enables PAE (or just highmem-4G) support on a 4-way 6G RAM
> > > machine becomes 38-40 times slower.
> > 
> > 38-40 times slower in what kind of benchmark ?
> 
> compiling the kernel, specifically. But even a simple thing like
> "time ps" shows about 0.9 seconds real time when it should show
> something like 0.021 seconds. Everything becomes unbearably
> slow, make xconfig takes 4-5 minutes to startup, the shutdown
> becomes impossible so I do sysrq-B after a few syncs etc. Also,
> as I said, one of the eepro100 interfaces becomes dead. I
> believe this _is_ the same problem even if it really seems it is
> not.

OUCH ... this shouldn't happen and to be honest I don't
have an explanation for how this /could/ happen (or even
what it would have to do with the new VM)...

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Tigran Aivazian


On Wed, 11 Oct 2000, Rik van Riel wrote:

> On Wed, 11 Oct 2000, Tigran Aivazian wrote:
> > On Wed, 11 Oct 2000, Mark Hemment wrote:
> > > On Wed, 11 Oct 2000, Tigran Aivazian wrote:
> > >  
> > > > a) one of the eepro100 interfaces (the onboard one on the S2QR6 mb) is
> > > > malfunctioning, interrupts are generated but no traffic gets through (YES,
> > > > I did plug it in correctly, this time, and I repeat 2.2.16 works!)
> > > 
> > >   I saw this the other week on our two-way Dell under a reasonibly heavy
> > > load - but with 3c59x.c driver, the eepro100s survived!
> > >   Either NIC (had two Tornados) could go this away after anything from 1
> > > to 36 hours of load.  They would end up running in "poll" mode off the
> > > transmit watchdog timer.
> > >   Swapped them for a dual-port eepro100 and no more problems.
> > 
> > I disabled eepro100 support completely and the problem is still
> > there. What I also noticed is that with highmem-PAE enabled I
> > get BUG in page_alloc.c at line 221 so it is probably a VM
> > problem recently introduced (hence cc'd Rik).
> 
> Can you trigger this bug /without/ PAE ?
> 

no, I can't.

> I've been stress-testing my dual-cpu test machines (one with
> 64MB and one with 1GB ram) very very heavily for the last 4
> days and haven't encountered any bug whatsoever ...
> 
> Btw, what compiler are you using ?

kgcc on red hat 6.9 which is really egcs-2.91.66

> 
> > I will continue to narrow down by removing some things (like
> > mtrr) from the equation. Rik, the problem is that when one
> > enables PAE (or just highmem-4G) support on a 4-way 6G RAM
> > machine becomes 38-40 times slower.
> 
> 38-40 times slower in what kind of benchmark ?

compiling the kernel, specifically. But even a simple thing like "time
ps" shows about 0.9 seconds real time when it should show something like
0.021 seconds. Everything becomes unbearably slow, make xconfig takes 4-5
minutes to startup, the shutdown becomes impossible so I do sysrq-B after
a few syncs etc. Also, as I said, one of the eepro100 interfaces becomes
dead. I believe this _is_ the same problem even if it really seems it is
not.

Regards,
Tigran

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Rik van Riel


On Wed, 11 Oct 2000, Tigran Aivazian wrote:
> On Wed, 11 Oct 2000, Mark Hemment wrote:
> > On Wed, 11 Oct 2000, Tigran Aivazian wrote:
> >  
> > > a) one of the eepro100 interfaces (the onboard one on the S2QR6 mb) is
> > > malfunctioning, interrupts are generated but no traffic gets through (YES,
> > > I did plug it in correctly, this time, and I repeat 2.2.16 works!)
> > 
> >   I saw this the other week on our two-way Dell under a reasonibly heavy
> > load - but with 3c59x.c driver, the eepro100s survived!
> >   Either NIC (had two Tornados) could go this away after anything from 1
> > to 36 hours of load.  They would end up running in "poll" mode off the
> > transmit watchdog timer.
> >   Swapped them for a dual-port eepro100 and no more problems.
> 
> I disabled eepro100 support completely and the problem is still
> there. What I also noticed is that with highmem-PAE enabled I
> get BUG in page_alloc.c at line 221 so it is probably a VM
> problem recently introduced (hence cc'd Rik).

Can you trigger this bug /without/ PAE ?

I've been stress-testing my dual-cpu test machines (one with
64MB and one with 1GB ram) very very heavily for the last 4
days and haven't encountered any bug whatsoever ...

Btw, what compiler are you using ?

> I will continue to narrow down by removing some things (like
> mtrr) from the equation. Rik, the problem is that when one
> enables PAE (or just highmem-4G) support on a 4-way 6G RAM
> machine becomes 38-40 times slower.

38-40 times slower in what kind of benchmark ?

regards,

Rik
--
"What you're running that piece of shit Gnome?!?!"
   -- Miguel de Icaza, UKUUG 2000

http://www.conectiva.com/   http://www.surriel.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Alan Cox


> On Wed, 11 Oct 2000, Alan Cox wrote:
> 
> > > I will continue to narrow down by removing some things (like mtrr) from
> > > the equation. Rik, the problem is that when one enables PAE (or just
> > > highmem-4G) support on a 4-way 6G RAM machine becomes 38-40 times slower.
> > 
> > What happens if you boot a PAE kernel with mem=512M on that box ?
> > 
> 
> it works fine then. Kernel compiles in 68 seconds as it should. Shall I
> keep incrementing mem= to see what happens next...

The threshold is probably >1Gig, but see where it actually is, thats the point
the other zones and memory juggling kicks in

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Tigran Aivazian


On Wed, 11 Oct 2000, Alan Cox wrote:

> > I will continue to narrow down by removing some things (like mtrr) from
> > the equation. Rik, the problem is that when one enables PAE (or just
> > highmem-4G) support on a 4-way 6G RAM machine becomes 38-40 times slower.
> 
> What happens if you boot a PAE kernel with mem=512M on that box ?
> 

it works fine then. Kernel compiles in 68 seconds as it should. Shall I
keep incrementing mem= to see what happens next...

Regards,
Tigran

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Alan Cox


> I will continue to narrow down by removing some things (like mtrr) from
> the equation. Rik, the problem is that when one enables PAE (or just
> highmem-4G) support on a 4-way 6G RAM machine becomes 38-40 times slower.

What happens if you boot a PAE kernel with mem=512M on that box ?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: 36 bit MTRRs, Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Tigran Aivazian

On Wed, 11 Oct 2000, Boszormenyi Zoltan wrote:

> On Wed, 11 Oct 2000, Tigran Aivazian wrote:
> > I will continue to narrow down by removing some things (like mtrr) from
> > the equation. Rik, the problem is that when one enables PAE (or just
> > highmem-4G) support on a 4-way 6G RAM machine becomes 38-40 times slower.
> 
> Will you please try this patch? This is almost the same as I
> sent to you before, it is just against 2.4.0-test10-pre1 and
> it lacks the corrections to e.g. the frame buffer drivers.

Hi Zoltan,

I have tried your patch and although it works:

# cat /proc/mtrr 
reg00: base=0x (   0MB), size=4096MB: write-back, count=1
reg01: base=0x001 (4096MB), size=2048MB: write-back,
count=1
reg02: base=0xfc00 (4032MB), size=  64MB: uncachable, count=1

unfortunately, it doesn't solve the problem. The machine is still
unbearably slow (up to 40x slower!) and one of the eepro100 interfaces is
still not working.

Another interesting idea was suggested by Mark Hemment - to switch 

memlist_add_head() -> memlist_add_tail()

in expand()/__free_pages_ok() (see mm/page_alloc.c) and it did make a
difference -- both eepro100 started to work fine but the machine remained
just as slow as before.

So, the problem is complex but I am told that Rik and others are aware
that at present there is no working support for highmem. Strangely, my
desktop dual PIII550 with 1G RAM works just fine with highmem... nice and
fast, no problems whatsoever and it is filled up with all kinds of
devices, from soundcard to bttv848, 3dfx, eepro100, ne2k, 8139 etc
etc. Everything just works.

Regards,
Tigran

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

36 bit MTRRs, Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Boszormenyi Zoltan

On Wed, 11 Oct 2000, Tigran Aivazian wrote:
> I will continue to narrow down by removing some things (like mtrr) from
> the equation. Rik, the problem is that when one enables PAE (or just
> highmem-4G) support on a 4-way 6G RAM machine becomes 38-40 times slower.

Will you please try this patch? This is almost the same as I
sent to you before, it is just against 2.4.0-test10-pre1 and
it lacks the corrections to e.g. the frame buffer drivers.

I am now running test10-pre1 with this patch and:

-
[root@localhost /root]# cd /proc
[root@localhost /proc]# cat mtrr
reg00: base=0x (   0MB), size= 128MB: write-back, count=1
reg05: base=0xe200 (3616MB), size=  32MB: write-combining, count=1
[root@localhost /proc]# echo "base=0x2 size=0x1 type=write-combining" 
>mtrr
[root@localhost /proc]# cat mtrr
reg00: base=0x (   0MB), size= 128MB: write-back, count=1
reg01: base=0x002 (8192MB), size=4096MB: write-combining, count=1
reg05: base=0xe200 (3616MB), size=  32MB: write-combining, count=1
--

This is on a dual P-III machine with 128 MB memory.
If it causes problems on Athlons then change the line

#define AMD_OR_MASK(0xf000UL)

to

#define AMD_OR_MASK(INTEL_OR_MASK)

and recompile and tell me what happens. Also, I would like to hear
reports from non-Intel (Cyrix, etc.) and older AMD machines.
I do not have my Cyrix 6x86MX anymore but this scheme worked on that.

Regards,
Zoltan Boszormenyi

 mtrrpage-new.diff.gz

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Tigran Aivazian

Hi Mark,

On Wed, 11 Oct 2000, Mark Hemment wrote:

> Hi Tigran,
> 
> On Wed, 11 Oct 2000, Tigran Aivazian wrote:
>  
> > a) one of the eepro100 interfaces (the onboard one on the S2QR6 mb) is
> > malfunctioning, interrupts are generated but no traffic gets through (YES,
> > I did plug it in correctly, this time, and I repeat 2.2.16 works!)
> 
>   I saw this the other week on our two-way Dell under a reasonibly heavy
> load - but with 3c59x.c driver, the eepro100s survived!
>   Either NIC (had two Tornados) could go this away after anything from 1
> to 36 hours of load.  They would end up running in "poll" mode off the
> transmit watchdog timer.
>   Swapped them for a dual-port eepro100 and no more problems.

I disabled eepro100 support completely and the problem is still
there. What I also noticed is that with highmem-PAE enabled I get BUG in
page_alloc.c at line 221 so it is probably a VM problem recently
introduced (hence cc'd Rik).

I will continue to narrow down by removing some things (like mtrr) from
the equation. Rik, the problem is that when one enables PAE (or just
highmem-4G) support on a 4-way 6G RAM machine becomes 38-40 times slower.

Regards,
Tigran

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: [more findings!] Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Tigran Aivazian


ok, confirmed -- it is _not_ PAE-related. Just using a plain highmem (4G)
support causes all these problems -- the machine becomes 38-40 times
slower overall and one of the eepro100 cards stops working. 

I will try Zoltan's ideas on 64bit mtrrs but any more ideas are welcome...

Thanks,
Tigran

On Wed, 11 Oct 2000, Tigran Aivazian wrote:

> Amazing, disabling highmem altogether (not just PAE) i.e. being able to
> use only low 896M of RAM got rid of _both_ the eepro100 and slowness
> problems!
> 
> The system is now very fast (kernel compile in 61 seconds!) and all
> eepro100 interfaces work fine. I will now test with plain highmem (4G) but
> no PAE... and see what happens
> 
>  On Wed, 11 Oct 2000, Tigran Aivazian wrote:
> 
> > Hi,
> > 
> > I have installed 2.4.0-test10-pre1 on a 4-way Xeon 700MHz 6G RAM machine
> > and observe various problems, not present in
> > 2.2.16-(redhat69's-number-17).
> > 
> > a) one of the eepro100 interfaces (the onboard one on the S2QR6 mb) is
> > malfunctioning, interrupts are generated but no traffic gets through (YES,
> > I did plug it in correctly, this time, and I repeat 2.2.16 works!)
> > 
> > b) it detects all memory correctly but creates a write-back mtrr only for
> > the first 2G, is this normal?
> > 
> > # cat /proc/meminfo /proc/mtrr
> > total:used:free:  shared: buffers:  cached:
> > Mem:  1985175552 107397120 1884320  6864896 60833792
> > Swap: 18917703680 1891770368
> > MemTotal:  6132952 kB
> > MemFree:   6028072 kB
> > MemShared:   0 kB
> > Buffers:  6704 kB
> > Cached:  59408 kB
> > Active:   9884 kB
> > Inact_dirty: 56228 kB
> > Inact_clean: 0 kB
> > Inact_target:   96 kB
> > HighTotal: 5322688 kB
> > HighFree:  5247736 kB
> > LowTotal:   810264 kB
> > LowFree:780336 kB
> > SwapTotal: 1847432 kB
> > SwapFree:  1847432 kB
> > reg01: base=0x (   0MB), size=2048MB: write-back, count=1
> > reg02: base=0xfc00 (4032MB), size=  64MB: uncachable, count=1
> > 
> > c) /proc/meminfo shows the number of bytes incorrectly, The B() macro of
> > fs/proc/proc_misc.c looks fine but perhaps the %8lu format specifier
> > should be extended to %16lu? (we should care about correctness more than
> > about binary compatibility with apps that may parse /proc/meminfo file)
> > 
> > d) the system is incredibly slow. It took only 1 minute 20 seconds to
> > compile the kernel (make -j4 bzImage, with mem=512M because the e820
> > (or whatever is in 2.2.x, I don't care) algorithm didn't work so I
> > gave it at least "some memory" to work with) on 2.2.16 and it took about
> > an hour to compile on 2.4.0-test10. I expected 50 seconds or so Must
> > be something to do with caching? I enabled PAE of course. It is probably
> > something simple to fix as I expect this machine to be the fastest in the
> > world (for this price :)
> > 
> > I will slowly go through all of these problems, starting with the simplest
> > c)
> > 
> > Regards,
> > Tigran
> > 
> > -
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [EMAIL PROTECTED]
> > Please read the FAQ at http://www.tux.org/lkml/
> > 
> 
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Mark Hemment

Hi Tigran,

On Wed, 11 Oct 2000, Tigran Aivazian wrote:

> a) one of the eepro100 interfaces (the onboard one on the S2QR6 mb) is
> malfunctioning, interrupts are generated but no traffic gets through (YES,
> I did plug it in correctly, this time, and I repeat 2.2.16 works!)

  I saw this the other week on our two-way Dell under a reasonibly heavy
load - but with 3c59x.c driver, the eepro100s survived!
  Either NIC (had two Tornados) could go this away after anything from 1
to 36 hours of load.  They would end up running in "poll" mode off the
transmit watchdog timer.
  Swapped them for a dual-port eepro100 and no more problems.

Mark

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

[more findings!] Re: test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Tigran Aivazian


Amazing, disabling highmem altogether (not just PAE) i.e. being able to
use only low 896M of RAM got rid of _both_ the eepro100 and slowness
problems!

The system is now very fast (kernel compile in 61 seconds!) and all
eepro100 interfaces work fine. I will now test with plain highmem (4G) but
no PAE... and see what happens

 On Wed, 11 Oct 2000, Tigran Aivazian wrote:

> Hi,
> 
> I have installed 2.4.0-test10-pre1 on a 4-way Xeon 700MHz 6G RAM machine
> and observe various problems, not present in
> 2.2.16-(redhat69's-number-17).
> 
> a) one of the eepro100 interfaces (the onboard one on the S2QR6 mb) is
> malfunctioning, interrupts are generated but no traffic gets through (YES,
> I did plug it in correctly, this time, and I repeat 2.2.16 works!)
> 
> b) it detects all memory correctly but creates a write-back mtrr only for
> the first 2G, is this normal?
> 
> # cat /proc/meminfo /proc/mtrr
> total:used:free:  shared: buffers:  cached:
> Mem:  1985175552 107397120 1884320  6864896 60833792
> Swap: 18917703680 1891770368
> MemTotal:  6132952 kB
> MemFree:   6028072 kB
> MemShared:   0 kB
> Buffers:  6704 kB
> Cached:  59408 kB
> Active:   9884 kB
> Inact_dirty: 56228 kB
> Inact_clean: 0 kB
> Inact_target:   96 kB
> HighTotal: 5322688 kB
> HighFree:  5247736 kB
> LowTotal:   810264 kB
> LowFree:780336 kB
> SwapTotal: 1847432 kB
> SwapFree:  1847432 kB
> reg01: base=0x (   0MB), size=2048MB: write-back, count=1
> reg02: base=0xfc00 (4032MB), size=  64MB: uncachable, count=1
> 
> c) /proc/meminfo shows the number of bytes incorrectly, The B() macro of
> fs/proc/proc_misc.c looks fine but perhaps the %8lu format specifier
> should be extended to %16lu? (we should care about correctness more than
> about binary compatibility with apps that may parse /proc/meminfo file)
> 
> d) the system is incredibly slow. It took only 1 minute 20 seconds to
> compile the kernel (make -j4 bzImage, with mem=512M because the e820
> (or whatever is in 2.2.x, I don't care) algorithm didn't work so I
> gave it at least "some memory" to work with) on 2.2.16 and it took about
> an hour to compile on 2.4.0-test10. I expected 50 seconds or so Must
> be something to do with caching? I enabled PAE of course. It is probably
> something simple to fix as I expect this machine to be the fastest in the
> world (for this price :)
> 
> I will slowly go through all of these problems, starting with the simplest
> c)
> 
> Regards,
> Tigran
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [EMAIL PROTECTED]
> Please read the FAQ at http://www.tux.org/lkml/
> 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

test10-pre1 problems on 4-way SuperServer8050

2000-10-11 Thread Tigran Aivazian


Hi,

I have installed 2.4.0-test10-pre1 on a 4-way Xeon 700MHz 6G RAM machine
and observe various problems, not present in
2.2.16-(redhat69's-number-17).

a) one of the eepro100 interfaces (the onboard one on the S2QR6 mb) is
malfunctioning, interrupts are generated but no traffic gets through (YES,
I did plug it in correctly, this time, and I repeat 2.2.16 works!)

b) it detects all memory correctly but creates a write-back mtrr only for
the first 2G, is this normal?

# cat /proc/meminfo /proc/mtrr
total:used:free:  shared: buffers:  cached:
Mem:  1985175552 107397120 1884320  6864896 60833792
Swap: 18917703680 1891770368
MemTotal:  6132952 kB
MemFree:   6028072 kB
MemShared:   0 kB
Buffers:  6704 kB
Cached:  59408 kB
Active:   9884 kB
Inact_dirty: 56228 kB
Inact_clean: 0 kB
Inact_target:   96 kB
HighTotal: 5322688 kB
HighFree:  5247736 kB
LowTotal:   810264 kB
LowFree:780336 kB
SwapTotal: 1847432 kB
SwapFree:  1847432 kB
reg01: base=0x (   0MB), size=2048MB: write-back, count=1
reg02: base=0xfc00 (4032MB), size=  64MB: uncachable, count=1

c) /proc/meminfo shows the number of bytes incorrectly, The B() macro of
fs/proc/proc_misc.c looks fine but perhaps the %8lu format specifier
should be extended to %16lu? (we should care about correctness more than
about binary compatibility with apps that may parse /proc/meminfo file)

d) the system is incredibly slow. It took only 1 minute 20 seconds to
compile the kernel (make -j4 bzImage, with mem=512M because the e820
(or whatever is in 2.2.x, I don't care) algorithm didn't work so I
gave it at least "some memory" to work with) on 2.2.16 and it took about
an hour to compile on 2.4.0-test10. I expected 50 seconds or so Must
be something to do with caching? I enabled PAE of course. It is probably
something simple to fix as I expect this machine to be the fastest in the
world (for this price :)

I will slowly go through all of these problems, starting with the simplest
c)

Regards,
Tigran

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
Please read the FAQ at http://www.tux.org/lkml/

47 matches

Mail list logo