Re: [Beowulf] Barcelona hardware error: how to detect

2008-06-12 Thread Bill Broadley
Yes, what are currently shipping from AMD are B3 revision processors. The TLB-look-aside problem is fixed. There are other less-critical problems with B3, however. Specifically, power-related compatibility issues with various motherboards due to (according to the motherboard manufacturers) AMD

Re: [Beowulf] Barcelona hardware error: how to detect

2008-06-10 Thread Jason Clinton
On Thu, Jun 5, 2008 at 1:09 PM, Mikhail Kuzminsky [EMAIL PROTECTED] wrote: In message from Mark Hahn [EMAIL PROTECTED] (Thu, 5 Jun 2008 13:55:01 -0400 (EDT)): I'm mystified by this: B2 was broken, so using it without the bios workaround is just a mistake or masochism. the workaround _did_

Re: [Beowulf] Barcelona hardware error: how to detect

2008-06-10 Thread Jason Clinton
On Thu, Jun 5, 2008 at 11:39 AM, Mikhail Kuzminsky [EMAIL PROTECTED] wrote: In message from Mark Hahn [EMAIL PROTECTED] (Thu, 5 Jun 2008 11:57:28 -0400 (EDT)): To be more exact, Rev. B2 of Opteron 2350 - is it for CPU stepping w/error or w/o error ? AMD, like Intel, does a reasonable job

Re: [Beowulf] Barcelona hardware error: how to detect

2008-06-10 Thread Chris Samuel
- Jason Clinton [EMAIL PROTECTED] wrote: The kernel patch is very extensive and, last I heard, under NDA. AMD post the patches publicly to the x86-64 discuss list. The most recent ones covered 2.6.24 and 2.6.25 and were sent out in April.

Re: [Beowulf] Barcelona hardware error: how to detect

2008-06-06 Thread Chris Samuel
- Mark Hahn [EMAIL PROTECTED] wrote: the kernel patch was publicly distributed in dec 07. it appears to add some kernel logic to avoid the specific L3 TLB states which don't behave correctly. the bios-level workaround is different, and appears to disable the L3 TLB - I don't know

Re: [Beowulf] Barcelona hardware error: how to detect

2008-06-06 Thread Chris Samuel
- Mikhail Kuzminsky [EMAIL PROTECTED] wrote: Yes, this AMD errata document says that in B3 revision the error will be fixed. I heard that new CPUs w/o TLB+L3 error are shipped now, but are this CPUs really B3 or may be have some more new release ? They certainly do exist, we've got 94

[Beowulf] Barcelona hardware error: how to detect

2008-06-05 Thread Mikhail Kuzminsky
How is possible to detect, that particular AMD Barcelona CPU has - or doesn't have - known hardware error problem ? To be more exact, Rev. B2 of Opteron 2350 - is it for CPU stepping w/error or w/o error ? Mikhail Kuzminsky Computer Assistance to Chemical Research Center Zelinsky Inst. of

Re: [Beowulf] Barcelona hardware error: how to detect

2008-06-05 Thread Mark Hahn
To be more exact, Rev. B2 of Opteron 2350 - is it for CPU stepping w/error or w/o error ? AMD, like Intel, does a reasonable job of disclosing such info: http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/41322.PDF the well-known problem is erattum 298, I think, and fixed

Re: [Beowulf] Barcelona hardware error: how to detect

2008-06-05 Thread Mikhail Kuzminsky
In message from Mark Hahn [EMAIL PROTECTED] (Thu, 5 Jun 2008 11:57:28 -0400 (EDT)): To be more exact, Rev. B2 of Opteron 2350 - is it for CPU stepping w/error or w/o error ? AMD, like Intel, does a reasonable job of disclosing such info:

Re: [Beowulf] Barcelona hardware error: how to detect

2008-06-05 Thread Mark Hahn
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/41322.PDF the well-known problem is erattum 298, I think, and fixed in B3. Yes, this AMD errata document says that in B3 revision the error will be fixed. I believe the absence of 'x' in the B3 column of the table on p

Re: [Beowulf] Barcelona hardware error: how to detect

2008-06-05 Thread Mikhail Kuzminsky
In message from Mark Hahn [EMAIL PROTECTED] (Thu, 5 Jun 2008 13:30:57 -0400 (EDT)): http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/41322.PDF the well-known problem is erattum 298, I think, and fixed in B3. Yes, this AMD errata document says that in B3 revision the

Re: [Beowulf] Barcelona hardware error: how to detect

2008-06-05 Thread Mikhail Kuzminsky
In message from Mark Hahn [EMAIL PROTECTED] (Thu, 5 Jun 2008 13:55:01 -0400 (EDT)): I believe the absence of 'x' in the B3 column of the table on p 15 means that it _is_ fixed in B3. I received just now some preliminary data about Gaussian-03 run problems w/B2 and about absence of this

Re: [Beowulf] Barcelona hardware error: how to detect

2008-06-05 Thread Mikhail Kuzminsky
In message from Jason Clinton [EMAIL PROTECTED] (Thu, 5 Jun 2008 13:16:33 -0500): On Thu, Jun 5, 2008 at 1:09 PM, Mikhail Kuzminsky [EMAIL PROTECTED] wrote: In message from Mark Hahn [EMAIL PROTECTED] (Thu, 5 Jun 2008 13:55:01 -0400 (EDT)): I'm mystified by this: B2 was broken, so using it

Re: [Beowulf] Barcelona hardware error: how to detect

2008-06-05 Thread Greg Lindahl
On Thu, Jun 05, 2008 at 10:09:58PM +0400, Mikhail Kuzminsky wrote: This was interesting for me also, because I have no information how this hardware problem may be affected in the real life. I have 4 chips with the bug, in 2 servers. I see about 1 lockup per month with my workload, which

Re: [Beowulf] Barcelona hardware error: how to detect

2008-06-05 Thread Mark Hahn
The kernel patch is very extensive and, last I heard, under NDA. AMD has the kernel patch was publicly distributed in dec 07. it appears to add some kernel logic to avoid the specific L3 TLB states which don't behave correctly. the bios-level workaround is different, and appears to disable