Re: OT: Hard Disk Problems (was: Re: Dealing with Seagate's problematic 7200.11 firmware.)

2009-01-30 Thread Sieve-X
Dieter openbsd at sopwith.solgatos.com writes:

 
 Sigh.  I could easily go on a major rant here, but it wouldn't do us
 any good.  Anyone have information or ideas that could get us closer
 to a solution?
 

Event log counter can be written every once in a while for example if S.M.A.R.T
automatic off-line data collection (ex. every 4h) is enabled (it is by default
and may include a list of last errors), temperatures, SER and others.

It appears to be theorically possible to effectively verify if a particular 
drive is affected by the problem using S.M.A.R.T information (ie. attributes, 
logs, etc) and that may also be used as workaround if there is a way to change 
the event counter to a safe value (ie. not 320 or a multiple of 320 + x*256) 
but we need more details (ex. specific data pattern) which were released by 
Seagate under NDA to some partners/vendors. It might be possible to find out
by comparing/researching the S.M.A.R.T information (including some of the 
vendor logs like 0xa1) from affected and non-affected drives matching basic
the requirements (7200.11/ES2.1/DiamondMax 22 both new and old firmware).

Log Directory Supported (this one is from an affected model)

SMART Log Directory Logging Version 1 [multi-sector log support]
Log at address 0x00 has 001 sectors [Log Directory]
Log at address 0x01 has 001 sectors [Summary SMART error log]
Log at address 0x02 has 005 sectors [Comprehensive SMART error log]
Log at address 0x03 has 005 sectors [Extended Comprehensive SMART error log]
Log at address 0x06 has 001 sectors [SMART self-test log]
Log at address 0x07 has 001 sectors [Extended self-test log]
Log at address 0x09 has 001 sectors [Selective self-test log]
Log at address 0x10 has 001 sectors [Reserved log]
Log at address 0x11 has 001 sectors [Reserved log]
Log at address 0x21 has 001 sectors [Write stream error log]
Log at address 0x22 has 001 sectors [Read stream error log]
Log at address 0x80 has 016 sectors [Host vendor specific log]
Log at address 0x81 has 016 sectors [Host vendor specific log]
Log at address 0x82 has 016 sectors [Host vendor specific log]
Log at address 0x83 has 016 sectors [Host vendor specific log]
Log at address 0x84 has 016 sectors [Host vendor specific log]
Log at address 0x85 has 016 sectors [Host vendor specific log]
Log at address 0x86 has 016 sectors [Host vendor specific log]
Log at address 0x87 has 016 sectors [Host vendor specific log]
Log at address 0x88 has 016 sectors [Host vendor specific log]
Log at address 0x89 has 016 sectors [Host vendor specific log]
Log at address 0x8a has 016 sectors [Host vendor specific log]
Log at address 0x8b has 016 sectors [Host vendor specific log]
Log at address 0x8c has 016 sectors [Host vendor specific log]
Log at address 0x8d has 016 sectors [Host vendor specific log]
Log at address 0x8e has 016 sectors [Host vendor specific log]
Log at address 0x8f has 016 sectors [Host vendor specific log]
Log at address 0x90 has 016 sectors [Host vendor specific log]
Log at address 0x91 has 016 sectors [Host vendor specific log]
Log at address 0x92 has 016 sectors [Host vendor specific log]
Log at address 0x93 has 016 sectors [Host vendor specific log]
Log at address 0x94 has 016 sectors [Host vendor specific log]
Log at address 0x95 has 016 sectors [Host vendor specific log]
Log at address 0x96 has 016 sectors [Host vendor specific log]
Log at address 0x97 has 016 sectors [Host vendor specific log]
Log at address 0x98 has 016 sectors [Host vendor specific log]
Log at address 0x99 has 016 sectors [Host vendor specific log]
Log at address 0x9a has 016 sectors [Host vendor specific log]
Log at address 0x9b has 016 sectors [Host vendor specific log]
Log at address 0x9c has 016 sectors [Host vendor specific log]
Log at address 0x9d has 016 sectors [Host vendor specific log]
Log at address 0x9e has 016 sectors [Host vendor specific log]
Log at address 0x9f has 016 sectors [Host vendor specific log]
Log at address 0xa1 has 020 sectors [Device vendor specific log]
Log at address 0xa8 has 020 sectors [Device vendor specific log]
Log at address 0xa9 has 001 sectors [Device vendor specific log]
Log at address 0xe0 has 001 sectors [Reserved log]
Log at address 0xe1 has 001 sectors [Reserved log]



Re: OT: Hard Disk Problems (was: Re: Dealing with Seagate's problematic 7200.11 firmware.)

2009-01-30 Thread Stuart Henderson
please, this is way off topic. could you try and find a better list to
chat about this on...



Re: Dealing with Seagate's problematic 7200.11 firmware.

2009-01-29 Thread Dieter
Has anyone looked into disassembling the firmware?



Re: OT: Hard Disk Problems (was: Re: Dealing with Seagate's problematic 7200.11 firmware.)

2009-01-28 Thread Toni Mueller
Hi,

On Tue, 27.01.2009 at 21:37:28 +, Dieter open...@sopwith.solgatos.com 
wrote:
 Toni writes:
  positives and false negatives. After deciding that the results were
  far too unreliable, the page was pulled.
 
 That too.  For one thing people were entering the serial numbers
 using lower case letters and getting false negatives.

this is a joke, right?

 There is a reason I want to look into zeroing out the magic area as
 an alternative to risking updating the firmware.  :-(

Understood... I'm looking for a different vendor, too. :-|

 the power fails.  So not a great workaround, but better than nothing,

Right.

 As I understand it, updating the firmware on some mainboards IS risky.

It may well be that some combinations don't work, but at some point,
I'd say that this should fall into the category of you get what you
pay for. IOW, I can't imagine that doing this kind of stuff right
would cost more than, say, $1 for a drive, and $5 for a motherboard,
and I think that everyone should be prepared to add, say, $50 to a
small server to get these things, ie, (much) less broken designs, imho.
But the bigger problem is that currently there appears to be no way to
add $50, or even $500, to a server, to get these things right because
there seems to be no vendor who offers such stuff.

   There is supposed to be some document that explains all this,
   with enough details to create a fix.  If anyone finds this
   document I need a copy please.
  
  Me too!
 
 Sounds like you are on good terms with your dealer.  Can your dealer get
 you a copy?

LOL. I can ask him, but don't expect too much...


Kind regards,
--Toni++



Re: OT: Hard Disk Problems (was: Re: Dealing with Seagate's problematic 7200.11 firmware.)

2009-01-28 Thread Dieter
   positives and false negatives. After deciding that the results were
   far too unreliable, the page was pulled.
  
  That too.  For one thing people were entering the serial numbers
  using lower case letters and getting false negatives.
 
 this is a joke, right?

As far as I can tell it is not a joke.  The people entering the
serial numbers might have been wintel users and thus not too bright.
Seagate's quality control dept is clearly missing in action lately.

  As I understand it, updating the firmware on some mainboards IS risky.
 
 It may well be that some combinations don't work, but at some point,
 I'd say that this should fall into the category of you get what you
 pay for. IOW, I can't imagine that doing this kind of stuff right
 would cost more than, say, $1 for a drive, and $5 for a motherboard,
 and I think that everyone should be prepared to add, say, $50 to a
 small server to get these things, ie, (much) less broken designs, imho.
 But the bigger problem is that currently there appears to be no way to
 add $50, or even $500, to a server, to get these things right because
 there seems to be no vendor who offers such stuff.

The idiots in charge of most companies don't care about quality control.

Sigh.  I could easily go on a major rant here, but it wouldn't do us
any good.  Anyone have information or ideas that could get us closer
to a solution?



Re: Dealing with Seagate's problematic 7200.11 firmware.

2009-01-27 Thread Toni Mueller
Hi,

On Mon, 26.01.2009 at 15:39:36 +0100, Raimo Niskanen 
raimo+open...@erix.ericsson.se wrote:
 How can I know if I have a suspicious drive?

you won't, imho, until Seagate will deliver usable data on this issue.
Their statements so far were a long way from being trust-inspiring,
imho.

My best bet is currently to wait for a definite statement of my dealer,
who also carries the burden of providing warranty to me (so I hope
he'll think twice before saying something he doesn't at least believe).


In the meantime, I've opted to not power down or reboot any machine
as long as I have definite answers, which turns out to be quite a
nuisance!



-- 
Kind regards,
--Toni++



Re: OT: Hard Disk Problems (was: Re: Dealing with Seagate's problematic 7200.11 firmware.)

2009-01-27 Thread Toni Mueller
Hi,

On Mon, 26.01.2009 at 17:08:51 +, Dieter open...@sopwith.solgatos.com 
wrote:
 It is easy to set up a slashdot account.  Or you can post as anonymous
 coward.

yes, but I don't want to set up a /. account right now, and posting as
AC wouldn't likely solve the problem.

 that he has another slashdot account that isn't anonymous.  Problem I
 have is I can't find a way to send him a PM (private message).  Most web

This is exactly the point.

 forums have a facility for sending other users a PM.  We can post a reply
 to the thread, but he would have to read the thread again to see it.
 Any slashdot wizards out there have an idea?

Post to the thread and offer one's own email address (maybe
time-limited or so), and hope for the best... not exactly a silver
bullet, but maybe better than nothing.

 It isn't even just FLOSS.  Any non-x86 machine is out of luck.
 Proprietary Unix is out of luck.  Anything embedded is out of luck.
 Even Mac is probably out of luck.  And if the reboot to run the
 firmware installer bricks the drive(s) even wintel is out of luck.

Yes, and smartmontools claims to run on all platforms you mentioned
(except MAC OS 9). Ie, they even run on Windows and/or together with
Cygwin. Therefore, I think that this is a strategic point from where
the problem could be solved for a really broad range of systems, and in
one go.

 I don't understand the common corporate policy of keeping everything
 secret.  All they are doing is hurting their previously loyal customers.
 It didn't used to be this way.

Oh... over here, we have a saying: Sea gate, oder sie geht net.
(meaning: it works, or it doesn't - it's a pun on the pronounciation
of Seagate). Yes, many people, me included, thought they had
reformed...

 Supposedly there was a broken test machine that didn't zero out some
 special area after writing a test pattern.  So only drives that were
 tested on that machine are at risk.

I'd like to not speculate about the cause of the problem any longer,
but instead devise a plan to acquire the required knowledge to beef up
smartmontools to solve the problem. I could only believe such claims
about the causes, but presently, Seagate destroyed about as much
trust as they possibly could, at least with me. So, except for the
hard-core technical data, they're out of the loop as far as I'm
concerned.

 If we can find out what area
 this is (I assume it isn't in the normal space used for user storage)
 and how to zero it (if not already zero) there is no need to update
 the firmware.

I'd rather say that the (ring) buffer has some external counter, also
stored somewhere, which needs to be adjusted. I'd not bet that simply
zeroing the area(s) will do.

 Good question.  Seagate has some web page that supposedly will tell you,
 but of course it is broken and doesn't work with all browsers.

At some time, they had a page where you could enter your model and
serial number, but reportedly this page delivered a lot of false
positives and false negatives. After deciding that the results were
far too unreliable, the page was pulled.

 Toni reports that ES and ES.2 may be affected.

This I took from a Seagate web page. Stuart Henderson has posted the
link, and I had the same link in my email which I received from
Seagate, so, I'd say, the link is genuine (despite the contents of
the page being almost worthless, imho).

 From what I've read it sounds like the counter must be exactly 320 AND some
 location must have a test pattern rather than zero when you init (power up
 or reboot) the drive.  From Maxtorman's description, the log is circular,
 so it will eventually wrap around to 320 again.

My dealer, who claimed that he also had information directly from
Seagate, told me that the buffer was 256 entries long (makes a lot of
sense, imho), but nevermind. We need hard facts, preferably in the
form of photocopies of internal design papers or so, not speculations.

 So keeping the counter away from 320 is an okay short term workaround,

This would require to periodically check the log position and eg. reset
it to zero at shutdown, to be on the safe side.

 but long term we want to either zero out the magic location or update the
 firmware.

We want to have updated firmware and the ability to update firmware for
all drives, also from other manufacturers. Updating firmware for a
drive shouldn't be any more complicated or risky than updating the BIOS
on the motherboard.

 There is supposed to be some document that explains all this,
 with enough details to create a fix.  If anyone finds this
 document I need a copy please.

Me too!

 If you have one or more of the suspect drives, if it running,
 try to keep it running and don't reboot.  If it is powered down
 leave it powered down if possible until this all gets sorted out.

Yes... but that still doesn't help you in the face of a system's crash.
What to do then? No need to answer this one...


-- 
Kind regards,
--Toni++



Re: OT: Hard Disk Problems (was: Re: Dealing with Seagate's problematic 7200.11 firmware.)

2009-01-27 Thread Toni Mueller
Hi,

On Mon, 26.01.2009 at 17:08:51 +, Dieter open...@sopwith.solgatos.com 
wrote:
 Your suggestion of smartmontools is helpful, thank you.

thanks - I have just sent an email to them, esp. after seeing that
there are people from big name companies involved, who could procure
at least some of the required documentation inhouse.


-- 
Kind regards,
--Toni++



Re: Dealing with Seagate's problematic 7200.11 firmware.

2009-01-27 Thread Sieve-X
Dieter openbsd at sopwith.solgatos.com writes:

 
 Recovering from Seagate's problematic 7200.11 firmware.
 
 Most of you have read about the problems with Seagate's
 7200.11 disks.  For those of you that haven't, the firmware
 on many of these drives is buggy, and can brick the drive
 when powering up or rebooting the system.  Thus far,
 Seagate's response has been less than wonderful.  We need
 a FLOSS solution.
 
 Goals:
 
   1) Ability to read the number of log entries.
 
   2) Ability to change the number of log entries.

As far I know the drive internal event counter can only be accessed
or changed from firmware level (ie. serial/pc-3000). Maybe disabling
the S.M.A.R.T automatic off-line data collection (and/or the attribute
autosave) with smartctl could somehow prevent the internal event log 
from reaching the magic value (320 or 320+x*256) because it does save
data to reserved drive area (in case of errors it even includes POH).

POH = Power On Hours

 
   3) Ability to install new firmware from Unix.


Drive firmware flashing from (S)ATA interface level could be done 
on UNIX but doing so from a mounted file-system (to avoid a reboot)
and/or without controller reset might have castrophic results (would
risk to say it's even more critical than updating system BIOS because 
there more variables - ie. different controllers, RAID, etc).

 We need for this to work with any flavor of Unix,
 on any CPU arch, without reboot or power cycle.
 We need for this to work on one drive without affecting
 other drives.
 
 I don't expect to be able to write FLOSS firmware for the drives, so
 this isn't listed as a goal.  If you think you can, please feel free.

I also think the firmware should be open-source with a portable (any  arch)
update tool. This would allow many improvements and a much more reliable
bug tracking/testing process (ie. there are many firmware bugs like NCQ 
stuttering issue with some versions, self-test log holes, etc).

Writing FLOSS firmware would require some degree of cooperation from Seagate.

 
 The problem:
 
 IF the drive is powered down when there are 320 entries in this journal
 or log, then when it is powered back up, the drive errors out on init and
 won't boot properly - to the point that it won't even report it's
 information to the BIOS.
 
   Maxtorman, slashdot discussion [2]
 
 If Maxtorman is correct, then once the drive has been operating awhile,
 we have a 1 in 320 chance that the circular log is at entry 320.  We want
 to be able to find out how many log entries the disk currently has, and
 we want to be able to change the number of log entries away from 320,
 while we wait for Seagate to get its act together and release firmware
 that works properly.  Since Seagate's solution will require attaching
 the drive to an x86 system and booting a FreeDOS ISO from CD, if the log
 is at 320 that boot will brick the drive.
 
 There are other firmware problems with the 7200.11 series, but this is
 the biggie.
 
 Once Seagate releases working firmware, we want to be able to install
 it from Unix, on any CPU arch.  Seagate's release can only install
 on x86 using FreeDOS.
 
 *ATA Commands that may be useful:
 
 command name  command code in hex   page [1] pdf page [1]
 Read Log Ext  0x2F27  33
 S.M.A.R.T. Read Log Sector0xB0 / 0xD5 28,34   34,40
 S.M.A.R.T. Write Log Sector   0xB0 / 0xD6 28,34   34.40
 Write Log Extended0x3F28  34
 Download Microcode0x9227  33
 
 Questions:
 
   Is Maxtorman correct about the 320 log entries?
 
   Are the commands listed above the ones we need?
   What is the difference between the Log Extended
   and the S.M.A.R.T. Log Sector?
   Is Microcode the same as firmware?  (Seagate uses
   the term firmware elsewhere in the manual, but I don't
   find any sort of write firmware command.)
 
   Where can we get more detailed info about these
   commands and how to use them?

Maxtorman is right about the 320 but it's bit more complicated. Here 
is the failure root cause detailed descrption (no NDA pets were hurt):

The firmware issue is that the end boundary of the event log circular
buffer (320) was set incorrectly. During Event Log initialization, the
boundary condition that defines the end of the Event Log is off by one.
During power up, if the Event Log counter is at entry 320, or a multiple
of (320 + x*256), and if a particular data pattern (dependent on the type
of tester used during the drive manufacturing test process) had been present
in the reserved-area system tracks when the drive's reserved-area file 
system was created during manufacturing, firmware will increment the Event
Log pointer past the end of the event log data structure. This error is
detected and results in an Assert Failure, which causes the drive to
hang as a 

Re: OT: Hard Disk Problems (was: Re: Dealing with Seagate's problematic 7200.11 firmware.)

2009-01-27 Thread Sieve-X
Nenhum_de_Nos matheus at eternamente.info writes:

  where you read that from ?
 
  I have a couple of 750GB ES.2 and now I'm worried !
 
 
http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931NewLang=en
 
 thanks, yet OT, but I also heard of new firmwares being worse than old
 ones, from seagate first try to fix things. anyone already updated some
 ES.2 and all went fine ?
 
 thanks,
 
 matheus
 

I updated some ST3500320NS to SN06C and everything went fine. If you
have the SAS version of ES2.1 it's not affected (does not need update). 



Re: OT: Hard Disk Problems (was: Re: Dealing with Seagate's problematic 7200.11 firmware.)

2009-01-27 Thread Dieter
Toni writes:

  If we can find out what area
  this is (I assume it isn't in the normal space used for user storage)
  and how to zero it (if not already zero) there is no need to update
  the firmware.
 
 I'd rather say that the (ring) buffer has some external counter, also
 stored somewhere, which needs to be adjusted. I'd not bet that simply
 zeroing the area(s) will do.

From what I've read, zeroing the area should make the disk safe.
Depending on what it takes to zero the area, this might be either
more or less safe than updating the firmware.

  Good question.  Seagate has some web page that supposedly will tell you,
  but of course it is broken and doesn't work with all browsers.
 
 At some time, they had a page where you could enter your model and
 serial number, but reportedly this page delivered a lot of false
 positives and false negatives. After deciding that the results were
 far too unreliable, the page was pulled.

That too.  For one thing people were entering the serial numbers
using lower case letters and getting false negatives.

Does Seagate test anything?  Their firmware is buggy, their test
equipment is buggy, their web site doesn't work, their model  serial
number checker program is buggy.  They released a firmware installer
program that bricks drives.

There is a reason I want to look into zeroing out the magic area as
an alternative to risking updating the firmware.  :-(

  So keeping the counter away from 320 is an okay short term workaround,
 
 This would require to periodically check the log position and eg. reset
 it to zero at shutdown, to be on the safe side.

Yes.  Depending on what events make the counter increment, it might be
possible for the counter to go from 0 to 320 in a short time, and then
the power fails.  So not a great workaround, but better than nothing,
and if we get info on the counter before getting info on a proper fix
(either zeroing the magic are or updating the firmware) we could use
it as a workaround until we get a proper fix.

  but long term we want to either zero out the magic location or update the
  firmware.
 
 We want to have updated firmware and the ability to update firmware for
 all drives, also from other manufacturers. Updating firmware for a
 drive shouldn't be any more complicated or risky than updating the BIOS
 on the motherboard.

As I understand it, updating the firmware on some mainboards IS risky.
Some have a fail safe that allows trying again, others have two
areas that can be written, but some have neither of these and risk getting
bricked.

  There is supposed to be some document that explains all this,
  with enough details to create a fix.  If anyone finds this
  document I need a copy please.
 
 Me too!

Sounds like you are on good terms with your dealer.  Can your dealer get
you a copy?

--

Sieve-X writes:

 As far I know the drive internal event counter can only be accessed
 or changed from firmware level (ie. serial/pc-3000).

By serial you mean using the TTL to RS-232 thing?  For most of us that
would require taking the drive out of whatever case it is mounted in
to access the pins.

What is pc-3000?

 Drive firmware flashing from (S)ATA interface level could be done 
 on UNIX but doing so from a mounted file-system (to avoid a reboot)

I was worried that device drivers might not have a facility to
pass through whatever magic commands we need.  But yeah, if the
root partition is on an affected drive that might be a problem.
It would be nice to know if rebooting is safe or not.



Re: OT: Hard Disk Problems (was: Re: Dealing with Seagate's problematic 7200.11 firmware.)

2009-01-26 Thread Toni Mueller
Hi,

On Sun, 25.01.2009 at 16:27:14 +, Dieter open...@sopwith.solgatos.com 
wrote:
 I wrote:
  You wrote:
 Is Maxtorman correct about the 320 log entries?
  My dealer told me a similar story, but I don't know where he had it
  from.
 
 I guess the next step is to find out if Maxtorman is correct about this
 320 log entries stuff, and if the SMART log entries as reported by
 smartmontools is the log to worry about, or if there is some other log.

I don't have an account on /., and also feel incapable of actually
working on this problem, but someone who has and can, could probably
try to nag maxtorman about improving smartmontools to the point that
they do the right thing, or try to get him to connect one to somebody
else who can verify the issue and/or provide more technical details.

If he can find a way to almost-anonymously post to /., he might be able
to give some hints to the smartmontools gyus, too. Then, we only need
them to integrate everything and make a new release.

Personally, I'd say that it'd be best if Seagate themselves would grab
the opportunity to partially make good on the issue, but I heavily
doubt that they understand, or want to understand, what's it about
with FLOSS.


Kind regards,
--Toni++



Re: Dealing with Seagate's problematic 7200.11 firmware.

2009-01-26 Thread RedShift

Dieter wrote:

Recovering from Seagate's problematic 7200.11 firmware.

Most of you have read about the problems with Seagate's
7200.11 disks.  For those of you that haven't, the firmware
on many of these drives is buggy, and can brick the drive
when powering up or rebooting the system.  Thus far,
Seagate's response has been less than wonderful.  We need
a FLOSS solution.

Goals:

1) Ability to read the number of log entries.

2) Ability to change the number of log entries.

3) Ability to install new firmware from Unix.

We need for this to work with any flavor of Unix,
on any CPU arch, without reboot or power cycle.
We need for this to work on one drive without affecting
other drives.

I don't expect to be able to write FLOSS firmware for the drives, so
this isn't listed as a goal.  If you think you can, please feel free.

The problem:

IF the drive is powered down when there are 320 entries in this journal
or log, then when it is powered back up, the drive errors out on init and
won't boot properly - to the point that it won't even report it's
information to the BIOS.

Maxtorman, slashdot discussion [2]



Just a hypothetical situation, since we do not have the sourcecode of the 
firmware: isn't it possible some kind of mathematical operation is occuring on 
the number of log entries causing some kind of infinite loop to occur or a 
division that leads to/by 0 that the software/hardware is unable to handle? 
That could mean this problem could also manifest itself on for example 
multiples of 320, so just putting the counter on 321 may just be delaying the 
inevitable. And what happens if the counter overflows and reaches 320 again?

Glenn



If Maxtorman is correct, then once the drive has been operating awhile,
we have a 1 in 320 chance that the circular log is at entry 320.  We want
to be able to find out how many log entries the disk currently has, and
we want to be able to change the number of log entries away from 320,
while we wait for Seagate to get its act together and release firmware
that works properly.  Since Seagate's solution will require attaching
the drive to an x86 system and booting a FreeDOS ISO from CD, if the log
is at 320 that boot will brick the drive.

There are other firmware problems with the 7200.11 series, but this is
the biggie.

Once Seagate releases working firmware, we want to be able to install
it from Unix, on any CPU arch.  Seagate's release can only install
on x86 using FreeDOS.

*ATA Commands that may be useful:

command namecommand code in hex   page [1] pdf page [1]
Read Log Ext0x2F27  33
S.M.A.R.T. Read Log Sector  0xB0 / 0xD5 28,34   34,40
S.M.A.R.T. Write Log Sector 0xB0 / 0xD6 28,34   34.40
Write Log Extended  0x3F28  34
Download Microcode  0x9227  33

Questions:

Is Maxtorman correct about the 320 log entries?

Are the commands listed above the ones we need?
What is the difference between the Log Extended
and the S.M.A.R.T. Log Sector?
Is Microcode the same as firmware?  (Seagate uses
the term firmware elsewhere in the manual, but I don't
find any sort of write firmware command.)

Where can we get more detailed info about these
commands and how to use them?

References:

[1] Seagate Barracuda 7200.11 Serial ATA Product Manual rev C  August 2008
http://www.seagate.com/staticfiles/support/disc/manuals/desktop/Barracuda%207200.11/100507013c.pdf

[2] http://it.slashdot.org/article.pl?sid=09/01/21/0052236




Re: Dealing with Seagate's problematic 7200.11 firmware.

2009-01-26 Thread Raimo Niskanen
On Fri, Jan 23, 2009 at 09:28:34PM +, Dieter wrote:
 Recovering from Seagate's problematic 7200.11 firmware.
 
 Most of you have read about the problems with Seagate's
 7200.11 disks.  For those of you that haven't, the firmware
 on many of these drives is buggy, and can brick the drive
 when powering up or rebooting the system.  Thus far,

How can I know if I have a suspicious drive?

E.g# smartctl -i -d ata /dev/rwd1c
smartctl version 5.33 [i386-unknown-openbsd4.1] Copyright (C) 2002-4 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: ST3808110AS
Serial Number:5LRA2E2J
Firmware Version: 3.AJJ
User Capacity:80,026,361,856 bytes
Device is:Not in smartctl database [for details use: -P showall]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:Mon Jan 26 15:31:45 2009 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled


Google for ST3808110AS gives me Barracuda 7200.9 SATA 80-GB Hard Drive,
so I guess this one is not suspicious, but I have more disks,
in other servers. What if i find a 7200.10, 7200.11, ES or ES.2,
is that enough for me to suspect it?



 Seagate's response has been less than wonderful.  We need
 a FLOSS solution.
 
 Goals:
 
   1) Ability to read the number of log entries.
 
   2) Ability to change the number of log entries.
 
   3) Ability to install new firmware from Unix.
 
 We need for this to work with any flavor of Unix,
 on any CPU arch, without reboot or power cycle.
 We need for this to work on one drive without affecting
 other drives.
 
 I don't expect to be able to write FLOSS firmware for the drives, so
 this isn't listed as a goal.  If you think you can, please feel free.
 
 The problem:
 
 IF the drive is powered down when there are 320 entries in this journal
 or log, then when it is powered back up, the drive errors out on init and
 won't boot properly - to the point that it won't even report it's
 information to the BIOS.
 
   Maxtorman, slashdot discussion [2]
 
 If Maxtorman is correct, then once the drive has been operating awhile,
 we have a 1 in 320 chance that the circular log is at entry 320.  We want
 to be able to find out how many log entries the disk currently has, and
 we want to be able to change the number of log entries away from 320,
 while we wait for Seagate to get its act together and release firmware
 that works properly.  Since Seagate's solution will require attaching
 the drive to an x86 system and booting a FreeDOS ISO from CD, if the log
 is at 320 that boot will brick the drive.
 
 There are other firmware problems with the 7200.11 series, but this is
 the biggie.
 
 Once Seagate releases working firmware, we want to be able to install
 it from Unix, on any CPU arch.  Seagate's release can only install
 on x86 using FreeDOS.
 
 *ATA Commands that may be useful:
 
 command name  command code in hex   page [1] pdf page [1]
 Read Log Ext  0x2F27  33
 S.M.A.R.T. Read Log Sector0xB0 / 0xD5 28,34   34,40
 S.M.A.R.T. Write Log Sector   0xB0 / 0xD6 28,34   34.40
 Write Log Extended0x3F28  34
 Download Microcode0x9227  33
 
 Questions:
 
   Is Maxtorman correct about the 320 log entries?
 
   Are the commands listed above the ones we need?
   What is the difference between the Log Extended
   and the S.M.A.R.T. Log Sector?
   Is Microcode the same as firmware?  (Seagate uses
   the term firmware elsewhere in the manual, but I don't
   find any sort of write firmware command.)
 
   Where can we get more detailed info about these
   commands and how to use them?
 
 References:
 
 [1] Seagate Barracuda 7200.11 Serial ATA Product Manual rev C  August 2008
 http://www.seagate.com/staticfiles/support/disc/manuals/desktop/Barracuda%207200.11/100507013c.pdf
 
 [2] http://it.slashdot.org/article.pl?sid=09/01/21/0052236

-- 

/ Raimo Niskanen, Erlang/OTP, Ericsson AB



Re: OT: Hard Disk Problems (was: Re: Dealing with Seagate's problematic 7200.11 firmware.)

2009-01-26 Thread Nenhum_de_Nos
On Sun, January 25, 2009 16:01, Toni Mueller wrote:
 Hi,

 On Fri, 23.01.2009 at 21:28:34 +, Dieter
 open...@sopwith.solgatos.com wrote:
 Recovering from Seagate's problematic 7200.11 firmware.


 first off, several other product lines are affected, too. In
 particular, the popular ES and ES.2 server grade disks are also
 affected, to the best of my knowledge. Seagate only admits to problems
 with ES.2 drives, not ES drives, though.

where you read that from ?

I have a couple of 750GB ES.2 and now I'm worried !

matheus

-- 
We will call you cygnus,
The God of balance you shall be



Re: OT: Hard Disk Problems (was: Re: Dealing with Seagate's problematic 7200.11 firmware.)

2009-01-26 Thread Stuart Henderson
On 2009-01-26, Nenhum_de_Nos math...@eternamente.info wrote:
 On Sun, January 25, 2009 16:01, Toni Mueller wrote:
 Hi,

 On Fri, 23.01.2009 at 21:28:34 +, Dieter
 open...@sopwith.solgatos.com wrote:
 Recovering from Seagate's problematic 7200.11 firmware.


 first off, several other product lines are affected, too. In
 particular, the popular ES and ES.2 server grade disks are also
 affected, to the best of my knowledge. Seagate only admits to problems
 with ES.2 drives, not ES drives, though.

 where you read that from ?

 I have a couple of 750GB ES.2 and now I'm worried !

http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931NewLang=en



Re: OT: Hard Disk Problems (was: Re: Dealing with Seagate's problematic 7200.11 firmware.)

2009-01-26 Thread Nenhum_de_Nos
On Mon, January 26, 2009 18:48, Stuart Henderson wrote:
 On 2009-01-26, Nenhum_de_Nos math...@eternamente.info wrote:
 On Sun, January 25, 2009 16:01, Toni Mueller wrote:
 Hi,

 On Fri, 23.01.2009 at 21:28:34 +, Dieter
 open...@sopwith.solgatos.com wrote:
 Recovering from Seagate's problematic 7200.11 firmware.


 first off, several other product lines are affected, too. In
 particular, the popular ES and ES.2 server grade disks are also
 affected, to the best of my knowledge. Seagate only admits to problems
 with ES.2 drives, not ES drives, though.

 where you read that from ?

 I have a couple of 750GB ES.2 and now I'm worried !

 http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931NewLang=en

thanks, yet OT, but I also heard of new firmwares being worse than old
ones, from seagate first try to fix things. anyone already updated some
ES.2 and all went fine ?

thanks,

matheus

-- 
We will call you cygnus,
The God of balance you shall be



Re: OT: Hard Disk Problems (was: Re: Dealing with Seagate's problematic 7200.11 firmware.)

2009-01-26 Thread Dieter
Disk families affected:

Barracuda 7200.11, Barracuda ES.2 (SATA), DiamondMax 22, FreeAgent Desk,
Maxtor OneTouch 4, Pipeline HD, Pipeline HD Pro, SV35.3, SV35.4

Barracuda ES.2 SAS drive is not affected

All drives with a date of manufacture January 12, 2009 and later are
not affected by this issue

This condition was introduced by a firmware issue that sets the drive event
log to an invalid location causing the drive to become inaccessible.

The firmware issue is that the end boundary of the event log circular
buffer (320) was set incorrectly. During Event Log initialization,
the boundary condition that defines the end of the Event Log is off
by one. During power up, if the Event Log counter is at entry 320,
or a multiple of (320 + x*256), and if a particular data pattern
(dependent on the type of tester used during the drive manufacturing
test process) had been present in the reserved-area system tracks
when the drive's reserved-area file system was created during
manufacturing, firmware will increment the Event Log pointer past
the end of the event log data structure. This error is detected and
results in an Assert Failure, which causes the drive to hang as a
failsafe measure. When the drive enters failsafe further update s to
the counter become impossible and the condition will remain through
subsequent power cycles. The problem only arises if a power cycle
initialization occurs when the Event Log is at 320 or some multiple
of 256 thereafter.



Seagate says only on power up, but I'm pretty sure I have seen stories
of rebooting causing bricking.  Might be unrelated, but to play it safe
I will continue to avoid reboots.

So, we have confirmation of the number 320, and a formula for event counts
past 320.  We still need to find out if this Event Log counter is the
error count reported by smartmontools, or some other counter.

Ideally, I would like to find out how to read this reserved-area system
track, and how to set it to a safe value (I have seen zero, but this is
not confirmed).  If we can do this we don't need to update the firmware.

And we still want to find out how to update the firmware from Unix.



Re: OT: Hard Disk Problems (was: Re: Dealing with Seagate's problematic 7200.11 firmware.)

2009-01-26 Thread Dieter
Toni writes:

Is Maxtorman correct about the 320 log entries?
   My dealer told me a similar story, but I don't know where he had it
   from.
  
  I guess the next step is to find out if Maxtorman is correct about this
  320 log entries stuff, and if the SMART log entries as reported by
  smartmontools is the log to worry about, or if there is some other log.
 
 I don't have an account on /., and also feel incapable of actually
 working on this problem, but someone who has and can, could probably
 try to nag maxtorman about improving smartmontools to the point that
 they do the right thing, or try to get him to connect one to somebody
 else who can verify the issue and/or provide more technical details.
 
 If he can find a way to almost-anonymously post to /., he might be able
 to give some hints to the smartmontools gyus, too. Then, we only need
 them to integrate everything and make a new release.

It is easy to set up a slashdot account.  Or you can post as anonymous
coward.   He set up the Maxtorman account to post anonymously, he mentioned
that he has another slashdot account that isn't anonymous.  Problem I
have is I can't find a way to send him a PM (private message).  Most web
forums have a facility for sending other users a PM.  We can post a reply
to the thread, but he would have to read the thread again to see it.
Any slashdot wizards out there have an idea?

Your suggestion of smartmontools is helpful, thank you.

 Personally, I'd say that it'd be best if Seagate themselves would grab
 the opportunity to partially make good on the issue, but I heavily
 doubt that they understand, or want to understand, what's it about
 with FLOSS.

It isn't even just FLOSS.  Any non-x86 machine is out of luck.
Proprietary Unix is out of luck.  Anything embedded is out of luck.
Even Mac is probably out of luck.  And if the reboot to run the
firmware installer bricks the drive(s) even wintel is out of luck.

I don't understand the common corporate policy of keeping everything
secret.  All they are doing is hurting their previously loyal customers.
It didn't used to be this way.

Supposedly there was a broken test machine that didn't zero out some
special area after writing a test pattern.  So only drives that were
tested on that machine are at risk.  If we can find out what area
this is (I assume it isn't in the normal space used for user storage)
and how to zero it (if not already zero) there is no need to update
the firmware.

--
Raimo writes:

 How can I know if I have a suspicious drive?

Good question.  Seagate has some web page that supposedly will tell you,
but of course it is broken and doesn't work with all browsers.

 Google for ST3808110AS gives me Barracuda 7200.9 SATA 80-GB Hard Drive,
 so I guess this one is not suspicious, but I have more disks,
 in other servers. What if i find a 7200.10, 7200.11, ES or ES.2,
 is that enough for me to suspect it?

I haven't read anything about problems with 7200.10 or earlier.
Toni reports that ES and ES.2 may be affected.

--
Glenn writes:

 Just a hypothetical situation, since we do not have the sourcecode of
 the firmware: isn't it possible some kind of mathematical operation
 is occuring on the number of log entries causing some kind of infinite
 loop to occur or a division that leads to/by 0 that the software/hardware
 is unable to handle? That could mean this problem could also manifest
 itself on for example multiples of 320, so just putting the counter on
 321 may just be delaying the inevitable. And what happens if the counter
 overflows and reaches 320 again?

From what I've read it sounds like the counter must be exactly 320 AND some
location must have a test pattern rather than zero when you init (power up
or reboot) the drive.  From Maxtorman's description, the log is circular,
so it will eventually wrap around to 320 again.

So keeping the counter away from 320 is an okay short term workaround,
but long term we want to either zero out the magic location or update the
firmware.

--
matheus writes:

 but I also heard of new firmwares being worse than old
 ones, from seagate first try to fix things.

What I read is that the firmware itself was ok but the installer
program would brick a previously working drive.  But it didn't
brick it as badly as the firmware bug, you can still update the
firmware again once you get a proper update program.

===

There is supposed to be some document that explains all this,
with enough details to create a fix.  If anyone finds this
document I need a copy please.

If you have one or more of the suspect drives, if it running,
try to keep it running and don't reboot.  If it is powered down
leave it powered down if possible until this all gets sorted out.



OT: Hard Disk Problems (was: Re: Dealing with Seagate's problematic 7200.11 firmware.)

2009-01-25 Thread Toni Mueller
Hi,

On Fri, 23.01.2009 at 21:28:34 +, Dieter open...@sopwith.solgatos.com 
wrote:
 Recovering from Seagate's problematic 7200.11 firmware.


first off, several other product lines are affected, too. In
particular, the popular ES and ES.2 server grade disks are also
affected, to the best of my knowledge. Seagate only admits to problems
with ES.2 drives, not ES drives, though.


 Seagate's response has been less than wonderful.  We need
 a FLOSS solution.

Right.

 We need for this to work with any flavor of Unix,

We need to do this from within a running system.

 We need for this to work on one drive without affecting
 other drives.

My first idea is that smartmontools probably provide much of the
required framework alreaedy, and could possibly extended to work with
this situation, too.

 If Maxtorman is correct, then once the drive has been operating awhile,

Seagate sent me the following link

http://seagate.custkb.com/seagate/crm/selfservice/search.jsp?DocId=207931

which imho contributes to the impression of a less-than-stellar
response by stating Based on the low risk as determined by an analysis
of actual field return data, Seagate believes that the affected drives
can be used as is. (current as of _now_).

 that works properly.  Since Seagate's solution will require attaching
 the drive to an x86 system and booting a FreeDOS ISO from CD, if the log
 is at 320 that boot will brick the drive.

As far as I understood, the firmware has a sort of a boot loader which
reads the actual firmware from the drive, and also writes new firmware
to the drive. This leads me to suspect that writing a modified boot
loader firmware which does not contain such log entry reading or
writing, could bypass the 'brickedness' caused by the broken firmware
which is actually on the platters (ie, which is what the boot loader
needs to load to begin with). So, if a modified boot loader would eg.
abstain from loading the firmware on the drive, the corruption of said
firmware on the drive would not occur, thus not blocking the remainder
of the hardware. However, if, and how, such a new boot loader could be
placed into the ???ROMs of the drive, I really don't know.

 Once Seagate releases working firmware, we want to be able to install
 it from Unix, on any CPU arch.  Seagate's release can only install
 on x86 using FreeDOS.

- smartmontools come to mind.

   Is Maxtorman correct about the 320 log entries?

My dealer told me a similar story, but I don't know where he had it
from.



Kind regards,
--Toni++



Re: OT: Hard Disk Problems (was: Re: Dealing with Seagate's problematic 7200.11 firmware.)

2009-01-25 Thread Dieter
  Recovering from Seagate's problematic 7200.11 firmware.
 
 
 first off, several other product lines are affected, too. In
 particular, the popular ES and ES.2 server grade disks are also
 affected, to the best of my knowledge. Seagate only admits to problems
 with ES.2 drives, not ES drives, though.

Word is the Maxtor Diamond Max 21 line is also affected.

 We need to do this from within a running system.

Yes.

 My first idea is that smartmontools probably provide much of the
 required framework alreaedy, and could possibly extended to work with
 this situation, too.

Thanks.  I downloaded smartmontools, fixed a couple of ILP vs LP64 bugs,
and it appears to provide the number of SMART log entries.

  Is Maxtorman correct about the 320 log entries?
 
 My dealer told me a similar story, but I don't know where he had it
 from.

I guess the next step is to find out if Maxtorman is correct about this
320 log entries stuff, and if the SMART log entries as reported by
smartmontools is the log to worry about, or if there is some other log.
E.g. see the Read Log Ext and Write Log Extended commands I posted
yesterday.  I don't know if these use the same log as the SMART commands
or if this is something different.



Dealing with Seagate's problematic 7200.11 firmware.

2009-01-23 Thread Dieter
Recovering from Seagate's problematic 7200.11 firmware.

Most of you have read about the problems with Seagate's
7200.11 disks.  For those of you that haven't, the firmware
on many of these drives is buggy, and can brick the drive
when powering up or rebooting the system.  Thus far,
Seagate's response has been less than wonderful.  We need
a FLOSS solution.

Goals:

1) Ability to read the number of log entries.

2) Ability to change the number of log entries.

3) Ability to install new firmware from Unix.

We need for this to work with any flavor of Unix,
on any CPU arch, without reboot or power cycle.
We need for this to work on one drive without affecting
other drives.

I don't expect to be able to write FLOSS firmware for the drives, so
this isn't listed as a goal.  If you think you can, please feel free.

The problem:

IF the drive is powered down when there are 320 entries in this journal
or log, then when it is powered back up, the drive errors out on init and
won't boot properly - to the point that it won't even report it's
information to the BIOS.

Maxtorman, slashdot discussion [2]

If Maxtorman is correct, then once the drive has been operating awhile,
we have a 1 in 320 chance that the circular log is at entry 320.  We want
to be able to find out how many log entries the disk currently has, and
we want to be able to change the number of log entries away from 320,
while we wait for Seagate to get its act together and release firmware
that works properly.  Since Seagate's solution will require attaching
the drive to an x86 system and booting a FreeDOS ISO from CD, if the log
is at 320 that boot will brick the drive.

There are other firmware problems with the 7200.11 series, but this is
the biggie.

Once Seagate releases working firmware, we want to be able to install
it from Unix, on any CPU arch.  Seagate's release can only install
on x86 using FreeDOS.

*ATA Commands that may be useful:

command namecommand code in hex   page [1] pdf page [1]
Read Log Ext0x2F27  33
S.M.A.R.T. Read Log Sector  0xB0 / 0xD5 28,34   34,40
S.M.A.R.T. Write Log Sector 0xB0 / 0xD6 28,34   34.40
Write Log Extended  0x3F28  34
Download Microcode  0x9227  33

Questions:

Is Maxtorman correct about the 320 log entries?

Are the commands listed above the ones we need?
What is the difference between the Log Extended
and the S.M.A.R.T. Log Sector?
Is Microcode the same as firmware?  (Seagate uses
the term firmware elsewhere in the manual, but I don't
find any sort of write firmware command.)

Where can we get more detailed info about these
commands and how to use them?

References:

[1] Seagate Barracuda 7200.11 Serial ATA Product Manual rev C  August 2008
http://www.seagate.com/staticfiles/support/disc/manuals/desktop/Barracuda%207200.11/100507013c.pdf

[2] http://it.slashdot.org/article.pl?sid=09/01/21/0052236