SCSI + camcontrol

2007-05-04 Thread Grant Peel
Hello all,

 A few weeks back, I turned on mod_gzip in apache and as a result, the /tmp 
directory filled up with .wrk files causing the root filesystem to fill to 
capacity. When we noticed what was happening, on May 1 we had no choice but to 
cold boot the machine as it was, for all purposes locked up.

In the security run, for May 1 and May 3 I am seeing the SCSI errors below.

FreeBSD 4.7 (yes we are going to upgrade soon (migrating to a newly setup 
machine)),
Apache 1.3.26
We do have complete dumps (From may1),
The machine is a vintage 2003 Dell SC1400 
HD = 1 Fujitsu SCSI that has never had problems before.

Questions:

Do the errors below TRUELY indicate pending doom?

Can camcontrol be used to squash the errors?

Should FSCK be used to fix?

Are these errors (the text below), formatted from the FreeBSD kernel or are 
they shown as reported by the HD subsystem? i.e. where can I go to read what 
the errors actually mean?

THanks all,

-Grant

May 3:

May  3 03:59:13 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 4 21 
7a df 0 0 80 0
May  3 03:59:13 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR info:4217b55 
asc:11,1
May  3 03:59:14 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted 
sks:80,3f
May  3 03:59:16 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 4 21 
7a ef 0 0 70 0
May  3 03:59:18 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR info:4217b55 
asc:11,1
May  3 03:59:18 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted 
sks:80,3f
May  3 03:59:20 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 4 21 
7a ff 0 0 60 0
May  3 03:59:21 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR info:4217b55 
asc:11,1
May  3 03:59:22 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted 
sks:80,3f
May  3 03:59:24 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 4 21 
7b f 0 0 50 0
May  3 03:59:27 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR info:4217b55 
asc:11,1
May  3 03:59:28 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted 
sks:80,3f
May  3 03:59:29 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 4 21 
7b 1f 0 0 40 0
May  3 03:59:29 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR info:4217b55 
asc:11,1
May  3 03:59:29 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted 
sks:80,3f
May  3 03:59:32 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 4 21 
7b 2f 0 0 30 0
May  3 03:59:33 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR info:4217b55 
asc:11,1
May  3 03:59:35 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted 
sks:80,3f
May  3 03:59:36 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 4 21 
7b 3f 0 0 20 0
May  3 03:59:36 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR info:4217b55 
asc:11,1
May  3 03:59:36 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted 
sks:80,3f
May  3 03:59:42 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 4 21 
7b 4f 0 0 10 0
May  3 03:59:42 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR info:4217b55 
asc:11,1
May  3 03:59:43 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted 
sks:80,3f
May  3 03:59:45 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 4 21 
7b 4f 0 0 10 0
May  3 03:59:47 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR info:4217b55 
asc:11,1
May  3 03:59:48 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted 
sks:80,3f
May  3 03:59:49 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 4 21 
7b 4f 0 0 10 0
May  3 03:59:49 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR info:4217b55 
asc:11,1
May  3 03:59:49 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted 
sks:80,3f

May 1:

May  1 03:29:28 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 3 ab 
d5 c1 0 0 e 0
May  1 03:29:31 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR info:3abd5c1 
asc:11,1
May  1 03:29:31 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted 
sks:80,3f
May  1 03:29:32 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 3 ab 
d5 c1 0 0 1 0
May  1 03:29:32 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR info:3abd5c1 
asc:11,1
May  1 03:29:32 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted 
sks:80,3f

___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: SCSI + camcontrol

2007-05-04 Thread Dan Nelson
In the last episode (May 04), Grant Peel said:
  A few weeks back, I turned on mod_gzip in apache and as a result,
  the /tmp directory filled up with .wrk files causing the root
  filesystem to fill to capacity. When we noticed what was happening,
  on May 1 we had no choice but to cold boot the machine as it was,
  for all purposes locked up.
 
 In the security run, for May 1 and May 3 I am seeing the SCSI errors below.
 
 FreeBSD 4.7 (yes we are going to upgrade soon (migrating to a newly setup 
 machine)),
 Apache 1.3.26
 We do have complete dumps (From may1),
 The machine is a vintage 2003 Dell SC1400 
 HD = 1 Fujitsu SCSI that has never had problems before.
 
 Questions:
 
 Do the errors below TRUELY indicate pending doom?
 
 Can camcontrol be used to squash the errors?
 
 Should FSCK be used to fix?
 
 Are these errors (the text below), formatted from the FreeBSD kernel
 or are they shown as reported by the HD subsystem? i.e. where can I
 go to read what the errors actually mean?

Those are errors reported by the drive:
 
 May  3 03:59:13 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 4 21 
 7a df 0 0 80 0
 May  3 03:59:13 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR 
 info:4217b55 asc:11,1
 May  3 03:59:14 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted 
 sks:80,3f

 May  1 03:29:28 excelsior /kernel: (da0:ahc0:0:1:0): READ(10). CDB: 28 0 3 ab 
 d5 c1 0 0 e 0
 May  1 03:29:31 excelsior /kernel: (da0:ahc0:0:1:0): MEDIUM ERROR 
 info:3abd5c1 asc:11,1
 May  1 03:29:31 excelsior /kernel: (da0:ahc0:0:1:0): Read retries exhausted 
 sks:80,3f

The drive has tried to read the indicated block number (0x4217b55 and
0x3abd5c1), and couldn't, even after multiple retries.  If it was able
to recover the data after retrying, it would have reallocated the block
to a spare sector.

There isn't an easy way to map a raw block number to a filename, but if
you can determine that the files belonging to the blocks were old, your
drive is probably still okay, and you happened to trip over some weak
spots on the disk that lost their data over time.  If they were
recently-generated files, then I'd start worrying about getting that
new system up as soon as possible.

One thing to try would be dd if=/dev/da0 of=/dev/null bs=64k, and see
how many more errors get generated.  Installing smartmontools and
comparing the output of smartctl -a /dev/da0 before and after will
also tell you how many ECC recoveries and rereads were done.

-- 
Dan Nelson
[EMAIL PROTECTED]
___
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to [EMAIL PROTECTED]