wd0 read timeouts - how to proceed?

2010-12-24 Thread Webcharge
Must be the holiday season *sigh* my OpenBSD server is suddenly 
giving the occassional read-timeout on the /var slice of the main harddisk:


---
wd0(pciide0:0:0): timeout
type: ata
c_bcount: 65536
c_skip: 0
wd0g: device timeout reading fsbn 17002464 of 17002464-17002591 (wd0 bn 
67334928; cn 66800 tn 8 sn 24), retrying

wd0: soft error (corrected)
---

Is this the actual disk or the controller/other hardware? Either way it 
needs a fix.


My problem is this is a live system that is not close by. I would very 
much prefer to 'fix' this remotely to buy some time to replace the 
machine completely.
I do have offsite backups of essential data but not a spare system in 
the rack at this very moment.

Not to mention I would like to avoid spending X-mas alone in the datacenter.

There is a second harddisk installed, with OpenBSD formatted slices, but 
of different proportions. This (larger) disk is unused, so data / layout 
may be wiped,
so it seems like smart idea to copy the data at least (I do have offsite 
backups of essential data but not a spare system in the rack at this 
very moment)


Can I just copy /var (wd0g)  to /var2 (wd1i) and remount or should I 
proceed otherwise or would copy/remounting /var simply not work on a 
live system?


Or, possibly, I could 'clone' the whole wd0 disk to wd1 and use that 
instead of wd1?
I understood you will need to boot in single user mode for this [1] and 
or have identical disks [2],  or is there another (remote-safe) way?


Any advice is highly appreciated!

Thanks, and happy holidays,

Matt

[1] http://unixsadm.blogspot.com/2007/08/cloning-disk-in-openbsd.html
[2] http://monkey.org/openbsd/archive/tech/0112/msg00079.html



Re: wd0 read timeouts - how to proceed?

2010-12-24 Thread Joachim Schipper
On Fri, Dec 24, 2010 at 11:00:48AM +0100, Webcharge wrote:
 Must be the holiday season *sigh* my OpenBSD server is suddenly
 giving the occassional read-timeout on the /var slice of the main
 harddisk:

 There is a second harddisk installed, with OpenBSD formatted slices,
 but of different proportions. This (larger) disk is unused, so data
 / layout may be wiped,
 so it seems like smart idea to copy the data at least (I do have
 offsite backups of essential data but not a spare system in the rack
 at this very moment)
 
 Can I just copy /var (wd0g)  to /var2 (wd1i) and remount or should
 I proceed otherwise or would copy/remounting /var simply not work on
 a live system?

If the system is quiet, you can try 'sync; sync; dd ...; fsck', but
something like 'tar cpf - | tar xpf -' is more likely to get you a
somewhat consistent view. Change /etc/fstab and reboot (you *can* try
mounting the new /var over the old one, but you'll want to play with
fstat -n to see which processes are still accessing the old /var.)

Of course, this isn't guaranteed to work. In particular, if something is
actually writing to /var, your view won't be consistent. Even more in
particular, don't try this with running databases.

Joachim



Re: wd0 read timeouts - how to proceed?

2010-12-24 Thread Vadim Zhukov
2010/12/24 Joachim Schipper joac...@joachimschipper.nl:
 something like 'tar cpf - | tar xpf -' is more likely to get you a
 somewhat consistent view.

POSIX pax(1) with -rw options should work slightly faster (and it's
already faster to type ;) ).

--
  WBR,
  Vadim Zhukov



Re: wd0 read timeouts - how to proceed?

2010-12-24 Thread Chris Smith
On Fri, Dec 24, 2010 at 5:00 AM, Webcharge webcha...@gmx.net wrote:
 Is this the actual disk or the controller/other hardware?

If the hardware is smart aware installing smartmontools and running
smartctl may give you a clue.



Re: wd0 read timeouts - how to proceed?

2010-12-24 Thread Gabriel Linder

On 12/24/10 17:09, Chris Smith wrote:

On Fri, Dec 24, 2010 at 5:00 AM, Webchargewebcha...@gmx.net  wrote:

Is this the actual disk or the controller/other hardware?

If the hardware is smart aware installing smartmontools and running
smartctl may give you a clue.


atactl(8) works just fine.