On Thu, Jan 21, 2010 at 03:55:59PM +0100, Matthias Appel wrote: > I have a serious issue with my zpool.
Yes. You need to figure out what the root cause of the issue is. > My zpool consists of 4 vdevs which are assembled to 2 mirrors. > > One of this mirrors got degraded cause of too many errors on each vdev > of the mirror. > Yes, both vdevs of the mirror got degraded. Yes, but note they're still active, just in a degraded state. You should be able to read the data, with care and with some luck that the errors don't align to take out both copies of a particular chunk of data. > According to murphys law I don't have a backup as well (I have a backup, > which was made several months ago, and some backups spread across several > disks) > Both of these backups are not the best so I want to access my data on > the zpool so I can make a backup and replace the opensolaris server. If you have them, you may be able to combine them with the current data in the pool to reconstruct what you need. > As the two faulted vdevs are connected to different controllers I assume > that the problem is located on the server, not on the harddisks/ > controllers....one of the faulted harddisks has been replaced some weeks > ago due to crc errors, so I assume the server is bad, not the > disks/cables/controllers. Yeah, especially since these are chksum errors. Prime suspects would be bad memory and inadequate power supply. Bad motherboard or pci, dust, temperature and others follow after that. > My state is as follows: > > NAME STATE READ WRITE CKSUM > performance DEGRADED 0 0 8 > mirror DEGRADED 0 0 16 > c1t1d0 DEGRADED 0 0 23 too many errors > c2d0 DEGRADED 0 0 24 too many errors > mirror ONLINE 0 0 0 > c1t2d0 ONLINE 0 0 7 > c3d0 ONLINE 0 0 7 Note that you have cksum errors on both mirror vdevs, even if the second one has not yet been marked degraded. It's curious that mirror pairs are showing basically the same number of errors, this suggests that maybe the data corruption occurred in the original data as it was written, and both copies are indeed bad. I would guess from this that c2/c3 are cmdk (ide, or older sata, or not in ahci mode), which are not hot-plug and that's why you don't see them in cfgadm. Those aren't ideal controllers for a performance oriented pool anyway, another reason to start thinking about new hardware options. > Is there a possibility to force the two degraded vdevs online, so I can > fully acces the zpool and do a backup? They are online, but the rate of errors is a concern. I would stop writing to this pool to avoid further damage, if you haven't already. You probably want to set the pool's failmode property to "continue" to maximise the amount of data you can get off. The nice thing about zfs is, in general, you know if you're getting a good backup off the pool. If you have bad memory, though, things can get corrupted once they're out of zfs's hands. I would recommend zfs send | zfs recv as the method of making the backup, rather than some other tool that might not notice corruption through memory. If you get errors in the backup stream, don't panic - they may be introduced after the data is read from disk, and a retry later might not get hit in the same spot. If they're in the source data on disk, you will need to switch to a file-by-file copy to read everything you can around those errors, once you find where they are. The first thing I would do is shutdown the box and run memtest86+ for 24h or so. Look into your options for replacement parts or a replacement box while that runs. If you can get the disks out and into another known-good server, that might be a good idea, but take care of them. Note that you don't absolutely have to have all 4 disks online in that server - if you have space to copy the disks with dd one at a time to some other storage, that would work too. Look at your FMA logs for other clues and reports of errors. You may want to scrub the pool, to see what data has been damaged, but I wouldn't do that until you have resolved the root cause, lest you "repair" good data with bad. > I wanted to ask first, before doing any stupid things and to loose the > whole pool. Wise, if you have the luxury of time to wait for advice. Let us know what you find and we can make more suggestions. -- Dan.
pgpTmEGKQL5md.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss