Re: Likelihood of read error, recover device failure raid10
On Sunday, August 14, 2016 8:04:14 PM CEST you wrote: > On Sunday, August 14, 2016 10:20:39 AM CEST you wrote: > > On Sat, Aug 13, 2016 at 9:39 AM, Wolfgang Mader > > > > <wolfgang_ma...@brain-frog.de> wrote: > > > Hi, > > > > > > I have two questions > > > > > > 1) Layout of raid10 in btrfs > > > btrfs pools all devices and than stripes and mirrors across this pool. > > > Is > > > it therefore correct, that a raid10 layout consisting of 4 devices > > > a,b,c,d is _not_ > > > > > > raid0 > > >| > > >|---| > > > > > > - > > > > > > |a| |b| |c| |d| > > > | > > >raid1raid1 > > > > > > Rather, there is no clear distinction of device level between two > > > devices > > > which form a raid1 set which are than paired by raid0, but simply, each > > > bit is mirrored across two different devices. Is this correct? > > > > All of the profiles apply to block groups (chunks), and that includes > > raid10. They only incidentally apply to devices since of course block > > groups end up on those devices, but which stripe ends up on which > > device is not consistent, and that ends up making Btrfs raid10 pretty > > much only able to survive a single device loss. > > > > I don't know if this is really thoroughly understood. I just did a > > test and I kinda wonder if the reason for this inconsistent assignment > > is a difference between the initial stripe>devid pairing at mkfs time, > > compared to subsequent pairings done by kernel code. For example, I > > > > get this from mkfs: > > item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520) itemoff 15715 > > itemsize > > > > 176 chunk length 16777216 owner 2 stripe_len 65536 > > > > type SYSTEM|RAID10 num_stripes 4 > > > > stripe 0 devid 4 offset 1048576 > > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > > stripe 1 devid 3 offset 1048576 > > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > > stripe 2 devid 2 offset 1048576 > > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > > stripe 3 devid 1 offset 20971520 > > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > > > item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 37748736) itemoff 15539 > > itemsize > > > > 176 chunk length 2147483648 owner 2 stripe_len 65536 > > > > type METADATA|RAID10 num_stripes 4 > > > > stripe 0 devid 4 offset 9437184 > > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > > stripe 1 devid 3 offset 9437184 > > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > > stripe 2 devid 2 offset 9437184 > > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > > stripe 3 devid 1 offset 29360128 > > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > > > item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 2185232384) itemoff 15363 > > > > itemsize 176 > > > > chunk length 2147483648 owner 2 stripe_len 65536 > > type DATA|RAID10 num_stripes 4 > > > > stripe 0 devid 4 offset 1083179008 > > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > > stripe 1 devid 3 offset 1083179008 > > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > > stripe 2 devid 2 offset 1083179008 > > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > > stripe 3 devid 1 offset 1103101952 > > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > > > Here you can see every chunk type has the same stripe to devid > > pairing. But once the kernel starts to allocate more data chunks, the > > pairing is different from mkfs, yet always (so far) consistent for > > each additional kernel allocated chunk. > > > > item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 4332716032) itemoff 15187 > > > > itemsize 176 > > > > chunk length 2147483648 owner 2 stripe_len 65536 > > type DATA|RAID10 num_stripes 4 > > > > stripe 0 devid 2 offset 2156920832 > > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > > stripe 1 devid 3 offset 2156920832 > > dev uuid: af95126a-
Re: Likelihood of read error, recover device failure raid10
On Sunday, August 14, 2016 10:20:39 AM CEST you wrote: > On Sat, Aug 13, 2016 at 9:39 AM, Wolfgang Mader > > <wolfgang_ma...@brain-frog.de> wrote: > > Hi, > > > > I have two questions > > > > 1) Layout of raid10 in btrfs > > btrfs pools all devices and than stripes and mirrors across this pool. Is > > it therefore correct, that a raid10 layout consisting of 4 devices > > a,b,c,d is _not_ > > > > raid0 > >| > >|---| > > > > - > > > > |a| |b| |c| |d| > > | > >raid1raid1 > > > > Rather, there is no clear distinction of device level between two devices > > which form a raid1 set which are than paired by raid0, but simply, each > > bit is mirrored across two different devices. Is this correct? > > All of the profiles apply to block groups (chunks), and that includes > raid10. They only incidentally apply to devices since of course block > groups end up on those devices, but which stripe ends up on which > device is not consistent, and that ends up making Btrfs raid10 pretty > much only able to survive a single device loss. > > I don't know if this is really thoroughly understood. I just did a > test and I kinda wonder if the reason for this inconsistent assignment > is a difference between the initial stripe>devid pairing at mkfs time, > compared to subsequent pairings done by kernel code. For example, I > get this from mkfs: > > item 4 key (FIRST_CHUNK_TREE CHUNK_ITEM 20971520) itemoff 15715 itemsize > 176 chunk length 16777216 owner 2 stripe_len 65536 > type SYSTEM|RAID10 num_stripes 4 > stripe 0 devid 4 offset 1048576 > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > stripe 1 devid 3 offset 1048576 > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > stripe 2 devid 2 offset 1048576 > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > stripe 3 devid 1 offset 20971520 > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > item 5 key (FIRST_CHUNK_TREE CHUNK_ITEM 37748736) itemoff 15539 itemsize > 176 chunk length 2147483648 owner 2 stripe_len 65536 > type METADATA|RAID10 num_stripes 4 > stripe 0 devid 4 offset 9437184 > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > stripe 1 devid 3 offset 9437184 > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > stripe 2 devid 2 offset 9437184 > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > stripe 3 devid 1 offset 29360128 > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > item 6 key (FIRST_CHUNK_TREE CHUNK_ITEM 2185232384) itemoff 15363 > itemsize 176 > chunk length 2147483648 owner 2 stripe_len 65536 > type DATA|RAID10 num_stripes 4 > stripe 0 devid 4 offset 1083179008 > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > stripe 1 devid 3 offset 1083179008 > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > stripe 2 devid 2 offset 1083179008 > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > stripe 3 devid 1 offset 1103101952 > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > Here you can see every chunk type has the same stripe to devid > pairing. But once the kernel starts to allocate more data chunks, the > pairing is different from mkfs, yet always (so far) consistent for > each additional kernel allocated chunk. > > > item 7 key (FIRST_CHUNK_TREE CHUNK_ITEM 4332716032) itemoff 15187 > itemsize 176 > chunk length 2147483648 owner 2 stripe_len 65536 > type DATA|RAID10 num_stripes 4 > stripe 0 devid 2 offset 2156920832 > dev uuid: 1c3038ca-2615-414e-9383-d326b942f647 > stripe 1 devid 3 offset 2156920832 > dev uuid: af95126a-e674-425c-af01-2599d66d9d06 > stripe 2 devid 4 offset 2156920832 > dev uuid: 736ba7b3-f21f-4643-8a59-9869b3526a82 > stripe 3 devid 1 offset 2176843776 > dev uuid: 969a95d3-d76d-44dc-9364-9d1f6e449a74 > > This volume now has about a dozen chunks created by kernel code, and > the stripe X to devid Y mapping is identical. Using dd and hexdump, > I'm finding that stripe 0 and 1 are mirrored pairs, they contain > identical information. And stripe 2 and 3 are mirrored pairs. And the > raid0 striping happens across 01 and 23 such that odd-numbered 64KiB > (default) stripe elements go on 01, and even-numbered stripe elements > go on 23. If the stripe
Likelihood of read error, recover device failure raid10
Hi, I have two questions 1) Layout of raid10 in btrfs btrfs pools all devices and than stripes and mirrors across this pool. Is it therefore correct, that a raid10 layout consisting of 4 devices a,b,c,d is _not_ raid0 |---| - |a| |b| |c| |d| raid1raid1 Rather, there is no clear distinction of device level between two devices which form a raid1 set which are than paired by raid0, but simply, each bit is mirrored across two different devices. Is this correct? 2) Recover raid10 from a failed disk Raid10 inherits its redundancy from the raid1 scheme. If I build a raid10 from n devices, each bit is mirrored across two devices. Therefore, in order to restore a raid10 from a single failed device, I need to read the amount of data worth this device from the remaining n-1 devices. In case, the amount of data on the failed disk is in the order of the number of bits for which I can expect an unrecoverable read error from a device, I will most likely not be able to recover from the disk failure. Is this conclusion correct, or am I am missing something here. Thanks, Wolfgang signature.asc Description: This is a digitally signed message part.
Concurrent write access
Hi, I have a btrfs raid10 which is connected to a server hosting multiple virtual machine. Does btrfs support connecting the same subvolumes of the same raid to multiple virtual machines for concurrent read and write? The situation would be the same as, say, mounting user homes from the same nfs share on different machines. Thanks, Wolfgang signature.asc Description: This is a digitally signed message part.
Re: Concurrent write access
On Thursday 09 July 2015 22:06:09 Hugo Mills wrote: On Thu, Jul 09, 2015 at 11:34:40PM +0200, Wolfgang Mader wrote: Hi, I have a btrfs raid10 which is connected to a server hosting multiple virtual machine. Does btrfs support connecting the same subvolumes of the same raid to multiple virtual machines for concurrent read and write? The situation would be the same as, say, mounting user homes from the same nfs share on different machines. It'll depend on the protocol you use to make the subvolumes visible within the VMs. btrfs subvolumes aren't block devices, so that rules out most of the usual approaches. However, there are two methods I've used which I can confirm will work well: NFS and 9p. NFS will work as a root filesystem, and will work with any host/guest, as long as there's a network connection between the two. 9p is, at least in theory, faster (particularly with virtio), but won't let you boot with the 9p device as your root FS. You'll need virtualiser support if you want to run a virtio 9p -- I know qemu/kvm supports this; I don't know if anything else supports it. Thanks for the overview. It it qmeu/kvm in fact, to this is an option. Right now, however, I connect the discs as virtual discs and not the file system, but only to one virtual machine. Best, Wolfgang You can probably use Samba/CIFS as well. It'll be slower than the virtualised 9p, and not be able to host a root filesystem. I haven't tried this one, because Samba and I get on like a house on fire(*). Hugo. (*) Screaming, shouting, people running away, emergency services. signature.asc Description: This is a digitally signed message part.
Re: How to get the devid of a missing device
On Monday, April 27, 2015 02:11:05 AM Duncan wrote: Wolfgang Mader posted on Sun, 26 Apr 2015 20:39:34 +0200 as excerpted: Hello, I have a raid10 with one device missing. I would like to use btrfs replace to replace it. However, I am unsure on how to obtain the devid of the missing device. The devid is if the device is still active in the filesystem. If it's missing... btrfs device delete missing That, along with a bunch of other likely helpful information, is covered on the wiki: https://btrfs.wiki.kernel.org Specifically for that (reassemble from the wrap, too lazy to fiddle with it on my end): https://btrfs.wiki.kernel.org/index.php/ Using_Btrfs_with_Multiple_Devices#Replacing_failed_devices But since it was a four-device raid10 and four devices is the minimum for that, I don't believe it'll let you delete until you device add, first. Thanks for your answer. I know the stuff of that page, but since it is possible to use btrfs replace on a missing device I wanted to try that approach. Of course that means probably some hours for the add, then some more hours for the delete missing, during which you obviously hope not to lose another device. Of course the sysadmin's backup rule of thumb applies -- if you don't have a backup, by definition, loss of the data isn't a big deal, or you'd have it backed up. (And the corollary, it's not a backup until it's tested, applies as well.) So you shouldn't have to worry about loss of a second device during that time, because it's either backed up or the data isn't worth the trouble to backup and thus loss of it with the loss of a second device isn't a big deal. Well, of course, there is a backup on a different machine. Its in the same room, but who has the luxury for off-site backups in a home-use setting! :) That page doesn't seem to cover direct replacement, probably because the replace command is new and it hasn't been updated. But AFAIK replace doesn't work with a missing device anyway; it's the fast way to replace a still listed device, so you don't have to add and then delete, but the device has to still be there in ordered to use that shortcut. (You could try using missing with the -r option tho, just in case it works now. The wiki/manpage is a bit vague on that point.) Btw, the file system is too old for skinny metadata and extended inode refs. If I do a btrfs replace or a btrfs device add, must I myself ensure that the new features are not enables for the new device which is to be added? I don't believe there's any way to set that manually even if you wanted to -- you don't use mkfs on it and the add/replace would overwrite existing if you did. The new device should just take on the attributes of the filesystem you're adding it to. Good to know. Thanks -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How to get the devid of a missing device
On Monday, April 27, 2015 12:48:07 PM Anand Jain wrote: On 04/27/2015 02:39 AM, Wolfgang Mader wrote: Hello, I have a raid10 with one device missing. I would like to use btrfs replace to replace it. However, I am unsure on how to obtain the devid of the missing device. Having the filesystem mounted in degraded mode under mnt, btrfs fs At the user end there is no way. unless you want to use gdb and dump the fs_uuids and check. Submitted these patches as of now to obtain from the logs. Will help in the situation when device is missing at the time of mount. Btrfs: check error before reporting missing device and add uuid Btrfs: log when missing device is created For long term we have sysfs interface, patches are in the ML if you want to tests. Good luck. Anand Great. Thank you for patching this in. Best, Ẃolfgang -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
How to get the devid of a missing device
Hello, I have a raid10 with one device missing. I would like to use btrfs replace to replace it. However, I am unsure on how to obtain the devid of the missing device. Having the filesystem mounted in degraded mode under mnt, btrfs fs show /mnt returns sudo btrfs filesystem show /mnt Label: 'dataPool' uuid: b5f082e2-2ce0-4f91-b54b-c2d26185a635 Total devices 4 FS bytes used 665.88GiB devid2 size 931.51GiB used 336.03GiB path /dev/sdb devid4 size 931.51GiB used 336.03GiB path /dev/sdd devid5 size 931.51GiB used 336.03GiB path /dev/sde *** Some devices missing but does not mention the devid of the missing device. The device stat is sudo btrfs device stats /mnt [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs0 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 [(null)].write_io_errs 1448 [(null)].read_io_errs0 [(null)].flush_io_errs 0 [(null)].corruption_errs 0 [(null)].generation_errs 0 [/dev/sdd].write_io_errs 0 [/dev/sdd].read_io_errs0 [/dev/sdd].flush_io_errs 0 [/dev/sdd].corruption_errs 0 [/dev/sdd].generation_errs 0 [/dev/sde].write_io_errs 0 [/dev/sde].read_io_errs0 [/dev/sde].flush_io_errs 0 [/dev/sde].corruption_errs 0 [/dev/sde].generation_errs 0 Btw, the file system is too old for skinny metadata and extended inode refs. If I do a btrfs replace or a btrfs device add, must I myself ensure that the new features are not enables for the new device which is to be added? Thank! Wolfgang -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Scrub status: no stats available
Dear list, I am running btrfs on Arch Linux ARM (Linux 3.14.2, Btrfs v3.14.1). I can run scrub w/o errors, but I never get stats from scrub status What I get is btrfs scrub status /pools/dataPool scrub status for b5f082e2-2ce0-4f91-b54b-c2d26185a635 no stats available total bytes scrubbed: 694.13GiB with 0 errors Please mind the line no stats available. Where can I start digging? Thank you, Wolfgang -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Understanding btrfs and backups
Duncan, thank you for this comprehensive post. Really helpful as always! [...] As for restoring, since a snapshot is a copy of the filesystem as it existed at that point, and the method btrfs exposes for accessing them is to mount that specific snapshot, to restore an individual file from a snapshot, you simply mount the snapshot you want somewhere and copy the file as it existed in that snapshot over top of your current version (which will have presumably already been mounted elsewhere, before you mounted the snapshot to retrieve the file from), then unmount the snapshot and go about your day. =:^) Please, how do I list mounted snapshots only? [...] Since a snapshot is an image of the filesystem as it was at that particular point in time, and btrfs by nature copies blocks elsewhere when they are modified, all (well, not all as there's metadata like file owner, permissions and group, too, but that's handled the same way) the snapshot does is map what blocks composed each file at the time the snapshot was taken. Is it correct, that e.g. ownership is recorded separately from the data itself, so if I would change the owner of all my files, the respective snapshot would only store the old owner information? [...] The first time you do this, there's no existing copy at the other end, so btrfs send sends a full copy and btrfs receive writes it out. After that, the receive side has a snapshot identical to the one created on the send side and further btrfs send/receives to the same set simply duplicate the differences between the reference and the new snapshot from the send end to the receive end. As with local snapshots, old ones can be deleted on both the send and receive ends, as long as at least one common reference snapshot is maintained on both ends, so diffs taken against the send side reference can be applied to an appropriately identical receive side reference, thereby updating the receive side to match the new read-only snapshot on the send side. Is the receiving side a complete file system in its own right? If so, I only need to maintain one common reference in order to apply the received snapshot, right. If I would in any way get the send and receive side out of sync, such that they do not share a common reference any more, only the send/receive would fail, but I still would have the complete filesystem on the receiving side, and could copy it all over (cp, rscync) to the send side in case of a disaster on the send side. Is this correct? Thank you! Best, Wolfgang -- Wolfgang Mader wolfgang.ma...@fdm.uni-freiburg.de Telefon: +49 (761) 203-7710 Institute of Physics Hermann-Herder Str. 3, 79104 Freiburg, Germany Office: 207 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read i/o errs and disk replacement
On Tuesday 18 February 2014 15:02:51 Chris Murphy wrote: On Feb 18, 2014, at 2:33 PM, Wolfgang Mader wolfgang_ma...@brain-frog.de wrote: Feb 18 13:14:09 deck kernel: ata2.00: failed command: READ DMA Feb 18 13:14:09 deck kernel: ata2.00: cmd c8/00:08:60:f2:30/00:00:00:00:00/e0 tag 0 dma 4096 in res 51/04:08:60:f2:30/00:00:00:00:00/e0 Emask 0x1 (device error) Feb 18 13:14:09 deck kernel: ata2.00: status: { DRDY ERR } Feb 18 13:14:09 deck kernel: ata2.00: error: { ABRT } Feb 18 13:14:09 deck kernel: ata2.15: hard resetting link Feb 18 13:14:14 deck kernel: ata2.15: link is slow to respond, please be patient (ready=0) Feb 18 13:14:19 deck kernel: ata2.15: SRST failed (errno=-16) Feb 18 13:14:19 deck kernel: ata2.15: hard resetting link Feb 18 13:14:24 deck kernel: ata2.15: link is slow to respond, please be patient (ready=0) Feb 18 13:14:29 deck kernel: ata2.15: SATA link up 3.0 Gbps (SStatus 123 SControl F300) Feb 18 13:14:29 deck kernel: Feb 18 13:14:30 deck kernel: ata2.01: hard resetting link Feb 18 13:14:31 deck kernel: ata2.02: hard resetting link Feb 18 13:14:31 deck kernel: ata2.03: hard resetting link Feb 18 13:14:32 deck kernel: ata2.04: hard resetting link Feb 18 13:14:32 deck kernel: ata2.05: hard resetting link Feb 18 13:14:33 deck kernel: ata2.06: hard resetting link Feb 18 13:14:34 deck kernel: ata2.07: hard resetting link Feb 18 13:14:34 deck kernel: ata2.00: configured for UDMA/133 Feb 18 13:14:34 deck kernel: ata2.01: configured for UDMA/133 Feb 18 13:14:35 deck kernel: ata2.02: configured for UDMA/133 Feb 18 13:14:35 deck kernel: ata2.03: configured for UDMA/133 Feb 18 13:14:35 deck kernel: ata2.04: configured for UDMA/133 Feb 18 13:14:35 deck kernel: ata2.05: configured for UDMA/133 Feb 18 13:14:35 deck kernel: ata2.06: configured for UDMA/133 Feb 18 13:14:35 deck kernel: ata2.07: configured for UDMA/133 Feb 18 13:14:35 deck kernel: ata2: EH complete Two things. The full dmesg includes useful information separate from the error messages, including the model drive to ata device mapping, and why there's a failed read to ATA2.00 yet there's a reset in sequence for ata2.01, 2.02, 2.03 and so on. So the entire dmesg would be useful. You want it, you get it. I attache the full output to this mail, and hope that the list allows for attachements. Here is a short summary. The hd enclosure is a Sharkoon 8-Bay which is connected via e-sata. The hds are mostly Segate Barracude ES.2 (ST31000NSSUN1.0T) and one from Hitachi (HUA7210SASUN1.0T). All of them are server grade hds and have a device time out of 30. The hds are sata 1.0 such that they do not feature native command queuing. I have generally a low read/write performance, around 10MB/sec. I test the enclosure with a sata 2 disk and got a way better performance, around 70MB/sec. This is why I sticked with the Skarkoon. If the bad performance is due to bad configuration, I would be happy to fix it. :-) Now to the dmesg output. The SRST error you spotted in the logs points to wrong jumper setting concerning master, slave, cable select. My hds do not have a jumper for this setting. The order of the hds is ata2.00-sda ata2.01-sdc etc. The port multiplier through which all those devices are accessed is ata2.15. I rebooted the system two times to see if the read error count goes up. Didn't happen, it sits still at 2. In any case the actual problem might not be discoverable due to the hard resetting. I'm not finding any useful translation, in 5 minute search, for SRST. But it makes me suspicious of a configuration problem, like maybe an unnecessary jumper setting on a drive or with the enclosure itself. So I'd check for that. Also, what model drives are being used? If they are consumer drives, they almost certainly have long error recoveries over 30 minutes. And if the drive is trying to honor the read request for more than 30 seconds, the default SCSI block layer will time out and produce messages like what we see here. So you probably need to change the SCSI block layer timeout. To set the command timer to something else use: echo value /sys/block/device/device/timeout Where value is e.g. 121 since many consumer drives time out at 120 seconds this means the kernel will wait 121 seconds before starting its error handling (which includes resetting the drive and then the bus). ---end--- This output it repeated several times and than end in this read error [Tue Feb 18 13:15:48 2014] btrfs: bdev /dev/sdb errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 [Tue Feb 18 13:15:48 2014] ata2: EH complete [Tue Feb 18 13:15:48 2014] btrfs read error corrected: ino 1 off 29184540672 (dev /dev/sdb sector 3207776) Well that reads like Btrfs knows what sector had a read problem, without corruption being the cause, and corrected it. So the question
Read i/o errs and disk replacement
Hi all, well, I hit the first incidence where I really have to work with my btrfs setup. To get things straight I want to double-check here to not screw things up right from the start. We are talking about a home server. There is no time or user pressure involved, and there are backups, too. Software - Linux 3.13.3 Btrfs v3.12 Hardware --- 5 1T hard drives configured to be a raid 10 for both data and metadata Data, RAID10: total=282.00GiB, used=273.33GiB System, RAID10: total=64.00MiB, used=36.00KiB Metadata, RAID10: total=1.00GiB, used=660.48MiB Error This is not btrfs' fault but due to an hd error. I saw in the system logs btrfs: bdev /dev/sdb errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 and a subsequent check on btrfs showed [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs2 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 So, I have a read error on sdb. Questions --- 1) Do I have to take action immediately (shutdown the system, umount the file system)? Can I even ignore the error? Unfortunately, I can not access SMART information through the sata interface of the enclosure which hosts the hds. 2) I only can replace the disk, not add a new one and than swap over. There is no space left in the disk enclosure I am using. I also can not guarantee that if I remove sdb and start the system up again that all the other disks are named the same as they are now, and that the newly added disk will be names sdb again. Is this an issue? 3) I know that btrfs can handle disks of different sizes. Is there a downside if I go for a 3T disk and add it to the 1T disks? Is there e.g. more stuff saved on the 3T disk, and if this ones fails I lose redundancy? Is a soft transition to 3T where I replace every dying 1T disk with a 3T disk advisable? Proposed solution for the current issue -- 1) Delete the faulted drive using btrfs device delete /dev/sdb /path/to/pool 2) Format the new disk with btrfs mkfs.btrfs 3) Add the new disk to the filesystem using btrfs device add /dev/newdiskname /path/to/pool 4) Balance the file system btrfs fs balance /path/to/pool Is this the proper way to deal with the situation? Thank you for your advice. Best, Wolfgang -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Read i/o errs and disk replacement
On Tuesday 18 February 2014 11:48:49 Chris Murphy wrote: On Feb 18, 2014, at 6:19 AM, Wolfgang Mader wolfgang_ma...@brain-frog.de wrote: Hi all, well, I hit the first incidence where I really have to work with my btrfs setup. To get things straight I want to double-check here to not screw things up right from the start. We are talking about a home server. There is no time or user pressure involved, and there are backups, too. Software - Linux 3.13.3 Btrfs v3.12 Hardware --- 5 1T hard drives configured to be a raid 10 for both data and metadata Data, RAID10: total=282.00GiB, used=273.33GiB System, RAID10: total=64.00MiB, used=36.00KiB Metadata, RAID10: total=1.00GiB, used=660.48MiB Error This is not btrfs' fault but due to an hd error. I saw in the system logs btrfs: bdev /dev/sdb errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 and a subsequent check on btrfs showed [/dev/sdb].write_io_errs 0 [/dev/sdb].read_io_errs2 [/dev/sdb].flush_io_errs 0 [/dev/sdb].corruption_errs 0 [/dev/sdb].generation_errs 0 So, I have a read error on sdb. Questions --- 1) Do I have to take action immediately (shutdown the system, umount the file system)? Can I even ignore the error? Unfortunately, I can not access SMART information through the sata interface of the enclosure which hosts the hds. A full dmesg should be sufficient to determine if this is due to the drive reporting a read error, in which case Btrfs is expected to get a copy of the missing data from a mirror, send it up to the application layer without error, and then write it to the LBAs of the device(s) that reported the original read error. It is kinda important to make sure that there wasn't a device reset, but an explicit read error. If the drive merely hangs while in recovery, upon reset any way of knowing what sectors were slow or bad is lost. Thank you for your quick response. The first read error is occurring during system start up when the raid is activated for the first time [Tue Feb 18 13:02:08 2014] btrfs: use lzo compression [Tue Feb 18 13:02:08 2014] btrfs: disk space caching is enabled [Tue Feb 18 13:02:09 2014] btrfs: bdev /dev/sdb errs: wr 0, rd 1, flush 0, corrupt 0, gen 0 and then dmsg is silent for the next 10 minutes. The second read error happens while the device is in use and is preceded by ---start-- Feb 18 13:14:09 deck kernel: ata2.15: exception Emask 0x1 SAct 0x0 SErr 0x0 action 0x6 Feb 18 13:14:09 deck kernel: ata2.15: edma_err_cause=0084 pp_flags=0001, dev error, EDMA self-disable Feb 18 13:14:09 deck kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0 Feb 18 13:14:09 deck kernel: ata2.00: failed command: READ DMA Feb 18 13:14:09 deck kernel: ata2.00: cmd c8/00:08:60:f2:30/00:00:00:00:00/e0 tag 0 dma 4096 in res 51/04:08:60:f2:30/00:00:00:00:00/e0 Emask 0x1 (device error) Feb 18 13:14:09 deck kernel: ata2.00: status: { DRDY ERR } Feb 18 13:14:09 deck kernel: ata2.00: error: { ABRT } Feb 18 13:14:09 deck kernel: ata2.15: hard resetting link Feb 18 13:14:14 deck kernel: ata2.15: link is slow to respond, please be patient (ready=0) Feb 18 13:14:19 deck kernel: ata2.15: SRST failed (errno=-16) Feb 18 13:14:19 deck kernel: ata2.15: hard resetting link Feb 18 13:14:24 deck kernel: ata2.15: link is slow to respond, please be patient (ready=0) Feb 18 13:14:29 deck kernel: ata2.15: SATA link up 3.0 Gbps (SStatus 123 SControl F300) Feb 18 13:14:29 deck kernel: Feb 18 13:14:30 deck kernel: ata2.01: hard resetting link Feb 18 13:14:31 deck kernel: ata2.02: hard resetting link Feb 18 13:14:31 deck kernel: ata2.03: hard resetting link Feb 18 13:14:32 deck kernel: ata2.04: hard resetting link Feb 18 13:14:32 deck kernel: ata2.05: hard resetting link Feb 18 13:14:33 deck kernel: ata2.06: hard resetting link Feb 18 13:14:34 deck kernel: ata2.07: hard resetting link Feb 18 13:14:34 deck kernel: ata2.00: configured for UDMA/133 Feb 18 13:14:34 deck kernel: ata2.01: configured for UDMA/133 Feb 18 13:14:35 deck kernel: ata2.02: configured for UDMA/133 Feb 18 13:14:35 deck kernel: ata2.03: configured for UDMA/133 Feb 18 13:14:35 deck kernel: ata2.04: configured for UDMA/133 Feb 18 13:14:35 deck kernel: ata2.05: configured for UDMA/133 Feb 18 13:14:35 deck kernel: ata2.06: configured for UDMA/133 Feb 18 13:14:35 deck kernel: ata2.07: configured for UDMA/133 Feb 18 13:14:35 deck kernel: ata2: EH complete ---end--- This output it repeated several times and than end in this read error [Tue Feb 18 13:15:48 2014] btrfs: bdev /dev/sdb errs: wr 0, rd 2, flush 0, corrupt 0, gen 0 [Tue Feb 18 13:15:48 2014] ata2: EH complete [Tue Feb 18 13:15:48 2014] btrfs read error corrected: ino 1 off 29184540672 (dev /dev/sdb sector 3207776) This might have to do with the fact, that my hds