Re: defragmenting best practice?
On Tue, Oct 31, 2017 at 05:47:54PM -0400, Dave wrote: > I'm following up on all the suggestions regarding Firefox performance > on BTRFS. > > > > 5. Firefox profile sync has not worked well for us in the past, so we > don't use it. > 6. Our machines generally have plenty of RAM so we could put the > Firefox cache (and maybe profile) into RAM using a technique such as > https://wiki.archlinux.org/index.php/Firefox/Profile_on_RAM. However, > profile persistence is important. > 4. Put the Firefox cache in RAM > > 5. If needed, consider putting the Firefox profile in RAM Have you looked into profile-sync-daemon? https://wiki.archlinux.org/index.php/profile-sync-daemon It basically does the "keep the profile in RAM but also sync it to HDD" for you. I've used it for years, it works quite well. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
On September 19, 2017 11:38:13 PM PDT, Davewrote: >>On Thu 2017-08-31 (09:05), Ulli Horlacher wrote: > >Here's my scenario. Some months ago I built an over-the-top powerful >desktop computer / workstation and I was looking forward to really >fantastic performance improvements over my 6 year old Ubuntu machine. >I installed Arch Linux on BTRFS on the new computer (on an SSD). To my >shock, it was no faster than my old machine. I focused a lot on >Firefox performance because I use Firefox a lot and that was one of >the applications in which I was most looking forward to better >performance. > > > >What would you guys do in this situation? Check out profile sync daemon: https://wiki.archlinux.org/index.php/profile-sync-daemon It keeps the active profile files in a ramfs, periodically syncing them back to disk. It works quite well on my 7 year old netbook. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS converted from EXT4 becomes read-only after reboot
On Mon, May 08, 2017 at 12:41:11PM -0400, Austin S. Hemmelgarn wrote: > Send/receive is not likely to transfer the problem unless it has something > to do with how things are reflinked. Receive operates by recreating the > sent subvolume from userspace using regular commands and the clone ioctls, > so it won't replicate any low-level structural issues in the filesystem > unless they directly involve the way extents are being shared (or are a side > effect of that). On top of that, if there is an issue on the sending side, > send itself will probably not send that data, so it's actually only > marginally more dangerous than using something like rsync to copy the data. True, but my goal was to eliminate as many btrfs variables as I could. To answer the original question, I used rsync to copy the data and attributes (something like rsync -aHXp --numeric-ids) from a live CD to an external hard drive (formatted ext4), then ran mkfs.btrfs on the original partition, then re-ran the rsync in the opposite direction. It worked quite well for me, and the problem hasn't resurfaced. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS converted from EXT4 becomes read-only after reboot
On May 8, 2017 11:28:42 AM EDT, Sanidhya Solankiwrote: >On Mon, 8 May 2017 10:16:44 -0400 >Alexandru Guzu wrote: > >> Sean, how would you approach the copy of the data back and forth if >> the OS is on it? Would a Send-receive and then back work? > >You could use a Live-USB and then just dd it to remote or attached >storage, if >you want to be absolutely sure the data is not affected. I would not suggest either of those. Send / receive might work, but since we don't know the source of the problem, it risks transferring the problem. DD would not solve the problem at all, since we're trying to rebuild the partition, not clone it. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS converted from EXT4 becomes read-only after reboot
On May 3, 2017 4:28:11 PM EDT, Alexandru Guzuwrote: >Hi all, > >In a VirtualBox VM, I converted a EXT4 fs to BTRFS that is now running >on Ubuntu 16.04 (Kernel 4.4.0-72). I was able to use the system for >several weeks. I even did kernel updates, compression, deduplication >without issues. > >Since today, a little while after booting (usually when I start >opening applications), the FS becomes read-only. IMO, the only thing >that might have changed that could affect this is a kernel upgrade. >However, since the FS becomes RO, I cannot easily do a kernel update >(I guess I'd have to chroot and do it). > >The stack trace is the following: > >[ 88.836057] [ cut here ] >[ 88.836082] WARNING: CPU: 0 PID: 25 at >/build/linux-wXdoVv/linux-4.4.0/fs/btrfs/inode.c:2931 >btrfs_finish_ordered_io+0x63b/0x650 [btrfs]() >[ 88.836083] BTRFS: Transaction aborted (error -95) >[ 88.836084] Modules linked in: nvram msr zram lz4_compress >vboxsf(OE) joydev crct10dif_pclmul crc32_pclmul ghash_clmulni_intel >aesni_intel snd_intel8x0 aes_x86_64 snd_ac97_codec lrw gf128mul >glue_helper ablk_helper cryptd ac97_bus snd_pcm snd_seq_midi >snd_seq_midi_event snd_rawmidi snd_seq vboxvideo(OE) snd_seq_device >snd_timer ttm input_leds drm_kms_helper serio_raw i2c_piix4 snd drm >fb_sys_fops syscopyarea sysfillrect sysimgblt soundcore 8250_fintek >vboxguest(OE) mac_hid parport_pc ppdev lp parport autofs4 btrfs xor >raid6_pq hid_generic usbhid hid psmouse ahci libahci fjes video >pata_acpi >[ 88.836116] CPU: 0 PID: 25 Comm: kworker/u2:1 Tainted: G >OE 4.4.0-72-generic #93-Ubuntu >[ 88.836117] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS >VirtualBox 12/01/2006 >[ 88.836130] Workqueue: btrfs-endio-write btrfs_endio_write_helper >[btrfs] >[ 88.836132] 0286 618d3a00 88007cabbc78 >813f82b3 >[ 88.836134] 88007cabbcc0 c01912a8 88007cabbcb0 >81081302 >[ 88.836135] 880058bf01b0 8800355f2800 88007b9e64e0 >88007c66f098 >[ 88.836137] Call Trace: >[ 88.836142] [] dump_stack+0x63/0x90 >[ 88.836145] [] warn_slowpath_common+0x82/0xc0 >[ 88.836147] [] warn_slowpath_fmt+0x5c/0x80 >[ 88.836159] [] ? unpin_extent_cache+0x8f/0xe0 >[btrfs] >[ 88.836167] [] ? btrfs_free_path+0x26/0x30 >[btrfs] >[ 88.836178] [] >btrfs_finish_ordered_io+0x63b/0x650 [btrfs] >[ 88.836188] [] finish_ordered_fn+0x15/0x20 >[btrfs] >[ 88.836200] [] >btrfs_scrubparity_helper+0xca/0x2f0 [btrfs] >[ 88.836202] [] ? >__raw_callee_save___pv_queued_spin_unlock+0x11/0x20 >[ 88.836214] [] btrfs_endio_write_helper+0xe/0x10 >[btrfs] >[ 88.836217] [] process_one_work+0x165/0x480 >[ 88.836219] [] worker_thread+0x4b/0x4c0 >[ 88.836220] [] ? process_one_work+0x480/0x480 >[ 88.836222] [] ? process_one_work+0x480/0x480 >[ 88.836224] [] kthread+0xd8/0xf0 >[ 88.836225] [] ? >kthread_create_on_node+0x1e0/0x1e0 >[ 88.836229] [] ret_from_fork+0x3f/0x70 >[ 88.836230] [] ? >kthread_create_on_node+0x1e0/0x1e0 >[ 88.836232] ---[ end trace f4b8dbb54f0db139 ]--- >[ 88.836234] BTRFS: error (device sda1) in >btrfs_finish_ordered_io:2931: errno=-95 unknown >[ 88.836236] BTRFS info (device sda1): forced readonly >[ 88.836392] pending csums is 118784 > >If I reboot with the Live CD version, I can run `scrub` and `check` >without any issues. Also the FS stays read-write. > >Is this a known issue? Could it be due to the conversion, or can this >happen on any system on any time. This is a test VM that I can >re-make, but I would like to know if this can happen on a production >system. > >Regards, >Alex. >-- >To unsubscribe from this list: send the line "unsubscribe linux-btrfs" >in >the body of a message to majord...@vger.kernel.org >More majordomo info at http://vger.kernel.org/majordomo-info.html Just FYI, I ran into basically this same issue a while back. My solution was to copy all the data off it, re-run mkfs.btrfs, then copy all the data back. I found that none of the existing data was damaged, so I think the actual issue is something left over from the conversion that confuses the commit logic. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is btrfs-convert able to deal with sparse files in a ext4 filesystem?
On Sat, Apr 01, 2017 at 11:48:50AM +0200, Kai Herlemann wrote: > Hi, > I have on my ext4 filesystem some sparse files, mostly images from > ext4 filesystems. > Is btrfs-convert (4.9.1) able to deal with sparse files or can that > cause any problems? >From personal experience, I would recommend not using btrfs-convert on ext4 partitions. I attempted it on a /home partition on one of my machines, and while it did succeed in converting, the fs it produced had weird issues that caused transation failures and thus semi-frequent remount-ro. Btrfs-check, scrub, and balance were all unable to repair the damage. I ended up recreating the parition from a backup. As far as I know, there were no sparse files on this partition, either. Just my one data point, for whatever it's worth. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Cross-subvolume rename behavior
On Thu, Mar 23, 2017 at 07:23:40AM -0400, Austin S. Hemmelgarn wrote: > On 2017-03-23 06:09, Hugo Mills wrote: > >Direct rename (using rename(2)) isn't possible across subvols, > > which is what the EXDEV result indicates. The solution is exactly what > > mv does, which is reflink-and-delete (which is cheaper than > > copy-and-delete, because no data is moved). In theory, you probably > > could implement rename across subvolumes in the FS, but it would just > > be moving the exact same operations from userspace to kernel space. > > > Doing so though would have the advantage that it theoretically could be made > (almost) atomic like a normal rename is, whereas the fallback in mv is > absolutely not atomic. > > > >I think that the solution here is for the sshfs stack to be fixed > > so that it passes the EXDEV up to the mv command properly, and passes > > the subsequent server-side copy (reflink) back down correctly. > > This would be wonderful in theory, but it can't pass down the reflink, > because the SFTP protocol (which is what sshfs uses) doesn't even have the > concept of reflinks, so implementing this completely would require a > revision to the SFTP protocol, which I don't see as likely to happen. Thanks for the insights, Hugo and Austin. That's more or less what I was expecting, so I guess my next stop will be with the openssh folks. Though I was looking over the SFTP protocol, and none of the SSH_FX_* errors seem suitable for this purpose. This might require a workaround in sshfs... --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Cross-subvolume rename behavior
Hello, all. I'm currently tracking down the source of some strange behavior in my setup. I recognize that this isn't strictly a btrfs issue, but I figured I'd start at the bottom of the stack and work my way up. I have a server with a btrfs filesystem on it that I remotely access on several systems via an sshfs mount. For the most part this works perfectly, but I just discovered that moving files between subvolumes on that mount fails with a confusing "Operation not permitted" error. After doing some digging, it turns out it's not actually a permissions error. If I do the same operation locally on the server, it succeeds, but an strace of the mv reveals that the rename() syscall returns EXDEV. The mv util takes this as a sign to fall back on the copy-and-delete routine, so the move succeeds. Unfortunately, it seems that somewhere in sshfs, sftp, or fuse, the EXDEV is getting turned into a generic failure, which mv apparently interprets as "permission denied". So my question for the btrfs devs: is rename()-ing across subvolumes not feasible, or is this simply a case of no one has implemented that yet? Thanks for any insights, --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Incremental send robustness question
On October 14, 2016 12:43:03 AM EDT, Duncan <1i5t5.dun...@cox.net> wrote: >I see the specific questions have been answered, and alternatives >explored in one direction, but I've another alternative, in a different > >direction, to suggest. > >First a disclaimer. I'm a btrfs user/sysadmin and regular on the list, > >but I'm not a dev, and my own use-case doesn't involve send/receive, so > >what I know regarding send/receive is from the list and manpages, not >personal experience. With that in mind... > >It's worth noting that send/receive are subvolume-specific -- a send >won't continue down into a subvolume. > >Also note that in addition to -p/parent, there's -s/clone-src. The >latter is more flexible than the super-strict parent option, at the >expense of a fatter send-stream as additional metadata is sent that >specifies which clone the instructions are relative to. > >It should be possible to use the combination of these two facts to >split >and recombine your send stream in a firewall-timeout-friendly manner, >as >long as no individual files are so big that sending an individual file >exceeds the timeout. > >1) Start by taking a read-only snapshot of your intended source >subvolume, so you have an unchanging reference. > >2) Take multiple writable snapshots of it, and selectively delete >subdirs >(and files if necessary) from each writable snapshot, trimming each one > >to a size that should pass the firewall without interruption, so that >the >combination of all these smaller subvolumes contains the content of the > >single larger one. > >3) Take read-only snapshots of each of these smaller snapshots, >suitable >for sending. > >4) Do a non-incremental send of each of these smaller snapshots to the >remote. > >If it's practical to keep the subvolume divisions, you can simply split > >the working tree into subvolumes and send those individually instead of > >doing the snapshot splitting above, in which case you can then use -p/ >parent on each as you were trying to do on the original, and you can >stop >here. > >If you need/prefer the single subvolume, continue... > >5) Do an incremental send of the original full snapshot, using multiple >-c options to list each of the smaller snapshots. Since all the >data has already been transferred in the smaller snapshot sends, this >send should be all metadata, no actual data. It'll simply be combining > >the individual reference subvolumes into a single larger subvolume once > >again. > >6) Once you have the single larger subvolume on the receive side, you >can >delete the smaller snapshots as you now have a copy of the larger >subvolume on each side to do further incremental sends of the working >copy against. > >7) I believe the first incremental send of the full working copy >against >the original larger snapshot will still have to use -c, while >incremental >sends based on that first one will be able to use the stricter but >slimmer send-stream -p, with each one then using the previous one as >the >parent. However, I'm not sure on that. It may be that you have to >continue using the fatter send-stream -c each time. > >Again, I don't have send/receive experience of my own, so hopefully >someone who does can reply either confirming that this should work and >whether or not -p can be used after the initial setup, or explaining >why >the idea won't work, but at this point based on my own understanding, >it >seems like it should be perfectly workable to me. =:^) I was considering doing something like this, but the simple solution of "just bring the disk over" won out. If that hadn't been possible, I might have done something like that, and I'm still mulling over possible solutions to similar / related problems. I think the biggest solution would be support for partial / resuming receives. That'll probably go on my ever-growing list of things to possibly look into when I happen upon some free time. It sounds quite complicated, though... --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Incremental send robustness question
On Thu, Oct 13, 2016 at 01:14:51AM +0200, Hans van Kranenburg wrote: > On 10/13/2016 12:29 AM, Sean Greenslade wrote: > > Hi, all. I have a question about a backup plan I have involving > > send/receive. As far as I can tell, there's no way to to resume a send > > that has been interrupted. In this case, my interruption comes from an > > overbearing firewall that doesn't like long-lived connections. I'm > > trying to do the initial (non-incremental) sync of the first snapshot > > from my main server to my backup endpoint. The snapshot is ~900 GiB, and > > the internet link is 25 Mbps, so this'll be going for quite a long time. > > You can't resume an interrupted send. You'll have to remove the target > subvolume on the destination and start again. > > Pipe the send into a local file, and then use any tool that can reliably > resume interrupted transfers to get it to the other side. > > Or, if faster, put in on a disk and drive there with your car. :) I may just end up doing that. Hugo's responce gave me some crazy ideas involving a custom build of split that waits for a command after each output file fills, which would of course require an equally weird build of cat that would stall the pipe indefinitely until all the files showed up. Driving the HDD over would probably be a little simpler. =P > > And while we're at it, what are the failure modes for incremental sends? > > Will it throw an error if the parents don't match, or will there just be > > silent failures? > > Create a list of possibilities, create some test filesystems, try it. I may just do that, presuming I can find the spare time. Given that I'm building a backup solution around this tech, it would definitely bolster my confidence in it if I knew what its failure modes looked like. Thanks to everyone for your fast responses. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Incremental send robustness question
Hi, all. I have a question about a backup plan I have involving send/receive. As far as I can tell, there's no way to to resume a send that has been interrupted. In this case, my interruption comes from an overbearing firewall that doesn't like long-lived connections. I'm trying to do the initial (non-incremental) sync of the first snapshot from my main server to my backup endpoint. The snapshot is ~900 GiB, and the internet link is 25 Mbps, so this'll be going for quite a long time. What I would like to do is "fake" the first snapshot transfer by rsync-ing the files over. So my question is this: if I rsync a subvolume (with the -a option to make all file times, permissions, ownerships, etc. the same), is that good enough to then be used as a parent for future incremental sends? And while we're at it, what are the failure modes for incremental sends? Will it throw an error if the parents don't match, or will there just be silent failures? I would imagine receive would barf if it was told to reference a file that didn't exist, but what if the referenced file is there but contains different data? Are there checks for this sort of thing, or is it always assumed that the parent subvols are identical and if they're not, you're in undefined behavior land? Thanks, --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is stability a joke? (wiki updated)
On Mon, Sep 19, 2016 at 12:08:55AM -0400, Zygo Blaxell wrote: > > At the end of the day I'm not sure fsck really matters. If the filesystem > is getting corrupted enough that both copies of metadata are broken, > there's something fundamentally wrong with that setup (hardware bugs, > software bugs, bad RAM, etc) and it's just going to keep slowly eating > more data until the underlying problem is fixed, and there's no guarantee > that a repair is going to restore data correctly. If we exclude broken > hardware, the only thing btrfs check is going to repair is btrfs kernel > bugs...and in that case, why would we expect btrfs check to have fewer > bugs than the filesystem itself? I see btrfs check as having a very useful role: fixing known problems introduced by previous versions of kernel / progs. In my ext conversion thread, I seem to have discovered a problem introduced by convert, balance, or defrag. The data and metadata seem to be OK, however the filesystem cannot be written to without btrfs falling over. If this was caused by some edge-case data in the btrfs partition, it makes a lot more sense to have btrfs check repair it than it does to modify the kernel code to work around this and possibly many other bugs. The upshot to this is that since (potentially all of) the data is intact, a functional btrfs check would save me the hassle of restoring from backup. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Post ext3 conversion problems
On Mon, Sep 19, 2016 at 02:30:28PM +0800, Qu Wenruo wrote: > All chunks are completed convert to DUP, no small chunk, all to its maximum > chunk size. > So from chunk level, nothing related to convert yet. > > But for extent tree, I found several extents are heavily referred to. > Like extent 158173081600 or 183996522496. > > If you're not using off-band dedupe, then it's quite possible that's the > remaining structure of convert. I never ran any sort of dedup on this partition. > Not pretty sure if it's related to the bug, but did you do the > balance/defrag operation just after removing ext_save subvolume? That's quite possible. I did it in a live boot, so I don't have the bash history to check. I checked it just now using "btrfs subvol list -d", and there's nothing listed. I ran a full balance after that, but the problem remains. So whatever the problem is, it can survive a full balance after the ext_save subvol is completely deleted. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Post ext3 conversion problems
On Mon, Sep 19, 2016 at 10:20:37AM +0800, Qu Wenruo wrote: > > -95 is -EOPNOTSUPP. > > Not a common errno in btrfs. > > Most EOPNOTSUPP are related to discard and crapped fallcate/drop extents. > > Then are you using discard mount option? I did indeed have the discard mount option enabled. I tried booting with discard disabled, but the same problem appeared. > > Normally a btrfs-debug-tree would help in most case, but this time it seems > to be a runtime scrub bug other than on-disk metadata corruption. > > What I can see here is, with all your operation, your fs should be a normal > btrfs, other than converted one. > > To confirm my idea, would you please upload the following things if your > filesystem is not too large? > > # btrfs-debug-tree -t extent > # btrfs-debug-tree -t chunk > # btrfs-debug-tree -t dev > > There is no file/dir name/data contained in the dump. So it's just > chunk/extent allocation info. > You could upload them at ease. > > > Not a mess, I think it's a good bug report. I think Qu and David know > > more about the latest iteration of the convert code. If you can wait > > until next week at least to see if they have questions that'd be best. > > If you need to get access to the computer sooner than later I suggest > > btrfs-image -c9 -t4 -s to make a filename sanitized copy of the > > filesystem metadata for them to look at, just in case. They might be > > able to figure out the problem just from the stack trace, but better > > to have the image before blowing away the file system, just in case > > they want it. > > Yes, btrfs-image dump would be the best. > Although sanitizing may takes a long time and the output may be too large. I had posted a btrfs-image before. It was run with a single -s flag: http://phead.us/tmp/sgreenslade_home_sanitized_2016-09-16.btrfs Here's the debug tree data: http://phead.us/tmp/wheatley_chunk_2016-09-18.dump.gz http://phead.us/tmp/wheatley_extent_2016-09-18.dump.gz http://phead.us/tmp/wheatley_dev_2016-09-18.dump.gz Thanks, --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Post ext3 conversion problems
On Fri, Sep 16, 2016 at 07:27:58PM -0700, Liu Bo wrote: > Interesting, seems that we get errors from > > btrfs_finish_ordered_io > insert_reserved_file_extent > __btrfs_drop_extents > > And splitting an inline extent throws -95. Heh, you beat me to the draw. I was just coming to the same conclusion myself from poking at the source code. What's interesting is that it seems to be a quite explicit thing: if (extent_type == BTRFS_FILE_EXTENT_INLINE) { ret = -EOPNOTSUPP; break; } So now the question is why is this happening? Clearly the presence of inline extents isn't an issue by itself, since another one of my btrfs /home partitions has plenty of them. I added some debug prints to my kernel to catch the inode that tripped the error. Here's the relevant chunk (with filenames scrubbed) from btrfs-debug-tree: Inode 140345 triggered the transaction abort. leaf 175131459584 items 51 free space 7227 generation 118521 owner 5 fs uuid 1d9ee7c7-d13a-4c3c-b730-256c70841c5b chunk uuid b67a1a82-ff22-48b5-af1b-9d5f85ebee25 item 0 key (140343 INODE_ITEM 0) itemoff 16123 itemsize 160 inode generation 1 transid 1 size 180 nbytes 0 block group 0 mode 40755 links 1 uid 1000 gid 1000 rdev 0 flags 0x0(none) item 1 key (140343 INODE_REF 131327) itemoff 16107 itemsize 16 inode ref index 199 namelen 6 name: item 2 key (140343 DIR_ITEM 1073386496) itemoff 16072 itemsize 35 location key (142600 INODE_ITEM 0) type SYMLINK namelen 5 datalen 0 name: item 3 key (140343 DIR_ITEM 1148422723) itemoff 16037 itemsize 35 location key (142601 INODE_ITEM 0) type SYMLINK namelen 5 datalen 0 name: item 4 key (140343 DIR_ITEM 2415965623) itemoff 16004 itemsize 33 location key (131550 INODE_ITEM 0) type SYMLINK namelen 3 datalen 0 name: item 5 key (140343 DIR_ITEM 2448077466) itemoff 15965 itemsize 39 location key (140565 INODE_ITEM 0) type FILE namelen 9 datalen 0 name: item 6 key (140343 DIR_ITEM 2566671093) itemoff 15930 itemsize 35 location key (140564 INODE_ITEM 0) type SYMLINK namelen 5 datalen 0 name: item 7 key (140343 DIR_ITEM 3391512089) itemoff 15873 itemsize 57 location key (142599 INODE_ITEM 0) type FILE namelen 27 datalen 0 name: item 8 key (140343 DIR_ITEM 3621719155) itemoff 15838 itemsize 35 location key (131627 INODE_ITEM 0) type SYMLINK namelen 5 datalen 0 name: item 9 key (140343 DIR_ITEM 3701680574) itemoff 15798 itemsize 40 location key (142603 INODE_ITEM 0) type FIFO namelen 10 datalen 0 name: item 10 key (140343 DIR_ITEM 3816117430) itemoff 15763 itemsize 35 location key (140563 INODE_ITEM 0) type SYMLINK namelen 5 datalen 0 name: item 11 key (140343 DIR_ITEM 4214885080) itemoff 15729 itemsize 34 location key (131544 INODE_ITEM 0) type SYMLINK namelen 4 datalen 0 name: item 12 key (140343 DIR_ITEM 4253409616) itemoff 15687 itemsize 42 location key (140352 INODE_ITEM 0) type FILE namelen 12 datalen 0 name: item 13 key (140343 DIR_INDEX 2) itemoff 15653 itemsize 34 location key (131544 INODE_ITEM 0) type SYMLINK namelen 4 datalen 0 name: item 14 key (140343 DIR_INDEX 3) itemoff 15620 itemsize 33 location key (131550 INODE_ITEM 0) type SYMLINK namelen 3 datalen 0 name: item 15 key (140343 DIR_INDEX 4) itemoff 15585 itemsize 35 location key (131627 INODE_ITEM 0) type SYMLINK namelen 5 datalen 0 name: item 16 key (140343 DIR_INDEX 5) itemoff 15543 itemsize 42 location key (140352 INODE_ITEM 0) type FILE namelen 12 datalen 0 name: item 17 key (140343 DIR_INDEX 6) itemoff 15508 itemsize 35 location key (140563 INODE_ITEM 0) type SYMLINK namelen 5 datalen 0 name: item 18 key (140343 DIR_INDEX 7) itemoff 15473 itemsize 35 location key (140564 INODE_ITEM 0) type SYMLINK namelen 5 datalen 0 name: item 19 key (140343 DIR_INDEX 8) itemoff 15434 itemsize 39 location key (140565 INODE_ITEM 0) type FILE namelen 9 datalen 0 name: item 20 key (140343 DIR_INDEX 9) itemoff 15377 itemsize 57 location key (142599 INODE_ITEM 0) type FILE namelen 27 datalen 0 name: item 21 key (140343 DIR_INDEX 10) itemoff 15342 itemsize 35 location key (142600 INODE_ITEM 0) type SYMLINK namelen 5 datalen 0 name: item 22 key (140343 DIR_INDEX
Re: Post ext3 conversion problems
On Fri, Sep 16, 2016 at 05:45:59PM -0600, Chris Murphy wrote: > On Fri, Sep 16, 2016 at 5:25 PM, Sean Greenslade > <s...@seangreenslade.com> wrote: > > > In the mean time, is there any way to make the kernel more verbose about > > btrfs errors? It would be nice to see, for example, what was in the > > transaction that failed, or at least what files / metadata it was > > touching. > > No idea. Maybe one of the compile time options: > > > CONFIG_BTRFS_FS_CHECK_INTEGRITY=y > This also requires mount options, either check_int or check_int_data > CONFIG_BTRFS_FS_RUN_SANITY_TESTS > CONFIG_BTRFS_DEBUG=y > https://patchwork.kernel.org/patch/846462/ > CONFIG_BTRFS_ASSERT=y > > Actually, even before that maybe if you did a 'btrfs-debug-tree /dev/sdX' > > That might explode in the vicinity of the problem. Thing is, btrfs > check doesn't see anything wrong with the metadata, so chances are > debug-tree won't either. Hmm, I'll probably have a go at compiling the latest mainline kernel with CONFIG_BTRFS_DEBUG enabled. It certainly can't hurt to try. And as you suspected, btrfs-debug-tree didn't explode / error out on me. I didn't thoroughly inspect the output (as I have very little understanding of the btrfs internals), but it all seemed OK. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Post ext3 conversion problems
On Fri, Sep 16, 2016 at 02:23:44PM -0600, Chris Murphy wrote: > Not a mess, I think it's a good bug report. I think Qu and David know > more about the latest iteration of the convert code. If you can wait > until next week at least to see if they have questions that'd be best. > If you need to get access to the computer sooner than later I suggest > btrfs-image -c9 -t4 -s to make a filename sanitized copy of the > filesystem metadata for them to look at, just in case. They might be > able to figure out the problem just from the stack trace, but better > to have the image before blowing away the file system, just in case > they want it. I can hang on to the system in its current state, I don't particularly need this machine fully operational. Just to be proactive, I ran the btrfs-image as follows: btrfs-image -c9 -t4 -s -w /dev/sda2 dumpfile http://phead.us/tmp/sgreenslade_home_sanitized_2016-09-16.btrfs In the mean time, is there any way to make the kernel more verbose about btrfs errors? It would be nice to see, for example, what was in the transaction that failed, or at least what files / metadata it was touching. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Post ext3 conversion problems
Hi, all. I've been playing around with an old laptop of mine, and I figured I'd use it as a learning / bugfinding opportunity. Its /home partition was originally ext3. I have a full partition image of this drive as a backup, so I can do (and have done) potentially destructive things. The system disk is a ~6 year old SSD. To start, I rebooted to a livedisk (Arch, kernel 4.7.2 w/progs 4.7.1) and ran a simple btrfs-convert on it. After patching up the fstab and rebooting, everything seemed fine. I deleted the recovery subvol, ran a full balance, ran a full defrag, and rebooted again. I then decided to try (as an experiment) using DUP mode for data and metadata. I ran that balance without issue, then started using the machine. Sometime later, I got the following remount ro: [ 7316.764235] [ cut here ] [ 7316.764292] WARNING: CPU: 2 PID: 14196 at fs/btrfs/inode.c:2954 btrfs_finish_ordered_io+0x6bc/0x6d0 [btrfs] [ 7316.764297] BTRFS: Transaction aborted (error -95) [ 7316.764301] Modules linked in: fuse sha256_ssse3 sha256_generic hmac drbg ansi_cprng ctr ccm joydev mousedev uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core videodev media crc32c_generic iTCO_wdt btrfs iTCO_vendor_support arc4 xor ath9k raid6_pq ath9k_common ath9k_hw ath mac80211 snd_hda_codec_realtek snd_hda_codec_generic psmouse input_leds coretemp snd_hda_intel led_class pcspkr snd_hda_codec cfg80211 snd_hwdep snd_hda_core snd_pcm lpc_ich snd_timer atl1c rfkill snd soundcore shpchp intel_agp wmi thermal fjes battery evdev ac tpm_tis mac_hid tpm sch_fq_codel vboxnetflt(O) vboxnetadp(O) pci_stub vboxpci(O) vboxdrv(O) loop sg acpi_cpufreq ip_tables x_tables ext4 crc16 jbd2 mbcache sd_mod serio_raw atkbd libps2 ahci libahci uhci_hcd libata scsi_mod ehci_pci ehci_hcd usbcore [ 7316.764434] usb_common i8042 serio i915 video button intel_gtt i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm [ 7316.764462] CPU: 2 PID: 14196 Comm: kworker/u8:11 Tainted: G O 4.7.3-5-ck #1 [ 7316.764467] Hardware name: ASUSTeK Computer INC. 1015PEM/1015PE, BIOS 0903 11/08/2010 [ 7316.764507] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [ 7316.764513] 0286 6101f47d 8800230dbc78 812f0215 [ 7316.764522] 8800230dbcc8 8800230dbcb8 8107ae6f [ 7316.764530] 0b8a0035 88007791afa8 8800751d9000 880014101d40 [ 7316.764538] Call Trace: [ 7316.764551] [] dump_stack+0x63/0x8e [ 7316.764560] [] __warn+0xcf/0xf0 [ 7316.764567] [] warn_slowpath_fmt+0x61/0x80 [ 7316.764605] [] ? unpin_extent_cache+0xa2/0xf0 [btrfs] [ 7316.764640] [] ? btrfs_free_path+0x26/0x30 [btrfs] [ 7316.764677] [] btrfs_finish_ordered_io+0x6bc/0x6d0 [btrfs] [ 7316.764715] [] finish_ordered_fn+0x15/0x20 [btrfs] [ 7316.764753] [] btrfs_scrubparity_helper+0x7e/0x360 [btrfs] [ 7316.764791] [] btrfs_endio_write_helper+0xe/0x10 [btrfs] [ 7316.764799] [] process_one_work+0x1ed/0x490 [ 7316.764806] [] worker_thread+0x49/0x500 [ 7316.764813] [] ? process_one_work+0x490/0x490 [ 7316.764820] [] kthread+0xda/0xf0 [ 7316.764830] [] ret_from_fork+0x1f/0x40 [ 7316.764838] [] ? kthread_worker_fn+0x170/0x170 [ 7316.764843] ---[ end trace 90f54effc5e294b0 ]--- [ 7316.764851] BTRFS: error (device sda2) in btrfs_finish_ordered_io:2954: errno=-95 unknown [ 7316.764859] BTRFS info (device sda2): forced readonly [ 7316.765396] pending csums is 9437184 After seeing this, I decided to attempt a repair (confident that I could restore from backup if it failed). At the time, I was unaware of the issues with progs 4.7.1, so when I ran the check and saw all the incorrect backrefs messages, I figured that was my problem and ran the --repair. Of course, this didn't make the messages go away on subsequent checks, so I looked further and found this bug: https://bugzilla.kernel.org/show_bug.cgi?id=155791 I updated progs to 4.7.2 and re-ran the --repair (I didn't save any of the logs from these, unfortunately). The repair seemed to work (I also used --init-extent-tree), as current checks don't report any errors. The system boots and mounts the FS just fine. I can read from it all day, scrubs complete without failure, but just using the system for a while will eventually trigger the same "Transaction aborted (error -95)" error. I realize this is something of a mess, and that I was less than methodical with my actions so far. Given that I have a full backup that can be restored if need be (and I certainly could try running the convert again), what is my best course of action? Thanks, --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: does btrfs-receive use/compare the checksums from the btrfs-send side?
On Sun, Aug 28, 2016 at 10:25:32PM +0200, Christoph Anton Mitterer wrote: > On Sun, 2016-08-28 at 22:19 +0200, Adam Borowski wrote: > > Transports over which you're likely to send a filesystem stream > > already > > protect against corruption. > Well... in some cases,... but not always... just consider a plain old > netcat... Netcat uses TCP by default, so there is error correction and a guaranteed-correct stream transfer there. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs fi usage bug during shrink
Hi, all. I was resizing (shrinking) a btrfs partition, and figured I'd check in on how it was going with "btrfs fi usage." It was quite startling: $ sudo btrfs fi usage /mnt/ Overall: Device size: 370.00GiB Device allocated:372.03GiB Device unallocated: 16.00EiB Device missing: 0.00B Used:360.56GiB Free (estimated):0.00B (min: 8.00EiB) Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 224.00MiB (used: 0.00B) Data,single: Size:370.02GiB, Used:359.31GiB /dev/mapper/c1370.02GiB Metadata,DUP: Size:1.00GiB, Used:639.22MiB /dev/mapper/c1 2.00GiB System,DUP: Size:8.00MiB, Used:64.00KiB /dev/mapper/c1 16.00MiB Unallocated: /dev/mapper/c1 16.00EiB It's reasonably obvious what's going on, here. The overall size has been set to the final size, and now the worker is going through balancing all the chunks that are now out of bounds. I feel like "fi usage" should probably have some logic to detect this situation and report something more sensible. Thankfully, it's only transient, and returns to normal once the resize completes. Thanks, --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID1 disk upgrade method
On Thu, Jan 28, 2016 at 01:47:36PM -0500, Sean Greenslade wrote: > OK, I just misunderstood how that syntax worked. All seems good now. > I'll try to play around with some dummy configurations this weekend to > see if I can reproduce the post-replace mount bug. So I finally got some time to play with this, and I am entirely unable to reproduce these errors with virtual loop disks. I'm going to chalk these errors up to transient SATA nastiness, since that's happened on this system before. Either way, there was no data loss during this entire operation, so besides a few extra unplanned reboots, things went extremely well. Excellent work on btrfs, devs, and thanks to everyone who chimed in to help me. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid1 vs raid5
On Tue, Jan 05, 2016 at 05:24:31PM +0100, Psalle wrote: > Hello all and excuse me if this is a silly question. I looked around in the > wiki and list archives but couldn't find any in-depth discussion about this: > > I just realized that, since raid1 in btrfs is special (meaning only two > copies in different devices), the effect in terms of resilience achieved > with raid1 and raid5 are the same: you can lose one drive and not lose data. > > So!, presuming that raid5 were at the same level of maturity, what would be > the pros/cons of each mode? This is true for "classic" RAID: assume you have 3x 1TB disks. RAID1 will give you 1.5TB, whereas RAID5 will give you 2TB. RAID1 = 1/2 total disk space (assuming equally-sized disks) RAID5 = (N-1)*single disk space (same assumption) > As a corollary, I guess that if raid1 is considered a good compromise, then > functional equivalents to raid6 and beyond could simply be implemented as > "storing n copies in different devices", dropping any complex parity > computations and making this mode entirely generic. This is akin to what has been mentioned on the list earlier as "N-way mirroring" and I agree that it will be very nice to have once implemented. However it is not the same as RAID5/6 since the parity schemes are used to get more usable storage than just simple mirroring would allow for. Thus, the main pro of RAID5/6 is more usable storage, and the main con is more computational complexity (and thus more cpu requirements, slower access time, more fragile error states, etc.) > Since this seems pretty obvious, I'd welcome your insights on what are > the things I'm missing, since it doesn't exist (and it isn't planned > to be this way, AFAIK). I can foresee consistency difficulties, but > that seems hardly insurmountable if its being done for raid1? Fixing an inconsistency in RAID1 is much easier than RAID5/6. No math, just checking csums. Fixing an inconsistency in RAID5/6 involves busting out the parity math. This is why repairing RAID5/6 only became possible in btrfs relatively recently. Generating the parity data was relatively easy, but rebuilding missing data with it was a more difficult task. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Potential to loose data in case of disk failure
On Wed, Nov 11, 2015 at 11:30:57AM -0600, Jim Murphy wrote: > Hi all, > > What am I missing or misunderstanding? I have a newly > purchased laptop I want/need to multi boot different OSs > on. As a result after partitioning I have ended up with two > partitions on each of the two internal drives(sda3, sda8, > sdb3 and sdb8). FWIW, sda3 and sdb3 are the same size > and sda8 and sdb8 are the same size. As an end result > I want one btrfs raid1 filesystem. For lack of better terms, > sda3 and sda8 "concatenated" together, sdb3 and sdb8 > "concatenated" together and then mirroring "sda" to "sdb" > using only btrfs. So far have found no use-case to cover > this. > > If I create a raid1 btrfs volume using all 4 "devices" as I > understand it I would loose data if I were to loose a drive > because two mirror possibilities would be: > > sda3 mirrored to sda8 > sdb3 mirrored to sdb8 > > Is what I want to do possible without using MD-RAID and/or > LVM? If so would someone point me to the documentation > I missed. For whatever reason, I don't want to believe that > this can't be done. I want to believe that the code in btrfs > is smart enough to know that sda3 and sda8 are on the same > drive and would not try to mirror data between them except in > a test setup. I hope I just missed some documentation, > somewhere. > > Thanks in advance for your help. And last but not least, > thanks to all for your work on btrfs. > > Jim That's a pretty unusual setup, so I'm not surprised there's no quick and easy answer. The best solution in my opinion would be to shuffle your partitions around and combine sda3 and sda8 into a single partition. There's generally no reason to present btrfs with two different partitions on the same disk. If there's something that prevents you from doing that, you may be able to use RAID10 or RAID6 somehow. I'm not really sure, though, so I'll defer to others on the list for implementation details. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS as image store for KVM?
On Thu, Sep 17, 2015 at 07:56:08PM +0200, Gert Menke wrote: > MD+LVM is very close to what I want, but md has no way to cope with silent > data corruption. So if I'd want to use a guest filesystem that has no > checksums either, I'm out of luck. > I'm honestly a bit confused here - isn't checksumming one of the most > obvious things to want in a software RAID setup? Is it a feature that might > appear in the future? Maybe I should talk to the md guys... MD is emulating hardware RAID. In hardware RAID, you are doing work at the block level. Block-level RAID has no understanding of the filesystem(s) running on top of it. Therefore it would have to checksum groups of blocks, and store those checksums on the physical disks somewhere, perhaps by keeping some portion of the drive for itself. But then this is not very efficient, since it is maintaining checksums for data that may be useless (blocks the FS is not currently using). So then you might make the RAID filesystem aware...and you now have BTRFS RAID. Simply put, the block level is probably not an appropriate place for checksumming to occur. BTRFS can make checksumming work much more effectively and efficiently by doing it at the filesystem level. --Sean -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Metadata about to fill up, how to make it bigger next time?
On Wed, Mar 25, 2015 at 04:45:09PM -0700, Anand Patil wrote: Hi everyone, When I run btrfs fi df /path/to/fs, I see: Data, single: total=53.01GiB, used=51.79GiB System, DUP: total=32.00MiB, used=16.00KiB Metadata, DUP: total=16.00GiB, used=14.72GiB My most pressing question is, does that metadata line really mean that the filesystem is going to become unusable soon? No. If you total up your allocations, you will notice that you have ~69GiB allocated, far shy of your total 1000GiB available. BTRFS allocates in chunks as needed, so once it needs more space for metadata, it will allocate some. You only get into trouble if there is no free space to allocate from, and you are far from having that issue. --Sean -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it necessary to balance a btrfs raid1 array?
On Wed, Sep 10, 2014 at 08:43:25PM +0200, Goffredo Baroncelli wrote: May be that I am missing something obvious, however I have to ask which would be the purpose to balance a two disks RAID1 system. The balance command should move the data between the disks in order to avoid some disk full and other empty; but this assume that there is a not symmetrical uses of the disks. Which is not the case for a RAID1/two disks system. Balancing is not necessarily about data distribution between two disks. You can balance a single disk BTRFS partition. It's more about balancing how the data / metadata chunks are allocated and used. It also (during a re-write of a chunk) honors the RAID rules of that chunk type. *scrub Regarding scrub, pay attention that some (consumer) disks are guarantee for a (not recoverable) error rate less than 1/10^14 [1] bit reads. 10^14 bit are something like 10TB. This means that if you read your system 5 times, you may got an error bit. I suppose that these are very conservative number, so the likelihood of an undetected error is (I hope)lower. But also I am inclined to think these number are evaluated in an ideal case (in term of temperature, voltage, vibration); this means that the true might be worse. So if you compare these numbers with your average throughput, you can estimate which is the likelihood of an error. Pay attention that a scrub job means read all your data: If you have 1T of data, and you performs a scrub each week, in three months you reach the 10^14 bit reads. This explains the interest in higher redundancy level (raid 6 or more). G.Baroncelli I think there is a bit of misunderstanding here. Those disk error rates are latent media errors. They're a function of production quality of the platters and the amount of time the data rests on the drive. Reads do not affect this, and in fact, can actually help reduce the error rate. When a hard drive does a read, it also reads the CRC values for the sector that it just read. If it matches, the drive passes it on as good data. If not, it attempts error correction on it. If it can correct the error, it will return the corrected data and (hopefully) re-write the data on the disk to fix the error permanently. I use quotes because this could mean that that zone of media is damaged, and it will probably error again. The disk will eventually re-allocate a sector that repeatedly returns bad data. This is what you want to happen. So doing reads, especially across the entire media surface, is a great way to make the disk perform these sector checks. But sometimes the disk cannot correct the error. Then the controller (if it is well-behaved) will return a read error, or sometimes just bunk data. If the BTRFS scrub sees bad data, it will detect it with its checksums, and if in a RAID configuration, be able to locate a good copy of the data to restore. Long story short, reads don't cause media errors, and scrubs help detect errors early. --Sean -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it necessary to balance a btrfs raid1 array?
On Thu, Sep 11, 2014 at 12:28:56AM +0200, Goffredo Baroncelli wrote: The WD datasheet says something different. It reports Non-recoverable read errors per bits read less than 1/10^14. They express the number of error in terms of number of bit reading. You instead are saying that the error depends by the disk age. These two sentence are very different. ( and of course all these values depend also by the product quality). I'm not certain how those specs are determined. I was basing my statements on knowledge of how read errors occur in rotating media. I think that there is two source of error: - a platter/disk degradation (due to ageing, wearing...), which may require a sector relocation - other sources of error which are not permanent and that may be corrected by a 2nd read I don't have any idea about which one is bigger (even I suspect the second). They are both the same, generally. If the sector is damaged (e.g. manufacturing fault), then it can do several things. It can always return bad data, which will result in a reallocation. It can also partially fail. For example, accept the data, but slowly lose it over some period of time. It's still due to bad media, but if you were to read it quickly enough, you may be able to catch it before it goes bad. If the drive catches (and re-writes) it, then it may have staved off losing that data that time around. So doing reads, especially across the entire media surface, is a great way to make the disk perform these sector checks. But sometimes the disk cannot correct the error. I read this as: the error rate is greater than 1/10^14, but the CRC and some multiple reading and sector remapping lower the error rate below 1/10^14. If behind this there are a dumb drive which returns an error as soon as the CRC doesn't match, or a smart drive which retries several time until it got a good value doesn't matter: the error rate is still 1/10^14. Yes, the error rate is almost entirely determined by the manufacturing of the physical media. Controllers can attempt to work around that, but they won't go searching for media defects on their own (at least, I've never seen a drive that does.) Long story short, reads don't cause media errors, and scrubs help detect errors early. Nobody told that a reading cause a media error; however assuming (this is how I read the WD datasheet) the error rate constant, if you increase the number of reading then you have more errors. May be that I was not clear, however I didn't want to say that scrubbing reduces the life of disk, I wanted to point out that the size of the disk and the error rate are becoming comparable. I know that wasn't your implication, but I wanted to be sure that things weren't misinterpreted. I'll clarify: Disks have latent errors. Nothing you can do will change this, and the number of reads you do will not affect the error rate of the media. It _will_ affect how often those errors are detected, however. And with btrds, this is a Good Thing(TM). If errors are found, they can be corrected by either the disk controller itself (on the block level) or the filesystem on its level. Scrub your disks, folks. A scrubbed disk is a happy disk. --Sean -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is it necessary to balance a btrfs raid1 array?
On Wed, Sep 10, 2014 at 11:51:19PM -0400, Zygo Blaxell wrote: This is a complex topic. I agree, and I make no claim to be an expert in any of this. Some disks have bugs in their firmware, and some of those bugs make the data sheets and most of this discussion entirely moot. The firmware is gonna do what the firmware's gonna do. Agreed. That's why I like that fact that btrfs provides another layer of error checking / correction. It's a bad idea to try to rewrite a fading sector in some cases. If the drive is located in a climate-controlled data center then it should be OK; however, there are multiple causes of read failure and some of them will also cause writes to damage adjacent data on the disk. Spinning disks stop being able to position their heads properly around -10C or so, a fact that will be familiar to anyone who's tried to use a laptop outside in winter. Maybe someone dropped the computer, and the read errors are due to the heads vibrating with the shock--a read retry a few milliseconds later would be OK, but a rewrite (without a delay, so the heads are still vibrating from the shock) would just wipe out some nearby data with no possibility of recovery. Of course, the drive can't always know what's going on outside. It just tries its best (we hope). Most of the reallocations I've observed in the field happen when a sector is written, not read. Very true. I believe what happens is that a sector is marked for re-allocation when the read fails, and a write to that sector will trigger the actual reallocation. Hence the pending reallocations SMART attribute. Most disks can search for defects on their own, but the host has to issue a SMART command to initiate such a search. They will also track defect rates and log recent error details (with varying degrees of bugginess). And again, it's up to the questionable firmware's discretion as to how that search is done / how thorough it is. And it has to be triggered by the user / script. I don't consider that to really be on its own, as btrfs scrub requires the same level of input/scripting. smartmontools is your friend. It's not a replacement for btrfs scrub, but it collects occasionally useful complementary information about the health of the drive. I can't find the link, but there was a study done that shows an alarmingly high percentage of disk failures showed no SMART errors before failing. There used to be a firmware feature for drives to test themselves whenever they are spinning and idle for four continuous hours, but most modern disks will power themselves down if they are idle for much less time...and who has a disk that's idle for four hours at a time anyway? ;) My backup destination is touched once a day. It averages about 20 hours a day idle. Though it probably doesn't need to be testing itself 80% of the time. That would be a mite excessive =P Scrub your disks, folks. A scrubbed disk is a happy disk. Seconded. Also remember that not all storage errors are due to disk failure. There's a lot of RAM, high-speed signalling, and wire between the host CPU and a disk platter. SMART self-tests won't detect failures in those, but scrubs will. But we'll save the ECC RAM discussion for another day, perhaps. --Sean -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html