Re: btrfs partition converted from ext4 becomes read-only minutes after booting: WARNING: CPU: 2 PID: 2777 at ../fs/btrfs/super.c:260 __btrfs_abort_transaction+0x4b/0x120
On Fri, Jun 12, 2015 at 03:19:06PM +0300, Robert Munteanu wrote: Hi, Note to others: kernel 4.0.4 Reply to you: I tried ext4 to btrfs once a year ago and it severely mangled my filesystem. I looked at it as a cool feature/hack that may have worked some time ago, but that no one really uses anymore, and that may not work right at this point. Unless you hear back from a developer interested in debugging/fixing this, I would assume that this feature is broken and dead. Marc I have converted my root ext4 partition to btrfs. I used an USB stick to boot and used btrfs-convert. I also did a balance and defrag ( in that order ) , both when the fs was mounted. After logging in to KDE I quickly get a read-only filesystem. I've pasted the backtrace below Jun 11 23:13:08 mars kernel: WARNING: CPU: 2 PID: 2777 at ../fs/btrfs/super.c:260 __btrfs_abort_transaction+0x4b/0x120 [btrfs]() Jun 11 23:13:08 mars kernel: BTRFS: Transaction aborted (error -95) Jun 11 23:13:08 mars kernel: Modules linked in: bnep bluetooth rfkill fuse vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) af_packet nf_log_ipv6 xt_pkttype nf_log_ip v4 nf_log_common xt_LOG xt_limit ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT iptable_raw xt_CT iptable_filter ip6table_mangle nf_con ntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables xfs libcrc32c snd_hda _codec_hdmi raid1 md_mod gpio_ich ppdev iTCO_wdt iTCO_vendor_support coretemp snd_hda_codec_realtek snd_hda_codec_generic kvm_intel snd_hda_intel dm_mod kvm snd_hda_co ntroller snd_hda_codec snd_hwdep serio_raw pcspkr snd_pcm i2c_i801 snd_seq joydev snd_seq_device snd_timer snd 8250_fintek parport_pc parport acpi_cpufreq lpc_ich Jun 11 23:13:08 mars kernel: soundcore mfd_core shpchp processor ata_generic btrfs hid_logitech_hidpp xor raid6_pq sr_mod cdrom nvidia_uvm(PO) nvidia(PO) firewire_ohc i firewire_core crc_itu_t uas usb_storage r8169 mii pata_jmicron hid_logitech_dj drm button sg Jun 11 23:13:08 mars kernel: CPU: 2 PID: 2777 Comm: kworker/u8:0 Tainted: P O4.0.4-3-desktop #1 Jun 11 23:13:08 mars kernel: Hardware name: Gigabyte Technology Co., Ltd. EP35-DS4/EP35-DS4, BIOS F6d 01/08/2009 Jun 11 23:13:08 mars kernel: Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] Jun 11 23:13:08 mars kernel: a0a92832 8167c4aa 880128513ca8 Jun 11 23:13:08 mars kernel: 81063bb1 880031929d28 880221e71800 ffa1 Jun 11 23:13:08 mars kernel: a0a914e0 0b50 81063c2a a0a95928 Jun 11 23:13:08 mars kernel: Call Trace: Jun 11 23:13:08 mars kernel: [8100574c] dump_trace+0x8c/0x340 Jun 11 23:13:08 mars kernel: [81005aa3] show_stack_log_lvl+0xa3/0x190 Jun 11 23:13:08 mars kernel: [81007201] show_stack+0x21/0x50 Jun 11 23:13:08 mars kernel: [8167c4aa] dump_stack+0x47/0x67 Jun 11 23:13:08 mars kernel: [81063bb1] warn_slowpath_common+0x81/0xb0 Jun 11 23:13:08 mars kernel: [81063c2a] warn_slowpath_fmt+0x4a/0x50 Jun 11 23:13:08 mars kernel: [a09e598b] __btrfs_abort_transaction+0x4b/0x120 [btrfs] Jun 11 23:13:08 mars kernel: [a0a1d18a] btrfs_finish_ordered_io+0x5aa/0x620 [btrfs] Jun 11 23:13:08 mars kernel: [a0a43253] normal_work_helper+0xc3/0x320 [btrfs] Jun 11 23:13:08 mars kernel: [8107bcf2] process_one_work+0x142/0x420 Jun 11 23:13:08 mars kernel: [8107c0e4] worker_thread+0x114/0x460 Jun 11 23:13:08 mars kernel: [81081261] kthread+0xc1/0xe0 Jun 11 23:13:08 mars kernel: [81682d58] ret_from_fork+0x58/0x90 Jun 11 23:13:08 mars kernel: ---[ end trace 4c4eb7d6e98afa91 ]--- Jun 11 23:13:08 mars kernel: BTRFS: error (device sda1) in btrfs_finish_ordered_io:2896: errno=-95 unknown Jun 11 23:13:08 mars kernel: BTRFS info (device sda1): forced readonly Some diagnostic info: - btrfs scrub reports no errors - on the host machine I'm running btrfs v4.0+20150429 and kernel 4.0.4-3-desktop - on the live medium, used to run btrfs-convert, I was running btrfs v4.0+20150429 and kernel 4.0.3-1-default # btrfs fi show Label: none uuid: 54dea125-74cd-4bb2-86a2-f7bc645b76cf Total devices 1 FS bytes used 90.22GiB devid1 size 223.57GiB used 92.03GiB path /dev/sda1 btrfs-progs v4.0+20150429 # btrfs fi df / Data, single: total=89.00GiB, used=88.17GiB System, single: total=32.00MiB, used=16.00KiB Metadata, single: total=3.00GiB, used=2.05GiB GlobalReserve, single: total=512.00MiB, used=0.00B Is there a way out? I still have the old ext4 image and can revert, but I'm keeping the btrfs one for now, in case I can extract some useful debugging information from it. Thanks, Robert -- http://robert.muntea.nu/ -- To unsubscribe from this list: send the
Re: How do I make 'btrfs scrub' report errors via email?
On Sat, Jun 13, 2015 at 10:48:35PM +0900, crocket wrote: I can check the result of 'btrfs scrub' later, but I don't want to take time to actually check it. Does anyone know how to make 'btrfs scrub' report errors via email? It seems google doesn't know. See the bottom of: http://marc.merlins.org/perso/btrfs/2014-03.html#Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair You can remove shlock from the script if you don't have it (or use another lock script). Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [systemd-devel] [survey] BTRFS_IOC_DEVICES_READY return status
On 2015-06-15 19:38, Lennart Poettering wrote: On Mon, 15.06.15 19:23, Goffredo Baroncelli (kreij...@inwind.it) wrote: On 2015-06-15 12:46, Lennart Poettering wrote: On Sat, 13.06.15 17:09, Goffredo Baroncelli (kreij...@libero.it) wrote: Further, the problem will be more intense in this eg. if you use dd and copy device A to device B. After you mount device A, by just providing device B in the above two commands you could let kernel update the device path, again all the IO (since device is mounted) are still going to the device A (not B), but /proc/self/mounts and 'btrfs fi show' shows it as device B (not A). Its a bug. very tricky to fix. In the past [*] I proposed a mount.btrfs helper . I tried to move the logic outside the kernel. I think that the problem is that we try to manage all these cases from a device point of view: when a device appears, we register the device and we try to mount the filesystem... This works very well when there is 1-volume filesystem. For the other cases there is a mess between the different layers: - kernel - udev/systemd - initrd logic My attempt followed a different idea: the mount helper waits the devices if needed, or if it is the case it mounts the filesystem in degraded mode. All devices are passed as mount arguments (--device=/dev/sdX), there is no a device registration: this avoids all these problems. Hmm, no. /bin/mount should not block for devices. That's generally incompatible with how the tool is used, and in particular from systemd. We would not make use for such a scheme in systemd. /bin/mount should always be short-running. Apart systemd, which are these incompatibilities ? Well, /bin/mount is not a daemon, and it should not be one. My helper is not a deamon; you was correct the first time: it blocks until all needed/enough devices are appeared. Anyway this should not be different from mounting a nfs filesystem. Even in this case the mount helper blocks until the connection happened. The block time is not negligible, even tough not long as a device timeout ... I am pretty sure that if such automatic degraded mounting should be supported, then this should be done with some background storage daemon that alters the effect of the READY ioctl somehow after the timeout, and then retriggers the devcies so that systemd takes note. (or, alternatively: such a scheme could even be implemented all in kernel, based on some configurable kernel setting...) I recognize that this solution provides the maximum compatibility with the current implementation. However it seems too complex to me. Re-trigging a devices seems to me more a workaround than a solution. Well, it's not really ugly. I mean, if the state or properties of a device change, then udev should update its information about it, and that's done via a retrigger. We do that all the time already, for example when an existing loopback device gets a backing file assigned or removed. I am pretty sure that loopback case is very close to what you want to do here, hence retriggering (either from the kernel side, or from userspace), appears like an OK thing to do. What seems strange to me is that in this case the devices don't have changed their status. How this problem is managed in the md/dm raid cases ? Could a generator do this job ? I.e. this generator (or storage daemon) waits that all (or enough) devices are appeared, then it creates a .mount unit: do you think that it is doable ? systemd generators are a way to extend the systemd unit dep tree with units. They are very short running, and are executed only very very early at boot. They cannot wait for anything, they don#t have access to devices and are not run when they are appear. Lennart -- gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs partition converted from ext4 becomes read-only minutes after booting: WARNING: CPU: 2 PID: 2777 at ../fs/btrfs/super.c:260 __btrfs_abort_transaction+0x4b/0x120
Am Wed, 17 Jun 2015 10:46:30 -0700 schrieb Marc MERLIN m...@merlins.org: I tried ext4 to btrfs once a year ago and it severely mangled my filesystem. I looked at it as a cool feature/hack that may have worked some time ago, but that no one really uses anymore, and that may not work right at this point. Just another data point: when I switched to btrfs in the middle of last year I used btrfs-convert on two file systems (an SSD and my backup partition on a USB 3.0 HDD), and it worked in both cases (i.e., no data loss). I did see some strange balance issues (see the ML archives), but IIRC nothing really serious. -- Marc Joliet -- People who think they know everything really annoy those of us who know we don't - Bjarne Stroustrup pgpazrScetd8J.pgp Description: Digitale Signatur von OpenPGP
Re: btrfs partition converted from ext4 becomes read-only minutes after booting: WARNING: CPU: 2 PID: 2777 at ../fs/btrfs/super.c:260 __btrfs_abort_transaction+0x4b/0x120
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 6/12/15 8:19 AM, Robert Munteanu wrote: Hi, I have converted my root ext4 partition to btrfs. I used an USB stick to boot and used btrfs-convert. I also did a balance and defrag ( in that order ) , both when the fs was mounted. After logging in to KDE I quickly get a read-only filesystem. I've pasted the backtrace below Jun 11 23:13:08 mars kernel: WARNING: CPU: 2 PID: 2777 at ../fs/btrfs/super.c:260 __btrfs_abort_transaction+0x4b/0x120 [btrfs]() Jun 11 23:13:08 mars kernel: BTRFS: Transaction aborted (error -95) Jun 11 23:13:08 mars kernel: Modules linked in: bnep bluetooth rfkill fuse vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) af_packet nf_log_ipv6 xt_pkttype nf_log_ip v4 nf_log_common xt_LOG xt_limit ip6t_REJECT xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT iptable_raw xt_CT iptable_filter ip6table_mangle nf_con ntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables xfs libcrc32c snd_hda _codec_hdmi raid1 md_mod gpio_ich ppdev iTCO_wdt iTCO_vendor_support coretemp snd_hda_codec_realtek snd_hda_codec_generic kvm_intel snd_hda_intel dm_mod kvm snd_hda_co ntroller snd_hda_codec snd_hwdep serio_raw pcspkr snd_pcm i2c_i801 snd_seq joydev snd_seq_device snd_timer snd 8250_fintek parport_pc parport acpi_cpufreq lpc_ich Jun 11 23:13:08 mars kernel: soundcore mfd_core shpchp processor ata_generic btrfs hid_logitech_hidpp xor raid6_pq sr_mod cdrom nvidia_uvm(PO) nvidia(PO) firewire_ohc i firewire_core crc_itu_t uas usb_storage r8169 mii pata_jmicron hid_logitech_dj drm button sg Jun 11 23:13:08 mars kernel: CPU: 2 PID: 2777 Comm: kworker/u8:0 Tainted: P O4.0.4-3-desktop #1 openSUSE Tumbleweed, I take it? We still actively support btrfs-convert through SLES, so we're invested in ensuring it continues working properly. I'd be interested in seeing images of both filesystems to investigate and to see if we can reproduce it. Errno -95 is -EOPNOTSUPP which is kind of strange to see. I can see a few possible places it would get passed up with a trace like that but being able to reproduce it would be extremely helpfu l. - -Jeff Jun 11 23:13:08 mars kernel: Hardware name: Gigabyte Technology Co., Ltd. EP35-DS4/EP35-DS4, BIOS F6d 01/08/2009 Jun 11 23:13:08 mars kernel: Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] Jun 11 23:13:08 mars kernel: a0a92832 8167c4aa 880128513ca8 Jun 11 23:13:08 mars kernel: 81063bb1 880031929d28 880221e71800 ffa1 Jun 11 23:13:08 mars kernel: a0a914e0 0b50 81063c2a a0a95928 Jun 11 23:13:08 mars kernel: Call Trace: Jun 11 23:13:08 mars kernel: [8100574c] dump_trace+0x8c/0x340 Jun 11 23:13:08 mars kernel: [81005aa3] show_stack_log_lvl+0xa3/0x190 Jun 11 23:13:08 mars kernel: [81007201] show_stack+0x21/0x50 Jun 11 23:13:08 mars kernel: [8167c4aa] dump_stack+0x47/0x67 Jun 11 23:13:08 mars kernel: [81063bb1] warn_slowpath_common+0x81/0xb0 Jun 11 23:13:08 mars kernel: [81063c2a] warn_slowpath_fmt+0x4a/0x50 Jun 11 23:13:08 mars kernel: [a09e598b] __btrfs_abort_transaction+0x4b/0x120 [btrfs] Jun 11 23:13:08 mars kernel: [a0a1d18a] btrfs_finish_ordered_io+0x5aa/0x620 [btrfs] Jun 11 23:13:08 mars kernel: [a0a43253] normal_work_helper+0xc3/0x320 [btrfs] Jun 11 23:13:08 mars kernel: [8107bcf2] process_one_work+0x142/0x420 Jun 11 23:13:08 mars kernel: [8107c0e4] worker_thread+0x114/0x460 Jun 11 23:13:08 mars kernel: [81081261] kthread+0xc1/0xe0 Jun 11 23:13:08 mars kernel: [81682d58] ret_from_fork+0x58/0x90 Jun 11 23:13:08 mars kernel: ---[ end trace 4c4eb7d6e98afa91 ]--- Jun 11 23:13:08 mars kernel: BTRFS: error (device sda1) in btrfs_finish_ordered_io:2896: errno=-95 unknown Jun 11 23:13:08 mars kernel: BTRFS info (device sda1): forced readonly Some diagnostic info: - btrfs scrub reports no errors - on the host machine I'm running btrfs v4.0+20150429 and kernel 4.0.4-3-desktop - on the live medium, used to run btrfs-convert, I was running btrfs v4.0+20150429 and kernel 4.0.3-1-default # btrfs fi show Label: none uuid: 54dea125-74cd-4bb2-86a2-f7bc645b76cf Total devices 1 FS bytes used 90.22GiB devid1 size 223.57GiB used 92.03GiB path /dev/sda1 btrfs-progs v4.0+20150429 # btrfs fi df / Data, single: total=89.00GiB, used=88.17GiB System, single: total=32.00MiB, used=16.00KiB Metadata, single: total=3.00GiB, used=2.05GiB GlobalReserve, single: total=512.00MiB, used=0.00B Is there a way out? I still have the old ext4 image and can revert, but I'm keeping the btrfs one for now, in case I can extract some useful debugging information from it. Thanks, Robert - -- Jeff Mahoney
Re: btrfs differential receive has become excrutiatingly slow on one machine
On Wed, May 13, 2015 at 12:35:20PM +0100, Filipe David Manana wrote: It's a broad question, but how can I diagnose btrfs send being so slow without taking the risk of killing my connection? (if there is no good answer on this one, I can try another sync later with -vvv and strace if you'd like) Try to see if it's a problem at the sending side or at the receiving side. Redirect send's output to a file, see how much it takes. Then run receive with that file as input and see how long it takes. You can also use 'perf record -ag' while doing both, it might give some useful information. Hi Filipe, sorry this took a while, I've been away and then had my server hardware die, but things are back up and I'm now trying to sync a 100GB btrfs diff from my laptop to my server. I've confirmed that btrfs send is fast (I sent it to a file) Then scp of the diff is fast too (45mn for 100GB over wireless) But the restore is slow. I see files going by quickly, and then hangs and restarts. You requested strace -T in the past. I'm showing an exerpt of system calls that take more than 1 second. When I see this, I get worried: truncate(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/merlin-change/Maildir.google/lists2/new/1432663866_0.19916.legolas,U=427356,FMD5=7e806062200fb6d33546530d24aac86c:2,, 21043) = 0 19.335333 Or this: unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/media/tuners/mt2266.mod.dwo) = 0 28.298224 unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/merlin-change/Maildir.google/INBOX/cur/1432061846_0.2789.legolas,U=381014,FMD5=7e33429f656f1e6e9d79b29c3f82c57e:2,) = 0 45.084068 19 seconds for a truncate or 28 or 45 seconds for an unlink cannot be right of course. It's btrfs over dmcrypt over swraid5 (5 drives). Filesystem is only half full: gargamel:~# btrfs fi show /mnt/btrfs_pool2/ Label: 'dshelf2' uuid: d4a51178-c1e6-4219-95ab-5c5864695bfd Total devices 1 FS bytes used 4.34TiB devid1 size 7.28TiB used 4.43TiB path /dev/mapper/dshelf2 gargamel:~# btrfs fi df /mnt/btrfs_pool2/ Data, single: total=4.28TiB, used=4.28TiB System, DUP: total=8.00MiB, used=496.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=77.50GiB, used=68.05GiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=512.00MiB, used=192.00KiB More system calls: gargamel:~# grep ' [1-9]' /mnt/btrfs_pool2/strace ioctl(3, BTRFS_IOC_SNAP_CREATE_V2, 0x7ffca6be3a40) = 0 11.430413 link(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/www/Pix/new/tmp01/dsc05964.jpg, /mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/www/Pix/albums/Outings/Dinners/Misc/20150228_Alexander_Patisserie.jpg) = 0 1.572029 ) = 93 1.195424 truncate(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/merlin-change/.config/google-chrome-beta/Profile 1/Storage/ext/beknehfpfkghjoafdifaflglpjkojoco/def/Cookies, 7168) = 0 53.618366 unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/iio/dac/.ad7303.ko.cmd) = 0 64.146415 unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/iio/dac/.mcp4725.ko.cmd) = 0 10.403782 unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/iio/dac/ad5360.mod.dwo) = 0 2.180297 unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/iio/dac/ad5360.mod.o) = 0 1.278088 unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/md/persistent-data/built-in.o) = 0 3.706420 unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/media/tuners/mt2266.mod.dwo) = 0 28.298224 unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/media/usb/dvb-usb-v2/.tmp_af9035.dwo) = 0 7.626091 unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/net/wimax/i2400m/i2400m.ko) = 0 25.749138 unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/merlin-change/.config/google-chrome-unstable/Default/Cache/f_002221) = 0 1.751792
Re: btrfs differential receive has become excrutiatingly slow on one machine
On Wed, Jun 17, 2015 at 10:58:05AM -0700, Marc MERLIN wrote: You requested strace -T in the past. I'm showing an exerpt of system calls that take more than 1 second. When I see this, I get worried: truncate(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/merlin-change/Maildir.google/lists2/new/1432663866_0.19916.legolas,U=427356,FMD5=7e806062200fb6d33546530d24aac86c:2,, 21043) = 0 19.335333 Or this: unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/media/tuners/mt2266.mod.dwo) = 0 28.298224 unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/merlin-change/Maildir.google/INBOX/cur/1432061846_0.2789.legolas,U=381014,FMD5=7e33429f656f1e6e9d79b29c3f82c57e:2,) = 0 45.084068 19 seconds for a truncate or 28 or 45 seconds for an unlink cannot be right of course. Interesting. The restore only took 2.5h. It's still too long but not as bad as I thought. But now I think I understand what's going on, because of the frequent pauses of a few seconds to 30s or more, this totally destroys the tcp flow, causing the sender to stop, and re-start sending slowly, ramp up the speed, only to be stopped again. No wonder that given that it can take 12h or more when I have send to receive talk over the network. So now the question is why the receive pauses for so long, and pseudo-randomly. Is there anything I can provide on that filesystem that would help? Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] check: check so offset is not bigger then the leaf
This could crash before because of dangerous dangling offset of pointer. Signed-off-by: Robert Marklund robbelibob...@gmail.com --- cmds-check.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/cmds-check.c b/cmds-check.c index 778f141..da36758 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -8906,6 +8906,16 @@ static int build_roots_info_cache(struct btrfs_fs_info *info) goto next; ei = btrfs_item_ptr(leaf, slot, struct btrfs_extent_item); + + if ((long long)ei info-extent_root-leafsize) { +fprintf(stderr, Bad leaf = %p, slot = %d\n, leaf, slot); +fprintf(stderr, item ptr = %p\n, ei); +fprintf(stderr, objectid = %llx\n, found_key.objectid); +fprintf(stderr, type = %x\n, found_key.type); +fprintf(stderr, offset = %llx\n, found_key.offset); +goto next; + } + flags = btrfs_extent_flags(leaf, ei); if (found_key.type == BTRFS_EXTENT_ITEM_KEY -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs-progs 4.1-rc1: btrfstune -u reporting incorrect current fsid?
Hi, I've done a quick test on changing the UUID of a btrfs. It worked, but btrfstune -u didn't print the same current uuid that btrfs fi sh does. It also upper cases the UUID where as btrfs fi sh and blkid don't. Thanks, Mike # btrfs filesystem show /dev/sdb1 | fgrep uuid Label: none uuid: b2813976-4d8b-4976-9d59-cbfbd588399c # ~fedora/programming/c/btrfs-progs-unstable/btrfstune -f -u /dev/sdb1 Current fsid: ---00B0-8F12937F New fsid: D294F3F3-F2B7-4407-B83A-DE5A4F8CBAB1 Set superblock flag CHANGING_FSID Change fsid in extents Change fsid on devices Clear superblock flag CHANGING_FSID Fsid change finished # btrfs filesystem show /dev/sdb1 | fgrep uuid Label: none uuid: d294f3f3-f2b7-4407-b83a-de5a4f8cbab1 # blkid | fgrep sdb1 /dev/sdb1: UUID=d294f3f3-f2b7-4407-b83a-de5a4f8cbab1 UUID_SUB=70065403-5ec1-462c-93a4-26cff8b6aea2 TYPE=btrfs PARTUUID=b309c48c-486f-4882-896c-34d4d0aeb529 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [systemd-devel] [survey] BTRFS_IOC_DEVICES_READY return status
On Wed, 17.06.15 21:10, Goffredo Baroncelli (kreij...@libero.it) wrote: Well, /bin/mount is not a daemon, and it should not be one. My helper is not a deamon; you was correct the first time: it blocks until all needed/enough devices are appeared. Anyway this should not be different from mounting a nfs filesystem. Even in this case the mount helper blocks until the connection happened. The block time is not negligible, even tough not long as a device timeout ... Well, the mount tool doesn't wait for the network to be configured or so. It just waits for a response from the server. That's quite a difference. Well, it's not really ugly. I mean, if the state or properties of a device change, then udev should update its information about it, and that's done via a retrigger. We do that all the time already, for example when an existing loopback device gets a backing file assigned or removed. I am pretty sure that loopback case is very close to what you want to do here, hence retriggering (either from the kernel side, or from userspace), appears like an OK thing to do. What seems strange to me is that in this case the devices don't have changed their status. How this problem is managed in the md/dm raid cases ? md has a daemon mdmon to my knowledge. Lennart -- Lennart Poettering, Red Hat -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 2/2] Btrfs: improve fsync for nocow file
On Wed, Jun 17, 2015 at 05:58:36PM +0200, David Sterba wrote: On Wed, Jun 17, 2015 at 03:54:32PM +0800, Liu Bo wrote: If we're overwriting an allocated file without changing timestamp and inode version, and the file is with NODATACOW, we don't have any metadata to commit, thus we can just flush the data device cache and go forward. However, if there's have any change on extents' disk bytenr, inode size, timestamp or inode version, we need to go through the normal btrfs_log_inode path. Test: 1. sysbench test of 1 file + 1 thread + bs=4k + size=40k + synchronous I/O mode + randomwrite + fsync_on_each_write, 2. loop device associated with tmpfs file 3. - For btrfs, -o nodatacow and -o noi_version option - For ext4 and xfs, no extra mount options Results: - btrfs: w/o: ~30Mb/sec w: ~131Mb/sec - other filesystems: (both don't enable i_version by default) ext4: 203Mb/sec xfs: 212Mb/sec Signed-off-by: Liu Bo bo.li@oracle.com --- v2: Catch errors from data writeback and skip barrier if necessary. fs/btrfs/btrfs_inode.h |2 + fs/btrfs/disk-io.c |2 +- fs/btrfs/disk-io.h |1 + fs/btrfs/file.c| 54 +++ fs/btrfs/inode.c |3 ++ 5 files changed, 56 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index de5e4f2..f7b99b6 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -44,6 +44,8 @@ #define BTRFS_INODE_IN_DELALLOC_LIST 9 #define BTRFS_INODE_READDIO_NEED_LOCK 10 #define BTRFS_INODE_HAS_PROPS 11 +#define BTRFS_INODE_NOTIMESTAMP12 +#define BTRFS_INODE_NOISIZE13 It's not clear what the flags mean and that they're related to syncing under some conditions. What do you think about BTRFS_ILOG_NOTIMESTAMP and BTRFS_ILOG_NOISIZE? /* * The following 3 bits are meant only for the btree inode. * When any of them is set, it means an error happened while writing an --- a/fs/btrfs/disk-io.h +++ b/fs/btrfs/disk-io.h @@ -60,6 +60,7 @@ void close_ctree(struct btrfs_root *root); int write_ctree_super(struct btrfs_trans_handle *trans, struct btrfs_root *root, int max_mirrors); struct buffer_head *btrfs_read_dev_super(struct block_device *bdev); +int barrier_all_devices(struct btrfs_fs_info *info); int btrfs_commit_super(struct btrfs_root *root); struct extent_buffer *btrfs_find_tree_block(struct btrfs_root *root, u64 bytenr); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index faa7d39..180a3e1 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -523,8 +523,12 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode, * the disk i_size. There is no need to log the inode * at this time. */ - if (end_pos isize) + if (end_pos isize) { i_size_write(inode, end_pos); + clear_bit(BTRFS_INODE_NOISIZE, BTRFS_I(inode)-runtime_flags); + } else { + set_bit(BTRFS_INODE_NOISIZE, BTRFS_I(inode)-runtime_flags); + } return 0; } @@ -1715,19 +1719,33 @@ out: static void update_time_for_write(struct inode *inode) { struct timespec now; + int sync_it = 0; - if (IS_NOCMTIME(inode)) + if (IS_NOCMTIME(inode)) { + set_bit(BTRFS_INODE_NOTIMESTAMP, BTRFS_I(inode)-runtime_flags); return; + } now = current_fs_time(inode-i_sb); - if (!timespec_equal(inode-i_mtime, now)) + if (!timespec_equal(inode-i_mtime, now)) { inode-i_mtime = now; + sync_it = S_MTIME; + } - if (!timespec_equal(inode-i_ctime, now)) + if (!timespec_equal(inode-i_ctime, now)) { inode-i_ctime = now; + sync_it |= S_CTIME; + } - if (IS_I_VERSION(inode)) + if (IS_I_VERSION(inode)) { inode_inc_iversion(inode); + sync_it |= S_VERSION; + } + + if (!sync_it) + set_bit(BTRFS_INODE_NOTIMESTAMP, BTRFS_I(inode)-runtime_flags); + else + clear_bit(BTRFS_INODE_NOTIMESTAMP, BTRFS_I(inode)-runtime_flags); } static ssize_t btrfs_file_write_iter(struct kiocb *iocb, @@ -1983,6 +2001,32 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) goto out; } + if (BTRFS_I(inode)-flags BTRFS_INODE_NODATACOW) { + if (test_and_clear_bit(BTRFS_INODE_NOTIMESTAMP, + BTRFS_I(inode)-runtime_flags) + test_and_clear_bit(BTRFS_INODE_NOISIZE, +
Re: Automatic balance after mkfs?
Austin S Hemmelgarn wrote on 2015/06/16 09:21 -0400: On 2015-06-16 09:13, Holger Hoffstätte wrote: Forking from the other thread.. On Tue, 16 Jun 2015 12:25:45 +, Hugo Mills wrote: Yes. It's an artefact of the way that mkfs works. If you run a balance on those chunks, they'll go away. (btrfs balance start -dusage=0 -musage=0 /mountpoint) Since I had to explain this very same thing to a new btrfs-using friend just yesterday I wondered if it might not make sense for mkfs to issue a general balance after creating the fs? It should be simple enough (just issue the balance ioctl?) and not have any negative side effects. Just doing such a post-mkfs cleanup automatically would certainly reduce the number of times we have to explain the this. :) Any reasons why we couldn't/shouldn't do this? Following the same line of thinking, is there any reason we couldn't just rewrite mkfs to get rid of this legacy behavior? Compared to the more complex auto balance, rewrite mkfs is a much better idea. The original mkfs seems easy for developers, but bad for users. I like the idea and I'll try to implment it if I have spare time. Thanks. Qu -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: trim not working and irreparable errors from btrfsck
Austin S Hemmelgarn posted on Wed, 17 Jun 2015 13:17:22 -0400 as excerpted: On 2015-06-17 11:40, Christian wrote: On 06/17/2015 11:28 AM, Chris Murphy wrote: However, fstrim still gives me 0 B (0 bytes) trimmed, so that may be another problem. Is there a way to check if trim works? That sounds like maybe your SSD is blacklisted for trim, is all I can think of. So trim shouldn't be the cause of the problem if it's being blacklisted. The recent problems appear to be around newer SSDs that support queue trim and newer kernels that issue queued trim. There have been some patches related to trim to the kernel, but the existence of blacklisting and claims of bugs in firmware make it difficult to test and isolate. http://techreport.com/news/28473/some-samsung-ssds-may-suffer-from-a- buggy-trim-implementation This is an Intel SSD in a Lenovo Thinkpad X1 Carbon. Trim worked until a few weeks ago and still works for my small ext4 boot partition (just ran it to check). I will keep looking for a solution. Thanks! I'm seeing the same issue here, but with a Crucial brand SSD. Somewhat interestingly, I don't see any issues like this with BTRFS on top of LVM's thin-provisioning volumes, or with any other filesystems, so I think it has something to do with how BTRFS is reporting unused space or how it is submitting the discard requests. FWIW, there's a current btrfs patch in progress that relates to problems with btrfs trim. But while I do have SSDs, I purposefully overprovisioned them by nearly 100% (IOW I partitioned only about 55%, the rest is entirely unused), so trim isn't as critical here as it is for many. I don't use the discard mount option, and have a systemd timer job setup to automate my fstrims and don't worry about the output too much, so I haven't been following the patch progress /that/ closely. But I do know that recent kernel btrfs trims (either fstrim or discard mount option triggered) haven't been working as originally intended due to some bug, and this patch is supposed to fix it. I'd thus conclude that you're very likely hitting this known issue, and that either for 4.1 or 4.2 (again, I'm not following progress that closely, and don't remember for sure if it's in 4.1, altho I've been running the rcs since rc6 or so), the problem should be fixed as that patch gets into mainline. Anyone wishing to investigate further can of course check the list (and/ or possibly the kernel's git log) for discard/trim related patches and follow the progress once found. ... Actually, just checked myself. Looks like the patches were first posted on March 30 @ 15:12:17 -0400 or so (that's the time for one of them). There's one for the discard mount option, and another for FITRIM (which may or may not be a typo for FSTRIM, I'm not actually sure). Jeff Mahoney je...@suse.com author. That should be enough to find the threads. And I don't see the patches in the late 4.1-rc I'm running so either my git log search foo is bad or it'll be (at least) 4.2. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [systemd-devel] [survey] BTRFS_IOC_DEVICES_READY return status
В Wed, 17 Jun 2015 23:02:02 +0200 Lennart Poettering lenn...@poettering.net пишет: On Wed, 17.06.15 21:10, Goffredo Baroncelli (kreij...@libero.it) wrote: Well, /bin/mount is not a daemon, and it should not be one. My helper is not a deamon; you was correct the first time: it blocks until all needed/enough devices are appeared. Anyway this should not be different from mounting a nfs filesystem. Even in this case the mount helper blocks until the connection happened. The block time is not negligible, even tough not long as a device timeout ... Well, the mount tool doesn't wait for the network to be configured or so. It just waits for a response from the server. That's quite a difference. Well, it's not really ugly. I mean, if the state or properties of a device change, then udev should update its information about it, and that's done via a retrigger. We do that all the time already, for example when an existing loopback device gets a backing file assigned or removed. I am pretty sure that loopback case is very close to what you want to do here, hence retriggering (either from the kernel side, or from userspace), appears like an OK thing to do. What seems strange to me is that in this case the devices don't have changed their status. How this problem is managed in the md/dm raid cases ? md has a daemon mdmon to my knowledge. No, mdmon does something different. What mdadm does is to start timer when RAID is complete enough to be started in degraded mode. If notifications for missing devices appear after that, RAID is started normally. If no notification appears until timer is expired, RAID is started in degraded mode. ACTION==add|change, IMPORT{program}=BINDIR/mdadm --incremental --export $devnode --offroot ${DEVLINKS} ACTION==add|change, ENV{MD_STARTED}==*unsafe*, ENV{MD_FOREIGN}==no, ENV{SYSTEMD_WANTS}+=mdadm-last-resort@$env{MD_DEVICE}.timer -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: trim not working and irreparable errors from btrfsck
-Original Message- From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs- ow...@vger.kernel.org] On Behalf Of Christian Sent: Thursday, 18 June 2015 12:34 AM To: linux-btrfs@vger.kernel.org Subject: Re: trim not working and irreparable errors from btrfsck On 06/17/2015 10:22 AM, Chris Murphy wrote: On Wed, Jun 17, 2015 at 6:56 AM, Christian Dysthe cdys...@gmail.com wrote: Hi, Sorry for asking more about this. I'm not a developer but trying to learn. In my case I get several errors like this one: root 2625 inode 353819 errors 400, nbytes wrong Is it inode 353819 I should focus on and what is the number after root, in this case 2625? I'm going to guess it's tree root 2625, which is the same thing as fs tree, which is the same thing as subvolume. Each subvolume has its own inodes. So on a given Btrfs volume, an inode number can exist more than once, but in separate subvolumes. When you use btrfs inspect inode it will list all files with that inode number, but only the one in subvol ID 2625 is what you care about deleting and replacing. Thanks! Deleting the file for that inode took care of it. No more errors. Restored it from a backup. However, fstrim still gives me 0 B (0 bytes) trimmed, so that may be another problem. Is there a way to check if trim works? I've got the same problem. I've got 2 SSDs with 2 partitions in RAID1, fstrim always works on the 2nd partition but not the first. There are no errors on either filesystem that I know of, but the first one is root so I can't take it offline to run btrfs check. Paul. N�r��yb�X��ǧv�^�){.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ�)ߡ�a�����G���h��j:+v���w��٥
Re: [PATCH] fstests: generic test for fsync after adding hard link to a file
On Wed, Jun 17, 2015 at 12:52:16PM +0100, fdman...@kernel.org wrote: From: Filipe Manana fdman...@suse.com This test is motivated by an issue found in btrfs. It tests that after syncing the filesystem, adding a hard link to a file, syncing the filesystem again, doing a write to the file that increases its size and then doing a fsync against that file, durably persists the data written to the file. That is, after log/journal replay, the data is available. The btrfs issue is fixed by the commit titled: Btrfs: fix fsync data loss after append write Signed-off-by: Filipe Manana fdman...@suse.com Looks good to me. Tested on ext2/3/4 xfs and btrfs, btrfs fails the test, and _notrun on ext2, as expected. Reviewed-by: Eryu Guan eg...@redhat.com --- tests/generic/090 | 108 ++ tests/generic/090.out | 17 tests/generic/group | 1 + 3 files changed, 126 insertions(+) create mode 100755 tests/generic/090 create mode 100644 tests/generic/090.out diff --git a/tests/generic/090 b/tests/generic/090 new file mode 100755 index 000..a1f2b89 --- /dev/null +++ b/tests/generic/090 @@ -0,0 +1,108 @@ +#! /bin/bash +# FS QA Test No. 090 +# +# Test that after syncing the filesystem, adding a hard link to a file, +# syncing the filesystem again, doing a write to the file that increases +# its size and then doing a fsync against that file, durably persists the +# data written to the file. That is, after log/journal replay, the data +# is available. +# +# This test is motivated by a bug found in btrfs. +# +#--- +# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved. +# Author: Filipe Manana fdman...@suse.com +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! + +_cleanup() +{ + _cleanup_flakey + rm -f $tmp.* +} +trap _cleanup; exit \$status 0 1 2 3 15 + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/dmflakey + +# real QA test starts here +_supported_fs generic +_supported_os Linux +_need_to_be_root +_require_scratch +_require_dm_flakey +_require_metadata_journaling $SCRATCH_DEV + +rm -f $seqres.full + +_scratch_mkfs $seqres.full 21 +_init_flakey +_mount_flakey + +# Create the test file with some initial data and then fsync it. +# The fsync here is only needed to trigger the issue in btrfs, as it causes the +# the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode. +$XFS_IO_PROG -f -c pwrite -S 0xaa 0 32k \ + -c fsync \ + $SCRATCH_MNT/foo | _filter_xfs_io +sync + +# Add a hard link to our file. +# On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode, +# which is a necessary condition to trigger the issue. +ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar + +# Sync the filesystem to force a commit of the current btrfs transaction, this +# is a necessary condition to trigger the bug on btrfs. +sync + +# Now append more data to our file, increasing its size, and fsync the file. +# In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the +# write path did not update the inode item in the btree nor the delayed inode +# item (in memory structure) in the current transaction (created by the fsync +# handler), the fsync did not record the inode's new i_size in the fsync +# log/journal. This made the data unavailable after the fsync log/journal is +# replayed. +$XFS_IO_PROG -c pwrite -S 0xbb 32K 32K \ + -c fsync \ + $SCRATCH_MNT/foo | _filter_xfs_io + +echo File content after fsync and before crash: +od -t x1 $SCRATCH_MNT/foo + +# Simulate a crash/power loss. +_load_flakey_table $FLAKEY_DROP_WRITES +_unmount_flakey + +# Allow writes again and mount. This makes the fs replay its fsync log. +_load_flakey_table $FLAKEY_ALLOW_WRITES +_mount_flakey + +echo File content after crash and log replay: +od -t x1 $SCRATCH_MNT/foo + +status=0 +exit diff --git a/tests/generic/090.out b/tests/generic/090.out new
Re: trim not working and irreparable errors from btrfsck
And you might as well just attach a full dmesg to the bug report too. Who knows there might be something buried in there that's useful. The cleanest approach is to reboot and then reproduce all of this with fstrim on each volume, then capture the dmesg to a file. Then do the strace fstrim for each volume. But a bunch of kernel message clutter may be better than no messages at all. Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID10 Balancing Request for Comments and Advices
Hugo Mills posted on Wed, 17 Jun 2015 13:27:36 + as excerpted: Yes, on this 80% full 6x4TB RAID10 -dusage=15 took 2 seconds and relocated 0 out of 3026 chunks”. Out of curiosity, I had to use -dusage=90 to have it relocate only 1 chunk and it took les than 30 seconds. So I put a -dusage=25 in the weekly cron just before the scrub. In most cases, all you need to do is clean up one data chunk to give the metadata enough space to work in. Instead of manually iterating through several values of usage= until you get a useful response, you can use limit=n to stop after n successful block group relocations. Thanks, Hugo. It wasn't previously clear to me what the practical usage for the (relatively new) limit= filter was. Very useful explanation. =:^) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How do I make 'btrfs scrub' report errors via email?
On Thu, Jun 18, 2015 at 09:56:09AM +0900, crocket wrote: I think that's not going to report only errors. Outside of saying how long the scrub took, that's all it does. If you're not quite happy with the output, grep -v is your friend :) Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC] btrfs: csum: Introduce partial csum for tree block.
Ping? New new comments? Thanks, Qu Qu Wenruo wrote on 2015/06/16 10:39 +0800: Qu Wenruo wrote on 2015/06/16 09:22 +0800: David Sterba wrote on 2015/06/15 15:15 +0200: On Mon, Jun 15, 2015 at 04:02:49PM +0800, Qu Wenruo wrote: In the following case of corruption, RAID1 or DUP will fail to recover it(Use 16K as leafsize) 04K8K12K16K Mirror 0: |-OK--|ERROR|-OK-| Mirror 1: |OK---|--Error-| Since the CRC32 stored in header is calculated for the whole leaf, so both will fail the CRC32 check. But the corruption are in different position, in fact, if we know where the corruption is (no need to be so accurate), we can recover the tree block by using the current part. In above example, we can just use the correct 0~12K from mirror 1 and then 12K~16K from mirror 0. If the mirror 0 copy is intact, you can use it entirely. Your improvement could help if each mirror is partially broken but we can find good copies of all 4k blocks among all mirrors. The natural question is how often this happens and if it's worth adding the code complexity and what's the estimated speed drop. I think the conditions are very rare and that we could add minimal code to attempt to build the metadata block from the available copies without the separate block checksums. This is an immediate idea so I could have missed something: * if a metadata-block checksum mismatches, do a direct comparison of the metadata-blocks in all available mirrors * if they match and checksums match, no help, it's damaged * if there's a good copy (ie the original checksum or data were corrupted), use it * otherwise attempt to rebuild the metadata block from what's available * by direct comparisons of the 4k blocks, find the first where the metadataA and mirror1 blocks mismatch, offset N * try to compute the checksum from metadataA[0..N-1] + mirror1 block N + rest of metadataA * if it's ok, use it * if not: the block N is corrupted in mirror1 (we've skipped it in metadataA) then repeat with metadataA[0..N] + mirror1[N+1..end] The method sounds good, but the codes will be even more complex than mine. Also, due to the nature of CRC32 and RAID1/Dup case, things will be quite complex like the following case using your method. 04K8K12K16K Mirror 0 || X||| Mirror 1 ||| X|| If using your method: [0~4K] CRC32 of 0~4K are the same. No problem. [4~8K] CRC32 of 0~8K differs. Now we know there is something wrong with 4~8K, but we still don't know which is the good copy. We can continue, keep 2 different CRC32 seed for later checksum. One seed using the 4~8 from mirror 0 and one from mirror1. Note, the crc32 is for the whole tree block, so until we calculate all the tree block, we can know which one is correct. Sorry here, can - can't Thanks, Qu [8~12K] Now crc32 mismatches again. We still don't know which part is correct. We can still continue by recording different CRC32 seed for them. But don't forget the previous 2 seeds we recorded. So we records 4 crc32 seeds for 0~12K.(Mi0, Mi0) (Mi0, Mi1) (Mi1, Mi0) (Mi1, Mi1). [12~16K] Continue calculate the crc32 with above 4 seeds. Finally, we found the crc32 matches with the tree block is using the combination of (Mi1, Mi0). And we can restore the tree block. Yes, with this behavior, we can restore the tree block even 3 parts are corrupted, but in that case, we need to try 8 times. So, I don't consider this is more easy to implement than my idea. [ROOT CAUSE] The root cause of the complex is: 1) Checksum algorithm depends on previous input(seed) Almost all checksum algorithm depends on previous input. And you won't know the data is correct or not until all data is calculated. 2) Only one extra duplication for RAID1/DUP We don't have extra duplication to judge which block is correct as there are only two mirror. So my partial checksum solves the 2 root cause at the same time. With partial checksum, we don't depend on previous data to calculate checksum. And we have extra reference if mirror differs with each other, we use checksum to judge which copy is correct. That's a rough idea that I hope will cover most of the cases when it happens. With some more exhaustive attempts to rebuild the metadata block we can try to repair 2 damaged blocks. As this is completely independent, we can test it separately, and also add it as a rescue feature to the userspace tools. Yes, this corruption case may be minor enough, since even corruption in one mirror is rare enough. So I didn't introduce a new CRC32 checksum, but use the extra 32-4 bytes to store the partial CRC32 to keep the backward compatibility. The above would work with any checksums, without the need to store the per-block checksums which become impossible with strongher algorithms. [FURTHER CSUM DESIGN] As you mentioned in later
Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION
On Wed, Jun 17, 2015 at 07:01:18PM +0200, David Sterba wrote: On Wed, Jun 17, 2015 at 11:52:36PM +0800, Liu Bo wrote: On Wed, Jun 17, 2015 at 05:33:06PM +0200, David Sterba wrote: On Wed, Jun 17, 2015 at 03:54:31PM +0800, Liu Bo wrote: MS_I_VERSION is enabled by default for btrfs, this adds an alternative option to toggle it off. There's an existing generic iversion/noiversion mount option pair, no need to extra add it to btrfs. I know, it doesn't work though. Sigh, I see, btrfs forces MS_I_VERSION flag, 0c4d2d95d06e920e0c61707e62c7fffc9c57f63a. I read 'enabled by default' as that there's a standard way to override the defaults. So the right way is not to do that but this will break everyhing that relies on that behaviour at the moment. This means to add the exception to the upper layers, either VFS or 'mount', which is not very likely to happen. The generic options do not reach the filesystem specific callbacks, so we can't check it. Ext4 also makes its own i_version option, so I think we can do the same thing until more filesystems require to do it in a generic way. The performance benefit with no_iversion is obvious for fsync related workloads since we would avoid some expensive log commits. Thanks, -liubo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: generic test for fsync after adding xattr to a file
On Wed, Jun 17, 2015 at 12:52:47PM +0100, fdman...@kernel.org wrote: From: Filipe Manana fdman...@suse.com This test is motivated by an issue found in btrfs. It tests that after syncing the filesystem, adding a xattr to a file, syncing the filesystem again, writing to the file and then doing a fsync against that file, the xattr still exists after a power failure. That is, after the fsync log/journal is replayed, the xattr still exists and with the correct value. The btrfs issue is fixed by the patch titled: Btrfs: fix fsync xattr loss in the fast fsync path If I read the above patch correctly, the issue to be tested here was introduced by commit 4f764e515361 (Btrfs: remove deleted xattrs on fsync log replay), which is in mainline kernel since v4.1-rc1, so the test should fail on my v4.1-rc5 kernel, but I didn't see it fail. Is it reproducible everytime? Or did I miss something? Signed-off-by: Filipe Manana fdman...@suse.com --- tests/generic/094 | 112 ++ tests/generic/094.out | 29 + tests/generic/group | 1 + 3 files changed, 142 insertions(+) create mode 100755 tests/generic/094 create mode 100644 tests/generic/094.out diff --git a/tests/generic/094 b/tests/generic/094 new file mode 100755 index 000..1c6d113 --- /dev/null +++ b/tests/generic/094 @@ -0,0 +1,112 @@ +#! /bin/bash +# FS QA Test No. 094 +# +# Test that after syncing the filesystem, adding a xattr to a file, syncing +# the filesystem again, writing to the file and then doing a fsync against that +# file, the xattr still exists after a power failure. That is, after the fsync +# log/journal is replayed, the xattr still exists and with the correct value. +# +# This test is motivated by a bug found in btrfs. +# +#--- +# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved. +# Author: Filipe Manana fdman...@suse.com +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! + +_cleanup() +{ + _cleanup_flakey + rm -f $tmp.* +} +trap _cleanup; exit \$status 0 1 2 3 15 + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/dmflakey +. ./common/attr + +# real QA test starts here +_supported_fs generic +_supported_os Linux +_need_to_be_root +_require_scratch +_require_dm_flakey +_require_attrs +_require_metadata_journaling $SCRATCH_DEV + +rm -f $seqres.full + +_scratch_mkfs $seqres.full 21 +_init_flakey +_mount_flakey + +# Create the test file with some initial data and make sure everything is +# durably persisted. We do fsync before calling sync to make sure that if the +# filesystem is btrfs, we get the flag BTRFS_INODE_NEEDS_FULL_SYNC cleared +# from the btrfs inode - a condition necessary to trigger the issue in btrfs. +$XFS_IO_PROG -f -c pwrite -S 0xaa 0 32k \ + -c fsync \ + $SCRATCH_MNT/foo | _filter_xfs_io +sync + +# Add a xattr to our file. +$SETFATTR_PROG -n user.attr -v somevalue $SCRATCH_MNT/foo + +# Sync the filesystem to force a commit of the current btrfs transaction, this +# is a necessary condition to trigger the bug on btrfs. +sync + +# Now update our file's data and fsync the file. +# After a successful fsync, if the fsync log/journal is replayed we expect to +# see the xattr named user.attr with a value of somevalue (and the updated +# file data of course). Btrfs used to remove the xattr when it replayed the +# fsync log/journal. +$XFS_IO_PROG -c pwrite -S 0xbb 8K 16K \ + -c fsync \ + $SCRATCH_MNT/foo | _filter_xfs_io + +echo File content after fsync and before crash: +od -t x1 $SCRATCH_MNT/foo + +echo File xattrs after fsync and before crash: +$GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foo | _filter_scratch + +# Simulate a crash/power loss. +_load_flakey_table $FLAKEY_DROP_WRITES +_unmount_flakey + +# Allow writes again and mount. This makes the fs replay its fsync log. +_load_flakey_table $FLAKEY_ALLOW_WRITES +_mount_flakey + +echo
Re: trim not working and irreparable errors from btrfsck
File a new bug at bugzilla.kernel.org describing this problem. Include make/model of all involved SSDs, which you can get from smartctl or hdparm. And then do a strace fstrim on the working and non-working volumes, saving the output to separate files and attaching them to the bug report. And then it's probably best to create a new list thread to post the URL for the bug, since this thread is really about two problems that may not be related. Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] Btrfs: incremental send, fix rmdir not send utimes
Hi Filipe, I've found that the following case is the main cause of such error and it's fs tree is shown via btrfs-debug-tress as below. file tree key (459 ROOT_ITEM 20487) node 132988928 level 1 items 3 free 490 generation 20487 owner 459 fs uuid b451ae42-3b03-4003-b0a4-45dce324557f chunk uuid d8831db3-2e42-4b32-9a5c-3efdf50d36bc key (256 INODE_ITEM 0) block 132710400 (8100) gen 20486 key (264 INODE_ITEM 0) block 130695168 (7977) gen 20480 key (266 XATTR_ITEM 952319794) block 126042112 (7693) gen 20464 leaf 132710400 items 166 free space 3639 generation 20486 owner 455 fs uuid b451ae42-3b03-4003-b0a4-45dce324557f chunk uuid d8831db3-2e42-4b32-9a5c-3efdf50d36bc item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160 inode generation 20425 transid 20442 size 32 block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 flags 0x0 item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12 inode ref index 0 namelen 2 name: .. ... item 165 key (262 XATTR_ITEM 1100961104) itemoff 7789 itemsize 39 location key (0 UNKNOWN.0 0) type XATTR namelen 8 datalen 1 name: user.a78 data a binary 61 leaf 130695168 items 133 free space 7332 generation 20480 owner 455 fs uuid b451ae42-3b03-4003-b0a4-45dce324557f chunk uuid d8831db3-2e42-4b32-9a5c-3efdf50d36bc item 0 key (264 INODE_ITEM 0) itemoff 16123 itemsize 160 inode generation 20428 transid 20434 size 10 block group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 flags 0x0 item 1 key (264 INODE_REF 256) itemoff 16112 itemsize 11 inode ref index 11 namelen 1 name: c ... We can see that inode 262 is right at the end of leaf. Then send_utime() will use btrfs_search_slot() to find a appropriate place to put 262 where is at the back of 262. However, that place is uninitialized on disk. Suppose we read atime tv_sec:576469548413222912, tv_nsec:1919251317 and then send it out. Receiving side will got EINVAL since tv_nsec:1919251317 is greater than 999,999,999. Thanks. Robbie Ko 2015-06-10 18:06 GMT+08:00 Robbie Ko robbi...@synology.com: Hi Filipi, 2015-06-09 18:36 GMT+08:00 Filipe David Manana fdman...@gmail.com: On Tue, Jun 9, 2015 at 11:04 AM, Robbie Ko robbi...@synology.com wrote: Hi Filipe, 2015-06-08 22:00 GMT+08:00 Filipe David Manana fdman...@gmail.com: On Mon, Jun 8, 2015 at 4:44 AM, Robbie Ko robbi...@synology.com wrote: Hi Filipe, Hi Robbie, I've fixed don't send utimes for non-existing directory with another solution. In apply_dir_move(), the old parent dir. and new parent dir. will be updated after the current dir. has moved. And there's only one entry in old parent dir. (e.g. entry with smallest ino) will be tagged with rmdir_ino to prevent its parent dir. is deleted but updated. Can't parse this phrase. What do you mean by tagging an entry with rmdir_ino? rmdir_ino corresponds to the number of a inode that wasn't deleted when it was processed because there was some inode with a lower number that is a child of the directory in the parent snapshot and had its rename/move operation delayed (it happens after the directory we want to delete is processed). Right , my tagged with rmdir_ino is same meaning as you explained here. However, if we process rename for another entry not tagged with rmdir_ino first, its old parent dir. which is deleted will be updated according to apply_dir_move(). Therefore, I think we should check the existence of the dir. before we're going to update it's utime. The patch is pasted in the following link, could you give me some comment? https://friendpaste.com/h8tZqOS9iAUpp2DvgGI2k Looks better. However I still don't understand your explanation, and just tried the example in your commit message: Parent snapshot: | a/ (ino 259) | c (ino 264) | b/ (ino 260) | d (ino 265) | del/ (ino 263) | item1/ (ino 261) | item2/ (ino 262) Send snapshot: | a/ (ino 259) | b/ (ino 260) | c/ (ino 2) | item2 (ino 259) | d/ (ino 257) | item1/ (ino 258) So it's confusing after looking at it. First the send snapshot mentions inode number 2, which doesn't exist in the parent snapshot - I assume you meant inode number 264. Then, the send snapshot has two inodes with number 259. Is item2 in the send snapshot supposed to be inode 262? Your guess is right. And I correct it as follow. # Parent snapshot: # # | a/(ino 259) # | | c (ino 264) # | # | b/(ino 260) # | | d (ino 265) # | # | del/ (ino 263) #| item1/ (ino 261) #| item2/ (ino 262) # Send snapshot: # # | a/(ino 259) # | b/(ino 260) # | c/(ino 264) # | | item2/ (ino 262) # | # | d/(ino 265)
Re: BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432)
Marc MERLIN posted on Wed, 17 Jun 2015 09:19:36 -0700 as excerpted: On Wed, Jun 17, 2015 at 01:51:26PM +, Duncan wrote: Also, if my actual data got corrupted, am I correct that btrfs will detect the checksum failure and give me a different error message of a read error that cannot be corrected? I'll do a scrub later, for now I have to wait 20 hours for the raid rebuild first. Yes again. Great, thanks for confirming. Makes me happy to know that checksums and metadata DUP are helping me out here. With ext4 I'd have been worse off for sure. One thing I'd strongly recommend. Once the rebuild is complete and you do the scrub, there may well be both read/corrected errors, and unverified errors. AFAIK, the unverified errors are a result of bad metadata blocks, so missing checksums for what they covered. So once you I'm slightly confused here. If I have metadata DUP and checksums, how can metadata blocks be unverified? Data blocks being unverified, I understand, it would mean the data or checksum is bad, but I expect that's a different error message I haven't seen yet. Backing up a bit to better explain what I'm seeing here... What I'm getting here, when the sectors go unreadable on the (slowly) failing SSD, is actually a SATA level timeout, which btrfs (correctly) interprets as a read error. But it wouldn't really matter whether it was a read error or a corruption error, btrfs would respond the same -- because both data and metadata are btrfs raid1 here, it would fetch and verify the other copy of the block from the raid1 mirror device, and assuming it verified (which it should since the other device is still in great condition, zero relocations), rewrite it over the one it couldn't read. Back on the failing device, the rewrite triggers a sector relocation, and assuming it doesn't fall in the bad area too, that block is now clean. (If it does fall in the defective area, I simply have to repeat the scrub another time or two, until there are no more errors.) But, and this is what I was trying to explain earlier but skipped a step I figured was more obvious than it apparently was, btrfs works with trees, including a metadata tree. So each block of metadata that has checksums covering actual data, is in turn itself checksummed by a metadata block one step closer to the metadata root block, multiple levels deep. I should mention here that this is my non-coder understanding. If a dev says it works differently... It's these multiple metadata levels and the chained checksums for them, that I was referencing. Suppose it's a metadata block that fails, not a data block. That metadata block will be checksummed, and will in turn contain checksums for other blocks, which might be either data blocks, or other metadata blocks, a level closer to the data (and further from the root) than the failed block. Because the metadata block was failed (either checksum failure or read error, shouldn't matter at this point), whatever checksums it contained, whether for data, or for other metadata blocks, will be unverified. If the affected metadata block is close to the root of the tree, the effect could in theory domino thru to several further levels. These checksum unverified blocks (because the block containing the checksums failed) will show up as unverified errors, and whatever that checksum was supposed to cover, whether other metadata blocks or data blocks, won't be checked in that scrub round, because the level above it can't be verified. Given a checksum-verified raid1 copy on the mirror device, the original failed block will be rewritten. But if it's metadata, whatever checksums it in turn contained will still not be verified in that scrub round. Again, these show up as unverified errors. By running scrub repeatedly, however, now that the first error has been fixed by the rewrite from the good copy, the checksums it contained can now in turn be checked. If they all verify, great. If not, another rewrite will be triggered, fixing them, but if if those checksums were in turn for other metadata blocks, now /those/ will need checked and will show up as unverified. So depending on what the bad metadata block was located on in the metadata tree, a second, third, possibly even fourth, scrub may be needed, in ordered to correct all the errors at all levels of the metadata tree, thereby fixing in turn each level of unverified errors exposed as the level above it (closer to root) was fixed. Of course, if your scrub listed all corrected (metadata since it's raid1 in your case) or uncorrectable (data since it's single in your case, or metadata with both copies bad) errors, no unverified errors, then at least in theory, a second scrub shouldn't find any further errors to correct. Only if you see unverified errors should it be necessary to repeat the scrub, but then you might need to repeat it several times as each run will
Re: trim not working and irreparable errors from btrfsck
On 2015-06-17 11:40, Christian wrote: On 06/17/2015 11:28 AM, Chris Murphy wrote: However, fstrim still gives me 0 B (0 bytes) trimmed, so that may be another problem. Is there a way to check if trim works? That sounds like maybe your SSD is blacklisted for trim, is all I can think of. So trim shouldn't be the cause of the problem if it's being blacklisted. The recent problems appear to be around newer SSDs that support queue trim and newer kernels that issue queued trim. There have been some patches related to trim to the kernel, but the existence of blacklisting and claims of bugs in firmware make it difficult to test and isolate. http://techreport.com/news/28473/some-samsung-ssds-may-suffer-from-a-buggy-trim-implementation This is an Intel SSD in a Lenovo Thinkpad X1 Carbon. Trim worked until a few weeks ago and still works for my small ext4 boot partition (just ran it to check). I will keep looking for a solution. Thanks! I'm seeing the same issue here, but with a Crucial brand SSD. Somewhat interestingly, I don't see any issues like this with BTRFS on top of LVM's thin-provisioning volumes, or with any other filesystems, so I think it has something to do with how BTRFS is reporting unused space or how it is submitting the discard requests. smime.p7s Description: S/MIME Cryptographic Signature
Re: [PATCH] Btrfs: fix typo in the error log
On Mon, Jun 15, 2015 at 10:02:23PM +0800, Anand Jain wrote: --- fs/btrfs/disk-io.c | 2 +- fs/btrfs/super.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 57d48f8..5147cb7 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2896,7 +2896,7 @@ retry_root_backup: fs_info-num_tolerated_disk_barrier_failures !(sb-s_flags MS_RDONLY)) { printk(KERN_WARNING BTRFS: - too many missing devices, writeable mount is not allowed\n); + too many missing devices, writable mount is not allowed\n); I see both forms are accepted: http://dictionary.reference.com/browse/writeable http://dictionary.reference.com/browse/writable Though not here: http://www.merriam-webster.com/dictionary/writable -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Replacing a drive from a RAID 1 array
On 2015-06-16 12:58, Hugo Mills wrote: On Tue, Jun 16, 2015 at 06:43:23PM +0200, Arnaud Kapp wrote: Hello, Consider the following situation: I have a RAID 1 array with 4 drives. I want to replace one the drive by a new one, with greater capacity. However, let's say I only have 4 HDD slots so I cannot plug the new drive, add it to the array then remove the other one. If there a *safe* way to change drives in this situation? I'd bet that booting with 3drives, adding the new one then removing the old, non connected one would work. However, is there something that could go wrong in this situation? The main thing that could go wrong with that is a disk failure. If you have the SATA ports available, I'd consider operating the machine with the case open and one of the drives bare and resting on something stable and insulating for the time it takes to do a btrfs replace operation. This would be my first suggestion also; although, if you only have 4 SATA ports, you might want to invest in a SATA add in card (if you go this way, look for one with an ASmedia chipset, those are the best I've seen as far as reliability for add on controllers). If that's not an option, then a good-quality external USB case with a short cable directly attached to one of the USB ports on the motherboard would be a reasonable solution (with the proviso that some USB connections are just plain unstable and throw errors, which can cause problems with the filesystem code, typically requiring a reboot, and a restart of the process). If you decide to go with this option and are using an Intel system, avoid using USB3.0 ports, as a number of Intel's chipsets have known bugs with their USB3 hardware that will likely cause serious issues. If your system has an eSATA port however, try to use that instead of USB, it will almost certainly be faster and more reliable. You might also consider using either NBD or iSCSI to present one of the disks (I'd probably use the outgoing one) over the network from another machine with more slots in it, but that's going to end up with horrible performance during the migration. The other possibility WRT this is ATAoE, which generally gets better performance than NBD or iSCSI but has the caveat that both systems have to be on the same network link (ie, no gateways between them). If you do decide to use ATAoE look into a program called 'vblade' (most distro's have it in a package with the same name). smime.p7s Description: S/MIME Cryptographic Signature
[PATCH] Btrfs-progs: add feature to get mininum size for resizing a fs/device
From: Filipe Manana fdman...@suse.com Currently there is not way for a user to know what is the minimum size a device of a btrfs filesystem can be resized to. Sometimes the value of total allocated space (sum of all allocated chunks/device extents), which can be parsed from 'btrfs filesystem show' and 'btrfs filesystem usage', works as the minimum size, but sometimes it does not, namely when device extents have to relocated to holes (unallocated space) within the new size of the device (the total allocated space sum). This change adds the ability to reliably compute such minimum value and extents 'btrfs filesystem resize' with the following syntax to get such value: btrfs filesystem resize [devid:]get_min_size Signed-off-by: Filipe Manana fdman...@suse.com --- Documentation/btrfs-filesystem.asciidoc | 4 +- Makefile.in | 8 +- cmds-filesystem.c | 219 +++- ctree.h | 3 + tests/shrink-min-size-tests.sh | 72 +++ 5 files changed, 302 insertions(+), 4 deletions(-) create mode 100755 tests/shrink-min-size-tests.sh diff --git a/Documentation/btrfs-filesystem.asciidoc b/Documentation/btrfs-filesystem.asciidoc index f1c35b6..45f8cf7 100644 --- a/Documentation/btrfs-filesystem.asciidoc +++ b/Documentation/btrfs-filesystem.asciidoc @@ -88,7 +88,7 @@ If a newlabel optional argument is passed, the label is changed. NOTE: the maximum allowable length shall be less than 256 chars // Some wording are extracted by the resize2fs man page -*resize* [devid:][+/-]size[kKmMgGtTpPeE]|[devid:]max path:: +*resize* [devid:][+/-]size[kKmMgGtTpPeE]|[devid:]max|[devid:]get_min_size path:: Resize a mounted filesystem identified by directory path. A particular device can be resized by specifying a devid. + @@ -108,6 +108,8 @@ KiB, MiB, GiB, TiB, PiB, or EiB, respectively. Case does not matter. + If \'max' is passed, the filesystem will occupy all available space on the device devid. +If \'get_min_size' is passed, return the minimum size the device can be +shrunk to, without performing any resize operation. + The resize command does not manipulate the size of underlying partition. If you wish to enlarge/reduce a filesystem, you must make sure you diff --git a/Makefile.in b/Makefile.in index 860a390..202c51e 100644 --- a/Makefile.in +++ b/Makefile.in @@ -46,7 +46,7 @@ libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o \ libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \ crc32c.h list.h kerncompat.h radix-tree.h extent-cache.h \ extent_io.h ioctl.h ctree.h btrfsck.h version.h -TESTS = fsck-tests.sh convert-tests.sh +TESTS = fsck-tests.sh convert-tests.sh shrink-min-size-tests.sh prefix ?= @prefix@ exec_prefix = @exec_prefix@ @@ -161,6 +161,10 @@ $(BUILDDIRS): @echo Making all in $(patsubst build-%,%,$@) $(Q)$(MAKE) $(MAKEOPTS) -C $(patsubst build-%,%,$@) +test-shrink-min-size: btrfs mkfs.btrfs + @echo [TEST] shrink-min-size-tests.sh + $(Q)bash tests/shrink-min-size-tests.sh + test-convert: btrfs btrfs-convert @echo [TEST] convert-tests.sh $(Q)bash tests/convert-tests.sh @@ -169,7 +173,7 @@ test-fsck: btrfs btrfs-image btrfs-corrupt-block btrfs-debug-tree mkfs.btrfs @echo [TEST] fsck-tests.sh $(Q)bash tests/fsck-tests.sh -test: test-fsck test-convert +test: test-fsck test-convert test-shrink-min-size # # NOTE: For static compiles, you need to have all the required libs diff --git a/cmds-filesystem.c b/cmds-filesystem.c index b93bb33..13b5bc5 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -1220,14 +1220,228 @@ static int cmd_defrag(int argc, char **argv) } static const char * const cmd_resize_usage[] = { - btrfs filesystem resize [devid:][+/-]newsize[kKmMgGtTpPeE]|[devid:]max path, + btrfs filesystem resize [devid:][+/-]newsize[kKmMgGtTpPeE]|[devid:]max|[devid:]get_min_size path, Resize a filesystem, If 'max' is passed, the filesystem will occupy all available space, on the device 'devid'., + If 'get_min_size' is passed, return the minimum size the device can, + be shrunk to., [kK] means KiB, which denotes 1KiB = 1024B, 1MiB = 1024KiB, etc., NULL }; +struct dev_extent_elem { + u64 start; + /* inclusive end */ + u64 end; + struct list_head list; +}; + +static int add_dev_extent(struct list_head *list, + const u64 start, const u64 end, + const int append) +{ + struct dev_extent_elem *e; + + e = malloc(sizeof(*e)); + if (!e) + return -ENOMEM; + + e-start = start; + e-end = end; + + if (append) + list_add_tail(e-list, list); + else + list_add(e-list, list); + + return
[PATCH] fstests: generic test for fsync after adding xattr to a file
From: Filipe Manana fdman...@suse.com This test is motivated by an issue found in btrfs. It tests that after syncing the filesystem, adding a xattr to a file, syncing the filesystem again, writing to the file and then doing a fsync against that file, the xattr still exists after a power failure. That is, after the fsync log/journal is replayed, the xattr still exists and with the correct value. The btrfs issue is fixed by the patch titled: Btrfs: fix fsync xattr loss in the fast fsync path Signed-off-by: Filipe Manana fdman...@suse.com --- tests/generic/094 | 112 ++ tests/generic/094.out | 29 + tests/generic/group | 1 + 3 files changed, 142 insertions(+) create mode 100755 tests/generic/094 create mode 100644 tests/generic/094.out diff --git a/tests/generic/094 b/tests/generic/094 new file mode 100755 index 000..1c6d113 --- /dev/null +++ b/tests/generic/094 @@ -0,0 +1,112 @@ +#! /bin/bash +# FS QA Test No. 094 +# +# Test that after syncing the filesystem, adding a xattr to a file, syncing +# the filesystem again, writing to the file and then doing a fsync against that +# file, the xattr still exists after a power failure. That is, after the fsync +# log/journal is replayed, the xattr still exists and with the correct value. +# +# This test is motivated by a bug found in btrfs. +# +#--- +# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved. +# Author: Filipe Manana fdman...@suse.com +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! + +_cleanup() +{ + _cleanup_flakey + rm -f $tmp.* +} +trap _cleanup; exit \$status 0 1 2 3 15 + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/dmflakey +. ./common/attr + +# real QA test starts here +_supported_fs generic +_supported_os Linux +_need_to_be_root +_require_scratch +_require_dm_flakey +_require_attrs +_require_metadata_journaling $SCRATCH_DEV + +rm -f $seqres.full + +_scratch_mkfs $seqres.full 21 +_init_flakey +_mount_flakey + +# Create the test file with some initial data and make sure everything is +# durably persisted. We do fsync before calling sync to make sure that if the +# filesystem is btrfs, we get the flag BTRFS_INODE_NEEDS_FULL_SYNC cleared +# from the btrfs inode - a condition necessary to trigger the issue in btrfs. +$XFS_IO_PROG -f -c pwrite -S 0xaa 0 32k \ + -c fsync \ + $SCRATCH_MNT/foo | _filter_xfs_io +sync + +# Add a xattr to our file. +$SETFATTR_PROG -n user.attr -v somevalue $SCRATCH_MNT/foo + +# Sync the filesystem to force a commit of the current btrfs transaction, this +# is a necessary condition to trigger the bug on btrfs. +sync + +# Now update our file's data and fsync the file. +# After a successful fsync, if the fsync log/journal is replayed we expect to +# see the xattr named user.attr with a value of somevalue (and the updated +# file data of course). Btrfs used to remove the xattr when it replayed the +# fsync log/journal. +$XFS_IO_PROG -c pwrite -S 0xbb 8K 16K \ + -c fsync \ + $SCRATCH_MNT/foo | _filter_xfs_io + +echo File content after fsync and before crash: +od -t x1 $SCRATCH_MNT/foo + +echo File xattrs after fsync and before crash: +$GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foo | _filter_scratch + +# Simulate a crash/power loss. +_load_flakey_table $FLAKEY_DROP_WRITES +_unmount_flakey + +# Allow writes again and mount. This makes the fs replay its fsync log. +_load_flakey_table $FLAKEY_ALLOW_WRITES +_mount_flakey + +echo File content after crash and log replay: +od -t x1 $SCRATCH_MNT/foo + +echo File xattrs after crash and log replay: +$GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foo | _filter_scratch + +status=0 +exit diff --git a/tests/generic/094.out b/tests/generic/094.out new file mode 100644 index 000..2e5e0fa --- /dev/null +++ b/tests/generic/094.out @@ -0,0 +1,29 @@ +QA output created by 094 +wrote 32768/32768 bytes at offset 0 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +wrote 16384/16384 bytes at offset
Re: I_ERR_FILE_EXTENT_DISCOUNT when there are no file extent holes in inode
Przemysław Pawełczyk wrote on 2015/06/16 14:19 +0200: On Tue, Jun 16, 2015 at 9:54 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote: Przemysław Pawełczyk wrote on 2015/06/14 21:38 +0200: I wanted to move and resize /home btrfs partition of my debian jessie v8.1 (w/ btrfs-tools v3.17) in virtual machine using gparted 0.22.0 after booting from latest SysRescCD 4.5.3 (w/ btrfs-progs v3.19.1). GParted does fs check before, to make sure that everything is fine, but it wasn't. There were following errors from `btrfsck`: checking fs roots root 5 inode 1521611 errors 100, file extent discount Found file extent holes: start: 12288, len:4096 root 5 inode 1521634 errors 100, file extent discount Found file extent holes: root 5 inode 1521645 errors 100, file extent discount Found file extent holes: start: 8192, len:4096 root 5 inode 1521647 errors 100, file extent discount Found file extent holes: start: 8192, len:8192 start: 20480, len:4096 root 5 inode 1521648 errors 100, file extent discount Found file extent holes: root 5 inode 1521649 errors 100, file extent discount Found file extent holes: ... As you can see not every inode w/ file extent discount error flag has file extent holes. I'm not sure of exact definition of this error flag, so cannot tell myself how (ab?)normal it is. I was using this partition almost daily for almost a year (back then it was debian testing when installed) and beside occasional VirtualBox hangups (during rsync from USB), I had no problems at all. Qu Wenruo's discount file extent hole repairing function landed in btrfs-progs v3.19, so I couldn't use debian's old btrfsck to improve the situation, but sysresccd's one was recent enough (and I was already booted into it), so I went with its `btrfsck --repair`. I got many 'Fixed discount file extents for inode' messages, but next `btrfsck` run still reported file extent discount errors. Looking closely there was some improvement, because 2 inodes were no longer reported (only one within visible part of the below log dump): checking fs roots root 5 inode 1521634 errors 100, file extent discount Found file extent holes: root 5 inode 1521645 errors 100, file extent discount Found file extent holes: root 5 inode 1521647 errors 100, file extent discount Found file extent holes: root 5 inode 1521648 errors 100, file extent discount Found file extent holes: root 5 inode 1521649 errors 100, file extent discount Found file extent holes: ... I cloned btrfs-progs.git with latest stable v4.0.1, and executed self-built `btrfsck --repair` from my debian, hoping that maybe there were some improvements in that department. Sadly no, I got many 'Fixed discount file extents for inode', but next `btrfsck` revealed same old file extent discount errors. It looked like flag error is simply not cleared, so I finally looked into the code. When I found repair_inode_discount_extent() in cmds-check.c, I though I've found the bug. I_ERR_FILE_EXTENT_DISCOUNT is cleared only within while (node) loop, so if there are no file extents hole, it won't be cleared. So I moved if (RB_EMPTY_ROOT(rec-holes)) rec-errors = ~I_ERR_FILE_EXTENT_DISCOUNT; Thanks a lot for pointing out the problem. I'll try to fix it soon. I would send a patch separately if I was convinced that it fixes the real problem, but as your read from the rest of the mail, I am not. It may seem as slight optimization (checking things once instead of repeatedly), but it also masks error flag (i.e. clears it) for cases that are not really fixed in the function and only next btrfsck run will reset this file extent discount error flag (in case of these holeless inodes having extent_end isize), so I think that it needs to be postponed after repair_inode_discount_extent() will be smart enough to thoroughly fix inode's extents deficiency. Also, welcome aboard to btrfs development! :) Thank you, but I don't plan to truly dive into btrfs (at least yet). :) I just hoped I could work my problem out myself and even if not, I could at least provide more detailed report than File extent discount errors are not fixed by btrfsck. after the while loop. It must have helped clearing error flag during `btrfsck --repair`, but rerunning `btrfsck` revealed that there are still the same file extent discount errors, so apparently they were reset in some other code path. I added some debug printf to verify that RB_EMPTY_ROOT(rec-holes) was not false (i.e. 0) and other one in maybe_free_inode_rec() after conditions that lead to setting I_ERR_FILE_EXTENT_DISCOUNT error flag, to see the values that met the conditions: if (rec-nlink 0 !no_holes ( rec-extent_end rec-isize || first_extent_gap(rec-holes) rec-isize ) ) Rerunning `btrfsck` gave me this new info: Checking filesystem on /dev/sda7 UUID: 8b889e4c-5dba-43e3-a116-e13874bfb311 !Set discount file extents for inode1521634 (nlink=1 extent_end=0 isize=1408
BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432)
I had a few power offs due to a faulty power supply, and my mdadm raid5 got into fail mode after 2 drives got kicked out since their sequence numbers didn't match due to the abrupt power offs. I brought the swraid5 back up by force assembling it with 4 drives (one was really only a few sequence numbers behind), and it's doing a full parity rebuild on the 5th drive that was farther behind. So I can understand how I may have had a few blocks that are in a bad state. I'm getting a few (not many) of those messages in syslog. BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432) Filesystem looks like this: Label: 'btrfs_pool1' uuid: 6358304a-2234-4243-b02d-4944c9af47d7 Total devices 1 FS bytes used 8.29TiB devid1 size 14.55TiB used 8.32TiB path /dev/mapper/dshelf1 gargamel:~# btrfs fi df /mnt/btrfs_pool1 Data, single: total=8.29TiB, used=8.28TiB System, DUP: total=8.00MiB, used=920.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=14.00GiB, used=10.58GiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=512.00MiB, used=0.00B Kernel 3.19.8. Just to make sure I understand, do those messages in syslog mean that my metadata got corrupted a bit, but because I have 2 copies, btrfs can fix the bad copy by using the good one? Also, if my actual data got corrupted, am I correct that btrfs will detect the checksum failure and give me a different error message of a read error that cannot be corrected? I'll do a scrub later, for now I have to wait 20 hours for the raid rebuild first. Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix fsync xattr loss in the fast fsync path
From: Filipe Manana fdman...@suse.com After commit 4f764e515361 (Btrfs: remove deleted xattrs on fsync log replay), we can end up in a situation where during log replay we end up deleting xattrs that were never deleted when their file was last fsynced. This happens in the fast fsync path (flag BTRFS_INODE_NEEDS_FULL_SYNC is not set in the inode) if the inode has the flag BTRFS_INODE_COPY_EVERYTHING set, the xattr was added in a past transaction and the leaf where the xattr is located was not updated (COWed or created) in the current transaction. In this scenario the xattr item never ends up in the log tree and therefore at log replay time, which makes the replay code delete the xattr from the fs/subvol tree as it thinks that xattr was deleted prior to the last fsync. Fix this by using a new item key type that represents xattrs to be deleted at log replay time. This key type is only used in the log tree. By using this explicit item we can continue to log only xattrs that were added (or modified) in the current transaction instead of all xattrs, while still keeping the the intention of commit 4f764e515361 (Btrfs: remove deleted xattrs on fsync log replay). This issue is reprodicible with the following test case for fstests: seq=`basename $0` seqres=$RESULT_DIR/$seq echo QA output created by $seq here=`pwd` tmp=/tmp/$$ status=1 # failure is the default! _cleanup() { _cleanup_flakey rm -f $tmp.* } trap _cleanup; exit \$status 0 1 2 3 15 # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey . ./common/attr # real QA test starts here _supported_fs generic _supported_os Linux _need_to_be_root _require_scratch _require_dm_flakey _require_attrs _require_metadata_journaling $SCRATCH_DEV _crash_and_mount() { # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Allow writes again and mount. This makes the fs replay its fsync log. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey } rm -f $seqres.full _scratch_mkfs $seqres.full 21 _init_flakey _mount_flakey # Create the test file with some initial data and make sure everything is # durably persisted. We do fsync before calling sync to make sure that if the # filesystem is btrfs, we get the flag BTRFS_INODE_NEEDS_FULL_SYNC cleared # from the btrfs inode - a condition necessary to trigger the issue in btrfs. $XFS_IO_PROG -f -c pwrite -S 0xaa 0 32k \ -c fsync \ $SCRATCH_MNT/foo | _filter_xfs_io sync # Add a xattr to our file. $SETFATTR_PROG -n user.attr -v somevalue $SCRATCH_MNT/foo # Sync the filesystem to force a commit of the current btrfs transaction, this # is a necessary condition to trigger the bug on btrfs. sync # Now update our file's data and fsync the file. # After a successful fsync, if the fsync log/journal is replayed we expect to # see the xattr named user.attr with a value of somevalue (and the updated # file data of course). Btrfs used to remove the xattr when it replayed the # fsync log/journal. $XFS_IO_PROG -c pwrite -S 0xbb 8K 16K \ -c fsync \ $SCRATCH_MNT/foo | _filter_xfs_io echo File content after fsync and before crash: od -t x1 $SCRATCH_MNT/foo echo File xattrs after fsync and before crash: $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foo | _filter_scratch _crash_and_mount echo File content after crash and log replay: od -t x1 $SCRATCH_MNT/foo echo File xattrs after crash and log replay: $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foo | _filter_scratch status=0 exit The expected golden output for this test: wrote 32768/32768 bytes at offset 0 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) wrote 16384/16384 bytes at offset 8192 XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) File content after fsync and before crash: 000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 002 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb * 006 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 010 File xattrs after fsync and before crash: # file: SCRATCH_MNT/foo user.attr=somevalue File content after crash and log replay: 000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 002 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb * 006 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 010 File xattrs after crash and log replay: # file: SCRATCH_MNT/foo user.attr=somevalue Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/ctree.h| 15 ++ fs/btrfs/dir-item.c | 71 ++- fs/btrfs/inode.c| 5 +- fs/btrfs/tree-log.c | 138 +--- fs/btrfs/tree-log.h | 10 fs/btrfs/xattr.c| 10 6 files changed, 198
[PATCH] Btrfs: fix fsync data loss after append write
From: Filipe Manana fdman...@suse.com If we do an append write to a file (which increases its inode's i_size) that does not have the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode, and the previous transaction added a new hard link to the file, which sets the flag BTRFS_INODE_COPY_EVERYTHING in the file's inode, and then fsync the file, the inode's new i_size isn't logged. This has the consequence that after the fsync log is replayed, the file size remains what it was before the append write operation, which means users/applications will not be able to read the data that was successsfully fsync'ed before. This happens because neither the inode item nor the delayed inode get their i_size updated when the append write is made - doing so would require starting a transaction in the buffered write path, something that we do not do intentionally for performance reasons. Fix this by making sure that when the flag BTRFS_INODE_COPY_EVERYTHING is set the inode is logged with its current i_size (log the in-memory inode into the log tree). This issue is not a recent regression and is easy to reproduce with the following test case for fstests: seq=`basename $0` seqres=$RESULT_DIR/$seq echo QA output created by $seq here=`pwd` tmp=/tmp/$$ status=1 # failure is the default! _cleanup() { _cleanup_flakey rm -f $tmp.* } trap _cleanup; exit \$status 0 1 2 3 15 # get standard environment, filters and checks . ./common/rc . ./common/filter . ./common/dmflakey # real QA test starts here _supported_fs generic _supported_os Linux _need_to_be_root _require_scratch _require_dm_flakey _require_metadata_journaling $SCRATCH_DEV _crash_and_mount() { # Simulate a crash/power loss. _load_flakey_table $FLAKEY_DROP_WRITES _unmount_flakey # Allow writes again and mount. This makes the fs replay its fsync log. _load_flakey_table $FLAKEY_ALLOW_WRITES _mount_flakey } rm -f $seqres.full _scratch_mkfs $seqres.full 21 _init_flakey _mount_flakey # Create the test file with some initial data and then fsync it. # The fsync here is only needed to trigger the issue in btrfs, as it causes the # the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode. $XFS_IO_PROG -f -c pwrite -S 0xaa 0 32k \ -c fsync \ $SCRATCH_MNT/foo | _filter_xfs_io sync # Add a hard link to our file. # On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode, # which is a necessary condition to trigger the issue. ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar # Sync the filesystem to force a commit of the current btrfs transaction, this # is a necessary condition to trigger the bug on btrfs. sync # Now append more data to our file, increasing its size, and fsync the file. # In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the # write path did not update the inode item in the btree nor the delayed inode # item (in memory struture) in the current transaction (created by the fsync # handler), the fsync did not record the inode's new i_size in the fsync # log/journal. This made the data unavailable after the fsync log/journal is # replayed. $XFS_IO_PROG -c pwrite -S 0xbb 32K 32K \ -c fsync \ $SCRATCH_MNT/foo | _filter_xfs_io echo File content after fsync and before crash: od -t x1 $SCRATCH_MNT/foo _crash_and_mount echo File content after crash and log replay: od -t x1 $SCRATCH_MNT/foo status=0 exit The expected file output before and after the crash/power failure expects the appended data to be available, which is: 000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa * 010 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb * 020 Cc: sta...@vger.kernel.org Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/tree-log.c | 14 +- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index d049683..4920fce 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -4161,6 +4161,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, u64 ino = btrfs_ino(inode); struct extent_map_tree *em_tree = BTRFS_I(inode)-extent_tree; u64 logged_isize = 0; + bool need_log_inode_item = true; path = btrfs_alloc_path(); if (!path) @@ -4269,11 +4270,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, } else { if (inode_only == LOG_INODE_ALL) fast_search = true; - ret = log_inode_item(trans, log, dst_path, inode); - if (ret) { - err = ret; - goto out_unlock; - } goto log_extents; }
Btrfs progs release 4.1-rc1
Hi, unusual load of changes. Among the small UI enhancements, the mkfs output rework is worth mentioning separately. It's based on Goffredo's patches but I've tweaked the output: Current: https://git.kernel.org/cgit/linux/kernel/git/kdave/btrfs-progs.git/commit/?id=60447579c41a10069b67d502b317bf57519acdd3 Original proposal: https://git.kernel.org/cgit/linux/kernel/git/kdave/btrfs-progs.git/commit/?id=3afd86683b7324c9fec94ca2c35b64ac72ab523f The ETA for 4.1 release is next week, please give it a test and report bugs. * bugfixes - fsck.btrfs: no bash-isms - bugzilla 97171: invalid memory access (with tests) - receive: - cloning works with --chroot - capabilities not lost - mkfs: do not try to register bare file images - option --help accepted by the standalone utilities * enhancements - corrupt block: ability to remove csums - mkfs: - warn if metadata redundancy is lower than for data - options to make the output quiet (only errors) - mixed case names of raid profiles accepted - rework the output: - more compact, 'key: value' format - subvol: - show: - print received uuid - update the output - sync: - grab all deleted ids and print them as they're removed, previous implementation only checked if there are any to be deleted - change in command semantics - scrub: print timestamps in days HMS format - receive: - can specify mount point, do not rely on /proc - can work inside subvolumes - send: - new option to send stream without data (NO_FILE_DATA) - convert: - specify incompat features on the new fs - qgroup: - show: distinguish no limits and 0 limit value - limit: ability to clear the limit - help for 'btrfs' is shorter, 1st level command overview - debug tree: print key names according to their C name * new - rescure zero-log - btrfsune: - rewrite uuid on a filesystem image - new option to turn on NO_HOLES incompat feature * deprecated - standalone btrfs-zero-log * other - testing framework updates - uuid rewrite test - btrfstune feature setting test - zero-log tests - more testing image formats - manual page updates - ioctl.h synced with current kernel uapi version - convert: preparatory works for more filesystems (reiserfs pending) - use static buffers for path handling where possible - add new helpers for send uilts that check memory allocations, switch all users, deprecate old helpers - Makefile: fix build dependency generation Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/ Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git Shortlog: Anand Jain (2): btrfs-progs: add info about list-all to the help btrfs-progs: use function is_block_device() instead Dimitri John Ledkov (1): btrfs-progs: fsck.btrfs: Fix bashism and bad getopts processing Dongsheng Yang (4): btrfs-progs: qgroup: show 'none' when we did not limit it on this qgroup btrfs-progs: qgroup: allow user to clear some limitation on qgroup. btrfs-progs: qgroup limit: error out if input value is negative btrfs-progs: qgroup limit: add a check for invalid input of 'T/G/M/K' Emil Karlson (1): btrfs-progs: use openat for process_clone in receive Goffredo Baroncelli (4): btrfs-progs: add strdup in btrfs_add_to_fsid() to track the device path btrfs-progs: return the fsid from make_btrfs() btrfs-progs: mkfs: track sizes of created block groups btrfs-progs: mkfs: print the summary Jeff Mahoney (8): btrfs-progs: convert: clean up blk_iterate_data handling wrt record_file_blocks btrfs-progs: convert: remove unused fs argument from block_iterate_proc btrfs-progs: convert: remove unused inode_key in copy_single_inode btrfs-progs: convert: rename ext2_root to image_root btrfs-progs: compat: define DIV_ROUND_UP if not already defined btrfs-progs: convert: fix typo in btrfs_insert_dir_item call btrfs-progs: convert: factor out adding dirent into convert_insert_dirent btrfs-progs: convert: factor out block iteration callback Josef Bacik (3): Btrfs-progs: corrupt-block: add the ability to remove csums btrfs-progs: specify mountpoint for recieve btrfs-progs: make receive work inside of subvolumes Qu Wenruo (6): btrfs-progs: Enhance read_tree_block to avoid memory corruption btrfs-progs: btrfstune: rework change_uuid btrfs-progs: btrfstune: add ability to restore unfinished fsid change btrfs-progs: btrfstune: add '-U' and '-u' option to change fsid btrfs-progs: Documentation: uuid change btrfs-progs: btrfstune: fix a bug which makes unfinished fsid change unrecoverable Sam Tygier (1): btrfs-progs: mkfs: check metadata redundancy David Sterba (73): btrfs-progs: tests: log the test name in results file btrfs-progs: tests: support more
[PATCH] fstests: generic test for fsync after adding hard link to a file
From: Filipe Manana fdman...@suse.com This test is motivated by an issue found in btrfs. It tests that after syncing the filesystem, adding a hard link to a file, syncing the filesystem again, doing a write to the file that increases its size and then doing a fsync against that file, durably persists the data written to the file. That is, after log/journal replay, the data is available. The btrfs issue is fixed by the commit titled: Btrfs: fix fsync data loss after append write Signed-off-by: Filipe Manana fdman...@suse.com --- tests/generic/090 | 108 ++ tests/generic/090.out | 17 tests/generic/group | 1 + 3 files changed, 126 insertions(+) create mode 100755 tests/generic/090 create mode 100644 tests/generic/090.out diff --git a/tests/generic/090 b/tests/generic/090 new file mode 100755 index 000..a1f2b89 --- /dev/null +++ b/tests/generic/090 @@ -0,0 +1,108 @@ +#! /bin/bash +# FS QA Test No. 090 +# +# Test that after syncing the filesystem, adding a hard link to a file, +# syncing the filesystem again, doing a write to the file that increases +# its size and then doing a fsync against that file, durably persists the +# data written to the file. That is, after log/journal replay, the data +# is available. +# +# This test is motivated by a bug found in btrfs. +# +#--- +# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved. +# Author: Filipe Manana fdman...@suse.com +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! + +_cleanup() +{ + _cleanup_flakey + rm -f $tmp.* +} +trap _cleanup; exit \$status 0 1 2 3 15 + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/dmflakey + +# real QA test starts here +_supported_fs generic +_supported_os Linux +_need_to_be_root +_require_scratch +_require_dm_flakey +_require_metadata_journaling $SCRATCH_DEV + +rm -f $seqres.full + +_scratch_mkfs $seqres.full 21 +_init_flakey +_mount_flakey + +# Create the test file with some initial data and then fsync it. +# The fsync here is only needed to trigger the issue in btrfs, as it causes the +# the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode. +$XFS_IO_PROG -f -c pwrite -S 0xaa 0 32k \ + -c fsync \ + $SCRATCH_MNT/foo | _filter_xfs_io +sync + +# Add a hard link to our file. +# On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode, +# which is a necessary condition to trigger the issue. +ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar + +# Sync the filesystem to force a commit of the current btrfs transaction, this +# is a necessary condition to trigger the bug on btrfs. +sync + +# Now append more data to our file, increasing its size, and fsync the file. +# In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the +# write path did not update the inode item in the btree nor the delayed inode +# item (in memory structure) in the current transaction (created by the fsync +# handler), the fsync did not record the inode's new i_size in the fsync +# log/journal. This made the data unavailable after the fsync log/journal is +# replayed. +$XFS_IO_PROG -c pwrite -S 0xbb 32K 32K \ + -c fsync \ + $SCRATCH_MNT/foo | _filter_xfs_io + +echo File content after fsync and before crash: +od -t x1 $SCRATCH_MNT/foo + +# Simulate a crash/power loss. +_load_flakey_table $FLAKEY_DROP_WRITES +_unmount_flakey + +# Allow writes again and mount. This makes the fs replay its fsync log. +_load_flakey_table $FLAKEY_ALLOW_WRITES +_mount_flakey + +echo File content after crash and log replay: +od -t x1 $SCRATCH_MNT/foo + +status=0 +exit diff --git a/tests/generic/090.out b/tests/generic/090.out new file mode 100644 index 000..4a4423a --- /dev/null +++ b/tests/generic/090.out @@ -0,0 +1,17 @@ +QA output created by 090 +wrote 32768/32768 bytes at offset 0 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +wrote 32768/32768 bytes at offset 32768 +XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec) +File content after fsync
[RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION
MS_I_VERSION is enabled by default for btrfs, this adds an alternative option to toggle it off. Signed-off-by: Liu Bo bo.li@oracle.com --- fs/btrfs/super.c |7 ++- 1 files changed, 6 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 05fef19..e610e3e 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -324,7 +324,7 @@ enum { Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_rescan_uuid_tree, Opt_commit_interval, Opt_barrier, Opt_nodefrag, Opt_nodiscard, Opt_noenospc_debug, Opt_noflushoncommit, Opt_acl, Opt_datacow, - Opt_datasum, Opt_treelog, Opt_noinode_cache, + Opt_datasum, Opt_treelog, Opt_noinode_cache, Opt_noi_version, Opt_err, }; @@ -351,6 +351,7 @@ static match_table_t tokens = { {Opt_nossd, nossd}, {Opt_acl, acl}, {Opt_noacl, noacl}, + {Opt_noi_version, noi_version}, {Opt_notreelog, notreelog}, {Opt_treelog, treelog}, {Opt_flushoncommit, flushoncommit}, @@ -593,6 +594,10 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) case Opt_noacl: root-fs_info-sb-s_flags = ~MS_POSIXACL; break; + case Opt_noi_version: + root-fs_info-sb-s_flags = ~MS_I_VERSION; + btrfs_info(root-fs_info, disable i_version); + break; case Opt_notreelog: btrfs_set_and_info(root, NOTREELOG, disabling tree log); -- 1.7.7.6 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH v2 2/2] Btrfs: improve fsync for nocow file
If we're overwriting an allocated file without changing timestamp and inode version, and the file is with NODATACOW, we don't have any metadata to commit, thus we can just flush the data device cache and go forward. However, if there's have any change on extents' disk bytenr, inode size, timestamp or inode version, we need to go through the normal btrfs_log_inode path. Test: 1. sysbench test of 1 file + 1 thread + bs=4k + size=40k + synchronous I/O mode + randomwrite + fsync_on_each_write, 2. loop device associated with tmpfs file 3. - For btrfs, -o nodatacow and -o noi_version option - For ext4 and xfs, no extra mount options Results: - btrfs: w/o: ~30Mb/sec w: ~131Mb/sec - other filesystems: (both don't enable i_version by default) ext4: 203Mb/sec xfs: 212Mb/sec Signed-off-by: Liu Bo bo.li@oracle.com --- v2: Catch errors from data writeback and skip barrier if necessary. fs/btrfs/btrfs_inode.h |2 + fs/btrfs/disk-io.c |2 +- fs/btrfs/disk-io.h |1 + fs/btrfs/file.c| 54 +++ fs/btrfs/inode.c |3 ++ 5 files changed, 56 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index de5e4f2..f7b99b6 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -44,6 +44,8 @@ #define BTRFS_INODE_IN_DELALLOC_LIST 9 #define BTRFS_INODE_READDIO_NEED_LOCK 10 #define BTRFS_INODE_HAS_PROPS 11 +#define BTRFS_INODE_NOTIMESTAMP12 +#define BTRFS_INODE_NOISIZE13 /* * The following 3 bits are meant only for the btree inode. * When any of them is set, it means an error happened while writing an diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 639f266..8a41df1 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3299,7 +3299,7 @@ static int write_dev_flush(struct btrfs_device *device, int wait) * send an empty flush down to each device in parallel, * then wait for them */ -static int barrier_all_devices(struct btrfs_fs_info *info) +int barrier_all_devices(struct btrfs_fs_info *info) { struct list_head *head; struct btrfs_device *dev; diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h index 27d44c0..bea982c 100644 --- a/fs/btrfs/disk-io.h +++ b/fs/btrfs/disk-io.h @@ -60,6 +60,7 @@ void close_ctree(struct btrfs_root *root); int write_ctree_super(struct btrfs_trans_handle *trans, struct btrfs_root *root, int max_mirrors); struct buffer_head *btrfs_read_dev_super(struct block_device *bdev); +int barrier_all_devices(struct btrfs_fs_info *info); int btrfs_commit_super(struct btrfs_root *root); struct extent_buffer *btrfs_find_tree_block(struct btrfs_root *root, u64 bytenr); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index faa7d39..180a3e1 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -523,8 +523,12 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode, * the disk i_size. There is no need to log the inode * at this time. */ - if (end_pos isize) + if (end_pos isize) { i_size_write(inode, end_pos); + clear_bit(BTRFS_INODE_NOISIZE, BTRFS_I(inode)-runtime_flags); + } else { + set_bit(BTRFS_INODE_NOISIZE, BTRFS_I(inode)-runtime_flags); + } return 0; } @@ -1715,19 +1719,33 @@ out: static void update_time_for_write(struct inode *inode) { struct timespec now; + int sync_it = 0; - if (IS_NOCMTIME(inode)) + if (IS_NOCMTIME(inode)) { + set_bit(BTRFS_INODE_NOTIMESTAMP, BTRFS_I(inode)-runtime_flags); return; + } now = current_fs_time(inode-i_sb); - if (!timespec_equal(inode-i_mtime, now)) + if (!timespec_equal(inode-i_mtime, now)) { inode-i_mtime = now; + sync_it = S_MTIME; + } - if (!timespec_equal(inode-i_ctime, now)) + if (!timespec_equal(inode-i_ctime, now)) { inode-i_ctime = now; + sync_it |= S_CTIME; + } - if (IS_I_VERSION(inode)) + if (IS_I_VERSION(inode)) { inode_inc_iversion(inode); + sync_it |= S_VERSION; + } + + if (!sync_it) + set_bit(BTRFS_INODE_NOTIMESTAMP, BTRFS_I(inode)-runtime_flags); + else + clear_bit(BTRFS_INODE_NOTIMESTAMP, BTRFS_I(inode)-runtime_flags); } static ssize_t btrfs_file_write_iter(struct kiocb *iocb, @@ -1983,6 +2001,32 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) goto out; } + if (BTRFS_I(inode)-flags BTRFS_INODE_NODATACOW) { + if
Re: RAID1: system stability
Upd: i've try do removing disk by 'right' way: # echo 1 /sys/block/sdf/device/delete All okay and system don't crush immediately on 'sync' call and can work some time without problem, but after some call, which i can repeat by: # apt-get update testing system get kernel crush (on which i delete one of raid1 btrfs device), i've get following dmesg: Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Modules linked in: 8021q garp mrp stp llc binfmt_misc gpio_ich coretemp kvm_intel lpc_ich ipmi_ssif kvm amdkfd amd_iommu_v2 serio_raw radeon ttm i5000_edac drm_kms_helper drm edac_core i2c_algo_bit i5k_amb ioatdma dca shpchp 8250_fintek joydev mac_hid ipmi_si ipmi_msghandler bonding autofs4 btrfs xor raid6_pq ses enclosure hid_generic psmouse usbhid hid mptsas mptscsih e1000e mptbase scsi_transport_sas ptp pps_core Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: CPU: 3 PID: 99 Comm: kworker/u16:4 Not tainted 4.0.4-040004-generic #201505171336 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Hardware name: Intel S5000VSA/S5000VSA, BIOS S5000.86B.15.00.0101.110920101604 11/09/2010 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Workqueue: btrfs-endio btrfs_endio_helper [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: task: 88009ab31400 ti: 88009ab4 task.ti: 88009ab4 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RIP: 0010:[c0477d50] [c0477d50] repair_io_failure+0x1c0/0x200 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RSP: 0018:88009ab43bb8 EFLAGS: 00010206 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RAX: RBX: 88009b1d3f30 RCX: 88009b53f9c0 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RDX: 88044902f400 RSI: RDI: 88009b53f9c0 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RBP: 88009ab43c18 R08: R09: Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: R10: 880448c1b090 R11: R12: 3907 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: R13: 880439599e68 R14: 1000 R15: 88009a86 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: FS: () GS:88045fcc() knlGS: Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: CS: 0010 DS: ES: CR0: 8005003b Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: CR2: 7f640a27e675 CR3: 98b4b000 CR4: 000407e0 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Stack: Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: 9a860de0 ea0002644380 0003d2ee8000 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: 8000 88009b53f9c0 88009ab43c18 88009b1d3f30 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: 88044c44a3c0 88009b0c1190 88009a86 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Call Trace: Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [c0477f30] clean_io_failure+0x1a0/0x1b0 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [c0478218] end_bio_extent_readpage+0x2d8/0x3d0 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [8137b2c3] bio_endio+0x53/0xa0 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [8137b322] bio_endio_nodec+0x12/0x20 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [c044efb8] end_workqueue_fn+0x48/0x60 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [c0488b2e] normal_work_helper+0x7e/0x1b0 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [c0488d32] btrfs_endio_helper+0x12/0x20 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [81092204] process_one_work+0x144/0x490 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [81092c6e] worker_thread+0x11e/0x450 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [81092b50] ? create_worker+0x1f0/0x1f0 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [81098999] kthread+0xc9/0xe0 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [810988d0] ? flush_kthread_worker+0x90/0x90 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [817f08d8] ret_from_fork+0x58/0x90 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: [810988d0] ? flush_kthread_worker+0x90/0x90 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Code: 44 00 00 4c 89 ef e8 b0 34 f0 c0 31 f6 4c 89 e7 e8 06 05 01 00 ba fb ff ff ff e9 c7 fe ff ff ba fb ff ff ff e9 bd fe ff ff 0f 0b 0f 0b 49 8b 4c 24 30 48 8b b3 58 fe ff ff 48 83 c1 10 48 85 f6 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RIP [c0477d50] repair_io_failure+0x1c0/0x200 [btrfs] Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RSP 88009ab43bb8 Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: ---[ end trace 0361c6fdca5f7ee2 ]--- --- Another test case: i've delete device: echo 1 /sys/block/sdf/device/delete after i reinsert this device (remove and insert again in server) Server found sdg device, all that okay but kernel crush with following stuck trace: --- Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: kernel BUG at
Re: [PATCH 3/7] btrfs: skip superblocks during discard
On Mon, Jun 15, 2015 at 2:41 PM, je...@suse.com wrote: From: Jeff Mahoney je...@suse.com Btrfs doesn't track superblocks with extent records so there is nothing persistent on-disk to indicate that those blocks are in use. We track the superblocks in memory to ensure they don't get used by removing them from the free space cache when we load a block group from disk. Prior to 47ab2a6c6a (Btrfs: remove empty block groups automatically), that was fine since the block group would never be reclaimed so the superblock was always safe. Once we started removing the empty block groups, we were protected by the fact that discards weren't being properly issued for unused space either via FITRIM or -odiscard. The block groups were still being released, but the blocks remained on disk. In order to properly discard unused block groups, we need to filter out the superblocks from the discard range. Superblocks are located at fixed locations on each device, so it makes sense to filter them out in btrfs_issue_discard, which is used by both -odiscard and FITRIM. Signed-off-by: Jeff Mahoney je...@suse.com Reviewed-by: Filipe Manana fdman...@suse.com Tested-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 59 ++ 1 file changed, 55 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index cf9cefd..1e44b93 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1884,10 +1884,12 @@ static int remove_extent_backref(struct btrfs_trans_handle *trans, return ret; } +#define in_range(b, first, len)((b) = (first) (b) (first) + (len)) static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len, u64 *discarded_bytes) { - int ret = 0; + int j, ret = 0; + u64 bytes_left, end; u64 aligned_start = ALIGN(start, 1 9); if (WARN_ON(start != aligned_start)) { @@ -1897,11 +1899,60 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len, } *discarded_bytes = 0; - if (len) { - ret = blkdev_issue_discard(bdev, start 9, len 9, + + if (!len) + return 0; + + end = start + len; + bytes_left = len; + + /* Skip any superblocks on this device. */ + for (j = 0; j BTRFS_SUPER_MIRROR_MAX; j++) { + u64 sb_start = btrfs_sb_offset(j); + u64 sb_end = sb_start + BTRFS_SUPER_INFO_SIZE; + u64 size = sb_start - start; + + if (!in_range(sb_start, start, bytes_left) + !in_range(sb_end, start, bytes_left) + !in_range(start, sb_start, BTRFS_SUPER_INFO_SIZE)) + continue; + + /* +* Superblock spans beginning of range. Adjust start and +* try again. +*/ + if (sb_start = start) { + start += sb_end - start; + if (start end) { + bytes_left = 0; + break; + } + bytes_left = end - start; + continue; + } + + if (size) { + ret = blkdev_issue_discard(bdev, start 9, size 9, + GFP_NOFS, 0); + if (!ret) + *discarded_bytes += size; + else if (ret != -EOPNOTSUPP) + return ret; + } + + start = sb_end; + if (start end) { + bytes_left = 0; + break; + } + bytes_left = end - start; + } + + if (bytes_left) { + ret = blkdev_issue_discard(bdev, start 9, bytes_left 9, GFP_NOFS, 0); if (!ret) - *discarded_bytes = len; + *discarded_bytes += bytes_left; } return ret; } -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 7/7] btrfs: cleanup, stop casting for extent_map-lookup everywhere
On Mon, Jun 15, 2015 at 2:41 PM, je...@suse.com wrote: From: Jeff Mahoney je...@suse.com Overloading extent_map-bdev to struct map_lookup * might have started out as a means to an end, but it's a pattern that's used all over the place now. Let's get rid of the casting and just add a union instead. Signed-off-by: Jeff Mahoney je...@suse.com Reviewed-by: Filipe Manana fdman...@suse.com Tested-by: Filipe Manana fdman...@suse.com --- fs/btrfs/dev-replace.c | 2 +- fs/btrfs/extent_map.c | 2 +- fs/btrfs/extent_map.h | 10 +- fs/btrfs/scrub.c | 2 +- fs/btrfs/volumes.c | 24 5 files changed, 24 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 0573848..2ad3289 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -613,7 +613,7 @@ static void btrfs_dev_replace_update_device_in_mapping_tree( em = lookup_extent_mapping(em_tree, start, (u64)-1); if (!em) break; - map = (struct map_lookup *)em-bdev; + map = em-map_lookup; for (i = 0; i map-num_stripes; i++) if (srcdev == map-stripes[i].dev) map-stripes[i].dev = tgtdev; diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index 6a98bdd..84fb56d 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -76,7 +76,7 @@ void free_extent_map(struct extent_map *em) WARN_ON(extent_map_in_tree(em)); WARN_ON(!list_empty(em-list)); if (test_bit(EXTENT_FLAG_FS_MAPPING, em-flags)) - kfree(em-bdev); + kfree(em-map_lookup); kmem_cache_free(extent_map_cache, em); } } diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h index b2991fd..eb8b8fa 100644 --- a/fs/btrfs/extent_map.h +++ b/fs/btrfs/extent_map.h @@ -32,7 +32,15 @@ struct extent_map { u64 block_len; u64 generation; unsigned long flags; - struct block_device *bdev; + union { + struct block_device *bdev; + + /* +* used for chunk mappings +* flags EXTENT_FLAG_FS_MAPPING must be set +*/ + struct map_lookup *map_lookup; + }; atomic_t refs; unsigned int compress_type; struct list_head list; diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index ab58115..19f7241d 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -3339,7 +3339,7 @@ static noinline_for_stack int scrub_chunk(struct scrub_ctx *sctx, if (!em) return -EINVAL; - map = (struct map_lookup *)em-bdev; + map = em-map_lookup; if (em-start != chunk_offset) goto out; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 7fdde31..9f48ae5 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1068,7 +1068,7 @@ again: struct map_lookup *map; int i; - map = (struct map_lookup *)em-bdev; + map = em-map_lookup; for (i = 0; i map-num_stripes; i++) { if (map-stripes[i].dev != device) continue; @@ -2622,7 +2622,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans, free_extent_map(em); return -EINVAL; } - map = (struct map_lookup *)em-bdev; + map = em-map_lookup; for (i = 0; i map-num_stripes; i++) { struct btrfs_device *device = map-stripes[i].dev; @@ -4465,7 +4465,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, goto error; } set_bit(EXTENT_FLAG_FS_MAPPING, em-flags); - em-bdev = (struct block_device *)map; + em-map_lookup = map; em-start = start; em-len = num_bytes; em-block_start = 0; @@ -4560,7 +4560,7 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans, return -EINVAL; } - map = (struct map_lookup *)em-bdev; + map = em-map_lookup; item_size = btrfs_chunk_item_size(map-num_stripes); stripe_size = em-orig_block_len; @@ -4702,7 +4702,7 @@ int btrfs_chunk_readonly(struct btrfs_root *root, u64 chunk_offset) if (!em) return 1; - map = (struct map_lookup *)em-bdev; + map = em-map_lookup; for (i = 0; i map-num_stripes; i++) { if (map-stripes[i].dev-missing) { miss_ndevs++; @@ -4782,7 +4782,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len) return 1; } - map = (struct map_lookup *)em-bdev; + map =
Re: [PATCH 4/7] btrfs: iterate over unused chunk space in FITRIM
On Mon, Jun 15, 2015 at 2:41 PM, je...@suse.com wrote: From: Jeff Mahoney je...@suse.com Since we now clean up block groups automatically as they become empty, iterating over block groups is no longer sufficient to discard unused space. This patch iterates over the unused chunk space and discards any regions that are unallocated, regardless of whether they were ever used. This is a change for btrfs but is consistent with other file systems. We do this in a transactionless manner since the discard process can take a substantial amount of time and a transaction would need to be started before the acquisition of the device list lock. That would mean a transaction would be held open across /all/ of the discards collectively. In order to prevent other threads from allocating or freeing chunks, we hold the chunks lock across the search and discard calls. We release it between searches to allow the file system to perform more-or-less normally. Since the running transaction can commit and disappear while we're using the transaction pointer, we take a reference to it and release it after the search. This is safe since it would happen normally at the end of the transaction commit after any locks are released anyway. We also take the commit_root_sem to protect against a transaction starting and committing while we're running. Signed-off-by: Jeff Mahoney je...@suse.com Reviewed-by: Filipe Manana fdman...@suse.com Tested-by: Filipe Manana fdman...@suse.com Side note, this still doesn't apply cleanly on latest integration branch (integration-4.2), results in warnings about casting pointer from different type (btrfs_trans_handle to btrfs_transaction) at btrfs_shrink_device(). --- fs/btrfs/extent-tree.c | 101 + fs/btrfs/volumes.c | 60 ++--- fs/btrfs/volumes.h | 3 ++ 3 files changed, 141 insertions(+), 23 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 1e44b93..24b48df 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -10143,10 +10143,99 @@ int btrfs_error_unpin_extent_range(struct btrfs_root *root, u64 start, u64 end) return unpin_extent_range(root, start, end, false); } +/* + * It used to be that old block groups would be left around forever. + * Iterating over them would be enough to trim unused space. Since we + * now automatically remove them, we also need to iterate over unallocated + * space. + * + * We don't want a transaction for this since the discard may take a + * substantial amount of time. We don't require that a transaction be + * running, but we do need to take a running transaction into account + * to ensure that we're not discarding chunks that were released in + * the current transaction. + * + * Holding the chunks lock will prevent other threads from allocating + * or releasing chunks, but it won't prevent a running transaction + * from committing and releasing the memory that the pending chunks + * list head uses. For that, we need to take a reference to the + * transaction. + */ +static int btrfs_trim_free_extents(struct btrfs_device *device, + u64 minlen, u64 *trimmed) +{ + u64 start = 0, len = 0; + int ret; + + *trimmed = 0; + + /* Not writeable = nothing to do. */ + if (!device-writeable) + return 0; + + /* No free space = nothing to do. */ + if (device-total_bytes = device-bytes_used) + return 0; + + ret = 0; + + while (1) { + struct btrfs_fs_info *fs_info = device-dev_root-fs_info; + struct btrfs_transaction *trans; + u64 bytes; + + ret = mutex_lock_interruptible(fs_info-chunk_mutex); + if (ret) + return ret; + + down_read(fs_info-commit_root_sem); + + spin_lock(fs_info-trans_lock); + trans = fs_info-running_transaction; + if (trans) + atomic_inc(trans-use_count); + spin_unlock(fs_info-trans_lock); + + ret = find_free_dev_extent_start(trans, device, minlen, start, +start, len); + if (trans) + btrfs_put_transaction(trans); + + if (ret) { + up_read(fs_info-commit_root_sem); + mutex_unlock(fs_info-chunk_mutex); + if (ret == -ENOSPC) + ret = 0; + break; + } + + ret = btrfs_issue_discard(device-bdev, start, len, bytes); + up_read(fs_info-commit_root_sem); + mutex_unlock(fs_info-chunk_mutex); + + if (ret) + break; + +
Re: [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount
On Mon, Jun 15, 2015 at 2:41 PM, je...@suse.com wrote: From: Jeff Mahoney je...@suse.com The cleaner thread may already be sleeping by the time we enter close_ctree. If that's the case, we'll skip removing any unused block groups queued for removal, even during a normal umount. They'll be cleaned up automatically at next mount, but users expect a umount to be a clean synchronization point, especially when used on thin-provisioned storage with -odiscard. We also explicitly remove unused block groups in the ro-remount path for the same reason. Signed-off-by: Jeff Mahoney je...@suse.com Reviewed-by: Filipe Manana fdman...@suse.com Tested-by: Filipe Manana fdman...@suse.com --- fs/btrfs/disk-io.c | 9 + fs/btrfs/super.c | 11 +++ 2 files changed, 20 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 2ef9a4b..2e47fef 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3710,6 +3710,15 @@ void close_ctree(struct btrfs_root *root) cancel_work_sync(fs_info-async_reclaim_work); if (!(fs_info-sb-s_flags MS_RDONLY)) { + /* +* If the cleaner thread is stopped and there are +* block groups queued for removal, the deletion will be +* skipped when we quit the cleaner thread. +*/ + mutex_lock(root-fs_info-cleaner_mutex); + btrfs_delete_unused_bgs(root-fs_info); + mutex_unlock(root-fs_info-cleaner_mutex); + ret = btrfs_commit_super(root); if (ret) btrfs_err(fs_info, commit super ret %d, ret); diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 9e66f5e..2ccd8d4 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1539,6 +1539,17 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data) sb-s_flags |= MS_RDONLY; + /* +* Setting MS_RDONLY will put the cleaner thread to +* sleep at the next loop if it's already active. +* If it's already asleep, we'll leave unused block +* groups on disk until we're mounted read-write again +* unless we clean them up here. +*/ + mutex_lock(root-fs_info-cleaner_mutex); + btrfs_delete_unused_bgs(fs_info); + mutex_unlock(root-fs_info-cleaner_mutex); + btrfs_dev_replace_suspend_for_unmount(fs_info); btrfs_scrub_cancel(fs_info); btrfs_pause_balance(fs_info); -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432)
On Wed, Jun 17, 2015 at 12:16:54AM -0700, Marc MERLIN wrote: I had a few power offs due to a faulty power supply, and my mdadm raid5 got into fail mode after 2 drives got kicked out since their sequence numbers didn't match due to the abrupt power offs. I brought the swraid5 back up by force assembling it with 4 drives (one was really only a few sequence numbers behind), and it's doing a full parity rebuild on the 5th drive that was farther behind. So I can understand how I may have had a few blocks that are in a bad state. I'm getting a few (not many) of those messages in syslog. BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432) Filesystem looks like this: Label: 'btrfs_pool1' uuid: 6358304a-2234-4243-b02d-4944c9af47d7 Total devices 1 FS bytes used 8.29TiB devid1 size 14.55TiB used 8.32TiB path /dev/mapper/dshelf1 gargamel:~# btrfs fi df /mnt/btrfs_pool1 Data, single: total=8.29TiB, used=8.28TiB System, DUP: total=8.00MiB, used=920.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=14.00GiB, used=10.58GiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=512.00MiB, used=0.00B Kernel 3.19.8. Just to make sure I understand, do those messages in syslog mean that my metadata got corrupted a bit, but because I have 2 copies, btrfs can fix the bad copy by using the good one? Also, if my actual data got corrupted, am I correct that btrfs will detect the checksum failure and give me a different error message of a read error that cannot be corrected? Yes, that's my reading of the situation. Note that the 3.19 kernel is the earliest I would expect this to be able to happen, as it's the first kernel that actually had the full set of parity RAID repair code in it. I'll do a scrub later, for now I have to wait 20 hours for the raid rebuild first. You'll probably find that the rebuild is equivalent to a scrub anyway. Hugo. -- Hugo Mills | If you're not part of the solution, you're part of hugo@... carfax.org.uk | the precipiate. http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432)
Hugo Mills wrote (ao): On Wed, Jun 17, 2015 at 12:16:54AM -0700, Marc MERLIN wrote: I had a few power offs due to a faulty power supply, and my mdadm raid5 got into fail mode after 2 drives got kicked out since their sequence numbers didn't match due to the abrupt power offs. gargamel:~# btrfs fi df /mnt/btrfs_pool1 Data, single: total=8.29TiB, used=8.28TiB System, DUP: total=8.00MiB, used=920.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=14.00GiB, used=10.58GiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=512.00MiB, used=0.00B I'll do a scrub later, for now I have to wait 20 hours for the raid rebuild first. You'll probably find that the rebuild is equivalent to a scrub anyway. He has mdadm raid, which is rebuilding. This is obviously not equivalent to a btrfs scrub. Sander -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432)
On Wed, Jun 17, 2015 at 12:58:35PM +0200, Sander wrote: Hugo Mills wrote (ao): On Wed, Jun 17, 2015 at 12:16:54AM -0700, Marc MERLIN wrote: I had a few power offs due to a faulty power supply, and my mdadm raid5 got into fail mode after 2 drives got kicked out since their sequence numbers didn't match due to the abrupt power offs. gargamel:~# btrfs fi df /mnt/btrfs_pool1 Data, single: total=8.29TiB, used=8.28TiB System, DUP: total=8.00MiB, used=920.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=14.00GiB, used=10.58GiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=512.00MiB, used=0.00B I'll do a scrub later, for now I have to wait 20 hours for the raid rebuild first. You'll probably find that the rebuild is equivalent to a scrub anyway. He has mdadm raid, which is rebuilding. This is obviously not equivalent to a btrfs scrub. Ah, thanks for the correction. Note to self: read more carefully before replying. Hugo. -- Hugo Mills | If you're not part of the solution, you're part of hugo@... carfax.org.uk | the precipiate. http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
[PATCH 2/2] Btrfs: fix warning of bytes_may_use
While running generic/019, dmesg got several warnings from btrfs_free_reserved_data_space(). Test generic/019 produces some disk failures so sumbit dio will get errors, in which case, btrfs_direct_IO() goes to the error handling and free bytes_may_use, but the problem is that bytes_may_use has been free'd during get_block(). This adds a runtime flag to show if we've gone through get_block(), if so, don't do the cleanup work. Signed-off-by: Liu Bo bo.li@oracle.com --- fs/btrfs/btrfs_inode.h | 2 ++ fs/btrfs/inode.c | 16 +--- 2 files changed, 15 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 0ef5cc1..81220b2 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -44,6 +44,8 @@ #define BTRFS_INODE_IN_DELALLOC_LIST 9 #define BTRFS_INODE_READDIO_NEED_LOCK 10 #define BTRFS_INODE_HAS_PROPS 11 +/* DIO is ready to submit */ +#define BTRFS_INODE_DIO_READY 12 /* * The following 3 bits are meant only for the btree inode. * When any of them is set, it means an error happened while writing an diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 7bf150a..438b56f 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7530,6 +7530,7 @@ unlock: current-journal_info = outstanding_extents; btrfs_free_reserved_data_space(inode, len); + set_bit(BTRFS_INODE_DIO_READY, BTRFS_I(inode)-runtime_flags); } /* @@ -8311,9 +8312,18 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, struct iov_iter *iter, btrfs_submit_direct, flags); if (iov_iter_rw(iter) == WRITE) { current-journal_info = NULL; - if (ret 0 ret != -EIOCBQUEUED) - btrfs_delalloc_release_space(inode, count); - else if (ret = 0 (size_t)ret count) + if (ret 0 ret != -EIOCBQUEUED) { + /* +* If the error comes from submitting stage, +* btrfs_get_blocsk_direct() has free'd data space, +* and metadata space will be handled by +* finish_ordered_fn, don't do that again to make +* sure bytes_may_use is correct. +*/ + if (!test_and_clear_bit(BTRFS_INODE_DIO_READY, +BTRFS_I(inode)-runtime_flags)) + btrfs_delalloc_release_space(inode, count); + } else if (ret = 0 (size_t)ret count) btrfs_delalloc_release_space(inode, count - (size_t)ret); } -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] Btrfs: fix hang when failing to submit bio of directIO
The hang is uncoverd by generic/019. btrfs_endio_direct_write() skips the finish_ordered_fn part when it hits an error, thus those added ordered extents will never get processed, which block processes that waiting for them via btrfs_start_ordered_extent(). This fixes the above, and meanwhile finish_ordered_fn will do the space accounting work. Signed-off-by: Liu Bo bo.li@oracle.com --- fs/btrfs/inode.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 8bb0136..7bf150a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7855,8 +7855,6 @@ static void btrfs_endio_direct_write(struct bio *bio, int err) struct bio *dio_bio; int ret; - if (err) - goto out_done; again: ret = btrfs_dec_test_first_ordered_pending(inode, ordered, ordered_offset, @@ -7879,7 +7877,6 @@ out_test: ordered = NULL; goto again; } -out_done: dio_bio = dip-dio_bio; kfree(dip); -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/7] btrfs: make btrfs_issue_discard return bytes discarded
On Mon, Jun 15, 2015 at 2:41 PM, je...@suse.com wrote: From: Jeff Mahoney je...@suse.com Initially this will just be the length argument passed to it, but the following patches will adjust that to reflect re-alignment and skipped blocks. Signed-off-by: Jeff Mahoney je...@suse.com Reviewed-by: Filipe Manana fdman...@suse.com Tested-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0ec3acd..da1145d 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1884,10 +1884,17 @@ static int remove_extent_backref(struct btrfs_trans_handle *trans, return ret; } -static int btrfs_issue_discard(struct block_device *bdev, - u64 start, u64 len) +static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len, + u64 *discarded_bytes) { - return blkdev_issue_discard(bdev, start 9, len 9, GFP_NOFS, 0); + int ret = 0; + + *discarded_bytes = 0; + ret = blkdev_issue_discard(bdev, start 9, len 9, GFP_NOFS, 0); + if (!ret) + *discarded_bytes = len; + + return ret; } int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr, @@ -1908,14 +1915,16 @@ int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr, for (i = 0; i bbio-num_stripes; i++, stripe++) { + u64 bytes; if (!stripe-dev-can_discard) continue; ret = btrfs_issue_discard(stripe-dev-bdev, stripe-physical, - stripe-length); + stripe-length, + bytes); if (!ret) - discarded_bytes += stripe-length; + discarded_bytes += bytes; else if (ret != -EOPNOTSUPP) break; /* Logic errors or -ENOMEM, or -EIO but I don't know how that could happen JDM */ -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/7] btrfs: btrfs_issue_discard ensure offset/length are aligned to sector boundaries
On Mon, Jun 15, 2015 at 2:41 PM, je...@suse.com wrote: From: Jeff Mahoney je...@suse.com It's possible, though unexpected, to pass unaligned offsets and lengths to btrfs_issue_discard. We then shift the offset/length values to sector units. If an unaligned offset has been passed, it will result in the entire sector being discarded, possibly losing data. An unaligned length is safe but we'll end up returning an inaccurate number of discarded bytes. This patch aligns the offset to the 512B boundary, adjusts the length, and warns, since we shouldn't be discarding on an offset that isn't aligned with our sector size. Signed-off-by: Jeff Mahoney je...@suse.com Reviewed-by: Filipe Manana fdman...@suse.com Tested-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 17 + 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index da1145d..cf9cefd 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -1888,12 +1888,21 @@ static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len, u64 *discarded_bytes) { int ret = 0; + u64 aligned_start = ALIGN(start, 1 9); - *discarded_bytes = 0; - ret = blkdev_issue_discard(bdev, start 9, len 9, GFP_NOFS, 0); - if (!ret) - *discarded_bytes = len; + if (WARN_ON(start != aligned_start)) { + len -= aligned_start - start; + len = round_down(len, 1 9); + start = aligned_start; + } + *discarded_bytes = 0; + if (len) { + ret = blkdev_issue_discard(bdev, start 9, len 9, + GFP_NOFS, 0); + if (!ret) + *discarded_bytes = len; + } return ret; } -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/7] btrfs: add missing discards when unpinning extents with -o discard
On Mon, Jun 15, 2015 at 2:41 PM, je...@suse.com wrote: From: Jeff Mahoney je...@suse.com When we clear the dirty bits in btrfs_delete_unused_bgs for extents in the empty block group, it results in btrfs_finish_extent_commit being unable to discard the freed extents. The block group removal patch added an alternate path to forget extents other than btrfs_finish_extent_commit. As a result, any extents that would be freed when the block group is removed aren't discarded. In my test run, with a large copy of mixed sized files followed by removal, it left nearly 2/3 of extents undiscarded. To clean up the block groups, we add the removed block group onto a list that will be discarded after transaction commit. Signed-off-by: Jeff Mahoney je...@suse.com Reviewed-by: Filipe Manana fdman...@suse.com Tested-by: Filipe Manana fdman...@suse.com --- fs/btrfs/ctree.h| 3 ++ fs/btrfs/extent-tree.c | 68 ++--- fs/btrfs/free-space-cache.c | 57 + fs/btrfs/super.c| 2 +- fs/btrfs/transaction.c | 2 ++ fs/btrfs/transaction.h | 2 ++ 6 files changed, 105 insertions(+), 29 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 6f364e1..780acf1 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3438,6 +3438,8 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 group_start, struct extent_map *em); void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info); +void btrfs_get_block_group_trimming(struct btrfs_block_group_cache *cache); +void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *cache); void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans, struct btrfs_root *root); u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data); @@ -4068,6 +4070,7 @@ __printf(5, 6) void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function, unsigned int line, int errno, const char *fmt, ...); +const char *btrfs_decode_error(int errno); void __btrfs_abort_transaction(struct btrfs_trans_handle *trans, struct btrfs_root *root, const char *function, diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 24b48df..3598440 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -6103,20 +6103,19 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans, struct btrfs_root *root) { struct btrfs_fs_info *fs_info = root-fs_info; + struct btrfs_block_group_cache *block_group, *tmp; + struct list_head *deleted_bgs; struct extent_io_tree *unpin; u64 start; u64 end; int ret; - if (trans-aborted) - return 0; - if (fs_info-pinned_extents == fs_info-freed_extents[0]) unpin = fs_info-freed_extents[1]; else unpin = fs_info-freed_extents[0]; - while (1) { + while (!trans-aborted) { mutex_lock(fs_info-unused_bg_unpin_mutex); ret = find_first_extent_bit(unpin, 0, start, end, EXTENT_DIRTY, NULL); @@ -6135,6 +6134,34 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans, cond_resched(); } + /* +* Transaction is finished. We don't need the lock anymore. We +* do need to clean up the block groups in case of a transaction +* abort. +*/ + deleted_bgs = trans-transaction-deleted_bgs; + list_for_each_entry_safe(block_group, tmp, deleted_bgs, bg_list) { + u64 trimmed = 0; + + ret = -EROFS; + if (!trans-aborted) + ret = btrfs_discard_extent(root, + block_group-key.objectid, + block_group-key.offset, + trimmed); + + list_del_init(block_group-bg_list); + btrfs_put_block_group_trimming(block_group); + btrfs_put_block_group(block_group); + + if (ret) { + const char *errstr = btrfs_decode_error(ret); + btrfs_warn(fs_info, + Discard failed while removing blockgroup: errno=%d %s\n, + ret, errstr); + } + } + return 0; } @@ -9914,6 +9941,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle *trans, * currently running transaction might finish and a new one start, * allowing for new block groups to
[PATCH v5.3 24/27] Btrfs: free the stale device
From: Anand Jain anand.j...@oracle.com When btrfs on a device is overwritten with a new btrfs (mkfs), the old btrfs instance in the kernel becomes stale. So with this patch, if kernel finds device is overwritten then delete the stale fsid/uuid. Signed-off-by: Anand Jain anand.j...@oracle.com --- v5.2-v5.3: removed unnecessary return in retun void func v5.1-v5.2: accepts David's review comments with thanks. esp.ly the important rcu part v5-v5.1: since this deals with only devices in unmounted state, don't try to remove device link fs/btrfs/volumes.c | 61 ++ 1 file changed, 61 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e50005f..39d4d48 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -445,6 +445,61 @@ static void pending_bios_fn(struct btrfs_work *work) run_scheduled_bios(device); } + +void btrfs_free_stale_device(struct btrfs_device *cur_dev) +{ + struct btrfs_fs_devices *fs_devs; + struct btrfs_device *dev; + + if (!cur_dev-name) + return; + + list_for_each_entry(fs_devs, fs_uuids, list) { + int del = 1; + + if (fs_devs-opened) + continue; + if (fs_devs-seeding) + continue; + + list_for_each_entry(dev, fs_devs-devices, dev_list) { + + if (dev == cur_dev) + continue; + if (!dev-name) + continue; + + /* +* Todo: This won't be enough. What if the same device +* comes back (with new uuid and) with its mapper path? +* But for now, this does help as mostly an admin will +* either use mapper or non mapper path throughout. +*/ + rcu_read_lock(); + del = strcmp(rcu_str_deref(dev-name), + rcu_str_deref(cur_dev-name)); + rcu_read_unlock(); + if (!del) + break; + } + + if (!del) { + /* delete the stale device */ + if (fs_devs-num_devices == 1) { + btrfs_sysfs_remove_fsid(fs_devs); + list_del(fs_devs-list); + free_fs_devices(fs_devs); + } else { + fs_devs-num_devices--; + list_del(dev-dev_list); + rcu_string_free(dev-name); + kfree(dev); + } + break; + } + } +} + /* * Add new device to list of registered devices * @@ -560,6 +615,12 @@ static noinline int device_list_add(const char *path, if (!fs_devices-opened) device-generation = found_transid; + /* +* if there is new btrfs on an already registered device, +* then remove the stale device entry. +*/ + btrfs_free_stale_device(device); + *fs_devices_ret = fs_devices; return ret; -- 2.4.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID10 Balancing Request for Comments and Advices
On Jun 17, 2015, at 9:27 AM, Hugo Mills h...@carfax.org.uk wrote: On Wed, Jun 17, 2015 at 09:13:08AM -0400, Vincent Olivier wrote: On Jun 16, 2015, at 8:14 PM, Chris Murphy li...@colorremedies.com wrote: On Tue, Jun 16, 2015 at 5:58 PM, Duncan 1i5t5.dun...@cox.net wrote: On a current kernel unlike older ones, btrfs actually automates entirely empty chunk reclaim, so this problem doesn't occur anything close to near as often as it used to. However, it's still possible to have mostly but not entirely empty chunks that btrfs won't automatically reclaim. A balance can be used to rewrite and combine these mostly empty chunks, reclaiming the space saved. This is what Hugo was recommending. Yes, as little as a -dusage=5 (data chunks that are 5% or less full) can clear the problem and is very fast, seconds. Possibly a bit longer, many seconds o single digit minutes is -dusage=15. I haven't done a full balance in forever. Yes, on this 80% full 6x4TB RAID10 -dusage=15 took 2 seconds and relocated 0 out of 3026 chunks”. Out of curiosity, I had to use -dusage=90 to have it relocate only 1 chunk and it took les than 30 seconds. So I put a -dusage=25 in the weekly cron just before the scrub. In most cases, all you need to do is clean up one data chunk to give the metadata enough space to work in. Instead of manually iterating through several values of usage= until you get a useful response, you can use limit=n to stop after n successful block group relocations. Nice! Will do that instead! Thanks. signature.asc Description: Message signed with OpenPGP using GPGMail
Re: RAID10 Balancing Request for Comments and Advices
On Jun 16, 2015, at 8:14 PM, Chris Murphy li...@colorremedies.com wrote: On Tue, Jun 16, 2015 at 5:58 PM, Duncan 1i5t5.dun...@cox.net wrote: On a current kernel unlike older ones, btrfs actually automates entirely empty chunk reclaim, so this problem doesn't occur anything close to near as often as it used to. However, it's still possible to have mostly but not entirely empty chunks that btrfs won't automatically reclaim. A balance can be used to rewrite and combine these mostly empty chunks, reclaiming the space saved. This is what Hugo was recommending. Yes, as little as a -dusage=5 (data chunks that are 5% or less full) can clear the problem and is very fast, seconds. Possibly a bit longer, many seconds o single digit minutes is -dusage=15. I haven't done a full balance in forever. Yes, on this 80% full 6x4TB RAID10 -dusage=15 took 2 seconds and relocated 0 out of 3026 chunks”. Out of curiosity, I had to use -dusage=90 to have it relocate only 1 chunk and it took les than 30 seconds. So I put a -dusage=25 in the weekly cron just before the scrub. FYI. Thanks for your help.-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount
On Wed, Jun 17, 2015 at 11:04 AM, Filipe David Manana fdman...@gmail.com wrote: On Mon, Jun 15, 2015 at 2:41 PM, je...@suse.com wrote: From: Jeff Mahoney je...@suse.com The cleaner thread may already be sleeping by the time we enter close_ctree. If that's the case, we'll skip removing any unused block groups queued for removal, even during a normal umount. They'll be cleaned up automatically at next mount, but users expect a umount to be a clean synchronization point, especially when used on thin-provisioned storage with -odiscard. We also explicitly remove unused block groups in the ro-remount path for the same reason. Signed-off-by: Jeff Mahoney je...@suse.com Reviewed-by: Filipe Manana fdman...@suse.com Tested-by: Filipe Manana fdman...@suse.com --- fs/btrfs/disk-io.c | 9 + fs/btrfs/super.c | 11 +++ 2 files changed, 20 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 2ef9a4b..2e47fef 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3710,6 +3710,15 @@ void close_ctree(struct btrfs_root *root) cancel_work_sync(fs_info-async_reclaim_work); if (!(fs_info-sb-s_flags MS_RDONLY)) { + /* +* If the cleaner thread is stopped and there are +* block groups queued for removal, the deletion will be +* skipped when we quit the cleaner thread. +*/ + mutex_lock(root-fs_info-cleaner_mutex); + btrfs_delete_unused_bgs(root-fs_info); + mutex_unlock(root-fs_info-cleaner_mutex); + ret = btrfs_commit_super(root); if (ret) btrfs_err(fs_info, commit super ret %d, ret); diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 9e66f5e..2ccd8d4 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1539,6 +1539,17 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data) sb-s_flags |= MS_RDONLY; + /* +* Setting MS_RDONLY will put the cleaner thread to +* sleep at the next loop if it's already active. +* If it's already asleep, we'll leave unused block +* groups on disk until we're mounted read-write again +* unless we clean them up here. +*/ + mutex_lock(root-fs_info-cleaner_mutex); + btrfs_delete_unused_bgs(fs_info); + mutex_unlock(root-fs_info-cleaner_mutex); So actually, this allows for a deadlock after the patch I sent out last week: https://patchwork.kernel.org/patch/6586811/ In that patch delete_unused_bgs is no longer called under the cleaner_mutex, and making it so, will cause a deadlock with relocation. Even without that patch, I don't think you need using this mutex anyway - no 2 tasks running this function can get the same bg from the fs_info-unused_bgs list. thanks + btrfs_dev_replace_suspend_for_unmount(fs_info); btrfs_scrub_cancel(fs_info); btrfs_pause_balance(fs_info); -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID10 Balancing Request for Comments and Advices
On Wed, Jun 17, 2015 at 09:13:08AM -0400, Vincent Olivier wrote: On Jun 16, 2015, at 8:14 PM, Chris Murphy li...@colorremedies.com wrote: On Tue, Jun 16, 2015 at 5:58 PM, Duncan 1i5t5.dun...@cox.net wrote: On a current kernel unlike older ones, btrfs actually automates entirely empty chunk reclaim, so this problem doesn't occur anything close to near as often as it used to. However, it's still possible to have mostly but not entirely empty chunks that btrfs won't automatically reclaim. A balance can be used to rewrite and combine these mostly empty chunks, reclaiming the space saved. This is what Hugo was recommending. Yes, as little as a -dusage=5 (data chunks that are 5% or less full) can clear the problem and is very fast, seconds. Possibly a bit longer, many seconds o single digit minutes is -dusage=15. I haven't done a full balance in forever. Yes, on this 80% full 6x4TB RAID10 -dusage=15 took 2 seconds and relocated 0 out of 3026 chunks”. Out of curiosity, I had to use -dusage=90 to have it relocate only 1 chunk and it took les than 30 seconds. So I put a -dusage=25 in the weekly cron just before the scrub. In most cases, all you need to do is clean up one data chunk to give the metadata enough space to work in. Instead of manually iterating through several values of usage= until you get a useful response, you can use limit=n to stop after n successful block group relocations. Hugo. -- Hugo Mills | Alert status mauve ocelot: Slight chance of hugo@... carfax.org.uk | brimstone. Be prepared to make a nice cup of tea. http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: RAID10 Balancing Request for Comments and Advices
On Jun 16, 2015, at 7:58 PM, Duncan 1i5t5.dun...@cox.net wrote: Vincent Olivier posted on Tue, 16 Jun 2015 09:34:29 -0400 as excerpted: On Jun 16, 2015, at 8:25 AM, Hugo Mills h...@carfax.org.uk wrote: On Tue, Jun 16, 2015 at 08:09:17AM -0400, Vincent Olivier wrote: My first question is this : is it normal to have “single” blocks ? Why not only RAID10? I don’t remember the exact mkfs options I used but I certainly didn’t ask for “single” so this is unexpected. Yes. It's an artefact of the way that mkfs works. If you run a balance on those chunks, they'll go away. (btrfs balance start -dusage=0 -musage=0 /mountpoint) Thanks! I did and it did go away, except for the GlobalReserve, single: total=512.00MiB, used=0.00B”. But I suppose this is a permanent fixture, right? Yes. GlobalReserve is for short-term btrfs-internal use, reserved for times when btrfs needs to (temporarily) allocate some space in ordered to free space, etc. It's always single, and you'll rarely see anything but 0 used except perhaps in the middle of a balance or something. Get it. Thanks. Is there anyway to put that on another device, say, a SSD? I am thinking of backing up this RAID10 on a 2x8TB device-managed SMR RAID1 and I want to minimize random write operations (noatime al.). I will start a new thread for that maybe but first, is there something substantial I can read about btrfs+SMR? Or should I avoid SMR+btfs ? For maintenance, I would suggest running a scrub regularly, to check for various forms of bitrot. Typical frequencies for a scrub are once a week or once a month -- opinions vary (as do runtimes). Yes. I cronned it weekly for now. Takes about 5 hours. Is it automatically corrected on RAID10 since a copy of it exist within the filesystem ? What happens for RAID0 ? For raid10 (and the raid1 I use), yes, it's corrected, from the other existing copy, assuming it's good, tho if there are metadata checksum errors, there may be corresponding unverified checksums as well, where the verification couldn't be done because the metadata containing the checksums was bad. Thus, if there are errors found and corrected, and you see unverified errors as well, rerun the scrub, so the newly corrected metadata can now be used to verify the previously unverified errors. ok then, rule of the thumb re-run the scrub on “unverified checksum error(s)”. I have yet to see checksum errors yet but will keep it in mind.. I'm presently getting a lot of experience with this as one of the ssds in my raid1 is gradually failing and rewriting sectors. Generally what happens is that the ssd will take too long, triggering a SATA reset (30 second timeout), and btrfs will call that an error. The scrub then rewrites the bad copy on the unreliable device with the good copy from the more reliable device, with the write triggering a sector relocation on the bad device. The newly written copy then checks out good, but if it was metadata, it very likely contained checksums for several other blocks, which couldn't be verified because the block containing their checksums was itself bad. Typically I'll see dozens to a couple hundred unverified errors for every bad metadata block rewritten in this way. Rerunning the scrub then either verifies or fixes the previously unverified blocks, tho sometimes one of those in turn ends up bad and if it's a metadata block, I may end up rerunning the scrub another time or two, until everything checks out. FWIW, on the bad device, smartctl -A reports (excerpted): ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 5 Reallocated_Sector_Ct 0x0032 098 098 036Old_age Always - 259 182 Erase_Fail_Count_Total 0x0032 100 100 000Old_age Always - 132 While on the paired good device: 5 Reallocated_Sector_Ct 0x0032 253 253 036Old_age Always - 0 182 Erase_Fail_Count_Total 0x0032 253 253 000Old_age Always - 0 Meanwhile, smartctl -H has already warned once that the device is failing, tho it went back to passing status again, but as of now it's saying failing, again. The attribute that actually registers as failing, again from the bad device, followed by the good, is: 1 Raw_Read_Error_Rate 0x000f 001 001 006Pre-fail Always FAILING_NOW 3081 1 Raw_Read_Error_Rate 0x000f 160 159 006Pre-fail Always - 41 When it's not actually reporting failing, the FAILING_NOW status is replaced with IN_THE_PAST. 250 Read_Error_Retry_Rate is the other attribute of interest, with values of 100 current and worst for both devices, threshold 0, but a raw value of 2488 for the good device and over 17,000,000 for the failing device. But with the cooked value never moving from 100 and with no real guidance on how to interpret the raw values, while
Re: BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432)
Marc MERLIN posted on Wed, 17 Jun 2015 00:16:54 -0700 as excerpted: I had a few power offs due to a faulty power supply, and my mdadm raid5 got into fail mode after 2 drives got kicked out since their sequence numbers didn't match due to the abrupt power offs. I brought the swraid5 back up by force assembling it with 4 drives (one was really only a few sequence numbers behind), and it's doing a full parity rebuild on the 5th drive that was farther behind. So I can understand how I may have had a few blocks that are in a bad state. I'm getting a few (not many) of those messages in syslog. BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432) Filesystem looks like this: Label: 'btrfs_pool1' uuid: 6358304a-2234-4243-b02d-4944c9af47d7 Total devices 1 FS bytes used 8.29TiB devid1 size 14.55TiB used 8.32TiB path /dev/mapper/dshelf1 gargamel:~# btrfs fi df /mnt/btrfs_pool1 Data, single: total=8.29TiB, used=8.28TiB System, DUP: total=8.00MiB, used=920.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=14.00GiB, used=10.58GiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=512.00MiB, used=0.00B Kernel 3.19.8. Just to make sure I understand, do those messages in syslog mean that my metadata got corrupted a bit, but because I have 2 copies, btrfs can fix the bad copy by using the good one? Yes. Despite the confusion between btrfs raid5 and mdraid5, Hugo was correct there. It's just the 3.19 kernel bit that he got wrong, since he was thinking btrfs raid. Btrfs dup mode should be good going back many kernels. Also, if my actual data got corrupted, am I correct that btrfs will detect the checksum failure and give me a different error message of a read error that cannot be corrected? I'll do a scrub later, for now I have to wait 20 hours for the raid rebuild first. Yes again. As I mentioned in a different thread a few hours ago, I have an SSD that is slowly going bad, relocating sectors, etc (200-some relocated at this point, by raw value, that attribute dropped to 100 cooked value on the first relocation and is now at 98, with a threshold of 36, so I figure it should be good for a few thousand relocations if I let it go that far). But it's in a btrfs raid1 with a reliable (no relocations yet) paired-ssd and I've been able to scrub-fix the errors so far, plus I have things backed up and a replacement ready to insert when I decide it's time, so I'm able to watch in more or less morbid fascination as the thing slowly dies, a sector at a time. The interesting thing is that with btrfs' checksumming and data integrity feature, I can continue to use the drive in raid1 even tho it's definitely bad enough to be all but unusable with ordinary filesystems. Anyway, as a result of that, I'm getting lots of experience with scrubs and corrected errors. One thing I'd strongly recommend. Once the rebuild is complete and you do the scrub, there may well be both read/corrected errors, and unverified errors. AFAIK, the unverified errors are a result of bad metadata blocks, so missing checksums for what they covered. So once you finish the first scrub and have corrected most of the metadata block errors, do another scrub. The idea is to repeat until you have no more unverified errors, they're either all corrected (if dup metadata) or all uncorrectable (the single data). That's what I'm doing here, with both data and metadata as raid1 and thus correctable, tho in some instances the device is triggering a new relocation on the second and occasionally (once?) third scrub, so that's causing me to have to do more scrubs than I would if the problem were entirely in the past, as it sounds like yours is, or will-be once the mdraid rebuild is done, anyway. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 6/17/15 10:32 AM, Jeff Mahoney wrote: On 6/17/15 9:24 AM, Filipe David Manana wrote: On Wed, Jun 17, 2015 at 11:04 AM, Filipe David Manana fdman...@gmail.com wrote: On Mon, Jun 15, 2015 at 2:41 PM, je...@suse.com wrote: From: Jeff Mahoney je...@suse.com The cleaner thread may already be sleeping by the time we enter close_ctree. If that's the case, we'll skip removing any unused block groups queued for removal, even during a normal umount. They'll be cleaned up automatically at next mount, but users expect a umount to be a clean synchronization point, especially when used on thin-provisioned storage with -odiscard. We also explicitly remove unused block groups in the ro-remount path for the same reason. Signed-off-by: Jeff Mahoney je...@suse.com Reviewed-by: Filipe Manana fdman...@suse.com Tested-by: Filipe Manana fdman...@suse.com --- fs/btrfs/disk-io.c | 9 + fs/btrfs/super.c | 11 +++ 2 files changed, 20 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 2ef9a4b..2e47fef 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3710,6 +3710,15 @@ void close_ctree(struct btrfs_root *root) cancel_work_sync(fs_info-async_reclaim_work); if (!(fs_info-sb-s_flags MS_RDONLY)) { + /* + * If the cleaner thread is stopped and there are + * block groups queued for removal, the deletion will be + * skipped when we quit the cleaner thread. +*/ + mutex_lock(root-fs_info-cleaner_mutex); + btrfs_delete_unused_bgs(root-fs_info); + mutex_unlock(root-fs_info-cleaner_mutex); + ret = btrfs_commit_super(root); if (ret) btrfs_err(fs_info, commit super ret %d, ret); diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 9e66f5e..2ccd8d4 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1539,6 +1539,17 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data) sb-s_flags |= MS_RDONLY; + /* +* Setting MS_RDONLY will put the cleaner thread to +* sleep at the next loop if it's already active. +* If it's already asleep, we'll leave unused block +* groups on disk until we're mounted read-write again + * unless we clean them up here. +*/ + mutex_lock(root-fs_info-cleaner_mutex); + btrfs_delete_unused_bgs(fs_info); + mutex_unlock(root-fs_info-cleaner_mutex); So actually, this allows for a deadlock after the patch I sent out last week: https://patchwork.kernel.org/patch/6586811/ In that patch delete_unused_bgs is no longer called under the cleaner_mutex, and making it so, will cause a deadlock with/ru relocation. Even without that patch, I don't think you need using this mutex anyway - no 2 tasks running this function can get the same bg from the fs_info-unused_bgs list. I was hitting crashes during umount when xfstests would do remount-ro and umount in quick succession. I can go back and confirm this, but I believe I was encountering a race between the cleaner thread and umount after being set read-only. It didn't trigger all the time. My hypothesis is that if the cleaner thread was running and had a lot of work to do, it could start before set MS_RDONLY and still be performing work through the remount and into the umount. Ro-remount would have set MS_RDONLY so we skip the btrfs_super_commit in close_ctree and then blow up afterwards. Taking the cleaner mutex means we either wait until the cleaner thread has finished or we put it to sleep on the next loop before it does anything. In either case, it's safe. It could just has easily been: mutex_lock(root-fs_info-cleaner_mutex); mutex_unlock(root-fs_info-cleaner_mutex); btrfs_delete_unused_bgs(fs_info); I think it actually was in a previous version I was testing. It probably should go back to that version so that we don't end up confusing it with the new mutex you introduced in your patch. It looks like your: [PATCH] Btrfs: fix crash on close_ctree() if cleaner starts new transaction would also fix this in a more general case. We can drop taking the cleaner mutex here. - -Jeff - -- Jeff Mahoney SUSE Labs -BEGIN PGP SIGNATURE- Version: GnuPG/MacGPG2 v2.0.19 (Darwin) iQIcBAEBAgAGBQJVgYYDAAoJEB57S2MheeWyxIcQAIGwFvP1bL4C8Oa3WyFL/tjE QITNDQZGYXEKfFqRWdHEAeFJ8kv234xo/tx7Ml0Txd8DFrqzDwXSxv6deLzDiiTT gymMdBKO3x7TLKZTxnyDXYEUDHM72IMOUS2el3wOOsc61rL1KajFEWySGtAA80pk bIUH6uosRTXhpXBRe080mc9XPhtfIQyCC8nroJHYazNwT3VWrvbhDaZPM3npNttj 5glsCz7ieseiWKqFCIlYC5yCgpst79U7D8M75Jo0yslvtZNpZOMR3YhvyQakj5hG p/CFRfbdFGnl3wKv+ACyu7XlewqoA9LwkB5Sbjzd4XbS3n7J4gch043b+BbIl2SA VghNTTI+tm7KKvMa3fghtedooVYu6DjdhU58VEWOBtHaDiWntSmd0FqzUCqAotxC fwEmMWCWCWR1E0etRUrnbO1DGltkR38ost7cvXOPXUUdvv3Hy22mTfWW73YwsWXW kwmG2V+IdgOWHDMxQCnj55/NbYep+/TiVjDPJnOuCn8tD5Tw+zHxtRbXhVcyKpGj
Re: [PATCH 0/5] Rework/enhance btrfs-map-logical
On Wed, Jun 17, 2015 at 03:48:59PM +0800, Qu Wenruo wrote: Although btrfs-map-logical is mainly a tool for developers, but even as a developer, I feel quite frustrated about the bug I found: Thanks, I'll queue it for 4.1. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs stable updates for 4.0
On Thu, Jun 11, 2015 at 06:06:32PM +0200, David Sterba wrote: Hi, please queue the following patches to 4.0 stable. There are fixes for user visible bugs and one usability regression with RAID1 - single conversion during balance. One of the patches does not apply cleanly to 4.0.5, there's a minor conflict in patch context (153c35b60c72de9fae06c8e2c8b2c47d79d4, the last one). A reply to this mail is the adapted version or you can pull/cherry-pick from git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-stable-4.0 Subjects: Btrfs: send, add missing check for dead clone root Btrfs: send, don't leave without decrementing clone root's send_progress btrfs: incorrect handling for fiemap_fill_next_extent return btrfs: cleanup orphans while looking up default subvolume Btrfs: fix range cloning when same inode used as source and destination Btrfs: fix uninit variable in clone ioctl Btrfs: fix regression in raid level conversion Commits: 5cc2b17e80cf5770f2e585c2d90fd8af1b901258 # 3.14+ 2f1f465ae6da244099af55c066e5355abd8ff620 # 3.14+ 26e726afe01c1c82072cf23a5ed89ce25f39d9f2 # 3.10+ 727b9784b6085c99c2f836bf4fcc2848dc9cf904 # 3.14+ Thanks, I'm applying the above commits to 3.16 kernel as well. Cheers, -- Luís df858e76723ace61342b118aa4302bd09de4e386 # 4.0+ de249e66a73d69281cd812087979c6fae552 # 4.0+ 153c35b60c72de9fae06c8e2c8b2c47d79d4 # 4.0+ Thanks. -- To unsubscribe from this list: send the line unsubscribe stable in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 6/17/15 9:24 AM, Filipe David Manana wrote: On Wed, Jun 17, 2015 at 11:04 AM, Filipe David Manana fdman...@gmail.com wrote: On Mon, Jun 15, 2015 at 2:41 PM, je...@suse.com wrote: From: Jeff Mahoney je...@suse.com The cleaner thread may already be sleeping by the time we enter close_ctree. If that's the case, we'll skip removing any unused block groups queued for removal, even during a normal umount. They'll be cleaned up automatically at next mount, but users expect a umount to be a clean synchronization point, especially when used on thin-provisioned storage with -odiscard. We also explicitly remove unused block groups in the ro-remount path for the same reason. Signed-off-by: Jeff Mahoney je...@suse.com Reviewed-by: Filipe Manana fdman...@suse.com Tested-by: Filipe Manana fdman...@suse.com --- fs/btrfs/disk-io.c | 9 + fs/btrfs/super.c | 11 +++ 2 files changed, 20 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 2ef9a4b..2e47fef 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3710,6 +3710,15 @@ void close_ctree(struct btrfs_root *root) cancel_work_sync(fs_info-async_reclaim_work); if (!(fs_info-sb-s_flags MS_RDONLY)) { + /* + * If the cleaner thread is stopped and there are + * block groups queued for removal, the deletion will be + * skipped when we quit the cleaner thread. +*/ + mutex_lock(root-fs_info-cleaner_mutex); + btrfs_delete_unused_bgs(root-fs_info); + mutex_unlock(root-fs_info-cleaner_mutex); + ret = btrfs_commit_super(root); if (ret) btrfs_err(fs_info, commit super ret %d, ret); diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 9e66f5e..2ccd8d4 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1539,6 +1539,17 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data) sb-s_flags |= MS_RDONLY; + /* +* Setting MS_RDONLY will put the cleaner thread to +* sleep at the next loop if it's already active. +* If it's already asleep, we'll leave unused block +* groups on disk until we're mounted read-write again +* unless we clean them up here. +*/ + mutex_lock(root-fs_info-cleaner_mutex); + btrfs_delete_unused_bgs(fs_info); + mutex_unlock(root-fs_info-cleaner_mutex); So actually, this allows for a deadlock after the patch I sent out last week: https://patchwork.kernel.org/patch/6586811/ In that patch delete_unused_bgs is no longer called under the cleaner_mutex, and making it so, will cause a deadlock with/ru relocation. Even without that patch, I don't think you need using this mutex anyway - no 2 tasks running this function can get the same bg from the fs_info-unused_bgs list. I was hitting crashes during umount when xfstests would do remount-ro and umount in quick succession. I can go back and confirm this, but I believe I was encountering a race between the cleaner thread and umount after being set read-only. It didn't trigger all the time. My hypothesis is that if the cleaner thread was running and had a lot of work to do, it could start before set MS_RDONLY and still be performing work through the remount and into the umount. Ro-remount would have set MS_RDONLY so we skip the btrfs_super_commit in close_ctree and then blow up afterwards. Taking the cleaner mutex means we either wait until the cleaner thread has finished or we put it to sleep on the next loop before it does anything. In either case, it's safe. It could just has easily been: mutex_lock(root-fs_info-cleaner_mutex); mutex_unlock(root-fs_info-cleaner_mutex); btrfs_delete_unused_bgs(fs_info); I think it actually was in a previous version I was testing. It probably should go back to that version so that we don't end up confusing it with the new mutex you introduced in your patch. - -Jeff - -- Jeff Mahoney SUSE Labs -BEGIN PGP SIGNATURE- Version: GnuPG/MacGPG2 v2.0.19 (Darwin) iQIcBAEBAgAGBQJVgYThAAoJEB57S2MheeWymvMP/jPnCFslZfEphccGlqsDUQeb Ua9SVQJ5XjS0BbnVfuMGmzxew30BkUBdpnlWsufdVIKIeR9DNcvuDHJtcXMUI+Uw FU/Asik//xiDPJ1hldPc4d0CJjsFBKHVLKjirkeE7kuvwa+XmfUYfKrhfzt6ZGvt sWrCwMJRWFAS88ayR+NAelwaMzIy+Rbs5gZYg6dd2OCvIa4GuTh/szx8RaPOjNWQ QcQHy2FlCcV/AtCA+ZaXh8NLmATIA8613biP7ATGIYHEdaZf7Oivov/u154QVwkt c4omauofHKbBmlz2d//PS/T/n9nT7F7p1YvFaDnLLyQ0Ew3VBq+M9gyuWF8IGxti iHdGkgQxnPSY0gGLA5bIt0D+su1RcTqa/71LOsBqbmk7KioNF4bp9FmaykHx2LAL NpKGPD6BEcTTZAXfGdV6/IxTuii1temxcyawJrijakFTseA/GOmODI3K1kg+nLZA OBjzFmzzLFir8SuiIWLO5ncbbsoM6rHhbl08DeKZ6tOH4JQm2ROciVgTn67SVxB5 bmjzl/zhhePfPgmbf5WoLsT4cbuGK00r+M3U79vzIjfEPmKAfGFbu9jEGPvQahKT tOBRw7IaL8vCrBLFGUhhQzECwOK6Zms4r2ZTino30MwSNHegPjUbt8xDmFHw+gp3 Td6o4o23By9ygZgac0KI =iHjc -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs
Re: [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount
On Wed, Jun 17, 2015 at 3:36 PM, Jeff Mahoney je...@suse.com wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 6/17/15 10:32 AM, Jeff Mahoney wrote: On 6/17/15 9:24 AM, Filipe David Manana wrote: On Wed, Jun 17, 2015 at 11:04 AM, Filipe David Manana fdman...@gmail.com wrote: On Mon, Jun 15, 2015 at 2:41 PM, je...@suse.com wrote: From: Jeff Mahoney je...@suse.com The cleaner thread may already be sleeping by the time we enter close_ctree. If that's the case, we'll skip removing any unused block groups queued for removal, even during a normal umount. They'll be cleaned up automatically at next mount, but users expect a umount to be a clean synchronization point, especially when used on thin-provisioned storage with -odiscard. We also explicitly remove unused block groups in the ro-remount path for the same reason. Signed-off-by: Jeff Mahoney je...@suse.com Reviewed-by: Filipe Manana fdman...@suse.com Tested-by: Filipe Manana fdman...@suse.com --- fs/btrfs/disk-io.c | 9 + fs/btrfs/super.c | 11 +++ 2 files changed, 20 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 2ef9a4b..2e47fef 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3710,6 +3710,15 @@ void close_ctree(struct btrfs_root *root) cancel_work_sync(fs_info-async_reclaim_work); if (!(fs_info-sb-s_flags MS_RDONLY)) { + /* + * If the cleaner thread is stopped and there are + * block groups queued for removal, the deletion will be + * skipped when we quit the cleaner thread. +*/ + mutex_lock(root-fs_info-cleaner_mutex); + btrfs_delete_unused_bgs(root-fs_info); + mutex_unlock(root-fs_info-cleaner_mutex); + ret = btrfs_commit_super(root); if (ret) btrfs_err(fs_info, commit super ret %d, ret); diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 9e66f5e..2ccd8d4 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1539,6 +1539,17 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data) sb-s_flags |= MS_RDONLY; + /* +* Setting MS_RDONLY will put the cleaner thread to +* sleep at the next loop if it's already active. +* If it's already asleep, we'll leave unused block +* groups on disk until we're mounted read-write again + * unless we clean them up here. +*/ + mutex_lock(root-fs_info-cleaner_mutex); + btrfs_delete_unused_bgs(fs_info); + mutex_unlock(root-fs_info-cleaner_mutex); So actually, this allows for a deadlock after the patch I sent out last week: https://patchwork.kernel.org/patch/6586811/ In that patch delete_unused_bgs is no longer called under the cleaner_mutex, and making it so, will cause a deadlock with/ru relocation. Even without that patch, I don't think you need using this mutex anyway - no 2 tasks running this function can get the same bg from the fs_info-unused_bgs list. I was hitting crashes during umount when xfstests would do remount-ro and umount in quick succession. I can go back and confirm this, but I believe I was encountering a race between the cleaner thread and umount after being set read-only. It didn't trigger all the time. My hypothesis is that if the cleaner thread was running and had a lot of work to do, it could start before set MS_RDONLY and still be performing work through the remount and into the umount. Ro-remount would have set MS_RDONLY so we skip the btrfs_super_commit in close_ctree and then blow up afterwards. Taking the cleaner mutex means we either wait until the cleaner thread has finished or we put it to sleep on the next loop before it does anything. In either case, it's safe. It could just has easily been: mutex_lock(root-fs_info-cleaner_mutex); mutex_unlock(root-fs_info-cleaner_mutex); btrfs_delete_unused_bgs(fs_info); I think it actually was in a previous version I was testing. It probably should go back to that version so that we don't end up confusing it with the new mutex you introduced in your patch. It looks like your: [PATCH] Btrfs: fix crash on close_ctree() if cleaner starts new transaction would also fix this in a more general case. We can drop taking the cleaner mutex here. Cool, thanks Jeff. - -Jeff - -- Jeff Mahoney SUSE Labs -BEGIN PGP SIGNATURE- Version: GnuPG/MacGPG2 v2.0.19 (Darwin) iQIcBAEBAgAGBQJVgYYDAAoJEB57S2MheeWyxIcQAIGwFvP1bL4C8Oa3WyFL/tjE QITNDQZGYXEKfFqRWdHEAeFJ8kv234xo/tx7Ml0Txd8DFrqzDwXSxv6deLzDiiTT gymMdBKO3x7TLKZTxnyDXYEUDHM72IMOUS2el3wOOsc61rL1KajFEWySGtAA80pk bIUH6uosRTXhpXBRe080mc9XPhtfIQyCC8nroJHYazNwT3VWrvbhDaZPM3npNttj 5glsCz7ieseiWKqFCIlYC5yCgpst79U7D8M75Jo0yslvtZNpZOMR3YhvyQakj5hG p/CFRfbdFGnl3wKv+ACyu7XlewqoA9LwkB5Sbjzd4XbS3n7J4gch043b+BbIl2SA VghNTTI+tm7KKvMa3fghtedooVYu6DjdhU58VEWOBtHaDiWntSmd0FqzUCqAotxC fwEmMWCWCWR1E0etRUrnbO1DGltkR38ost7cvXOPXUUdvv3Hy22mTfWW73YwsWXW
Re: trim not working and irreparable errors from btrfsck
On Wed, Jun 17, 2015 at 6:56 AM, Christian Dysthe cdys...@gmail.com wrote: Hi, Sorry for asking more about this. I'm not a developer but trying to learn. In my case I get several errors like this one: root 2625 inode 353819 errors 400, nbytes wrong Is it inode 353819 I should focus on and what is the number after root, in this case 2625? I'm going to guess it's tree root 2625, which is the same thing as fs tree, which is the same thing as subvolume. Each subvolume has its own inodes. So on a given Btrfs volume, an inode number can exist more than once, but in separate subvolumes. When you use btrfs inspect inode it will list all files with that inode number, but only the one in subvol ID 2625 is what you care about deleting and replacing. -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: trim not working and irreparable errors from btrfsck
On 06/17/2015 10:22 AM, Chris Murphy wrote: On Wed, Jun 17, 2015 at 6:56 AM, Christian Dysthe cdys...@gmail.com wrote: Hi, Sorry for asking more about this. I'm not a developer but trying to learn. In my case I get several errors like this one: root 2625 inode 353819 errors 400, nbytes wrong Is it inode 353819 I should focus on and what is the number after root, in this case 2625? I'm going to guess it's tree root 2625, which is the same thing as fs tree, which is the same thing as subvolume. Each subvolume has its own inodes. So on a given Btrfs volume, an inode number can exist more than once, but in separate subvolumes. When you use btrfs inspect inode it will list all files with that inode number, but only the one in subvol ID 2625 is what you care about deleting and replacing. Thanks! Deleting the file for that inode took care of it. No more errors. Restored it from a backup. However, fstrim still gives me 0 B (0 bytes) trimmed, so that may be another problem. Is there a way to check if trim works? -- //Christian -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/5] Rework/enhance btrfs-map-logical
Although btrfs-map-logical is mainly a tool for developers, but even as a developer, I feel quite frustrated about the bug I found: 1) Assert if pass bytenr of a tree root The most annoying one. --- $ btrfs-map-logical -l 29425664 /dev/sda6 mirror 1 logical 29425664 physical 37814272 device /dev/sda6 mirror 2 logical 29425664 physical 556096 device /dev/sda6 extent_io.c:582: free_extent_buffer: Assertion `eb-refs 0` failed. btrfs-map-logical[0x41c464] btrfs-map-logical(free_extent_buffer+0xc0)[0x41cf10] btrfs-map-logical(btrfs_release_all_roots+0x59)[0x40e649] btrfs-map-logical(close_ctree+0x1aa)[0x40f51a] btrfs-map-logical(main+0x387)[0x4077c7] /usr/lib/libc.so.6(__libc_start_main+0xf0)[0x7f1e7f619790] btrfs-map-logical(_start+0x29)[0x4078f9] --- 2) Strange logical offset and non-exist mapping --- $ btrfs-map-logical -l 1 -b 8192 /dev/sda6 mirror 1 logical 1 physical 1 device /dev/sda6 mirror 1 logical 4097 physical 4097 device /dev/sda6 --- There is no extents in that range normally. Despite that, for the first mirror, it's OK to start from 1 as I passed an unaligned bytenr. But why the 2nd non-exist mapping is also unaligned? For all the fix, see the commit message of the 5th patch. Qu Wenruo (5): btrfs-progs: Export read_extent_data function. btrfs-progs: map-logical: Introduce map_one_extent function. Btrfs-progs: map-logical: Introduce print_mapping_info function. Btrfs-progs: map-logical: Introduce write_extent_content function. btrfs-progs: map-logical: Rework map-logical logics. btrfs-map-logical.c | 273 +++- cmds-check.c| 34 --- disk-io.c | 34 +++ disk-io.h | 2 + 4 files changed, 243 insertions(+), 100 deletions(-) -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/5] btrfs-progs: Export read_extent_data function.
Export it for later btrfs-map-logical cleanup. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- cmds-check.c | 34 -- disk-io.c| 34 ++ disk-io.h| 2 ++ 3 files changed, 36 insertions(+), 34 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index db121b1..778f141 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -5235,40 +5235,6 @@ static int check_space_cache(struct btrfs_root *root) return error ? -EINVAL : 0; } -static int read_extent_data(struct btrfs_root *root, char *data, - u64 logical, u64 *len, int mirror) -{ - u64 offset = 0; - struct btrfs_multi_bio *multi = NULL; - struct btrfs_fs_info *info = root-fs_info; - struct btrfs_device *device; - int ret = 0; - u64 max_len = *len; - - ret = btrfs_map_block(info-mapping_tree, READ, logical, len, - multi, mirror, NULL); - if (ret) { - fprintf(stderr, Couldn't map the block %llu\n, - logical + offset); - goto err; - } - device = multi-stripes[0].dev; - - if (device-fd == 0) - goto err; - if (*len max_len) - *len = max_len; - - ret = pread64(device-fd, data, *len, multi-stripes[0].physical); - if (ret != *len) - ret = -EIO; - else - ret = 0; -err: - kfree(multi); - return ret; -} - static int check_extent_csums(struct btrfs_root *root, u64 bytenr, u64 num_bytes, unsigned long leaf_offset, struct extent_buffer *eb) { diff --git a/disk-io.c b/disk-io.c index 2a7feb0..720fee4 100644 --- a/disk-io.c +++ b/disk-io.c @@ -340,6 +340,40 @@ struct extent_buffer *read_tree_block(struct btrfs_root *root, u64 bytenr, return ERR_PTR(ret); } +int read_extent_data(struct btrfs_root *root, char *data, + u64 logical, u64 *len, int mirror) +{ + u64 offset = 0; + struct btrfs_multi_bio *multi = NULL; + struct btrfs_fs_info *info = root-fs_info; + struct btrfs_device *device; + int ret = 0; + u64 max_len = *len; + + ret = btrfs_map_block(info-mapping_tree, READ, logical, len, + multi, mirror, NULL); + if (ret) { + fprintf(stderr, Couldn't map the block %llu\n, + logical + offset); + goto err; + } + device = multi-stripes[0].dev; + + if (device-fd == 0) + goto err; + if (*len max_len) + *len = max_len; + + ret = pread64(device-fd, data, *len, multi-stripes[0].physical); + if (ret != *len) + ret = -EIO; + else + ret = 0; +err: + kfree(multi); + return ret; +} + int write_and_map_eb(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *eb) diff --git a/disk-io.h b/disk-io.h index 62eb566..87e1cd9 100644 --- a/disk-io.h +++ b/disk-io.h @@ -66,6 +66,8 @@ struct btrfs_device; int read_whole_eb(struct btrfs_fs_info *info, struct extent_buffer *eb, int mirror); struct extent_buffer *read_tree_block(struct btrfs_root *root, u64 bytenr, u32 blocksize, u64 parent_transid); +int read_extent_data(struct btrfs_root *root, char *data, u64 logical, +u64 *len, int mirror); void readahead_tree_block(struct btrfs_root *root, u64 bytenr, u32 blocksize, u64 parent_transid); struct extent_buffer *btrfs_find_create_tree_block(struct btrfs_root *root, -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/5] Btrfs-progs: map-logical: Introduce print_mapping_info function.
The new function will print the mapping info of given range [logical, logical+len). Note, caller must ensure the range are completely inside an extent. Or btrfs_map_block can return -ENOENT. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- btrfs-map-logical.c | 63 + 1 file changed, 63 insertions(+) diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c index 8442779..22ece82 100644 --- a/btrfs-map-logical.c +++ b/btrfs-map-logical.c @@ -93,6 +93,69 @@ out: return ret; } +static int __print_mapping_info(struct btrfs_fs_info *fs_info, u64 logical, + u64 len, int mirror_num) +{ + struct btrfs_multi_bio *multi = NULL; + u64 cur_offset = 0; + u64 cur_len; + int ret = 0; + + while (cur_offset len) { + struct btrfs_device *device; + int i; + + cur_len = len - cur_offset; + ret = btrfs_map_block(fs_info-mapping_tree, READ, + logical + cur_offset, cur_len, + multi, mirror_num, NULL); + if (ret) { + fprintf(info_file, + Error: fails to map mirror%d logical %llu: %s\n, + mirror_num, logical, strerror(-ret)); + return ret; + } + for (i = 0; i multi-num_stripes; i++) { + device = multi-stripes[i].dev; + fprintf(info_file, + mirror %d logical %Lu physical %Lu device %s\n, + mirror_num, logical + cur_offset, + multi-stripes[0].physical, + device-name); + } + kfree(multi); + multi = NULL; + cur_offset += cur_len; + } + return ret; +} + +/* + * Logical and len is the exact value of a extent. + * And offset is the offset inside the extent. It's only used for case + * where user only want to print part of the extent. + * + * Caller *MUST* ensure the range [logical,logical+len) are in one extent. + * Or we can encounter the following case, causing a -ENOENT error: + * |-given parameter--| + * |-- Extent A -| + */ +static int print_mapping_info(struct btrfs_fs_info *fs_info, u64 logical, + u64 len) +{ + int num_copies; + int mirror_num; + int ret = 0; + + num_copies = btrfs_num_copies(fs_info-mapping_tree, logical, len); + for (mirror_num = 1; mirror_num = num_copies; mirror_num++) { + ret = __print_mapping_info(fs_info, logical, len, mirror_num); + if (ret 0) + return ret; + } + return ret; +} + static struct extent_buffer * debug_read_block(struct btrfs_root *root, u64 bytenr, u32 blocksize, u64 copy) { -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/5] Btrfs-progs: map-logical: Introduce write_extent_content function.
This function will write extent content info desired file. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- btrfs-map-logical.c | 34 ++ 1 file changed, 34 insertions(+) diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c index 22ece82..1ee101c 100644 --- a/btrfs-map-logical.c +++ b/btrfs-map-logical.c @@ -30,6 +30,8 @@ #include list.h #include utils.h +#define BUFFER_SIZE (64 * 1024) + /* we write the mirror info to stdout unless they are dumping the data * to stdout * */ @@ -156,6 +158,38 @@ static int print_mapping_info(struct btrfs_fs_info *fs_info, u64 logical, return ret; } +/* Same requisition as print_mapping_info function */ +static int write_extent_content(struct btrfs_fs_info *fs_info, int out_fd, + u64 logical, u64 length, int mirror) +{ + char buffer[BUFFER_SIZE]; + u64 cur_offset = 0; + u64 cur_len; + int ret = 0; + + while (cur_offset length) { + cur_len = min_t(u64, length - cur_offset, BUFFER_SIZE); + ret = read_extent_data(fs_info-tree_root, buffer, + logical + cur_offset, cur_len, mirror); + if (ret 0) { + fprintf(stderr, + Failed to read extent at [%llu, %llu]: %s\n, + logical, logical + length, strerror(-ret)); + return ret; + } + ret = write(out_fd, buffer, cur_len); + if (ret 0 || ret != cur_len) { + if (ret 0) + ret = -EINTR; + fprintf(stderr, output file write failed: %s\n, + strerror(-ret)); + return ret; + } + cur_offset += cur_len; + } + return ret; +} + static struct extent_buffer * debug_read_block(struct btrfs_root *root, u64 bytenr, u32 blocksize, u64 copy) { -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/5] btrfs-progs: map-logical: Rework map-logical logics.
[[BUG]] The original map-logical has the following problems: 1) Assert if we pass any tree root bytenr. The problem is easy to trigger, here the number 29622272 is the bytenr of tree root: # btrfs-map-logical -l 29622272 /dev/sda6 mirror 1 logical 29622272 physical 38010880 device /dev/sda6 mirror 2 logical 29622272 physical 752704 device /dev/sda6 extent_io.c:582: free_extent_buffer: Assertion `eb-refs 0` failed. btrfs-map-logical[0x41c464] btrfs-map-logical(free_extent_buffer+0xc0)[0x41cf10] btrfs-map-logical(btrfs_release_all_roots+0x59)[0x40e649] btrfs-map-logical(close_ctree+0x1aa)[0x40f51a] btrfs-map-logical(main+0x387)[0x4077c7] /usr/lib/libc.so.6(__libc_start_main+0xf0)[0x7f80a5562790] btrfs-map-logical(_start+0x29)[0x4078f9] The problem is that, btrfs-map-logical always use sectorsize as default block size to call alloc_extent_buffer. And when it failes to find the block with the same size, it will free the extent buffer in a incorrect method(Free and create a new one with refs == 1). 2) Will return map result for non-exist extent. # btrfs-map-logical -l 1 -b 123456 /dev/sda6 mirror 1 logical 1 physical 1 device /dev/sda6 mirror 1 logical 4097 physical 4097 device /dev/sda6 mirror 1 logical 8193 physical 8193 device /dev/sda6 ... Normally, before bytenr 12582912, there should be no extent as that's the mkfs time temp metadata/system chunk. But map-logical will still map them out. Not to mention the 1 offset among all results. [[FIX]] This patch will rework the whole map logical by the following methods: 1) Always do things inside a extent Even under the following case, map logical will only return covered range in existing extents. |-- range given ---| |-Extent A-| |-Extent B-| |---Extent C-| Result: |--| |--| |--| So with this patch, we will search extent tree to ensure all operation are inside a extent before we do some stupid things. 2) No direct call on alloc_extent_buffer function. That low-level function shouldn't be called at such high level. It's only designed for low-level tree operation. So in this patch we will only use safe high level functions avoid such problem. [[RESULT]] With this patch, no assert will be triggered and better handle on non-exist extents. # btrfs-map-logical -l 29622272 /dev/sda6 mirror 1 logical 29622272 physical 38010880 device /dev/sda6 mirror 2 logical 29622272 physical 752704 device /dev/sda6 # btrfs-map-logical -l 1 -b 123456 /dev/sda6 No extent found at range [1,123457) Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- btrfs-map-logical.c | 148 1 file changed, 67 insertions(+), 81 deletions(-) diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c index 1ee101c..a88e56e 100644 --- a/btrfs-map-logical.c +++ b/btrfs-map-logical.c @@ -190,69 +190,6 @@ static int write_extent_content(struct btrfs_fs_info *fs_info, int out_fd, return ret; } -static struct extent_buffer * debug_read_block(struct btrfs_root *root, - u64 bytenr, u32 blocksize, u64 copy) -{ - int ret; - struct extent_buffer *eb; - u64 length; - struct btrfs_multi_bio *multi = NULL; - struct btrfs_device *device; - int num_copies; - int mirror_num = 1; - - eb = btrfs_find_create_tree_block(root, bytenr, blocksize); - if (!eb) - return NULL; - - length = blocksize; - while (1) { - ret = btrfs_map_block(root-fs_info-mapping_tree, READ, - eb-start, length, multi, - mirror_num, NULL); - if (ret) { - fprintf(info_file, - Error: fails to map mirror%d logical %llu: %s\n, - mirror_num, (unsigned long long)eb-start, - strerror(-ret)); - free_extent_buffer(eb); - return NULL; - } - device = multi-stripes[0].dev; - eb-fd = device-fd; - device-total_ios++; - eb-dev_bytenr = multi-stripes[0].physical; - - fprintf(info_file, mirror %d logical %Lu physical %Lu - device %s\n, mirror_num, (unsigned long long)bytenr, - (unsigned long long)eb-dev_bytenr, device-name); - kfree(multi); - - if (!copy || mirror_num == copy) { - ret = read_extent_from_disk(eb, 0, eb-len); - if (ret) { - fprintf(info_file, - Error: failed to read extent: mirror %d logical %llu: %s\n, - mirror_num, (unsigned long long)eb-start, - strerror(-ret)); - free_extent_buffer(eb); -
Re: BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432)
On Wed, Jun 17, 2015 at 1:16 AM, Marc MERLIN m...@merlins.org wrote: So I can understand how I may have had a few blocks that are in a bad state. I'm getting a few (not many) of those messages in syslog. BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432) I think more information is needed at the time of this entry, maybe the previous 20 entries or so. That Btrfs thinks there is a read error is different than when it thinks there's a checksum error. For example when I willfully corrupt one sector that I know contains metadata, then read the file or do a scrub: [48466.824770] BTRFS: checksum error at logical 20971520 on dev /dev/sdb, sector 57344: metadata leaf (level 0) in tree 3 [48466.829900] BTRFS: checksum error at logical 20971520 on dev /dev/sdb, sector 57344: metadata leaf (level 0) in tree 3 [48466.834944] BTRFS: bdev /dev/sdb errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [48466.853589] BTRFS: fixed up error at logical 20971520 on dev /dev/sdb I'd expect in your case you have a bdev line that reads rd 1, corrupt 0. Just to make sure I understand, do those messages in syslog mean that my metadata got corrupted a bit, but because I have 2 copies, btrfs can fix the bad copy by using the good one? Probably but not enough information has been given to conclude that. Also, if my actual data got corrupted, am I correct that btrfs will detect the checksum failure and give me a different error message of a read error that cannot be corrected? Yes, it looks like this when the file is directly read: [ 1457.231316] BTRFS warning (device sdb): csum failed ino 257 off 0 csum 3703877302 expected csum 1978138932 [ 1457.231842] BTRFS warning (device sdb): csum failed ino 257 off 0 csum 3703877302 expected csum 1978138932 It looks like this during a scrub: [ 1540.865520] BTRFS: checksum error at logical 12845056 on dev /dev/sdb, sector 25088, root 5, inode 257, offset 0, length 4096, links 1 (path: grub2-install) [ 1540.865534] BTRFS: bdev /dev/sdb errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [ 1540.866944] BTRFS: unable to fixup (regular) error at logical 12845056 on dev /dev/sdb -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION
On Wed, Jun 17, 2015 at 03:54:31PM +0800, Liu Bo wrote: MS_I_VERSION is enabled by default for btrfs, this adds an alternative option to toggle it off. There's an existing generic iversion/noiversion mount option pair, no need to extra add it to btrfs. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION
On Wed, Jun 17, 2015 at 11:52:36PM +0800, Liu Bo wrote: On Wed, Jun 17, 2015 at 05:33:06PM +0200, David Sterba wrote: On Wed, Jun 17, 2015 at 03:54:31PM +0800, Liu Bo wrote: MS_I_VERSION is enabled by default for btrfs, this adds an alternative option to toggle it off. There's an existing generic iversion/noiversion mount option pair, no need to extra add it to btrfs. I know, it doesn't work though. Sigh, I see, btrfs forces MS_I_VERSION flag, 0c4d2d95d06e920e0c61707e62c7fffc9c57f63a. I read 'enabled by default' as that there's a standard way to override the defaults. So the right way is not to do that but this will break everyhing that relies on that behaviour at the moment. This means to add the exception to the upper layers, either VFS or 'mount', which is not very likely to happen. The generic options do not reach the filesystem specific callbacks, so we can't check it. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: trim not working and irreparable errors from btrfsck
On Wed, Jun 17, 2015 at 8:33 AM, Christian cdys...@gmail.com wrote: On 06/17/2015 10:22 AM, Chris Murphy wrote: On Wed, Jun 17, 2015 at 6:56 AM, Christian Dysthe cdys...@gmail.com wrote: Hi, Sorry for asking more about this. I'm not a developer but trying to learn. In my case I get several errors like this one: root 2625 inode 353819 errors 400, nbytes wrong Is it inode 353819 I should focus on and what is the number after root, in this case 2625? I'm going to guess it's tree root 2625, which is the same thing as fs tree, which is the same thing as subvolume. Each subvolume has its own inodes. So on a given Btrfs volume, an inode number can exist more than once, but in separate subvolumes. When you use btrfs inspect inode it will list all files with that inode number, but only the one in subvol ID 2625 is what you care about deleting and replacing. Thanks! Deleting the file for that inode took care of it. No more errors. Restored it from a backup. However, fstrim still gives me 0 B (0 bytes) trimmed, so that may be another problem. Is there a way to check if trim works? That sounds like maybe your SSD is blacklisted for trim, is all I can think of. So trim shouldn't be the cause of the problem if it's being blacklisted. The recent problems appear to be around newer SSDs that support queue trim and newer kernels that issue queued trim. There have been some patches related to trim to the kernel, but the existence of blacklisting and claims of bugs in firmware make it difficult to test and isolate. http://techreport.com/news/28473/some-samsung-ssds-may-suffer-from-a-buggy-trim-implementation -- Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: trim not working and irreparable errors from btrfsck
On 06/17/2015 11:28 AM, Chris Murphy wrote: However, fstrim still gives me 0 B (0 bytes) trimmed, so that may be another problem. Is there a way to check if trim works? That sounds like maybe your SSD is blacklisted for trim, is all I can think of. So trim shouldn't be the cause of the problem if it's being blacklisted. The recent problems appear to be around newer SSDs that support queue trim and newer kernels that issue queued trim. There have been some patches related to trim to the kernel, but the existence of blacklisting and claims of bugs in firmware make it difficult to test and isolate. http://techreport.com/news/28473/some-samsung-ssds-may-suffer-from-a-buggy-trim-implementation This is an Intel SSD in a Lenovo Thinkpad X1 Carbon. Trim worked until a few weeks ago and still works for my small ext4 boot partition (just ran it to check). I will keep looking for a solution. Thanks! -- //Christian -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432)
On Wed, Jun 17, 2015 at 01:51:26PM +, Duncan wrote: Also, if my actual data got corrupted, am I correct that btrfs will detect the checksum failure and give me a different error message of a read error that cannot be corrected? I'll do a scrub later, for now I have to wait 20 hours for the raid rebuild first. Yes again. Great, thanks for confirming. Makes me happy to know that checksums and metadata DUP are helping me out here. With ext4 I'd have been worse off for sure. One thing I'd strongly recommend. Once the rebuild is complete and you do the scrub, there may well be both read/corrected errors, and unverified errors. AFAIK, the unverified errors are a result of bad metadata blocks, so missing checksums for what they covered. So once you I'm slightly confused here. If I have metadata DUP and checksums, how can metadata blocks be unverified? Data blocks being unverified, I understand, it would mean the data or checksum is bad, but I expect that's a different error message I haven't seen yet. (I use sec.pl which Emails me any btrfs kernel log line that's not whitelisted as being known/OK) On Wed, Jun 17, 2015 at 08:58:18AM -0600, Chris Murphy wrote: I think more information is needed at the time of this entry, maybe the previous 20 entries or so. That Btrfs thinks there is a read error is different than when it thinks there's a checksum error. For example when I willfully corrupt one sector that I know contains metadata, then read the file or do a scrub: [48466.824770] BTRFS: checksum error at logical 20971520 on dev /dev/sdb, sector 57344: metadata leaf (level 0) in tree 3 [48466.829900] BTRFS: checksum error at logical 20971520 on dev /dev/sdb, sector 57344: metadata leaf (level 0) in tree 3 [48466.834944] BTRFS: bdev /dev/sdb errs: wr 0, rd 0, flush 0, corrupt 1, gen 0 [48466.853589] BTRFS: fixed up error at logical 20971520 on dev /dev/sdb I'd expect in your case you have a bdev line that reads rd 1, corrupt 0. Thankfully I have 0 checksum errors. I looked at all the warnings I got last night, and none are repeated, so it looks like it's just minor corruption I got from re-adding a raid drive that wasn't 100% in sync with the others, and btrfs nicely fixing things for me. Nice! BTRFS: read error corrected: ino 1 off 205565952 (dev /dev/mapper/dshelf1 sector 417880) BTRFS: read error corrected: ino 1 off 216420352 (dev /dev/mapper/dshelf1 sector 439080) BTRFS: read error corrected: ino 1 off 216424448 (dev /dev/mapper/dshelf1 sector 439088) BTRFS: read error corrected: ino 1 off 216473600 (dev /dev/mapper/dshelf1 sector 439184) BTRFS: read error corrected: ino 1 off 226656256 (dev /dev/mapper/dshelf1 sector 459072) BTRFS: read error corrected: ino 1 off 226693120 (dev /dev/mapper/dshelf1 sector 459144) BTRFS: read error corrected: ino 1 off 226729984 (dev /dev/mapper/dshelf1 sector 459216) BTRFS: read error corrected: ino 1 off 226734080 (dev /dev/mapper/dshelf1 sector 459224) BTRFS: read error corrected: ino 1 off 226742272 (dev /dev/mapper/dshelf1 sector 459240) BTRFS: read error corrected: ino 1 off 226758656 (dev /dev/mapper/dshelf1 sector 459272) BTRFS: read error corrected: ino 1 off 226783232 (dev /dev/mapper/dshelf1 sector 459320) BTRFS: read error corrected: ino 1 off 226811904 (dev /dev/mapper/dshelf1 sector 459376) BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432) BTRFS: read error corrected: ino 1 off 226865152 (dev /dev/mapper/dshelf1 sector 459480) BTRFS: read error corrected: ino 1 off 226975744 (dev /dev/mapper/dshelf1 sector 459696) BTRFS: read error corrected: ino 1 off 228589568 (dev /dev/mapper/dshelf1 sector 462848) BTRFS: read error corrected: ino 1 off 228601856 (dev /dev/mapper/dshelf1 sector 462872) BTRFS: read error corrected: ino 1 off 228610048 (dev /dev/mapper/dshelf1 sector 462888) BTRFS: read error corrected: ino 1 off 228614144 (dev /dev/mapper/dshelf1 sector 462896) BTRFS: read error corrected: ino 1 off 228618240 (dev /dev/mapper/dshelf1 sector 462904) BTRFS: read error corrected: ino 1 off 228626432 (dev /dev/mapper/dshelf1 sector 462920) BTRFS: read error corrected: ino 1 off 228630528 (dev /dev/mapper/dshelf1 sector 462928) BTRFS: read error corrected: ino 1 off 228638720 (dev /dev/mapper/dshelf1 sector 462944) BTRFS: read error corrected: ino 1 off 228642816 (dev /dev/mapper/dshelf1 sector 462952) BTRFS: read error corrected: ino 1 off 228646912 (dev /dev/mapper/dshelf1 sector 462960) BTRFS: read error corrected: ino 1 off 228651008 (dev /dev/mapper/dshelf1 sector 462968) BTRFS: read error corrected: ino 1 off 228708352 (dev /dev/mapper/dshelf1 sector 463080) BTRFS: read error corrected: ino 1 off 228712448 (dev /dev/mapper/dshelf1 sector 463088) BTRFS: read error corrected: ino 1 off 228716544 (dev /dev/mapper/dshelf1 sector 463096) BTRFS: read error corrected: ino 1 off 228720640 (dev /dev/mapper/dshelf1 sector 463104) BTRFS: read error
Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION
On Wed, Jun 17, 2015 at 05:33:06PM +0200, David Sterba wrote: On Wed, Jun 17, 2015 at 03:54:31PM +0800, Liu Bo wrote: MS_I_VERSION is enabled by default for btrfs, this adds an alternative option to toggle it off. There's an existing generic iversion/noiversion mount option pair, no need to extra add it to btrfs. I know, it doesn't work though. Thanks, -liubo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v2 2/2] Btrfs: improve fsync for nocow file
On Wed, Jun 17, 2015 at 03:54:32PM +0800, Liu Bo wrote: If we're overwriting an allocated file without changing timestamp and inode version, and the file is with NODATACOW, we don't have any metadata to commit, thus we can just flush the data device cache and go forward. However, if there's have any change on extents' disk bytenr, inode size, timestamp or inode version, we need to go through the normal btrfs_log_inode path. Test: 1. sysbench test of 1 file + 1 thread + bs=4k + size=40k + synchronous I/O mode + randomwrite + fsync_on_each_write, 2. loop device associated with tmpfs file 3. - For btrfs, -o nodatacow and -o noi_version option - For ext4 and xfs, no extra mount options Results: - btrfs: w/o: ~30Mb/sec w: ~131Mb/sec - other filesystems: (both don't enable i_version by default) ext4: 203Mb/sec xfs: 212Mb/sec Signed-off-by: Liu Bo bo.li@oracle.com --- v2: Catch errors from data writeback and skip barrier if necessary. fs/btrfs/btrfs_inode.h |2 + fs/btrfs/disk-io.c |2 +- fs/btrfs/disk-io.h |1 + fs/btrfs/file.c| 54 +++ fs/btrfs/inode.c |3 ++ 5 files changed, 56 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index de5e4f2..f7b99b6 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -44,6 +44,8 @@ #define BTRFS_INODE_IN_DELALLOC_LIST 9 #define BTRFS_INODE_READDIO_NEED_LOCK10 #define BTRFS_INODE_HAS_PROPS11 +#define BTRFS_INODE_NOTIMESTAMP 12 +#define BTRFS_INODE_NOISIZE 13 It's not clear what the flags mean and that they're related to syncing under some conditions. /* * The following 3 bits are meant only for the btree inode. * When any of them is set, it means an error happened while writing an --- a/fs/btrfs/disk-io.h +++ b/fs/btrfs/disk-io.h @@ -60,6 +60,7 @@ void close_ctree(struct btrfs_root *root); int write_ctree_super(struct btrfs_trans_handle *trans, struct btrfs_root *root, int max_mirrors); struct buffer_head *btrfs_read_dev_super(struct block_device *bdev); +int barrier_all_devices(struct btrfs_fs_info *info); int btrfs_commit_super(struct btrfs_root *root); struct extent_buffer *btrfs_find_tree_block(struct btrfs_root *root, u64 bytenr); diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index faa7d39..180a3e1 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -523,8 +523,12 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode, * the disk i_size. There is no need to log the inode * at this time. */ - if (end_pos isize) + if (end_pos isize) { i_size_write(inode, end_pos); + clear_bit(BTRFS_INODE_NOISIZE, BTRFS_I(inode)-runtime_flags); + } else { + set_bit(BTRFS_INODE_NOISIZE, BTRFS_I(inode)-runtime_flags); + } return 0; } @@ -1715,19 +1719,33 @@ out: static void update_time_for_write(struct inode *inode) { struct timespec now; + int sync_it = 0; - if (IS_NOCMTIME(inode)) + if (IS_NOCMTIME(inode)) { + set_bit(BTRFS_INODE_NOTIMESTAMP, BTRFS_I(inode)-runtime_flags); return; + } now = current_fs_time(inode-i_sb); - if (!timespec_equal(inode-i_mtime, now)) + if (!timespec_equal(inode-i_mtime, now)) { inode-i_mtime = now; + sync_it = S_MTIME; + } - if (!timespec_equal(inode-i_ctime, now)) + if (!timespec_equal(inode-i_ctime, now)) { inode-i_ctime = now; + sync_it |= S_CTIME; + } - if (IS_I_VERSION(inode)) + if (IS_I_VERSION(inode)) { inode_inc_iversion(inode); + sync_it |= S_VERSION; + } + + if (!sync_it) + set_bit(BTRFS_INODE_NOTIMESTAMP, BTRFS_I(inode)-runtime_flags); + else + clear_bit(BTRFS_INODE_NOTIMESTAMP, BTRFS_I(inode)-runtime_flags); } static ssize_t btrfs_file_write_iter(struct kiocb *iocb, @@ -1983,6 +2001,32 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) goto out; } + if (BTRFS_I(inode)-flags BTRFS_INODE_NODATACOW) { + if (test_and_clear_bit(BTRFS_INODE_NOTIMESTAMP, + BTRFS_I(inode)-runtime_flags) + test_and_clear_bit(BTRFS_INODE_NOISIZE, + BTRFS_I(inode)-runtime_flags)) { + + /* make sure data is on disk and catch error */ + if (!full_sync) + ret =