Re: btrfs partition converted from ext4 becomes read-only minutes after booting: WARNING: CPU: 2 PID: 2777 at ../fs/btrfs/super.c:260 __btrfs_abort_transaction+0x4b/0x120

2015-06-17 Thread Marc MERLIN
On Fri, Jun 12, 2015 at 03:19:06PM +0300, Robert Munteanu wrote:
 Hi,
 
Note to others: kernel 4.0.4

Reply to you:
I tried ext4 to btrfs once a year ago and it severely mangled my
filesystem.
I looked at it as a cool feature/hack that may have worked some time ago, but
that no one really uses anymore, and that may not work right at this
point.

Unless you hear back from a developer interested in debugging/fixing
this, I would assume that this feature is broken and dead.

Marc

 I have converted my root ext4 partition to btrfs. I used an USB stick
 to boot and used btrfs-convert.
 
 I also did a balance and defrag ( in that order ) , both when the fs
 was mounted.
 
 After logging in to KDE I quickly get a read-only filesystem. I've
 pasted the backtrace below
 
 Jun 11 23:13:08 mars kernel: WARNING: CPU: 2 PID: 2777 at
 ../fs/btrfs/super.c:260 __btrfs_abort_transaction+0x4b/0x120 [btrfs]()
 Jun 11 23:13:08 mars kernel: BTRFS: Transaction aborted (error -95)
 Jun 11 23:13:08 mars kernel: Modules linked in: bnep bluetooth rfkill
 fuse vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) af_packet
 nf_log_ipv6 xt_pkttype nf_log_ip
 v4 nf_log_common xt_LOG xt_limit ip6t_REJECT xt_tcpudp
 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT iptable_raw
 xt_CT iptable_filter ip6table_mangle nf_con
 ntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4
 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter
 ip6_tables x_tables xfs libcrc32c snd_hda
 _codec_hdmi raid1 md_mod gpio_ich ppdev iTCO_wdt iTCO_vendor_support
 coretemp snd_hda_codec_realtek snd_hda_codec_generic kvm_intel
 snd_hda_intel dm_mod kvm snd_hda_co
 ntroller snd_hda_codec snd_hwdep serio_raw pcspkr snd_pcm i2c_i801
 snd_seq joydev snd_seq_device snd_timer snd 8250_fintek parport_pc
 parport acpi_cpufreq lpc_ich
 Jun 11 23:13:08 mars kernel:  soundcore mfd_core shpchp processor
 ata_generic btrfs hid_logitech_hidpp xor raid6_pq sr_mod cdrom
 nvidia_uvm(PO) nvidia(PO) firewire_ohc
 i firewire_core crc_itu_t uas usb_storage r8169 mii pata_jmicron
 hid_logitech_dj drm button sg
 Jun 11 23:13:08 mars kernel: CPU: 2 PID: 2777 Comm: kworker/u8:0
 Tainted: P   O4.0.4-3-desktop #1
 Jun 11 23:13:08 mars kernel: Hardware name: Gigabyte Technology Co.,
 Ltd. EP35-DS4/EP35-DS4, BIOS F6d 01/08/2009
 Jun 11 23:13:08 mars kernel: Workqueue: btrfs-endio-write
 btrfs_endio_write_helper [btrfs]
 Jun 11 23:13:08 mars kernel:   a0a92832
 8167c4aa 880128513ca8
 Jun 11 23:13:08 mars kernel:  81063bb1 880031929d28
 880221e71800 ffa1
 Jun 11 23:13:08 mars kernel:  a0a914e0 0b50
 81063c2a a0a95928
 Jun 11 23:13:08 mars kernel: Call Trace:
 Jun 11 23:13:08 mars kernel:  [8100574c] dump_trace+0x8c/0x340
 Jun 11 23:13:08 mars kernel:  [81005aa3] 
 show_stack_log_lvl+0xa3/0x190
 Jun 11 23:13:08 mars kernel:  [81007201] show_stack+0x21/0x50
 Jun 11 23:13:08 mars kernel:  [8167c4aa] dump_stack+0x47/0x67
 Jun 11 23:13:08 mars kernel:  [81063bb1]
 warn_slowpath_common+0x81/0xb0
 Jun 11 23:13:08 mars kernel:  [81063c2a] warn_slowpath_fmt+0x4a/0x50
 Jun 11 23:13:08 mars kernel:  [a09e598b]
 __btrfs_abort_transaction+0x4b/0x120 [btrfs]
 Jun 11 23:13:08 mars kernel:  [a0a1d18a]
 btrfs_finish_ordered_io+0x5aa/0x620 [btrfs]
 Jun 11 23:13:08 mars kernel:  [a0a43253]
 normal_work_helper+0xc3/0x320 [btrfs]
 Jun 11 23:13:08 mars kernel:  [8107bcf2] 
 process_one_work+0x142/0x420
 Jun 11 23:13:08 mars kernel:  [8107c0e4] worker_thread+0x114/0x460
 Jun 11 23:13:08 mars kernel:  [81081261] kthread+0xc1/0xe0
 Jun 11 23:13:08 mars kernel:  [81682d58] ret_from_fork+0x58/0x90
 Jun 11 23:13:08 mars kernel: ---[ end trace 4c4eb7d6e98afa91 ]---
 Jun 11 23:13:08 mars kernel: BTRFS: error (device sda1) in
 btrfs_finish_ordered_io:2896: errno=-95 unknown
 Jun 11 23:13:08 mars kernel: BTRFS info (device sda1): forced readonly
 
 Some diagnostic info:
 
 - btrfs scrub reports no errors
 - on the host machine I'm running btrfs v4.0+20150429 and kernel 
 4.0.4-3-desktop
 - on the live medium, used to run btrfs-convert, I was running btrfs
 v4.0+20150429 and kernel 4.0.3-1-default
 
 # btrfs fi show
 Label: none  uuid: 54dea125-74cd-4bb2-86a2-f7bc645b76cf
 Total devices 1 FS bytes used 90.22GiB
 devid1 size 223.57GiB used 92.03GiB path /dev/sda1
 
 btrfs-progs v4.0+20150429
 
 # btrfs fi df /
 Data, single: total=89.00GiB, used=88.17GiB
 System, single: total=32.00MiB, used=16.00KiB
 Metadata, single: total=3.00GiB, used=2.05GiB
 GlobalReserve, single: total=512.00MiB, used=0.00B
 
 Is there a way out? I still have the old ext4 image and can revert,
 but I'm keeping the btrfs one for now, in case I can extract some
 useful debugging information from it.
 
 Thanks,
 
 Robert
 
 
 -- 
 http://robert.muntea.nu/
 --
 To unsubscribe from this list: send the 

Re: How do I make 'btrfs scrub' report errors via email?

2015-06-17 Thread Marc MERLIN
On Sat, Jun 13, 2015 at 10:48:35PM +0900, crocket wrote:
 I can check the result of 'btrfs scrub' later, but I don't want to
 take time to actually check it.
 Does anyone know how to make 'btrfs scrub' report errors via email?
 It seems google doesn't know.

See the bottom of:
http://marc.merlins.org/perso/btrfs/2014-03.html#Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair

You can remove shlock from the script if you don't have it (or use
another lock script).

Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] [survey] BTRFS_IOC_DEVICES_READY return status

2015-06-17 Thread Goffredo Baroncelli
On 2015-06-15 19:38, Lennart Poettering wrote:
 On Mon, 15.06.15 19:23, Goffredo Baroncelli (kreij...@inwind.it) wrote:
 
 On 2015-06-15 12:46, Lennart Poettering wrote:
 On Sat, 13.06.15 17:09, Goffredo Baroncelli (kreij...@libero.it) wrote:

 Further, the problem will be more intense in this eg. if you use dd
 and copy device A to device B. After you mount device A, by just
 providing device B in the above two commands you could let kernel
 update the device path, again all the IO (since device is mounted)
 are still going to the device A (not B), but /proc/self/mounts and
 'btrfs fi show' shows it as device B (not A).

 Its a bug. very tricky to fix.

 In the past [*] I proposed a mount.btrfs helper . I tried to move the 
 logic outside the kernel.
 I think that the problem is that we try to manage all these cases
 from a device point of view: when a device appears, we register the
 device and we try to mount the filesystem... This works very well
 when there is 1-volume filesystem. For the other cases there is a
 mess between the different layers:

 - kernel
 - udev/systemd
 - initrd logic

 My attempt followed a different idea: the mount helper waits the
 devices if needed, or if it is the case it mounts the filesystem in
 degraded mode. All devices are passed as mount arguments
 (--device=/dev/sdX), there is no a device registration: this avoids
 all these problems.

 Hmm, no. /bin/mount should not block for devices. That's generally
 incompatible with how the tool is used, and in particular from
 systemd. We would not make use for such a scheme in
 systemd. /bin/mount should always be short-running.

 Apart systemd, which are these incompatibilities ? 
 
 Well, /bin/mount is not a daemon, and it should not be one.

My helper is not a deamon; you was correct the first time: it blocks until all 
needed/enough devices are appeared.
Anyway this should not be different from mounting a nfs filesystem. Even in 
this case the mount helper blocks until the connection happened. The block time 
is not negligible, even tough not long as a device timeout ... 

 
 I am pretty sure that if such automatic degraded mounting should be
 supported, then this should be done with some background storage
 daemon that alters the effect of the READY ioctl somehow after the
 timeout, and then retriggers the devcies so that systemd takes
 note. (or, alternatively: such a scheme could even be implemented all
 in kernel, based on some configurable kernel setting...)

 I recognize that this solution provides the maximum compatibility
 with the current implementation. However it seems too complex to
 me. Re-trigging a devices seems to me more a workaround than a
 solution.
 
 Well, it's not really ugly. I mean, if the state or properties of a
 device change, then udev should update its information about it, and
 that's done via a retrigger. We do that all the time already, for
 example when an existing loopback device gets a backing file assigned
 or removed. I am pretty sure that loopback case is very close to what
 you want to do here, hence retriggering (either from the kernel side,
 or from userspace), appears like an OK thing to do.

What seems strange to me is that in this case the devices don't have changed 
their status.
How this problem is managed in the md/dm raid cases ?

 
 Could a generator do this job ? I.e. this generator (or storage
 daemon) waits that all (or enough) devices are appeared, then it
 creates a .mount unit: do you think that it is doable ?
 
 systemd generators are a way to extend the systemd unit dep tree with
 units. They are very short running, and are executed only very very
 early at boot. They cannot wait for anything, they don#t have access
 to devices and are not run when they are appear.
 
 Lennart
 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs partition converted from ext4 becomes read-only minutes after booting: WARNING: CPU: 2 PID: 2777 at ../fs/btrfs/super.c:260 __btrfs_abort_transaction+0x4b/0x120

2015-06-17 Thread Marc Joliet
Am Wed, 17 Jun 2015 10:46:30 -0700
schrieb Marc MERLIN m...@merlins.org:

 I tried ext4 to btrfs once a year ago and it severely mangled my
 filesystem.
 I looked at it as a cool feature/hack that may have worked some time ago, but
 that no one really uses anymore, and that may not work right at this
 point.

Just another data point: when I switched to btrfs in the middle of last year I
used btrfs-convert on two file systems (an SSD and my backup partition on a
USB 3.0 HDD), and it worked in both cases (i.e., no data loss). I did see some
strange balance issues (see the ML archives), but IIRC nothing really serious.

-- 
Marc Joliet
--
People who think they know everything really annoy those of us who know we
don't - Bjarne Stroustrup


pgpazrScetd8J.pgp
Description: Digitale Signatur von OpenPGP


Re: btrfs partition converted from ext4 becomes read-only minutes after booting: WARNING: CPU: 2 PID: 2777 at ../fs/btrfs/super.c:260 __btrfs_abort_transaction+0x4b/0x120

2015-06-17 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 6/12/15 8:19 AM, Robert Munteanu wrote:
 Hi,
 
 I have converted my root ext4 partition to btrfs. I used an USB
 stick to boot and used btrfs-convert.
 
 I also did a balance and defrag ( in that order ) , both when the
 fs was mounted.
 
 After logging in to KDE I quickly get a read-only filesystem. I've 
 pasted the backtrace below
 
 Jun 11 23:13:08 mars kernel: WARNING: CPU: 2 PID: 2777 at 
 ../fs/btrfs/super.c:260 __btrfs_abort_transaction+0x4b/0x120
 [btrfs]() Jun 11 23:13:08 mars kernel: BTRFS: Transaction aborted
 (error -95) Jun 11 23:13:08 mars kernel: Modules linked in: bnep
 bluetooth rfkill fuse vboxpci(O) vboxnetadp(O) vboxnetflt(O)
 vboxdrv(O) af_packet nf_log_ipv6 xt_pkttype nf_log_ip v4
 nf_log_common xt_LOG xt_limit ip6t_REJECT xt_tcpudp 
 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT
 iptable_raw xt_CT iptable_filter ip6table_mangle nf_con 
 ntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 
 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack ip6table_filter 
 ip6_tables x_tables xfs libcrc32c snd_hda _codec_hdmi raid1 md_mod
 gpio_ich ppdev iTCO_wdt iTCO_vendor_support coretemp
 snd_hda_codec_realtek snd_hda_codec_generic kvm_intel snd_hda_intel
 dm_mod kvm snd_hda_co ntroller snd_hda_codec snd_hwdep serio_raw
 pcspkr snd_pcm i2c_i801 snd_seq joydev snd_seq_device snd_timer snd
 8250_fintek parport_pc parport acpi_cpufreq lpc_ich Jun 11 23:13:08
 mars kernel:  soundcore mfd_core shpchp processor ata_generic btrfs
 hid_logitech_hidpp xor raid6_pq sr_mod cdrom nvidia_uvm(PO)
 nvidia(PO) firewire_ohc i firewire_core crc_itu_t uas usb_storage
 r8169 mii pata_jmicron hid_logitech_dj drm button sg Jun 11
 23:13:08 mars kernel: CPU: 2 PID: 2777 Comm: kworker/u8:0 Tainted:
 P   O4.0.4-3-desktop #1

openSUSE Tumbleweed, I take it?

We still actively support btrfs-convert through SLES, so we're
invested in ensuring it continues working properly.  I'd be interested
in seeing images of both filesystems to investigate and to see if we
can reproduce it. Errno -95 is -EOPNOTSUPP which is kind of strange to
see.  I can see a few possible places it would get passed up with a
trace like that but being able to reproduce it would be extremely helpfu
l.

- -Jeff

 Jun 11 23:13:08 mars kernel: Hardware name: Gigabyte Technology
 Co., Ltd. EP35-DS4/EP35-DS4, BIOS F6d 01/08/2009 Jun 11 23:13:08
 mars kernel: Workqueue: btrfs-endio-write btrfs_endio_write_helper
 [btrfs] Jun 11 23:13:08 mars kernel:  
 a0a92832 8167c4aa 880128513ca8 Jun 11 23:13:08
 mars kernel:  81063bb1 880031929d28 880221e71800
 ffa1 Jun 11 23:13:08 mars kernel:  a0a914e0
 0b50 81063c2a a0a95928 Jun 11 23:13:08
 mars kernel: Call Trace: Jun 11 23:13:08 mars kernel:
 [8100574c] dump_trace+0x8c/0x340 Jun 11 23:13:08 mars
 kernel:  [81005aa3] show_stack_log_lvl+0xa3/0x190 Jun 11
 23:13:08 mars kernel:  [81007201] show_stack+0x21/0x50 
 Jun 11 23:13:08 mars kernel:  [8167c4aa]
 dump_stack+0x47/0x67 Jun 11 23:13:08 mars kernel:
 [81063bb1] warn_slowpath_common+0x81/0xb0 Jun 11 23:13:08
 mars kernel:  [81063c2a] warn_slowpath_fmt+0x4a/0x50 Jun
 11 23:13:08 mars kernel:  [a09e598b] 
 __btrfs_abort_transaction+0x4b/0x120 [btrfs] Jun 11 23:13:08 mars
 kernel:  [a0a1d18a] btrfs_finish_ordered_io+0x5aa/0x620
 [btrfs] Jun 11 23:13:08 mars kernel:  [a0a43253] 
 normal_work_helper+0xc3/0x320 [btrfs] Jun 11 23:13:08 mars kernel:
 [8107bcf2] process_one_work+0x142/0x420 Jun 11 23:13:08
 mars kernel:  [8107c0e4] worker_thread+0x114/0x460 Jun 11
 23:13:08 mars kernel:  [81081261] kthread+0xc1/0xe0 Jun
 11 23:13:08 mars kernel:  [81682d58]
 ret_from_fork+0x58/0x90 Jun 11 23:13:08 mars kernel: ---[ end trace
 4c4eb7d6e98afa91 ]--- Jun 11 23:13:08 mars kernel: BTRFS: error
 (device sda1) in btrfs_finish_ordered_io:2896: errno=-95 unknown 
 Jun 11 23:13:08 mars kernel: BTRFS info (device sda1): forced
 readonly
 
 Some diagnostic info:
 
 - btrfs scrub reports no errors - on the host machine I'm running
 btrfs v4.0+20150429 and kernel 4.0.4-3-desktop - on the live
 medium, used to run btrfs-convert, I was running btrfs 
 v4.0+20150429 and kernel 4.0.3-1-default
 
 # btrfs fi show Label: none  uuid:
 54dea125-74cd-4bb2-86a2-f7bc645b76cf Total devices 1 FS bytes used
 90.22GiB devid1 size 223.57GiB used 92.03GiB path /dev/sda1
 
 btrfs-progs v4.0+20150429
 
 # btrfs fi df / Data, single: total=89.00GiB, used=88.17GiB System,
 single: total=32.00MiB, used=16.00KiB Metadata, single:
 total=3.00GiB, used=2.05GiB GlobalReserve, single: total=512.00MiB,
 used=0.00B
 
 Is there a way out? I still have the old ext4 image and can
 revert, but I'm keeping the btrfs one for now, in case I can
 extract some useful debugging information from it.
 
 Thanks,
 
 Robert
 
 


- -- 
Jeff Mahoney

Re: btrfs differential receive has become excrutiatingly slow on one machine

2015-06-17 Thread Marc MERLIN
On Wed, May 13, 2015 at 12:35:20PM +0100, Filipe David Manana wrote:
  It's a broad question, but how can I diagnose btrfs send being so slow
  without taking the risk of killing my connection?
  (if there is no good answer on this one, I can try another sync later
  with -vvv and strace if you'd like)
 
 Try to see if it's a problem at the sending side or at the receiving
 side. Redirect send's output to a file, see how much it takes. Then
 run receive with that file as input and see how long it takes.
 You can also use 'perf record -ag' while doing both, it might give
 some useful information.

Hi Filipe, sorry this took a while, I've been away and then had my
server hardware die, but things are back up and I'm now trying to sync a
100GB btrfs diff from my laptop to my server.

I've confirmed that btrfs send is fast (I sent it to a file)
Then scp of the diff is fast too (45mn for 100GB over wireless)
But the restore is slow. I see files going by quickly, and then hangs and 
restarts.

You requested strace -T in the past. I'm showing an exerpt of system calls that 
take
more than 1 second.

When I see this, I get worried:
truncate(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/merlin-change/Maildir.google/lists2/new/1432663866_0.19916.legolas,U=427356,FMD5=7e806062200fb6d33546530d24aac86c:2,,
 21043) = 0 19.335333

Or this:
unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/media/tuners/mt2266.mod.dwo)
 = 0 28.298224
unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/merlin-change/Maildir.google/INBOX/cur/1432061846_0.2789.legolas,U=381014,FMD5=7e33429f656f1e6e9d79b29c3f82c57e:2,)
 = 0 45.084068

19 seconds for a truncate or 28 or 45 seconds for an unlink cannot be right of 
course.

It's btrfs over dmcrypt over swraid5 (5 drives). Filesystem is only half full:

gargamel:~# btrfs fi show /mnt/btrfs_pool2/
Label: 'dshelf2'  uuid: d4a51178-c1e6-4219-95ab-5c5864695bfd
Total devices 1 FS bytes used 4.34TiB
devid1 size 7.28TiB used 4.43TiB path /dev/mapper/dshelf2

gargamel:~# btrfs fi df /mnt/btrfs_pool2/
Data, single: total=4.28TiB, used=4.28TiB
System, DUP: total=8.00MiB, used=496.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=77.50GiB, used=68.05GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=192.00KiB


More system calls:

gargamel:~# grep ' [1-9]'  /mnt/btrfs_pool2/strace
ioctl(3, BTRFS_IOC_SNAP_CREATE_V2, 0x7ffca6be3a40) = 0 11.430413
link(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/www/Pix/new/tmp01/dsc05964.jpg,
 
/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/www/Pix/albums/Outings/Dinners/Misc/20150228_Alexander_Patisserie.jpg)
 = 0 1.572029
) = 93 1.195424
truncate(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/merlin-change/.config/google-chrome-beta/Profile
 1/Storage/ext/beknehfpfkghjoafdifaflglpjkojoco/def/Cookies, 7168) = 0 
53.618366
unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/iio/dac/.ad7303.ko.cmd)
 = 0 64.146415
unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/iio/dac/.mcp4725.ko.cmd)
 = 0 10.403782
unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/iio/dac/ad5360.mod.dwo)
 = 0 2.180297
unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/iio/dac/ad5360.mod.o)
 = 0 1.278088
unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/md/persistent-data/built-in.o)
 = 0 3.706420
unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/media/tuners/mt2266.mod.dwo)
 = 0 28.298224
unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/media/usb/dvb-usb-v2/.tmp_af9035.dwo)
 = 0 7.626091
unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/net/wimax/i2400m/i2400m.ko)
 = 0 25.749138
unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/merlin-change/.config/google-chrome-unstable/Default/Cache/f_002221)
 = 0 1.751792

Re: btrfs differential receive has become excrutiatingly slow on one machine

2015-06-17 Thread Marc MERLIN
On Wed, Jun 17, 2015 at 10:58:05AM -0700, Marc MERLIN wrote:
 You requested strace -T in the past. I'm showing an exerpt of system calls 
 that take
 more than 1 second.
 
 When I see this, I get worried:
 truncate(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/merlin-change/Maildir.google/lists2/new/1432663866_0.19916.legolas,U=427356,FMD5=7e806062200fb6d33546530d24aac86c:2,,
  21043) = 0 19.335333
 
 Or this:
 unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/src/linux-3.19.8-amd64-i915-volpreempt-s20150421/drivers/media/tuners/mt2266.mod.dwo)
  = 0 28.298224
 unlink(/mnt/btrfs_pool2/backup/debian64/legolas/varchange_ggm_daily_ro.20150616_23:06:10/merlin-change/Maildir.google/INBOX/cur/1432061846_0.2789.legolas,U=381014,FMD5=7e33429f656f1e6e9d79b29c3f82c57e:2,)
  = 0 45.084068
 
 19 seconds for a truncate or 28 or 45 seconds for an unlink cannot be right 
 of course.

Interesting. The restore only took 2.5h.
It's still too long but not as bad as I thought.

But now I think I understand what's going on, because of the frequent
pauses of a few seconds to 30s or more, this totally destroys the tcp
flow, causing the sender to stop, and re-start sending slowly, ramp up
the speed, only to be stopped again.

No wonder that given that it can take 12h or more when I have send to
receive talk over the network.

So now the question is why the receive pauses for so long, and pseudo-randomly.

Is there anything I can provide on that filesystem that would help?

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] check: check so offset is not bigger then the leaf

2015-06-17 Thread Robert Marklund
This could crash before because of dangerous dangling
offset of pointer.

Signed-off-by: Robert Marklund robbelibob...@gmail.com
---
 cmds-check.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index 778f141..da36758 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -8906,6 +8906,16 @@ static int build_roots_info_cache(struct btrfs_fs_info 
*info)
goto next;
 
ei = btrfs_item_ptr(leaf, slot, struct btrfs_extent_item);
+
+   if ((long long)ei  info-extent_root-leafsize) {
+fprintf(stderr, Bad leaf = %p, slot = %d\n, leaf, 
slot);
+fprintf(stderr, item ptr = %p\n, ei);
+fprintf(stderr, objectid = %llx\n, 
found_key.objectid);
+fprintf(stderr, type = %x\n, found_key.type);
+fprintf(stderr, offset   = %llx\n, found_key.offset);
+goto next;
+   }
+
flags = btrfs_extent_flags(leaf, ei);
 
if (found_key.type == BTRFS_EXTENT_ITEM_KEY 
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


btrfs-progs 4.1-rc1: btrfstune -u reporting incorrect current fsid?

2015-06-17 Thread Mike Fleetwood
Hi,

I've done a quick test on changing the UUID of a btrfs.  It worked, but
btrfstune -u didn't print the same current uuid that btrfs fi sh does.
It also upper cases the UUID where as btrfs fi sh and blkid don't.

Thanks,
Mike

# btrfs filesystem show /dev/sdb1 | fgrep uuid
Label: none  uuid: b2813976-4d8b-4976-9d59-cbfbd588399c

# ~fedora/programming/c/btrfs-progs-unstable/btrfstune -f -u /dev/sdb1
Current fsid: ---00B0-8F12937F
New fsid: D294F3F3-F2B7-4407-B83A-DE5A4F8CBAB1
Set superblock flag CHANGING_FSID
Change fsid in extents
Change fsid on devices
Clear superblock flag CHANGING_FSID
Fsid change finished

# btrfs filesystem show /dev/sdb1 | fgrep uuid
Label: none  uuid: d294f3f3-f2b7-4407-b83a-de5a4f8cbab1

# blkid | fgrep sdb1
/dev/sdb1: UUID=d294f3f3-f2b7-4407-b83a-de5a4f8cbab1
UUID_SUB=70065403-5ec1-462c-93a4-26cff8b6aea2 TYPE=btrfs
PARTUUID=b309c48c-486f-4882-896c-34d4d0aeb529
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] [survey] BTRFS_IOC_DEVICES_READY return status

2015-06-17 Thread Lennart Poettering
On Wed, 17.06.15 21:10, Goffredo Baroncelli (kreij...@libero.it) wrote:

  Well, /bin/mount is not a daemon, and it should not be one.
 
 My helper is not a deamon; you was correct the first time: it blocks
 until all needed/enough devices are appeared.
 Anyway this should not be different from mounting a nfs
 filesystem. Even in this case the mount helper blocks until the
 connection happened. The block time is not negligible, even tough
 not long as a device timeout ...

Well, the mount tool doesn't wait for the network to be configured or
so. It just waits for a response from the server. That's quite a
difference.

  Well, it's not really ugly. I mean, if the state or properties of a
  device change, then udev should update its information about it, and
  that's done via a retrigger. We do that all the time already, for
  example when an existing loopback device gets a backing file assigned
  or removed. I am pretty sure that loopback case is very close to what
  you want to do here, hence retriggering (either from the kernel side,
  or from userspace), appears like an OK thing to do.
 
 What seems strange to me is that in this case the devices don't have changed 
 their status.
 How this problem is managed in the md/dm raid cases ?

md has a daemon mdmon to my knowledge.

Lennart

-- 
Lennart Poettering, Red Hat
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 2/2] Btrfs: improve fsync for nocow file

2015-06-17 Thread Liu Bo
On Wed, Jun 17, 2015 at 05:58:36PM +0200, David Sterba wrote:
 On Wed, Jun 17, 2015 at 03:54:32PM +0800, Liu Bo wrote:
  If  we're overwriting an allocated file without changing timestamp
  and inode version, and the file is with NODATACOW, we don't have any 
  metadata to
  commit, thus we can just flush the data device cache and go forward.
  
  However, if there's have any change on extents' disk bytenr, inode size,
  timestamp or inode version, we need to go through the normal btrfs_log_inode
  path.
  
  Test:
  
  1. sysbench test of
  1 file + 1 thread + bs=4k + size=40k + synchronous I/O mode + randomwrite +
  fsync_on_each_write,
  2. loop device associated with tmpfs file
  3.
- For btrfs, -o nodatacow and -o noi_version option
- For ext4 and xfs, no extra mount options
  
  
  Results:
  
  - btrfs:
  w/o: ~30Mb/sec
  w:   ~131Mb/sec
  
  - other filesystems: (both don't enable i_version by default)
  ext4:  203Mb/sec
  xfs:   212Mb/sec
  
  
  Signed-off-by: Liu Bo bo.li@oracle.com
  ---
  v2: Catch errors from data writeback and skip barrier if necessary.
  
   fs/btrfs/btrfs_inode.h |2 +
   fs/btrfs/disk-io.c |2 +-
   fs/btrfs/disk-io.h |1 +
   fs/btrfs/file.c|   54 
  +++
   fs/btrfs/inode.c   |3 ++
   5 files changed, 56 insertions(+), 6 deletions(-)
  
  diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
  index de5e4f2..f7b99b6 100644
  --- a/fs/btrfs/btrfs_inode.h
  +++ b/fs/btrfs/btrfs_inode.h
  @@ -44,6 +44,8 @@
   #define BTRFS_INODE_IN_DELALLOC_LIST   9
   #define BTRFS_INODE_READDIO_NEED_LOCK  10
   #define BTRFS_INODE_HAS_PROPS  11
  +#define BTRFS_INODE_NOTIMESTAMP12
  +#define BTRFS_INODE_NOISIZE13
 
 It's not clear what the flags mean and that they're related to syncing
 under some conditions.

What do you think about BTRFS_ILOG_NOTIMESTAMP and BTRFS_ILOG_NOISIZE? 

 
   /*
* The following 3 bits are meant only for the btree inode.
* When any of them is set, it means an error happened while writing an
  --- a/fs/btrfs/disk-io.h
  +++ b/fs/btrfs/disk-io.h
  @@ -60,6 +60,7 @@ void close_ctree(struct btrfs_root *root);
   int write_ctree_super(struct btrfs_trans_handle *trans,
struct btrfs_root *root, int max_mirrors);
   struct buffer_head *btrfs_read_dev_super(struct block_device *bdev);
  +int barrier_all_devices(struct btrfs_fs_info *info);
   int btrfs_commit_super(struct btrfs_root *root);
   struct extent_buffer *btrfs_find_tree_block(struct btrfs_root *root,
  u64 bytenr);
  diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
  index faa7d39..180a3e1 100644
  --- a/fs/btrfs/file.c
  +++ b/fs/btrfs/file.c
  @@ -523,8 +523,12 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct 
  inode *inode,
   * the disk i_size.  There is no need to log the inode
   * at this time.
   */
  -   if (end_pos  isize)
  +   if (end_pos  isize) {
  i_size_write(inode, end_pos);
  +   clear_bit(BTRFS_INODE_NOISIZE, BTRFS_I(inode)-runtime_flags);
  +   } else {
  +   set_bit(BTRFS_INODE_NOISIZE, BTRFS_I(inode)-runtime_flags);
  +   }
  return 0;
   }
   
  @@ -1715,19 +1719,33 @@ out:
   static void update_time_for_write(struct inode *inode)
   {
  struct timespec now;
  +   int sync_it = 0;
   
  -   if (IS_NOCMTIME(inode))
  +   if (IS_NOCMTIME(inode)) {
  +   set_bit(BTRFS_INODE_NOTIMESTAMP, 
  BTRFS_I(inode)-runtime_flags);
  return;
  +   }
   
  now = current_fs_time(inode-i_sb);
  -   if (!timespec_equal(inode-i_mtime, now))
  +   if (!timespec_equal(inode-i_mtime, now)) {
  inode-i_mtime = now;
  +   sync_it = S_MTIME;
  +   }
   
  -   if (!timespec_equal(inode-i_ctime, now))
  +   if (!timespec_equal(inode-i_ctime, now)) {
  inode-i_ctime = now;
  +   sync_it |= S_CTIME;
  +   }
   
  -   if (IS_I_VERSION(inode))
  +   if (IS_I_VERSION(inode)) {
  inode_inc_iversion(inode);
  +   sync_it |= S_VERSION;
  +   }
  +
  +   if (!sync_it)
  +   set_bit(BTRFS_INODE_NOTIMESTAMP, 
  BTRFS_I(inode)-runtime_flags);
  +   else
  +   clear_bit(BTRFS_INODE_NOTIMESTAMP, 
  BTRFS_I(inode)-runtime_flags);
   }
   
   static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
  @@ -1983,6 +2001,32 @@ int btrfs_sync_file(struct file *file, loff_t start, 
  loff_t end, int datasync)
  goto out;
  }
   
  +   if (BTRFS_I(inode)-flags  BTRFS_INODE_NODATACOW) {
  +   if (test_and_clear_bit(BTRFS_INODE_NOTIMESTAMP,
  +   BTRFS_I(inode)-runtime_flags) 
  +   test_and_clear_bit(BTRFS_INODE_NOISIZE,
  +   

Re: Automatic balance after mkfs?

2015-06-17 Thread Qu Wenruo



Austin S Hemmelgarn wrote on 2015/06/16 09:21 -0400:

On 2015-06-16 09:13, Holger Hoffstätte wrote:


Forking from the other thread..

On Tue, 16 Jun 2015 12:25:45 +, Hugo Mills wrote:


Yes. It's an artefact of the way that mkfs works. If you run a
balance on those chunks, they'll go away. (btrfs balance start
-dusage=0 -musage=0 /mountpoint)


Since I had to explain this very same thing to a new btrfs-using friend
just yesterday I wondered if it might not make sense for mkfs to issue
a general balance after creating the fs? It should be simple enough
(just issue the balance ioctl?) and not have any negative side effects.

Just doing such a post-mkfs cleanup automatically would certainly
reduce the number of times we have to explain the this. :)

Any reasons why we couldn't/shouldn't do this?


Following the same line of thinking, is there any reason we couldn't
just rewrite mkfs to get rid of this legacy behavior?


Compared to the more complex auto balance, rewrite mkfs is a much better 
idea.


The original mkfs seems easy for developers, but bad for users.

I like the idea and I'll try to implment it if I have spare time.

Thanks.
Qu
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: trim not working and irreparable errors from btrfsck

2015-06-17 Thread Duncan
Austin S Hemmelgarn posted on Wed, 17 Jun 2015 13:17:22 -0400 as
excerpted:

 On 2015-06-17 11:40, Christian wrote:
 On 06/17/2015 11:28 AM, Chris Murphy wrote:


 However, fstrim still gives me 0 B (0 bytes) trimmed, so that may be
 another problem. Is there a way to check if trim works?

 That sounds like maybe your SSD is blacklisted for trim, is all I can
 think of. So trim shouldn't be the cause of the problem if it's being
 blacklisted. The recent problems appear to be around newer SSDs that
 support queue trim and newer kernels that issue queued trim. There
 have been some patches related to trim to the kernel, but the
 existence of blacklisting and claims of bugs in firmware make it
 difficult to test and isolate.

 http://techreport.com/news/28473/some-samsung-ssds-may-suffer-from-a-
buggy-trim-implementation


 This is an Intel SSD in a Lenovo Thinkpad X1 Carbon. Trim worked until
 a few weeks ago and still works for my small ext4 boot partition (just
 ran it to check). I will keep looking for a solution. Thanks!

 I'm seeing the same issue here, but with a Crucial brand SSD.  Somewhat
 interestingly, I don't see any issues like this with BTRFS on top of
 LVM's thin-provisioning volumes, or with any other filesystems, so I
 think it has something to do with how BTRFS is reporting unused space or
 how it is submitting the discard requests.

FWIW, there's a current btrfs patch in progress that relates to problems 
with btrfs trim.

But while I do have SSDs, I purposefully overprovisioned them by nearly 
100% (IOW I partitioned only about 55%, the rest is entirely unused), so 
trim isn't as critical here as it is for many.  I don't use the discard 
mount option, and have a systemd timer job setup to automate my fstrims 
and don't worry about the output too much, so I haven't been following 
the patch progress /that/ closely.

But I do know that recent kernel btrfs trims (either fstrim or discard 
mount option triggered) haven't been working as originally intended due 
to some bug, and this patch is supposed to fix it.

I'd thus conclude that you're very likely hitting this known issue, and 
that either for 4.1 or 4.2 (again, I'm not following progress that 
closely, and don't remember for sure if it's in 4.1, altho I've been 
running the rcs since rc6 or so), the problem should be fixed as that 
patch gets into mainline.

Anyone wishing to investigate further can of course check the list (and/
or possibly the kernel's git log) for discard/trim related patches and 
follow the progress once found.

... Actually, just checked myself.  Looks like the patches were first 
posted on March 30 @ 15:12:17 -0400 or so (that's the time for one of 
them).  There's one for the discard mount option, and another for FITRIM 
(which may or may not be a typo for FSTRIM, I'm not actually sure).  Jeff 
Mahoney je...@suse.com author.  That should be enough to find the 
threads.  And I don't see the patches in the late 4.1-rc I'm running so 
either my git log search foo is bad or it'll be (at least) 4.2.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [systemd-devel] [survey] BTRFS_IOC_DEVICES_READY return status

2015-06-17 Thread Andrei Borzenkov
В Wed, 17 Jun 2015 23:02:02 +0200
Lennart Poettering lenn...@poettering.net пишет:

 On Wed, 17.06.15 21:10, Goffredo Baroncelli (kreij...@libero.it) wrote:
 
   Well, /bin/mount is not a daemon, and it should not be one.
  
  My helper is not a deamon; you was correct the first time: it blocks
  until all needed/enough devices are appeared.
  Anyway this should not be different from mounting a nfs
  filesystem. Even in this case the mount helper blocks until the
  connection happened. The block time is not negligible, even tough
  not long as a device timeout ...
 
 Well, the mount tool doesn't wait for the network to be configured or
 so. It just waits for a response from the server. That's quite a
 difference.
 
   Well, it's not really ugly. I mean, if the state or properties of a
   device change, then udev should update its information about it, and
   that's done via a retrigger. We do that all the time already, for
   example when an existing loopback device gets a backing file assigned
   or removed. I am pretty sure that loopback case is very close to what
   you want to do here, hence retriggering (either from the kernel side,
   or from userspace), appears like an OK thing to do.
  
  What seems strange to me is that in this case the devices don't have 
  changed their status.
  How this problem is managed in the md/dm raid cases ?
 
 md has a daemon mdmon to my knowledge.
 

No, mdmon does something different. What mdadm does is to start timer
when RAID is complete enough to be started in degraded mode. If
notifications for missing devices appear after that, RAID is started
normally. If no notification appears until timer is expired, RAID is
started in degraded mode. 

ACTION==add|change, IMPORT{program}=BINDIR/mdadm --incremental --export 
$devnode --offroot ${DEVLINKS}
ACTION==add|change, ENV{MD_STARTED}==*unsafe*, ENV{MD_FOREIGN}==no, 
ENV{SYSTEMD_WANTS}+=mdadm-last-resort@$env{MD_DEVICE}.timer

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: trim not working and irreparable errors from btrfsck

2015-06-17 Thread Paul Jones
 -Original Message-
 From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-
 ow...@vger.kernel.org] On Behalf Of Christian
 Sent: Thursday, 18 June 2015 12:34 AM
 To: linux-btrfs@vger.kernel.org
 Subject: Re: trim not working and irreparable errors from btrfsck
 
 On 06/17/2015 10:22 AM, Chris Murphy wrote:
  On Wed, Jun 17, 2015 at 6:56 AM, Christian Dysthe cdys...@gmail.com
 wrote:
  Hi,
 
  Sorry for asking more about this. I'm not a developer but trying to learn.
  In my case I get several errors like this one:
 
  root 2625 inode 353819 errors 400, nbytes wrong
 
  Is it inode 353819 I should focus on and what is the number after
  root, in this case 2625?
 
  I'm going to guess it's tree root 2625, which is the same thing as fs
  tree, which is the same thing as subvolume. Each subvolume has its own
  inodes. So on a given Btrfs volume, an inode number can exist more
  than once, but in separate subvolumes. When you use btrfs inspect
  inode it will list all files with that inode number, but only the one
  in subvol ID 2625 is what you care about deleting and replacing.
 
 Thanks! Deleting the file for that inode took care of it. No more errors.
 Restored it from a backup.
 
 However, fstrim still gives me 0 B (0 bytes) trimmed, so that may be another
 problem. Is there a way to check if trim works?

I've got the same problem. I've got 2 SSDs with 2 partitions in RAID1, fstrim 
always works on the 2nd partition but not the first. There are no errors on 
either filesystem that I know of, but the first one is root so I can't take it 
offline to run btrfs check.

Paul.
N�r��yb�X��ǧv�^�)޺{.n�+{�n�߲)w*jg����ݢj/���z�ޖ��2�ޙ�)ߡ�a�����G���h��j:+v���w��٥

Re: [PATCH] fstests: generic test for fsync after adding hard link to a file

2015-06-17 Thread Eryu Guan
On Wed, Jun 17, 2015 at 12:52:16PM +0100, fdman...@kernel.org wrote:
 From: Filipe Manana fdman...@suse.com
 
 This test is motivated by an issue found in btrfs.
 
 It tests that after syncing the filesystem, adding a hard link to a file,
 syncing the filesystem again, doing a write to the file that increases
 its size and then doing a fsync against that file, durably persists the
 data written to the file. That is, after log/journal replay, the data
 is available.
 
 The btrfs issue is fixed by the commit titled:
 
   Btrfs: fix fsync data loss after append write
 
 Signed-off-by: Filipe Manana fdman...@suse.com

Looks good to me. Tested on ext2/3/4 xfs and btrfs, btrfs fails the
test, and _notrun on ext2, as expected.

Reviewed-by: Eryu Guan eg...@redhat.com

 ---
  tests/generic/090 | 108 
 ++
  tests/generic/090.out |  17 
  tests/generic/group   |   1 +
  3 files changed, 126 insertions(+)
  create mode 100755 tests/generic/090
  create mode 100644 tests/generic/090.out
 
 diff --git a/tests/generic/090 b/tests/generic/090
 new file mode 100755
 index 000..a1f2b89
 --- /dev/null
 +++ b/tests/generic/090
 @@ -0,0 +1,108 @@
 +#! /bin/bash
 +# FS QA Test No. 090
 +#
 +# Test that after syncing the filesystem, adding a hard link to a file,
 +# syncing the filesystem again, doing a write to the file that increases
 +# its size and then doing a fsync against that file, durably persists the
 +# data written to the file. That is, after log/journal replay, the data
 +# is available.
 +#
 +# This test is motivated by a bug found in btrfs.
 +#
 +#---
 +# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved.
 +# Author: Filipe Manana fdman...@suse.com
 +#
 +# This program is free software; you can redistribute it and/or
 +# modify it under the terms of the GNU General Public License as
 +# published by the Free Software Foundation.
 +#
 +# This program is distributed in the hope that it would be useful,
 +# but WITHOUT ANY WARRANTY; without even the implied warranty of
 +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 +# GNU General Public License for more details.
 +#
 +# You should have received a copy of the GNU General Public License
 +# along with this program; if not, write the Free Software Foundation,
 +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
 +#---
 +#
 +
 +seq=`basename $0`
 +seqres=$RESULT_DIR/$seq
 +echo QA output created by $seq
 +
 +here=`pwd`
 +tmp=/tmp/$$
 +status=1 # failure is the default!
 +
 +_cleanup()
 +{
 + _cleanup_flakey
 + rm -f $tmp.*
 +}
 +trap _cleanup; exit \$status 0 1 2 3 15
 +
 +# get standard environment, filters and checks
 +. ./common/rc
 +. ./common/filter
 +. ./common/dmflakey
 +
 +# real QA test starts here
 +_supported_fs generic
 +_supported_os Linux
 +_need_to_be_root
 +_require_scratch
 +_require_dm_flakey
 +_require_metadata_journaling $SCRATCH_DEV
 +
 +rm -f $seqres.full
 +
 +_scratch_mkfs  $seqres.full 21
 +_init_flakey
 +_mount_flakey
 +
 +# Create the test file with some initial data and then fsync it.
 +# The fsync here is only needed to trigger the issue in btrfs, as it causes 
 the
 +# the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode.
 +$XFS_IO_PROG -f -c pwrite -S 0xaa 0 32k \
 + -c fsync \
 + $SCRATCH_MNT/foo | _filter_xfs_io
 +sync
 +
 +# Add a hard link to our file.
 +# On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode,
 +# which is a necessary condition to trigger the issue.
 +ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar
 +
 +# Sync the filesystem to force a commit of the current btrfs transaction, 
 this
 +# is a necessary condition to trigger the bug on btrfs.
 +sync
 +
 +# Now append more data to our file, increasing its size, and fsync the file.
 +# In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the
 +# write path did not update the inode item in the btree nor the delayed inode
 +# item (in memory structure) in the current transaction (created by the fsync
 +# handler), the fsync did not record the inode's new i_size in the fsync
 +# log/journal. This made the data unavailable after the fsync log/journal is
 +# replayed.
 +$XFS_IO_PROG -c pwrite -S 0xbb 32K 32K \
 + -c fsync \
 + $SCRATCH_MNT/foo | _filter_xfs_io
 +
 +echo File content after fsync and before crash:
 +od -t x1 $SCRATCH_MNT/foo
 +
 +# Simulate a crash/power loss.
 +_load_flakey_table $FLAKEY_DROP_WRITES
 +_unmount_flakey
 +
 +# Allow writes again and mount. This makes the fs replay its fsync log.
 +_load_flakey_table $FLAKEY_ALLOW_WRITES
 +_mount_flakey
 +
 +echo File content after crash and log replay:
 +od -t x1 $SCRATCH_MNT/foo
 +
 +status=0
 +exit
 diff --git a/tests/generic/090.out b/tests/generic/090.out
 new 

Re: trim not working and irreparable errors from btrfsck

2015-06-17 Thread Chris Murphy
And you might as well just attach a full dmesg to the bug report too.
Who knows there might be something buried in there that's useful. The
cleanest approach is to reboot and then reproduce all of this with
fstrim on each volume, then capture the dmesg to a file. Then do the
strace fstrim for each volume. But a bunch of kernel message clutter
may be better than no messages at all.


Chris Murphy
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID10 Balancing Request for Comments and Advices

2015-06-17 Thread Duncan
Hugo Mills posted on Wed, 17 Jun 2015 13:27:36 + as excerpted:

 Yes, on this 80% full 6x4TB RAID10 -dusage=15 took 2 seconds and
 relocated 0 out of 3026 chunks”.
 
 Out of curiosity, I had to use -dusage=90 to have it relocate only 1
 chunk and it took les than 30 seconds.
 
 So I put a -dusage=25 in the weekly cron just before the scrub.
 
 In most cases, all you need to do is clean up one data chunk to
 give the metadata enough space to work in. Instead of manually iterating
 through several values of usage= until you get a useful response, you
 can use limit=n to stop after n successful block group relocations.

Thanks, Hugo.  It wasn't previously clear to me what the practical usage 
for the (relatively new) limit= filter was.  Very useful explanation. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How do I make 'btrfs scrub' report errors via email?

2015-06-17 Thread Marc MERLIN
On Thu, Jun 18, 2015 at 09:56:09AM +0900, crocket wrote:
 I think that's not going to report only errors.

Outside of saying how long the scrub took, that's all it does.

If you're not quite happy with the output, grep -v is your friend :)

Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC] btrfs: csum: Introduce partial csum for tree block.

2015-06-17 Thread Qu Wenruo

Ping?

New new comments?

Thanks,
Qu

Qu Wenruo wrote on 2015/06/16 10:39 +0800:



Qu Wenruo wrote on 2015/06/16 09:22 +0800:



David Sterba wrote on 2015/06/15 15:15 +0200:

On Mon, Jun 15, 2015 at 04:02:49PM +0800, Qu Wenruo wrote:

In the following case of corruption, RAID1 or DUP will fail to recover
it(Use 16K as leafsize)
04K8K12K16K
Mirror 0:
|-OK--|ERROR|-OK-|

Mirror 1:
|OK---|--Error-|

Since the CRC32 stored in header is calculated for the whole leaf,
so both will fail the CRC32 check.

But the corruption are in different position, in fact, if we know where
the corruption is (no need to be so accurate), we can recover the tree
block by using the current part.

In above example, we can just use the correct 0~12K from mirror 1
and then 12K~16K from mirror 0.


If the mirror 0 copy is intact, you can use it entirely. Your
improvement could help if each mirror is partially broken but we can
find good copies of all 4k blocks among all mirrors.

The natural question is how often this happens and if it's worth adding
the code complexity and what's the estimated speed drop.

I think the conditions are very rare and that we could add minimal code
to attempt to build the metadata block from the available copies without
the separate block checksums. This is an immediate idea so I could have
missed something:

* if a metadata-block checksum mismatches, do a direct comparison of the
   metadata-blocks in all available mirrors
   * if they match and checksums match, no help, it's damaged
   * if there's a good copy (ie the original checksum or data were
 corrupted), use it
   * otherwise attempt to rebuild the metadata block from what's
available

* by direct comparisons of the 4k blocks, find the first where the
   metadataA and mirror1 blocks mismatch, offset N
* try to compute the checksum from metadataA[0..N-1] + mirror1 block N +
   rest of metadataA
   * if it's ok, use it
   * if not: the block N is corrupted in mirror1 (we've skipped it in
 metadataA)
 then repeat with metadataA[0..N] + mirror1[N+1..end]

The method sounds good, but the codes will be even more complex than
mine.
Also, due to the nature of CRC32 and RAID1/Dup case, things will be
quite complex like the following case using your method.

04K8K12K16K
Mirror 0
||   X|||
Mirror 1
|||   X||

If using your method:
[0~4K]
CRC32 of 0~4K are the same. No problem.

[4~8K]
CRC32 of 0~8K differs.
Now we know there is something wrong with 4~8K, but we still don't know
which is the good copy.

We can continue, keep 2 different CRC32 seed for later checksum.
One seed using the 4~8 from mirror 0 and one from mirror1.

Note, the crc32 is for the whole tree block, so until we calculate all
the tree block, we can know which one is correct.

Sorry here, can - can't

Thanks,
Qu


[8~12K]
Now crc32 mismatches again.
We still don't know which part is correct.

We can still continue by recording different CRC32 seed for them.
But don't forget the previous 2 seeds we recorded.
So we records 4 crc32 seeds for 0~12K.(Mi0, Mi0) (Mi0, Mi1) (Mi1, Mi0)
(Mi1, Mi1).

[12~16K]
Continue calculate the crc32 with above 4 seeds.
Finally, we found the crc32 matches with the tree block is using the
combination of (Mi1, Mi0). And we can restore the tree block.

Yes, with this behavior, we can restore the tree block even 3 parts are
corrupted, but in that case, we need to try 8 times.

So, I don't consider this is more easy to implement than my idea.

[ROOT CAUSE]
The root cause of the complex is:
1) Checksum algorithm depends on previous input(seed)
Almost all checksum algorithm depends on previous input.
And you won't know the data is correct or not until all data is
calculated.

2) Only one extra duplication for RAID1/DUP
We don't have extra duplication to judge which block is correct as there
are only two mirror.

So my partial checksum solves the 2 root cause at the same time.
With partial checksum, we don't depend on previous data to calculate
checksum.
And we have extra reference if mirror differs with each other, we use
checksum to judge which copy is correct.



That's a rough idea that I hope will cover most of the cases when it
happens. With some more exhaustive attempts to rebuild the metadata
block we can try to repair 2 damaged blocks.

As this is completely independent, we can test it separately, and also
add it as a rescue feature to the userspace tools.


Yes, this corruption case may be minor enough, since even corruption in
one mirror is rare enough.
So I didn't introduce a new CRC32 checksum, but use the extra 32-4
bytes
to store the partial CRC32 to keep the backward compatibility.


The above would work with any checksums, without the need to store the
per-block checksums which become impossible with strongher algorithms.


[FURTHER CSUM DESIGN]
As you mentioned in later 

Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION

2015-06-17 Thread Liu Bo
On Wed, Jun 17, 2015 at 07:01:18PM +0200, David Sterba wrote:
 On Wed, Jun 17, 2015 at 11:52:36PM +0800, Liu Bo wrote:
  On Wed, Jun 17, 2015 at 05:33:06PM +0200, David Sterba wrote:
   On Wed, Jun 17, 2015 at 03:54:31PM +0800, Liu Bo wrote:
MS_I_VERSION is enabled by default for btrfs, this adds an alternative
option to toggle it off.
   
   There's an existing generic iversion/noiversion mount option pair, no
   need to extra add it to btrfs.
  
  I know, it doesn't work though.
 
 Sigh, I see, btrfs forces MS_I_VERSION flag,
 0c4d2d95d06e920e0c61707e62c7fffc9c57f63a. I read 'enabled by default' as
 that there's a standard way to override the defaults.
 
 So the right way is not to do that but this will break everyhing that
 relies on that behaviour at the moment. This means to add the exception
 to the upper layers, either VFS or 'mount', which is not very likely to
 happen.
 
 The generic options do not reach the filesystem specific callbacks, so
 we can't check it.

Ext4 also makes its own i_version option, so I think we can do the
same thing until more filesystems require to do it in a generic way.

The performance benefit with no_iversion is obvious for fsync
related workloads since we would avoid some expensive log commits.

Thanks,

-liubo
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fstests: generic test for fsync after adding xattr to a file

2015-06-17 Thread Eryu Guan
On Wed, Jun 17, 2015 at 12:52:47PM +0100, fdman...@kernel.org wrote:
 From: Filipe Manana fdman...@suse.com
 
 This test is motivated by an issue found in btrfs.
 
 It tests that after syncing the filesystem, adding a xattr to a file,
 syncing the filesystem again, writing to the file and then doing a fsync
 against that file, the xattr still exists after a power failure. That is,
 after the fsync log/journal is replayed, the xattr still exists and with
 the correct value.
 
 The btrfs issue is fixed by the patch titled:
 
   Btrfs: fix fsync xattr loss in the fast fsync path

If I read the above patch correctly, the issue to be tested here was
introduced by commit 4f764e515361 (Btrfs: remove deleted xattrs on
fsync log replay), which is in mainline kernel since v4.1-rc1, so the
test should fail on my v4.1-rc5 kernel, but I didn't see it fail.

Is it reproducible everytime? Or did I miss something?

 
 Signed-off-by: Filipe Manana fdman...@suse.com
 ---
  tests/generic/094 | 112 
 ++
  tests/generic/094.out |  29 +
  tests/generic/group   |   1 +
  3 files changed, 142 insertions(+)
  create mode 100755 tests/generic/094
  create mode 100644 tests/generic/094.out
 
 diff --git a/tests/generic/094 b/tests/generic/094
 new file mode 100755
 index 000..1c6d113
 --- /dev/null
 +++ b/tests/generic/094
 @@ -0,0 +1,112 @@
 +#! /bin/bash
 +# FS QA Test No. 094
 +#
 +# Test that after syncing the filesystem, adding a xattr to a file, syncing
 +# the filesystem again, writing to the file and then doing a fsync against 
 that
 +# file, the xattr still exists after a power failure. That is, after the 
 fsync
 +# log/journal is replayed, the xattr still exists and with the correct value.
 +#
 +# This test is motivated by a bug found in btrfs.
 +#
 +#---
 +# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved.
 +# Author: Filipe Manana fdman...@suse.com
 +#
 +# This program is free software; you can redistribute it and/or
 +# modify it under the terms of the GNU General Public License as
 +# published by the Free Software Foundation.
 +#
 +# This program is distributed in the hope that it would be useful,
 +# but WITHOUT ANY WARRANTY; without even the implied warranty of
 +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 +# GNU General Public License for more details.
 +#
 +# You should have received a copy of the GNU General Public License
 +# along with this program; if not, write the Free Software Foundation,
 +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
 +#---
 +#
 +
 +seq=`basename $0`
 +seqres=$RESULT_DIR/$seq
 +echo QA output created by $seq
 +
 +here=`pwd`
 +tmp=/tmp/$$
 +status=1 # failure is the default!
 +
 +_cleanup()
 +{
 + _cleanup_flakey
 + rm -f $tmp.*
 +}
 +trap _cleanup; exit \$status 0 1 2 3 15
 +
 +# get standard environment, filters and checks
 +. ./common/rc
 +. ./common/filter
 +. ./common/dmflakey
 +. ./common/attr
 +
 +# real QA test starts here
 +_supported_fs generic
 +_supported_os Linux
 +_need_to_be_root
 +_require_scratch
 +_require_dm_flakey
 +_require_attrs
 +_require_metadata_journaling $SCRATCH_DEV
 +
 +rm -f $seqres.full
 +
 +_scratch_mkfs  $seqres.full 21
 +_init_flakey
 +_mount_flakey
 +
 +# Create the test file with some initial data and make sure everything is
 +# durably persisted. We do fsync before calling sync to make sure that if the
 +# filesystem is btrfs, we get the flag BTRFS_INODE_NEEDS_FULL_SYNC cleared
 +# from the btrfs inode - a condition necessary to trigger the issue in btrfs.
 +$XFS_IO_PROG -f -c pwrite -S 0xaa 0 32k \
 + -c fsync \
 + $SCRATCH_MNT/foo | _filter_xfs_io
 +sync
 +
 +# Add a xattr to our file.
 +$SETFATTR_PROG -n user.attr -v somevalue $SCRATCH_MNT/foo
 +
 +# Sync the filesystem to force a commit of the current btrfs transaction, 
 this
 +# is a necessary condition to trigger the bug on btrfs.
 +sync
 +
 +# Now update our file's data and fsync the file.
 +# After a successful fsync, if the fsync log/journal is replayed we expect to
 +# see the xattr named user.attr with a value of somevalue (and the 
 updated
 +# file data of course). Btrfs used to remove the xattr when it replayed the
 +# fsync log/journal.
 +$XFS_IO_PROG -c pwrite -S 0xbb 8K 16K \
 + -c fsync \
 + $SCRATCH_MNT/foo | _filter_xfs_io
 +
 +echo File content after fsync and before crash:
 +od -t x1 $SCRATCH_MNT/foo
 +
 +echo File xattrs after fsync and before crash:
 +$GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foo | _filter_scratch
 +
 +# Simulate a crash/power loss.
 +_load_flakey_table $FLAKEY_DROP_WRITES
 +_unmount_flakey
 +
 +# Allow writes again and mount. This makes the fs replay its fsync log.
 +_load_flakey_table $FLAKEY_ALLOW_WRITES
 +_mount_flakey
 +
 +echo 

Re: trim not working and irreparable errors from btrfsck

2015-06-17 Thread Chris Murphy
File a new bug at bugzilla.kernel.org describing this problem. Include
make/model of all involved SSDs, which you can get from smartctl or
hdparm. And then do a strace fstrim on the working and non-working
volumes, saving the output to separate files and attaching them to the
bug report. And then it's probably best to create a new list thread to
post the URL for the bug, since this thread is really about two
problems that may not be related.

Chris Murphy
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/5] Btrfs: incremental send, fix rmdir not send utimes

2015-06-17 Thread Robbie Ko
Hi Filipe,

I've found that the following case is the main cause of such error
and it's fs tree is shown via btrfs-debug-tress as below.

file tree key (459 ROOT_ITEM 20487)
node 132988928 level 1 items 3 free 490 generation 20487 owner 459
fs uuid b451ae42-3b03-4003-b0a4-45dce324557f
chunk uuid d8831db3-2e42-4b32-9a5c-3efdf50d36bc
key (256 INODE_ITEM 0) block 132710400 (8100) gen 20486
key (264 INODE_ITEM 0) block 130695168 (7977) gen 20480
key (266 XATTR_ITEM 952319794) block 126042112 (7693) gen 20464
leaf 132710400 items 166 free space 3639 generation 20486 owner 455
fs uuid b451ae42-3b03-4003-b0a4-45dce324557f
chunk uuid d8831db3-2e42-4b32-9a5c-3efdf50d36bc
item 0 key (256 INODE_ITEM 0) itemoff 16123 itemsize 160
inode generation 20425 transid 20442 size 32 block
group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 flags 0x0
item 1 key (256 INODE_REF 256) itemoff 16111 itemsize 12
inode ref index 0 namelen 2 name: ..
...
item 165 key (262 XATTR_ITEM 1100961104) itemoff 7789 itemsize 39
location key (0 UNKNOWN.0 0) type XATTR
namelen 8 datalen 1 name: user.a78
data a
binary 61
leaf 130695168 items 133 free space 7332 generation 20480 owner 455
fs uuid b451ae42-3b03-4003-b0a4-45dce324557f
chunk uuid d8831db3-2e42-4b32-9a5c-3efdf50d36bc
item 0 key (264 INODE_ITEM 0) itemoff 16123 itemsize 160
inode generation 20428 transid 20434 size 10 block
group 0 mode 40755 links 1 uid 0 gid 0 rdev 0 flags 0x0
item 1 key (264 INODE_REF 256) itemoff 16112 itemsize 11
inode ref index 11 namelen 1 name: c
...

We can see that inode 262 is right at the end of leaf. Then send_utime() will
use btrfs_search_slot() to find a appropriate place to put 262 where is at the
back of 262. However, that place is uninitialized on disk. Suppose we read
atime tv_sec:576469548413222912, tv_nsec:1919251317 and then send it out.
Receiving side will  got EINVAL since tv_nsec:1919251317 is greater
than   999,999,999.

Thanks.
Robbie Ko

2015-06-10 18:06 GMT+08:00 Robbie Ko robbi...@synology.com:
 Hi Filipi,

 2015-06-09 18:36 GMT+08:00 Filipe David Manana fdman...@gmail.com:
 On Tue, Jun 9, 2015 at 11:04 AM, Robbie Ko robbi...@synology.com wrote:
 Hi Filipe,

 2015-06-08 22:00 GMT+08:00 Filipe David Manana fdman...@gmail.com:
 On Mon, Jun 8, 2015 at 4:44 AM, Robbie Ko robbi...@synology.com wrote:
 Hi Filipe,

 Hi Robbie,


 I've fixed don't send utimes for non-existing directory with another 
 solution.

  In apply_dir_move(), the old parent dir. and new parent dir. will be
 updated after the current dir. has moved.

 And there's only one entry in old parent dir. (e.g. entry with
 smallest ino) will be tagged with rmdir_ino to prevent its parent dir.
 is deleted but updated.

 Can't parse this phrase. What do you mean by tagging an entry with 
 rmdir_ino?
 rmdir_ino corresponds to the number of a inode that wasn't deleted
 when it was processed because there was some inode with a lower number
 that is a child of the directory in the parent snapshot and had its
 rename/move operation delayed (it happens after the directory we want
 to delete is processed).


 Right , my tagged with rmdir_ino is same meaning as you explained here.


 However, if we process rename for another entry not tagged with
 rmdir_ino first, its old parent dir. which is deleted  will be updated
 according to apply_dir_move().

 Therefore, I think we should check the existence of  the dir. before
 we're going to update it's utime.

 The patch is pasted in the following link, could you give me some comment?

 https://friendpaste.com/h8tZqOS9iAUpp2DvgGI2k

 Looks better.
 However I still don't understand your explanation, and just tried the
 example in your commit message:

 Parent snapshot:

 | a/ (ino 259)
   | c (ino 264)
 | b/ (ino 260)
   | d (ino 265)
 | del/ (ino 263)
   | item1/ (ino 261)
   | item2/ (ino 262)

 Send snapshot:
 | a/ (ino 259)
 | b/ (ino 260)
 | c/ (ino 2)
   | item2 (ino 259)
 | d/ (ino 257)
   | item1/ (ino 258)

 So it's confusing after looking at it.
 First the send snapshot mentions inode number 2, which doesn't exist
 in the parent snapshot - I assume you meant inode number 264.
 Then, the send snapshot has two inodes with number 259. Is item2 in
 the send snapshot supposed to be inode 262?


 Your guess is right. And I correct it as follow.

  # Parent snapshot:
  #
  # | a/(ino 259)
  # | | c   (ino 264)
  # |
  # | b/(ino 260)
  # | | d   (ino 265)
  # |
  # | del/  (ino 263)
  #| item1/ (ino 261)
  #| item2/ (ino 262)

  # Send snapshot:
  #
  # | a/(ino 259)
  # | b/(ino 260)
  # | c/(ino 264)
  # | | item2/  (ino 262)
  # |
  # | d/(ino 265)
  

Re: BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432)

2015-06-17 Thread Duncan
Marc MERLIN posted on Wed, 17 Jun 2015 09:19:36 -0700 as excerpted:

 On Wed, Jun 17, 2015 at 01:51:26PM +, Duncan wrote:
  Also, if my actual data got corrupted, am I correct that btrfs will
  detect the checksum failure and give me a different error message of
  a read error that cannot be corrected?
  
  I'll do a scrub later, for now I have to wait 20 hours for the raid
  rebuild first.
 
 Yes again.
  
 Great, thanks for confirming.
 Makes me happy to know that checksums and metadata DUP are helping me
 out here.  With ext4 I'd have been worse off for sure.
 
 One thing I'd strongly recommend.  Once the rebuild is complete and you
 do the scrub, there may well be both read/corrected errors, and
 unverified errors.  AFAIK, the unverified errors are a result of bad
 metadata blocks, so missing checksums for what they covered.  So once
 you
 
 I'm slightly confused here. If I have metadata DUP and checksums, how
 can metadata blocks be unverified?
 Data blocks being unverified, I understand, it would mean the data or
 checksum is bad, but I expect that's a different error message I haven't
 seen yet.

Backing up a bit to better explain what I'm seeing here...

What I'm getting here, when the sectors go unreadable on the (slowly) 
failing SSD, is actually a SATA level timeout, which btrfs (correctly) 
interprets as a read error.  But it wouldn't really matter whether it was 
a read error or a corruption error, btrfs would respond the same -- 
because both data and metadata are btrfs raid1 here, it would fetch and 
verify the other copy of the block from the raid1 mirror device, and 
assuming it verified (which it should since the other device is still in 
great condition, zero relocations), rewrite it over the one it couldn't 
read.

Back on the failing device, the rewrite triggers a sector relocation, and 
assuming it doesn't fall in the bad area too, that block is now clean.  
(If it does fall in the defective area, I simply have to repeat the scrub 
another time or two, until there are no more errors.)


But, and this is what I was trying to explain earlier but skipped a step 
I figured was more obvious than it apparently was, btrfs works with 
trees, including a metadata tree.  So each block of metadata that has 
checksums covering actual data, is in turn itself checksummed by a 
metadata block one step closer to the metadata root block, multiple 
levels deep.

I should mention here that this is my non-coder understanding.  If a dev 
says it works differently...

It's these multiple metadata levels and the chained checksums for them, 
that I was referencing.  Suppose it's a metadata block that fails, not a 
data block.  That metadata block will be checksummed, and will in turn 
contain checksums for other blocks, which might be either data blocks, or 
other metadata blocks, a level closer to the data (and further from the 
root) than the failed block.

Because the metadata block was failed (either checksum failure or read 
error, shouldn't matter at this point), whatever checksums it contained, 
whether for data, or for other metadata blocks, will be unverified.  If 
the affected metadata block is close to the root of the tree, the effect 
could in theory domino thru to several further levels.

These checksum unverified blocks (because the block containing the 
checksums failed) will show up as unverified errors, and whatever that 
checksum was supposed to cover, whether other metadata blocks or data 
blocks, won't be checked in that scrub round, because the level above it 
can't be verified.

Given a checksum-verified raid1 copy on the mirror device, the original 
failed block will be rewritten.  But if it's metadata, whatever checksums 
it in turn contained will still not be verified in that scrub round.  
Again, these show up as unverified errors.

By running scrub repeatedly, however, now that the first error has been 
fixed by the rewrite from the good copy, the checksums it contained can 
now in turn be checked.  If they all verify, great.  If not, another 
rewrite will be triggered, fixing them, but if if those checksums were in 
turn for other metadata blocks, now /those/ will need checked and will 
show up as unverified.

So depending on what the bad metadata block was located on in the 
metadata tree, a second, third, possibly even fourth, scrub may be 
needed, in ordered to correct all the errors at all levels of the 
metadata tree, thereby fixing in turn each level of unverified errors 
exposed as the level above it (closer to root) was fixed.


Of course, if your scrub listed all corrected (metadata since it's raid1 
in your case) or uncorrectable (data since it's single in your case, or 
metadata with both copies bad) errors, no unverified errors, then at 
least in theory, a second scrub shouldn't find any further errors to 
correct.  Only if you see unverified errors should it be necessary to 
repeat the scrub, but then you might need to repeat it several times as 
each run will 

Re: trim not working and irreparable errors from btrfsck

2015-06-17 Thread Austin S Hemmelgarn

On 2015-06-17 11:40, Christian wrote:

On 06/17/2015 11:28 AM, Chris Murphy wrote:



However, fstrim still gives me 0 B (0 bytes) trimmed, so that may be
another problem. Is there a way to check if trim works?


That sounds like maybe your SSD is blacklisted for trim, is all I can
think of. So trim shouldn't be the cause of the problem if it's being
blacklisted. The recent problems appear to be around newer SSDs that
support queue trim and newer kernels that issue queued trim. There
have been some patches related to trim to the kernel, but the
existence of blacklisting and claims of bugs in firmware make it
difficult to test and isolate.

http://techreport.com/news/28473/some-samsung-ssds-may-suffer-from-a-buggy-trim-implementation



This is an Intel SSD in a Lenovo Thinkpad X1 Carbon. Trim worked until a
few weeks ago and still works for my small ext4 boot partition (just ran
it to check). I will keep looking for a solution. Thanks!

I'm seeing the same issue here, but with a Crucial brand SSD.  Somewhat 
interestingly, I don't see any issues like this with BTRFS on top of 
LVM's thin-provisioning volumes, or with any other filesystems, so I 
think it has something to do with how BTRFS is reporting unused space or 
how it is submitting the discard requests.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH] Btrfs: fix typo in the error log

2015-06-17 Thread David Sterba
On Mon, Jun 15, 2015 at 10:02:23PM +0800, Anand Jain wrote:
 ---
  fs/btrfs/disk-io.c | 2 +-
  fs/btrfs/super.c   | 2 +-
  2 files changed, 2 insertions(+), 2 deletions(-)
 
 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 57d48f8..5147cb7 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -2896,7 +2896,7 @@ retry_root_backup:
fs_info-num_tolerated_disk_barrier_failures 
   !(sb-s_flags  MS_RDONLY)) {
   printk(KERN_WARNING BTRFS: 
 - too many missing devices, writeable mount is not 
 allowed\n);
 + too many missing devices, writable mount is not 
 allowed\n);

I see both forms are accepted:

http://dictionary.reference.com/browse/writeable
http://dictionary.reference.com/browse/writable

Though not here:

http://www.merriam-webster.com/dictionary/writable
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Replacing a drive from a RAID 1 array

2015-06-17 Thread Austin S Hemmelgarn

On 2015-06-16 12:58, Hugo Mills wrote:

On Tue, Jun 16, 2015 at 06:43:23PM +0200, Arnaud Kapp wrote:

Hello,

Consider the following situation: I have a RAID 1 array with 4 drives.
I want to replace one the drive by a new one, with greater capacity.

However, let's say I only have 4 HDD slots so I cannot plug the new
drive, add it to the array then remove the other one.
If there a *safe* way to change drives in this situation? I'd bet that
booting with 3drives, adding the new one then removing the old, non
connected one would work. However, is there something that could go
wrong in this situation?


The main thing that could go wrong with that is a disk failure. If
you have the SATA ports available, I'd consider operating the machine
with the case open and one of the drives bare and resting on something
stable and insulating for the time it takes to do a btrfs replace
operation.
This would be my first suggestion also; although, if you only have 4 
SATA ports, you might want to invest in a SATA add in card (if you go 
this way, look for one with an ASmedia chipset, those are the best I've 
seen as far as reliability for add on controllers).


If that's not an option, then a good-quality external USB case with
a short cable directly attached to one of the USB ports on the
motherboard would be a reasonable solution (with the proviso that some
USB connections are just plain unstable and throw errors, which can
cause problems with the filesystem code, typically requiring a reboot,
and a restart of the process).
If you decide to go with this option and are using an Intel system, 
avoid using USB3.0 ports, as a number of Intel's chipsets have known 
bugs with their USB3 hardware that will likely cause serious issues.
If your system has an eSATA port however, try to use that instead of 
USB, it will almost certainly be faster and more reliable.


You might also consider using either NBD or iSCSI to present one of
the disks (I'd probably use the outgoing one) over the network from
another machine with more slots in it, but that's going to end up with
horrible performance during the migration.
The other possibility WRT this is ATAoE, which generally gets better 
performance than NBD or iSCSI but has the caveat that both systems have 
to be on the same network link (ie, no gateways between them).  If you 
do decide to use ATAoE look into a program called 'vblade' (most 
distro's have it in a package with the same name).




smime.p7s
Description: S/MIME Cryptographic Signature


[PATCH] Btrfs-progs: add feature to get mininum size for resizing a fs/device

2015-06-17 Thread fdmanana
From: Filipe Manana fdman...@suse.com

Currently there is not way for a user to know what is the minimum size a
device of a btrfs filesystem can be resized to. Sometimes the value of
total allocated space (sum of all allocated chunks/device extents), which
can be parsed from 'btrfs filesystem show' and 'btrfs filesystem usage',
works as the minimum size, but sometimes it does not, namely when device
extents have to relocated to holes (unallocated space) within the new
size of the device (the total allocated space sum).

This change adds the ability to reliably compute such minimum value and
extents 'btrfs filesystem resize' with the following syntax to get such
value:

   btrfs filesystem resize [devid:]get_min_size

Signed-off-by: Filipe Manana fdman...@suse.com
---
 Documentation/btrfs-filesystem.asciidoc |   4 +-
 Makefile.in |   8 +-
 cmds-filesystem.c   | 219 +++-
 ctree.h |   3 +
 tests/shrink-min-size-tests.sh  |  72 +++
 5 files changed, 302 insertions(+), 4 deletions(-)
 create mode 100755 tests/shrink-min-size-tests.sh

diff --git a/Documentation/btrfs-filesystem.asciidoc 
b/Documentation/btrfs-filesystem.asciidoc
index f1c35b6..45f8cf7 100644
--- a/Documentation/btrfs-filesystem.asciidoc
+++ b/Documentation/btrfs-filesystem.asciidoc
@@ -88,7 +88,7 @@ If a newlabel optional argument is passed, the label is 
changed.
 NOTE: the maximum allowable length shall be less than 256 chars
 
 // Some wording are extracted by the resize2fs man page
-*resize* [devid:][+/-]size[kKmMgGtTpPeE]|[devid:]max path::
+*resize* 
[devid:][+/-]size[kKmMgGtTpPeE]|[devid:]max|[devid:]get_min_size 
path::
 Resize a mounted filesystem identified by directory path. A particular device
 can be resized by specifying a devid.
 +
@@ -108,6 +108,8 @@ KiB, MiB, GiB, TiB, PiB, or EiB, respectively. Case does 
not matter.
 +
 If \'max' is passed, the filesystem will occupy all available space on the
 device devid.
+If \'get_min_size' is passed, return the minimum size the device can be
+shrunk to, without performing any resize operation.
 +
 The resize command does not manipulate the size of underlying
 partition.  If you wish to enlarge/reduce a filesystem, you must make sure you
diff --git a/Makefile.in b/Makefile.in
index 860a390..202c51e 100644
--- a/Makefile.in
+++ b/Makefile.in
@@ -46,7 +46,7 @@ libbtrfs_objects = send-stream.o send-utils.o rbtree.o 
btrfs-list.o crc32c.o \
 libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \
   crc32c.h list.h kerncompat.h radix-tree.h extent-cache.h \
   extent_io.h ioctl.h ctree.h btrfsck.h version.h
-TESTS = fsck-tests.sh convert-tests.sh
+TESTS = fsck-tests.sh convert-tests.sh shrink-min-size-tests.sh
 
 prefix ?= @prefix@
 exec_prefix = @exec_prefix@
@@ -161,6 +161,10 @@ $(BUILDDIRS):
@echo Making all in $(patsubst build-%,%,$@)
$(Q)$(MAKE) $(MAKEOPTS) -C $(patsubst build-%,%,$@)
 
+test-shrink-min-size: btrfs mkfs.btrfs
+   @echo [TEST]   shrink-min-size-tests.sh
+   $(Q)bash tests/shrink-min-size-tests.sh
+
 test-convert: btrfs btrfs-convert
@echo [TEST]   convert-tests.sh
$(Q)bash tests/convert-tests.sh
@@ -169,7 +173,7 @@ test-fsck: btrfs btrfs-image btrfs-corrupt-block 
btrfs-debug-tree mkfs.btrfs
@echo [TEST]   fsck-tests.sh
$(Q)bash tests/fsck-tests.sh
 
-test: test-fsck test-convert
+test: test-fsck test-convert test-shrink-min-size
 
 #
 # NOTE: For static compiles, you need to have all the required libs
diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index b93bb33..13b5bc5 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -1220,14 +1220,228 @@ static int cmd_defrag(int argc, char **argv)
 }
 
 static const char * const cmd_resize_usage[] = {
-   btrfs filesystem resize 
[devid:][+/-]newsize[kKmMgGtTpPeE]|[devid:]max path,
+   btrfs filesystem resize 
[devid:][+/-]newsize[kKmMgGtTpPeE]|[devid:]max|[devid:]get_min_size path,
Resize a filesystem,
If 'max' is passed, the filesystem will occupy all available space,
on the device 'devid'.,
+   If 'get_min_size' is passed, return the minimum size the device can,
+   be shrunk to.,
[kK] means KiB, which denotes 1KiB = 1024B, 1MiB = 1024KiB, etc.,
NULL
 };
 
+struct dev_extent_elem {
+   u64 start;
+   /* inclusive end */
+   u64 end;
+   struct list_head list;
+};
+
+static int add_dev_extent(struct list_head *list,
+ const u64 start, const u64 end,
+ const int append)
+{
+   struct dev_extent_elem *e;
+
+   e = malloc(sizeof(*e));
+   if (!e)
+   return -ENOMEM;
+
+   e-start = start;
+   e-end = end;
+
+   if (append)
+   list_add_tail(e-list, list);
+   else
+   list_add(e-list, list);
+
+   return 

[PATCH] fstests: generic test for fsync after adding xattr to a file

2015-06-17 Thread fdmanana
From: Filipe Manana fdman...@suse.com

This test is motivated by an issue found in btrfs.

It tests that after syncing the filesystem, adding a xattr to a file,
syncing the filesystem again, writing to the file and then doing a fsync
against that file, the xattr still exists after a power failure. That is,
after the fsync log/journal is replayed, the xattr still exists and with
the correct value.

The btrfs issue is fixed by the patch titled:

  Btrfs: fix fsync xattr loss in the fast fsync path

Signed-off-by: Filipe Manana fdman...@suse.com
---
 tests/generic/094 | 112 ++
 tests/generic/094.out |  29 +
 tests/generic/group   |   1 +
 3 files changed, 142 insertions(+)
 create mode 100755 tests/generic/094
 create mode 100644 tests/generic/094.out

diff --git a/tests/generic/094 b/tests/generic/094
new file mode 100755
index 000..1c6d113
--- /dev/null
+++ b/tests/generic/094
@@ -0,0 +1,112 @@
+#! /bin/bash
+# FS QA Test No. 094
+#
+# Test that after syncing the filesystem, adding a xattr to a file, syncing
+# the filesystem again, writing to the file and then doing a fsync against that
+# file, the xattr still exists after a power failure. That is, after the fsync
+# log/journal is replayed, the xattr still exists and with the correct value.
+#
+# This test is motivated by a bug found in btrfs.
+#
+#---
+# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana fdman...@suse.com
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo QA output created by $seq
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+
+_cleanup()
+{
+   _cleanup_flakey
+   rm -f $tmp.*
+}
+trap _cleanup; exit \$status 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+. ./common/attr
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_need_to_be_root
+_require_scratch
+_require_dm_flakey
+_require_attrs
+_require_metadata_journaling $SCRATCH_DEV
+
+rm -f $seqres.full
+
+_scratch_mkfs  $seqres.full 21
+_init_flakey
+_mount_flakey
+
+# Create the test file with some initial data and make sure everything is
+# durably persisted. We do fsync before calling sync to make sure that if the
+# filesystem is btrfs, we get the flag BTRFS_INODE_NEEDS_FULL_SYNC cleared
+# from the btrfs inode - a condition necessary to trigger the issue in btrfs.
+$XFS_IO_PROG -f -c pwrite -S 0xaa 0 32k \
+   -c fsync \
+   $SCRATCH_MNT/foo | _filter_xfs_io
+sync
+
+# Add a xattr to our file.
+$SETFATTR_PROG -n user.attr -v somevalue $SCRATCH_MNT/foo
+
+# Sync the filesystem to force a commit of the current btrfs transaction, this
+# is a necessary condition to trigger the bug on btrfs.
+sync
+
+# Now update our file's data and fsync the file.
+# After a successful fsync, if the fsync log/journal is replayed we expect to
+# see the xattr named user.attr with a value of somevalue (and the updated
+# file data of course). Btrfs used to remove the xattr when it replayed the
+# fsync log/journal.
+$XFS_IO_PROG -c pwrite -S 0xbb 8K 16K \
+   -c fsync \
+   $SCRATCH_MNT/foo | _filter_xfs_io
+
+echo File content after fsync and before crash:
+od -t x1 $SCRATCH_MNT/foo
+
+echo File xattrs after fsync and before crash:
+$GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foo | _filter_scratch
+
+# Simulate a crash/power loss.
+_load_flakey_table $FLAKEY_DROP_WRITES
+_unmount_flakey
+
+# Allow writes again and mount. This makes the fs replay its fsync log.
+_load_flakey_table $FLAKEY_ALLOW_WRITES
+_mount_flakey
+
+echo File content after crash and log replay:
+od -t x1 $SCRATCH_MNT/foo
+
+echo File xattrs after crash and log replay:
+$GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foo | _filter_scratch
+
+status=0
+exit
diff --git a/tests/generic/094.out b/tests/generic/094.out
new file mode 100644
index 000..2e5e0fa
--- /dev/null
+++ b/tests/generic/094.out
@@ -0,0 +1,29 @@
+QA output created by 094
+wrote 32768/32768 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 16384/16384 bytes at offset 

Re: I_ERR_FILE_EXTENT_DISCOUNT when there are no file extent holes in inode

2015-06-17 Thread Qu Wenruo



Przemysław Pawełczyk wrote on 2015/06/16 14:19 +0200:

On Tue, Jun 16, 2015 at 9:54 AM, Qu Wenruo quwen...@cn.fujitsu.com wrote:

Przemysław Pawełczyk wrote on 2015/06/14 21:38 +0200:



I wanted to move and resize /home btrfs partition of my debian jessie
v8.1 (w/ btrfs-tools v3.17) in virtual machine using gparted 0.22.0
after booting from latest SysRescCD 4.5.3 (w/ btrfs-progs v3.19.1).
GParted does fs check before, to make sure that everything is fine,
but it wasn't. There were following errors from `btrfsck`:

checking fs roots
root 5 inode 1521611 errors 100, file extent discount
Found file extent holes:
start: 12288, len:4096
root 5 inode 1521634 errors 100, file extent discount
Found file extent holes:
root 5 inode 1521645 errors 100, file extent discount
Found file extent holes:
start: 8192, len:4096
root 5 inode 1521647 errors 100, file extent discount
Found file extent holes:
start: 8192, len:8192
start: 20480, len:4096
root 5 inode 1521648 errors 100, file extent discount
Found file extent holes:
root 5 inode 1521649 errors 100, file extent discount
Found file extent holes:
...

As you can see not every inode w/ file extent discount error flag has
file extent holes. I'm not sure of exact definition of this error
flag, so cannot tell myself how (ab?)normal it is. I was using this
partition almost daily for almost a year (back then it was debian
testing when installed) and beside occasional VirtualBox hangups
(during rsync from USB), I had no problems at all.

Qu Wenruo's discount file extent hole repairing function landed in
btrfs-progs v3.19, so I couldn't use debian's old btrfsck to improve
the situation, but sysresccd's one was recent enough (and I was
already booted into it), so I went with its `btrfsck --repair`. I got
many 'Fixed discount file extents for inode' messages, but next
`btrfsck` run still reported file extent discount errors. Looking
closely there was some improvement, because 2 inodes were no longer
reported (only one within visible part of the below log dump):

checking fs roots
root 5 inode 1521634 errors 100, file extent discount
Found file extent holes:
root 5 inode 1521645 errors 100, file extent discount
Found file extent holes:
root 5 inode 1521647 errors 100, file extent discount
Found file extent holes:
root 5 inode 1521648 errors 100, file extent discount
Found file extent holes:
root 5 inode 1521649 errors 100, file extent discount
Found file extent holes:
...

I cloned btrfs-progs.git with latest stable v4.0.1, and executed
self-built `btrfsck --repair` from my debian, hoping that maybe there
were some improvements in that department. Sadly no, I got many 'Fixed
discount file extents for inode', but next `btrfsck` revealed same old
file extent discount errors. It looked like flag error is simply not
cleared, so I finally looked into the code.

When I found repair_inode_discount_extent() in cmds-check.c, I though
I've found the bug. I_ERR_FILE_EXTENT_DISCOUNT is cleared only within
while (node) loop, so if there are no file extents hole, it won't be
cleared. So I moved

  if (RB_EMPTY_ROOT(rec-holes))
  rec-errors = ~I_ERR_FILE_EXTENT_DISCOUNT;



Thanks a lot for pointing out the problem.
I'll try to fix it soon.


I would send a patch separately if I was convinced that it fixes the
real problem, but as your read from the rest of the mail, I am not. It
may seem as slight optimization (checking things once instead of
repeatedly), but it also masks error flag (i.e. clears it) for cases
that are not really fixed in the function and only next btrfsck run
will reset this file extent discount error flag (in case of these
holeless inodes having extent_end  isize), so I think that it needs
to be postponed after repair_inode_discount_extent() will be smart
enough to thoroughly fix inode's extents deficiency.


Also, welcome aboard to btrfs development! :)


Thank you, but I don't plan to truly dive into btrfs (at least yet). :)
I just hoped I could work my problem out myself and even if not, I
could at least provide more detailed report than File extent discount
errors are not fixed by btrfsck.



after the while loop. It must have helped clearing error flag during
`btrfsck --repair`, but rerunning `btrfsck` revealed that there are
still the same file extent discount errors, so apparently they were
reset in some other code path.

I added some debug printf to verify that RB_EMPTY_ROOT(rec-holes)
was not false (i.e. 0) and other one in maybe_free_inode_rec() after
conditions that lead to setting I_ERR_FILE_EXTENT_DISCOUNT error flag,
to see the values that met the conditions:

  if (rec-nlink  0  !no_holes
   (   rec-extent_end  rec-isize
  || first_extent_gap(rec-holes)  rec-isize
 )
 )

Rerunning `btrfsck` gave me this new info:

Checking filesystem on /dev/sda7
UUID: 8b889e4c-5dba-43e3-a116-e13874bfb311
!Set discount file extents for inode1521634 (nlink=1 extent_end=0
  isize=1408   

BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432)

2015-06-17 Thread Marc MERLIN
I had a few power offs due to a faulty power supply, and my mdadm raid5
got into fail mode after 2 drives got kicked out since their sequence
numbers didn't match due to the abrupt power offs.

I brought the swraid5 back up by force assembling it with 4 drives (one
was really only a few sequence numbers behind), and it's doing a full
parity rebuild on the 5th drive that was farther behind.

So I can understand how I may have had a few blocks that are in a bad
state.
I'm getting a few (not many) of those messages in syslog.
BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 
sector 459432)

Filesystem looks like this:
Label: 'btrfs_pool1'  uuid: 6358304a-2234-4243-b02d-4944c9af47d7
Total devices 1 FS bytes used 8.29TiB
devid1 size 14.55TiB used 8.32TiB path /dev/mapper/dshelf1

gargamel:~# btrfs fi df /mnt/btrfs_pool1
Data, single: total=8.29TiB, used=8.28TiB
System, DUP: total=8.00MiB, used=920.00KiB
System, single: total=4.00MiB, used=0.00B
Metadata, DUP: total=14.00GiB, used=10.58GiB
Metadata, single: total=8.00MiB, used=0.00B
GlobalReserve, single: total=512.00MiB, used=0.00B

Kernel 3.19.8.

Just to make sure I understand, do those messages in syslog mean that my
metadata got corrupted a bit, but because I have 2 copies, btrfs can fix
the bad copy by using the good one?

Also, if my actual data got corrupted, am I correct that btrfs will
detect the checksum failure and give me a different error message of a
read error that cannot be corrected?

I'll do a scrub later, for now I have to wait 20 hours for the raid rebuild
first.

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: fix fsync xattr loss in the fast fsync path

2015-06-17 Thread fdmanana
From: Filipe Manana fdman...@suse.com

After commit 4f764e515361 (Btrfs: remove deleted xattrs on fsync log
replay), we can end up in a situation where during log replay we end up
deleting xattrs that were never deleted when their file was last fsynced.

This happens in the fast fsync path (flag BTRFS_INODE_NEEDS_FULL_SYNC is
not set in the inode) if the inode has the flag BTRFS_INODE_COPY_EVERYTHING
set, the xattr was added in a past transaction and the leaf where the
xattr is located was not updated (COWed or created) in the current
transaction. In this scenario the xattr item never ends up in the log
tree and therefore at log replay time, which makes the replay code delete
the xattr from the fs/subvol tree as it thinks that xattr was deleted
prior to the last fsync.

Fix this by using a new item key type that represents xattrs to be deleted
at log replay time. This key type is only used in the log tree. By using
this explicit item we can continue to log only xattrs that were added (or
modified) in the current transaction instead of all xattrs, while still
keeping the the intention of commit 4f764e515361 (Btrfs: remove deleted
xattrs on fsync log replay).

This issue is reprodicible with the following test case for fstests:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo QA output created by $seq

  here=`pwd`
  tmp=/tmp/$$
  status=1  # failure is the default!

  _cleanup()
  {
  _cleanup_flakey
  rm -f $tmp.*
  }
  trap _cleanup; exit \$status 0 1 2 3 15

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey
  . ./common/attr

  # real QA test starts here
  _supported_fs generic
  _supported_os Linux
  _need_to_be_root
  _require_scratch
  _require_dm_flakey
  _require_attrs
  _require_metadata_journaling $SCRATCH_DEV

  _crash_and_mount()
  {
  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey
  # Allow writes again and mount. This makes the fs replay its fsync log.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey
  }

  rm -f $seqres.full

  _scratch_mkfs  $seqres.full 21
  _init_flakey
  _mount_flakey

  # Create the test file with some initial data and make sure everything is
  # durably persisted. We do fsync before calling sync to make sure that if the
  # filesystem is btrfs, we get the flag BTRFS_INODE_NEEDS_FULL_SYNC cleared
  # from the btrfs inode - a condition necessary to trigger the issue in btrfs.
  $XFS_IO_PROG -f -c pwrite -S 0xaa 0 32k \
  -c fsync \
  $SCRATCH_MNT/foo | _filter_xfs_io
  sync

  # Add a xattr to our file.
  $SETFATTR_PROG -n user.attr -v somevalue $SCRATCH_MNT/foo

  # Sync the filesystem to force a commit of the current btrfs transaction, this
  # is a necessary condition to trigger the bug on btrfs.
  sync

  # Now update our file's data and fsync the file.
  # After a successful fsync, if the fsync log/journal is replayed we expect to
  # see the xattr named user.attr with a value of somevalue (and the updated
  # file data of course). Btrfs used to remove the xattr when it replayed the
  # fsync log/journal.
  $XFS_IO_PROG -c pwrite -S 0xbb 8K 16K \
   -c fsync \
   $SCRATCH_MNT/foo | _filter_xfs_io

  echo File content after fsync and before crash:
  od -t x1 $SCRATCH_MNT/foo

  echo File xattrs after fsync and before crash:
  $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foo | _filter_scratch

  _crash_and_mount

  echo File content after crash and log replay:
  od -t x1 $SCRATCH_MNT/foo

  echo File xattrs after crash and log replay:
  $GETFATTR_PROG --absolute-names --dump $SCRATCH_MNT/foo | _filter_scratch

  status=0
  exit

The expected golden output for this test:

  wrote 32768/32768 bytes at offset 0
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  wrote 16384/16384 bytes at offset 8192
  XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
  File content after fsync and before crash:
  000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  002 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  006 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  010
  File xattrs after fsync and before crash:
  # file: SCRATCH_MNT/foo
  user.attr=somevalue

  File content after crash and log replay:
  000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  002 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  006 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  010
  File xattrs after crash and log replay:
  # file: SCRATCH_MNT/foo
  user.attr=somevalue

Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/ctree.h|  15 ++
 fs/btrfs/dir-item.c |  71 ++-
 fs/btrfs/inode.c|   5 +-
 fs/btrfs/tree-log.c | 138 +---
 fs/btrfs/tree-log.h |  10 
 fs/btrfs/xattr.c|  10 
 6 files changed, 198 

[PATCH] Btrfs: fix fsync data loss after append write

2015-06-17 Thread fdmanana
From: Filipe Manana fdman...@suse.com

If we do an append write to a file (which increases its inode's i_size)
that does not have the flag BTRFS_INODE_NEEDS_FULL_SYNC set in its inode,
and the previous transaction added a new hard link to the file, which sets
the flag BTRFS_INODE_COPY_EVERYTHING in the file's inode, and then fsync
the file, the inode's new i_size isn't logged. This has the consequence
that after the fsync log is replayed, the file size remains what it was
before the append write operation, which means users/applications will
not be able to read the data that was successsfully fsync'ed before.

This happens because neither the inode item nor the delayed inode get
their i_size updated when the append write is made - doing so would
require starting a transaction in the buffered write path, something that
we do not do intentionally for performance reasons.

Fix this by making sure that when the flag BTRFS_INODE_COPY_EVERYTHING is
set the inode is logged with its current i_size (log the in-memory inode
into the log tree).

This issue is not a recent regression and is easy to reproduce with the
following test case for fstests:

  seq=`basename $0`
  seqres=$RESULT_DIR/$seq
  echo QA output created by $seq

  here=`pwd`
  tmp=/tmp/$$
  status=1  # failure is the default!

  _cleanup()
  {
  _cleanup_flakey
  rm -f $tmp.*
  }
  trap _cleanup; exit \$status 0 1 2 3 15

  # get standard environment, filters and checks
  . ./common/rc
  . ./common/filter
  . ./common/dmflakey

  # real QA test starts here
  _supported_fs generic
  _supported_os Linux
  _need_to_be_root
  _require_scratch
  _require_dm_flakey
  _require_metadata_journaling $SCRATCH_DEV

  _crash_and_mount()
  {
  # Simulate a crash/power loss.
  _load_flakey_table $FLAKEY_DROP_WRITES
  _unmount_flakey
  # Allow writes again and mount. This makes the fs replay its fsync 
log.
  _load_flakey_table $FLAKEY_ALLOW_WRITES
  _mount_flakey
  }

  rm -f $seqres.full

  _scratch_mkfs  $seqres.full 21
  _init_flakey
  _mount_flakey

  # Create the test file with some initial data and then fsync it.
  # The fsync here is only needed to trigger the issue in btrfs, as it causes 
the
  # the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode.
  $XFS_IO_PROG -f -c pwrite -S 0xaa 0 32k \
  -c fsync \
  $SCRATCH_MNT/foo | _filter_xfs_io
  sync

  # Add a hard link to our file.
  # On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode,
  # which is a necessary condition to trigger the issue.
  ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar

  # Sync the filesystem to force a commit of the current btrfs transaction, this
  # is a necessary condition to trigger the bug on btrfs.
  sync

  # Now append more data to our file, increasing its size, and fsync the file.
  # In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the
  # write path did not update the inode item in the btree nor the delayed inode
  # item (in memory struture) in the current transaction (created by the fsync
  # handler), the fsync did not record the inode's new i_size in the fsync
  # log/journal. This made the data unavailable after the fsync log/journal is
  # replayed.
  $XFS_IO_PROG -c pwrite -S 0xbb 32K 32K \
   -c fsync \
   $SCRATCH_MNT/foo | _filter_xfs_io

  echo File content after fsync and before crash:
  od -t x1 $SCRATCH_MNT/foo

  _crash_and_mount

  echo File content after crash and log replay:
  od -t x1 $SCRATCH_MNT/foo

  status=0
  exit

The expected file output before and after the crash/power failure expects the
appended data to be available, which is:

  000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
  *
  010 bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb
  *
  020

Cc: sta...@vger.kernel.org
Signed-off-by: Filipe Manana fdman...@suse.com
---
 fs/btrfs/tree-log.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c
index d049683..4920fce 100644
--- a/fs/btrfs/tree-log.c
+++ b/fs/btrfs/tree-log.c
@@ -4161,6 +4161,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle 
*trans,
u64 ino = btrfs_ino(inode);
struct extent_map_tree *em_tree = BTRFS_I(inode)-extent_tree;
u64 logged_isize = 0;
+   bool need_log_inode_item = true;
 
path = btrfs_alloc_path();
if (!path)
@@ -4269,11 +4270,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle 
*trans,
} else {
if (inode_only == LOG_INODE_ALL)
fast_search = true;
-   ret = log_inode_item(trans, log, dst_path, inode);
-   if (ret) {
-   err = ret;
-   goto out_unlock;
-   }
goto log_extents;
}

Btrfs progs release 4.1-rc1

2015-06-17 Thread David Sterba
Hi,

unusual load of changes. Among the small UI enhancements, the mkfs output
rework is worth mentioning separately. It's based on Goffredo's patches
but I've tweaked the output:

Current:
https://git.kernel.org/cgit/linux/kernel/git/kdave/btrfs-progs.git/commit/?id=60447579c41a10069b67d502b317bf57519acdd3

Original proposal:
https://git.kernel.org/cgit/linux/kernel/git/kdave/btrfs-progs.git/commit/?id=3afd86683b7324c9fec94ca2c35b64ac72ab523f

The ETA for 4.1 release is next week, please give it a test and report
bugs.


* bugfixes
  - fsck.btrfs: no bash-isms
  - bugzilla 97171: invalid memory access (with tests)
  - receive:
- cloning works with --chroot
- capabilities not lost
  - mkfs: do not try to register bare file images
  - option --help accepted by the standalone utilities

* enhancements
  - corrupt block: ability to remove csums
  - mkfs:
- warn if metadata redundancy is lower than for data
- options to make the output quiet (only errors)
- mixed case names of raid profiles accepted
- rework the output:
  - more compact, 'key: value' format
  - subvol:
- show:
  - print received uuid
  - update the output
- sync:
  - grab all deleted ids and print them as they're removed,
previous implementation only checked if there are any
to be deleted - change in command semantics
  - scrub: print timestamps in days HMS format
  - receive:
- can specify mount point, do not rely on /proc
- can work inside subvolumes
  - send:
- new option to send stream without data (NO_FILE_DATA)
  - convert:
- specify incompat features on the new fs
  - qgroup:
- show: distinguish no limits and 0 limit value
- limit: ability to clear the limit
  - help for 'btrfs' is shorter, 1st level command overview
  - debug tree: print key names according to their C name

* new
  - rescure zero-log
  - btrfsune:
- rewrite uuid on a filesystem image
- new option to turn on NO_HOLES incompat feature

* deprecated
  - standalone btrfs-zero-log

* other
  - testing framework updates
- uuid rewrite test
- btrfstune feature setting test
- zero-log tests
- more testing image formats
  - manual page updates
  - ioctl.h synced with current kernel uapi version
  - convert: preparatory works for more filesystems (reiserfs pending)
  - use static buffers for path handling where possible
  - add new helpers for send uilts that check memory allocations,
switch all users, deprecate old helpers
  - Makefile: fix build dependency generation

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

Anand Jain (2):
  btrfs-progs: add info about list-all to the help
  btrfs-progs: use function is_block_device() instead

Dimitri John Ledkov (1):
  btrfs-progs: fsck.btrfs: Fix bashism and bad getopts processing

Dongsheng Yang (4):
  btrfs-progs: qgroup: show 'none' when we did not limit it on this qgroup
  btrfs-progs: qgroup: allow user to clear some limitation on qgroup.
  btrfs-progs: qgroup limit: error out if input value is negative
  btrfs-progs: qgroup limit: add a check for invalid input of 'T/G/M/K'

Emil Karlson (1):
  btrfs-progs: use openat for process_clone in receive

Goffredo Baroncelli (4):
  btrfs-progs: add strdup in btrfs_add_to_fsid() to track the device path
  btrfs-progs: return the fsid from make_btrfs()
  btrfs-progs: mkfs: track sizes of created block groups
  btrfs-progs: mkfs: print the summary

Jeff Mahoney (8):
  btrfs-progs: convert: clean up blk_iterate_data handling wrt 
record_file_blocks
  btrfs-progs: convert: remove unused fs argument from block_iterate_proc
  btrfs-progs: convert: remove unused inode_key in copy_single_inode
  btrfs-progs: convert: rename ext2_root to image_root
  btrfs-progs: compat: define DIV_ROUND_UP if not already defined
  btrfs-progs: convert: fix typo in btrfs_insert_dir_item call
  btrfs-progs: convert: factor out adding dirent into convert_insert_dirent
  btrfs-progs: convert: factor out block iteration callback

Josef Bacik (3):
  Btrfs-progs: corrupt-block: add the ability to remove csums
  btrfs-progs: specify mountpoint for recieve
  btrfs-progs: make receive work inside of subvolumes

Qu Wenruo (6):
  btrfs-progs: Enhance read_tree_block to avoid memory corruption
  btrfs-progs: btrfstune: rework change_uuid
  btrfs-progs: btrfstune: add ability to restore unfinished fsid change
  btrfs-progs: btrfstune: add '-U' and '-u' option to change fsid
  btrfs-progs: Documentation: uuid change
  btrfs-progs: btrfstune: fix a bug which makes unfinished fsid change 
unrecoverable

Sam Tygier (1):
  btrfs-progs: mkfs: check metadata redundancy

David Sterba (73):
  btrfs-progs: tests: log the test name in results file
  btrfs-progs: tests: support more 

[PATCH] fstests: generic test for fsync after adding hard link to a file

2015-06-17 Thread fdmanana
From: Filipe Manana fdman...@suse.com

This test is motivated by an issue found in btrfs.

It tests that after syncing the filesystem, adding a hard link to a file,
syncing the filesystem again, doing a write to the file that increases
its size and then doing a fsync against that file, durably persists the
data written to the file. That is, after log/journal replay, the data
is available.

The btrfs issue is fixed by the commit titled:

  Btrfs: fix fsync data loss after append write

Signed-off-by: Filipe Manana fdman...@suse.com
---
 tests/generic/090 | 108 ++
 tests/generic/090.out |  17 
 tests/generic/group   |   1 +
 3 files changed, 126 insertions(+)
 create mode 100755 tests/generic/090
 create mode 100644 tests/generic/090.out

diff --git a/tests/generic/090 b/tests/generic/090
new file mode 100755
index 000..a1f2b89
--- /dev/null
+++ b/tests/generic/090
@@ -0,0 +1,108 @@
+#! /bin/bash
+# FS QA Test No. 090
+#
+# Test that after syncing the filesystem, adding a hard link to a file,
+# syncing the filesystem again, doing a write to the file that increases
+# its size and then doing a fsync against that file, durably persists the
+# data written to the file. That is, after log/journal replay, the data
+# is available.
+#
+# This test is motivated by a bug found in btrfs.
+#
+#---
+# Copyright (C) 2015 SUSE Linux Products GmbH. All Rights Reserved.
+# Author: Filipe Manana fdman...@suse.com
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo QA output created by $seq
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+
+_cleanup()
+{
+   _cleanup_flakey
+   rm -f $tmp.*
+}
+trap _cleanup; exit \$status 0 1 2 3 15
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_need_to_be_root
+_require_scratch
+_require_dm_flakey
+_require_metadata_journaling $SCRATCH_DEV
+
+rm -f $seqres.full
+
+_scratch_mkfs  $seqres.full 21
+_init_flakey
+_mount_flakey
+
+# Create the test file with some initial data and then fsync it.
+# The fsync here is only needed to trigger the issue in btrfs, as it causes the
+# the flag BTRFS_INODE_NEEDS_FULL_SYNC to be removed from the btrfs inode.
+$XFS_IO_PROG -f -c pwrite -S 0xaa 0 32k \
+   -c fsync \
+   $SCRATCH_MNT/foo | _filter_xfs_io
+sync
+
+# Add a hard link to our file.
+# On btrfs this sets the flag BTRFS_INODE_COPY_EVERYTHING on the btrfs inode,
+# which is a necessary condition to trigger the issue.
+ln $SCRATCH_MNT/foo $SCRATCH_MNT/bar
+
+# Sync the filesystem to force a commit of the current btrfs transaction, this
+# is a necessary condition to trigger the bug on btrfs.
+sync
+
+# Now append more data to our file, increasing its size, and fsync the file.
+# In btrfs because the inode flag BTRFS_INODE_COPY_EVERYTHING was set and the
+# write path did not update the inode item in the btree nor the delayed inode
+# item (in memory structure) in the current transaction (created by the fsync
+# handler), the fsync did not record the inode's new i_size in the fsync
+# log/journal. This made the data unavailable after the fsync log/journal is
+# replayed.
+$XFS_IO_PROG -c pwrite -S 0xbb 32K 32K \
+   -c fsync \
+   $SCRATCH_MNT/foo | _filter_xfs_io
+
+echo File content after fsync and before crash:
+od -t x1 $SCRATCH_MNT/foo
+
+# Simulate a crash/power loss.
+_load_flakey_table $FLAKEY_DROP_WRITES
+_unmount_flakey
+
+# Allow writes again and mount. This makes the fs replay its fsync log.
+_load_flakey_table $FLAKEY_ALLOW_WRITES
+_mount_flakey
+
+echo File content after crash and log replay:
+od -t x1 $SCRATCH_MNT/foo
+
+status=0
+exit
diff --git a/tests/generic/090.out b/tests/generic/090.out
new file mode 100644
index 000..4a4423a
--- /dev/null
+++ b/tests/generic/090.out
@@ -0,0 +1,17 @@
+QA output created by 090
+wrote 32768/32768 bytes at offset 0
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+wrote 32768/32768 bytes at offset 32768
+XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
+File content after fsync 

[RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION

2015-06-17 Thread Liu Bo
MS_I_VERSION is enabled by default for btrfs, this adds an alternative
option to toggle it off.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/super.c |7 ++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 05fef19..e610e3e 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -324,7 +324,7 @@ enum {
Opt_check_integrity_print_mask, Opt_fatal_errors, Opt_rescan_uuid_tree,
Opt_commit_interval, Opt_barrier, Opt_nodefrag, Opt_nodiscard,
Opt_noenospc_debug, Opt_noflushoncommit, Opt_acl, Opt_datacow,
-   Opt_datasum, Opt_treelog, Opt_noinode_cache,
+   Opt_datasum, Opt_treelog, Opt_noinode_cache, Opt_noi_version,
Opt_err,
 };
 
@@ -351,6 +351,7 @@ static match_table_t tokens = {
{Opt_nossd, nossd},
{Opt_acl, acl},
{Opt_noacl, noacl},
+   {Opt_noi_version, noi_version},
{Opt_notreelog, notreelog},
{Opt_treelog, treelog},
{Opt_flushoncommit, flushoncommit},
@@ -593,6 +594,10 @@ int btrfs_parse_options(struct btrfs_root *root, char 
*options)
case Opt_noacl:
root-fs_info-sb-s_flags = ~MS_POSIXACL;
break;
+   case Opt_noi_version:
+   root-fs_info-sb-s_flags = ~MS_I_VERSION;
+   btrfs_info(root-fs_info, disable i_version);
+   break;
case Opt_notreelog:
btrfs_set_and_info(root, NOTREELOG,
   disabling tree log);
-- 
1.7.7.6

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC PATCH v2 2/2] Btrfs: improve fsync for nocow file

2015-06-17 Thread Liu Bo
If  we're overwriting an allocated file without changing timestamp
and inode version, and the file is with NODATACOW, we don't have any metadata to
commit, thus we can just flush the data device cache and go forward.

However, if there's have any change on extents' disk bytenr, inode size,
timestamp or inode version, we need to go through the normal btrfs_log_inode
path.

Test:

1. sysbench test of
1 file + 1 thread + bs=4k + size=40k + synchronous I/O mode + randomwrite +
fsync_on_each_write,
2. loop device associated with tmpfs file
3.
  - For btrfs, -o nodatacow and -o noi_version option
  - For ext4 and xfs, no extra mount options


Results:

- btrfs:
w/o: ~30Mb/sec
w:   ~131Mb/sec

- other filesystems: (both don't enable i_version by default)
ext4:  203Mb/sec
xfs:   212Mb/sec


Signed-off-by: Liu Bo bo.li@oracle.com
---
v2: Catch errors from data writeback and skip barrier if necessary.

 fs/btrfs/btrfs_inode.h |2 +
 fs/btrfs/disk-io.c |2 +-
 fs/btrfs/disk-io.h |1 +
 fs/btrfs/file.c|   54 +++
 fs/btrfs/inode.c   |3 ++
 5 files changed, 56 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index de5e4f2..f7b99b6 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -44,6 +44,8 @@
 #define BTRFS_INODE_IN_DELALLOC_LIST   9
 #define BTRFS_INODE_READDIO_NEED_LOCK  10
 #define BTRFS_INODE_HAS_PROPS  11
+#define BTRFS_INODE_NOTIMESTAMP12
+#define BTRFS_INODE_NOISIZE13
 /*
  * The following 3 bits are meant only for the btree inode.
  * When any of them is set, it means an error happened while writing an
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 639f266..8a41df1 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3299,7 +3299,7 @@ static int write_dev_flush(struct btrfs_device *device, 
int wait)
  * send an empty flush down to each device in parallel,
  * then wait for them
  */
-static int barrier_all_devices(struct btrfs_fs_info *info)
+int barrier_all_devices(struct btrfs_fs_info *info)
 {
struct list_head *head;
struct btrfs_device *dev;
diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h
index 27d44c0..bea982c 100644
--- a/fs/btrfs/disk-io.h
+++ b/fs/btrfs/disk-io.h
@@ -60,6 +60,7 @@ void close_ctree(struct btrfs_root *root);
 int write_ctree_super(struct btrfs_trans_handle *trans,
  struct btrfs_root *root, int max_mirrors);
 struct buffer_head *btrfs_read_dev_super(struct block_device *bdev);
+int barrier_all_devices(struct btrfs_fs_info *info);
 int btrfs_commit_super(struct btrfs_root *root);
 struct extent_buffer *btrfs_find_tree_block(struct btrfs_root *root,
u64 bytenr);
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index faa7d39..180a3e1 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -523,8 +523,12 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct 
inode *inode,
 * the disk i_size.  There is no need to log the inode
 * at this time.
 */
-   if (end_pos  isize)
+   if (end_pos  isize) {
i_size_write(inode, end_pos);
+   clear_bit(BTRFS_INODE_NOISIZE, BTRFS_I(inode)-runtime_flags);
+   } else {
+   set_bit(BTRFS_INODE_NOISIZE, BTRFS_I(inode)-runtime_flags);
+   }
return 0;
 }
 
@@ -1715,19 +1719,33 @@ out:
 static void update_time_for_write(struct inode *inode)
 {
struct timespec now;
+   int sync_it = 0;
 
-   if (IS_NOCMTIME(inode))
+   if (IS_NOCMTIME(inode)) {
+   set_bit(BTRFS_INODE_NOTIMESTAMP, 
BTRFS_I(inode)-runtime_flags);
return;
+   }
 
now = current_fs_time(inode-i_sb);
-   if (!timespec_equal(inode-i_mtime, now))
+   if (!timespec_equal(inode-i_mtime, now)) {
inode-i_mtime = now;
+   sync_it = S_MTIME;
+   }
 
-   if (!timespec_equal(inode-i_ctime, now))
+   if (!timespec_equal(inode-i_ctime, now)) {
inode-i_ctime = now;
+   sync_it |= S_CTIME;
+   }
 
-   if (IS_I_VERSION(inode))
+   if (IS_I_VERSION(inode)) {
inode_inc_iversion(inode);
+   sync_it |= S_VERSION;
+   }
+
+   if (!sync_it)
+   set_bit(BTRFS_INODE_NOTIMESTAMP, 
BTRFS_I(inode)-runtime_flags);
+   else
+   clear_bit(BTRFS_INODE_NOTIMESTAMP, 
BTRFS_I(inode)-runtime_flags);
 }
 
 static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
@@ -1983,6 +2001,32 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
goto out;
}
 
+   if (BTRFS_I(inode)-flags  BTRFS_INODE_NODATACOW) {
+   if 

Re: RAID1: system stability

2015-06-17 Thread Timofey Titovets
Upd:
i've try do removing disk by 'right' way:
# echo 1  /sys/block/sdf/device/delete

All okay and system don't crush immediately on 'sync' call and can
work some time without problem, but after some call, which i can
repeat by:
  # apt-get update
testing system get kernel crush (on which i delete one of raid1 btrfs
device), i've get following dmesg:

Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Modules linked in: 8021q
garp mrp stp llc binfmt_misc gpio_ich coretemp kvm_intel lpc_ich
ipmi_ssif kvm amdkfd amd_iommu_v2 serio_raw radeon ttm i5000_edac
drm_kms_helper drm edac_core i2c_algo_bit i5k_amb ioatdma dca shpchp
8250_fintek joydev mac_hid ipmi_si ipmi_msghandler bonding autofs4
btrfs xor raid6_pq ses enclosure hid_generic psmouse usbhid hid mptsas
mptscsih e1000e mptbase scsi_transport_sas ptp pps_core
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: CPU: 3 PID: 99 Comm:
kworker/u16:4 Not tainted 4.0.4-040004-generic #201505171336
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Hardware name: Intel
S5000VSA/S5000VSA, BIOS S5000.86B.15.00.0101.110920101604 11/09/2010
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Workqueue: btrfs-endio
btrfs_endio_helper [btrfs]
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: task: 88009ab31400
ti: 88009ab4 task.ti: 88009ab4
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RIP:
0010:[c0477d50]  [c0477d50]
repair_io_failure+0x1c0/0x200 [btrfs]
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RSP:
0018:88009ab43bb8  EFLAGS: 00010206
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RAX: 
RBX: 88009b1d3f30 RCX: 88009b53f9c0
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RDX: 88044902f400
RSI:  RDI: 88009b53f9c0
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RBP: 88009ab43c18
R08:  R09: 
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: R10: 880448c1b090
R11:  R12: 3907
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: R13: 880439599e68
R14: 1000 R15: 88009a86
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: FS:
() GS:88045fcc()
knlGS:
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: CS:  0010 DS:  ES:
 CR0: 8005003b
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: CR2: 7f640a27e675
CR3: 98b4b000 CR4: 000407e0
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Stack:
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  
9a860de0 ea0002644380 0003d2ee8000
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  8000
88009b53f9c0 88009ab43c18 88009b1d3f30
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  88044c44a3c0
88009b0c1190  88009a86
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Call Trace:
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [c0477f30]
clean_io_failure+0x1a0/0x1b0 [btrfs]
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [c0478218]
end_bio_extent_readpage+0x2d8/0x3d0 [btrfs]
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [8137b2c3]
bio_endio+0x53/0xa0
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [8137b322]
bio_endio_nodec+0x12/0x20
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [c044efb8]
end_workqueue_fn+0x48/0x60 [btrfs]
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [c0488b2e]
normal_work_helper+0x7e/0x1b0 [btrfs]
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [c0488d32]
btrfs_endio_helper+0x12/0x20 [btrfs]
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [81092204]
process_one_work+0x144/0x490
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [81092c6e]
worker_thread+0x11e/0x450
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [81092b50] ?
create_worker+0x1f0/0x1f0
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [81098999]
kthread+0xc9/0xe0
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [810988d0] ?
flush_kthread_worker+0x90/0x90
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [817f08d8]
ret_from_fork+0x58/0x90
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  [810988d0] ?
flush_kthread_worker+0x90/0x90
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: Code: 44 00 00 4c 89 ef
e8 b0 34 f0 c0 31 f6 4c 89 e7 e8 06 05 01 00 ba fb ff ff ff e9 c7 fe
ff ff ba fb ff ff ff e9 bd fe ff ff 0f 0b 0f 0b 49 8b 4c 24 30 48 8b
b3 58 fe ff ff 48 83 c1 10 48 85 f6
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: RIP  [c0477d50]
repair_io_failure+0x1c0/0x200 [btrfs]
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel:  RSP 88009ab43bb8
Jun 17 12:00:41 srv-lab-ceph-node-01 kernel: ---[ end trace
0361c6fdca5f7ee2 ]---
---

Another test case:
i've delete device:
echo 1  /sys/block/sdf/device/delete
after i reinsert this device (remove and insert again in server)
Server found sdg device, all that okay but kernel crush with following
stuck trace:
---
Jun 17 12:08:35 srv-lab-ceph-node-01 kernel: kernel BUG at

Re: [PATCH 3/7] btrfs: skip superblocks during discard

2015-06-17 Thread Filipe David Manana
On Mon, Jun 15, 2015 at 2:41 PM,  je...@suse.com wrote:
 From: Jeff Mahoney je...@suse.com

 Btrfs doesn't track superblocks with extent records so there is nothing
 persistent on-disk to indicate that those blocks are in use.  We track
 the superblocks in memory to ensure they don't get used by removing them
 from the free space cache when we load a block group from disk.  Prior
 to 47ab2a6c6a (Btrfs: remove empty block groups automatically), that
 was fine since the block group would never be reclaimed so the superblock
 was always safe.  Once we started removing the empty block groups, we
 were protected by the fact that discards weren't being properly issued
 for unused space either via FITRIM or -odiscard.  The block groups were
 still being released, but the blocks remained on disk.

 In order to properly discard unused block groups, we need to filter out
 the superblocks from the discard range.  Superblocks are located at fixed
 locations on each device, so it makes sense to filter them out in
 btrfs_issue_discard, which is used by both -odiscard and FITRIM.

 Signed-off-by: Jeff Mahoney je...@suse.com
Reviewed-by: Filipe Manana fdman...@suse.com
Tested-by: Filipe Manana fdman...@suse.com

 ---
  fs/btrfs/extent-tree.c | 59 
 ++
  1 file changed, 55 insertions(+), 4 deletions(-)

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index cf9cefd..1e44b93 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -1884,10 +1884,12 @@ static int remove_extent_backref(struct 
 btrfs_trans_handle *trans,
 return ret;
  }

 +#define in_range(b, first, len)((b) = (first)  (b)  (first) + 
 (len))
  static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
u64 *discarded_bytes)
  {
 -   int ret = 0;
 +   int j, ret = 0;
 +   u64 bytes_left, end;
 u64 aligned_start = ALIGN(start, 1  9);

 if (WARN_ON(start != aligned_start)) {
 @@ -1897,11 +1899,60 @@ static int btrfs_issue_discard(struct block_device 
 *bdev, u64 start, u64 len,
 }

 *discarded_bytes = 0;
 -   if (len) {
 -   ret = blkdev_issue_discard(bdev, start  9, len  9,
 +
 +   if (!len)
 +   return 0;
 +
 +   end = start + len;
 +   bytes_left = len;
 +
 +   /* Skip any superblocks on this device. */
 +   for (j = 0; j  BTRFS_SUPER_MIRROR_MAX; j++) {
 +   u64 sb_start = btrfs_sb_offset(j);
 +   u64 sb_end = sb_start + BTRFS_SUPER_INFO_SIZE;
 +   u64 size = sb_start - start;
 +
 +   if (!in_range(sb_start, start, bytes_left) 
 +   !in_range(sb_end, start, bytes_left) 
 +   !in_range(start, sb_start, BTRFS_SUPER_INFO_SIZE))
 +   continue;
 +
 +   /*
 +* Superblock spans beginning of range.  Adjust start and
 +* try again.
 +*/
 +   if (sb_start = start) {
 +   start += sb_end - start;
 +   if (start  end) {
 +   bytes_left = 0;
 +   break;
 +   }
 +   bytes_left = end - start;
 +   continue;
 +   }
 +
 +   if (size) {
 +   ret = blkdev_issue_discard(bdev, start  9, size  
 9,
 +  GFP_NOFS, 0);
 +   if (!ret)
 +   *discarded_bytes += size;
 +   else if (ret != -EOPNOTSUPP)
 +   return ret;
 +   }
 +
 +   start = sb_end;
 +   if (start  end) {
 +   bytes_left = 0;
 +   break;
 +   }
 +   bytes_left = end - start;
 +   }
 +
 +   if (bytes_left) {
 +   ret = blkdev_issue_discard(bdev, start  9, bytes_left  9,
GFP_NOFS, 0);
 if (!ret)
 -   *discarded_bytes = len;
 +   *discarded_bytes += bytes_left;
 }
 return ret;
  }
 --
 2.4.3

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 7/7] btrfs: cleanup, stop casting for extent_map-lookup everywhere

2015-06-17 Thread Filipe David Manana
On Mon, Jun 15, 2015 at 2:41 PM,  je...@suse.com wrote:
 From: Jeff Mahoney je...@suse.com

 Overloading extent_map-bdev to struct map_lookup * might have started out
 as a means to an end, but it's a pattern that's used all over the place
 now. Let's get rid of the casting and just add a union instead.

 Signed-off-by: Jeff Mahoney je...@suse.com
Reviewed-by: Filipe Manana fdman...@suse.com
Tested-by: Filipe Manana fdman...@suse.com

 ---
  fs/btrfs/dev-replace.c |  2 +-
  fs/btrfs/extent_map.c  |  2 +-
  fs/btrfs/extent_map.h  | 10 +-
  fs/btrfs/scrub.c   |  2 +-
  fs/btrfs/volumes.c | 24 
  5 files changed, 24 insertions(+), 16 deletions(-)

 diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c
 index 0573848..2ad3289 100644
 --- a/fs/btrfs/dev-replace.c
 +++ b/fs/btrfs/dev-replace.c
 @@ -613,7 +613,7 @@ static void 
 btrfs_dev_replace_update_device_in_mapping_tree(
 em = lookup_extent_mapping(em_tree, start, (u64)-1);
 if (!em)
 break;
 -   map = (struct map_lookup *)em-bdev;
 +   map = em-map_lookup;
 for (i = 0; i  map-num_stripes; i++)
 if (srcdev == map-stripes[i].dev)
 map-stripes[i].dev = tgtdev;
 diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
 index 6a98bdd..84fb56d 100644
 --- a/fs/btrfs/extent_map.c
 +++ b/fs/btrfs/extent_map.c
 @@ -76,7 +76,7 @@ void free_extent_map(struct extent_map *em)
 WARN_ON(extent_map_in_tree(em));
 WARN_ON(!list_empty(em-list));
 if (test_bit(EXTENT_FLAG_FS_MAPPING, em-flags))
 -   kfree(em-bdev);
 +   kfree(em-map_lookup);
 kmem_cache_free(extent_map_cache, em);
 }
  }
 diff --git a/fs/btrfs/extent_map.h b/fs/btrfs/extent_map.h
 index b2991fd..eb8b8fa 100644
 --- a/fs/btrfs/extent_map.h
 +++ b/fs/btrfs/extent_map.h
 @@ -32,7 +32,15 @@ struct extent_map {
 u64 block_len;
 u64 generation;
 unsigned long flags;
 -   struct block_device *bdev;
 +   union {
 +   struct block_device *bdev;
 +
 +   /*
 +* used for chunk mappings
 +* flags  EXTENT_FLAG_FS_MAPPING must be set
 +*/
 +   struct map_lookup *map_lookup;
 +   };
 atomic_t refs;
 unsigned int compress_type;
 struct list_head list;
 diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c
 index ab58115..19f7241d 100644
 --- a/fs/btrfs/scrub.c
 +++ b/fs/btrfs/scrub.c
 @@ -3339,7 +3339,7 @@ static noinline_for_stack int scrub_chunk(struct 
 scrub_ctx *sctx,
 if (!em)
 return -EINVAL;

 -   map = (struct map_lookup *)em-bdev;
 +   map = em-map_lookup;
 if (em-start != chunk_offset)
 goto out;

 diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
 index 7fdde31..9f48ae5 100644
 --- a/fs/btrfs/volumes.c
 +++ b/fs/btrfs/volumes.c
 @@ -1068,7 +1068,7 @@ again:
 struct map_lookup *map;
 int i;

 -   map = (struct map_lookup *)em-bdev;
 +   map = em-map_lookup;
 for (i = 0; i  map-num_stripes; i++) {
 if (map-stripes[i].dev != device)
 continue;
 @@ -2622,7 +2622,7 @@ int btrfs_remove_chunk(struct btrfs_trans_handle *trans,
 free_extent_map(em);
 return -EINVAL;
 }
 -   map = (struct map_lookup *)em-bdev;
 +   map = em-map_lookup;

 for (i = 0; i  map-num_stripes; i++) {
 struct btrfs_device *device = map-stripes[i].dev;
 @@ -4465,7 +4465,7 @@ static int __btrfs_alloc_chunk(struct 
 btrfs_trans_handle *trans,
 goto error;
 }
 set_bit(EXTENT_FLAG_FS_MAPPING, em-flags);
 -   em-bdev = (struct block_device *)map;
 +   em-map_lookup = map;
 em-start = start;
 em-len = num_bytes;
 em-block_start = 0;
 @@ -4560,7 +4560,7 @@ int btrfs_finish_chunk_alloc(struct btrfs_trans_handle 
 *trans,
 return -EINVAL;
 }

 -   map = (struct map_lookup *)em-bdev;
 +   map = em-map_lookup;
 item_size = btrfs_chunk_item_size(map-num_stripes);
 stripe_size = em-orig_block_len;

 @@ -4702,7 +4702,7 @@ int btrfs_chunk_readonly(struct btrfs_root *root, u64 
 chunk_offset)
 if (!em)
 return 1;

 -   map = (struct map_lookup *)em-bdev;
 +   map = em-map_lookup;
 for (i = 0; i  map-num_stripes; i++) {
 if (map-stripes[i].dev-missing) {
 miss_ndevs++;
 @@ -4782,7 +4782,7 @@ int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 
 logical, u64 len)
 return 1;
 }

 -   map = (struct map_lookup *)em-bdev;
 +   map = 

Re: [PATCH 4/7] btrfs: iterate over unused chunk space in FITRIM

2015-06-17 Thread Filipe David Manana
On Mon, Jun 15, 2015 at 2:41 PM,  je...@suse.com wrote:
 From: Jeff Mahoney je...@suse.com

 Since we now clean up block groups automatically as they become
 empty, iterating over block groups is no longer sufficient to discard
 unused space.

 This patch iterates over the unused chunk space and discards any regions
 that are unallocated, regardless of whether they were ever used.  This is
 a change for btrfs but is consistent with other file systems.

 We do this in a transactionless manner since the discard process can take
 a substantial amount of time and a transaction would need to be started
 before the acquisition of the device list lock.  That would mean a
 transaction would be held open across /all/ of the discards collectively.
 In order to prevent other threads from allocating or freeing chunks, we
 hold the chunks lock across the search and discard calls.  We release it
 between searches to allow the file system to perform more-or-less
 normally.  Since the running transaction can commit and disappear while
 we're using the transaction pointer, we take a reference to it and
 release it after the search.  This is safe since it would happen normally
 at the end of the transaction commit after any locks are released anyway.
 We also take the commit_root_sem to protect against a transaction starting
 and committing while we're running.

 Signed-off-by: Jeff Mahoney je...@suse.com
Reviewed-by: Filipe Manana fdman...@suse.com
Tested-by: Filipe Manana fdman...@suse.com

Side note, this still doesn't apply cleanly on latest integration
branch (integration-4.2), results in warnings about casting pointer
from different type (btrfs_trans_handle to btrfs_transaction) at
btrfs_shrink_device().

 ---
  fs/btrfs/extent-tree.c | 101 
 +
  fs/btrfs/volumes.c |  60 ++---
  fs/btrfs/volumes.h |   3 ++
  3 files changed, 141 insertions(+), 23 deletions(-)

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 1e44b93..24b48df 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -10143,10 +10143,99 @@ int btrfs_error_unpin_extent_range(struct 
 btrfs_root *root, u64 start, u64 end)
 return unpin_extent_range(root, start, end, false);
  }

 +/*
 + * It used to be that old block groups would be left around forever.
 + * Iterating over them would be enough to trim unused space.  Since we
 + * now automatically remove them, we also need to iterate over unallocated
 + * space.
 + *
 + * We don't want a transaction for this since the discard may take a
 + * substantial amount of time.  We don't require that a transaction be
 + * running, but we do need to take a running transaction into account
 + * to ensure that we're not discarding chunks that were released in
 + * the current transaction.
 + *
 + * Holding the chunks lock will prevent other threads from allocating
 + * or releasing chunks, but it won't prevent a running transaction
 + * from committing and releasing the memory that the pending chunks
 + * list head uses.  For that, we need to take a reference to the
 + * transaction.
 + */
 +static int btrfs_trim_free_extents(struct btrfs_device *device,
 +  u64 minlen, u64 *trimmed)
 +{
 +   u64 start = 0, len = 0;
 +   int ret;
 +
 +   *trimmed = 0;
 +
 +   /* Not writeable = nothing to do. */
 +   if (!device-writeable)
 +   return 0;
 +
 +   /* No free space = nothing to do. */
 +   if (device-total_bytes = device-bytes_used)
 +   return 0;
 +
 +   ret = 0;
 +
 +   while (1) {
 +   struct btrfs_fs_info *fs_info = device-dev_root-fs_info;
 +   struct btrfs_transaction *trans;
 +   u64 bytes;
 +
 +   ret = mutex_lock_interruptible(fs_info-chunk_mutex);
 +   if (ret)
 +   return ret;
 +
 +   down_read(fs_info-commit_root_sem);
 +
 +   spin_lock(fs_info-trans_lock);
 +   trans = fs_info-running_transaction;
 +   if (trans)
 +   atomic_inc(trans-use_count);
 +   spin_unlock(fs_info-trans_lock);
 +
 +   ret = find_free_dev_extent_start(trans, device, minlen, start,
 +start, len);
 +   if (trans)
 +   btrfs_put_transaction(trans);
 +
 +   if (ret) {
 +   up_read(fs_info-commit_root_sem);
 +   mutex_unlock(fs_info-chunk_mutex);
 +   if (ret == -ENOSPC)
 +   ret = 0;
 +   break;
 +   }
 +
 +   ret = btrfs_issue_discard(device-bdev, start, len, bytes);
 +   up_read(fs_info-commit_root_sem);
 +   mutex_unlock(fs_info-chunk_mutex);
 +
 +   if (ret)
 +   break;
 +
 +   

Re: [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount

2015-06-17 Thread Filipe David Manana
On Mon, Jun 15, 2015 at 2:41 PM,  je...@suse.com wrote:
 From: Jeff Mahoney je...@suse.com

 The cleaner thread may already be sleeping by the time we enter
 close_ctree.  If that's the case, we'll skip removing any unused
 block groups queued for removal, even during a normal umount.
 They'll be cleaned up automatically at next mount, but users
 expect a umount to be a clean synchronization point, especially
 when used on thin-provisioned storage with -odiscard.  We also
 explicitly remove unused block groups in the ro-remount path
 for the same reason.

 Signed-off-by: Jeff Mahoney je...@suse.com
Reviewed-by: Filipe Manana fdman...@suse.com
Tested-by: Filipe Manana fdman...@suse.com

 ---
  fs/btrfs/disk-io.c |  9 +
  fs/btrfs/super.c   | 11 +++
  2 files changed, 20 insertions(+)

 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 2ef9a4b..2e47fef 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -3710,6 +3710,15 @@ void close_ctree(struct btrfs_root *root)
 cancel_work_sync(fs_info-async_reclaim_work);

 if (!(fs_info-sb-s_flags  MS_RDONLY)) {
 +   /*
 +* If the cleaner thread is stopped and there are
 +* block groups queued for removal, the deletion will be
 +* skipped when we quit the cleaner thread.
 +*/
 +   mutex_lock(root-fs_info-cleaner_mutex);
 +   btrfs_delete_unused_bgs(root-fs_info);
 +   mutex_unlock(root-fs_info-cleaner_mutex);
 +
 ret = btrfs_commit_super(root);
 if (ret)
 btrfs_err(fs_info, commit super ret %d, ret);
 diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
 index 9e66f5e..2ccd8d4 100644
 --- a/fs/btrfs/super.c
 +++ b/fs/btrfs/super.c
 @@ -1539,6 +1539,17 @@ static int btrfs_remount(struct super_block *sb, int 
 *flags, char *data)

 sb-s_flags |= MS_RDONLY;

 +   /*
 +* Setting MS_RDONLY will put the cleaner thread to
 +* sleep at the next loop if it's already active.
 +* If it's already asleep, we'll leave unused block
 +* groups on disk until we're mounted read-write again
 +* unless we clean them up here.
 +*/
 +   mutex_lock(root-fs_info-cleaner_mutex);
 +   btrfs_delete_unused_bgs(fs_info);
 +   mutex_unlock(root-fs_info-cleaner_mutex);
 +
 btrfs_dev_replace_suspend_for_unmount(fs_info);
 btrfs_scrub_cancel(fs_info);
 btrfs_pause_balance(fs_info);
 --
 2.4.3

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432)

2015-06-17 Thread Hugo Mills
On Wed, Jun 17, 2015 at 12:16:54AM -0700, Marc MERLIN wrote:
 I had a few power offs due to a faulty power supply, and my mdadm raid5
 got into fail mode after 2 drives got kicked out since their sequence
 numbers didn't match due to the abrupt power offs.
 
 I brought the swraid5 back up by force assembling it with 4 drives (one
 was really only a few sequence numbers behind), and it's doing a full
 parity rebuild on the 5th drive that was farther behind.
 
 So I can understand how I may have had a few blocks that are in a bad
 state.
 I'm getting a few (not many) of those messages in syslog.
 BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 
 sector 459432)
 
 Filesystem looks like this:
 Label: 'btrfs_pool1'  uuid: 6358304a-2234-4243-b02d-4944c9af47d7
 Total devices 1 FS bytes used 8.29TiB
 devid1 size 14.55TiB used 8.32TiB path /dev/mapper/dshelf1
 
 gargamel:~# btrfs fi df /mnt/btrfs_pool1
 Data, single: total=8.29TiB, used=8.28TiB
 System, DUP: total=8.00MiB, used=920.00KiB
 System, single: total=4.00MiB, used=0.00B
 Metadata, DUP: total=14.00GiB, used=10.58GiB
 Metadata, single: total=8.00MiB, used=0.00B
 GlobalReserve, single: total=512.00MiB, used=0.00B
 
 Kernel 3.19.8.
 
 Just to make sure I understand, do those messages in syslog mean that my
 metadata got corrupted a bit, but because I have 2 copies, btrfs can fix
 the bad copy by using the good one?
 
 Also, if my actual data got corrupted, am I correct that btrfs will
 detect the checksum failure and give me a different error message of a
 read error that cannot be corrected?

   Yes, that's my reading of the situation. Note that the 3.19 kernel
is the earliest I would expect this to be able to happen, as it's the
first kernel that actually had the full set of parity RAID repair code
in it.

 I'll do a scrub later, for now I have to wait 20 hours for the raid rebuild
 first.

   You'll probably find that the rebuild is equivalent to a scrub anyway.

   Hugo.

-- 
Hugo Mills | If you're not part of the solution, you're part of
hugo@... carfax.org.uk | the precipiate.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432)

2015-06-17 Thread Sander
Hugo Mills wrote (ao):
 On Wed, Jun 17, 2015 at 12:16:54AM -0700, Marc MERLIN wrote:
  I had a few power offs due to a faulty power supply, and my mdadm raid5
  got into fail mode after 2 drives got kicked out since their sequence
  numbers didn't match due to the abrupt power offs.

  gargamel:~# btrfs fi df /mnt/btrfs_pool1
  Data, single: total=8.29TiB, used=8.28TiB
  System, DUP: total=8.00MiB, used=920.00KiB
  System, single: total=4.00MiB, used=0.00B
  Metadata, DUP: total=14.00GiB, used=10.58GiB
  Metadata, single: total=8.00MiB, used=0.00B
  GlobalReserve, single: total=512.00MiB, used=0.00B

  I'll do a scrub later, for now I have to wait 20 hours for the raid
  rebuild first.
 
You'll probably find that the rebuild is equivalent to a scrub anyway.

He has mdadm raid, which is rebuilding. This is obviously not equivalent
to a btrfs scrub.

Sander
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432)

2015-06-17 Thread Hugo Mills
On Wed, Jun 17, 2015 at 12:58:35PM +0200, Sander wrote:
 Hugo Mills wrote (ao):
  On Wed, Jun 17, 2015 at 12:16:54AM -0700, Marc MERLIN wrote:
   I had a few power offs due to a faulty power supply, and my mdadm raid5
   got into fail mode after 2 drives got kicked out since their sequence
   numbers didn't match due to the abrupt power offs.
 
   gargamel:~# btrfs fi df /mnt/btrfs_pool1
   Data, single: total=8.29TiB, used=8.28TiB
   System, DUP: total=8.00MiB, used=920.00KiB
   System, single: total=4.00MiB, used=0.00B
   Metadata, DUP: total=14.00GiB, used=10.58GiB
   Metadata, single: total=8.00MiB, used=0.00B
   GlobalReserve, single: total=512.00MiB, used=0.00B
 
   I'll do a scrub later, for now I have to wait 20 hours for the raid
   rebuild first.
  
 You'll probably find that the rebuild is equivalent to a scrub anyway.
 
 He has mdadm raid, which is rebuilding. This is obviously not equivalent
 to a btrfs scrub.

   Ah, thanks for the correction. Note to self: read more carefully
before replying.

   Hugo.

-- 
Hugo Mills | If you're not part of the solution, you're part of
hugo@... carfax.org.uk | the precipiate.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


[PATCH 2/2] Btrfs: fix warning of bytes_may_use

2015-06-17 Thread Liu Bo
While running generic/019, dmesg got several warnings from
btrfs_free_reserved_data_space().

Test generic/019 produces some disk failures so sumbit dio will get errors,
in which case, btrfs_direct_IO() goes to the error handling and free
bytes_may_use, but the problem is that bytes_may_use has been free'd
during get_block().

This adds a runtime flag to show if we've gone through get_block(), if so,
don't do the cleanup work.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/btrfs_inode.h |  2 ++
 fs/btrfs/inode.c   | 16 +---
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 0ef5cc1..81220b2 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -44,6 +44,8 @@
 #define BTRFS_INODE_IN_DELALLOC_LIST   9
 #define BTRFS_INODE_READDIO_NEED_LOCK  10
 #define BTRFS_INODE_HAS_PROPS  11
+/* DIO is ready to submit */
+#define BTRFS_INODE_DIO_READY  12
 /*
  * The following 3 bits are meant only for the btree inode.
  * When any of them is set, it means an error happened while writing an
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 7bf150a..438b56f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7530,6 +7530,7 @@ unlock:
 
current-journal_info = outstanding_extents;
btrfs_free_reserved_data_space(inode, len);
+   set_bit(BTRFS_INODE_DIO_READY, BTRFS_I(inode)-runtime_flags);
}
 
/*
@@ -8311,9 +8312,18 @@ static ssize_t btrfs_direct_IO(struct kiocb *iocb, 
struct iov_iter *iter,
   btrfs_submit_direct, flags);
if (iov_iter_rw(iter) == WRITE) {
current-journal_info = NULL;
-   if (ret  0  ret != -EIOCBQUEUED)
-   btrfs_delalloc_release_space(inode, count);
-   else if (ret = 0  (size_t)ret  count)
+   if (ret  0  ret != -EIOCBQUEUED) {
+   /*
+* If the error comes from submitting stage,
+* btrfs_get_blocsk_direct() has free'd data space,
+* and metadata space will be handled by
+* finish_ordered_fn, don't do that again to make
+* sure bytes_may_use is correct.
+*/
+   if (!test_and_clear_bit(BTRFS_INODE_DIO_READY,
+BTRFS_I(inode)-runtime_flags))
+   btrfs_delalloc_release_space(inode, count);
+   } else if (ret = 0  (size_t)ret  count)
btrfs_delalloc_release_space(inode,
 count - (size_t)ret);
}
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] Btrfs: fix hang when failing to submit bio of directIO

2015-06-17 Thread Liu Bo
The hang is uncoverd by generic/019.

btrfs_endio_direct_write() skips the finish_ordered_fn part when it hits
an error, thus those added ordered extents will never get processed, which
block processes that waiting for them via btrfs_start_ordered_extent().

This fixes the above, and meanwhile finish_ordered_fn will do the space
accounting work.

Signed-off-by: Liu Bo bo.li@oracle.com
---
 fs/btrfs/inode.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 8bb0136..7bf150a 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -7855,8 +7855,6 @@ static void btrfs_endio_direct_write(struct bio *bio, int 
err)
struct bio *dio_bio;
int ret;
 
-   if (err)
-   goto out_done;
 again:
ret = btrfs_dec_test_first_ordered_pending(inode, ordered,
   ordered_offset,
@@ -7879,7 +7877,6 @@ out_test:
ordered = NULL;
goto again;
}
-out_done:
dio_bio = dip-dio_bio;
 
kfree(dip);
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/7] btrfs: make btrfs_issue_discard return bytes discarded

2015-06-17 Thread Filipe David Manana
On Mon, Jun 15, 2015 at 2:41 PM,  je...@suse.com wrote:
 From: Jeff Mahoney je...@suse.com

 Initially this will just be the length argument passed to it,
 but the following patches will adjust that to reflect re-alignment
 and skipped blocks.

 Signed-off-by: Jeff Mahoney je...@suse.com

Reviewed-by: Filipe Manana fdman...@suse.com
Tested-by: Filipe Manana fdman...@suse.com

 ---
  fs/btrfs/extent-tree.c | 19 ++-
  1 file changed, 14 insertions(+), 5 deletions(-)

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 0ec3acd..da1145d 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -1884,10 +1884,17 @@ static int remove_extent_backref(struct 
 btrfs_trans_handle *trans,
 return ret;
  }

 -static int btrfs_issue_discard(struct block_device *bdev,
 -   u64 start, u64 len)
 +static int btrfs_issue_discard(struct block_device *bdev, u64 start, u64 len,
 +  u64 *discarded_bytes)
  {
 -   return blkdev_issue_discard(bdev, start  9, len  9, GFP_NOFS, 0);
 +   int ret = 0;
 +
 +   *discarded_bytes = 0;
 +   ret = blkdev_issue_discard(bdev, start  9, len  9, GFP_NOFS, 0);
 +   if (!ret)
 +   *discarded_bytes = len;
 +
 +   return ret;
  }

  int btrfs_discard_extent(struct btrfs_root *root, u64 bytenr,
 @@ -1908,14 +1915,16 @@ int btrfs_discard_extent(struct btrfs_root *root, u64 
 bytenr,


 for (i = 0; i  bbio-num_stripes; i++, stripe++) {
 +   u64 bytes;
 if (!stripe-dev-can_discard)
 continue;

 ret = btrfs_issue_discard(stripe-dev-bdev,
   stripe-physical,
 - stripe-length);
 + stripe-length,
 + bytes);
 if (!ret)
 -   discarded_bytes += stripe-length;
 +   discarded_bytes += bytes;
 else if (ret != -EOPNOTSUPP)
 break; /* Logic errors or -ENOMEM, or -EIO 
 but I don't know how that could happen JDM */

 --
 2.4.3

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/7] btrfs: btrfs_issue_discard ensure offset/length are aligned to sector boundaries

2015-06-17 Thread Filipe David Manana
On Mon, Jun 15, 2015 at 2:41 PM,  je...@suse.com wrote:
 From: Jeff Mahoney je...@suse.com

 It's possible, though unexpected, to pass unaligned offsets and lengths
 to btrfs_issue_discard.  We then shift the offset/length values to sector
 units.  If an unaligned offset has been passed, it will result in the
 entire sector being discarded, possibly losing data.  An unaligned
 length is safe but we'll end up returning an inaccurate number of
 discarded bytes.

 This patch aligns the offset to the 512B boundary, adjusts the length,
 and warns, since we shouldn't be discarding on an offset that isn't
 aligned with our sector size.

 Signed-off-by: Jeff Mahoney je...@suse.com
Reviewed-by: Filipe Manana fdman...@suse.com
Tested-by: Filipe Manana fdman...@suse.com

 ---
  fs/btrfs/extent-tree.c | 17 +
  1 file changed, 13 insertions(+), 4 deletions(-)

 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index da1145d..cf9cefd 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -1888,12 +1888,21 @@ static int btrfs_issue_discard(struct block_device 
 *bdev, u64 start, u64 len,
u64 *discarded_bytes)
  {
 int ret = 0;
 +   u64 aligned_start = ALIGN(start, 1  9);

 -   *discarded_bytes = 0;
 -   ret = blkdev_issue_discard(bdev, start  9, len  9, GFP_NOFS, 0);
 -   if (!ret)
 -   *discarded_bytes = len;
 +   if (WARN_ON(start != aligned_start)) {
 +   len -= aligned_start - start;
 +   len = round_down(len, 1  9);
 +   start = aligned_start;
 +   }

 +   *discarded_bytes = 0;
 +   if (len) {
 +   ret = blkdev_issue_discard(bdev, start  9, len  9,
 +  GFP_NOFS, 0);
 +   if (!ret)
 +   *discarded_bytes = len;
 +   }
 return ret;
  }

 --
 2.4.3

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/7] btrfs: add missing discards when unpinning extents with -o discard

2015-06-17 Thread Filipe David Manana
On Mon, Jun 15, 2015 at 2:41 PM,  je...@suse.com wrote:
 From: Jeff Mahoney je...@suse.com

 When we clear the dirty bits in btrfs_delete_unused_bgs for extents
 in the empty block group, it results in btrfs_finish_extent_commit being
 unable to discard the freed extents.

 The block group removal patch added an alternate path to forget extents
 other than btrfs_finish_extent_commit.  As a result, any extents that
 would be freed when the block group is removed aren't discarded.  In my
 test run, with a large copy of mixed sized files followed by removal, it
 left nearly 2/3 of extents undiscarded.

 To clean up the block groups, we add the removed block group onto a list
 that will be discarded after transaction commit.

 Signed-off-by: Jeff Mahoney je...@suse.com
Reviewed-by: Filipe Manana fdman...@suse.com
Tested-by: Filipe Manana fdman...@suse.com

 ---
  fs/btrfs/ctree.h|  3 ++
  fs/btrfs/extent-tree.c  | 68 
 ++---
  fs/btrfs/free-space-cache.c | 57 +
  fs/btrfs/super.c|  2 +-
  fs/btrfs/transaction.c  |  2 ++
  fs/btrfs/transaction.h  |  2 ++
  6 files changed, 105 insertions(+), 29 deletions(-)

 diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
 index 6f364e1..780acf1 100644
 --- a/fs/btrfs/ctree.h
 +++ b/fs/btrfs/ctree.h
 @@ -3438,6 +3438,8 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
 *trans,
  struct btrfs_root *root, u64 group_start,
  struct extent_map *em);
  void btrfs_delete_unused_bgs(struct btrfs_fs_info *fs_info);
 +void btrfs_get_block_group_trimming(struct btrfs_block_group_cache *cache);
 +void btrfs_put_block_group_trimming(struct btrfs_block_group_cache *cache);
  void btrfs_create_pending_block_groups(struct btrfs_trans_handle *trans,
struct btrfs_root *root);
  u64 btrfs_get_alloc_profile(struct btrfs_root *root, int data);
 @@ -4068,6 +4070,7 @@ __printf(5, 6)
  void __btrfs_std_error(struct btrfs_fs_info *fs_info, const char *function,
  unsigned int line, int errno, const char *fmt, ...);

 +const char *btrfs_decode_error(int errno);

  void __btrfs_abort_transaction(struct btrfs_trans_handle *trans,
struct btrfs_root *root, const char *function,
 diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
 index 24b48df..3598440 100644
 --- a/fs/btrfs/extent-tree.c
 +++ b/fs/btrfs/extent-tree.c
 @@ -6103,20 +6103,19 @@ int btrfs_finish_extent_commit(struct 
 btrfs_trans_handle *trans,
struct btrfs_root *root)
  {
 struct btrfs_fs_info *fs_info = root-fs_info;
 +   struct btrfs_block_group_cache *block_group, *tmp;
 +   struct list_head *deleted_bgs;
 struct extent_io_tree *unpin;
 u64 start;
 u64 end;
 int ret;

 -   if (trans-aborted)
 -   return 0;
 -
 if (fs_info-pinned_extents == fs_info-freed_extents[0])
 unpin = fs_info-freed_extents[1];
 else
 unpin = fs_info-freed_extents[0];

 -   while (1) {
 +   while (!trans-aborted) {
 mutex_lock(fs_info-unused_bg_unpin_mutex);
 ret = find_first_extent_bit(unpin, 0, start, end,
 EXTENT_DIRTY, NULL);
 @@ -6135,6 +6134,34 @@ int btrfs_finish_extent_commit(struct 
 btrfs_trans_handle *trans,
 cond_resched();
 }

 +   /*
 +* Transaction is finished.  We don't need the lock anymore.  We
 +* do need to clean up the block groups in case of a transaction
 +* abort.
 +*/
 +   deleted_bgs = trans-transaction-deleted_bgs;
 +   list_for_each_entry_safe(block_group, tmp, deleted_bgs, bg_list) {
 +   u64 trimmed = 0;
 +
 +   ret = -EROFS;
 +   if (!trans-aborted)
 +   ret = btrfs_discard_extent(root,
 +  block_group-key.objectid,
 +  block_group-key.offset,
 +  trimmed);
 +
 +   list_del_init(block_group-bg_list);
 +   btrfs_put_block_group_trimming(block_group);
 +   btrfs_put_block_group(block_group);
 +
 +   if (ret) {
 +   const char *errstr = btrfs_decode_error(ret);
 +   btrfs_warn(fs_info,
 +  Discard failed while removing blockgroup: 
 errno=%d %s\n,
 +  ret, errstr);
 +   }
 +   }
 +
 return 0;
  }

 @@ -9914,6 +9941,11 @@ int btrfs_remove_block_group(struct btrfs_trans_handle 
 *trans,
  * currently running transaction might finish and a new one start,
  * allowing for new block groups to 

[PATCH v5.3 24/27] Btrfs: free the stale device

2015-06-17 Thread Anand Jain
From: Anand Jain anand.j...@oracle.com

When btrfs on a device is overwritten with a new btrfs (mkfs),
the old btrfs instance in the kernel becomes stale. So with this
patch, if kernel finds device is overwritten then delete the stale
fsid/uuid.

Signed-off-by: Anand Jain anand.j...@oracle.com
---
v5.2-v5.3: removed unnecessary return in retun void func
v5.1-v5.2: accepts David's review comments with thanks. esp.ly the
important rcu part
v5-v5.1: since this deals with only devices in unmounted state, don't
try to remove device link

 fs/btrfs/volumes.c | 61 ++
 1 file changed, 61 insertions(+)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index e50005f..39d4d48 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -445,6 +445,61 @@ static void pending_bios_fn(struct btrfs_work *work)
run_scheduled_bios(device);
 }
 
+
+void btrfs_free_stale_device(struct btrfs_device *cur_dev)
+{
+   struct btrfs_fs_devices *fs_devs;
+   struct btrfs_device *dev;
+
+   if (!cur_dev-name)
+   return;
+
+   list_for_each_entry(fs_devs, fs_uuids, list) {
+   int del = 1;
+
+   if (fs_devs-opened)
+   continue;
+   if (fs_devs-seeding)
+   continue;
+
+   list_for_each_entry(dev, fs_devs-devices, dev_list) {
+
+   if (dev == cur_dev)
+   continue;
+   if (!dev-name)
+   continue;
+
+   /*
+* Todo: This won't be enough. What if the same device
+* comes back (with new uuid and) with its mapper path?
+* But for now, this does help as mostly an admin will
+* either use mapper or non mapper path throughout.
+*/
+   rcu_read_lock();
+   del = strcmp(rcu_str_deref(dev-name),
+   rcu_str_deref(cur_dev-name));
+   rcu_read_unlock();
+   if (!del)
+   break;
+   }
+
+   if (!del) {
+   /* delete the stale device */
+   if (fs_devs-num_devices == 1) {
+   btrfs_sysfs_remove_fsid(fs_devs);
+   list_del(fs_devs-list);
+   free_fs_devices(fs_devs);
+   } else {
+   fs_devs-num_devices--;
+   list_del(dev-dev_list);
+   rcu_string_free(dev-name);
+   kfree(dev);
+   }
+   break;
+   }
+   }
+}
+
 /*
  * Add new device to list of registered devices
  *
@@ -560,6 +615,12 @@ static noinline int device_list_add(const char *path,
if (!fs_devices-opened)
device-generation = found_transid;
 
+   /*
+* if there is new btrfs on an already registered device,
+* then remove the stale device entry.
+*/
+   btrfs_free_stale_device(device);
+
*fs_devices_ret = fs_devices;
 
return ret;
-- 
2.4.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID10 Balancing Request for Comments and Advices

2015-06-17 Thread Vincent Olivier

 On Jun 17, 2015, at 9:27 AM, Hugo Mills h...@carfax.org.uk wrote:
 
 On Wed, Jun 17, 2015 at 09:13:08AM -0400, Vincent Olivier wrote:
 
 On Jun 16, 2015, at 8:14 PM, Chris Murphy li...@colorremedies.com wrote:
 
 On Tue, Jun 16, 2015 at 5:58 PM, Duncan 1i5t5.dun...@cox.net wrote:
 
 On a current kernel unlike older ones, btrfs actually automates entirely
 empty chunk reclaim, so this problem doesn't occur anything close to near
 as often as it used to.  However, it's still possible to have mostly but
 not entirely empty chunks that btrfs won't automatically reclaim.  A
 balance can be used to rewrite and combine these mostly empty chunks,
 reclaiming the space saved.  This is what Hugo was recommending.
 
 Yes, as little as a -dusage=5 (data chunks that are 5% or less full)
 can clear the problem and is very fast, seconds. Possibly a bit
 longer, many seconds o single digit minutes is -dusage=15. I haven't
 done a full balance in forever.
 
 
 Yes, on this 80% full 6x4TB RAID10 -dusage=15 took 2 seconds and relocated 
 0 out of 3026 chunks”.
 
 Out of curiosity, I had to use -dusage=90 to have it relocate only 1 chunk 
 and it took les than 30 seconds.
 
 So I put a -dusage=25 in the weekly cron just before the scrub.
 
   In most cases, all you need to do is clean up one data chunk to
 give the metadata enough space to work in. Instead of manually
 iterating through several values of usage= until you get a useful
 response, you can use limit=n to stop after n successful block
 group relocations.


Nice! Will do that instead! Thanks.


signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: RAID10 Balancing Request for Comments and Advices

2015-06-17 Thread Vincent Olivier

 On Jun 16, 2015, at 8:14 PM, Chris Murphy li...@colorremedies.com wrote:
 
 On Tue, Jun 16, 2015 at 5:58 PM, Duncan 1i5t5.dun...@cox.net wrote:
 
 On a current kernel unlike older ones, btrfs actually automates entirely
 empty chunk reclaim, so this problem doesn't occur anything close to near
 as often as it used to.  However, it's still possible to have mostly but
 not entirely empty chunks that btrfs won't automatically reclaim.  A
 balance can be used to rewrite and combine these mostly empty chunks,
 reclaiming the space saved.  This is what Hugo was recommending.
 
 Yes, as little as a -dusage=5 (data chunks that are 5% or less full)
 can clear the problem and is very fast, seconds. Possibly a bit
 longer, many seconds o single digit minutes is -dusage=15. I haven't
 done a full balance in forever.


Yes, on this 80% full 6x4TB RAID10 -dusage=15 took 2 seconds and relocated 0 
out of 3026 chunks”.

Out of curiosity, I had to use -dusage=90 to have it relocate only 1 chunk and 
it took les than 30 seconds.

So I put a -dusage=25 in the weekly cron just before the scrub.

FYI.

Thanks for your help.--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount

2015-06-17 Thread Filipe David Manana
On Wed, Jun 17, 2015 at 11:04 AM, Filipe David Manana
fdman...@gmail.com wrote:
 On Mon, Jun 15, 2015 at 2:41 PM,  je...@suse.com wrote:
 From: Jeff Mahoney je...@suse.com

 The cleaner thread may already be sleeping by the time we enter
 close_ctree.  If that's the case, we'll skip removing any unused
 block groups queued for removal, even during a normal umount.
 They'll be cleaned up automatically at next mount, but users
 expect a umount to be a clean synchronization point, especially
 when used on thin-provisioned storage with -odiscard.  We also
 explicitly remove unused block groups in the ro-remount path
 for the same reason.

 Signed-off-by: Jeff Mahoney je...@suse.com
 Reviewed-by: Filipe Manana fdman...@suse.com
 Tested-by: Filipe Manana fdman...@suse.com

 ---
  fs/btrfs/disk-io.c |  9 +
  fs/btrfs/super.c   | 11 +++
  2 files changed, 20 insertions(+)

 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 2ef9a4b..2e47fef 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -3710,6 +3710,15 @@ void close_ctree(struct btrfs_root *root)
 cancel_work_sync(fs_info-async_reclaim_work);

 if (!(fs_info-sb-s_flags  MS_RDONLY)) {
 +   /*
 +* If the cleaner thread is stopped and there are
 +* block groups queued for removal, the deletion will be
 +* skipped when we quit the cleaner thread.
 +*/
 +   mutex_lock(root-fs_info-cleaner_mutex);
 +   btrfs_delete_unused_bgs(root-fs_info);
 +   mutex_unlock(root-fs_info-cleaner_mutex);
 +
 ret = btrfs_commit_super(root);
 if (ret)
 btrfs_err(fs_info, commit super ret %d, ret);
 diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
 index 9e66f5e..2ccd8d4 100644
 --- a/fs/btrfs/super.c
 +++ b/fs/btrfs/super.c
 @@ -1539,6 +1539,17 @@ static int btrfs_remount(struct super_block *sb, int 
 *flags, char *data)

 sb-s_flags |= MS_RDONLY;

 +   /*
 +* Setting MS_RDONLY will put the cleaner thread to
 +* sleep at the next loop if it's already active.
 +* If it's already asleep, we'll leave unused block
 +* groups on disk until we're mounted read-write again
 +* unless we clean them up here.
 +*/
 +   mutex_lock(root-fs_info-cleaner_mutex);
 +   btrfs_delete_unused_bgs(fs_info);
 +   mutex_unlock(root-fs_info-cleaner_mutex);

So actually, this allows for a deadlock after the patch I sent out last week:

https://patchwork.kernel.org/patch/6586811/

In that patch delete_unused_bgs is no longer called under the
cleaner_mutex, and making it so, will cause a deadlock with
relocation.

Even without that patch, I don't think you need using this mutex
anyway - no 2 tasks running this function can get the same bg from the
fs_info-unused_bgs list.

thanks


 +
 btrfs_dev_replace_suspend_for_unmount(fs_info);
 btrfs_scrub_cancel(fs_info);
 btrfs_pause_balance(fs_info);
 --
 2.4.3

 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



 --
 Filipe David Manana,

 Reasonable men adapt themselves to the world.
  Unreasonable men adapt the world to themselves.
  That's why all progress depends on unreasonable men.



-- 
Filipe David Manana,

Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID10 Balancing Request for Comments and Advices

2015-06-17 Thread Hugo Mills
On Wed, Jun 17, 2015 at 09:13:08AM -0400, Vincent Olivier wrote:
 
  On Jun 16, 2015, at 8:14 PM, Chris Murphy li...@colorremedies.com wrote:
  
  On Tue, Jun 16, 2015 at 5:58 PM, Duncan 1i5t5.dun...@cox.net wrote:
  
  On a current kernel unlike older ones, btrfs actually automates entirely
  empty chunk reclaim, so this problem doesn't occur anything close to near
  as often as it used to.  However, it's still possible to have mostly but
  not entirely empty chunks that btrfs won't automatically reclaim.  A
  balance can be used to rewrite and combine these mostly empty chunks,
  reclaiming the space saved.  This is what Hugo was recommending.
  
  Yes, as little as a -dusage=5 (data chunks that are 5% or less full)
  can clear the problem and is very fast, seconds. Possibly a bit
  longer, many seconds o single digit minutes is -dusage=15. I haven't
  done a full balance in forever.
 
 
 Yes, on this 80% full 6x4TB RAID10 -dusage=15 took 2 seconds and relocated 0 
 out of 3026 chunks”.
 
 Out of curiosity, I had to use -dusage=90 to have it relocate only 1 chunk 
 and it took les than 30 seconds.
 
 So I put a -dusage=25 in the weekly cron just before the scrub.

   In most cases, all you need to do is clean up one data chunk to
give the metadata enough space to work in. Instead of manually
iterating through several values of usage= until you get a useful
response, you can use limit=n to stop after n successful block
group relocations.

   Hugo.

-- 
Hugo Mills | Alert status mauve ocelot: Slight chance of
hugo@... carfax.org.uk | brimstone. Be prepared to make a nice cup of tea.
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: RAID10 Balancing Request for Comments and Advices

2015-06-17 Thread Vincent Olivier

 On Jun 16, 2015, at 7:58 PM, Duncan 1i5t5.dun...@cox.net wrote:
 
 Vincent Olivier posted on Tue, 16 Jun 2015 09:34:29 -0400 as excerpted:
 
 
 On Jun 16, 2015, at 8:25 AM, Hugo Mills h...@carfax.org.uk wrote:
 
 On Tue, Jun 16, 2015 at 08:09:17AM -0400, Vincent Olivier wrote:
 
 My first question is this : is it normal to have “single” blocks ?
 Why not only RAID10? I don’t remember the exact mkfs options I used
 but I certainly didn’t ask for “single” so this is unexpected.
 
 Yes. It's an artefact of the way that mkfs works. If you run a
 balance on those chunks, they'll go away. (btrfs balance start
 -dusage=0 -musage=0 /mountpoint)
 
 Thanks! I did and it did go away, except for the GlobalReserve, single:
 total=512.00MiB, used=0.00B”. But I suppose this is a permanent fixture,
 right?
 
 Yes.  GlobalReserve is for short-term btrfs-internal use, reserved for
 times when btrfs needs to (temporarily) allocate some space in ordered to
 free space, etc.  It's always single, and you'll rarely see anything but
 0 used except perhaps in the middle of a balance or something.


Get it. Thanks.

Is there anyway to put that on another device, say, a SSD? I am thinking of 
backing up this RAID10 on a 2x8TB device-managed SMR RAID1 and I want to 
minimize random write operations (noatime  al.). I will start a new thread for 
that maybe but first, is there something substantial I can read about 
btrfs+SMR? Or should I avoid SMR+btfs ?


 
 For maintenance, I would suggest running a scrub regularly, to
 check for various forms of bitrot. Typical frequencies for a scrub are
 once a week or once a month -- opinions vary (as do runtimes).
 
 
 Yes. I cronned it weekly for now. Takes about 5 hours. Is it
 automatically corrected on RAID10 since a copy of it exist within the
 filesystem ? What happens for RAID0 ?
 
 For raid10 (and the raid1 I use), yes, it's corrected, from the other
 existing copy, assuming it's good, tho if there are metadata checksum
 errors, there may be corresponding unverified checksums as well, where
 the verification couldn't be done because the metadata containing the
 checksums was bad.  Thus, if there are errors found and corrected, and
 you see unverified errors as well, rerun the scrub, so the newly
 corrected metadata can now be used to verify the previously unverified
 errors.


ok then, rule of the thumb re-run the scrub on “unverified checksum error(s)”. 
I have yet to see checksum errors yet but will keep it in mind..

 
 I'm presently getting a lot of experience with this as one of the ssds in
 my raid1 is gradually failing and rewriting sectors.  Generally what
 happens is that the ssd will take too long, triggering a SATA reset (30
 second timeout), and btrfs will call that an error.  The scrub then
 rewrites the bad copy on the unreliable device with the good copy from
 the more reliable device, with the write triggering a sector relocation
 on the bad device.  The newly written copy then checks out good, but if
 it was metadata, it very likely contained checksums for several other
 blocks, which couldn't be verified because the block containing their
 checksums was itself bad.  Typically I'll see dozens to a couple hundred
 unverified errors for every bad metadata block rewritten in this way.
 Rerunning the scrub then either verifies or fixes the previously
 unverified blocks, tho sometimes one of those in turn ends up bad and if
 it's a metadata block, I may end up rerunning the scrub another time or
 two, until everything checks out.
 
 FWIW, on the bad device, smartctl -A reports (excerpted):
 
 ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
 WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0032   098   098   036Old_age   Always   
 -   259
 182 Erase_Fail_Count_Total  0x0032   100   100   000Old_age   Always  
  -   132
 
 While on the paired good device:
 
  5 Reallocated_Sector_Ct   0x0032   253   253   036Old_age   Always   
 -   0
 182 Erase_Fail_Count_Total  0x0032   253   253   000Old_age   Always  
  -   0
 
 Meanwhile, smartctl -H has already warned once that the device is
 failing, tho it went back to passing status again, but as of now it's
 saying failing, again.  The attribute that actually registers as failing,
 again from the bad device, followed by the good, is:
 
  1 Raw_Read_Error_Rate 0x000f   001   001   006Pre-fail  Always   
 FAILING_NOW 3081
 
  1 Raw_Read_Error_Rate 0x000f   160   159   006Pre-fail  Always   
 -   41
 
 When it's not actually reporting failing, the FAILING_NOW status is
 replaced with IN_THE_PAST.
 
 250 Read_Error_Retry_Rate is the other attribute of interest, with values
 of 100 current and worst for both devices, threshold 0, but a raw value
 of 2488 for the good device and over 17,000,000 for the failing device.
 But with the cooked value never moving from 100 and with no real
 guidance on how to interpret the raw values, while 

Re: BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432)

2015-06-17 Thread Duncan
Marc MERLIN posted on Wed, 17 Jun 2015 00:16:54 -0700 as excerpted:

 I had a few power offs due to a faulty power supply, and my mdadm raid5
 got into fail mode after 2 drives got kicked out since their sequence
 numbers didn't match due to the abrupt power offs.
 
 I brought the swraid5 back up by force assembling it with 4 drives (one
 was really only a few sequence numbers behind), and it's doing a full
 parity rebuild on the 5th drive that was farther behind.
 
 So I can understand how I may have had a few blocks that are in a bad
 state.
 I'm getting a few (not many) of those messages in syslog.
 BTRFS: read error corrected: ino 1 off 226840576 (dev
 /dev/mapper/dshelf1 sector 459432)
 
 Filesystem looks like this:
 Label: 'btrfs_pool1'  uuid: 6358304a-2234-4243-b02d-4944c9af47d7
 Total devices 1 FS bytes used 8.29TiB devid1 size 14.55TiB
 used 8.32TiB path /dev/mapper/dshelf1
 
 gargamel:~# btrfs fi df /mnt/btrfs_pool1 Data, single: total=8.29TiB,
 used=8.28TiB System, DUP: total=8.00MiB, used=920.00KiB System, single:
 total=4.00MiB, used=0.00B Metadata, DUP: total=14.00GiB, used=10.58GiB
 Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single:
 total=512.00MiB, used=0.00B
 
 Kernel 3.19.8.
 
 Just to make sure I understand, do those messages in syslog mean that my
 metadata got corrupted a bit, but because I have 2 copies, btrfs can fix
 the bad copy by using the good one?

Yes.  Despite the confusion between btrfs raid5 and mdraid5, Hugo was 
correct there.  It's just the 3.19 kernel bit that he got wrong, since he 
was thinking btrfs raid.  Btrfs dup mode should be good going back many 
kernels.

 Also, if my actual data got corrupted, am I correct that btrfs will
 detect the checksum failure and give me a different error message of a
 read error that cannot be corrected?
 
 I'll do a scrub later, for now I have to wait 20 hours for the raid
 rebuild first.

Yes again.

As I mentioned in a different thread a few hours ago, I have an SSD that 
is slowly going bad, relocating sectors, etc (200-some relocated at this 
point, by raw value, that attribute dropped to 100 cooked value on the 
first relocation and is now at 98, with a threshold of 36, so I figure it 
should be good for a few thousand relocations if I let it go that far).  
But it's in a btrfs raid1 with a reliable (no relocations yet) paired-ssd 
and I've been able to scrub-fix the errors so far, plus I have things 
backed up and a replacement ready to insert when I decide it's time, so 
I'm able to watch in more or less morbid fascination as the thing slowly 
dies, a sector at a time.  

The interesting thing is that with btrfs' checksumming and data integrity 
feature, I can continue to use the drive in raid1 even tho it's 
definitely bad enough to be all but unusable with ordinary filesystems.

Anyway, as a result of that, I'm getting lots of experience with scrubs 
and corrected errors.

One thing I'd strongly recommend.  Once the rebuild is complete and you 
do the scrub, there may well be both read/corrected errors, and 
unverified errors.  AFAIK, the unverified errors are a result of bad 
metadata blocks, so missing checksums for what they covered.  So once you 
finish the first scrub and have corrected most of the metadata block 
errors, do another scrub.  The idea is to repeat until you have no more 
unverified errors, they're either all corrected (if dup metadata) or all 
uncorrectable (the single data).  That's what I'm doing here, with both 
data and metadata as raid1 and thus correctable, tho in some instances 
the device is triggering a new relocation on the second and occasionally 
(once?) third scrub, so that's causing me to have to do more scrubs than 
I would if the problem were entirely in the past, as it sounds like yours 
is, or will-be once the mdraid rebuild is done, anyway.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount

2015-06-17 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 6/17/15 10:32 AM, Jeff Mahoney wrote:
 On 6/17/15 9:24 AM, Filipe David Manana wrote:
 On Wed, Jun 17, 2015 at 11:04 AM, Filipe David Manana 
 fdman...@gmail.com wrote:
 On Mon, Jun 15, 2015 at 2:41 PM,  je...@suse.com wrote:
 From: Jeff Mahoney je...@suse.com
 
 The cleaner thread may already be sleeping by the time we 
 enter close_ctree.  If that's the case, we'll skip removing
 any unused block groups queued for removal, even during a
 normal umount. They'll be cleaned up automatically at next
 mount, but users expect a umount to be a clean
 synchronization point, especially when used on
 thin-provisioned storage with -odiscard.  We also explicitly
 remove unused block groups in the ro-remount path for the
 same reason.
 
 Signed-off-by: Jeff Mahoney je...@suse.com
 Reviewed-by: Filipe Manana fdman...@suse.com Tested-by:
 Filipe Manana fdman...@suse.com
 
 --- fs/btrfs/disk-io.c |  9 + fs/btrfs/super.c   |
 11 +++ 2 files changed, 20 insertions(+)
 
 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 
 2ef9a4b..2e47fef 100644 --- a/fs/btrfs/disk-io.c +++ 
 b/fs/btrfs/disk-io.c @@ -3710,6 +3710,15 @@ void 
 close_ctree(struct btrfs_root *root) 
 cancel_work_sync(fs_info-async_reclaim_work);
 
 if (!(fs_info-sb-s_flags  MS_RDONLY)) { +   /*
 + * If the cleaner thread is stopped and there are + * block
 groups queued for removal, the deletion will be + * skipped
 when we quit the cleaner thread. +*/ +
 mutex_lock(root-fs_info-cleaner_mutex); + 
 btrfs_delete_unused_bgs(root-fs_info); + 
 mutex_unlock(root-fs_info-cleaner_mutex); + ret = 
 btrfs_commit_super(root); if (ret) btrfs_err(fs_info,
 commit super ret %d, ret); diff --git a/fs/btrfs/super.c 
 b/fs/btrfs/super.c index 9e66f5e..2ccd8d4 100644 --- 
 a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1539,6
 +1539,17 @@ static int btrfs_remount(struct super_block *sb,
 int *flags, char *data)
 
 sb-s_flags |= MS_RDONLY;
 
 +   /* +* Setting MS_RDONLY will 
 put the cleaner thread to +* sleep at the
 next loop if it's already active. +* If it's
 already asleep, we'll leave unused block +*
 groups on disk until we're mounted read-write again +
 * unless we clean them up here. +*/ + 
 mutex_lock(root-fs_info-cleaner_mutex); + 
 btrfs_delete_unused_bgs(fs_info); + 
 mutex_unlock(root-fs_info-cleaner_mutex);
 
 So actually, this allows for a deadlock after the patch I sent
 out last week:
 
 https://patchwork.kernel.org/patch/6586811/
 
 In that patch delete_unused_bgs is no longer called under the 
 cleaner_mutex, and making it so, will cause a deadlock with/ru 
 relocation.
 
 Even without that patch, I don't think you need using this mutex
  anyway - no 2 tasks running this function can get the same bg
 from the fs_info-unused_bgs list.
 
 I was hitting crashes during umount when xfstests would do
 remount-ro and umount in quick succession.  I can go back and
 confirm this, but I believe I was encountering a race between the
 cleaner thread and umount after being set read-only.  It didn't
 trigger all the time.  My hypothesis is that if the cleaner thread
 was running and had a lot of work to do, it could start before set
 MS_RDONLY and still be performing work through the remount and into
 the umount.  Ro-remount would have set MS_RDONLY so we skip the
 btrfs_super_commit in close_ctree and then blow up afterwards.
 
 Taking the cleaner mutex means we either wait until the cleaner
 thread has finished or we put it to sleep on the next loop before
 it does anything.  In either case, it's safe.  It could just has
 easily been:
 
 mutex_lock(root-fs_info-cleaner_mutex); 
 mutex_unlock(root-fs_info-cleaner_mutex);
 
 btrfs_delete_unused_bgs(fs_info);
 
 I think it actually was in a previous version I was testing.  It 
 probably should go back to that version so that we don't end up 
 confusing it with the new mutex you introduced in your patch.

It looks like your:
[PATCH] Btrfs: fix crash on close_ctree() if cleaner starts new
transaction

would also fix this in a more general case.  We can drop taking the
cleaner mutex here.

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)

iQIcBAEBAgAGBQJVgYYDAAoJEB57S2MheeWyxIcQAIGwFvP1bL4C8Oa3WyFL/tjE
QITNDQZGYXEKfFqRWdHEAeFJ8kv234xo/tx7Ml0Txd8DFrqzDwXSxv6deLzDiiTT
gymMdBKO3x7TLKZTxnyDXYEUDHM72IMOUS2el3wOOsc61rL1KajFEWySGtAA80pk
bIUH6uosRTXhpXBRe080mc9XPhtfIQyCC8nroJHYazNwT3VWrvbhDaZPM3npNttj
5glsCz7ieseiWKqFCIlYC5yCgpst79U7D8M75Jo0yslvtZNpZOMR3YhvyQakj5hG
p/CFRfbdFGnl3wKv+ACyu7XlewqoA9LwkB5Sbjzd4XbS3n7J4gch043b+BbIl2SA
VghNTTI+tm7KKvMa3fghtedooVYu6DjdhU58VEWOBtHaDiWntSmd0FqzUCqAotxC
fwEmMWCWCWR1E0etRUrnbO1DGltkR38ost7cvXOPXUUdvv3Hy22mTfWW73YwsWXW
kwmG2V+IdgOWHDMxQCnj55/NbYep+/TiVjDPJnOuCn8tD5Tw+zHxtRbXhVcyKpGj

Re: [PATCH 0/5] Rework/enhance btrfs-map-logical

2015-06-17 Thread David Sterba
On Wed, Jun 17, 2015 at 03:48:59PM +0800, Qu Wenruo wrote:
 Although btrfs-map-logical is mainly a tool for developers, but even as
 a developer, I feel quite frustrated about the bug I found:

Thanks, I'll queue it for 4.1.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs stable updates for 4.0

2015-06-17 Thread Luis Henriques
On Thu, Jun 11, 2015 at 06:06:32PM +0200, David Sterba wrote:
 Hi,
 
 please queue the following patches to 4.0 stable. There are fixes for user
 visible bugs and one usability regression with RAID1 - single conversion
 during balance.
 
 One of the patches does not apply cleanly to 4.0.5, there's a minor conflict 
 in
 patch context (153c35b60c72de9fae06c8e2c8b2c47d79d4, the last one). A 
 reply
 to this mail is the adapted version or you can pull/cherry-pick from
 
   git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-stable-4.0
 
 Subjects:
   Btrfs: send, add missing check for dead clone root
   Btrfs: send, don't leave without decrementing clone root's send_progress
   btrfs: incorrect handling for fiemap_fill_next_extent return
   btrfs: cleanup orphans while looking up default subvolume
   Btrfs: fix range cloning when same inode used as source and destination
   Btrfs: fix uninit variable in clone ioctl
   Btrfs: fix regression in raid level conversion
 Commits:
   5cc2b17e80cf5770f2e585c2d90fd8af1b901258 # 3.14+
   2f1f465ae6da244099af55c066e5355abd8ff620 # 3.14+
   26e726afe01c1c82072cf23a5ed89ce25f39d9f2 # 3.10+
   727b9784b6085c99c2f836bf4fcc2848dc9cf904 # 3.14+

Thanks, I'm applying the above commits to 3.16 kernel as well.

Cheers,
--
Luís

   df858e76723ace61342b118aa4302bd09de4e386 # 4.0+
   de249e66a73d69281cd812087979c6fae552 # 4.0+
   153c35b60c72de9fae06c8e2c8b2c47d79d4 # 4.0+
 
 Thanks.
 --
 To unsubscribe from this list: send the line unsubscribe stable in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount

2015-06-17 Thread Jeff Mahoney
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 6/17/15 9:24 AM, Filipe David Manana wrote:
 On Wed, Jun 17, 2015 at 11:04 AM, Filipe David Manana 
 fdman...@gmail.com wrote:
 On Mon, Jun 15, 2015 at 2:41 PM,  je...@suse.com wrote:
 From: Jeff Mahoney je...@suse.com
 
 The cleaner thread may already be sleeping by the time we
 enter close_ctree.  If that's the case, we'll skip removing any
 unused block groups queued for removal, even during a normal
 umount. They'll be cleaned up automatically at next mount, but
 users expect a umount to be a clean synchronization point,
 especially when used on thin-provisioned storage with
 -odiscard.  We also explicitly remove unused block groups in
 the ro-remount path for the same reason.
 
 Signed-off-by: Jeff Mahoney je...@suse.com
 Reviewed-by: Filipe Manana fdman...@suse.com Tested-by: Filipe
 Manana fdman...@suse.com
 
 --- fs/btrfs/disk-io.c |  9 + fs/btrfs/super.c   | 11
 +++ 2 files changed, 20 insertions(+)
 
 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index
 2ef9a4b..2e47fef 100644 --- a/fs/btrfs/disk-io.c +++
 b/fs/btrfs/disk-io.c @@ -3710,6 +3710,15 @@ void
 close_ctree(struct btrfs_root *root) 
 cancel_work_sync(fs_info-async_reclaim_work);
 
 if (!(fs_info-sb-s_flags  MS_RDONLY)) { +   /* +
 * If the cleaner thread is stopped and there are +
 * block groups queued for removal, the deletion will be +
 * skipped when we quit the cleaner thread. +*/ 
 +   mutex_lock(root-fs_info-cleaner_mutex); +
 btrfs_delete_unused_bgs(root-fs_info); +
 mutex_unlock(root-fs_info-cleaner_mutex); + ret =
 btrfs_commit_super(root); if (ret) btrfs_err(fs_info, commit
 super ret %d, ret); diff --git a/fs/btrfs/super.c
 b/fs/btrfs/super.c index 9e66f5e..2ccd8d4 100644 ---
 a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1539,6 +1539,17
 @@ static int btrfs_remount(struct super_block *sb, int *flags,
 char *data)
 
 sb-s_flags |= MS_RDONLY;
 
 +   /* +* Setting MS_RDONLY will
 put the cleaner thread to +* sleep at the next
 loop if it's already active. +* If it's already
 asleep, we'll leave unused block +* groups on
 disk until we're mounted read-write again +*
 unless we clean them up here. +*/ +
 mutex_lock(root-fs_info-cleaner_mutex); +
 btrfs_delete_unused_bgs(fs_info); +
 mutex_unlock(root-fs_info-cleaner_mutex);
 
 So actually, this allows for a deadlock after the patch I sent out
 last week:
 
 https://patchwork.kernel.org/patch/6586811/
 
 In that patch delete_unused_bgs is no longer called under the 
 cleaner_mutex, and making it so, will cause a deadlock with/ru 
 relocation.
 
 Even without that patch, I don't think you need using this mutex 
 anyway - no 2 tasks running this function can get the same bg from
 the fs_info-unused_bgs list.

I was hitting crashes during umount when xfstests would do remount-ro
and umount in quick succession.  I can go back and confirm this, but I
believe I was encountering a race between the cleaner thread and
umount after being set read-only.  It didn't trigger all the time.  My
hypothesis is that if the cleaner thread was running and had a lot of
work to do, it could start before set MS_RDONLY and still be
performing work through the remount and into the umount.  Ro-remount
would have set MS_RDONLY so we skip the btrfs_super_commit in
close_ctree and then blow up afterwards.

Taking the cleaner mutex means we either wait until the cleaner thread
has finished or we put it to sleep on the next loop before it does
anything.  In either case, it's safe.  It could just has easily been:

   mutex_lock(root-fs_info-cleaner_mutex);
   mutex_unlock(root-fs_info-cleaner_mutex);

   btrfs_delete_unused_bgs(fs_info);

I think it actually was in a previous version I was testing.  It
probably should go back to that version so that we don't end up
confusing it with the new mutex you introduced in your patch.

- -Jeff

- -- 
Jeff Mahoney
SUSE Labs
-BEGIN PGP SIGNATURE-
Version: GnuPG/MacGPG2 v2.0.19 (Darwin)

iQIcBAEBAgAGBQJVgYThAAoJEB57S2MheeWymvMP/jPnCFslZfEphccGlqsDUQeb
Ua9SVQJ5XjS0BbnVfuMGmzxew30BkUBdpnlWsufdVIKIeR9DNcvuDHJtcXMUI+Uw
FU/Asik//xiDPJ1hldPc4d0CJjsFBKHVLKjirkeE7kuvwa+XmfUYfKrhfzt6ZGvt
sWrCwMJRWFAS88ayR+NAelwaMzIy+Rbs5gZYg6dd2OCvIa4GuTh/szx8RaPOjNWQ
QcQHy2FlCcV/AtCA+ZaXh8NLmATIA8613biP7ATGIYHEdaZf7Oivov/u154QVwkt
c4omauofHKbBmlz2d//PS/T/n9nT7F7p1YvFaDnLLyQ0Ew3VBq+M9gyuWF8IGxti
iHdGkgQxnPSY0gGLA5bIt0D+su1RcTqa/71LOsBqbmk7KioNF4bp9FmaykHx2LAL
NpKGPD6BEcTTZAXfGdV6/IxTuii1temxcyawJrijakFTseA/GOmODI3K1kg+nLZA
OBjzFmzzLFir8SuiIWLO5ncbbsoM6rHhbl08DeKZ6tOH4JQm2ROciVgTn67SVxB5
bmjzl/zhhePfPgmbf5WoLsT4cbuGK00r+M3U79vzIjfEPmKAfGFbu9jEGPvQahKT
tOBRw7IaL8vCrBLFGUhhQzECwOK6Zms4r2ZTino30MwSNHegPjUbt8xDmFHw+gp3
Td6o4o23By9ygZgac0KI
=iHjc
-END PGP SIGNATURE-
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs 

Re: [PATCH 5/7] btrfs: explictly delete unused block groups in close_ctree and ro-remount

2015-06-17 Thread Filipe David Manana
On Wed, Jun 17, 2015 at 3:36 PM, Jeff Mahoney je...@suse.com wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 On 6/17/15 10:32 AM, Jeff Mahoney wrote:
 On 6/17/15 9:24 AM, Filipe David Manana wrote:
 On Wed, Jun 17, 2015 at 11:04 AM, Filipe David Manana
 fdman...@gmail.com wrote:
 On Mon, Jun 15, 2015 at 2:41 PM,  je...@suse.com wrote:
 From: Jeff Mahoney je...@suse.com

 The cleaner thread may already be sleeping by the time we
 enter close_ctree.  If that's the case, we'll skip removing
 any unused block groups queued for removal, even during a
 normal umount. They'll be cleaned up automatically at next
 mount, but users expect a umount to be a clean
 synchronization point, especially when used on
 thin-provisioned storage with -odiscard.  We also explicitly
 remove unused block groups in the ro-remount path for the
 same reason.

 Signed-off-by: Jeff Mahoney je...@suse.com
 Reviewed-by: Filipe Manana fdman...@suse.com Tested-by:
 Filipe Manana fdman...@suse.com

 --- fs/btrfs/disk-io.c |  9 + fs/btrfs/super.c   |
 11 +++ 2 files changed, 20 insertions(+)

 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index
 2ef9a4b..2e47fef 100644 --- a/fs/btrfs/disk-io.c +++
 b/fs/btrfs/disk-io.c @@ -3710,6 +3710,15 @@ void
 close_ctree(struct btrfs_root *root)
 cancel_work_sync(fs_info-async_reclaim_work);

 if (!(fs_info-sb-s_flags  MS_RDONLY)) { +   /*
 + * If the cleaner thread is stopped and there are + * block
 groups queued for removal, the deletion will be + * skipped
 when we quit the cleaner thread. +*/ +
 mutex_lock(root-fs_info-cleaner_mutex); +
 btrfs_delete_unused_bgs(root-fs_info); +
 mutex_unlock(root-fs_info-cleaner_mutex); + ret =
 btrfs_commit_super(root); if (ret) btrfs_err(fs_info,
 commit super ret %d, ret); diff --git a/fs/btrfs/super.c
 b/fs/btrfs/super.c index 9e66f5e..2ccd8d4 100644 ---
 a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1539,6
 +1539,17 @@ static int btrfs_remount(struct super_block *sb,
 int *flags, char *data)

 sb-s_flags |= MS_RDONLY;

 +   /* +* Setting MS_RDONLY will
 put the cleaner thread to +* sleep at the
 next loop if it's already active. +* If it's
 already asleep, we'll leave unused block +*
 groups on disk until we're mounted read-write again +
 * unless we clean them up here. +*/ +
 mutex_lock(root-fs_info-cleaner_mutex); +
 btrfs_delete_unused_bgs(fs_info); +
 mutex_unlock(root-fs_info-cleaner_mutex);

 So actually, this allows for a deadlock after the patch I sent
 out last week:

 https://patchwork.kernel.org/patch/6586811/

 In that patch delete_unused_bgs is no longer called under the
 cleaner_mutex, and making it so, will cause a deadlock with/ru
 relocation.

 Even without that patch, I don't think you need using this mutex
  anyway - no 2 tasks running this function can get the same bg
 from the fs_info-unused_bgs list.

 I was hitting crashes during umount when xfstests would do
 remount-ro and umount in quick succession.  I can go back and
 confirm this, but I believe I was encountering a race between the
 cleaner thread and umount after being set read-only.  It didn't
 trigger all the time.  My hypothesis is that if the cleaner thread
 was running and had a lot of work to do, it could start before set
 MS_RDONLY and still be performing work through the remount and into
 the umount.  Ro-remount would have set MS_RDONLY so we skip the
 btrfs_super_commit in close_ctree and then blow up afterwards.

 Taking the cleaner mutex means we either wait until the cleaner
 thread has finished or we put it to sleep on the next loop before
 it does anything.  In either case, it's safe.  It could just has
 easily been:

 mutex_lock(root-fs_info-cleaner_mutex);
 mutex_unlock(root-fs_info-cleaner_mutex);

 btrfs_delete_unused_bgs(fs_info);

 I think it actually was in a previous version I was testing.  It
 probably should go back to that version so that we don't end up
 confusing it with the new mutex you introduced in your patch.

 It looks like your:
 [PATCH] Btrfs: fix crash on close_ctree() if cleaner starts new
 transaction

 would also fix this in a more general case.  We can drop taking the
 cleaner mutex here.

Cool, thanks Jeff.


 - -Jeff

 - --
 Jeff Mahoney
 SUSE Labs
 -BEGIN PGP SIGNATURE-
 Version: GnuPG/MacGPG2 v2.0.19 (Darwin)

 iQIcBAEBAgAGBQJVgYYDAAoJEB57S2MheeWyxIcQAIGwFvP1bL4C8Oa3WyFL/tjE
 QITNDQZGYXEKfFqRWdHEAeFJ8kv234xo/tx7Ml0Txd8DFrqzDwXSxv6deLzDiiTT
 gymMdBKO3x7TLKZTxnyDXYEUDHM72IMOUS2el3wOOsc61rL1KajFEWySGtAA80pk
 bIUH6uosRTXhpXBRe080mc9XPhtfIQyCC8nroJHYazNwT3VWrvbhDaZPM3npNttj
 5glsCz7ieseiWKqFCIlYC5yCgpst79U7D8M75Jo0yslvtZNpZOMR3YhvyQakj5hG
 p/CFRfbdFGnl3wKv+ACyu7XlewqoA9LwkB5Sbjzd4XbS3n7J4gch043b+BbIl2SA
 VghNTTI+tm7KKvMa3fghtedooVYu6DjdhU58VEWOBtHaDiWntSmd0FqzUCqAotxC
 fwEmMWCWCWR1E0etRUrnbO1DGltkR38ost7cvXOPXUUdvv3Hy22mTfWW73YwsWXW
 

Re: trim not working and irreparable errors from btrfsck

2015-06-17 Thread Chris Murphy
On Wed, Jun 17, 2015 at 6:56 AM, Christian Dysthe cdys...@gmail.com wrote:
 Hi,

 Sorry for asking more about this. I'm not a developer but trying to learn.
 In my case I get several errors like this one:

 root 2625 inode 353819 errors 400, nbytes wrong

 Is it inode 353819 I should focus on and what is the number after root, in
 this case 2625?

I'm going to guess it's tree root 2625, which is the same thing as fs
tree, which is the same thing as subvolume. Each subvolume has its own
inodes. So on a given Btrfs volume, an inode number can exist more
than once, but in separate subvolumes. When you use btrfs inspect
inode it will list all files with that inode number, but only the one
in subvol ID 2625 is what you care about deleting and replacing.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: trim not working and irreparable errors from btrfsck

2015-06-17 Thread Christian

On 06/17/2015 10:22 AM, Chris Murphy wrote:

On Wed, Jun 17, 2015 at 6:56 AM, Christian Dysthe cdys...@gmail.com wrote:

Hi,

Sorry for asking more about this. I'm not a developer but trying to learn.
In my case I get several errors like this one:

root 2625 inode 353819 errors 400, nbytes wrong

Is it inode 353819 I should focus on and what is the number after root, in
this case 2625?


I'm going to guess it's tree root 2625, which is the same thing as fs
tree, which is the same thing as subvolume. Each subvolume has its own
inodes. So on a given Btrfs volume, an inode number can exist more
than once, but in separate subvolumes. When you use btrfs inspect
inode it will list all files with that inode number, but only the one
in subvol ID 2625 is what you care about deleting and replacing.

Thanks! Deleting the file for that inode took care of it. No more 
errors. Restored it from a backup.


However, fstrim still gives me 0 B (0 bytes) trimmed, so that may be 
another problem. Is there a way to check if trim works?


--
//Christian


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/5] Rework/enhance btrfs-map-logical

2015-06-17 Thread Qu Wenruo
Although btrfs-map-logical is mainly a tool for developers, but even as
a developer, I feel quite frustrated about the bug I found:

1) Assert if pass bytenr of a tree root
The most annoying one.
---
$ btrfs-map-logical  -l 29425664 /dev/sda6 
mirror 1 logical 29425664 physical 37814272 device /dev/sda6
mirror 2 logical 29425664 physical 556096 device /dev/sda6
extent_io.c:582: free_extent_buffer: Assertion `eb-refs  0` failed.
btrfs-map-logical[0x41c464]
btrfs-map-logical(free_extent_buffer+0xc0)[0x41cf10]
btrfs-map-logical(btrfs_release_all_roots+0x59)[0x40e649]
btrfs-map-logical(close_ctree+0x1aa)[0x40f51a]
btrfs-map-logical(main+0x387)[0x4077c7]
/usr/lib/libc.so.6(__libc_start_main+0xf0)[0x7f1e7f619790]
btrfs-map-logical(_start+0x29)[0x4078f9]
---

2) Strange logical offset and non-exist mapping
---
$ btrfs-map-logical  -l 1 -b 8192 /dev/sda6 
mirror 1 logical 1 physical 1 device /dev/sda6
mirror 1 logical 4097 physical 4097 device /dev/sda6
---
There is no extents in that range normally.
Despite that, for the first mirror, it's OK to start from 1 as I passed an
unaligned bytenr.
But why the 2nd non-exist mapping is also unaligned?

For all the fix, see the commit message of the 5th patch.

Qu Wenruo (5):
  btrfs-progs: Export read_extent_data function.
  btrfs-progs: map-logical: Introduce map_one_extent function.
  Btrfs-progs: map-logical: Introduce print_mapping_info function.
  Btrfs-progs: map-logical: Introduce write_extent_content function.
  btrfs-progs: map-logical: Rework map-logical logics.

 btrfs-map-logical.c | 273 +++-
 cmds-check.c|  34 ---
 disk-io.c   |  34 +++
 disk-io.h   |   2 +
 4 files changed, 243 insertions(+), 100 deletions(-)

-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/5] btrfs-progs: Export read_extent_data function.

2015-06-17 Thread Qu Wenruo
Export it for later btrfs-map-logical cleanup.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 cmds-check.c | 34 --
 disk-io.c| 34 ++
 disk-io.h|  2 ++
 3 files changed, 36 insertions(+), 34 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index db121b1..778f141 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -5235,40 +5235,6 @@ static int check_space_cache(struct btrfs_root *root)
return error ? -EINVAL : 0;
 }
 
-static int read_extent_data(struct btrfs_root *root, char *data,
-   u64 logical, u64 *len, int mirror)
-{
-   u64 offset = 0;
-   struct btrfs_multi_bio *multi = NULL;
-   struct btrfs_fs_info *info = root-fs_info;
-   struct btrfs_device *device;
-   int ret = 0;
-   u64 max_len = *len;
-
-   ret = btrfs_map_block(info-mapping_tree, READ, logical, len,
- multi, mirror, NULL);
-   if (ret) {
-   fprintf(stderr, Couldn't map the block %llu\n,
-   logical + offset);
-   goto err;
-   }
-   device = multi-stripes[0].dev;
-
-   if (device-fd == 0)
-   goto err;
-   if (*len  max_len)
-   *len = max_len;
-
-   ret = pread64(device-fd, data, *len, multi-stripes[0].physical);
-   if (ret != *len)
-   ret = -EIO;
-   else
-   ret = 0;
-err:
-   kfree(multi);
-   return ret;
-}
-
 static int check_extent_csums(struct btrfs_root *root, u64 bytenr,
u64 num_bytes, unsigned long leaf_offset,
struct extent_buffer *eb) {
diff --git a/disk-io.c b/disk-io.c
index 2a7feb0..720fee4 100644
--- a/disk-io.c
+++ b/disk-io.c
@@ -340,6 +340,40 @@ struct extent_buffer *read_tree_block(struct btrfs_root 
*root, u64 bytenr,
return ERR_PTR(ret);
 }
 
+int read_extent_data(struct btrfs_root *root, char *data,
+  u64 logical, u64 *len, int mirror)
+{
+   u64 offset = 0;
+   struct btrfs_multi_bio *multi = NULL;
+   struct btrfs_fs_info *info = root-fs_info;
+   struct btrfs_device *device;
+   int ret = 0;
+   u64 max_len = *len;
+
+   ret = btrfs_map_block(info-mapping_tree, READ, logical, len,
+ multi, mirror, NULL);
+   if (ret) {
+   fprintf(stderr, Couldn't map the block %llu\n,
+   logical + offset);
+   goto err;
+   }
+   device = multi-stripes[0].dev;
+
+   if (device-fd == 0)
+   goto err;
+   if (*len  max_len)
+   *len = max_len;
+
+   ret = pread64(device-fd, data, *len, multi-stripes[0].physical);
+   if (ret != *len)
+   ret = -EIO;
+   else
+   ret = 0;
+err:
+   kfree(multi);
+   return ret;
+}
+
 int write_and_map_eb(struct btrfs_trans_handle *trans,
 struct btrfs_root *root,
 struct extent_buffer *eb)
diff --git a/disk-io.h b/disk-io.h
index 62eb566..87e1cd9 100644
--- a/disk-io.h
+++ b/disk-io.h
@@ -66,6 +66,8 @@ struct btrfs_device;
 int read_whole_eb(struct btrfs_fs_info *info, struct extent_buffer *eb, int 
mirror);
 struct extent_buffer *read_tree_block(struct btrfs_root *root, u64 bytenr,
  u32 blocksize, u64 parent_transid);
+int read_extent_data(struct btrfs_root *root, char *data, u64 logical,
+u64 *len, int mirror);
 void readahead_tree_block(struct btrfs_root *root, u64 bytenr, u32 blocksize,
  u64 parent_transid);
 struct extent_buffer *btrfs_find_create_tree_block(struct btrfs_root *root,
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/5] Btrfs-progs: map-logical: Introduce print_mapping_info function.

2015-06-17 Thread Qu Wenruo
The new function will print the mapping info of given range
[logical, logical+len).

Note, caller must ensure the range are completely inside an extent.
Or btrfs_map_block can return -ENOENT.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 btrfs-map-logical.c | 63 +
 1 file changed, 63 insertions(+)

diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c
index 8442779..22ece82 100644
--- a/btrfs-map-logical.c
+++ b/btrfs-map-logical.c
@@ -93,6 +93,69 @@ out:
return ret;
 }
 
+static int __print_mapping_info(struct btrfs_fs_info *fs_info, u64 logical,
+   u64 len, int mirror_num)
+{
+   struct btrfs_multi_bio *multi = NULL;
+   u64 cur_offset = 0;
+   u64 cur_len;
+   int ret = 0;
+
+   while (cur_offset  len) {
+   struct btrfs_device *device;
+   int i;
+
+   cur_len = len - cur_offset;
+   ret = btrfs_map_block(fs_info-mapping_tree, READ,
+   logical + cur_offset, cur_len,
+   multi, mirror_num, NULL);
+   if (ret) {
+   fprintf(info_file,
+   Error: fails to map mirror%d logical %llu: 
%s\n,
+   mirror_num, logical, strerror(-ret));
+   return ret;
+   }
+   for (i = 0; i  multi-num_stripes; i++) {
+   device = multi-stripes[i].dev;
+   fprintf(info_file,
+   mirror %d logical %Lu physical %Lu device 
%s\n,
+   mirror_num, logical + cur_offset,
+   multi-stripes[0].physical,
+   device-name);
+   }
+   kfree(multi);
+   multi = NULL;
+   cur_offset += cur_len;
+   }
+   return ret;
+}
+
+/*
+ * Logical and len is the exact value of a extent.
+ * And offset is the offset inside the extent. It's only used for case
+ * where user only want to print part of the extent.
+ *
+ * Caller *MUST* ensure the range [logical,logical+len) are in one extent.
+ * Or we can encounter the following case, causing a -ENOENT error:
+ * |-given parameter--|
+ * |-- Extent A -|
+ */
+static int print_mapping_info(struct btrfs_fs_info *fs_info, u64 logical,
+ u64 len)
+{
+   int num_copies;
+   int mirror_num;
+   int ret = 0;
+
+   num_copies = btrfs_num_copies(fs_info-mapping_tree, logical, len);
+   for (mirror_num = 1; mirror_num = num_copies; mirror_num++) {
+   ret = __print_mapping_info(fs_info, logical, len, mirror_num);
+   if (ret  0)
+   return ret;
+   }
+   return ret;
+}
+
 static struct extent_buffer * debug_read_block(struct btrfs_root *root,
u64 bytenr, u32 blocksize, u64 copy)
 {
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 4/5] Btrfs-progs: map-logical: Introduce write_extent_content function.

2015-06-17 Thread Qu Wenruo
This function will write extent content info desired file.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 btrfs-map-logical.c | 34 ++
 1 file changed, 34 insertions(+)

diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c
index 22ece82..1ee101c 100644
--- a/btrfs-map-logical.c
+++ b/btrfs-map-logical.c
@@ -30,6 +30,8 @@
 #include list.h
 #include utils.h
 
+#define BUFFER_SIZE (64 * 1024)
+
 /* we write the mirror info to stdout unless they are dumping the data
  * to stdout
  * */
@@ -156,6 +158,38 @@ static int print_mapping_info(struct btrfs_fs_info 
*fs_info, u64 logical,
return ret;
 }
 
+/* Same requisition as print_mapping_info function */
+static int write_extent_content(struct btrfs_fs_info *fs_info, int out_fd,
+   u64 logical, u64 length, int mirror)
+{
+   char buffer[BUFFER_SIZE];
+   u64 cur_offset = 0;
+   u64 cur_len;
+   int ret = 0;
+
+   while (cur_offset  length) {
+   cur_len = min_t(u64, length - cur_offset, BUFFER_SIZE);
+   ret = read_extent_data(fs_info-tree_root, buffer,
+  logical + cur_offset, cur_len, mirror);
+   if (ret  0) {
+   fprintf(stderr,
+   Failed to read extent at [%llu, %llu]: %s\n,
+   logical, logical + length, strerror(-ret));
+   return ret;
+   }
+   ret = write(out_fd, buffer, cur_len);
+   if (ret  0 || ret != cur_len) {
+   if (ret  0)
+   ret = -EINTR;
+   fprintf(stderr, output file write failed: %s\n,
+   strerror(-ret));
+   return ret;
+   }
+   cur_offset += cur_len;
+   }
+   return ret;
+}
+
 static struct extent_buffer * debug_read_block(struct btrfs_root *root,
u64 bytenr, u32 blocksize, u64 copy)
 {
-- 
2.4.3

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 5/5] btrfs-progs: map-logical: Rework map-logical logics.

2015-06-17 Thread Qu Wenruo
[[BUG]]
The original map-logical has the following problems:
1) Assert if we pass any tree root bytenr.
The problem is easy to trigger, here the number 29622272 is the bytenr of tree 
root:

 # btrfs-map-logical -l 29622272 /dev/sda6
 mirror 1 logical 29622272 physical 38010880 device /dev/sda6
 mirror 2 logical 29622272 physical 752704 device /dev/sda6
 extent_io.c:582: free_extent_buffer: Assertion `eb-refs  0` failed.
 btrfs-map-logical[0x41c464]
 btrfs-map-logical(free_extent_buffer+0xc0)[0x41cf10]
 btrfs-map-logical(btrfs_release_all_roots+0x59)[0x40e649]
 btrfs-map-logical(close_ctree+0x1aa)[0x40f51a]
 btrfs-map-logical(main+0x387)[0x4077c7]
 /usr/lib/libc.so.6(__libc_start_main+0xf0)[0x7f80a5562790]
 btrfs-map-logical(_start+0x29)[0x4078f9]

The problem is that, btrfs-map-logical always use sectorsize as default
block size to call alloc_extent_buffer.
And when it failes to find the block with the same size, it will free
the extent buffer in a incorrect method(Free and create a new one with
refs == 1).

2) Will return map result for non-exist extent.

 # btrfs-map-logical -l 1 -b 123456 /dev/sda6
 mirror 1 logical 1 physical 1 device /dev/sda6
 mirror 1 logical 4097 physical 4097 device /dev/sda6
 mirror 1 logical 8193 physical 8193 device /dev/sda6
 ...

Normally, before bytenr 12582912, there should be no extent as that's
the mkfs time temp metadata/system chunk.

But map-logical will still map them out.

Not to mention the 1 offset among all results.

[[FIX]]
This patch will rework the whole map logical by the following methods:
1) Always do things inside a extent
Even under the following case, map logical will only return covered
range in existing extents.

|-- range given ---|
|-Extent A-|  |-Extent B-|  |---Extent C-|
Result:
|--|  |--|  |--|

So with this patch, we will search extent tree to ensure all operation
are inside a extent before we do some stupid things.

2) No direct call on alloc_extent_buffer function.

That low-level function shouldn't be called at such high level.
It's only designed for low-level tree operation.

So in this patch we will only use safe high level functions avoid such
problem.

[[RESULT]]
With this patch, no assert will be triggered and better handle on
non-exist extents.

 # btrfs-map-logical -l 29622272 /dev/sda6
 mirror 1 logical 29622272 physical 38010880 device /dev/sda6
 mirror 2 logical 29622272 physical 752704 device /dev/sda6

 # btrfs-map-logical -l 1 -b 123456 /dev/sda6
 No extent found at range [1,123457)

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 btrfs-map-logical.c | 148 
 1 file changed, 67 insertions(+), 81 deletions(-)

diff --git a/btrfs-map-logical.c b/btrfs-map-logical.c
index 1ee101c..a88e56e 100644
--- a/btrfs-map-logical.c
+++ b/btrfs-map-logical.c
@@ -190,69 +190,6 @@ static int write_extent_content(struct btrfs_fs_info 
*fs_info, int out_fd,
return ret;
 }
 
-static struct extent_buffer * debug_read_block(struct btrfs_root *root,
-   u64 bytenr, u32 blocksize, u64 copy)
-{
-   int ret;
-   struct extent_buffer *eb;
-   u64 length;
-   struct btrfs_multi_bio *multi = NULL;
-   struct btrfs_device *device;
-   int num_copies;
-   int mirror_num = 1;
-
-   eb = btrfs_find_create_tree_block(root, bytenr, blocksize);
-   if (!eb)
-   return NULL;
-
-   length = blocksize;
-   while (1) {
-   ret = btrfs_map_block(root-fs_info-mapping_tree, READ,
- eb-start, length, multi,
- mirror_num, NULL);
-   if (ret) {
-   fprintf(info_file,
-   Error: fails to map mirror%d logical %llu: 
%s\n,
-   mirror_num, (unsigned long long)eb-start,
-   strerror(-ret));
-   free_extent_buffer(eb);
-   return NULL;
-   }
-   device = multi-stripes[0].dev;
-   eb-fd = device-fd;
-   device-total_ios++;
-   eb-dev_bytenr = multi-stripes[0].physical;
-
-   fprintf(info_file, mirror %d logical %Lu physical %Lu 
-   device %s\n, mirror_num, (unsigned long long)bytenr,
-   (unsigned long long)eb-dev_bytenr, device-name);
-   kfree(multi);
-
-   if (!copy || mirror_num == copy) {
-   ret = read_extent_from_disk(eb, 0, eb-len);
-   if (ret) {
-   fprintf(info_file,
-   Error: failed to read extent: mirror 
%d logical %llu: %s\n,
-   mirror_num, (unsigned long 
long)eb-start,
-   strerror(-ret));
-   free_extent_buffer(eb);
- 

Re: BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432)

2015-06-17 Thread Chris Murphy
On Wed, Jun 17, 2015 at 1:16 AM, Marc MERLIN m...@merlins.org wrote:

 So I can understand how I may have had a few blocks that are in a bad
 state.
 I'm getting a few (not many) of those messages in syslog.
 BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 
 sector 459432)

I think more information is needed at the time of this entry, maybe
the previous 20 entries or so. That Btrfs thinks there is a read error
is different than when it thinks there's a checksum error. For example
when I willfully corrupt one sector that I know contains metadata,
then read the file or do a scrub:

[48466.824770] BTRFS: checksum error at logical 20971520 on dev
/dev/sdb, sector 57344: metadata leaf (level 0) in tree 3
[48466.829900] BTRFS: checksum error at logical 20971520 on dev
/dev/sdb, sector 57344: metadata leaf (level 0) in tree 3
[48466.834944] BTRFS: bdev /dev/sdb errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
[48466.853589] BTRFS: fixed up error at logical 20971520 on dev /dev/sdb


I'd expect in your case you have a bdev line that reads rd 1, corrupt 0.


 Just to make sure I understand, do those messages in syslog mean that my
 metadata got corrupted a bit, but because I have 2 copies, btrfs can fix
 the bad copy by using the good one?

Probably but not enough information has been given to conclude that.


 Also, if my actual data got corrupted, am I correct that btrfs will
 detect the checksum failure and give me a different error message of a
 read error that cannot be corrected?

Yes, it looks like this when the file is directly read:

[ 1457.231316] BTRFS warning (device sdb): csum failed ino 257 off 0
csum 3703877302 expected csum 1978138932
[ 1457.231842] BTRFS warning (device sdb): csum failed ino 257 off 0
csum 3703877302 expected csum 1978138932

It looks like this during a scrub:

[ 1540.865520] BTRFS: checksum error at logical 12845056 on dev
/dev/sdb, sector 25088, root 5, inode 257, offset 0, length 4096,
links 1 (path: grub2-install)
[ 1540.865534] BTRFS: bdev /dev/sdb errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
[ 1540.866944] BTRFS: unable to fixup (regular) error at logical
12845056 on dev /dev/sdb


-- 
Chris Murphy
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION

2015-06-17 Thread David Sterba
On Wed, Jun 17, 2015 at 03:54:31PM +0800, Liu Bo wrote:
 MS_I_VERSION is enabled by default for btrfs, this adds an alternative
 option to toggle it off.

There's an existing generic iversion/noiversion mount option pair, no
need to extra add it to btrfs.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION

2015-06-17 Thread David Sterba
On Wed, Jun 17, 2015 at 11:52:36PM +0800, Liu Bo wrote:
 On Wed, Jun 17, 2015 at 05:33:06PM +0200, David Sterba wrote:
  On Wed, Jun 17, 2015 at 03:54:31PM +0800, Liu Bo wrote:
   MS_I_VERSION is enabled by default for btrfs, this adds an alternative
   option to toggle it off.
  
  There's an existing generic iversion/noiversion mount option pair, no
  need to extra add it to btrfs.
 
 I know, it doesn't work though.

Sigh, I see, btrfs forces MS_I_VERSION flag,
0c4d2d95d06e920e0c61707e62c7fffc9c57f63a. I read 'enabled by default' as
that there's a standard way to override the defaults.

So the right way is not to do that but this will break everyhing that
relies on that behaviour at the moment. This means to add the exception
to the upper layers, either VFS or 'mount', which is not very likely to
happen.

The generic options do not reach the filesystem specific callbacks, so
we can't check it.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: trim not working and irreparable errors from btrfsck

2015-06-17 Thread Chris Murphy
On Wed, Jun 17, 2015 at 8:33 AM, Christian cdys...@gmail.com wrote:
 On 06/17/2015 10:22 AM, Chris Murphy wrote:

 On Wed, Jun 17, 2015 at 6:56 AM, Christian Dysthe cdys...@gmail.com
 wrote:

 Hi,

 Sorry for asking more about this. I'm not a developer but trying to
 learn.
 In my case I get several errors like this one:

 root 2625 inode 353819 errors 400, nbytes wrong

 Is it inode 353819 I should focus on and what is the number after root,
 in
 this case 2625?


 I'm going to guess it's tree root 2625, which is the same thing as fs
 tree, which is the same thing as subvolume. Each subvolume has its own
 inodes. So on a given Btrfs volume, an inode number can exist more
 than once, but in separate subvolumes. When you use btrfs inspect
 inode it will list all files with that inode number, but only the one
 in subvol ID 2625 is what you care about deleting and replacing.

 Thanks! Deleting the file for that inode took care of it. No more errors.
 Restored it from a backup.

 However, fstrim still gives me 0 B (0 bytes) trimmed, so that may be
 another problem. Is there a way to check if trim works?

That sounds like maybe your SSD is blacklisted for trim, is all I can
think of. So trim shouldn't be the cause of the problem if it's being
blacklisted. The recent problems appear to be around newer SSDs that
support queue trim and newer kernels that issue queued trim. There
have been some patches related to trim to the kernel, but the
existence of blacklisting and claims of bugs in firmware make it
difficult to test and isolate.

http://techreport.com/news/28473/some-samsung-ssds-may-suffer-from-a-buggy-trim-implementation

-- 
Chris Murphy
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: trim not working and irreparable errors from btrfsck

2015-06-17 Thread Christian

On 06/17/2015 11:28 AM, Chris Murphy wrote:



However, fstrim still gives me 0 B (0 bytes) trimmed, so that may be
another problem. Is there a way to check if trim works?


That sounds like maybe your SSD is blacklisted for trim, is all I can
think of. So trim shouldn't be the cause of the problem if it's being
blacklisted. The recent problems appear to be around newer SSDs that
support queue trim and newer kernels that issue queued trim. There
have been some patches related to trim to the kernel, but the
existence of blacklisting and claims of bugs in firmware make it
difficult to test and isolate.

http://techreport.com/news/28473/some-samsung-ssds-may-suffer-from-a-buggy-trim-implementation

This is an Intel SSD in a Lenovo Thinkpad X1 Carbon. Trim worked until a 
few weeks ago and still works for my small ext4 boot partition (just ran 
it to check). I will keep looking for a solution. Thanks!


--
//Christian


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 sector 459432)

2015-06-17 Thread Marc MERLIN
On Wed, Jun 17, 2015 at 01:51:26PM +, Duncan wrote:
  Also, if my actual data got corrupted, am I correct that btrfs will
  detect the checksum failure and give me a different error message of a
  read error that cannot be corrected?
  
  I'll do a scrub later, for now I have to wait 20 hours for the raid
  rebuild first.
 
 Yes again.
 
Great, thanks for confirming.
Makes me happy to know that checksums and metadata DUP are helping me
out here.
With ext4 I'd have been worse off for sure.

 One thing I'd strongly recommend.  Once the rebuild is complete and you 
 do the scrub, there may well be both read/corrected errors, and 
 unverified errors.  AFAIK, the unverified errors are a result of bad 
 metadata blocks, so missing checksums for what they covered.  So once you 

I'm slightly confused here. If I have metadata DUP and checksums, how
can metadata blocks be unverified?
Data blocks being unverified, I understand, it would mean the data or
checksum is bad, but I expect that's a different error message I haven't
seen yet.
(I use sec.pl which Emails me any btrfs kernel log line that's not
whitelisted as being known/OK)

On Wed, Jun 17, 2015 at 08:58:18AM -0600, Chris Murphy wrote:
 I think more information is needed at the time of this entry, maybe
 the previous 20 entries or so. That Btrfs thinks there is a read error
 is different than when it thinks there's a checksum error. For example
 when I willfully corrupt one sector that I know contains metadata,
 then read the file or do a scrub:
 
 [48466.824770] BTRFS: checksum error at logical 20971520 on dev
 /dev/sdb, sector 57344: metadata leaf (level 0) in tree 3
 [48466.829900] BTRFS: checksum error at logical 20971520 on dev
 /dev/sdb, sector 57344: metadata leaf (level 0) in tree 3
 [48466.834944] BTRFS: bdev /dev/sdb errs: wr 0, rd 0, flush 0, corrupt 1, gen  0
 [48466.853589] BTRFS: fixed up error at logical 20971520 on dev /dev/sdb
 
 I'd expect in your case you have a bdev line that reads rd 1, corrupt 0.

Thankfully I have 0 checksum errors.

I looked at all the warnings I got last night, and none are repeated, so it 
looks like
it's just minor corruption I got from re-adding a raid drive that wasn't
100% in sync with the others, and btrfs nicely fixing things for me.
Nice! 

BTRFS: read error corrected: ino 1 off 205565952 (dev /dev/mapper/dshelf1 
sector 417880)
BTRFS: read error corrected: ino 1 off 216420352 (dev /dev/mapper/dshelf1 
sector 439080)
BTRFS: read error corrected: ino 1 off 216424448 (dev /dev/mapper/dshelf1 
sector 439088)
BTRFS: read error corrected: ino 1 off 216473600 (dev /dev/mapper/dshelf1 
sector 439184)
BTRFS: read error corrected: ino 1 off 226656256 (dev /dev/mapper/dshelf1 
sector 459072)
BTRFS: read error corrected: ino 1 off 226693120 (dev /dev/mapper/dshelf1 
sector 459144)
BTRFS: read error corrected: ino 1 off 226729984 (dev /dev/mapper/dshelf1 
sector 459216)
BTRFS: read error corrected: ino 1 off 226734080 (dev /dev/mapper/dshelf1 
sector 459224)
BTRFS: read error corrected: ino 1 off 226742272 (dev /dev/mapper/dshelf1 
sector 459240)
BTRFS: read error corrected: ino 1 off 226758656 (dev /dev/mapper/dshelf1 
sector 459272)
BTRFS: read error corrected: ino 1 off 226783232 (dev /dev/mapper/dshelf1 
sector 459320)
BTRFS: read error corrected: ino 1 off 226811904 (dev /dev/mapper/dshelf1 
sector 459376)
BTRFS: read error corrected: ino 1 off 226840576 (dev /dev/mapper/dshelf1 
sector 459432)
BTRFS: read error corrected: ino 1 off 226865152 (dev /dev/mapper/dshelf1 
sector 459480)
BTRFS: read error corrected: ino 1 off 226975744 (dev /dev/mapper/dshelf1 
sector 459696)
BTRFS: read error corrected: ino 1 off 228589568 (dev /dev/mapper/dshelf1 
sector 462848)
BTRFS: read error corrected: ino 1 off 228601856 (dev /dev/mapper/dshelf1 
sector 462872)
BTRFS: read error corrected: ino 1 off 228610048 (dev /dev/mapper/dshelf1 
sector 462888)
BTRFS: read error corrected: ino 1 off 228614144 (dev /dev/mapper/dshelf1 
sector 462896)
BTRFS: read error corrected: ino 1 off 228618240 (dev /dev/mapper/dshelf1 
sector 462904)
BTRFS: read error corrected: ino 1 off 228626432 (dev /dev/mapper/dshelf1 
sector 462920)
BTRFS: read error corrected: ino 1 off 228630528 (dev /dev/mapper/dshelf1 
sector 462928)
BTRFS: read error corrected: ino 1 off 228638720 (dev /dev/mapper/dshelf1 
sector 462944)
BTRFS: read error corrected: ino 1 off 228642816 (dev /dev/mapper/dshelf1 
sector 462952)
BTRFS: read error corrected: ino 1 off 228646912 (dev /dev/mapper/dshelf1 
sector 462960)
BTRFS: read error corrected: ino 1 off 228651008 (dev /dev/mapper/dshelf1 
sector 462968)
BTRFS: read error corrected: ino 1 off 228708352 (dev /dev/mapper/dshelf1 
sector 463080)
BTRFS: read error corrected: ino 1 off 228712448 (dev /dev/mapper/dshelf1 
sector 463088)
BTRFS: read error corrected: ino 1 off 228716544 (dev /dev/mapper/dshelf1 
sector 463096)
BTRFS: read error corrected: ino 1 off 228720640 (dev /dev/mapper/dshelf1 
sector 463104)
BTRFS: read error 

Re: [RFC PATCH v2 1/2] Btrfs: add noi_version option to disable MS_I_VERSION

2015-06-17 Thread Liu Bo
On Wed, Jun 17, 2015 at 05:33:06PM +0200, David Sterba wrote:
 On Wed, Jun 17, 2015 at 03:54:31PM +0800, Liu Bo wrote:
  MS_I_VERSION is enabled by default for btrfs, this adds an alternative
  option to toggle it off.
 
 There's an existing generic iversion/noiversion mount option pair, no
 need to extra add it to btrfs.

I know, it doesn't work though.

Thanks,

-liubo
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v2 2/2] Btrfs: improve fsync for nocow file

2015-06-17 Thread David Sterba
On Wed, Jun 17, 2015 at 03:54:32PM +0800, Liu Bo wrote:
 If  we're overwriting an allocated file without changing timestamp
 and inode version, and the file is with NODATACOW, we don't have any metadata 
 to
 commit, thus we can just flush the data device cache and go forward.
 
 However, if there's have any change on extents' disk bytenr, inode size,
 timestamp or inode version, we need to go through the normal btrfs_log_inode
 path.
 
 Test:
 
 1. sysbench test of
 1 file + 1 thread + bs=4k + size=40k + synchronous I/O mode + randomwrite +
 fsync_on_each_write,
 2. loop device associated with tmpfs file
 3.
   - For btrfs, -o nodatacow and -o noi_version option
   - For ext4 and xfs, no extra mount options
 
 
 Results:
 
 - btrfs:
 w/o: ~30Mb/sec
 w:   ~131Mb/sec
 
 - other filesystems: (both don't enable i_version by default)
 ext4:  203Mb/sec
 xfs:   212Mb/sec
 
 
 Signed-off-by: Liu Bo bo.li@oracle.com
 ---
 v2: Catch errors from data writeback and skip barrier if necessary.
 
  fs/btrfs/btrfs_inode.h |2 +
  fs/btrfs/disk-io.c |2 +-
  fs/btrfs/disk-io.h |1 +
  fs/btrfs/file.c|   54 +++
  fs/btrfs/inode.c   |3 ++
  5 files changed, 56 insertions(+), 6 deletions(-)
 
 diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
 index de5e4f2..f7b99b6 100644
 --- a/fs/btrfs/btrfs_inode.h
 +++ b/fs/btrfs/btrfs_inode.h
 @@ -44,6 +44,8 @@
  #define BTRFS_INODE_IN_DELALLOC_LIST 9
  #define BTRFS_INODE_READDIO_NEED_LOCK10
  #define BTRFS_INODE_HAS_PROPS11
 +#define BTRFS_INODE_NOTIMESTAMP  12
 +#define BTRFS_INODE_NOISIZE  13

It's not clear what the flags mean and that they're related to syncing
under some conditions.

  /*
   * The following 3 bits are meant only for the btree inode.
   * When any of them is set, it means an error happened while writing an
 --- a/fs/btrfs/disk-io.h
 +++ b/fs/btrfs/disk-io.h
 @@ -60,6 +60,7 @@ void close_ctree(struct btrfs_root *root);
  int write_ctree_super(struct btrfs_trans_handle *trans,
 struct btrfs_root *root, int max_mirrors);
  struct buffer_head *btrfs_read_dev_super(struct block_device *bdev);
 +int barrier_all_devices(struct btrfs_fs_info *info);
  int btrfs_commit_super(struct btrfs_root *root);
  struct extent_buffer *btrfs_find_tree_block(struct btrfs_root *root,
   u64 bytenr);
 diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
 index faa7d39..180a3e1 100644
 --- a/fs/btrfs/file.c
 +++ b/fs/btrfs/file.c
 @@ -523,8 +523,12 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct 
 inode *inode,
* the disk i_size.  There is no need to log the inode
* at this time.
*/
 - if (end_pos  isize)
 + if (end_pos  isize) {
   i_size_write(inode, end_pos);
 + clear_bit(BTRFS_INODE_NOISIZE, BTRFS_I(inode)-runtime_flags);
 + } else {
 + set_bit(BTRFS_INODE_NOISIZE, BTRFS_I(inode)-runtime_flags);
 + }
   return 0;
  }
  
 @@ -1715,19 +1719,33 @@ out:
  static void update_time_for_write(struct inode *inode)
  {
   struct timespec now;
 + int sync_it = 0;
  
 - if (IS_NOCMTIME(inode))
 + if (IS_NOCMTIME(inode)) {
 + set_bit(BTRFS_INODE_NOTIMESTAMP, 
 BTRFS_I(inode)-runtime_flags);
   return;
 + }
  
   now = current_fs_time(inode-i_sb);
 - if (!timespec_equal(inode-i_mtime, now))
 + if (!timespec_equal(inode-i_mtime, now)) {
   inode-i_mtime = now;
 + sync_it = S_MTIME;
 + }
  
 - if (!timespec_equal(inode-i_ctime, now))
 + if (!timespec_equal(inode-i_ctime, now)) {
   inode-i_ctime = now;
 + sync_it |= S_CTIME;
 + }
  
 - if (IS_I_VERSION(inode))
 + if (IS_I_VERSION(inode)) {
   inode_inc_iversion(inode);
 + sync_it |= S_VERSION;
 + }
 +
 + if (!sync_it)
 + set_bit(BTRFS_INODE_NOTIMESTAMP, 
 BTRFS_I(inode)-runtime_flags);
 + else
 + clear_bit(BTRFS_INODE_NOTIMESTAMP, 
 BTRFS_I(inode)-runtime_flags);
  }
  
  static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
 @@ -1983,6 +2001,32 @@ int btrfs_sync_file(struct file *file, loff_t start, 
 loff_t end, int datasync)
   goto out;
   }
  
 + if (BTRFS_I(inode)-flags  BTRFS_INODE_NODATACOW) {
 + if (test_and_clear_bit(BTRFS_INODE_NOTIMESTAMP,
 + BTRFS_I(inode)-runtime_flags) 
 + test_and_clear_bit(BTRFS_INODE_NOISIZE,
 + BTRFS_I(inode)-runtime_flags)) {
 +
 + /* make sure data is on disk and catch error */
 + if (!full_sync)
 + ret =