from:"Kai Krakow"

Re: A Big Thank You, and some Notes on Current Recovery Tools.

2018-01-01 Thread Kai Krakow

Am Mon, 01 Jan 2018 18:13:10 +0800 schrieb Qu Wenruo:

> On 2018年01月01日 08:48, Stirling Westrup wrote:
>> Okay, I want to start this post with a HUGE THANK YOU THANK YOU THANK
>> YOU to Nikolay Borisov and most especially to Qu Wenruo!
>> 
>> Thanks to their tireless help in answering all my dumb questions I have
>> managed to get my BTRFS working again! As I speak I have the full,
>> non-degraded, quad of drives mounted and am updating my latest backup
>> of their contents.
>> 
>> I had a 4-drive setup with 2x4T and 2x2T drives and one of the 2T
>> drives failed, and with help I was able to make a 100% recovery of the
>> lost data. I do have some observations on what I went through though.
>> Take this as constructive criticism, or as a point for discussing
>> additions to the recovery tools:
>> 
>> 1) I had a 2T drive die with exactly 3 hard-sector errors and those 3
>> errors exactly coincided with the 3 super-blocks on the drive.
> 
> WTF, why all these corruption all happens at btrfs super blocks?!
> 
> What a coincident.

Maybe it's a hybrid drive with flash? Or something that went wrong in the 
drive-internal cache memory the very time when superblocks where updated?

I bet that the sectors aren't really broken, just the on-disk checksum 
didn't match the sector. I remember such things happening to me more than 
once back in the days when drives where still connected by molex power 
connectors. Those connectors started to get loose over time, due to 
thermals or repeated disconnect and connect. That is, drives sometimes 
started to no longer have a reliable power source which let to all sorts 
of very strange problems, mostly resulting in pseudo-defective sectors.

That said, the OP would like to check the power supply after this 
coincidence... Maybe it's aging and no longer able to support all four 
drives, CPU, GPU and stuff with stable power.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs balance problems

2017-12-29 Thread Kai Krakow

Am Thu, 28 Dec 2017 00:39:37 + schrieb Duncan:

>> I can I get btrfs balance to work in the background, without adversely
>> affecting other applications?
> 
> I'd actually suggest a different strategy.
> 
> What I did here way back when I was still on reiserfs on spinning rust,
> where it made more difference than on ssd, but I kept the settings when
> I switched to ssd and btrfs, and at least some others have mentioned
> that similar settings helped them on btrfs as well, is...
> 
> Problem: The kernel virtual-memory subsystem's writeback cache was
> originally configured for systems with well under a Gigabyte of RAM, and
> the defaults no longer work so well on multi-GiB-RAM systems,
> particularly above 8 GiB RAM, because they are based on a percentage of
> available RAM, and will typically let several GiB of dirty writeback
> cache accumulate before kicking off any attempt to actually write it to
> storage.  On spinning rust, when writeback /does/ finally kickoff, this
> can result in hogging the IO for well over half a minute at a time,
> where 30 seconds also happens to be the default "flush it anyway" time.

This is somehow like the buffer bloat discussion for networking... Big 
buffers increase latency. There is more than one type of buffer.

Additionally to what Duncan wrote (first type of buffer), the kernel 
lately got a new option to fight this "buffer bloat": writeback-
throttling. It may help to enable that option.

The second type of buffer is the io queue.

So, you may also want to lower the io queue depth (nr_requests) of your 
devices. I think it defaults to 128 while most consumer drives only have 
a queue depth of 31 or 32 commands. Thus, reducing nr_requests for some 
of your devices may help you achieve better latency (but reduces 
throughput).

Especially if working with io schedulers that do not implement io 
priorities, you could simply lower nr_requests to around or below the 
native command queue depth of your devices. The device itself can handle 
it better in that case, especially on spinning rust, as the firmware 
knows when to pull certain selected commands from the queue during a 
rotation of the media. The kernel knows nothing about rotary positions, 
it can only use the queue to prioritize and reorder requests but cannot 
take advantage of rotary positions of the heads.

See

$ grep ^ /sys/block/*/queue/nr_requests


You may also get better results with increasing the nr_requests instead 
but at the cost of also adjusting the write buffer sizes, because with 
large nr_requests, you don't want blocking on writes so early, at least 
not when you need good latency. This probably works best for you with 
schedulers that care about latency, like deadline or kyber.

For testing, keep in mind that everything works in dependence to each 
other setting. So change one at a time, take your tests, then change 
another and see how that relates to the first change, even when the first 
change made your experience worse.

Another tip that's missing: Put different access classes onto different 
devices. That is, if you have a directory structure that's mostly written 
to, put it on it's own physical devices, with separate tuning and 
appropriate filesystem (log structured and cow filesystems are good at 
streaming writes). Put read mostly workloads also on their own device and 
filesystems. Put realtime workloads on their own device and filesystems. 
This gives you a much easier chance to succeed.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs allow compression on NoDataCow files? (AFAIK Not, but it does)

2017-12-22 Thread Kai Krakow

Am Thu, 21 Dec 2017 13:51:40 -0500 schrieb Chris Mason:

> On 12/20/2017 03:59 PM, Timofey Titovets wrote:
>> How reproduce:
>> touch test_file
>> chattr +C test_file
>> dd if=/dev/zero of=test_file bs=1M count=1
>> btrfs fi def -vrczlib test_file
>> filefrag -v test_file
>> 
>> test_file
>> Filesystem type is: 9123683e
>> File size of test_file is 1048576 (256 blocks of 4096 bytes)
>> ext: logical_offset:physical_offset: length:   expected: flags:
>>0:0..  31:   72917050..  72917081: 32: encoded
>>1:   32..  63:   72917118..  72917149: 32:   72917082: encoded
>>2:   64..  95:   72919494..  72919525: 32:   72917150: encoded
>>3:   96.. 127:   72927576..  72927607: 32:   72919526: encoded
>>4:  128.. 159:   72943261..  72943292: 32:   72927608: encoded
>>5:  160.. 191:   72944929..  72944960: 32:   72943293: encoded
>>6:  192.. 223:   72944952..  72944983: 32:   72944961: encoded
>>7:  224.. 255:   72967084..  72967115: 32:   72944984:
>> last,encoded,eof
>> test_file: 8 extents found
>> 
>> I can't found at now, where that error happen in code,
>> but it's reproducible on Linux 4.14.8
> 
> We'll silently cow in a few cases, this is one.

I think the question was about compression, not cow.

I can reproduce this behavior:

$ touch nocow.dat
$ touch cow.dat
$ chattr +c cow.dat
$ chattr +C nocow.dat
$ dd if=/dev/zero of=cow.dat count=1 bs=1M
$ dd if=/dev/zero of=nocow.dat count=1 bs=1M

$ filefrag -v cow.dat
Filesystem type is: 9123683e
File size of cow.dat is 1048576 (256 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31: 1044845154..1044845185: 32: 
encoded,shared
   1:   32..  63: 1044845166..1044845197: 32: 1044845186: 
encoded,shared
   2:   64..  95: 1044845167..1044845198: 32: 1044845198: 
encoded,shared
   3:   96.. 127: 1044851064..1044851095: 32: 1044845199: 
encoded,shared
   4:  128.. 159: 1044851065..1044851096: 32: 1044851096: 
encoded,shared
   5:  160.. 191: 1044852160..1044852191: 32: 1044851097: 
encoded,shared
   6:  192.. 223: 1044943106..1044943137: 32: 1044852192: 
encoded,shared
   7:  224.. 255: 1045054792..1045054823: 32: 1044943138: 
last,encoded,shared,eof
cow.dat: 8 extents found

$ filefrag -v nocow.dat 
Filesystem type is: 9123683e
File size of nocow.dat is 1048576 (256 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0.. 255: 1196077983..1196078238:256: 
last,shared,eof
nocow.dat: 1 extent found

Now it seems to be compressed (8x 128k extents):

$ filefrag -v nocow.dat  
Filesystem type is: 9123683e
File size of nocow.dat is 1048576 (256 blocks of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..  31: 1121866367..1121866398: 32: 
encoded,shared
   1:   32..  63: 1121866369..1121866400: 32: 1121866399: 
encoded,shared
   2:   64..  95: 1121866370..1121866401: 32: 1121866401: 
encoded,shared
   3:   96.. 127: 1121866371..1121866402: 32: 1121866402: 
encoded,shared
   4:  128.. 159: 1121866372..1121866403: 32: 1121866403: 
encoded,shared
   5:  160.. 191: 1121866373..1121866404: 32: 1121866404: 
encoded,shared
   6:  192.. 223: 1121866374..1121866405: 32: 1121866405: 
encoded,shared
   7:  224.. 255: 1121866375..1121866406: 32: 1121866406: 
last,encoded,shared,eof
nocow.dat: 8 extents found


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[4.14] WARNING: CPU: 2 PID: 779 at fs/btrfs/backref.c:1255 find_parent_nodes+0x892/0x1340

2017-12-15 Thread Kai Krakow

Hello!

During balance I'm seeing the following in dmesg:

[194123.693226] [ cut here ]
[194123.693231] WARNING: CPU: 2 PID: 779 at fs/btrfs/backref.c:1255 
find_parent_nodes+0x892/0x1340
[194123.693232] Modules linked in: f2fs bridge stp llc cifs ccm xfs rfcomm bnep 
fuse cachefiles snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic 
af_packet iTCO_wdt iTCO_vendor_support btusb btintel bluetooth kvm_intel 
snd_hda_intel rfkill snd_hda_codec kvm ecdh_generic snd_hda_core rtc_cmos 
lpc_ich irqbypass snd_pcm snd_timer snd soundcore uas usb_storage r8168(O) 
nvidia_drm(PO) vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) nvidia_uvm(PO) 
nvidia_modeset(PO) nvidia(PO) nct6775 hwmon_vid coretemp hwmon efivarfs
[194123.693252] CPU: 2 PID: 779 Comm: crawl Tainted: P   O
4.14.0-pf5 #3
[194123.693252] Hardware name: To Be Filled By O.E.M. To Be Filled By 
O.E.M./Z68 Pro3, BIOS L2.16A 02/22/2013
[194123.693253] task: 8803e0a53840 task.stack: c90003f24000
[194123.693254] RIP: 0010:find_parent_nodes+0x892/0x1340
[194123.693255] RSP: 0018:c90003f27b30 EFLAGS: 00010286
[194123.693256] RAX:  RBX: 88000bcdb3a0 RCX: 
88000bef7d68
[194123.693256] RDX: 88000bef7d68 RSI: 88042f29b620 RDI: 
88000bef7c60
[194123.693256] RBP: c90003f27c50 R08: 0001b620 R09: 
812949c0
[194123.693257] R10: ea2f36c0 R11: 88041d003980 R12: 

[194123.693257] R13: 88000bef7c60 R14: 8801c1487a40 R15: 
88000bef7d68
[194123.693258] FS:  7f4cae8cb700() GS:88042f28() 
knlGS:
[194123.693258] CS:  0010 DS:  ES:  CR0: 80050033
[194123.693259] CR2: 7f63e7953038 CR3: 0003e190a002 CR4: 
001606e0
[194123.693259] Call Trace:
[194123.693262]  ? kmem_cache_free+0x13d/0x170
[194123.693265]  ? btrfs_find_all_roots_safe+0x89/0xf0
[194123.693266]  btrfs_find_all_roots_safe+0x89/0xf0
[194123.693267]  ? extent_same_check_offsets+0x60/0x60
[194123.693269]  iterate_extent_inodes+0x154/0x270
[194123.693270]  ? extent_same_check_offsets+0x60/0x60
[194123.693271]  ? iterate_inodes_from_logical+0x81/0x90
[194123.693272]  iterate_inodes_from_logical+0x81/0x90
[194123.693273]  btrfs_ioctl+0x8a3/0x2420
[194123.693275]  ? generic_file_read_iter+0x2c1/0x7c0
[194123.693276]  ? do_vfs_ioctl+0x8a/0x5c0
[194123.693277]  do_vfs_ioctl+0x8a/0x5c0
[194123.693278]  ? __fget+0x62/0xa0
[194123.693279]  SyS_ioctl+0x36/0x70
[194123.693281]  entry_SYSCALL_64_fastpath+0x13/0x94
[194123.693283] RIP: 0033:0x3d9acf0037
[194123.693283] RSP: 002b:7f4cae8c8448 EFLAGS: 0246 ORIG_RAX: 
0010
[194123.693284] RAX: ffda RBX: 000a RCX: 
003d9acf0037
[194123.693284] RDX: 7f4cae8c85a8 RSI: c0389424 RDI: 
0003
[194123.693285] RBP: 000a R08:  R09: 
0400
[194123.693285] R10: 0020 R11: 0246 R12: 
000a
[194123.693285] R13:  R14: 000a R15: 
7f4ca8080f70
[194123.693286] Code: 3d e4 f6 e2 00 4c 89 ee e8 2c 5e ed ff 44 89 f2 45 85 f6 
0f 85 4a fe ff ff e9 4f fb ff ff 85 c0 0f 84 22 fc ff ff e9 5d fc ff ff <0f> ff 
48 85 db 0f 84 49 fc ff ff e9 37 fc ff ff 41 f6 47 42 01 
[194123.693300] ---[ end trace 8c5a2104d4238bb7 ]---


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Ref doesn't match the record start and is compressed

2017-12-10 Thread Kai Krakow

Hello!

Since my system was temporarily overloaded (load 500+ for probably
several hours but it recovered, no reboot needed), I'm seeing
btrfs-related backtraces in dmesg with the fs going RO.

I tried to fix it but it fails:

> Fixed 0 roots.
> checking extents
> ref mismatch on [3043919785984 57344] extent item 1, found 3
> Data backref 3043919785984 root 265 owner 83535 offset 131072 num_refs 0 not 
> found in extent tree
> Incorrect local backref count on 3043919785984 root 265 owner 83535 offset 
> 131072 found 1 wanted 0 back 0x61854760
> Backref disk bytenr does not match extent record, bytenr=3043919785984, ref 
> bytenr=3043919790080
> Backref bytes do not match extent backref, bytenr=3043919785984, ref 
> bytes=57344, backref bytes=40960
> Data backref 3043919785984 root 259 owner 11522804 offset 0 num_refs 0 not 
> found in extent tree
> Incorrect local backref count on 3043919785984 root 259 owner 11522804 offset 
> 0 found 1 wanted 0 back 0x57892430
> Backref disk bytenr does not match extent record, bytenr=3043919785984, ref 
> bytenr=3043919831040
> Backref bytes do not match extent backref, bytenr=3043919785984, ref 
> bytes=57344, backref bytes=8192
> backpointer mismatch on [3043919785984 57344]
> attempting to repair backref discrepency for bytenr 3043919785984
> Ref doesn't match the record start and is compressed, please take a 
> btrfs-image of this file system and send it to a btrfs developer so they can 
> complete this functionality for bytenr 3043919790080
> failed to repair damaged filesystem, aborting
> enabling repair mode
> Checking filesystem on /dev/bcacache48 8 UUID:
> bc201ce5-8f2b-4263-995a-6641e89d4c88  

When I identify the files with "btrfs inspect-internal" and try to
delete them, btrfs fails and goes into RO mode:

[  735.877040] [ cut here ]
[  735.877048] WARNING: CPU: 2 PID: 809 at fs/btrfs/extent-tree.c:7053 
__btrfs_free_extent.isra.71+0x740/0xbc0 

   
[  735.877049] Modules linked in: uas usb_storage r8168(O) nvidia_drm(PO) 
vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) nvidia_uvm(PO) 
nvidia_modeset(PO) nvidia(PO) nct6775 hwmon_vid coretemp hwmon efivarfs 
   
[  735.877065] CPU: 2 PID: 809 Comm: umount Tainted: P   O
4.14.0-pf4 #1   


[  735.877066] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z68 
Pro3, BIOS L2.16A 02/22/2013

  
[  735.877067] task: 88040b6cf080 task.stack: c900035b  


  
[  735.877070] RIP: 0010:__btrfs_free_extent.isra.71+0x740/0xbc0


  
[  735.877071] RSP: 0018:c900035b3c00 EFLAGS: 00010246  


  
[  735.877073] RAX: fffe RBX: 02c4b7c2 RCX: 
c900035b3bb8

  
[  735.877074] RDX: 02c4b7c2d000 RSI: 88021c22f607 RDI: 
c900035b3bb8

  
[  735.877075] RBP: fffe R08:  R09: 
0001464f

  
[  735.877076] R10: 0002 R11: 076d R12: 
880414618000

  
[  735.877077] R13: 880406b3d850 R14:  R15: 
0109

Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-17 Thread Kai Krakow

Am Fri, 17 Nov 2017 06:51:52 +0300
schrieb Andrei Borzenkov <arvidj...@gmail.com>:

> 16.11.2017 19:13, Kai Krakow пишет:
> ...
> > > BTW: From user API perspective, btrfs snapshots do not guarantee  
> > perfect granular consistent backups.  
> 
> Is it documented somewhere? I was relying on crash-consistent
> write-order-preserving snapshots in NetApp for as long as I remember.
> And I was sure btrfs offers is as it is something obvious for
> redirect-on-write idea.

I think it has ordering guarantees, but it is not as atomic in time as
one might think. That's the point. But devs may tell better.


> > A user-level file transaction may
> > still end up only partially in the snapshot. If you are running
> > transaction sensitive applications, those usually do provide some
> > means of preparing a freeze and a thaw of transactions.
> >   
> 
> Is snapshot creation synchronous to know when thaw?

I think you could do "btrfs snap create", then "btrfs fs sync", and
everything should be fine.


> > I think the user transactions API which could've been used for this
> > will even be removed during the next kernel cycles. I remember
> > reiserfs4 tried to deploy something similar. But there's no
> > consistent layer in the VFS for subscribing applications to
> > filesystem snapshots so they could prepare and notify the kernel
> > when they are ready. 
> 
> I do not see what VFS has to do with it. NetApp works by simply
> preserving previous consistency point instead of throwing it away.
> I.e. snapshot is always last committed image on stable storage. Would
> something like this be possible on btrfs level by duplicating current
> on-disk root (sorry if I use wrong term)?

I think btrfs gives the same consistency. But the moment you issue
"btrfs snap create" may delay snapshot creation a little bit. So if
your application relies on exact point in time snapshots, you need to
ensure synchronizing your application to the filesystem. I think the
same is true for NetApp.

I just wanted to point that out because it may not be obvious, given
that btrfs snapshot creation is built right into the tool chain of
filesystem itself, unlike e.g. NetApp or LVM, or other storage layers.

Background: A good while back I was told that btrfs snapshots during
ongoing IO may result in some of the later IO carried over to before
the snapshot. Transactional ordering of IO operations is still
guaranteed but it may overlap with snapshot creation. So you can still
loose a transaction you didn't expect to loose at that point in time.

So I understood this as:

If you just want to ensure transactional integrity of your database,
you are all fine with btrfs snapshots.

But if you want to ensure that a just finished transaction makes it
into the snapshot completely, you have to sync the processes.

However, things may have changed since then.


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Tiered storage?

2017-11-16 Thread Kai Krakow

Am Wed, 15 Nov 2017 08:11:04 +0100
schrieb waxhead :

> As for dedupe there is (to my knowledge) nothing fully automatic yet. 
> You have to run a program to scan your filesystem but all the 
> deduplication is done in the kernel.
> duperemove works apparently quite well when I tested it, but there
> may be some performance implications.

There's bees as near-line deduplication tool, that is it watches for
generation changes in the filesystem and walks the inodes. It only
looks at extents, not at files. Deduplication itself is then delegated
to the kernel which ensures all changes are data-safe. The process is
running as a daemon and processes your changes in realtime (delayed by
a few seconds to minutes of course, due to transaction commit and
hashing phase).

You need to dedicate it part of your RAM to work, around 1 GB is
usually sufficient to work well enough. The RAM will be locked and
cannot be swapped out, so you should have a sufficiently equipped
system.

Works very well here (2TB of data, 1GB hash table, 16GB RAM).
New dDuplicated files are picked up within seconds, scanned (hitting
the cache most of the time thus not requiring physical IO), and then
submitted to the kernel for deduplication.

I'd call that fully automatic: Once set up, it just works, and works
well. Performance impact is very low once the initial scan is done.

https://github.com/Zygo/bees


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-16 Thread Kai Krakow

Link 2 slipped away, adding it below...

Am Tue, 14 Nov 2017 15:51:57 -0500
schrieb Dave :

> On Tue, Nov 14, 2017 at 3:50 AM, Roman Mamedov  wrote:
> >
> > On Mon, 13 Nov 2017 22:39:44 -0500
> > Dave  wrote:
> >  
> > > I have my live system on one block device and a backup snapshot
> > > of it on another block device. I am keeping them in sync with
> > > hourly rsync transfers.
> > >
> > > Here's how this system works in a little more detail:
> > >
> > > 1. I establish the baseline by sending a full snapshot to the
> > > backup block device using btrfs send-receive.
> > > 2. Next, on the backup device I immediately create a rw copy of
> > > that baseline snapshot.
> > > 3. I delete the source snapshot to keep the live filesystem free
> > > of all snapshots (so it can be optimally defragmented, etc.)
> > > 4. hourly, I take a snapshot of the live system, rsync all
> > > changes to the backup block device, and then delete the source
> > > snapshot. This hourly process takes less than a minute currently.
> > > (My test system has only moderate usage.)
> > > 5. hourly, following the above step, I use snapper to take a
> > > snapshot of the backup subvolume to create/preserve a history of
> > > changes. For example, I can find the version of a file 30 hours
> > > prior.  
> >
> > Sounds a bit complex, I still don't get why you need all these
> > snapshot creations and deletions, and even still using btrfs
> > send-receive.  
> 
> 
> Hopefully, my comments below will explain my reasons.
> 
> >
> > Here is my scheme:
> > 
> > /mnt/dst <- mounted backup storage volume
> > /mnt/dst/backup  <- a subvolume
> > /mnt/dst/backup/host1/ <- rsync destination for host1, regular
> > directory /mnt/dst/backup/host2/ <- rsync destination for host2,
> > regular directory /mnt/dst/backup/host3/ <- rsync destination for
> > host3, regular directory etc.
> >
> > /mnt/dst/backup/host1/bin/
> > /mnt/dst/backup/host1/etc/
> > /mnt/dst/backup/host1/home/
> > ...
> > Self explanatory. All regular directories, not subvolumes.
> >
> > Snapshots:
> > /mnt/dst/snaps/backup <- a regular directory
> > /mnt/dst/snaps/backup/2017-11-14T12:00/ <- snapshot 1
> > of /mnt/dst/backup /mnt/dst/snaps/backup/2017-11-14T13:00/ <-
> > snapshot 2
> > of /mnt/dst/backup /mnt/dst/snaps/backup/2017-11-14T14:00/ <-
> > snapshot 3 of /mnt/dst/backup
> >
> > Accessing historic data:
> > /mnt/dst/snaps/backup/2017-11-14T12:00/host1/bin/bash
> > ...
> > /bin/bash for host1 as of 2017-11-14 12:00 (time on the backup
> > system).
> > 
> >
> > No need for btrfs send-receive, only plain rsync is used, directly
> > from hostX:/ to /mnt/dst/backup/host1/;  
> 
> 
> I prefer to start with a BTRFS snapshot at the backup destination. I
> think that's the most "accurate" starting point.

No, you should finish with a snapshot. Use the rsync destination as a
"dirty" scratch area, let rsync also delete files which are no longer
in the source. After successfully running rsync, make a snapshot of
that directory and make it RO, leave the scratch in place (even when
rsync dies or becomes killed).

I once made some scripts[2] following those rules, you may want to adapt
them.

> > No need to create or delete snapshots during the actual backup
> > process;  
> 
> Then you can't guarantee consistency of the backed up information.

Take a temporary snapshot of the source, rsync to to the scratch
destination, take a RO snapshot of that destination, remove the
temporary snapshot.

BTW: From user API perspective, btrfs snapshots do not guarantee
perfect granular consistent backups. A user-level file transaction may
still end up only partially in the snapshot. If you are running
transaction sensitive applications, those usually do provide some means
of preparing a freeze and a thaw of transactions.

I think the user transactions API which could've been used for this
will even be removed during the next kernel cycles. I remember
reiserfs4 tried to deploy something similar. But there's no consistent
layer in the VFS for subscribing applications to filesystem snapshots
so they could prepare and notify the kernel when they are ready.

> > A single common timeline is kept for all hosts to be backed up,
> > snapshot count not multiplied by the number of hosts (in my case
> > the backup location is multi-purpose, so I somewhat care about
> > total number of snapshots there as well);
> >
> > Also, all of this works even with source hosts which do not use
> > Btrfs.  
> 
> That's not a concern for me because I prefer to use BTRFS everywhere.

At least I suggest looking into bees[1] to deduplicate the backup
destination. Rsync is not very efficient to work with btrfs snapshots.
It will break reflinks often and write inefficiently sized blocks, even
with inplace option. Also,

Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-16 Thread Kai Krakow

Am Tue, 14 Nov 2017 15:51:57 -0500
schrieb Dave :

> On Tue, Nov 14, 2017 at 3:50 AM, Roman Mamedov  wrote:
> >
> > On Mon, 13 Nov 2017 22:39:44 -0500
> > Dave  wrote:
> >  
> > > I have my live system on one block device and a backup snapshot
> > > of it on another block device. I am keeping them in sync with
> > > hourly rsync transfers.
> > >
> > > Here's how this system works in a little more detail:
> > >
> > > 1. I establish the baseline by sending a full snapshot to the
> > > backup block device using btrfs send-receive.
> > > 2. Next, on the backup device I immediately create a rw copy of
> > > that baseline snapshot.
> > > 3. I delete the source snapshot to keep the live filesystem free
> > > of all snapshots (so it can be optimally defragmented, etc.)
> > > 4. hourly, I take a snapshot of the live system, rsync all
> > > changes to the backup block device, and then delete the source
> > > snapshot. This hourly process takes less than a minute currently.
> > > (My test system has only moderate usage.)
> > > 5. hourly, following the above step, I use snapper to take a
> > > snapshot of the backup subvolume to create/preserve a history of
> > > changes. For example, I can find the version of a file 30 hours
> > > prior.  
> >
> > Sounds a bit complex, I still don't get why you need all these
> > snapshot creations and deletions, and even still using btrfs
> > send-receive.  
> 
> 
> Hopefully, my comments below will explain my reasons.
> 
> >
> > Here is my scheme:
> > 
> > /mnt/dst <- mounted backup storage volume
> > /mnt/dst/backup  <- a subvolume
> > /mnt/dst/backup/host1/ <- rsync destination for host1, regular
> > directory /mnt/dst/backup/host2/ <- rsync destination for host2,
> > regular directory /mnt/dst/backup/host3/ <- rsync destination for
> > host3, regular directory etc.
> >
> > /mnt/dst/backup/host1/bin/
> > /mnt/dst/backup/host1/etc/
> > /mnt/dst/backup/host1/home/
> > ...
> > Self explanatory. All regular directories, not subvolumes.
> >
> > Snapshots:
> > /mnt/dst/snaps/backup <- a regular directory
> > /mnt/dst/snaps/backup/2017-11-14T12:00/ <- snapshot 1
> > of /mnt/dst/backup /mnt/dst/snaps/backup/2017-11-14T13:00/ <-
> > snapshot 2
> > of /mnt/dst/backup /mnt/dst/snaps/backup/2017-11-14T14:00/ <-
> > snapshot 3 of /mnt/dst/backup
> >
> > Accessing historic data:
> > /mnt/dst/snaps/backup/2017-11-14T12:00/host1/bin/bash
> > ...
> > /bin/bash for host1 as of 2017-11-14 12:00 (time on the backup
> > system).
> > 
> >
> > No need for btrfs send-receive, only plain rsync is used, directly
> > from hostX:/ to /mnt/dst/backup/host1/;  
> 
> 
> I prefer to start with a BTRFS snapshot at the backup destination. I
> think that's the most "accurate" starting point.

No, you should finish with a snapshot. Use the rsync destination as a
"dirty" scratch area, let rsync also delete files which are no longer
in the source. After successfully running rsync, make a snapshot of
that directory and make it RO, leave the scratch in place (even when
rsync dies or becomes killed).

I once made some scripts[2] following those rules, you may want to adapt
them.

> > No need to create or delete snapshots during the actual backup
> > process;  
> 
> Then you can't guarantee consistency of the backed up information.

Take a temporary snapshot of the source, rsync to to the scratch
destination, take a RO snapshot of that destination, remove the
temporary snapshot.

BTW: From user API perspective, btrfs snapshots do not guarantee
perfect granular consistent backups. A user-level file transaction may
still end up only partially in the snapshot. If you are running
transaction sensitive applications, those usually do provide some means
of preparing a freeze and a thaw of transactions.

I think the user transactions API which could've been used for this
will even be removed during the next kernel cycles. I remember
reiserfs4 tried to deploy something similar. But there's no consistent
layer in the VFS for subscribing applications to filesystem snapshots
so they could prepare and notify the kernel when they are ready.

> > A single common timeline is kept for all hosts to be backed up,
> > snapshot count not multiplied by the number of hosts (in my case
> > the backup location is multi-purpose, so I somewhat care about
> > total number of snapshots there as well);
> >
> > Also, all of this works even with source hosts which do not use
> > Btrfs.  
> 
> That's not a concern for me because I prefer to use BTRFS everywhere.

At least I suggest looking into bees[1] to deduplicate the backup
destination. Rsync is not very efficient to work with btrfs snapshots.
It will break reflinks often and write inefficiently sized blocks, even
with inplace option. Also, rsync won't efficiently catch files

Re: A partially failing disk in raid0 needs replacement

2017-11-14 Thread Kai Krakow

Am Tue, 14 Nov 2017 17:48:56 +0500
schrieb Roman Mamedov :

> [1] Note that "ddrescue" and "dd_rescue" are two different programs
> for the same purpose, one may work better than the other. I don't
> remember which. :)

One is a perl implementation and is the one working worse. ;-)


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs seed question

2017-11-03 Thread Kai Krakow

Am Thu, 12 Oct 2017 09:20:28 -0400
schrieb Joseph Dunn :

> On Thu, 12 Oct 2017 12:18:01 +0800
> Anand Jain  wrote:
> 
> > On 10/12/2017 08:47 AM, Joseph Dunn wrote:  
> > > After seeing how btrfs seeds work I wondered if it was possible
> > > to push specific files from the seed to the rw device.  I know
> > > that removing the seed device will flush all the contents over to
> > > the rw device, but what about flushing individual files on demand?
> > > 
> > > I found that opening a file, reading the contents, seeking back
> > > to 0, and writing out the contents does what I want, but I was
> > > hoping for a bit less of a hack.
> > > 
> > > Is there maybe an ioctl or something else that might trigger a
> > > similar action?
> > 
> >You mean to say - seed-device delete to trigger copy of only the 
> > specified or the modified files only, instead of whole of
> > seed-device ? What's the use case around this ?
> >   
> 
> Not quite.  While the seed device is still connected I would like to
> force some files over to the rw device.  The use case is basically a
> much slower link to a seed device holding significantly more data than
> we currently need.  An example would be a slower iscsi link to the
> seed device and a local rw ssd.  I would like fast access to a
> certain subset of files, likely larger than the memory cache will
> accommodate.  If at a later time I want to discard the image as a
> whole I could unmount the file system or if I want a full local copy
> I could delete the seed-device to sync the fs.  In the mean time I
> would have access to all the files, with some slower (iscsi) and some
> faster (ssd) and the ability to pick which ones are in the faster
> group at the cost of one content transfer.
> 
> I'm not necessarily looking for a new feature addition, just if there
> is some existing call that I can make to push specific files from the
> slow mirror to the fast one.  If I had to push a significant amount of
> metadata that would be fine, but the file contents feeding some
> computations might be large and useful only to certain clients.
> 
> So far I found that I can re-write the file with the same contents and
> thanks to the lack of online dedupe these writes land on the rw mirror
> so later reads to that file should not hit the slower mirror.  By the
> way, if I'm misunderstanding how the read process would work after the
> file push please correct me.
> 
> I hope this makes sense but I'll try to clarify further if you have
> more questions.

You could try to wrap something like bcache ontop of the iscsi device,
then make it a read-mostly cache (like bcache write-around mode). This
probably involves rewriting the iscsi contents to add a bcache header.
You could try mdcache instead.

Then you sacrifice a few gigabytes of local SSD storage of the caching
layer.

I guess that you're sharing the seed device with different machines. As
bcache will add a protective superblock, you may need to thin-clone the
seed image on the source to have independent superblocks per each
bcache instance. Not sure how this applies to mdcache as I never used
it.

But the caching approach is probably the easiest way to go for you. And
it's mostly automatic once deployed: you don't have to manually
choose which files to move to the sprout...


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Problem with file system

2017-11-03 Thread Kai Krakow

Am Tue, 31 Oct 2017 07:28:58 -0400
schrieb "Austin S. Hemmelgarn" :

> On 2017-10-31 01:57, Marat Khalili wrote:
> > On 31/10/17 00:37, Chris Murphy wrote:  
> >> But off hand it sounds like hardware was sabotaging the expected
> >> write ordering. How to test a given hardware setup for that, I
> >> think, is really overdue. It affects literally every file system,
> >> and Linux storage technology.
> >>
> >> It kinda sounds like to me something other than supers is being
> >> overwritten too soon, and that's why it's possible for none of the
> >> backup roots to find a valid root tree, because all four possible
> >> root trees either haven't actually been written yet (still) or
> >> they've been overwritten, even though the super is updated. But
> >> again, it's speculation, we don't actually know why your system
> >> was no longer mountable.  
> > Just a detached view: I know hardware should respect
> > ordering/barriers and such, but how hard is it really to avoid
> > overwriting at least one complete metadata tree for half an hour
> > (even better, yet another one for a day)? Just metadata, not data
> > extents.  
> If you're running on an SSD (or thinly provisioned storage, or
> something else which supports discards) and have the 'discard' mount
> option enabled, then there is no backup metadata tree (this issue was
> mentioned on the list a while ago, but nobody ever replied), because
> it's already been discarded.  This is ideally something which should
> be addressed (we need some sort of discard queue for handling in-line
> discards), but it's not easy to address.
> 
> Otherwise, it becomes a question of space usage on the filesystem,
> and this is just another reason to keep some extra slack space on the
> FS (though that doesn't help _much_, it does help).  This, in theory,
> could be addressed, but it probably can't be applied across mounts of
> a filesystem without an on-disk format change.

Well, maybe inline discard is working at the wrong level. It should
kick in when the reference through any of the backup roots is dropped,
not when the current instance is dropped.

Without knowledge of the internals, I guess discards could be added to
a queue within a new tree in btrfs, and only added to that queue when
dropped from the last backup root referencing it. But this will
probably add some bad performance spikes.

I wonder how a regular fstrim run through cron applies to this problem?


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: defragmenting best practice?

2017-11-03 Thread Kai Krakow

Am Thu, 2 Nov 2017 22:47:31 -0400
schrieb Dave <davestechs...@gmail.com>:

> On Thu, Nov 2, 2017 at 5:16 PM, Kai Krakow <hurikha...@gmail.com>
> wrote:
> 
> >
> > You may want to try btrfs autodefrag mount option and see if it
> > improves things (tho, the effect may take days or weeks to apply if
> > you didn't enable it right from the creation of the filesystem).
> >
> > Also, autodefrag will probably unshare reflinks on your snapshots.
> > You may be able to use bees[1] to work against this effect. Its
> > interaction with autodefrag is not well tested but it works fine
> > for me. Also, bees is able to reduce some of the fragmentation
> > during deduplication because it will rewrite extents back into
> > bigger chunks (but only for duplicated data).
> >
> > [1]: https://github.com/Zygo/bees  
> 
> I will look into bees. And yes, I plan to try autodefrag. (I already
> have it enabled now.) However, I need to understand something about
> how btrfs send-receive works in regard to reflinks and fragmentation.
> 
> Say I have 2 snapshots on my live volume. The earlier one of them has
> already been sent to another block device by btrfs send-receive (full
> backup). Now defrag runs on the live volume and breaks some percentage
> of the reflinks. At this point I do an incremental btrfs send-receive
> using "-p" (or "-c") with the diff going to the same other block
> device where the prior snapshot was already sent.
> 
> Will reflinks be "made whole" (restored) on the receiving block
> device? Or is the state of the source volume replicated so closely
> that reflink status is the same on the target?
> 
> Also, is fragmentation reduced on the receiving block device?
> 
> My expectation is that fragmentation would be reduced and duplication
> would be reduced too. In other words, does send-receive result in
> defragmentation and deduplication too?

As far as I understand, btrfs send/receive doesn't create an exact
mirror. It just replays the block operations between generation
numbers. That is: If it finds new blocks referenced between
generations, it will write a _new_ block to the destination.

So, no, it won't reduce fragmentation or duplication. It just keeps
reflinks intact as long as such extents weren't touched within the
generation range. Otherwise they are rewritten as new extents.

Autodefrag and deduplication processes will as such probably increase
duplication at the destination. A developer may have a better clue, tho.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: defragmenting best practice?

2017-11-03 Thread Kai Krakow

Am Fri, 3 Nov 2017 08:58:22 +0300
schrieb Marat Khalili :

> On 02/11/17 04:39, Dave wrote:
> > I'm going to make this change now. What would be a good way to
> > implement this so that the change applies to the $HOME/.cache of
> > each user?  
> I'd make each user's .cache a symlink (should work but if it won't
> then bind mount) to a per-user directory in some separately mounted
> volume with necessary options.

On a systemd system, each user already has a private tmpfs location
at /run/user/$(id -u).

You could add to the central login script:

# CACHE_DIR="/run/user/$(id -u)/cache"
# mkdir -p $CACHE_DIR && ln -snf $CACHE_DIR $HOME/.cache

You should not run this as root (because of mkdir -p).

You could wrap it into an if statement:

# if [ "$(whoami)" -ne "root" ]; then
#   ...
# fi


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: defragmenting best practice?

2017-11-03 Thread Kai Krakow

Am Thu, 2 Nov 2017 22:59:36 -0400
schrieb Dave :

> On Thu, Nov 2, 2017 at 7:07 AM, Austin S. Hemmelgarn
>  wrote:
> > On 2017-11-01 21:39, Dave wrote:  
> >> I'm going to make this change now. What would be a good way to
> >> implement this so that the change applies to the $HOME/.cache of
> >> each user?
> >>
> >> The simple way would be to create a new subvolume for each existing
> >> user and mount it at $HOME/.cache in /etc/fstab, hard coding that
> >> mount location for each user. I don't mind doing that as there are
> >> only 4 users to consider. One minor concern is that it adds an
> >> unexpected step to the process of creating a new user. Is there a
> >> better way?
> >>  
> > The easiest option is to just make sure nobody is logged in and run
> > the following shell script fragment:
> >
> > for dir in /home/* ; do
> > rm -rf $dir/.cache
> > btrfs subvolume create $dir/.cache
> > done
> >
> > And then add something to the user creation scripts to create that
> > subvolume.  This approach won't pollute /etc/fstab, will still
> > exclude the directory from snapshots, and doesn't require any
> > hugely creative work to integrate with user creation and deletion.
> >
> > In general, the contents of the .cache directory are just that,
> > cached data. Provided nobody is actively accessing it, it's
> > perfectly safe to just nuke the entire directory...  
> 
> I like this suggestion. Thank you. I had intended to mount the .cache
> subvolumes with the NODATACOW option. However, with this approach, I
> won't be explicitly mounting the .cache subvolumes. Is it possible to
> use "chattr +C $dir/.cache" in that loop even though it is a
> subvolume? And, is setting the .cache directory to NODATACOW the right
> choice given this scenario? From earlier comments, I believe it is,
> but I want to be sure I understood correctly.

It is important to apply "chattr +C" to the _empty_ directory, because
even if used recursively, it won't apply to already existing, non-empty
files. But the +C attribute is inherited by newly created files and
directory: So simply follow the "chattr +C on empty directory" and
you're all set.

BTW: You cannot mount subvolumes from an already mounted btrfs device
with different mount options. That is currently not implemented (except
for maybe a very few options). So the fstab approach probably wouldn't
have helped you (depending on your partition layout).

You can simply just create subvolumes within the location needed and
they are implicitly mounted. Then change the particular subvolume cow
behavior with chattr.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-03 Thread Kai Krakow

Am Thu, 2 Nov 2017 23:24:29 -0400
schrieb Dave <davestechs...@gmail.com>:

> On Thu, Nov 2, 2017 at 4:46 PM, Kai Krakow <hurikha...@gmail.com>
> wrote:
> > Am Wed, 1 Nov 2017 02:51:58 -0400
> > schrieb Dave <davestechs...@gmail.com>:
> >  
>  [...]  
>  [...]  
>  [...]  
> >>
> >> Thanks for confirming. I must have missed those reports. I had
> >> never considered this idea until now -- but I like it.
> >>
> >> Are there any blogs or wikis where people have done something
> >> similar to what we are discussing here?  
> >
> > I used rsync before, backup source and destination both were btrfs.
> > I was experiencing the same btrfs bug from time to time on both
> > devices, luckily not at the same time.
> >
> > I instead switched to using borgbackup, and xfs as the destination
> > (to not fall the same-bug-in-two-devices pitfall).  
> 
> I'm going to stick with btrfs everywhere. My reasoning is that my
> biggest pitfalls will be related to lack of knowledge. So focusing on
> learning one filesystem better (vs poorly learning two) is the better
> strategy for me, given my limited time. (I'm not an IT professional of
> any sort.)
> 
> Is there any problem with the Borgbackup repository being on btrfs?

No. I just wanted to point out that keeping backup and source on
different media (which includes different technology, too) is common
best practice and adheres to the 3-2-1 backup strategy.


> > Borgbackup achieves a
> > much higher deduplication density and compression, and as such also
> > is able to store much more backup history in the same storage
> > space. The first run is much slower than rsync (due to enabled
> > compression) but successive runs are much faster (like 20 minutes
> > per backup run instead of 4-5 hours).
> >
> > I'm currently storing 107 TB of backup history in just 2.2 TB backup
> > space, which counts a little more than one year of history now,
> > containing 56 snapshots. This is my retention policy:
> >
> >   * 5 yearly snapshots
> >   * 12 monthly snapshots
> >   * 14 weekly snapshots (worth around 3 months)
> >   * 30 daily snapshots
> >
> > Restore is fast enough, and a snapshot can even be fuse-mounted
> > (tho, in that case mounted access can be very slow navigating
> > directories).
> >
> > With latest borgbackup version, the backup time increased to around
> > 1 hour from 15-20 minutes in the previous version. That is due to
> > switching the file cache strategy from mtime to ctime. This can be
> > tuned to get back to old performance, but it may miss some files
> > during backup if you're doing awkward things to file timestamps.
> >
> > I'm also backing up some servers with it now, then use rsync to sync
> > the borg repository to an offsite location.
> >
> > Combined with same-fs local btrfs snapshots with short retention
> > times, this could be a viable solution for you.  
> 
> Yes, I appreciate the idea. I'm going to evaluate both rsync and
> Borgbackup.
> 
> The advantage of rsync, I think, is that it will likely run in just a
> couple minutes. That will allow me to run it hourly and to keep my
> live volume almost entire free of snapshots and fully defragmented.
> It's also very simple as I already have rsync. And since I'm going to
> run btrfs on the backup volume, I can perform hourly snapshots there
> and use Snapper to manage retention. It's all very simple and relies
> on tools I already have and know.
> 
> However, the advantages of Borgbackup you mentioned (much higher
> deduplication density and compression) make it worth considering.
> Maybe Borgbackup won't take long to complete successive (incremental)
> backups on my system.

Once a full backup was taken, incremental backups are extremely fast.
At least for me, it works much faster than rsync. And as with btrfs
snapshots, each incremental backup is also a full backup. It's not like
traditional backup software that needs the backup parent and grand
parent to make use of the differential and/or incremental backups.

There's one caveat, tho: Only one process can access a repository at a
time, that is you need to serialize different backup jobs if you want
them to go into the same repository. Deduplication is done only within
the same repository. Tho, you might be able to leverage btrfs
deduplication (e.g. using bees) across multiple repositories if you're
not using encrypted repositories.

But since you're currently using send/receive and/or rsync, encrypted
storage of the backup doesn't seem to be an important point to you.

Burp with its client/server approach may have an advantage here, so its
setup see

Re: defragmenting best practice?

2017-11-02 Thread Kai Krakow

Am Tue, 31 Oct 2017 20:37:27 -0400
schrieb Dave :

> > Also, you can declare the '.firefox/default/' directory to be
> > NOCOW, and that "just works".  
> 
> The cache is in a separate location from the profiles, as I'm sure you
> know.  The reason I suggested a separate BTRFS subvolume for
> $HOME/.cache is that this will prevent the cache files for all
> applications (for that user) from being included in the snapshots. We
> take frequent snapshots and (afaik) it makes no sense to include cache
> in backups or snapshots. The easiest way I know to exclude cache from
> BTRFS snapshots is to put it on a separate subvolume. I assumed this
> would make several things related to snapshots more efficient too.
> 
> As far as the Firefox profile being declared NOCOW, as soon as we take
> the first snapshot, I understand that it will become COW again. So I
> don't see any point in making it NOCOW.

Ah well, not really. The files and directories will still be nocow -
however, the next write to any such file after a snapshot will make a
cow operation. So you still see the fragmentation effect but to a much
lesser extent. But the files itself will remain in nocow format.

You may want to try btrfs autodefrag mount option and see if it
improves things (tho, the effect may take days or weeks to apply if you
didn't enable it right from the creation of the filesystem).

Also, autodefrag will probably unshare reflinks on your snapshots. You
may be able to use bees[1] to work against this effect. Its interaction
with autodefrag is not well tested but it works fine for me. Also, bees
is able to reduce some of the fragmentation during deduplication
because it will rewrite extents back into bigger chunks (but only for
duplicated data).

[1]: https://github.com/Zygo/bees


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Need help with incremental backup strategy (snapshots, defragmentingt & performance)

2017-11-02 Thread Kai Krakow

Am Wed, 1 Nov 2017 02:51:58 -0400
schrieb Dave :

> >  
> >> To reconcile those conflicting goals, the only idea I have come up
> >> with so far is to use btrfs send-receive to perform incremental
> >> backups  
> >
> > As already said by Romain Mamedov, rsync is viable alternative to
> > send-receive with much less hassle. According to some reports it
> > can even be faster.  
> 
> Thanks for confirming. I must have missed those reports. I had never
> considered this idea until now -- but I like it.
> 
> Are there any blogs or wikis where people have done something similar
> to what we are discussing here?

I used rsync before, backup source and destination both were btrfs. I
was experiencing the same btrfs bug from time to time on both devices,
luckily not at the same time.

I instead switched to using borgbackup, and xfs as the destination (to
not fall the same-bug-in-two-devices pitfall). Borgbackup achieves a
much higher deduplication density and compression, and as such also is
able to store much more backup history in the same storage space. The
first run is much slower than rsync (due to enabled compression) but
successive runs are much faster (like 20 minutes per backup run instead
of 4-5 hours).

I'm currently storing 107 TB of backup history in just 2.2 TB backup
space, which counts a little more than one year of history now,
containing 56 snapshots. This is my retention policy:

  * 5 yearly snapshots
  * 12 monthly snapshots
  * 14 weekly snapshots (worth around 3 months)
  * 30 daily snapshots

Restore is fast enough, and a snapshot can even be fuse-mounted (tho,
in that case mounted access can be very slow navigating directories).

With latest borgbackup version, the backup time increased to around 1
hour from 15-20 minutes in the previous version. That is due to
switching the file cache strategy from mtime to ctime. This can be
tuned to get back to old performance, but it may miss some files during
backup if you're doing awkward things to file timestamps.

I'm also backing up some servers with it now, then use rsync to sync
the borg repository to an offsite location.

Combined with same-fs local btrfs snapshots with short retention times,
this could be a viable solution for you.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Why do full balance and deduplication reduce available free space?

2017-10-09 Thread Kai Krakow

Am Mon, 02 Oct 2017 22:19:32 +0200
schrieb Niccolò Belli <darkba...@linuxsystems.it>:

> Il 2017-10-02 21:35 Kai Krakow ha scritto:
> > Besides defragging removing the reflinks, duperemove will unshare
> > your snapshots when used in this way: If it sees duplicate blocks
> > within the subvolumes you give it, it will potentially unshare
> > blocks from the snapshots while rewriting extents.
> > 
> > BTW, you should be able to use duperemove with read-only snapshots
> > if used in read-only-open mode. But I'd rather suggest to use bees
> > instead: It works at whole-volume level, walking extents instead of
> > files. That way it is much faster, doesn't reprocess already
> > deduplicated extents, and it works with read-only snapshots.
> > 
> > Until my patch it didn't like mixed nodatasum/datasum workloads.
> > Currently this is fixed by just leaving nocow data alone as users
> > probably set nocow for exactly the reason to not fragment extents
> > and relocate blocks.  
> 
> Bad Btrfs Feature Interactions: btrfs read-only snapshots (never
> tested, probably wouldn't work well)
> 
> Unfortunately it seems that bees doesn't support read-only snapshots,
> so it's a no way.
> 
> P.S.
> I tried duperemove with -A, but besides taking much longer it didn't 
> improve the situation.
> Are you sure that the culprit is duperemove? AFAIK it shouldn't
> unshare extents...

Unsharing of extents depends... If an extent is shared between a
r/o and r/w snapshot, rewriting the extent for deduplication ends up in
a shared extent again but it is no longer reflinked with the original
r/o snapshot. At least if btrfs doesn't allow to change extents part of
a r/o snapshot... Which you all tell is the case...

And then, there's unsharing of metadata by the deduplication process
itself.

Both effects should be minimal, tho. But since chunks are allocated in
1GB sizes, it may jump 1GB worth of allocation just for a few extra MB
needed. A metadata rebalance may fix this.


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Why do full balance and deduplication reduce available free space?

2017-10-02 Thread Kai Krakow

Am Mon, 02 Oct 2017 12:02:16 +0200
schrieb Niccolò Belli :

> This is how I performe balance: btrfs balance start --full-balance
> rootfs This is how I perform deduplication (duperemove is from git
> master): duperemove -drh --dedupe-options=noblock
> --hashfile=../rootfs.hash 

Besides defragging removing the reflinks, duperemove will unshare your
snapshots when used in this way: If it sees duplicate blocks within the
subvolumes you give it, it will potentially unshare blocks from the
snapshots while rewriting extents.

BTW, you should be able to use duperemove with read-only snapshots if
used in read-only-open mode. But I'd rather suggest to use bees
instead: It works at whole-volume level, walking extents instead of
files. That way it is much faster, doesn't reprocess already
deduplicated extents, and it works with read-only snapshots.

Until my patch it didn't like mixed nodatasum/datasum workloads.
Currently this is fixed by just leaving nocow data alone as users
probably set nocow for exactly the reason to not fragment extents and
relocate blocks.


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.13.3 still has the out of space kernel oops

2017-09-26 Thread Kai Krakow

Am Wed, 27 Sep 2017 00:45:04 +0200
schrieb Ian Kumlien :

> I just had my laptop hit the out of space kernel oops which it kinda
> hard to recover from
> 
> Everything states "out of disk" even with 20 gigs free (both according
> to df and btrfs fi df)

You should run balance from time to time. I can suggest the auto
balance script from here:

https://www.spinics.net/lists/linux-btrfs/msg52076.html

It can be run unattended on a regular basis.


> So I'm suspecting that i need to run btrfs check on it to recover the
> lost space (i have mounted it with clear_cache and nothing)

I don't think that "btrfs check" would recover free space, that's not a
file system corruption, it's an allocation issue due to unbalanced
chunks.


> The problem is, finally getting a shell with rd.shell rd.break=cmdline
> - systemd is still a pain and since it's "udev" it's not allowing me
> to do cryptsetup luksopen due to "dependencies"

Does "emergency" as a cmdline work? It should boot to emergency mode of
systemd. Also, "recovery" as a cmdline could work, boots to a different
mode. Both work for me using dracut on Gentoo with systemd.


> Basically, btrfs check should be able to run on a ro mounted
> fileystem, this is too hard to get working without having to bring a
> live usb stick atm

I think this is possible in the latest version but only running in
non-repair mode.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

dmesg: csum "varying number" expected csum 0x0 mirror 1 (with trace/oops)

2017-09-26 Thread Kai Krakow

Hello!

I came across noting some kernel messages which seem to be related to
btrfs, should I worry?

I'm currently running scrub on the device now.

inode-resolve points to an unimportant, easily recoverable file:

$ sudo btrfs inspect-internal inode-resolve -v 1229624 
/mnt/btrfs-pool/gentoo/usr/portage/
ioctl ret=0, bytes_left=4033, bytes_missing=0, cnt=1, missed=0
/mnt/btrfs-pool/gentoo/usr/portage//packages/dev-lang/tcl/tcl-8.6.6-2.xpak

The file wasn't modified in month and it has been in order previously
because the backup didn't choke on it:

$ ls -alt
-rw-r--r-- 1 root root 11241835 19. Apr 22:11 
/mnt/btrfs-pool/gentoo/usr/portage//packages/dev-lang/tcl/tcl-8.6.6-2.xpak

I can read the file without problems, no new messages in dmesg:

$ cat 
/mnt/btrfs-pool/gentoo/usr/portage//packages/dev-lang/tcl/tcl-8.6.6-2.xpak 
>/dev/null

What's strange is the long time gaps between the btrfs reports and the
kernel backtraces... Also, expected csum=0 looks strange.

That mount is subvol_id=0. Not sure if inode-resolve works that way. I
retried for the important subvolumes and it resolved for none.

Just in case, I have a backlog of multiple daily backups.

$ uname -a
Linux jupiter 4.12.14-ck #1 SMP PREEMPT Fri Sep 22 02:47:44 CEST 2017
x86_64 Intel(R) Core(TM) i7-3770K CPU @ 3.50GHz GenuineIntel GNU/Linux


[88597.792462] BTRFS info (device bcache48): no csum found for inode 1229624 
start 1650688
[88597.793304] BTRFS warning (device bcache48): csum failed root 280 ino 
1229624 off 1650688 csum 0x953c1e92 expected csum 0x mirror 1
[128569.451376] [ cut here ]
[128569.451382] WARNING: CPU: 7 PID: 146 at kernel/workqueue.c:2041 
process_one_work+0x44/0x310
[128569.451383] Modules linked in: cifs ccm fuse bridge stp llc veth rfcomm 
bnep cachefiles btusb af_packet btintel iTCO_wdt bluetooth tun 
iTCO_vendor_support rfkill ecdh_generic kvm_intel snd_hda_codec_hdmi 
snd_hda_codec_realtek kvm snd_hda_codec_generic snd_hda_intel snd_hda_codec 
snd_hda_core rtc_cmos snd_pcm snd_timer snd soundcore lpc_ich irqbypass uas 
usb_storage r8168(O) nvidia_drm(PO) vboxpci(O) vboxnetadp(O) vboxnetflt(O) 
vboxdrv(O) nvidia_modeset(PO) nvidia(PO) nct6775 hwmon_vid coretemp hwmon 
efivarfs
[128569.451406] CPU: 7 PID: 146 Comm: bcache Tainted: P   O
4.12.14-ck #1
[128569.451407] Hardware name: To Be Filled By O.E.M. To Be Filled By 
O.E.M./Z68 Pro3, BIOS L2.16A 02/22/2013
[128569.451410] task: 880419bf8bc0 task.stack: c95c4000
[128569.451412] RIP: 0010:process_one_work+0x44/0x310
[128569.451412] RSP: :c95c7e78 EFLAGS: 00013002
[128569.451413] RAX: 0007 RBX: 880419bdcf00 RCX: 
88042f2d8020
[128569.451414] RDX: 88042f2d8018 RSI: 880120454f08 RDI: 
880419bdcf00
[128569.451415] RBP: 88042f2d8000 R08:  R09: 

[128569.451415] R10: 8a209a609e79 R11:  R12: 

[128569.451416] R13: 88042f2e1b00 R14: 88042f2e1b80 R15: 
880419bdcf30
[128569.451417] FS:  () GS:88042f3c() 
knlGS:
[128569.451418] CS:  0010 DS:  ES:  CR0: 80050033
[128569.451418] CR2: 00b0 CR3: 0003f23ef000 CR4: 
001406e0
[128569.451419] Call Trace:
[128569.451422]  ? rescuer_thread+0x20b/0x370
[128569.451424]  ? kthread+0xf2/0x130
[128569.451425]  ? process_one_work+0x310/0x310
[128569.451426]  ? kthread_create_on_node+0x40/0x40
[128569.451428]  ? ret_from_fork+0x22/0x30
[128569.451429] Code: 04 b8 00 00 00 00 4c 0f 44 e8 49 8b 45 08 44 8b a0 00 01 
00 00 41 83 e4 20 f6 45 10 04 75 0e 65 8b 05 79 38 f7 7e 3b 45 04 74 02 <0f> ff 
48 ba eb 83 b5 80 46 86 c8 61 48 0f af d6 48 c1 ea 3a 48 
[128569.451449] ---[ end trace d00a1585e5166d18 ]---
[148997.934146] BUG: unable to handle kernel paging request at c96af000
[148997.934154] IP: memcpy_erms+0x6/0x10
[148997.934155] PGD 41d021067
[148997.934156] P4D 41d021067
[148997.934157] PUD 41d022067
[148997.934158] PMD 41961c067
[148997.934158] PTE 0
[148997.934162] Oops: 0002 [#1] PREEMPT SMP
[148997.934163] Modules linked in: cifs ccm fuse bridge stp llc veth rfcomm 
bnep cachefiles btusb af_packet btintel iTCO_wdt bluetooth tun 
iTCO_vendor_support rfkill ecdh_generic kvm_intel snd_hda_codec_hdmi 
snd_hda_codec_realtek kvm snd_hda_codec_generic snd_hda_intel snd_hda_codec 
snd_hda_core rtc_cmos snd_pcm snd_timer snd soundcore lpc_ich irqbypass uas 
usb_storage r8168(O) nvidia_drm(PO) vboxpci(O) vboxnetadp(O) vboxnetflt(O) 
vboxdrv(O) nvidia_modeset(PO) nvidia(PO) nct6775 hwmon_vid coretemp hwmon 
efivarfs
[148997.934188] CPU: 6 PID: 966 Comm: kworker/u16:16 Tainted: PW  O
4.12.14-ck #1
[148997.934190] Hardware name: To Be Filled By O.E.M. To Be Filled By 
O.E.M./Z68 Pro3, BIOS L2.16A 02/22/2013
[148997.934193] Workqueue: btrfs-endio btrfs_endio_helper
[148997.934194] task: 88001496a340 task.stack: c90011b68000
[148997.934196] RIP:

Re: Give up on bcache?

2017-09-26 Thread Kai Krakow

Am Tue, 26 Sep 2017 23:33:19 +0500
schrieb Roman Mamedov :

> On Tue, 26 Sep 2017 16:50:00 + (UTC)
> Ferry Toth  wrote:
> 
> > https://www.phoronix.com/scan.php?page=article=linux414-bcache-
> > raid=2
> > 
> > I think it might be idle hopes to think bcache can be used as a ssd
> > cache for btrfs to significantly improve performance..  
> 
> My personal real-world experience shows that SSD caching -- with
> lvmcache -- does indeed significantly improve performance of a large
> Btrfs filesystem with slowish base storage.
> 
> And that article, sadly, only demonstrates once again the general
> mediocre quality of Phoronix content: it is an astonishing oversight
> to not check out lvmcache in the same setup, to at least try to draw
> some useful conclusion, is it Bcache that is strangely deficient, or
> SSD caching as a general concept does not work well in the hardware
> setup utilized.

Bcache is actually not meant to increase benchmark performance except
for very few corner cases. It is designed to improve interactivity and
perceived performance, reducing head movements. On the bcache homepage
there's actually tips on how to benchmark bcache correctly, including
warm-up phase and turning on sequential caching. Phoronix doesn't do
that, they test default settings, which is imho a good thing but you
should know the consequences and research how to turn the knobs.

Depending on the caching mode and cache size, the SQlite test may not
show real-world numbers. Also, you should optimize some btrfs options
to work correctly with bcache, e.g. force it to mount "nossd" as it
detects the bcache device as SSD - which is wrong for some workloads, I
think especially desktop workloads and most server workloads.

Also, you may want to tune udev to correct some attributes so other
applications can do their detection and behavior correctly, too:

$ cat /etc/udev/rules.d/00-ssd-scheduler.rules
ACTION=="add|change", KERNEL=="bcache*", ATTR{queue/rotational}="1"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", 
ATTR{queue/iosched/slice_idle}="0"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="0", 
ATTR{queue/scheduler}="kyber"
ACTION=="add|change", KERNEL=="sd[a-z]", ATTR{queue/rotational}=="1", 
ATTR{queue/scheduler}="bfq"

Take note: on a non-mq system you may want to use noop/deadline/cfq
instead of kyber/bfq.

I'm running bcache since over two years now and the performance
improvement is very very high with boot times going down to 30-40s from
3+ minutes previously, faster app startup times (almost instantly like
on SSD), reduced noise by reduced head movements, etc. Also, it has
easy setup (no split metadata/data cache, you can attach more than one
device to a single cache), and it is rocksolid even when crashing the
system.

Bcache learns by using LRU for caching: What you don't need will be
pushed out of cache over time, what you use, stays. This is actually a
lot like "hot data caching". Given a big enough cache, everything of
your daily needs would stay in cache, easily achieving hit ratios
around 90%. Since sequential access is bypassed, you don't have to
worry to flush the cache with large copy operations.

My system uses a 512G SSD with 400G dedicated to bcache, attached to 3x
1TB HDD draid0 mraid1 btrfs, filled with 2TB of net data and daily
backups using borgbackup. Bcache runs in writeback mode, the backup
takes around 15 minutes each night to dig through all data and stores
it to an internal intermediate backup also on bcache (xfs, write-around
mode). Currently not implemented, this intermediate backup will later
be mirrored to external, off-site location.

Some of the rest of the SSD is EFI-ESP, some swap space, and
over-provisioned area to keep bcache performance high.

$ uptime && bcache-status
 21:28:44 up 3 days, 20:38,  3 users,  load average: 1,18, 1,44, 2,14
--- bcache ---
UUIDaacfbcd9-dae5-4377-92d1-6808831a4885
Block Size  4.00 KiB
Bucket Size 512.00 KiB
Congested?  False
Read Congestion 2.0ms
Write Congestion20.0ms
Total Cache Size400 GiB
Total Cache Used400 GiB (100%)
Total Cache Unused  0 B (0%)
Evictable Cache 396 GiB (99%)
Replacement Policy  [lru] fifo random
Cache Mode  (Various)
Total Hits  2364518 (89%)
Total Misses290764
Total Bypass Hits   4284468 (100%)
Total Bypass Misses 0
Total Bypassed  215 GiB

The bucket size and block size was chosen to best fit with Samsung TLC
arrangement. But this is pure theory, I never benchmarked the benefits.
I just feel more comfortable that way. ;-)

One should also keep in mind: The way how btrfs works cannot optimally
use bcache, as cow will obviously invalidate data in bcache - but
bcache doesn't have knowledge of this. Of course, such

Re: Btrfs performance with small blocksize on SSD

2017-09-25 Thread Kai Krakow

Am Mon, 25 Sep 2017 07:04:14 +
schrieb "Fuhrmann, Carsten" :

> Well the correct translation for "Laufzeit" is runtime and not
> latency. But thank you for that hint, I'll change it to "gesamt
> Laufzeit" to make it more clear.

How about better translating it to English in the first place as you're
trying to reach an international community?

Also, it would be nice to put the exact test you did as a command line
or configuration file, so it can be replayed on other systems, and -
most important - by the developers, to easily uncover what is causing
the behavior...


> Best regards
> 
> Carsten
> 
> -Ursprüngliche Nachricht-
> Von: Andrei Borzenkov [mailto:arvidj...@gmail.com] 
> Gesendet: Sonntag, 24. September 2017 18:43
> An: Fuhrmann, Carsten ; Qu Wenruo
> ; linux-btrfs@vger.kernel.org Betreff: Re:
> AW: Btrfs performance with small blocksize on SSD
> 
> 24.09.2017 16:53, Fuhrmann, Carsten пишет:
> > Hello,
> > 
> > 1)
> > I used direct write (no page cache) but I didn't disable the Disk
> > cache of the HDD/SSD itself. In all tests I wrote 1GB and looked
> > for the runtime of that write process.  
> 
> So "latency" on your diagram means total time to write 1GiB file?
> That is highly unusual meaning for "latency" which normally means
> time to perform single IO. If so, you should better rename Y-axis to
> something like "total run time".
> N_r__yb_Xv_^_)__{.n_+{_n___)w*jg______j/___z_2___&_)___a_____G___h__j:+v___w


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs performance with small blocksize on SSD

2017-09-24 Thread Kai Krakow

Am Sun, 24 Sep 2017 19:43:05 +0300
schrieb Andrei Borzenkov :

> 24.09.2017 16:53, Fuhrmann, Carsten пишет:
> > Hello,
> > 
> > 1)
> > I used direct write (no page cache) but I didn't disable the Disk
> > cache of the HDD/SSD itself. In all tests I wrote 1GB and looked
> > for the runtime of that write process.  
> 
> So "latency" on your diagram means total time to write 1GiB file? That
> is highly unusual meaning for "latency" which normally means time to
> perform single IO. If so, you should better rename Y-axis to something
> like "total run time".

If you look closely it says "Laufzeit" which visually looks similar to
"latency" but really means "run time". ;-)


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: defragmenting best practice?

2017-09-21 Thread Kai Krakow

Am Thu, 21 Sep 2017 22:10:13 +0200
schrieb Kai Krakow <hurikha...@gmail.com>:

> Am Wed, 20 Sep 2017 07:46:52 -0400
> schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
> 
> > >  Fragmentation: Files with a lot of random writes can become
> > > heavily fragmented (1+ extents) causing excessive multi-second
> > > spikes of CPU load on systems with an SSD or large amount a RAM.
> > > On desktops this primarily affects application databases
> > > (including Firefox). Workarounds include manually defragmenting
> > > your home directory using btrfs fi defragment. Auto-defragment
> > > (mount option autodefrag) should solve this problem.
> > > 
> > > Upon reading that I am wondering if fragmentation in the Firefox
> > > profile is part of my issue. That's one thing I never tested
> > > previously. (BTW, this system has 256 GB of RAM and 20 cores.)
> > Almost certainly.  Most modern web browsers are brain-dead and
> > insist on using SQLite databases (or traditional DB files) for
> > everything, including the cache, and the usage for the cache in
> > particular kills performance when fragmentation is an issue.  
> 
> At least in Chrome, you can turn on simple cache backend, which, I
> think, is using many small instead of one huge file. This suit btrfs
> much better:
> 
> chrome://flags/#enable-simple-cache-backend
> 
> 
> And then I suggest also doing this (as your login user):
> 
> $ cd $HOME
> $ mv .cache .cache.old
> $ mkdir .cache
> $ lsattr +C .cache

Oops, of course that's chattr, not lsattr

> $ rsync -av .cache.old/ .cache/
> $ rm -Rf .cache.old
> 
> This makes caches for most applications nocow. Chrome performance was
> completely fixed for me by doing this.
> 
> I'm not sure where Firefox puts its cache, I only use it on very rare
> occasions. But I think it's going to .cache/mozilla last time looked
> at it.
> 
> You may want to close all apps before converting the cache directory.
> 
> Also, I don't see any downsides in making this nocow. That directory
> could easily be also completely volatile. If something breaks due to
> no longer protected by data csum, just clean it out.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: defragmenting best practice?

2017-09-21 Thread Kai Krakow

Am Wed, 20 Sep 2017 07:46:52 -0400
schrieb "Austin S. Hemmelgarn" :

> >  Fragmentation: Files with a lot of random writes can become
> > heavily fragmented (1+ extents) causing excessive multi-second
> > spikes of CPU load on systems with an SSD or large amount a RAM. On
> > desktops this primarily affects application databases (including
> > Firefox). Workarounds include manually defragmenting your home
> > directory using btrfs fi defragment. Auto-defragment (mount option
> > autodefrag) should solve this problem.
> > 
> > Upon reading that I am wondering if fragmentation in the Firefox
> > profile is part of my issue. That's one thing I never tested
> > previously. (BTW, this system has 256 GB of RAM and 20 cores.)  
> Almost certainly.  Most modern web browsers are brain-dead and insist
> on using SQLite databases (or traditional DB files) for everything, 
> including the cache, and the usage for the cache in particular kills 
> performance when fragmentation is an issue.

At least in Chrome, you can turn on simple cache backend, which, I
think, is using many small instead of one huge file. This suit btrfs
much better:

chrome://flags/#enable-simple-cache-backend


And then I suggest also doing this (as your login user):

$ cd $HOME
$ mv .cache .cache.old
$ mkdir .cache
$ lsattr +C .cache
$ rsync -av .cache.old/ .cache/
$ rm -Rf .cache.old

This makes caches for most applications nocow. Chrome performance was
completely fixed for me by doing this.

I'm not sure where Firefox puts its cache, I only use it on very rare
occasions. But I think it's going to .cache/mozilla last time looked
at it.

You may want to close all apps before converting the cache directory.

Also, I don't see any downsides in making this nocow. That directory
could easily be also completely volatile. If something breaks due to no
longer protected by data csum, just clean it out.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: SSD caching an existing btrfs raid1

2017-09-20 Thread Kai Krakow

Am Wed, 20 Sep 2017 17:51:15 +0200
schrieb Psalle :

> On 19/09/17 17:47, Austin S. Hemmelgarn wrote:
> (...)
> >
> > A better option if you can afford to remove a single device from
> > that array temporarily is to use bcache.  Bcache has one specific
> > advantage in this case, multiple backend devices can share the same
> > cache device. This means you don't have to carve out dedicated
> > cache space for each disk on the SSD and leave some unused space so
> > that you can add new devices if needed.  The downside is that you
> > can't convert each device in-place, but because you're using BTRFS,
> > you can still convert the volume as a whole in-place.  The
> > procedure for doing so looks like this:
> >
> > 1. Format the SSD as a bcache cache.
> > 2. Use `btrfs device delete` to remove a single hard drive from the 
> > array.
> > 3. Set up the drive you just removed as a bcache backing device
> > bound to the cache you created in step 1.
> > 4. Add the new bcache device to the array.
> > 5. Repeat from step 2 until the whole array is converted.
> >
> > A similar procedure can actually be used to do almost any
> > underlying storage conversion (for example, switching to whole disk
> > encryption, or adding LVM underneath BTRFS) provided all your data
> > can fit on one less disk than you have.  
> 
> Thanks Austin, that's just great. For some reason I had discarded
> bcache thinking that it would force me to rebuild from scratch, but
> this kind of incremental migration is exactly why I hoped was
> possible. I have plenty of space to replace the devices one by one.
> 
> I will report back my experience in a few days, I hope.

I've done it exactly that way in the past and it worked flawlessly (but
it took 24+ hours). But it was easy for me because I was also adding a
third disk to the pool, so existing stuff could easily move.

I suggest to initialize bcache to writearound mode while converting, so
your maybe terabytes of disk don't go through the SSD.

If you later decide to remove bcache or not sure about future bcache
usage, you can wrap any partition into a bcache container - just don't
connect it to a cache and it will work like a normal partition.


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: kernel BUG at fs/btrfs/extent_io.c:1989

2017-09-18 Thread Kai Krakow

Am Mon, 18 Sep 2017 20:30:41 +0200
schrieb Holger Hoffstätte :

> On 09/18/17 19:09, Liu Bo wrote:
> > This 'mirror 0' looks fishy, (as mirror comes from
> > btrfs_io_bio->mirror_num, which should be at least 1 if raid1 setup
> > is in use.)
> > 
> > Not sure if 4.13.2-gentoo made any changes on btrfs, but can you  
> 
> No, it did not; Gentoo always strives to be as close to mainline as
> possible except for urgent security & low-risk convenience fixes.

According to
https://dev.gentoo.org/~mpagano/genpatches/patches-4.13-2.htm
it's not only security patches.

But as the list shows, there are indeed no btrfs patches. But there's
one that may change btrfs behavior (tho unlikely), that is enabling
native gcc optimizations if you choose so. I don't think that's a
default option in Gentoo.

I'm using native optimizations myself and see no strange mirror issues
in btrfs. OTOH, I've lately switched to gentoo ck patchset to get
better optimizations for gaming and realtime apps. But it's still at
the 4.12 series.

Are you sure the system crashed and wasn't just stuck at reading from
the disks? If the disks have error correction and recovery enabled, the
Linux block layer times out on the requests that the drives eventually
won't fix anyways and resets the link after 30s. The drive timeout is
120s by default.

You can change that on enterprise grade and NAS-ready drives, also a
handful of desktop drives support it. Smartctl is used to set the
values, just google "smartctl scterc". You could also adjust the
timeout of the scsi layer to above the drive timeout, that means more
than 120s if you cannot change scterc. I think it makes most sense to
not reset the link before the drive had its chance to answer the
request.

I think there are pros and cons of changing these values. I always
recommend to increase the scsi timeout above the scterc timeout.
Personally, I lower the scterc timeout to 70 centisecs, and let the
scsi timeout just at its default. RAID setups should use this to get
control of their own error correction methods: The drive returns from
request early and the RAID can do its job of reading from another copy,
i.e. btrfs or mdraid, then repair it by writing back a correct copy
which the drive converts into a sector relocation aka self-repair.

Other people may jump in and recommend their own perspective of why or
why not change which knob to which value.

But well, as long as you saw no scsi errors reported when the "crash"
occurred, these values are not involved in your problem anyways.

What about "btrfs device stats"?

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A user cannot remove his readonly snapshots?!

2017-09-16 Thread Kai Krakow

Am Sat, 16 Sep 2017 09:36:33 +0200
schrieb Ulli Horlacher <frams...@rus.uni-stuttgart.de>:

> On Sat 2017-09-16 (01:22), Kai Krakow wrote:
> 
> > > tux@xerus:/test/tux/zz/.snapshot: btrfs subvolume delete
> > > 2017-09-15_1859.test Delete subvolume (no-commit):
> > > '/test/tux/zz/.snapshot/2017-09-15_1859.test' ERROR: cannot delete
> > > '/test/tux/zz/.snapshot/2017-09-15_1859.test': Read-only file
> > > system  
> > 
> > See "man mount" in section btrfs mount options: There is a mount
> > option to allow normal user to delete snapshots.  
> 
> As I wrote first: "I have mounted with option user_subvol_rm_allowed"
> 
> tux@xerus:/test/tux/zz/.snapshot: mount | grep /test
> /dev/sdd4 on /test type btrfs
> (rw,relatime,space_cache,user_subvol_rm_allowed,subvolid=5,subvol=/)
> 
> This does not help. A user cannot remove a readonly snapshot he just
> has created.

Yes, sorry, I only later discovered the other posts.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: A user cannot remove his readonly snapshots?!

2017-09-15 Thread Kai Krakow

Am Sat, 16 Sep 2017 00:02:01 +0200
schrieb Ulli Horlacher :

> On Fri 2017-09-15 (23:44), Ulli Horlacher wrote:
> > On Fri 2017-09-15 (22:07), Peter Grandi wrote:
> >   
>  [...]  
> > > 
> > > Ordinary permissions still apply both to 'create' and 'delete':  
> > 
> > My user tux is the owner of the snapshot directory, because he has
> > created it!  
> 
> I can delete normal subvolumes but not the readonly snapshots:
> 
> tux@xerus:/test/tux/zz/.snapshot: btrfs subvolume create test
> Create subvolume './test'
> 
> tux@xerus:/test/tux/zz/.snapshot: ll
> drwxr-xr-x  tux  users- 2017-09-15 18:22:26
> 2017-09-15_1822.test drwxr-xr-x  tux  users- 2017-09-15
> 18:22:26 2017-09-15_1824.test drwxr-xr-x  tux  users-
> 2017-09-15 18:57:39 2017-09-15_1859.test drwxr-xr-x  tux
> users- 2017-09-15 23:58:51 test
> 
> tux@xerus:/test/tux/zz/.snapshot: btrfs subvolume delete test
> Delete subvolume (no-commit): '/test/tux/zz/.snapshot/test'
> 
> tux@xerus:/test/tux/zz/.snapshot: btrfs subvolume delete
> 2017-09-15_1859.test Delete subvolume (no-commit):
> '/test/tux/zz/.snapshot/2017-09-15_1859.test' ERROR: cannot delete
> '/test/tux/zz/.snapshot/2017-09-15_1859.test': Read-only file system

See "man mount" in section btrfs mount options: There is a mount option
to allow normal user to delete snapshots. But this is said to has
security implication I cannot currently tell. Maybe someone else knows.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: defragmenting best practice?

2017-09-15 Thread Kai Krakow

Am Fri, 15 Sep 2017 16:11:50 +0200
schrieb Michał Sokołowski :

> On 09/15/2017 03:07 PM, Tomasz Kłoczko wrote:
> > [...]
> > Case #1
> > 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu cow2
> > storage -> guest BTRFS filesystem  
> > SQL table row insertions per second: 1-2
> >
> > Case #2
> > 2x 7200 rpm HDD -> md raid 1 -> host BTRFS rootfs -> qemu raw
> > storage -> guest EXT4 filesystem
> > SQL table row insertions per second: 10-15
> > Q -1) why you are comparing btrfs against ext4 on top of the btrfs
> > which is doing own COW operations on bottom of such sandwiches ..
> > if we SUPPOSE to be talking about impact of the fragmentation on
> > top of btrfs?  
> 
> Tomasz,
> you seem to be convinced that fragmentation does not matter. I found
> this (extremely bad, true) example says otherwise.

Sorry to jump this, but did you at least set the qemu image to nocow?
Otherwise this example is totally flawed because you're testing qemu
storage layer mostly and not btrfs.

A better test would've been to test qemu raw on btrfs cow vs on btrfs
nocow, with both the same file system inside the qemu image.

But you are modifying multiple parameters at once during the test, and
I expect then everyone has a huge impact on performance but only one is
specific to btrfs which you apparently did not test this way.

Personally, running qemu cow2 on btrfs cow really helps nothing except
really bad performance. Make one of both layers nocow and it should
become better.

If you want to give some better numbers, please reduce this test to
just one cow layer, the one at the top layer: btrfs host fs. Copy the
image somewhere else to restore from, and ensure (using filefrag) that
the starting situation matches each test run.

Don't change any parameters of the qemu layer at each test. And run a
file system inside which doesn't do any fancy stuff, like ext2 or ext3
without journal. Use qemu raw storage.

Then test again with cow vs nocow on the host side.

Create a nocow copy of your image (use size of the source image for
truncate):

# rm -f qemu-image-nocow.raw
# touch qemu-image-nocow.raw
# chattr +C -c qemu-image-nocow.raw
# dd if=source-image.raw of=qemu-image-nocow.raw bs=1M
# btrfs fi defrag -f qemu-image-nocow.raw
# filefrag -v qemu-image-nocow.raw

Create a cow copy of your image:

# rm -f qemu-image-cow.raw
# touch qemu-image-cow.raw
# chattr -C -c qemu-image-cow.raw
# dd if=source-image.raw of=qemu-image-cow.raw bs=1M
# btrfs fi defrag -f qemu-image-cow.raw
# filefrag -v qemu-image-cow.raw

Given that host btrfs is mounted datacow,compress=none and without
autodefrag, and you don't touch the source image contents during tests.

Now run your test script inside both qemu machines, take your
measurements and check fragmentation again after the run.

filefrag should report no more fragments than before the test for the
first test, but should report a magnitude more for the second test.

Now copy (cp) both one at a time to a new file and measure the time. It
should be slower for the highly fragmented version.

Don't forget to run tests with and without flushed caches so we get
cold and warm numbers.

In this scenario, qemu would only be the application to modify the raw
image files and you're actually testing the impact of fragmentation of
btrfs.

You could also make a reflink copy of the nocow test image and do a
third test to see that it introduces fragmentation now, tho probably
much lower than for the cow test image. You can verify the numbers with
filefrag.

According to Tomasz, your tests should not run at vastly different
speeds because fragmentation has no impact on performance, quod est
demonstrandum... I think we will not get to the "erat" part.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: defragmenting best practice?

2017-09-14 Thread Kai Krakow

Am Thu, 14 Sep 2017 18:48:54 +0100
schrieb Tomasz Kłoczko <kloczko.tom...@gmail.com>:

> On 14 September 2017 at 16:24, Kai Krakow <hurikha...@gmail.com>
> wrote: [..]
> > Getting e.g. boot files into read order or at least nearby improves
> > boot time a lot. Similar for loading applications.  
> 
> By how much it is possible to improve boot time?
> Just please some example which I can try to replay which ill be
> showing that we have similar results.
> I still have one one of my laptops with spindle on btrfs root fs ( and
> no other FSess in use) so I could be able to confirm that my numbers
> are enough close to your numbers.

I need to create a test setup because this system uses bcache. The
difference (according to systemd-analyze) between warm bcache and no
bcache at all ranges from 16-30s boot time vs. 3+ minutes boot time.

I could turn off bcache, do a boot trace, try to rearrange boot files,
boot again. However, that is not very reproducible as the current file
layout is not defined. It'd be better to setup a separate machine where
I could start over from a "well defined" state before applying
optimization steps to see the differences between different strategies.
At least readahead is not very helpful, I tested that in the past. It
reduces boot time just by a few seconds, maybe 20-30, thus going from
3+ minutes to 2+ minutes.

I still have an old laptop lying around: Single spindle, should make a
good test scenario. I'll have to see if I can get it back into shape.
It will take me some time.

> > Shake tries to
> > improve this by rewriting the files - and this works because file
> > systems (given enough free space) already do a very good job at
> > doing this. But constant system updates degrade this order over
> > time.  
> 
> OK. Please prepare some database, import some data which size will be
> few times of not used RAM (best if this multiplication factor will be
> at least 10). Then do some batch of selects measuring distribution
> latencies of those queries.

Well, this is pretty easy. Systemd-journald is a real beast when it
comes to cow fragmentation. Results can be easily generated and
reproduced. There are long traces of discussions in the systemd mailing
list and I simply decided to make the files nocow right from the start
and that fixed it for me. I can simply revert it and create benchmarks.

> This will give you some data about. not fragmented data.

Well, I would probably do it the other way around: Generate a
fragmented journal file (as that is how journald creates the file over
time), then rewrite it by some manner to reduce extents, then run
journal operations again on this file. Does it bother you to turn this
around?

> Then on next stage try to apply some number of update queries and
> after reboot the system or drop all caches. and repeat the same set of
> selects.
> After this all what you need to do is compare distribution of the
> latencies.

Which tool to use to measure which latencies?

Speaking of latencies: What's of interest here is perceived
performance resulting mostly from seek overhead (except probably in the
journal file case which just overwhelmes by the pure amount of
extents). I'm not sure if measuring VFS latencies would provide any
useful insights here. VFS probably works fast enough still in this
case.

> > It really doesn't matter if some big file is laid out in 1
> > allocation of 1 GB or in 250 allocations of 4MB: It really doesn't
> > make a big difference.
> >
> > Recombining extents into bigger once, tho, can make a big
> > difference in an aging btrfs, even on SSDs.  
> 
> That it may be an issue with using extents.

I can't follow why you argue that a file with thousands of extents vs
a file of same size but only a few extents would makes no difference to
operate on. And of course this has to do with extents. But btrfs uses
extents. Do you suggest to use ZFS instead?

Due to how cow works, the effect would probably be less or barely
noticable for writes, but read scanning through the file becomes slow
with clearly more "noise" from the moving heads.

> Again: please show some results of some test unit which anyone will be
> able to reply and confirm or not that this effect really exist.
> 
> If problem really exist and is related ot extents you should have real
> scenario explanation why ZFS is not using extents.

That was never the discussion. You brought in the ZFS point. I read
about the design reasoning behind ZFS when it appeared and started gain
public interest years back.

> btrfs is not to far from classic approach do FS because it srill uses
> allocation structures.
> This is not the case in context of ZFS because this technology has no
> information about what is already allocates.

What about btrfs free space tree? Isn'

Re: defragmenting best practice?

2017-09-14 Thread Kai Krakow

Am Thu, 14 Sep 2017 17:24:34 +0200
schrieb Kai Krakow <hurikha...@gmail.com>:

Errors corrected, see below...


> Am Thu, 14 Sep 2017 14:31:48 +0100
> schrieb Tomasz Kłoczko <kloczko.tom...@gmail.com>:
> 
> > On 14 September 2017 at 12:38, Kai Krakow <hurikha...@gmail.com>
> > wrote: [..]  
> > >
> > > I suggest you only ever defragment parts of your main subvolume or
> > > rely on autodefrag, and let bees do optimizing the snapshots.  
> 
> Please read that again including the parts you omitted.
> 
> 
> > > Also, I experimented with adding btrfs support to shake, still
> > > working on better integration but currently lacking time... :-(
> > >
> > > Shake is an adaptive defragger which rewrites files. With my
> > > current patches it clones each file, and then rewrites it to its
> > > original location. This approach is currently not optimal as it
> > > simply bails out if some other process is accessing the file and
> > > leaves you with an (intact) temporary copy you need to move back
> > > in place manually.
> > 
> > If you really want to have real and *ideal* distribution of the data
> > across physical disk first you need to build time travel device.
> > This device will allow you to put all blocks which needs to be read
> > in perfect order (to read all data only sequentially without seek).
> > However it will be working only in case of spindles because in case
> > of SSDs there is no seek time.
> > Please let us know when you will write drivers/timetravel/ Linux
> > kernel driver. When such driver will be available I promise I'll
> > write all necessary btrfs code by myself in matter of few days (it
> > will be piece of cake compare to build such device).
> > 
> > But seriously ..  
> 
> Seriously: Defragmentation on spindles is IMHO not about getting the
> perfect continuous allocation but providing better spatial layout of
> the files you work with.
> 
> Getting e.g. boot files into read order or at least nearby improves
> boot time a lot. Similar for loading applications. Shake tries to
> improve this by rewriting the files - and this works because file
> systems (given enough free space) already do a very good job at doing
> this. But constant system updates degrade this order over time.
> 
> It really doesn't matter if some big file is laid out in 1 allocation
> of 1 GB or in 250 allocations of 4MB: It really doesn't make a big
> difference.
> 
> Recombining extents into bigger once, tho, can make a big difference
> in an aging btrfs, even on SSDs.
> 
> Bees is, btw, not about defragmentation: I have some OS containers
> running and I want to deduplicate data after updates. It seems to do a
> good job here, better than other deduplicators I found. And if some
> defrag tools destroyed your snapshot reflinks, bees can also help
> here. On its way it may recombine extents so it may improve
> fragmentation. But usually it probably defragments because it needs
 ^^^
It fragments!

> to split extents that a defragger combined.
> 
> But well, I think getting 100% continuous allocation is really not the
> achievement you want to get, especially when reflinks are a primary
> concern.
> 
> 
> > Only context/scenario when you may want to lower defragmentation is
> > when you are something needs to allocate continuous area lower than
> > free space and larger than largest free chunk. Something like this
> > happens only when volume is working on almost 100% allocated space.
> > In such scenario even you bees cannot do to much as it may be not
> > enough free space to move some other data in larger chunks to
> > defragment FS physical space.  
> 
> Bees does not do that.
> 
> 
> > If your workload will be still writing
> > new data to FS such defragmentation may give you (maybe) few more
> > seconds and just after this FS will be 100% full,
> > 
> > In other words if someone is thinking that such defragmentation
> > daemon is solving any problems he/she may be 100% right .. such
> > person is only *thinking* that this is truth.  
> 
> Bees is not about that.
> 
> 
> > kloczek
> > PS. Do you know first McGyver rule? -> "If it ain't broke, don't fix
> > it".  
> 
> Do you know the saying "think first, then act"?
> 
> 
> > So first show that fragmentation is hurting latency of the
> > access to btrfs data and it will be possible to measurable such
> > impact. Before you will start measuring this you need to learn how o
> > sample for example VFS la

Re: defragmenting best practice?

2017-09-14 Thread Kai Krakow

Am Thu, 14 Sep 2017 14:31:48 +0100
schrieb Tomasz Kłoczko <kloczko.tom...@gmail.com>:

> On 14 September 2017 at 12:38, Kai Krakow <hurikha...@gmail.com>
> wrote: [..]
> >
> > I suggest you only ever defragment parts of your main subvolume or
> > rely on autodefrag, and let bees do optimizing the snapshots.

Please read that again including the parts you omitted.

> > Also, I experimented with adding btrfs support to shake, still
> > working on better integration but currently lacking time... :-(
> >
> > Shake is an adaptive defragger which rewrites files. With my current
> > patches it clones each file, and then rewrites it to its original
> > location. This approach is currently not optimal as it simply bails
> > out if some other process is accessing the file and leaves you with
> > an (intact) temporary copy you need to move back in place
> > manually.  
> 
> If you really want to have real and *ideal* distribution of the data
> across physical disk first you need to build time travel device. This
> device will allow you to put all blocks which needs to be read in
> perfect order (to read all data only sequentially without seek).
> However it will be working only in case of spindles because in case of
> SSDs there is no seek time.
> Please let us know when you will write drivers/timetravel/ Linux
> kernel driver. When such driver will be available I promise I'll
> write all necessary btrfs code by myself in matter of few days (it
> will be piece of cake compare to build such device).
> 
> But seriously ..

Seriously: Defragmentation on spindles is IMHO not about getting the
perfect continuous allocation but providing better spatial layout of
the files you work with.

Getting e.g. boot files into read order or at least nearby improves
boot time a lot. Similar for loading applications. Shake tries to
improve this by rewriting the files - and this works because file
systems (given enough free space) already do a very good job at doing
this. But constant system updates degrade this order over time.

It really doesn't matter if some big file is laid out in 1 allocation
of 1 GB or in 250 allocations of 4MB: It really doesn't make a big
difference.

Recombining extents into bigger once, tho, can make a big difference in
an aging btrfs, even on SSDs.

Bees is, btw, not about defragmentation: I have some OS containers
running and I want to deduplicate data after updates. It seems to do a
good job here, better than other deduplicators I found. And if some
defrag tools destroyed your snapshot reflinks, bees can also help here.
On its way it may recombine extents so it may improve fragmentation.
But usually it probably defragments because it needs to split extents
that a defragger combined.

But well, I think getting 100% continuous allocation is really not the
achievement you want to get, especially when reflinks are a primary
concern.

> Only context/scenario when you may want to lower defragmentation is
> when you are something needs to allocate continuous area lower than
> free space and larger than largest free chunk. Something like this
> happens only when volume is working on almost 100% allocated space.
> In such scenario even you bees cannot do to much as it may be not
> enough free space to move some other data in larger chunks to
> defragment FS physical space.

Bees does not do that.

> If your workload will be still writing
> new data to FS such defragmentation may give you (maybe) few more
> seconds and just after this FS will be 100% full,
> 
> In other words if someone is thinking that such defragmentation daemon
> is solving any problems he/she may be 100% right .. such person is
> only *thinking* that this is truth.

Bees is not about that.

> kloczek
> PS. Do you know first McGyver rule? -> "If it ain't broke, don't fix
> it".

Do you know the saying "think first, then act"?

> So first show that fragmentation is hurting latency of the
> access to btrfs data and it will be possible to measurable such
> impact. Before you will start measuring this you need to learn how o
> sample for example VFS layer latency. Do you know how to do this to
> deliver such proof?

You didn't get the point. You only read "defragmentation" and your
alarm lights lid up. You even think bees would be a defragmenter. It
probably is more the opposite because it introduces more fragments in
exchange for more reflinks.

> PS2. The same "discussions" about fragmentation where in the past
> about +10 years ago after ZFS has been introduced. Just to let you
> know that after initial ZFS introduction up to now was not written
> even single line of ZFS code to handle active fragmentation and no one
> been able to prove that something about active defragmentation needs
>

Re: defragmenting best practice?

2017-09-14 Thread Kai Krakow

Am Tue, 12 Sep 2017 18:28:43 +0200
schrieb Ulli Horlacher :

> On Thu 2017-08-31 (09:05), Ulli Horlacher wrote:
> > When I do a 
> > btrfs filesystem defragment -r /directory
> > does it defragment really all files in this directory tree, even if
> > it contains subvolumes?
> > The man page does not mention subvolumes on this topic.  
> 
> No answer so far :-(
> 
> But I found another problem in the man-page:
> 
>   Defragmenting with Linux kernel versions < 3.9 or >= 3.14-rc2 as
> well as with Linux stable kernel versions >= 3.10.31, >= 3.12.12 or
> >= 3.13.4 will break up the ref-links of COW data (for example files
> >copied with
>   cp --reflink, snapshots or de-duplicated data). This may cause
>   considerable increase of space usage depending on the broken up
>   ref-links.
> 
> I am running Ubuntu 16.04 with Linux kernel 4.10 and I have several
> snapshots.
> Therefore, I better should avoid calling "btrfs filesystem defragment
> -r"?
> 
> What is the defragmenting best practice?
> Avoid it completly?

You may want to try https://github.com/Zygo/bees. It is a daemon
watching the file system generation changes, scanning the blocks and
then recombines them. Of course, this process somewhat defeats the
purpose of defragging in the first place as it will undo some of the
defragmenting.

I suggest you only ever defragment parts of your main subvolume or rely
on autodefrag, and let bees do optimizing the snapshots.

Also, I experimented with adding btrfs support to shake, still working
on better integration but currently lacking time... :-(

Shake is an adaptive defragger which rewrites files. With my current
patches it clones each file, and then rewrites it to its original
location. This approach is currently not optimal as it simply bails out
if some other process is accessing the file and leaves you with an
(intact) temporary copy you need to move back in place manually.

Shake works very well with the idea of detecting how defragmented, how
old, and how far away from an "ideal" position a file is and exploits
standard Linux file systems behavior to optimally placing files by
rewriting them. It then records its status per file in extended
attributes. It also works with non-btrfs file systems. My patches try
to avoid defragging files with shared extents, so this may help your
situation. However, it will still shuffle files around if they are too
far from their ideal position, thus destroying shared extents. A future
patch could use extent recombining and skip shared extents in that
process. But first I'd like to clean out some of the rough edges
together with the original author of shake.

Look here: https://github.com/unbrice/shake and also check out the pull
requests and comments there. You shouldn't currently run shake
unattended and only on specific parts of your FS you feel need
defragmenting.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Help me understand what is going on with my RAID1 FS

2017-09-10 Thread Kai Krakow

Am Sun, 10 Sep 2017 20:15:52 +0200
schrieb Ferenc-Levente Juhos :

> >Problem is that each raid1 block group contains two chunks on two
> >separate devices, it can't utilize fully three devices no matter
> >what. If that doesn't suit you then you need to add 4th disk. After
> >that FS will be able to use all unallocated space on all disks in
> >raid1 profile. But even then you'll be able to safely lose only one
> >disk since BTRFS still will be storing only 2 copies of data.  
> 
> I hope I didn't say that I want to utilize all three devices fully. It
> was clear to me that there will be 2 TB of wasted space.
> Also I'm not questioning the chunk allocator for RAID1 at all. It's
> clear and it always has been clear that for RAID1 the chunks need to
> be allocated on different physical devices.
> If I understood Kai's point of view, he even suggested that I might
> need to do balancing to make sure that the free space on the three
> devices is being used smartly. Hence the questions about balancing.

It will allocate chunks from the device with the most space available.
So while you fill your disks space usage will evenly distribute.

The problem comes when you start deleting stuff, some chunks may even
be freed, and everything becomes messed up. In an aging file system you
may notice that the chunks are no longer evenly distributed. A balance
is a way to fix that because it will reallocate chunks and coalesce
data back into single chunks, making free space for new allocations. In
this process it will actually evenly distribute your data again.

You may want to use this rebalance script:
https://www.spinics.net/lists/linux-btrfs/msg52076.html

> I mean in worst case it could happen like this:
> 
> Again I have disks of sizes 3, 3, 8:
> Fig.1
> Drive1(8) Drive2(3) Drive3(3)
>  -   X1X1
>  -   X2X2
>  -   X3X3
> Here the new drive is completely unused. Even if one X1 chunk would be
> on Drive1 it would be still a sub-optimal allocation.

This won't happen while filling a fresh btrfs. Chunks are always
allocated from a device with most free space (within the raid1
constraints). This it will allocate space alternating between disk1+2
and disk1+3.

> This is the optimal allocation. Will btrfs allocate like this?
> Considering that Drive1 has the most free space.
> Fig. 2
> Drive1(8) Drive2(3) Drive3(3)
> X1X1-
> X2-   X2
> X3X3-
> X4-   X4

Yes.

> From my point of view Fig.2 shows the optimal allocation, by the time
> the disks Drive2 and Drive3 are full (3TB) Drive1 must have 6TB
> (because it is exclusively holding the mirrors for both Drive2 and 3).
> For sure now btrfs can say, since two of the drives are completely
> full he can't allocate any more chunks and the remaining 2 TB of space
> from Drive1 is wasted. This is clear it's even pointed out by the
> btrfs size calculator.

Yes.


> But again if the above statements are true, then df might as well tell
> the "truth" and report that I have 3.5 TB space free and not 1.5TB (as
> it is reported now). Again here I fully understand Kai's explanation.
> Because coming back to my first e-mail, my "problem" was that df is
> reporting 1.5 TB free, whereas the whole FS holds 2.5 TB of data.

The size calculator has undergone some revisions. I think it currently
estimates the free space from net data to raw data ratio across all
devices, taking the current raid constraints into account.

Calculating free space in btrfs is difficult because in the future
btrfs may even support different raid levels for different sub volumes.
It's probably best to calculate for the worst case scenario then.

Even today it's already difficult if you use different raid levels for
meta data and content data: The filesystem cannot predict the future of
allocations. It can only give an educated guess. And the calculation
was revised a few times to not "overshoot".


> So the question still remains, is it just that df is intentionally not
> smart enough to give a more accurate estimation,

The df utility doesn't now anything about btrfs allocations. The value
is estimated by btrfs itself. To get more detailed info for capacity
planning, you should use "btrfs fi df" and its various siblings.

> or is the assumption
> that the allocator picks the drive with most free space mistaken?
> If I continue along the lines of what Kai said, and I need to do
> re-balance, because the allocation is not like shown above (Fig.2),
> then my question is still legitimate. Are there any filters that one
> might use to speed up or to selectively balance in my case? or will I
> need to do full balance?

Your assumption is misguided. The total free space estimation is a
totally different thing than what the allocator bases its decision on.
See "btrfs dev usage". The allocator uses space from the biggest
unallocated space

Re: Help me understand what is going on with my RAID1 FS

2017-09-10 Thread Kai Krakow

Am Sun, 10 Sep 2017 15:45:42 +0200
schrieb FLJ :

> Hello all,
> 
> I have a BTRFS RAID1 volume running for the past year. I avoided all
> pitfalls known to me that would mess up this volume. I never
> experimented with quotas, no-COW, snapshots, defrag, nothing really.
> The volume is a RAID1 from day 1 and is working reliably until now.
> 
> Until yesterday it consisted of two 3 TB drives, something along the
> lines:
> 
> Label: 'BigVault'  uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db
> Total devices 2 FS bytes used 2.47TiB
> devid1 size 2.73TiB used 2.47TiB path /dev/sdb
> devid2 size 2.73TiB used 2.47TiB path /dev/sdc
> 
> Yesterday I've added a new drive to the FS and did a full rebalance
> (without filters) over night, which went through without any issues.
> 
> Now I have:
>  Label: 'BigVault'  uuid: a37ad5f5-a21b-41c7-970b-13b6c4db33db
> Total devices 3 FS bytes used 2.47TiB
> devid1 size 2.73TiB used 1.24TiB path /dev/sdb
> devid2 size 2.73TiB used 1.24TiB path /dev/sdc
> devid3 size 7.28TiB used 2.48TiB path /dev/sda
> 
> # btrfs fi df /mnt/BigVault/
> Data, RAID1: total=2.47TiB, used=2.47TiB
> System, RAID1: total=32.00MiB, used=384.00KiB
> Metadata, RAID1: total=4.00GiB, used=2.74GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> But still df -h is giving me:
> Filesystem   Size  Used Avail Use% Mounted on
> /dev/sdb 6.4T  2.5T  1.5T  63% /mnt/BigVault
> 
> Although I've heard and read about the difficulty in reporting free
> space due to the flexibility of BTRFS, snapshots and subvolumes, etc.,
> but I only have a single volume, no subvolumes, no snapshots, no
> quotas and both data and metadata are RAID1.
> 
> My expectation would've been that in case of BigVault Size == Used +
> Avail.
> 
> Actually based on http://carfax.org.uk/btrfs-usage/index.html I
> would've expected 6 TB of usable space. Here I get 6.4 which is odd,
> but that only 1.5 TB is available is even stranger.
> 
> Could anyone explain what I did wrong or why my expectations are
> wrong?
> 
> Thank you in advance

Btrfs reports estimated free space from the free space of the smallest
member as it can only guarantee that. In your case this is 2.73 minus
1.24 free which is roughly around 1.5T. But since this free space
distributes across three disks with one having much more free space, it
probably will use up that space at half the rate of actual allocation.
But due to how btrfs allocates from free space in chunks, that may not
be possible - thus the low unexpected value. You will probably need to
run balance once in a while to evenly redistribute allocated chunks
across all disks.

It may give you better estimates if you combine sdb and sdc into one
logical device, e.g. using raid0 or jbod via md or lvm.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: csum failed root -9

2017-06-15 Thread Kai Krakow

Am Wed, 14 Jun 2017 15:39:50 +0200
schrieb Henk Slager <eye...@gmail.com>:

> On Tue, Jun 13, 2017 at 12:47 PM, Henk Slager <eye...@gmail.com>
> wrote:
> > On Tue, Jun 13, 2017 at 7:24 AM, Kai Krakow <hurikha...@gmail.com>
> > wrote:  
> >> Am Mon, 12 Jun 2017 11:00:31 +0200
> >> schrieb Henk Slager <eye...@gmail.com>:
> >>  
>  [...]  
> >>
> >> There's btrfs-progs v4.11 available...  
> >
> > I started:
> > # btrfs check -p --readonly /dev/mapper/smr
> > but it stopped with printing 'Killed' while checking extents. The
> > board has 8G RAM, no swap (yet), so I just started lowmem mode:
> > # btrfs check -p --mode lowmem --readonly /dev/mapper/smr
> >
> > Now after a 1 day 77 lines like this are printed:
> > ERROR: extent[5365470154752, 81920] referencer count mismatch (root:
> > 6310, owner: 1771130, offset: 33243062272) wanted: 1, have: 2
> >
> > It is still running, hopefully it will finish within 2 days. But
> > lateron I can compile/use latest progs from git. Same for kernel,
> > maybe with some tweaks/patches, but I think I will also plug the
> > disk into a faster machine then ( i7-4770 instead of the J1900 ).
> >  
>  [...]  
> >>
> >> What looks strange to me is that the parameters of the error
> >> reports seem to be rotated by one... See below:
> >>  
>  [...]  
> >>
> >> Why does it say "ino 1"? Does it mean devid 1?  
> >
> > On a 3-disk btrfs raid1 fs I see in the journal also "read error
> > corrected: ino 1" lines for all 3 disks. This was with a 4.10.x
> > kernel, ATM I don't know if this is right or wrong.
> >  
>  [...]  
> >>
> >> And why does it say "root -9"? Shouldn't it be "failed -9 root 257
> >> ino 515567616"? In that case the "off" value would be completely
> >> missing...
> >>
> >> Those "rotations" may mess up with where you try to locate the
> >> error on disk...  
> >
> > I hadn't looked at the numbers like that, but as you indicate, I
> > also think that the 1-block csum fail location is bogus because the
> > kernel calculates that based on some random corruption in critical
> > btrfs structures, also looking at the 77 referencer count
> > mismatches. A negative root ID is already a sort of red flag. When
> > I can mount the fs again after the check is finished, I can
> > hopefully use the output of the check to get clearer how big the
> > 'damage' is.  
> 
> The btrfs lowmem mode check ends with:
> 
> ERROR: root 7331 EXTENT_DATA[928390 3506176] shouldn't be hole
> ERROR: errors found in fs roots
> found 6968612982784 bytes used, error(s) found
> total csum bytes: 6786376404
> total tree bytes: 25656016896
> total fs tree bytes: 14857535488
> total extent tree bytes: 3237216256
> btree space waste bytes: 3072362630
> file data blocks allocated: 38874881994752
>  referenced 36477629964288
> 
> In total 2000+ of those "shouldn't be hole" lines.
> 
> A non-lowmem check, now done with kernel 4.11.4 and progs v4.11 and
> 16G swap added ends with 'noerrors found'

Don't trust lowmem mode too much. The developer of lowmem mode may tell
you more about specific edge cases.

> W.r.t. holes, maybe it is woth to mention the super-flags:
> incompat_flags  0x369
> ( MIXED_BACKREF |
>   COMPRESS_LZO |
>   BIG_METADATA |
>   EXTENDED_IREF |
>   SKINNY_METADATA |
>   NO_HOLES )

I think it's not worth to follow up on this holes topic: I guess it was
a false report of lowmem mode, and it was fixed with 4.11 btrfs progs.

> The fs has received snapshots from source fs that had NO_HOLES enabled
> for some time, but after registed this bug:
> https://bugzilla.kernel.org/show_bug.cgi?id=121321
> I put back that NO_HOLES flag to zero on the source fs. It seems I
> forgot to do that on the 8TB target/backup fs. But I don't know if
> there is a relation between this flag flipping and the btrfs check
> error messages.
> 
> I think I leave it as is for the time being, unless there is some news
> how to fix things with low risk (or maybe via a temp overlay snapshot
> with DM). But the lowmem check took 2 days, that's not really fun.
> The goal for the 8TB fs is to have an up to 7 year snapshot history at
> sometime, now the oldest snapshot is from early 2014, so almost
> halfway :)

Btrfs is still much too unstable to trust 7 years worth of backup to
it. You will probably loose it at some point, especially while many
snapshots are still such a huge performance breaker in btrfs. I suggest
trying out also other alternatives like borg backup for such a project.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: csum failed root -9

2017-06-12 Thread Kai Krakow

Am Mon, 12 Jun 2017 11:00:31 +0200
schrieb Henk Slager :

> Hi all,
> 
> there is 1-block corruption a 8TB filesystem that showed up several
> months ago. The fs is almost exclusively a btrfs receive target and
> receives monthly sequential snapshots from two hosts but 1 received
> uuid. I do not know exactly when the corruption has happened but it
> must have been roughly 3 to 6 months ago. with monthly updated
> kernel+progs on that host.
> 
> Some more history:
> - fs was created in november 2015 on top of luks
> - initially bcache between the 2048-sector aligned partition and luks.
> Some months ago I removed 'the bcache layer' by making sure that cache
> was clean and then zeroing 8K bytes at start of partition in an
> isolated situation. Then setting partion offset to 2064 by
> delete-recreate in gdisk.
> - in december 2016 there were more scrub errors, but related to the
> monthly snapshot of december2016. I have removed that snapshot this
> year and now only this 1-block csum error is the only issue.
> - brand/type is seagate 8TB SMR. At least since kernel 4.4+ that
> includes some SMR related changes in the blocklayer this disk works
> fine with btrfs.
> - the smartctl values show no error so far but I will run an extended
> test this week after another btrfs check which did not show any error
> earlier with the csum fail being there
> - I have noticed that the board that has the disk attached has been
> rebooted due to power-failures many times (unreliable power switch and
> power dips from energy company) and the 150W powersupply is broken and
> replaced since then. Also due to this, I decided to remove bcache
> (which has been in write-through and write-around only).
> 
> Some btrfs inpect-internal exercise shows that the problem is in a
> directory in the root that contains most of the data and snapshots.
> But an  rsync -c  with an identical other clone snapshot shows no
> difference (no writes to an rw snapshot of that clone). So the fs is
> still OK as file-level backup, but btrfs replace/balance will fatal
> error on just this 1 csum error. It looks like that this is not a
> media/disk error but some HW induced error or SW/kernel issue.
> Relevant btrfs commands + dmesg info, see below.
> 
> Any comments on how to fix or handle this without incrementally
> sending all snapshots to a new fs (6+ TiB of data, assuming this won't
> fail)?
> 
> 
> # uname -r
> 4.11.3-1-default
> # btrfs --version
> btrfs-progs v4.10.2+20170406

There's btrfs-progs v4.11 available...

> fs profile is dup for system+meta, single for data
> 
> # btrfs scrub start /local/smr

What looks strange to me is that the parameters of the error reports
seem to be rotated by one... See below:

> [27609.626555] BTRFS error (device dm-0): parent transid verify failed
> on 6350718500864 wanted 23170 found 23076
> [27609.685416] BTRFS info (device dm-0): read error corrected: ino 1
> off 6350718500864 (dev /dev/mapper/smr sector 11681212672)
> [27609.685928] BTRFS info (device dm-0): read error corrected: ino 1
> off 6350718504960 (dev /dev/mapper/smr sector 11681212680)
> [27609.686160] BTRFS info (device dm-0): read error corrected: ino 1
> off 6350718509056 (dev /dev/mapper/smr sector 11681212688)
> [27609.687136] BTRFS info (device dm-0): read error corrected: ino 1
> off 6350718513152 (dev /dev/mapper/smr sector 11681212696)
> [37663.606455] BTRFS error (device dm-0): parent transid verify failed
> on 6350453751808 wanted 23170 found 23075
> [37663.685158] BTRFS info (device dm-0): read error corrected: ino 1
> off 6350453751808 (dev /dev/mapper/smr sector 11679647008)
> [37663.685386] BTRFS info (device dm-0): read error corrected: ino 1
> off 6350453755904 (dev /dev/mapper/smr sector 11679647016)
> [37663.685587] BTRFS info (device dm-0): read error corrected: ino 1
> off 635045376 (dev /dev/mapper/smr sector 11679647024)
> [37663.685798] BTRFS info (device dm-0): read error corrected: ino 1
> off 6350453764096 (dev /dev/mapper/smr sector 11679647032)

Why does it say "ino 1"? Does it mean devid 1?

> [43497.234598] BTRFS error (device dm-0): bdev /dev/mapper/smr errs:
> wr 0, rd 0, flush 0, corrupt 1, gen 0
> [43497.234605] BTRFS error (device dm-0): unable to fixup (regular)
> error at logical 7175413624832 on dev /dev/mapper/smr
> 
> # < figure out which chunk with help of btrfs py lib >
> 
> chunk vaddr 7174898057216 type 1 stripe 0 devid 1 offset 6696948727808
> length 1073741824 used 1073741824 used_pct 100
> chunk vaddr 7175971799040 type 1 stripe 0 devid 1 offset 6698022469632
> length 1073741824 used 1073741824 used_pct 100
> 
> # btrfs balance start -v
> -dvrange=7174898057216..7174898057217 /local/smr
> 
> [74250.913273] BTRFS info (device dm-0): relocating block group
> 7174898057216 flags data
> [74255.941105] BTRFS warning (device dm-0): csum failed root -9 ino
> 257 off 515567616 csum 0x589cb236 expected csum 0xee19bf74 mirror 1
> [74255.965804] BTRFS warning (device dm-0): csum

Re: 4.11.1: cannot btrfs check --repair a filesystem, causes heavy memory stalls

2017-05-23 Thread Kai Krakow

Am Tue, 23 May 2017 07:21:33 -0400
schrieb "Austin S. Hemmelgarn" :

> On 2017-05-22 22:07, Chris Murphy wrote:
> > On Mon, May 22, 2017 at 5:57 PM, Marc MERLIN 
> > wrote:  
> >> On Mon, May 22, 2017 at 05:26:25PM -0600, Chris Murphy wrote:  
>  [...]  
>  [...]  
>  [...]  
> >>
> >> Oh, swap will work, you're sure?
> >> I already have an SSD, if that's good enough, I can give it a
> >> shot.  
> >
> > Yeah although I have no idea how much swap is needed for it to
> > succeed. I'm not sure what the relationship is to fs metadata chunk
> > size to btrfs check RAM requirement is; but if it wants all of the
> > metadata in RAM, then whatever btrfs fi us shows you for metadata
> > may be a guide (?) for how much memory it's going to want.  
> I think the in-memory storage is a bit more space efficient than the 
> on-disk storage, but I'm not certain, and I'm pretty sure it takes up 
> more space when it's actually repairing things.  If I'm doing the
> math correctly, you _may_ need up to 50% _more_ than the total
> metadata size for the FS in virtual memory space.
> >
> > Another possibility is zswap, which still requires a backing device,
> > but it might be able to limit how much swap to disk is needed if the
> > data to swap out is highly compressible. *shrug*
> >  
> zswap won't help in that respect, but it might make swapping stuff
> back in faster.  It just keeps a compressed copy in memory in
> parallel to writing the full copy out to disk, then uses that
> compressed copy to swap in instead of going to disk if the copy is
> still in memory (but it will discard the compressed copies if memory
> gets really low).  In essence, it reduces the impact of swapping when
> memory pressure is moderate (the situation for most desktops for
> example), but becomes almost useless when you have very high memory
> pressure (which is what describes this usage).

Is this really how zswap works?

I always thought it acts as a compressed write-back cache in front of
the swap devices. Pages first go to zswap compressed, and later
write-back kicks in and migrates those compressed pages to real swap,
but still compressed. This is done by zswap putting two (or up to three
in modern kernels) compressed pages into one page. It has the downside
of uncompressing all "buddy pages" when only one is needed back in. But
it stays compressed. This also tells me zswap will either achieve
around 1:2 or 1:3 effective compression ratio or none. So it cannot be
compared to how streaming compression works.

OTOH, if the page is reloaded from cache before write-back kicks in, it
will never be written to swap but just uncompressed and discarded from
the cache.

Under high memory pressure it doesn't really work that well due to high
CPU overhead if pages constantly swap out, compress, write, read,
uncompress, swap in... This usually results in very low CPU usage for
processes but high IO and disk wait and high kernel CPU usage. But it
defers memory pressure conditions to a little later in exchange for
more a little more IO usage and more CPU usage. If you have a lot of
inactive memory around, it can make a difference. But it is counter
productive if almost all your memory is active and pressure is high.

So, in this scenario, it probably still doesn't help.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 2/2] Btrfs: compression must free at least PAGE_SIZE

2017-05-20 Thread Kai Krakow

Am Sat, 20 May 2017 19:49:53 +0300
schrieb Timofey Titovets :

> Btrfs already skip store of data where compression didn't free at
> least one byte. So make logic better and make check that compression
> free at least one PAGE_SIZE, because in another case it useless to
> store this data compressed
> 
> Signed-off-by: Timofey Titovets 
> ---
>  fs/btrfs/lzo.c  | 5 -
>  fs/btrfs/zlib.c | 3 ++-
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/lzo.c b/fs/btrfs/lzo.c
> index bd0b0938..7f38bc3c 100644
> --- a/fs/btrfs/lzo.c
> +++ b/fs/btrfs/lzo.c
> @@ -229,8 +229,11 @@ static int lzo_compress_pages(struct list_head
> *ws, in_len = min(bytes_left, PAGE_SIZE);
>   }
> 
> - if (tot_out > tot_in)
> + /* Compression must save at least one PAGE_SIZE */
> + if (tot_out + PAGE_SIZE => tot_in) {

Shouldn't this be ">" instead of ">=" (btw, I don't think => works)...

Given the case that tot_in is 8192, and tot_out is 4096, we saved a
complete page but 4096+4096 would still be equal to 8192.

The former logic only pretended that there is no point in compression
if we saved 0 bytes.

BTW: What's the smallest block size that btrfs stores? Is it always
PAGE_SIZE? I'm not familiar with btrfs internals...

> + ret = -E2BIG;
>   goto out;
> + }
> 
>   /* store the size of all chunks of compressed data */
>   cpage_out = kmap(pages[0]);
> diff --git a/fs/btrfs/zlib.c b/fs/btrfs/zlib.c
> index 135b1082..2b04259b 100644
> --- a/fs/btrfs/zlib.c
> +++ b/fs/btrfs/zlib.c
> @@ -191,7 +191,8 @@ static int zlib_compress_pages(struct list_head
> *ws, goto out;
>   }
> 
> - if (workspace->strm.total_out >= workspace->strm.total_in) {
> + /* Compression must save at least one PAGE_SIZE */
> + if (workspace->strm.total_out + PAGE_SIZE >=
> workspace->strm.total_in) { ret = -E2BIG;

Same as above...

>   goto out;
>   }
> --
> 2.13.0


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-17 Thread Kai Krakow

Am Fri, 5 May 2017 08:43:23 -0700
schrieb Marc MERLIN <m...@merlins.org>:

[missing quote of the command]
> > Corrupted blocks are corrupted, that command is just trying to
> > corrupt it again.
> > It won't do the black magic to adjust tree blocks to avoid them.  
>  
> I see. you may hve seen the earlier message from Kai Krakow who was
> able to to recover his FS by trying this trick, but I understand it
> can't work in all cases.

Huh, what trick? I don't take credit for it... ;-)

The corrupt-block trick must've been someone else...


-- 
Regards,
Kai

Replies to list-only preferred.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs/SSD

2017-05-16 Thread Kai Krakow

Am Tue, 16 May 2017 14:21:20 +0200
schrieb Tomasz Torcz <to...@pipebreaker.pl>:

> On Tue, May 16, 2017 at 03:58:41AM +0200, Kai Krakow wrote:
> > Am Mon, 15 May 2017 22:05:05 +0200
> > schrieb Tomasz Torcz <to...@pipebreaker.pl>:
> >   
>  [...]  
> > > 
> > >   Let me add my 2 cents.  bcache-writearound does not cache writes
> > > on SSD, so there are less writes overall to flash.  It is said
> > > to prolong the life of the flash drive.
> > >   I've recently switched from bcache-writeback to
> > > bcache-writearound, because my SSD caching drive is at the edge
> > > of it's lifetime. I'm using bcache in following configuration:
> > > http://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2.svg My
> > > SSD is Samsung SSD 850 EVO 120GB, which I bought exactly 2 years
> > > ago.
> > > 
> > >   Now, according to
> > > http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/850evo.html
> > > 120GB and 250GB warranty only covers 75 TBW (terabytes written).  
> > 
> > According to your chart, all your data is written twice to bcache.
> > It may have been better to buy two drives, one per mirror. I don't
> > think that SSD firmwares do deduplication - so data is really
> > written twice.  
> 
>   I'm aware of that, but 50 GB (I've got 100GB caching partition)
> is still plenty to cache my ~, some media files, two small VMs.
> On the other hand I don't want to overspend. This is just a home
> server.
>   Nb. I'm still waiting for btrfs native SSD caching, which was
> planned for 3.6 kernel 5 years ago :)
> ( 
> https://oss.oracle.com/~mason/presentation/btrfs-jls-12/btrfs.html#/planned-3.6
> )
> 
> > 
> >   
> > > My
> > > drive has  # smartctl -a /dev/sda  | grep LBA 241
> > > Total_LBAs_Written  0x0032   099   099   000Old_age
> > > Always   -   136025596053  
> > 
> > Doesn't say this "99%" remaining? The threshold is far from being
> > reached...
> > 
> > I'm curious, what is Wear_Leveling_Count reporting?  
> 
> ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE
> UPDATED  WHEN_FAILED RAW_VALUE 9 Power_On_Hours  0x0032
> 096   096   000Old_age   Always   -   18227 12
> Power_Cycle_Count   0x0032   099   099   000Old_age
> Always   -   29 177 Wear_Leveling_Count 0x0013   001
> 001   000Pre-fail  Always   -   4916
> 
>  Is this 001 mean 1%? If so, SMART contradicts datasheets. And I
> don't think I shoud see read errors for 1% wear.

It more means 1% left, that is 99% wear... Most of these are counters
from 100 down to zero, with THRESH being the threshold point below or at
which it is considered failed or failing.

Only a few values work the other way around (like temperature).

Be careful with interpreting raw values: they may be very manufacturer
specific and not normalized.

According to Total_LBAs_Written, the manufacturer thinks the drive
could still take 100x more (only 1% used). But your wear level is almost
100% (value = 001). I think that value isn't really designed around the
flash cell lifetime, but intermediate components like caches.

So you need to read most values "backwards": It's not a used counter,
but a "what's left" counter.

What does it tell you about reserved blocks usage? Note that it's sort
of double negation here: value 100 used means 100% unused or 0%
used... ;-) Or just insert a "minus" in front of those values and think
of them counting up to zero. So on a time axis it's at -100% of the
total lifetime scale and 0 is the fail point (or whatever "thresh"
says).


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs/SSD

2017-05-15 Thread Kai Krakow

Am Mon, 15 May 2017 22:05:05 +0200
schrieb Tomasz Torcz <to...@pipebreaker.pl>:

> On Mon, May 15, 2017 at 09:49:38PM +0200, Kai Krakow wrote:
> >   
> > > It's worth noting also that on average, COW filesystems like BTRFS
> > > (or log-structured-filesystems will not benefit as much as
> > > traditional filesystems from SSD caching unless the caching is
> > > built into the filesystem itself, since they don't do in-place
> > > rewrites (so any new write by definition has to drop other data
> > > from the cache).  
> > 
> > Yes, I considered that, too. And when I tried, there was almost no
> > perceivable performance difference between bcache-writearound and
> > bcache-writeback. But the latency of performance improvement was
> > much longer in writearound mode, so I sticked to writeback mode.
> > Also, writing random data is faster because bcache will defer it to
> > background and do writeback in sector order. Sequential access is
> > passed around bcache anyway, harddisks are already good at that.  
> 
>   Let me add my 2 cents.  bcache-writearound does not cache writes
> on SSD, so there are less writes overall to flash.  It is said
> to prolong the life of the flash drive.
>   I've recently switched from bcache-writeback to bcache-writearound,
> because my SSD caching drive is at the edge of it's lifetime. I'm
> using bcache in following configuration:
> http://enotty.pipebreaker.pl/dżogstaff/2016.05.25-opcja2.svg My SSD
> is Samsung SSD 850 EVO 120GB, which I bought exactly 2 years ago.
> 
>   Now, according to
> http://www.samsung.com/semiconductor/minisite/ssd/product/consumer/850evo.html
> 120GB and 250GB warranty only covers 75 TBW (terabytes written).

According to your chart, all your data is written twice to bcache. It
may have been better to buy two drives, one per mirror. I don't think
that SSD firmwares do deduplication - so data is really written twice.

They may do compression but that won't be streaming compression but
per-block compression, so it won't help here as a deduplicator. Also,
due to internal structure, compression would probably work similar to
how zswap works: By combining compressed blocks into "buddy blocks", so
only compression above 2:1 will merge compressed blocks into single
blocks. For most of your data, this won't be true. So effectively, this
has no overall effect. For this reason, I doubt that any firmware takes
the chance for compression, effects are just too low vs. the management
overhead and complexity that adds to the already complicated FTL layer.

> My
> drive has  # smartctl -a /dev/sda  | grep LBA 241
> Total_LBAs_Written  0x0032   099   099   000Old_age
> Always   -   136025596053

Doesn't say this "99%" remaining? The threshold is far from being
reached...

I'm curious, what is Wear_Leveling_Count reporting?

> which multiplied by 512 bytes gives 69.6 TB. Close to 75TB? Well…
> 
> [35354.697513] sd 0:0:0:0: [sda] tag#19 FAILED Result:
> hostbyte=DID_OK driverbyte=DRIVER_SENSE [35354.697516] sd 0:0:0:0:
> [sda] tag#19 Sense Key : Medium Error [current] [35354.697518] sd
> 0:0:0:0: [sda] tag#19 Add. Sense: Unrecovered read error - auto
> reallocate failed [35354.697522] sd 0:0:0:0: [sda] tag#19 CDB:
> Read(10) 28 00 0c 30 82 9f 00 00 48 00 [35354.697524]
> blk_update_request: I/O error, dev sda, sector 204505785
> 
> Above started appearing recently.  So, I was really suprised that:
> - this drive is only rated for 120 TBW
> - I went through this limit in only 2 years
> 
>   The workload is lightly utilised home server / media center.

I think, bcache is a real SSD killer for drives around 120GB size or
below... I had similar life usage with my previous small SSD just after
one year. But I never had a sense error because I took it out of
service early. And I switched to writearound, too.

I think the write-pattern of bcache cannot be handled well by the FTL.
It behaves like a log-structured file system, with new writes only
appended, and sometimes a garbage collection is done by freeing
complete erase blocks. Maybe it could work better, if btrfs could pass
information about freed blocks down to bcache. Btrfs has a lot of these
due to COW nature.

I wonder if this would already be supported if turning on discard in
btrfs? Does anyone know?

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs/SSD

2017-05-15 Thread Kai Krakow

Am Mon, 15 May 2017 08:03:48 -0400
schrieb "Austin S. Hemmelgarn" :

> > That's why I don't trust any of my data to them. But I still want
> > the benefit of their speed. So I use SSDs mostly as frontend caches
> > to HDDs. This gives me big storage with fast access. Indeed, I'm
> > using bcache successfully for this. A warm cache is almost as fast
> > as native SSD (at least it feels almost that fast, it will be
> > slower if you threw benchmarks at it).  
> That's to be expected though, most benchmarks don't replicate actual 
> usage patterns for client systems, and using SSD's for caching with 
> bcache or dm-cache for most server workloads except a file server
> will usually get you a performance hit.

You mean "performance boost"? Almost every read-most server workload
should benefit... I file server may be the exact opposite...

Also, I think dm-cache and bcache work very differently and are not
directly comparable. Their benefit depends much on the applied workload.

If I remember right, dm-cache is more about keeping "hot data" in the
flash storage while bcache is more about reducing seeking. So dm-cache
optimizes for bigger throughput of SSDs while bcache optimizes for
almost-zero seek overhead of SSDs. Depending on your underlying
storage, one or the other may even give zero benefit or worsen
performance. Which is what I'd call a "performance hit"... I didn't
ever try dm-cache, tho. For reasons I don't remember exactly, I didn't
like something about how it's implemented, I think it was related to
crash recovery. I don't know if that still holds true with modern
kernels. It may have changed but I never looked back to revise that
decision.


> It's worth noting also that on average, COW filesystems like BTRFS
> (or log-structured-filesystems will not benefit as much as
> traditional filesystems from SSD caching unless the caching is built
> into the filesystem itself, since they don't do in-place rewrites (so
> any new write by definition has to drop other data from the cache).

Yes, I considered that, too. And when I tried, there was almost no
perceivable performance difference between bcache-writearound and
bcache-writeback. But the latency of performance improvement was much
longer in writearound mode, so I sticked to writeback mode. Also,
writing random data is faster because bcache will defer it to
background and do writeback in sector order. Sequential access is
passed around bcache anyway, harddisks are already good at that.

But of course, the COW nature of btrfs will lower the hit rate I can
on writes. That's why I see no benefit in using bcache-writethrough
with btrfs.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs/SSD

2017-05-15 Thread Kai Krakow

Am Mon, 15 May 2017 07:46:01 -0400
schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>:

> On 2017-05-12 14:27, Kai Krakow wrote:
> > Am Tue, 18 Apr 2017 15:02:42 +0200
> > schrieb Imran Geriskovan <imran.gerisko...@gmail.com>:
> >  
> >> On 4/17/17, Austin S. Hemmelgarn <ahferro...@gmail.com> wrote:  
>  [...]  
> >>
> >> I'm trying to have a proper understanding of what "fragmentation"
> >> really means for an ssd and interrelation with wear-leveling.
> >>
> >> Before continuing lets remember:
> >> Pages cannot be erased individually, only whole blocks can be
> >> erased. The size of a NAND-flash page size can vary, and most
> >> drive have pages of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have
> >> blocks of 128 or 256 pages, which means that the size of a block
> >> can vary between 256 KB and 4 MB.
> >> codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/
> >>
> >> Lets continue:
> >> Since block sizes are between 256k-4MB, data smaller than this will
> >> "probably" will not be fragmented in a reasonably empty and trimmed
> >> drive. And for a brand new ssd we may speak of contiguous series
> >> of blocks.
> >>
> >> However, as drive is used more and more and as wear leveling
> >> kicking in (ie. blocks are remapped) the meaning of "contiguous
> >> blocks" will erode. So any file bigger than a block size will be
> >> written to blocks physically apart no matter what their block
> >> addresses says. But my guess is that accessing device blocks
> >> -contiguous or not- are constant time operations. So it would not
> >> contribute performance issues. Right? Comments?
> >>
> >> So your the feeling about fragmentation/performance is probably
> >> related with if the file is spread into less or more blocks. If #
> >> of blocks used is higher than necessary (ie. no empty blocks can be
> >> found. Instead lots of partially empty blocks have to be used
> >> increasing the total # of blocks involved) then we will notice
> >> performance loss.
> >>
> >> Additionally if the filesystem will gonna try something to reduce
> >> the fragmentation for the blocks, it should precisely know where
> >> those blocks are located. Then how about ssd block informations?
> >> Are they available and do filesystems use it?
> >>
> >> Anyway if you can provide some more details about your experiences
> >> on this we can probably have better view on the issue.  
> >
> > What you really want for SSD is not defragmented files but
> > defragmented free space. That increases life time.
> >
> > So, defragmentation on SSD makes sense if it cares more about free
> > space but not file data itself.
> >
> > But of course, over time, fragmentation of file data (be it meta
> > data or content data) may introduce overhead - and in btrfs it
> > probably really makes a difference if I scan through some of the
> > past posts.
> >
> > I don't think it is important for the file system to know where the
> > SSD FTL located a data block. It's just important to keep
> > everything nicely aligned with erase block sizes, reduce rewrite
> > patterns, and free up complete erase blocks as good as possible.
> >
> > Maybe such a process should be called "compaction" and not
> > "defragmentation". In the end, the more continuous blocks of free
> > space there are, the better the chance for proper wear leveling.  
> 
> There is one other thing to consider though.  From a practical 
> perspective, performance on an SSD is a function of the number of 
> requests and what else is happening in the background.  The second 
> aspect isn't easy to eliminate on most systems, but the first is
> pretty easy to mitigate by defragmenting data.
> 
> Reiterating the example I made elsewhere in the thread:
> Assume you have an SSD and storage controller that can use DMA to 
> transfer up to 16MB of data off of the disk in a single operation.
> If you need to load a 16MB file off of this disk and it's properly
> aligned (it usually will be with most modern filesystems if the
> partition is properly aligned) and defragmented, it will take exactly
> one operation (assuming that doesn't get interrupted).  By contrast,
> if you have 16 fragments of 1MB each, that will take at minimum 2
> operations, and more likely 15-16 (depends on where everything is
> on-disk, and how smart the driver is about minimizing the

Re: Btrfs/SSD

2017-05-15 Thread Kai Krakow

Am Mon, 15 May 2017 14:09:20 +0100
schrieb Tomasz Kusmierz :

> > Traditional hard drives usually do this too these days (they've
> > been under-provisioned since before SSD's existed), which is part
> > of why older disks tend to be noisier and slower (the reserved
> > space is usually at the far inside or outside of the platter, so
> > using sectors from there to replace stuff leads to long seeks).  
> 
> Not true. When HDD uses 10% (10% is just for easy example) of space
> as spare than aligment on disk is (US - used sector, SS - spare
> sector, BS - bad sector)
> 
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> US US US US US US US US US SS
> 
> if failure occurs - drive actually shifts sectors up:
> 
> US US US US US US US US US SS
> US US US BS BS BS US US US US
> US US US US US US US US US US
> US US US US US US US US US US
> US US US US US US US US US SS
> US US US BS US US US US US US
> US US US US US US US US US SS
> US US US US US US US US US SS

This makes sense... Reserve area somehow implies it is continuous and
as such located at one far end of the platter. But your image totally
makes sense.


> that strategy is in place to actually mitigate the problem that
> you’ve described, actually it was in place since drives were using
> PATA :) so if your drive get’s nosier over time it’s either a broken
> bearing or demagnetised arm magnet causing it to not aim propperly -
> so drive have to readjust position multiple times before hitting a
> right track -- To unsubscribe from this list: send the line
> "unsubscribe linux-btrfs" in the body of a message to
> majord...@vger.kernel.org More majordomo info at
> http://vger.kernel.org/majordomo-info.html

I can confirm that such drives usually do not get noisier except there's
something broken other than just a few sectors. And faulty bearing in
notebook drives is the most often scenario I see. I always recommend to
replace such drives early because they will usually fail completely.
Such notebooks are good candidates for SSD replacements btw. ;-)

The demagnetised arm magnet is an interesting error scenario - didn't
think of it. Thanks for the pointer.

But still, there's one noise you can easily identify as bad sectors:
When the drive starts clicking for 30 or more seconds while trying to
read data, and usually also freezes the OS during that time. Such
drives can be "repaired" by rewriting the offending sectors (because it
will be moved to reserve area then). But I guess it's best to already
replace such a drive by that time.

Early, back in PATA times, I often had harddisks exposing seemingly bad
sectors when power was cut while the drive was writing data. I usually
used dd to rewrite such sectors and the drive was good as new again -
except I lost some file data maybe. Luckily, modern drives don't show
such behavior. And also SSDs learned to handle this...


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: balancing every night broke balancing so now I can't balance anymore?

2017-05-14 Thread Kai Krakow

Am Sun, 14 May 2017 22:57:26 +0200
schrieb Lionel Bouton :

> I've coded one Ruby script which tries to balance between the cost of
> reallocating group and the need for it. The basic idea is that it
> tries to keep the proportion of free space "wasted" by being allocated
> although it isn't used below a threshold. It will bring this
> proportion down enough through balance that minor reallocation won't
> trigger a new balance right away. It should handle pathological
> conditions as well as possible and it won't spend more than 2 hours
> working on a single filesystem by default. We deploy this as a daily
> cron script through Puppet on all our systems and it works very well
> (I didn't have to use balance manually to manage free space since we
> did that). Note that by default it sleeps a random amount of time to
> avoid IO spikes on VMs running on the same host. You can either edit
> it or pass it "0" which will be used for the max amount of time to
> sleep bypassing this precaution.
> 
> Here is the latest version : https://pastebin.com/Rrw1GLtx
> Given its current size, I should probably push it on github...

Yes, please... ;-)

> I've seen other maintenance scripts mentioned on this list so you
> might something simpler or more targeted to your needs by browsing
> through the list's history.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: balancing every night broke balancing so now I can't balance anymore?

2017-05-14 Thread Kai Krakow

Am Sun, 14 May 2017 13:15:09 -0700
schrieb Marc MERLIN :

> On Sun, May 14, 2017 at 09:13:35PM +0200, Hans van Kranenburg wrote:
> > On 05/13/2017 10:54 PM, Marc MERLIN wrote:  
> > > Kernel 4.11, btrfs-progs v4.7.3
> > > 
> > > I run scrub and balance every night, been doing this for 1.5
> > > years on this filesystem.  
> > 
> > What are the exact commands you run every day?  
>  
> http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html
> (at the bottom)
> every night:
> 1) scrub
> 2) balance -musage=0
> 3) balance -musage=20
> 4) balance -dusage=0
> 5) balance -dusage=20
> 
> > > How did I get into such a misbalanced state when I balance every
> > > night?  
> > 
> > I don't know, since I don't know what you do exactly. :)  
>  
> Now you do :)
> 
> > > My filesystem is not full, I can write just fine, but I sure
> > > cannot rebalance now.  
> > 
> > Yes, because you have quite some allocated but unused space. If
> > btrfs cannot just allocate more chunks, it starts trying a bit
> > harder to reuse all the empty spots in the already existing
> > chunks.  
> 
> Ok. shouldn't balance fix problems just like this?
> I have 60GB-ish free, or in this case that's also >25%, that's a lot
> 
> Speaking of unallocated, I have more now:
> Device unallocated:993.00MiB
> 
> This kind of just magically fixed itself during snapshot rotation and
> deletion I think.
> Sure enough, balance works again, but this feels pretty fragile.
> Looking again:
> Device size:   228.67GiB
> Device allocated:  227.70GiB
> Device unallocated:993.00MiB
> Free (estimated):   58.53GiB  (min: 58.53GiB)
> 
> You're saying that I need unallocated space for new chunks to be
> created, which is required by balance.
> Should btrfs not take care of keeping some space for me?
> Shoudln't a nigthly balance, which I'm already doing, help even more
> with this?
> 
> > > Besides adding another device to add space, is there a way around
> > > this and more generally not getting into that state anymore
> > > considering that I already rebalance every night?  
> > 
> > Add monitoring and alerting on the amount of unallocated space.
> > 
> > FWIW, this is what I use for that purpose:
> > 
> > https://packages.debian.org/sid/munin-plugins-btrfs
> > https://packages.debian.org/sid/monitoring-plugins-btrfs
> > 
> > And, of course the btrfs-heatmap program keeps being a fun tool to
> > create visual timelapses of your filesystem, so you can learn how
> > your usage pattern is resulting in allocation of space by btrfs,
> > and so that you can visually see what the effect of your btrfs
> > balance attempts is:  
> 
> That's interesting, but ultimately, users shoudln't have to
> micromanage their filesystem to that level, even btrfs.
> 
> a) What is wrong in my nightly script that I should fix/improve?

You may want to try
https://www.spinics.net/lists/linux-btrfs/msg52076.html

> b) How do I recover from my current state?

That script may work it's way through.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[OT] SSD performance patterns (was: Btrfs/SSD)

2017-05-13 Thread Kai Krakow

Am Sat, 13 May 2017 09:39:39 + (UTC)
schrieb Duncan <1i5t5.dun...@cox.net>:

> Kai Krakow posted on Fri, 12 May 2017 20:27:56 +0200 as excerpted:
> 
> > In the end, the more continuous blocks of free space there are, the
> > better the chance for proper wear leveling.  
> 
> Talking about which...
> 
> When I was doing my ssd research the first time around, the going 
> recommendation was to keep 20-33% of the total space on the ssd
> entirely unallocated, allowing it to use that space as an FTL
> erase-block management pool.
> 
> At the time, I added up all my "performance matters" data dirs and 
> allowing for reasonable in-filesystem free-space, decided I could fit
> it in 64 GB if I had to, tho 80 GB would be a more comfortable fit,
> so allowing for the above entirely unpartitioned/unused slackspace 
> recommendations, had a target of 120-128 GB, with a reasonable range 
> depending on actual availability of 100-160 GB.
> 
> It turned out, due to pricing and availability, I ended up spending 
> somewhat more and getting 256 GB (238.5 GiB).  Of course that allowed
> me much more flexibility than I had expected and I ended up with
> basically everything but the media partition on the ssds, PLUS I
> still left them at only just over 50% partitioned, (using the gdisk
> figures, 51%- partitioned, 49%+ free).

I put by ESP (for UEFI) onto the SSD and also played with putting swap
onto it dedicated to hibernation. But I discarded the hibernation idea
and removed the swap because it didn't work well: It wasn't much faster
then waking from HDD, and hibernation is not that reliable anyways.
Also, hybrid hibernation is not yet integrated into KDE so I stick to
sleep mode currently.

The rest of my SSD (also 500GB) is dedicated to bcache. This fits my
complete work set of daily work with hit ratios going up to 90% and
beyond. My filesystem boots and feels like SSD, the HDDs are almost
silent and still my file system is 3TB on 3x 1TB HDD.


> Given that, I've not enabled btrfs trim/discard (which saved me from
> the bugs with it a few kernel cycles ago), and while I do have a
> weekly fstrim systemd timer setup, I've not had to be too concerned
> about btrfs bugs (also now fixed, I believe) when fstrim on btrfs was
> known not to be trimming everything it really should have been.

This is a good recommendation as TRIM is still a slow operation because
Queued TRIM is not used for most drives due to buggy firmware. So you
not only circumvent kernel and firmware bugs, but also get better
performance that way.


> Anyway, that 20-33% left entirely unallocated/unpartitioned 
> recommendation still holds, right?  Am I correct in asserting that if
> one is following that, the FTL already has plenty of erase-blocks
> available for management and the discussion about filesystem level
> trim and free space management becomes much less urgent, tho of
> course it's still worth considering if it's convenient to do so?
> 
> And am I also correct in believing that while it's not really worth 
> spending more to over-provision to the near 50% as I ended up doing,
> if things work out that way as they did with me because the
> difference in price between 30% overprovisioning and 50%
> overprovisioning ends up being trivial, there's really not much need
> to worry about active filesystem trim at all, because the FTL has
> effectively half the device left to play erase-block musical chairs
> with as it decides it needs to?

I think, things may have changed since long ago. See below. But it
certainly depends on which drive manufacturer you chose, I guess.

I can at least confirm that bigger drives wear their write cycles much
slower, even when filled up. My old 128MB Crucial drive was worn after
only 1 year (I swapped it early, I kept an eye on SMART numbers). My
500GB Samsung drive is around 1 year old now, I do write a lot more
data to it, but according to SMART it should work for at least 5 to 7
more years. By that time, I probably already swapped it for a bigger
drive.

So I guess you should maybe look at your SMART numbers and calculate
the expected life time:

Power_on_Hours(RAW) * WLC(VALUE) / (100-WLC(VALUE))
with WLC = Wear_Leveling_Count

should get you the expected remaining power on hours. My drive is
powered on 24/7 most of the time but if you power your drive only 8
hours per day, you can easily ramp up the life time by three times of
days vs. me. ;-)

There is also Total_LBAs_Written but that, at least for me, usually
gives much higher lifetime values so I'd stick with the pessimistic
ones.

Even when WLC goes to zero, the drive should still have reserved blocks
available. My drive sets the threshold to 0 for WLC which makes me
think that it is not fatal when it hits 0 because the drive still has
reserved blocks. And for reserved blocks, the threshold is 10%.

Re: Btrfs/SSD

2017-05-13 Thread Kai Krakow

Am Sat, 13 May 2017 14:52:47 +0500
schrieb Roman Mamedov <r...@romanrm.net>:

> On Fri, 12 May 2017 20:36:44 +0200
> Kai Krakow <hurikha...@gmail.com> wrote:
> 
> > My concern is with fail scenarios of some SSDs which die unexpected
> > and horribly. I found some reports of older Samsung SSDs which
> > failed suddenly and unexpected, and in a way that the drive
> > completely died: No more data access, everything gone. HDDs start
> > with bad sectors and there's a good chance I can recover most of
> > the data except a few sectors.  
> 
> Just have your backups up-to-date, doesn't matter if it's SSD, HDD or
> any sort of RAID.
> 
> In a way it's even better, that SSDs [are said to] fail abruptly and
> entirely. You can then just restore from backups and go on. Whereas a
> failing HDD can leave you puzzled on e.g. whether it's a cable or
> controller problem instead, and possibly can even cause some data
> corruption which you won't notice until too late.

My current backup strategy can handle this. I never backup files from
the source again if it didn't change by timestamp. That way, silent data
corruption won't creep into the backup. Additionally, I keep a backlog
of 5 years of file history. Even if a corrupted file creeps into the
backup, there is enough time to get a good copy back. If it's older, it
probably doesn't hurt so much anyway.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs/SSD

2017-05-12 Thread Kai Krakow

Am Fri, 12 May 2017 15:02:20 +0200
schrieb Imran Geriskovan :

> On 5/12/17, Duncan <1i5t5.dun...@cox.net> wrote:
> > FWIW, I'm in the market for SSDs ATM, and remembered this from a
> > couple weeks ago so went back to find it.  Thanks. =:^)
> >
> > (I'm currently still on quarter-TB generation ssds, plus spinning
> > rust for the larger media partition and backups, and want to be rid
> > of the spinning rust, so am looking at half-TB to TB, which seems
> > to be the pricing sweet spot these days anyway.)  
> 
> Since you are taking ssds to mainstream based on your experience,
> I guess your perception of data retension/reliability is better than
> that of spinning rust. Right? Can you eloborate?
> 
> Or an other criteria might be physical constraints of spinning rust
> on notebooks which dictates that you should handle the device
> with care when running.
> 
> What was your primary motivation other than performance?

Personally, I don't really trust SSDs so much. They are much more
robust when it comes to physical damage because there are no physical
parts. That's absolutely not my concern. Regarding this, I trust SSDs
better than HDDs.

My concern is with fail scenarios of some SSDs which die unexpected and
horribly. I found some reports of older Samsung SSDs which failed
suddenly and unexpected, and in a way that the drive completely died:
No more data access, everything gone. HDDs start with bad sectors and
there's a good chance I can recover most of the data except a few
sectors.

When SSD blocks die, they are probably huge compared to a sector (256kB
to 4MB usually because that's erase block sizes). If this happens, the
firmware may decide to either allow read-only access or completely deny
access. There's another situation where dying storage chips may
completely mess up the firmware and there's no longer any access to
data.

That's why I don't trust any of my data to them. But I still want the
benefit of their speed. So I use SSDs mostly as frontend caches to
HDDs. This gives me big storage with fast access. Indeed, I'm using
bcache successfully for this. A warm cache is almost as fast as native
SSD (at least it feels almost that fast, it will be slower if you threw
benchmarks at it).

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs/SSD

2017-05-12 Thread Kai Krakow

Am Tue, 18 Apr 2017 15:02:42 +0200
schrieb Imran Geriskovan :

> On 4/17/17, Austin S. Hemmelgarn  wrote:
> > Regarding BTRFS specifically:
> > * Given my recently newfound understanding of what the 'ssd' mount
> > option actually does, I'm inclined to recommend that people who are
> > using high-end SSD's _NOT_ use it as it will heavily increase
> > fragmentation and will likely have near zero impact on actual device
> > lifetime (but may _hurt_ performance).  It will still probably help
> > with mid and low-end SSD's.  
> 
> I'm trying to have a proper understanding of what "fragmentation"
> really means for an ssd and interrelation with wear-leveling.
> 
> Before continuing lets remember:
> Pages cannot be erased individually, only whole blocks can be erased.
> The size of a NAND-flash page size can vary, and most drive have pages
> of size 2 KB, 4 KB, 8 KB or 16 KB. Most SSDs have blocks of 128 or 256
> pages, which means that the size of a block can vary between 256 KB
> and 4 MB.
> codecapsule.com/.../coding-for-ssds-part-2-architecture-of-an-ssd-and-benchmarking/
> 
> Lets continue:
> Since block sizes are between 256k-4MB, data smaller than this will
> "probably" will not be fragmented in a reasonably empty and trimmed
> drive. And for a brand new ssd we may speak of contiguous series
> of blocks.
> 
> However, as drive is used more and more and as wear leveling kicking
> in (ie. blocks are remapped) the meaning of "contiguous blocks" will
> erode. So any file bigger than a block size will be written to blocks
> physically apart no matter what their block addresses says. But my
> guess is that accessing device blocks -contiguous or not- are
> constant time operations. So it would not contribute performance
> issues. Right? Comments?
> 
> So your the feeling about fragmentation/performance is probably
> related with if the file is spread into less or more blocks. If # of
> blocks used is higher than necessary (ie. no empty blocks can be
> found. Instead lots of partially empty blocks have to be used
> increasing the total # of blocks involved) then we will notice
> performance loss.
> 
> Additionally if the filesystem will gonna try something to reduce
> the fragmentation for the blocks, it should precisely know where
> those blocks are located. Then how about ssd block informations?
> Are they available and do filesystems use it?
> 
> Anyway if you can provide some more details about your experiences
> on this we can probably have better view on the issue.

What you really want for SSD is not defragmented files but defragmented
free space. That increases life time.

So, defragmentation on SSD makes sense if it cares more about free
space but not file data itself.

But of course, over time, fragmentation of file data (be it meta data
or content data) may introduce overhead - and in btrfs it probably
really makes a difference if I scan through some of the past posts.

I don't think it is important for the file system to know where the SSD
FTL located a data block. It's just important to keep everything nicely
aligned with erase block sizes, reduce rewrite patterns, and free up
complete erase blocks as good as possible.

Maybe such a process should be called "compaction" and not
"defragmentation". In the end, the more continuous blocks of free space
there are, the better the chance for proper wear leveling.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfsck lowmem mode shows corruptions

2017-05-05 Thread Kai Krakow

Am Fri, 5 May 2017 08:55:10 +0800
schrieb Qu Wenruo <quwen...@cn.fujitsu.com>:

> At 05/05/2017 01:29 AM, Kai Krakow wrote:
> > Hello!
> > 
> > Since I saw a few kernel freezes lately (due to experimenting with
> > ck-sources) including some filesystem-related backtraces, I booted
> > my rescue system to check my btrfs filesystem.
> > 
> > Luckily, it showed no problems. It said, everything's fine. But I
> > also thought: Okay, let's try lowmem mode. And that showed a
> > frightening long list of extent corruptions und unreferenced
> > chunks. Should I worry?  
> 
> Thanks for trying lowmem mode.
> 
> Would you please provide the version of btrfs-progs?

Sorry... I realized it myself the moment I hit the "send" button.

Here it is:

# btrfs version
btrfs-progs v4.10.2

# uname -a
Linux jupiter 4.10.13-ck #2 SMP PREEMPT Thu May 4 23:44:09 CEST 2017
x86_64 Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz GenuineIntel GNU/Linux

> IIRC "ERROR: data extent[96316809216 2097152] backref lost" bug has
> been fixed in recent release.

Is there a patch I could apply?

> And for reference, would you please provide the tree dump of your
> chunk and device tree?
> 
> This can be done by running:
> # btrfs-debug-tree -t device 
> # btrfs-debug-tree -t chunk 

I'll attach those...

I'd like to note that between the OP and these dumps, I scrubbed and
rebalanced the hole device. I think that would scramble up some
numbers. Also I took the dumps while the fs was online.

If you want me to do clean dumps of the offline device without
intermediate fs processing, let me know.

Thanks,
Kai

> And this 2 dump only contains the btrfs chunk mapping info, so
> nothing sensitive is contained.
> 
> Thanks,
> Qu
> > 
> > PS: The freezes seem to be related to bfq, switching to deadline
> > solved these.
> > 
> > Full log attached, here's an excerpt:
> > 
> > ---8<---
> > 
> > checking extents
> > ERROR: chunk[256 4324327424) stripe 0 did not find the related dev
> > extent ERROR: chunk[256 4324327424) stripe 1 did not find the
> > related dev extent ERROR: chunk[256 4324327424) stripe 2 did not
> > find the related dev extent ERROR: chunk[256 7545552896) stripe 0
> > did not find the related dev extent ERROR: chunk[256 7545552896)
> > stripe 1 did not find the related dev extent ERROR: chunk[256
> > 7545552896) stripe 2 did not find the related dev extent [...]
> > ERROR: device extent[1, 1094713344, 1073741824] did not find the
> > related chunk ERROR: device extent[1, 2168455168, 1073741824] did
> > not find the related chunk ERROR: device extent[1, 3242196992,
> > 1073741824] did not find the related chunk [...]
> > ERROR: device extent[2, 608854605824, 1073741824] did not find the
> > related chunk ERROR: device extent[2, 609928347648, 1073741824] did
> > not find the related chunk ERROR: device extent[2, 611002089472,
> > 1073741824] did not find the related chunk [...]
> > ERROR: device extent[3, 64433946624, 1073741824] did not find the
> > related chunk ERROR: device extent[3, 65507688448, 1073741824] did
> > not find the related chunk ERROR: device extent[3, 66581430272,
> > 1073741824] did not find the related chunk [...]
> > ERROR: data extent[96316809216 2097152] backref lost
> > ERROR: data extent[96316809216 2097152] backref lost
> > ERROR: data extent[96316809216 2097152] backref lost
> > ERROR: data extent[686074396672 13737984] backref lost
> > ERROR: data extent[686074396672 13737984] backref lost
> > ERROR: data extent[686074396672 13737984] backref lost
> > [...]
> > ERROR: errors found in extent allocation tree or chunk allocation
> > checking free space cache
> > checking fs roots
> > ERROR: errors found in fs roots
> > Checking filesystem on /dev/disk/by-label/system
> > UUID: bc201ce5-8f2b-4263-995a-6641e89d4c88
> > found 1960075935744 bytes used, error(s) found
> > total csum bytes: 1673537040
> > total tree bytes: 4899094528
> > total fs tree bytes: 2793914368
> > total extent tree bytes: 190398464
> > btree space waste bytes: 871743708
> > file data blocks allocated: 6907169177600
> >   referenced 1979268648960
> >   
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-btrfs" in the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



-- 
Regards,
Kai

Replies to list-only preferred.

chunk-tree.txt.gz
Description: application/gzip


device-tree.txt.gz
Description: application/gzip

btrfsck lowmem mode shows corruptions

2017-05-04 Thread Kai Krakow

Hello!

Since I saw a few kernel freezes lately (due to experimenting with
ck-sources) including some filesystem-related backtraces, I booted my
rescue system to check my btrfs filesystem.

Luckily, it showed no problems. It said, everything's fine. But I also
thought: Okay, let's try lowmem mode. And that showed a frightening
long list of extent corruptions und unreferenced chunks. Should I worry?

PS: The freezes seem to be related to bfq, switching to deadline solved
these.

Full log attached, here's an excerpt:

---8<---

checking extents
ERROR: chunk[256 4324327424) stripe 0 did not find the related dev extent
ERROR: chunk[256 4324327424) stripe 1 did not find the related dev extent
ERROR: chunk[256 4324327424) stripe 2 did not find the related dev extent
ERROR: chunk[256 7545552896) stripe 0 did not find the related dev extent
ERROR: chunk[256 7545552896) stripe 1 did not find the related dev extent
ERROR: chunk[256 7545552896) stripe 2 did not find the related dev extent
[...]
ERROR: device extent[1, 1094713344, 1073741824] did not find the related chunk
ERROR: device extent[1, 2168455168, 1073741824] did not find the related chunk
ERROR: device extent[1, 3242196992, 1073741824] did not find the related chunk
[...]
ERROR: device extent[2, 608854605824, 1073741824] did not find the related chunk
ERROR: device extent[2, 609928347648, 1073741824] did not find the related chunk
ERROR: device extent[2, 611002089472, 1073741824] did not find the related chunk
[...]
ERROR: device extent[3, 64433946624, 1073741824] did not find the related chunk
ERROR: device extent[3, 65507688448, 1073741824] did not find the related chunk
ERROR: device extent[3, 66581430272, 1073741824] did not find the related chunk
[...]
ERROR: data extent[96316809216 2097152] backref lost
ERROR: data extent[96316809216 2097152] backref lost
ERROR: data extent[96316809216 2097152] backref lost
ERROR: data extent[686074396672 13737984] backref lost
ERROR: data extent[686074396672 13737984] backref lost
ERROR: data extent[686074396672 13737984] backref lost
[...]
ERROR: errors found in extent allocation tree or chunk allocation
checking free space cache
checking fs roots
ERROR: errors found in fs roots
Checking filesystem on /dev/disk/by-label/system
UUID: bc201ce5-8f2b-4263-995a-6641e89d4c88
found 1960075935744 bytes used, error(s) found
total csum bytes: 1673537040
total tree bytes: 4899094528
total fs tree bytes: 2793914368
total extent tree bytes: 190398464
btree space waste bytes: 871743708
file data blocks allocated: 6907169177600
 referenced 1979268648960

-- 
Regards,
Kai

Replies to list-only preferred.

lowmem.txt.gz
Description: application/gzip

Re: Can I see what device was used to mount btrfs?

2017-05-02 Thread Kai Krakow

Am Tue, 2 May 2017 21:50:19 +0200
schrieb Goffredo Baroncelli :

> On 2017-05-02 20:49, Adam Borowski wrote:
> >> It could be some daemon that waits for btrfs to become complete.
> >> Do we have something?  
> > Such a daemon would also have to read the chunk tree.  
> 
> I don't think that a daemon is necessary. As proof of concept, in the
> past I developed a mount helper [1] which handled the mount of a
> btrfs filesystem: this handler first checks if the filesystem is a
> multivolume devices, if so it waits that all the devices are
> appeared. Finally mount the filesystem.
> 
> > It's not so simple -- such a btrfs device would have THREE states:
> > 
> > 1. not mountable yet (multi-device with not enough disks present)
> > 2. mountable ro / rw-degraded
> > 3. healthy  
> 
> My mount.btrfs could be "programmed" to wait a timeout, then it
> mounts the filesystem as degraded if not all devices are present.
> This is a very simple strategy, but this could be expanded.
> 
> I am inclined to think that the current approach doesn't fit well the
> btrfs requirements.  The roles and responsibilities are spread to too
> much layer (udev, systemd, mount)... I hoped that my helper could be
> adopted in order to concentrate all the responsibility to only one
> binary; this would reduce the interface number with the other
> subsystem (eg systemd, udev).
> 
> For example, it would be possible to implement a sane check that
> prevent to mount a btrfs filesystem if two devices exposes the same
> UUID... 

Ideally, the btrfs wouldn't even appear in /dev until it was assembled
by udev. But apparently that's not the case, and I think this is where
the problems come from. I wish, btrfs would not show up as device nodes
in /dev that the mount command identified as btrfs. Instead, btrfs
would expose (probably through udev) a device node
in /dev/btrfs/fs_identifier when it is ready.

Apparently, the core problem of how to handle degraded btrfs still
remains. Maybe it could be solved by adding more stages of btrfs nodes,
like /dev/btrfs-incomplete (for unusable btrfs), /dev/btrfs-degraded
(for btrfs still missing devices but at least one stripe of btrfs raid
available) and /dev/btrfs as the final stage. That way, a mount process
could wait for a while, and if the device doesn't appear, it tries the
degraded stage instead. If the fs is opened from the degraded dev node
stage, udev (or other processes) that scan for devices should stop
assembling the fs if they still do so.

bcache has a similar approach by hiding an fs within a protective
superblock. Unless bcache is setup, the fs won't show up in /dev, and
that fs won't be visible by other means. Btrfs should do something
similar and only show a single device node if assembled completely. The
component devices would have superblocks ignored by mount, and only the
final node would expose a virtual superblock and the compound device
after it. Of course, this makes things like compound device resizing
more complicated maybe even impossible.

If I'm not totally wrong, I think this is also how zfs exposes its
pools. You need user space tools to make the fs pools visible in the
tree. If zfs is incomplete, there's nothing to mount, and thus no race
condition. But I never tried zfs seriously, so I do not know.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-02 Thread Kai Krakow

Am Mon, 1 May 2017 22:56:06 -0600
schrieb Chris Murphy :

> On Mon, May 1, 2017 at 9:23 PM, Marc MERLIN  wrote:
> > Hi Chris,
> >
> > Thanks for the reply, much appreciated.
> >
> > On Mon, May 01, 2017 at 07:50:22PM -0600, Chris Murphy wrote:  
> >> What about btfs check (no repair), without and then also with
> >> --mode=lowmem?
> >>
> >> In theory I like the idea of a 24 hour rollback; but in normal
> >> usage Btrfs will eventually free up space containing stale and no
> >> longer necessary metadata. Like the chunk tree, it's always
> >> changing, so you get to a point, even with snapshots, that the old
> >> state of that tree is just - gone. A snapshot of an fs tree does
> >> not make the chunk tree frozen in time.  
> >
> > Right, of course, I was being way over optimistic here. I kind of
> > forgot that metadata wasn't COW, my bad.  
> 
> Well it is COW. But there's more to the file system than fs trees, and
> just because an fs tree gets snapshot doesn't mean all data is
> snapshot. So whether snapshot or not, there's metadata that becomes
> obsolete as the file system is updated and those areas get freed up
> and eventually overwritten.
> 
> 
> >  
> >> In any case, it's a big problem in my mind if no existing tools can
> >> fix a file system of this size. So before making anymore changes,
> >> make sure you have a btrfs-image somewhere, even if it's huge. The
> >> offline checker needs to be able to repair it, right now it's all
> >> we have for such a case.  
> >
> > The image will be huge, and take maybe 24H to make (last time it
> > took some silly amount of time like that), and honestly I'm not
> > sure how useful it'll be.
> > Outside of the kernel crashing if I do a btrfs balance, and
> > hopefully the crash report I gave is good enough, the state I'm in
> > is not btrfs' fault.
> >
> > If I can't roll back to a reasonably working state, with data loss
> > of a known quantity that I can recover from backup, I'll have to
> > destroy and filesystem and recover from scratch, which will take
> > multiple days. Since I can't wait too long before getting back to a
> > working state, I think I'm going to try btrfs check --repair after
> > a scrub to get a list of all the pathanmes/inodes that are known to
> > be damaged, and work from there.
> > Sounds reasonable?  
> 
> Yes.
> 
> 
> >
> > Also, how is --mode=lowmem being useful?  
> 
> Testing. lowmem is a different implementation, so it might find
> different things from the regular check.
> 
> 
> >
> > And for re-parenting a sub-subvolume, is that possible?
> > (I want to delete /sub1/ but I can't because I have /sub1/sub2
> > that's also a subvolume and I'm not sure how to re-parent sub2 to
> > somewhere else so that I can subvolume delete sub1)  
> 
> Well you can move sub2 out of sub1 just like a directory and then
> delete sub1. If it's read-only it can't be moved, but you can use
> btrfs property get/set ro true/false to temporarily make it not
> read-only, move it, then make it read-only again, and it's still fine
> to use with btrfs send receive.
> 
> 
> 
> 
> 
> >
> > In the meantime, a simple check without repair looks like this. It
> > will likely take many hours to complete:
> > gargamel:/var/local/space# btrfs check /dev/mapper/dshelf2
> > Checking filesystem on /dev/mapper/dshelf2
> > UUID: 03e9a50c-1ae6-4782-ab9c-5f310a98e653
> > checking extents
> > checksum verify failed on 3096461459456 found 0E6B7980 wanted
> > FBE5477A checksum verify failed on 3096461459456 found 0E6B7980
> > wanted FBE5477A checksum verify failed on 2899180224512 found
> > 7A6D427F wanted 7E899EE5 checksum verify failed on 2899180224512
> > found 7A6D427F wanted 7E899EE5 checksum verify failed on
> > 2899180224512 found ABBE39B0 wanted E0735D0E checksum verify failed
> > on 2899180224512 found 7A6D427F wanted 7E899EE5 bytenr mismatch,
> > want=2899180224512, have=3981076597540270796 checksum verify failed
> > on 1449488023552 found CECC36AF wanted 199FE6C5 checksum verify
> > failed on 1449488023552 found CECC36AF wanted 199FE6C5 checksum
> > verify failed on 1449544613888 found 895D691B wanted A0C64D2B
> > checksum verify failed on 1449544613888 found 895D691B wanted
> > A0C64D2B parent transid verify failed on 1671538819072 wanted
> > 293964 found 293902 parent transid verify failed on 1671538819072
> > wanted 293964 found 293902 checksum verify failed on 1671603781632
> > found 18BC28D6 wanted 372655A0 checksum verify failed on
> > 1671603781632 found 18BC28D6 wanted 372655A0 checksum verify failed
> > on 1759425052672 found 843B59F1 wanted F0FF7D00 checksum verify
> > failed on 1759425052672 found 843B59F1 wanted F0FF7D00 checksum
> > verify failed on 2182657212416 found CD8EFC0C wanted 70847071
> > checksum verify failed on 2182657212416 found CD8EFC0C wanted
> > 70847071 checksum verify failed on 2898779357184 found 96395131
> > wanted 433D6E09 checksum verify failed on 2898779357184 found
> > 96395131 wanted

Re: 4.11 relocate crash, null pointer + rolling back a filesystem by X hours?

2017-05-02 Thread Kai Krakow

Am Tue, 2 May 2017 05:01:02 + (UTC)
schrieb Duncan <1i5t5.dun...@cox.net>:

> Of course on-list I'm somewhat known for my arguments propounding the 
> notion that any filesystem that's too big to be practically
> maintained (including time necessary to restore from backups, should
> that be necessary for whatever reason) is... too big... and should
> ideally be broken along logical and functional boundaries into a
> number of individual smaller filesystems until such point as each one
> is found to be practically maintainable within a reasonably practical
> time frame. Don't put all the eggs in one basket, and when the bottom
> of one of those baskets inevitably falls out, most of your eggs will
> be safe in other baskets. =:^)

Hehe... Yes, you're a fan of small filesystems. I'm more from the
opposite camp, preferring one big filesystem to not mess around with
size constraints of small filesystems fighting for the same volume
space. It also gives such filesystems better chances for data locality
of data staying in totally different parts across your fs mounts and
can reduce head movement. Of course, much of this is not true if you
use different devices per filesystem, or use SSDs, or SAN where you
have no real control over the physical placement of image stripes
anyway. But well...

In an ideal world, subvolumes of btrfs would be totally independent of
each other, just only share the same volume and dynamically allocating
chunks of space from it. If one is broken, it is simply not usable and
it should be destroyable. A garbage collector would grab the leftover
chunks from the subvolume and free them, and you could recreate this
subvolume from backup. In reality, shared extents will cross subvolume
borders so it is probably not how things could work anytime in the near
of far future.

This idea is more like having thinly provisioned LVM volumes which
allocate space as the filesystems on top need them, much like doing
thinly provisioned images with a VM host system. The problem here is,
unlike subvolumes, those chunks of space could never be given back to
the host as it doesn't know if it is still in use. Of course, there's
implementations available which allow thinning the images by passing
through TRIM from the guest to the host (or by other means of
communication channels between host and guest), but that is usually not
giving good performance, if even supported.

I tried once to exploit this in VirtualBox and hoped it would translate
guest discards into hole punching requests on the host, and it's even
documented to work that way... But (a) it was horrible slow, and (b) it
was incredibly unstable to the point of being useless. OTOH, it's not
announced as a stable feature and has to be enabled by manually editing
the XML config files.

But I still like the idea: Is it possible to make btrfs still work if
one subvolume gets corrupted? Of course it should have ways of telling
the user which other subvolumes are interconnected through shared
extents so those would be also discarded upon corruption cleanup - at
least if those extents couldn't be made any sense of any longer. Since
corruption is an issue mostly of subvolumes being written to, snapshots
should be mostly safe.

Such a feature would also only make sense if btrfs had an online repair
tool. BTW, are there plans for having an online repair tool in the
future? Maybe one that only scans and fixes part of the filesystems
(for obvious performance reasons, wrt Duncans idea of handling
filesystems), i.e. those parts that the kernel discovered having
corruptions? If I could then just delete and restore affected files,
this would be even better than having independent subvolumes like above.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: backing up a collection of snapshot subvolumes

2017-04-25 Thread Kai Krakow

Am Tue, 25 Apr 2017 00:02:13 -0400
schrieb "J. Hart" :

> I have a remote machine with a filesystem for which I periodically
> take incremental snapshots for historical reasons.  These snapshots
> are stored in an archival filesystem tree on a file server.  Older
> snapshots are removed and newer ones added on a rotational basis.  I
> need to be able to backup this archive by syncing it with a set of
> backup drives. Due to the size, I need to back it up incrementally
> rather than sending the entire content each time.  Due to the
> snapshot rotation, I need to be able to update the state of the
> archive backup filesystem as a whole, in much the same manner that
> rsync handles file trees.
> 
> It seems that I cannot use "btrfs send", as the archive directory 
> contains the snapshots as subvolumes.
> 
> I cannot use rsync as it treats subvolumes as simple directories, and 
> does not preserve subvolume attributes.  Rsync also does not support 
> reflinks, so the snapshot directory content will no longer be
> reflinked to other snapshots on the archive backup.  I cannot use
> hard links in the incrementals as hard links do not cross subvolume
> boundaries.
> 
> Thoughts anyone ?

If this is for archival purpose only and storage efficiency and speed is
your primary concern, try borgbackup.

Borgbackup deploys its own deduplication and doesn't rely on btrfs
snapshot capabilities. You can store your archives wherever you want
(even on non-btrfs), and it won't store any data blocks twice.

Upon restore, you could simply recreate the snapshot/subvolume from a
similar tree and then rsync a tree restored from borgbackup back to
this snapshot to get back a state with the benefits of btrfs snapshots.

Borg also has an adapter for fuse to get a mounted view into the
archives, tho, it is slow and probably takes a lot of RAM. But it is
good enough for single file lookups and navigation.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: About free space fragmentation, metadata write amplification and (no)ssd

2017-04-11 Thread Kai Krakow

Am Tue, 11 Apr 2017 07:33:41 -0400
schrieb "Austin S. Hemmelgarn" :

> >> FWIW, it is possible to use a udev rule to change the rotational
> >> flag from userspace.  The kernel's selection algorithm for
> >> determining is is somewhat sub-optimal (essentially, if it's not a
> >> local disk that can be proven to be rotational, it assumes it's
> >> non-rotational), so re-selecting this ends up being somewhat
> >> important in certain cases (virtual machines for example).  
> >
> > Just putting nossd in fstab seems convenient enough.  
> While that does work, there are other pieces of software that change 
> behavior based on the value of the rotational flag, and likewise make 
> misguided assumptions about what it means.

Something similar happens when you put btrfs on bcache. It now assumes
it is on SSD but in reality it isn't. Thus, I also deployed udev rules
to force back nossd behavior.

But maybe, in the bcache case using "nossd" instead would make more
sense. Any ideas on this?


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Kai Krakow

Am Mon, 10 Apr 2017 15:43:57 -0400
schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>:

> On 2017-04-10 14:18, Kai Krakow wrote:
> > Am Mon, 10 Apr 2017 13:13:39 -0400
> > schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
> >  
> >> On 2017-04-10 12:54, Kai Krakow wrote:  
>  [...]  
>  [...]  
> >>  [...]
> >>  [...]  
>  [...]  
> >>  [...]
> >>  [...]  
>  [...]  
>  [...]  
> >> The command-line also rejects a number of perfectly legitimate
> >> arguments that BTRFS does understand too though, so that's not much
> >> of a test.  
> >
> > Which are those? I didn't encounter any...  
> I'm not sure there are any anymore, but I know that a handful (mostly 
> really uncommon ones) used to (and BTRFS is not alone in this
> respect, some of the more esoteric ext4 options aren't accepted on
> the kernel command-line either).  I know at a minimum at some point
> in the past alloc-start, check_int, and inode_cache did not work from
> the kernel command-line.

The post from Janos explains why: The difference is with the mount
handler, depending on whether you use initrd or not.

> >> I've just finished some quick testing though, and it looks
> >> like you're right, BTRFS does not support this, which means I now
> >> need to figure out what the hell was causing the IOPS counters in
> >> collectd to change in rough correlation  with remounting
> >> (especially since it appears to happen mostly independent of the
> >> options being changed).  
> >
> > I think that noatime (which I remember you also used?), lazytime,
> > and relatime are mutually exclusive: they all handle the inode
> > updates. Maybe that is the effect you see?  
> They're not exactly exclusive.  The lazytime option will prevent
> changes to the mtime or atime fields in a file from forcing inode
> write-out for up to 24 hours (if the inode would be written out for
> some other reason (such as a file-size change or the inode being
> evicted from the cache), then the timestamps will be too), but it
> does not change the value of the timestamps.  So if you have lazytime
> enabled and use touch to update the mtime on anotherwise idle file,
> the mtime will still be correct as far as userspace is concerned, as
> long as you don't crash before the update hits the disk (but
> userspace will only see the discrepancy _after_ the crash).

Yes, I know all this. But I don't see why you still want noatime or
relatime if you use lazytime, except for super-optimizing. Lazytime
gives you POSIX conformity for a problem that the other options only
tried to solve.

> > Well, relatime is mostly the same thus not perfectly resembling the
> > POSIX standard. I think the only software that relies on atime is
> > mutt...  
> This very much depends on what you're doing.  If you have a WORM 
> workload, then yeah, it's pretty much the same.  If however you have 
> something like a database workload where a specific set of files get 
> internally rewritten regularly, then it actually has a measurable
> impact.

I think "impact" is a whole different story. I'm on your side here.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Kai Krakow

Am Tue, 11 Apr 2017 01:45:32 +0200
schrieb "Janos Toth F." :

> >> The command-line also rejects a number of perfectly legitimate
> >> arguments that BTRFS does understand too though, so that's not much
> >> of a test.  
> >
> > Which are those? I didn't encounter any...  
> 
> I think this bug still stands unresolved (for 3+ years, probably
> because most people use init-rd/fs without ever considering to omit it
> in case they don't really need it at all):
> Bug 61601 - rootflags=noatime causes kernel panic when booting
> without initrd. The last time I tried it applied to Btrfs as well:
> https://bugzilla.kernel.org/show_bug.cgi?id=61601#c18

Ah okay, so the difference is with the mount handler. I can only use
initrd here because I have multi-device btrfs ontop of bcache as rootfs.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Kai Krakow

Am Mon, 10 Apr 2017 13:13:39 -0400
schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>:

> On 2017-04-10 12:54, Kai Krakow wrote:
> > Am Mon, 10 Apr 2017 18:44:44 +0200
> > schrieb Kai Krakow <hurikha...@gmail.com>:
> >  
> >> Am Mon, 10 Apr 2017 08:51:38 -0400
> >> schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
> >>  
>  [...]  
>  [...]  
> >>  [...]  
>  [...]  
>  [...]  
> >>
> >> Did you put it in /etc/fstab only for the rootfs? If yes, it
> >> probably has no effect. You would need to give it as rootflags on
> >> the kernel cmdline.  
> >
> > I did a "fgrep lazytime /usr/src/linux -ir" and it reveals only ext4
> > and f2fs know the flag. Kernel 4.10.
> >
> > So probably you're seeing a placebo effect. If you put lazytime for
> > rootfs just only into fstab, it won't have an effect because on
> > initial mount this file cannot be opened (for obvious reasons), and
> > on remount, btrfs seems to happily accept lazytime but it has no
> > effect. It won't show up in /proc/mounts. Try using it in rootflags
> > kernel cmdline and you should see that the kernel won't accept the
> > flag lazytime. 
> The command-line also rejects a number of perfectly legitimate
> arguments that BTRFS does understand too though, so that's not much
> of a test.

Which are those? I didn't encounter any...

> I've just finished some quick testing though, and it looks
> like you're right, BTRFS does not support this, which means I now
> need to figure out what the hell was causing the IOPS counters in
> collectd to change in rough correlation  with remounting (especially
> since it appears to happen mostly independent of the options being
> changed).

I think that noatime (which I remember you also used?), lazytime, and
relatime are mutually exclusive: they all handle the inode updates.
Maybe that is the effect you see?

> This is somewhat disappointing though, as supporting this would
> probably help with the write-amplification issues inherent in COW
> filesystems. --

Well, relatime is mostly the same thus not perfectly resembling the
POSIX standard. I think the only software that relies on atime is
mutt...

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Kai Krakow

Am Mon, 10 Apr 2017 18:44:44 +0200
schrieb Kai Krakow <hurikha...@gmail.com>:

> Am Mon, 10 Apr 2017 08:51:38 -0400
> schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
> 
> > On 2017-04-10 08:45, Kai Krakow wrote:  
> > > Am Mon, 10 Apr 2017 08:39:23 -0400
> > > schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
> > >
>  [...]  
> > >
> > > Does btrfs really support lazytime now?
> > >
> > It appears to, I do see fewer writes with it than without it.  At
> > the very least, if it doesn't, then nothing complains about it.  
> 
> Did you put it in /etc/fstab only for the rootfs? If yes, it probably
> has no effect. You would need to give it as rootflags on the kernel
> cmdline.

I did a "fgrep lazytime /usr/src/linux -ir" and it reveals only ext4
and f2fs know the flag. Kernel 4.10.

So probably you're seeing a placebo effect. If you put lazytime for
rootfs just only into fstab, it won't have an effect because on initial
mount this file cannot be opened (for obvious reasons), and on remount,
btrfs seems to happily accept lazytime but it has no effect. It won't
show up in /proc/mounts. Try using it in rootflags kernel cmdline and
you should see that the kernel won't accept the flag lazytime.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Kai Krakow

Am Mon, 10 Apr 2017 08:51:38 -0400
schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>:

> On 2017-04-10 08:45, Kai Krakow wrote:
> > Am Mon, 10 Apr 2017 08:39:23 -0400
> > schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>:
> >  
> >> They've been running BTRFS
> >> with LZO compression, the SSD allocator, atime disabled, and mtime
> >> updates deferred (lazytime mount option) the whole time, so it may
> >> be a slightly different use case than the OP from this thread.  
> >
> > Does btrfs really support lazytime now?
> >  
> It appears to, I do see fewer writes with it than without it.  At the 
> very least, if it doesn't, then nothing complains about it.

Did you put it in /etc/fstab only for the rootfs? If yes, it probably
has no effect. You would need to give it as rootflags on the kernel
cmdline.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs filesystem keeps allocating new chunks for no apparent reason

2017-04-10 Thread Kai Krakow

Am Mon, 10 Apr 2017 08:39:23 -0400
schrieb "Austin S. Hemmelgarn" :

> They've been running BTRFS 
> with LZO compression, the SSD allocator, atime disabled, and mtime 
> updates deferred (lazytime mount option) the whole time, so it may be
> a slightly different use case than the OP from this thread.

Does btrfs really support lazytime now?

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: About free space fragmentation, metadata write amplification and (no)ssd

2017-04-08 Thread Kai Krakow

Am Sun, 9 Apr 2017 02:21:19 +0200
schrieb Hans van Kranenburg :

> On 04/08/2017 11:55 PM, Peter Grandi wrote:
> >> [ ... ] This post is way too long [ ... ]  
> > 
> > Many thanks for your report, it is really useful, especially the
> > details.  
> 
> Thanks!
> 
> >> [ ... ] using rsync with --link-dest to btrfs while still
> >> using rsync, but with btrfs subvolumes and snapshots [1]. [
> >> ... ]  Currently there's ~35TiB of data present on the example
> >> filesystem, with a total of just a bit more than 9
> >> subvolumes, in groups of 32 snapshots per remote host (daily
> >> for 14 days, weekly for 3 months, montly for a year), so
> >> that's about 2800 'groups' of them. Inside are millions and
> >> millions and millions of files. And the best part is... it
> >> just works. [ ... ]  
> > 
> > That kind of arrangement, with a single large pool and very many
> > many files and many subdirectories is a worst case scanario for
> > any filesystem type, so it is amazing-ish that it works well so
> > far, especially with 90,000 subvolumes.  
> 
> Yes, this is one of the reasons for this post. Instead of only hearing
> about problems all day on the mailing list and IRC, we need some more
> reports of success.
> 
> The fundamental functionality of doing the cow snapshots, moo, and the
> related subvolume removal on filesystem trees is so awesome. I have no
> idea how we would have been able to continue this type of backup
> system when btrfs was not available. Hardlinks and rm -rf was a total
> dead end road.

I'm absolutely no expert with arrays of sizes that you use but I also
stopped using the hardlink-and-remove approach: It was slow to manage
(rsync works slow for it, rm works slow for it) and it was error-probe
(due to the nature of hardlinks). I used btrfs with snapshots and rsync
for a while in my personal testbed, and experienced great slowness over
time: rsync started to become slower and slower, full backup took 4
hours with huge %IO usage, maintaining the backup history was also slow
(removing backups took a while), rebalancing was needed due to huge
wasted space. I used rsync with --inplace and --no-whole-file to waste
as few space as possible.

What I first found was an adaptive rebalancer script which I still use
for the main filesystem:

https://www.spinics.net/lists/linux-btrfs/msg52076.html
(thanks to Lionel)

It works pretty well and has no such big IO overhead due to the
adaptive multi-pass approach.

But it still did not help the slowness. I now tested borgbackup for a
while, and it's fast: It does the same job in 30 minutes or less
instead of 4 hours, and it has much better backup density and comes
with easy history maintenance, too. I can now store much more backup
history in the same space. Full restore time is about the same as
copying back with rsync.

For a professional deployment I'm planning to use XFS as the storage
backend and borgbackup as the backup frontend, because my findings
showed that XFS allocation groups are spanning diagonally across the
disk array, that is if you'd use simple JBOD of your iSCSI LUNs, XFS
will spread writes across all the LUNs without you needing to do normal
RAID striping, which should eliminate the need to migrate when adding
more LUNs, and the underlaying storage layer on the NetApp side will
probably already do RAID for redundancy anyways. Just feed more space to
XFS using LVM.

Borgbackup can do everything that btrfs can do for you but is
targetting the job of doing backups only: It can compress, deduplicate,
encrypt and do history thinning. The only downside I found is that only
one backup job at a time can access the backup repository. So you'd
have to use one backup repo per source machine. That way you cannot
benefit from deduplication across multiple sources. But I'm sure NetApp
can do that. OTOH, maybe backup duration drops to a point that you
could serialize the backup of some machines.

> OTOH, what we do with btrfs (taking a bulldozer and drive across all
> the boundaries of sanity according to all recommendations and
> warnings) on this scale of individual remotes is something that the
> NetApp people should totally be jealous of. Backups management
> (manual create, restore etc on top of the nightlies) is self service
> functionality for our customers, and being able to implement the
> magic behind the APIs with just a few commands like a btrfs sub snap
> and some rsync gives the right amount of freedom and flexibility we
> need.

This is something I'm planning here, too: Self-service backups, do a
btrfs snap, but then use borgbackup for archiving purposes.

BTW: I think the 2M size comes from the assumption that SSDs manage
their storage in groups of erase block sizes. The optimization here
would be that btrfs deallocates (and maybe trims) only whole erase
blocks which typically are 2M. This has a performance benefit. But if
your underlying storage layer is RAID anyways, this no longer maps

Re: Shrinking a device - performance?

2017-04-01 Thread Kai Krakow

Am Mon, 27 Mar 2017 20:06:46 +0500
schrieb Roman Mamedov :

> On Mon, 27 Mar 2017 16:49:47 +0200
> Christian Theune  wrote:
> 
> > Also: the idea of migrating on btrfs also has its downside - the
> > performance of “mkdir” and “fsync” is abysmal at the moment. I’m
> > waiting for the current shrinking job to finish but this is likely
> > limited to the “find free space” algorithm. We’re talking about a
> > few megabytes converted per second. Sigh.  
> 
> Btw since this is all on LVM already, you could set up lvmcache with
> a small SSD-based cache volume. Even some old 60GB SSD would work
> wonders for performance, and with the cache policy of "writethrough"
> you don't have to worry about its reliability (much).

That's maybe the best recommendation to speed things up. I'm using
bcache here for the same reasons (speeding up random workloads) and it
works wonders.

Tho, for such big storage I'd maybe recommend a bigger SSD and a new
one. Bigger SSDs tend to last much longer. Just don't use the whole of
it to allow for better wear leveling and you'll get a final setup that
can serve the system much longer than for the period of migration.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: backing up a file server with many subvolumes

2017-04-01 Thread Kai Krakow

Am Mon, 27 Mar 2017 08:57:17 +0300
schrieb Marat Khalili :

> Just some consideration, since I've faced similar but no exactly same 
> problem: use rsync, but create snapshots on target machine. Blind
> rsync will destroy deduplication of your snapshots and take huge
> amount of storage, so it's not a solution. But you can rsync --inline
> your snapshots in chronological order to some folder and re-take
> snapshots of that folder, thus recreating your snapshots structure on
> target. Obviously, it can/should be automated.

I think it's --inplace and --no-whole-file...

Apparently, rsync cannot detect moved files which was a big deal for me
regarding deduplication, so I found another solution which is even
faster. See my other reply.

> On 26/03/17 06:00, J. Hart wrote:
> > I have a Btrfs filesystem on a backup server.  This filesystem has
> > a directory to hold backups for filesystems from remote machines.
> > In this directory is a subdirectory for each machine.  Under each
> > machine subdirectory is one directory for each filesystem
> > (ex /boot, /home, etc) on that machine.  In each filesystem
> > subdirectory are incremental snapshot subvolumes for that
> > filesystem.  The scheme is something like this:
> >
> > /backup///
> >
> > I'd like to try to back up (duplicate) the file server filesystem 
> > containing these snapshot subvolumes for each remote machine.  The 
> > problem is that I don't think I can use send/receive to do this. 
> > "Btrfs send" requires "read-only" snapshots, and snapshots are not 
> > recursive as yet.  I think there are too many subvolumes which
> > change too often to make doing this without recursion practical.
> >
> > Any thoughts would be most appreciated.
> >
> > J. Hart
> >
> > -- 
> > To unsubscribe from this list: send the line "unsubscribe
> > linux-btrfs" in the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe
> linux-btrfs" in the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 



-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: backing up a file server with many subvolumes

2017-04-01 Thread Kai Krakow

Am Mon, 27 Mar 2017 07:53:17 -0400
schrieb "Austin S. Hemmelgarn" :

> > I'd like to try to back up (duplicate) the file server filesystem
> > containing these snapshot subvolumes for each remote machine.  The
> > problem is that I don't think I can use send/receive to do this.
> > "Btrfs send" requires "read-only" snapshots, and snapshots are not
> > recursive as yet.  I think there are too many subvolumes which
> > change too often to make doing this without recursion practical.
> >
> > Any thoughts would be most appreciated.  
> In general, I would tend to agree with everyone else so far if you
> have to keep your current setup.  Use rsync with the --inplace option
> to send data to a staging location, then snapshot that staging
> location to do the actual backup.
> 
> Now, that said, I could probably give some more specific advice if I
> had a bit more info on how you're actually storing the backups.
> There are three general ways you can do this with BTRFS and
> subvolumes: 1. Send/receive of snapshots from the system being backed
> up. 2. Use some other software to transfer the data into a staging
> location on the backup server, then snapshot that.
> 3. Use some other software to transfer the data, and have it handle 
> snapshots instead of using BTRFS, possibly having it create
> subvolumes instead of directories at the top level for each system.

If you decide for (3), I can recommend borgbackup. It allows variable
block size deduplication across all backup sources, tho to fully get
that potential, your backups can only be done serially not in parallel.
Borgbackup cannot access the same repository with two processes in
parallel, and deduplication is only per repository.

Another recommendation for backups is the 3-2-1 rule:

  * have at least 3 different copies of your data (that means, your
original data, the backup copy, and another backup copy, separated
in a way they cannot fail for the same reason)
  * use at least 2 different media (that also means: don't backup
btrfs to btrfs, and/or use 2 different backup techniques)
  * keep at least 1 external copy (maybe rsync to a remote location)

The 3 copy rule can be deployed by using different physical locations,
different device types, different media, and/or different backup
programs. So it's kind of entangled with the 2 and 1 rule.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Confusion about snapshots containers

2017-03-31 Thread Kai Krakow

Am Wed, 29 Mar 2017 16:27:30 -0500
schrieb Tim Cuthbertson :

> I have recently switched from multiple partitions with multiple
> btrfs's to a flat layout. I will try to keep my question concise.
> 
> I am confused as to whether a snapshots container should be a normal
> directory or a mountable subvolume. I do not understand how it can be
> a normal directory while being at the same level as, for example, a
> rootfs subvolume. This is with the understanding that the rootfs is
> NOT at the btrfs top level.
> 
> Which should it be, a normal directory or a mountable subvolume
> directly under btrfs top level? If either way can work, what are the
> pros and cons of each?

I think there is no exact standard you could follow. Many distributions
seems to go for the standard to prepend subvolumes with "@" if they are
meant to be mounted. However, I'm not doing so.

Generally speaking, subvolumes organize your volume into logical
containers which make sense to be snapshotted on their own. Snapshots
won't propagate to sub-subvolumes, it is important to keep that in mind
while designing your idea of a structure.

I'm using it like this:

In subvol=0 I have the following subvolumes:

/* - contains distribution specific file systems
/home - contains home directories
/snapshots - contains snapshots I want to keep
/other
  - misc stuff, i.e. a dump of the subvol structure in a txt
  - a copy of my restore script
  - some other supporting docs for restore
  - this subvolume is kept in sync with my backup volume

This means: If I mount one of the rootfs, my home will not be part of
this mount automatically because that subvolume is out of scope of the
rootfs.

Now I have the following subvolumes below these:

/gentoo/rootfs - rootfs of my main distribution
  Note 1: Everything below (except subvolumes) should be maintained
  by the package manager.
  Note 2: currently I installed no other distributions
  Note 3: I could have called it main-system-rootfs

/gentoo/usr
  - actually not a subvolume but a directory for volumes shareable with
other distribution instances

/gentoo/usr/portage - portage, shareable by other gentoo instances
/gentoo/usr/src - the gentoo linux kernel sources, shareable

The following are put below /gentoo/rootfs so they not need to be
mounted separately:

/gentoo/rootfs/var/log
  - log volume because I don't want it to be snapshotted
/gentoo/rootfs/var/tmp
  - tmp volume because it makes no sense to be snapshotted
/gentoo/rootfs/var/lib/machines
  - subvolume for keeping nspawn containers
/gentoo/rootfs/var/lib/machines/*
  - different machines cloned from each other
/gentoo/rootfs/usr/local
  - non-package manager stuff

/home/myuser - my user home
/home/myuser/.VirtualBox
  - VirtualBox machines because I want them snapshotted separately

/etc/fstab now only mounts subvolumes outside of the scope
of /gentoo/rootfs:

LABEL=system /homebtrfs compress=lzo,subvol=home,noatime
LABEL=system /usr/portage btrfs 
noauto,compress=lzo,subvol=gentoo/usr/portage,noatime,x-systemd.automount
LABEL=system /usr/src btrfs 
noauto,compress=lzo,subvol=gentoo/usr/src,noatime,x-systemd.automount

Additionally, I mount the subvol=0 for two special purposes:

LABEL=system /mnt/btrfs-pool btrfs 
noauto,compress=lzo,subvolid=0,x-systemd.automount,noatime

First: For managing all the subvolumes and have an untampered view
(without tmpfs or special purpose mounts) to the volumes.

Second: To take a clean backup of the whole system.

Now, I can give the bootloader subvol=gentoo/rootfs to select which
system to boot (or make it the default subvolume).

Maybe you get the idea and find that idea helpful.

PS: It can make sense to have var/lib/machines outside of the rootfs
scope if you want to share it with other distributions.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Home storage with btrfs

2017-03-15 Thread Kai Krakow

Am Wed, 15 Mar 2017 23:26:32 +0100
schrieb Kai Krakow <hurikha...@gmail.com>:

> Well, bugs can hit you with every filesystem. Nothing as complex as a

Meh... I fooled myself. Find the mistake... ;-)

SPOILER:

"Nothing" should be "something".

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Home storage with btrfs

2017-03-15 Thread Kai Krakow

Am Wed, 15 Mar 2017 23:41:41 +0100
schrieb Kai Krakow <hurikha...@gmail.com>:

> Am Wed, 15 Mar 2017 23:26:32 +0100
> schrieb Kai Krakow <hurikha...@gmail.com>:
> 
> > Well, bugs can hit you with every filesystem. Nothing as complex as
> > a  
> 
> Meh... I fooled myself. Find the mistake... ;-)
> 
> SPOILER:
> 
> "Nothing" should be "something".

*doublefacepalm*

Please forget what I wrote. The original sentence is correct.

I should get some coffee or go to bed. :-\

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Home storage with btrfs

2017-03-15 Thread Kai Krakow

Am Wed, 15 Mar 2017 07:55:51 + (UTC)
schrieb Duncan <1i5t5.dun...@cox.net>:

> Hérikz Nawarro posted on Mon, 13 Mar 2017 08:29:32 -0300 as excerpted:
> 
> > Today is safe to use btrfs for home storage? No raid, just secure
> > storage for some files and create snapshots from it.  
> 
> 
> I'll echo the others... but with emphasis on a few caveats the others 
> mentioned but didn't give the emphasis I thought they deserved:
> 
> 1) Btrfs is, as I repeatedly put it in post after post, "stabilizing,
> but not yet fully stable and mature."  In general, that means it's
> likely to work quite or even very well for you (as it has done for
> us) if you don't try the too unusual or get too cocky, but get too
> close to the edge and you just might find yourself over that edge.
> Don't worry too much, tho, those edges are clearly marked if you're
> paying attention, and just by asking here, you're already paying way
> more attention than too many we see here... /after/ they've found
> themselves over the edge.  That's a _very_ good sign. =:^)

Well, bugs can hit you with every filesystem. Nothing as complex as a
file system can ever be proven bug free (except FAT maybe). But as a
general-purpose-no-fancy-features-needed FS, btrfs should be on par
with other FS these days.

> 2) "Stabilizing, not fully stable and mature", means even more than
> ever, if you value your data more than the time, hassle and resources
> necessary to have backups, you HAVE them, tested and available for
> practical use should it be necessary.

This is totally not dependent on "stabilizing, not fully stable and
mature". If you data matters to you, do backups. It's that simple. If
you don't do backups, your data isn't important - by definition.

> Of course any sysadmin (and that's what you are for at least your own 
> systems if you're making this choice) worth the name will tell you
> the value of the data is really defined by the number of backups it
> has, not by any arbitrary claims to value absent those backups.  No
> backups, you simply didn't value the data enough to have them,
> whatever claims of value you might otherwise try to make.  Backups,
> you /did/ value the data.

Yes. :-)

> And of course the corollary to that first sysadmin's rule of backups
> is that an untested as restorable backup isn't yet a backup, only a 
> potential backup, because the job isn't finished and it can't be
> properly called a backup until you know you can restore from it if
> necessary.

Even more true. :-)

> And lest anyone get the wrong idea, a snapshot is /not/ a backup for 
> purposes of the above rules.  It's on the same filesystem and
> hardware media and if that goes down... you've lost it just the
> same.  And since that filesystem is still stabilizing, you really
> must be even more prepared for it to go down, even if the chances are
> still quite good it won't.

A good backup should follow the 3-2-1 rule: Have 3 different backup
copies, 2 different media, and store at least 1 copy external/off-site.

For customers, we usually deploy a strategy like this for Windows
machines: Do one local backup using Windows Image Backup to a local
NAS to backup from inside the VM, use a different software to do image
backups from outside of the VM to the local NAS, mirror the "outside
image" to a remote location (cloud storage). And keep some backup
history. Overwriting the one existing backup with a new one won't help
you anything. All involved software should be able to do efficient
delta backups, otherwise mirroring offsite may be no fun.

In linux, I'm using borgbackup and rsync to have something similar.
Using borgbackup to a local storage, and syncing it offsite with rsync
gives me the 2-1 rule part. You can get the third rule by using rsync
to also mirror the local FS off the machine. But that's usually
overkill for personal backups. Instead, I only have a third copy of
most valuable data like photos, dev stuff, documents, etc.

BTW: For me, different media also means different FS types. So a bug in
one FS wouldn't easily hit the other.

[snip]

> 4) Keep the number of snapshots per subvolume under tight control as 
> already suggested.  A few hundred, NOT a few thousand.  Easy enough
> if you do those snapshots manually, but easy enough to get thousands
> if you're not paying attention to thin out the old ones and using an 
> automated tool such as snapper.

Borgbackup is so fast and storage efficient that you could run it easily
multiple times per day. That in turn means I don't need to rely on
regular snapshots to undo mistakes. I only use snapshots before doing
some knowingly risky stuff to have fast recovery. But that's all,
nothing else should snapshots before (except you are doing more
advanced stuff like container cloning, VM instance spawning, ...).

> 5) Stay away from quotas.  Either you need the feature and thus need
> a more mature filesystem where it's actually stable and does what it
> says on the label, or you don't, in which

Re: Hard crash on 4.9.5

2017-03-13 Thread Kai Krakow

Am Sat, 28 Jan 2017 15:50:38 -0500
schrieb Matt McKinnon :

> This same file system (which crashed again with the same errors) is
> also giving this output during a metadata or data balance:

This looks somewhat familiar to the err=-17 that I am experiencing when
using VirtualBox image on btrfs in CoW mode (compress=lzo).

During IO intensive workloads, it results in "object already exists,
err -17" (or similar, someone else also experienced it through another
workload). The resulting btrfs check show the same errors, giving
inodes without csum.

Trying to continue using this file system in successive boots usually
results in boot freezes or complete unmountable filesystem, broken
beyond repair.

I'm feeling that using the bfq elevator usually enables me to trigger
this bug also without using VirtualBox, i.e. during normal system
usage, and mostly during boot when IO load is very high. So I also
stopped using bfq although it was giving me a much superior
interactivity.

Marking vbox images nocow and using standard elevators (cfq, deadline)
exposes no such problems so far - even during excessive IO loads.

EOM

> Jan 27 19:42:47 my_machine kernel: [  335.018123] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 2191360
> Jan 27 19:42:47 my_machine kernel: [  335.018128] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 2195456
> Jan 27 19:42:47 my_machine kernel: [  335.018491] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 4018176
> Jan 27 19:42:47 my_machine kernel: [  335.018496] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 4022272
> Jan 27 19:42:47 my_machine kernel: [  335.018499] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 4026368
> Jan 27 19:42:47 my_machine kernel: [  335.018502] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 4030464
> Jan 27 19:42:47 my_machine kernel: [  335.019443] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 6156288
> Jan 27 19:42:47 my_machine kernel: [  335.019688] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 7933952
> Jan 27 19:42:47 my_machine kernel: [  335.019693] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 7938048
> Jan 27 19:42:47 my_machine kernel: [  335.019754] BTRFS info (device 
> sda1): no csum found for inode 28472371 start 8077312
> Jan 27 19:42:47 my_machine kernel: [  335.025485] BTRFS warning
> (device sda1): csum failed ino 28472371 off 2191360 csum 4031061501
> expected csum 0 Jan 27 19:42:47 my_machine kernel: [  335.025490]
> BTRFS warning (device sda1): csum failed ino 28472371 off 2195456
> csum 2371784003 expected csum 0 Jan 27 19:42:47 my_machine kernel:
> [  335.025526] BTRFS warning (device sda1): csum failed ino 28472371
> off 4018176 csum 3812080098 expected csum 0 Jan 27 19:42:47
> my_machine kernel: [  335.025531] BTRFS warning (device sda1): csum
> failed ino 28472371 off 4022272 csum 2776681411 expected csum 0 Jan
> 27 19:42:47 my_machine kernel: [  335.025534] BTRFS warning (device
> sda1): csum failed ino 28472371 off 4026368 csum 1179241675 expected
> csum 0 Jan 27 19:42:47 my_machine kernel: [  335.025540] BTRFS
> warning (device sda1): csum failed ino 28472371 off 4030464 csum
> 1256914217 expected csum 0 Jan 27 19:42:47 my_machine kernel:
> [  335.026142] BTRFS warning (device sda1): csum failed ino 28472371
> off 7933952 csum 2695958066 expected csum 0 Jan 27 19:42:47
> my_machine kernel: [  335.026147] BTRFS warning (device sda1): csum
> failed ino 28472371 off 7938048 csum 3260800596 expected csum 0 Jan
> 27 19:42:47 my_machine kernel: [  335.026934] BTRFS warning (device
> sda1): csum failed ino 28472371 off 6156288 csum 4293116449 expected
> csum 0 Jan 27 19:42:47 my_machine kernel: [  335.033249] BTRFS
> warning (device sda1): csum failed ino 28472371 off 8077312 csum
> 4031878292 expected csum 0
> 
> Can these be ignored?
> 
> 
> On 01/25/2017 04:06 PM, Liu Bo wrote:
> > On Mon, Jan 23, 2017 at 03:03:55PM -0500, Matt McKinnon wrote:  
> >> Wondering what to do about this error which says 'reboot needed'.
> >> Has happened a three times in the past week:
> >>  
> >
> > Well, I don't think btrfs's logic here is wrong, the following stack
> > shows that a nfs client has sent a second unlink against the same
> > inode while somehow the inode was not fully deleted by the first
> > unlink.
> >
> > So it'd be good that you could add some debugging information to
> > get us further.
> >
> > Thanks,
> >
> > -liubo
> >  
> >> Jan 23 14:16:17 my_machine kernel: [ 2568.595648] BTRFS error
> >> (device sda1): err add delayed dir index item(index: 23810) into
> >> the deletion tree of the delayed node(root id: 257, inode id:
> >> 2661433, errno: -17) Jan 23 14:16:17 my_machine kernel:
> >> [ 2568.611010] [ cut here ]
> >> Jan 23 14:16:17 my_machine kernel: [ 2568.615628] kernel BUG at
> >>

Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-03 Thread Kai Krakow

Am Fri, 3 Mar 2017 07:19:06 -0500
schrieb "Austin S. Hemmelgarn" <ahferro...@gmail.com>:

> On 2017-03-03 00:56, Kai Krakow wrote:
> > Am Thu, 2 Mar 2017 11:37:53 +0100
> > schrieb Adam Borowski <kilob...@angband.pl>:
> >  
> >> On Wed, Mar 01, 2017 at 05:30:37PM -0700, Chris Murphy wrote:  
>  [...]  
> >>
> >> Well, there's Qu's patch at:
> >> https://www.spinics.net/lists/linux-btrfs/msg47283.html
> >> but it doesn't apply cleanly nor is easy to rebase to current
> >> kernels. 
>  [...]  
> >>
> >> Well, yeah.  The current check is naive and wrong.  It does have a
> >> purpose, just fails in this, very common, case.  
> >
> > I guess the reasoning behind this is: Creating any more chunks on
> > this drive will make raid1 chunks with only one copy. Adding
> > another drive later will not replay the copies without user
> > interaction. Is that true?
> >
> > If yes, this may leave you with a mixed case of having a raid1 drive
> > with some chunks not mirrored and some mirrored. When the other
> > drives goes missing later, you are loosing data or even the whole
> > filesystem although you were left with the (wrong) imagination of
> > having a mirrored drive setup...
> >
> > Is this how it works?
> >
> > If yes, a real patch would also need to replay the missing copies
> > after adding a new drive.
> >  
> The problem is that that would use some serious disk bandwidth
> without user intervention.  The way from userspace to fix this is to
> scrub the FS.  It would essentially be the same from kernel space,
> which means that if you had a multi-TB FS and this happened, you'd be
> running at below capacity in terms of bandwidth for quite some time.
> If this were to be implemented, it would have to be keyed off of the
> per-chunk degraded check (so that _only_ the chunks that need it get
> touched), and there would need to be a switch to disable it.

Well, I'd expect that a replaced drive would involve reduced bandwidth
for a while. Every traditional RAID does this. The key solution there
is that you can limit bandwidth and/or define priorities (BG rebuild
rate).

Btrfs OTOH could be a lot more smarter, only rebuilding chunks that are
affected. The kernel can already do IO priorities and some sort of
bandwidth limiting should also be possible. I think IO throttling is
already implemented in the kernel somewhere (at least with 4.10) and
also in btrfs. So the basics are there.

In a RAID setup, performance should never have priority over redundancy
by default.

If performance is an important factor, I suggest working with SSD
writeback caches. This is already possible with different kernel
techniques like mdcache or bcache. Proper hardware controllers also
support this in hardware. It's cheap to have a mirrored SSD
writeback cache of 1TB or so if your setup already contains a multiple
terabytes array. Such a setup has huge performance benefits in setups
we deploy (tho, not btrfs related).

Also, adding/replacing a drive is usually not a totally unplanned
event. Except for hot spares, a missing drive will be replaced at the
time you arrive on-site. If performance is a factor, this can be done
the same time as manually starting the process. So why not should it be
done automatically?

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: raid1 degraded mount still produce single chunks, writeable mount not allowed

2017-03-03 Thread Kai Krakow

Am Thu, 2 Mar 2017 11:37:53 +0100
schrieb Adam Borowski :

> On Wed, Mar 01, 2017 at 05:30:37PM -0700, Chris Murphy wrote:
> > [1717713.408675] BTRFS warning (device dm-8): missing devices (1)
> > exceeds the limit (0), writeable mount is not allowed
> > [1717713.446453] BTRFS error (device dm-8): open_ctree failed
> > 
> > [chris@f25s ~]$ uname -r
> > 4.9.8-200.fc25.x86_64
> > 
> > I thought this was fixed. I'm still getting a one time degraded rw
> > mount, after that it's no longer allowed, which really doesn't make
> > any sense because those single chunks are on the drive I'm trying to
> > mount.  
> 
> Well, there's Qu's patch at:
> https://www.spinics.net/lists/linux-btrfs/msg47283.html
> but it doesn't apply cleanly nor is easy to rebase to current kernels.
> 
> > I don't understand what problem this proscription is trying to
> > avoid. If it's OK to mount rw,degraded once, then it's OK to allow
> > it twice. If it's not OK twice, it's not OK once.  
> 
> Well, yeah.  The current check is naive and wrong.  It does have a
> purpose, just fails in this, very common, case.

I guess the reasoning behind this is: Creating any more chunks on this
drive will make raid1 chunks with only one copy. Adding another drive
later will not replay the copies without user interaction. Is that true?

If yes, this may leave you with a mixed case of having a raid1 drive
with some chunks not mirrored and some mirrored. When the other drives
goes missing later, you are loosing data or even the whole filesystem
although you were left with the (wrong) imagination of having a
mirrored drive setup...

Is this how it works?

If yes, a real patch would also need to replay the missing copies after
adding a new drive.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Please add more info to dmesg output on I/O error

2017-03-01 Thread Kai Krakow

Am Wed, 1 Mar 2017 19:04:26 +0300
schrieb Timofey Titovets :

> Hi, today i try move my FS from old HDD to new SSD
> While processing i catch I/O error and device remove operation was
> canceled
> 
> Dmesg:
> [ 1015.010241] blk_update_request: I/O error, dev sda, sector 81353664
> [ 1015.010246] BTRFS error (device sdb1): bdev /dev/sda1 errs: wr 0,
> rd 23, flush 0, corrupt 0, gen 0
> [ 1015.010282] ata5: EH complete
> [ 1017.016721] ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0
> action 0x0 [ 1017.016730] ata5.00: irq_stat 0x4008
> [ 1017.016737] ata5.00: failed command: READ FPDMA QUEUED
> [ 1017.016748] ata5.00: cmd 60/08:80:c0:5b:d9/00:00:04:00:00/40 tag 16
> ncq dma 4096 in
>res 41/40:00:c0:5b:d9/00:00:04:00:00/40 Emask
> 0x409 (media error) 
> [ 1017.016754] ata5.00: status: { DRDY ERR }
> [ 1017.016757] ata5.00: error: { UNC }
> [ 1017.029479] ata5.00: configured for UDMA/133
> [ 1017.029506] sd 4:0:0:0: [sda] tag#16 UNKNOWN(0x2003) Result:
> hostbyte=0x00 driverbyte=0x08
> [ 1017.029511] sd 4:0:0:0: [sda] tag#16 Sense Key : 0x3 [current]
> [ 1017.029516] sd 4:0:0:0: [sda] tag#16 ASC=0x11 ASCQ=0x4
> [ 1017.029520] sd 4:0:0:0: [sda] tag#16 CDB: opcode=0x28 28 00 04 d9
> 5b c0 00 00 08 00
> 
> At now, i fixed this problem by doing scrub FS and delete damaged
> files, but scrub are slow, and if btrfs show me a more info on I/O
> error, it's will be more helpful
> i.e. something like i getting by scrub:
> [ 1260.559180] BTRFS warning (device sdb1): i/o error at logical
> 40569896960 on dev /dev/sda1, sector 81351616, root 309, inode 55135,
> offset 71278592, length 4096, links 1 (path:
> nefelim4ag/.config/skypeforlinux/Cache/data_3)
> 
> Thanks.

You should turn off SCT ERC with smartctl or set it to lower values, or
if that doesn't work with your HDD firmware, increase the timeout of
the scsi driver above 120s. This setup as it is, is not going to work
correctly with btrfs in case of errors.

# smartctl -l scterc,70,70 /dev/sdb

should do the trick if supported. It applies an error correction
timeout of 7 seconds for reading and writing, which is below the kernel
scsi layer timeout of 30 seconds. Otherwise, your drive will fail to
respond for two minutes until the kernel resets the drive. According to
dmesg, this is what happened.

NAS-ready drives usually support this setting, while desktop drives
don't or at least default to standard desktop timeouts.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.7.2] btrfs_run_delayed_refs:2963: errno=-17 Object already exists

2017-02-17 Thread Kai Krakow

Am Tue, 14 Feb 2017 13:52:48 +0100
schrieb Marc Joliet :

> Hi again,
> 
> so, it seems that I've solved the problem:  After having to
> umount/mount the FS several times to get btrfs-cleaner to finish, I
> thought of the "failed to load free space [...], rebuilding it now"
> type errors and decided to try the clear_cache mount option.  Since
> then, my home server has been running its backups regularly again.
> Furthermore, I was able back up my desktop again via send/recv (the
> rsync based backup is still running, but I expect it to succeed).
> The kernel log has also stayed clean.
> 
> Kai, I'd be curious whether clear_cache will help in your case, too,
> if you haven't tried it already.

The FS in question has been rebuilt from scratch. AFAIR it wasn't even
mountable back then, or at it froze the system short after mounting.

I needed the system back in usable state, so I had to recreate and
restore the system.

Next time, if it happens again (finger crossed it does not), I'll give
it a try.

-- 
Regards,
Kai

Replies to list-only preferred.


pgp8pHpMpE5QT.pgp
Description: Digitale Signatur von OpenPGP

Re: [4.7.2] btrfs_run_delayed_refs:2963: errno=-17 Object already exists

2017-02-10 Thread Kai Krakow

Am Fri, 10 Feb 2017 23:15:03 +0100
schrieb Marc Joliet :

> # btrfs filesystem df /media/MARCEC_BACKUP/   
> Data, single: total=851.00GiB, used=831.36GiB
> System, DUP: total=64.00MiB, used=120.00KiB
> Metadata, DUP: total=13.00GiB, used=10.38GiB
> Metadata, single: total=1.12GiB, used=0.00B
> GlobalReserve, single: total=512.00MiB, used=0.00B
> 
> Hmm, I take it that the single metadata is a leftover from running
> --repair?

It's more probably a remnant of an incomplete balance operation or an
older mkfs version. I'd simply rebalance metadata to fix this.

I don't think that btrfs-repair would migrate missing metadata
duplicates back to single profile, it would more likely trigger
recreating the missing duplicates. But I'm not sure.

If it is a result of the repair operation, that could be an
interesting clue. Could it explain "error -17" from your logs? But that
would mean the duplicates were already missing before the repair
operation and triggered that problem. So the question is, why are those
duplicates missing in the first place as a result of normal operation?
From your logs:

---8<--- snip
Feb 02 22:49:14 thetick kernel: BTRFS: device label MARCEC_BACKUP devid
1 transid 283903 /dev/sdd2
Feb 02 22:49:19 thetick kernel: EXT4-fs (sdd1): mounted filesystem with 
ordered data mode. Opts: (null)
Feb 03 00:18:52 thetick kernel: BTRFS info (device sdd2): use zlib
compression Feb 03 00:18:52 thetick kernel: BTRFS info (device sdd2):
disk space caching is enabled
Feb 03 00:18:52 thetick kernel: BTRFS info (device sdd2): has skinny
extents Feb 03 00:20:09 thetick kernel: BTRFS info (device sdd2): The
free space cache file (3967375376384) is invalid. skip it
Feb 03 01:05:58 thetick kernel: [ cut here ]
Feb 03 01:05:58 thetick kernel: WARNING: CPU: 1 PID: 26544 at
fs/btrfs/extent- tree.c:2967 btrfs_run_delayed_refs+0x26c/0x290
Feb 03 01:05:58 thetick kernel: BTRFS: Transaction aborted (error -17)
--->8--- snap

"error -17" being "object already exists". My only theory would be this
has a direct connection to you finding the single metadata profile.
Like in "the kernel thinks the objects already exists when it really
didn't, and as a result the object is there only once now" aka "single
metadata".

But I'm no dev and no expert on the internals.

-- 
Regards,
Kai

Replies to list-only preferred.


pgp64OIVK8aRx.pgp
Description: Digitale Signatur von OpenPGP

Re: BTRFS and cyrus mail server

2017-02-08 Thread Kai Krakow

Am Wed, 08 Feb 2017 19:38:06 +0100
schrieb Libor Klepáč :

> Hello,
> inspired by recent discussion on BTRFS vs. databases i wanted to ask
> on suitability of BTRFS for hosting a Cyrus imap server spool. I
> haven't found any recent article on this topic.
> 
> I'm preparing migration of our mailserver to Debian Stretch, ie.
> kernel 4.9 for now. We are using XFS for storage now. I will migrate
> using imapsync to new server. Both are virtual machines running on
> vmware on Dell hardware. Disks are on battery backed hw raid
> controllers over vmfs.
> 
> I'm considering using BTRFS, but I'm little concerned because of
> reading this mailing list ;)
> 
> I'm interested in using:
>  - compression (emails should compress well - right?)

Not really... The small part that's compressible (headers and a few
lines of text) are already small, so a sector (maybe 4k) is still a
sector. Compression gains you no benefit here. That big parts of mails
is already compressed (images, attachments). Mail spools only compress
well if you're compressing mails to a solid archive (like 7zip or tgz).
If you're compressing each mail individually, there's almost no gain
because of file system slack.

>  - maybe deduplication (cyrus does it by hardlinking of same content
> messages now) later

It won't work that way. I'd stick to hardlinking. Only
offline/nearline deduplication will help you. And it will have a hard
time finding the duplicates. This would only properly work if Cyrus
separates mail headers and bodies (I don't know if it does, dovecot
doesn't which is what I use) because delivering to the spool usually
adds some headers like "Delivered-To". This changes the byte offsets
between similar mails so that deduplication will no longer work.
 
>  - snapshots for history

Don't do snapshots too deep. I had similar plans but instead decided it
would be better to use the following setup as a continuous backup
strategy: Deliver mails to two spools, one being the user accessible
spool, and one being the backup spool. Once per day you rename the
backup spool and let it be recreated. Then store away the old backup
store in whatever way you want (snapshots, traditional backup with
retention, ...).

>  - send/receive for offisite backup

It's not that stable that I'd use it in production...

>  - what about data inlining, should it be turned off?

How much data can be inlined? I'm not sure, I never thought about that.

> Our Cyrus pool consist of ~520GB of data in ~2,5million files, ~2000 
> mailboxes.

Similar numbers here, just more mailboxes and less space because we
take care that customers remove their mails from our servers and store
it in their own systems and backups. With a few exceptions, and those
have really big mailboxes.

> We have message size limit of ~25MB, so emails are not bigger than
> that.

50 MB raw size here... (after 3-in-4 decoding this makes around 37 MB
worth of attachments)

> There are however bigger files, these are per mailbox
> caches/index files of cyrus (some of them are around 300MB) - and
> these are also files which are most modified.
> Rest of files (messages) are usualy just writen once.

I'm still struggling if I should try btrfs or stay with xfs. Xfs has a
huge benefit of scaling very very well to parallel workloads and
accross multiple devices. Btrfs does exactly that not very well yet
(because of write-serialization etc).

> 
> ---
> I started using btrfs on backup server as a storage for 4 backuppc
> run in containers (backups are then send away with btrbk), year ago.
> After switching off data inlining i'm satisfied, everything works
> (send/ receive is sometime slow, but i guess it's because of sata
> disks on receive side).

I've started to love borgbackup. It's very fast, efficient, and
reliable. Not sure how good it works for VM images, but for delta
backups in general it's very efficient and fast.


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: better document btrfs receive security

2017-02-07 Thread Kai Krakow

Am Fri,  3 Feb 2017 08:48:58 -0500
schrieb "Austin S. Hemmelgarn" :

> +user who is running receive, and then move then into the final
> destination 

Typo? s/then/them/?

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: dup vs raid1 in single disk

2017-02-07 Thread Kai Krakow

Am Thu, 19 Jan 2017 15:02:14 -0500
schrieb "Austin S. Hemmelgarn" :

> On 2017-01-19 13:23, Roman Mamedov wrote:
> > On Thu, 19 Jan 2017 17:39:37 +0100
> > "Alejandro R. Mosteo"  wrote:
> >  
> >> I was wondering, from a point of view of data safety, if there is
> >> any difference between using dup or making a raid1 from two
> >> partitions in the same disk. This is thinking on having some
> >> protection against the typical aging HDD that starts to have bad
> >> sectors.  
> >
> > RAID1 will write slower compared to DUP, as any optimization to
> > make RAID1 devices work in parallel will cause a total performance
> > disaster for you as you will start trying to write to both
> > partitions at the same time, turning all linear writes into random
> > ones, which are about two orders of magnitude slower than linear on
> > spinning hard drives. DUP shouldn't have this issue, but still it
> > will be twice slower than single, since you are writing everything
> > twice.  
> As of right now, there will actually be near zero impact on write 
> performance (or at least, it's way less than the theoretical 50%) 
> because there really isn't any optimization to speak of in the 
> multi-device code.  That will hopefully change over time, but it's
> not likely to do so any time in the future since nobody appears to be 
> working on multi-device write performance.

I think that's only true if you don't account the seek overhead. In
single device RAID1 mode you will always seek half of the device while
writing data, and even when reading between odd and even PIDs. In
contrast, DUP mode doesn't guarantee your seeks to be shorter but from
a statistical point of view, on the average it should be shorter. So it
should yield better performance (tho I wouldn't expect it to be
observable, depending on your workload).

So, on devices having no seek overhead (aka SSD), it is probably true
(minus bus bandwidth considerations). For HDD I'd prefer DUP.

>From data safety point of view: It's more likely that adjacent
and nearby sectors are bad. So DUP imposes a higher risk of written
data being written to only bad sectors - which means data loss or even
file system loss (if metadata hits this problem).

To be realistic: I wouldn't trade space usage for duplicate data on an
already failing disk, no matter if it's DUP or RAID1. HDD disk space is
cheap, and using such a scenario is just waste of performance AND
space - no matter what. I don't understand the purpose of this. It just
results in fake safety.

Better get two separate devices half the size. There's a better chance
of getting a better cost/space ratio anyways, plus better performance
and safety.

> There's also the fact that you're writing more metadata than data
> most of the time unless you're dealing with really big files, and
> metadata is already DUP mode (unless you are using an SSD), so the
> performance hit isn't 50%, it's actually a bit more than half the
> ratio of data writes to metadata writes.
> >  
> >> On a related note, I see this caveat about dup in the manpage:
> >>
> >> "For example, a SSD drive can remap the blocks internally to a
> >> single copy thus deduplicating them. This negates the purpose of
> >> increased redunancy (sic) and just wastes space"  
> >
> > That ability is vastly overestimated in the man page. There is no
> > miracle content-addressable storage system working at 500 MB/sec
> > speeds all within a little cheap controller on SSDs. Likely most of
> > what it can do, is just compress simple stuff, such as runs of
> > zeroes or other repeating byte sequences.  
> Most of those that do in-line compression don't implement it in 
> firmware, they implement it in hardware, and even DEFLATE can get 500 
> MB/second speeds if properly implemented in hardware.  The firmware
> may control how the hardware works, but it's usually hardware doing
> heavy lifting in that case, and getting a good ASIC made that can hit
> the required performance point for a reasonable compression algorithm
> like LZ4 or Snappy is insanely cheap once you've gotten past the VLSI
> work.

I still thinks it's a myth... The overhead of managing inline
deduplication is just way too high to implement it without jumping
through expensive hoops. Most workloads have almost zero deduplication
potential. And even when, their temporal occurrence is spaced so far
that an inline deduplicator won't catch it.

If it would be all so easy, btrfs would already have it working in
mainline. I don't even remember that those patches is still being
worked on.

With this in mind, I think dup metadata is still a good think to have
even on SSD and I would always force to enable it.

Potential for deduplication is only when using snapshots (which already
are deduplicated when taken) or when handling user data on a file
server in a multi-user environment. Users tend to copy their files all
over the place - multiple directories of multiple gigabytes.

Re: BTRFS for OLTP Databases

2017-02-07 Thread Kai Krakow

Am Tue, 7 Feb 2017 22:25:29 +0100
schrieb Lionel Bouton <lionel-subscript...@bouton.name>:

> Le 07/02/2017 à 21:47, Austin S. Hemmelgarn a écrit :
> > On 2017-02-07 15:36, Kai Krakow wrote:  
> >> Am Tue, 7 Feb 2017 09:13:25 -0500
> >> schrieb Peter Zaitsev <p...@percona.com>:
> >>  
>  [...]  
> >>
> >> Out of curiosity, I see one problem here:
> >>
> >> If you're doing snapshots of the live database, each snapshot
> >> leaves the database files like killing the database in-flight.
> >> Like shutting the system down in the middle of writing data.
> >>
> >> This is because I think there's no API for user space to subscribe
> >> to events like a snapshot - unlike e.g. the VSS API (volume
> >> snapshot service) in Windows. You should put the database into
> >> frozen state to prepare it for a hotcopy before creating the
> >> snapshot, then ensure all data is flushed before continuing.  
> > Correct.  
> >>
> >> I think I've read that btrfs snapshots do not guarantee single
> >> point in time snapshots - the snapshot may be smeared across a
> >> longer period of time while the kernel is still writing data. So
> >> parts of your writes may still end up in the snapshot after
> >> issuing the snapshot command, instead of in the working copy as
> >> expected.  
> > Also correct AFAICT, and this needs to be better documented (for
> > most people, the term snapshot implies atomicity of the
> > operation).  
> 
> Atomicity can be a relative term. If the snapshot atomicity is
> relative to barriers but not relative to individual writes between
> barriers then AFAICT it's fine because the filesystem doesn't make
> any promise it won't keep even in the context of its snapshots.
> Consider a power loss : the filesystems atomicity guarantees can't go
> beyond what the hardware guarantees which means not all current in fly
> write will reach the disk and partial writes can happen. Modern
> filesystems will remain consistent though and if an application using
> them makes uses of f*sync it can provide its own guarantees too. The
> same should apply to snapshots : all the writes in fly can complete or
> not on disk before the snapshot what matters is that both the snapshot
> and these writes will be completed after the next barrier (and any
> robust application will ignore all the in fly writes it finds in the
> snapshot if they were part of a batch that should be atomically
> commited).
> 
> This is why AFAIK PostgreSQL or MySQL with their default ACID
> compliant configuration will recover from a BTRFS snapshot in the
> same way they recover from a power loss.

This is what I meant in my other reply. But this is also why it should
be documented. Wrongly implying that snapshots are single point in time
snapshots is a wrong assumption with possibly horrible side effects one
wouldn't expect.

Taking a snapshot is like a power loss - even tho there is no power
loss. So the database has to be properly configured. It is simply short
sighted if you don't think about this fact. The documentation should
really point that fact out.


-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS for OLTP Databases

2017-02-07 Thread Kai Krakow

Am Tue, 7 Feb 2017 10:43:11 -0500
schrieb "Austin S. Hemmelgarn" :

> > I mean that:
> > You have a 128MB extent, you rewrite random 4k sectors, btrfs will
> > not split 128MB extent, and not free up data, (i don't know
> > internal algo, so i can't predict when this will hapen), and after
> > some time, btrfs will rebuild extents, and split 128 MB exten to
> > several more smaller. But when you use compression, allocator
> > rebuilding extents much early (i think, it's because btrfs also
> > operates with that like 128kb extent, even if it's a continuos
> > 128MB chunk of data). 
> The allocator has absolutely nothing to do with this, it's a function
> of the COW operation.  Unless you're using nodatacow, that 128MB
> extent will get split the moment the data hits the storage device
> (either on the next commit cycle (at most 30 seconds with the default
> commit cycle), or when fdatasync is called, whichever is sooner).  In
> the case of compression, it's still one extent (although on disk it
> will be less than 128MB) and will be split at _exactly_ the same time
> under _exactly_ the same circumstances as an uncompressed extent.
> IOW, it has absolutely nothing to do with the extent handling either.

I don't think that btrfs splits extents which are part of the snapshot.
The extent in a snapshot will stay intact when writing to this extent
in another snapshot. Of course, in the just written snapshot, the
extent will be represented as a split extent mapping to the original
extents data blocks plus the new data in the middle (thus resulting in
three extents). This is also why small random writes without autodefrag
result in a vast amount of small extents bringing the fs performance to
a crawl.

Do that multiple times on multiple snapshots, delete some of the
original snapshots, and you're left with slack space, data blocks being
inaccessible and won't be reclaimed into free space (because they
are still part of the original extent), and which can only be
reclaimed by a defrag operation - which would of course unshares data.

Thus, if any of the above mentioned small extents is still shared with
an extent originally much bigger, then it will still occupy its
original space on the filesystem - even when its associated
snapshot/subvolume no longer exists. Only when the last remaining
tiny block of such an extent gets rewritten and the reference counter
decreases to zero, the extent is given up and freed.

To work around this, you can currently only unshare and recombine by
doing defrag and dedupe on all snapshots. This will reclaim space
sitting in parts of the original extents no longer referenced by a
snapshot visible from the VFS layer.

This is for performance reasons because btrfs is extent based.

As far as I know, ZFS on the other side, works different. It uses block
based storage for the snapshot feature and can easily throw away unused
blocks. Only a second layer on top maps this back into extents. The
underlying infrastructure, however, is block based storage, which also
enables the volume pool to create block devices on the fly out of ZFS
storage space.

PS: All above given the fact I understood it right. ;-)

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS for OLTP Databases

2017-02-07 Thread Kai Krakow

Am Tue, 7 Feb 2017 15:27:34 -0500
schrieb "Austin S. Hemmelgarn" :

> >> I'm not sure about this one.  I would assume based on the fact that
> >> many other things don't work with nodatacow and that regular defrag
> >> doesn't work on files which are currently mapped as executable code
> >> that it does not, but I could be completely wrong about this too.  
> >
> > Technically, there's nothing that prevents autodefrag to work for
> > nodatacow files. The question is: is it really necessary? Standard
> > file systems also have no autodefrag, it's not an issue there
> > because they are essentially nodatacow. Simply defrag the database
> > file once and you're done. Transactional MySQL uses huge data
> > files, probably preallocated. It should simply work with
> > nodatacow.  
> The thing is, I don't have enough knowledge of how defrag is
> implemented in BTRFS to say for certain that ti doesn't use COW
> semantics somewhere (and I would actually expect it to do so, since
> that in theory makes many things _much_ easier to handle), and if it
> uses COW somewhere, then it by definition doesn't work on NOCOW files.

A dev would be needed on this. But from a non-dev point of view, the
defrag operation itself is CoW: Blocks are rewritten to another
location in contiguous order. Only metadata CoW should be needed for
this operation.

It should be nothing else than writing to a nodatacow snapshot... Just
that the snapshot is more or less implicit and temporary.

Hmm? *curious*

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS for OLTP Databases

2017-02-07 Thread Kai Krakow

Am Tue, 7 Feb 2017 09:13:25 -0500
schrieb Peter Zaitsev :

> Hi Hugo,
> 
> For the use case I'm looking for I'm interested in having snapshot(s)
> open at all time.  Imagine  for example snapshot being created every
> hour and several of these snapshots  kept at all time providing quick
> recovery points to the state of 1,2,3 hours ago.  In  such case (as I
> think you also describe)  nodatacow  does not provide any advantage.

Out of curiosity, I see one problem here:

If you're doing snapshots of the live database, each snapshot leaves
the database files like killing the database in-flight. Like shutting
the system down in the middle of writing data.

This is because I think there's no API for user space to subscribe to
events like a snapshot - unlike e.g. the VSS API (volume snapshot
service) in Windows. You should put the database into frozen state to
prepare it for a hotcopy before creating the snapshot, then ensure all
data is flushed before continuing.

I think I've read that btrfs snapshots do not guarantee single point in
time snapshots - the snapshot may be smeared across a longer period of
time while the kernel is still writing data. So parts of your writes
may still end up in the snapshot after issuing the snapshot command,
instead of in the working copy as expected.

How is this going to be addressed? Is there some snapshot aware API to
let user space subscribe to such events and do proper preparation? Is
this planned? LVM could be a user of such an API, too. I think this
could have nice enterprise-grade value for Linux.

XFS has xfs_freeze and xfs_thaw for this, to prepare LVM snapshots. But
still, also this needs to be integrated with MySQL to properly work. I
once (years ago) researched on this but gave up on my plans when I
planned database backups for our web server infrastructure. We moved to
creating SQL dumps instead, although there're binlogs which can be used
to recover to a clean and stable transactional state after taking
snapshots. But I simply didn't want to fiddle around with properly
cleaning up binlogs which accumulate horribly much space usage over
time. The cleanup process requires to create a cold copy or dump of the
complete database from time to time, only then it's safe to remove all
binlogs up to that point in time.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS for OLTP Databases

2017-02-07 Thread Kai Krakow

Am Tue, 7 Feb 2017 14:50:04 -0500
schrieb "Austin S. Hemmelgarn" :

> > Also does autodefrag works with nodatacow (ie with snapshot)  or
> > are these exclusive ?  
> I'm not sure about this one.  I would assume based on the fact that
> many other things don't work with nodatacow and that regular defrag
> doesn't work on files which are currently mapped as executable code
> that it does not, but I could be completely wrong about this too.

Technically, there's nothing that prevents autodefrag to work for
nodatacow files. The question is: is it really necessary? Standard file
systems also have no autodefrag, it's not an issue there because they
are essentially nodatacow. Simply defrag the database file once and
you're done. Transactional MySQL uses huge data files, probably
preallocated. It should simply work with nodatacow.

On the other hand: Using snapshots clearly introduces fragmentation over
time. If autodefrag kicks in (given, it is supported for nodatacow), it
will slowly unshare all data over time. This somehow defeats the
purpose of having snapshots in the first place for this scenario.

In conclusion, I'd recommend to run some maintenance scripts from time
to time, one to re-share identical blocks, and one to defragment the
current workspace.

The bees daemon comes into mind here... I haven't tried it but it
sounds like it could fill a gap here:

https://github.com/Zygo/bees

Another option comes into mind: XFS now supports shared-extents
copies. You could simply do a cold copy of the database with this
feature resulting in the same effect as a snapshot, without seeing the
other performance problems of btrfs. Tho, the fragmentation issue would
remain, and I think there's no dedupe application for XFS yet.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: BTRFS for OLTP Databases

2017-02-07 Thread Kai Krakow

Am Tue, 7 Feb 2017 10:06:34 -0500
schrieb "Austin S. Hemmelgarn" :

> 4. Try using in-line compression.  This can actually significantly 
> improve performance, especially if you have slow storage devices and
> a really nice CPU.

Just a side note: With nodatacow there'll be no compression, I think.
At least for files with "chattr +C" there'll be no compression. I thus
think "nodatacow" has the same effect.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Very slow balance / btrfs-transaction

2017-02-07 Thread Kai Krakow

Am Mon, 6 Feb 2017 08:19:37 -0500
schrieb "Austin S. Hemmelgarn" :

> > MDRAID uses stripe selection based on latency and other measurements
> > (like head position). It would be nice if btrfs implemented similar
> > functionality. This would also be helpful for selecting a disk if
> > there're more disks than stripesets (for example, I have 3 disks in
> > my btrfs array). This could write new blocks to the most idle disk
> > always. I think this wasn't covered by the above mentioned patch.
> > Currently, selection is based only on the disk with most free
> > space.  
> You're confusing read selection and write selection.  MDADM and
> DM-RAID both use a load-balancing read selection algorithm that takes
> latency and other factors into account.  However, they use a
> round-robin write selection algorithm that only cares about the
> position of the block in the virtual device modulo the number of
> physical devices.

Thanks for clearing that point.

> As an example, say you have a 3 disk RAID10 array set up using MDADM 
> (this is functionally the same as a 3-disk raid1 mode BTRFS
> filesystem). Every third block starting from block 0 will be on disks
> 1 and 2, every third block starting from block 1 will be on disks 3
> and 1, and every third block starting from block 2 will be on disks 2
> and 3.  No latency measurements are taken, literally nothing is
> factored in except the block's position in the virtual device.

I didn't know MDADM can use RAID10 on odd amounts of disks...
Nice. I'll keep that in mind. :-)


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs receive leaves new subvolume modifiable during operation

2017-02-06 Thread Kai Krakow

Am Mon, 6 Feb 2017 07:30:31 -0500
schrieb "Austin S. Hemmelgarn" :

> > How about mounting the receiver below a directory only traversable
> > by root (chmod og-rwx)? Backups shouldn't be directly accessible by
> > ordinary users anyway.  
> There are perfectly legitimate reasons to have backups directly 
> accessible by users.  If you're running automated backups for _their_ 
> systems _they_ absolutely should have at least read access to the 
> backups _once they're stable_.

This is not what I tried to explain. The OPs question mixes the
creation process with later access. The creation process, however,
should always be isolated. See below, you're even agreeing. ;-)

> This is not a common case, but it is
> a perfectly legitimate one.  In the same way, if you're storing
> backups for your users (in this case you absolutely should not be4
> using send/receive for other reasons), then the use case dictates
> that they have some way to access them.

I don't deny that. But the OP should understand to properly isolate
both operations from each other. This is best practice, this is how it
should be done.

> > If you want a backup becoming accessible, you can later snapshot it
> > to an accessible location after send/receive finished without
> > errors.
> >
> > An in-transit backup, however, should always be protected from
> > possible disruptive access. This is an issue with any backup
> > software. So place the receive within an inaccessible directory.
> > This is not something the backup process should directly bother
> > with for simplicity.  

> I agree on this point however.  Doing a backup directly into the
> final persistent storage location is never a good idea.  It makes
> error handling more complicated, it makes handling of multi-tier
> storage more complicated (and usually less efficient), and it makes
> security more difficult.

Agreed. It complicates a lot of things. In conclusion: If done right,
the original request isn't a problem, neither is anything wrong by
design. It's a question of isolation of operations.

This is simply one of the most basic principles of a safe and secure
backup.

Personally, if I use rsync for backups, I always rsync to a scratch
location only accessible by the backup process. This scratch area may
even be incomplete, inconsistent or broken in other ways. Only when
rsync successfully finished, the scratch area will be snapshot to its
final destination - which is accessible by its users/owners. This also
has the benefit of the snapshots being self-contained deltas which can
be removed without rewriting or reorganizing partial or complete backup
history. That's a plus-point for backup safety, performance, and
retention policies.

Currently, I'm using borgbackup for my personal backups. It has a
similar approach by using checkpoints for resuming a partial backup.
Only a successful backup process creates the final checkpoint. The
intermediate checkpoints can be thrown away at any time later. It
currently stores a backup history of 95.8 TB (multiple months) on a 3 TB
hard disk. Unfortunately, I don't yet sync this to an offsite location.
My most important data (photos, mental work like programming) is stored
in a different location through other means (Git, cloud sync).

For customers, I prefer to use a local cache where the backup is
stored, then it will be synced offsite using deduplication algorithms
to reduce transfer overhead. A second, different backup software stores
another local copy for fast recovery in case of disaster. There's only
need to sync back from offsite storage in case of total local data
loss. And there's a backup for the backup. If one doesn't work, there's
always the other.

In all cases, the intermediate storage is protected from tampering by
the user, or even completely blocked to be accessed by the user. Only
final and clean snapshots are made available.

Also, error handling and cleanup is easy because errors don't leak or
propagate into the final storage. You simply can clean caches,
intermediate checkpoints, or scratch/staging areas. You can even loose
it for whatever reason (hardware problems, storage errors etc). The
only downside would be that the next backups takes longer to complete.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is it possible to have metadata-only device with no data?

2017-02-05 Thread Kai Krakow

Am Mon, 06 Feb 2017 00:42:01 +0300
schrieb Alexander Tomokhov :

> Is it possible, having two drives to do raid1 for metadata but keep
> data on a single drive only? --

No, but you could take a look into bcache which should get you
something similar if used in write-around mode.

Random access will become cached in bcache, which should most of the
time be metadata, plus of course randomly accessed data from HDD. If
you reduce the sequential cutoff trigger in bcache, it should cache
mostly metadata only.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs receive leaves new subvolume modifiable during operation

2017-02-05 Thread Kai Krakow

Am Thu, 2 Feb 2017 10:52:26 +
schrieb Graham Cobb :

> On 02/02/17 00:02, Duncan wrote:
> > If it's a workaround, then many of the Linux procedures we as
> > admins and users use every day are equally workarounds.  Setting
> > 007 perms on a dir that doesn't have anything immediately security
> > vulnerable in it, simply to keep other users from even potentially
> > seeing or being able to write to something N layers down the subdir
> > tree, is standard practice.  
> 
> No. There is no need to normally place a read-only snapshot below a
> no-execute directory just to prevent write access to it. That is not
> part of the admin's expectation.

Wrong. If you tend to not be in control of the permissions below a
mountpoint, you prevent access to it by restricting permissions on a
parent directory of the mountpoint. It's that easy and it always has
been. That is standard practice. While your backup is running, you have
no control of it - thus use this standard practice!

If you want to make selective or general access to the backups,
snapshot them to a different location later. You should really learn
how to work with scratch locations for a backup. The receiver is such a
scratch location and should be handled like it.

Scratch directories for backup locations are not a bug and not a
workaround. It's just how you should handle it and how it works. By
definition, the target directory cannot be read-only while the backup
is running - so by definition you need other means to protect it.

That in turn defines all your current locations for snapshot as scratch
locations. Put your final snapshots somewhere else if you want your
users access these after successful finishing a backup. You never want
people to access in-transit or partial backups. In-transit and partial
always goes to the scratch directory. Only then, after success of the
backup, a backup should be made available in a final directory.

Maybe your design of your backup is wrong, and not how btrfs handles
that? Using scratch locations is a design you should really really
always use for backups. Every sophisticated admin would agree that this
should be best practice.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs receive leaves new subvolume modifiable during operation

2017-02-05 Thread Kai Krakow

Am Wed, 1 Feb 2017 17:43:32 +
schrieb Graham Cobb :

> On 01/02/17 12:28, Austin S. Hemmelgarn wrote:
> > On 2017-02-01 00:09, Duncan wrote:  
> >> Christian Lupien posted on Tue, 31 Jan 2017 18:32:58 -0500 as
> >> excerpted: 
>  [...]  
> >>
> >> I'm just a btrfs-using list regular not a dev, but AFAIK, the
> >> behavior is likely to be by design and difficult to change,
> >> because the send stream is simply a stream of userspace-context
> >> commands for receive to act upon, and any other suitably
> >> privileged userspace program could run the same commands.  (If
> >> your btrfs-progs is new enough receive even has a dump option,
> >> that prints the metadata operations in human readable form, one
> >> operation per line.)
> >>
> >> So making the receive snapshot read-only during the transfer would
> >> prevent receive itself working.  
> > That's correct.  Fixing this completely would require implementing
> > receive on the kernel side, which is not a practical option for
> > multiple reasons.  
> 
> I am with Christian on this. Both the effects he discovered go against
> my expectation of how send/receive would or should work.
> 
> This first bug is more serious because it appears to allow a
> non-privileged user to disrupt the correct operation of receive,
> creating a form of denial-of-service of a send/receive based backup
> process. If I decided that I didn't want my pron collection (or my
> incriminating emails) appearing in the backups I could just make sure
> that I removed them from the receive snapshots while they were still
> writeable.
> 
> You may be right that fixing this would require receive in the kernel,
> and that is undesirable, although it seems to me that it should be
> possible to do something like allow receive to create the snapshot
> with a special flag that would cause the kernel to treat it as
> read-only to any requests not delivered through the same file
> descriptor, or something like that (or, if that can't be done, at
> least require root access to make any changes). In any case, I
> believe it should be treated as a bug, even if low priority, with an
> explicit warning about the possible corruption of receive-based
> backups in the btrfs-receive man page.

How about mounting the receiver below a directory only traversable by
root (chmod og-rwx)? Backups shouldn't be directly accessible by
ordinary users anyway.

If you want a backup becoming accessible, you can later snapshot it to
an accessible location after send/receive finished without errors.

An in-transit backup, however, should always be protected from possible
disruptive access. This is an issue with any backup software. So place
the receive within an inaccessible directory. This is not something the
backup process should directly bother with for simplicity.


-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Very slow balance / btrfs-transaction

2017-02-04 Thread Kai Krakow

Am Sat, 04 Feb 2017 20:50:03 +
schrieb "Jorg Bornschein" :

> February 4, 2017 1:07 AM, "Goldwyn Rodrigues" 
> wrote:
> 
> > Yes, please check if disabling quotas makes a difference in
> > execution time of btrfs balance.  
> 
> Just FYI: With quotas disabled it took ~20h to finish the balance
> instead of the projected >30 days. Therefore, in my case, there was a
> speedup of factor ~35.
> 
> 
> and thanks for the quick reply! (and for btrfs general!)
> 
> 
> BTW: I'm wondering how much sense it makes to activate the underlying
> bcache for my raid1 fs again. I guess btrfs chooses randomly (or
> based predicted of disk latency?) which copy of a given extend to
> load?

As far as I know, it uses PID modulo only currently, no round-robin,
no random value. There are no performance optimizations going into btrfs
yet because there're still a lot of ongoing feature implementations.

I think there were patches to include a rotator value in the stripe
selection. They don't apply to the current kernel. I tried it once and
didn't see any subjective difference for normal desktop workloads. But
that's probably because I use RAID1 for metadata only.

MDRAID uses stripe selection based on latency and other measurements
(like head position). It would be nice if btrfs implemented similar
functionality. This would also be helpful for selecting a disk if
there're more disks than stripesets (for example, I have 3 disks in my
btrfs array). This could write new blocks to the most idle disk always.
I think this wasn't covered by the above mentioned patch. Currently,
selection is based only on the disk with most free space.

> I guess that would mean the effective cache size would only be
> half of the actual cache-set size (+-additional overhead)? Or does
> btrfs try a deterministically determined copy of each extend first? 

I'm currently using 500GB bcache, it helps a lot during system start -
and probably also while using using the system. I think that bcache
mostly caches metadata access which should improve a lot of btrfs
performance issues. The downside of RAID1 profile is, that probably
every second access is a cache-miss unless it has already been cached.
Thus, it's only half-effective as it could be.

I'm using write-back bcache caching, and RAID0 for data (I do daily
backups with borgbackup, I can easily recover broken files). So
writing with bcache is not such an issue for me. The cache is big
enough that double metadata writes are no problem.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [4.7.2] btrfs_run_delayed_refs:2963: errno=-17 Object already exists

2017-02-03 Thread Kai Krakow

Am Thu, 02 Feb 2017 13:01:03 +0100
schrieb Marc Joliet <mar...@gmx.de>:

> On Sunday 28 August 2016 15:29:08 Kai Krakow wrote:
> > Hello list!  
> 
> Hi list
> 
> > It happened again. While using VirtualBox the following crash
> > happened, btrfs check found a lot of errors which it couldn't
> > repair. Earlier that day my system crashed which may already
> > introduced errors into my filesystem. Apparently, I couldn't create
> > an image (not enough space available), I only can give this trace
> > from dmesg:
> > 
> > [44819.903435] [ cut here ]
> > [44819.903443] WARNING: CPU: 3 PID: 2787 at
> > fs/btrfs/extent-tree.c:2963 btrfs_run_delayed_refs+0x26c/0x290
> > [44819.903444] BTRFS: Transaction aborted (error -17)
> > [44819.903445] Modules linked in: nls_iso8859_15 nls_cp437 vfat fat
> > fuse rfcomm veth af_packet ipt_MASQUERADE nf_nat_masquerade_ipv4
> > iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat
> > nf_conntrack bridge stp llc w83627ehf bnep hwmon_vid cachefiles
> > btusb btintel bluetooth snd_hda_codec_hdmi snd_hda_codec_realtek
> > snd_hda_codec_generic snd_hda_intel snd_hda_codec rfkill snd_hwdep
> > snd_hda_core snd_pcm snd_timer coretemp hwmon snd r8169 mii
> > kvm_intel kvm iTCO_wdt iTCO_vendor_support rtc_cmos irqbypass
> > soundcore ip_tables uas usb_storage nvidia_drm(PO) vboxpci(O)
> > vboxnetadp(O) vboxnetflt(O) vboxdrv(O) nvidia_modeset(PO)
> > nvidia(PO) efivarfs unix ipv6 [44819.903484] CPU: 3 PID: 2787 Comm:
> > BrowserBlocking Tainted: P   O4.7.2-gentoo #2
> > [44819.903485] Hardware name: To Be Filled By O.E.M. To Be Filled
> > By O.E.M./Z68 Pro3, BIOS L2.16A 02/22/2013 [44819.903487]
> >  8130af2d 8800b7d03d20 
> > [44819.903489]  810865fa 880409374428 8800b7d03d70
> > 8803bf299760 [44819.903491]  ffef
> > 8803f677f000 8108666a [44819.903493] Call Trace:
> > [44819.903496]  [] ? dump_stack+0x46/0x59
> > [44819.903500]  [] ? __warn+0xba/0xe0
> > [44819.903502]  [] ? warn_slowpath_fmt+0x4a/0x50
> > [44819.903504]  [] ?
> > btrfs_run_delayed_refs+0x26c/0x290 [44819.903507]
> > [] ? btrfs_release_path+0xe/0x80 [44819.903509]
> > [] ? btrfs_start_dirty_block_groups+0x2da/0x420
> > [44819.903511] [] ?
> > btrfs_commit_transaction+0x143/0x990 [44819.903514]
> > [] ? kmem_cache_free+0x165/0x180 [44819.903516]
> > [] ? btrfs_wait_ordered_range+0x7c/0x110
> > [44819.903518] [] ? btrfs_sync_file+0x286/0x360
> > [44819.903522] [] ? do_fsync+0x33/0x60
> > [44819.903524]  [] ? SyS_fdatasync+0xa/0x10
> > [44819.903528]  [] ?
> > entry_SYSCALL_64_fastpath+0x13/0x8f [44819.903529] ---[ end trace
> > 6944811e170a0e57 ]--- [44819.903531] BTRFS: error (device bcache2)
> > in btrfs_run_delayed_refs:2963: errno=-17 Object already exists
> > [44819.903533] BTRFS info (device bcache2): forced readonly  
> 
> I got the same error myself, with this stack trace:
> 
> -- Logs begin at Fr 2016-04-01 17:07:28 CEST, end at Mi 2017-02-01
> 22:03:57 CET. --
> Feb 01 01:46:26 diefledermaus kernel: [ cut here
> ] Feb 01 01:46:26 diefledermaus kernel: WARNING: CPU: 1
> PID: 16727 at fs/btrfs/extent-tree.c:2967
> btrfs_run_delayed_refs+0x278/0x2b0 Feb 01 01:46:26 diefledermaus
> kernel: BTRFS: Transaction aborted (error -17) Feb 01 01:46:26
> diefledermaus kernel: BTRFS: error (device sdb2) in
> btrfs_run_delayed_refs:2967: errno=-17 Object already exists Feb 01
> 01:46:27 diefledermaus kernel: BTRFS info (device sdb2): forced
> readonly Feb 01 01:46:27 diefledermaus kernel: Modules linked in: msr
> ctr ccm tun arc4 snd_hda_codec_idt applesmc snd_hda_codec_generic
> input_polldev hwmon snd_hda_intel ath5k snd_hda_codec mac80211
> snd_hda_core ath snd_pcm cfg80211 snd_timer video acpi_cpufreq snd
> backlight sky2 rfkill processor button soundcore sg usb_storage
> sr_mod cdrom ata_generic pata_acpi uhci_hcd ahci libahci ata_piix
> libata ehci_pci ehci_hcd Feb 01 01:46:27 diefledermaus kernel: CPU: 1
> PID: 16727 Comm: kworker/u4:0 Not tainted 4.9.6-gentoo #1
> Feb 01 01:46:27 diefledermaus kernel: Hardware name: Apple Inc. 
> Macmini2,1/Mac-F4208EAA, BIOS MM21.88Z.009A.B00.0706281359
> 06/28/07 Feb 01 01:46:27 diefledermaus kernel: Workqueue:
> btrfs-extent-refs btrfs_extent_refs_helper
> Feb 01 01:46:27 diefledermaus kernel:  
> 812cf739 c9000285fd60 
> Feb 01 01:46:27 diefledermaus kernel:  8104908a
> 8800428df1e0 c9000285fdb0 0020
> Feb 0

Re: mount option nodatacow for VMs on SSD?

2016-11-28 Thread Kai Krakow

Am Mon, 28 Nov 2016 01:38:29 +0100
schrieb Ulli Horlacher <frams...@rus.uni-stuttgart.de>:

> On Sat 2016-11-26 (11:27), Kai Krakow wrote:
> 
> > > I have vmware and virtualbox VMs on btrfs SSD.  
> 
> > As a side note: I don't think you can use "nodatacow" just for one
> > subvolume while the other subvolumes of the same btrfs are mounted
> > different. The wiki is just wrong here.
> > 
> > The list of possible mount options in the wiki explicitly lists
> > "nodatacow" as not working per subvolume - just globally for the
> > whole fs.  
> 
> Thanks for pointing this out!
> I have misunderstood this, first.

You can, however, use chattr to make the subvolume root directory (that
one where it is mounted) nodatacow (chattr +C) _before_ placing any
files or directories in there. That way, newly created files and
directories will inherit the flag. Take note that this flag can only
applied to directories and empty (zero-sized) files.

That way, you get the intended benefit and your next question applies a
little less because:

> Ok, then next question :-)
> 
> What is better (for a single user workstation): using mount option
> "autodefrag" or call "btrfs filesystem defragment -r" (-t ?) via
> nightly cronjob?
> 
> So far, I use neither.

When using the above method to make your VM images nodatacow, the only
fragmentation issue you need to handle is when doing snapshots.
Snapshots are subject to copy-on-write. If you do heavy snapshotting,
you'll be getting heavy fragmentation based on the write-patterns. I
don't know if autodefrag will handle nodatacow files. You may want to
use a dedupe utility after defragmentation, like duperemove (running
it manually) or bees (a background daemon also trying to keep
fragmentation low).

If you are doing no or infrequent snapshots, I won't bother with manual
defragging at all for your VM images since you're on SSD. If you aren't
going to use snapshots at all, you even won't have to think about
autodefrag, tho I still recommend to enable it (see post from Duncan).

Manual defrag is a highly write-intensive operation, rewriting multiple
gigabytes of data. I strongly recommend against using it on a daily
basis for SSD.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mount option nodatacow for VMs on SSD?

2016-11-26 Thread Kai Krakow

Am Fri, 25 Nov 2016 09:28:40 +0100
schrieb Ulli Horlacher :

> I have vmware and virtualbox VMs on btrfs SSD.
> 
> I read in
> https://btrfs.wiki.kernel.org/index.php/SysadminGuide#When_To_Make_Subvolumes
> 
>  certain types of data (databases, VM images and similar
> typically big files that are randomly written internally) may require
> CoW to be disabled for them.  So for example such areas could be
> placed in a subvolume, that is always mounted with the option
> "nodatacow".
> 
> Does this apply to SSDs, too?

As a side note: I don't think you can use "nodatacow" just for one
subvolume while the other subvolumes of the same btrfs are mounted
different. The wiki is just wrong here.

The list of possible mount options in the wiki explicitly lists
"nodatacow" as not working per subvolume - just globally for the whole
fs.

-- 
Regards,
Kai

Replies to list-only preferred.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: corrupt leaf, slot offset bad

2016-11-14 Thread Kai Krakow

Am Tue, 11 Oct 2016 07:09:49 -0700
schrieb Liu Bo :

> On Tue, Oct 11, 2016 at 02:48:09PM +0200, David Sterba wrote:
> > Hi,
> > 
> > looks like a lot of random bitflips.
> > 
> > On Mon, Oct 10, 2016 at 11:50:14PM +0200, a...@aron.ws wrote:  
> > > item 109 has a few strange chars in its name (and it's
> > > truncated): 1-x86_64.pkg.tar.xz 0x62 0x14 0x0a 0x0a
> > > 
> > >   item 105 key (261 DIR_ITEM 54556048) itemoff 11723
> > > itemsize 72 location key (606286 INODE_ITEM 0) type FILE
> > >   namelen 42 datalen 0 name:
> > > python2-gobject-3.20.1-1-x86_64.pkg.tar.xz item 106 key (261
> > > DIR_ITEM 56363628) itemoff 11660 itemsize 63 location key (894298
> > > INODE_ITEM 0) type FILE namelen 33 datalen 0 name:
> > > unrar-1:5.4.5-1-x86_64.pkg.tar.xz item 107 key (261 DIR_ITEM
> > > 66963651) itemoff 11600 itemsize 60 location key (1178 INODE_ITEM
> > > 0) type FILE namelen 30 datalen 0 name:
> > > glibc-2.23-5-x86_64.pkg.tar.xz item 108 key (261 DIR_ITEM
> > > 68561395) itemoff 11532 itemsize 68 location key (660578
> > > INODE_ITEM 0) type FILE namelen 38 datalen 0 name:
> > > squashfs-tools-4.3-4-x86_64.pkg.tar.xz item 109 key (261 DIR_ITEM
> > > 76859450) itemoff 11483 itemsize 65 location key (2397184
> > > UNKNOWN.0 7091317839824617472) type 45 namelen 13102 datalen
> > > 13358 name: 1-x86_64.pkg.tar.xzb  
> > 
> > namelen must be smaller than 255, but the number itself does not
> > look like a bitflip (0x332e), the name looks like a fragment of.
> > 
> > The location key is random garbage, likely an overwritten memory,
> > 7091317839824617472 == 0x62696c010023 contains ascii 'bil', the
> > key type is unknown but should be INODE_ITEM.
> >   
> > >   data
> > >   item 110 key (261 DIR_ITEM 9799832789237604651) itemoff
> > > 11405 itemsize 62
> > >   location key (388547 INODE_ITEM 0) type FILE
> > >   namelen 32 datalen 0 name:
> > > intltool-0.51.0-1-any.pkg.tar.xz item 111 key (261 DIR_ITEM
> > > 81211850) itemoff 11344 itemsize 131133  
> > 
> > itemsize 131133 == 0x2003d is a clear bitflip, 0x3d == 61,
> > corresponds to the expected item size.
> > 
> > There's possibly other random bitflips in the keys or other
> > structures. It's hard to estimate the damage and thus the scope of
> > restorable data.  
> 
> It makes sense since this's a ssd we may have only one copy for
> metadata.
> 
> Thanks,
> 
> -liubo

>From this point of view it doesn't make sense to store only one copy of
meta data on SSD... The bit flip probably happened in RAM when taking
the other garbage into account, so dup meta data could have helped here.

If the SSD firmware would collapse duplicate meta data into single
blobs, that's perfectly fine. If the dup meta data arrives with bits
flipped, it won't be deduplicated. So this is fine, too.

BTW: I cannot believe that SSD firmwares really do the quite expensive
job of deduplication other than maybe internal compression. Maybe there
are some drives out there but most won't deduplicate. It's just too
little gain for too much complexity. So I personally would always
switch on duplicate meta data even for SSD. It shouldn't add to wear
leveling too much if you do the usual SSD optimization anyways (like
noatime).

PS: I suggest doing an extensive memtest86 before trying any repairs on
this system... Are you probably mixing different model DIMMs in dual
channel slots? Most of the times I've seen bitflips, this was the
culprit...

-- 
Regards,
Kai

Replies to list-only preferred.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 >

1 - 100 of 316 matches

Mail list logo