date:20160919

Re: [RFC] Preliminary BTRFS Encryption

2016-09-19 Thread Zygo Blaxell

On Mon, Sep 19, 2016 at 07:50:07PM +, Alex Elsayed wrote:
> > That would be true if the problem were not already long solved in btrfs.
> > The 32-bit CRC tree stores 4 bytes per block separately and efficiently.
> > With minor changes it can store a 32-byte HMAC for each block.
> 
> I disagree that this "solves" it - in particular, the fact that the fsck 
> tool support dropping/regenerating the extent tree is wildly unsafe in 
> the face of this.

Those fsck features should no longer work on the AEAD tree (or would
require the keys to work if there was enough filesystem left to salvage).

> For an AEAD that lacks nonce-misuse-resistance, it's "merely" downgrading 
> security from AEAD to simple encryption (GCM, for instance, becomes 
> exactly CTR). This would be almost okay (it's a fsck tool, after all), 
> but the fact that it's a fsck tool makes the next part worse.
> 
> In the case of nonce-misuse-resistant AEAD, it's much worse: Dropping the 
> checksum tree would permanently and irrevocably corrupt every single 
> extent, with no data recoverable at all. This is the _exact_ opposite of 
> _anything_ you would _ever_ want a fsck tool to do.

So...don't put those features in fsck?

In my experience, if you're dropping the checksum or especially the
extent tree, your filesystem is already so badly damaged you might as
well mkfs+restore the filesystem.  It'll take longer to reverify the
data at the application level or compare with the last backup.

An AEAD tree would just be like that, except there's no point in even
offering the option.  It would just be "rebuilding the AEAD tree will
erase all your encrypted data, leaving only plaintext data on the
filesystem if you had any, are you very sure about this y/N"

> This is, fundamentally, the problem with treating an "auth tag" as a 
> separate thing: It's only separate at all in weaker systems, and the act 
> of separating the data induces incredibly nasty failure modes.
> 
> It gets even worse if you consider _why_ that option exists for the fsck 
> tool: Because of the possibility that the _structure_ of the checksum 
> tree becomes corrupted. As a result, two bit-flips (one for each 
> duplicate of the metadata) would be entirely capable of irrevocably 
> destroying _all encrypted data on the FS_.

That event already destroys a btrfs filesystem, even without encryption.
btrfs already includes much of the verification process of a Merkle tree,
with weak checksums and no auth.  Currently, if you lose both copies of an
interior tree node, it is only possible to recover the filesystem offline
by brute-force search of the metadata.  It's one of the reasons why it's
so important to have duplicate metadata even on a single disk.

The only difference with encryption is that recovery would be
theoretically impossible instead of just practically infeasible.

> Separating the "auth tag" - simply considering an "auth tag" a separate 
> thing from the overall ciphertext - is a dangerous thing to do.
> 
> >> If you're _not_ using a nonce-misuse-resistant AEAD, it's even worse:
> >> keeping the tag out-of-band makes it far too easy to fail to verify it,
> >> or verify it only after decrypting the ciphertext to plaintext.
> >> Bluntly: that is an immediate security vulnerability.
> >> 
> >> tl;dr: Don't encrypt pages, encrypt extents. They grow a little for the
> >> auth tag, and that's fine.
> >> 
> >> Btrfs already handles needing to read the full extent in order to get a
> >> page out of it with compression, anyway.
> > 
> > It does, but compressed extents are limited to 128K.  Uncompressed
> > extents come in sizes up to 128M, far too large to read in their
> > entirety for many applications.
> 
> Er, yes, and? Just as compressed extents have a different cap for reasons 
> of practicality, so too can encrypted extents.

...which very inefficient space usage for short extents.

signature.asc
Description: Digital signature

Re: multi-device btrfs with single data mode and disk failure

2016-09-19 Thread Alexandre Poux



Le 15/09/2016 à 23:54, Chris Murphy a écrit :
> On Thu, Sep 15, 2016 at 3:48 PM, Alexandre Poux  wrote:
>> Le 15/09/2016 à 18:54, Chris Murphy a écrit :
>>> On Thu, Sep 15, 2016 at 10:30 AM, Alexandre Poux  wrote:
 Thank you very much for your answers

 Le 15/09/2016 à 17:38, Chris Murphy a écrit :
> On Thu, Sep 15, 2016 at 1:44 AM, Alexandre Poux  wrote:
>> Is it possible to do some king of a "btrfs delete missing" on this
>> kind of setup, in order to recover access in rw to my other data, or
>> I must copy all my data on a new partition
> That *should* work :) Except that your file system with 6 drives is
> too full to be shrunk to 5 drives. Btrfs will either refuse, or get
> confused, about how to shrink a nearly full 6 drive volume into 5.
>
> So you'll have to do one of three things:
>
> 1. Add a 2+TB drive, then remove the missing one; OR
> 2. btrfs replace is faster and is raid10 reliable; OR
> 3. Read only scrub to get a file listing of bad files, then remount
> read-write degraded and delete them all. Now you maybe can do a device
> delete missing. But it's still a tight fit, it basically has to
> balance things out to get it to fit on an odd number of drives, it may
> actually not work even though there seems to be enough total space,
> there has to be enough space on FOUR drives.
>
 Are you sure you are talking about data in single mode ?
 I don't understand why you are talking about raid10,
 or the fact that it will have to rebalance everything.
>>> Yeah sorry I got confused in that very last sentence. Single, it will
>>> find space in 1GiB increments. Of course this fails because that data
>>> doesn't exist anymore, but to start the operation it needs to be
>>> possible.
>> No problem
 Moreover, even in degraded mode I cannot mount it in rw
 It tells me
 "too many missing devices, writeable remount is not allowed"
 due to the fact I'm in single mode.
>>> Oh you're in that trap. Well now you're stuck. I've had the case where
>>> I could mount read write degraded with metadata raid1 and data single,
>>> but it was good for only one mount and then I get the same message you
>>> get and it was only possible to mount read only. At that point it's
>>> totally suck unless you're adept at manipulating the file system with
>>> a hex editor...
>>>
>>> Someone might have a patch somewhere that drops this check and lets
>>> too many missing devices to mount anyway... I seem to recall this.
>>> It'd be in the archives if it exists.
>>>
>>>
>>>
 And as far as as know, btrfs replace and btrfs delete, are not supposed
 to work in read only...
>>> It doesn't. Must be read write mounted.
>>>
>>>
 I would like to tell him forgot about the missing data, and give me back
 my partition.
>>> This feature doesn't exist yet. I really want to see this, it'd be
>>> great for ceph and gluster if the volume could lose a drive, report
>>> all the missing files to the cluster file system, delete the device
>>> and the file references, and then the cluster knows that brick doesn't
>>> have those files and can replicate them somewhere else or even back to
>>> the brick that had them.
>>>
>> So I found this patch : https://patchwork.kernel.org/patch/7014141/
>>
>> Does this seems ok ?
> No idea I haven't tried it.
>
>> So after patching my kernel with it,
>> I should be able to mount in rw my partition, and thus,
>> I will be able to do a btrfs delete missing
>> Which will just forgot about the old disk and everything should be fine
>> afterward ?
> It will forget about the old disk but it will try to migrate all
> metadata and data that was on that disk to the remaining drives; so
> until you delete all files that are corrupt, you'll continue to get
> corruption messages about them.
>
>> Is this risky ? or not so much ?
> Probably. If you care about the data, mount read only, back up what
> you can, then see if you can fix it after that.
>
>> The scrubing is almost finished, and as I was expecting, I lost no data
>> at all.
> Well I'd guess the device delete should work then, but I still have no
> idea if that patch will let you mount it degraded read-write. Worth a
> shot though, it'll save time.
>
OK, so I found some time to work on it.

I decided to do some tests in a vm (virtualbox) with 3 disks
after making an array with 3 disks, metadata in raid1 and data in single,
I remove one disk to reproduce my situation.

I tried the patch, and, after updated it (nothing fancy),
I can indeed mount a degraded partition with data in single.

But I can't remove the device :
#btrfs device remove missing /mnt
ERROR: error removing device 'missing': Input/output error
or
#btrfs device remove 2 /mnt
ERROR: error removing devid 2: Input/output error

replace doesn't work either
btrfs replace start -B 2 /dev/sdb /mnt
BTRFS error (device

Uzenetet!

2016-09-19 Thread Ko may l




Hello.

Jo estet, es hogyan csinalod? Csak egy gyors, van egy hivatalos 
lehetoseget szeretnek beszelni veled negyszemkozt.

 
Orulnek a gyors valaszt itt az en szemelyes magan e-mail címet a tovabbi 
kommunikaciot.


Udvozlettel,
Mrs. Ko majus Leung
e-mail: komayln...@gmail.com
Elnokhelyettes, ugyvezeto igazgato
es ugyvezeto igazgatoja Chong Hing Bank Limited
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is stability a joke?

2016-09-19 Thread Hans van Kranenburg

On 09/19/2016 05:38 PM, David Sterba wrote:
> On Mon, Sep 12, 2016 at 01:31:42PM -0400, Austin S. Hemmelgarn wrote:
>> [...] A lot of stuff that may seem obvious to us after years of 
>> working with BTRFS isn't going to be to a newcomer, and it's a lot more 
>> likely that some random person will get things write if we have a good, 
>> central BCP document than if it stays as scattered tribal knowledge.
> 
> The IRC tribe answers the same newcomer questions over and over, which
> is fine for the interaction itself, but if all that also ended up in
> wiki we'd have perfect documentation years ago.

Yes, it's not the first time I'm thinking "wow, this #btrfs irc log I
have here is a goldmine of very useful information". Transforming it
into concise usable text on a wiki is a lot of work, but there's
certainly a "turnover" point that can be reached quite fast (I guess).

OTOH, the same happens on the mailing list, where I also see lots of
similar things answered over and over again, and a lot of treasures
being buried and forgotten.

> Also the current status
> of features and bugs is kept in the IRC-hive-mind yet it still needs
> some other way to actually make it appear on wiki. Edit with courage!

Oh, right there at the end, I expected: Join #btrfs on freenode IRC! :-D

-- 
Hans van Kranenburg
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

spurious call trace during send

2016-09-19 Thread Christoph Anton Mitterer

Hey.

FYI:

Just got this call trace during a send/receive (with -p) between two
btrfs on 4.7.0.

Neither btrfs-send nor -receive showed an error though and seem to have
completed successfully (at least a diff of the changes implied that.


Sep 19 20:24:38 heisenberg kernel: BTRFS info (device dm-2): disk space caching 
is enabled
Sep 19 20:25:53 heisenberg kernel: [ cut here ]
Sep 19 20:25:53 heisenberg kernel: WARNING: CPU: 2 PID: 24266 at 
/build/linux-m2Twzh/linux-4.7.2/fs/btrfs/send.c:5964 
btrfs_ioctl_send+0x537/0x1260 [btrfs]
Sep 19 20:25:53 heisenberg kernel: Modules linked in: udp_diag tcp_diag 
inet_diag algif_skcipher af_alg uas hmac drbg ansi_cprng ctr ccm vhost_net 
vhost macvtap macvlan xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat xt_tcpudp tun bridge stp 
llc fuse ebtable_filter ebtables joydev rtsx_pci_ms rtsx_pci_sdmmc memstick 
mmc_core cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_conservative 
iTCO_wdt iTCO_vendor_support intel_rapl x86_pkg_temp_thermal intel_powerclamp 
coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel psmouse uvcvideo videobuf2_vmalloc videobuf2_memops 
videobuf2_v4l2 videobuf2_core videodev media btusb pcspkr btrtl btbcm btintel 
bluetooth crc16 sg rtsx_pci arc4 iwldvm mac80211 iwlwifi cfg80211 rfkill 
snd_hda_codec_hdmi snd_hda_codec_realtek
Sep 19 20:25:53 heisenberg kernel:  snd_hda_codec_generic fjes i915 tpm_tis tpm 
battery fujitsu_laptop ac i2c_i801 snd_hda_intel video snd_hda_codec lpc_ich 
mfd_core button snd_hda_core snd_hwdep drm_kms_helper snd_pcm shpchp snd_timer 
e1000e snd drm soundcore mei_me ptp i2c_algo_bit pps_core mei ip6t_REJECT 
nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables 
xt_policy ipt_REJECT nf_reject_ipv4 xt_comment nf_conntrack_ipv4 nf_defrag_ipv4 
xt_multiport xt_conntrack nf_conntrack iptable_filter binfmt_misc loop sunrpc 
parport_pc ppdev lp parport ip_tables x_tables autofs4 dm_crypt dm_mod raid10 
raid456 libcrc32c async_raid6_recov async_memcpy async_pq async_xor async_tx 
raid1 raid0 multipath linear md_mod btrfs crc32c_generic xor raid6_pq uhci_hcd 
usb_storage sd_mod crc32c_intel ahci libahci aesni_intel
Sep 19 20:25:53 heisenberg kernel:  libata aes_x86_64 xhci_pci glue_helper lrw 
xhci_hcd gf128mul ablk_helper cryptd ehci_pci ehci_hcd scsi_mod evdev usbcore 
serio_raw usb_common
Sep 19 20:25:53 heisenberg kernel: CPU: 2 PID: 24266 Comm: btrfs Not tainted 
4.7.0-1-amd64 #1 Debian 4.7.2-1
Sep 19 20:25:53 heisenberg kernel: Hardware name: FUJITSU LIFEBOOK 
E782/FJNB23E, BIOS Version 1.11 05/24/2012
Sep 19 20:25:53 heisenberg kernel:  0286 7d5ad1ff 
aff16655 
Sep 19 20:25:53 heisenberg kernel:   afc7895e 
8802d1ebf42c 7fff6fcb1800
Sep 19 20:25:53 heisenberg kernel:  8800c15da000 40489426 
8802d1ebf000 8803da842100
Sep 19 20:25:53 heisenberg kernel: Call Trace:
Sep 19 20:25:53 heisenberg kernel:  [] ? dump_stack+0x5c/0x77
Sep 19 20:25:53 heisenberg kernel:  [] ? __warn+0xbe/0xe0
Sep 19 20:25:53 heisenberg kernel:  [] ? 
btrfs_ioctl_send+0x537/0x1260 [btrfs]
Sep 19 20:25:53 heisenberg kernel:  [] ? 
intel_pstate_update_util+0x1be/0x320
Sep 19 20:25:53 heisenberg kernel:  [] ? 
__memcg_kmem_get_cache+0x48/0x150
Sep 19 20:25:53 heisenberg kernel:  [] ? 
kmem_cache_alloc+0x149/0x560
Sep 19 20:25:53 heisenberg kernel:  [] ? 
attach_task_cfs_rq+0x3b/0x70
Sep 19 20:25:53 heisenberg kernel:  [] ? 
btrfs_ioctl+0x8f8/0x2300 [btrfs]
Sep 19 20:25:53 heisenberg kernel:  [] ? 
cpumask_next_and+0x2a/0x40
Sep 19 20:25:53 heisenberg kernel:  [] ? 
enqueue_task_fair+0x5d/0x960
Sep 19 20:25:53 heisenberg kernel:  [] ? sched_clock+0x5/0x10
Sep 19 20:25:53 heisenberg kernel:  [] ? 
check_preempt_curr+0x50/0x90
Sep 19 20:25:53 heisenberg kernel:  [] ? 
do_vfs_ioctl+0x9e/0x5d0
Sep 19 20:25:53 heisenberg kernel:  [] ? _do_fork+0x14d/0x3f0
Sep 19 20:25:53 heisenberg kernel:  [] ? SyS_ioctl+0x74/0x80
Sep 19 20:25:53 heisenberg kernel:  [] ? 
system_call_fast_compare_end+0xc/0x96
Sep 19 20:25:53 heisenberg kernel: ---[ end trace b61b956fbe6451d3 ]---


Any ideas?

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: stability matrix

2016-09-19 Thread Chris Mason




On 09/19/2016 04:36 PM, Christoph Anton Mitterer wrote:

On Mon, 2016-09-19 at 16:07 -0400, Chris Mason wrote:

That's in the blockdev command (blockdev --setro /dev/xxx).

Well, I know that ;-) ... but I bet most end-user don't (just as most
end-users assume mount -r is truly ro)...



It's a tradeoff, without the log replay traditional filesystems wouldn't 
be able to mount at all after a crash.  Since most init systems default 
to ro at the beginning, it would have been awkward to introduce the 
logging filesystems into established systems.


16+ years later, I still feel it's still the path of least surprise.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Experimental btrfs encryption

2016-09-19 Thread Alex Elsayed

On Mon, 19 Sep 2016 11:15:18 -0400, Theodore Ts'o wrote:

> (I'm not on linux-btrfs@, so please keep me on the cc list.  Or perhpas
> better yet, maybe we can move discussion to the linux-fsdevel@
> list.)

I apologize if this doesn't keep you in the CC, as I'm posting via gmane.

> Hi Anand,
> 
> After reading this thread on the web archives, and seeing that some
> folks seem to be a bit confused about "vfs level crypto", fs/crypto,
> and ext4/f2fs encryption, I thought I would give a few comments.
> 
> First of all, these are all the same thing.  Initially ext4 encryption
> was implemented targetting ChromeOS as the initial customer, and as a
> replacement for ecryptfs.  Folks have already pointed you at the design
> document[1].  Also of interest is the is the 2015 Linux Security
> Symposium slides set[2].  The first deployed use of this was for Android
> N's File-based Encryption and Direct boot[3]; a technical description
> which left out some of the product details (since LSS 2016 was before
> the Android N release) can be found at the 2016 LSS slides[4].
> 
> [1]
> https://docs.google.com/document/
d/1ft26lUQyuSpiu6VleP70_npaWdRfXFoNnB8JYnykNTg/preview
> [2] http://kernsec.org/files/lss2014/Halcrow_EXT4_Encryption.pdf [3]
> https://android.googleblog.com/2016/08/android-70-nougat-more-powerful-
os-made.html
> [4] http://kernsec.org/files/lss2015/halcrow.pdf
> 
> The other thing that perhaps would be worth noting is that Michael
> Halcrow started this as an encryption/security expert who had dabbled in
> file systems, while I was someone for whom encryption/security is a
> hobby (although in a previous life I was the tech lead for Kerberos and
> chaired the IPSEC working group) who was a file system expert.  In order
> to do file system security well, you need people who are well versed in
> both discplines working together.
> 
> With all due respect, the fact that you chose counter mode and how use
> used it pretty clearly demonstrates that you would be well advised to
> find someone who is a crypto expert to collaborate with you --- or use
> the fs/crypto framework since it was designed and vetted by multiple
> crypto experts as well as file system experts.

100% agreed on the former, and mostly agreed on the latter (though I feel 
that even applying fs/crypto to btrfs includes sufficient novelty as to 
require very careful review by crypto experts).

> Having someone who is a product manager who can discuss with you
> specific goals is also important, because there are lots of tradeoffs
> and lots of design choices  and so what you chose to do is (or at
> least should be!)  very much dependent on your threat model, who is
> planning on using the feature, what you can and can not upon via-a-vis
> hardware support, performance requirements, and so on.
> 
> 
> Secondly, in terms of how it all works.  Each user as a "master key"
> which is stored on a keyring.  We use a hash of the key to serve as the
> key identifier, and associated with each inode we store a nonce (a
> random unique string) and the key identifier.  We use the nonce and the
> user's master key to generate a unique key for that inode.

As noted in my discussions with Zygo Blaxell, this is one of the places 
where applying fs/crypto to btrfs without careful reexamination would 
fail badly - using the inode will not work.

> That key is used to protect the contents of the data file, and to
> encrypt filenames and symlink targets --- since filenames can leak
> significant information about what the user is doing.  (For example,
> in the downloads directory of their web browser, leaking filenames is
> just as good as leaking part of their browsing history.)
>
> As far as using the fs/crypto infrastructure, it's actually pretty
> simple.  The file system needs to provide a flag indicating whether or
> not the file is encrypted, and support extended attributes.  When you
> create an inode in an encrypted directory, you call
> fscrypt_inherit_context() and the fscrypto layer will take care of
> creating the necessary xattr for the per-inode key.  When you need open
> a encrypted file, or operate on an encrypted inode, you call
> fscrypt_get_encryption_info() on the inode.  The per-inode encryption
> key is cached in the i_crypt_info structure, which hangs off of the
> struct inode.

When someone says "pretty simple" regarding cryptography, it's often 
neither pretty nor simple :P

The issue, here, is that inodes are fundamentally not a safe scope to 
attach that information to in btrfs. As extents can be shared between 
inodes (and thus both will need to decrypt them), and inodes can be 
duplicated unmodified (snapshots), attaching keys and nonces to inodes 
opens up a whole host of (possibly insoluble) issues, including 
catastrophic nonce reuse via writable snapshots.

> When you write to an encrypted file, you call fscrypt_encrypt_page(),
> which returns a struct page with the encrypted contents to be written.
> After the write is

Re: stability matrix

2016-09-19 Thread Christoph Anton Mitterer

On Mon, 2016-09-19 at 16:07 -0400, Chris Mason wrote:
> That's in the blockdev command (blockdev --setro /dev/xxx).
Well, I know that ;-) ... but I bet most end-user don't (just as most
end-users assume mount -r is truly ro)...

At least this is nowadays documented at the mount manpage... so in a
way one can of course argue: if the user can't read you can't help him
anyway... :)

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH] btrfs: Fix handling of -ENOENT from btrfs_uuid_iter_rem

2016-09-19 Thread David Sterba

On Mon, Sep 19, 2016 at 02:49:41PM -0400, Chris Mason wrote:
> On 09/19/2016 02:13 PM, David Sterba wrote:
> > On Wed, Sep 07, 2016 at 10:38:58AM +0300, Nikolay Borisov wrote:
> >> btrfs_uuid_iter_rem is able to return -ENOENT, however this condition
> >> is not handled in btrfs_uuid_tree_iterate which can lead to calling
> >> btrfs_next_item with freed path argument, leading to a null pointer
> >> dereference. Fix it by redoing the search but with an incremented
> >> objectid so we don't loop over the same key.
> >>
> >> Signed-off-by: Nikolay Borisov 
> >> Suggested-by: Chris Mason 
> >> Link: https://lkml.kernel.org/r/57a473b0.2040...@kyup.com
> >
> > I'll queue the patch for 4.9, thanks.
> >
> 
> Not having a good test for this kept me from trying the patch cold.  I 
> think bumping the objectid will end up missing items.

Ok, so I can keep it in the branches that are not for the upcoming
merges but still in for-next.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is stability a joke? (wiki updated)

2016-09-19 Thread Zygo Blaxell

On Mon, Sep 19, 2016 at 01:38:36PM -0400, Austin S. Hemmelgarn wrote:
> >>I'm not sure if the brfsck is really all that helpful to user as much
> >>as it is for developers to better learn about the failure vectors of
> >>the file system.
> >
> >ReiserFS had no working fsck for all of the 8 years I used it (and still
> >didn't last year when I tried to use it on an old disk).  "Not working"
> >here means "much less data is readable from the filesystem after running
> >fsck than before."  It's not that much of an inconvenience if you have
> >backups.
> For a small array, this may be the case.  Once you start looking into double
> digit TB scale arrays though, restoring backups becomes a very expensive
> operation.  If you had a multi-PB array with a single dentry which had no
> inode, would you rather be spending multiple days restoring files and
> possibly losing recent changes, or spend a few hours to check the filesystem
> and fix it with minimal data loss?

I'd really prefer to be able to delete the dead dentry with 'rm' as root,
or failing that, with a ZDB-like tool or ioctl, if it's the only known
instance of such a bad metadata object and I already know where it's
located.

Usually the ultimate failure mode of a btrfs filesystem is a read-only
filesystem from which you can read most or all of your data, but you
can't ever make it writable again because of fsck limitations.

The one thing I do miss about every filesystem that isn't ext2/ext3 is
automated fsck that prioritizes availability, making the filesystem
safely writable even if it can't recover lost data.  On the other
hand, fixing an ext[23] filesystem is utterly trivial compared to any
btree-based filesystem.

signature.asc
Description: Digital signature

Re: [RFC] Preliminary BTRFS Encryption

2016-09-19 Thread Alex Elsayed

On Mon, 19 Sep 2016 14:08:06 -0400, Zygo Blaxell wrote:

> On Sat, Sep 17, 2016 at 06:37:16AM +, Alex Elsayed wrote:
>> > Encryption in ext4 is a per-directory-tree affair. One starts by
>> > setting an encryption policy (using an ioctl() call) for a given
>> > directory, which must be empty at the time; that policy includes a
>> > master key used for all files and directories stored below the target
>> > directory. Each individual file is encrypted with its own key, which
>> > is derived from the master key and a per-file random nonce value
>> > (which is stored in an extended attribute attached to the file's
>> > inode). File names and symbolic links are also encrypted.
> 
> Probably the simplest way to map this to btrfs is to move the nonce from
> the inode to the extent.

I agree. Mostly, I was making a point about how the ext4/VFS code (which 
_does_ put it on the inode) can't just be transported over to btrfs 
unchanged, which is what I read Dave Chinner as advocating.

> Inodes aren't unique within a btrfs filesystem, extents can be shared by
> multiple inodes, and a single extent can appear multiple times in the
> same inode at different offsets.  Attaching the nonce to the inode would
> not be sufficient to read the extent in all but the special case of a
> single reference at the original offset where it was written, and it
> also leads to the replay problems with duplicate inodes you pointed out.

Yup.

> Extents in a btrfs filesystem are unique and carry their own attributes
> (e.g. compression format, checksums) and reference count.  They can
> easily carry a reference to an encryption policy object and a nonce
> attribute.

Definitely agreed.

> Nonces within metadata are more complicated.  btrfs doesn't have
> directory files like ext4 does, so it doesn't get directory filename
> encryption for free with file encryption.  Encryption could be done
> per-item in the metadata trees, but in the special case of directories
> that happen to the the roots of subvols, it would be possible to encrypt
> entire pages of metadata at a time (with the caveat that a snapshot
> would require shared encryption policy between the origin and snapshot
> subvols).

Encrypting tree values per-item is actually one of the best arguments in 
_favor_ of nonce-misuse-resistant AEAD. Its security notion is very, very 
strong:

If a (key, nonce, associated data, message) tuple is repeated, the only 
data an attacker can discover is the fact that the two ciphertexts have 
the same value (a one-bit leak).

In other words, if you encrypt each value in the b-tree with some key, 
some nonce, use the b-tree key as the associated data, and use the value 
as the message, you get a _very_ secure system against a _very_ wide 
variety of attacks - essentially for free. And all _without_ sacrificing 
flexibility, as one could use distinct (crypto) keys for distinct (b-
tree) keys.

(You still need something for protecting the _structure_ of the B-tree, 
but that's a different issue).

> This is what makes keys at the subvol root level so attractive.

Pretty much.

>> So there isn't quite a "subvol key" in the VFS approach - each
>> directory has a key, and there are derived keys for the entries below
>> it. (I'll note that this framing does not address shared extents _at
>> all_, and would love to have clarification on that).
> 
> Files are modified by creating new extents (using parameters inherited
> from the inode to fill in the extent attributes) and updating the inode
> to refer to the new extent instead of the old one at the modified
> offset. Cloned extents are references to existing extents associated
> with a different inode or at a different place within the same inode (if
> the extent is not compatible with the destination inode, clone fails
> with an error).  A snapshot is an efficient way to clone an entire
> subvol tree at once, including all inodes and attributes.

There is the caveat of chattr +C, which would need hard-disabled for 
extent-level encryption (vs block level).

> Inode attributes and extent attributes can sometimes conflict,
> especially during a clone operation.  Encryption attributes could become
> one of these cases (i.e. to prevent an extent from one encryption policy
> from being cloned to an inode under a different encryption policy).

That is a good approach.

>> > I don't see how snapshots could work, writable or otherwise, without
>> > separating the key identity from the subvol identity and having a
>> > many-to-one relationship between subvols and keys.  The extents in
>> > each subvol would be shared, and they'd be encrypted with a single
>> > secret, so there's not really another way to do this.
>> 
>> That's not the issue. The issue is that, assuming the key stays the
>> same,
>> then a user could quite possibly create a snapshot, write into both the
>> original and the snapshot, causing encryption to occur twice with the
>> same key, same nonce, and different data.
> 
> If the

Re: stability matrix

2016-09-19 Thread Chris Mason




On 09/19/2016 03:52 PM, Christoph Anton Mitterer wrote:

On Mon, 2016-09-19 at 13:18 -0400, Austin S. Hemmelgarn wrote:

- even mounting a fs ro, may cause it to be changed


This would go to the UseCases

My same argument about the UUID issues applies here, just without
the
security aspect.


I personally could agree to have that "just" in the usecases.

That a fs my be changed even though it's mounted ro is not unique to
btrfs and the need for not having that happen goes probably rather
into data-forensics and rescue use cases.

IMO there's rather a general problem, namely that the different
filesystems don't provide a mount option that implies every other mount
options currently needed to get an actual "hard ro", i.e. one where the
device is never written to.

Qu was about to add such option when nologreplay was added, but IIRC he
got some resistance by linux-fs, who probably didn't care enough
whether the end-user can easily do such "hard ro" mount ;)




That's in the blockdev command (blockdev --setro /dev/xxx).

We actually try to maintain the established norms where it doesn't 
conflict with the btrfs use cases.  This is one of them ;)


-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: stability matrix

2016-09-19 Thread Christoph Anton Mitterer

On Mon, 2016-09-19 at 13:18 -0400, Austin S. Hemmelgarn wrote:
> > > - even mounting a fs ro, may cause it to be changed
> > 
> > This would go to the UseCases
> My same argument about the UUID issues applies here, just without
> the 
> security aspect.

I personally could agree to have that "just" in the usecases.

That a fs my be changed even though it's mounted ro is not unique to
btrfs and the need for not having that happen goes probably rather
into data-forensics and rescue use cases.

IMO there's rather a general problem, namely that the different
filesystems don't provide a mount option that implies every other mount
options currently needed to get an actual "hard ro", i.e. one where the
device is never written to.

Qu was about to add such option when nologreplay was added, but IIRC he
got some resistance by linux-fs, who probably didn't care enough
whether the end-user can easily do such "hard ro" mount ;)

Cheers,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: [RFC] Preliminary BTRFS Encryption

2016-09-19 Thread Alex Elsayed

On Mon, 19 Sep 2016 14:57:33 -0400, Zygo Blaxell wrote:

> On Sat, Sep 17, 2016 at 07:13:45AM +, Alex Elsayed wrote:
>> IMO, this is already a flawed framing - in particular, if encrypting at
>> the extent level, one _should not_ be encrypting (or authenticating)
>> individual pages. The meaningful unit is the extent, and encrypting at
>> page granularity puts you right back where dmcrypt is: dealing with
>> fixed-
>> size space, and needing to find somewhere else to put the auth tag.
>> 
>> This is not a good place to be, and I strongly suspect it motivated
>> choosing XTS in the first place - something I feel is an _error_ in the
>> long run, and a dangerous one. (IMO, anything _but_ AEAD should be
>> forbidden in FS-level encryption.)
>> 
>> In a nonce-misuse-resistent AEAD, there _is_ no auth tag: There's some
>> amount of inherent ciphertext expansion, and the ciphertext _cannot be
>> decrypted at all_ unless all of it is present. In essence, a built-in
>> all-
>> or-nothing transform.
>> 
>> You could, potentially, chop off part of that and store it elsewhere,
>> but now you're dealing with significant added complexity, for
>> absolutely zero gain.
> 
> That would be true if the problem were not already long solved in btrfs.
> The 32-bit CRC tree stores 4 bytes per block separately and efficiently.
> With minor changes it can store a 32-byte HMAC for each block.

I disagree that this "solves" it - in particular, the fact that the fsck 
tool support dropping/regenerating the extent tree is wildly unsafe in 
the face of this.

For an AEAD that lacks nonce-misuse-resistance, it's "merely" downgrading 
security from AEAD to simple encryption (GCM, for instance, becomes 
exactly CTR). This would be almost okay (it's a fsck tool, after all), 
but the fact that it's a fsck tool makes the next part worse.

In the case of nonce-misuse-resistant AEAD, it's much worse: Dropping the 
checksum tree would permanently and irrevocably corrupt every single 
extent, with no data recoverable at all. This is the _exact_ opposite of 
_anything_ you would _ever_ want a fsck tool to do.

This is, fundamentally, the problem with treating an "auth tag" as a 
separate thing: It's only separate at all in weaker systems, and the act 
of separating the data induces incredibly nasty failure modes.

It gets even worse if you consider _why_ that option exists for the fsck 
tool: Because of the possibility that the _structure_ of the checksum 
tree becomes corrupted. As a result, two bit-flips (one for each 
duplicate of the metadata) would be entirely capable of irrevocably 
destroying _all encrypted data on the FS_.

Separating the "auth tag" - simply considering an "auth tag" a separate 
thing from the overall ciphertext - is a dangerous thing to do.

>> If you're _not_ using a nonce-misuse-resistant AEAD, it's even worse:
>> keeping the tag out-of-band makes it far too easy to fail to verify it,
>> or verify it only after decrypting the ciphertext to plaintext.
>> Bluntly: that is an immediate security vulnerability.
>> 
>> tl;dr: Don't encrypt pages, encrypt extents. They grow a little for the
>> auth tag, and that's fine.
>> 
>> Btrfs already handles needing to read the full extent in order to get a
>> page out of it with compression, anyway.
> 
> It does, but compressed extents are limited to 128K.  Uncompressed
> extents come in sizes up to 128M, far too large to read in their
> entirety for many applications.

Er, yes, and? Just as compressed extents have a different cap for reasons 
of practicality, so too can encrypted extents.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: stability matrix (was: Is stability a joke?)

2016-09-19 Thread Christoph Anton Mitterer

+1 for all your changes with the following comments in addition...


On Mon, 2016-09-19 at 17:27 +0200, David Sterba wrote:
> That's more like a usecase, thats out of the scope of the tabular
> overview. But we have an existing page UseCases that I'd like to
> transform to a more structured and complete overview of usceases of
> various features, so the UUID collisions would build on top of that
> with
> "and this could hapen if ...".
Well I don't agree here and see it basically like Austin.

It's not that these UUID collisions can only happen in special
circumstances but plain normal situations that always used to work with
probably literally each and every fs. (So much for the accidental
corruptions).

And an attack is probably never "usecase dependant"... it always
depends on the attacker.
And since that seems to be a pretty real attack vector, I'd also say
it's mandatory to quite clearly warn about that deficiency...

TBH, I'm rather surprised that this situation seems to be kinda
"accepted".

I had a chat with CM recently and he implied things might be solved
with encryption.
While this is probably the case for at least some of the described
problems, it rather seems like a workaround:
- why making btrfs-encryption mandatory for devices who have partially
  secured access (e.g. where a systemdisk with btrfs is not physically
  accessible but a USB port is)
- what about users that rather want to use block device encryption
  instead of fs-level-encryption?


> > - in-band dedupe
> >   deduped are IIRC not bitwise compared by the kernel before de-
> > duping,
> >   as it's the case with offline dedupe.
> >   Even if this is considered safe by the community... I think users
> >   should be told.
> Only features merged are reflected. And the out-of-band dedupe does
> full
> memcpy. See btrfs_cmp_data() called from btrfs_extent_same().
Ah,... I kinda thought it was already merged ... possibly got confused
by the countless patch iterations of it ;)


> > - btrfs check --repair (and others?)
> >   Telling people that this may often cause more harm than good.
> I think userspace tools do not belong to the overview.
Well... I wouldn't mind if there was a btrfs-progs status page... (and
both link each other).
OTOH,... the user probably wants one central point where all relevant
info can be found... and not again having to dig through n websites.


> > - even mounting a fs ro, may cause it to be changed
> 
> This would go to the UseCases
Fine for me.


> 
> > 
> > - DB/VM-image like IO patterns + nodatacow + (!)checksumming
> >   + (auto)defrag + snapshots
> >   a)
> >   People typically may have the impression:
> >   btrfs = checksummed => als is guaranteed to be "valid" (or at
> > least
> >   noticed)
> >   However this isn't the case for nodatacow'ed files, which in turn
> > is
> >   kinda "mandatory" for DB/VM-image like IO patterns, cause
> > otherwise
> >   these would fragment to heavily (see (b).
> >   Unless claimed by some people, none of the major DBs or VM-image
> >   formats do general checksumming on their own, most even don't
> > support
> >   it, some that do wouldn't do it without app-support and few
> > "just"
> >   don't do it per default.
> >   Thus one should bump people to this situation and that they may
> > not
> >   get this "correctness" guarantee here.
> >   b)
> >   IIRC, it doesn't even help to simply not use nodatacow on such
> > files
> >   and using auto-defrag instead to countermeasure the fragmenting,
> > as
> >   that one doesn't perform too well on large files.
> 
> Same.
Fine for me either... you already said above you would mention the
nodatacow=>no-checksumming=>no-verification-and-no-raid-repair in the
general section... this is enough for that place.


> > For specific features:
> > - Autodefrag
> >   - didn't that also cause reflinks to be broken up?
> 
> No and never had.

Absolutely sure? One year ago, I was told that at first too so I
started using it, but later on some (IIRC) developer said auto-defrag
would also suffer from it.

> > - RAID*
> >   No userland tools for monitoring/etc.
> 
> That's a usability bug.

Well it is and it will probably go away sooner or later... but the
unaware user may not really realise that he actually has to take care
on this by himself for now.
So I though it would be helpful to have it added.



Best wishes,
Chris.

smime.p7s
Description: S/MIME cryptographic signature

Re: [PATCH] Btrfs: handle quota reserve failure properly

2016-09-19 Thread Jeff Mahoney

On 9/15/16 2:57 PM, Josef Bacik wrote:
> btrfs/022 was spitting a warning for the case that we exceed the quota.  If we
> fail to make our quota reservation we need to clean up our data space
> reservation.  Thanks,
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 03da2f6..d72eaae 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4286,13 +4286,10 @@ int btrfs_check_data_free_space(struct inode *inode, 
> u64 start, u64 len)
>   if (ret < 0)
>   return ret;
>  
> - /*
> -  * Use new btrfs_qgroup_reserve_data to reserve precious data space
> -  *
> -  * TODO: Find a good method to avoid reserve data space for NOCOW
> -  * range, but don't impact performance on quota disable case.
> -  */
> + /* Use new btrfs_qgroup_reserve_data to reserve precious data space. */
>   ret = btrfs_qgroup_reserve_data(inode, start, len);
> + if (ret)
> + btrfs_free_reserved_data_space_noquota(inode, start, len);
>   return ret;
>  }
>  
> 

Tested-by: Jeff Mahoney 

btrfs/022 passes now.

Thanks,

-Jeff

-- 
Jeff Mahoney
SUSE Labs



signature.asc
Description: OpenPGP digital signature

Re: [RFC] Preliminary BTRFS Encryption

2016-09-19 Thread Zygo Blaxell

On Sat, Sep 17, 2016 at 07:13:45AM +, Alex Elsayed wrote:
> IMO, this is already a flawed framing - in particular, if encrypting at 
> the extent level, one _should not_ be encrypting (or authenticating) 
> individual pages. The meaningful unit is the extent, and encrypting at 
> page granularity puts you right back where dmcrypt is: dealing with fixed-
> size space, and needing to find somewhere else to put the auth tag.
> 
> This is not a good place to be, and I strongly suspect it motivated 
> choosing XTS in the first place - something I feel is an _error_ in the 
> long run, and a dangerous one. (IMO, anything _but_ AEAD should be 
> forbidden in FS-level encryption.)
> 
> In a nonce-misuse-resistent AEAD, there _is_ no auth tag: There's some 
> amount of inherent ciphertext expansion, and the ciphertext _cannot be 
> decrypted at all_ unless all of it is present. In essence, a built-in all-
> or-nothing transform.
> 
> You could, potentially, chop off part of that and store it elsewhere, but 
> now you're dealing with significant added complexity, for absolutely zero 
> gain.

That would be true if the problem were not already long solved in btrfs.
The 32-bit CRC tree stores 4 bytes per block separately and efficiently.
With minor changes it can store a 32-byte HMAC for each block.

> If you're _not_ using a nonce-misuse-resistant AEAD, it's even worse: 
> keeping the tag out-of-band makes it far too easy to fail to verify it, 
> or verify it only after decrypting the ciphertext to plaintext. Bluntly: 
> that is an immediate security vulnerability.
> 
> tl;dr: Don't encrypt pages, encrypt extents. They grow a little for the 
> auth tag, and that's fine.
> 
> Btrfs already handles needing to read the full extent in order to get a 
> page out of it with compression, anyway.

It does, but compressed extents are limited to 128K.  Uncompressed extents
come in sizes up to 128M, far too large to read in their entirety for
many applications.



signature.asc
Description: Digital signature

Re: [PATCH] btrfs: Fix handling of -ENOENT from btrfs_uuid_iter_rem

2016-09-19 Thread Chris Mason


On 09/19/2016 02:13 PM, David Sterba wrote:

On Wed, Sep 07, 2016 at 10:38:58AM +0300, Nikolay Borisov wrote:

btrfs_uuid_iter_rem is able to return -ENOENT, however this condition
is not handled in btrfs_uuid_tree_iterate which can lead to calling
btrfs_next_item with freed path argument, leading to a null pointer
dereference. Fix it by redoing the search but with an incremented
objectid so we don't loop over the same key.

Signed-off-by: Nikolay Borisov 
Suggested-by: Chris Mason 
Link: https://lkml.kernel.org/r/57a473b0.2040...@kyup.com


I'll queue the patch for 4.9, thanks.



Not having a good test for this kept me from trying the patch cold.  I 
think bumping the objectid will end up missing items.


We know its returning -ENOENT, so it should in theory be enough to just 
goto again_search_slot, assuming that we just raced with the deletion.


-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is stability a joke? (wiki updated)

2016-09-19 Thread Austin S. Hemmelgarn


On 2016-09-19 14:27, Chris Murphy wrote:

On Mon, Sep 19, 2016 at 11:38 AM, Austin S. Hemmelgarn
 wrote:

ReiserFS had no working fsck for all of the 8 years I used it (and still
didn't last year when I tried to use it on an old disk).  "Not working"
here means "much less data is readable from the filesystem after running
fsck than before."  It's not that much of an inconvenience if you have
backups.


For a small array, this may be the case.  Once you start looking into double
digit TB scale arrays though, restoring backups becomes a very expensive
operation.  If you had a multi-PB array with a single dentry which had no
inode, would you rather be spending multiple days restoring files and
possibly losing recent changes, or spend a few hours to check the filesystem
and fix it with minimal data loss?


Yep restoring backups, even fully re-replicating data in a cluster, is
untenable it's so expensive. But even offline fsck is sufficiently
non-scalable that at a certain volume size it's not tenable. 100TB
takes a long time to fsck offline, and is it even possible to fsck 1PB
Btrfs? Seems to me it's another case were if it were possible to
isolate what tree limbs are sick, just cut them off and report the
data loss rather than consider the whole fs unusable. That's what we
do with living things.

This is part of why I said the ZFS approach is valid.  At the moment 
though, we can't even do that, and to do it properly, we'd need a tool 
to bypass the VFS layer to prune the tree, which is non-trivial in and 
of itself.  It would be nice to have a mode in check where you could say 
'I know this path in the FS has some kind of issue, figure out what's 
wrong and fix it if possible, otherwise optionally prune that branch 
from the appropriate tree'.  On the same note, it would be nice to be 
able to manually restrict it to specific checks (eg, 'check only for 
orphaned inodes', or 'only validate the FSC/FST').  If we were to add 
such functionality, dealing with some minor corruption in a 100TB+ array 
wouldn't be quite as much of an issue.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is stability a joke? (wiki updated)

2016-09-19 Thread Chris Murphy

On Mon, Sep 19, 2016 at 11:38 AM, Austin S. Hemmelgarn
 wrote:
>> ReiserFS had no working fsck for all of the 8 years I used it (and still
>> didn't last year when I tried to use it on an old disk).  "Not working"
>> here means "much less data is readable from the filesystem after running
>> fsck than before."  It's not that much of an inconvenience if you have
>> backups.
>
> For a small array, this may be the case.  Once you start looking into double
> digit TB scale arrays though, restoring backups becomes a very expensive
> operation.  If you had a multi-PB array with a single dentry which had no
> inode, would you rather be spending multiple days restoring files and
> possibly losing recent changes, or spend a few hours to check the filesystem
> and fix it with minimal data loss?

Yep restoring backups, even fully re-replicating data in a cluster, is
untenable it's so expensive. But even offline fsck is sufficiently
non-scalable that at a certain volume size it's not tenable. 100TB
takes a long time to fsck offline, and is it even possible to fsck 1PB
Btrfs? Seems to me it's another case were if it were possible to
isolate what tree limbs are sick, just cut them off and report the
data loss rather than consider the whole fs unusable. That's what we
do with living things.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Preliminary BTRFS Encryption

2016-09-19 Thread Zygo Blaxell

On Sat, Sep 17, 2016 at 06:37:16AM +, Alex Elsayed wrote:
> > Encryption in ext4 is a per-directory-tree affair. One starts by
> > setting an encryption policy (using an ioctl() call) for a given
> > directory, which must be empty at the time; that policy includes a
> > master key used for all files and directories stored below the target
> > directory. Each individual file is encrypted with its own key, which is
> > derived from the master key and a per-file random nonce value (which is
> > stored in an extended attribute attached to the file's inode). File
> > names and symbolic links are also encrypted.

Probably the simplest way to map this to btrfs is to move the nonce from
the inode to the extent.

Inodes aren't unique within a btrfs filesystem, extents can be shared
by multiple inodes, and a single extent can appear multiple times in the
same inode at different offsets.  Attaching the nonce to the inode would
not be sufficient to read the extent in all but the special case of a
single reference at the original offset where it was written, and it
also leads to the replay problems with duplicate inodes you pointed out.

Extents in a btrfs filesystem are unique and carry their own attributes
(e.g. compression format, checksums) and reference count.  They can easily
carry a reference to an encryption policy object and a nonce attribute.

Nonces within metadata are more complicated.  btrfs doesn't have directory
files like ext4 does, so it doesn't get directory filename encryption
for free with file encryption.  Encryption could be done per-item in the
metadata trees, but in the special case of directories that happen to
the the roots of subvols, it would be possible to encrypt entire pages
of metadata at a time (with the caveat that a snapshot would require
shared encryption policy between the origin and snapshot subvols).
This is what makes keys at the subvol root level so attractive.

> So there isn't quite a "subvol key" in the VFS approach - each directory 
> has a key, and there are derived keys for the entries below it. (I'll 
> note that this framing does not address shared extents _at all_, and 
> would love to have clarification on that).

Files are modified by creating new extents (using parameters inherited
from the inode to fill in the extent attributes) and updating the inode to
refer to the new extent instead of the old one at the modified offset.
Cloned extents are references to existing extents associated with a
different inode or at a different place within the same inode (if the
extent is not compatible with the destination inode, clone fails with
an error).  A snapshot is an efficient way to clone an entire subvol
tree at once, including all inodes and attributes.

Inode attributes and extent attributes can sometimes conflict, especially
during a clone operation.  Encryption attributes could become one of
these cases (i.e. to prevent an extent from one encryption policy from
being cloned to an inode under a different encryption policy).

> > I don't see how snapshots could work, writable or otherwise, without
> > separating the key identity from the subvol identity and having a
> > many-to-one relationship between subvols and keys.  The extents in each
> > subvol would be shared, and they'd be encrypted with a single secret,
> > so there's not really another way to do this.
> 
> That's not the issue. The issue is that, assuming the key stays the same, 
> then a user could quite possibly create a snapshot, write into both the 
> original and the snapshot, causing encryption to occur twice with the 
> same key, same nonce, and different data.

If the extents have nonces (and inodes do not) then this doesn't happen.
A write to either snapshot necessarily creates new extents in all cases
(the nodatacow feature, the only way to modify a data extent in-place,
is disabled when the extent is shared).

signature.asc
Description: Digital signature

Re: [PATCH] btrfs: Fix handling of -ENOENT from btrfs_uuid_iter_rem

2016-09-19 Thread David Sterba

On Wed, Sep 07, 2016 at 10:38:58AM +0300, Nikolay Borisov wrote:
> btrfs_uuid_iter_rem is able to return -ENOENT, however this condition
> is not handled in btrfs_uuid_tree_iterate which can lead to calling
> btrfs_next_item with freed path argument, leading to a null pointer
> dereference. Fix it by redoing the search but with an incremented
> objectid so we don't loop over the same key.
> 
> Signed-off-by: Nikolay Borisov 
> Suggested-by: Chris Mason 
> Link: https://lkml.kernel.org/r/57a473b0.2040...@kyup.com

I'll queue the patch for 4.9, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: handle quota reserve failure properly

2016-09-19 Thread David Sterba

On Fri, Sep 16, 2016 at 09:02:22AM +, Holger Hoffstätte wrote:
> On Thu, 15 Sep 2016 14:57:48 -0400, Josef Bacik wrote:
> 
> > btrfs/022 was spitting a warning for the case that we exceed the quota.  If 
> > we
> > fail to make our quota reservation we need to clean up our data space
> > reservation.  Thanks,
> > 
> > Signed-off-by: Josef Bacik 
> > ---
> >  fs/btrfs/extent-tree.c | 9 +++--
> >  1 file changed, 3 insertions(+), 6 deletions(-)
> > 
> > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> > index 03da2f6..d72eaae 100644
> > --- a/fs/btrfs/extent-tree.c
> > +++ b/fs/btrfs/extent-tree.c
> > @@ -4286,13 +4286,10 @@ int btrfs_check_data_free_space(struct inode 
> > *inode, u64 start, u64 len)
> > if (ret < 0)
> > return ret;
> >  
> > -   /*
> > -* Use new btrfs_qgroup_reserve_data to reserve precious data space
> > -*
> > -* TODO: Find a good method to avoid reserve data space for NOCOW
> > -* range, but don't impact performance on quota disable case.
> > -*/
> > +   /* Use new btrfs_qgroup_reserve_data to reserve precious data space. */
> > ret = btrfs_qgroup_reserve_data(inode, start, len);
> > +   if (ret)
> > +   btrfs_free_reserved_data_space_noquota(inode, start, len);
> > return ret;
> >  }
> >  
> > -- 
> > 2.7.4
> 
> This came up before, though slightly different:
> http://www.spinics.net/lists/linux-btrfs/msg56644.html
> 
> Which version is correct - with or without _noquota ?

Seems that it's the _noquota variant.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: kill BUG_ON in do_relocation

2016-09-19 Thread David Sterba

On Thu, Sep 15, 2016 at 02:58:12PM -0400, Chris Mason wrote:
> 
> 
> On 09/15/2016 03:01 PM, Liu Bo wrote:
> > On Wed, Sep 14, 2016 at 11:19:04AM -0700, Liu Bo wrote:
> >> On Wed, Sep 14, 2016 at 01:31:31PM -0400, Josef Bacik wrote:
> >>> On 09/14/2016 01:29 PM, Chris Mason wrote:
> 
> 
>  On 09/14/2016 01:13 PM, Josef Bacik wrote:
> > On 09/14/2016 12:27 PM, Liu Bo wrote:
> >> While updating btree, we try to push items between sibling
> >> nodes/leaves in order to keep height as low as possible.
> >> But we don't memset the original places with zero when
> >> pushing items so that we could end up leaving stale content
> >> in nodes/leaves.  One may read the above stale content by
> >> increasing btree blocks' @nritems.
> >>
> >
> > Ok this sounds really bad.  Is this as bad as I think it sounds?  We
> > should probably fix this like right now right?
> 
>  He's bumping @nritems with a fuzzer I think?  As in this happens when 
>  someone
>  forces it (or via some other bug) but not in normal operations.
> 
> >>>
> >>> Oh ok if this happens with a fuzzer than this is fine, but I'd rather do
> >>> -EIO so we know this is something bad with the fs.
> >>
> >> -EIO may be more appropriate to be given while reading btree blocks and
> >> checking their validation?
> >
> > Looks like EIO doesn't fit into this case, either, do we have any errno
> > representing 'corrupted filesystem'?
> 
> That's EIO.  Sometimes the EIO is big enough we have to abort, but 
> really the abort is just adding bonus.

I think we misuse the EIO where we should really return EFSCORRUPTED
that's an alias for EUCLEAN, looking at xfs or ext4. EIO should be
really a message that the hardware is bad.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: subvolume verbose delete flag

2016-09-19 Thread David Sterba

On Thu, Sep 15, 2016 at 03:15:50PM -0400, Vincent Batts wrote:
> There was already the logic for verbose output, but the flag parsing did
> not include it.
> 
> Signed-off-by: Vincent Batts 

Applied, thanks. I wonder where the original argument got lost. In the
commit 2ed161bd281beca29feebebbc8c4227cc6e918c3 it to added to getopt
but inside cmd_subvol_create. I'll fix it as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is stability a joke? (wiki updated)

2016-09-19 Thread Austin S. Hemmelgarn


On 2016-09-19 00:08, Zygo Blaxell wrote:

On Thu, Sep 15, 2016 at 01:02:43PM -0600, Chris Murphy wrote:

Right, well I'm vaguely curious why ZFS, as different as it is,
basically take the position that if the hardware went so batshit that
they can't unwind it on a normal mount, then an fsck probably can't
help either... they still don't have an fsck and don't appear to want
one.


ZFS has no automated fsck, but it does have a kind of interactive
debugger that can be used to manually fix things.

ZFS seems to be a lot more robust when it comes to handling bad metadata
(contrast with btrfs-style BUG_ON panics).

When you delete a directory entry that has a missing inode on ZFS,
the dirent goes away.  In the ZFS administrator documentation they give
examples of this as a response in cases where ZFS metadata gets corrupted.

When you delete a file with a missing inode on btrfs, something
(VFS?) wants to check the inode to see if it has attributes that might
affect unlink (e.g. the immutable bit), gets an error reading the
inode, and bombs out of the unlink() before unlink() can get rid of the
dead dirent.  So if you get a dirent with no inode on btrfs on a large
filesystem (too large for btrfs check to handle), you're basically stuck
with it forever.  You can't even rename it.  Hopefully it doesn't happen
in a top-level directory.

ZFS is also infamous for saying "sucks to be you, I'm outta here" when
things go wrong.  People do want ZFS fsck and defrag, but nobody seems
to be bothered much about making those things happen.

At the end of the day I'm not sure fsck really matters.  If the filesystem
is getting corrupted enough that both copies of metadata are broken,
there's something fundamentally wrong with that setup (hardware bugs,
software bugs, bad RAM, etc) and it's just going to keep slowly eating
more data until the underlying problem is fixed, and there's no guarantee
that a repair is going to restore data correctly.  If we exclude broken
hardware, the only thing btrfs check is going to repair is btrfs kernel
bugs...and in that case, why would we expect btrfs check to have fewer
bugs than the filesystem itself?
I wouldn't, but I would still expect to have some tool to deal with 
things like orphaned inodes, dentries which are missing inodes, and 
other similar cases that don't make the filesystem unusable, but can't 
easily be fixed in a sane manner on a live filesystem.  The ZFS approach 
is valid, but it can't deal with things like orphaned inodes where 
there's no reference in the directories any more.



I'm not sure if the brfsck is really all that helpful to user as much
as it is for developers to better learn about the failure vectors of
the file system.


ReiserFS had no working fsck for all of the 8 years I used it (and still
didn't last year when I tried to use it on an old disk).  "Not working"
here means "much less data is readable from the filesystem after running
fsck than before."  It's not that much of an inconvenience if you have
backups.
For a small array, this may be the case.  Once you start looking into 
double digit TB scale arrays though, restoring backups becomes a very 
expensive operation.  If you had a multi-PB array with a single dentry 
which had no inode, would you rather be spending multiple days restoring 
files and possibly losing recent changes, or spend a few hours to check 
the filesystem and fix it with minimal data loss?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: change btrfs_csum_final result param type to u8

2016-09-19 Thread David Sterba

On Sun, Sep 18, 2016 at 12:10:22AM +0100, Domagoj Tršan wrote:
> csum member of struct btrfs_super_block has array type of u8. It makes sense
> that function btrfs_csum_final should be also declared to accept u8 *. I
> changed the declaration of method void btrfs_csum_final(u32 crc, char 
> *result);
> to void btrfs_csum_final(u32 crc, u8 *result);
> Also, I changed definitions of various csum variables to be consistent with
> kernel code.

Aligning the progs code with kernel is useful, even the seemingly
trivial changes (through there will be always some differences). Feel
free to send more patches like that.

Patch applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: change btrfs_csum_final result param type to u8

2016-09-19 Thread David Sterba

On Sun, Sep 18, 2016 at 12:10:34AM +0100, Domagoj Tršan wrote:
> csum member of struct btrfs_super_block has array type of u8. It makes sense
> that function btrfs_csum_final should be also declared to accept u8 *. I
> changed the declaration of method void btrfs_csum_final(u32 crc, char 
> *result);
> to void btrfs_csum_final(u32 crc, u8 *result);

You should put a similar text to the patch itself, it's not necessary to
send the cover letter for single patches. Otherwise the change is ok.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH]btrfs-progs: btrfs-convert.c : check source file system state

2016-09-19 Thread David Sterba

On Thu, Sep 15, 2016 at 02:08:52PM +0200, Lakshmipathi.G wrote:
> Signed-off-by: Lakshmipathi.G 
> ---
>  btrfs-convert.c | 15 +++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/btrfs-convert.c b/btrfs-convert.c
> index c10dc17..27da9ce 100644
> --- a/btrfs-convert.c
> +++ b/btrfs-convert.c
> @@ -2171,6 +2171,17 @@ static void ext2_copy_inode_item(struct 
> btrfs_inode_item *dst,
>   }
>   memset(>reserved, 0, sizeof(dst->reserved));
>  }
> +static int check_filesystem_state(struct btrfs_convert_context *cctx)
> +{
> + ext2_filsys fs = cctx->fs_data;
> +
> +if (!(fs->super->s_state & EXT2_VALID_FS))
> + return 1;
> + else if (fs->super->s_state & EXT2_ERROR_FS)
> + return 1;
> + else
> + return 0;
> +}
>  
>  /*
>   * copy a single inode. do all the required works, such as cloning
> @@ -2340,6 +2351,10 @@ static int do_convert(const char *devname, int 
> datacsum, int packing,
>   ret = convert_open_fs(devname, );
>   if (ret)
>   goto fail;
> + ret = check_filesystem_state();
> + if (ret) 
> + warning("Source Filesystem is not clean, \
> +  running e2fsck is recommended.");

I'm wondering if this should be a hard error or not, I'll leave it as a
warning for now.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: stability matrix

2016-09-19 Thread Austin S. Hemmelgarn


On 2016-09-19 11:27, David Sterba wrote:

Hi,

On Thu, Sep 15, 2016 at 04:14:04AM +0200, Christoph Anton Mitterer wrote:

In general:
- I think another column should be added, which tells when and for
  which kernel version the feature-status of each row was
  revised/updated the last time and especially by whom.
  If a core dev makes a statement on a particular feature, this
  probably means much more, than if it was made by "just" a list
  regular.


It's going to be revised per release. If there's a bug that affect the
status, the page will be updated. I'm going to do that among other
per-release regular boring tasks.

I'm still not decided if the kernel version will be useful enough, but
if anybody is willing to do the research and fill the table I don't
object.
Moving forwards, I think it's worth it, but I don't feel that it's worth 
looking back at anything before 4.4 to list versions.



  And yes I know, in the beginning it already says "this is for 4.7"...
  but let's be honest, it's pretty likely when this is bumped to 4.8
  that not each and every point will be thoroughly checked again.
- Optionally even one further column could be added, that lists bugs
  where the specific cases are kept record of (if any).


There's a new section under the table to write anything that would not
fit. Mostly pointers to other documentation (manual pages) or bugzilla.


- Perhaps a 3rd Status like "eats-your-data" which is worse than
  critical, e.g. for things were it's known that there is a high
  chance for still getting data corruption (RAID56?)


Perhaps there should be another section that lists general caveats
and pitfalls including:
- defrag/auto-defrag causes ref-link break up (which in turn causes
  possible extensive space being eaten up)


Updated accordingly.


- nodatacow files are not yet[0] checksummed, which in turn means
  that any errors (especially silent data corruption) will not be
  noticed AND which in turn also means the data itself cannot be
  repaired even in case of RAIDs (only the RAIDs are made consistent
  again)


Added to the table.


- subvolume UUID attacks discussed in the recent thread
- fs/device UUID collisions
  - the accidental corruption that can happen in case colliding
fs/device UUIDs appear in a system (and telling the user that
this is e.g. the case when dd'ing and image or using lvm
snapshots, probably also when having btrfs on MD RAID1 or RAID10)
  - the attacks that are possible when UUIDs are known to an attacker


That's more like a usecase, thats out of the scope of the tabular
overview. But we have an existing page UseCases that I'd like to
transform to a more structured and complete overview of usceases of
various features, so the UUID collisions would build on top of that with
"and this could hapen if ...".
I don't agree with this being use case specific.  Whether or not someone 
cares could technically be use case specific, but the use cases where 
this actually doesn't matter is pretty much limited to tight embedded 
systems with no way to attach external storage.  This behavior results 
in both a number of severe security holes for anyone without proper 
physical security (read as 'almost all desktop and laptop users, as well 
as many server admins'), and severe potential for data loss when 
performing normal recovery activities that work on every other filesystem.



- in-band dedupe
  deduped are IIRC not bitwise compared by the kernel before de-duping,
  as it's the case with offline dedupe.
  Even if this is considered safe by the community... I think users
  should be told.


Only features merged are reflected. And the out-of-band dedupe does full
memcpy. See btrfs_cmp_data() called from btrfs_extent_same().


- btrfs check --repair (and others?)
  Telling people that this may often cause more harm than good.


I think userspace tools do not belong to the overview.


- even mounting a fs ro, may cause it to be changed


This would go to the UseCases
My same argument about the UUID issues applies here, just without the 
security aspect.  The only difference here is that it's common behavior 
across most filesystems (but not widely known to most people who aren't 
FS develo9pers or sysops experts).



- DB/VM-image like IO patterns + nodatacow + (!)checksumming
  + (auto)defrag + snapshots
  a)
  People typically may have the impression:
  btrfs = checksummed => als is guaranteed to be "valid" (or at least
  noticed)
  However this isn't the case for nodatacow'ed files, which in turn is
  kinda "mandatory" for DB/VM-image like IO patterns, cause otherwise
  these would fragment to heavily (see (b).
  Unless claimed by some people, none of the major DBs or VM-image
  formats do general checksumming on their own, most even don't support
  it, some that do wouldn't do it without app-support and few "just"
  don't do it per default.
  Thus one should bump people to this situation and that they may not
  get this "correctness" guarantee here.

Re: [PATCH]btrfs-progs: Add fast,slow symlinks and fifo types to convert test

2016-09-19 Thread David Sterba

On Thu, Sep 15, 2016 at 11:34:07AM +0200, Lakshmipathi.G wrote:
> + slow_symlink)
> + for num in $(seq 1 $DATASET_SIZE); do
> + fname64=`date +%s | sha256sum | cut -f1 -d'-'`

Do you need to generate the date and sha all the time?

> + run_check $SUDO_HELPER touch $dirpath/$fname64
> + run_check $SUDO_HELPER ln -s $dirpath/$fname64 
> $dirpath/slow_slink.$num
> + done
> + ;;
>   esac
>  }
>  
>  populate_fs() {
>  
> -for dataset_type in 'small' 'hardlink' 'symlink' 'brokenlink' 'perm' 
> 'sparse' 'acls'; do
> +for dataset_type in 'small' 'hardlink' 'fast_symlink' 'brokenlink' 
> 'perm' 'sparse' 'acls' 'fifo' 'slow_symlink'; do
>   generate_dataset "$dataset_type"
>   done
>  }
> -- 
> 1.9.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is stability a joke?

2016-09-19 Thread David Sterba

On Mon, Sep 12, 2016 at 01:31:42PM -0400, Austin S. Hemmelgarn wrote:
> On 2016-09-12 12:51, David Sterba wrote:
> > On Mon, Sep 12, 2016 at 10:54:40AM -0400, Austin S. Hemmelgarn wrote:
> >>> Somebody has put that table on the wiki, so it's a good starting point.
> >>> I'm not sure we can fit everything into one table, some combinations do
> >>> not bring new information and we'd need n-dimensional matrix to get the
> >>> whole picture.
> >> Agreed, especially because some things are only bad in specific
> >> circumstances (For example, snapshots generally work fine on almost
> >> anything, until you get into the range of more than about 250, then they
> >> start causing issues).
> >
> > The performance aspect could be hard to estimate. Each feature has some
> > cost, we can document what's expected hit but various combinations and
> > actual runtime performance is unpredictable. I'd rather let the tools do
> > what the user asks for, as we might not be able to even detect there are
> > some bad external factors. I think that 250 snapshots would perform
> > better on an ssd than a rotational disk. In the end this leads to the
> > "dos & don'ts".
> >
> In general yes in this case, but performance starts to degrade 
> exponentially beyond a certain point.  The difference between (for 
> example) 10 and 20 snapshots is not as much as between 1000 and 1010. 
> The problem here is that we don't really have a BCP document that anyone 
> ever reads.  A lot of stuff that may seem obvious to us after years of 
> working with BTRFS isn't going to be to a newcomer, and it's a lot more 
> likely that some random person will get things write if we have a good, 
> central BCP document than if it stays as scattered tribal knowledge.

The IRC tribe answers the same newcomer questions over and over, which
is fine for the interaction itself, but if all that also ended up in
wiki we'd have perfect documentation years ago. Also the current status
of features and bugs is kept in the IRC-hive-mind yet it still needs
some other way to actually make it appear on wiki. Edit with courage!
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is stability a joke? (wiki updated)

2016-09-19 Thread Zygo Blaxell

On Mon, Sep 19, 2016 at 08:32:14AM -0400, Austin S. Hemmelgarn wrote:
> On 2016-09-18 23:47, Zygo Blaxell wrote:
> >On Mon, Sep 12, 2016 at 12:56:03PM -0400, Austin S. Hemmelgarn wrote:
> >>4. File Range Cloning and Out-of-band Dedupe: Similarly, work fine if the FS
> >>is healthy.
> >
> >I've found issues with OOB dedup (clone/extent-same):
> >
> >1.  Don't dedup data that has not been committed--either call fsync()
> >on it, or check the generation numbers on each extent before deduping
> >it, or make sure the data is not being actively modified during dedup;
> >otherwise, a race condition may lead to the the filesystem locking up and
> >becoming inaccessible until the kernel is rebooted.  This is particularly
> >important if you are doing bedup-style incremental dedup on a live system.
> >
> >I've worked around #1 by placing a fsync() call on the src FD immediately
> >before calling FILE_EXTENT_SAME.  When I do an A/B experiment with and
> >without the fsync, "with-fsync" runs for weeks at a time without issues,
> >while "without-fsync" hangs, sometimes in just a matter of hours.  Note
> >that the fsync() doesn't resolve the underlying race condition, it just
> >makes the filesystem hang less often.
> >
> >2.  There is a practical limit to the number of times a single duplicate
> >extent can be deduplicated.  As more references to a shared extent
> >are created, any part of the filesystem that uses backref walking code
> >gets slower.  This includes dedup itself, balance, device replace/delete,
> >FIEMAP, LOGICAL_INO, and mmap() (which can be bad news if the duplicate
> >files are executables).  Several factors (including file size and number
> >of snapshots) are involved, making it difficult to devise workarounds or
> >set up test cases.  99.5% of the time, these operations just get slower
> >by a few ms each time a new reference is created, but the other 0.5% of
> >the time, write operations will abruptly grow to consume hours of CPU
> >time or dozens of gigabytes of RAM (in millions of kmalloc-32 slabs)
> >when they touch one of these over-shared extents.  When this occurs,
> >it effectively (but not literally) crashes the host machine.
> >
> >I've worked around #2 by building tables of "toxic" hashes that occur too
> >frequently in a filesystem to be deduped, and using these tables in dedup
> >software to ignore any duplicate data matching them.  These tables can
> >be relatively small as they only need to list hashes that are repeated
> >more than a few thousand times, and typical filesystems (up to 10TB or
> >so) have only a few hundred such hashes.
> >
> >I happened to have a couple of machines taken down by these issues this
> >very weekend, so I can confirm the issues are present in kernels 4.4.21,
> >4.5.7, and 4.7.4.
> OK, that's good to know.  In my case, I'm not operating on a very big data
> set (less than 40GB, but the storage cluster I'm doing this on only has
> about 200GB of total space, so I'm trying to conserve as much as possible),
> and it's mostly static data (less than 100MB worth of changes a day except
> on Sunday when I run backups), so it makes sense that I've not seen either
> of these issues.

I ran into issue #2 on an 8GB filesystem last weekend.  The lower limit
on filesystem size could be as low as a few megabytes if they're arranged
in *just* the right way.

> The second one sounds like the same performance issue caused by having very
> large numbers of snapshots, and based on what's happening, I don't think
> there's any way we could fix it without rewriting certain core code.

find_parent_nodes is the usual culprit for CPU usage.  Fixing this is
required for in-band dedup as well, so I assume someone has it on their
roadmap and will get it done eventually.



signature.asc
Description: Digital signature

Re: Experimental btrfs encryption

2016-09-19 Thread Theodore Ts'o

(I'm not on linux-btrfs@, so please keep me on the cc list.  Or
perhpas better yet, maybe we can move discussion to the linux-fsdevel@
list.)

Hi Anand,

After reading this thread on the web archives, and seeing that some
folks seem to be a bit confused about "vfs level crypto", fs/crypto,
and ext4/f2fs encryption, I thought I would give a few comments.

First of all, these are all the same thing.  Initially ext4 encryption
was implemented targetting ChromeOS as the initial customer, and as a
replacement for ecryptfs.  Folks have already pointed you at the
design document[1].  Also of interest is the is the 2015 Linux
Security Symposium slides set[2].  The first deployed use of this was
for Android N's File-based Encryption and Direct boot[3]; a technical
description which left out some of the product details (since LSS 2016
was before the Android N release) can be found at the 2016 LSS
slides[4].

[1] 
https://docs.google.com/document/d/1ft26lUQyuSpiu6VleP70_npaWdRfXFoNnB8JYnykNTg/preview
[2] http://kernsec.org/files/lss2014/Halcrow_EXT4_Encryption.pdf
[3] 
https://android.googleblog.com/2016/08/android-70-nougat-more-powerful-os-made.html
[4] http://kernsec.org/files/lss2015/halcrow.pdf

The other thing that perhaps would be worth noting is that Michael
Halcrow started this as an encryption/security expert who had dabbled
in file systems, while I was someone for whom encryption/security is a
hobby (although in a previous life I was the tech lead for Kerberos
and chaired the IPSEC working group) who was a file system expert.  In
order to do file system security well, you need people who are well
versed in both discplines working together.

With all due respect, the fact that you chose counter mode and how use
used it pretty clearly demonstrates that you would be well advised to
find someone who is a crypto expert to collaborate with you --- or use
the fs/crypto framework since it was designed and vetted by multiple
crypto experts as well as file system experts.

Having someone who is a product manager who can discuss with you
specific goals is also important, because there are lots of tradeoffs
and lots of design choices  and so what you chose to do is (or at
least should be!)  very much dependent on your threat model, who is
planning on using the feature, what you can and can not upon via-a-vis
hardware support, performance requirements, and so on.


Secondly, in terms of how it all works.  Each user as a "master key"
which is stored on a keyring.  We use a hash of the key to serve as
the key identifier, and associated with each inode we store a nonce (a
random unique string) and the key identifier.  We use the nonce and
the user's master key to generate a unique key for that inode.

That key is used to protect the contents of the data file, and to
encrypt filenames and symlink targets --- since filenames can leak
significant information about what the user is doing.  (For example,
in the downloads directory of their web browser, leaking filenames is
just as good as leaking part of their browsing history.)

As far as using the fs/crypto infrastructure, it's actually pretty
simple.  The file system needs to provide a flag indicating whether or
not the file is encrypted, and support extended attributes.  When you
create an inode in an encrypted directory, you call
fscrypt_inherit_context() and the fscrypto layer will take care of
creating the necessary xattr for the per-inode key.  When you need
open a encrypted file, or operate on an encrypted inode, you call
fscrypt_get_encryption_info() on the inode.  The per-inode encryption
key is cached in the i_crypt_info structure, which hangs off of the
struct inode.

When you write to an encrypted file, you call fscrypt_encrypt_page(),
which returns a struct page with the encrypted contents to be written.
After the write is completed (or in the error case), you call
fscrypt_restore_control_page() to release encrypted page.

To read from an encrypted page, you call fscrypt_get_ctx() to get an
encryption context, which gets stashed in the bio's bi_private
pointer.  (If btrfs is already using bi_private, then you'll need to
add a field in the structure which hangs off of bi_private to stash
the encryption context.)  After the read completes, you call
fscrypt_decrypt_bio_pages() to decrypt all of the pages read as part
of the read/write operation.

It's actually relatively straightforward to use.  If you have any
questions please feel free to ask on linux-fsdevel.


As far as poeple commenting that it might be better to encrypt on the
extent level --- the reason why we didn't chose that path is because
while it does make it easier to do authenticated encryption modes, the
downside is that you can only do the data integrity check if you read
in the entire extent.  This has obvious memory utilization impacts and
will also impact your 4k random read/write performance.

We do have a solution in mind to solve the authenticated encryption
problem; in fact, an intern

Re: stability matrix (was: Is stability a joke?)

2016-09-19 Thread David Sterba

Hi,

On Thu, Sep 15, 2016 at 04:14:04AM +0200, Christoph Anton Mitterer wrote:
> In general:
> - I think another column should be added, which tells when and for
>   which kernel version the feature-status of each row was 
>   revised/updated the last time and especially by whom.
>   If a core dev makes a statement on a particular feature, this
>   probably means much more, than if it was made by "just" a list
>   regular.

It's going to be revised per release. If there's a bug that affect the
status, the page will be updated. I'm going to do that among other
per-release regular boring tasks.

I'm still not decided if the kernel version will be useful enough, but
if anybody is willing to do the research and fill the table I don't
object.

>   And yes I know, in the beginning it already says "this is for 4.7"...
>   but let's be honest, it's pretty likely when this is bumped to 4.8
>   that not each and every point will be thoroughly checked again.
> - Optionally even one further column could be added, that lists bugs
>   where the specific cases are kept record of (if any).

There's a new section under the table to write anything that would not
fit. Mostly pointers to other documentation (manual pages) or bugzilla.

> - Perhaps a 3rd Status like "eats-your-data" which is worse than
>   critical, e.g. for things were it's known that there is a high
>   chance for still getting data corruption (RAID56?)
> 
> 
> Perhaps there should be another section that lists general caveats
> and pitfalls including:
> - defrag/auto-defrag causes ref-link break up (which in turn causes
>   possible extensive space being eaten up)

Updated accordingly.

> - nodatacow files are not yet[0] checksummed, which in turn means
>   that any errors (especially silent data corruption) will not be
>   noticed AND which in turn also means the data itself cannot be
>   repaired even in case of RAIDs (only the RAIDs are made consistent
>   again)

Added to the table.

> - subvolume UUID attacks discussed in the recent thread
> - fs/device UUID collisions
>   - the accidental corruption that can happen in case colliding
>     fs/device UUIDs appear in a system (and telling the user that
>     this is e.g. the case when dd'ing and image or using lvm
>     snapshots, probably also when having btrfs on MD RAID1 or RAID10)
>   - the attacks that are possible when UUIDs are known to an attacker

That's more like a usecase, thats out of the scope of the tabular
overview. But we have an existing page UseCases that I'd like to
transform to a more structured and complete overview of usceases of
various features, so the UUID collisions would build on top of that with
"and this could hapen if ...".

> - in-band dedupe
>   deduped are IIRC not bitwise compared by the kernel before de-duping,
>   as it's the case with offline dedupe.
>   Even if this is considered safe by the community... I think users
>   should be told.

Only features merged are reflected. And the out-of-band dedupe does full
memcpy. See btrfs_cmp_data() called from btrfs_extent_same().

> - btrfs check --repair (and others?)
>   Telling people that this may often cause more harm than good.

I think userspace tools do not belong to the overview.

> - even mounting a fs ro, may cause it to be changed

This would go to the UseCases

> - DB/VM-image like IO patterns + nodatacow + (!)checksumming
>   + (auto)defrag + snapshots
>   a)
>   People typically may have the impression:
>   btrfs = checksummed => als is guaranteed to be "valid" (or at least
>   noticed)
>   However this isn't the case for nodatacow'ed files, which in turn is
>   kinda "mandatory" for DB/VM-image like IO patterns, cause otherwise
>   these would fragment to heavily (see (b).
>   Unless claimed by some people, none of the major DBs or VM-image
>   formats do general checksumming on their own, most even don't support
>   it, some that do wouldn't do it without app-support and few "just"
>   don't do it per default.
>   Thus one should bump people to this situation and that they may not
>   get this "correctness" guarantee here.
>   b)
>   IIRC, it doesn't even help to simply not use nodatacow on such files
>   and using auto-defrag instead to countermeasure the fragmenting, as
>   that one doesn't perform too well on large files.

Same.

> For specific features:
> - Autodefrag
>   - didn't that also cause reflinks to be broken up?

No and never had.

> that should be
>     mentioned than as well, as it is (more or less) for defrag and
>     people could then assume it's not the case for autodefrag (which I
>     did initially)
>   - wasn't it said that autodefrag performs bad with files > ~1GB?
>     Perhaps that should be mentioned too
> - defrag
>   "extents get unshared" is IMO not an adequate description for the end
>   user,... it should perhaps link to the defrag article and there
>   explain in detail that any ref-linked files will be broken up, which
>   means space usage will increase, and may especially

Re: Is stability a joke? (wiki updated)

2016-09-19 Thread Sean Greenslade

On Mon, Sep 19, 2016 at 12:08:55AM -0400, Zygo Blaxell wrote:
> 
> At the end of the day I'm not sure fsck really matters.  If the filesystem
> is getting corrupted enough that both copies of metadata are broken,
> there's something fundamentally wrong with that setup (hardware bugs,
> software bugs, bad RAM, etc) and it's just going to keep slowly eating
> more data until the underlying problem is fixed, and there's no guarantee
> that a repair is going to restore data correctly.  If we exclude broken
> hardware, the only thing btrfs check is going to repair is btrfs kernel
> bugs...and in that case, why would we expect btrfs check to have fewer
> bugs than the filesystem itself?

I see btrfs check as having a very useful role: fixing known problems
introduced by previous versions of kernel / progs. In my ext conversion
thread, I seem to have discovered a problem introduced by convert,
balance, or defrag. The data and metadata seem to be OK, however the
filesystem cannot be written to without btrfs falling over. If this was
caused by some edge-case data in the btrfs partition, it makes a lot
more sense to have btrfs check repair it than it does to modify the
kernel code to work around this and possibly many other bugs. The upshot
to this is that since (potentially all of) the data is intact, a
functional btrfs check would save me the hassle of restoring from
backup.

--Sean

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Post ext3 conversion problems

2016-09-19 Thread Sean Greenslade

On Mon, Sep 19, 2016 at 02:30:28PM +0800, Qu Wenruo wrote:
> All chunks are completed convert to DUP, no small chunk, all to its maximum
> chunk size.
> So from chunk level, nothing related to convert yet.
> 
> But for extent tree, I found several extents are heavily referred to.
> Like extent 158173081600 or 183996522496.
> 
> If you're not using off-band dedupe, then it's quite possible that's the
> remaining structure of convert.

I never ran any sort of dedup on this partition.

> Not pretty sure if it's related to the bug, but did you do the
> balance/defrag operation just after removing ext_save subvolume?

That's quite possible. I did it in a live boot, so I don't have the bash
history to check. I checked it just now using "btrfs subvol list -d",
and there's nothing listed. I ran a full balance after that, but the
problem remains. So whatever the problem is, it can survive a full
balance after the ext_save subvol is completely deleted.

--Sean
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: stability matrix

2016-09-19 Thread David Sterba

On Thu, Sep 15, 2016 at 07:54:26AM -0400, Austin S. Hemmelgarn wrote:
> > I'd like to help creating/maintaining this bug overview. A good start
> > would be to just crawl through all stable kernels and some distro
> > kernels and see which commits show up in fs/btrfs.
> >
> As of right now, we kind of do have such a page:
> https://btrfs.wiki.kernel.org/index.php/Gotchas
> It's not really well labeled though, ans it's easy to overlook.

The page has been created long time ago, if you'd need to start a new
page with similar content I can add a redirect so the link still works.

A more detailed bug page would be welcome by users. The changelogs I
write per release are terse as I don't want to spend the day just on
that. All the information should be in the git log and possibly in
recent mails, so this is manual work to present in on the wiki and does
not need devs to assist.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] Btrfs: fix incremental send failure caused by balance

2016-09-19 Thread fdmanana

From: Filipe Manana 

Commit 951555856b88 ("Btrfs: send, don't bug on inconsistent snapshots")
removed some BUG_ON() statements (replacing them with returning errors
to user space and logging error messages) when a snapshot is in an
inconsistent state due to failures to update a delayed inode item (ENOMEM
or ENOSPC) after adding/updating/deleting references, xattrs or file
extent items.

However there is a case, when no errors happen, where a file extent item
can be modified without having the corresponding inode item updated. This
case happens during balance under very specific timings, when relocation
is in the stage where it updates data pointers and a leaf that contains
file extent items is COWed. When that happens file extent items get their
disk_bytenr field updated to a new value that reflects the post relocation
logical address of the extent, without updating their respective inode
items (as there is nothing that needs to be updated on them). This is
performed at relocation.c:replace_file_extents() through
relocation.c:btrfs_reloc_cow_block().

So make an incremental send deal with this case and don't do any processing
for a file extent item that got its disk_bytenr field updated by relocation,
since the extent's data is the same as the one pointed by the file extent
item in the parent snapshot.

After the recent commit mentioned above this case resulted in EIO errors
returned to user space (and an error message logged to dmesg/syslog) when
doing an incremental send, while before it, it resulted in hitting a
BUG_ON leading to the following trace:

[  952.206705] [ cut here ]
[  952.206714] kernel BUG at ../fs/btrfs/send.c:5653!
[  952.206719] Internal error: Oops - BUG: 0 [#1] SMP
[  952.209854] Modules linked in: st dm_mod nls_utf8 isofs fuse nf_log_ipv6 
xt_pkttype xt_physdev br_netfilter nf_log_ipv4 nf_log_common xt_LOG xt_limit 
ebtable_filter ebtables af_packet bridge stp llc ip6t_REJECT xt_tcpudp 
nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT iptable_raw xt_CT 
iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast 
nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables xt_conntrack nf_conntrack 
ip6table_filter ip6_tables x_tables xfs libcrc32c nls_iso8859_1 nls_cp437 vfat 
fat joydev aes_ce_blk ablk_helper cryptd snd_intel8x0 aes_ce_cipher 
snd_ac97_codec ac97_bus snd_pcm ghash_ce sha2_ce sha1_ce snd_timer snd 
virtio_net soundcore btrfs xor sr_mod cdrom hid_generic usbhid raid6_pq 
virtio_blk virtio_scsi bochs_drm drm_kms_helper syscopyarea sysfillrect 
sysimgblt fb_sys_fops ttm virtio_mmio xhci_pci xhci_hcd usbcore usb_common 
virtio_pci virtio_ring virtio drm sg efivarfs
[  952.228333] Supported: Yes
[  952.228908] CPU: 0 PID: 12779 Comm: snapperd Not tainted 4.4.14-50-default #1
[  952.230329] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
[  952.231683] task: 800058e94100 ti: 8000d866c000 task.ti: 
8000d866c000
[  952.233279] PC is at changed_cb+0x9f4/0xa48 [btrfs]
[  952.234375] LR is at changed_cb+0x58/0xa48 [btrfs]
[  952.236552] pc : [] lr : [] pstate: 
8145
[  952.238049] sp : 8000d866fa20
[  952.238732] x29: 8000d866fa20 x28: 0019
[  952.239840] x27: 28d5 x26: 24a2
[  952.241008] x25: 0002 x24: 8000e66e92f0
[  952.242131] x23: 8000b8c76800 x22: 800092879140
[  952.243238] x21: 0002 x20: 8000d866fb78
[  952.244348] x19: 8000b8f8c200 x18: 2710
[  952.245607] x17: 90d42480 x16: 80237dc0
[  952.246719] x15: 90de7510 x14: ab000c000a2faf08
[  952.247835] x13: 00577c2b x12: ab000c000b696665
[  952.248981] x11: 2e65726f632f6966 x10: 652d34366d72612f
[  952.250101] x9 : 32627572672f746f x8 : ab000c00092f1671
[  952.251352] x7 : 80577c2b x6 : 800053eadf45
[  952.252468] x5 :  x4 : 80005e169494
[  952.253582] x3 : 0004 x2 : 8000d866fb78
[  952.254695] x1 : 0003e2a3 x0 : 0003e2a4
[  952.255803]
[  952.256150] Process snapperd (pid: 12779, stack limit = 0x8000d866c020)
[  952.257516] Stack: (0x8000d866fa20 to 0x8000d867)
[  952.258654] fa20: 8000d866fae0 7c308fc0 800092879140 
8000e66e92f0
[  952.260219] fa40: 0035 800055de6000 8000b8c76800 
8000d866fb78
[  952.261745] fa60: 0002 24a2 28d5 
0019
[  952.263269] fa80: 8000d866fae0 7c3090f0 8000d866fae0 
7c309128
[  952.264797] faa0: 800092879140 8000e66e92f0 0035 
800055de6000
[  952.268261] fac0: 8000b8c76800 8000d866fb78 0002 
1000
[  952.269822] fae0: 8000d866fbc0 7c39ecfc 8000b8f8c200 
8000b8f8c368
[  952.271368] fb00: 8000b8f8c378 800055de6000 0001 
8000ecb17500
[  952.272893] fb20: 8000b8c76800 800092879140

Re: Is stability a joke? (wiki updated)

2016-09-19 Thread Austin S. Hemmelgarn


On 2016-09-18 22:57, Zygo Blaxell wrote:

On Fri, Sep 16, 2016 at 08:00:44AM -0400, Austin S. Hemmelgarn wrote:

To be entirely honest, both zero-log and super-recover could probably be
pretty easily integrated into btrfs check such that it detects when they
need to be run and does so.  zero-log has a very well defined situation in
which it's absolutely needed (log tree corrupted such that it can't be
replayed), which is pretty easy to detect (the kernel obviously does so,
albeit by crashing).


Check already includes zero-log.  It loses a little data that way, but
that is probably better than the alternative (try to teach btrfs check
how to replay the log tree and keep up with kernel changes).
Interesting, as I've never seen check try to zero the log (even in cases 
where it would fix things) unless it makes some other change in the FS. 
I won't dispute that it clears the log tree _if_ it makes other changes 
to the FS (it kind of has to for safety reasons), but that's the only 
circumstance that I've seen it do so (even on filesystems where the log 
tree was corrupted, but the rest of the FS was fine).


There have been at least two log-tree bugs (or, more accurately,
bugs triggered while processing the log tree during mount) in the 3.x
and 4.x kernels.  The most recent I've encountered was in one of the
4.7-rc kernels.  zero-log is certainly not obsolete.
I won't dispute this, as I've had it happen myself (albeit not quite 
that recently), all I was trying to say was that it fixes a very well 
defined problem.


For a filesystem where availablity is more important than integrity
(e.g. root filesystems) it's really handy to have zero-log as a separate
tool without the huge overhead (and regression risk) of check.
Agreed, hence my later statement that if it gets fully merged, there 
should be an option to run just that.


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is stability a joke? (wiki updated)

2016-09-19 Thread Austin S. Hemmelgarn


On 2016-09-18 23:47, Zygo Blaxell wrote:

On Mon, Sep 12, 2016 at 12:56:03PM -0400, Austin S. Hemmelgarn wrote:

4. File Range Cloning and Out-of-band Dedupe: Similarly, work fine if the FS
is healthy.


I've found issues with OOB dedup (clone/extent-same):

1.  Don't dedup data that has not been committed--either call fsync()
on it, or check the generation numbers on each extent before deduping
it, or make sure the data is not being actively modified during dedup;
otherwise, a race condition may lead to the the filesystem locking up and
becoming inaccessible until the kernel is rebooted.  This is particularly
important if you are doing bedup-style incremental dedup on a live system.

I've worked around #1 by placing a fsync() call on the src FD immediately
before calling FILE_EXTENT_SAME.  When I do an A/B experiment with and
without the fsync, "with-fsync" runs for weeks at a time without issues,
while "without-fsync" hangs, sometimes in just a matter of hours.  Note
that the fsync() doesn't resolve the underlying race condition, it just
makes the filesystem hang less often.

2.  There is a practical limit to the number of times a single duplicate
extent can be deduplicated.  As more references to a shared extent
are created, any part of the filesystem that uses backref walking code
gets slower.  This includes dedup itself, balance, device replace/delete,
FIEMAP, LOGICAL_INO, and mmap() (which can be bad news if the duplicate
files are executables).  Several factors (including file size and number
of snapshots) are involved, making it difficult to devise workarounds or
set up test cases.  99.5% of the time, these operations just get slower
by a few ms each time a new reference is created, but the other 0.5% of
the time, write operations will abruptly grow to consume hours of CPU
time or dozens of gigabytes of RAM (in millions of kmalloc-32 slabs)
when they touch one of these over-shared extents.  When this occurs,
it effectively (but not literally) crashes the host machine.

I've worked around #2 by building tables of "toxic" hashes that occur too
frequently in a filesystem to be deduped, and using these tables in dedup
software to ignore any duplicate data matching them.  These tables can
be relatively small as they only need to list hashes that are repeated
more than a few thousand times, and typical filesystems (up to 10TB or
so) have only a few hundred such hashes.

I happened to have a couple of machines taken down by these issues this
very weekend, so I can confirm the issues are present in kernels 4.4.21,
4.5.7, and 4.7.4.
OK, that's good to know.  In my case, I'm not operating on a very big 
data set (less than 40GB, but the storage cluster I'm doing this on only 
has about 200GB of total space, so I'm trying to conserve as much as 
possible), and it's mostly static data (less than 100MB worth of changes 
a day except on Sunday when I run backups), so it makes sense that I've 
not seen either of these issues.


The second one sounds like the same performance issue caused by having 
very large numbers of snapshots, and based on what's happening, I don't 
think there's any way we could fix it without rewriting certain core code.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RAID1 availability issue[2], Hot-spare and auto-replace

2016-09-19 Thread Austin S. Hemmelgarn


On 2016-09-18 13:28, Chris Murphy wrote:

On Sun, Sep 18, 2016 at 2:34 AM, Anand Jain  wrote:


(updated the subject, was [1])


IMO the hot-spare feature makes most sense with the raid56,



  Why. ?


Raid56 is not scalable, has less redundancy in most all
configurations, rebuild impacts the entire array performance, and in
the case of raid6 two drives lost means incredibly slow rebuild. All
of that adds up to more disk for raid56 to be mitigated with a hot
spare being available for immediate rebuild.

Who currently would use hot spare right now? Problem 1 is Btrfs raid10
is not scalable like other raid10 implementations (mdadm, lvm,
hardware). Problem 2 is Btrfs the raid56 parity scrub bug; and
arguably also partial stripe writes not being CoW. I think hot spare
is pointless with those two problems still being true, and the way to
mitigate them right now is a clusterfs. Hot spare doesn't mitigate
these Btrfs weaknesses.





which is stuck where it is, so we need to get it working first.




  We need at least one RAID which does not have the availability
  issue. We could achieve that with raid1, there are patches
  which needs maintainer time.


I agree with the idea of degraded raid1 chunks. It's a nasty surprise
to realize this only once it's too late and there's data loss. That
there is a user space work around, maybe makes it less of a big deal?
But I don't think it's documented on gotchas page with the soft
conversion work around to do the rebuild properly: scrub/balance alone
is not correct.

I kinda think we need a list of priorities for multiple device stuff,
and honestly hot spare while important I think is bottom of the list.

1. multiple fs UUID dev UUID corruption problem (the cloned device problem)
2. degraded volumes new bg's are single profile (Anand's April patchset)
3. raid56 bad parity created during scrub when data strip is bad and gets fixed
4. better faulty device tolerance (no crashing)
5. raid10 scaling, needs a way for even number block devices of the
same size to get fixed mirroring so it can tolerate multiple drive
failures so long as a mirrored pair don't fail
6. raid56 partial stripe RMW need to be CoW, doesn't matter if it
slows things down, if you don't like it, use raid10
7. raid1 threaded/async reads (whatever the correct term is to read
from all raid1 drives rather than PID based)
8. better faulty device notifications
9. raid56 parity needs to be checksummed
10. hotspare
FWIW, I'd probably list the faulty device tolerance and notifications 
(in that order) immediately after the first two items, put the raid1 
threaded reads at the end of the list (after hot-spares), and put the 
raid10 scaling after raid1 threading (anyone who's actually concerned 
with performance and has done their homework is more likely to be using 
BTRFS in raid1 mode on top of a pair of RAID0 arrays (most likely MD or 
LVM based) instead of BTRFS raid10 mode, not only because of the 
reliability factor, but also because it gets significantly better 
performance than BTRFS raid10 mode, and will continue to do so until we 
get proper load-balancing of reads on raid1 and raid10 profiles).  I'd 
also add that we should be parallelizing reads of stripe components in 
raid0, raid10, raid5, and raid6 modes (ie, if we're using raid10 mode 
and need to read both halves of a stripe, both reads should get 
dispatched at the same time), but that would likely go in with the raid1 
performance stuff.



2 and 3 might seem tied. Both can result in data loss, both have user
space work arounds (undocumented); but 2 has a greater chance of
happening than 3.
2 also impacts things other than raid5/6, which means (at least IMO) it 
should be higher priority.


4 is probably worse than 3, but 4 is much more nebulous and 3 produces
a big negative perception.

I'm sure someone could argue hotspare could get squeezed in between 4
and 5; but that's really my one bias in the list, I don't care about
hot spare. I think it's more scalable to take advantage of Btrfs
uniqueness to shrink the file system to drop the bad drive to regain
full redundancy, rather than do hot spares, this is faster, and
doesn't waste a drive that's not doing any work.
This isn't just you, I'm pretty much of the same opinion on this 
particular item.


I see shrink as more scalable with hard drives than hot spares,
especially in the case of data single profile with clusterfs's: drop
the bad device and its data, autodelete the lost files, rebuild
metadata to regain complete fs redundancy,  inform the cluster of
partial data loss - boom the array is completely fixed, let the
cluster figure out what to do next. Plus each brick isn't spinning an
unused hot spare. There is in effect a hot spare *somewhere* partially
used somewhere else in a cluster fs anyway. I see hot spare as an edge
case need, especially with hard drives. It's not a general purpose
need.

I agree on this too to a certain extent, except:
1. There aren't any

Re: RAID1 availability issue[2], Hot-spare and auto-replace

2016-09-19 Thread Austin S. Hemmelgarn


On 2016-09-18 22:25, Anand Jain wrote:


Chris Murphy,

 Thanks for writing in detail, it makes sense..

 Generally hot spare is to reduce the risk of double disk failures
 leading to the data lose at the data centers before the data is
 reconstructed again for redundancy.

On 09/19/2016 01:28 AM, Chris Murphy wrote:

On Sun, Sep 18, 2016 at 2:34 AM, Anand Jain 
wrote:


(updated the subject, was [1])


IMO the hot-spare feature makes most sense with the raid56,



  Why. ?


Raid56 is not scalable, has less redundancy in most all
configurations, rebuild impacts the entire array performance, and in
the case of raid6 two drives lost means incredibly slow rebuild. All
of that adds up to more disk for raid56 to be mitigated with a hot
spare being available for immediate rebuild.

Who currently would use hot spare right now?


 Probably you mean to say hot spare is not P1 right now, looking at
 other things to fix, I agree.  raid1 availability issue is p1.
 I do get ping-ed on it once in a while.

 I am curious what do you recommend as a btrfs vm data solution for
 the enterprise production ?
I have no idea what Chris would recommend, but in my case, it depends on 
what you want to do.  For use inside a VM, I'd say it's entirely up to 
your requirements, but I'd only trust it for catching corruption, not 
preventing data loss (that's the job of the storage host anyway).  For 
use for storing VM images, there are much better options.  For a single 
user system or a small single server without HA requirements you should 
be using LVM (or something similar) and setting proper ACL's on the LV's 
so you don't need to run the VM's as root (and easy portability is a 
bogus argument against this, it's trivial to generate image files from 
block devices on Linux).  For HA setups, I'd probably set up a SAN using 
GlusterFS+iSCSI (possibly with BTRFS as a back-end for Gluster) or Ceph.


Thanks, Anand


Problem 1 is Btrfs raid10
is not scalable like other raid10 implementations (mdadm, lvm,
hardware). Problem 2 is Btrfs the raid56 parity scrub bug; and
arguably also partial stripe writes not being CoW. I think hot spare
is pointless with those two problems still being true, and the way to
mitigate them right now is a clusterfs. Hot spare doesn't mitigate
these Btrfs weaknesses.





which is stuck where it is, so we need to get it working first.




  We need at least one RAID which does not have the availability
  issue. We could achieve that with raid1, there are patches
  which needs maintainer time.


I agree with the idea of degraded raid1 chunks. It's a nasty surprise
to realize this only once it's too late and there's data loss. That
there is a user space work around, maybe makes it less of a big deal?
But I don't think it's documented on gotchas page with the soft
conversion work around to do the rebuild properly: scrub/balance alone
is not correct.

I kinda think we need a list of priorities for multiple device stuff,
and honestly hot spare while important I think is bottom of the list.

1. multiple fs UUID dev UUID corruption problem (the cloned device
problem)
2. degraded volumes new bg's are single profile (Anand's April patchset)
3. raid56 bad parity created during scrub when data strip is bad and
gets fixed
4. better faulty device tolerance (no crashing)
5. raid10 scaling, needs a way for even number block devices of the
same size to get fixed mirroring so it can tolerate multiple drive
failures so long as a mirrored pair don't fail
6. raid56 partial stripe RMW need to be CoW, doesn't matter if it
slows things down, if you don't like it, use raid10
7. raid1 threaded/async reads (whatever the correct term is to read
from all raid1 drives rather than PID based)
8. better faulty device notifications
9. raid56 parity needs to be checksummed
10. hotspare


2 and 3 might seem tied. Both can result in data loss, both have user
space work arounds (undocumented); but 2 has a greater chance of
happening than 3.

4 is probably worse than 3, but 4 is much more nebulous and 3 produces
a big negative perception.

I'm sure someone could argue hotspare could get squeezed in between 4
and 5; but that's really my one bias in the list, I don't care about
hot spare. I think it's more scalable to take advantage of Btrfs
uniqueness to shrink the file system to drop the bad drive to regain
full redundancy, rather than do hot spares, this is faster, and
doesn't waste a drive that's not doing any work.

I see shrink as more scalable with hard drives than hot spares,
especially in the case of data single profile with clusterfs's: drop
the bad device and its data, autodelete the lost files, rebuild
metadata to regain complete fs redundancy,  inform the cluster of
partial data loss - boom the array is completely fixed, let the
cluster figure out what to do next. Plus each brick isn't spinning an
unused hot spare. There is in effect a hot spare *somewhere* partially
used somewhere else in a cluster fs anyway. I

Transaksjon

2016-09-19 Thread Jon S. Cunliffe

Dato: 9/18/2016
 
Jeg er Sir Jonathan Stephen Cunliffe, visesentralbanksjef , Finansiell 
stabilitet, Bank of England. Jeg har et interessant tilbud som er verdt (£ 11.5 
millioner) for å dele med deg. Hvis du er interessert, kan du skrive tilbake 
til min personlige e-post: jonl1...@aol.co.uk for flere detaljer.
 
Merk: Dette innlegget ble oversatt av Google oversetter . Hvis du kan snakke, 
skrive og forstå engelsk, vennligst gi meg beskjed så det vil være bedre for 
vår kommunikasjon.
 
Hilsen,
Sir Jon S. Cunliffe

---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

44 matches

Mail list logo