Re: [f2fs-dev] [PATCH 2/2] f2fs: support {d,id,did,x}node checksum

2018-04-16 Thread Jaegeuk Kim
On 04/13, Chao Yu wrote:
> Ping again..
> 
> Do you have time to discuss this?

We may need a time to have a chat in person. Do you have any chance to visit
US?

> 
> On 2018/2/27 22:16, Chao Yu wrote:
> > Ping,
> > 
> > On 2018/2/13 15:34, Chao Yu wrote:
> >> Hi Jaegeuk,
> >>
> >> On 2018/2/10 10:52, Chao Yu wrote:
> >>> On 2018/2/10 9:41, Jaegeuk Kim wrote:
>  On 02/01, Chao Yu wrote:
> >
> >
> > On 2018/2/1 6:15, Jaegeuk Kim wrote:
> >> On 01/31, Chao Yu wrote:
> >>> On 2018/1/31 10:02, Jaegeuk Kim wrote:
>  What if we want to add more entries in addition to node_checksum? Do 
>  we have
>  to add a new feature flag at every time? How about adding a layout 
>  value instead
> >>>
> >>> Hmm.. for previous implementation, IMO, we'd better add a new feature 
> >>> flag at
> >>> every time, otherwise, w/ extra_nsize only, in current image, we can 
> >>> know a
> >>> valid range of extended area in node block, but we don't know which
> >>> fields/features are valid/enabled or not.
> >>>
> >>> One more thing is that if we can add one feature flag for each field, 
> >>> we got one
> >>> more chance to disable it dynamically.
> >>>
>  of extra_nsize? For example, layout #1 means node_checksum with 
>  extra_nsize=X?
> 
> 
>  What does 1017 mean? We need to make this structure more flexibly 
>  for new
> >>>
> >>> Yes, using raw 1017 is not appropriate here.
> >>>
>  entries. Like this?
>   union {
>   struct node_v1;
>   struct node_v2;
>   struct node_v3;
>   ...
>   struct direct_node dn;
>   struct indirect_node in;
>   };
>   };
> 
>   struct node_v1 {
>   __le32 data[DEF_ADDRS_PER_BLOCK - V1_NSIZE=1];
>   __le32 node_checksum;
>   }
> 
>   struct node_v2 {
>   __le32 data[DEF_ADDRS_PER_BLOCK - V2_NSIZE=500];
> >>>
> >>> Hmm.. If we only need to add one more 4 bytes field in struct 
> >>> node_v2, but
> >>> V2_NSIZE is defined as fixed 500, there must be 492 bytes wasted.
> >>>
> >>> Or we can define V2_NSIZE as 8, but if there comes more and more 
> >>> extended
> >>> fields, node version count can be a large number, it results in 
> >>> complicated
> >>> version recognization and handling.
> >>>
> >>> One more question is how can we control which fields are valid or not 
> >>> in
> >>> comp[Vx_NSIZE]?
> >>>
> >>>
> >>> Anyway, what I'm thinking is maybe we can restructure layout of node 
> >>> block like
> >>> the one used by f2fs_inode:
> >>>
> >>> struct f2fs_node {
> >>>   union {
> >>>   struct f2fs_inode i;
> >>>   union {
> >>>   struct {
> >>>   __le32 node_checksum;
> >>>   __le32 feature_field_1;
> >>>   __le32 feature_field_2;
> >>>   
> >>>   __le32 addr[];
> >>>   
> >>>   };
> >>>   struct direct_node dn;
> >>>   struct indirect_node in;
> >>>   };
> >>>   };
> >>>   struct node_footer footer;
> >>> } __packed;
> >>>
> >>> Moving all extended fields to the head of f2fs_node, so we don't have 
> >>> to use
> >>> macro to indicate actual size of addr.
> >>
> >> Thinking what'd be the best way. My concern is, once getting more 
> >> entries, we
> >
> > OK, I think we need more discussion.. ;)
> >
> >> can't set each of features individually. Like the second entry should 
> >> have
> >
> > Oh, that will be hard. If we have to avoid that, we have to tag in 
> > somewhere
> > e.g. f2fs_inode::i_flags2 to indicate which new field in f2fs_node is 
> > valid, for
> > example:
> >
> > #define F2FS_NODE_CHECKSUM  0x0001
> > #define F2FS_NODE_FIELD10x0002
> > #define F2FS_NODE_FIELD20x0004
> >
> > union {
> > struct {
> > __le32 node_checksum;
> > __le32 field_1;
> > __le32 field_2;
> > 
> > __le32 addr[];
> > };
> > struct direct_node dn;
> > struct indirect_node in;
> > };
> >
> > f2fs_inode::i_flags2 = 

Re: [f2fs-dev] Symlink not persisted even after fsync

2018-04-16 Thread Vijay Chidambaram
On Mon, Apr 16, 2018 at 7:07 PM, Dave Chinner  wrote:
> On Sun, Apr 15, 2018 at 07:10:52PM -0500, Vijay Chidambaram wrote:
>> Thanks! As I mentioned before, this is useful. I have a follow-up
>> question. Consider the following workload:
>>
>>  creat foo
>>  link (foo, A/bar)
>>  fsync(foo)
>>  crash
>>
>> In this case, after the file system recovers, do we expect foo's link
>> count to be 2 or 1?
>
> So, strictly ordered behaviour:
>
> create foo:
> - creates dirent in inode B and new inode A in an atomic
>   transaction sequence #1
>
> link foo -> A/bar
> - creates dirent in inode C and bumps inode A link count in
>   an atomic transaction seqeunce #2.
>
> fsync foo
> - looks at inode A, sees it's "last modification" sequence
>   counter as #2
> - flushes all transactions up to and including #2 to the
>   journal.
>
> See the dependency chain? Both the inodes and dirents in the create
> operation and the link operation are chained to the inode foo via
> the atomic transactions. Hence when we flush foo, we also flush the
> dependent changes because of the change atomicity requirements
>
>> I would say 2,
>
> Correct, for strict ordering. But
>
>> but POSIX is silent on this,
>
> Well, it's not silent, POSIX explicitly allows for fsync() to do
> nothing and report success. Hence we can't really look to POSIX to
> define how fsync() should behave.
>
>> so
>> thought I would confirm. The tricky part here is we are not calling
>> fsync() on directory A.
>
> Right. But directory A has a dependent change linked to foo. If we
> fsync() foo, we are persisting the link count change in that file,
> and hence all the other changes related to that link count change
> must also be flushed. Similarly, all the cahnges related to the
> creation on foo must be flushed, too.
>
>> In this case, its not a symlink; its a hard link, so I would say the
>> link count for foo should be 2.
>
> Right - that's the "reference counted object dependency" I refered
> to. i.e. it's a bi-direction atomic dependency - either we show both
> the new dirent and the link count change, or we show neither of
> them.  Hence fsync on one object implies that we are also persisting
> the related changes in the other object, too.
>
>> But btrfs and F2FS show link count of
>> 1 after a crash.
>
> That may be valid if the dirent A/bar does not exist after recovery,
> but it also means fsync() hasn't actually guaranteed inode changes
> made prior to the fsync to be persistent on disk. i.e. that's a
> violation of ordered metadata semantics and probably a bug.

Great, this matches our understanding perfectly. We have separately
posted to the btrfs mailing list to confirm it is a bug. Thanks!

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] Symlink not persisted even after fsync

2018-04-16 Thread Dave Chinner
On Sun, Apr 15, 2018 at 07:10:52PM -0500, Vijay Chidambaram wrote:
> Thanks! As I mentioned before, this is useful. I have a follow-up
> question. Consider the following workload:
> 
>  creat foo
>  link (foo, A/bar)
>  fsync(foo)
>  crash
> 
> In this case, after the file system recovers, do we expect foo's link
> count to be 2 or 1? 

So, strictly ordered behaviour:

create foo:
- creates dirent in inode B and new inode A in an atomic
  transaction sequence #1

link foo -> A/bar
- creates dirent in inode C and bumps inode A link count in
  an atomic transaction seqeunce #2.

fsync foo
- looks at inode A, sees it's "last modification" sequence
  counter as #2
- flushes all transactions up to and including #2 to the
  journal.

See the dependency chain? Both the inodes and dirents in the create
operation and the link operation are chained to the inode foo via
the atomic transactions. Hence when we flush foo, we also flush the
dependent changes because of the change atomicity requirements

> I would say 2,

Correct, for strict ordering. But

> but POSIX is silent on this,

Well, it's not silent, POSIX explicitly allows for fsync() to do
nothing and report success. Hence we can't really look to POSIX to
define how fsync() should behave.

> so
> thought I would confirm. The tricky part here is we are not calling
> fsync() on directory A.

Right. But directory A has a dependent change linked to foo. If we
fsync() foo, we are persisting the link count change in that file,
and hence all the other changes related to that link count change
must also be flushed. Similarly, all the cahnges related to the
creation on foo must be flushed, too.

> In this case, its not a symlink; its a hard link, so I would say the
> link count for foo should be 2.

Right - that's the "reference counted object dependency" I refered
to. i.e. it's a bi-direction atomic dependency - either we show both
the new dirent and the link count change, or we show neither of
them.  Hence fsync on one object implies that we are also persisting
the related changes in the other object, too.

> But btrfs and F2FS show link count of
> 1 after a crash.

That may be valid if the dirent A/bar does not exist after recovery,
but it also means fsync() hasn't actually guaranteed inode changes
made prior to the fsync to be persistent on disk. i.e. that's a
violation of ordered metadata semantics and probably a bug.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH 2/2] f2fs: refactor read path to allow multiple postprocessing steps

2018-04-16 Thread Michael Halcrow via Linux-f2fs-devel
Given recent talk I've seen on potentially applying file-based
protections in NFS, I think it's worth making some cautionary
observations at this stage.

Moxie's Cryptographic Doom Principle is an approachable take on the
argument that one should verify before performing any other
cryptographic operations.

https://moxie.org/blog/the-cryptographic-doom-principle/

As we consider inline cryptographic engines, this becomes challenging
because encryption is delegated to the block layer, but authenticity
still happens at the file system layer.

That said, the feasibility of applying attacks such as CCA2 against
file sytem content is in question, depending on the adversarial model
for the context in which the file system is integrated.  In
particular, does the model include a MiTM between the kernel and the
storage?

The fs-crypt adversarial model explicitly concerns itself only with a
single point-in-time permanent offline compromise of storage.  I'm not
aware of any real analysis anyone has done when we're instead dealing
with storage that can be arbitrarily modified at arbitrary points in
time.

Is the user concerned about malicious drive controller firmware?  Will
the content be transmitted over an insecure network connection?  Will
the content be otherwise manipulatable by an adversary who can change
to the backing storage?

The other caution relates to compress-then-encrypt.  I refer to the
CRIME and BREACH attacks.

https://en.wikipedia.org/wiki/CRIME

https://en.wikipedia.org/wiki/BREACH

Again, it's context-dependent whether or how attacks like this can
apply to a file system.  We'll need to be explicit about our
expectations as we investigate whether or how to use fs-crypt and
fs-verity capabilities in NFS.  I think this feeds into a discussion
on efficient protection of file system metadata, but I'll defer to
some future thread for that.

On Mon, Apr 16, 2018 at 12:31:47PM -0700, Eric Biggers wrote:
> Currently f2fs's ->readpage() and ->readpages() assume that either the
> data undergoes no postprocessing, or decryption only.  But with
> fs-verity, there will be an additional authenticity verification step,
> and it may be needed either by itself, or combined with decryption.
> 
> To support this, store a 'struct bio_post_read_ctx' in ->bi_private
> which contains a work struct, a bitmask of postprocessing steps that are
> enabled, and an indicator of the current step.  The bio completion
> routine, if there was no I/O error, enqueues the first postprocessing
> step.  When that completes, it continues to the next step.  Pages that
> fail any postprocessing step have PageError set.  Once all steps have
> completed, pages without PageError set are set Uptodate, and all pages
> are unlocked.
> 
> Also replace f2fs_encrypted_file() with a new function
> f2fs_post_read_required() in places like direct I/O and garbage
> collection that really should be testing whether the file needs special
> I/O processing, not whether it is encrypted specifically.
> 
> This may also be useful for other future f2fs features such as
> compression.
> 
> Signed-off-by: Eric Biggers 
> ---
>  fs/f2fs/data.c   | 163 +++
>  fs/f2fs/f2fs.h   |  12 +++-
>  fs/f2fs/file.c   |   4 +-
>  fs/f2fs/gc.c |   6 +-
>  fs/f2fs/inline.c |   2 +-
>  fs/f2fs/super.c  |   6 ++
>  6 files changed, 144 insertions(+), 49 deletions(-)
> 
> diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
> index 39225519de1c..ec70de0cd1d5 100644
> --- a/fs/f2fs/data.c
> +++ b/fs/f2fs/data.c
> @@ -30,6 +30,9 @@
>  #include "trace.h"
>  #include 
>  
> +static struct kmem_cache *bio_post_read_ctx_cache;
> +static mempool_t *bio_post_read_ctx_pool;
> +
>  static bool __is_cp_guaranteed(struct page *page)
>  {
>   struct address_space *mapping = page->mapping;
> @@ -50,11 +53,77 @@ static bool __is_cp_guaranteed(struct page *page)
>   return false;
>  }
>  
> -static void f2fs_read_end_io(struct bio *bio)
> +/* postprocessing steps for read bios */
> +enum bio_post_read_step {
> + STEP_INITIAL = 0,
> + STEP_DECRYPT,
> +};
> +
> +struct bio_post_read_ctx {
> + struct bio *bio;
> + struct work_struct work;
> + unsigned int cur_step;
> + unsigned int enabled_steps;

What if enabled_steps has gaps?  E.g., 0x09 will be feasible if
there's compression, encryption, virus signature scanning, and verity.
I'm don't really see how the cur_step logic can be made to be elegant
in that case.

If you start "inital" at 1, then you could collapse these two into
"pending_steps", setting each bit position to 0 after processing.
Just a thought.

> +};
> +
> +static void __read_end_io(struct bio *bio)
>  {
> - struct bio_vec *bvec;
> + struct page *page;
> + struct bio_vec *bv;
>   int i;
>  
> + bio_for_each_segment_all(bv, bio, i) {
> + page = bv->bv_page;
> +
> + /* PG_error was set if any post_read step failed */
> + if 

Re: [f2fs-dev] [PATCH] f2fs: set deadline to drop expired inmem pages

2018-04-16 Thread Jaegeuk Kim
On 04/13, Chao Yu wrote:
> On 2018/4/13 12:05, Jaegeuk Kim wrote:
> > On 04/13, Chao Yu wrote:
> >> On 2018/4/13 9:04, Jaegeuk Kim wrote:
> >>> On 04/10, Chao Yu wrote:
>  Hi Jaegeuk,
> 
>  On 2018/4/8 16:13, Chao Yu wrote:
> > f2fs doesn't allow abuse on atomic write class interface, so except
> > limiting in-mem pages' total memory usage capacity, we need to limit
> > start-commit time as well, otherwise we may run into infinite loop
> > during foreground GC because target blocks in victim segment are
> > belong to atomic opened file for long time.
> >
> > Now, we will check the condition with f2fs_balance_fs_bg in
> > background threads, once if user doesn't commit data exceeding 30
> > seconds, we will drop all cached data, so I expect it can keep our
> > system running safely to prevent Dos attack.
> 
>  Is it worth to add this patch to avoid abuse on atomic write interface 
>  by user?
> >>>
> >>> Hmm, hope to see a real problem first in this case.
> >>
> >> I think this can be a more critical security leak instead of a potential 
> >> issue
> >> which we can wait for someone reporting that can be too late.
> >>
> >> For example, user can simply write a huge file whose data spread in all 
> >> f2fs
> >> segments, once user open that file as atomic, foreground GC will suffer
> >> deadloop, causing denying any further service of f2fs.
> > 
> > How can you guarantee it won't happen within 30sec? If you want to avoid 
> > that,
> 
> Now the value is smaller than generic hang task threshold in order to avoid
> foreground GC helding gc_mutex too long, we can tune that parameter?
> 
> > you have to take a look at foreground gc.
> 
> What do you mean? let GC moves blocks of atomic write opened file?

I thought that we first need to detect when foreground GC is stuck by such the
huge number of atomic writes. Then, we need to do something like dropping all
the atomic writes.

> 
> Thanks,
> 
> > 
> >>
> >> Thanks,
> >>
> >>>
>  Thanks,
> >>>
> >>> .
> >>>
> > 
> > .
> > 

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


[f2fs-dev] [PATCH 1/2] fscrypt: allow synchronous bio decryption

2018-04-16 Thread Eric Biggers via Linux-f2fs-devel
Currently, fscrypt provides fscrypt_decrypt_bio_pages() which decrypts a
bio's pages asynchronously, then unlocks them afterwards.  But, this
assumes that decryption is the last "postprocessing step" for the bio,
so it's incompatible with additional postprocessing steps such as
authenticity verification after decryption.

Therefore, rename the existing fscrypt_decrypt_bio_pages() to
fscrypt_enqueue_decrypt_bio().  Then, add fscrypt_decrypt_bio() which
decrypts the pages in the bio synchronously without unlocking the pages,
nor setting them Uptodate; and add fscrypt_enqueue_decrypt_work(), which
enqueues work on the fscrypt_read_workqueue.  The new functions will be
used by filesystems that support both fscrypt and fs-verity.

Signed-off-by: Eric Biggers 
---
 fs/crypto/bio.c | 35 +
 fs/crypto/crypto.c  |  8 +++-
 fs/crypto/fscrypt_private.h |  1 -
 fs/ext4/readpage.c  |  2 +-
 fs/f2fs/data.c  |  2 +-
 include/linux/fscrypt_notsupp.h | 13 +---
 include/linux/fscrypt_supp.h|  5 -
 7 files changed, 45 insertions(+), 21 deletions(-)

diff --git a/fs/crypto/bio.c b/fs/crypto/bio.c
index 0d5e6a569d58..0959044c5cee 100644
--- a/fs/crypto/bio.c
+++ b/fs/crypto/bio.c
@@ -26,15 +26,8 @@
 #include 
 #include "fscrypt_private.h"
 
-/*
- * Call fscrypt_decrypt_page on every single page, reusing the encryption
- * context.
- */
-static void completion_pages(struct work_struct *work)
+static void __fscrypt_decrypt_bio(struct bio *bio, bool done)
 {
-   struct fscrypt_ctx *ctx =
-   container_of(work, struct fscrypt_ctx, r.work);
-   struct bio *bio = ctx->r.bio;
struct bio_vec *bv;
int i;
 
@@ -46,22 +39,38 @@ static void completion_pages(struct work_struct *work)
if (ret) {
WARN_ON_ONCE(1);
SetPageError(page);
-   } else {
+   } else if (done) {
SetPageUptodate(page);
}
-   unlock_page(page);
+   if (done)
+   unlock_page(page);
}
+}
+
+void fscrypt_decrypt_bio(struct bio *bio)
+{
+   __fscrypt_decrypt_bio(bio, false);
+}
+EXPORT_SYMBOL(fscrypt_decrypt_bio);
+
+static void completion_pages(struct work_struct *work)
+{
+   struct fscrypt_ctx *ctx =
+   container_of(work, struct fscrypt_ctx, r.work);
+   struct bio *bio = ctx->r.bio;
+
+   __fscrypt_decrypt_bio(bio, true);
fscrypt_release_ctx(ctx);
bio_put(bio);
 }
 
-void fscrypt_decrypt_bio_pages(struct fscrypt_ctx *ctx, struct bio *bio)
+void fscrypt_enqueue_decrypt_bio(struct fscrypt_ctx *ctx, struct bio *bio)
 {
INIT_WORK(>r.work, completion_pages);
ctx->r.bio = bio;
-   queue_work(fscrypt_read_workqueue, >r.work);
+   fscrypt_enqueue_decrypt_work(>r.work);
 }
-EXPORT_SYMBOL(fscrypt_decrypt_bio_pages);
+EXPORT_SYMBOL(fscrypt_enqueue_decrypt_bio);
 
 void fscrypt_pullback_bio_page(struct page **page, bool restore)
 {
diff --git a/fs/crypto/crypto.c b/fs/crypto/crypto.c
index ce654526c0fb..0758d32ad01b 100644
--- a/fs/crypto/crypto.c
+++ b/fs/crypto/crypto.c
@@ -45,12 +45,18 @@ static mempool_t *fscrypt_bounce_page_pool = NULL;
 static LIST_HEAD(fscrypt_free_ctxs);
 static DEFINE_SPINLOCK(fscrypt_ctx_lock);
 
-struct workqueue_struct *fscrypt_read_workqueue;
+static struct workqueue_struct *fscrypt_read_workqueue;
 static DEFINE_MUTEX(fscrypt_init_mutex);
 
 static struct kmem_cache *fscrypt_ctx_cachep;
 struct kmem_cache *fscrypt_info_cachep;
 
+void fscrypt_enqueue_decrypt_work(struct work_struct *work)
+{
+   queue_work(fscrypt_read_workqueue, work);
+}
+EXPORT_SYMBOL(fscrypt_enqueue_decrypt_work);
+
 /**
  * fscrypt_release_ctx() - Releases an encryption context
  * @ctx: The encryption context to release.
diff --git a/fs/crypto/fscrypt_private.h b/fs/crypto/fscrypt_private.h
index ad6722bae8b7..4012558f6115 100644
--- a/fs/crypto/fscrypt_private.h
+++ b/fs/crypto/fscrypt_private.h
@@ -97,7 +97,6 @@ static inline bool fscrypt_valid_enc_modes(u32 contents_mode,
 /* crypto.c */
 extern struct kmem_cache *fscrypt_info_cachep;
 extern int fscrypt_initialize(unsigned int cop_flags);
-extern struct workqueue_struct *fscrypt_read_workqueue;
 extern int fscrypt_do_page_crypto(const struct inode *inode,
  fscrypt_direction_t rw, u64 lblk_num,
  struct page *src_page,
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index 9ffa6fad18db..19b87a8de6ff 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -77,7 +77,7 @@ static void mpage_end_io(struct bio *bio)
if (bio->bi_status) {
fscrypt_release_ctx(bio->bi_private);
} else {
-   fscrypt_decrypt_bio_pages(bio->bi_private, bio);
+   

[f2fs-dev] [PATCH 2/2] f2fs: refactor read path to allow multiple postprocessing steps

2018-04-16 Thread Eric Biggers via Linux-f2fs-devel
Currently f2fs's ->readpage() and ->readpages() assume that either the
data undergoes no postprocessing, or decryption only.  But with
fs-verity, there will be an additional authenticity verification step,
and it may be needed either by itself, or combined with decryption.

To support this, store a 'struct bio_post_read_ctx' in ->bi_private
which contains a work struct, a bitmask of postprocessing steps that are
enabled, and an indicator of the current step.  The bio completion
routine, if there was no I/O error, enqueues the first postprocessing
step.  When that completes, it continues to the next step.  Pages that
fail any postprocessing step have PageError set.  Once all steps have
completed, pages without PageError set are set Uptodate, and all pages
are unlocked.

Also replace f2fs_encrypted_file() with a new function
f2fs_post_read_required() in places like direct I/O and garbage
collection that really should be testing whether the file needs special
I/O processing, not whether it is encrypted specifically.

This may also be useful for other future f2fs features such as
compression.

Signed-off-by: Eric Biggers 
---
 fs/f2fs/data.c   | 163 +++
 fs/f2fs/f2fs.h   |  12 +++-
 fs/f2fs/file.c   |   4 +-
 fs/f2fs/gc.c |   6 +-
 fs/f2fs/inline.c |   2 +-
 fs/f2fs/super.c  |   6 ++
 6 files changed, 144 insertions(+), 49 deletions(-)

diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 39225519de1c..ec70de0cd1d5 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -30,6 +30,9 @@
 #include "trace.h"
 #include 
 
+static struct kmem_cache *bio_post_read_ctx_cache;
+static mempool_t *bio_post_read_ctx_pool;
+
 static bool __is_cp_guaranteed(struct page *page)
 {
struct address_space *mapping = page->mapping;
@@ -50,11 +53,77 @@ static bool __is_cp_guaranteed(struct page *page)
return false;
 }
 
-static void f2fs_read_end_io(struct bio *bio)
+/* postprocessing steps for read bios */
+enum bio_post_read_step {
+   STEP_INITIAL = 0,
+   STEP_DECRYPT,
+};
+
+struct bio_post_read_ctx {
+   struct bio *bio;
+   struct work_struct work;
+   unsigned int cur_step;
+   unsigned int enabled_steps;
+};
+
+static void __read_end_io(struct bio *bio)
 {
-   struct bio_vec *bvec;
+   struct page *page;
+   struct bio_vec *bv;
int i;
 
+   bio_for_each_segment_all(bv, bio, i) {
+   page = bv->bv_page;
+
+   /* PG_error was set if any post_read step failed */
+   if (bio->bi_status || PageError(page)) {
+   ClearPageUptodate(page);
+   SetPageError(page);
+   } else {
+   SetPageUptodate(page);
+   }
+   unlock_page(page);
+   }
+   if (bio->bi_private)
+   mempool_free(bio->bi_private, bio_post_read_ctx_pool);
+   bio_put(bio);
+}
+
+static void bio_post_read_processing(struct bio_post_read_ctx *ctx);
+
+static void decrypt_work(struct work_struct *work)
+{
+   struct bio_post_read_ctx *ctx =
+   container_of(work, struct bio_post_read_ctx, work);
+
+   fscrypt_decrypt_bio(ctx->bio);
+
+   bio_post_read_processing(ctx);
+}
+
+static void bio_post_read_processing(struct bio_post_read_ctx *ctx)
+{
+   switch (++ctx->cur_step) {
+   case STEP_DECRYPT:
+   if (ctx->enabled_steps & (1 << STEP_DECRYPT)) {
+   INIT_WORK(>work, decrypt_work);
+   fscrypt_enqueue_decrypt_work(>work);
+   return;
+   }
+   ctx->cur_step++;
+   /* fall-through */
+   default:
+   __read_end_io(ctx->bio);
+   }
+}
+
+static bool f2fs_bio_post_read_required(struct bio *bio)
+{
+   return bio->bi_private && !bio->bi_status;
+}
+
+static void f2fs_read_end_io(struct bio *bio)
+{
 #ifdef CONFIG_F2FS_FAULT_INJECTION
if (time_to_inject(F2FS_P_SB(bio_first_page_all(bio)), FAULT_IO)) {
f2fs_show_injection_info(FAULT_IO);
@@ -62,28 +131,15 @@ static void f2fs_read_end_io(struct bio *bio)
}
 #endif
 
-   if (f2fs_bio_encrypted(bio)) {
-   if (bio->bi_status) {
-   fscrypt_release_ctx(bio->bi_private);
-   } else {
-   fscrypt_enqueue_decrypt_bio(bio->bi_private, bio);
-   return;
-   }
-   }
-
-   bio_for_each_segment_all(bvec, bio, i) {
-   struct page *page = bvec->bv_page;
+   if (f2fs_bio_post_read_required(bio)) {
+   struct bio_post_read_ctx *ctx = bio->bi_private;
 
-   if (!bio->bi_status) {
-   if (!PageUptodate(page))
-   SetPageUptodate(page);
-   } else {
-   ClearPageUptodate(page);
-   SetPageError(page);
-   }
- 

[f2fs-dev] [PATCH 0/2] fscrypt / f2fs: prepare I/O path for fs-verity

2018-04-16 Thread Eric Biggers via Linux-f2fs-devel
Hello,

These two patches restructure f2fs's read path to allow the data to go
through multiple postprocessing steps, rather than just decryption as is
implemented currently.  This is mainly in preparation for doing
authenticity verification of data via fs-verity, though this change
might also be useful for other future f2fs features, e.g. compression.

These patches don't yet add the fs-verity work, however, as it depends
on the rest of the fs-verity patchset.  I'm planning to send the full
patchset out as an RFC, but some parts need further investigation first.
(The work-in-progress version can be found at
 git://git.kernel.org/pub/scm/linux/kernel/git/mhalcrow/linux.git,
 branch "fs-verity-dev".)

Eric Biggers (2):
  fscrypt: allow synchronous bio decryption
  f2fs: refactor read path to allow multiple postprocessing steps

 fs/crypto/bio.c |  35 ---
 fs/crypto/crypto.c  |   8 +-
 fs/crypto/fscrypt_private.h |   1 -
 fs/ext4/readpage.c  |   2 +-
 fs/f2fs/data.c  | 163 
 fs/f2fs/f2fs.h  |  12 ++-
 fs/f2fs/file.c  |   4 +-
 fs/f2fs/gc.c|   6 +-
 fs/f2fs/inline.c|   2 +-
 fs/f2fs/super.c |   6 ++
 include/linux/fscrypt_notsupp.h |  13 ++-
 include/linux/fscrypt_supp.h|   5 +-
 12 files changed, 188 insertions(+), 69 deletions(-)

-- 
2.17.0.484.g0c8726318c-goog


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] [PATCH v11 00/63] Convert page cache to XArray

2018-04-16 Thread Ross Zwisler
On Sat, Apr 14, 2018 at 07:12:13AM -0700, Matthew Wilcox wrote:
> From: Matthew Wilcox 
> 
> This conversion keeps the radix tree and XArray data structures in sync
> at all times.  That allows us to convert the page cache one function at
> a time and should allow for easier bisection.  Other than renaming some
> elements of the structures, the data structures are fundamentally
> unchanged; a radix tree walk and an XArray walk will touch the same
> number of cachelines.  I have changes planned to the XArray data
> structure, but those will happen in future patches.

I've hit a few failures when throwing this into my test setup.  The first two
seem like similar bugs hit in two different ways, the third seems unique.
I've verified that these don't seem to happen with the next-20180413 baseline.

1) Just run some parted commands in a loop:

# while true; do
> parted -s /dev/pmem0 mktable msdos
> parted -s -a optimal /dev/pmem0 mkpart Primary 2MiB 12GiB
> parted -s -a optimal /dev/pmem0 mkpart Primary 12GiB 16382MiB
> done

Within a few seconds I hit:

page:ea0004293040 count:0 mapcount:0 mapping: index:0x0
flags: 0x2fffc1(locked)
raw: 002fffc1   
raw: dead0100 dead0200  
page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) <= 0)
[ cut here ]
kernel BUG at ./include/linux/mm.h:853!
invalid opcode:  [#1] PREEMPT SMP PTI
Modules linked in: dax_pmem device_dax nd_pmem nd_btt nfit libnvdimm
CPU: 10 PID: 1539 Comm: systemd-udevd Not tainted 
4.16.0-next-20180413-00063-gbbcfa4572878 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 
04/01/2014
RIP: 0010:__add_to_page_cache_locked+0x34b/0x400
RSP: 0018:c90003427a58 EFLAGS: 00010246
RAX: 003e RBX: ea0004293040 RCX: 0001
RDX:  RSI:  RDI: 
RBP: c90003427ac8 R08: 001ff3fd4371 R09: 
R10: 0001 R11:  R12: 88010cd82210
R13:  R14:  R15: c90003427ad8
FS:  7ff6e400c1c0() GS:88011580() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 55af28d0e390 CR3: 00010a64e000 CR4: 06e0
Call Trace:
 ? memcg_drain_all_list_lrus+0x260/0x260
 add_to_page_cache_lru+0x4f/0xe0
 mpage_readpages+0xde/0x1d0
 ? check_disk_change+0x70/0x70
 ? xa_load+0xbe/0x150
 blkdev_readpages+0x1d/0x20
 __do_page_cache_readahead+0x203/0x2f0
 force_page_cache_readahead+0xb8/0x110
 ? force_page_cache_readahead+0xb8/0x110
 page_cache_sync_readahead+0x43/0x50
 generic_file_read_iter+0x842/0xb70
 blkdev_read_iter+0x35/0x40
 __vfs_read+0xfe/0x170
 vfs_read+0xa3/0x150
 ksys_read+0x58/0xc0
 __x64_sys_read+0x1a/0x20
 do_syscall_64+0x65/0x220
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7ff6e3bf2d31
RSP: 002b:7ffc54c38a78 EFLAGS: 0246 ORIG_RAX: 
RAX: ffda RBX: 55af28cdb7c0 RCX: 7ff6e3bf2d31
RDX: 0004 RSI: 55af28de66d8 RDI: 000f
RBP:  R08: 55af28de66b0 R09: 7ff6e3bdd090
R10:  R11: 0246 R12: 0004
R13: 55af28de66b0 R14: 55af28cdb810 R15: 55af28de66c8
Code: 88 fb 55 82 48 89 df e8 64 42 04 00 0f 0b 48 c7 c6 a8 fc 55 82 48 89 df 
e8 53 42 04 00 0f 0b 48 c7 c6 18 b5 52 82 e8 45 42 04 00 <0f> 0b 48 c1 f8 02 85 
c0 0f 84 59 fe ff ff 45 85 ed 48 c7 43 08
RIP: __add_to_page_cache_locked+0x34b/0x400 RSP: c90003427a58
---[ end trace 0a2ff3a36c6cabde ]---

2) xfs + DAX + generic/095

A spew of this new message:

Page cache invalidation failure on direct I/O.  Possible data corruption due to 
collision with buffered I/O!

Then a bug similar to the one hit with parted:

BUG: Bad page state in process fio  pfn:11e38c
page:ea000478e300 count:0 mapcount:0 mapping: index:0x60
page:ea000478e300 count:0 mapcount:0 mapping: index:0x60
flags: 0x3fffc10068(uptodate|lru|active|mappedtodisk)
raw: 003fffc10068  0060 
raw: ea0004ed6220 88011a04d830  
page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
[ cut here ]
kernel BUG at ./include/linux/mm.h:492!
invalid opcode:  [#1] PREEMPT SMP PTI
Modules linked in: nd_pmem nd_btt dax_pmem device_dax nfit libnvdimm
CPU: 5 PID: 599 Comm: systemd-journal Tainted: GW 
4.16.0-next-20180413-00063-gbbcfa4572878 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 
04/01/2014
RIP: 0010:release_pages+0x30e/0x3f0
RSP: 0018:c90001223a68 EFLAGS: 00010046
RAX: 003e RBX: ea000478e300 RCX: 0003
RDX:  RSI:  RDI: 
RBP: 

Re: [f2fs-dev] Symlink not persisted even after fsync

2018-04-16 Thread Vijay Chidambaram
On Mon, Apr 16, 2018 at 12:39 AM, Amir Goldstein  wrote:
> On Mon, Apr 16, 2018 at 3:10 AM, Vijay Chidambaram  
> wrote:
> [...]
>> Consider the following workload:
>>
>>  creat foo
>>  link (foo, A/bar)
>>  fsync(foo)
>>  crash
>>
>> In this case, after the file system recovers, do we expect foo's link
>> count to be 2 or 1? I would say 2, but POSIX is silent on this, so
>> thought I would confirm. The tricky part here is we are not calling
>> fsync() on directory A.
>>
>> In this case, its not a symlink; its a hard link, so I would say the
>> link count for foo should be 2. But btrfs and F2FS show link count of
>> 1 after a crash.
>>
>
> That sounds like a clear bug - nlink is metadata of inode foo, so
> should be made persistent by fsync(foo).

This is what we think as well. We have posted this as a separate
thread to confirm this with other btrfs developers.

> For non-journaled fs you would need to fsync(A) to guarantee
> seeing A/bar after crash, but for a journaled fs, if you didn't see
> A/bar after crash and did see nlink 2 on foo then you would get
> a filesystem inconsistency, so practically, fsync(foo) takes care
> of persisting A/bar entry as well. But as you already understand,
> these rules have not been formalized by a standard, instead, they
> have been "formalized" by various fsck.* tools.

I don't think fsck tools are very useful here: fsck could return the
file system to an empty state, and that would still be consistent.
fsck makes no guarantees about data loss. I think fsck is allow to
truncate files, remove directory entries etc. which could lead to data
loss.

But I agree the guarantees haven't been formalized.

> Allow me to suggest a different framing for CrashMonkey.
> You seem to be engaging in discussions with the community
> about whether X behavior is a bug or not and as you can see
> the answer depends on the filesystem (and sometimes on the
> developer). Instead, you could declare that CrashMonkey
> is a "Certification tool" to certify filesystems to a certain
> crash consistency behavior. Then you can discuss with the
> community about specific models that CrashMonkey should
> be testing. The model describes the implicit dependencies
> and ordering guaranties between operations.
> Dave has mentioned the "strictly ordered metadata" model.
> I do not know of any formal definition of this model for filesystems,
> but you can take a shot at starting one and encoding it into
> CrashMonkey. This sounds like a great paper to me.

This is a great idea! We will be submitting the basic CrashMonkey
paper soon, so I don't know if we have enough time to do this.
Currently, we just explicitly say this behavior is supported by ext4,
but not btrfs, etc. So the bugs are file-system specific. But we would
definitely consider doing this in the future.

Btw, such models are what we introduced in the ALICE paper that Ted
had mentioned before. We called them "Abstract Persistence Models",
but it was essentially the same idea.

> I don't know if Btrfs and f2fs will qualify as "strictly ordered
> metadata" and I don't know if they would want to qualify.
> Mind you a filesystem can be crash consistent without
> following "strictly ordered metadata". In fact, in many cases
> "strictly ordered metadata" imposes performance penalty by
> coupling together unrelated metadata updates (e.g. create
> A/a and create B/b), but it is also quite hard to decouple them
> because future operation can create a dependency (e.g.
> mv A/a B/b).

I agree that total ordering might lead to performance loss. I'm not
advocating for btrfs/F2FS to be totally ordered; I merely want them to
be clear about what guarantees they do provide.

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


Re: [f2fs-dev] Symlink not persisted even after fsync

2018-04-16 Thread Vijay Chidambaram
On Mon, Apr 16, 2018 at 12:52 AM, Theodore Y. Ts'o  wrote:
> On Sun, Apr 15, 2018 at 07:10:52PM -0500, Vijay Chidambaram wrote:
>>
>> I don't think this is what the paper's ext3-fast does. All the paper
>> says is if you have a file system where the fsync of a file persisted
>> only data related to that file, it would increase performance.
>> ext3-fast is the name given to such a file system. Note that we do not
>> present a design of ext3-fast or analyze it in any detail. In fact, we
>> explicitly say "The ext3-fast file system (derived from inferences
>> provided by ALICE) seems interesting for application safety, though
>> further investigation is required into the validity of its design."
>
> Well, says that it's based on ext3's data=journal "Abstract Persistent
> Model".  It's true that a design was not proposed --- but if you
> don't propose a design, how do you know what the performance is or
> whether it's even practical?  That's one of those things I find
> extremely distasteful in the paper.  Sure, I can model a faster than
> light interstellar engine ala Star Trek's Warp Drive --- and I can
> talk about it having, say, better performance than a reaction drive.
> But it doesn't tell us anything useful about whether it can be built,
> or whether it's even useful to dream about it.
>
> To me, that part of the paper, really read as, "watch as I wave my
> hands around widely, that they never leave the ends of my arms!"

I partially understand where you are coming from, but your argument
seems to boil down to "don't say anything until you have worked out
every detail". I don't agree with this. Yes, it was speculative, but
we did have a fairly clear disclaimer.

To the point about it being obvious: you might be surprised at how
many people outside this community take it for granted that if you
fsync a file, only that file's contents and metadata will be persisted
:) So it was obvious to you, but truly shocking for many.

Btw, ext3-fast is what led to our CCFS work in FAST 17:
http://www.cs.utexas.edu/~vijay/papers/fast17-c2fs.pdf. In this paper,
we do show that if you divide your application writes into streams, it
is possible to persist only the data/metadata of one stream,
independent of the IO being done in other streams. So as it turned
out, it wasn't an impossible file-system design.

But we digress. I think we both agree that researchers should engage
more with the file-system community.

>
>> Thanks! As I mentioned before, this is useful. I have a follow-up
>> question. Consider the following workload:
>>
>>  creat foo
>>  link (foo, A/bar)
>>  fsync(foo)
>>  crash
>>
>> In this case, after the file system recovers, do we expect foo's link
>> count to be 2 or 1? I would say 2, but POSIX is silent on this, so
>> thought I would confirm. The tricky part here is we are not calling
>> fsync() on directory A.
>>
>> In this case, its not a symlink; its a hard link, so I would say the
>> link count for foo should be 2. But btrfs and F2FS show link count of
>> 1 after a crash.
>
> Well, is the link count accurate?  That is to say, does A/bar exist?
> I would think that the requirement that the file system be self
> consistent is the most important consideration.

There are two ways to look at this.

1. A/bar does not exist, link count is 1, and so it is not a bug.

2. We are calling fsync on the inode when the inode's link count is 2.
So it should persist the inode plus the dependency that is A/bar. The
file system after a crash should show both A/bar and the file with
link count 2. This is what ext4, xfs, and F2FS do.

We've posted separately to figure out what semantics btrfs supports.

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


[f2fs-dev] [PATCH] f2fs: Copy summary entries in bulk

2018-04-16 Thread LiFan
Copy summary entries in bulk in write_compacted_summaries().
And zero the blank part of page after writing the summaries
to the page.

Signed-off-by: Fan li 
---
 fs/f2fs/segment.c | 38 ++
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c
index 3bc77c2..a08b2a8 100644
--- a/fs/f2fs/segment.c
+++ b/fs/f2fs/segment.c
@@ -3103,14 +3103,12 @@ static void write_compacted_summaries(struct 
f2fs_sb_info *sbi, block_t blkaddr)
 {
struct page *page;
unsigned char *kaddr;
-   struct f2fs_summary *summary;
struct curseg_info *seg_i;
int written_size = 0;
-   int i, j;
+   int i;
 
page = grab_meta_page(sbi, blkaddr++);
kaddr = (unsigned char *)page_address(page);
-   memset(kaddr, 0, PAGE_SIZE);
 
/* Step 1: write nat cache */
seg_i = CURSEG_I(sbi, CURSEG_HOT_DATA);
@@ -3124,34 +3122,42 @@ static void write_compacted_summaries(struct 
f2fs_sb_info *sbi, block_t blkaddr)
 
/* Step 3: write summary entries */
for (i = CURSEG_HOT_DATA; i <= CURSEG_COLD_DATA; i++) {
-   unsigned short blkoff;
+   unsigned short blkoff, page_cap, summary_cnt;
+
seg_i = CURSEG_I(sbi, i);
if (sbi->ckpt->alloc_type[i] == SSR)
blkoff = sbi->blocks_per_seg;
else
blkoff = curseg_blkoff(sbi, i);
 
-   for (j = 0; j < blkoff; j++) {
+   while (blkoff) {
if (!page) {
page = grab_meta_page(sbi, blkaddr++);
kaddr = (unsigned char *)page_address(page);
-   memset(kaddr, 0, PAGE_SIZE);
written_size = 0;
}
-   summary = (struct f2fs_summary *)(kaddr + written_size);
-   *summary = seg_i->sum_blk->entries[j];
-   written_size += SUMMARY_SIZE;
 
-   if (written_size + SUMMARY_SIZE <= PAGE_SIZE -
-   SUM_FOOTER_SIZE)
-   continue;
-
-   set_page_dirty(page);
-   f2fs_put_page(page, 1);
-   page = NULL;
+   page_cap = (PAGE_SIZE - SUM_FOOTER_SIZE - written_size)
+   / SUMMARY_SIZE;
+   summary_cnt = min(page_cap, blkoff);
+   memcpy(kaddr + written_size, seg_i->sum_blk->entries,
+   summary_cnt * SUMMARY_SIZE);
+   written_size += summary_cnt * SUMMARY_SIZE;
+   blkoff -= summary_cnt;
+
+   if (page_cap == summary_cnt) {
+   memset(kaddr + written_size, 0,
+   PAGE_SIZE - written_size);
+   set_page_dirty(page);
+   f2fs_put_page(page, 1);
+   page = NULL;
+   }
}
}
+
if (page) {
+   memset(kaddr + written_size, 0,
+   PAGE_SIZE - written_size);
set_page_dirty(page);
f2fs_put_page(page, 1);
}
-- 
2.7.4




--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel


[f2fs-dev] [PATCH] f2fs: allocate hot_data for atomic write more strictly

2018-04-16 Thread Yunlei He
If a file not set type as hot, has dirty pages more than
threshold 64 before starting atomic write, may be lose hot
flag.

Signed-off-by: Yunlei He 
---
 fs/f2fs/file.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c
index 79eeed5..bf61e20 100644
--- a/fs/f2fs/file.c
+++ b/fs/f2fs/file.c
@@ -1685,7 +1685,6 @@ static int f2fs_ioc_start_atomic_write(struct file *filp)
goto out;
 
set_inode_flag(inode, FI_ATOMIC_FILE);
-   set_inode_flag(inode, FI_HOT_DATA);
f2fs_update_time(F2FS_I_SB(inode), REQ_TIME);
 
if (!get_dirty_pages(inode))
@@ -1697,11 +1696,11 @@ static int f2fs_ioc_start_atomic_write(struct file 
*filp)
ret = filemap_write_and_wait_range(inode->i_mapping, 0, LLONG_MAX);
if (ret) {
clear_inode_flag(inode, FI_ATOMIC_FILE);
-   clear_inode_flag(inode, FI_HOT_DATA);
goto out;
}
 
 inc_stat:
+   set_inode_flag(inode, FI_HOT_DATA);
F2FS_I(inode)->inmem_task = current;
stat_inc_atomic_write(inode);
stat_update_max_atomic_write(inode);
-- 
1.9.1


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Linux-f2fs-devel mailing list
Linux-f2fs-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/linux-f2fs-devel