from:"Theodore Y. Ts'o"

Re: [PATCH -next] ext4: use DEFINE_MUTEX (and mutex_init() had been too late)

2020-12-23 Thread Theodore Y. Ts'o

On Wed, Dec 23, 2020 at 10:12:54PM +0800, Zheng Yongjun wrote:
> Signed-off-by: Zheng Yongjun 

Why is mutex_init() too late?  We only take the mutex after we
mounting an ext4 file system, and that can't happen until ext4_init_fs
is called.

- Ted

>  fs/ext4/super.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 94472044f4c1..8776f06a639d 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -59,7 +59,7 @@
>  #include 
>  
>  static struct ext4_lazy_init *ext4_li_info;
> -static struct mutex ext4_li_mtx;
> +static DEFINE_MUTEX(ext4_li_mtx);
>  static struct ratelimit_state ext4_mount_msg_ratelimit;
>  
>  static int ext4_load_journal(struct super_block *, struct ext4_super_block *,
> @@ -6640,7 +6640,6 @@ static int __init ext4_init_fs(void)
>  
>   ratelimit_state_init(_mount_msg_ratelimit, 30 * HZ, 64);
>   ext4_li_info = NULL;
> - mutex_init(_li_mtx);
>  
>   /* Build-time check for flags consistency */
>   ext4_check_flag_values();
> -- 
> 2.22.0
>

[GIT PULL] ext4 updates for v5.11-rc1

2020-12-22 Thread Theodore Y. Ts'o

The following changes since commit 418baf2c28f3473039f2f7377760bd8f6897ae18:

  Linux 5.10-rc5 (2020-11-22 15:36:08 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus

for you to fetch changes up to be993933d2e997fdb72b8b1418d2a84df79b8962:

  ext4: remove unnecessary wbc parameter from ext4_bio_write_page (2020-12-22 
13:08:45 -0500)

NOTE: The reason why the branch had recently changed was to add a
one-line fix which added flush_work() call to an error/cleanup patgh,
to address a syzbot reported failure.  See the thread at:

http://lore.kernel.org/r/1faff305b709b...@google.com

There were also some commit description updates to add some Cc:
sta...@kernel.org tags.

This branch was tested and passes xfstests regression tests, and in
any case, it's all bug fixes and cleanups:

TESTRUNID: tytso-20201222152130
KERNEL:5.10.0-rc5-xfstests-00029-gbe993933d2e9 #2064 SMP Tue Dec 22 
15:19:12 EST 2020 x86_64
CMDLINE:   -c ext4/4k -g auto
CPUS:  2
MEM:   7680

ext4/4k: 520 tests, 43 skipped, 6608 seconds
Totals: 477 tests, 43 skipped, 0 failures, 0 errors, 6554s


Various bug fixes and cleanups for ext4; no new features this cycle.



Alexander Lochmann (1):
  Updated locking documentation for transaction_t

Chunguang Xu (7):
  ext4: use ASSERT() to replace J_ASSERT()
  ext4: remove redundant mb_regenerate_buddy()
  ext4: simplify the code of mb_find_order_for_block
  ext4: update ext4_data_block_valid related comments
  ext4: delete nonsensical (commented-out) code inside 
ext4_xattr_block_set()
  ext4: fix a memory leak of ext4_free_data
  ext4: avoid s_mb_prefetch to be zero in individual scenarios

Colin Ian King (1):
  ext4: remove redundant assignment of variable ex

Dan Carpenter (1):
  ext4: fix an IS_ERR() vs NULL check

Gustavo A. R. Silva (1):
  ext4: fix fall-through warnings for Clang

Harshad Shirwadkar (3):
  ext4: add docs about fast commit idempotence
  ext4: make fast_commit.h byte identical with e2fsprogs/fast_commit.h
  jbd2: add a helper to find out number of fast commit blocks

Jan Kara (8):
  ext4: fix deadlock with fs freezing and EA inodes
  ext4: don't remount read-only with errors=continue on reboot
  ext4: remove redundant sb checksum recomputation
  ext4: standardize error message in ext4_protect_reserved_inode()
  ext4: make ext4_abort() use __ext4_error()
  ext4: move functions in super.c
  ext4: simplify ext4 error translation
  ext4: defer saving error info from atomic context

Kaixu Xia (2):
  ext4: remove redundant operation that set bh to NULL
  ext4: remove the unused EXT4_CURRENT_REV macro

Lei Chen (1):
  ext4: remove unnecessary wbc parameter from ext4_bio_write_page

Roman Anufriev (2):
  ext4: add helpers for checking whether quota can be enabled/is journalled
  ext4: print quota journalling mode on (re-)mount

Theodore Ts'o (1):
  ext4: check for invalid block size early when mounting a file system

Xianting Tian (1):
  ext4: remove the null check of bio_vec page

 Documentation/filesystems/ext4/journal.rst |  50 ++
 fs/ext4/balloc.c   |   2 +-
 fs/ext4/block_validity.c   |  16 +-
 fs/ext4/ext4.h |  77 ++---
 fs/ext4/ext4_jbd2.c|   4 +-
 fs/ext4/ext4_jbd2.h|   9 +-
 fs/ext4/extents.c  |   5 +-
 fs/ext4/fast_commit.c  |  99 +++-
 fs/ext4/fast_commit.h  |  78 +++--
 fs/ext4/fsync.c|   2 +-
 fs/ext4/indirect.c |   4 +-
 fs/ext4/inode.c|  35 ++--
 fs/ext4/mballoc.c  |  39 ++---
 fs/ext4/namei.c|  12 +-
 fs/ext4/page-io.c  |   5 +-
 fs/ext4/super.c| 422 
-
 fs/ext4/xattr.c|   1 -
 fs/jbd2/journal.c  |   8 +-
 include/linux/jbd2.h   |  14 +-
 19 files changed, 504 insertions(+), 378 deletions(-)

Re: general protection fault in ext4_commit_super

2020-12-22 Thread Theodore Y. Ts'o

On Tue, Dec 22, 2020 at 12:28:53PM +0100, Jan Kara wrote:
> > Fix e810c942a325 ("ext4: save error info to sb through journal if 
> > available")
> > by flushing work as part of rollback.
> 
> Thanks for having a look. I don't think the fix is quite correct though. The
> flush_work() should be at failed_mount3: label. So something like attached
> fixup. Ted, can you please fold it into the buggy commit?

Done.  I folded it into "ext4: defer saving error info from atomic
context" since this is the commit where we introduced the s_error_work
workqueue.

Thanks!!

- Ted

Re: [PATCH v2 0/3] add support for metadata encryption to F2FS

2020-12-17 Thread Theodore Y. Ts'o

On Thu, Dec 17, 2020 at 08:51:14PM +, Satya Tangirala wrote:
> On Thu, Dec 17, 2020 at 01:08:49PM -0500, Theodore Y. Ts'o wrote:
> > On Thu, Dec 17, 2020 at 03:04:32PM +, Satya Tangirala wrote:
> > > This patch series adds support for metadata encryption to F2FS using
> > > blk-crypto.
> > 
> > Is there a companion patch series needed so that f2fstools can
> > check/repair a file system with metadata encryption enabled?
> > 
> > - Ted
> Yes! It's at
> https://lore.kernel.org/linux-f2fs-devel/20201217151013.1513045-1-sat...@google.com/

Cool, I've been meaning to update f2fs-tools in Debian, and including
these patches will allow us to generate {kvm,gce,android}-xfstests
images with this support.  I'm hoping to get to it sometime betweeen
Christmas and New Year's.

I guess there will need to be some additional work needed to create
the f2fs image with a fixed keys for a particular file system in
xfstests-bld, and then mounting and checking said image with the
appropriatre keys as well.   Is that something you've put together?

Cheers,

- Ted

Re: [PATCH] ext4: Don't leak old mountpoint samples

2020-12-17 Thread Theodore Y. Ts'o

On Tue, Dec 01, 2020 at 04:13:01PM +0100, Richard Weinberger wrote:
> As soon the first file is opened, ext4 samples the mountpoint
> of the filesystem in 64 bytes of the super block.
> It does so using strlcpy(), this means that the remaining bytes
> in the super block string buffer are untouched.
> If the mount point before had a longer path than the current one,
> it can be reconstructed.
> 
> Consider the case where the fs was mounted to "/media/johnjdeveloper"
> and later to "/".
> The the super block buffer then contains "/\x00edia/johnjdeveloper".
> 
> This case was seen in the wild and caused confusion how the name
> of a developer ands up on the super block of a filesystem used
> in production...
> 
> Fix this by clearing the string buffer before writing to it,
> 
> Signed-off-by: Richard Weinberger 

Thank for reporting this issue.  In fact, the better fix is to use
strncpy().  See my revised patch for an explanation of why

commit cdc9ad7d3f201a77749432878fb4caa490862de6
Author: Theodore Ts'o 
Date:   Thu Dec 17 13:24:15 2020 -0500

ext4: don't leak old mountpoint samples

When the first file is opened, ext4 samples the mountpoint of the
filesystem in 64 bytes of the super block.  It does so using
strlcpy(), this means that the remaining bytes in the super block
string buffer are untouched.  If the mount point before had a longer
path than the current one, it can be reconstructed.

Consider the case where the fs was mounted to "/media/johnjdeveloper"
and later to "/".  The super block buffer then contains
"/\x00edia/johnjdeveloper".

This case was seen in the wild and caused confusion how the name
of a developer ands up on the super block of a filesystem used
in production...

Fix this by using strncpy() instead of strlcpy().  The superblock
field is defined to be a fixed-size char array, and it is already
marked using __nonstring in fs/ext4/ext4.h.  The consumer of the field
in e2fsprogs already assumes that in the case of a 64+ byte mount
path, that s_last_mounted will not be NUL terminated.

Reported-by: Richard Weinberger 
Signed-off-by: Theodore Ts'o 

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1cd3d26e3217..349b27f0dda0 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -810,7 +810,7 @@ static int ext4_sample_last_mounted(struct super_block *sb,
if (err)
goto out_journal;
lock_buffer(sbi->s_sbh);
-   strlcpy(sbi->s_es->s_last_mounted, cp,
+   strncpy(sbi->s_es->s_last_mounted, cp,
sizeof(sbi->s_es->s_last_mounted));
ext4_superblock_csum_set(sb);
unlock_buffer(sbi->s_sbh);

Re: [PATCH v2 0/3] add support for metadata encryption to F2FS

2020-12-17 Thread Theodore Y. Ts'o

On Thu, Dec 17, 2020 at 03:04:32PM +, Satya Tangirala wrote:
> This patch series adds support for metadata encryption to F2FS using
> blk-crypto.

Is there a companion patch series needed so that f2fstools can
check/repair a file system with metadata encryption enabled?

- Ted

Re: [PATCH] fs: ext4: remove unnecessary wbc parameter from ext4_bio_write_page

2020-12-16 Thread Theodore Y. Ts'o

On Fri, Dec 11, 2020 at 02:54:24PM +0800, chenle...@gmail.com wrote:
> From: Lei Chen 
> 
> ext4_bio_write_page does not need wbc parameter, since its parameter
> io contains the io_wbc field. The io::io_wbc is initialized by
> ext4_io_submit_init which is called in ext4_writepages and
> ext4_writepage functions prior to ext4_bio_write_page.
> Therefor, when ext4_bio_write_page is called, wbc info
> has already been included in io parameter.
> 
> Signed-off-by: Lei Chen 

Thanks, applied.

- Ted

Re: [PATCH 031/141] ext4: Fix fall-through warnings for Clang

2020-12-16 Thread Theodore Y. Ts'o

On Fri, Nov 20, 2020 at 12:28:32PM -0600, Gustavo A. R. Silva wrote:
> In preparation to enable -Wimplicit-fallthrough for Clang, fix a warning
> by explicitly adding a break statement instead of just letting the code
> fall through to the next case.
> 
> Link: https://github.com/KSPP/linux/issues/115
> Signed-off-by: Gustavo A. R. Silva 

Thanks, applied.

- Ted

Re: [PATCH] ext4: fix -Wstringop-truncation warnings

2020-12-15 Thread Theodore Y. Ts'o

On Thu, Nov 12, 2020 at 05:33:24PM +0800, Kang Wenlin wrote:
> From: Wenlin Kang 
> 
> The strncpy() function may create a unterminated string,
> use strscpy_pad() instead.
> 
> This fixes the following warning:
> 
> fs/ext4/super.c: In function '__save_error_info':
> fs/ext4/super.c:349:2: warning: 'strncpy' specified bound 32 equals 
> destination size [-Wstringop-truncation]
>   strncpy(es->s_last_error_func, func, sizeof(es->s_last_error_func));
>   ^~~
> fs/ext4/super.c:353:3: warning: 'strncpy' specified bound 32 equals 
> destination size [-Wstringop-truncation]
>strncpy(es->s_first_error_func, func,
>^
> sizeof(es->s_first_error_func));
> ~~~

What compiler are you using?  s_last_error_func is defined to not
necessarily be NUL terminated.  So strscpy_pad() is not a proper
replacement for strncpy() in this use case.

>From Documentation/process/deprecated:

   If a caller is using non-NUL-terminated strings, strncpy() can
   still be used, but destinations should be marked with the `__nonstring
   `_
   attribute to avoid future compiler warnings.

s_{first,last}_error_func is properly annotated with __nonstring in
fs/ext4/ext4.h.

- Ted

Re: UBSAN: shift-out-of-bounds in ext4_fill_super

2020-12-14 Thread Theodore Y. Ts'o

(Dropping off-topic lists)

On Mon, Dec 14, 2020 at 03:37:37PM +0100, Dmitry Vyukov wrote:
> > It's going to make everyone else's tags who pull from ext4.git messy,
> > though, with gobs of tags that probably won't be of use to them.  It
> > does avoid the need to use git fetch --tags --force, and I guess
> > people are used to the need to GC tags with the linux-repo.

(I had meant to say linux-next repo above.)

> syzbot is now prepared and won't fail next time, nor on other similar
> trees. Which is good.
> So it's really up to you.

I'm curious --- are you having to do anything special in terms of
deleting old tags to keep the size of the repo under control?  Git
will keep a tag around indefinitely, so if you have huge numbers of
next-MMDD tags in your repo, the size will grow without bound.
Are you doing anything to automatically garbage collect tags to preven
this from being a problem?

(I am not pulling linux-next every day; only when I need to debug a
bug reported against the -next tree, so I just manually delete the
tags as necessary.  So I'm curious what folks who are following
linux-next are doing, and whether they have something specific for
linux-next tags, or whether they have a more general solution.)

Cheers,

- Ted

Re: UBSAN: shift-out-of-bounds in ext4_fill_super

2020-12-10 Thread Theodore Y. Ts'o

On Thu, Dec 10, 2020 at 09:09:51AM +0100, Dmitry Vyukov wrote:
> >  * [new tag]   ext4-for-linus-5.8-rc1-2 -> 
> > ext4-for-linus-5.8-rc1-2
> >  ! [rejected]  ext4_for_linus   -> ext4_for_linus  
> > (would clobber existing tag)
> 
> Interesting. First time I see this. Should syzkaller use 'git fetch
> --tags --force"?...
> StackOverflow suggests it should help:
> https://stackoverflow.com/questions/58031165/how-to-get-rid-of-would-clobber-existing-tag

Yeah, sorry, ext4_for_linus is a signed tag which is only used to
authenticate my pull request to Linus.  After Linus accepts the pull,
the digital signature is going to be upstream, and so I end up
deleting and the reusing that tag for the next merge window.

I guess I could just start always using ext4_for_linus- and
just delete the tags once they have been accepted, to keep my list of
tags clean. 

It's going to make everyone else's tags who pull from ext4.git messy,
though, with gobs of tags that probably won't be of use to them.  It
does avoid the need to use git fetch --tags --force, and I guess
people are used to the need to GC tags with the linux-repo.  So maybe
that's the right thing to do going forward.

- Ted

Re: UBSAN: shift-out-of-bounds in ext4_fill_super

2020-12-09 Thread Theodore Y. Ts'o

On Tue, Dec 08, 2020 at 11:33:11PM -0800, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:15ac8fdb Add linux-next specific files for 20201207
> git tree:   linux-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=1125c92350
> kernel config:  https://syzkaller.appspot.com/x/.config?x=3696b8138207d24d
> dashboard link: https://syzkaller.appspot.com/bug?extid=345b75652b1d24227443
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=151bf86b50
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=139212cb50

#syz test git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
e360ba58d067a30a4e3e7d55ebdd919885a058d6

>From 3d3bc303a8a8f7123cf486f49fa9060116fa1465 Mon Sep 17 00:00:00 2001
From: Theodore Ts'o 
Date: Wed, 9 Dec 2020 15:59:11 -0500
Subject: [PATCH] ext4: check for invalid block size early when mounting a file
 system

Check for valid block size directly by validating s_log_block_size; we
were doing this in two places.  First, by calculating blocksize via
BLOCK_SIZE << s_log_block_size, and then checking that the blocksize
was valid.  And then secondly, by checking s_log_block_size directly.

The first check is not reliable, and can trigger an UBSAN warning if
s_log_block_size on a maliciously corrupted superblock is greater than
22.  This is harmless, since the second test will correctly reject the
maliciously fuzzed file system, but to make syzbot shut up, and
because the two checks are duplicative in any case, delete the
blocksize check, and move the s_log_block_size earlier in
ext4_fill_super().

Signed-off-by: Theodore Ts'o 
Reported-by: syzbot+345b75652b1d24227...@syzkaller.appspotmail.com
---
 fs/ext4/super.c | 40 
 1 file changed, 16 insertions(+), 24 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index f86220a8df50..4a16bbf0432c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4202,18 +4202,25 @@ static int ext4_fill_super(struct super_block *sb, void 
*data, int silent)
 */
sbi->s_li_wait_mult = EXT4_DEF_LI_WAIT_MULT;
 
-   blocksize = BLOCK_SIZE << le32_to_cpu(es->s_log_block_size);
-
-   if (blocksize == PAGE_SIZE)
-   set_opt(sb, DIOREAD_NOLOCK);
-
-   if (blocksize < EXT4_MIN_BLOCK_SIZE ||
-   blocksize > EXT4_MAX_BLOCK_SIZE) {
+   if (le32_to_cpu(es->s_log_block_size) >
+   (EXT4_MAX_BLOCK_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE)) {
ext4_msg(sb, KERN_ERR,
-  "Unsupported filesystem blocksize %d (%d 
log_block_size)",
-blocksize, le32_to_cpu(es->s_log_block_size));
+"Invalid log block size: %u",
+le32_to_cpu(es->s_log_block_size));
goto failed_mount;
}
+   if (le32_to_cpu(es->s_log_cluster_size) >
+   (EXT4_MAX_CLUSTER_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE)) {
+   ext4_msg(sb, KERN_ERR,
+"Invalid log cluster size: %u",
+le32_to_cpu(es->s_log_cluster_size));
+   goto failed_mount;
+   }
+
+   blocksize = EXT4_MIN_BLOCK_SIZE << le32_to_cpu(es->s_log_block_size);
+
+   if (blocksize == PAGE_SIZE)
+   set_opt(sb, DIOREAD_NOLOCK);
 
if (le32_to_cpu(es->s_rev_level) == EXT4_GOOD_OLD_REV) {
sbi->s_inode_size = EXT4_GOOD_OLD_INODE_SIZE;
@@ -4432,21 +4439,6 @@ static int ext4_fill_super(struct super_block *sb, void 
*data, int silent)
if (!ext4_feature_set_ok(sb, (sb_rdonly(sb
goto failed_mount;
 
-   if (le32_to_cpu(es->s_log_block_size) >
-   (EXT4_MAX_BLOCK_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE)) {
-   ext4_msg(sb, KERN_ERR,
-"Invalid log block size: %u",
-le32_to_cpu(es->s_log_block_size));
-   goto failed_mount;
-   }
-   if (le32_to_cpu(es->s_log_cluster_size) >
-   (EXT4_MAX_CLUSTER_LOG_SIZE - EXT4_MIN_BLOCK_LOG_SIZE)) {
-   ext4_msg(sb, KERN_ERR,
-"Invalid log cluster size: %u",
-le32_to_cpu(es->s_log_cluster_size));
-   goto failed_mount;
-   }
-
if (le16_to_cpu(sbi->s_es->s_reserved_gdt_blocks) > (blocksize / 4)) {
ext4_msg(sb, KERN_ERR,
 "Number of reserved GDT blocks insanely large: %d",
-- 
2.28.0

Re: Pass modules to Linux kernel without initrd

2020-12-08 Thread Theodore Y. Ts'o

On Tue, Dec 08, 2020 at 10:24:08AM +0100, Paul Menzel wrote:
> Dear Linux folks,
> 
> Trying to reduce the boot time of standard distributions, I would like to
> get rid of the initrd. The initrd is for mounting the root file system and
> on most end user systems with standard distributions that means loading the
> bus driver for the drive and the file system driver. Everyone could build
> their own Linux kernel and build the drivers into the Linux kernel, but most
> users enjoy using the distribution Linux kernel, which build the drivers as
> modules to support a lot of systems. (I think Fedora builds the default file
> system driver (of the installer) into the Linux kernel.)

It's unclear what you are trying to speed up by replacing the initrd
with "appending the required modules to the Linux kernel image".  Why
do you think this will speed things up?  What do you think is
currently slow with using an initrd?

If what you are concerned about is the speed to load an initrd which
has all of the kernel modules shipped by the distribution, including
those not needed by a particular hardware platform, there are
distributions which can be configured to automatically include only
those kernel modules needed for a particular system.

There are also some shell scripts which some people have written that
will automatically create a kernel config file which only has the
device drivers needed for a particular system.  Creating a system
which used such a script, and then compiled a custom kernel image
would also not be hard.

You seem to be assuming that building a custom kernel image ish
hard(tm), and so no user would want to do this.  If this were 
automated, what is your objection to such an approach?

Without a clear understanding what part of the boot process you think
is slow, and which you are trying to optimize, and what precisely your
constraints are, or at least, what you *think* your constraints are,
and why you think things have to be that way, it's going to be hard to
comment further.

Cheers,

- Ted

Re: Why the auxiliary cipher in gss_krb5_crypto.c?

2020-12-04 Thread Theodore Y. Ts'o

On Fri, Dec 04, 2020 at 02:59:35PM +, David Howells wrote:
> Hi Chuck, Bruce,
> 
> Why is gss_krb5_crypto.c using an auxiliary cipher?  For reference, the
> gss_krb5_aes_encrypt() code looks like the attached.
> 
> From what I can tell, in AES mode, the difference between the main cipher and
> the auxiliary cipher is that the latter is "cbc(aes)" whereas the former is
> "cts(cbc(aes))" - but they have the same key.
> 
> Reading up on CTS, I'm guessing the reason it's like this is that CTS is the
> same as the non-CTS, except for the last two blocks, but the non-CTS one is
> more efficient.

The reason to use CTS is if you don't want to expand the size of the
cipher text to the cipher block size.  e.g., if you have a 53 byte
plaintext, and you can't afford to let the ciphertext be 56 bytes, the
cryptographic engineer will reach for CTS instead of CBC.

So that probably explains the explanation to use CTS (and it's
required by the spec in any case).  As far as why CBC is being used
instead of CTS, the only reason I can think of is the one you posted.
Perhaps there was some hardware or software configureation where
cbc(aes) was hardware accelerated, and cts(cbc(aes)) would not be?

In any case, using cbc(aes) for all but the last two blocks, and using
cts(cbc(aes)) for the last two blocks, is identical to using
cts(cbc(aes)) for the whole encryption.  So the only reason to do this
in the more complex way would be because for performance reasons.

 - Ted

Re: [PATCH V2] uapi: fix statx attribute value overlap for DAX & MOUNT_ROOT

2020-12-04 Thread Theodore Y. Ts'o

On Thu, Dec 03, 2020 at 08:18:23AM +0200, Amir Goldstein wrote:
> Here is a recent example, where during patch review, I requested NOT to 
> include
> any stable backport triggers [1]:
> "...We should consider sending this to stable, but maybe let's merge
> first and let it
>  run in master for a while before because it is not a clear and
> immediate danger..."
>
> As a developer and as a reviewer, I wish (as Dave implied) that I had a way to
> communicate to AUTOSEL that auto backport of this patch has more risk than
> the risk of not backporting.

My suggestion is that we could put something in the MAINTAINERS file
which indicates what the preferred delay time should be for (a)
patches explicitly cc'ed to stable, and (b) preferred time should be
for patches which are AUTOSEL'ed for stable for that subsystem.  That
time might be either in days/weeks, or "after N -rc releases", "after
the next full release", or, "never" (which would be a way for a
subsystem to opt out of the AUTOSEL process).

It should also be possible specify the delay in the trailer, e.g.:

Stable-Defer: 
Auto-Stable-Defer: 

The advantage of specifying the delay relative to when they show up in
Linus's tree helps deal with the case where the submaintainer might
not be sure when their patches will get pushed to Linus by the
maintainer.

Cheers,

- Ted

Re: [Ksummit-discuss] crediting bug reports and fixes folded into original patch

2020-12-03 Thread Theodore Y. Ts'o

On Thu, Dec 03, 2020 at 12:43:52AM +0100, Vlastimil Babka wrote:
> 
> there was a bit of debate on Twitter about this, so I thought I would bring it
> here. Imagine a scenario where patch sits as a commit in -next and there's a 
> bug
> report or fix, possibly by a bot or with some static analysis. The maintainer
> decides to fold it into the original patch, which makes sense for e.g.
> bisectability. But there seem to be no clear rules about attribution in this
> case, which looks like there should be, probably in
> Documentation/maintainer/modifying-patches.rst

I don't think there should be any kind of fixed, inflexible rules
about this.  

1) Sometimes there will be a *huge* number of comments and
suggestions.  Do we really want to require links to dozens of mail
message id's, and/or dozens or more e-mail addresses?

2) Sometimes a fixup is pretty trivial; even if it is expressed in the
form of a one-line patch, versus someone who does a detailed review of
a patch, but doesn't actually end up appending an explicit
Reviewed-by, perhaps because he or she didn't completely agree with
the final version of the patch.

3) I think this very much should be up to the maintainer's discretion,
as opposed to making rules that may result in some rediculous amount
of bloat in the git log.

4) It's really unhealthy, in my opinion for people to be fixed on
counting attributions.  If we create fixed rules, this can turn into
people try to game the system.  It's the same reason why I'm not
terribly enthusiastic about people trying to game Signed-off-by counts
by sending gazillions of white space or spelling fixes.

If the fix is large enough that for copyright reasons we need to
acknowledge the work, then folding in the SoB as for DCO reason makes
perfect sense.  But if it's a trivial patch (the kind where projects
that require copyright assignment wouldn't require executed legal
agreements), then perhaps attribution is not always a requirement.
Again, there are times when people who spend a lot of work discussing
patch may not get attributiionm even if they didn't actually create
the one-line whitespace fix and sent it in as a patch with a
signed-off-by with a demand that the attribution be preserved.

Common sense really needs to prevale here, and I'm concerned that
people who like to create rules don't realize what a mess this can
create when contributors approach their participation with a sense of
entitlement.

Cheers,

- Ted

Re: [PATCH v3] Updated locking documentation for transaction_t

2020-12-03 Thread Theodore Y. Ts'o

On Thu, Dec 03, 2020 at 03:38:40PM +0100, Alexander Lochmann wrote:
> 
> 
> On 03.12.20 15:04, Theodore Y. Ts'o wrote:
> > On Thu, Oct 15, 2020 at 03:26:28PM +0200, Alexander Lochmann wrote:
> > > Hi folks,
> > > 
> > > I've updated the lock documentation according to our finding for
> > > transaction_t.
> > > Does this patch look good to you?
> > 
> > I updated the annotations to match with the local usage, e.g:
> > 
> >  * When commit was requested [journal_t.j_state_lock]
> > 
> > became:
> > 
> >  * When commit was requested [j_state_lock]What do you mean by local 
> > usage?
> The annotations of other members of transaction_t?

Yes, I'd like the annotations of the other objects to be consistent,
and just use j_state_lock, j_list_lock, etc., for the other annotations.

> Shouldn't the annotation look like this?
> [t_journal->j_state_lock]
> It would be more precise.

It's more precise, but it's also unnecessary in this case, since all
of the elements of the journal have a j_ prefix, elements of a
transaction_t have a t_ prefix, etc.  There is also no other structure
element which has a j_state_lock name *other* than in journal_t.

Cheers,

- Ted

Re: [PATCH][next] ext4: remove redundant assignment of variable ex

2020-12-03 Thread Theodore Y. Ts'o

On Wed, Oct 21, 2020 at 02:23:26PM +0100, Colin King wrote:
> From: Colin Ian King 
> 
> Variable ex is assigned a variable that is not being read, the assignment
> is redundant and can be removed.
> 
> Addresses-Coverity: ("Unused value")
> Signed-off-by: Colin Ian King 

Thanks, applied.

- Ted

Re: [PATCH] ext4: remove the null check of bio_vec page

2020-12-03 Thread Theodore Y. Ts'o

On Wed, Oct 21, 2020 at 12:25:03PM +0200, Jan Kara wrote:
> On Tue 20-10-20 16:22:01, Xianting Tian wrote:
> > bv_page can't be NULL in a valid bio_vec, so we can remove the NULL check,
> > as we did in other places when calling bio_for_each_segment_all() to go
> > through all bio_vec of a bio.
> > 
> > Signed-off-by: Xianting Tian 
> 
> Thanks for the patch. It looks good to me. You can add:
> 
> Reviewed-by: Jan Kara 

Applied, thanks.

- Ted

Re: [PATCH v3] Updated locking documentation for transaction_t

2020-12-03 Thread Theodore Y. Ts'o

On Thu, Oct 15, 2020 at 03:26:28PM +0200, Alexander Lochmann wrote:
> Hi folks,
> 
> I've updated the lock documentation according to our finding for
> transaction_t.
> Does this patch look good to you?

I updated the annotations to match with the local usage, e.g:

 * When commit was requested [journal_t.j_state_lock]

became:

 * When commit was requested [j_state_lock]

Otherwise, looks good.  Thanks for the patch!

   - Ted

Re: [PATCH v7 6/8] ext4: support direct I/O with fscrypt using blk-crypto

2020-12-03 Thread Theodore Y. Ts'o

On Tue, Nov 17, 2020 at 02:07:06PM +, Satya Tangirala wrote:
> From: Eric Biggers 
> 
> Wire up ext4 with fscrypt direct I/O support. Direct I/O with fscrypt is
> only supported through blk-crypto (i.e. CONFIG_BLK_INLINE_ENCRYPTION must
> have been enabled, the 'inlinecrypt' mount option must have been specified,
> and either hardware inline encryption support must be present or
> CONFIG_BLK_INLINE_ENCYRPTION_FALLBACK must have been enabled). Further,
> direct I/O on encrypted files is only supported when the *length* of the
> I/O is aligned to the filesystem block size (which is *not* necessarily the
> same as the block device's block size).
> 
> fscrypt_limit_io_blocks() is called before setting up the iomap to ensure
> that the blocks of each bio that iomap will submit will have contiguous
> DUNs. Note that fscrypt_limit_io_blocks() is normally a no-op, as normally
> the DUNs simply increment along with the logical blocks. But it's needed
> to handle an edge case in one of the fscrypt IV generation methods.
> 
> Signed-off-by: Eric Biggers 
> Co-developed-by: Satya Tangirala 
> Signed-off-by: Satya Tangirala 
> Reviewed-by: Jaegeuk Kim 

Acked-by: Theodore Ts'o

Re: [PATCH v9 2/2] fs: ext4: Modify inode-test.c to use KUnit parameterized testing feature

2020-12-02 Thread Theodore Y. Ts'o

On Mon, Nov 16, 2020 at 11:11:50AM +0530, Arpitha Raghunandan wrote:
> Modify fs/ext4/inode-test.c to use the parameterized testing
> feature of KUnit.
> 
> Signed-off-by: Arpitha Raghunandan <98.a...@gmail.com>
> Signed-off-by: Marco Elver 

Acked-by: Theodore Ts'o

Re: [PATCH v9 1/2] kunit: Support for Parameterized Testing

2020-12-02 Thread Theodore Y. Ts'o

On Mon, Nov 30, 2020 at 02:22:22PM -0800, 'Brendan Higgins' via KUnit 
Development wrote:
> 
> Looks good to me. I would definitely like to pick this up. But yeah,
> in order to pick up 2/2 we will need an ack from either Ted or Iurii.
> 
> Ted seems to be busy right now, so I think I will just ask Shuah to go
> ahead and pick this patch up by itself and we or Ted can pick up patch
> 2/2 later.

I have been paying attention to this patch series, but I had presumed
that this was much more of a kunit change than an ext4 change, and the
critical bits was a review of the kunit infrastructure.  I certainly
have no objection to changing the ext4 test to use the new
parameterized testing, and if you'd like me to give a quick review,
I'll take a quick look.  I assume, Brendan, that you've already tried
doing a compile and run test of the patch series, so I'm not going to
do that?

- Ted

Re: drivers/char/random.c needs a (new) maintainer

2020-11-30 Thread Theodore Y. Ts'o

On Mon, Nov 30, 2020 at 04:15:23PM +0100, Jason A. Donenfeld wrote:
> I am willing to maintain random.c and have intentions to have a
> formally verified RNG. I've mentioned this to Ted before.
> 
> But I think Ted's reluctance to not accept the recent patches sent to
> this list is mostly justified, and I have no desire to see us rush
> into replacing random.c with something suboptimal or FIPSy.

Being a maintainer is not about *accepting* patches, it's about
*reviewing* them.  I do plan to make time to catch up on reviewing
patches this cycle.  One thing that would help me is if folks
(especially Jason, if you would) could start with a detailed review of
Nicolai's patches.  His incremental approach is I believe the best one
from a review perspective, and certainly his cleanup patches are ones
which I would expect are no-brainers.

- Ted

[GIT PULL] ext4 bug fixes for 5.10-rc

2020-11-22 Thread Theodore Y. Ts'o

The following changes since commit 09162bc32c880a791c6c0668ce0745cf7958f576:

  Linux 5.10-rc4 (2020-11-15 16:44:31 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus_fixes2

for you to fetch changes up to f902b216501094495ff75834035656e8119c537f:

  ext4: fix bogus warning in ext4_update_dx_flag() (2020-11-19 22:41:10 -0500)


A final set of miscellaneous bug fixes for ext4


Jan Kara (1):
  ext4: fix bogus warning in ext4_update_dx_flag()

Mauro Carvalho Chehab (1):
  jbd2: fix kernel-doc markups

Theodore Ts'o (1):
  ext4: drop fast_commit from /proc/mounts

 fs/ext4/ext4.h|  3 ++-
 fs/ext4/super.c   |  4 
 fs/jbd2/journal.c | 34 ++
 fs/jbd2/transaction.c | 31 ---
 include/linux/jbd2.h  |  2 +-
 5 files changed, 37 insertions(+), 37 deletions(-)

Re: [PATCH v4 12/27] jbd2: fix kernel-doc markups

2020-11-19 Thread Theodore Y. Ts'o

On Mon, Nov 16, 2020 at 11:18:08AM +0100, Mauro Carvalho Chehab wrote:
> Kernel-doc markup should use this format:
> identifier - description
> 
> They should not have any type before that, as otherwise
> the parser won't do the right thing.
> 
> Also, some identifiers have different names between their
> prototypes and the kernel-doc markup.
> 
> Reviewed-by: Jan Kara 
> Signed-off-by: Mauro Carvalho Chehab 

Applied to the ext4 tree, thanks!

- Ted

Re: [RESEND][PATCH] ima: Set and clear FMODE_CAN_READ in ima_calc_file_hash()

2020-11-17 Thread Theodore Y. Ts'o

On Tue, Nov 17, 2020 at 10:23:58AM -0800, Linus Torvalds wrote:
> On Mon, Nov 16, 2020 at 10:35 AM Mimi Zohar  wrote:
> >
> > We need to differentiate between signed files, which by definition are
> > immutable, and those that are mutable.  Appending to a mutable file,
> > for example, would result in the file hash not being updated.
> > Subsequent reads would fail.
> 
> Why would that require any reading of the file at all AT WRITE TIME?
> 
> Don't do it. Really.
> 
> When opening the file write-only, you just invalidate the hash. It
> doesn't matter anyway - you're only writing.
> 
> Later on, when reading, only at that point does the hash matter, and
> then you can do the verification.
> 
> Although honestly, I don't even see the point. You know the hash won't
> match, if you wrote to the file.

I think the use case the IMA folks might be thinking about is where
they want to validate the file at open time, *before* the userspace
application starts writing to the file, since there might be some
subtle attacks where Boris changes the first part of the file before
Alice appends "I agree" to said file.

Of course, Boris will be able to modify the file after Alice has
modified it, so it's a bit of a moot point, but one could imagine a
scenario where the file is modified while the system is not running
(via an evil hotel maid) and then after Alice modifies the file, of
*course* the hash will be invalid, so no one would notice.  A sane
application would have read the file to make sure it contained the
proper contents before appending "I agree" to said file, so it's a bit
of an esoteric point.

The other case I could imagine is if the file is marked execute-only,
without read access, and IMA wanted to be able to read the file to
check the hash.  But we already make an execption for allowing the
file to be read during page faults, so that's probably less
controversial.

- Ted

Re: [PATCH v2 1/3] libfs: Add generic function for setting dentry_ops

2020-11-17 Thread Theodore Y. Ts'o

On Tue, Nov 17, 2020 at 04:03:13AM +, Daniel Rosenberg wrote:
> This adds a function to set dentry operations at lookup time that will
> work for both encrypted filenames and casefolded filenames.
> 
> A filesystem that supports both features simultaneously can use this
> function during lookup preparations to set up its dentry operations once
> fscrypt no longer does that itself.
> 
> Currently the casefolding dentry operation are always set if the
> filesystem defines an encoding because the features is toggleable on
> empty directories. Since we don't know what set of functions we'll
> eventually need, and cannot change them later, we add just add them.
> 
> Signed-off-by: Daniel Rosenberg 

Reviewed-by: Theodore Ts'o 

- Ted

Re: [PATCH v2 2/3] fscrypt: Have filesystems handle their d_ops

2020-11-17 Thread Theodore Y. Ts'o

On Tue, Nov 17, 2020 at 09:04:11AM -0800, Jaegeuk Kim wrote:
> 
> I'd like to pick this patch series in f2fs/dev for -next, so please let me 
> know
> if you have any concern.

No concern for me as far as ext4 is concerned, thanks!

 - Ted

Re: [PATCH v7 0/8] add support for direct I/O with fscrypt using blk-crypto

2020-11-17 Thread Theodore Y. Ts'o

What is the expected use case for Direct I/O using fscrypt?  This
isn't a problem which is unique to fscrypt, but one of the really
unfortunate aspects of the DIO interface is the silent fallback to
buffered I/O.  We've lived with this because DIO goes back decades,
and the original use case was to keep enterprise databases happy, and
the rules around what is necessary for DIO to work was relatively well
understood.

But with fscrypt, there's going to be some additional requirements
(e.g., using inline crypto) required or else DIO silently fall back to
buffered I/O for encrypted files.  Depending on the intended use case
of DIO with fscrypt, this caveat might or might not be unfortunately
surprising for applications.

I wonder if we should have some kind of interface so we can more
explicitly allow applications to query exactly what the requirements
might be for a particular file vis-a-vis Direct I/O.  What are the
memory alignment requirements, what are the file offset alignment
requirements, what are the write size requirements, for a particular
file.

- Ted

Re: [PATCH v2 2/3] fscrypt: Have filesystems handle their d_ops

2020-11-17 Thread Theodore Y. Ts'o

On Tue, Nov 17, 2020 at 04:03:14AM +, Daniel Rosenberg wrote:
> This shifts the responsibility of setting up dentry operations from
> fscrypt to the individual filesystems, allowing them to have their own
> operations while still setting fscrypt's d_revalidate as appropriate.
> 
> Most filesystems can just use generic_set_encrypted_ci_d_ops, unless
> they have their own specific dentry operations as well. That operation
> will set the minimal d_ops required under the circumstances.
> 
> Since the fscrypt d_ops are set later on, we must set all d_ops there,
> since we cannot adjust those later on. This should not result in any
> change in behavior.
> 
> Signed-off-by: Daniel Rosenberg 

Acked-by: Theodore Ts'o

[GIT PULL] more ext4 fixes for v5.10-rc4

2020-11-12 Thread Theodore Y. Ts'o

The following changes since commit 52d1998d09af92d44ffce7454637dd3fd1afdc7d:

  Merge tag 'fscrypt-for-linus' of 
git://git.kernel.org/pub/scm/fs/fscrypt/fscrypt (2020-11-10 10:05:37 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus_bugfixes

for you to fetch changes up to d196e229a80c39254f4adbc312f55f5198e98941:

  Revert "ext4: fix superblock checksum calculation race" (2020-11-11 14:24:18 
-0500)


Two ext4 bug fixes, one via a revert of a commit sent during the merge window.


Harshad Shirwadkar (1):
  ext4: handle dax mount option collision

Theodore Ts'o (1):
  Revert "ext4: fix superblock checksum calculation race"

 fs/ext4/ext4.h  |  6 +++---
 fs/ext4/super.c | 11 ---
 2 files changed, 3 insertions(+), 14 deletions(-)

Re: How to enable auto-suspend by default

2020-11-10 Thread Theodore Y. Ts'o

One note...  I'll double check, but on my XPS 13 9380, as I recall, I
have to manually disable autosuspend on all of the XHCI controllers
and internal hubs after running "powertop --auto-tune", or else any
external mouse attached to said USB device will be dead to the world
for 2-3 seconds if the autosuspend timeout has kicked in, which was
***super*** annoying.

- Ted

Re: [PATCH 0/2] Tristate moount option comatibility fixup

2020-11-10 Thread Theodore Y. Ts'o

On Mon, Nov 09, 2020 at 08:10:07PM +0100, Michal Suchanek wrote:
> Hello,
> 
> after the tristate dax option change some applications fail to detect
> pmem devices because the dax option no longer shows in mtab when device
> is mounted with -o dax.

Which applications?  Name them.

We *really* don't want to encourage applications to make decisions
only based on the mount options.  For example, it could be that the
application's files will have the S_DAX flag set.

It would be a real shame if we are actively encourage applications to
use a broken configuration mechanism which was only used as a hack
while DAX was in experimental status.

- Ted

[GIT PULL] ext4 cleanups for 5.10-rc4

2020-11-09 Thread Theodore Y. Ts'o

(Resent with missing cc's, sorry.)

The following changes since commit 3cea11cd5e3b00d91caf0b4730194039b45c5891:

  Linux 5.10-rc2 (2020-11-01 14:43:51 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus_cleanups

for you to fetch changes up to 05d5233df85e9621597c5838e95235107eb624a2:

  jbd2: fix up sparse warnings in checkpoint code (2020-11-07 00:09:08 -0500)


More fixes and cleanups for the new fast_commit features, but also a
few other miscellaneous bug fixes and a cleanup for the MAINTAINERS
file.


Chao Yu (1):
  MAINTAINERS: add missing file in ext4 entry

Dan Carpenter (1):
  ext4: silence an uninitialized variable warning

Harshad Shirwadkar (22):
  ext4: describe fast_commit feature flags
  ext4: mark fc ineligible if inode gets evictied due to mem pressure
  ext4: drop redundant calls ext4_fc_track_range
  ext4: fixup ext4_fc_track_* functions' signature
  jbd2: rename j_maxlen to j_total_len and add jbd2_journal_max_txn_bufs
  ext4: clean up the JBD2 API that initializes fast commits
  jbd2: drop jbd2_fc_init documentation
  jbd2: don't use state lock during commit path
  jbd2: don't pass tid to jbd2_fc_end_commit_fallback()
  jbd2: add todo for a fast commit performance optimization
  jbd2: don't touch buffer state until it is filled
  jbd2: don't read journal->j_commit_sequence without taking a lock
  ext4: dedpulicate the code to wait on inode that's being committed
  ext4: fix code documentatioon
  ext4: mark buf dirty before submitting fast commit buffer
  ext4: remove unnecessary fast commit calls from ext4_file_mmap
  ext4: fix inode dirty check in case of fast commits
  ext4: disable fast commit with data journalling
  ext4: issue fsdev cache flush before starting fast commit
  ext4: make s_mount_flags modifications atomic
  jbd2: don't start fast commit on aborted journal
  ext4: cleanup fast commit mount options

Joseph Qi (1):
  ext4: unlock xattr_sem properly in ext4_inline_data_truncate()

Kaixu Xia (1):
  ext4: correctly report "not supported" for {usr,grp}jquota when 
!CONFIG_QUOTA

Theodore Ts'o (2):
  ext4: fix sparse warnings in fast_commit code
  jbd2: fix up sparse warnings in checkpoint code

 Documentation/filesystems/ext4/journal.rst |   6 ++
 Documentation/filesystems/ext4/super.rst   |   7 +++
 Documentation/filesystems/journalling.rst  |   6 +-
 MAINTAINERS|   1 +
 fs/ext4/ext4.h |  66 ++--
 fs/ext4/extents.c  |   7 +--
 fs/ext4/fast_commit.c  | 174 
+++--
 fs/ext4/fast_commit.h  |   6 +-
 fs/ext4/file.c |   6 +-
 fs/ext4/fsmap.c|   2 +-
 fs/ext4/fsync.c|   2 +-
 fs/ext4/inline.c   |   1 +
 fs/ext4/inode.c|  19 +++---
 fs/ext4/mballoc.c  |   6 +-
 fs/ext4/namei.c|  61 +--
 fs/ext4/super.c|  47 ---
 fs/jbd2/checkpoint.c   |   2 +
 fs/jbd2/commit.c   |  11 +++-
 fs/jbd2/journal.c  | 138 
+++---
 fs/jbd2/recovery.c |   6 +-
 fs/jbd2/transaction.c  |   4 +-
 fs/ocfs2/journal.c |   2 +-
 include/linux/jbd2.h   |  23 ---
 include/trace/events/ext4.h|  10 +--
 24 files changed, 342 insertions(+), 271 deletions(-)

Re: [PATCH] MAINTAINERS: add missing file in ext4 entry

2020-11-06 Thread Theodore Y. Ts'o

On Fri, Oct 30, 2020 at 10:24:35AM +0800, Chao Yu wrote:
> include/trace/events/ext4.h belongs to ext4 module, add the file path into
> ext4 entry in MAINTAINERS.
> 
> Signed-off-by: Chao Yu 

Thanks, applied.

- Ted

Re: [PATCH] ext4: Use generic casefolding support

2020-10-29 Thread Theodore Y. Ts'o

On Wed, Oct 28, 2020 at 05:08:20AM +, Daniel Rosenberg wrote:
> This switches ext4 over to the generic support provided in libfs.
> 
> Since casefolded dentries behave the same in ext4 and f2fs, we decrease
> the maintenance burden by unifying them, and any optimizations will
> immediately apply to both.
> 
> Signed-off-by: Daniel Rosenberg 
> Reviewed-by: Eric Biggers 

Applied, thanks.

- Ted

GIT PULL] ext4 fixes for 5.10-rc2

2020-10-29 Thread Theodore Y. Ts'o

The following changes since commit 96485e4462604744d66bf4301557d996d80b85eb:

  Merge tag 'ext4_for_linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 (2020-10-22 10:31:08 
-0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus_fixes

for you to fetch changes up to 6694875ef8045cdb1e6712ee9b68fe08763507d8:

  ext4: indicate that fast_commit is available via /sys/fs/ext4/feature/... 
(2020-10-28 13:43:22 -0400)


Bug fixes for the new ext4 fast commit feature, plus a fix for the
data=journal bug fix.  Also use the generic casefolding support which
has now landed in fs/libfs.c for 5.10.


Andrea Righi (1):
  ext4: properly check for dirty state in ext4_inode_datasync_dirty()

Daniel Rosenberg (1):
  ext4: use generic casefolding support

Harshad Shirwadkar (4):
  ext4: fix double locking in ext4_fc_commit_dentry_updates()
  ext4: make num of fast commit blocks configurable
  ext4: use s_mount_flags instead of s_mount_state for fast commit state
  ext4: use IS_ERR() for error checking of path

Jan Kara (1):
  ext4: fix mmap write protection for data=journal mode

Mauro Carvalho Chehab (1):
  jbd2: fix a kernel-doc markup

Theodore Ts'o (1):
  ext4: indicate that fast_commit is available via /sys/fs/ext4/feature/...

yangerkun (1):
  ext4: do not use extent after put_bh

 fs/ext4/dir.c | 64 
++--
 fs/ext4/ext4.h| 20 
 fs/ext4/extents.c | 30 +++---
 fs/ext4/fast_commit.c | 37 -
 fs/ext4/hash.c|  2 +-
 fs/ext4/inode.c   | 15 +--
 fs/ext4/namei.c   | 20 
 fs/ext4/super.c   | 16 
 fs/ext4/sysfs.c   |  2 ++
 include/linux/jbd2.h  |  7 +--
 10 files changed, 78 insertions(+), 135 deletions(-)

Re: [PATCH] ext4: properly check for dirty state in ext4_inode_datasync_dirty()

2020-10-28 Thread Theodore Y. Ts'o

On Wed, Oct 28, 2020 at 08:57:03AM +0530, Ritesh Harjani wrote:
> 
> Well, I too noticed this yesterday while I was testing xfstests -g swap.
> Those tests were returning _notrun, hence that could be the reason why
> it didn't get notice in XFSTESTing from Ted.

Yeah, one of the things I discussed with Harshad is we really need a
test that looks like generic/472, but which is in shared/NNN, and
which unconditionally tries to use swapon for those file systems where
swapfiles are expected to work.  This is actually the second
regression caused by our breaking swapfile support (the other being
the iomap bmap change), which escaped our testing because we didn't
notice that generic/472 was skipped.

(Mental note; perhaps we should have a way of flagging tests that are
skipped when previously they would run in the {kvm,gce}-xfstests
framework.)

- Ted

Re: [PATCH v3 23/32] jbd2: fix a kernel-doc markup

2020-10-27 Thread Theodore Y. Ts'o

On Tue, Oct 27, 2020 at 10:51:27AM +0100, Mauro Carvalho Chehab wrote:
> The kernel-doc markup that documents _fc_replay_callback is
> missing an asterisk, causing this warning:
> 
>   ../include/linux/jbd2.h:1271: warning: Function parameter or member 
> 'j_fc_replay_callback' not described in 'journal_s'
> 
> When building the docs.
> 
> Fixes: 609f928af48f ("jbd2: fast commit recovery path")
> Signed-off-by: Mauro Carvalho Chehab 

Thanks, I'm accomulating some bug fix patches to push to Linus, so
I'll grab this for the ext4 git tree.

- Ted

Re: PROBLEM: Reiser4 hard lockup

2020-10-27 Thread Theodore Y. Ts'o

On Tue, Oct 27, 2020 at 01:53:31AM +0100, Edward Shishkin wrote:
> > > reiser4progs 1.1.x Software Framework Release Number (SFRN) 4.0.1 file
> > > system utilities should not be used to check/fix media formatted 'a
> > > priori' in SFRN 4.0.2 and vice-versa.
> > 
> > Honestly, this is the first time I've heard about a Linux FS having
> > versioning other than a major one
> 
> This is because, unlike other Linux file systems, reiser4 is a
> framework.
> 
> In vanilla kernel having a filesystem-as-framework is discouraged for
> ideological reasons. As they explained: "nobody's interested in
> plugins". A huge monolithic mess without any internal structure -
> welcome :)

I wouldn't call it an ideological problem, but more about wanting to
assure interoperability issues and wanting to reduce confusion on the
part of users, especially if images get moved between systems.  There
is also plenty of way of introducing internal structure and code
cleanliness without going completely undisciplined with respect to
on-disk format extensions.  :-)

Finally, I'll note that ext 2/3/4 does have a rather fine-grained set
of feature flags, with specific rules about what the kernel --- and
e2fsck --- should do if it finds a feature bit it doesn't understand
in the incompat, ro_compat, and compat feature flags set.  This is
especially helpful since we have multiple implementations of ext 2/3/4
out there (in FreeBSD, the GRUB bootloader, GNU HURD, Fuchsia, etc.)
and so using feature bits allow for safe and reliable interoperability
with the user being warned if they can safely only mount the file
system read-only, or not at all, if the file system has some new
feature that their current OS version does not support.  We can also
give appropriate warnings if they are using an insufficiently recent
version of the userspace tools.

Cheers,

- Ted

Re: [RFC] Removing b_end_io

2020-10-25 Thread Theodore Y. Ts'o

On Sun, Oct 25, 2020 at 04:44:38AM +, Matthew Wilcox wrote:
> @@ -3068,6 +3069,12 @@ static int submit_bh_wbc(int op, int op_flags, struct 
> buffer_head *bh,
>   }
>  
>   submit_bio(bio);
> +}
> +
> +static int submit_bh_wbc(int op, int op_flags, struct buffer_head *bh,
> +  enum rw_hint write_hint, struct writeback_control *wbc)
> +{
> + __bh_submit(bh, op | op_flags, write_hint, wbc, end_bio_bh_io_sync);
>   return 0;
>  }
>

I believe this will break use cases where the file system sets
bh->b_end_io and then calls submit_bh(), which then calls
submit_bh_wbc().  That's because with this change, calls to
submit_bh_wbc() --- include submit_bh() --- ignores bh->b_end_io and
results in end_bio_bh_io_sync getting used.

Filesystems that do this includes fs/ntfs, fs/resiserfs.

In this case, that can probably be fixed by changing submit_bh() to
pass in bh->b_end_io, or switching those users to use the new
bh_submit() function to prevent these breakages.

- Ted

Re: [PATCH] ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it

2020-10-23 Thread Theodore Y. Ts'o

On Thu, Oct 22, 2020 at 04:52:52PM -0700, Brendan Higgins wrote:
> So you, me, Luis, David, and a whole bunch of other people have been
> thinking about this problem for a while. What if we just put
> kunitconfig fragments in directories along side the test files they
> enable?
> 
> For example, we could add a file to fs/ext4/kunitconfig which contains:
> 
> CONFIG_EXT4_FS=y
> CONFIG_EXT4_KUNIT_TESTS=y
> 
> We could do something similar in fs/jdb2, etc.
> 
> Obviously some logically separate KUnit tests (different maintainers,
> different Kconfig symbols, etc) reside in the same directory, for
> these we could name the kunitconfig file something like
> lib/list-test.kunitconfig (not a great example because lists are
> always built into Linux), but you get the idea.
> 
> Then like Ted suggested, if you call kunit.py run foo/bar, then
> 
> if bar is a directory, then kunit.py will look for foo/bar/kunitconfig
> 
> if bar is a file ending with .kunitconfig like foo/bar.kunitconfig,
> then it will use that kunitconfig
> 
> if bar is '...' (foo/...) then kunit.py will look for all kunitconfigs
> underneath foo.
> 
> Once all the kunitconfigs have been resolved, they will be merged into
> the .kunitconfig. If they can be successfully merged together, the new
> .kunitconfig will then continue to function as it currently does.

I was thinking along a similar set of lines this morning.  One thing
I'd add in addition to your suggestion to that is to change how
.kunitconfig is interpreted such that

CONFIG_KUNIT=y

is always implied, so it doesn't have to be specified explicitly, and
that if a line like:

fs/ext4

or

mm

etc. occurs, that will cause a include of the Kunitconfig (I'd using a
capitalized version of the filename like Kconfig, so that it's easier
to see in a directory listing) in the named directory.

That way, .kunitconfig is backwards compatible, but it also allows
people to put a one-liner into .kunitconfig to enable the unit tests
for that particular directory.

What do folks think?

Cheers,

- Ted

[GIT PULL] ext4 changes for 5.10

2020-10-22 Thread Theodore Y. Ts'o

The following changes since commit a1b8638ba1320e6684aa98233c15255eb803fac7:

  Linux 5.9-rc7 (2020-09-27 14:38:10 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus

for you to fetch changes up to 1322181170bb01bce3c228b82ae3d5c6b793164f:

  ext4: fix invalid inode checksum (2020-10-21 23:22:38 -0400)


The siginificant new ext4 feature this time around is Harshad's new
fast_commit mode.  In addition, thanks to Mauricio for fixing a race
where mmap'ed pages that are being changed in parallel with a
data=journal transaction commit could result in bad checksums in the
failure that could cause journal replays to fail.  Also notable is
Ritesh's buffered write optimization which can result in significant
improvements on parallel write workloads.  (The kernel test robot
reported a 330.6% improvement on fio.write_iops on a 96 core system
using DAX[1].)

Besides that, we have the usual miscellaneous cleanups and bug fixes.

[1] https://lore.kernel.org/r/20200925071217.GO28663@shao2-debian


Chunguang Xu (4):
  ext4: rename journal_dev to s_journal_dev inside ext4_sb_info
  ext4: rename system_blks to s_system_blks inside ext4_sb_info
  ext4: delete invalid comments near mb_buddy_adjust_border
  ext4: make mb_check_counter per group

Constantine Sapuntzakis (1):
  ext4: fix superblock checksum calculation race

Darrick J. Wong (1):
  ext4: limit entries returned when counting fsmap records

Dinghao Liu (1):
  ext4: fix error handling code in add_new_gdb

Eric Biggers (1):
  ext4: fix leaking sysfs kobject after failed mount

Harshad Shirwadkar (9):
  doc: update ext4 and journalling docs to include fast commit feature
  ext4: add fast_commit feature and handling for extended mount options
  ext4 / jbd2: add fast commit initialization
  jbd2: add fast commit machinery
  ext4: main fast-commit commit path
  jbd2: fast commit recovery path
  ext4: fast commit recovery path
  ext4: add a mount opt to forcefully turn fast commits on
  ext4: add fast commit stats in procfs

Hui Su (1):
  jbd2: fix the comment of struct jbd2_journal_handle

Jan Kara (2):
  ext4: discard preallocations before releasing group lock
  ext4: Detect already used quota file early

Jens Axboe (1):
  ext4: flag as supporting buffered async reads

Kaixu Xia (1):
  ext4: use the normal helper to get the actual inode

Luo Meng (1):
  ext4: fix invalid inode checksum

Mauricio Faria de Oliveira (4):
  jbd2: introduce/export functions 
jbd2_journal_submit|finish_inode_data_buffers()
  jbd2, ext4, ocfs2: introduce/use journal callbacks 
j_submit|finish_inode_data_buffers()
  ext4: data=journal: fixes for ext4_page_mkwrite()
  ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()

Nikolay Borisov (1):
  ext4: remove unused argument from ext4_(inc|dec)_count

Petr Malat (1):
  ext4: do not interpret high bytes if 64bit feature is disabled

Randy Dunlap (1):
  ext4: delete duplicated words + other fixes

Ritesh Harjani (3):
  ext4: implement swap_activate aops using iomap
  ext4: optimize file overwrites
  ext4: fix bs < ps issue reported with dioread_nolock mount opt

Tian Tao (1):
  ext4: remove unused including 

Xiao Yang (1):
  ext4: disallow modifying DAX inode flag if inline_data has been set

Ye Bin (1):
  ext4: fix dead loop in ext4_mb_new_blocks

Zhang Qilong (1):
  ext4: add trace exit in exception path.

Zhang Xiaoxu (1):
  ext4: fix bdev write error check failed when mount fs with ro

changfengnan (1):
  jbd2: avoid transaction reuse after reformatting

zhangyi (F) (7):
  ext4: clear buffer verified flag if read meta block from disk
  ext4: introduce new metadata buffer read helpers
  ext4: use common helpers in all places reading metadata buffers
  ext4: use ext4_buffer_uptodate() in __ext4_get_inode_loc()
  ext4: introduce ext4_sb_breadahead_unmovable() to replace 
sb_breadahead_unmovable()
  ext4: use ext4_sb_bread() instead of sb_bread()
  ext4: introduce ext4_sb_bread_unmovable() to replace sb_bread_unmovable()

 Documentation/filesystems/ext4/journal.rst |   66 ++
 Documentation/filesystems/journalling.rst  |   33 +
 fs/ext4/Makefile   |2 +-
 fs/ext4/acl.c  |2 +
 fs/ext4/balloc.c   |   14 +-
 fs/ext4/block_validity.c   |   10 +-
 fs/ext4/dir.c  |4 +-
 fs/ext4/ext4.h |  136 +++-
 fs/ext4/ext4_jbd2.c|2 +-
 fs/ext4/extents.c  |  315 +++-
 fs/ext4/extents_status.c   |   24 +
 fs/ext4/fast_commit.c

Re: [PATCH] ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it

2020-10-21 Thread Theodore Y. Ts'o

On Wed, Oct 21, 2020 at 04:07:15PM -0700, Randy Dunlap wrote:
> > I'm don't particularly care how this gets achieved, but please think
> > about how to make it easy for a kernel developer to run a specific set
> > of subsystem unit tests.  (In fact, being able to do something like
> > "kunit.py run fs/ext4 fs/jbd2" or maybe "kunit.py run fs/..." would be
> > *great*.  No need to fuss with hand editing the .kunitconfig file at
> > all would be **wonderful**.
> 
> I understand the wish for ease of use, but this is still the tail
> wagging the dog.
> 
> The primary documentation for 'select' is
> Documentation/kbuild/kconfig-language.rst, which says:
> 
>   Note:
>   select should be used with care. select will force
>   a symbol to a value without visiting the dependencies.
>   By abusing select you are able to select a symbol FOO even
>   if FOO depends on BAR that is not set.
>   In general use select only for non-visible symbols
>   (no prompts anywhere) and for symbols with no dependencies.
>   That will limit the usefulness but on the other hand avoid
>   the illegal configurations all over.
> 

Well, the KUNIT configs are kinda of a special case, since normally
they don't have a lot of huge number of dependencies, since unit tests
in general are not integration tests.  So ideally, dependencies will
mostly be replaced with mocking functions.  And if there are *real*
dependencies that the Kunit Unit tests need, they can be explicitly
pulled in with selects.

That being said, as I said, I'm not picky about *how* this gets
achieved.  But ease of use is a key part of making people more likely
to run the unit tests.  So another way of solving the problem might be
to put some kind of automated dependency solver into kunit.py, or some
way of manually adding the necessary dependencies in some kind of
Kunitconfig file that are in directories where their are Unit tests,
or maybe some kind of extenstion to the Kconfig file.  My main
requirement is that the only thing that should be necessary for
enabling the ext4 Kunit tests should be adding a single line to the
.kunitconfig file.  It's not fair to make the human developer manually
have to figure out the dependency chains.

As far as I'm concerned, ease of use is important enough to justfy
special casing and/or bending the rules as far as "select" is concered
for Kunit-related CONFIG items.  But if someone else want to suggest a
better approach, I'm all ears.

Cheers,

- Ted

Re: [PATCH] ext: EXT4_KUNIT_TESTS should depend on EXT4_FS instead of selecting it

2020-10-21 Thread Theodore Y. Ts'o

On Wed, Oct 21, 2020 at 02:16:56PM -0700, Randy Dunlap wrote:
> On 10/21/20 2:15 PM, Brendan Higgins wrote:
> > On Tue, Oct 20, 2020 at 12:37 AM Geert Uytterhoeven
> >  wrote:
> >>
> >> EXT4_KUNIT_TESTS selects EXT4_FS, thus enabling an optional feature the
> >> user may not want to enable.  Fix this by making the test depend on
> >> EXT4_FS instead.
> >>
> >> Fixes: 1cbeab1b242d16fd ("ext4: add kunit test for decoding extended 
> >> timestamps")
> >> Signed-off-by: Geert Uytterhoeven 
> > 
> > If I remember correctly, having EXT4_KUNIT_TESTS select EXT4_FS was
> > something that Ted specifically requested, but I don't have any strong
> > feelings on it either way.
> 
> omg, please No. depends on is the right fix here.

So my requirement which led to that particular request is to keep what
needs to be placed in .kunitconfig to a small and reasonable set.

Per Documentation/dev-tools/kunit, we start by:

cd $PATH_TO_LINUX_REPO
cp arch/um/configs/kunit_defconfig .kunitconfig

we're then supposed to add whatever Kunit tests we want to enable, to wit:

CONFIG_EXT4_KUNIT_TESTS=y

so that .kunitconfig would look like this:

CONFIG_KUNIT=y
CONFIG_KUNIT_TEST=y
CONFIG_KUNIT_EXAMPLE_TEST=y
CONFIG_EXT4_KUNIT_TESTS=y

... and then you should be able to run:

./tools/testing/kunit/kunit.py run

... and have the kunit tests run.  I would *not* like to have to put a
huge long list of CONFIG_* dependencies into the .kunitconfig file.

I'm don't particularly care how this gets achieved, but please think
about how to make it easy for a kernel developer to run a specific set
of subsystem unit tests.  (In fact, being able to do something like
"kunit.py run fs/ext4 fs/jbd2" or maybe "kunit.py run fs/..." would be
*great*.  No need to fuss with hand editing the .kunitconfig file at
all would be **wonderful**.

Cheers,

- Ted

Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

2020-10-09 Thread Theodore Y. Ts'o

On Thu, Oct 08, 2020 at 03:22:59PM -0700, Josh Triplett wrote:
> 
> I wasn't trying to make a *new* general principle or policy. I was under
> the impression that this *was* the policy, because it never occurred to
> me that it could be otherwise. It seemed like a natural aspect of the
> kernel/userspace boundary, to the point that the idea of it *not* being
> part of the kernel's stability guarantees didn't cross my mind. 

>From our perspective (and Darrick and I discussed this on this week's
ext4 video conference, so it represents the ext4 and xfs maintainer's
position) is that the file system format is different.  First, the
on-disk format is not an ABI, and it is several orders more complex
than a system call interface.  Second, we make no guarantees about
what the file system created by malicious tools will do.  For example,
XFS developers reject bug reports from file system fuzzers, because
the v5 format has CRC checks, so randomly corrupted file systems won't
crash the kernel.  Yes, this doesn't protect against maliciously
created file systems where the attacker makes sure the checksums are
valid, but only crazy people who think containers are just as secure
as VM's and that unprivileged users should be allowed to make the
kernel mount potentially maliciously created file systems would be
exposing the kernel to such maliciously created images.

> Finally, I think there's also some clarification needed in the role of
> what some of the incompat and ro_compat flags mean. For instance,
> "RO_COMPAT_READONLY" is documented as:
> > - Read-only filesystem image; the kernel will not mount this image
> >   read-write and most tools will refuse to write to the image.
> Is it reasonable to interpret this as "this can never, ever become
> writable", such that no kernel should ever "understand" that flag in
> ro_compat?

Yes.  However...

> I'd assumed so, but this discussion is definitely leading me
> to want to confirm any such assumptions. Is this a flag that e2fsck
> could potentially use to determine that it shouldn't check
> read-write-specific data structures, or should that be a different flag?

Just because it won't be modifiable, shouldn't mean that e2fsck won't
check to make sure that such structures are valid.  "Won't be changed"
and "valid" are two different concepts.  And certainly, today we *do*
check to make sure the bitmaps are valid and don't overlap, and we
can't go back in time to change that.

That being said, on the ext4 weekly video chat, we did discuss other
uses of an incompat feature flag that would allow the allocation
bitmap blocks and inode table block fields in the superblock to be
zero, which would mean that they are unallocated.  This would allow us
to dynamically grow the inode table by adding an extra block group
descriptor.  In fact, I'd probably use this as an opportunity to make
some other changes, such using inodes to store locations of the block
group descriptors, inode tables, and allocation bitmaps at the same
time.  Those details can be discussed later, but the point is that
this is why it's good to discuss format changes from a requirements
perspective, so that if we do need to make an incompat change, we can
kill multiple birds with a single stone.

> It's an arbitrary filesystem hierarchy, including directories, files of
> various sizes (hence using inline_data), and permissions. The problem
> isn't to get data from point A to point B; the problem is (in part) to
> turn a representation of a filesystem into an actual mounted filesystem
> as efficiently as possible, live-serving individual blocks on demand
> rather than generating the whole image in advance.

Ah, so you want to be able to let the other side "look at" the file
system in parallel with it being generated on demand?  The cache
coherency problems would seem to be... huge.  For example, how can you
add a file to directory after the reader has looked at the directory
inode and directory blocks?  Or for that matter, looked at a portion
of the inode table block?  Or are you using 4k inodes so there is only
one inode per block?  What about the fact that we sometimes do
readahead of inode table blocks?

I can think of all sorts of implementation level changes in terms of
caching, readahead behavior, etc., that we might make in the future
that might break you if you are doing some quite as outré are that.
Again, the fact that you're being cagey about what you are doing, and
potentially complaining about changes we might make that would break
you, is ***really*** scaring me now.

Can you go into more details here?  I'm sorry if you're working for
some startup who might want to patent these ideas, but if you want to
guarantee future compatibility, I'm really going to have to insist.

   - Ted

Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

2020-10-07 Thread Theodore Y. Ts'o

On Wed, Oct 07, 2020 at 01:14:24PM -0700, Josh Triplett wrote:
> 
> That sounds like a conversation that would have been a lot more
> interesting and enjoyable if it hadn't started with "can we shoot it in
> the head", and continued with the notion that anything other than
> e2fsprogs making something to be mounted by mount(2) and handled by
> fs/ext4 is being "inflicted", and if the goal didn't still seem to be
> "how do we make it go away so that only e2fsprogs and the kernel ever
> touch ext4". I started this thread because I'd written some userspace
> code, a new version of the kernel made that userspace code stop working,
> so I wanted to report that the moment I'd discovered that, along with a
> potential way to address it with as little disrupton to ext4 as
> possible.

What is really getting my dander up is your attempt to claim that the
on-disk file system format is like the userspace/kernel interface,
where if we break any file system that file system that was
"previously accepted by an older kernel", this is a bug that must be
reverted or otherwise fixed to allow file systems that had previously
worked, to continue to work.  And this is true even if the file system
is ***invalid***.

And the problem with this is that there have been any number of
commits where file systems which were previously invalid, but which
could be caused to trigger a syzbot whine, which was fixed by
tightening up the validity tests in the kernel.  In some cases, I had
to also had to fix up e2fsck to detect the invalid file system which
was generated by the file system fuzzer.  Yes, it's unfortunate that
we didn't have these checks earlier, but a file system has a huge
amount of state.

The principle you've articulated would make it impossible for me to
fix these bugs, unless I can prove that the failure to check a
particular invalid file system corruption could lead to a security
vulnerability.  (Would it be OK for me to make the kernel more strict
and reject an invalid file system if it triggers a WARN_ON, so I get
the syzbot complaint, but it doesn't actually cause a security issue?)

So this conversation would have been a lot more pleasant for *me* if
you hadn't tried to elevate your request to a general principle, where
if someone is deliberately generating an invalid file system, I'm not
allowed to make the kernel more strict to detect said invalidity and
to reject the invalid / corrupted / fuzzed file system.

And note that sometimes the security problem happens when there are
multiple file system corruptions that are chained together.  So
enabling block validity *can* sometimes prevent the fuzzed file system
from proceeding further.  Granted, this is less likely in the case of
a read-only file system, but it really worries me when there are
proprietary programs (maybe your library isn't proprietary, but I note
you haven't send me a link to your git repo, but instead have offered
sending sample file systems) which insist on generating their own file
systems, which might or might not be valid, and then expecting them to
receive first class support as part of an iron-bound contract where
I'm not even allowed to add stronger sanity checks which might reject
said invalid file system in the future.

> The short version is that I needed a library to rapidly turn
> dynamically-obtained data into a set of disk blocks to be served
> on-the-fly as a software-defined disk, and then mounted on the other
> side of that interface by the Linux kernel. Turns out that's *many
> orders of magnitude* faster than any kind of network filesystem like
> NFS. It's slightly similar to a vvfat for ext4. The less blocks it can
> generate and account for and cache, the faster it can run, and
> microseconds matter.

So are you actually trying to dedup data blocks, or are you just
trying to avoid needing to track the block allocation bitmaps?  And
are you just writing a single file, or multiple files?  Do you know
what the maximum size of the file or files will be?  Do you need a
complex directory structure, or just a single root directory?  Can the
file system be sparse?

So for example, you can do something like this, which puts all of the
metadata at beginning of the file system, and then you could write to
contiguous data blocks.  Add the following in mke2fs.conf:

[fs_types]
hugefile = {
features = 
extent,huge_file,bigalloc,flex_bg,uninit_bg,dir_nlink,extra_isize,^resize_inode,sparse_super2
cluster_size = 32768
hash_alg = half_md4
reserved_ratio = 0.0
num_backup_sb = 0
packed_meta_blocks = 1
make_hugefiles = 1
inode_ratio = 4194304
hugefiles_dir = /storage
hugefiles_name = huge-file
hugefiles_digits = 0
hugefiles_size = 0
hugefiles_align = 256M
hugefiles_align_disk = true
num_hugefiles = 1
zero_hugefiles = false
inode_size = 128
}

   hugefiles = {
features =

Re: [PATCH] ext4/xfs: add page refcount helper

2020-10-07 Thread Theodore Y. Ts'o

On Tue, Oct 06, 2020 at 07:40:05PM -0700, Dan Williams wrote:
> On Tue, Oct 6, 2020 at 4:09 PM Ralph Campbell  wrote:
> >
> > There are several places where ZONE_DEVICE struct pages assume a reference
> > count == 1 means the page is idle and free. Instead of open coding this,
> > add a helper function to hide this detail.
> >
> > Signed-off-by: Ralph Campbell 
> > Reviewed-by: Christoph Hellwig 
> > ---
> >
> > I'm resending this as a separate patch since I think it is ready to
> > merge. Originally, this was part of an RFC and is unchanged from v3:
> > https://lore.kernel.org/linux-mm/20201001181715.17416-1-rcampb...@nvidia.com
> >
> > It applies cleanly to linux-5.9.0-rc7-mm1 but doesn't really
> > depend on anything, just simple merge conflicts when applied to
> > other trees.
> > I'll let the various maintainers decide which tree and when to merge.
> > It isn't urgent since it is a clean up patch.
> 
> Thanks Ralph, it looks good to me. Jan, or Ted care to ack? I don't
> have much else pending for dax at the moment as Andrew is carrying my
> dax updates for this cycle. Andrew please take this into -mm if you
> get a chance. Otherwise I'll cycle back to it when some other dax
> updates arrive in my queue.

Acked-by: Theodore Ts'o  # for fs/ext4/inode.c

Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

2020-10-07 Thread Theodore Y. Ts'o

On Wed, Oct 07, 2020 at 01:03:04AM -0700, Josh Triplett wrote:
> > But can we *please* take your custom tool out back and shoot it in the
> > head?
> 
> Nope. As mentioned, this isn't about creating ext4 filesystem images,
> and it isn't even remotely similar to mke2fs.

Can you please tell us what your tool is for, then?  Why are you doing
this?  Why are you inflicting this on us?

And if the answer is nope, I can't, it's propietary, sorry  then I
think we should treat this the same way we treat proprietary kernel
modules.

- Ted

Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

2020-10-06 Thread Theodore Y. Ts'o

On Mon, Oct 05, 2020 at 10:03:06PM -0700, Josh Triplett wrote:
> 
> I'm not trying to create a problem here; I'm trying to address a whole
> family of problems. I was generally under the impression that mounting
> existing root filesystems fell under the scope of the kernel<->userspace
> or kernel<->existing-system boundary, as defined by what the kernel
> accepts and existing userspace has used successfully, and that upgrading
> the kernel should work with existing userspace and systems. If there's
> some other rule that applies for filesystems, I'm not aware of that.
> (I'm also not trying to suggest that every random corner case of what
> the kernel *could* accept needs to be the format definition, but rather,
> cases that correspond to existing userspace.)

I'm not opposed to the kernel side change; it's *this time*.  I'm more
interested in killing off the tool that generated the malformed file
system in the first place.  As I keep pointing out, things aren't
going to go well if "e2fsck -E unshare_blocks" is applied to it.  So
users who use this unofficial tool to create this file system is can
run into at least this corner case, if not others, and that will
result in, as the UI designers like to say, "a poor user experience".

We had a similar issue with Android.  Many years ago, Andy Rubin was
originally quite allergic to the GPL, and had tried to promulgate the
rule, "no GPL in Android Userspace".  This is why bionic is used as
libc, and this resulted in Android engineers (I think before the
Google acquisition, but I'm not 100% sure), creating an unofficial,
"unauthorized" make_ext4fs which was a BSD-licensed version of mke2fs.

Unfortuantely, it created file systems which the kernel would never
complain about, but which, 50% of the time, would result in a file
system which under some circumstances, would get corrupted (even more)
when e2fsck attempted to repair the file system.  So if a user had a
bit flip caused by an eMMC hiccup, e2fsck could end up making things
worse.

Worse, make_ext4fs had over time, grown extra functionality, such as
pre-setting the SELinux xattrs, such that you couldn't just replace it
with mke2fs.  It took *years* to fix the problem, and that's why
contrib/e2fsdroid exists today.  We finally, a few years ago, were
able to retire make_ext4fs and replace it with the combination of
mke2fs and e2fsdroid.

So that's why I really don't like it when there are "unauthorized",
unofficial tools creating file systems out there which we are now
obliged to support.  Even if it's OK as far as the kernel is
concerned, unless you're planning on forking and/or reimplementing all
of e2fsprogs, and doing so correctly, that way is going to cause
headaches for file system developers.

As far as I'm concerned, it's not just about on-disk file system
format, it's also about the official user space tools.  If you create
a file system which the kernel is happy with, but which wasn't created
using the official user space tools, file systems are so full of state
and permutations of how things should be done that the opportunities
for mischief are huge.

And what's especially aggravating is when it's done for petty reasons
--- whether it's trying to sae an extra 0.0003% of storage, or because
some VP was allergic to the GPL, it's stupid stuff.

> I don't *want* to rely on what apparently turned out to be an
> undocumented bug in the kernel's validator. That's why I was trying to
> fix the issue in what seemed like the right way, by detecting the
> situation and turning off the validator. That seemed like it would fully
> address the issue. If it would help, I could also supply a tiny filesystem
> image for regression testing.

I'm OK with working around the problem, and we're lucky that it's this
simple this time around.

But can we *please* take your custom tool out back and shoot it in the
head?  Like make_ext4fs, it's just going to cause more headaches down
the line.

And perhaps we need to make a policy that makes it clear that for file
systems, it's not just about "whatever the kernel happens to accept".
It should also be, "was it generated using an official version of the
userspace tools", at least as a consideration.  Yes, we can try to
make the kernel more strict, and that's a good thing to do, but
inevitably, as we make the kernel more strict, we can potentially
break other unffocial tools out there, and it's going to make it a lot
harder to be able to do backwards compatible format enhancements to
the file system.

- Ted

Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

2020-10-05 Thread Theodore Y. Ts'o

On Mon, Oct 05, 2020 at 07:51:10PM -0700, Darrick J. Wong wrote:
> > > Could /somebody/ please document the ondisk format changes that are
> > > associated with this feature?
> > 
> > I pretty much had to sort it out by looking at a combination of
> > e2fsprogs and the kernel, and a lot of experimentation, until I ended up
> > with something that the kernel was completely happy with without a
> > single complaint.
> > 
> > I'd be happy to write up a summary of the format.
> 
> Seems like a good idea, particularly since you're asking for a format
> change that requires kernel support and the ondisk format documentation
> lives under Documentation/.  That said...

> > If you set up the rest of the metadata consistently with it (for
> > instance, 0 free blocks and 0 free inodes), you'll only get a single
> > complaint, from the e2fsck equivalent of block_validity. See
> > https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=956509 for details on
> > that;
> 
> ...Ted shot down this whole thing six months ago.
> 
> The Debian bug database is /not/ the designated forum to discuss changes
> to the ondisk format; linux-ext4 is.

What Josh is proposing I'm pretty sure would also break "e2fsck -E
unshare_blocks", so that's another reason not to accept this as a
valid format change.

As far as I'm concerned, contrib/e2fsdroid is the canonical definition
of how to create valid file systems with shared_blocks.

- Ted

Re: ext4 regression in v5.9-rc2 from e7bfb5c9bb3d on ro fs with overlapped bitmaps

2020-10-05 Thread Theodore Y. Ts'o

On Mon, Oct 05, 2020 at 10:36:39AM -0700, Darrick J. Wong wrote:
> > Commit e7bfb5c9bb3d ("ext4: handle add_system_zone() failure in
> > ext4_setup_system_zone()") breaks mounting of read-only ext4 filesystems
> > with intentionally overlapping bitmap blocks.
> > 
> > On an always-read-only filesystem explicitly marked with
> > EXT4_FEATURE_RO_COMPAT_SHARED_BLOCKS, prior to that commit, it's safe to
> > point all the block and inode bitmaps to a single block
> 
> LOL, WHAT?
> 
> I didn't know shared blocks applied to fs metadata.  I thought that
> "shared" only applied to file extent maps being able to share physical
> bloctks.

My understanding matches Darrick's.  I was going to track down the
Google engineer who has most recently (as far as I know) enhanced
e2fsprogs's support of the shared block feature (see the commits
returned by "git log --author dvan...@google.com contrib/android") but
he's apparently out of the office today.  Hopefully I'll be able to
track him down and ask about this tomorrow.

> Oookay.  So that's not how you create these shared block ext4s,
> apparently...

Yeah, they are created by the e2fsdroid program.  See sources in
contrib/e2fsdroid.  I took a quick look, and I don't see anything
there which is frobbing with with the bitmaps; but perhaps I'm missing
something, which is why I'd ask David to see if he knows anything
about this.

More to the point, if we are have someone who is trying to dedup or
otherwise frob with bitmaps, I suspect this will break "e2fsck -E
unshare_blocks /dev/XXX", which is a way that you can take a root file
system which is using shared_blocks, and turn it into something that
can actually be mount read/write.  This is something that I believe
was being used by AOSP "debug" or "userdebug" (I'm a bit fuzzy on the
details) so that Android developers couldn't actually modify the root
file system.  (Of course, you have to also disable dm-verity in order
for this to work)

Unfortunately, e2fsdroid is currently not buildable under the standard
Linux compilation environment.  For the reason why, see:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=928551#75

The first step would be to teach e2fsprogs's configure to check for
libsparse, and to link against it if it's available.  But before we
could enable this by default for Linux distribution, we need to link
against libsparse using dlopen(), since most distro release engineers
would be cranky if mke2fs were to drag in some random Android
libraries that have to be installed as shared libraries in their
installers.  Which is the point of comment #75 in the above bug.

Since the only use of shared_blocks is for Android, since very few
other projects want a completely read-only root file system, and where
dedup is actually significantly helpful, we've never tried to make
this work outside of the Android context.  At least in theory, though,
it might be useful if we could create shared_block file systems using
"mke2fs -O shared_blocks -d /path/to/embedded-root-fs system.img 1G".
Patches gratefully accepted

Cheers,

- Ted

Re: [PATCH] FIX the comment of struct jbd2_journal_handle

2020-10-02 Thread Theodore Y. Ts'o

On Wed, Sep 23, 2020 at 01:12:31AM +0800, Hui Su wrote:
> the struct name was modified long ago, but the comment still
> use struct handle_s.
> 
> Signed-off-by: Hui Su 

Tnanks, applied.  I updated the commit summary to be:

jbd2: fix the comment of struct jbd2_journal_handle

- Ted

Re: [PATCH] ext4: fix leaking sysfs kobject after failed mount

2020-10-02 Thread Theodore Y. Ts'o

On Thu, Sep 24, 2020 at 11:08:59AM +0200, Jan Kara wrote:
> On Tue 22-09-20 09:24:56, Eric Biggers wrote:
> > From: Eric Biggers 
> > 
> > ext4_unregister_sysfs() only deletes the kobject.  The reference to it
> > needs to be put separately, like ext4_put_super() does.
> > 
> > This addresses the syzbot report
> > "memory leak in kobject_set_name_vargs (3)"
> > (https://syzkaller.appspot.com/bug?extid=9f864abad79fae7c17e1).
> > 
> > Reported-by: syzbot+9f864abad79fae7c1...@syzkaller.appspotmail.com
> > Fixes: 72ba74508b28 ("ext4: release sysfs kobject when failing to enable 
> > quotas on mount")
> > Cc: sta...@vger.kernel.org
> > Signed-off-by: Eric Biggers 
> 
> Looks good. You can add:
> 
> Reviewed-by: Jan Kara 

Thanks, applied.

- Ted

Re: [PATCHv3 1/1] ext4: Optimize file overwrites

2020-10-02 Thread Theodore Y. Ts'o

On Fri, Sep 18, 2020 at 10:36:35AM +0530, Ritesh Harjani wrote:
> In case if the file already has underlying blocks/extents allocated
> then we don't need to start a journal txn and can directly return
> the underlying mapping. Currently ext4_iomap_begin() is used by
> both DAX & DIO path. We can check if the write request is an
> overwrite & then directly return the mapping information.
> 
> This could give a significant perf boost for multi-threaded writes
> specially random overwrites.
> On PPC64 VM with simulated pmem(DAX) device, ~10x perf improvement
> could be seen in random writes (overwrite). Also bcoz this optimizes
> away the spinlock contention during jbd2 slab cache allocation
> (jbd2_journal_handle). On x86 VM, ~2x perf improvement was observed.
> 
> Reported-by: Dan Williams 
> Suggested-by: Jan Kara 
> Signed-off-by: Ritesh Harjani 

Thanks, applied.

- Ted

Re: [PATCH] [v2] ext4: Fix error handling code in add_new_gdb

2020-10-02 Thread Theodore Y. Ts'o

On Sat, Aug 29, 2020 at 10:54:02AM +0800, Dinghao Liu wrote:
> When ext4_journal_get_write_access() fails, we should
> terminate the execution flow and release n_group_desc,
> iloc.bh, dind and gdb_bh.
> 
> Signed-off-by: Dinghao Liu 

Thanks, applied.

- Ted

Re: [PATCHv2 1/3] ext4: Refactor ext4_overwrite_io() to take ext4_map_blocks as argument

2020-10-02 Thread Theodore Y. Ts'o

On Sat, Aug 22, 2020 at 05:04:35PM +0530, Ritesh Harjani wrote:
> Refactor ext4_overwrite_io() to take struct ext4_map_blocks
> as it's function argument with m_lblk and m_len filled
> from caller
> 
> There should be no functionality change in this patch.
> 
> Signed-off-by: Ritesh Harjani 
> ---
>  fs/ext4/file.c | 22 +++---
>  1 file changed, 11 insertions(+), 11 deletions(-)
> 
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 2a01e31a032c..84f73ed91af2 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -188,26 +188,22 @@ ext4_extending_io(struct inode *inode, loff_t offset, 
> size_t len)
>  }
>  
>  /* Is IO overwriting allocated and initialized blocks? */
> -static bool ext4_overwrite_io(struct inode *inode, loff_t pos, loff_t len)
> +static bool ext4_overwrite_io(struct inode *inode, struct ext4_map_blocks 
> *map)
>  {
> - struct ext4_map_blocks map;
>   unsigned int blkbits = inode->i_blkbits;
> - int err, blklen;ts
> + loff_t end = (map->m_lblk + map->m_len) << blkbits;

As Dan Carpenter has pointed out, we need to cast map->m_lblk to
loff_t, since m_lblk is 32 bits, and when this get shifted left by
blkbits, we could end up losing bits.

> - if (pos + len > i_size_read(inode))
> + if (end > i_size_read(inode))
>   return false;

This transformation is not functionally identical.

The problem is that pos is not necessarily a multiple of the file
system blocksize.From below, 

> + map.m_lblk = offset >> inode->i_blkbits;
> + map.m_len = EXT4_MAX_BLOCKS(count, offset, inode->i_blkbits);

So what previously was the starting offset of the overwrite, is now
offset shifted right by blkbits, and then shifted left back by blkbits.

So unless I'm missing something, this looks not quite right?

- Ted

Re: [PATCH] ext4: flag as supporting buffered async reads

2020-10-02 Thread Theodore Y. Ts'o

On Mon, Aug 03, 2020 at 05:02:11PM -0600, Jens Axboe wrote:
> ext4 uses generic_file_read_iter(), which already supports this.
> 
> Cc: Theodore Ts'o 
> Signed-off-by: Jens Axboe 

Applied, thanks.   (And apologies for the delay.)

- Ted

Re: [DISCUSSION PATCH 00/41] random: possible ways towards NIST SP800-90B compliance

2020-10-02 Thread Theodore Y. Ts'o

On Fri, Oct 02, 2020 at 03:39:35PM +, Van Leeuwen, Pascal wrote:
> > Then your company can not contribute in Linux kernel development, as
> > this is obviously not allowed by such a footer.
> >
> Interesting, this has never been raised as a problem until today ...
> Going back through my mail archive, it looks like they started automatically 
> adding that some
> 3 months ago. Not that they informed anyone about that, it just silently 
> happened.

So use a private e-mail address (e.g., at fastmail.fm if you don't
want to run your mail server) and then tunnel out SMTP requests using
ssh.  It's not hard.  :-)

I've worked a multiple $BIG_COMPANY's, and I've been doing this for
decades.  It's also helpful when I need to send e-mails from
conference networks from my laptop

- Ted

Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

2020-09-27 Thread Theodore Y. Ts'o

On Fri, Sep 25, 2020 at 02:18:48PM -0700, Shakeel Butt wrote:
> 
> Yes, you are right. Let's first get this patch tested and after
> confirmation we can update the commit message.

Thanks Shakeel!  I've tested your patch, as well as reverting the
three commits that Linus had suggested, and both seem to address the
problem for me as well.  I did see a small number of failures
immediately as soon as the VM has booted, when testing with the
"revert the three commits" but this appears to be a different failure,
which I had been seeing periodically during the bisect as well which
was no doubt introducing noise in my testing:

[   28.545018] watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [swapper/1:0]
[   28.545018] Modules linked in:
[   28.545018] irq event stamp: 4517759
[   28.545018] hardirqs last  enabled at (4517758): [] 
asm_common_interrupt+0x1e/0x40
[   28.545018] hardirqs last disabled at (4517759): [] 
sysvec_apic_timer_interrupt+0xb/0x90
[   28.545018] softirqs last  enabled at (10634): [] 
irq_enter_rcu+0x6d/0x70
[   28.545018] softirqs last disabled at (10635): [] 
asm_call_on_stack+0x12/0x20
[   28.545018] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
5.9.0-rc6-xfstests-7-g3f3cb48a7d90 #1916
[   28.545018] Hardware name: Google Google Compute Engine/Google Compute 
Engine, BIOS Google 01/01/2011
[   28.545018] RIP: 0010:__do_softirq+0xa3/0x435
[   28.545018] Code: 00 83 80 ac 07 00 00 01 48 89 44 24 08 c7 44 24 1c 0a 00 
00 00 65 66 c7 05 a8 ae 9e 55 00 00 e8 d3 92 3b ff fb b8 ff ff ff ff <48> c7 c3 
40 51 00 ab 41 0f bc c7 89 c6 83 c6 01 89 74 24 04 75 6a
[   28.545018] RSP: :b89f000e0f98 EFLAGS: 0202
[   28.545018] RAX:  RBX:  RCX: 298a
[   28.545018] RDX:  RSI:  RDI: aa80009d
[   28.545018] RBP: b89f000abda0 R08: 0001 R09: 
[   28.545018] R10: 0001 R11: 0046 R12: 0001
[   28.545018] R13:  R14:  R15: 0080
[   28.545018] FS:  () GS:998e5920() 
knlGS:
[   28.545018] CS:  0010 DS:  ES:  CR0: 80050033
[   28.545018] CR2:  CR3: 00023e012001 CR4: 001706e0
[   28.545018] DR0:  DR1:  DR2: 
[   28.545018] DR3:  DR6: fffe0ff0 DR7: 0400
[   28.545018] Call Trace:
[   28.545018]  
[   28.545018]  asm_call_on_stack+0x12/0x20
[   28.545018]  
[   28.545018]  do_softirq_own_stack+0x4e/0x60
[   28.545018]  irq_exit_rcu+0x9f/0xe0
[   28.545018]  sysvec_call_function_single+0x43/0x90
[   28.545018]  asm_sysvec_call_function_single+0x12/0x20
[   28.545018] RIP: 0010:acpi_idle_do_entry+0x54/0x70
[   28.545018] Code: ed c3 e9 cf fe ff ff 65 48 8b 04 25 00 6e 01 00 48 8b 00 
a8 08 75 ea e8 ba c0 5b ff e9 07 00 00 00 0f 00 2d f8 3d 4e 00 fb f4 <9c> 58 fa 
f6 c4 02 74 cf e9 5f c2 5b ff cc cc cc cc cc cc cc cc cc
[   28.545018] RSP: :b89f000abe88 EFLAGS: 0202
[   28.545018] RAX: 293b RBX: 998e5564 RCX: 1a12
[   28.545018] RDX:  RSI:  RDI: aa5fd2b6
[   28.545018] RBP: ab163760 R08: 0001 R09: 000e003c
[   28.545018] R10: 998e582e2340 R11: 0046 R12: 0001
[   28.545018] R13: 0001 R14: ab1637e0 R15: 
[   28.545018]  ? acpi_idle_do_entry+0x46/0x70
[   28.545018]  ? acpi_idle_do_entry+0x46/0x70
[   28.545018]  acpi_idle_enter+0x7d/0xb0
[   28.545018]  cpuidle_enter_state+0x84/0x2c0
[   28.545018]  cpuidle_enter+0x29/0x40
[   28.545018]  cpuidle_idle_call+0x111/0x180
[   28.545018]  do_idle+0x7b/0xd0
[   28.545018]  cpu_startup_entry+0x19/0x20
[   28.545018]  secondary_startup_64+0xb6/0xc0

I think this was an issue relating to acpi_idle that others have
reported, but I thought this was fixed before -rc6 was released?  In
any case, this is post -rc6, so apparently there is something else
going on here, and this is probably unrelated to the regression which
Shakeel's patch is addressing.

- Ted

Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

2020-09-24 Thread Theodore Y. Ts'o

On Thu, Sep 24, 2020 at 10:33:45AM -0400, Theodore Y. Ts'o wrote:
> HOWEVER, thanks to a hint from a colleague at $WORK, and realizing
> that one of the stack traces had virtio balloon in the trace, I
> realized that when I switched the GCE VM type from e1-standard-2 to
> n1-standard-2 (where e1 VM's are cheaper because they use
> virtio-balloon to better manage host OS memory utilization), problem
> has become, much, *much* rarer (and possibly has gone away, although
> I'm going to want to run a lot more tests before I say that
> conclusively) on my test setup.  At the very least, using an n1 VM
> (which doesn't have virtio-balloon enabled in the hypervisor) is
> enough to unblock ext4 development.

 and I spoke too soon.  A number of runs using -rc6 are now
failing even with the n1-standard-2 VM, so virtio-ballon may not be an
indicator.

This is why debugging this is frustrating; it is very much a heisenbug
--- although 5.8 seems to work completely reliably, as does commits
before 37f4a24c2469.  Anything after that point will show random
failures.  :-(

- Ted

Re: [PATCH] ext4: Implement swap_activate aops using iomap

2020-09-24 Thread Theodore Y. Ts'o

On Fri, Sep 04, 2020 at 02:46:53PM +0530, Ritesh Harjani wrote:
> After moving ext4's bmap to iomap interface, swapon functionality
> on files created using fallocate (which creates unwritten extents) are
> failing. This is since iomap_bmap interface returns 0 for unwritten
> extents and thus generic_swapfile_activate considers this as holes
> and hence bail out with below kernel msg :-
> 
> [340.915835] swapon: swapfile has holes
> 
> To fix this we need to implement ->swap_activate aops in ext4
> which will use ext4_iomap_report_ops. Since we only need to return
> the list of extents so ext4_iomap_report_ops should be enough.
> 
> Reported-by: Yuxuan Shui 
> Fixes: ac58e4fb03f ("ext4: move ext4 bmap to use iomap infrastructure")
> Signed-off-by: Ritesh Harjani 

Thanks, applied.

- Ted

Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

2020-09-24 Thread Theodore Y. Ts'o

On Thu, Sep 24, 2020 at 08:59:01AM +0800, Ming Lei wrote:
> 
> The list corruption issue can be reproduced on kvm/qumu guest too when
> running xfstests(ext4) generic/038.
> 
> However, the issue may become not reproduced when adding or removing memory
> debug options, such as adding KASAN.

Can you reliably reprodue this crash?  And if so, with what config and
what kernel version.

One of the reasons why I had gone silent on this bug is that I've been
trying many, many configurations and configurations which reliably
reproduced on say, -rc4 would not reproduce on -rc5, and then I would
get a completely different set of results on -rc6.  So I've been
trying to run a lot of different experiments to try to understand what
might be going on, since it seems pretty clear this must be a very
timing-sensitive failure.

I also found that the re-occrance went down significantly if I enabled
KASAN, and while it didn't go away, I wasn't able to get a KASAN
failure to trigger, either.  Turning off CONFIG_PROVE_LOCKING and a
*lot* of other debugging configs made the problem vanish in -rc4, but
that trick didn't work with -rc5 or -rc6.

Each time I discovered one of these things, I was about to post to the
e-mail thread, only to have a confirmation test run on a different
kernel version make the problem go away.  In particular, your revert
helped with -rc4 and -rc6 IIRC, but it didn't help in -rc5.

HOWEVER, thanks to a hint from a colleague at $WORK, and realizing
that one of the stack traces had virtio balloon in the trace, I
realized that when I switched the GCE VM type from e1-standard-2 to
n1-standard-2 (where e1 VM's are cheaper because they use
virtio-balloon to better manage host OS memory utilization), problem
has become, much, *much* rarer (and possibly has gone away, although
I'm going to want to run a lot more tests before I say that
conclusively) on my test setup.  At the very least, using an n1 VM
(which doesn't have virtio-balloon enabled in the hypervisor) is
enough to unblock ext4 development.

Any chance your kvm/qemu configuration might have been using
virtio-ballon?  Because other ext4 developers who have been using
kvm-xftests have not had any problems

> When I enable PAGE_POISONING, double free on kmalloc(192) is captured:
> 
> [ 1198.317139] slab: double free detected in cache 'kmalloc-192', objp 
> 89ada7584300^M
> [ 1198.326651] [ cut here ]^M
> [ 1198.327969] kernel BUG at mm/slab.c:2535!^M
> [ 1198.329129] invalid opcode:  [#1] SMP PTI^M
> [ 1198.333776] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 
> 5.9.0-rc4_quiesce_srcu-xfstests #102^M
> [ 1198.336085] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
> rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014^M
> [ 1198.339826] RIP: 0010:free_block.cold.92+0x13/0x15^M
> [ 1198.341472] Code: 8d 44 05 f0 eb d0 48 63 83 e0 00 00 00 48 8d 54 05 f8 e9 
> 4b 81 ff ff 48 8b 73 58 48 89 ea 48 c7 c7 98 e7 4a 9c e8 20 c3 eb ff <0f> 0b 
> 48 8b 73 58 48 c7 c2 20 e8 4a 9c 48 c7 c7 70 32 22 9c e8 19^M
> [ 1198.347331] RSP: 0018:982e40710be8 EFLAGS: 00010046^M
> [ 1198.349091] RAX: 0048 RBX: 89adb6441400 RCX: 
> ^M
> [ 1198.351839] RDX:  RSI: 89adbaa97800 RDI: 
> 89adbaa97800^M
> [ 1198.354572] RBP: 89ada7584300 R08: 0417 R09: 
> 0057^M
> [ 1198.357150] R10: 0001 R11: 982e40710aa5 R12: 
> 89adbaaae598^M
> [ 1198.359067] R13: e7bc819d6108 R14: e7bc819d6100 R15: 
> 89adb6442280^M
> [ 1198.360975] FS:  () GS:89adbaa8() 
> knlGS:^M
> [ 1198.363202] CS:  0010 DS:  ES:  CR0: 80050033^M
> [ 1198.365986] CR2: 55f6a3811318 CR3: 00017adca005 CR4: 
> 00770ee0^M
> [ 1198.368679] DR0:  DR1:  DR2: 
> ^M
> [ 1198.371386] DR3:  DR6: fffe0ff0 DR7: 
> 0400^M
> [ 1198.374203] PKRU: 5554^M
> [ 1198.375174] Call Trace:^M
> [ 1198.376165]  ^M
> [ 1198.376908]  ___cache_free+0x56d/0x770^M
> [ 1198.378355]  ? kmem_freepages+0xa0/0xf0^M
> [ 1198.379814]  kfree+0x91/0x120^M
> [ 1198.382121]  kmem_freepages+0xa0/0xf0^M
> [ 1198.383474]  slab_destroy+0x9f/0x120^M
> [ 1198.384779]  slabs_destroy+0x6d/0x90^M
> [ 1198.386110]  ___cache_free+0x632/0x770^M
> [ 1198.387547]  ? kmem_freepages+0xa0/0xf0^M
> [ 1198.389016]  kfree+0x91/0x120^M
> [ 1198.390160]  kmem_freepages+0xa0/0xf0^M
> [ 1198.391551]  slab_destroy+0x9f/0x120^M
> [ 1198.392964]  slabs_destroy+0x6d/0x90^M
> [ 1198.394439]  ___cache_free+0x632/0x770^M
> [ 1198.395896]  kmem_cache_free.part.75+0x19/0x70^M
> [ 1198.397791]  rcu_core+0x1eb/0x6b0^M
> [ 1198.399829]  ? ktime_get+0x37/0xa0^M
> [ 1198.401343]  __do_softirq+0xdf/0x2c5^M
> [ 1198.403010]  asm_call_on_stack+0x12/0x20^M
> [ 1198.404847]  ^M
> [ 1198.405799]  do_softirq_own_stack+0x39/0x50^M
> [ 1198.407621]  irq_exit_rcu+0x97/0xa0^M
> [

Re: [PATCH] random: initialize ChaCha20 constants with correct endianness

2020-09-18 Thread Theodore Y. Ts'o

On Tue, Sep 15, 2020 at 09:50:13PM -0700, Eric Biggers wrote:
> From: Eric Biggers 
> 
> On big endian CPUs, the ChaCha20-based CRNG is using the wrong
> endianness for the ChaCha20 constants.
> 
> This doesn't matter cryptographically, but technically it means it's not
> ChaCha20 anymore.  Fix it to always use the standard constants.

I'll note that we're not technically ChaCha20 in terms of how we
handle the IV.  ChaCha20 is defined as having a 96 bit IV and a 32-bit
counter.  The counter is "usually initialized to be zero or one" (per
RFC 7539) and the counter is defined to be Little Endian.

We're currently not bothering to deal with Endian conversions with the
counter, and we're using a 64-bit counter, instead of a 32-bit
counter.  We also twiddle 32-bits of the state (crng->state[14]) by
XOR'ing it with RDRAND if available at each round, which is also a
spec violation.

WE also initialize the counter to be a random value, using the
input_pool or the primary crng state (if we are initializing the
secondary state), but given that the specification says _usually_ zero
or one, that's not an out-and-out spec violation.

As far as the other deviations / "spec violations" from ChaCha-20 are
concerned...  I'm "sorry not sorry".  :-)

I have no objections to changing things so that the first 4 words of
the crng state are more ChaCha-20-like, on the theory that most of the
cryptoanlysis work (both positive and negative) have been done with
the little-endian version of "expand 32-byte k".  I don't think it
really makes a difference, either positively or negatively.  But
technically we'd *still* not be using ChaCha20.  We could say that
we're using the ChaCha20 block function, regardless.

Cheers,

- Ted

Re: [PATCH AUTOSEL 4.19 059/206] ext4: make dioread_nolock the default

2020-09-18 Thread Theodore Y. Ts'o

On Thu, Sep 17, 2020 at 07:58:59PM -0700, Eric Biggers wrote:
> On Thu, Sep 17, 2020 at 10:05:35PM -0400, Sasha Levin wrote:
> > From: Theodore Ts'o 
> > 
> > [ Upstream commit 244adf6426ee31a83f397b700d964cff12a247d3 ]
> > 
> > This fixes the direct I/O versus writeback race which can reveal stale
> > data, and it improves the tail latency of commits on slow devices.
> > 
> > Link: https://lore.kernel.org/r/20200125022254.1101588-1-ty...@mit.edu
> > Signed-off-by: Theodore Ts'o 
> > Signed-off-by: Sasha Levin I
> 
> Any particular reason to be backporting this?  I thought I saw some fixes for
> dioread_nolock go by, after it was made the default.  Are you getting all of
> those fixes too?

Agreed, making dioread_nolock the default has enough issues that it's
not something that I'd suggest backporting at this point.  It's a
fundamental behavioral change that it's not something we should change
in a stable kernel.

- Ted

Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

2020-09-17 Thread Theodore Y. Ts'o

On Thu, Sep 17, 2020 at 10:20:51AM +0800, Ming Lei wrote:
> 
> Obviously there is other more serious issue, since 568f27006577 is
> completely reverted in your test, and you still see list corruption
> issue.
> 
> So I'd suggest to find the big issue first. Once it is fixed, maybe
> everything becomes fine.
> ...
> Looks it is more like a memory corruption issue, is there any helpful log
> dumped when running kernel with kasan?

Last night, I ran six VM's using -rc4 with and without KASAN; without
Kasan, half of them hung.  With KASAN enabled, all of the test VM's
ran to completion.

This strongly suggests whatever the problem is, it's timing related.
I'll run a larger set of test runs to see if this pattern is confirmed
today.

> BTW, I have kvm/qumu auto test which runs blktest/xfstest over 
> virtio-blk/virito-scsi/loop/nvme
> with xfs/ext4 every two days, and not see such failure recently, but the 
> kernel config is based
> rhel8's config.

Here is the configs I'm using, with and without KASAN.  (With KASAN is
enabled is sent as a diff to avoid running into LKML's e-mail size
restrictrions.)

 - Ted
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 5.9.0-rc4 Kernel Configuration
#
CONFIG_CC_VERSION_TEXT="gcc (Debian 10.2.0-7) 10.2.0"
CONFIG_CC_IS_GCC=y
CONFIG_GCC_VERSION=100200
CONFIG_LD_VERSION=23500
CONFIG_CLANG_VERSION=0
CONFIG_CC_CAN_LINK=y
CONFIG_CC_CAN_LINK_STATIC=y
CONFIG_CC_HAS_ASM_GOTO=y
CONFIG_CC_HAS_ASM_INLINE=y
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_TABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION="-xfstests"
CONFIG_LOCALVERSION_AUTO=y
CONFIG_BUILD_SALT=""
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_HAVE_KERNEL_ZSTD=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
# CONFIG_KERNEL_ZSTD is not set
CONFIG_DEFAULT_INIT=""
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_WATCH_QUEUE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_USELIB=y
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_HARDIRQS_SW_RESEND=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
# end of IRQ subsystem

CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_INIT=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_HAVE_POSIX_CPU_TIMERS_TASK_WORK=y
CONFIG_POSIX_CPU_TIMERS_TASK_WORK=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
# end of Timers subsystem

CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
CONFIG_PREEMPT_COUNT=y

#
# CPU/Task time and stats accounting
#
CONFIG_TICK_CPU_ACCOUNTING=y
# CONFIG_VIRT_CPU_ACCOUNTING_GEN is not set
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_HAVE_SCHED_AVG_IRQ=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_PSI is not set
# end of CPU/Task time and stats accounting

CONFIG_CPU_ISOLATION=y

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
CONFIG_TASKS_RCU_GENERIC=y
CONFIG_TASKS_RUDE_RCU=y
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
# end of RCU Subsystem

CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
# CONFIG_IKHEADERS is not set
CONFIG_LOG_BUF_SHIFT=17
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y

#
# Scheduler features
#
# CONFIG_UCLAMP_TASK is not set
# end of Scheduler features

CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_CC_HAS_INT128=y
CONFIG_ARCH_SUPPORTS_INT128=y
# CONFIG_NUMA_BALANCING is not set
CONFIG_CGROUPS=y
CONFIG_PAGE_COUNTER=y
CONFIG_MEMCG=y
CONFIG_MEMCG_SWAP=y
CONFIG_MEMCG_KMEM=y
CONFIG_BLK_CGROUP=y
CONFIG_CGROUP_WRITEBACK=y
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_CFS_BANDWIDTH is not set
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_CGROUP_PIDS=y
CONFIG_CGROUP_RDMA=y
CONFIG_CGROUP_FREEZER=y
CONFIG_CPUSETS=y

Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

2020-09-16 Thread Theodore Y. Ts'o

On Wed, Sep 16, 2020 at 07:09:41AM +0800, Ming Lei wrote:
> > The problem is it's a bit tricky to revert 568f27006577, since there
> > is a merge conflict in blk_kick_flush().  I attempted to do the bisect
> > manually here, but it's clearly not right since the kernel is not
> > booting after the revert:
> > 
> > https://github.com/tytso/ext4/commit/1e67516382a33da2c9d483b860ac4ec2ad390870
> > 
> > branch:
> > 
> > https://github.com/tytso/ext4/tree/manual-revert-of-568f27006577
> > 
> > Can you send me a patch which correctly reverts 568f27006577?  I can
> > try it against -rc1 .. -rc4, whichever is most convenient.
> 
> Please test the following revert patch against -rc4.

Unfortunately the results of the revert is... wierd.

With -rc4, *all* of the VM's are failing --- reliably.  With rc4 with
the revert, *some* of the VM's are able to complete the tests, but
over half are still locking up or failing with some kind of oops.  So
that seems to imply that there is some kind of timing issue going on,
or maybe there are multiple bugs in play?

So let's review the bidding.   We're going to review four commits:

7bf137298cb7: (Parent of 568f27006577)  Completely clean, all VM's complete the 
tests

568f27006577: Fails reliably.  In 9 of the 11 VM's there is nothing on
the console; the I/O is just stopped.  If I've been able to
get to the VM before it gets killed from the timeout, ssh
works, but any attempt do any I/O will hang, which presumably
explains why the tests are hanging.  In the other two VM's
there are a hung task timeouts, with stack traces that look
like this...

v5.9-rc4: More than half of the VM's are failing --- but at least some are 
succeeding,
which is more than can be said for 568f27006577.  There is a
*variety* of different sort of failures.  So the fact that
we're not seeing the silent hangs in -rc4 is... interesting

v5.9-rc4 with the revert of 568f27006577: we're seeing a similar
number of VM failures, but the failure signature is different.
The most common failure is...

(More details below, with the stack traces.)

I really don't know what to make of this.  It looks like there's
something going on in the block layer, based the fact that
568f27006577 fails reliably, but its predecssor is completely clean.
But then things have changed significantly by the time we get to -rc4.
I'll do a more in-depth analysis of -rc1 to see if the failure
patterns are more similar to 568f27006577 than -rc4.  But hopefully
you can see something that I'm missing?

Thanks,

- Ted

---

7bf137298cb7: (Parent of 568f27006577)  Completely clean, all VM's complete the 
tests

---

568f27006577: Fails reliably.  In 9 of the 11 VM's there is nothing on
the console; the I/O is just stopped.  If I've been able to
get to the VM before it gets killed from the timeout, ssh
works, but any attempt do any I/O will hang, which presumably
explains why the tests are hanging.  In the other two VM's
there are a hung task timeouts, with stack traces that look
like this:

[14375.634282] INFO: task jbd2/sda1-8:116 blocked for more than 122 
seconds.
[14375.641679]  Not tainted 5.8.0-rc2-xfstests-30545-g568f27006577 
#6
[14375.648517] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[14375.656523] jbd2/sda1-8 D0   116  2 0x4000
[14375.656530] Call Trace:
[14375.656548]  __schedule+0x2cc/0x6e0
[14375.656695]  ? sched_clock_cpu+0xc/0xb0
[14375.656699]  schedule+0x55/0xd0
[14375.656702]  io_schedule+0x12/0x40
[14375.656708]  blk_mq_get_tag+0x11e/0x280
[14375.656715]  ? __wake_up_common_lock+0xc0/0xc0
[14375.656719]  __blk_mq_alloc_request+0xb6/0x100
[14375.656722]  blk_mq_submit_bio+0x13f/0x7d0
[14375.656727]  ? blk_queue_enter+0x15c/0x510
[14375.656731]  submit_bio_noacct+0x48d/0x500
[14375.656737]  ? kvm_sched_clock_read+0x14/0x30
[14375.656740]  ? submit_bio+0x42/0x150
[14375.656744]  submit_bio+0x42/0x150
[14375.656748]  ? guard_bio_eod+0x90/0x140
[14375.656754]  submit_bh_wbc+0x16d/0x190
[14375.656761]  jbd2_journal_commit_transaction+0x70d/0x1f23
[14375.656767]  ? kjournald2+0x128/0x3b0
[14375.656771]  kjournald2+0x128/0x3b0
[14375.656777]  ? trace_hardirqs_on+0x1c/0xf0
[14375.656781]  ? __wake_up_common_lock+0xc0/0xc0
[14375.656785]  ? __jbd2_debug+0x50/0x50
[14375.656788]  kthread+0x136/0x150
[14375.656792]  ? __kthread_queue_delayed_work+0x90/0x90
[14375.656796]  ret_from_fork+0x22/0x30

---

v5.9-rc4: More than half of the VM's are failing --- but at least some are 
succeeding,
which

Re: REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

2020-09-15 Thread Theodore Y. Ts'o

On Tue, Sep 15, 2020 at 03:33:03PM +0800, Ming Lei wrote:
> Hi Theodore,
> 
> On Tue, Sep 15, 2020 at 12:45:19AM -0400, Theodore Y. Ts'o wrote:
> > On Thu, Sep 03, 2020 at 11:55:28PM -0400, Theodore Y. Ts'o wrote:
> > > Worse, right now, -rc1 and -rc2 is causing random crashes in my
> > > gce-xfstests framework.  Sometimes it happens before we've run even a
> > > single xfstests; sometimes it happens after we have successfully
> > > completed all of the tests, and we're doing a shutdown of the VM under
> > > test.  Other times it happens in the middle of a test run.  Given that
> > > I'm seeing this at -rc1, which is before my late ext4 pull request to
> > > Linus, it's probably not an ext4 related bug.  But it also means that
> > > I'm partially blind in terms of my kernel testing at the moment.  So I
> > > can't even tell Linus that I've run lots of tests and I'm 100%
> > > confident your one-line change is 100% safe.
> > 
> > I was finally able to bisect it down to the commit:
> > 
> > 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag
> 
> 37f4a24c2469 has been reverted in:
> 
>   4e2f62e566b5 Revert "blk-mq: put driver tag when this request is 
> completed"
> 
> And later the patch is committed as the following after being fixed:
> 
>   568f27006577 blk-mq: centralise related handling into 
> blk_mq_get_driver_tag
> 
> So can you reproduce the issue by running kernel of commit 568f27006577?

Yes.  And things work fine if I try 4e2f62e566b5.

> If yes, can the issue be fixed by reverting 568f27006577?

The problem is it's a bit tricky to revert 568f27006577, since there
is a merge conflict in blk_kick_flush().  I attempted to do the bisect
manually here, but it's clearly not right since the kernel is not
booting after the revert:

https://github.com/tytso/ext4/commit/1e67516382a33da2c9d483b860ac4ec2ad390870

branch:

https://github.com/tytso/ext4/tree/manual-revert-of-568f27006577

Can you send me a patch which correctly reverts 568f27006577?  I can
try it against -rc1 .. -rc4, whichever is most convenient.

> Can you share the exact mount command line for setup the environment?
> and the exact xfstest item?

It's a variety of mount command lines, since I'm using gce-xfstests[1][2]
using a variety of file system scenarios --- but the basic one, which
is ext4 using the default 4k block size is failing (they all are failing).

[1] https://thunk.org/gce-xfstests
[2] 
https://github.com/tytso/xfstests-bld/blob/master/Documentation/gce-xfstests.md

It's also not one consistent xfstests which is failing, but it does
tend to be tests which are loading up the storage stack with a lot of
small random read/writes, especially involving metadata blocks/writes.
(For example, tests which run fsstress.)

Since this reliably triggers for me, and other people running
kvm-xfstests or are running xfstests on their own test environments
aren't seeing it, I'm assuming it must be some kind of interesting
interaction between virtio-scsi, perhaps with how Google Persistent
Disk is behaving (maybe timing related?  who knows?).  Darrick Wong
did say he saw something like it once using Oracle's Cloud
infrastructure, but as far as I know it hasn't reproduced since.  On
Google Compute Engine VM's, it reproduces *extremely* reliably.

I expect that if you were to set up gce-xfstests, get a free GCE
account with the initial $300 free credits, you could run
"gce-xfstests -c ext4/4k -g auto" and it would reproduce within an
hour or so.  (So under a dollar's worth of VM credits, so long as you
notice that it's hung and shut down the VM after gathering debugging
data.)

The instructions are at [2], and the image xfstests-202008311554 in
the xfstests-cloud project is a public copy of the VM test appliance I
was using.

% gcloud compute images describe --project xfstests-cloud xfstests-202008311554
archiveSizeBytes: '1720022528'
creationTimestamp: '2020-09-15T15:09:30.544-07:00'
description: Linux Kernel File System Test Appliance
diskSizeGb: '10'
family: xfstests
guestOsFeatures:
- type: VIRTIO_SCSI_MULTIQUEUE
- type: UEFI_COMPATIBLE
id: '1558420969906537845'
kind: compute#image
labelFingerprint: V-2Qgcxt2uw=
labels:
  blktests: g8a75bed
  e2fsprogs: v1_45_6
  fio: fio-3_22
  fsverity: v1_2
  ima-evm-utils: v1_3_1
  nvme-cli: v1_12
  quota: g13bb8c2
  util-linux: v2_36
  xfsprogs: v5_8_0-rc1
  xfstests: linux-v3_8-2838-geb439bf2
  xfstests-bld: gb5085ab
licenseCodes:
- '5543610867827062957'
licenses:
- 
https://www.googleapis.com/compute/v1/projects/debian-cloud/global/licenses/debian-10-buster
name: xfstests-202008311554
selfLink: 
https://www.googleapis.com/compute/v1/projects/xfstests-cloud/global/images/xfstests-202008311554
sourceDisk: 
https://www.googleapis.com/compute/v1/projects/xfstests-cloud/zones

REGRESSION: 37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

2020-09-14 Thread Theodore Y. Ts'o

On Thu, Sep 03, 2020 at 11:55:28PM -0400, Theodore Y. Ts'o wrote:
> Worse, right now, -rc1 and -rc2 is causing random crashes in my
> gce-xfstests framework.  Sometimes it happens before we've run even a
> single xfstests; sometimes it happens after we have successfully
> completed all of the tests, and we're doing a shutdown of the VM under
> test.  Other times it happens in the middle of a test run.  Given that
> I'm seeing this at -rc1, which is before my late ext4 pull request to
> Linus, it's probably not an ext4 related bug.  But it also means that
> I'm partially blind in terms of my kernel testing at the moment.  So I
> can't even tell Linus that I've run lots of tests and I'm 100%
> confident your one-line change is 100% safe.

I was finally able to bisect it down to the commit:

37f4a24c2469: blk-mq: centralise related handling into blk_mq_get_driver_tag

(See below for [1] Bisect log.)

The previous commit allows the tests to run to completion.  With
commit 37f4a24c2469 and later all 11 test scenarios (4k blocks, 1k
blocks, ext3 compat, ext4 w/ fscrypt, nojournal mode, data=journal,
bigalloc, etc.) the VM will get stuck.

The symptom is that while running xfstests in a Google Compute Engine
(GCE) VM, the tests just hang.  There are a number of tests where this
is more likely, but it's not unique to a single test.

In most cases, there is nothing; just the test stops running until the
test framework times out after an hour (tests usually complete in
seconds or at most a few tens of minutes or so in the worst case) and
kills the VM.  In one case, I did get a report like this.  (See below
for [2] stack trace from 37f4a24c2469.)

I attempted to revert the commit in question against -rc1 and -rc4;
that result can be found at the branches manual-revert-of-blk-mq-patch
and manual-revert-of-blk-mq-patch-rc4 at https://github.com/tytso/ext4.

I don't think I got the revert quite right; with the revert, most of
the test VM's successfully complete, but 2 out of the 11 fail, with a
different stack trace.  (See below for [3] stack trace from my
attempted manual revert of 37f4a24c2469).  But it does seem to confirm
that the primary cause of the test VM hangs is caused by commit
37f4a24c2469.

Does this make any sense as to what might be going on?  I hope it does
for you, since I'm pretty confused what might be going on.

Thanks,

   - Ted

[1] Bisect log

git bisect start
# bad: [9123e3a74ec7b934a4a099e98af6a61c2f80bbf5] Linux 5.9-rc1
git bisect bad 9123e3a74ec7b934a4a099e98af6a61c2f80bbf5
# good: [bcf876870b95592b52519ed4aafcf9d95999bc9c] Linux 5.8
git bisect good bcf876870b95592b52519ed4aafcf9d95999bc9c
# bad: [8186749621ed6b8fc42644c399e8c755a2b6f630] Merge tag 
'drm-next-2020-08-06' of git://anongit.freedesktop.org/drm/drm
git bisect bad 8186749621ed6b8fc42644c399e8c755a2b6f630
# bad: [2324d50d051ec0f14a548e78554fb02513d6dcef] Merge tag 'docs-5.9' of 
git://git.lwn.net/linux
git bisect bad 2324d50d051ec0f14a548e78554fb02513d6dcef
# bad: [92c59e126b21fd212195358a0d296e787e444087] Merge tag 'arm-defconfig-5.9' 
of git://git.kernel.org/pub/scm/linux/kernel/git/soc/soc
git bisect bad 92c59e126b21fd212195358a0d296e787e444087
# bad: [cdc8fcb49905c0b67e355e027cb462ee168ffaa3] Merge tag 
'for-5.9/io_uring-20200802' of git://git.kernel.dk/linux-block
git bisect bad cdc8fcb49905c0b67e355e027cb462ee168ffaa3
# good: [ab5c60b79ab6cc50b39bbb21b2f9fb55af900b84] Merge branch 'linus' of 
git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
git bisect good ab5c60b79ab6cc50b39bbb21b2f9fb55af900b84
# bad: [d958e343bdc3de2643ce25225bed082dc222858d] block: blk-timeout: delete 
duplicated word
git bisect bad d958e343bdc3de2643ce25225bed082dc222858d
# bad: [53042f3cc411adc79811ba3cfbca5d7a42a7b806] ps3vram: stop using 
->queuedata
git bisect bad 53042f3cc411adc79811ba3cfbca5d7a42a7b806
# good: [621c1f42945e76015c3a585e7a9fe6e71665eba0] block: move struct 
block_device to blk_types.h
git bisect good 621c1f42945e76015c3a585e7a9fe6e71665eba0
# good: [36a3df5a4574d5ddf59804fcd0c4e9654c514d9a] blk-mq: put driver tag when 
this request is completed
git bisect good 36a3df5a4574d5ddf59804fcd0c4e9654c514d9a
# good: [570e9b73b0af2e5381ca5343759779b8c1ed20e3] blk-mq: move 
blk_mq_get_driver_tag into blk-mq.c
git bisect good 570e9b73b0af2e5381ca5343759779b8c1ed20e3
# bad: [b5fc1e8bedf8ad2c6381e0df6331ad5686aca425] blk-mq: remove pointless call 
of list_entry_rq() in hctx_show_busy_rq()
git bisect bad b5fc1e8bedf8ad2c6381e0df6331ad5686aca425
# bad: [37f4a24c2469a10a4c16c641671bd766e276cf9f] blk-mq: centralise related 
handling into blk_mq_get_driver_tag
git bisect bad 37f4a24c2469a10a4c16c641671bd766e276cf9f
# good: [723bf178f158abd1ce6069cb049581b3cb003aab] blk-mq: move 
blk_mq_put_driver_tag() into blk-mq.c
git bisect good 723bf178f158abd1ce6069cb049581b3cb003aab
# first bad commit: [37f4a24c2469a10a4c16c641671bd766e276cf9f] blk

Re: [PATCH] ext4: flag as supporting buffered async reads

2020-09-03 Thread Theodore Y. Ts'o

99dedf RBX: 0001 RCX: 
[ 3326.428656] RDX:  RSI:  RDI: ad5e433b
[ 3326.435910] RBP: 949dd7a29800 R08: 0001 R09: 0001
[ 3326.443169] R10: 949dd91ea724 R11: 08ee R12: 949dd7a29864
[ 3326.450420] R13: 0001 R14: 0001 R15: 
[ 3326.457690]  ? acpi_safe_halt+0x1b/0x30
[ 3326.461647]  acpi_idle_enter+0x1d0/0x260
[ 3326.465704]  cpuidle_enter_state+0x6e/0x3a0
[ 3326.470019]  cpuidle_enter+0x29/0x40
[ 3326.473726]  cpuidle_idle_call+0xf8/0x160
[ 3326.477854]  do_idle+0x72/0xc0
[ 3326.481036]  cpu_startup_entry+0x19/0x20
[ 3326.485079]  start_kernel+0x433/0x452
[ 3326.488899]  secondary_startup_64+0xb6/0xc0
[ 3326.493204] CR2: 949e7c958e3f
[ 3326.496757] ---[ end trace 464b5b002bebb81c ]---



On Thu, Sep 03, 2020 at 06:10:19PM -0600, Jens Axboe wrote:
> On 8/26/20 7:54 PM, Jens Axboe wrote:
> > On 8/25/20 8:18 AM, Jens Axboe wrote:
> >> On 8/24/20 4:56 AM, Jens Axboe wrote:
> >>> On 8/22/20 9:48 AM, Jens Axboe wrote:
> >>>> On 8/22/20 8:33 AM, Theodore Y. Ts'o wrote:
> >>>>> On Fri, Aug 21, 2020 at 03:26:35PM -0600, Jens Axboe wrote:
> >>>>>>>>> Resending this one, as I've been carrying it privately since May. 
> >>>>>>>>> The
> >>>>>>>>> necessary bits are now upstream (and XFS/btrfs equiv changes as 
> >>>>>>>>> well),
> >>>>>>>>> please consider this one for 5.9. Thanks!
> >>>>>>>>
> >>>>>>>> The necessary commit only hit upstream as of 5.9-rc1, unless I'm
> >>>>>>>> missing something?  It's on my queue to send to Linus once I get my
> >>>>>>>> (late) ext4 primary pull request for 5.9.
> >>>>>>>
> >>>>>>> Right, it went in at the start of the merge window for 5.9. Thanks 
> >>>>>>> Ted!
> >>>>>>
> >>>>>> Didn't see it in the queue that just sent in, is it still queued up?
> >>>>>
> >>>>> It wasn't in the queue which I queued up because that was based on
> >>>>> 5.8-rc4.  Linus was a bit grumpy (fairly so) because it was late, and
> >>>>> that's totally on me.
> >>>>>
> >>>>> He has said that he's going to start ignoring pull requests that
> >>>>> aren't fixes only if this becomes a pattern, so while I can send him
> >>>>> another pull request which will just have that one change, there are
> >>>>> no guarantees he's going to take it at this late date.
> >>>>>
> >>>>> Sorry, when you sent me the commit saying that the changes that were
> >>>>> needed were already upstream on August 3rd, I thought that meant that
> >>>>> they were aready in Linus's tree.  I should have checked and noticed
> >>>>> that that in fact "ext4: flag as supporting buffered async reads"
> >>>>> wasn't compiling against Linus's upstream tree, so I didn't realize
> >>>>> this needed to be handled as a special case during the merge window.
> >>>>
> >>>> Well to be honest, this kind of sucks. I've been posting it since May,
> >>>> and the ideal approach would have been to just ack it and I could have
> >>>> carried it in my tree. That's what we did for btrfs and XFS, both of
> >>>> which have it.
> >>>>
> >>>> The required patches *were* upstreamed on August 3rd, which is why I
> >>>> mentioned that. But yes, not in 5.8 or earlier, of course.
> >>>>
> >>>> So I suggest that you either include it for the next pull request for
> >>>> Linus, or that I put it in with your ack. Either is fine with me. I'd
> >>>> consider this a "dropping the ball" kind of thing, it's not like the
> >>>> patch hasn't been in linux-next or hasn't been ready for months. This
> >>>> isn't some "oh I wrote this feature after the merge window" event. It'd
> >>>> be a real shame to ship 5.9 and ext4 not have support for the more
> >>>> efficient async buffered reads, imho, especially since the two other
> >>>> major local file systems already have it.
> >>>>
> >>>> Let me know what you think.
> >>>
> >>> Ted, can you make a call on this, please? It's now post -rc2. Let's
> >>> get this settled and included, one way or another.
> >>
> >> Daily ping on this one...
> > 
> > And again. Ted, not sure how to make any progress with this, to be
> > honest, it's like pounding sand.
> 
> And 8 days later...
> 
> -- 
> Jens Axboe
>

Re: [PATCH] ext4: flag as supporting buffered async reads

2020-08-22 Thread Theodore Y. Ts'o

On Fri, Aug 21, 2020 at 03:26:35PM -0600, Jens Axboe wrote:
> >>> Resending this one, as I've been carrying it privately since May. The
> >>> necessary bits are now upstream (and XFS/btrfs equiv changes as well),
> >>> please consider this one for 5.9. Thanks!
> >>
> >> The necessary commit only hit upstream as of 5.9-rc1, unless I'm
> >> missing something?  It's on my queue to send to Linus once I get my
> >> (late) ext4 primary pull request for 5.9.
> > 
> > Right, it went in at the start of the merge window for 5.9. Thanks Ted!
> 
> Didn't see it in the queue that just sent in, is it still queued up?

It wasn't in the queue which I queued up because that was based on
5.8-rc4.  Linus was a bit grumpy (fairly so) because it was late, and
that's totally on me.

He has said that he's going to start ignoring pull requests that
aren't fixes only if this becomes a pattern, so while I can send him
another pull request which will just have that one change, there are
no guarantees he's going to take it at this late date.

Sorry, when you sent me the commit saying that the changes that were
needed were already upstream on August 3rd, I thought that meant that
they were aready in Linus's tree.  I should have checked and noticed
that that in fact "ext4: flag as supporting buffered async reads"
wasn't compiling against Linus's upstream tree, so I didn't realize
this needed to be handled as a special case during the merge window.

Cheers,

- Ted

Re: [LKP] Re: [ext4] d3b6f23f71: stress-ng.fiemap.ops_per_sec -60.5% regression

2020-08-19 Thread Theodore Y. Ts'o

Looking at what the stress-ng fiemap workload is doing, and
it's interesting.

It is running 4 processes which are calling FIEMAP on a particular
file in a loop, with a 25ms sleep every 64 times.  And then there is a
fifth process which is randomly writing to the file and calling
punch_hole to random offsets in the file.

So this is quite different from what Ritesh has been benchmarking
which is fiemap in isolation, as opposed to fiemap racing against a 3
other fiemap processes plus a process which is actively modifying the
file.

In the original code, if I remember correctly, we were using a shared
reader/writer lock to look at the extent tree blocks directly, but we
hold the i_data_sem rw_sempahore for the duration of the fiemap call.

In the new code, we're going through the extent_status cache, which is
grabbing the rw_spinlock each time we do a lookup in the extents
status tree.  So this is a much finer-grained locking and that is
probably the explanation for the increased time for running fiemap in
the contended case.

If this theory is correct, we would probably get back the performance
by wrapping the calls to iomap_fiemap() with {up,down}_read(>i_data_sem)
in ext4_fiemap().

That being said, however  it's clear what real-life workload cares
about FIEMAP performance, especially with multiple threads all calling
FIEMAP racing against a file which is being actively modified.  Having
stress-ng do this to find potential kernel bugs is a great thing, so I
understand why stress-ng might be doing this as a QA tool.  Why we
should care about stress-ng as a performance benchmark, at least in
this case, is much less clear to me.

Cheers,

- Ted

Re: [PATCH] ext4: Fix comment typo "the the".

2020-08-18 Thread Theodore Y. Ts'o

On Sat, Apr 25, 2020 at 02:16:24AM +0900, kyoungho koo wrote:
> I have found double typed comments "the the". So i modified it to
> one "the"
> 
> Signed-off-by: kyoungho koo 

Thanks, applied; apologies for this falling through the cracks!

- Ted

Re: [PATCH] mballoc: Replace seq_printf with seq_puts

2020-08-18 Thread Theodore Y. Ts'o

On Mon, Aug 10, 2020 at 02:21:58AM +, Xu Wang wrote:
> seq_puts is a lot cheaper than seq_printf, so use that to print
> literal strings.
> 
> Signed-off-by: Xu Wang 

Applied, thanks.

- Ted

Re: [PATCH] ext4: flag as supporting buffered async reads

2020-08-18 Thread Theodore Y. Ts'o

On Mon, Aug 03, 2020 at 05:02:11PM -0600, Jens Axboe wrote:
> ext4 uses generic_file_read_iter(), which already supports this.
> 
> Cc: Theodore Ts'o 
> Signed-off-by: Jens Axboe 
> 
> ---
> 
> Resending this one, as I've been carrying it privately since May. The
> necessary bits are now upstream (and XFS/btrfs equiv changes as well),
> please consider this one for 5.9. Thanks!

The necessary commit only hit upstream as of 5.9-rc1, unless I'm
missing something?  It's on my queue to send to Linus once I get my
(late) ext4 primary pull request for 5.9.

- Ted

Re: Maintainers / Kernel Summit 2020 planning kick-off

2020-07-01 Thread Theodore Y. Ts'o

On Wed, Jul 01, 2020 at 09:12:31AM +1000, Dave Airlie wrote:
> 
> What timezone are the conferences being held in? It impacts on what I
> can attend quite heavily :-)

When you register for the Linux Plumbers Conference, there will be an
opportunity for you to indicate your timezone preferences.  The
current thinking is that the timing will be like many of the virtual
conferences, which is to run it only for at most 3-4 hours a day, at
hours when it is most, convenient for the people who have registered.
That's for the Kernel Summit.

For the Maintainer's Summit, due to the lack of topic proposals, we
are probably going to end up skipping it for this year.

Cheers,

 - Ted

Maintainers / Kernel Summit 2020 submissions

2020-06-15 Thread Theodore Y. Ts'o

So far, we have received 5 techinical topic submissions for the Kernel
Summit; thanks to those who have submitted.  If you have some
additional ideas of technical topics you'd like to discuss at the
Kernel Summit, please submit them this week.  For details on how to
proposal a topic for the Kernel Summit, please see [1].

[1] https://lore.kernel.org/r/20200515163956.ga2158...@mit.edu

We have not, so far, gotten any submissions for the Maintainer's
Summit.  It's unclear whether this is because people have been
distracted with issues relating to the pandemic situation, or whether
things have been going swimingly from a process perspective, or if
people are just not as motivated to suggest topics if they can't
discuss them in a face-to-face setting as opposed to a virtual setting
 or some combination of all of the above.

If you do have some ideas for things that you think are worth the
attention of a Maintainer's Summit this year, please kindly submit
them this week.  (Again, please see [1] for submission instructions.)
If we do not get sufficient submissions, we will need to consider
whether or not it makes sense to hold a Maintainer's Summit this year.

Thanks,

- Ted

[GIT PULL] ext4 changes part 2 for 5.8

2020-06-14 Thread Theodore Y. Ts'o

The following changes since commit 6b8ed62008a49751fc71fefd2a4f89202a7c2d4d:

  ext4: avoid unnecessary transaction starts during writeback (2020-06-03 
23:16:56 -0400)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4-for-linus-5.8-rc1-2

for you to fetch changes up to 7b97d868b7ab2448859668de9222b8af43f76e78:

  ext4, jbd2: ensure panic by fix a race between jbd2 abort and ext4 error 
handlers (2020-06-12 14:51:41 -0400)


This is the second round of ext4 commits for 5.8 merge window.  It
includes the per-inode DAX support, which was dependant on the DAX
infrastructure which came in via the XFS tree, and a number of
regression and bug fixes; most notably the "BUG: using
smp_processor_id() in preemptible code in ext4_mb_new_blocks" reported
by syzkaller.


Eric Biggers (1):
  ext4: avoid utf8_strncasecmp() with unstable name

Ira Weiny (14):
  fs: Remove unneeded IS_DAX() check in io_is_direct()
  fs/stat: Define DAX statx attribute
  Documentation/dax: Update Usage section
  fs: Lift XFS_IDONTCACHE to the VFS layer
  fs: Introduce DCACHE_DONTCACHE
  fs/ext4: Narrow scope of DAX check in setflags
  fs/ext4: Disallow verity if inode is DAX
  fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS
  fs/ext4: Update ext4_should_use_dax()
  fs/ext4: Only change S_DAX on inode load
  fs/ext4: Make DAX mount option a tri-state
  fs/ext4: Remove jflag variable
  fs/ext4: Introduce DAX inode flag
  Documentation/dax: Update DAX enablement for ext4

Jan (janneke) Nieuwenhuizen (1):
  ext4: support xattr gnu.* namespace for the Hurd

Jeffle Xu (1):
  ext4: fix partial cluster initialization when splitting extent

Ritesh Harjani (1):
  ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr

Theodore Ts'o (2):
  Enable ext4 support for per-file/directory dax operations
  ext4: avoid race conditions when remounting with options that change dax

yangerkun (1):
  ext4: stop overwrite the errcode in ext4_setup_super

zhangyi (F) (1):
  ext4, jbd2: ensure panic by fix a race between jbd2 abort and ext4 error 
handlers

 Documentation/filesystems/dax.txt | 142 
--
 Documentation/filesystems/ext4/verity.rst |   3 ++
 drivers/block/loop.c  |   6 +--
 fs/dcache.c   |  19 
 fs/ext4/Makefile  |   3 +-
 fs/ext4/dir.c |  16 ++
 fs/ext4/ext4.h|  27 ---
 fs/ext4/extents.c |   2 +-
 fs/ext4/ialloc.c  |   2 +-
 fs/ext4/inode.c   |  26 +++---
 fs/ext4/ioctl.c   |  65 -
 fs/ext4/mballoc.c |   2 +-
 fs/ext4/super.c   | 124 
+--
 fs/ext4/verity.c  |   5 +-
 fs/ext4/xattr.c   |   2 +
 fs/ext4/xattr.h   |   1 +
 fs/ext4/xattr_hurd.c  |  51 
 fs/jbd2/journal.c |  17 +--
 fs/stat.c |   3 ++
 fs/xfs/xfs_icache.c   |   4 +-
 fs/xfs/xfs_inode.h|   3 +-
 fs/xfs/xfs_super.c|   2 +-
 include/linux/dcache.h|   2 +
 include/linux/fs.h|  14 +++---
 include/linux/jbd2.h  |   6 ++-
 include/uapi/linux/fs.h   |   1 +
 include/uapi/linux/stat.h |   1 +
 include/uapi/linux/xattr.h|   4 ++
 28 files changed, 465 insertions(+), 88 deletions(-)
 create mode 100644 fs/ext4/xattr_hurd.c

Re: [PATCH v2] ext4: support xattr gnu.* namespace for the Hurd

2020-06-12 Thread Theodore Y. Ts'o

On Fri, May 29, 2020 at 10:39:39AM +0200, Jan Nieuwenhuizen wrote:
> Theodore Y. Ts'o writes:
> 
> Hello!
> 
> > On Mon, May 25, 2020 at 09:39:40PM +0200, Jan (janneke) Nieuwenhuizen wrote:
> >> The Hurd gained[0] support for moving the translator and author
> >> fields out of the inode and into the "gnu.*" xattr namespace.
> >> 
> >> In anticipation of that, an xattr INDEX was reserved[1].  The Hurd has
> >> now been brought into compliance[2] with that.
> >> 
> >> This patch adds support for reading and writing such attributes from
> >> Linux; you can now do something like
> 
> [...]
> >
> > This patch is missing a Signed-off-by.  If you don't understand why
> > this is really important, please read: 
> >
> > https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin
> 
> Ah, that makes sense.  Sorry for omitting that/being not clear on it.
> 
> > Can you resubmit with the DCO or confirm that it's OK for me to add
> > your Signed-off-by?
> 
> Yes, that's OK, please do!  I am the original author of this patch.

Thanks, I've applied this to the ext4.git tree.

- Ted

Re: BUG: unable to handle kernel NULL pointer dereference in generic_perform_write (2)

2020-06-10 Thread Theodore Y. Ts'o

#syz test: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
5749fe5af3db176659978718ddaecebb450cdb6b

Re: BUG: unable to handle kernel NULL pointer dereference in generic_perform_write (2)

2020-06-10 Thread Theodore Y. Ts'o

#syz test: https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
5b8b9d0c6d0e0f1993c6c56deaf9646942c49d94

Re: BUG: unable to handle kernel NULL pointer dereference in generic_perform_write (2)

2020-06-10 Thread Theodore Y. Ts'o

#syz test: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
5b8b9d0c6d0e0f1993c6c56deaf9646942c49d94

[GIT PULL] ext4 changes for 5.8-rc1

2020-06-04 Thread Theodore Y. Ts'o

The following changes since commit 0e698dfa282211e414076f9dc7e83c1c288314fd:

  Linux 5.7-rc4 (2020-05-03 14:56:04 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git 
tags/ext4_for_linus

for you to fetch changes up to 6b8ed62008a49751fc71fefd2a4f89202a7c2d4d:

  ext4: avoid unnecessary transaction starts during writeback (2020-06-03 
23:16:56 -0400)


A lot of bug fixes and cleanups for ext4, including:

* Fix performance problems found in dioread_nolock now that it is the
  default, caused by transaction leaks.
* Clean up fiemap handling in ext4
* Clean up and refactor multiple block allocator (mballoc) code
* Fix a problem with mballoc with a smaller file systems running out
  of blocks because they couldn't properly use blocks that had been
  reserved by inode preallocation.
* Fixed a race in ext4_sync_parent() versus rename()
* Simplify the error handling in the extent manipulation code
* Make sure all metadata I/O errors are felected to ext4_ext_dirty()'s and
  ext4_make_inode_dirty()'s callers.
* Avoid passing an error pointer to brelse in ext4_xattr_set()
* Fix race which could result to freeing an inode on the dirty last
  in data=journal mode.
* Fix refcount handling if ext4_iget() fails
* Fix a crash in generic/019 caused by a corrupted extent node


Carlos Guerrero Álvarez (1):
  ext4: fix a style issue in fs/ext4/acl.c

Christoph Hellwig (10):
  ext4: fix fiemap size checks for bitmap files
  ext4: split _ext4_fiemap
  ext4: remove the call to fiemap_check_flags in ext4_fiemap
  fs: mark __generic_block_fiemap static
  fs: move the fiemap definitions out of fs.h
  iomap: fix the iomap_fiemap prototype
  fs: move fiemap range validation into the file systems instances
  fs: handle FIEMAP_FLAG_SYNC in fiemap_prep
  fs: remove the access_ok() check in ioctl_fiemap
  ext4: remove the access_ok() check in ext4_ioctl_get_es_cache

Christophe JAILLET (1):
  ext4: fix a typo in a comment

Eric Biggers (2):
  ext4: fix race between ext4_sync_parent() and rename()
  ext4: add casefold flag to EXT4_INODE_* flags

Eric Whitney (7):
  ext4: remove EXT4_GET_BLOCKS_KEEP_SIZE flag
  ext4: translate a few more map flags to strings in tracepoints
  ext4: remove dead GET_BLOCKS_ZERO code
  ext4: remove redundant GET_BLOCKS_CONVERT code
  ext4: clean up GET_BLOCKS_PRE_IO error handling
  ext4: clean up ext4_ext_convert_to_initialized() error handling
  ext4: rework map struct instantiation in ext4_ext_map_blocks()

Harshad Shirwadkar (3):
  ext4: fix EXT_MAX_EXTENT/INDEX to check for zeroed eh_max
  ext4: handle ext4_mark_inode_dirty errors
  ext4: don't ignore return values from ext4_ext_dirty()

Jan Kara (5):
  writeback: Export inode_io_list_del()
  ext4: Avoid freeing inodes on dirty list
  ext4: drop ext4_journal_free_reserved()
  jbd2: avoid leaking transaction credits when unreserving handle
  ext4: avoid unnecessary transaction starts during writeback

Jason Yan (1):
  ext4: remove unnecessary comparisons to bool

Jeffle Xu (1):
  ext4: fix error pointer dereference

Jens Axboe (1):
  ext4: don't block for O_DIRECT if IOCB_NOWAIT is set

Jonathan Grant (1):
  add comment for ext4_dir_entry_2 file_type member

Kaixu Xia (2):
  ext4: remove unnecessary test_opt for DIOREAD_NOLOCK
  ext4: remove redundant variable has_bigalloc in ext4_fill_super

Ritesh Harjani (21):
  ext4: mballoc: print bb_free info even when it is 0
  ext4: mballoc: refactor ext4_mb_show_ac()
  ext4: mballoc: add more mb_debug() msgs
  ext4: mballoc: correct the mb_debug() format specifier for pa_len var
  ext4: mballoc: fix few other format specifier in mb_debug()
  ext4: mballoc: simplify error handling in ext4_init_mballoc()
  ext4: mballoc: make ext4_mb_use_preallocated() return type as bool
  ext4: mballoc: refactor code inside DOUBLE_CHECK into separate function
  ext4: mballoc: fix possible NULL ptr & remove BUG_ONs from DOUBLE_CHECK
  ext4: balloc: use task_pid_nr() helper
  ext4: use BIT() macro for BH_** state bits
  ext4: improve ext_debug() msg in case of block allocation failure
  ext4: replace EXT_DEBUG with __maybe_unused in 
ext4_ext_handle_unwritten_extents()
  ext4: mballoc: make mb_debug() implementation to use pr_debug()
  ext4: make ext_debug() implementation to use pr_debug()
  ext4: mballoc: add blocks to PA list under same spinlock after allocating 
blocks
  ext4: mballoc: refactor ext4_mb_discard_preallocations()
  ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC 
handling
  ext4: mballoc: refactor ext4_mb_good_group()
  ext4: mballoc: use lock for checking free blocks while retrying

Re: [PATCH V5 0/9] Enable ext4 support for per-file/directory DAX operations

2020-05-28 Thread Theodore Y. Ts'o

On Thu, May 28, 2020 at 10:54:41PM -0400, Theodore Y. Ts'o wrote:
> 
> Thanks, applied to the ext4-dax branch.
> 

I spoke too soon.  While I tried merging with the ext4.git dev branch,
a merge conflict made me look closer and I realize I needed to make
the following changes (see diff between your patch set and what is
currently in ext4-dax).

Essentially, I needed to rework the branch to take into account commit
e0198aff3ae3 ("ext4: reject mount options not supported when
remounting in handle_mount_opt()").

The problem is that if you allow handle_mount_opt() to apply the
changes to the dax settings, and then later on, ext4_remount() realize
that we're remounting, and we need to reject the change, there's a
race if we restore the mount options to the original configuration.
Specifically, as Syzkaller pointed out, between when we change the dax
settings and then reset them, it's possible for some file to be opened
with "wrong" dax setting, and then when they are reset, *boom*.

The correct way to deal with this is to reject the mount option change
much earlier, in handle_mount_opt(), *before* we mess with the dax
settings.

Please take a look at the ext4-dax for the actual changes which I
made.

Cheers,

- Ted


diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3658e3016999..9a37d70394b2 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1733,7 +1733,7 @@ static int clear_qf_name(struct super_block *sb, int 
qtype)
 #define MOPT_NO_EXT3   0x0200
 #define MOPT_EXT4_ONLY (MOPT_NO_EXT2 | MOPT_NO_EXT3)
 #define MOPT_STRING0x0400
-#define MOPT_SKIP  0x0800
+#define MOPT_NO_REMOUNT0x0800
 
 static const struct mount_opts {
int token;
@@ -1783,18 +1783,15 @@ static const struct mount_opts {
{Opt_min_batch_time, 0, MOPT_GTE0},
{Opt_inode_readahead_blks, 0, MOPT_GTE0},
{Opt_init_itable, 0, MOPT_GTE0},
-   {Opt_dax, EXT4_MOUNT_DAX_ALWAYS, MOPT_SET | MOPT_SKIP},
-   {Opt_dax_always, EXT4_MOUNT_DAX_ALWAYS,
-   MOPT_EXT4_ONLY | MOPT_SET | MOPT_SKIP},
-   {Opt_dax_inode, EXT4_MOUNT2_DAX_INODE,
-   MOPT_EXT4_ONLY | MOPT_SET | MOPT_SKIP},
-   {Opt_dax_never, EXT4_MOUNT2_DAX_NEVER,
-   MOPT_EXT4_ONLY | MOPT_SET | MOPT_SKIP},
+   {Opt_dax, 0, MOPT_NO_REMOUNT},
+   {Opt_dax_always, 0, MOPT_NO_REMOUNT},
+   {Opt_dax_inode, 0, MOPT_NO_REMOUNT},
+   {Opt_dax_never, 0, MOPT_NO_REMOUNT},
{Opt_stripe, 0, MOPT_GTE0},
{Opt_resuid, 0, MOPT_GTE0},
{Opt_resgid, 0, MOPT_GTE0},
-   {Opt_journal_dev, 0, MOPT_NO_EXT2 | MOPT_GTE0},
-   {Opt_journal_path, 0, MOPT_NO_EXT2 | MOPT_STRING},
+   {Opt_journal_dev, 0, MOPT_NO_EXT2 | MOPT_GTE0 | MOPT_NO_REMOUNT},
+   {Opt_journal_path, 0, MOPT_NO_EXT2 | MOPT_STRING | MOPT_NO_REMOUNT},
{Opt_journal_ioprio, 0, MOPT_NO_EXT2 | MOPT_GTE0},
{Opt_data_journal, EXT4_MOUNT_JOURNAL_DATA, MOPT_NO_EXT2 | MOPT_DATAJ},
{Opt_data_ordered, EXT4_MOUNT_ORDERED_DATA, MOPT_NO_EXT2 | MOPT_DATAJ},
@@ -1831,7 +1828,7 @@ static const struct mount_opts {
{Opt_jqfmt_vfsv1, QFMT_VFS_V1, MOPT_QFMT},
{Opt_max_dir_size_kb, 0, MOPT_GTE0},
{Opt_test_dummy_encryption, 0, MOPT_GTE0},
-   {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET},
+   {Opt_nombcache, EXT4_MOUNT_NO_MBCACHE, MOPT_SET | MOPT_NO_REMOUNT},
{Opt_err, 0, 0}
 };
 
@@ -1929,6 +1926,12 @@ static int handle_mount_opt(struct super_block *sb, char 
*opt, int token,
 "Mount option \"%s\" incompatible with ext3", opt);
return -1;
}
+   if ((m->flags & MOPT_NO_REMOUNT) && is_remount) {
+   ext4_msg(sb, KERN_ERR,
+"Mount option \"%s\" not supported when remounting",
+opt);
+   return -1;
+   }
 
if (args->from && !(m->flags & MOPT_STRING) && match_int(args, ))
return -1;
@@ -2008,11 +2011,6 @@ static int handle_mount_opt(struct super_block *sb, char 
*opt, int token,
}
sbi->s_resgid = gid;
} else if (token == Opt_journal_dev) {
-   if (is_remount) {
-   ext4_msg(sb, KERN_ERR,
-"Cannot specify journal on remount");
-   return -1;
-   }
*journal_devnum = arg;
} else if (token == Opt_journal_path) {
char *journal_path;
@@ -2020,11 +2018,6 @@ static int handle_mount_opt(struct super_block *sb, char 
*opt, int token,
struct path path;
int error;
 
-   if (is_remount) {
-   ext4_msg(sb, KERN_ERR,
-"Cannot specify journal on remount");
-

Re: [PATCH v2] ext4: support xattr gnu.* namespace for the Hurd

2020-05-28 Thread Theodore Y. Ts'o

On Mon, May 25, 2020 at 09:39:40PM +0200, Jan (janneke) Nieuwenhuizen wrote:
> The Hurd gained[0] support for moving the translator and author
> fields out of the inode and into the "gnu.*" xattr namespace.
> 
> In anticipation of that, an xattr INDEX was reserved[1].  The Hurd has
> now been brought into compliance[2] with that.
> 
> This patch adds support for reading and writing such attributes from
> Linux; you can now do something like
> 
> mkdir -p hurd-root/servers/socket
> touch hurd-root/servers/socket/1
> setfattr --name=gnu.translator --value='"/hurd/pflocal\0"' \
> hurd-root/servers/socket/1
> getfattr --name=gnu.translator hurd-root/servers/socket/1
> # file: 1
> gnu.translator="/hurd/pflocal"
> 
> to setup a pipe translator, which is being used to create[3] a
> vm-image for the Hurd from GNU Guix.
> 
> [0] https://summerofcode.withgoogle.com/projects/#5869799859027968
> [1] 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3980bd3b406addb327d858aebd19e229ea340b9a
> [2] 
> https://git.savannah.gnu.org/cgit/hurd/hurd.git/commit/?id=a04c7bf83172faa7cb080fbe3b6c04a8415ca645
> [3] https://git.savannah.gnu.org/cgit/guix.git/log/?h=wip-hurd-vm

This patch is missing a Signed-off-by.  If you don't understand why
this is really important, please read: 

https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin

Can you resubmit with the DCO or confirm that it's OK for me to add your 
Signed-off-by?

  - Ted

Re: [PATCH V5 0/9] Enable ext4 support for per-file/directory DAX operations

2020-05-28 Thread Theodore Y. Ts'o

On Thu, May 28, 2020 at 07:59:54AM -0700, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> Changes from V4:
>   Fix up DAX mutual exclusion with other flags.
>   Add clean up patch (remove jflags)
> 
> Changes from V3:
>   Change EXT4_DAX_FL to bit24
>   Cache device DAX support in the super block and use that is
>   ext4_should_use_dax()
> 
> Changes from V2:
>   Rework DAX exclusivity with verity and encryption based on feedback
>   from Eric
> 
> Enable the same per file DAX support in ext4 as was done for xfs.  This series
> builds and depends on the V11 series for xfs.[1]
> 
> This passes the same xfstests test as XFS.
> 
> The only issue is that this modifies the old mount option parsing code rather
> than waiting for the new parsing code to be finalized.
> 
> This series starts with 3 fixes which include making Verity and Encrypt truly
> mutually exclusive from DAX.  I think these first 3 patches should be picked 
> up
> for 5.8 regardless of what is decided regarding the mount parsing.
> 
> [1] https://lore.kernel.org/lkml/20200428002142.404144-1-ira.we...@intel.com/
> 
> To: linux-e...@vger.kernel.org
> To: Andreas Dilger 
> To: "Theodore Y. Ts'o" 
> To: Jan Kara 
> To: Eric Biggers 

Thanks, applied to the ext4-dax branch.

- Ted

Re: [PATCHv5 0/5] Improve ext4 handling of ENOSPC with multi-threaded use-case

2020-05-28 Thread Theodore Y. Ts'o



Thanks, I've applied this patch series.

- Ted

Re: [PATCH] ext4: Fix a typo in a comment

2020-05-21 Thread Theodore Y. Ts'o

On Sun, May 03, 2020 at 10:06:47PM +0200, Christophe JAILLET wrote:
> s/extnets/extents/
> 
> Signed-off-by: Christophe JAILLET 

Thanks, applied.

- Ted

Maintainers / Kernel Summit 2020 planning kick-off

2020-05-15 Thread Theodore Y. Ts'o

[ Feel free to forward this to other Linux kernel mailing lists as
  appropriate -- Ted ]

This year, the Maintainers and Kernel Summit will NOT be held in
Halifax, August 25 -- 28th, as a result of the COVID-19 pandemic.
Instead, we will be pursuing a virtual conference format for both the
Maintainers and Kernel Summit, around the last week of August.

As in previous years, the Maintainers Summit is invite-only, where the
primary focus will be process issues around Linux Kernel Development.
It will be limited to 30 invitees and a handful of sponsored
attendees.

The Kernel Summit is organized as a track which is run in parallel
with the other tracks at the Linux Plumbers Conference (LPC), and is
open to all registered attendees of LPC.

Linus will be generating a core list of people to be invited to the
Maintainers Summit.  The top ten people from that list will receive
invites, and then program committee will use the rest of Linus's list
as a starting point of people to be considered.  People who suggest
topics that should be discussed at the Maintainers Summit will also
be added to the list for consideration.  To make topic suggestions for
the Maintainers Summit, please send e-mail to the
ksummit-disc...@lists.linuxfoundation.org list with a subject prefix
of [MAINTAINERS SUMMIT].

The other job of the program committee will be to organize the program
for the Kernel Summit.  The goal of the Kernel Summit track will be to
provide a forum to discuss specific technical issues that would be
easier to resolve in person than over e-mail.  The program committee
will also consider "information sharing" topics if they are clearly of
interest to the wider development community (i.e., advanced training
in topics that would be useful to kernel developers).

To suggest a topic for the Kernel Summit, please do two things.
First, please tag your e-mail with [TECH TOPIC].  As before, please
use a separate e-mail for each topic, and send the topic suggestions
to the ksummit-discuss list.

Secondly, please create a topic at the Linux Plumbers Conference
proposal submission site and target it to the Kernel Summit track.
For your convenience you can use:

https://bit.ly/lpc20-submit

Please do both steps.  I'll try to notice if someone forgets one or
the other, but your chances of making sure your proposal gets the
necessary attention and consideration are maximized by submitting both
to the mailing list and the web site.

People who submit topic suggestions before June 15th and which are
accepted, will be given free admission to the Linux Plumbers
Conference.

We will be reserving roughly half of the Kernel Summit slots for
last-minute discussions that will be scheduled during the week of
Plumbers, in an "unconference style".  This allows last-minute ideas
that come up to be given given slots for discussion.

If you were not subscribed on to the kernel-discuss mailing list from
last year (or if you had removed yourself after the kernel summit),
you can subscribe to the discuss list using mailman:

   https://lists.linuxfoundation.org/mailman/listinfo/ksummit-discuss

The program committee this year is composed of the following people:

Greg Kroah-Hartman
Jens Axboe
Jon Corbet
Ted Ts'o
Thomas Gleixner

Re: [PATCH] ext4: Fix buffer_head refcnt leak when ext4_iget() fails

2020-05-14 Thread Theodore Y. Ts'o

On Thu, Apr 23, 2020 at 01:09:27PM +0800, Xiyu Yang wrote:
> ext4_orphan_get() invokes ext4_read_inode_bitmap(), which returns a
> reference of the specified buffer_head object to "bitmap_bh" with
> increased refcnt.
> 
> When ext4_orphan_get() returns, local variable "bitmap_bh" becomes
> invalid, so the refcount should be decreased to keep refcount balanced.
> 
> The reference counting issue happens in one exception handling path of
> ext4_orphan_get(). When ext4_iget() fails, the function forgets to
> decrease the refcnt increased by ext4_read_inode_bitmap(), causing a
> refcnt leak.
> 
> Fix this issue by calling brelse() when ext4_iget() fails.
> 
> Signed-off-by: Xiyu Yang 
> Signed-off-by: Xin Tan 

Applied, thanks.

- Ted

Re: [PATCH] ext4: remove unnecessary comparisons to bool

2020-05-14 Thread Theodore Y. Ts'o

On Mon, Apr 20, 2020 at 12:29:18PM +0800, Jason Yan wrote:
> Fix the following coccicheck warning:
> 
> fs/ext4/extents_status.c:1057:5-28: WARNING: Comparison to bool
> fs/ext4/inode.c:2314:18-24: WARNING: Comparison to bool
> 
> Signed-off-by: Jason Yan 

Applied, thanks.

- Ted

Re: [PATCH v3 5/6] fs: ext4: default KUNIT_* fragments to KUNIT_ALL_TESTS

2020-05-11 Thread Theodore Y. Ts'o

On Mon, May 11, 2020 at 03:14:38PM +0200, Anders Roxell wrote:
> This makes it easier to enable all KUnit fragments.
> 
> Adding 'if !KUNIT_ALL_TESTS' so individual tests can not be turned off.
> Therefore if KUNIT_ALL_TESTS is enabled that will hide the prompt in
> menuconfig.
> 
> Reviewed-by: David Gow 
> Signed-off-by: Anders Roxell 

Acked-off-by: Theodore Ts'o

Re: [PATCH linux-kselftest/test v1] apparmor: add AppArmor KUnit tests for policy unpack

2019-10-18 Thread Theodore Y. Ts'o

On Thu, Oct 17, 2019 at 05:43:07PM -0700, Brendan Higgins wrote:
> > +config SECURITY_APPARMOR_TEST
> > +   bool "Build KUnit tests for policy_unpack.c"
> > +   default n
> > +   depends on KUNIT && SECURITY_APPARMOR
> 
> Ted, here is an example where doing select on direct dependencies is
> tricky because SECURITY_APPARMOR has a number of indirect dependencies.

Well, that could be solved by adding a select on all of the indirect
dependencies.  I did get your point about the fact that we could have
cases where the indirect dependencies might conflict with one another.
That's going to be a tough situation regardless of whether we have a
sat-solver or a human who has to struggle with that situation.

It's also going to be a bit sad because it means that we won't be able
to create a single config that could be used to run all the kunit
tests when a user pushes a change to a Gerrit server for review.  :-/

I suppose that if we use a strict definition of "unit tests", and we
assume that all of the tests impacted by a change in foo/bar/baz.c
will be found in foo/bar/baz-test.c, or maybe foo/bar/*-test.c, we can
automate the generation of the kunitconfig file, perhaps?

The other sad bit about having mutually exclusive config options is
that we can't easily "run all KUinit tests" for some kind of test
spinner or zero-day bot.

I'm not sure there's a good solution to that issue, though.

 - Ted

Re: [REGRESSION] kmemleak: commit c566586818 causes failure to boot

2019-10-14 Thread Theodore Y. Ts'o

On Mon, Oct 14, 2019 at 01:51:15PM +0100, Catalin Marinas wrote:
> In your case, CONFIG_DEBUG_KMEMLEAK_DEFAULT_OFF=y, so it disables itself
> irrespective of the pool size and trips over the bug. Even with default
> off, it still involves the clean-up since kmemleak needs to track early
> allocations in case it is turned on by the kmemleak=on cmdline option.
> 
> So I think 16000 is sufficient in your case, the default-off triggered
> the bug (well, unless you find in the logs "kmemleak: Memory pool empty,
> consider increasing CONFIG_DEBUG_KMEMLEAK_MEM_POOL_SIZE").

Ah, got it, thanks for the clarification!

- Ted

Re: [REGRESSION] kmemleak: commit c566586818 causes failure to boot

2019-10-14 Thread Theodore Y. Ts'o

On Mon, Oct 14, 2019 at 08:03:14AM +0100, Catalin Marinas wrote:
> Thanks for the report. I have a fix already:
> 
> http://lkml.kernel.org/r/20191004134624.46216-1-catalin.mari...@arm.com
> 
> I was hoping Andrew had sent it to Linus before -rc3 but it doesn't seem
> to be in mainline yet.

Thanks for the pointer to the fix!  Does that mean that the workaround
is to increase the kmemleak pool size?  I had been using the default
(16000) and it seems surprising that that it wasn't enough to even get
the kernel through a standard boot sequence.  Should we perhaps
increase the default mempool size?

- Ted

[REGRESSION] kmemleak: commit c566586818 causes failure to boot

2019-10-13 Thread Theodore Y. Ts'o

Commit c566586818 ("mm: kmemleak: use the memory pool for early
allocations") causes my test kernels to fail to boot on using both kvm
and using Google Compute Engine.  A git bisect localized it to
c566586818, and I confirmed by test building v5.4-rc3, which failed as
above using KVM.  When I reverted c566586818 the kernel booted
successfully.

The symptoms are that the boot hangs after:

[2.844808] hctosys: unable to open rtc device (rtc0)

and then about 25 seconds later, we get the following warning:

[   28.237938] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1:7]
[   28.239345] irq event stamp: 198897938
[   28.240017] hardirqs last  enabled at (198897937): [] 
_raw_write_unlock_irqrestore+
0x43/0x47
[   28.241979] hardirqs last disabled at (198897938): [] 
trace_hardirqs_off_thunk+0x1a
/0x20
[   28.243930] softirqs last  enabled at (198876302): [] 
__do_softirq+0x32a/0x42a
[   28.247350] softirqs last disabled at (198876295): [] 
irq_exit+0xb3/0xc0
[   28.250080] CPU: 0 PID: 7 Comm: kworker/0:1 Not tainted 
5.4.0-rc3-xfstests-00403-g4f5cafb5cb84 #1225
[   28.253081] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.12.0-1 04/01/2014
[   28.254885] Workqueue: events kmemleak_do_cleanup
[   28.255570] RIP: 0010:_raw_write_unlock_irqrestore+0x45/0x47
[   28.256401] Code: e8 b0 4d 60 ff 48 89 ef e8 d8 a6 60 ff f6 c7 02 75 11 53 
9d e8 dc b1 68 ff 65 ff 0d
 cd 73 10 5f 5b 5d c3 e8 ed b0 68 ff 53 9d  ed 90 90 90 90 90 90 90 90 90 
90 90 90 90 90 90 90 90 90
 90 90
[   28.260440] RSP: :98984006fdf8 EFLAGS: 0246 ORIG_RAX: 
ff13
[   28.262258] RAX: 94d7fd23a1c0 RBX: 0246 RCX: 0006
[   28.264238] RDX: 0007 RSI: 94d7fd23a9c0 RDI: 94d7fd23a1c0
[   28.267333] RBP: a1c94bc0 R08: 0006931c1cf2 R09: 
[   28.269871] R10:  R11:  R12: 
[   28.272175] R13:  R14:  R15: a1c94aa8
[   28.274649] FS:  () GS:94d7fd80() 
knlGS:
[   28.277758] CS:  0010 DS:  ES:  CR0: 80050033
[   28.279638] CR2:  CR3: 5fc12001 CR4: 00360ef0
[   28.282367] Call Trace:
[   28.283075]  find_and_remove_object+0x7f/0x90
[   28.284335]  delete_object_full+0xc/0x20
[   28.285488]  __kmemleak_do_cleanup+0x63/0x100
[   28.286913]  process_one_work+0x246/0x570
[   28.288801]  worker_thread+0x50/0x3b0
[   28.290406]  ? process_one_work+0x570/0x570
[   28.291497]  kthread+0x126/0x140
[   28.292316]  ? kthread_delayed_work_timer_fn+0xa0/0xa0
[   28.294262]  ret_from_fork+0x3a/0x50
[   28.837921] rcu: INFO: rcu_sched self-detected stall on CPU
...

I've attached the log from the KVM session and the config.gz used to
build the kernels.

- Ted


config.gz
Description: application/gzip


log.201910132216.gz
Description: application/gzip

Re: x86/random: Speculation to the rescue

2019-10-07 Thread Theodore Y. Ts'o

On Sun, Oct 06, 2019 at 08:21:03PM +0200, Pavel Machek wrote:
> Even without cycle counter... if we _know_ we are trying to generate
> entropy and have MMC available, we don't care about power and
> performance.
> 
> So we can just...
> 
>issue read request on MMC
>while (!interrupt_done)
>i++
>
> ...and then use i++ as poor man's version of cycle counter.

I suggest that you try that and see how much uncertainty you really
have before you assume that this is actually going to work.  Again, on
"System on a Chip" systems, there is very likely a single master
oscillator, and the eMMC is going to be on the the same silicon die as
the CPU.  At least for spinning rust platters it's on a physically
separate peripheral, and there was a physical claim about how chaotic
air turbulence might add enough uncertainty that timing HDD reads
could be unpredictable (although note that the paper[1] which claimed
this was written 25 years ago, and HDD's have had to get more precise,
not less, in order to cram more bits into the same squire millimeter).
more.)

[1] D. Davis, R. Ihaka, P.R. Fenstermacher, "Cryptographic Randomness
from Air Turbulence in Disk Drives", in Advances in Cryptology --
CRYPTO '94 Conference Proceedings, edited by Yvo G. Desmedt,
pp.114--120. Lecture Notes in Computer Science #839. Heidelberg:
Springer-Verlag, 1994

But for a flash device that is on the same silicon as the CPU, and
driven off of the same clock crystal, and where you are doing *reads*,
so you can't even depend on SSD GC and uncertainties in programming
the flash cell --- I'm going to be dubious.

If someone wants to use this as a last ditch hope, then we can let
systemd do this, where there's at least a bit more uncertainty in
userspace, and maybe you could be doing something useful, like running
fsck on the stateful partitions of the system.

Ultimately, though, we really want to try to get a hardware RNG, and
for that matter, a cryptographic secure element built into these
devices, and the sooner we can pound it into CPU manufacturers' heads
that not doing this should be professional malpractice, the better.

   - Ted

P.S.  Note that this Don Davis's paper[1] claims that they were able
to extract 100 independent unpredictable bits per _minute_.  Given
that modern init scripts want us to be able to boot in under a few
seconds, and we need at least 128 independent bits to securely
initialize the CRNG, people who think that we can get secure random
bits suitable for long-term cryptographic keys (say, such as
initializing the host's SSH key) during early boot based only on HDD
timings may be a bit optimistic.

P.P.S.  I actually knew Don; he was at MIT Project Athena as a staff
member at the same time I was working on Kerberos as my day job, and
Linux's e2fsprogs, /dev/random, etc. as a hobby...

Re: Stop breaking the CSRNG

2019-10-02 Thread Theodore Y. Ts'o

On Wed, Oct 02, 2019 at 06:55:33PM +0200, Kurt Roeckx wrote:
> 
> But it seems people are now thinking about breaking getrandom() too,
> to let it return data when it's not initialized by default. Please
> don't.

"It's complicated"

The problem is that whether a CRNG can be considered secure is a
property of the entire system, including the hardware, and given the
large number of hardware configurations which the kernel and OpenSSL
can be used, in practice, we can't assure that getrandom(2) is
"secure" without making certain assumptions.  For example, if we
assume that the CPU is an x86 processor new enough to support RDRAND,
and that RDRAND is competently implemented (e.g., it won't disappear
after a suspend/resume) and doesn't have any backdoors implanted in
it, then it's easy to say that getrandom() will always be secure.

But if you assume that there is no hardware random number generator,
and everything is driven from a single master oscillator, with no
exernal input, and the CPU is utterly simple, with speculation or
anything else that might be non-determinstic, AND if we assume that
the idiots who make an IOT device use the same random seed across
millions of devices all cloned off of the same master imagine, there
is ***absoutely*** nothing the kernel can do to guarantee, with 100%
certainty, that the CRNG will be initialzied.  (This is especially
true if the idiots who design the IOT device call OpenSSL to generate
their long-term private key the moment the device is first plugged in,
before any networking device is brought on-line.)

The point with all of this is that both the kernel and OpenSSL, and
whether or not they can be combined to create a secure overall
solution is going to be dependent on the hardware choices, and choices
of the distribution and the application programmers in terms of what
other software components are used, and when and where those
components try to request random numbers, especially super-early in
the boot process.

Historically, I've tried to work around this problem by being super
paranoid about the choices of thresholds before declaring the CRNG to
be initialized, while *also* making sure that at least on most common
x86 systems, the CRNG could be considered initialized before the root
file system was mounted read/write.

But over time, assumptions of what is common hardware changes.  SSD's
replace HDD's; NAPI and other polling techniques are more common to
reduce the number of interrupts; the use of a single master oscillator
to drive the all of the various clocks on the system, etc.  And
software changes --- systemd running boot scripts in parallel means
that boot times are reduced, which is good, but it also means the time
to when the root is mounted read/write is much shortened.

So in the absence of a hardware RNG, or a hardware random number
generator which is considered trusted (i.e., should RDRAND
beconsidered trusted?), there *will* be times when we will simply fail
to be able to generate secure random numbers (at least by our
hueristics, which can potentially be overly optimistic on some
hardware platforms, and overly conservative on others).

The question is then, what do we do?  Do we hang the boot --- at which
point users will complain to Linus?  Or do we just hope that things
are "good enough", and that even if the user has elected to say that
they don't trust RDRAND, that we'll hope it's competently implement
and not backdoored?  Or do we assume that using a jitter entropy
scheme is actually secure, as opposed to security through obscurity
(and maybe is completely pointless on a simple and completely open
architecture with no speculation such as RISC-V)?

There really are no good choices here.  The one thing which Linus has
made very clear is that hanging at boot is Not Acceptable.  Long term,
the best we can do is to through the kitchen sink at the problem.  So
we should try to use UEFI's RNG if available; use the TPM's RNG if
available; use RDRAND if available; try to use a seed file if
available (and hope it's not cloned to be identical on a million IOT
devices); and so on.  Hopefully, they won't *all* incompetently
implemented and/or implanted with a backdoor from the NSA or MSS or
the KGB.

The only words of hope that I can give you is that it's likely that
there are so many zero day bugs in the kernel, in userspace
applications, and crypto libraries (including maybe OpenSSL), that we
don't have to make the CRNG impossible to attack in order to make a
difference.  We just have to make it harder than finding and
exploiting zero day security bugs in *other* parts of the system.

   "When a mountain bear is chasing after you, you don’t have to
   outrun the bear. You only have to outrun the person running next to
   you."  :-)

Bottom line, we can do the best we can with each of our various
components, but without control over the hardware that will be in use,
or for OpenSSL, what applications are trying to call OpenSSL for, and
when they might

Re: x86/random: Speculation to the rescue

2019-10-02 Thread Theodore Y. Ts'o

On Tue, Oct 01, 2019 at 06:15:02PM +0200, Ahmed S. Darwish wrote:
> 
> Using the "ent" tool, [2] also used to test randomness in the Stephen
> Müller LRNG paper, on a 50-byte file, produced the following
> results:

The "ent" tool is really, really useless.  If you take any CRNG, even
intialized with a known seed, "ent" will say that it's *GREAT*!

If you don't believe me, disable all entropy inputs into the CRNG,
initialize it with "THE NSA IS OUR LORD AND MASTER", and then run it.
You'll get substantially the same results.  (And if we didn't the Cha
Cha 20 encryption algorithm would be totally broken).

 - Ted

1 2 3 4 5 6 7 8 9 >

1 - 100 of 886 matches

Mail list logo