[Kernel-packages] [Bug 1934709] [NEW] btrfs: Automatic balance returns -EUCLEAN and leads to forced readonly filesystem

Matthew Ruffell Mon, 05 Jul 2021 20:30:58 -0700

Public bug reported:

BugLink: https://bugs.launchpad.net/bugs/1934709


[Impact]

During an automatic balance, users may encounter an error when writing
the transaction log to disk, when the log tree is being parsed, which
forces the filesystem to be remounted read-only and outputs the
following kernel oops:

BTRFS: error (device dm-14) in balance_level:1962: errno=-117 unknown
BTRFS info (device dm-14): forced readonly
BTRFS: Transaction aborted (error -117)
WARNING: CPU: 7 PID: 10818 at 
/build/linux-99Rib2/linux-4.15.0/fs/btrfs/tree-log.c:2908 
btrfs_sync_log+0xa28/0xbc0 [btrfs]
CPU: 7 PID: 10818 Comm: qemu-system-s39 Tainted: G           OE    
4.15.0-136-generic #140-Ubuntu
Hardware name: IBM 3907 LR1 A00 (LPAR)
Krnl PSW : 0000000076bc1d64 000000009cc65255 (btrfs_sync_log+0xa28/0xbc0 
[btrfs])
           R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
Krnl GPRS: 007899a600000000 0000000000000006 0000000000000027 0000000000000007
           000003ff801fdd8c 000000000002394c 00000053650a7000 ffffffffffffff8b
           000000536b7f6000 00000053ffffff8b 00000053650a3800 0000005385935000
           000000532054de01 0000005385935290 000003ff801fdd8c 0000004eebb1fae0
Krnl Code: 000003ff801fdd80: c0200002a032       larl    %r2,000003ff80251de4
           000003ff801fdd86: c0e5fffb2181       brasl   %r14,000003ff80162088
          #000003ff801fdd8c: a7f40001           brc     15,000003ff801fdd8e
          >000003ff801fdd90: e3a0f0a80004       lg      %r10,168(%r15)
           000003ff801fdd96: b9040057           lgr     %r5,%r7
           000003ff801fdd9a: a7490b5c           lghi    %r4,2908
           000003ff801fdd9e: b904002a           lgr     %r2,%r10
           000003ff801fdda2: c0300002604f       larl    %r3,000003ff80249e40
Call Trace:
([<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs])
 [<000003ff801c74e2>] btrfs_sync_file+0x3e2/0x550 [btrfs]
 [<00000000003ce6ce>] do_fsync+0x5e/0x90
 [<00000000003ce9ca>] SyS_fdatasync+0x32/0x48
 [<00000000008ffe5c>] system_call+0xd8/0x2c8
Last Breaking-Event-Address:
 [<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs]
BTRFS: error (device dm-14) in btrfs_sync_log:2908: errno=-117 unknown
BTRFS error (device dm-14): pending csums is 269639680

This bug appears to be linked to bug 1933172, but is different and has a
different root cause. Again, I believe that this is a regression
introduced in the fixing of CVE-2019-19036, from 4.15.0-109-generic.

[Fix]

Analysis of the kernel oops is as follows:

The first thing we see is that BTRFS entered ERROR state with the
reason:

in balance_level:1962: errno=-117 unknown

Now errno -117 is:

100 #define EUCLEAN     117 /* Structure needs cleaning */

btrfs treats -EUCLEAN as if corruption has happened. Let's see where
this is returned from.

If we start at fs/btrfs/ctree.c in balance_level(), line 1962:

1917 static noinline int balance_level(struct btrfs_trans_handle *trans,
1918              struct btrfs_root *root,
1919              struct btrfs_path *path, int level)
1920 {
...
1958         /* promote the child to a root */
1959         child = read_node_slot(fs_info, mid, 0);
1960         if (IS_ERR(child)) {
1961             ret = PTR_ERR(child);
1962             btrfs_handle_fs_error(fs_info, ret, NULL);
1963             goto enospc;
1964         }
...
2136 }

We are in the middle of a balancing operation, and if you happen to be
familiar with how b-tree data structures work, we are promoting a child
node to a topmost root node.

The error most likely happens in read_node_slot(), with the lines after
it printing that an error has happened.

1887 static noinline struct extent_buffer *
1888 read_node_slot(struct btrfs_fs_info *fs_info, struct extent_buffer *parent,
1889            int slot)
1890 {
...
1900     btrfs_node_key_to_cpu(parent, &first_key, slot);
1901     eb = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot),
1902                  btrfs_node_ptr_generation(parent, slot),
1903                  level - 1, &first_key);
...
1910 }

There are two calls here which are relevant. btrfs_node_key_to_cpu() and
read_tree_block().

Let's look at read_tree_block() in fs/btrfs/disk-io.c:

1147 struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 
bytenr,
1148                       u64 parent_transid, int level,
1149                       struct btrfs_key *first_key)
1150 {
1151     struct extent_buffer *buf = NULL;
1152     int ret;
1153
1154     buf = btrfs_find_create_tree_block(fs_info, bytenr);
1155     if (IS_ERR(buf))
1156         return buf;
1157
1158     ret = btree_read_extent_buffer_pages(fs_info, buf, parent_transid,
1159                          level, first_key);
1160     if (ret) {
1161         free_extent_buffer_stale(buf);
1162         return ERR_PTR(ret);
1163     }
1164     return buf;
1165
1166 }

The interesting one here is btree_read_extent_buffer_pages():

498 static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
499                       struct extent_buffer *eb,
500                       u64 parent_transid, int level,
501                       struct btrfs_key *first_key)
502 {
...
511     while (1) {
512         clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
513         ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
514                            btree_get_extent, mirror_num);
515         if (!ret) {
516             if (verify_parent_transid(io_tree, eb,
517                            parent_transid, 0))
518                 ret = -EIO;
519             else if (verify_level_key(fs_info, eb, level,
520                           first_key))
521                 ret = -EUCLEAN;
522             else
523                 break;
524         }
525
526         /*
527          * This buffer's crc is fine, but its contents are corrupted, so
528          * there is no reason to read the other copies, they won't be
529          * any less wrong.
530          */
531         if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
532             ret == -EUCLEAN)
533             break;
...

If read_extent_buffer_pages() returns zero, we call verify_level_key(),
and if this returns non-zero, we return -EUCLEAN.

verify_level_key() can fail on three conditions.

 440 static int verify_level_key(struct btrfs_fs_info *fs_info,
 441                 struct extent_buffer *eb, int level,
 442                 struct btrfs_key *first_key)
 443 {
 448     found_level = btrfs_header_level(eb);
 449     if (found_level != level) {
...
 456         return -EIO;
 457     }
 458
 459     if (!first_key)
 460         return 0;
 461
 462     /* We have @first_key, so this @eb must have at least one item */
 463     if (btrfs_header_nritems(eb) == 0) {
 464         btrfs_err(fs_info,
 465         "invalid tree nritems, bytenr=%llu nritems=0 expect >0",
 466               eb->start);
 467         WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
 468         return -EUCLEAN;
 469     }
 470
 471     if (found_level)
 472         btrfs_node_key_to_cpu(eb, &found_key, 0);
 473     else
 474         btrfs_item_key_to_cpu(eb, &found_key, 0);
 475     ret = btrfs_comp_cpu_keys(first_key, &found_key);
...
 487     return ret;
 488 }

1) If the eb level doesn't match the provided level.
2) If the eb has 0 items.
3) If the found_key doesn't match the first_key.

With the information we currently have, we don't know what one caused
the problem.

I looked to see when verify_level_key() was first introduced. It seems
it arrived in 4.15.0-109-generic through the SRU of CVE-2019-19036, with
the commit:

ubuntu-bionic 50ab1ff51db0c5eb77ffc6f15ef32f07764f86ff
Author: Qu Wenruo <w...@suse.com>
Date:   Thu Mar 29 09:08:11 2018 +0800
Subject: btrfs: Validate child tree block's level and first key
Link: https://paste.ubuntu.com/p/DQjTkfNRDt/

If you look at this particular hunk:

https://paste.ubuntu.com/p/z4qXw2Jp9K/

We see lines 18 - 26 of the above pastebin are introduced here.

Looking at the original upstream commit:

commit 581c1760415c48cca9349b198bba52dd38750765
Author: Qu Wenruo <w...@suse.com>
Date:   Thu Mar 29 09:08:11 2018 +0800
Subject: btrfs: Validate child tree block's level and first key
Link: 
https://github.com/torvalds/linux/commit/581c1760415c48cca9349b198bba52dd38750765

Particularly the same hunk:

https://paste.ubuntu.com/p/vW3Y9BvK4C/

There is a subtle difference, the second if statement is extended with
the ret == -EUCLEAN check, and not implemented entirely.

Why is this?

I looked up when the if statement was first introduced, and it was a
very old commit from v2.6.39-rc1:

https://paste.ubuntu.com/p/vPT36xXq66/

Particularly this hunk:

https://paste.ubuntu.com/p/2jmQRSmVfc/

Interesting. Why is a commit from 4.15-109-generic re-implement
something that should have been there since 2.6.39?

I checked upstream, and found the if statement to be removed entirely.

That is when I came across:

commit f8397d69daef06d358430d3054662fb597e37c00
Author: Nikolay Borisov <nbori...@suse.com>
Date:   Tue Nov 6 16:40:20 2018 +0200
Subject: btrfs: Always try all copies when reading extent buffers
Link: 
https://github.com/torvalds/linux/commit/f8397d69daef06d358430d3054662fb597e37c00

Which talks about balance operations failing out of the blue after a
raid 1 disk was added back to the array. The commit removes the if
statement, and moves the location of the clear EXTENT_BUFFER_CORRUPT
inside the while loop.

I checked the Bionic 4.15 kernel, and I found that this commit was
applied in 4.15.0-56-generic:

ubuntu-bionic 03e1b5c9a1c1704e109466b375d09a4721b65ec5
Author: Nikolay Borisov <nbori...@suse.com>
Date:   Tue Nov 6 16:40:20 2018 +0200
Subject: btrfs: Always try all copies when reading extent buffers
Link: https://paste.ubuntu.com/p/TS2c5Mptr2/

It appears that the if statement was removed in 4.15.0-56-generic
intentionally, and was brought back mistakenly in a backport of "btrfs:
Validate child tree block's level and first key" 4.15.0-109-generic.

The root cause is likely some interaction between bug 1933172 and this
bug, which leads to EXTENT_BUFFER_CORRUPT being set, or the incorrect
first_key being set for the root node, which means we end up returning
-EUCLEAN and aborting the transaction.

Unfortunately, this is the second bad backport of CVE-2019-19036.

The fix is to revert "btrfs: Validate child tree block's level and first
key" and its dependency commit "btrfs: Detect unbalanced tree with empty
leaf before crashing btree operations", and re-apply correct backports
of these commits with that if statement removed, to keep the spirit of
the already applied "btrfs: Always try all copies when reading extent
buffers".

We will also add the below commit as it is a fixup commit for "btrfs:
Validate child tree block's level and first key" to correct usage of the
wrong "key" variable, which also leads to similar symptoms of -EUCLEAN
being returned from verify_level_key().

commit 17515f1b764df36271f3166c714f5a78301fbaa7
Author: Qu Wenruo <w...@suse.com>
Date:   Mon Apr 23 17:32:04 2018 +0800
Subject: btrfs: Fix wrong first_key parameter in replace_path
Link: 
https://github.com/torvalds/linux/commit/17515f1b764df36271f3166c714f5a78301fbaa7

[Testcase]

Unfortunately, the customer did not image the affected filesystem and
has since restored a backup ontop of it.

I have been attempting to reproduce this issue for some time, but I have
not experienced the same call trace. I ran into bug 1933172 while trying
to reproduce this bug.

I have been trying to balance nearly full btrfs filesystems, and I have
looped xfstests btrfs/124 for hours attempting to trigger "btrfs: Always
try all copies when reading extent buffers" but I haven't experienced a
crash yet.

I have a test kernel in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf311164-test-2

If you install it, balances still complete as expected.

I will keep attempting to reproduce the issue, and will update this
section if I manage to create a testcase.

[Where problems could occur]

If a regression were to occur, it would affect users of btrfs
filesystems, and would likely show during a routine balance operation.

I believe affected users would have nearly full filesystems, and would
also experience filesystem corruption from bug 1933172, which would then
cause the issues from this bug when the transaction log is written to
disk.

With all modifications to btrfs, there is a risk of data corruption and
filesystem corruption for all btrfs users, since balances happen
automatically and on a regular basis. If a regression does happen, users
should remount their filesystems with the "nobalance" flag, backup their
data, and attempt a repair if necessary.

[Other info]

A community member has hit this issue before, and has reported it
upstream to linux-btrfs here, although they never received a reply.

https://www.spinics.net/lists/linux-btrfs/msg112261.html

I have written to Richard and asked if he has any additional information
that might help reproduce, but I have yet to receive a reply.

If you read Richard's mailing list link, it mentions filesystem
corruption with missing extents. This suggests this crash might be
linked to bug 1933172, which I came across while trying to reproduce the
issue in this bug.

** Affects: linux (Ubuntu)
     Importance: Undecided
         Status: Fix Released

** Affects: linux (Ubuntu Bionic)
     Importance: Medium
     Assignee: Matthew Ruffell (mruffell)
         Status: In Progress

** Changed in: linux (Ubuntu)
       Status: New => Fix Released

** Also affects: linux (Ubuntu Bionic)
   Importance: Undecided
       Status: New

** Changed in: linux (Ubuntu Bionic)
       Status: New => In Progress

** Changed in: linux (Ubuntu Bionic)
   Importance: Undecided => Medium

** Changed in: linux (Ubuntu Bionic)
     Assignee: (unassigned) => Matthew Ruffell (mruffell)

** Description changed:

+ BugLink: https://bugs.launchpad.net/bugs/1934709
+ 
  [Impact]
  
  During an automatic balance, users may encounter an error when writing
  the transaction log to disk, when the log tree is being parsed, which
  forces the filesystem to be remounted read-only and outputs the
  following kernel oops:
  
  BTRFS: error (device dm-14) in balance_level:1962: errno=-117 unknown
  BTRFS info (device dm-14): forced readonly
  BTRFS: Transaction aborted (error -117)
  WARNING: CPU: 7 PID: 10818 at 
/build/linux-99Rib2/linux-4.15.0/fs/btrfs/tree-log.c:2908 
btrfs_sync_log+0xa28/0xbc0 [btrfs]
  CPU: 7 PID: 10818 Comm: qemu-system-s39 Tainted: G           OE    
4.15.0-136-generic #140-Ubuntu
  Hardware name: IBM 3907 LR1 A00 (LPAR)
  Krnl PSW : 0000000076bc1d64 000000009cc65255 (btrfs_sync_log+0xa28/0xbc0 
[btrfs])
-            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
+            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
  Krnl GPRS: 007899a600000000 0000000000000006 0000000000000027 0000000000000007
-            000003ff801fdd8c 000000000002394c 00000053650a7000 ffffffffffffff8b
-            000000536b7f6000 00000053ffffff8b 00000053650a3800 0000005385935000
-            000000532054de01 0000005385935290 000003ff801fdd8c 0000004eebb1fae0
+            000003ff801fdd8c 000000000002394c 00000053650a7000 ffffffffffffff8b
+            000000536b7f6000 00000053ffffff8b 00000053650a3800 0000005385935000
+            000000532054de01 0000005385935290 000003ff801fdd8c 0000004eebb1fae0
  Krnl Code: 000003ff801fdd80: c0200002a032     larl    %r2,000003ff80251de4
-            000003ff801fdd86: c0e5fffb2181     brasl   %r14,000003ff80162088
-           #000003ff801fdd8c: a7f40001         brc     15,000003ff801fdd8e
-           >000003ff801fdd90: e3a0f0a80004     lg      %r10,168(%r15)
-            000003ff801fdd96: b9040057         lgr     %r5,%r7
-            000003ff801fdd9a: a7490b5c         lghi    %r4,2908
-            000003ff801fdd9e: b904002a         lgr     %r2,%r10
-            000003ff801fdda2: c0300002604f     larl    %r3,000003ff80249e40
+            000003ff801fdd86: c0e5fffb2181     brasl   %r14,000003ff80162088
+           #000003ff801fdd8c: a7f40001         brc     15,000003ff801fdd8e
+           >000003ff801fdd90: e3a0f0a80004     lg      %r10,168(%r15)
+            000003ff801fdd96: b9040057         lgr     %r5,%r7
+            000003ff801fdd9a: a7490b5c         lghi    %r4,2908
+            000003ff801fdd9e: b904002a         lgr     %r2,%r10
+            000003ff801fdda2: c0300002604f     larl    %r3,000003ff80249e40
  Call Trace:
  ([<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs])
-  [<000003ff801c74e2>] btrfs_sync_file+0x3e2/0x550 [btrfs] 
-  [<00000000003ce6ce>] do_fsync+0x5e/0x90 
-  [<00000000003ce9ca>] SyS_fdatasync+0x32/0x48 
-  [<00000000008ffe5c>] system_call+0xd8/0x2c8 
+  [<000003ff801c74e2>] btrfs_sync_file+0x3e2/0x550 [btrfs]
+  [<00000000003ce6ce>] do_fsync+0x5e/0x90
+  [<00000000003ce9ca>] SyS_fdatasync+0x32/0x48
+  [<00000000008ffe5c>] system_call+0xd8/0x2c8
  Last Breaking-Event-Address:
-  [<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs]
+  [<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs]
  BTRFS: error (device dm-14) in btrfs_sync_log:2908: errno=-117 unknown
  BTRFS error (device dm-14): pending csums is 269639680
  
  This bug appears to be linked to bug 1933172, but is different and has a
  different root cause. Again, I believe that this is a regression
  introduced in the fixing of CVE-2019-19036, from 4.15.0-109-generic.
  
  [Fix]
  
  Analysis of the kernel oops is as follows:
  
  The first thing we see is that BTRFS entered ERROR state with the
  reason:
  
  in balance_level:1962: errno=-117 unknown
  
  Now errno -117 is:
  
  100 #define EUCLEAN     117 /* Structure needs cleaning */
  
  btrfs treats -EUCLEAN as if corruption has happened. Let's see where
  this is returned from.
  
  If we start at fs/btrfs/ctree.c in balance_level(), line 1962:
  
  1917 static noinline int balance_level(struct btrfs_trans_handle *trans,
  1918              struct btrfs_root *root,
  1919              struct btrfs_path *path, int level)
  1920 {
  ...
  1958         /* promote the child to a root */
  1959         child = read_node_slot(fs_info, mid, 0);
  1960         if (IS_ERR(child)) {
  1961             ret = PTR_ERR(child);
  1962             btrfs_handle_fs_error(fs_info, ret, NULL);
  1963             goto enospc;
  1964         }
  ...
  2136 }
  
  We are in the middle of a balancing operation, and if you happen to be
  familiar with how b-tree data structures work, we are promoting a child
  node to a topmost root node.
  
  The error most likely happens in read_node_slot(), with the lines after
  it printing that an error has happened.
  
  1887 static noinline struct extent_buffer *
  1888 read_node_slot(struct btrfs_fs_info *fs_info, struct extent_buffer 
*parent,
  1889            int slot)
  1890 {
  ...
  1900     btrfs_node_key_to_cpu(parent, &first_key, slot);
  1901     eb = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot),
  1902                  btrfs_node_ptr_generation(parent, slot),
  1903                  level - 1, &first_key);
  ...
  1910 }
  
  There are two calls here which are relevant. btrfs_node_key_to_cpu() and
  read_tree_block().
  
  Let's look at read_tree_block() in fs/btrfs/disk-io.c:
  
  1147 struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 
bytenr,
  1148                       u64 parent_transid, int level,
  1149                       struct btrfs_key *first_key)
  1150 {
  1151     struct extent_buffer *buf = NULL;
  1152     int ret;
- 1153 
+ 1153
  1154     buf = btrfs_find_create_tree_block(fs_info, bytenr);
  1155     if (IS_ERR(buf))
  1156         return buf;
- 1157 
+ 1157
  1158     ret = btree_read_extent_buffer_pages(fs_info, buf, parent_transid,
  1159                          level, first_key);
  1160     if (ret) {
  1161         free_extent_buffer_stale(buf);
  1162         return ERR_PTR(ret);
  1163     }
  1164     return buf;
- 1165 
+ 1165
  1166 }
  
  The interesting one here is btree_read_extent_buffer_pages():
  
  498 static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
  499                       struct extent_buffer *eb,
  500                       u64 parent_transid, int level,
  501                       struct btrfs_key *first_key)
  502 {
  ...
  511     while (1) {
  512         clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
  513         ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
  514                            btree_get_extent, mirror_num);
  515         if (!ret) {
  516             if (verify_parent_transid(io_tree, eb,
  517                            parent_transid, 0))
  518                 ret = -EIO;
  519             else if (verify_level_key(fs_info, eb, level,
  520                           first_key))
  521                 ret = -EUCLEAN;
  522             else
  523                 break;
  524         }
- 525 
+ 525
  526         /*
  527          * This buffer's crc is fine, but its contents are corrupted, so
  528          * there is no reason to read the other copies, they won't be
  529          * any less wrong.
  530          */
  531         if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
  532             ret == -EUCLEAN)
  533             break;
  ...
  
  If read_extent_buffer_pages() returns zero, we call verify_level_key(),
  and if this returns non-zero, we return -EUCLEAN.
  
  verify_level_key() can fail on three conditions.
  
-  440 static int verify_level_key(struct btrfs_fs_info *fs_info,
-  441                 struct extent_buffer *eb, int level,
-  442                 struct btrfs_key *first_key)
-  443 {
-  448     found_level = btrfs_header_level(eb);
-  449     if (found_level != level) {
- ...
-  456         return -EIO;
-  457     }
-  458 
-  459     if (!first_key)
-  460         return 0;
-  461 
-  462     /* We have @first_key, so this @eb must have at least one item */
-  463     if (btrfs_header_nritems(eb) == 0) {
-  464         btrfs_err(fs_info,
-  465         "invalid tree nritems, bytenr=%llu nritems=0 expect >0",
-  466               eb->start);
-  467         WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
-  468         return -EUCLEAN;
-  469     }
-  470 
-  471     if (found_level)
-  472         btrfs_node_key_to_cpu(eb, &found_key, 0);
-  473     else
-  474         btrfs_item_key_to_cpu(eb, &found_key, 0);
-  475     ret = btrfs_comp_cpu_keys(first_key, &found_key);
- ...
-  487     return ret;
-  488 }
-  
+  440 static int verify_level_key(struct btrfs_fs_info *fs_info,
+  441                 struct extent_buffer *eb, int level,
+  442                 struct btrfs_key *first_key)
+  443 {
+  448     found_level = btrfs_header_level(eb);
+  449     if (found_level != level) {
+ ...
+  456         return -EIO;
+  457     }
+  458
+  459     if (!first_key)
+  460         return 0;
+  461
+  462     /* We have @first_key, so this @eb must have at least one item */
+  463     if (btrfs_header_nritems(eb) == 0) {
+  464         btrfs_err(fs_info,
+  465         "invalid tree nritems, bytenr=%llu nritems=0 expect >0",
+  466               eb->start);
+  467         WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
+  468         return -EUCLEAN;
+  469     }
+  470
+  471     if (found_level)
+  472         btrfs_node_key_to_cpu(eb, &found_key, 0);
+  473     else
+  474         btrfs_item_key_to_cpu(eb, &found_key, 0);
+  475     ret = btrfs_comp_cpu_keys(first_key, &found_key);
+ ...
+  487     return ret;
+  488 }
+ 
  1) If the eb level doesn't match the provided level.
  2) If the eb has 0 items.
  3) If the found_key doesn't match the first_key.
  
  With the information we currently have, we don't know what one caused
  the problem.
  
  I looked to see when verify_level_key() was first introduced. It seems
  it arrived in 4.15.0-109-generic through the SRU of CVE-2019-19036, with
  the commit:
  
  ubuntu-bionic 50ab1ff51db0c5eb77ffc6f15ef32f07764f86ff
  Author: Qu Wenruo <w...@suse.com>
  Date:   Thu Mar 29 09:08:11 2018 +0800
  Subject: btrfs: Validate child tree block's level and first key
  Link: https://paste.ubuntu.com/p/DQjTkfNRDt/
  
  If you look at this particular hunk:
  
  https://paste.ubuntu.com/p/z4qXw2Jp9K/
  
  We see lines 18 - 26 of the above pastebin are introduced here.
  
  Looking at the original upstream commit:
  
  commit 581c1760415c48cca9349b198bba52dd38750765
  Author: Qu Wenruo <w...@suse.com>
  Date:   Thu Mar 29 09:08:11 2018 +0800
  Subject: btrfs: Validate child tree block's level and first key
  Link: 
https://github.com/torvalds/linux/commit/581c1760415c48cca9349b198bba52dd38750765
  
  Particularly the same hunk:
  
  https://paste.ubuntu.com/p/vW3Y9BvK4C/
  
  There is a subtle difference, the second if statement is extended with
  the ret == -EUCLEAN check, and not implemented entirely.
  
  Why is this?
  
  I looked up when the if statement was first introduced, and it was a
  very old commit from v2.6.39-rc1:
  
  https://paste.ubuntu.com/p/vPT36xXq66/
  
  Particularly this hunk:
  
  https://paste.ubuntu.com/p/2jmQRSmVfc/
  
  Interesting. Why is a commit from 4.15-109-generic re-implement
  something that should have been there since 2.6.39?
  
  I checked upstream, and found the if statement to be removed entirely.
  
  That is when I came across:
  
  commit f8397d69daef06d358430d3054662fb597e37c00
  Author: Nikolay Borisov <nbori...@suse.com>
  Date:   Tue Nov 6 16:40:20 2018 +0200
  Subject: btrfs: Always try all copies when reading extent buffers
  Link: 
https://github.com/torvalds/linux/commit/f8397d69daef06d358430d3054662fb597e37c00
  
  Which talks about balance operations failing out of the blue after a
  raid 1 disk was added back to the array. The commit removes the if
  statement, and moves the location of the clear EXTENT_BUFFER_CORRUPT
  inside the while loop.
  
  I checked the Bionic 4.15 kernel, and I found that this commit was
  applied in 4.15.0-56-generic:
  
  ubuntu-bionic 03e1b5c9a1c1704e109466b375d09a4721b65ec5
  Author: Nikolay Borisov <nbori...@suse.com>
  Date:   Tue Nov 6 16:40:20 2018 +0200
  Subject: btrfs: Always try all copies when reading extent buffers
  Link: https://paste.ubuntu.com/p/TS2c5Mptr2/
  
  It appears that the if statement was removed in 4.15.0-56-generic
  intentionally, and was brought back mistakenly in a backport of "btrfs:
  Validate child tree block's level and first key" 4.15.0-109-generic.
  
  The root cause is likely some interaction between bug 1933172 and this
  bug, which leads to EXTENT_BUFFER_CORRUPT being set, or the incorrect
  first_key being set for the root node, which means we end up returning
  -EUCLEAN and aborting the transaction.
  
  Unfortunately, this is the second bad backport of CVE-2019-19036.
  
  The fix is to revert "btrfs: Validate child tree block's level and first
  key" and its dependency commit "btrfs: Detect unbalanced tree with empty
  leaf before crashing btree operations", and re-apply correct backports
  of these commits with that if statement removed, to keep the spirit of
  the already applied "btrfs: Always try all copies when reading extent
  buffers".
  
  We will also add the below commit as it is a fixup commit for "btrfs:
  Validate child tree block's level and first key" to correct usage of the
  wrong "key" variable, which also leads to similar symptoms of -EUCLEAN
  being returned from verify_level_key().
  
  commit 17515f1b764df36271f3166c714f5a78301fbaa7
  Author: Qu Wenruo <w...@suse.com>
  Date:   Mon Apr 23 17:32:04 2018 +0800
  Subject: btrfs: Fix wrong first_key parameter in replace_path
  Link: 
https://github.com/torvalds/linux/commit/17515f1b764df36271f3166c714f5a78301fbaa7
  
  [Testcase]
  
  Unfortunately, the customer did not image the affected filesystem and
  has since restored a backup ontop of it.
  
  I have been attempting to reproduce this issue for some time, but I have
  not experienced the same call trace. I ran into bug 1933172 while trying
  to reproduce this bug.
  
  I have been trying to balance nearly full btrfs filesystems, and I have
  looped xfstests btrfs/124 for hours attempting to trigger "btrfs: Always
  try all copies when reading extent buffers" but I haven't experienced a
  crash yet.
  
  I have a test kernel in the following ppa:
  
+ https://launchpad.net/~mruffell/+archive/ubuntu/sf311164-test-2
+ 
  If you install it, balances still complete as expected.
  
  I will keep attempting to reproduce the issue, and will update this
  section if I manage to create a testcase.
  
  [Where problems could occur]
  
  If a regression were to occur, it would affect users of btrfs
  filesystems, and would likely show during a routine balance operation.
  
  I believe affected users would have nearly full filesystems, and would
  also experience filesystem corruption from bug 1933172, which would then
  cause the issues from this bug when the transaction log is written to
  disk.
  
  With all modifications to btrfs, there is a risk of data corruption and
  filesystem corruption for all btrfs users, since balances happen
  automatically and on a regular basis. If a regression does happen, users
  should remount their filesystems with the "nobalance" flag, backup their
  data, and attempt a repair if necessary.
  
  [Other info]
  
  A community member has hit this issue before, and has reported it
  upstream to linux-btrfs here, although they never received a reply.
  
  https://www.spinics.net/lists/linux-btrfs/msg112261.html
  
  I have written to Richard and asked if he has any additional information
  that might help reproduce, but I have yet to receive a reply.
  
  If you read Richard's mailing list link, it mentions filesystem
  corruption with missing extents. This suggests this crash might be
  linked to bug 1933172, which I came across while trying to reproduce the
  issue in this bug.

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1934709

Title:
  btrfs: Automatic balance returns -EUCLEAN and leads to forced readonly
  filesystem

Status in linux package in Ubuntu:
  Fix Released
Status in linux source package in Bionic:
  In Progress

Bug description:
  BugLink: https://bugs.launchpad.net/bugs/1934709

  [Impact]

  During an automatic balance, users may encounter an error when writing
  the transaction log to disk, when the log tree is being parsed, which
  forces the filesystem to be remounted read-only and outputs the
  following kernel oops:

  BTRFS: error (device dm-14) in balance_level:1962: errno=-117 unknown
  BTRFS info (device dm-14): forced readonly
  BTRFS: Transaction aborted (error -117)
  WARNING: CPU: 7 PID: 10818 at 
/build/linux-99Rib2/linux-4.15.0/fs/btrfs/tree-log.c:2908 
btrfs_sync_log+0xa28/0xbc0 [btrfs]
  CPU: 7 PID: 10818 Comm: qemu-system-s39 Tainted: G           OE    
4.15.0-136-generic #140-Ubuntu
  Hardware name: IBM 3907 LR1 A00 (LPAR)
  Krnl PSW : 0000000076bc1d64 000000009cc65255 (btrfs_sync_log+0xa28/0xbc0 
[btrfs])
             R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
  Krnl GPRS: 007899a600000000 0000000000000006 0000000000000027 0000000000000007
             000003ff801fdd8c 000000000002394c 00000053650a7000 ffffffffffffff8b
             000000536b7f6000 00000053ffffff8b 00000053650a3800 0000005385935000
             000000532054de01 0000005385935290 000003ff801fdd8c 0000004eebb1fae0
  Krnl Code: 000003ff801fdd80: c0200002a032     larl    %r2,000003ff80251de4
             000003ff801fdd86: c0e5fffb2181     brasl   %r14,000003ff80162088
            #000003ff801fdd8c: a7f40001         brc     15,000003ff801fdd8e
            >000003ff801fdd90: e3a0f0a80004     lg      %r10,168(%r15)
             000003ff801fdd96: b9040057         lgr     %r5,%r7
             000003ff801fdd9a: a7490b5c         lghi    %r4,2908
             000003ff801fdd9e: b904002a         lgr     %r2,%r10
             000003ff801fdda2: c0300002604f     larl    %r3,000003ff80249e40
  Call Trace:
  ([<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs])
   [<000003ff801c74e2>] btrfs_sync_file+0x3e2/0x550 [btrfs]
   [<00000000003ce6ce>] do_fsync+0x5e/0x90
   [<00000000003ce9ca>] SyS_fdatasync+0x32/0x48
   [<00000000008ffe5c>] system_call+0xd8/0x2c8
  Last Breaking-Event-Address:
   [<000003ff801fdd8c>] btrfs_sync_log+0xa24/0xbc0 [btrfs]
  BTRFS: error (device dm-14) in btrfs_sync_log:2908: errno=-117 unknown
  BTRFS error (device dm-14): pending csums is 269639680

  This bug appears to be linked to bug 1933172, but is different and has
  a different root cause. Again, I believe that this is a regression
  introduced in the fixing of CVE-2019-19036, from 4.15.0-109-generic.

  [Fix]

  Analysis of the kernel oops is as follows:

  The first thing we see is that BTRFS entered ERROR state with the
  reason:

  in balance_level:1962: errno=-117 unknown

  Now errno -117 is:

  100 #define EUCLEAN     117 /* Structure needs cleaning */

  btrfs treats -EUCLEAN as if corruption has happened. Let's see where
  this is returned from.

  If we start at fs/btrfs/ctree.c in balance_level(), line 1962:

  1917 static noinline int balance_level(struct btrfs_trans_handle *trans,
  1918              struct btrfs_root *root,
  1919              struct btrfs_path *path, int level)
  1920 {
  ...
  1958         /* promote the child to a root */
  1959         child = read_node_slot(fs_info, mid, 0);
  1960         if (IS_ERR(child)) {
  1961             ret = PTR_ERR(child);
  1962             btrfs_handle_fs_error(fs_info, ret, NULL);
  1963             goto enospc;
  1964         }
  ...
  2136 }

  We are in the middle of a balancing operation, and if you happen to be
  familiar with how b-tree data structures work, we are promoting a
  child node to a topmost root node.

  The error most likely happens in read_node_slot(), with the lines
  after it printing that an error has happened.

  1887 static noinline struct extent_buffer *
  1888 read_node_slot(struct btrfs_fs_info *fs_info, struct extent_buffer 
*parent,
  1889            int slot)
  1890 {
  ...
  1900     btrfs_node_key_to_cpu(parent, &first_key, slot);
  1901     eb = read_tree_block(fs_info, btrfs_node_blockptr(parent, slot),
  1902                  btrfs_node_ptr_generation(parent, slot),
  1903                  level - 1, &first_key);
  ...
  1910 }

  There are two calls here which are relevant. btrfs_node_key_to_cpu()
  and read_tree_block().

  Let's look at read_tree_block() in fs/btrfs/disk-io.c:

  1147 struct extent_buffer *read_tree_block(struct btrfs_fs_info *fs_info, u64 
bytenr,
  1148                       u64 parent_transid, int level,
  1149                       struct btrfs_key *first_key)
  1150 {
  1151     struct extent_buffer *buf = NULL;
  1152     int ret;
  1153
  1154     buf = btrfs_find_create_tree_block(fs_info, bytenr);
  1155     if (IS_ERR(buf))
  1156         return buf;
  1157
  1158     ret = btree_read_extent_buffer_pages(fs_info, buf, parent_transid,
  1159                          level, first_key);
  1160     if (ret) {
  1161         free_extent_buffer_stale(buf);
  1162         return ERR_PTR(ret);
  1163     }
  1164     return buf;
  1165
  1166 }

  The interesting one here is btree_read_extent_buffer_pages():

  498 static int btree_read_extent_buffer_pages(struct btrfs_fs_info *fs_info,
  499                       struct extent_buffer *eb,
  500                       u64 parent_transid, int level,
  501                       struct btrfs_key *first_key)
  502 {
  ...
  511     while (1) {
  512         clear_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags);
  513         ret = read_extent_buffer_pages(io_tree, eb, WAIT_COMPLETE,
  514                            btree_get_extent, mirror_num);
  515         if (!ret) {
  516             if (verify_parent_transid(io_tree, eb,
  517                            parent_transid, 0))
  518                 ret = -EIO;
  519             else if (verify_level_key(fs_info, eb, level,
  520                           first_key))
  521                 ret = -EUCLEAN;
  522             else
  523                 break;
  524         }
  525
  526         /*
  527          * This buffer's crc is fine, but its contents are corrupted, so
  528          * there is no reason to read the other copies, they won't be
  529          * any less wrong.
  530          */
  531         if (test_bit(EXTENT_BUFFER_CORRUPT, &eb->bflags) ||
  532             ret == -EUCLEAN)
  533             break;
  ...

  If read_extent_buffer_pages() returns zero, we call
  verify_level_key(), and if this returns non-zero, we return -EUCLEAN.

  verify_level_key() can fail on three conditions.

   440 static int verify_level_key(struct btrfs_fs_info *fs_info,
   441                 struct extent_buffer *eb, int level,
   442                 struct btrfs_key *first_key)
   443 {
   448     found_level = btrfs_header_level(eb);
   449     if (found_level != level) {
  ...
   456         return -EIO;
   457     }
   458
   459     if (!first_key)
   460         return 0;
   461
   462     /* We have @first_key, so this @eb must have at least one item */
   463     if (btrfs_header_nritems(eb) == 0) {
   464         btrfs_err(fs_info,
   465         "invalid tree nritems, bytenr=%llu nritems=0 expect >0",
   466               eb->start);
   467         WARN_ON(IS_ENABLED(CONFIG_BTRFS_DEBUG));
   468         return -EUCLEAN;
   469     }
   470
   471     if (found_level)
   472         btrfs_node_key_to_cpu(eb, &found_key, 0);
   473     else
   474         btrfs_item_key_to_cpu(eb, &found_key, 0);
   475     ret = btrfs_comp_cpu_keys(first_key, &found_key);
  ...
   487     return ret;
   488 }

  1) If the eb level doesn't match the provided level.
  2) If the eb has 0 items.
  3) If the found_key doesn't match the first_key.

  With the information we currently have, we don't know what one caused
  the problem.

  I looked to see when verify_level_key() was first introduced. It seems
  it arrived in 4.15.0-109-generic through the SRU of CVE-2019-19036,
  with the commit:

  ubuntu-bionic 50ab1ff51db0c5eb77ffc6f15ef32f07764f86ff
  Author: Qu Wenruo <w...@suse.com>
  Date:   Thu Mar 29 09:08:11 2018 +0800
  Subject: btrfs: Validate child tree block's level and first key
  Link: https://paste.ubuntu.com/p/DQjTkfNRDt/

  If you look at this particular hunk:

  https://paste.ubuntu.com/p/z4qXw2Jp9K/

  We see lines 18 - 26 of the above pastebin are introduced here.

  Looking at the original upstream commit:

  commit 581c1760415c48cca9349b198bba52dd38750765
  Author: Qu Wenruo <w...@suse.com>
  Date:   Thu Mar 29 09:08:11 2018 +0800
  Subject: btrfs: Validate child tree block's level and first key
  Link: 
https://github.com/torvalds/linux/commit/581c1760415c48cca9349b198bba52dd38750765

  Particularly the same hunk:

  https://paste.ubuntu.com/p/vW3Y9BvK4C/

  There is a subtle difference, the second if statement is extended with
  the ret == -EUCLEAN check, and not implemented entirely.

  Why is this?

  I looked up when the if statement was first introduced, and it was a
  very old commit from v2.6.39-rc1:

  https://paste.ubuntu.com/p/vPT36xXq66/

  Particularly this hunk:

  https://paste.ubuntu.com/p/2jmQRSmVfc/

  Interesting. Why is a commit from 4.15-109-generic re-implement
  something that should have been there since 2.6.39?

  I checked upstream, and found the if statement to be removed entirely.

  That is when I came across:

  commit f8397d69daef06d358430d3054662fb597e37c00
  Author: Nikolay Borisov <nbori...@suse.com>
  Date:   Tue Nov 6 16:40:20 2018 +0200
  Subject: btrfs: Always try all copies when reading extent buffers
  Link: 
https://github.com/torvalds/linux/commit/f8397d69daef06d358430d3054662fb597e37c00

  Which talks about balance operations failing out of the blue after a
  raid 1 disk was added back to the array. The commit removes the if
  statement, and moves the location of the clear EXTENT_BUFFER_CORRUPT
  inside the while loop.

  I checked the Bionic 4.15 kernel, and I found that this commit was
  applied in 4.15.0-56-generic:

  ubuntu-bionic 03e1b5c9a1c1704e109466b375d09a4721b65ec5
  Author: Nikolay Borisov <nbori...@suse.com>
  Date:   Tue Nov 6 16:40:20 2018 +0200
  Subject: btrfs: Always try all copies when reading extent buffers
  Link: https://paste.ubuntu.com/p/TS2c5Mptr2/

  It appears that the if statement was removed in 4.15.0-56-generic
  intentionally, and was brought back mistakenly in a backport of
  "btrfs: Validate child tree block's level and first key"
  4.15.0-109-generic.

  The root cause is likely some interaction between bug 1933172 and this
  bug, which leads to EXTENT_BUFFER_CORRUPT being set, or the incorrect
  first_key being set for the root node, which means we end up returning
  -EUCLEAN and aborting the transaction.

  Unfortunately, this is the second bad backport of CVE-2019-19036.

  The fix is to revert "btrfs: Validate child tree block's level and
  first key" and its dependency commit "btrfs: Detect unbalanced tree
  with empty leaf before crashing btree operations", and re-apply
  correct backports of these commits with that if statement removed, to
  keep the spirit of the already applied "btrfs: Always try all copies
  when reading extent buffers".

  We will also add the below commit as it is a fixup commit for "btrfs:
  Validate child tree block's level and first key" to correct usage of
  the wrong "key" variable, which also leads to similar symptoms of
  -EUCLEAN being returned from verify_level_key().

  commit 17515f1b764df36271f3166c714f5a78301fbaa7
  Author: Qu Wenruo <w...@suse.com>
  Date:   Mon Apr 23 17:32:04 2018 +0800
  Subject: btrfs: Fix wrong first_key parameter in replace_path
  Link: 
https://github.com/torvalds/linux/commit/17515f1b764df36271f3166c714f5a78301fbaa7

  [Testcase]

  Unfortunately, the customer did not image the affected filesystem and
  has since restored a backup ontop of it.

  I have been attempting to reproduce this issue for some time, but I
  have not experienced the same call trace. I ran into bug 1933172 while
  trying to reproduce this bug.

  I have been trying to balance nearly full btrfs filesystems, and I
  have looped xfstests btrfs/124 for hours attempting to trigger "btrfs:
  Always try all copies when reading extent buffers" but I haven't
  experienced a crash yet.

  I have a test kernel in the following ppa:

  https://launchpad.net/~mruffell/+archive/ubuntu/sf311164-test-2

  If you install it, balances still complete as expected.

  I will keep attempting to reproduce the issue, and will update this
  section if I manage to create a testcase.

  [Where problems could occur]

  If a regression were to occur, it would affect users of btrfs
  filesystems, and would likely show during a routine balance operation.

  I believe affected users would have nearly full filesystems, and would
  also experience filesystem corruption from bug 1933172, which would
  then cause the issues from this bug when the transaction log is
  written to disk.

  With all modifications to btrfs, there is a risk of data corruption
  and filesystem corruption for all btrfs users, since balances happen
  automatically and on a regular basis. If a regression does happen,
  users should remount their filesystems with the "nobalance" flag,
  backup their data, and attempt a repair if necessary.

  [Other info]

  A community member has hit this issue before, and has reported it
  upstream to linux-btrfs here, although they never received a reply.

  https://www.spinics.net/lists/linux-btrfs/msg112261.html

  I have written to Richard and asked if he has any additional
  information that might help reproduce, but I have yet to receive a
  reply.

  If you read Richard's mailing list link, it mentions filesystem
  corruption with missing extents. This suggests this crash might be
  linked to bug 1933172, which I came across while trying to reproduce
  the issue in this bug.

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1934709/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

[Kernel-packages] [Bug 1934709] [NEW] btrfs: Automatic balance returns -EUCLEAN and leads to forced readonly filesystem

Reply via email to