Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Andrei Borzenkov
18.09.2018 22:11, Austin S. Hemmelgarn пишет:
> On 2018-09-18 14:38, Andrei Borzenkov wrote:
>> 18.09.2018 21:25, Austin S. Hemmelgarn пишет:
>>> On 2018-09-18 14:16, Andrei Borzenkov wrote:
 18.09.2018 08:37, Chris Murphy пишет:
> On Mon, Sep 17, 2018 at 11:24 PM, Andrei Borzenkov
>  wrote:
>> 18.09.2018 07:21, Chris Murphy пишет:
>>> On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy
>>>  wrote:
>> ...
>>>
>>> There are a couple of reserve locations in Btrfs at the start and I
>>> think after the first superblock, for bootloader embedding. Possibly
>>> one or both of those areas could be used for this so it's outside
>>> the
>>> file system. But other implementations are going to run into this
>>> problem too.
>>>
>>
>> That's what SUSE grub2 version does - it includes patches to redirect
>> writes on btrfs to reserved area. I am not sure how it behaves in
>> case
>> of multi-device btrfs though.
>
> The patches aren't upstream yet? Will they be?
>

 I do not know. Personally I think much easier is to make grub location
 independent of /boot, allowing grub be installed in separate partition.
 This automatically covers all other cases (like MD, LVM etc).
>>> It actually is independent of /boot already.  I've got it running just
>>> fine on my laptop off of the EFI system partition (which is independent
>>> of my /boot partition), and thus have no issues with handling of the
>>> grubenv file.  The problem is that all the big distros assume you want
>>> it in /boot, so they have no option for putting it anywhere else.
>>>
>>
>> This requires more than just explicit --boot-directory. With current
>> monolithic configuration file listing all available kernels this file
>> cannot be in the same location, it must be together with kernels (think
>> about rollback to snapshot with completely different content). Or some
>> different, more flexible configuration is needed.
> Uh, no, it doesn't need to be with the kernels.

It does not need to be *with* kernels but it must match content of /boot
if you want to allow booting from multiple subvolumes (or even
partitions) using the same grub instance. The most obvious case is
snapper rollback used by SUSE. You still have single instance of
bootloader, but multiple subvolumes with different kernels. So somehow
bootloader must know which kernels to offer depending on which subvolume
you select.


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Duncan
Chris Murphy posted on Tue, 18 Sep 2018 13:34:14 -0600 as excerpted:

> I've run into some issue where grub2-mkconfig and grubby, can change the
> grub.cfg, and then do a really fast reboot without cleanly unmounting
> the volume - and what happens? Can't boot. The bootloader can't do log
> replay so it doesn't see the new grub.cfg at all. If all you do is mount
> the volume and unmount, log replay happens, the fs metadata is all fixed
> up just fine, and now the bootloader can see it.
> This same problem can happen with the kernel and initramfs
> installations.
> 
> (Hilariously the reason why this can happen is because of a process
> exempting itself from being forcibly killed by systemd *against* the
> documented advice of systemd devs that you should only do this for
> processes not on rootfs; but as a consequence of this process doing the
> wrong thing, systemd at reboot time ends up doing an unclean unmount and
> reboot because it won't kill the kill exempt process.)

That's... interesting!

FWIW here I use grub2, but as many admins I'm quite comfortable with 
bash, and the high-level grub2 config mechanisms simply didn't let me do 
what I needed to do.  So I had to learn the lower-level grub bash-like 
scripting language to do what I wanted to do, and I even go so far as to 
install-mask some of the higher level stuff so it doesn't get installed 
at all, and thus can't somehow run and screw up my config.

So I edit my grub scripts (and grubenv) much like I'd edit any other 
system script (and its separate config file where I have them) I might 
need to update, then save my work, and with both a bios-boot partition 
setup for grub-core and an entirely separate /boot that's not routinely 
mounted unless I'm updating it, I normally unmount it when I'm done, 
before I actually reboot.

So I've never had systemd interfere.

(And of course I have backups.  In fact, on my main personal system, with 
both the working root and its primary backup being btrfs pair-device 
raid1 on separate devices, I have four physical ssds installed, with a 
bios-boot partition with grub installed and a separate dedicated (btrfs 
dup mode) /boot on each of all four, so I have a working grub and /boot 
and three backups, each of which I can point the bios at and have tested 
separately as bootable.  So if upgrading grub or anything on /boot goes 
wrong I find that out testing the working copy, and boot one of the 
backups to resolve the problem before eventually upgrading all three 
backups after the working copy upgrade is well tested.)

> So *already* we have file systems that are becoming too complicated for
> the bootloader to reliably read, because they cannot do journal relay,
> let alone have any chance of modifying (nor would I want them to do
> this). So yeah I'm, very rapidly becoming opposed to grubenv on anything
> but super simple volumes like maybe ext4 without a journal (extents are
> nice); or even perhaps GRUB should just implement its own damn file
> system and we give it its own partition - similar to BIOS Boot - but
> probably a little bigger

You realize that solution is already standardized as EFI and its standard 
FAT filesystem, right?

=:^)

>>> but is the bootloader overwrite of gruvenv going to recompute parity
>>> and write to multiple devices? Eek!
>>
>> Recompute the parity should not be a big deal. Updating all the
>> (b)trees would be a too complex goal.
> 
> I think it's just asking for trouble. Sometimes the best answer ends up
> being no, no and definitely no.

Agreed.  I actually /like/ the fact that at the grub prompt I can rely on 
everything being read-only, and if that SuSE patch to put grubenv in the 
reserved space and make it writable gets upstreamed, I really hope 
there's a build-time configure option to disable the feature, because IMO 
grub doesn't /need/ to save state at that point, and allowing it to do so 
is effectively needlessly playing a risky Russian Roulette game with my 
storage devices.  Were it actually needed that'd be different, but it's 
not needed, so any risk is too much risk.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman



Re: [patch] file dedupe (and maybe clone) data corruption (was Re: [PATCH] generic: test for deduplication between different files)

2018-09-18 Thread Zygo Blaxell
On Mon, Sep 10, 2018 at 07:06:46PM +1000, Dave Chinner wrote:
> On Thu, Sep 06, 2018 at 11:53:06PM -0400, Zygo Blaxell wrote:
> > On Thu, Sep 06, 2018 at 06:38:09PM +1000, Dave Chinner wrote:
> > > On Fri, Aug 31, 2018 at 01:10:45AM -0400, Zygo Blaxell wrote:
> > > > On Thu, Aug 30, 2018 at 04:27:43PM +1000, Dave Chinner wrote:
> > > > > On Thu, Aug 23, 2018 at 08:58:49AM -0400, Zygo Blaxell wrote:
> > > > For future development I've abandoned the entire dedupe_file_range
> > > > approach.  I need to be able to read and dedupe the data blocks of
> > > > the filesystem directly without having to deal with details like which
> > > > files those blocks belong to, especially on filesystems with lots of
> > > > existing deduped blocks and snapshots. 
> > > 
> > > IOWs, your desired OOB dedupe algorithm is:
> > > 
> > >   a) ask the filesystem where all it's file data is
> > 
> > Actually, it's "ask the filesystem where all the *new* file data is"
> > since we don't want to read any unique data twice on subsequent runs.
> 
> Sorry, how do you read "unique data" twice? By definition, unique
> data only occurs once

...but once it has been read, we don't want to read it again.  Ever.
Even better would be to read unique data less than 1.0 times on average.

> Oh, and you still need to hash the old data so you can find
> collisions with the new data that got written. Unless, of course,
> you are keeping your hash tree in a persistent database 

I do that.

> and can work out how to prune stale entries out of it efficiently

I did that first.

Well, more like I found that even a bad algorithm can still find
most of the duplicate data in a typical filesystem, and there's a
steep diminishing returns curve the closer you get to 100% efficiency.
So I just used a bad algorithm (random drop with a bias toward keeping
hashes that matched duplicate blocks).  There's room to improve that,
but the possible gains are small, so it's at least #5 on the performance
whack-a-mole list and probably lower.

The randomness means each full-filesystem sweep finds a different subset
of duplicates, so I can arbitrarily cut hash table size in half and get
almost all of the match rate back by doing two full scans.  Or I cut
the filesystem up into a few large pieces and feed the pieces through in
different orders on different scan runs, so different subsets of data in
the hash table meet different subsets of data on disk during each scan.
An early prototype of bees worked that way, but single-digit efficiency
gains were not worth doubling iops, so I stopped.

> [...]I thought that "details omitted for
> reasons of brevity" would be understood, not require omitted details
> to be explained to me.

Sorry.  I don't know what you already know.

> > Bees also operates under a constant-RAM constraint, so it doesn't operate
> > in two distinct "collect data" and "act on data collected" passes,
> > and cannot spend memory to store data about more than a few extents at
> > any time.
> 
> I suspect that I'm thinking at a completely different scale to you.
> I don't really care for highly constrained or optimal dedupe
> algorithms  because those last few dedupe percentages really don't
> matter that much to me. 

At large scales RAM is always constrained.  It's the dedupe triangle of
RAM, iops, and match hit rate--any improvement in one comes at the cost
of the others.  Any dedupe can go faster or use less RAM by raising the
block size or partitioning the input data set to make it smaller.

bees RAM usage is a bit more explicitly controlled--the admin tells bees
how much RAM to use, and bees scales the other parameters to fit that.
Other dedupe engines make the admin do math to set parameters to avoid
overflowing RAM with dynamic memory allocations, or leave the admin to
discover what their RAM constraint is the hard way.

One big difference I am noticing in our approaches is latency.  ZFS (and
in-kernel btrfs dedupe) provides minimal dedupe latency (duplicate
data occupies disk space for zero time as it is never written to disk
at all) but it requires more RAM for a given dedupe hit rate than any
other dedupe implementation I've seen.  What you've written tells me
XFS saves RAM by partitioning the data and relying on an existing but
very large source of iops (sharing scrub reads with dedupe), but then
the dedupe latency is the same as the scrub interval (the worst so far).
bees aims to have latency of a few minutes (ideally scanning data while
it's still dirty in cache, but there's no good userspace API for that)
though it's obviously not there yet.

> I care much more about using all the
> resources we can and running as fast as we possibly can, then
> providing the admin with means to throttle performance back to what
> they need.
> 
> i.e. I'm concerned about how to effectively scan and dedupe PBs of
> data, where scan rates may need to be measured in tens of GB/s.  

My targets are only one order of magnitude smaller--in the limit, I 

Re: btrfs panic problem

2018-09-18 Thread Qu Wenruo


On 2018/9/19 上午8:35, sunny.s.zhang wrote:
> 
> 在 2018年09月19日 08:05, Qu Wenruo 写道:
>>
>> On 2018/9/18 上午8:28, sunny.s.zhang wrote:
>>> Hi All,
>>>
>>> My OS(4.1.12) panic in kmem_cache_alloc, which is called by
>>> btrfs_get_or_create_delayed_node.
>> Any reproducer?
>>
>> Anyway we need a reproducer as a testcase.
> 
> I have had a try, but could not  reproduce yet.

Since it's just one hit in production environment, I'm afraid we need to
inject some sleep or delay into this code and try bombing it with fsstress.

Despite that I have no good idea on reproducing it.

Thanks,
Qu

> 
> Any advice to reproduce it?
> 
>>
>> The code looks
>>
>>> I found that the freelist of the slub is wrong.
>>>
>>> crash> struct kmem_cache_cpu 887e7d7a24b0
>>>
>>> struct kmem_cache_cpu {
>>>    freelist = 0x2026,   <<< the value is id of one inode
>>>    tid = 29567861,
>>>    page = 0xea0132168d00,
>>>    partial = 0x0
>>> }
>>>
>>> And, I found there are two different btrfs inodes pointing delayed_node.
>>> It means that the same slub is used twice.
>>>
>>> I think this slub is freed twice, and then the next pointer of this slub
>>> point itself. So we get the same slub twice.
>>>
>>> When use this slub again, that break the freelist.
>>>
>>> Folloing code will make the delayed node being freed twice. But I don't
>>> found what is the process.
>>>
>>> Process A (btrfs_evict_inode) Process B
>>>
>>> call btrfs_remove_delayed_node call  btrfs_get_delayed_node
>>>
>>> node = ACCESS_ONCE(btrfs_inode->delayed_node);
>>>
>>> BTRFS_I(inode)->delayed_node = NULL;
>>> btrfs_release_delayed_node(delayed_node);
>>>
>>> if (node) {
>>> atomic_inc(>refs);
>>> return node;
>>> }
>>>
>>> ..
>>>
>>> btrfs_release_delayed_node(delayed_node);
>>>
>>>
>>> 1313 void btrfs_remove_delayed_node(struct inode *inode)
>>> 1314 {
>>> 1315 struct btrfs_delayed_node *delayed_node;
>>> 1316
>>> 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node);
>>> 1318 if (!delayed_node)
>>> 1319 return;
>>> 1320
>>> 1321 BTRFS_I(inode)->delayed_node = NULL;
>>> 1322 btrfs_release_delayed_node(delayed_node);
>>> 1323 }
>>>
>>>
>>>    87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct
>>> inode *inode)
>>>    88 {
>>>    89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
>>>    90 struct btrfs_root *root = btrfs_inode->root;
>>>    91 u64 ino = btrfs_ino(inode);
>>>    92 struct btrfs_delayed_node *node;
>>>    93
>>>    94 node = ACCESS_ONCE(btrfs_inode->delayed_node);
>>>    95 if (node) {
>>>    96 atomic_inc(>refs);
>>>    97 return node;
>>>    98 }
>>>
>> The analyse looks valid.
>> Can be fixed by adding a spinlock.
>>
>> Just wondering why we didn't hit it.
> 
> It just appeared once in our production environment.
> 
> Thanks,
> Sunny
>>
>> Thanks,
>> Qu
>>
>>> Thanks,
>>>
>>> Sunny
>>>
>>>
>>> PS:
>>>
>>> 
>>>
>>> panic informations
>>>
>>> PID: 73638  TASK: 887deb586200  CPU: 38  COMMAND: "dockerd"
>>>   #0 [88130404f940] machine_kexec at 8105ec10
>>>   #1 [88130404f9b0] crash_kexec at 811145b8
>>>   #2 [88130404fa80] oops_end at 8101a868
>>>   #3 [88130404fab0] no_context at 8106ea91
>>>   #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d
>>>   #5 [88130404fb50] bad_area_nosemaphore at 8106eda3
>>>   #6 [88130404fb60] __do_page_fault at 8106f328
>>>   #7 [88130404fbd0] do_page_fault at 8106f637
>>>   #8 [88130404fc10] page_fault at 816f6308
>>>  [exception RIP: kmem_cache_alloc+121]
>>>  RIP: 811ef019  RSP: 88130404fcc8  RFLAGS: 00010286
>>>  RAX:   RBX:   RCX: 01c32b76
>>>  RDX: 01c32b75  RSI:   RDI: 000224b0
>>>  RBP: 88130404fd08   R8: 887e7d7a24b0   R9: 
>>>  R10: 8802668b6618  R11: 0002  R12: 887e3e230a00
>>>  R13: 2026  R14: 887e3e230a00  R15: a01abf49
>>>  ORIG_RAX:   CS: 0010  SS: 0018
>>>   #9 [88130404fd10] btrfs_get_or_create_delayed_node at
>>> a01abf49 [btrfs]
>>> #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12
>>> [btrfs]
>>> #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs]
>>> #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs]
>>> #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs]
>>> #14 [88130404fe50] touch_atime at 812286d3
>>> #15 [88130404fe90] iterate_dir at 81221929
>>> #16 [88130404fee0] sys_getdents64 at 81221a19
>>> #17 [88130404ff50] system_call_fastpath at 816f2594
>>>  RIP: 006b68e4  RSP: 00c866259080  RFLAGS: 0246
>>>  RAX: 

Re: btrfs panic problem

2018-09-18 Thread sunny.s.zhang



在 2018年09月19日 08:05, Qu Wenruo 写道:


On 2018/9/18 上午8:28, sunny.s.zhang wrote:

Hi All,

My OS(4.1.12) panic in kmem_cache_alloc, which is called by
btrfs_get_or_create_delayed_node.

Any reproducer?

Anyway we need a reproducer as a testcase.


I have had a try, but could not  reproduce yet.

Any advice to reproduce it?



The code looks


I found that the freelist of the slub is wrong.

crash> struct kmem_cache_cpu 887e7d7a24b0

struct kmem_cache_cpu {
   freelist = 0x2026,   <<< the value is id of one inode
   tid = 29567861,
   page = 0xea0132168d00,
   partial = 0x0
}

And, I found there are two different btrfs inodes pointing delayed_node.
It means that the same slub is used twice.

I think this slub is freed twice, and then the next pointer of this slub
point itself. So we get the same slub twice.

When use this slub again, that break the freelist.

Folloing code will make the delayed node being freed twice. But I don't
found what is the process.

Process A (btrfs_evict_inode) Process B

call btrfs_remove_delayed_node call  btrfs_get_delayed_node

node = ACCESS_ONCE(btrfs_inode->delayed_node);

BTRFS_I(inode)->delayed_node = NULL;
btrfs_release_delayed_node(delayed_node);

if (node) {
atomic_inc(>refs);
return node;
}

..

btrfs_release_delayed_node(delayed_node);


1313 void btrfs_remove_delayed_node(struct inode *inode)
1314 {
1315 struct btrfs_delayed_node *delayed_node;
1316
1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node);
1318 if (!delayed_node)
1319 return;
1320
1321 BTRFS_I(inode)->delayed_node = NULL;
1322 btrfs_release_delayed_node(delayed_node);
1323 }


   87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct
inode *inode)
   88 {
   89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
   90 struct btrfs_root *root = btrfs_inode->root;
   91 u64 ino = btrfs_ino(inode);
   92 struct btrfs_delayed_node *node;
   93
   94 node = ACCESS_ONCE(btrfs_inode->delayed_node);
   95 if (node) {
   96 atomic_inc(>refs);
   97 return node;
   98 }


The analyse looks valid.
Can be fixed by adding a spinlock.

Just wondering why we didn't hit it.


It just appeared once in our production environment.

Thanks,
Sunny


Thanks,
Qu


Thanks,

Sunny


PS:



panic informations

PID: 73638  TASK: 887deb586200  CPU: 38  COMMAND: "dockerd"
  #0 [88130404f940] machine_kexec at 8105ec10
  #1 [88130404f9b0] crash_kexec at 811145b8
  #2 [88130404fa80] oops_end at 8101a868
  #3 [88130404fab0] no_context at 8106ea91
  #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d
  #5 [88130404fb50] bad_area_nosemaphore at 8106eda3
  #6 [88130404fb60] __do_page_fault at 8106f328
  #7 [88130404fbd0] do_page_fault at 8106f637
  #8 [88130404fc10] page_fault at 816f6308
     [exception RIP: kmem_cache_alloc+121]
     RIP: 811ef019  RSP: 88130404fcc8  RFLAGS: 00010286
     RAX:   RBX:   RCX: 01c32b76
     RDX: 01c32b75  RSI:   RDI: 000224b0
     RBP: 88130404fd08   R8: 887e7d7a24b0   R9: 
     R10: 8802668b6618  R11: 0002  R12: 887e3e230a00
     R13: 2026  R14: 887e3e230a00  R15: a01abf49
     ORIG_RAX:   CS: 0010  SS: 0018
  #9 [88130404fd10] btrfs_get_or_create_delayed_node at
a01abf49 [btrfs]
#10 [88130404fd60] btrfs_delayed_update_inode at a01aea12
[btrfs]
#11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs]
#12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs]
#13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs]
#14 [88130404fe50] touch_atime at 812286d3
#15 [88130404fe90] iterate_dir at 81221929
#16 [88130404fee0] sys_getdents64 at 81221a19
#17 [88130404ff50] system_call_fastpath at 816f2594
     RIP: 006b68e4  RSP: 00c866259080  RFLAGS: 0246
     RAX: ffda  RBX: 00c828dbbe00  RCX: 006b68e4
     RDX: 1000  RSI: 00c83da14000  RDI: 0011
     RBP:    R8:    R9: 
     R10:   R11: 0246  R12: 00c7
     R13: 02174e74  R14: 0555  R15: 0038
     ORIG_RAX: 00d9  CS: 0033  SS: 002b


We also find the list double add informations, including n_list and p_list:

[8642921.110568] [ cut here ]
[8642921.167929] WARNING: CPU: 38 PID: 73638 at lib/list_debug.c:33
__list_add+0xbe/0xd0()
[8642921.263780] list_add corruption. prev->next should be next
(887e40fa5368), but was ff:ff884c85a36288. 

Re: [PATCH 32/36] btrfs: clear delayed_refs_rsv for dirty bg cleanup

2018-09-18 Thread Omar Sandoval
On Tue, Sep 11, 2018 at 01:58:03PM -0400, Josef Bacik wrote:
> We keep track of dirty bg's as a reservation in the delayed_refs_rsv, so
> when we abort and we cleanup those dirty bgs we need to drop their
> reservation so we don't have accounting issues and lots of scary
> messages on umount.

Shouldn't this just be part of patch 6?

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/disk-io.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index caaca8154a1a..54fbdc944a3f 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -4412,6 +4412,7 @@ void btrfs_cleanup_dirty_bgs(struct btrfs_transaction 
> *cur_trans,
>  
>   spin_unlock(_trans->dirty_bgs_lock);
>   btrfs_put_block_group(cache);
> + btrfs_delayed_refs_rsv_release(fs_info, 1);
>   spin_lock(_trans->dirty_bgs_lock);
>   }
>   spin_unlock(_trans->dirty_bgs_lock);
> -- 
> 2.14.3
> 


Re: [PATCH 33/36] btrfs: only free reserved extent if we didn't insert it

2018-09-18 Thread Omar Sandoval
On Tue, Sep 11, 2018 at 01:58:04PM -0400, Josef Bacik wrote:
> When we insert the file extent once the ordered extent completes we free
> the reserved extent reservation as it'll have been migrated to the
> bytes_used counter.  However if we error out after this step we'll still
> clear the reserved extent reservation, resulting in a negative
> accounting of the reserved bytes for the block group and space info.
> Fix this by only doing the free if we didn't successfully insert a file
> extent for this extent.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/inode.c | 10 +-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 60bcad901857..fd6ade4680b5 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2992,6 +2992,7 @@ static int btrfs_finish_ordered_io(struct 
> btrfs_ordered_extent *ordered_extent)
>   bool truncated = false;
>   bool range_locked = false;
>   bool clear_new_delalloc_bytes = false;
> + bool clear_reserved_extent = true;
>  
>   if (!test_bit(BTRFS_ORDERED_NOCOW, _extent->flags) &&
>   !test_bit(BTRFS_ORDERED_PREALLOC, _extent->flags) &&
> @@ -3095,10 +3096,12 @@ static int btrfs_finish_ordered_io(struct 
> btrfs_ordered_extent *ordered_extent)
>   logical_len, logical_len,
>   compress_type, 0, 0,
>   BTRFS_FILE_EXTENT_REG);
> - if (!ret)
> + if (!ret) {
> + clear_reserved_extent = false;
>   btrfs_release_delalloc_bytes(fs_info,
>ordered_extent->start,
>ordered_extent->disk_len);
> + }
>   }
>   unpin_extent_cache(_I(inode)->extent_tree,
>  ordered_extent->file_offset, ordered_extent->len,
> @@ -3159,8 +3162,13 @@ static int btrfs_finish_ordered_io(struct 
> btrfs_ordered_extent *ordered_extent)
>* wrong we need to return the space for this ordered extent
>* back to the allocator.  We only free the extent in the
>* truncated case if we didn't write out the extent at all.
> +  *
> +  * If we made it past insert_reserved_file_extent before we
> +  * errored out then we don't need to do this as the accounting
> +  * has already been done.
>*/
>   if ((ret || !logical_len) &&
> + clear_reserved_extent &&
>   !test_bit(BTRFS_ORDERED_NOCOW, _extent->flags) &&
>   !test_bit(BTRFS_ORDERED_PREALLOC, _extent->flags))
>   btrfs_free_reserved_extent(fs_info,
> -- 
> 2.14.3
> 


Re: [PATCH 17/36] btrfs: loop in inode_rsv_refill

2018-09-18 Thread Omar Sandoval
On Tue, Sep 11, 2018 at 01:57:48PM -0400, Josef Bacik wrote:
> With severe fragmentation we can end up with our inode rsv size being
> huge during writeout, which would cause us to need to make very large
> metadata reservations.  However we may not actually need that much once
> writeout is complete.  So instead try to make our reservation, and if we
> couldn't make it re-calculate our new reservation size and try again.
> If our reservation size doesn't change between tries then we know we are
> actually out of space and can error out.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 19 +--
>  1 file changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 57567d013447..e43834380ce6 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -5790,10 +5790,11 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode 
> *inode,
>  {
>   struct btrfs_root *root = inode->root;
>   struct btrfs_block_rsv *block_rsv = >block_rsv;
> - u64 num_bytes = 0;
> + u64 num_bytes = 0, last = 0;
>   u64 qgroup_num_bytes = 0;
>   int ret = -ENOSPC;
>  
> +again:
>   spin_lock(_rsv->lock);
>   if (block_rsv->reserved < block_rsv->size)
>   num_bytes = block_rsv->size - block_rsv->reserved;
> @@ -5818,8 +5819,22 @@ static int btrfs_inode_rsv_refill(struct btrfs_inode 
> *inode,
>   spin_lock(_rsv->lock);
>   block_rsv->qgroup_rsv_reserved += qgroup_num_bytes;
>   spin_unlock(_rsv->lock);
> - } else
> + } else {
>   btrfs_qgroup_free_meta_prealloc(root, qgroup_num_bytes);
> +
> + /*
> +  * If we are fragmented we can end up with a lot of outstanding
> +  * extents which will make our size be much larger than our
> +  * reserved amount.  If we happen to try to do a reservation
> +  * here that may result in us trying to do a pretty hefty
> +  * reservation, which we may not need once delalloc flushing
> +  * happens.  If this is the case try and do the reserve again.
> +  */
> + if (flush == BTRFS_RESERVE_FLUSH_ALL && last != num_bytes) {

Is there any point in retrying the reservation if num_bytes didn't
change? As this is written, we will:

1. Calculate num_bytes
2. Try reservation, say it fails
3. Recalculate num_bytes, say it doesn't change
4. Retry the reservation anyways, and it fails again

Maybe we should check if it changed before we retry the reservation? So
then we'd have

1. Calculate num_bytes
2. Try reservation, fails
3. Recalculate num_bytes, it doesn't change, bail out

Also, is it possible that num_bytes > last because of other operations
happening at the same time, and should we still retry in that case?

> + last = num_bytes;
> + goto again;
> + }
> + }
>   return ret;
>  }
>  
> -- 
> 2.14.3
> 


Re: btrfs panic problem

2018-09-18 Thread Qu Wenruo


On 2018/9/18 上午8:28, sunny.s.zhang wrote:
> Hi All,
> 
> My OS(4.1.12) panic in kmem_cache_alloc, which is called by
> btrfs_get_or_create_delayed_node.

Any reproducer?

Anyway we need a reproducer as a testcase.

The code looks

> 
> I found that the freelist of the slub is wrong.
> 
> crash> struct kmem_cache_cpu 887e7d7a24b0
> 
> struct kmem_cache_cpu {
>   freelist = 0x2026,   <<< the value is id of one inode
>   tid = 29567861,
>   page = 0xea0132168d00,
>   partial = 0x0
> }
> 
> And, I found there are two different btrfs inodes pointing delayed_node.
> It means that the same slub is used twice.
> 
> I think this slub is freed twice, and then the next pointer of this slub
> point itself. So we get the same slub twice.
> 
> When use this slub again, that break the freelist.
> 
> Folloing code will make the delayed node being freed twice. But I don't
> found what is the process.
> 
> Process A (btrfs_evict_inode) Process B
> 
> call btrfs_remove_delayed_node call  btrfs_get_delayed_node
> 
> node = ACCESS_ONCE(btrfs_inode->delayed_node);
> 
> BTRFS_I(inode)->delayed_node = NULL;
> btrfs_release_delayed_node(delayed_node);
> 
> if (node) {
> atomic_inc(>refs);
> return node;
> }
> 
> ..
> 
> btrfs_release_delayed_node(delayed_node);
> 
> 
> 1313 void btrfs_remove_delayed_node(struct inode *inode)
> 1314 {
> 1315 struct btrfs_delayed_node *delayed_node;
> 1316
> 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node);
> 1318 if (!delayed_node)
> 1319 return;
> 1320
> 1321 BTRFS_I(inode)->delayed_node = NULL;
> 1322 btrfs_release_delayed_node(delayed_node);
> 1323 }
> 
> 
>   87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct
> inode *inode)
>   88 {
>   89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
>   90 struct btrfs_root *root = btrfs_inode->root;
>   91 u64 ino = btrfs_ino(inode);
>   92 struct btrfs_delayed_node *node;
>   93
>   94 node = ACCESS_ONCE(btrfs_inode->delayed_node);
>   95 if (node) {
>   96 atomic_inc(>refs);
>   97 return node;
>   98 }
> 

The analyse looks valid.
Can be fixed by adding a spinlock.

Just wondering why we didn't hit it.

Thanks,
Qu

> 
> Thanks,
> 
> Sunny
> 
> 
> PS:
> 
> 
> 
> panic informations
> 
> PID: 73638  TASK: 887deb586200  CPU: 38  COMMAND: "dockerd"
>  #0 [88130404f940] machine_kexec at 8105ec10
>  #1 [88130404f9b0] crash_kexec at 811145b8
>  #2 [88130404fa80] oops_end at 8101a868
>  #3 [88130404fab0] no_context at 8106ea91
>  #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d
>  #5 [88130404fb50] bad_area_nosemaphore at 8106eda3
>  #6 [88130404fb60] __do_page_fault at 8106f328
>  #7 [88130404fbd0] do_page_fault at 8106f637
>  #8 [88130404fc10] page_fault at 816f6308
>     [exception RIP: kmem_cache_alloc+121]
>     RIP: 811ef019  RSP: 88130404fcc8  RFLAGS: 00010286
>     RAX:   RBX:   RCX: 01c32b76
>     RDX: 01c32b75  RSI:   RDI: 000224b0
>     RBP: 88130404fd08   R8: 887e7d7a24b0   R9: 
>     R10: 8802668b6618  R11: 0002  R12: 887e3e230a00
>     R13: 2026  R14: 887e3e230a00  R15: a01abf49
>     ORIG_RAX:   CS: 0010  SS: 0018
>  #9 [88130404fd10] btrfs_get_or_create_delayed_node at
> a01abf49 [btrfs]
> #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12
> [btrfs]
> #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs]
> #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs]
> #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs]
> #14 [88130404fe50] touch_atime at 812286d3
> #15 [88130404fe90] iterate_dir at 81221929
> #16 [88130404fee0] sys_getdents64 at 81221a19
> #17 [88130404ff50] system_call_fastpath at 816f2594
>     RIP: 006b68e4  RSP: 00c866259080  RFLAGS: 0246
>     RAX: ffda  RBX: 00c828dbbe00  RCX: 006b68e4
>     RDX: 1000  RSI: 00c83da14000  RDI: 0011
>     RBP:    R8:    R9: 
>     R10:   R11: 0246  R12: 00c7
>     R13: 02174e74  R14: 0555  R15: 0038
>     ORIG_RAX: 00d9  CS: 0033  SS: 002b
> 
> 
> We also find the list double add informations, including n_list and p_list:
> 
> [8642921.110568] [ cut here ]
> [8642921.167929] WARNING: CPU: 38 PID: 73638 at lib/list_debug.c:33
> __list_add+0xbe/0xd0()
> [8642921.263780] list_add corruption. prev->next should be next
> (887e40fa5368), but 

Re: [PATCH 16/36] btrfs: run delayed iputs before committing

2018-09-18 Thread Omar Sandoval
On Tue, Sep 11, 2018 at 01:57:47PM -0400, Josef Bacik wrote:
> Delayed iputs means we can have final iputs of deleted inodes in the
> queue, which could potentially generate a lot of pinned space that could
> be free'd.  So before we decide to commit the transaction for ENOPSC
> reasons, run the delayed iputs so that any potential space is free'd up.
> If there is and we freed enough we can then commit the transaction and
> potentially be able to make our reservation.

Reviewed-by: Omar Sandoval 

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 9 +
>  1 file changed, 9 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 76941fc5af79..57567d013447 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4823,6 +4823,15 @@ static int may_commit_transaction(struct btrfs_fs_info 
> *fs_info,
>   if (!bytes)
>   return 0;
>  
> + /*
> +  * If we have pending delayed iputs then we could free up a bunch of
> +  * pinned space, so make sure we run the iputs before we do our pinned
> +  * bytes check below.
> +  */
> + mutex_lock(_info->cleaner_delayed_iput_mutex);
> + btrfs_run_delayed_iputs(fs_info);
> + mutex_unlock(_info->cleaner_delayed_iput_mutex);
> +
>   trans = btrfs_join_transaction(fs_info->extent_root);
>   if (IS_ERR(trans))
>   return -ENOSPC;
> -- 
> 2.14.3
> 


Re: [PATCH 14/36] btrfs: reset max_extent_size properly

2018-09-18 Thread Omar Sandoval
On Tue, Sep 11, 2018 at 01:57:45PM -0400, Josef Bacik wrote:
> If we use up our block group before allocating a new one we'll easily
> get a max_extent_size that's set really really low, which will result in
> a lot of fragmentation.  We need to make sure we're resetting the
> max_extent_size when we add a new chunk or add new space.
> 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/extent-tree.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 13441a293c73..44d59bee6e5e 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -4573,6 +4573,7 @@ static int do_chunk_alloc(struct btrfs_trans_handle 
> *trans, u64 flags,
>   goto out;
>   } else {
>   ret = 1;
> + space_info->max_extent_size = 0;
>   }
>  
>   space_info->force_alloc = CHUNK_ALLOC_NO_FORCE;
> @@ -8084,11 +8085,17 @@ static int __btrfs_free_reserved_extent(struct 
> btrfs_fs_info *fs_info,
>   if (pin)
>   pin_down_extent(fs_info, cache, start, len, 1);
>   else {
> + struct btrfs_space_info *space_info = cache->space_info;
> +
>   if (btrfs_test_opt(fs_info, DISCARD))
>   ret = btrfs_discard_extent(fs_info, start, len, NULL,
>   BTRFS_CLEAR_OP_DISCARD);
>   btrfs_add_free_space(cache, start, len);
>   btrfs_free_reserved_bytes(cache, len, delalloc);
> +
> + spin_lock(_info->lock);
> + space_info->max_extent_size = 0;
> + spin_unlock(_info->lock);
>   trace_btrfs_reserved_extent_free(fs_info, start, len);
>   }

Do we need to do the same for btrfs_free_tree_block()? If so, maybe it
can go in btrfs_free_reserved_bytes() instead?


Re: btrfs panic problem

2018-09-18 Thread sunny.s.zhang

Hi Duncan,

Thank you for your advice. I understand what you mean.  But i have 
reviewed the latest btrfs code, and i think the issue is exist still.


At 71 line, if the function of btrfs_get_delayed_node run over this 
line, then switch to other process, which run over the 1282 and release 
the delayed node at the end.


And then, switch back to the  btrfs_get_delayed_node. find that the node 
is not null, and use it as normal. that mean we used a freed memory.


at some time, this memory will be freed again.

latest code as below.

1278 void btrfs_remove_delayed_node(struct btrfs_inode *inode)
1279 {
1280 struct btrfs_delayed_node *delayed_node;
1281
1282 delayed_node = READ_ONCE(inode->delayed_node);
1283 if (!delayed_node)
1284 return;
1285
1286 inode->delayed_node = NULL;
1287 btrfs_release_delayed_node(delayed_node);
1288 }


  64 static struct btrfs_delayed_node *btrfs_get_delayed_node(
  65 struct btrfs_inode *btrfs_inode)
  66 {
  67 struct btrfs_root *root = btrfs_inode->root;
  68 u64 ino = btrfs_ino(btrfs_inode);
  69 struct btrfs_delayed_node *node;
  70
  71 node = READ_ONCE(btrfs_inode->delayed_node);
  72 if (node) {
  73 refcount_inc(>refs);
  74 return node;
  75 }
  76
  77 spin_lock(>inode_lock);
  78 node = radix_tree_lookup(>delayed_nodes_tree, ino);


在 2018年09月18日 13:05, Duncan 写道:

sunny.s.zhang posted on Tue, 18 Sep 2018 08:28:14 +0800 as excerpted:


My OS(4.1.12) panic in kmem_cache_alloc, which is called by
btrfs_get_or_create_delayed_node.

I found that the freelist of the slub is wrong.

[Not a dev, just a btrfs list regular and user, myself.  But here's a
general btrfs list recommendations reply...]

You appear to mean kernel 4.1.12 -- confirmed by the version reported in
the posted dump:  4.1.12-112.14.13.el6uek.x86_64

OK, so from the perspective of this forward-development-focused list,
kernel 4.1 is pretty ancient history, but you do have a number of options.

First let's consider the general situation.  Most people choose an
enterprise distro for supported stability, and that's certainly a valid
thing to want.  However, btrfs, while now reaching early maturity for the
basics (single device in single or dup mode, and multi-device in single/
raid0/1/10 modes, note that raid56 mode is newer and less mature),
remains under quite heavy development, and keeping reasonably current is
recommended for that reason.

So you you chose an enterprise distro presumably to lock in supported
stability for several years, but you chose a filesystem, btrfs, that's
still under heavy development, with reasonably current kernels and
userspace recommended as tending to have the known bugs fixed.  There's a
bit of a conflict there, and the /general/ recommendation would thus be
to consider whether one or the other of those choices are inappropriate
for your use-case, because it's really quite likely that if you really
want the stability of an enterprise distro and kernel, that btrfs isn't
as stable a filesystem as you're likely to want to match with it.
Alternatively, if you want something newer to match the still under heavy
development btrfs, you very likely want a distro that's not focused on
years-old stability just for the sake of it.  One or the other is likely
to be a poor match for your needs, and choosing something else that's a
better match is likely to be a much better experience for you.

But perhaps you do have reason to want to run the newer and not quite to
traditional enterprise-distro level stability btrfs, on an otherwise
older and very stable enterprise distro.  That's fine, provided you know
what you're getting yourself into, and are prepared to deal with it.

In that case, for best support from the list, we'd recommend running one
of the latest two kernels in either the current or mainline LTS tracks.

For current track, With 4.18 being the latest kernel, that'd be 4.18 or
4.17, as available on kernel.org (tho 4.17 is already EOL, no further
releases, at 4.17.19).

For mainline-LTS track, 4.14 and 4.9 are the latest two LTS series
kernels, tho IIRC 4.19 is scheduled to be this year's LTS (or was it 4.18
and it's just not out of normal stable range yet so not yet marked LTS?),
so it'll be coming up soon and 4.9 will then be dropping to third LTS
series and thus out of our best recommended range.  4.4 was the previous
LTS and while still in LTS support, is outside the two newest LTS series
that this list recommends.

And of course 4.1 is older than 4.4, so as I said, in btrfs development
terms, it's quite ancient indeed... quite out of practical support range
here, tho of course we'll still try, but in many cases the first question
when any problem's reported is going to be whether it's reproducible on
something closer to current.

But... you ARE on an enterprise kernel, likely on an enterprise distro,
and 

Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Chris Murphy
On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy  wrote:
> https://btrfs.wiki.kernel.org/index.php/FAQ#Does_grub_support_btrfs.3F
>
> Does anyone know if this is still a problem on Btrfs if grubenv has
> xattr +C set? In which case it should be possible to overwrite and
> there's no csums that are invalidated.

I'm wrong.

$ sudo grub2-editenv --verbose grubenv create
[sudo] password for chris:
[chris@f29h ~]$ ll
-rw-r--r--. 1 root  root 1024 Sep 18 13:37 grubenv
[chris@f29h ~]$ stat -f grubenv
  File: "grubenv"
ID: ac9ba8ecdce5b017 Namelen: 255 Type: btrfs
Block size: 4096   Fundamental block size: 4096
Blocks: Total: 46661632   Free: 37479747   Available: 37422535
Inodes: Total: 0  Free: 0
[chris@f29h ~]$ sudo filefrag -v grubenv
Filesystem type is: 9123683e
File size of grubenv is 1024 (1 block of 4096 bytes)
 ext: logical_offset:physical_offset: length:   expected: flags:
   0:0..4095:  0..  4095:   4096:
last,not_aligned,inline,eof
grubenv: 1 extent found
[chris@f29h ~]$

So it's an inline extent, which means nocow doesn't apply. It's
metadata so it *must* be COW. And any overwrite would trigger a
metadata checksum error.

First I'd argue it should refuse to create the file on Btrfs. But if
it does create grubenv, instead it should know that on Btrfs it must
redirect it to the appropriate btrfs reserved area (no idea how this
works) rather than to a file.



-- 
Chris Murphy


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Chris Murphy
On Tue, Sep 18, 2018 at 1:11 PM, Goffredo Baroncelli  wrote:


>> I think it's a problem, and near as I can tell it'll be a problem for
>> all kinds of complex storage. I don't see how the bootloader itself
>> can do an overwrite onto raid5 or raid6.
>
>
>> That's certainly supported by GRUB for reading
> Not yet, I am working on that [1]


Sorry! I meant mdadm raid56. It definitely can read that format for
some time and even degraded! It's pretty cool. But I see no way that
it's sane to have the bootloader write to such a volume.

I've run into some issue where grub2-mkconfig and grubby, can change
the grub.cfg, and then do a really fast reboot without cleanly
unmounting the volume - and what happens? Can't boot. The bootloader
can't do log replay so it doesn't see the new grub.cfg at all. If all
you do is mount the volume and unmount, log replay happens, the fs
metadata is all fixed up just fine, and now the bootloader can see it.
This same problem can happen with the kernel and initramfs
installations.

(Hilariously the reason why this can happen is because of a process
exempting itself from being forcibly killed by systemd *against* the
documented advice of systemd devs that you should only do this for
processes not on rootfs; but as a consequence of this process doing
the wrong thing, systemd at reboot time ends up doing an unclean
unmount and reboot because it won't kill the kill exempt process.)

So *already* we have file systems that are becoming too complicated
for the bootloader to reliably read, because they cannot do journal
relay, let alone have any chance of modifying (nor would I want them
to do this). So yeah I'm, very rapidly becoming opposed to grubenv on
anything but super simple volumes like maybe ext4 without a journal
(extents are nice); or even perhaps GRUB should just implement its own
damn file system and we give it its own partition - similar to BIOS
Boot - but probably a little bigger


>
>> but is the bootloader overwrite of gruvenv going to
>> recompute parity and write to multiple devices? Eek!
>
> Recompute the parity should not be a big deal. Updating all the (b)trees 
> would be a too complex goal.

I think it's just asking for trouble. Sometimes the best answer ends
up being no, no and definitely no.

-- 
Chris Murphy


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Chris Murphy
On Tue, Sep 18, 2018 at 1:01 PM, Andrei Borzenkov  wrote:
> 18.09.2018 21:57, Chris Murphy пишет:
>> On Tue, Sep 18, 2018 at 12:16 PM, Andrei Borzenkov  
>> wrote:
>>> 18.09.2018 08:37, Chris Murphy пишет:
>>
 The patches aren't upstream yet? Will they be?

>>>
>>> I do not know. Personally I think much easier is to make grub location
>>> independent of /boot, allowing grub be installed in separate partition.
>>> This automatically covers all other cases (like MD, LVM etc).
>>
>> The only case where I'm aware of this happens is Fedora on UEFI where
>> they write grubenv and grub.cfg on the FAT ESP. I'm pretty sure
>> upstream expects grubenv and grub.cfg at /boot/grub and I haven't ever
>> seen it elsewhere (except Fedora on UEFI).
>>
>> I'm not sure this is much easier. Yet another volume that would be
>> persistently mounted? Where? A nested mount at /boot/grub? I'm not
>> liking that at all. Even Windows and macOS have saner and simpler to
>> understand booting methods than this.
>>
>>
> That's exactly what Windows ended up with - separate boot volume with
> bootloader related files.

The OEM installer will absolutely install to a single partition. If
you point it to a blank drive on BIOS it will preferentially create a
"system" volume that's used for booting. But it's not mandatory. On
UEFI, it doesn't create a "system" volume, just "recovery" is ~500M
and "reserved" 16M. The reserved partition is blank unless you've done
some resizing on the main volume. The recovery volume contains
Winre.wim which is used for doing resets. If you blow away that
partition, you can still boot, but you can't do resets.

-- 
Chris Murphy


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Goffredo Baroncelli
On 18/09/2018 20.52, Chris Murphy wrote:
> On Tue, Sep 18, 2018 at 11:15 AM, Goffredo Baroncelli
>  wrote:
>> On 18/09/2018 06.21, Chris Murphy wrote:
>>> b. The bootloader code, would have to have sophisticated enough Btrfs
>>> knowledge to know if the grubenv has been reflinked or snapshot,
>>> because even if +C, it may not be valid to overwrite, and COW must
>>> still happen, and there's no way the code in GRUB can do full blow COW
>>> and update a bunch of metadata.
>>
>> And what if GRUB ignore the possibility of COWing and overwrite the data ? 
>> Is it a so big problem that the data is changed in all the snapshots ?
>> It would be interested if the same problem happens for a swap file.
> 
> I think it's an abomination :-) It totally perverts the idea of
> reflinks and snapshots and blurs the line between domains. 

:-)

> Is it a
> user file or not and are these user space commands or not and are they
> reliable or do they have exceptions?

On this statement I fully agree, on the one below a bit less
> 
> I have a boot subvolume mounted at /boot, and this boot subvolume gets
> snapshot, and if GRUB can overwrite grubenv, it overwrites the
> purported GRUB state information in every one of those boots, going
> back maybe months, even when these are read only subvolumes.

Also the 'suse' behavior have the same issue: storing the data somewhere in the 
storage reserved area suffers of the same problem. We should be realistic, 
without implement a full btrfs filesystem engine, it is near impossible to have 
a grubenv file visible by the filesystem and snapshot-able.


> 
> I think it's a problem, and near as I can tell it'll be a problem for
> all kinds of complex storage. I don't see how the bootloader itself
> can do an overwrite onto raid5 or raid6. 


> That's certainly supported by GRUB for reading
Not yet, I am working on that [1]

> but is the bootloader overwrite of gruvenv going to
> recompute parity and write to multiple devices? Eek!

Recompute the parity should not be a big deal. Updating all the (b)trees would 
be a too complex goal.
> 
> 

[1] http://lists.gnu.org/archive/html/grub-devel/2018-06/msg00064.html
-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Austin S. Hemmelgarn

On 2018-09-18 15:00, Chris Murphy wrote:

On Tue, Sep 18, 2018 at 12:25 PM, Austin S. Hemmelgarn
 wrote:


It actually is independent of /boot already.  I've got it running just fine
on my laptop off of the EFI system partition (which is independent of my
/boot partition), and thus have no issues with handling of the grubenv file.
The problem is that all the big distros assume you want it in /boot, so they
have no option for putting it anywhere else.

Actually installing it elsewhere is not hard though, you just pass
`--boot-directory=/wherever` to the `grub-install` script and turn off your
distributions automatic reinstall mechanism so it doesn't get screwed up by
the package manager when the GRUB package gets updated. You can also make
`/boot/grub` a symbolic link pointing to the real GRUB directory, so that
you don't have to pass any extra options to tools like grub-reboot or
grub-set-default.


This is how Fedora builds their signed grubx64.efi to behave. But you
cannot ever run grub-install on a Secure Boot enabled computer, or you
now have to learn all about signing your own binaries. I don't even
like doing that, let alone saner users.

So for those distros that support Secure Boot, in practice you're
stuck with the behavior of their prebuilt GRUB binary that goes on the
ESP.
Agreed, but that avoids the issues we're talking about here completely 
because the grubenv file ends up on the ESP too.




Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Austin S. Hemmelgarn

On 2018-09-18 14:57, Chris Murphy wrote:

On Tue, Sep 18, 2018 at 12:16 PM, Andrei Borzenkov  wrote:

18.09.2018 08:37, Chris Murphy пишет:



The patches aren't upstream yet? Will they be?



I do not know. Personally I think much easier is to make grub location
independent of /boot, allowing grub be installed in separate partition.
This automatically covers all other cases (like MD, LVM etc).


The only case where I'm aware of this happens is Fedora on UEFI where
they write grubenv and grub.cfg on the FAT ESP. I'm pretty sure
upstream expects grubenv and grub.cfg at /boot/grub and I haven't ever
seen it elsewhere (except Fedora on UEFI).

I'm not sure this is much easier. Yet another volume that would be
persistently mounted? Where? A nested mount at /boot/grub? I'm not
liking that at all. Even Windows and macOS have saner and simpler to
understand booting methods than this.
On this front maybe, but Windows' boot sequence is insane in it's own 
way (fun fact, if you have the Windows 8/8.1/10 boot-loader set up to 
multi-boot and want it to boot to something other than the default, it 
has to essentially _reboot the machine_ to actually boot that 
alternative entry).


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Austin S. Hemmelgarn

On 2018-09-18 14:38, Andrei Borzenkov wrote:

18.09.2018 21:25, Austin S. Hemmelgarn пишет:

On 2018-09-18 14:16, Andrei Borzenkov wrote:

18.09.2018 08:37, Chris Murphy пишет:

On Mon, Sep 17, 2018 at 11:24 PM, Andrei Borzenkov
 wrote:

18.09.2018 07:21, Chris Murphy пишет:

On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy
 wrote:

...


There are a couple of reserve locations in Btrfs at the start and I
think after the first superblock, for bootloader embedding. Possibly
one or both of those areas could be used for this so it's outside the
file system. But other implementations are going to run into this
problem too.



That's what SUSE grub2 version does - it includes patches to redirect
writes on btrfs to reserved area. I am not sure how it behaves in case
of multi-device btrfs though.


The patches aren't upstream yet? Will they be?



I do not know. Personally I think much easier is to make grub location
independent of /boot, allowing grub be installed in separate partition.
This automatically covers all other cases (like MD, LVM etc).

It actually is independent of /boot already.  I've got it running just
fine on my laptop off of the EFI system partition (which is independent
of my /boot partition), and thus have no issues with handling of the
grubenv file.  The problem is that all the big distros assume you want
it in /boot, so they have no option for putting it anywhere else.



This requires more than just explicit --boot-directory. With current
monolithic configuration file listing all available kernels this file
cannot be in the same location, it must be together with kernels (think
about rollback to snapshot with completely different content). Or some
different, more flexible configuration is needed.
Uh, no, it doesn't need to be with the kernels.  Fedora stores it on the 
ESP separate from the kernels (which are still on the boot partition) if 
you use Secure Boot, and I'm doing the same (without secure boot) 
without issue.  You do have to explicitly set the `root` variable 
correctly in the config though to get it to work though, and the default 
upstream 'easy configuration' arrangement does not do this consistently. 
 It's not too hard to hack in though, and it's positively trivial if 
you just write your own configuration files by hand like I do (no, I'm 
not crazy, the default configuration generator just produces a 
brobdingnagian monstrosity of a config that has tons of stuff I don't 
need and makes invalid assumptions about how I want things invoked, and 
the config syntax is actually not that hard).


As is now grub silently assumes everything is under /boot. This turned
out to be oversimplified.
No, it assumes everything is under whatever you told GRUB to set the 
default value of the `prefix` variable to when you built the GRUB image, 
which is automatically set to the path you pass to `--boot-directory` 
when you use grub-install.  This persists until you explicitly set that 
variable to a different location, or change the `root` variable (but 
GRUB still uses `prefix` for module look-ups if you just change the 
`root` variable).


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Chris Murphy
On Tue, Sep 18, 2018 at 12:25 PM, Austin S. Hemmelgarn
 wrote:

> It actually is independent of /boot already.  I've got it running just fine
> on my laptop off of the EFI system partition (which is independent of my
> /boot partition), and thus have no issues with handling of the grubenv file.
> The problem is that all the big distros assume you want it in /boot, so they
> have no option for putting it anywhere else.
>
> Actually installing it elsewhere is not hard though, you just pass
> `--boot-directory=/wherever` to the `grub-install` script and turn off your
> distributions automatic reinstall mechanism so it doesn't get screwed up by
> the package manager when the GRUB package gets updated. You can also make
> `/boot/grub` a symbolic link pointing to the real GRUB directory, so that
> you don't have to pass any extra options to tools like grub-reboot or
> grub-set-default.

This is how Fedora builds their signed grubx64.efi to behave. But you
cannot ever run grub-install on a Secure Boot enabled computer, or you
now have to learn all about signing your own binaries. I don't even
like doing that, let alone saner users.

So for those distros that support Secure Boot, in practice you're
stuck with the behavior of their prebuilt GRUB binary that goes on the
ESP.


-- 
Chris Murphy


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Andrei Borzenkov
18.09.2018 21:57, Chris Murphy пишет:
> On Tue, Sep 18, 2018 at 12:16 PM, Andrei Borzenkov  
> wrote:
>> 18.09.2018 08:37, Chris Murphy пишет:
> 
>>> The patches aren't upstream yet? Will they be?
>>>
>>
>> I do not know. Personally I think much easier is to make grub location
>> independent of /boot, allowing grub be installed in separate partition.
>> This automatically covers all other cases (like MD, LVM etc).
> 
> The only case where I'm aware of this happens is Fedora on UEFI where
> they write grubenv and grub.cfg on the FAT ESP. I'm pretty sure
> upstream expects grubenv and grub.cfg at /boot/grub and I haven't ever
> seen it elsewhere (except Fedora on UEFI).
> 
> I'm not sure this is much easier. Yet another volume that would be
> persistently mounted? Where? A nested mount at /boot/grub? I'm not
> liking that at all. Even Windows and macOS have saner and simpler to
> understand booting methods than this.
> 
> 
That's exactly what Windows ended up with - separate boot volume with
bootloader related files.




Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Chris Murphy
On Tue, Sep 18, 2018 at 12:16 PM, Andrei Borzenkov  wrote:
> 18.09.2018 08:37, Chris Murphy пишет:

>> The patches aren't upstream yet? Will they be?
>>
>
> I do not know. Personally I think much easier is to make grub location
> independent of /boot, allowing grub be installed in separate partition.
> This automatically covers all other cases (like MD, LVM etc).

The only case where I'm aware of this happens is Fedora on UEFI where
they write grubenv and grub.cfg on the FAT ESP. I'm pretty sure
upstream expects grubenv and grub.cfg at /boot/grub and I haven't ever
seen it elsewhere (except Fedora on UEFI).

I'm not sure this is much easier. Yet another volume that would be
persistently mounted? Where? A nested mount at /boot/grub? I'm not
liking that at all. Even Windows and macOS have saner and simpler to
understand booting methods than this.


-- 
Chris Murphy


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Chris Murphy
On Tue, Sep 18, 2018 at 11:15 AM, Goffredo Baroncelli
 wrote:
> On 18/09/2018 06.21, Chris Murphy wrote:
>> b. The bootloader code, would have to have sophisticated enough Btrfs
>> knowledge to know if the grubenv has been reflinked or snapshot,
>> because even if +C, it may not be valid to overwrite, and COW must
>> still happen, and there's no way the code in GRUB can do full blow COW
>> and update a bunch of metadata.
>
> And what if GRUB ignore the possibility of COWing and overwrite the data ? Is 
> it a so big problem that the data is changed in all the snapshots ?
> It would be interested if the same problem happens for a swap file.

I think it's an abomination :-) It totally perverts the idea of
reflinks and snapshots and blurs the line between domains. Is it a
user file or not and are these user space commands or not and are they
reliable or do they have exceptions?

I have a boot subvolume mounted at /boot, and this boot subvolume gets
snapshot, and if GRUB can overwrite grubenv, it overwrites the
purported GRUB state information in every one of those boots, going
back maybe months, even when these are read only subvolumes.

I think it's a problem, and near as I can tell it'll be a problem for
all kinds of complex storage. I don't see how the bootloader itself
can do an overwrite onto raid5 or raid6. That's certainly supported by
GRUB for reading, but is the bootloader overwrite of gruvenv going to
recompute parity and write to multiple devices? Eek!


-- 
Chris Murphy


Re: btrfs receive incremental stream on another uuid

2018-09-18 Thread Hugo Mills
On Tue, Sep 18, 2018 at 06:28:37PM +, Gervais, Francois wrote:
> > No. It is already possible (by setting received UUID); it should not be
> made too open to easy abuse.
> 
> 
> Do you mean edit the UUID in the byte stream before btrfs receive?

   No, there's an ioctl to change the received UUID of a
subvolume. It's used by receive, at the very end of the receive
operation.

   Messing around in this area is basically a recipe for ending up
with a half-completed send/receive full of broken data because the
receiving subvolume isn't quite as identical as you thought. It
enforces the rules for a reason.

   Now, it's possible to modify the send stream and the logic around
it a bit to support a number of additional modes of operation
(bidirectional send, for example), but that's queued up waiting for
(a) a definitive list of send stream format changes, and (b) David's
bandwidth to put them together in one patch set.

   If you want to see more on the underlying UUID model, and how it
could be (ab)used and modified, there's a write-up here, in a thread
on pretty much exactly the same proposal that you've just made:

https://www.spinics.net/lists/linux-btrfs/msg44089.html

   Hugo.

-- 
Hugo Mills | Great films about cricket: Monster's No-Ball
hugo@... carfax.org.uk |
http://carfax.org.uk/  |
PGP: E2AB1DE4  |


signature.asc
Description: Digital signature


Re: btrfs receive incremental stream on another uuid

2018-09-18 Thread Andrei Borzenkov
18.09.2018 21:28, Gervais, Francois пишет:
>> No. It is already possible (by setting received UUID); it should not be
> made too open to easy abuse.
> 
> 
> Do you mean edit the UUID in the byte stream before btrfs receive?
> 
No, I mean setting received UUID on subvolume. Unfortunately, it is
possible. Fortunately, it is not trivially done.


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Andrei Borzenkov
18.09.2018 21:25, Austin S. Hemmelgarn пишет:
> On 2018-09-18 14:16, Andrei Borzenkov wrote:
>> 18.09.2018 08:37, Chris Murphy пишет:
>>> On Mon, Sep 17, 2018 at 11:24 PM, Andrei Borzenkov
>>>  wrote:
 18.09.2018 07:21, Chris Murphy пишет:
> On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy
>  wrote:
...
>
> There are a couple of reserve locations in Btrfs at the start and I
> think after the first superblock, for bootloader embedding. Possibly
> one or both of those areas could be used for this so it's outside the
> file system. But other implementations are going to run into this
> problem too.
>

 That's what SUSE grub2 version does - it includes patches to redirect
 writes on btrfs to reserved area. I am not sure how it behaves in case
 of multi-device btrfs though.
>>>
>>> The patches aren't upstream yet? Will they be?
>>>
>>
>> I do not know. Personally I think much easier is to make grub location
>> independent of /boot, allowing grub be installed in separate partition.
>> This automatically covers all other cases (like MD, LVM etc).
> It actually is independent of /boot already.  I've got it running just
> fine on my laptop off of the EFI system partition (which is independent
> of my /boot partition), and thus have no issues with handling of the
> grubenv file.  The problem is that all the big distros assume you want
> it in /boot, so they have no option for putting it anywhere else.
> 

This requires more than just explicit --boot-directory. With current
monolithic configuration file listing all available kernels this file
cannot be in the same location, it must be together with kernels (think
about rollback to snapshot with completely different content). Or some
different, more flexible configuration is needed.

As is now grub silently assumes everything is under /boot. This turned
out to be oversimplified.

> Actually installing it elsewhere is not hard though, you just pass
> `--boot-directory=/wherever` to the `grub-install` script and turn off
> your distributions automatic reinstall mechanism so it doesn't get
> screwed up by the package manager when the GRUB package gets updated.
> You can also make `/boot/grub` a symbolic link pointing to the real GRUB
> directory, so that you don't have to pass any extra options to tools
> like grub-reboot or grub-set-default.



Re: Move data and mount point to subvolume

2018-09-18 Thread Hans van Kranenburg
On 09/18/2018 08:10 PM, Marc Joliet wrote:
> Am Sonntag, 16. September 2018, 14:50:04 CEST schrieb Hans van Kranenburg:
>> The last example, where you make a subvolume and move everything into
>> it, will not do what you want. Since a subvolume is a separate new
>> directoty/file hierarchy, mv will turn into a cp and rm operation
>> (without warning you) probably destroying information about data shared
>> between files.
> 
> I thought that wasn't true anymore.  The NEWS file to coreutils contains this 
> (for version 8.24):
> 
>   mv will try a reflink before falling back to a standard copy, which is
>   more efficient when moving files across BTRFS subvolume boundaries.

I was wrong when saying that mv will copy/rm between subvolumes indeed,
because you can reflink files between subvolumes, as long as they're
under the same mount point. (You still can NOT between mount points
iirc, even when they are mounts from the same btrfs.)

But still, mv silently does one or the other, which is also confusing,
because if it starts copying/removing things while that was not your
intention, at first you're like "hm, takes longer than I tought" and
then "oh wait, n", and then it's too late already.

-- 
Hans van Kranenburg


Re: btrfs receive incremental stream on another uuid

2018-09-18 Thread Gervais, Francois
> No. It is already possible (by setting received UUID); it should not be
made too open to easy abuse.


Do you mean edit the UUID in the byte stream before btrfs receive?


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Austin S. Hemmelgarn

On 2018-09-18 14:16, Andrei Borzenkov wrote:

18.09.2018 08:37, Chris Murphy пишет:

On Mon, Sep 17, 2018 at 11:24 PM, Andrei Borzenkov  wrote:

18.09.2018 07:21, Chris Murphy пишет:

On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy  wrote:

https://btrfs.wiki.kernel.org/index.php/FAQ#Does_grub_support_btrfs.3F

Does anyone know if this is still a problem on Btrfs if grubenv has
xattr +C set? In which case it should be possible to overwrite and
there's no csums that are invalidated.

I kinda wonder if in 2018 it's specious for, effectively out of tree
code, to be making modifications to the file system, outside of the
file system.


a. The bootloader code (pre-boot, not user space setup stuff) would
have to know how to read xattr and refuse to overwrite a grubenv
lacking xattr +C.
b. The bootloader code, would have to have sophisticated enough Btrfs
knowledge to know if the grubenv has been reflinked or snapshot,
because even if +C, it may not be valid to overwrite, and COW must
still happen, and there's no way the code in GRUB can do full blow COW
and update a bunch of metadata.

So answering my own question, this isn't workable. And it seems the
same problem for dm-thin.

There are a couple of reserve locations in Btrfs at the start and I
think after the first superblock, for bootloader embedding. Possibly
one or both of those areas could be used for this so it's outside the
file system. But other implementations are going to run into this
problem too.



That's what SUSE grub2 version does - it includes patches to redirect
writes on btrfs to reserved area. I am not sure how it behaves in case
of multi-device btrfs though.


The patches aren't upstream yet? Will they be?



I do not know. Personally I think much easier is to make grub location
independent of /boot, allowing grub be installed in separate partition.
This automatically covers all other cases (like MD, LVM etc).
It actually is independent of /boot already.  I've got it running just 
fine on my laptop off of the EFI system partition (which is independent 
of my /boot partition), and thus have no issues with handling of the 
grubenv file.  The problem is that all the big distros assume you want 
it in /boot, so they have no option for putting it anywhere else.


Actually installing it elsewhere is not hard though, you just pass 
`--boot-directory=/wherever` to the `grub-install` script and turn off 
your distributions automatic reinstall mechanism so it doesn't get 
screwed up by the package manager when the GRUB package gets updated. 
You can also make `/boot/grub` a symbolic link pointing to the real GRUB 
directory, so that you don't have to pass any extra options to tools 
like grub-reboot or grub-set-default.


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Andrei Borzenkov
18.09.2018 08:37, Chris Murphy пишет:
> On Mon, Sep 17, 2018 at 11:24 PM, Andrei Borzenkov  
> wrote:
>> 18.09.2018 07:21, Chris Murphy пишет:
>>> On Mon, Sep 17, 2018 at 9:44 PM, Chris Murphy  
>>> wrote:
 https://btrfs.wiki.kernel.org/index.php/FAQ#Does_grub_support_btrfs.3F

 Does anyone know if this is still a problem on Btrfs if grubenv has
 xattr +C set? In which case it should be possible to overwrite and
 there's no csums that are invalidated.

 I kinda wonder if in 2018 it's specious for, effectively out of tree
 code, to be making modifications to the file system, outside of the
 file system.
>>>
>>> a. The bootloader code (pre-boot, not user space setup stuff) would
>>> have to know how to read xattr and refuse to overwrite a grubenv
>>> lacking xattr +C.
>>> b. The bootloader code, would have to have sophisticated enough Btrfs
>>> knowledge to know if the grubenv has been reflinked or snapshot,
>>> because even if +C, it may not be valid to overwrite, and COW must
>>> still happen, and there's no way the code in GRUB can do full blow COW
>>> and update a bunch of metadata.
>>>
>>> So answering my own question, this isn't workable. And it seems the
>>> same problem for dm-thin.
>>>
>>> There are a couple of reserve locations in Btrfs at the start and I
>>> think after the first superblock, for bootloader embedding. Possibly
>>> one or both of those areas could be used for this so it's outside the
>>> file system. But other implementations are going to run into this
>>> problem too.
>>>
>>
>> That's what SUSE grub2 version does - it includes patches to redirect
>> writes on btrfs to reserved area. I am not sure how it behaves in case
>> of multi-device btrfs though.
> 
> The patches aren't upstream yet? Will they be?
> 

I do not know. Personally I think much easier is to make grub location
independent of /boot, allowing grub be installed in separate partition.
This automatically covers all other cases (like MD, LVM etc).

> They redirect writes to grubenv specifically? Or do they use the
> reserved areas like a hidden and fixed location for what grubenv would
> contain?
> 
> I guess the user space grub-editenv could write to grubenv, which even
> if COW, GRUB can pick up that change. But GRUB itself writes its
> changes to a reserved area.
> 
> Hmmm. Complicated.
> 



Re: btrfs receive incremental stream on another uuid

2018-09-18 Thread Andrei Borzenkov
18.09.2018 20:56, Gervais, Francois пишет:
> 
> Hi,
> 
> I'm trying to apply a btrfs send diff (done through -p) to another subvolume 
> with the same content as the proper parent but with a different uuid.
> 
> I looked through btrfs receive and I get the feeling that this is not 
> possible right now.
> 
> I'm thinking of adding a -p option to btrfs receive which could override the 
> parent information from the stream.
> 
> Would that make sense?
> 
No. It is already possible (by setting received UUID); it should not be
made too open to easy abuse.


Re: Move data and mount point to subvolume

2018-09-18 Thread Marc Joliet
Am Sonntag, 16. September 2018, 14:50:04 CEST schrieb Hans van Kranenburg:
> The last example, where you make a subvolume and move everything into
> it, will not do what you want. Since a subvolume is a separate new
> directoty/file hierarchy, mv will turn into a cp and rm operation
> (without warning you) probably destroying information about data shared
> between files.

I thought that wasn't true anymore.  The NEWS file to coreutils contains this 
(for version 8.24):

  mv will try a reflink before falling back to a standard copy, which is
  more efficient when moving files across BTRFS subvolume boundaries.

-- 
Marc Joliet
--
"People who think they know everything really annoy those of us who know we
don't" - Bjarne Stroustrup


signature.asc
Description: This is a digitally signed message part.


btrfs receive incremental stream on another uuid

2018-09-18 Thread Gervais, Francois


Hi,

I'm trying to apply a btrfs send diff (done through -p) to another subvolume 
with the same content as the proper parent but with a different uuid.

I looked through btrfs receive and I get the feeling that this is not possible 
right now.

I'm thinking of adding a -p option to btrfs receive which could override the 
parent information from the stream.

Would that make sense?


Re: GRUB writing to grubenv outside of kernel fs code

2018-09-18 Thread Goffredo Baroncelli
On 18/09/2018 06.21, Chris Murphy wrote:
> b. The bootloader code, would have to have sophisticated enough Btrfs
> knowledge to know if the grubenv has been reflinked or snapshot,
> because even if +C, it may not be valid to overwrite, and COW must
> still happen, and there's no way the code in GRUB can do full blow COW
> and update a bunch of metadata.

And what if GRUB ignore the possibility of COWing and overwrite the data ? Is 
it a so big problem that the data is changed in all the snapshots ? 
It would be interested if the same problem happens for a swap file.


BR
G.Baroncelli

-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5


[PATCH V5 RESEND] Btrfs: enchanse raid1/10 balance heuristic

2018-09-18 Thread Timofey Titovets
From: Timofey Titovets 

Currently btrfs raid1/10 balancer bаlance requests to mirrors,
based on pid % num of mirrors.

Make logic understood:
 - if one of underline devices are non rotational
 - Queue leght to underline devices

By default try use pid % num_mirrors guessing, but:
 - If one of mirrors are non rotational, repick optimal to it
 - If underline mirror have less queue leght then optimal,
   repick to that mirror

For avoid round-robin request balancing,
lets round down queue leght:
 - By 8 for rotational devs
 - By 2 for all non rotational devs

Changes:
  v1 -> v2:
- Use helper part_in_flight() from genhd.c
  to get queue lenght
- Move guess code to guess_optimal()
- Change balancer logic, try use pid % mirror by default
  Make balancing on spinning rust if one of underline devices
  are overloaded
  v2 -> v3:
- Fix arg for RAID10 - use sub_stripes, instead of num_stripes
  v3 -> v4:
- Rebased on latest misc-next
  v4 -> v5:
- Rebased on latest misc-next

Signed-off-by: Timofey Titovets 
---
 block/genhd.c  |   1 +
 fs/btrfs/volumes.c | 111 -
 2 files changed, 110 insertions(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 9656f9e9f99e..5ea5acc88d3c 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct 
*part,
atomic_read(>in_flight[1]);
}
 }
+EXPORT_SYMBOL_GPL(part_in_flight);
 
 void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
   unsigned int inflight[2])
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c95af358b71f..fa7dd6ac087f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "extent_map.h"
@@ -5201,6 +5202,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
*fs_info, u64 logical, u64 len)
return ret;
 }
 
+/**
+ * bdev_get_queue_len - return rounded down in flight queue lenght of bdev
+ *
+ * @bdev: target bdev
+ * @round_down: round factor big for hdd and small for ssd, like 8 and 2
+ */
+static int bdev_get_queue_len(struct block_device *bdev, int round_down)
+{
+   int sum;
+   struct hd_struct *bd_part = bdev->bd_part;
+   struct request_queue *rq = bdev_get_queue(bdev);
+   uint32_t inflight[2] = {0, 0};
+
+   part_in_flight(rq, bd_part, inflight);
+
+   sum = max_t(uint32_t, inflight[0], inflight[1]);
+
+   /*
+* Try prevent switch for every sneeze
+* By roundup output num by some value
+*/
+   return ALIGN_DOWN(sum, round_down);
+}
+
+/**
+ * guess_optimal - return guessed optimal mirror
+ *
+ * Optimal expected to be pid % num_stripes
+ *
+ * That's generaly ok for spread load
+ * Add some balancer based on queue leght to device
+ *
+ * Basic ideas:
+ *  - Sequential read generate low amount of request
+ *so if load of drives are equal, use pid % num_stripes balancing
+ *  - For mixed rotate/non-rotate mirrors, pick non-rotate as optimal
+ *and repick if other dev have "significant" less queue lenght
+ *  - Repick optimal if queue leght of other mirror are less
+ */
+static int guess_optimal(struct map_lookup *map, int num, int optimal)
+{
+   int i;
+   int round_down = 8;
+   int qlen[num];
+   bool is_nonrot[num];
+   bool all_bdev_nonrot = true;
+   bool all_bdev_rotate = true;
+   struct block_device *bdev;
+
+   if (num == 1)
+   return optimal;
+
+   /* Check accessible bdevs */
+   for (i = 0; i < num; i++) {
+   /* Init for missing bdevs */
+   is_nonrot[i] = false;
+   qlen[i] = INT_MAX;
+   bdev = map->stripes[i].dev->bdev;
+   if (bdev) {
+   qlen[i] = 0;
+   is_nonrot[i] = blk_queue_nonrot(bdev_get_queue(bdev));
+   if (is_nonrot[i])
+   all_bdev_rotate = false;
+   else
+   all_bdev_nonrot = false;
+   }
+   }
+
+   /*
+* Don't bother with computation
+* if only one of two bdevs are accessible
+*/
+   if (num == 2 && qlen[0] != qlen[1]) {
+   if (qlen[0] < qlen[1])
+   return 0;
+   else
+   return 1;
+   }
+
+   if (all_bdev_nonrot)
+   round_down = 2;
+
+   for (i = 0; i < num; i++) {
+   if (qlen[i])
+   continue;
+   bdev = map->stripes[i].dev->bdev;
+   qlen[i] = bdev_get_queue_len(bdev, round_down);
+   }
+
+   /* For mixed case, pick non rotational dev as optimal */
+   if (all_bdev_rotate == all_bdev_nonrot) {
+   for (i = 0; i < num; i++) {
+ 

[PATCH RESEND] Btrfs: make should_defrag_range() understood compressed extents

2018-09-18 Thread Timofey Titovets
From: Timofey Titovets 

 Both, defrag ioctl and autodefrag - call btrfs_defrag_file()
 for file defragmentation.

 Kernel default target extent size - 256KiB.
 Btrfs progs default - 32MiB.

 Both bigger then maximum size of compressed extent - 128KiB.
 That lead to rewrite all compressed data on disk.

 Fix that by check compression extents with different logic.

 As addition, make should_defrag_range() understood compressed extent type,
 if requested target compression are same as current extent compression type.
 Just don't recompress/rewrite extents.
 To avoid useless recompression of compressed extents.

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/ioctl.c | 28 +---
 1 file changed, 25 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index a990a9045139..0a5ea1ccc89d 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -1142,7 +1142,7 @@ static bool defrag_check_next_extent(struct inode *inode, 
struct extent_map *em)
 
 static int should_defrag_range(struct inode *inode, u64 start, u32 thresh,
   u64 *last_len, u64 *skip, u64 *defrag_end,
-  int compress)
+  int compress, int compress_type)
 {
struct extent_map *em;
int ret = 1;
@@ -1177,8 +1177,29 @@ static int should_defrag_range(struct inode *inode, u64 
start, u32 thresh,
 * real extent, don't bother defragging it
 */
if (!compress && (*last_len == 0 || *last_len >= thresh) &&
-   (em->len >= thresh || (!next_mergeable && !prev_mergeable)))
+   (em->len >= thresh || (!next_mergeable && !prev_mergeable))) {
ret = 0;
+   goto out;
+   }
+
+
+   /*
+* Try not recompress compressed extents
+* thresh >= BTRFS_MAX_UNCOMPRESSED will lead to
+* recompress all compressed extents
+*/
+   if (em->compress_type != 0 && thresh >= BTRFS_MAX_UNCOMPRESSED) {
+   if (!compress) {
+   if (em->len == BTRFS_MAX_UNCOMPRESSED)
+   ret = 0;
+   } else {
+   if (em->compress_type != compress_type)
+   goto out;
+   if (em->len == BTRFS_MAX_UNCOMPRESSED)
+   ret = 0;
+   }
+   }
+
 out:
/*
 * last_len ends up being a counter of how many bytes we've defragged.
@@ -1477,7 +1498,8 @@ int btrfs_defrag_file(struct inode *inode, struct file 
*file,
 
if (!should_defrag_range(inode, (u64)i << PAGE_SHIFT,
 extent_thresh, _len, ,
-_end, do_compress)){
+_end, do_compress,
+compress_type)){
unsigned long next;
/*
 * the should_defrag function tells us how much to skip
-- 
2.19.0


Re: btrfs panic problem

2018-09-18 Thread sunny.s.zhang

Add Junxiao


在 2018年09月18日 13:05, Duncan 写道:

sunny.s.zhang posted on Tue, 18 Sep 2018 08:28:14 +0800 as excerpted:


My OS(4.1.12) panic in kmem_cache_alloc, which is called by
btrfs_get_or_create_delayed_node.

I found that the freelist of the slub is wrong.

[Not a dev, just a btrfs list regular and user, myself.  But here's a
general btrfs list recommendations reply...]

You appear to mean kernel 4.1.12 -- confirmed by the version reported in
the posted dump:  4.1.12-112.14.13.el6uek.x86_64

OK, so from the perspective of this forward-development-focused list,
kernel 4.1 is pretty ancient history, but you do have a number of options.

First let's consider the general situation.  Most people choose an
enterprise distro for supported stability, and that's certainly a valid
thing to want.  However, btrfs, while now reaching early maturity for the
basics (single device in single or dup mode, and multi-device in single/
raid0/1/10 modes, note that raid56 mode is newer and less mature),
remains under quite heavy development, and keeping reasonably current is
recommended for that reason.

So you you chose an enterprise distro presumably to lock in supported
stability for several years, but you chose a filesystem, btrfs, that's
still under heavy development, with reasonably current kernels and
userspace recommended as tending to have the known bugs fixed.  There's a
bit of a conflict there, and the /general/ recommendation would thus be
to consider whether one or the other of those choices are inappropriate
for your use-case, because it's really quite likely that if you really
want the stability of an enterprise distro and kernel, that btrfs isn't
as stable a filesystem as you're likely to want to match with it.
Alternatively, if you want something newer to match the still under heavy
development btrfs, you very likely want a distro that's not focused on
years-old stability just for the sake of it.  One or the other is likely
to be a poor match for your needs, and choosing something else that's a
better match is likely to be a much better experience for you.

But perhaps you do have reason to want to run the newer and not quite to
traditional enterprise-distro level stability btrfs, on an otherwise
older and very stable enterprise distro.  That's fine, provided you know
what you're getting yourself into, and are prepared to deal with it.

In that case, for best support from the list, we'd recommend running one
of the latest two kernels in either the current or mainline LTS tracks.

For current track, With 4.18 being the latest kernel, that'd be 4.18 or
4.17, as available on kernel.org (tho 4.17 is already EOL, no further
releases, at 4.17.19).

For mainline-LTS track, 4.14 and 4.9 are the latest two LTS series
kernels, tho IIRC 4.19 is scheduled to be this year's LTS (or was it 4.18
and it's just not out of normal stable range yet so not yet marked LTS?),
so it'll be coming up soon and 4.9 will then be dropping to third LTS
series and thus out of our best recommended range.  4.4 was the previous
LTS and while still in LTS support, is outside the two newest LTS series
that this list recommends.

And of course 4.1 is older than 4.4, so as I said, in btrfs development
terms, it's quite ancient indeed... quite out of practical support range
here, tho of course we'll still try, but in many cases the first question
when any problem's reported is going to be whether it's reproducible on
something closer to current.

But... you ARE on an enterprise kernel, likely on an enterprise distro,
and very possibly actually paying /them/ for support.  So you're not
without options if you prefer to stay with your supported enterprise
kernel.  If you're paying them for support, you might as well use it, and
of course of the very many fixes since 4.1, they know what they've
backported and what they haven't, so they're far better placed to provide
that support in any case.

Or, given what you posted, you appear to be reasonably able to do at
least limited kernel-dev-level analysis yourself.  Given that, you're
already reasonably well placed to simply decide to stick with what you
have and take the support you can get, diving into things yourself if
necessary.


So those are your kernel options.  What about userspace btrfs-progs?

Generally speaking, while the filesystem's running, it's the kernel code
doing most of the work.  If you have old userspace, it simply means you
can't take advantage of some of the newer features as the old userspace
doesn't know how to call for them.

But the situation changes as soon as you have problems and can't mount,
because it's userspace code that runs to try to fix that sort of problem,
or failing that, it's userspace code that btrfs restore runs to try to
grab what files can be grabbed off of the unmountable filesystem.

So for routine operation, it's no big deal if userspace is a bit old, at
least as long as it's new enough to have all the newer command formats,
etc, that you need, and for 

[PATCH] btrfs-progs: tests: Add the testcase for subvolume name length limit test

2018-09-18 Thread Su Yanjun
Total of three conditions are tested. One for short name, one with
name length 255, the last one with more than 255.

This case should pass after commit
'btrfs-progs: change filename limit to 255 when creating subvolume'.

Signed-off-by: Su Yanjun 
---
 .../033-filename-length-limit/test.sh | 86 +++
 1 file changed, 86 insertions(+)
 create mode 100755 tests/misc-tests/033-filename-length-limit/test.sh

diff --git a/tests/misc-tests/033-filename-length-limit/test.sh 
b/tests/misc-tests/033-filename-length-limit/test.sh
new file mode 100755
index ..7764ad9b584c
--- /dev/null
+++ b/tests/misc-tests/033-filename-length-limit/test.sh
@@ -0,0 +1,86 @@
+#!/bin/bash
+#
+# test file name length limit settings
+
+source "$TEST_TOP/common"
+
+check_prereq mkfs.btrfs
+check_prereq btrfs
+
+setup_root_helper
+
+prepare_test_dev
+run_check "$TOP/mkfs.btrfs" -f "$TEST_DEV"
+run_check_mount_test_dev
+run_check $SUDO_HELPER chmod a+rw "$TEST_MNT"
+
+cd "$TEST_MNT"
+
+longname=\
+0123456789\
+0123456789\
+0123456789\
+0123456789\
+0123456789\
+\
+0123456789\
+0123456789\
+0123456789\
+0123456789\
+0123456789\
+\
+0123456789\
+0123456789\
+0123456789\
+0123456789\
+0123456789\
+\
+0123456789\
+0123456789\
+0123456789\
+0123456789\
+0123456789\
+\
+0123456789\
+0123456789\
+0123456789\
+0123456789\
+0123456789\
+\
+01234
+
+#
+# subvolume name length limit test
+#
+
+# short name test
+run_check $SUDO_HELPER "$TOP/btrfs" subvolume create aaa
+# 255
+run_check $SUDO_HELPER "$TOP/btrfs" subvolume create "$longname"
+# 256, must fail
+run_mustfail "name 256 bytes long succeeded" \
+   $SUDO_HELPER "$TOP/btrfs" subvolume create "$longname"5
+# 255*2, must fail
+run_mustfail "name 2 * 255 bytes long succeeded" \
+   $SUDO_HELPER "$TOP/btrfs" subvolume create "$longname$longname"
+
+#
+# snapshot name length limit test
+#
+
+run_check $SUDO_HELPER mkdir snaps
+
+# short name test
+run_check $SUDO_HELPER "$TOP/btrfs" subvolume snapshot aaa snaps/bbb
+# 255
+run_check $SUDO_HELPER "$TOP/btrfs" subvolume snapshot aaa snaps/"$longname"
+# 256, must fail
+run_mustfail "name 256 bytes long succeeded" \
+   $SUDO_HELPER "$TOP/btrfs" subvolume snapshot aaa snaps/"$longname"5
+# 255*2, must fail
+run_mustfail "name 2 * 255 bytes long succeeded" \
+   $SUDO_HELPER "$TOP/btrfs" subvolume snapshot aaa 
snaps/"$longname$longname"
+
+cd ..
+
+run_check_umount_test_dev
-- 
2.18.0





Re: [PATCH v3 5/7] btrfs-progs: lowmem: do missing check of last item after check_inode_item()

2018-09-18 Thread Qu Wenruo



On 2018/9/18 下午4:01, Su Yue wrote:
> 
> 
> On 9/18/18 1:32 PM, Qu Wenruo wrote:
>>
>>
>> On 2018/9/17 下午9:24, Su Yue wrote:
>>>
>>>
>>> On 2018/9/17 8:53 PM, Qu Wenruo wrote:


 On 2018/9/17 下午3:28, Su Yue wrote:
> After call of check_inode_item(), path may point to the last unchecked
> slot of the leaf. The outer walk_up_tree() always treats the position
> as checked slot then skips to the next. The last item will never be
> checked.
>
> While checking backrefs, path passed to walk_up_tree() always
> points to a checked slot.
> While checking fs trees, path passed to walk_up_tree() always
> points to an unchecked slot.

 Can we unify this behavior?
 I has considered in three ways:
>>> 1) Change logical of the process_one_leaf. After it returns, path
>>> points to the next slot checked.
>>> To unify it, we can use saved key but will cost one search_slot time
>>> during serval nodes(>=1). Or call btrfs_previous_item() after every time
>>> check_inode_items() returns.
>>>
>>> But why? why should we cost some time to swing the path. So I
>>> abandoned 1).
>>>
>>> 2) Change logical of the check_leaf_items(). What does the function
>>> is just traverse all items then returns, which seems quite natural.
>>> So I abandoned it.
>>
>> Well, this can also be interpreted as "it's a pretty good place to
>> change the behavior".
>>
>> IMHO, since check_leaf_items() are just really simple hub functions, it
>> will just need a btrfs_next_item() in its out: tag.
>>
> After sometime thinking, sorry, the idea should not work as expected.
> In fact, backrefs check and fs check walk a little differently.

Just as discussed offline, unfortunately that's the case.

> 
> Backrefs check always do walk nodes one by one, never skip any nodes.
> Fs check will try to skip shared nodes to speed up
Exactly.

> 
> While checking backrefs with your idea,
> If the tree has many levels.
> Assume before calling btrfs_next_item:
> path->slots[0] points to the one past of the last item.
> path->slots[1] points to the last slot of nodes[1].
> path->slots[2] points to the last slot of nodes[2].
> path->slots[3] points to the one *before* last slot of nodes[3].
> 
> After btrfs_next_item():
> path->slots[0] points to the first item of another leaf.
> path->slots[1] points to the first item of another node.
> path->slots[2] points to the first item of another node.
> path->slots[3] points to the a slot of *old* nodes[3].

These info is pretty useful, please consider include them in next version.

It's not that obvious from the code.

And now your patch makes sense.

Thanks,
Qu

> 
> Then walk_up_tree() is in, it thinks the slot is unchecked then
> returns with *level=0. Then walk_down_tree() just walk from level
> to leaf.
> Backrefs of new nodes[1,2] will never be checked, the most
> obvious negative effect is inaccurate account info.
> Although we can do check is slot the first in walk_up_tree(),
> it's a magic and worse than this patch.
> 
> Thanks,
> Su
> 
>> By that we can unify the behavior of them to all points to the next
>> *unchecked* slot.
>> And no need for the extra parameter.
>>
>> Thanks,
>> Qu
>>
>>>
>>>
>>> 3) It's what the patch does. The extra argument may seems strange,
>>> I preferred to this way.
>>>
>>> Maybe we can do something after check_leaf_items() returns, is it
>>> acceptable? I have no idea.
>>>
>>> Thanks,
>>> Su
>>>
 E.g, always points to an unchecked slot.

 It would make things easier and no need for the extra parameter.

 Thanks,
 Qu

>
> Solution:
> Add an argument @is_checked to walk_up_tree() to decide whether
> to skip current slot.
>
> Fixes: 5e2dc770471b ("btrfs-progs: check: skip shared node or leaf
> check for low_memory mode")
> Signed-off-by: Su Yue 
> ---
>    check/mode-lowmem.c | 37 +
>    1 file changed, 33 insertions(+), 4 deletions(-)
>
> diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
> index db44456fd85b..612e5e28e45b 100644
> --- a/check/mode-lowmem.c
> +++ b/check/mode-lowmem.c
> @@ -4597,22 +4597,38 @@ static int walk_down_tree(struct btrfs_root
> *root, struct btrfs_path *path,
>    return err;
>    }
>    +/*
> + * Walk up throuh the path. Make path point to next slot to be
> checked.
> + * walk_down_tree() should be called after this function.
> + *
> + * @root:    root of the tree
> + * @path:    will point to next slot to check for walk_down_tree()
> + * @level:    returns with level of next node to be checked
> + * @is_checked:    means is the current node checked or not
> + *    if false, the slot is unchecked, do not increase the slot
> + *    if true, means increase the slot of the current node
> + *
> + * Returns 0 means success.
> + * Returns >0 means the whole loop of walk up/down should be broken.

Re: [PATCH v3 5/7] btrfs-progs: lowmem: do missing check of last item after check_inode_item()

2018-09-18 Thread Su Yue




On 9/18/18 1:32 PM, Qu Wenruo wrote:



On 2018/9/17 下午9:24, Su Yue wrote:



On 2018/9/17 8:53 PM, Qu Wenruo wrote:



On 2018/9/17 下午3:28, Su Yue wrote:

After call of check_inode_item(), path may point to the last unchecked
slot of the leaf. The outer walk_up_tree() always treats the position
as checked slot then skips to the next. The last item will never be
checked.

While checking backrefs, path passed to walk_up_tree() always
points to a checked slot.
While checking fs trees, path passed to walk_up_tree() always
points to an unchecked slot.


Can we unify this behavior?
I has considered in three ways:

1) Change logical of the process_one_leaf. After it returns, path
points to the next slot checked.
To unify it, we can use saved key but will cost one search_slot time
during serval nodes(>=1). Or call btrfs_previous_item() after every time
check_inode_items() returns.

But why? why should we cost some time to swing the path. So I abandoned 1).

2) Change logical of the check_leaf_items(). What does the function
is just traverse all items then returns, which seems quite natural.
So I abandoned it.


Well, this can also be interpreted as "it's a pretty good place to
change the behavior".

IMHO, since check_leaf_items() are just really simple hub functions, it
will just need a btrfs_next_item() in its out: tag.


After sometime thinking, sorry, the idea should not work as expected.
In fact, backrefs check and fs check walk a little differently.

Backrefs check always do walk nodes one by one, never skip any nodes.
Fs check will try to skip shared nodes to speed up.

While checking backrefs with your idea,
If the tree has many levels.
Assume before calling btrfs_next_item:
path->slots[0] points to the one past of the last item.
path->slots[1] points to the last slot of nodes[1].
path->slots[2] points to the last slot of nodes[2].
path->slots[3] points to the one *before* last slot of nodes[3].

After btrfs_next_item():
path->slots[0] points to the first item of another leaf.
path->slots[1] points to the first item of another node.
path->slots[2] points to the first item of another node.
path->slots[3] points to the a slot of *old* nodes[3].

Then walk_up_tree() is in, it thinks the slot is unchecked then
returns with *level=0. Then walk_down_tree() just walk from level
to leaf.
Backrefs of new nodes[1,2] will never be checked, the most
obvious negative effect is inaccurate account info.
Although we can do check is slot the first in walk_up_tree(),
it's a magic and worse than this patch.

Thanks,
Su


By that we can unify the behavior of them to all points to the next
*unchecked* slot.
And no need for the extra parameter.

Thanks,
Qu




3) It's what the patch does. The extra argument may seems strange,
I preferred to this way.

Maybe we can do something after check_leaf_items() returns, is it
acceptable? I have no idea.

Thanks,
Su


E.g, always points to an unchecked slot.

It would make things easier and no need for the extra parameter.

Thanks,
Qu



Solution:
Add an argument @is_checked to walk_up_tree() to decide whether
to skip current slot.

Fixes: 5e2dc770471b ("btrfs-progs: check: skip shared node or leaf
check for low_memory mode")
Signed-off-by: Su Yue 
---
   check/mode-lowmem.c | 37 +
   1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/check/mode-lowmem.c b/check/mode-lowmem.c
index db44456fd85b..612e5e28e45b 100644
--- a/check/mode-lowmem.c
+++ b/check/mode-lowmem.c
@@ -4597,22 +4597,38 @@ static int walk_down_tree(struct btrfs_root
*root, struct btrfs_path *path,
   return err;
   }
   +/*
+ * Walk up throuh the path. Make path point to next slot to be checked.
+ * walk_down_tree() should be called after this function.
+ *
+ * @root:    root of the tree
+ * @path:    will point to next slot to check for walk_down_tree()
+ * @level:    returns with level of next node to be checked
+ * @is_checked:    means is the current node checked or not
+ *    if false, the slot is unchecked, do not increase the slot
+ *    if true, means increase the slot of the current node
+ *
+ * Returns 0 means success.
+ * Returns >0 means the whole loop of walk up/down should be broken.
+ */
   static int walk_up_tree(struct btrfs_root *root, struct btrfs_path
*path,
-    int *level)
+    int *level, bool is_checked)
   {
   int i;
   struct extent_buffer *leaf;
+    int skip_cur =s_checked ? 1 : 0;
     for (i =level; i < BTRFS_MAX_LEVEL - 1 && path->nodes[i]; i++) {
   leaf =ath->nodes[i];
-    if (path->slots[i] + 1 < btrfs_header_nritems(leaf)) {
-    path->slots[i]++;
+    if (path->slots[i] + skip_cur < btrfs_header_nritems(leaf)) {
+    path->slots[i] +=kip_cur;
   *level =;
   return 0;
   }
   free_extent_buffer(path->nodes[*level]);
   path->nodes[*level] =ULL;
   *level = + 1;
+    skip_cur =;
   }
   return 1;