Re: btrfs panic problem
在 2018年09月25日 16:31, Nikolay Borisov 写道: On 25.09.2018 11:20, sunny.s.zhang wrote: 在 2018年09月20日 02:36, Liu Bo 写道: On Mon, Sep 17, 2018 at 5:28 PM, sunny.s.zhang wrote: Hi All, My OS(4.1.12) panic in kmem_cache_alloc, which is called by btrfs_get_or_create_delayed_node. I found that the freelist of the slub is wrong. crash> struct kmem_cache_cpu 887e7d7a24b0 struct kmem_cache_cpu { freelist = 0x2026, <<< the value is id of one inode tid = 29567861, page = 0xea0132168d00, partial = 0x0 } And, I found there are two different btrfs inodes pointing delayed_node. It means that the same slub is used twice. I think this slub is freed twice, and then the next pointer of this slub point itself. So we get the same slub twice. When use this slub again, that break the freelist. Folloing code will make the delayed node being freed twice. But I don't found what is the process. Process A (btrfs_evict_inode) Process B call btrfs_remove_delayed_node call btrfs_get_delayed_node node = ACCESS_ONCE(btrfs_inode->delayed_node); BTRFS_I(inode)->delayed_node = NULL; btrfs_release_delayed_node(delayed_node); if (node) { atomic_inc(&node->refs); return node; } .. btrfs_release_delayed_node(delayed_node); By looking at the race, seems the following commit has addressed it. btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_commit_-3Fid-3Dec35e48b286959991cdbb886f1bdeda4575c80b4&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=mcYQsljqnoxPHJVaWVFtwsEEDhXdP3ULRlrPW_9etWQ&m=O7fQASCATWfOIp82M24gmi314geaUJDU-9erYxJ2ZEs&s=QtIafUNfkdy5BqfRQLhoHLY6o-Vk8-ZB0sD28mM-o_s&e= thanks, liubo I don't think so. this patch has resolved the problem of radix_tree_lookup. I don't think this can resolve my problem that race occur after ACCESS_ONCE(btrfs_inode->delayed_node). Because, if ACCESS_ONCE(btrfs_inode->delayed_node) return the node, then the function of btrfs_get_delayed_node will return, and don't continue. Can you reproduce the problem on an upstream kernel with added delays? The original report is from some RHEL-based distro (presumably oracle unbreakable linux) so there is no indication currently that this is a genuine problem in upstream kernels. Not yet. I will reproduce later. But I don't have any clue about this race now. Thanks, Sunny Thanks, Sunny 1313 void btrfs_remove_delayed_node(struct inode *inode) 1314 { 1315 struct btrfs_delayed_node *delayed_node; 1316 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node); 1318 if (!delayed_node) 1319 return; 1320 1321 BTRFS_I(inode)->delayed_node = NULL; 1322 btrfs_release_delayed_node(delayed_node); 1323 } 87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct inode *inode) 88 { 89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode); 90 struct btrfs_root *root = btrfs_inode->root; 91 u64 ino = btrfs_ino(inode); 92 struct btrfs_delayed_node *node; 93 94 node = ACCESS_ONCE(btrfs_inode->delayed_node); 95 if (node) { 96 atomic_inc(&node->refs); 97 return node; 98 } Thanks, Sunny PS: panic informations PID: 73638 TASK: 887deb586200 CPU: 38 COMMAND: "dockerd" #0 [88130404f940] machine_kexec at 8105ec10 #1 [88130404f9b0] crash_kexec at 811145b8 #2 [88130404fa80] oops_end at 8101a868 #3 [88130404fab0] no_context at 8106ea91 #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d #5 [88130404fb50] bad_area_nosemaphore at 8106eda3 #6 [88130404fb60] __do_page_fault at 8106f328 #7 [88130404fbd0] do_page_fault at 8106f637 #8 [88130404fc10] page_fault at 816f6308 [exception RIP: kmem_cache_alloc+121] RIP: 811ef019 RSP: 88130404fcc8 RFLAGS: 00010286 RAX: RBX: RCX: 01c32b76 RDX: 01c32b75 RSI: RDI: 000224b0 RBP: 88130404fd08 R8: 887e7d7a24b0 R9: R10: 8802668b6618 R11: 0002 R12: 887e3e230a00 R13: 2026 R14: 887e3e230a00 R15: a01abf49 ORIG_RAX: CS: 0010 SS: 0018 #9 [88130404fd10] btrfs_get_or_create_delayed_node at a01abf49 [btrfs] #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12 [btrfs] #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs] #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs] #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs] #14 [88130404fe50] touch_atime at 812286d3 #1
Re: btrfs panic problem
On 25.09.2018 11:20, sunny.s.zhang wrote: > > 在 2018年09月20日 02:36, Liu Bo 写道: >> On Mon, Sep 17, 2018 at 5:28 PM, sunny.s.zhang >> wrote: >>> Hi All, >>> >>> My OS(4.1.12) panic in kmem_cache_alloc, which is called by >>> btrfs_get_or_create_delayed_node. >>> >>> I found that the freelist of the slub is wrong. >>> >>> crash> struct kmem_cache_cpu 887e7d7a24b0 >>> >>> struct kmem_cache_cpu { >>> freelist = 0x2026, <<< the value is id of one inode >>> tid = 29567861, >>> page = 0xea0132168d00, >>> partial = 0x0 >>> } >>> >>> And, I found there are two different btrfs inodes pointing >>> delayed_node. It >>> means that the same slub is used twice. >>> >>> I think this slub is freed twice, and then the next pointer of this slub >>> point itself. So we get the same slub twice. >>> >>> When use this slub again, that break the freelist. >>> >>> Folloing code will make the delayed node being freed twice. But I don't >>> found what is the process. >>> >>> Process A (btrfs_evict_inode) Process B >>> >>> call btrfs_remove_delayed_node call btrfs_get_delayed_node >>> >>> node = ACCESS_ONCE(btrfs_inode->delayed_node); >>> >>> BTRFS_I(inode)->delayed_node = NULL; >>> btrfs_release_delayed_node(delayed_node); >>> >>> if (node) { >>> atomic_inc(&node->refs); >>> return node; >>> } >>> >>> .. >>> >>> btrfs_release_delayed_node(delayed_node); >>> >> By looking at the race, seems the following commit has addressed it. >> >> btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes >> https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_commit_-3Fid-3Dec35e48b286959991cdbb886f1bdeda4575c80b4&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=mcYQsljqnoxPHJVaWVFtwsEEDhXdP3ULRlrPW_9etWQ&m=O7fQASCATWfOIp82M24gmi314geaUJDU-9erYxJ2ZEs&s=QtIafUNfkdy5BqfRQLhoHLY6o-Vk8-ZB0sD28mM-o_s&e= >> >> >> thanks, >> liubo > > I don't think so. > this patch has resolved the problem of radix_tree_lookup. I don't think > this can resolve my problem that race occur after > ACCESS_ONCE(btrfs_inode->delayed_node). > Because, if ACCESS_ONCE(btrfs_inode->delayed_node) return the node, then > the function of btrfs_get_delayed_node will return, and don't continue. Can you reproduce the problem on an upstream kernel with added delays? The original report is from some RHEL-based distro (presumably oracle unbreakable linux) so there is no indication currently that this is a genuine problem in upstream kernels. > > Thanks, > Sunny > >> >>> 1313 void btrfs_remove_delayed_node(struct inode *inode) >>> 1314 { >>> 1315 struct btrfs_delayed_node *delayed_node; >>> 1316 >>> 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node); >>> 1318 if (!delayed_node) >>> 1319 return; >>> 1320 >>> 1321 BTRFS_I(inode)->delayed_node = NULL; >>> 1322 btrfs_release_delayed_node(delayed_node); >>> 1323 } >>> >>> >>> 87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct >>> inode >>> *inode) >>> 88 { >>> 89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode); >>> 90 struct btrfs_root *root = btrfs_inode->root; >>> 91 u64 ino = btrfs_ino(inode); >>> 92 struct btrfs_delayed_node *node; >>> 93 >>> 94 node = ACCESS_ONCE(btrfs_inode->delayed_node); >>> 95 if (node) { >>> 96 atomic_inc(&node->refs); >>> 97 return node; >>> 98 } >>> >>> >>> Thanks, >>> >>> Sunny >>> >>> >>> PS: >>> >>> >>> >>> panic informations >>> >>> PID: 73638 TASK: 887deb586200 CPU: 38 COMMAND: "dockerd" >>> #0 [88130404f940] machine_kexec at 8105ec10 >>> #1 [88130404f9b0] crash_kexec at 811145b8 >>> #2 [88130404fa80] oops_end at 8101a868 >>> #3 [88130404fab0] no_context at 8106ea91 >>> #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d >>> #5 [88130404fb50] bad_area_nosemaphore at 8106eda3 >>> #6 [88130404fb60] __do_page_fault at 8106f328 >>> #7 [88130404fbd0] do_page_fault at 8106f637 >>> #8 [88130404fc10] page_fault at 816f6308 >>> [exception RIP: kmem_cache_alloc+121] >>> RIP: 811ef019 RSP: 88130404fcc8 RFLAGS: 00010286 >>> RAX: RBX: RCX: 01c32b76 >>> RDX: 01c32b75 RSI: RDI: 000224b0 >>> RBP: 88130404fd08 R8: 887e7d7a24b0 R9: >>> R10: 8802668b6618 R11: 0002 R12: 887e3e230a00 >>> R13: 2026 R14: 887e3e230a00 R15: a01abf49 >>> ORIG_RAX: CS: 0010 SS: 0018 >>> #9 [88130404fd10] btrfs_get_or_create_delayed_node at >>> a01abf49 >>> [btrfs] >>> #10 [88130404fd60] btrfs_delayed_update_inode at fff
Re: btrfs panic problem
在 2018年09月20日 00:12, Nikolay Borisov 写道: On 19.09.2018 02:53, sunny.s.zhang wrote: Hi Duncan, Thank you for your advice. I understand what you mean. But i have reviewed the latest btrfs code, and i think the issue is exist still. At 71 line, if the function of btrfs_get_delayed_node run over this line, then switch to other process, which run over the 1282 and release the delayed node at the end. And then, switch back to the btrfs_get_delayed_node. find that the node is not null, and use it as normal. that mean we used a freed memory. at some time, this memory will be freed again. latest code as below. 1278 void btrfs_remove_delayed_node(struct btrfs_inode *inode) 1279 { 1280 struct btrfs_delayed_node *delayed_node; 1281 1282 delayed_node = READ_ONCE(inode->delayed_node); 1283 if (!delayed_node) 1284 return; 1285 1286 inode->delayed_node = NULL; 1287 btrfs_release_delayed_node(delayed_node); 1288 } 64 static struct btrfs_delayed_node *btrfs_get_delayed_node( 65 struct btrfs_inode *btrfs_inode) 66 { 67 struct btrfs_root *root = btrfs_inode->root; 68 u64 ino = btrfs_ino(btrfs_inode); 69 struct btrfs_delayed_node *node; 70 71 node = READ_ONCE(btrfs_inode->delayed_node); 72 if (node) { 73 refcount_inc(&node->refs); 74 return node; 75 } 76 77 spin_lock(&root->inode_lock); 78 node = radix_tree_lookup(&root->delayed_nodes_tree, ino); You are analysis is correct, however it's missing one crucial point - btrfs_remove_delayed_node is called only from btrfs_evict_inode. And inodes are evicted when all other references have been dropped. Check the code in evict_inodes() - inodes are added to the dispose list when their i_count is 0 at which point there should be no references in this inode. This invalidates your analysis... Thanks. Yes, I know this. and I know that other process can not use this inode if the inode is in the I_FREEING status. But, Chris has fixed a bug, which is similar with this and is found in production. it mean that this will occur in some condition. btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_commit_-3Fid-3Dec35e48b286959991cdbb886f1bdeda4575c80b4&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=mcYQsljqnoxPHJVaWVFtwsEEDhXdP3ULRlrPW_9etWQ&m=O7fQASCATWfOIp82M24gmi314geaUJDU-9erYxJ2ZEs&s=QtIafUNfkdy5BqfRQLhoHLY6o-Vk8-ZB0sD28mM-o_s&e= 在 2018年09月18日 13:05, Duncan 写道: sunny.s.zhang posted on Tue, 18 Sep 2018 08:28:14 +0800 as excerpted: My OS(4.1.12) panic in kmem_cache_alloc, which is called by btrfs_get_or_create_delayed_node. I found that the freelist of the slub is wrong. [Not a dev, just a btrfs list regular and user, myself. But here's a general btrfs list recommendations reply...] You appear to mean kernel 4.1.12 -- confirmed by the version reported in the posted dump: 4.1.12-112.14.13.el6uek.x86_64 OK, so from the perspective of this forward-development-focused list, kernel 4.1 is pretty ancient history, but you do have a number of options. First let's consider the general situation. Most people choose an enterprise distro for supported stability, and that's certainly a valid thing to want. However, btrfs, while now reaching early maturity for the basics (single device in single or dup mode, and multi-device in single/ raid0/1/10 modes, note that raid56 mode is newer and less mature), remains under quite heavy development, and keeping reasonably current is recommended for that reason. So you you chose an enterprise distro presumably to lock in supported stability for several years, but you chose a filesystem, btrfs, that's still under heavy development, with reasonably current kernels and userspace recommended as tending to have the known bugs fixed. There's a bit of a conflict there, and the /general/ recommendation would thus be to consider whether one or the other of those choices are inappropriate for your use-case, because it's really quite likely that if you really want the stability of an enterprise distro and kernel, that btrfs isn't as stable a filesystem as you're likely to want to match with it. Alternatively, if you want something newer to match the still under heavy development btrfs, you very likely want a distro that's not focused on years-old stability just for the sake of it. One or the other is likely to be a poor match for your needs, and choosing something else that's a better match is likely to be a much better experience for you. But perhaps you do have reason to want to run the newer and not quite to traditional enterprise-distro level stability btrfs, on an otherwise older and very stable enterprise distro. That's fine, provided you know what you're getting yourself into, and are pre
Re: btrfs panic problem
在 2018年09月20日 02:36, Liu Bo 写道: On Mon, Sep 17, 2018 at 5:28 PM, sunny.s.zhang wrote: Hi All, My OS(4.1.12) panic in kmem_cache_alloc, which is called by btrfs_get_or_create_delayed_node. I found that the freelist of the slub is wrong. crash> struct kmem_cache_cpu 887e7d7a24b0 struct kmem_cache_cpu { freelist = 0x2026, <<< the value is id of one inode tid = 29567861, page = 0xea0132168d00, partial = 0x0 } And, I found there are two different btrfs inodes pointing delayed_node. It means that the same slub is used twice. I think this slub is freed twice, and then the next pointer of this slub point itself. So we get the same slub twice. When use this slub again, that break the freelist. Folloing code will make the delayed node being freed twice. But I don't found what is the process. Process A (btrfs_evict_inode) Process B call btrfs_remove_delayed_node call btrfs_get_delayed_node node = ACCESS_ONCE(btrfs_inode->delayed_node); BTRFS_I(inode)->delayed_node = NULL; btrfs_release_delayed_node(delayed_node); if (node) { atomic_inc(&node->refs); return node; } .. btrfs_release_delayed_node(delayed_node); By looking at the race, seems the following commit has addressed it. btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_commit_-3Fid-3Dec35e48b286959991cdbb886f1bdeda4575c80b4&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=mcYQsljqnoxPHJVaWVFtwsEEDhXdP3ULRlrPW_9etWQ&m=O7fQASCATWfOIp82M24gmi314geaUJDU-9erYxJ2ZEs&s=QtIafUNfkdy5BqfRQLhoHLY6o-Vk8-ZB0sD28mM-o_s&e= thanks, liubo I don't think so. this patch has resolved the problem of radix_tree_lookup. I don't think this can resolve my problem that race occur after ACCESS_ONCE(btrfs_inode->delayed_node). Because, if ACCESS_ONCE(btrfs_inode->delayed_node) return the node, then the function of btrfs_get_delayed_node will return, and don't continue. Thanks, Sunny 1313 void btrfs_remove_delayed_node(struct inode *inode) 1314 { 1315 struct btrfs_delayed_node *delayed_node; 1316 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node); 1318 if (!delayed_node) 1319 return; 1320 1321 BTRFS_I(inode)->delayed_node = NULL; 1322 btrfs_release_delayed_node(delayed_node); 1323 } 87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct inode *inode) 88 { 89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode); 90 struct btrfs_root *root = btrfs_inode->root; 91 u64 ino = btrfs_ino(inode); 92 struct btrfs_delayed_node *node; 93 94 node = ACCESS_ONCE(btrfs_inode->delayed_node); 95 if (node) { 96 atomic_inc(&node->refs); 97 return node; 98 } Thanks, Sunny PS: panic informations PID: 73638 TASK: 887deb586200 CPU: 38 COMMAND: "dockerd" #0 [88130404f940] machine_kexec at 8105ec10 #1 [88130404f9b0] crash_kexec at 811145b8 #2 [88130404fa80] oops_end at 8101a868 #3 [88130404fab0] no_context at 8106ea91 #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d #5 [88130404fb50] bad_area_nosemaphore at 8106eda3 #6 [88130404fb60] __do_page_fault at 8106f328 #7 [88130404fbd0] do_page_fault at 8106f637 #8 [88130404fc10] page_fault at 816f6308 [exception RIP: kmem_cache_alloc+121] RIP: 811ef019 RSP: 88130404fcc8 RFLAGS: 00010286 RAX: RBX: RCX: 01c32b76 RDX: 01c32b75 RSI: RDI: 000224b0 RBP: 88130404fd08 R8: 887e7d7a24b0 R9: R10: 8802668b6618 R11: 0002 R12: 887e3e230a00 R13: 2026 R14: 887e3e230a00 R15: a01abf49 ORIG_RAX: CS: 0010 SS: 0018 #9 [88130404fd10] btrfs_get_or_create_delayed_node at a01abf49 [btrfs] #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12 [btrfs] #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs] #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs] #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs] #14 [88130404fe50] touch_atime at 812286d3 #15 [88130404fe90] iterate_dir at 81221929 #16 [88130404fee0] sys_getdents64 at 81221a19 #17 [88130404ff50] system_call_fastpath at 816f2594 RIP: 006b68e4 RSP: 00c866259080 RFLAGS: 0246 RAX: ffda RBX: 00c828dbbe00 RCX: 006b68e4 RDX: 1000 RSI: 00c83da14000 RDI: 0011 RBP: R8: R9:
Re: btrfs panic problem
On Mon, Sep 17, 2018 at 5:28 PM, sunny.s.zhang wrote: > Hi All, > > My OS(4.1.12) panic in kmem_cache_alloc, which is called by > btrfs_get_or_create_delayed_node. > > I found that the freelist of the slub is wrong. > > crash> struct kmem_cache_cpu 887e7d7a24b0 > > struct kmem_cache_cpu { > freelist = 0x2026, <<< the value is id of one inode > tid = 29567861, > page = 0xea0132168d00, > partial = 0x0 > } > > And, I found there are two different btrfs inodes pointing delayed_node. It > means that the same slub is used twice. > > I think this slub is freed twice, and then the next pointer of this slub > point itself. So we get the same slub twice. > > When use this slub again, that break the freelist. > > Folloing code will make the delayed node being freed twice. But I don't > found what is the process. > > Process A (btrfs_evict_inode) Process B > > call btrfs_remove_delayed_node call btrfs_get_delayed_node > > node = ACCESS_ONCE(btrfs_inode->delayed_node); > > BTRFS_I(inode)->delayed_node = NULL; > btrfs_release_delayed_node(delayed_node); > > if (node) { > atomic_inc(&node->refs); > return node; > } > > .. > > btrfs_release_delayed_node(delayed_node); > By looking at the race, seems the following commit has addressed it. btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ec35e48b286959991cdbb886f1bdeda4575c80b4 thanks, liubo > > 1313 void btrfs_remove_delayed_node(struct inode *inode) > 1314 { > 1315 struct btrfs_delayed_node *delayed_node; > 1316 > 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node); > 1318 if (!delayed_node) > 1319 return; > 1320 > 1321 BTRFS_I(inode)->delayed_node = NULL; > 1322 btrfs_release_delayed_node(delayed_node); > 1323 } > > > 87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct inode > *inode) > 88 { > 89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode); > 90 struct btrfs_root *root = btrfs_inode->root; > 91 u64 ino = btrfs_ino(inode); > 92 struct btrfs_delayed_node *node; > 93 > 94 node = ACCESS_ONCE(btrfs_inode->delayed_node); > 95 if (node) { > 96 atomic_inc(&node->refs); > 97 return node; > 98 } > > > Thanks, > > Sunny > > > PS: > > > > panic informations > > PID: 73638 TASK: 887deb586200 CPU: 38 COMMAND: "dockerd" > #0 [88130404f940] machine_kexec at 8105ec10 > #1 [88130404f9b0] crash_kexec at 811145b8 > #2 [88130404fa80] oops_end at 8101a868 > #3 [88130404fab0] no_context at 8106ea91 > #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d > #5 [88130404fb50] bad_area_nosemaphore at 8106eda3 > #6 [88130404fb60] __do_page_fault at 8106f328 > #7 [88130404fbd0] do_page_fault at 8106f637 > #8 [88130404fc10] page_fault at 816f6308 > [exception RIP: kmem_cache_alloc+121] > RIP: 811ef019 RSP: 88130404fcc8 RFLAGS: 00010286 > RAX: RBX: RCX: 01c32b76 > RDX: 01c32b75 RSI: RDI: 000224b0 > RBP: 88130404fd08 R8: 887e7d7a24b0 R9: > R10: 8802668b6618 R11: 0002 R12: 887e3e230a00 > R13: 2026 R14: 887e3e230a00 R15: a01abf49 > ORIG_RAX: CS: 0010 SS: 0018 > #9 [88130404fd10] btrfs_get_or_create_delayed_node at a01abf49 > [btrfs] > #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12 > [btrfs] > #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs] > #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs] > #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs] > #14 [88130404fe50] touch_atime at 812286d3 > #15 [88130404fe90] iterate_dir at 81221929 > #16 [88130404fee0] sys_getdents64 at 81221a19 > #17 [88130404ff50] system_call_fastpath at 816f2594 > RIP: 006b68e4 RSP: 00c866259080 RFLAGS: 0246 > RAX: ffda RBX: 00c828dbbe00 RCX: 006b68e4 > RDX: 1000 RSI: 00c83da14000 RDI: 0011 > RBP: R8: R9: > R10: R11: 0246 R12: 00c7 > R13: 02174e74 R14: 0555 R15: 0038 > ORIG_RAX: 00d9 CS: 0033 SS: 002b > > > We also find the list double add informations, including n_list and p_list: > > [8642921.110568] [ cut here ] > [8642921.167929] WARNING: CPU: 38 PID: 73638 at lib/list_debug.c:33 > __list_add+0xbe/0xd0() > [8642921.263780] lis
Re: btrfs panic problem
On 19.09.2018 02:53, sunny.s.zhang wrote: > Hi Duncan, > > Thank you for your advice. I understand what you mean. But i have > reviewed the latest btrfs code, and i think the issue is exist still. > > At 71 line, if the function of btrfs_get_delayed_node run over this > line, then switch to other process, which run over the 1282 and release > the delayed node at the end. > > And then, switch back to the btrfs_get_delayed_node. find that the node > is not null, and use it as normal. that mean we used a freed memory. > > at some time, this memory will be freed again. > > latest code as below. > > 1278 void btrfs_remove_delayed_node(struct btrfs_inode *inode) > 1279 { > 1280 struct btrfs_delayed_node *delayed_node; > 1281 > 1282 delayed_node = READ_ONCE(inode->delayed_node); > 1283 if (!delayed_node) > 1284 return; > 1285 > 1286 inode->delayed_node = NULL; > 1287 btrfs_release_delayed_node(delayed_node); > 1288 } > > > 64 static struct btrfs_delayed_node *btrfs_get_delayed_node( > 65 struct btrfs_inode *btrfs_inode) > 66 { > 67 struct btrfs_root *root = btrfs_inode->root; > 68 u64 ino = btrfs_ino(btrfs_inode); > 69 struct btrfs_delayed_node *node; > 70 > 71 node = READ_ONCE(btrfs_inode->delayed_node); > 72 if (node) { > 73 refcount_inc(&node->refs); > 74 return node; > 75 } > 76 > 77 spin_lock(&root->inode_lock); > 78 node = radix_tree_lookup(&root->delayed_nodes_tree, ino); > > You are analysis is correct, however it's missing one crucial point - btrfs_remove_delayed_node is called only from btrfs_evict_inode. And inodes are evicted when all other references have been dropped. Check the code in evict_inodes() - inodes are added to the dispose list when their i_count is 0 at which point there should be no references in this inode. This invalidates your analysis... > 在 2018年09月18日 13:05, Duncan 写道: >> sunny.s.zhang posted on Tue, 18 Sep 2018 08:28:14 +0800 as excerpted: >> >>> My OS(4.1.12) panic in kmem_cache_alloc, which is called by >>> btrfs_get_or_create_delayed_node. >>> >>> I found that the freelist of the slub is wrong. >> [Not a dev, just a btrfs list regular and user, myself. But here's a >> general btrfs list recommendations reply...] >> >> You appear to mean kernel 4.1.12 -- confirmed by the version reported in >> the posted dump: 4.1.12-112.14.13.el6uek.x86_64 >> >> OK, so from the perspective of this forward-development-focused list, >> kernel 4.1 is pretty ancient history, but you do have a number of >> options. >> >> First let's consider the general situation. Most people choose an >> enterprise distro for supported stability, and that's certainly a valid >> thing to want. However, btrfs, while now reaching early maturity for the >> basics (single device in single or dup mode, and multi-device in single/ >> raid0/1/10 modes, note that raid56 mode is newer and less mature), >> remains under quite heavy development, and keeping reasonably current is >> recommended for that reason. >> >> So you you chose an enterprise distro presumably to lock in supported >> stability for several years, but you chose a filesystem, btrfs, that's >> still under heavy development, with reasonably current kernels and >> userspace recommended as tending to have the known bugs fixed. There's a >> bit of a conflict there, and the /general/ recommendation would thus be >> to consider whether one or the other of those choices are inappropriate >> for your use-case, because it's really quite likely that if you really >> want the stability of an enterprise distro and kernel, that btrfs isn't >> as stable a filesystem as you're likely to want to match with it. >> Alternatively, if you want something newer to match the still under heavy >> development btrfs, you very likely want a distro that's not focused on >> years-old stability just for the sake of it. One or the other is likely >> to be a poor match for your needs, and choosing something else that's a >> better match is likely to be a much better experience for you. >> >> But perhaps you do have reason to want to run the newer and not quite to >> traditional enterprise-distro level stability btrfs, on an otherwise >> older and very stable enterprise distro. That's fine, provided you know >> what you're getting yourself into, and are prepared to deal with it. >> >> In that case, for best support from the list, we'd recommend running one >> of the latest two kernels in either the current or mainline LTS tracks. >> >> For current track, With 4.18 being the latest kernel, that'd be 4.18 or >> 4.17, as available on kernel.org (tho 4.17 is already EOL, no further >> releases, at 4.17.19). >> >> For mainline-LTS track, 4.14 and 4.9 are the latest two LTS series >> kernels, tho IIRC 4.19 is scheduled to be this year's LTS (or was it 4.18 >> and it's just not ou
Re: btrfs panic problem
On 2018/9/19 上午8:35, sunny.s.zhang wrote: > > 在 2018年09月19日 08:05, Qu Wenruo 写道: >> >> On 2018/9/18 上午8:28, sunny.s.zhang wrote: >>> Hi All, >>> >>> My OS(4.1.12) panic in kmem_cache_alloc, which is called by >>> btrfs_get_or_create_delayed_node. >> Any reproducer? >> >> Anyway we need a reproducer as a testcase. > > I have had a try, but could not reproduce yet. Since it's just one hit in production environment, I'm afraid we need to inject some sleep or delay into this code and try bombing it with fsstress. Despite that I have no good idea on reproducing it. Thanks, Qu > > Any advice to reproduce it? > >> >> The code looks >> >>> I found that the freelist of the slub is wrong. >>> >>> crash> struct kmem_cache_cpu 887e7d7a24b0 >>> >>> struct kmem_cache_cpu { >>> freelist = 0x2026, <<< the value is id of one inode >>> tid = 29567861, >>> page = 0xea0132168d00, >>> partial = 0x0 >>> } >>> >>> And, I found there are two different btrfs inodes pointing delayed_node. >>> It means that the same slub is used twice. >>> >>> I think this slub is freed twice, and then the next pointer of this slub >>> point itself. So we get the same slub twice. >>> >>> When use this slub again, that break the freelist. >>> >>> Folloing code will make the delayed node being freed twice. But I don't >>> found what is the process. >>> >>> Process A (btrfs_evict_inode) Process B >>> >>> call btrfs_remove_delayed_node call btrfs_get_delayed_node >>> >>> node = ACCESS_ONCE(btrfs_inode->delayed_node); >>> >>> BTRFS_I(inode)->delayed_node = NULL; >>> btrfs_release_delayed_node(delayed_node); >>> >>> if (node) { >>> atomic_inc(&node->refs); >>> return node; >>> } >>> >>> .. >>> >>> btrfs_release_delayed_node(delayed_node); >>> >>> >>> 1313 void btrfs_remove_delayed_node(struct inode *inode) >>> 1314 { >>> 1315 struct btrfs_delayed_node *delayed_node; >>> 1316 >>> 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node); >>> 1318 if (!delayed_node) >>> 1319 return; >>> 1320 >>> 1321 BTRFS_I(inode)->delayed_node = NULL; >>> 1322 btrfs_release_delayed_node(delayed_node); >>> 1323 } >>> >>> >>> 87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct >>> inode *inode) >>> 88 { >>> 89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode); >>> 90 struct btrfs_root *root = btrfs_inode->root; >>> 91 u64 ino = btrfs_ino(inode); >>> 92 struct btrfs_delayed_node *node; >>> 93 >>> 94 node = ACCESS_ONCE(btrfs_inode->delayed_node); >>> 95 if (node) { >>> 96 atomic_inc(&node->refs); >>> 97 return node; >>> 98 } >>> >> The analyse looks valid. >> Can be fixed by adding a spinlock. >> >> Just wondering why we didn't hit it. > > It just appeared once in our production environment. > > Thanks, > Sunny >> >> Thanks, >> Qu >> >>> Thanks, >>> >>> Sunny >>> >>> >>> PS: >>> >>> >>> >>> panic informations >>> >>> PID: 73638 TASK: 887deb586200 CPU: 38 COMMAND: "dockerd" >>> #0 [88130404f940] machine_kexec at 8105ec10 >>> #1 [88130404f9b0] crash_kexec at 811145b8 >>> #2 [88130404fa80] oops_end at 8101a868 >>> #3 [88130404fab0] no_context at 8106ea91 >>> #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d >>> #5 [88130404fb50] bad_area_nosemaphore at 8106eda3 >>> #6 [88130404fb60] __do_page_fault at 8106f328 >>> #7 [88130404fbd0] do_page_fault at 8106f637 >>> #8 [88130404fc10] page_fault at 816f6308 >>> [exception RIP: kmem_cache_alloc+121] >>> RIP: 811ef019 RSP: 88130404fcc8 RFLAGS: 00010286 >>> RAX: RBX: RCX: 01c32b76 >>> RDX: 01c32b75 RSI: RDI: 000224b0 >>> RBP: 88130404fd08 R8: 887e7d7a24b0 R9: >>> R10: 8802668b6618 R11: 0002 R12: 887e3e230a00 >>> R13: 2026 R14: 887e3e230a00 R15: a01abf49 >>> ORIG_RAX: CS: 0010 SS: 0018 >>> #9 [88130404fd10] btrfs_get_or_create_delayed_node at >>> a01abf49 [btrfs] >>> #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12 >>> [btrfs] >>> #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs] >>> #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs] >>> #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs] >>> #14 [88130404fe50] touch_atime at 812286d3 >>> #15 [88130404fe90] iterate_dir at 81221929 >>> #16 [88130404fee0] sys_getdents64 at 81221a19 >>> #17 [88130404ff50] system_call_fastpath at 816f2594 >>> RIP: 006b68e4 RSP: 00c866259080 RFLAGS: 0246 >>>
Re: btrfs panic problem
在 2018年09月19日 08:05, Qu Wenruo 写道: On 2018/9/18 上午8:28, sunny.s.zhang wrote: Hi All, My OS(4.1.12) panic in kmem_cache_alloc, which is called by btrfs_get_or_create_delayed_node. Any reproducer? Anyway we need a reproducer as a testcase. I have had a try, but could not reproduce yet. Any advice to reproduce it? The code looks I found that the freelist of the slub is wrong. crash> struct kmem_cache_cpu 887e7d7a24b0 struct kmem_cache_cpu { freelist = 0x2026, <<< the value is id of one inode tid = 29567861, page = 0xea0132168d00, partial = 0x0 } And, I found there are two different btrfs inodes pointing delayed_node. It means that the same slub is used twice. I think this slub is freed twice, and then the next pointer of this slub point itself. So we get the same slub twice. When use this slub again, that break the freelist. Folloing code will make the delayed node being freed twice. But I don't found what is the process. Process A (btrfs_evict_inode) Process B call btrfs_remove_delayed_node call btrfs_get_delayed_node node = ACCESS_ONCE(btrfs_inode->delayed_node); BTRFS_I(inode)->delayed_node = NULL; btrfs_release_delayed_node(delayed_node); if (node) { atomic_inc(&node->refs); return node; } .. btrfs_release_delayed_node(delayed_node); 1313 void btrfs_remove_delayed_node(struct inode *inode) 1314 { 1315 struct btrfs_delayed_node *delayed_node; 1316 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node); 1318 if (!delayed_node) 1319 return; 1320 1321 BTRFS_I(inode)->delayed_node = NULL; 1322 btrfs_release_delayed_node(delayed_node); 1323 } 87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct inode *inode) 88 { 89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode); 90 struct btrfs_root *root = btrfs_inode->root; 91 u64 ino = btrfs_ino(inode); 92 struct btrfs_delayed_node *node; 93 94 node = ACCESS_ONCE(btrfs_inode->delayed_node); 95 if (node) { 96 atomic_inc(&node->refs); 97 return node; 98 } The analyse looks valid. Can be fixed by adding a spinlock. Just wondering why we didn't hit it. It just appeared once in our production environment. Thanks, Sunny Thanks, Qu Thanks, Sunny PS: panic informations PID: 73638 TASK: 887deb586200 CPU: 38 COMMAND: "dockerd" #0 [88130404f940] machine_kexec at 8105ec10 #1 [88130404f9b0] crash_kexec at 811145b8 #2 [88130404fa80] oops_end at 8101a868 #3 [88130404fab0] no_context at 8106ea91 #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d #5 [88130404fb50] bad_area_nosemaphore at 8106eda3 #6 [88130404fb60] __do_page_fault at 8106f328 #7 [88130404fbd0] do_page_fault at 8106f637 #8 [88130404fc10] page_fault at 816f6308 [exception RIP: kmem_cache_alloc+121] RIP: 811ef019 RSP: 88130404fcc8 RFLAGS: 00010286 RAX: RBX: RCX: 01c32b76 RDX: 01c32b75 RSI: RDI: 000224b0 RBP: 88130404fd08 R8: 887e7d7a24b0 R9: R10: 8802668b6618 R11: 0002 R12: 887e3e230a00 R13: 2026 R14: 887e3e230a00 R15: a01abf49 ORIG_RAX: CS: 0010 SS: 0018 #9 [88130404fd10] btrfs_get_or_create_delayed_node at a01abf49 [btrfs] #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12 [btrfs] #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs] #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs] #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs] #14 [88130404fe50] touch_atime at 812286d3 #15 [88130404fe90] iterate_dir at 81221929 #16 [88130404fee0] sys_getdents64 at 81221a19 #17 [88130404ff50] system_call_fastpath at 816f2594 RIP: 006b68e4 RSP: 00c866259080 RFLAGS: 0246 RAX: ffda RBX: 00c828dbbe00 RCX: 006b68e4 RDX: 1000 RSI: 00c83da14000 RDI: 0011 RBP: R8: R9: R10: R11: 0246 R12: 00c7 R13: 02174e74 R14: 0555 R15: 0038 ORIG_RAX: 00d9 CS: 0033 SS: 002b We also find the list double add informations, including n_list and p_list: [8642921.110568] [ cut here ] [8642921.167929] WARNING: CPU: 38 PID: 73638 at lib/list_debug.c:33 __list_add+0xbe/0xd0() [8642921.263780] list_add corruption. prev->next should be next (887e40fa5368), but was ff:ff884c85a362
Re: btrfs panic problem
On 2018/9/18 上午8:28, sunny.s.zhang wrote: > Hi All, > > My OS(4.1.12) panic in kmem_cache_alloc, which is called by > btrfs_get_or_create_delayed_node. Any reproducer? Anyway we need a reproducer as a testcase. The code looks > > I found that the freelist of the slub is wrong. > > crash> struct kmem_cache_cpu 887e7d7a24b0 > > struct kmem_cache_cpu { > freelist = 0x2026, <<< the value is id of one inode > tid = 29567861, > page = 0xea0132168d00, > partial = 0x0 > } > > And, I found there are two different btrfs inodes pointing delayed_node. > It means that the same slub is used twice. > > I think this slub is freed twice, and then the next pointer of this slub > point itself. So we get the same slub twice. > > When use this slub again, that break the freelist. > > Folloing code will make the delayed node being freed twice. But I don't > found what is the process. > > Process A (btrfs_evict_inode) Process B > > call btrfs_remove_delayed_node call btrfs_get_delayed_node > > node = ACCESS_ONCE(btrfs_inode->delayed_node); > > BTRFS_I(inode)->delayed_node = NULL; > btrfs_release_delayed_node(delayed_node); > > if (node) { > atomic_inc(&node->refs); > return node; > } > > .. > > btrfs_release_delayed_node(delayed_node); > > > 1313 void btrfs_remove_delayed_node(struct inode *inode) > 1314 { > 1315 struct btrfs_delayed_node *delayed_node; > 1316 > 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node); > 1318 if (!delayed_node) > 1319 return; > 1320 > 1321 BTRFS_I(inode)->delayed_node = NULL; > 1322 btrfs_release_delayed_node(delayed_node); > 1323 } > > > 87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct > inode *inode) > 88 { > 89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode); > 90 struct btrfs_root *root = btrfs_inode->root; > 91 u64 ino = btrfs_ino(inode); > 92 struct btrfs_delayed_node *node; > 93 > 94 node = ACCESS_ONCE(btrfs_inode->delayed_node); > 95 if (node) { > 96 atomic_inc(&node->refs); > 97 return node; > 98 } > The analyse looks valid. Can be fixed by adding a spinlock. Just wondering why we didn't hit it. Thanks, Qu > > Thanks, > > Sunny > > > PS: > > > > panic informations > > PID: 73638 TASK: 887deb586200 CPU: 38 COMMAND: "dockerd" > #0 [88130404f940] machine_kexec at 8105ec10 > #1 [88130404f9b0] crash_kexec at 811145b8 > #2 [88130404fa80] oops_end at 8101a868 > #3 [88130404fab0] no_context at 8106ea91 > #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d > #5 [88130404fb50] bad_area_nosemaphore at 8106eda3 > #6 [88130404fb60] __do_page_fault at 8106f328 > #7 [88130404fbd0] do_page_fault at 8106f637 > #8 [88130404fc10] page_fault at 816f6308 > [exception RIP: kmem_cache_alloc+121] > RIP: 811ef019 RSP: 88130404fcc8 RFLAGS: 00010286 > RAX: RBX: RCX: 01c32b76 > RDX: 01c32b75 RSI: RDI: 000224b0 > RBP: 88130404fd08 R8: 887e7d7a24b0 R9: > R10: 8802668b6618 R11: 0002 R12: 887e3e230a00 > R13: 2026 R14: 887e3e230a00 R15: a01abf49 > ORIG_RAX: CS: 0010 SS: 0018 > #9 [88130404fd10] btrfs_get_or_create_delayed_node at > a01abf49 [btrfs] > #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12 > [btrfs] > #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs] > #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs] > #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs] > #14 [88130404fe50] touch_atime at 812286d3 > #15 [88130404fe90] iterate_dir at 81221929 > #16 [88130404fee0] sys_getdents64 at 81221a19 > #17 [88130404ff50] system_call_fastpath at 816f2594 > RIP: 006b68e4 RSP: 00c866259080 RFLAGS: 0246 > RAX: ffda RBX: 00c828dbbe00 RCX: 006b68e4 > RDX: 1000 RSI: 00c83da14000 RDI: 0011 > RBP: R8: R9: > R10: R11: 0246 R12: 00c7 > R13: 02174e74 R14: 0555 R15: 0038 > ORIG_RAX: 00d9 CS: 0033 SS: 002b > > > We also find the list double add informations, including n_list and p_list: > > [8642921.110568] [ cut here ] > [8642921.167929] WARNING: CPU: 38 PID: 73638 at lib/list_debug.c:33 > __list_add+0xbe/0xd0() > [8642921.263780] list_add corruption. prev->next should be next > (887e40fa53
Re: btrfs panic problem
Hi Duncan, Thank you for your advice. I understand what you mean. But i have reviewed the latest btrfs code, and i think the issue is exist still. At 71 line, if the function of btrfs_get_delayed_node run over this line, then switch to other process, which run over the 1282 and release the delayed node at the end. And then, switch back to the btrfs_get_delayed_node. find that the node is not null, and use it as normal. that mean we used a freed memory. at some time, this memory will be freed again. latest code as below. 1278 void btrfs_remove_delayed_node(struct btrfs_inode *inode) 1279 { 1280 struct btrfs_delayed_node *delayed_node; 1281 1282 delayed_node = READ_ONCE(inode->delayed_node); 1283 if (!delayed_node) 1284 return; 1285 1286 inode->delayed_node = NULL; 1287 btrfs_release_delayed_node(delayed_node); 1288 } 64 static struct btrfs_delayed_node *btrfs_get_delayed_node( 65 struct btrfs_inode *btrfs_inode) 66 { 67 struct btrfs_root *root = btrfs_inode->root; 68 u64 ino = btrfs_ino(btrfs_inode); 69 struct btrfs_delayed_node *node; 70 71 node = READ_ONCE(btrfs_inode->delayed_node); 72 if (node) { 73 refcount_inc(&node->refs); 74 return node; 75 } 76 77 spin_lock(&root->inode_lock); 78 node = radix_tree_lookup(&root->delayed_nodes_tree, ino); 在 2018年09月18日 13:05, Duncan 写道: sunny.s.zhang posted on Tue, 18 Sep 2018 08:28:14 +0800 as excerpted: My OS(4.1.12) panic in kmem_cache_alloc, which is called by btrfs_get_or_create_delayed_node. I found that the freelist of the slub is wrong. [Not a dev, just a btrfs list regular and user, myself. But here's a general btrfs list recommendations reply...] You appear to mean kernel 4.1.12 -- confirmed by the version reported in the posted dump: 4.1.12-112.14.13.el6uek.x86_64 OK, so from the perspective of this forward-development-focused list, kernel 4.1 is pretty ancient history, but you do have a number of options. First let's consider the general situation. Most people choose an enterprise distro for supported stability, and that's certainly a valid thing to want. However, btrfs, while now reaching early maturity for the basics (single device in single or dup mode, and multi-device in single/ raid0/1/10 modes, note that raid56 mode is newer and less mature), remains under quite heavy development, and keeping reasonably current is recommended for that reason. So you you chose an enterprise distro presumably to lock in supported stability for several years, but you chose a filesystem, btrfs, that's still under heavy development, with reasonably current kernels and userspace recommended as tending to have the known bugs fixed. There's a bit of a conflict there, and the /general/ recommendation would thus be to consider whether one or the other of those choices are inappropriate for your use-case, because it's really quite likely that if you really want the stability of an enterprise distro and kernel, that btrfs isn't as stable a filesystem as you're likely to want to match with it. Alternatively, if you want something newer to match the still under heavy development btrfs, you very likely want a distro that's not focused on years-old stability just for the sake of it. One or the other is likely to be a poor match for your needs, and choosing something else that's a better match is likely to be a much better experience for you. But perhaps you do have reason to want to run the newer and not quite to traditional enterprise-distro level stability btrfs, on an otherwise older and very stable enterprise distro. That's fine, provided you know what you're getting yourself into, and are prepared to deal with it. In that case, for best support from the list, we'd recommend running one of the latest two kernels in either the current or mainline LTS tracks. For current track, With 4.18 being the latest kernel, that'd be 4.18 or 4.17, as available on kernel.org (tho 4.17 is already EOL, no further releases, at 4.17.19). For mainline-LTS track, 4.14 and 4.9 are the latest two LTS series kernels, tho IIRC 4.19 is scheduled to be this year's LTS (or was it 4.18 and it's just not out of normal stable range yet so not yet marked LTS?), so it'll be coming up soon and 4.9 will then be dropping to third LTS series and thus out of our best recommended range. 4.4 was the previous LTS and while still in LTS support, is outside the two newest LTS series that this list recommends. And of course 4.1 is older than 4.4, so as I said, in btrfs development terms, it's quite ancient indeed... quite out of practical support range here, tho of course we'll still try, but in many cases the first question when any problem's reported is going to be whether it's reproducible on something closer to current. But... you ARE on an enterprise kernel, likely on an enterpr
Re: btrfs panic problem
Add Junxiao 在 2018年09月18日 13:05, Duncan 写道: sunny.s.zhang posted on Tue, 18 Sep 2018 08:28:14 +0800 as excerpted: My OS(4.1.12) panic in kmem_cache_alloc, which is called by btrfs_get_or_create_delayed_node. I found that the freelist of the slub is wrong. [Not a dev, just a btrfs list regular and user, myself. But here's a general btrfs list recommendations reply...] You appear to mean kernel 4.1.12 -- confirmed by the version reported in the posted dump: 4.1.12-112.14.13.el6uek.x86_64 OK, so from the perspective of this forward-development-focused list, kernel 4.1 is pretty ancient history, but you do have a number of options. First let's consider the general situation. Most people choose an enterprise distro for supported stability, and that's certainly a valid thing to want. However, btrfs, while now reaching early maturity for the basics (single device in single or dup mode, and multi-device in single/ raid0/1/10 modes, note that raid56 mode is newer and less mature), remains under quite heavy development, and keeping reasonably current is recommended for that reason. So you you chose an enterprise distro presumably to lock in supported stability for several years, but you chose a filesystem, btrfs, that's still under heavy development, with reasonably current kernels and userspace recommended as tending to have the known bugs fixed. There's a bit of a conflict there, and the /general/ recommendation would thus be to consider whether one or the other of those choices are inappropriate for your use-case, because it's really quite likely that if you really want the stability of an enterprise distro and kernel, that btrfs isn't as stable a filesystem as you're likely to want to match with it. Alternatively, if you want something newer to match the still under heavy development btrfs, you very likely want a distro that's not focused on years-old stability just for the sake of it. One or the other is likely to be a poor match for your needs, and choosing something else that's a better match is likely to be a much better experience for you. But perhaps you do have reason to want to run the newer and not quite to traditional enterprise-distro level stability btrfs, on an otherwise older and very stable enterprise distro. That's fine, provided you know what you're getting yourself into, and are prepared to deal with it. In that case, for best support from the list, we'd recommend running one of the latest two kernels in either the current or mainline LTS tracks. For current track, With 4.18 being the latest kernel, that'd be 4.18 or 4.17, as available on kernel.org (tho 4.17 is already EOL, no further releases, at 4.17.19). For mainline-LTS track, 4.14 and 4.9 are the latest two LTS series kernels, tho IIRC 4.19 is scheduled to be this year's LTS (or was it 4.18 and it's just not out of normal stable range yet so not yet marked LTS?), so it'll be coming up soon and 4.9 will then be dropping to third LTS series and thus out of our best recommended range. 4.4 was the previous LTS and while still in LTS support, is outside the two newest LTS series that this list recommends. And of course 4.1 is older than 4.4, so as I said, in btrfs development terms, it's quite ancient indeed... quite out of practical support range here, tho of course we'll still try, but in many cases the first question when any problem's reported is going to be whether it's reproducible on something closer to current. But... you ARE on an enterprise kernel, likely on an enterprise distro, and very possibly actually paying /them/ for support. So you're not without options if you prefer to stay with your supported enterprise kernel. If you're paying them for support, you might as well use it, and of course of the very many fixes since 4.1, they know what they've backported and what they haven't, so they're far better placed to provide that support in any case. Or, given what you posted, you appear to be reasonably able to do at least limited kernel-dev-level analysis yourself. Given that, you're already reasonably well placed to simply decide to stick with what you have and take the support you can get, diving into things yourself if necessary. So those are your kernel options. What about userspace btrfs-progs? Generally speaking, while the filesystem's running, it's the kernel code doing most of the work. If you have old userspace, it simply means you can't take advantage of some of the newer features as the old userspace doesn't know how to call for them. But the situation changes as soon as you have problems and can't mount, because it's userspace code that runs to try to fix that sort of problem, or failing that, it's userspace code that btrfs restore runs to try to grab what files can be grabbed off of the unmountable filesystem. So for routine operation, it's no big deal if userspace is a bit old, at least as long as it's new enough to have all the newer command formats, etc, that you need, and for com
Re: btrfs panic problem
sunny.s.zhang posted on Tue, 18 Sep 2018 08:28:14 +0800 as excerpted: > My OS(4.1.12) panic in kmem_cache_alloc, which is called by > btrfs_get_or_create_delayed_node. > > I found that the freelist of the slub is wrong. [Not a dev, just a btrfs list regular and user, myself. But here's a general btrfs list recommendations reply...] You appear to mean kernel 4.1.12 -- confirmed by the version reported in the posted dump: 4.1.12-112.14.13.el6uek.x86_64 OK, so from the perspective of this forward-development-focused list, kernel 4.1 is pretty ancient history, but you do have a number of options. First let's consider the general situation. Most people choose an enterprise distro for supported stability, and that's certainly a valid thing to want. However, btrfs, while now reaching early maturity for the basics (single device in single or dup mode, and multi-device in single/ raid0/1/10 modes, note that raid56 mode is newer and less mature), remains under quite heavy development, and keeping reasonably current is recommended for that reason. So you you chose an enterprise distro presumably to lock in supported stability for several years, but you chose a filesystem, btrfs, that's still under heavy development, with reasonably current kernels and userspace recommended as tending to have the known bugs fixed. There's a bit of a conflict there, and the /general/ recommendation would thus be to consider whether one or the other of those choices are inappropriate for your use-case, because it's really quite likely that if you really want the stability of an enterprise distro and kernel, that btrfs isn't as stable a filesystem as you're likely to want to match with it. Alternatively, if you want something newer to match the still under heavy development btrfs, you very likely want a distro that's not focused on years-old stability just for the sake of it. One or the other is likely to be a poor match for your needs, and choosing something else that's a better match is likely to be a much better experience for you. But perhaps you do have reason to want to run the newer and not quite to traditional enterprise-distro level stability btrfs, on an otherwise older and very stable enterprise distro. That's fine, provided you know what you're getting yourself into, and are prepared to deal with it. In that case, for best support from the list, we'd recommend running one of the latest two kernels in either the current or mainline LTS tracks. For current track, With 4.18 being the latest kernel, that'd be 4.18 or 4.17, as available on kernel.org (tho 4.17 is already EOL, no further releases, at 4.17.19). For mainline-LTS track, 4.14 and 4.9 are the latest two LTS series kernels, tho IIRC 4.19 is scheduled to be this year's LTS (or was it 4.18 and it's just not out of normal stable range yet so not yet marked LTS?), so it'll be coming up soon and 4.9 will then be dropping to third LTS series and thus out of our best recommended range. 4.4 was the previous LTS and while still in LTS support, is outside the two newest LTS series that this list recommends. And of course 4.1 is older than 4.4, so as I said, in btrfs development terms, it's quite ancient indeed... quite out of practical support range here, tho of course we'll still try, but in many cases the first question when any problem's reported is going to be whether it's reproducible on something closer to current. But... you ARE on an enterprise kernel, likely on an enterprise distro, and very possibly actually paying /them/ for support. So you're not without options if you prefer to stay with your supported enterprise kernel. If you're paying them for support, you might as well use it, and of course of the very many fixes since 4.1, they know what they've backported and what they haven't, so they're far better placed to provide that support in any case. Or, given what you posted, you appear to be reasonably able to do at least limited kernel-dev-level analysis yourself. Given that, you're already reasonably well placed to simply decide to stick with what you have and take the support you can get, diving into things yourself if necessary. So those are your kernel options. What about userspace btrfs-progs? Generally speaking, while the filesystem's running, it's the kernel code doing most of the work. If you have old userspace, it simply means you can't take advantage of some of the newer features as the old userspace doesn't know how to call for them. But the situation changes as soon as you have problems and can't mount, because it's userspace code that runs to try to fix that sort of problem, or failing that, it's userspace code that btrfs restore runs to try to grab what files can be grabbed off of the unmountable filesystem. So for routine operation, it's no big deal if userspace is a bit old, at least as long as it's new enough to have all the newer command formats, etc, that you n
Re: btrfs panic problem
Sorry, modify some errors: Process A (btrfs_evict_inode) Process B call btrfs_remove_delayed_node call btrfs_get_delayed_node node = ACCESS_ONCE(btrfs_inode->delayed_node); BTRFS_I(inode)->delayed_node = NULL; btrfs_release_delayed_node(delayed_node); if (node) { atomic_inc(&node->refs); return node; } .. btrfs_release_delayed_node(delayed_node); 在 2018年09月18日 08:28, sunny.s.zhang 写道: Hi All, My OS(4.1.12) panic in kmem_cache_alloc, which is called by btrfs_get_or_create_delayed_node. I found that the freelist of the slub is wrong. crash> struct kmem_cache_cpu 887e7d7a24b0 struct kmem_cache_cpu { freelist = 0x2026, <<< the value is id of one inode tid = 29567861, page = 0xea0132168d00, partial = 0x0 } And, I found there are two different btrfs inodes pointing delayed_node. It means that the same slub is used twice. I think this slub is freed twice, and then the next pointer of this slub point itself. So we get the same slub twice. When use this slub again, that break the freelist. Folloing code will make the delayed node being freed twice. But I don't found what is the process. Process A (btrfs_evict_inode) Process B call btrfs_remove_delayed_node call btrfs_get_delayed_node node = ACCESS_ONCE(btrfs_inode->delayed_node); BTRFS_I(inode)->delayed_node = NULL; btrfs_release_delayed_node(delayed_node); if (node) { atomic_inc(&node->refs); return node; } .. btrfs_release_delayed_node(delayed_node); 1313 void btrfs_remove_delayed_node(struct inode *inode) 1314 { 1315 struct btrfs_delayed_node *delayed_node; 1316 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node); 1318 if (!delayed_node) 1319 return; 1320 1321 BTRFS_I(inode)->delayed_node = NULL; 1322 btrfs_release_delayed_node(delayed_node); 1323 } 87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct inode *inode) 88 { 89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode); 90 struct btrfs_root *root = btrfs_inode->root; 91 u64 ino = btrfs_ino(inode); 92 struct btrfs_delayed_node *node; 93 94 node = ACCESS_ONCE(btrfs_inode->delayed_node); 95 if (node) { 96 atomic_inc(&node->refs); 97 return node; 98 } Thanks, Sunny PS: panic informations PID: 73638 TASK: 887deb586200 CPU: 38 COMMAND: "dockerd" #0 [88130404f940] machine_kexec at 8105ec10 #1 [88130404f9b0] crash_kexec at 811145b8 #2 [88130404fa80] oops_end at 8101a868 #3 [88130404fab0] no_context at 8106ea91 #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d #5 [88130404fb50] bad_area_nosemaphore at 8106eda3 #6 [88130404fb60] __do_page_fault at 8106f328 #7 [88130404fbd0] do_page_fault at 8106f637 #8 [88130404fc10] page_fault at 816f6308 [exception RIP: kmem_cache_alloc+121] RIP: 811ef019 RSP: 88130404fcc8 RFLAGS: 00010286 RAX: RBX: RCX: 01c32b76 RDX: 01c32b75 RSI: RDI: 000224b0 RBP: 88130404fd08 R8: 887e7d7a24b0 R9: R10: 8802668b6618 R11: 0002 R12: 887e3e230a00 R13: 2026 R14: 887e3e230a00 R15: a01abf49 ORIG_RAX: CS: 0010 SS: 0018 #9 [88130404fd10] btrfs_get_or_create_delayed_node at a01abf49 [btrfs] #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12 [btrfs] #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs] #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs] #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs] #14 [88130404fe50] touch_atime at 812286d3 #15 [88130404fe90] iterate_dir at 81221929 #16 [88130404fee0] sys_getdents64 at 81221a19 #17 [88130404ff50] system_call_fastpath at 816f2594 RIP: 006b68e4 RSP: 00c866259080 RFLAGS: 0246 RAX: ffda RBX: 00c828dbbe00 RCX: 006b68e4 RDX: 1000 RSI: 00c83da14000 RDI: 0011 RBP: R8: R9: R10: R11: 0246 R12: 00c7 R13: 02174e74 R14: 0555 R15: 0038 ORIG_RAX: 00d9 CS: 0033 SS: 002b We also find the list double add informations, including n_list and p_list: [8642921.110568] [ cut here ] [8642921.167929] WARNING: CPU: 38 PID: 73638 at lib/list_debug.c:33 __list_add+0xbe/0xd0() [8642921.263780] list_add corruption. prev->next should be next (887e40fa5368), but was ff:ff884c85a36288.
btrfs panic problem
Hi All, My OS(4.1.12) panic in kmem_cache_alloc, which is called by btrfs_get_or_create_delayed_node. I found that the freelist of the slub is wrong. crash> struct kmem_cache_cpu 887e7d7a24b0 struct kmem_cache_cpu { freelist = 0x2026, <<< the value is id of one inode tid = 29567861, page = 0xea0132168d00, partial = 0x0 } And, I found there are two different btrfs inodes pointing delayed_node. It means that the same slub is used twice. I think this slub is freed twice, and then the next pointer of this slub point itself. So we get the same slub twice. When use this slub again, that break the freelist. Folloing code will make the delayed node being freed twice. But I don't found what is the process. Process A (btrfs_evict_inode) Process B call btrfs_remove_delayed_node call btrfs_get_delayed_node node = ACCESS_ONCE(btrfs_inode->delayed_node); BTRFS_I(inode)->delayed_node = NULL; btrfs_release_delayed_node(delayed_node); if (node) { atomic_inc(&node->refs); return node; } .. btrfs_release_delayed_node(delayed_node); 1313 void btrfs_remove_delayed_node(struct inode *inode) 1314 { 1315 struct btrfs_delayed_node *delayed_node; 1316 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node); 1318 if (!delayed_node) 1319 return; 1320 1321 BTRFS_I(inode)->delayed_node = NULL; 1322 btrfs_release_delayed_node(delayed_node); 1323 } 87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct inode *inode) 88 { 89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode); 90 struct btrfs_root *root = btrfs_inode->root; 91 u64 ino = btrfs_ino(inode); 92 struct btrfs_delayed_node *node; 93 94 node = ACCESS_ONCE(btrfs_inode->delayed_node); 95 if (node) { 96 atomic_inc(&node->refs); 97 return node; 98 } Thanks, Sunny PS: panic informations PID: 73638 TASK: 887deb586200 CPU: 38 COMMAND: "dockerd" #0 [88130404f940] machine_kexec at 8105ec10 #1 [88130404f9b0] crash_kexec at 811145b8 #2 [88130404fa80] oops_end at 8101a868 #3 [88130404fab0] no_context at 8106ea91 #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d #5 [88130404fb50] bad_area_nosemaphore at 8106eda3 #6 [88130404fb60] __do_page_fault at 8106f328 #7 [88130404fbd0] do_page_fault at 8106f637 #8 [88130404fc10] page_fault at 816f6308 [exception RIP: kmem_cache_alloc+121] RIP: 811ef019 RSP: 88130404fcc8 RFLAGS: 00010286 RAX: RBX: RCX: 01c32b76 RDX: 01c32b75 RSI: RDI: 000224b0 RBP: 88130404fd08 R8: 887e7d7a24b0 R9: R10: 8802668b6618 R11: 0002 R12: 887e3e230a00 R13: 2026 R14: 887e3e230a00 R15: a01abf49 ORIG_RAX: CS: 0010 SS: 0018 #9 [88130404fd10] btrfs_get_or_create_delayed_node at a01abf49 [btrfs] #10 [88130404fd60] btrfs_delayed_update_inode at a01aea12 [btrfs] #11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs] #12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs] #13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs] #14 [88130404fe50] touch_atime at 812286d3 #15 [88130404fe90] iterate_dir at 81221929 #16 [88130404fee0] sys_getdents64 at 81221a19 #17 [88130404ff50] system_call_fastpath at 816f2594 RIP: 006b68e4 RSP: 00c866259080 RFLAGS: 0246 RAX: ffda RBX: 00c828dbbe00 RCX: 006b68e4 RDX: 1000 RSI: 00c83da14000 RDI: 0011 RBP: R8: R9: R10: R11: 0246 R12: 00c7 R13: 02174e74 R14: 0555 R15: 0038 ORIG_RAX: 00d9 CS: 0033 SS: 002b We also find the list double add informations, including n_list and p_list: [8642921.110568] [ cut here ] [8642921.167929] WARNING: CPU: 38 PID: 73638 at lib/list_debug.c:33 __list_add+0xbe/0xd0() [8642921.263780] list_add corruption. prev->next should be next (887e40fa5368), but was ff:ff884c85a36288. (prev=884c85a36288). [8642921.405490] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_conntrack iptable_filter arc4 ecb ppp_mppe ppp_async crc_ccitt ppp_generic slhc nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs fscache lockd sunrpc grace veth xt_nat xt_addrtype br_netfilter bridge tcp_diag inet_diag oracleacfs(POE) oracleadvm(POE) oracleoks(POE) oracleasm autofs4 dm_queue_length cpufreq_powersave be2iscsi iscsi_bo