Re: xfstests generic/476 failed on btrfs(errno=-12 Out of memory, kernel 5.11.10)
Hi, This patch works. From:Dennis Zhou To: Wang Yugui Cc: Vlastimil Babka , linux...@kvack.org, linux-btrfs@vger.kernel.org Date:Thu, 8 Apr 2021 13:48:33 + Subject: Re: unexpected -ENOMEM from percpu_counter_init() Ah. Can you try the following patch? https://lore.kernel.org/lkml/20210408035736.883861-4-g...@fb.com/ Best Regards Wang Yugui (wangyu...@e16-tech.com) 2021/04/13 > On 30.03.2021 09:16 Wang Yugui wrote: > > H, > > > >> On 30.03.21 г. 9:24, Wang Yugui wrote: > >>> Hi, Nikolay Borisov > >>> > >>> With a lot of dump_stack()/printk inserted around ENOMEM in btrfs code, > >>> we find out the call stack for ENOMEM. > >>> see the file -btrfs-dump_stack-when-ENOMEM.patch > >>> > >>> > >>> #cat /usr/hpc-bio/xfstests/results//generic/476.dmesg > >>> ... > >>> [ 5759.102929] ENOMEM btrfs_drew_lock_init > >>> [ 5759.102943] ENOMEM btrfs_init_fs_root > >>> [ 5759.102947] [ cut here ] > >>> [ 5759.102950] BTRFS: Transaction aborted (error -12) > >>> [ 5759.103052] WARNING: CPU: 14 PID: 2741468 at > >>> /ssd/hpc-bio/linux-5.10.27/fs/btrfs/transaction.c:1705 > >>> create_pending_snapshot+0xb8c/0xd50 [btrfs] > >>> ... > >>> > >>> > >>> btrfs_drew_lock_init() return -ENOMEM, > >>> this is the source: > >>> > >>> /* > >>> * We might be called under a transaction (e.g. indirect backref > >>> * resolution) which could deadlock if it triggers memory reclaim > >>> */ > >>> nofs_flag = memalloc_nofs_save(); > >>> ret = btrfs_drew_lock_init(&root->snapshot_lock); > >>> memalloc_nofs_restore(nofs_flag); > >>> if (ret == -ENOMEM) printk("ENOMEM btrfs_drew_lock_init\n"); > >>> if (ret) > >>> goto fail; > >>> > >>> And the souce come from: > >>> > >>> commit dcc3eb9638c3c927f1597075e851d0a16300a876 > >>> Author: Nikolay Borisov > >>> Date: Thu Jan 30 14:59:45 2020 +0200 > >>> > >>> btrfs: convert snapshot/nocow exlcusion to drew lock > >>> > >>> > >>> Any advice to fix this ENOMEM problem? > >> This is likely coming from changed behavior in MM, doesn't seem related > >> to btrfs. We have multiple places where nofs_save() is called. By the > >> same token the failure might have occurred in any other place, in any > >> other piece of code which uses memalloc_nofs_save, there is no > >> indication that this is directly related to btrfs. > >> > >>> top command show that this server have engough memory. > >>> > >>> The hardware of this server: > >>> CPU: Xeon(R) CPU E5-2660 v2(10 core) *2 > >>> memory: 192G, no swap > >> You are showing that the server has 192G of installed memory, you have > >> not shown any stats which prove at the time of failure what is the state > >> of the MM subsystem. At the very least at the time of failure inspect > >> the output of : > >> > >> cat /proc/meminfo > >> > >> and "free -m" commands. > >> > >> > > Only one xfstest job is running in this server. > > Had what looks like the same issue happinging on a server: > > [19146.391015] [ cut here ] > [19146.391017] BTRFS: Transaction aborted (error -12) > [19146.391035] WARNING: CPU: 13 PID: 1825871 at fs/btrfs/transaction.c:1684 > create_pending_snapshot+0x912/0xd10 > [19146.391036] Modules linked in: bcache crc64 loop dm_crypt bfq xfs dm_mod > st sr_mod cdrom intel_powerclamp coretemp dcdbas kvm_intel snd_pcm snd_timer > kvm snd irqbypass soundcore mgag200 serio_raw pcspkr drm_kms_helper evdev > joydev iTCO_wdt iTCO_vendor_support i2c_algo_bit i7core_edac sg ipmi_si > ipmi_devintf ipmi_msghandler wmi acpi_power_meter button ib_iser rdma_cm > iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm > configfs ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov > async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod > sd_mod hid_generic usbhid hid crct10dif_pclmul crc32_pclmul crc32c_intel > ghash_clmulni_intel aesni_intel crypto_simd ahci cryptd glue_helper mpt3sas > libahci uhci_hcd ehci_pci psmouse ehci_hcd lpc_ich raid_class libata nvme > scsi_transport_sas mfd_core usbcore nvme_core scsi_mod t10_pi bnx2 > [19146.391092] CPU: 13 PID: 1825871 Comm: btrfs Tainted: G W I?? 5.10.26 > #1 > [19146.391093] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.14.0 > 05/30/2018 > [19146.391095] RIP: 0010:create_pending_snapshot+0x912/0xd10 > [19146.391097] Code: 48 0f ba aa 40 0a 00 00 02 72 28 83 f8 fb 74 48 83 f8 e2 > 74 43 89 c6 48 c7 c7 70 2d 10 82 48 89 85 78 ff ff ff e8 d5 65 55 00 <0f> 0b > 48 8b 85 78 ff ff ff 89 c1 ba 94 06 00 00 48 c7 c6 70 46 e4 > [19146.391098] RSP: 0018:c900201c3b00 EFLAGS: 00010286 > [19146.391099] RAX: RBX: 8881ba393200 RCX: > 0fb98b88 > [19146.391100] RDX: ffd8 RSI: 0027 RDI: > 0fb98b80 > [19146.391101] RBP: c900201c3bd0 R08: 825e2148 R09: > 00027ffb > [19146.391101] R10:
Re: xfstests generic/476 failed on btrfs(errno=-12 Out of memory, kernel 5.11.10)
On 30.03.2021 09:16 Wang Yugui wrote: H, On 30.03.21 г. 9:24, Wang Yugui wrote: Hi, Nikolay Borisov With a lot of dump_stack()/printk inserted around ENOMEM in btrfs code, we find out the call stack for ENOMEM. see the file -btrfs-dump_stack-when-ENOMEM.patch #cat /usr/hpc-bio/xfstests/results//generic/476.dmesg ... [ 5759.102929] ENOMEM btrfs_drew_lock_init [ 5759.102943] ENOMEM btrfs_init_fs_root [ 5759.102947] [ cut here ] [ 5759.102950] BTRFS: Transaction aborted (error -12) [ 5759.103052] WARNING: CPU: 14 PID: 2741468 at /ssd/hpc-bio/linux-5.10.27/fs/btrfs/transaction.c:1705 create_pending_snapshot+0xb8c/0xd50 [btrfs] ... btrfs_drew_lock_init() return -ENOMEM, this is the source: /* * We might be called under a transaction (e.g. indirect backref * resolution) which could deadlock if it triggers memory reclaim */ nofs_flag = memalloc_nofs_save(); ret = btrfs_drew_lock_init(&root->snapshot_lock); memalloc_nofs_restore(nofs_flag); if (ret == -ENOMEM) printk("ENOMEM btrfs_drew_lock_init\n"); if (ret) goto fail; And the souce come from: commit dcc3eb9638c3c927f1597075e851d0a16300a876 Author: Nikolay Borisov Date: Thu Jan 30 14:59:45 2020 +0200 btrfs: convert snapshot/nocow exlcusion to drew lock Any advice to fix this ENOMEM problem? This is likely coming from changed behavior in MM, doesn't seem related to btrfs. We have multiple places where nofs_save() is called. By the same token the failure might have occurred in any other place, in any other piece of code which uses memalloc_nofs_save, there is no indication that this is directly related to btrfs. top command show that this server have engough memory. The hardware of this server: CPU: Xeon(R) CPU E5-2660 v2(10 core) *2 memory: 192G, no swap You are showing that the server has 192G of installed memory, you have not shown any stats which prove at the time of failure what is the state of the MM subsystem. At the very least at the time of failure inspect the output of : cat /proc/meminfo and "free -m" commands. Only one xfstest job is running in this server. Had what looks like the same issue happinging on a server: [19146.391015] [ cut here ] [19146.391017] BTRFS: Transaction aborted (error -12) [19146.391035] WARNING: CPU: 13 PID: 1825871 at fs/btrfs/transaction.c:1684 create_pending_snapshot+0x912/0xd10 [19146.391036] Modules linked in: bcache crc64 loop dm_crypt bfq xfs dm_mod st sr_mod cdrom intel_powerclamp coretemp dcdbas kvm_intel snd_pcm snd_timer kvm snd irqbypass soundcore mgag200 serio_raw pcspkr drm_kms_helper evdev joydev iTCO_wdt iTCO_vendor_support i2c_algo_bit i7core_edac sg ipmi_si ipmi_devintf ipmi_msghandler wmi acpi_power_meter button ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi drm configfs ip_tables x_tables autofs4 raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx raid1 raid0 multipath linear md_mod sd_mod hid_generic usbhid hid crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd ahci cryptd glue_helper mpt3sas libahci uhci_hcd ehci_pci psmouse ehci_hcd lpc_ich raid_class libata nvme scsi_transport_sas mfd_core usbcore nvme_core scsi_mod t10_pi bnx2 [19146.391092] CPU: 13 PID: 1825871 Comm: btrfs Tainted: G W I 5.10.26 #1 [19146.391093] Hardware name: Dell Inc. PowerEdge R510/0DPRKF, BIOS 1.14.0 05/30/2018 [19146.391095] RIP: 0010:create_pending_snapshot+0x912/0xd10 [19146.391097] Code: 48 0f ba aa 40 0a 00 00 02 72 28 83 f8 fb 74 48 83 f8 e2 74 43 89 c6 48 c7 c7 70 2d 10 82 48 89 85 78 ff ff ff e8 d5 65 55 00 <0f> 0b 48 8b 85 78 ff ff ff 89 c1 ba 94 06 00 00 48 c7 c6 70 46 e4 [19146.391098] RSP: 0018:c900201c3b00 EFLAGS: 00010286 [19146.391099] RAX: RBX: 8881ba393200 RCX: 0fb98b88 [19146.391100] RDX: ffd8 RSI: 0027 RDI: 0fb98b80 [19146.391101] RBP: c900201c3bd0 R08: 825e2148 R09: 00027ffb [19146.391101] R10: 8000 R11: 3fff R12: 888119dd39c0 [19146.391102] R13: 888248c36800 R14: 888a1bf69800 R15: fff4 [19146.391103] FS: 7f1d7c9488c0() GS:0fb8() knlGS: [19146.391104] CS: 0010 DS: ES: CR0: 80050033 [19146.391105] CR2: 7fffef58d000 CR3: 00028c988004 CR4: 000206e0 [19146.391106] Call Trace: [19146.39] ? create_pending_snapshots+0xa2/0xc0 [19146.391112] create_pending_snapshots+0xa2/0xc0 [19146.391114] btrfs_commit_transaction+0x4b9/0xb40 [19146.391116] ? start_transaction+0xd2/0x580 [19146.391119] btrfs_mksubvol+0x29e/0x450 [19146.391122] btrfs_mksnapshot+0x7b/0xb0 [19146.391124] __btrfs_ioctl_snap_create+0x16f/0x180 [19146.391126] btrfs_ioctl_snap_create_v2+0xb3/0x130 [19146.391128] btrfs_ioctl+0x15f/0x3040 [19146.391131] ? __x64_sy
Re: xfstests generic/476 failed on btrfs(errno=-12 Out of memory, kernel 5.11.10)
H, > On 30.03.21 г. 9:24, Wang Yugui wrote: > > Hi, Nikolay Borisov > > > > With a lot of dump_stack()/printk inserted around ENOMEM in btrfs code, > > we find out the call stack for ENOMEM. > > see the file -btrfs-dump_stack-when-ENOMEM.patch > > > > > > #cat /usr/hpc-bio/xfstests/results//generic/476.dmesg > > ... > > [ 5759.102929] ENOMEM btrfs_drew_lock_init > > [ 5759.102943] ENOMEM btrfs_init_fs_root > > [ 5759.102947] [ cut here ] > > [ 5759.102950] BTRFS: Transaction aborted (error -12) > > [ 5759.103052] WARNING: CPU: 14 PID: 2741468 at > > /ssd/hpc-bio/linux-5.10.27/fs/btrfs/transaction.c:1705 > > create_pending_snapshot+0xb8c/0xd50 [btrfs] > > ... > > > > > > btrfs_drew_lock_init() return -ENOMEM, > > this is the source: > > > > /* > > * We might be called under a transaction (e.g. indirect backref > > * resolution) which could deadlock if it triggers memory reclaim > > */ > > nofs_flag = memalloc_nofs_save(); > > ret = btrfs_drew_lock_init(&root->snapshot_lock); > > memalloc_nofs_restore(nofs_flag); > > if (ret == -ENOMEM) printk("ENOMEM btrfs_drew_lock_init\n"); > > if (ret) > > goto fail; > > > > And the souce come from: > > > > commit dcc3eb9638c3c927f1597075e851d0a16300a876 > > Author: Nikolay Borisov > > Date: Thu Jan 30 14:59:45 2020 +0200 > > > > btrfs: convert snapshot/nocow exlcusion to drew lock > > > > > > Any advice to fix this ENOMEM problem? > > This is likely coming from changed behavior in MM, doesn't seem related > to btrfs. We have multiple places where nofs_save() is called. By the > same token the failure might have occurred in any other place, in any > other piece of code which uses memalloc_nofs_save, there is no > indication that this is directly related to btrfs. > > > > > top command show that this server have engough memory. > > > > The hardware of this server: > > CPU: Xeon(R) CPU E5-2660 v2(10 core) *2 > > memory: 192G, no swap > > You are showing that the server has 192G of installed memory, you have > not shown any stats which prove at the time of failure what is the state > of the MM subsystem. At the very least at the time of failure inspect > the output of : > > cat /proc/meminfo > > and "free -m" commands. > > Only one xfstest job is running in this server. Best Regards Wang Yugui (wangyu...@e16-tech.com) 2021/03/30
Re: xfstests generic/476 failed on btrfs(errno=-12 Out of memory, kernel 5.11.10)
On 30.03.21 г. 9:24, Wang Yugui wrote: > Hi, Nikolay Borisov > > With a lot of dump_stack()/printk inserted around ENOMEM in btrfs code, > we find out the call stack for ENOMEM. > see the file -btrfs-dump_stack-when-ENOMEM.patch > > > #cat /usr/hpc-bio/xfstests/results//generic/476.dmesg > ... > [ 5759.102929] ENOMEM btrfs_drew_lock_init > [ 5759.102943] ENOMEM btrfs_init_fs_root > [ 5759.102947] [ cut here ] > [ 5759.102950] BTRFS: Transaction aborted (error -12) > [ 5759.103052] WARNING: CPU: 14 PID: 2741468 at > /ssd/hpc-bio/linux-5.10.27/fs/btrfs/transaction.c:1705 > create_pending_snapshot+0xb8c/0xd50 [btrfs] > ... > > > btrfs_drew_lock_init() return -ENOMEM, > this is the source: > > /* > * We might be called under a transaction (e.g. indirect backref > * resolution) which could deadlock if it triggers memory reclaim > */ > nofs_flag = memalloc_nofs_save(); > ret = btrfs_drew_lock_init(&root->snapshot_lock); > memalloc_nofs_restore(nofs_flag); > if (ret == -ENOMEM) printk("ENOMEM btrfs_drew_lock_init\n"); > if (ret) > goto fail; > > And the souce come from: > > commit dcc3eb9638c3c927f1597075e851d0a16300a876 > Author: Nikolay Borisov > Date: Thu Jan 30 14:59:45 2020 +0200 > > btrfs: convert snapshot/nocow exlcusion to drew lock > > > Any advice to fix this ENOMEM problem? This is likely coming from changed behavior in MM, doesn't seem related to btrfs. We have multiple places where nofs_save() is called. By the same token the failure might have occurred in any other place, in any other piece of code which uses memalloc_nofs_save, there is no indication that this is directly related to btrfs. > > top command show that this server have engough memory. > > The hardware of this server: > CPU: Xeon(R) CPU E5-2660 v2(10 core) *2 > memory: 192G, no swap You are showing that the server has 192G of installed memory, you have not shown any stats which prove at the time of failure what is the state of the MM subsystem. At the very least at the time of failure inspect the output of : cat /proc/meminfo and "free -m" commands.
Re: xfstests generic/476 failed on btrfs(errno=-12 Out of memory, kernel 5.11.10)
Hi, > kmem_cache_zalloc() without __GFP_NOFAIL may fail. > > btrfs use kmem_cache_zalloc() with GFP_NOFS mostly, > and only few place with __GFP_NOFAIL. > > xfs use kmem_cache_zalloc() with __GFP_NOFAIL mostly. > > It is very difficult to test all case of failure in kmem_cache_zalloc() . > > Should btrfs use kmem_cache_zalloc() with __GFP_NOFAIL just like xfs > or use mempool with pre-alloc to prevent fail? I tried both way. 1) add __GFP_NOFAIL to kmem_cache_zalloc() see 0001-btrfs-add-__GFP_NOFAIL-to-kmem_cache.patch but this problem still happened in test. 2) switch to use mempool_t for btrfs_path see 0001-btrfs-switch-to-mempool_t-for-btrfs_path.patch this problem yet not happen in test. But the memory alloc failure is difficult to test, we need more review. Best Regards Wang Yugui (wangyu...@e16-tech.com) 2021/03/28 0001-btrfs-add-__GFP_NOFAIL-to-kmem_cache.patch Description: Binary data 0001-btrfs-switch-to-mempool_t-for-btrfs_path.patch Description: Binary data
Re: xfstests generic/476 failed on btrfs(errno=-12 Out of memory, kernel 5.11.10)
Hi, kmem_cache_zalloc() without __GFP_NOFAIL may fail. btrfs use kmem_cache_zalloc() with GFP_NOFS mostly, and only few place with __GFP_NOFAIL. xfs use kmem_cache_zalloc() with __GFP_NOFAIL mostly. It is very difficult to test all case of failure in kmem_cache_zalloc() . Should btrfs use kmem_cache_zalloc() with __GFP_NOFAIL just like xfs or use mempool with pre-alloc to prevent fail? Best Regards Wang Yugui (wangyu...@e16-tech.com) 2021/03/27 > Hi, > > these callstack have a same root failure. > struct btrfs_path *btrfs_alloc_path(void) > { > return kmem_cache_zalloc(btrfs_path_cachep, GFP_NOFS); > } > > fs/btrfs/transaction.c:1679 create_pending_snapshot+0xc1a/0xda0 [btrfs] > (fail)new_root = btrfs_get_new_fs_root(fs_info, objectid, anon_dev); > (fail)btrfs_get_root_ref > (fail)btrfs_alloc_path > > fs/btrfs/ioctl.c:718 create_subvol+0x888/0x8f0 [btrfs] > (fail)new_root = btrfs_get_new_fs_root(fs_info, objectid, anon_dev); > (fail)btrfs_get_root_ref > (fail)btrfs_alloc_path > > Any advice whether it is a btrfs usage problem , or linux mm implement > problem, or expected behavior ? > > this server has 192G memory and no swap. > > /etc/sysctl.conf > #10G/1G > vm.dirty_bytes=10737418240 > vm.dirty_background_bytes=1073741824 > > And the filesystem is 10GiB > # cat /usr/hpc-bio/xfstests/results//generic/476.full > btrfs-progs v5.10.1 > See http://btrfs.wiki.kernel.org for more information. > > Detected a SSD, turning off metadata duplication. Mkfs with -m dup if you > want to force metadata duplication. > Label: (null) > UUID: 776508dd-165d-4150-89a9-0cdd13a0004a > Node size: 16384 > Sector size:4096 > Filesystem size:10.00GiB > Block group profiles: > Data: single8.00MiB > Metadata: single8.00MiB > System: single4.00MiB > SSD detected: yes > Incompat features: extref, skinny-metadata, no-holes > Runtime features: free-space-tree > Checksum: crc32c > Number of devices: 1 > Devices: >IDSIZE PATH > 110.00GiB /dev/sdb1 > > Best Regards > Wang Yugui (wangyu...@e16-tech.com) > 2021/03/27 > > > Hi, > > > > SSD/SAS is easy than SSD/NVMe to reproduce this problem. > > > > Yet not able to reproduce this problem on another server. > > CPU: Xeon(R) CPU E5-2680 v2(10 core) *2 > > memory: 192G, no swap > > disk: SSD/NVMe with same partition size as SSD/SAS. > > > > > > And this problem happened in kernel 5.10.26 + btrfs backport from > > 5.12.0-rc4 with a different callstack. > > > > [10459.782442] run fstests generic/476 at 2021-03-27 15:02:14 > > [10459.988507] BTRFS info (device nvme0n1p1): has skinny extents > > [10459.988515] BTRFS info (device nvme0n1p1): using free space tree > > [10459.991086] BTRFS info (device nvme0n1p1): enabling ssd optimizations > > [10460.062565] BTRFS: device fsid 776508dd-165d-4150-89a9-0cdd13a0004a > > devid 1 transid 6 /dev/sdb1 scanned by mkfs.btrfs (2713399) > > [10460.075938] BTRFS info (device sdb1): has skinny extents > > [10460.075947] BTRFS info (device sdb1): flagging fs with big metadata > > feature > > [10460.075950] BTRFS info (device sdb1): using free space tree > > [10460.077791] BTRFS info (device sdb1): enabling ssd optimizations > > [10460.078662] BTRFS info (device sdb1): checking UUID tree > > [10604.622052] [ cut here ] > > [10604.622062] BTRFS: Transaction aborted (error -12) > > [10604.622182] WARNING: CPU: 10 PID: 2713438 at fs/btrfs/ioctl.c:718 > > create_subvol+0x888/0x8f0 [btrfs] > > [10604.622187] Modules linked in: dm_thin_pool dm_persistent_data > > dm_bio_prison dm_flakey loop rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache > > rfkill rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi > > scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp > > ib_ipoib rdma_ucm ib_umad snd_hda_codec_realtek snd_hda_codec_generic > > ledtrig_audio snd_hda_codec_hdmi intel_rapl_msr intel_rapl_common > > snd_hda_intel snd_intel_dspcfg soundwire_intel soundwire_generic_allocation > > snd_soc_core sb_edac x86_pkg_temp_thermal snd_compress iTCO_wdt > > intel_powerclamp intel_pmc_bxt snd_pcm_dmaengine coretemp soundwire_cadence > > mei_wdt mei_hdcp iTCO_vendor_support kvm_intel snd_hda_codec dcdbas > > dell_smm_hwmon snd_hda_core kvm ac97_bus snd_hwdep snd_seq snd_seq_device > > irqbypass snd_pcm rapl intel_cstate mei_me snd_timer i2c_i801 intel_uncore > > snd mei i2c_smbus lpc_ich soundcore nvme_rdma nvme_fabrics rdma_cm iw_cm > > ib_cm rdmavt rdma_rxe nfsd ib_uverbs ip6_udp_tunnel udp_tunnel ib_core > > auth_rpcgss nfs_acl > > [10604.622244] lockd grace nfs_ssc ip_tables xfs radeon i2c_algo_bit bnx2x > > ttm drm_kms_helper cec crct10dif_pclmul crc32_pclmul crc32c_intel drm nvme > > mpt3sas e1000e pcspkr mdio ghash_clmulni_intel nvme_core raid_c
Re: xfstests generic/476 failed on btrfs(errno=-12 Out of memory, kernel 5.11.10)
Hi, these callstack have a same root failure. struct btrfs_path *btrfs_alloc_path(void) { return kmem_cache_zalloc(btrfs_path_cachep, GFP_NOFS); } fs/btrfs/transaction.c:1679 create_pending_snapshot+0xc1a/0xda0 [btrfs] (fail)new_root = btrfs_get_new_fs_root(fs_info, objectid, anon_dev); (fail)btrfs_get_root_ref (fail)btrfs_alloc_path fs/btrfs/ioctl.c:718 create_subvol+0x888/0x8f0 [btrfs] (fail)new_root = btrfs_get_new_fs_root(fs_info, objectid, anon_dev); (fail)btrfs_get_root_ref (fail)btrfs_alloc_path Any advice whether it is a btrfs usage problem , or linux mm implement problem, or expected behavior ? this server has 192G memory and no swap. /etc/sysctl.conf #10G/1G vm.dirty_bytes=10737418240 vm.dirty_background_bytes=1073741824 And the filesystem is 10GiB # cat /usr/hpc-bio/xfstests/results//generic/476.full btrfs-progs v5.10.1 See http://btrfs.wiki.kernel.org for more information. Detected a SSD, turning off metadata duplication. Mkfs with -m dup if you want to force metadata duplication. Label: (null) UUID: 776508dd-165d-4150-89a9-0cdd13a0004a Node size: 16384 Sector size:4096 Filesystem size:10.00GiB Block group profiles: Data: single8.00MiB Metadata: single8.00MiB System: single4.00MiB SSD detected: yes Incompat features: extref, skinny-metadata, no-holes Runtime features: free-space-tree Checksum: crc32c Number of devices: 1 Devices: IDSIZE PATH 110.00GiB /dev/sdb1 Best Regards Wang Yugui (wangyu...@e16-tech.com) 2021/03/27 > Hi, > > SSD/SAS is easy than SSD/NVMe to reproduce this problem. > > Yet not able to reproduce this problem on another server. > CPU: Xeon(R) CPU E5-2680 v2(10 core) *2 > memory: 192G, no swap > disk: SSD/NVMe with same partition size as SSD/SAS. > > > And this problem happened in kernel 5.10.26 + btrfs backport from > 5.12.0-rc4 with a different callstack. > > [10459.782442] run fstests generic/476 at 2021-03-27 15:02:14 > [10459.988507] BTRFS info (device nvme0n1p1): has skinny extents > [10459.988515] BTRFS info (device nvme0n1p1): using free space tree > [10459.991086] BTRFS info (device nvme0n1p1): enabling ssd optimizations > [10460.062565] BTRFS: device fsid 776508dd-165d-4150-89a9-0cdd13a0004a devid > 1 transid 6 /dev/sdb1 scanned by mkfs.btrfs (2713399) > [10460.075938] BTRFS info (device sdb1): has skinny extents > [10460.075947] BTRFS info (device sdb1): flagging fs with big metadata feature > [10460.075950] BTRFS info (device sdb1): using free space tree > [10460.077791] BTRFS info (device sdb1): enabling ssd optimizations > [10460.078662] BTRFS info (device sdb1): checking UUID tree > [10604.622052] [ cut here ] > [10604.622062] BTRFS: Transaction aborted (error -12) > [10604.622182] WARNING: CPU: 10 PID: 2713438 at fs/btrfs/ioctl.c:718 > create_subvol+0x888/0x8f0 [btrfs] > [10604.622187] Modules linked in: dm_thin_pool dm_persistent_data > dm_bio_prison dm_flakey loop rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache > rfkill rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi > scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp > ib_ipoib rdma_ucm ib_umad snd_hda_codec_realtek snd_hda_codec_generic > ledtrig_audio snd_hda_codec_hdmi intel_rapl_msr intel_rapl_common > snd_hda_intel snd_intel_dspcfg soundwire_intel soundwire_generic_allocation > snd_soc_core sb_edac x86_pkg_temp_thermal snd_compress iTCO_wdt > intel_powerclamp intel_pmc_bxt snd_pcm_dmaengine coretemp soundwire_cadence > mei_wdt mei_hdcp iTCO_vendor_support kvm_intel snd_hda_codec dcdbas > dell_smm_hwmon snd_hda_core kvm ac97_bus snd_hwdep snd_seq snd_seq_device > irqbypass snd_pcm rapl intel_cstate mei_me snd_timer i2c_i801 intel_uncore > snd mei i2c_smbus lpc_ich soundcore nvme_rdma nvme_fabrics rdma_cm iw_cm > ib_cm rdmavt rdma_rxe nfsd ib_uverbs ip6_udp_tunnel udp_tunnel ib_core > auth_rpcgss nfs_acl > [10604.622244] lockd grace nfs_ssc ip_tables xfs radeon i2c_algo_bit bnx2x > ttm drm_kms_helper cec crct10dif_pclmul crc32_pclmul crc32c_intel drm nvme > mpt3sas e1000e pcspkr mdio ghash_clmulni_intel nvme_core raid_class > scsi_transport_sas wmi dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua > btrfs xor raid6_pq sunrpc i2c_dev [last unloaded: scsi_debug] > [10604.622292] CPU: 10 PID: 2713438 Comm: fsstress Tainted: G S > 5.10.26-3.el7.x86_64 #1 > [10604.622296] Hardware name: Dell Inc. Precision T7610/0NK70N, BIOS A18 > 09/11/2019 > [10604.622333] RIP: 0010:create_subvol+0x888/0x8f0 [btrfs] > [10604.622337] Code: 8b 40 50 f0 48 0f ba a8 50 0a 00 00 03 72 1d 41 83 ff fb > 74 37 41 83 ff e2 74 31 44 89 fe 48 c7 c7 f0 44 59 c0 e8 ec 6b 5a f6 <0f> 0b > 48 8b bd 30 ff ff ff 44 89 f9 ba ce 02 00 00 48 c7 c6 80 2a > [10604.622342] RSP: 0018:fff
Re: xfstests generic/476 failed on btrfs(errno=-12 Out of memory, kernel 5.11.10)
Hi, SSD/SAS is easy than SSD/NVMe to reproduce this problem. Yet not able to reproduce this problem on another server. CPU: Xeon(R) CPU E5-2680 v2(10 core) *2 memory: 192G, no swap disk: SSD/NVMe with same partition size as SSD/SAS. And this problem happened in kernel 5.10.26 + btrfs backport from 5.12.0-rc4 with a different callstack. [10459.782442] run fstests generic/476 at 2021-03-27 15:02:14 [10459.988507] BTRFS info (device nvme0n1p1): has skinny extents [10459.988515] BTRFS info (device nvme0n1p1): using free space tree [10459.991086] BTRFS info (device nvme0n1p1): enabling ssd optimizations [10460.062565] BTRFS: device fsid 776508dd-165d-4150-89a9-0cdd13a0004a devid 1 transid 6 /dev/sdb1 scanned by mkfs.btrfs (2713399) [10460.075938] BTRFS info (device sdb1): has skinny extents [10460.075947] BTRFS info (device sdb1): flagging fs with big metadata feature [10460.075950] BTRFS info (device sdb1): using free space tree [10460.077791] BTRFS info (device sdb1): enabling ssd optimizations [10460.078662] BTRFS info (device sdb1): checking UUID tree [10604.622052] [ cut here ] [10604.622062] BTRFS: Transaction aborted (error -12) [10604.622182] WARNING: CPU: 10 PID: 2713438 at fs/btrfs/ioctl.c:718 create_subvol+0x888/0x8f0 [btrfs] [10604.622187] Modules linked in: dm_thin_pool dm_persistent_data dm_bio_prison dm_flakey loop rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rfkill rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_umad snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi intel_rapl_msr intel_rapl_common snd_hda_intel snd_intel_dspcfg soundwire_intel soundwire_generic_allocation snd_soc_core sb_edac x86_pkg_temp_thermal snd_compress iTCO_wdt intel_powerclamp intel_pmc_bxt snd_pcm_dmaengine coretemp soundwire_cadence mei_wdt mei_hdcp iTCO_vendor_support kvm_intel snd_hda_codec dcdbas dell_smm_hwmon snd_hda_core kvm ac97_bus snd_hwdep snd_seq snd_seq_device irqbypass snd_pcm rapl intel_cstate mei_me snd_timer i2c_i801 intel_uncore snd mei i2c_smbus lpc_ich soundcore nvme_rdma nvme_fabrics rdma_cm iw_cm ib_cm rdmavt rdma_rxe nfsd ib_uverbs ip6_udp_tunnel udp_tunnel ib_core auth_rpcgss nfs_acl [10604.622244] lockd grace nfs_ssc ip_tables xfs radeon i2c_algo_bit bnx2x ttm drm_kms_helper cec crct10dif_pclmul crc32_pclmul crc32c_intel drm nvme mpt3sas e1000e pcspkr mdio ghash_clmulni_intel nvme_core raid_class scsi_transport_sas wmi dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua btrfs xor raid6_pq sunrpc i2c_dev [last unloaded: scsi_debug] [10604.622292] CPU: 10 PID: 2713438 Comm: fsstress Tainted: G S 5.10.26-3.el7.x86_64 #1 [10604.622296] Hardware name: Dell Inc. Precision T7610/0NK70N, BIOS A18 09/11/2019 [10604.622333] RIP: 0010:create_subvol+0x888/0x8f0 [btrfs] [10604.622337] Code: 8b 40 50 f0 48 0f ba a8 50 0a 00 00 03 72 1d 41 83 ff fb 74 37 41 83 ff e2 74 31 44 89 fe 48 c7 c7 f0 44 59 c0 e8 ec 6b 5a f6 <0f> 0b 48 8b bd 30 ff ff ff 44 89 f9 ba ce 02 00 00 48 c7 c6 80 2a [10604.622342] RSP: 0018:afd7326cfc08 EFLAGS: 00010286 [10604.622346] RAX: RBX: 9071a992ca00 RCX: 0027 [10604.622348] RDX: 0027 RSI: 90812f818a80 RDI: 90812f818a88 [10604.622351] RBP: afd7326cfce8 R08: R09: c000d1cc [10604.622354] R10: fffd00d0 R11: afd7326cfa10 R12: fff4 [10604.622358] R13: 905445386000 R14: 9071e9b40230 R15: fff4 [10604.622361] FS: 7fb9e7398000() GS:90812f80() knlGS: [10604.622364] CS: 0010 DS: ES: CR0: 80050033 [10604.622367] CR2: 009c5c40 CR3: 00020d826002 CR4: 001706e0 [10604.622370] Call Trace: [10604.622411] btrfs_mksubvol+0x368/0x440 [btrfs] [10604.622447] __btrfs_ioctl_snap_create+0x11c/0x170 [btrfs] [10604.622455] ? _copy_from_user+0x3a/0x70 [10604.622488] btrfs_ioctl_snap_create_v2+0x111/0x140 [btrfs] [10604.622522] btrfs_ioctl+0x9d5/0x2f80 [btrfs] [10604.622528] ? __handle_mm_fault+0x797/0x7c0 [10604.622534] ? __x64_sys_ioctl+0x84/0xc0 [10604.622536] __x64_sys_ioctl+0x84/0xc0 [10604.622542] do_syscall_64+0x33/0x40 [10604.622549] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [10604.622552] RIP: 0033:0x7fb9e668988b [10604.622556] Code: 0f 1e fa 48 8b 05 fd 95 2c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d cd 95 2c 00 f7 d8 64 89 01 48 [10604.622561] RSP: 002b:7fff461628b8 EFLAGS: 0202 ORIG_RAX: 0010 [10604.622565] RAX: ffda RBX: RCX: 7fb9e668988b [10604.622569] RDX: 7fff461628c0 RSI: 50009418 RDI: 0004 [10604.622572] RBP: 0004 R08: R09: 0006 [10604.622575] R10: R11: 0