[lustre-discuss] Lustre 2.12.6 client crashes
Date: Thu, 20 Jan 2022 12:07:40 + From: Christopher Mountford To: lustre-discuss@lists.lustre.org Subject: Client crashes User-Agent: NeoMutt/20170306 (1.8.0) Hi All, We've started getting some fairly regular client panics on out lustre 2.12.7 filesystem, looking at the stack trace I think we are hitting this bug: https://jira.whamcloud.com/browse/LU-12752 I note that a fix is in 2.15.0, is this likely to be patched in a 2.12 release? We're still trying to isolate the job that is causing the crash, but once we have we should be able to reproduce this reliably. Kind Regards, Christopher. Log entriy: Jan 20 10:23:39 lmem006 kernel: LustreError: 4661:0:(osc_cache.c:2519:osc_teardown_async_page()) extent 937e2756e4d0@{[0 -> 255/255], [2|0|-|cache|wi|92fdd1dd8b40], [1703936|1|+|-|932384f1e880|256| (null)]} trunc at 42. +Jan 20 10:23:39 lmem006 kernel: LustreError: 4661:0:(osc_cache.c:2519:osc_teardown_async_page()) ### extent: 937e2756e4d0 ns: alice3-OST001f-osc-938e6a743000 lock: 932384f1e880/0x6024b6d908313ce7 lrc: 2/0,0 mode: PW/PW res: +[0x7c400:0x5c888a:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 65536->172031) flags: 0x8000200 nid: local remote: 0x345e4fe1c451a182 expref: -99 pid: 955 timeout: 0 lvb_type: 1 +Jan 20 10:23:39 lmem006 kernel: LustreError: 4661:0:(osc_page.c:192:osc_page_delete()) page@933651225e00[2 93228480b2f0 4 1 (null)] Jan 20 10:23:39 lmem006 kernel: LustreError: 4661:0:(osc_page.c:192:osc_page_delete()) vvp-page@933651225e50(0:0) vm@eaeada357d80 6f0879 3:0 933651225e00 42 lru Jan 20 10:23:39 lmem006 kernel: LustreError: 4661:0:(osc_page.c:192:osc_page_delete()) lov-page@933651225e90, comp index: 1, gen: 6 Jan 20 10:23:39 lmem006 kernel: LustreError: 4661:0:(osc_page.c:192:osc_page_delete()) osc-page@933651225ec8 42: 1< 0x845fed 2 0 + - > 2< 172032 0 4096 0x0 0x420 | (null) 938e52a7d738 92fdd1dd8b40 > 3< 0 0 0 > 4< 0 0 8 1703936 - | - - + - > +5< - - + - | 0 - | 1 - -> +Jan 20 10:23:39 lmem006 kernel: LustreError: 4661:0:(osc_page.c:192:osc_page_delete()) end page@933651225e00 Jan 20 10:23:39 lmem006 kernel: LustreError: 4661:0:(osc_page.c:192:osc_page_delete()) Trying to teardown failed: -16 Jan 20 10:23:39 lmem006 kernel: LustreError: 4661:0:(osc_page.c:193:osc_page_delete()) ASSERTION( 0 ) failed: Jan 20 10:23:40 lmem006 kernel: LustreError: 4661:0:(osc_page.c:193:osc_page_delete()) LBUG Jan 20 10:23:40 lmem006 kernel: Pid: 4661, comm: diamond 3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021 Jan 20 10:23:40 lmem006 kernel: Call Trace: Jan 20 10:23:40 lmem006 kernel: [] libcfs_call_trace+0x8c/0xc0 [libcfs] Jan 20 10:23:40 lmem006 kernel: [] lbug_with_loc+0x4c/0xa0 [libcfs] Jan 20 10:23:40 lmem006 kernel: [] osc_page_delete+0x48f/0x500 [osc] Jan 20 10:23:40 lmem006 kernel: [] cl_page_delete0+0x80/0x220 [obdclass] Jan 20 10:23:40 lmem006 kernel: [] cl_page_delete+0x33/0x110 [obdclass] Jan 20 10:23:40 lmem006 kernel: [] ll_invalidatepage+0x7f/0x170 [lustre] Jan 20 10:23:40 lmem006 kernel: [] do_invalidatepage_range+0x7d/0x90 Jan 20 10:23:40 lmem006 kernel: [] truncate_inode_page+0x77/0x80 Jan 20 10:23:40 lmem006 kernel: [] truncate_inode_pages_range+0x1ea/0x750 Jan 20 10:23:40 lmem006 kernel: [] truncate_inode_pages_final+0x4f/0x60 Jan 20 10:23:40 lmem006 kernel: [] ll_delete_inode+0x4f/0x230 [lustre] Jan 20 10:23:40 lmem006 kernel: [] evict+0xb4/0x180 Jan 20 10:23:40 lmem006 kernel: [] iput+0xfc/0x190 Jan 20 10:23:40 lmem006 kernel: [] __dentry_kill+0x158/0x1d0 Jan 20 10:23:40 lmem006 kernel: [] dput+0xb5/0x1a0 Jan 20 10:23:40 lmem006 kernel: [] __fput+0x18d/0x230 Jan 20 10:23:40 lmem006 kernel: [] fput+0xe/0x10 Jan 20 10:23:40 lmem006 kernel: [] task_work_run+0xbb/0xe0 Jan 20 10:23:40 lmem006 kernel: [] do_notify_resume+0xa5/0xc0 Jan 20 10:23:40 lmem006 kernel: [] int_signal+0x12/0x17 Jan 20 10:23:40 lmem006 kernel: [] 0x Jan 20 10:23:40 lmem006 kernel: Kernel panic - not syncing: LBUG ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] MDT hanging
Hi, We've had a couple of MDT hangs on 2 of our lustre filesystems after updating to 2.12.6 (though I'm sure I've seen this exact behaviour on previous versions). Ths symptoms are a gradualy increasing load on the affected MDS, processes doing I/O on the filesystem blocking indefinately, showing messages on the client similar to: Mar 9 15:37:22 spectre09 kernel: Lustre: 25309:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed out for slow reply: [sent 1615303641/real 1615303641] req@972dbe51bf00 x1692620480891456/t0(0) o44->ahome3-MDT0001-mdc-9718e3be@10.143.254.212@o2ib:12/10 lens 448/440 e 2 to 1 dl 1615304242 re f 2 fl Rpc:X/0/ rc 0/-1 Mar 9 15:37:22 spectre09 kernel: Lustre: ahome3-MDT0001-mdc-9718e3be: Connection to ahome3-MDT0001 (at 10.143.254.212@o2ib) was lost; in progress operatio ns using this service will wait for recovery to complete Mar 9 15:37:22 spectre09 kernel: Lustre: ahome3-MDT0001-mdc-9718e3be: Connection restored to 10.143.254.212@o2ib (at 10.143.254.212@o2ib) Warnings of hung mdt_io tasks on the MDS and lustre debug log dumps to /tmp. Rebooting the affected MDS cleared the problem and everything recovered. Looking at the MDS system logs, the first sign of trouble appears to be: Mar 9 15:24:11 amds01b kernel: VERIFY3(dr->dr_dbuf->db_level == level) failed (0 == 18446744073709551615) Mar 9 15:24:11 amds01b kernel: PANIC at dbuf.c:3391:dbuf_sync_list() Mar 9 15:24:11 amds01b kernel: Showing stack for process 18137 Mar 9 15:24:11 amds01b kernel: CPU: 3 PID: 18137 Comm: dp_sync_taskq Tainted: P OE 3.10.0-1160.2.1.el7_lustre.x86_64 #1 Mar 9 15:24:11 amds01b kernel: Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 07/16/2020 Mar 9 15:24:11 amds01b kernel: Call Trace: Mar 9 15:24:11 amds01b kernel: [] dump_stack+0x19/0x1b Mar 9 15:24:11 amds01b kernel: [] spl_dumpstack+0x44/0x50 [spl] Mar 9 15:24:11 amds01b kernel: [] spl_panic+0xc9/0x110 [spl] Mar 9 15:24:11 amds01b kernel: [] ? tracing_is_on+0x15/0x30 Mar 9 15:24:11 amds01b kernel: [] ? tracing_record_cmdline+0x1d/0x120 Mar 9 15:24:11 amds01b kernel: [] ? spl_kmem_free+0x35/0x40 [spl] Mar 9 15:24:11 amds01b kernel: [] ? update_curr+0x14c/0x1e0 Mar 9 15:24:11 amds01b kernel: [] ? account_entity_dequeue+0xae/0xd0 Mar 9 15:24:11 amds01b kernel: [] dbuf_sync_list+0x7b/0xd0 [zfs] Mar 9 15:24:11 amds01b kernel: [] dnode_sync+0x370/0x890 [zfs] Mar 9 15:24:11 amds01b kernel: [] sync_dnodes_task+0x61/0x150 [zfs] Mar 9 15:24:11 amds01b kernel: [] taskq_thread+0x2ac/0x4f0 [spl] Mar 9 15:24:11 amds01b kernel: [] ? wake_up_state+0x20/0x20 Mar 9 15:24:11 amds01b kernel: [] ? taskq_thread_spawn+0x60/0x60 [spl] Mar 9 15:24:11 amds01b kernel: [] kthread+0xd1/0xe0 Mar 9 15:24:11 amds01b kernel: [] ? insert_kthread_work+0x40/0x40 Mar 9 15:24:11 amds01b kernel: [] ret_from_fork_nospec_begin+0x7/0x21 Mar 9 15:24:11 amds01b kernel: [] ? insert_kthread_work+0x40/0x40 My read of this is that ZFS failed whilst syncing cached data out to disk and panicked (I guess this panic is internal to ZFS as the system remained up and otherwise responsive - no kernel panic triggered). Does this seem correct? The pacemaker ZFS resource did not pick up the failure, it relies on 'zpool list -H -o health'. Is there any way anyone can think of that we can detect this sort of problem to trigger an automated reset of the affected server? Unfortunately I'd rebooted the server before I spotted the log entry. Next time I'll run some zfs commands to see what they return before rebooting. Any advice on what additional steps to take? I guess this is probably more a ZFS rather than Lustre issue. The MDS are based on HPE DL360s, connected to D3700 JBODs, MDTs are on ZFS, Centos Lustre 7.9, zfs 0.7.13, lustre 2.12.6, kernel 3.10.0-1160.2.1.el7_lustre.x86_64 Kind Regards, Christopher. ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] MDS using D3710 DAS
Hi Sid. We use the D3700s (and our D8000s) as JBODS with zfs providing the redundancy - do you have some kind of hardware RAID? If so, are your raid controller the array corntrollers or on the HBAs? Off the top of my head, if the latter, there might be an issue with multiple HBAs trying to assemble the same RAID array? - Chris. On Mon, Feb 15, 2021 at 08:42:43AM +1000, Sid Young wrote: >Hi Christopher, >Just some background, all servers are DL385's all servers are running >the same image of Centos 7.9, The MDS HA pair have a SAS connected >D3710 and the dual OSS HA pair have a D8000 each with 45 disks in each >of them. >The D3710 (which has 24x 960G SSD's) seams a bit hit and miss at >presenting two LV's, I had setup a /lustre and /home which I was going >to use ldiskfs rather than zfs however I am finding that the disks MAY >present to both servers after some reboots but usually the first server >to reboot see's the LV presented and the other only see's its local >internal disks only, so the array appears to only present the LV's to >one host most of the time. >With the 4 OSS servers. i see the same issue, sometimes the LV's >present and sometimes they don't. >I was planning on setting up the OST's as ldiskfs as well, but I could >also go zfs, my test bed system and my current HPC uses ldsikfs. >Correct me if I am wrong, but disks should present to both servers all >the time and using PCS I should be able to mount up a /lustre and /home >one the first server while the disks present on the second server but >no software is mounting them so there should be no issues? >Sid Young > >On Fri, Feb 12, 2021 at 7:27 PM Christopher Mountford ><[1]cj...@leicester.ac.uk> wrote: > > Hi Sid, > We've a similar hardware configuration - 2 MDS pairs and 1 OSS pair > which each consist of 2 DL360 connected to a single D3700. However > we are using Lustre on ZFS with each array split into 2 or 4 zpools > (depending on the usage) and haven't seen any problems of this sort. > Are you using ldiskfs? > - Chris > On Fri, Feb 12, 2021 at 03:14:58PM +1000, Sid Young wrote: > >G'day all, > >Is anyone using a HPe D3710 with two HPeDL380/385 servers in a > MDS HA > >Configuration? If so, is your D3710 presenting LV's to both > servers at > >the same time AND are you using PCS with the Lustre PCS > Resources? > >I've just received new kit and cannot get disk to present to > the MDS > >servers at the same time. :( > >Sid Young > > ___ > > lustre-discuss mailing list > > [2]lustre-discuss@lists.lustre.org > > [3]http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > > References > >1. mailto:cj...@leicester.ac.uk >2. mailto:lustre-discuss@lists.lustre.org >3. > https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=04%7C01%7Ccjm14%40leicester.ac.uk%7C4d86239b31b545d327db08d8d139f050%7Caebecd6a31d44b0195ce8274afe853d9%7C0%7C0%7C637489394067185599%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=x1PMOvlWp3bocS%2Bub1mpvE1Mn59Q0EU0M18NQbj1wOk%3D&reserved=0 ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] MDS using D3710 DAS
Hi Sid, We've a similar hardware configuration - 2 MDS pairs and 1 OSS pair which each consist of 2 DL360 connected to a single D3700. However we are using Lustre on ZFS with each array split into 2 or 4 zpools (depending on the usage) and haven't seen any problems of this sort. Are you using ldiskfs? - Chris On Fri, Feb 12, 2021 at 03:14:58PM +1000, Sid Young wrote: >G'day all, >Is anyone using a HPe D3710 with two HPeDL380/385 servers in a MDS HA >Configuration? If so, is your D3710 presenting LV's to both servers at >the same time AND are you using PCS with the Lustre PCS Resources? >I've just received new kit and cannot get disk to present to the MDS >servers at the same time. :( >Sid Young > ___ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre 2.12 client crashes
Thank you for your help. I've created an issue in the comminity JIRA for this: LU-13168. Kind Regards, Christopher. On Mon, Jan 20, 2020 at 05:22:58PM +, Peter Jones wrote: > Christopher > > Apologies for the confusing message about requesting an account for JIRA - > I'll see if we can remove that message but I think that it might be > system-generated. We've had to disable self-registration because of repeated > hacking attempts via that mechanism. The message on the left "For questions > or login request, send email to Jira administrators" works - the link there > sends an email to i...@whamcloud.com and several requests come through per > week via that channel - but I can see why the message on the right would draw > your eye... > > Peter > > On 2020-01-20, 8:15 AM, "lustre-discuss on behalf of Christopher Mountford" > > wrote: > > We've seen 3 lustre client panics in the last few hours when using the > b2_12 branch (we're using it on client nodes as it patches a data on MDT bug > in 2.12.3. Still using 2.12.3 on MDS/OSS). This looks similar similar to > LU-12581, which we had seen on our system before but was fixed in 2.12.3. > Could this have been re-introduced in the b2_12 branch? > > I've included the dmesg from one of the panics below. Unfortunately we > have not yet found a way to reproduce the problem. Has anyone seen anything > similar to this? > > Is this mailing list a suitable place to ask for help on this sort of > bug? I've been looking at the Whamcloud Community Jira, but the link to > request an account returns "Your Jira administrator has not yet configured > this contact form." > > dmesg from failed client: > > [542909.741793] > = > [542909.741800] BUG kmalloc-8 (Tainted: G OE ): > Freechain corrupt > [542909.741802] > - > > [542909.741805] Disabling lock debugging due to kernel taint > [542909.741809] INFO: Slab 0xe0933440b3c0 objects=102 used=75 > fp=0x9bb6902cf558 flags=0x6f0081 > [542909.741812] INFO: Object 0x9bb6902cfad0 @offset=2768 > fp=0x7fff9bb6902cfdf0 > > [542909.741816] Redzone 9bb6902cfac8: bb 3b 3b 3b 3b bb bb bb > .... > [542909.741818] Object 9bb6902cfad0: 6b 6b 6b 6b 6b 6b 6b a5 > kkk. > [542909.741821] Redzone 9bb6902cfad8: bb bb bb 3b bb bb bb bb > ...; > [542909.741823] Padding 9bb6902cfae8: 5a 5a 5a 5a 5a 5a 5a 5a > > [542909.741828] CPU: 25 PID: 50461 Comm: pool Kdump: loaded Tainted: G > B OE 3.10.0-1062.9.1.el7.x86_64 #1 > [542909.741830] Hardware name: HP ProLiant BL460c Gen9, BIOS I36 > 10/21/2019 > [542909.741832] Call Trace: > [542909.741846] [] dump_stack+0x19/0x1b > [542909.741852] [] print_trailer+0x161/0x280 > [542909.741856] [] on_freelist+0xff/0x270 > [542909.741860] [] free_debug_processing+0x18d/0x270 > [542909.741867] [] ? kvfree+0x35/0x40 > [542909.741870] [] __slab_free+0x1ce/0x290 > [542909.741878] [] ? generic_setxattr+0x68/0x80 > [542909.741883] [] ? __vfs_setxattr_noperm+0x65/0x1b0 > [542909.741889] [] ? evm_inode_setxattr+0xe/0x10 > [542909.741892] [] ? kvfree+0x35/0x40 > [542909.741895] [] kfree+0x106/0x140 > [542909.741899] [] kvfree+0x35/0x40 > [542909.741902] [] setxattr+0x15b/0x1e0 > [542909.741909] [] ? putname+0x3d/0x60 > [542909.741914] [] ? user_path_at_empty+0x72/0xc0 > [542909.741920] [] ? __sb_start_write+0x58/0x120 > [542909.741926] [] ? do_utimes+0xf1/0x180 > [542909.741930] [] SyS_setxattr+0xb7/0x100 > [542909.741937] [] system_call_fastpath+0x25/0x2a > [542909.741940] > = > [542909.741942] BUG kmalloc-8 (Tainted: GB OE ): > Wrong object count. Counter is 75 but counted were 95 > [542909.741944] > - > > [542909.741947] INFO: Slab 0xe0933440b3c0 objects=102 used=75 > fp=0x9bb6902cf558 flags=0x6f0081 > [542909.741951] CPU: 25 PID: 50461 Comm: pool Kdump: loaded Tainted: G > B OE 3.10.0-1062.9.1.el7.x86_64 #1 > [542909.741953] Hardware name: HP ProLia
[lustre-discuss] Lustre 2.12 client crashes
18 eb 2e 0f 1f 00 48 3b 72 e0 48 8d 42 e0 73 1d 48 8b 52 10 48 85 d2 74 0f <48> 3b 72 e8 72 e7 48 8b 52 08 48 85 d2 75 f1 48 85 c0 74 04 48 [542911.665436] RIP [] find_vma+0x3b/0x60 [542911.695917] RSP -- -- # Dr. Christopher Mountford # System specialist - Research Computing/HPC # # IT services, # University of Leicester, University Road, # Leicester, LE1 7RH, UK # # t: 0116 252 3471 # e: cj...@le.ac.uk ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Lustre client crashes in Lustre 2.12.3 with data on MDT
Thank you for the suggestion, I've just grabbed and built b2_12 from git and this does fix the problem. We may have to temporarilly use this on clients whilst we wait for the 2.12.4 release. Many Thanks, Christopher. On Fri, Jan 10, 2020 at 03:52:03PM +, Peter Jones wrote: > While I'm not who you need to interpret the stack trace, I can decipher JIRA > and the state of LU-12462 is that it is already landed for the upcoming > 2.12.4 release. So, if you have a good reproducer, you could always test a > single client on the tip of b2_12 (either building from git or else grabbing > the latest build from > https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuild.whamcloud.com%2Fjob%2Flustre-b2_12%2F&data=02%7C01%7Ccjm14%40leicester.ac.uk%7Cbdc20a3f04af4b6d2aa008d795e50b5f%7Caebecd6a31d44b0195ce8274afe853d9%7C0%7C1%7C637142683281930296&sdata=8rJ1QBqhQxBm3nzWz4xx40ObtxZ65lFt6MRKI5fC%2BfQ%3D&reserved=0) > . What's there now is close to the finished article and this will let you > know whether moving to 2.12.4 when it comes out will resolve this issue for > you. > > On 2020-01-10, 7:42 AM, "lustre-discuss on behalf of Christopher Mountford" > > wrote: > > Hi, > > We just switched to a new 2.12.3 Lustre storage system on our local HPC > cluster have seen a number of client node crashes - all leaving a similar > syslog entry: > > Jan 10 13:21:08 spectre15 kernel: LustreError: > 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) extent > 9238ac5133f0@{[0 -> 255/255], > [1|0|-|cache|wiY|9238a7370f00],[1703936|89|+|-|9238733e7180|256| > (null)]} > Jan 10 13:21:08 spectre15 kernel: LustreError: > 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) ### extent: > 9238ac5133f0 ns: alice3-OST0019-osc-9248e7337800 lock: > 9238733e7180/0x3c4db4a67c3d39eb lrc: 2/0,0 mode: PW/PW res: > [0x74400:0x132478:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] > (req 65536->262143) flags: 0x200 nid: local remote: > 0xda6676eba5fdbd0c expref: -99 pid: 24499 timeout: 0 lvb_type: 1 > Jan 10 13:21:08 spectre15 kernel: LustreError: > 24567:0:(osc_cache.c:1241:osc_extent_tree_dump0()) Dump object > 9238a7370f00 extents at osc_cache_writeback_range:3062, mppr: 256. > Jan 10 13:21:08 spectre15 kernel: LustreError: > 24567:0:(osc_cache.c:1246:osc_extent_tree_dump0()) extent > 9238ac5133f0@{[0 -> 255/255], [1|0|-|cache|wiY|9238a7370f00], > [1703936|89|+|-|9238733e7180|256| (null)]} in tree 1. > Jan 10 13:21:08 spectre15 kernel: LustreError: > 24567:0:(osc_cache.c:1246:osc_extent_tree_dump0()) ### extent: > 9238ac5133f0 ns: alice3-OST0019-osc-9248e7337800 lock: > 9238733e7180/0x3c4db4a67c3d39eb lrc: 2/0,0 mode: PW/PW res: > [0x74400:0x132478:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] > (req 65536->262143) flags: 0x200 nid: local remote: > 0xda6676eba5fdbd0c expref: -99 pid: 24499 timeout: 0 lvb_type: 1 > Jan 10 13:21:08 spectre15 kernel: LustreError: > 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) ASSERTION( > ext->oe_start >= start && ext->oe_end <= end ) failed: > Jan 10 13:21:08 spectre15 kernel: LustreError: > 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) LBUG > Jan 10 13:21:08 spectre15 kernel: Pid: 24567, comm: rm > 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019 > Jan 10 13:21:08 spectre15 kernel: Call Trace: > Jan 10 13:21:08 spectre15 kernel: [] > libcfs_call_trace+0x8c/0xc0 [libcfs] > Jan 10 13:21:08 spectre15 kernel: [] > lbug_with_loc+0x4c/0xa0 [libcfs] > Jan 10 13:21:08 spectre15 kernel: [] > osc_cache_writeback_range+0xacd/0x1260 [osc] > Jan 10 13:21:08 spectre15 kernel: [] > osc_io_fsync_start+0x85/0x1a0 [osc] > Jan 10 13:21:08 spectre15 kernel: [] > cl_io_start+0x68/0x130 [obdclass] > Jan 10 13:21:08 spectre15 kernel: [] > lov_io_call.isra.7+0x87/0x140 [lov] > Jan 10 13:21:08 spectre15 kernel: [] > lov_io_start+0x56/0x150 [lov] > Jan 10 13:21:08 spectre15 kernel: [] > cl_io_start+0x68/0x130 [obdclass] > Jan 10 13:21:08 spectre15 kernel: [] > cl_io_loop+0xcc/0x1c0 [obdclass] > Jan 10 13:21:08 spectre15 kernel: [] > cl_sync_file_range+0x2db/0x380 [lustre] > Jan 10 13:21:08 spectre15 kernel: [] > ll_delete_inode+0x160/0x230 [lustre] > Jan 10 13:21:08 spectre15 kernel: [] evict+0xb4/0x180 > Jan 10 13:21:08 spectre15 kernel: [] iput+0xfc/0x190 > Jan 10 13:21:08 spectre15 kernel: [] > do_unlinkat+0x1ae/0x2d0 > Jan 10 13:21:08 spectre15 kernel: [] > SyS_unlinkat+0x1b/0x40 > Jan 10 13:
[lustre-discuss] Lustre client crashes in Lustre 2.12.3 with data on MDT
Hi, We just switched to a new 2.12.3 Lustre storage system on our local HPC cluster have seen a number of client node crashes - all leaving a similar syslog entry: Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) extent 9238ac5133f0@{[0 -> 255/255], [1|0|-|cache|wiY|9238a7370f00],[1703936|89|+|-|9238733e7180|256| (null)]} Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) ### extent: 9238ac5133f0 ns: alice3-OST0019-osc-9248e7337800 lock: 9238733e7180/0x3c4db4a67c3d39eb lrc: 2/0,0 mode: PW/PW res: [0x74400:0x132478:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 65536->262143) flags: 0x200 nid: local remote: 0xda6676eba5fdbd0c expref: -99 pid: 24499 timeout: 0 lvb_type: 1 Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:1241:osc_extent_tree_dump0()) Dump object 9238a7370f00 extents at osc_cache_writeback_range:3062, mppr: 256. Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:1246:osc_extent_tree_dump0()) extent 9238ac5133f0@{[0 -> 255/255], [1|0|-|cache|wiY|9238a7370f00], [1703936|89|+|-|9238733e7180|256| (null)]} in tree 1. Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:1246:osc_extent_tree_dump0()) ### extent: 9238ac5133f0 ns: alice3-OST0019-osc-9248e7337800 lock: 9238733e7180/0x3c4db4a67c3d39eb lrc: 2/0,0 mode: PW/PW res: [0x74400:0x132478:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 65536->262143) flags: 0x200 nid: local remote: 0xda6676eba5fdbd0c expref: -99 pid: 24499 timeout: 0 lvb_type: 1 Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) ASSERTION( ext->oe_start >= start && ext->oe_end <= end ) failed: Jan 10 13:21:08 spectre15 kernel: LustreError: 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) LBUG Jan 10 13:21:08 spectre15 kernel: Pid: 24567, comm: rm 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019 Jan 10 13:21:08 spectre15 kernel: Call Trace: Jan 10 13:21:08 spectre15 kernel: [] libcfs_call_trace+0x8c/0xc0 [libcfs] Jan 10 13:21:08 spectre15 kernel: [] lbug_with_loc+0x4c/0xa0 [libcfs] Jan 10 13:21:08 spectre15 kernel: [] osc_cache_writeback_range+0xacd/0x1260 [osc] Jan 10 13:21:08 spectre15 kernel: [] osc_io_fsync_start+0x85/0x1a0 [osc] Jan 10 13:21:08 spectre15 kernel: [] cl_io_start+0x68/0x130 [obdclass] Jan 10 13:21:08 spectre15 kernel: [] lov_io_call.isra.7+0x87/0x140 [lov] Jan 10 13:21:08 spectre15 kernel: [] lov_io_start+0x56/0x150 [lov] Jan 10 13:21:08 spectre15 kernel: [] cl_io_start+0x68/0x130 [obdclass] Jan 10 13:21:08 spectre15 kernel: [] cl_io_loop+0xcc/0x1c0 [obdclass] Jan 10 13:21:08 spectre15 kernel: [] cl_sync_file_range+0x2db/0x380 [lustre] Jan 10 13:21:08 spectre15 kernel: [] ll_delete_inode+0x160/0x230 [lustre] Jan 10 13:21:08 spectre15 kernel: [] evict+0xb4/0x180 Jan 10 13:21:08 spectre15 kernel: [] iput+0xfc/0x190 Jan 10 13:21:08 spectre15 kernel: [] do_unlinkat+0x1ae/0x2d0 Jan 10 13:21:08 spectre15 kernel: [] SyS_unlinkat+0x1b/0x40 Jan 10 13:21:08 spectre15 kernel: [] system_call_fastpath+0x25/0x2a Jan 10 13:21:08 spectre15 kernel: [] 0x Jan 10 13:21:08 spectre15 kernel: Kernel panic - not syncing: LBUG We are able to reproduced the error on a test system - it appears to be caused by removing multiple files with a single rm -f *, strangely, repeating this and deleting the files one at a time is fine (these results are both reproducable). Only files with a data on MDT layout cause the crash. We have been using the 2.12.3 client (with 2.10.7 servers) since December without issue. The problem seems to be occuring since we moved to using a new Lustre 2.12.3 filesystem which has data on MDT enabled. We have confirmed that deleting files which do not have a data on MDT layout does not cause the above problem. This looks to me like LU-12462 (https://jira.whamcloud.com/browse/LU-12462), however, it looks like this is only known to affect 2.13.0 (and 2.12.4) - not 2.12.3, I'm not familiar with jira though so I could be reading this wrong! Any suggestions on how best to report/resolve this? We have repeated the tests using a 2.13.0 test client and we do not see any crashes on this client (LU-12462 says fixed in 2.13). Regards, Christopher. -- -- # Dr. Christopher Mountford # System specialist - Research Computing/HPC # # IT services, # University of Leicester, University Road, # Leicester, LE1 7RH, UK # # t: 0116 252 3471 # e: cj...@le.ac.uk ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org