[lustre-discuss] Lustre 2.12.6 client crashes

2022-01-20 Thread Christopher Mountford via lustre-discuss
Date: Thu, 20 Jan 2022 12:07:40 +
From: Christopher Mountford 
To: lustre-discuss@lists.lustre.org
Subject: Client crashes
User-Agent: NeoMutt/20170306 (1.8.0)

Hi All,

We've started getting some fairly regular client panics on out lustre 2.12.7 
filesystem, looking at the stack trace I think we are hitting this bug: 
https://jira.whamcloud.com/browse/LU-12752

I note that a fix is in 2.15.0, is this likely to be patched in a 2.12 release?

We're still trying to isolate the job that is causing the crash, but once we 
have we should be able to reproduce this reliably.

Kind Regards,
Christopher.

Log entriy:

Jan 20 10:23:39 lmem006 kernel: LustreError: 
4661:0:(osc_cache.c:2519:osc_teardown_async_page()) extent 937e2756e4d0@{[0 
-> 255/255], [2|0|-|cache|wi|92fdd1dd8b40], 
[1703936|1|+|-|932384f1e880|256|  (null)]} trunc at 42.
+Jan 20 10:23:39 lmem006 kernel: LustreError: 
4661:0:(osc_cache.c:2519:osc_teardown_async_page()) ### extent: 
937e2756e4d0 ns: alice3-OST001f-osc-938e6a743000 lock: 
932384f1e880/0x6024b6d908313ce7 lrc: 2/0,0 mode: PW/PW res:
+[0x7c400:0x5c888a:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 
65536->172031) flags: 0x8000200 nid: local remote: 0x345e4fe1c451a182 
expref: -99 pid: 955 timeout: 0 lvb_type: 1
+Jan 20 10:23:39 lmem006 kernel: LustreError: 
4661:0:(osc_page.c:192:osc_page_delete()) page@933651225e00[2 
93228480b2f0 4 1   (null)]
Jan 20 10:23:39 lmem006 kernel: LustreError: 
4661:0:(osc_page.c:192:osc_page_delete()) vvp-page@933651225e50(0:0) 
vm@eaeada357d80 6f0879 3:0 933651225e00 42 lru
Jan 20 10:23:39 lmem006 kernel: LustreError: 
4661:0:(osc_page.c:192:osc_page_delete()) lov-page@933651225e90, comp 
index: 1, gen: 6
Jan 20 10:23:39 lmem006 kernel: LustreError: 
4661:0:(osc_page.c:192:osc_page_delete()) osc-page@933651225ec8 42: 1< 
0x845fed 2 0 + - > 2< 172032 0 4096 0x0 0x420 |   (null) 
938e52a7d738 92fdd1dd8b40 > 3< 0 0 0 > 4< 0 0 8 1703936 - | - - + - >
+5< - - + - | 0 - | 1 - ->
+Jan 20 10:23:39 lmem006 kernel: LustreError: 
4661:0:(osc_page.c:192:osc_page_delete()) end page@933651225e00
Jan 20 10:23:39 lmem006 kernel: LustreError: 
4661:0:(osc_page.c:192:osc_page_delete()) Trying to teardown failed: -16
Jan 20 10:23:39 lmem006 kernel: LustreError: 
4661:0:(osc_page.c:193:osc_page_delete()) ASSERTION( 0 ) failed:
Jan 20 10:23:40 lmem006 kernel: LustreError: 
4661:0:(osc_page.c:193:osc_page_delete()) LBUG
Jan 20 10:23:40 lmem006 kernel: Pid: 4661, comm: diamond 
3.10.0-1160.49.1.el7.x86_64 #1 SMP Tue Nov 30 15:51:32 UTC 2021
Jan 20 10:23:40 lmem006 kernel: Call Trace:
Jan 20 10:23:40 lmem006 kernel: [] 
libcfs_call_trace+0x8c/0xc0 [libcfs]
Jan 20 10:23:40 lmem006 kernel: [] lbug_with_loc+0x4c/0xa0 
[libcfs]
Jan 20 10:23:40 lmem006 kernel: [] 
osc_page_delete+0x48f/0x500 [osc]
Jan 20 10:23:40 lmem006 kernel: [] cl_page_delete0+0x80/0x220 
[obdclass]
Jan 20 10:23:40 lmem006 kernel: [] cl_page_delete+0x33/0x110 
[obdclass]
Jan 20 10:23:40 lmem006 kernel: [] 
ll_invalidatepage+0x7f/0x170 [lustre]
Jan 20 10:23:40 lmem006 kernel: [] 
do_invalidatepage_range+0x7d/0x90
Jan 20 10:23:40 lmem006 kernel: [] 
truncate_inode_page+0x77/0x80
Jan 20 10:23:40 lmem006 kernel: [] 
truncate_inode_pages_range+0x1ea/0x750
Jan 20 10:23:40 lmem006 kernel: [] 
truncate_inode_pages_final+0x4f/0x60
Jan 20 10:23:40 lmem006 kernel: [] ll_delete_inode+0x4f/0x230 
[lustre]
Jan 20 10:23:40 lmem006 kernel: [] evict+0xb4/0x180
Jan 20 10:23:40 lmem006 kernel: [] iput+0xfc/0x190
Jan 20 10:23:40 lmem006 kernel: [] __dentry_kill+0x158/0x1d0
Jan 20 10:23:40 lmem006 kernel: [] dput+0xb5/0x1a0
Jan 20 10:23:40 lmem006 kernel: [] __fput+0x18d/0x230
Jan 20 10:23:40 lmem006 kernel: [] fput+0xe/0x10
Jan 20 10:23:40 lmem006 kernel: [] task_work_run+0xbb/0xe0
Jan 20 10:23:40 lmem006 kernel: [] do_notify_resume+0xa5/0xc0
Jan 20 10:23:40 lmem006 kernel: [] int_signal+0x12/0x17
Jan 20 10:23:40 lmem006 kernel: [] 0x
Jan 20 10:23:40 lmem006 kernel: Kernel panic - not syncing: LBUG


___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] MDT hanging

2021-03-09 Thread Christopher Mountford via lustre-discuss
Hi,

We've had a couple of MDT hangs on 2 of our lustre filesystems after updating 
to 2.12.6 (though I'm sure I've seen this exact behaviour on previous versions).

Ths symptoms are a gradualy increasing load on the affected MDS, processes 
doing I/O on the filesystem blocking indefinately, showing messages on the 
client similar to:

Mar  9 15:37:22 spectre09 kernel: Lustre: 
25309:0:(client.c:2146:ptlrpc_expire_one_request()) @@@ Request sent has timed 
out for slow reply: [sent 1615303641/real 
1615303641]  req@972dbe51bf00 x1692620480891456/t0(0) 
o44->ahome3-MDT0001-mdc-9718e3be@10.143.254.212@o2ib:12/10 lens 448/440 
e 2 to 1 dl 1615304242 re
f 2 fl Rpc:X/0/ rc 0/-1
Mar  9 15:37:22 spectre09 kernel: Lustre: ahome3-MDT0001-mdc-9718e3be: 
Connection to ahome3-MDT0001 (at 10.143.254.212@o2ib) was lost; in progress 
operatio
ns using this service will wait for recovery to complete
Mar  9 15:37:22 spectre09 kernel: Lustre: ahome3-MDT0001-mdc-9718e3be: 
Connection restored to 10.143.254.212@o2ib (at 10.143.254.212@o2ib)

Warnings of hung mdt_io tasks on the MDS and lustre debug log dumps to /tmp.

Rebooting the affected MDS cleared the problem and everything recovered.



Looking at the MDS system logs, the first sign of trouble appears to be:

Mar  9 15:24:11 amds01b kernel: VERIFY3(dr->dr_dbuf->db_level == level) failed 
(0 == 18446744073709551615)
Mar  9 15:24:11 amds01b kernel: PANIC at dbuf.c:3391:dbuf_sync_list()
Mar  9 15:24:11 amds01b kernel: Showing stack for process 18137
Mar  9 15:24:11 amds01b kernel: CPU: 3 PID: 18137 Comm: dp_sync_taskq Tainted: 
P   OE     3.10.0-1160.2.1.el7_lustre.x86_64 #1
Mar  9 15:24:11 amds01b kernel: Hardware name: HPE ProLiant DL360 
Gen10/ProLiant DL360 Gen10, BIOS U32 07/16/2020
Mar  9 15:24:11 amds01b kernel: Call Trace:
Mar  9 15:24:11 amds01b kernel: [] dump_stack+0x19/0x1b
Mar  9 15:24:11 amds01b kernel: [] spl_dumpstack+0x44/0x50 
[spl]
Mar  9 15:24:11 amds01b kernel: [] spl_panic+0xc9/0x110 [spl]
Mar  9 15:24:11 amds01b kernel: [] ? tracing_is_on+0x15/0x30
Mar  9 15:24:11 amds01b kernel: [] ? 
tracing_record_cmdline+0x1d/0x120
Mar  9 15:24:11 amds01b kernel: [] ? spl_kmem_free+0x35/0x40 
[spl]
Mar  9 15:24:11 amds01b kernel: [] ? update_curr+0x14c/0x1e0
Mar  9 15:24:11 amds01b kernel: [] ? 
account_entity_dequeue+0xae/0xd0
Mar  9 15:24:11 amds01b kernel: [] dbuf_sync_list+0x7b/0xd0 
[zfs]
Mar  9 15:24:11 amds01b kernel: [] dnode_sync+0x370/0x890 
[zfs]
Mar  9 15:24:11 amds01b kernel: [] 
sync_dnodes_task+0x61/0x150 [zfs]
Mar  9 15:24:11 amds01b kernel: [] taskq_thread+0x2ac/0x4f0 
[spl]
Mar  9 15:24:11 amds01b kernel: [] ? wake_up_state+0x20/0x20
Mar  9 15:24:11 amds01b kernel: [] ? 
taskq_thread_spawn+0x60/0x60 [spl]
Mar  9 15:24:11 amds01b kernel: [] kthread+0xd1/0xe0
Mar  9 15:24:11 amds01b kernel: [] ? 
insert_kthread_work+0x40/0x40
Mar  9 15:24:11 amds01b kernel: [] 
ret_from_fork_nospec_begin+0x7/0x21
Mar  9 15:24:11 amds01b kernel: [] ? 
insert_kthread_work+0x40/0x40




My read of this is that ZFS failed whilst syncing cached data out to disk and 
panicked (I guess this panic is internal to ZFS as the system remained up and 
otherwise responsive - no kernel panic triggered). Does this seem correct?

The pacemaker ZFS resource did not pick up the failure, it relies on 'zpool 
list -H -o health'. Is there any way anyone can think of that we can detect 
this sort of problem to trigger an automated reset of the affected server? 
Unfortunately I'd rebooted the server before I spotted the log entry. Next time 
I'll run some zfs commands to see what they return before rebooting.

Any advice on what additional steps to take? I guess this is probably more a 
ZFS rather than Lustre issue.

The MDS are based on HPE DL360s, connected to D3700 JBODs, MDTs are on ZFS, 
Centos Lustre 7.9, zfs 0.7.13, lustre 2.12.6, kernel 
3.10.0-1160.2.1.el7_lustre.x86_64

Kind Regards,
Christopher.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] MDS using D3710 DAS

2021-02-15 Thread Christopher Mountford


Hi Sid.

We use the D3700s (and our D8000s) as JBODS with zfs providing the redundancy - 
do you have some kind of hardware RAID? If so, are your raid controller the 
array corntrollers or on the HBAs? Off the top of my head, if the latter, there 
might be an issue with multiple HBAs trying to assemble the same RAID array? 

- Chris.

On Mon, Feb 15, 2021 at 08:42:43AM +1000, Sid Young wrote:
>Hi Christopher,
>Just some background, all servers are DL385's all servers are running
>the same image of Centos 7.9, The MDS HA pair have a SAS connected
>D3710 and the dual OSS HA pair have a D8000 each with 45 disks in each
>of them.
>The D3710 (which has 24x 960G SSD's) seams a bit hit and miss at
>presenting two LV's, I had setup a /lustre and /home which I was going
>to use ldiskfs rather than zfs however I am finding that the disks MAY
>present to both servers after some reboots but usually the first server
>to reboot see's the LV presented and the other only see's its local
>internal disks only, so the array appears to only present the LV's to
>one host most of the time.
>With the 4 OSS servers. i see the same issue, sometimes the LV's
>present and sometimes they don't.
>I was planning on setting up the OST's as ldiskfs as well, but I could
>also go zfs, my test bed system and my current HPC uses ldsikfs.
>Correct me if I am wrong, but disks should present to both servers all
>the time and using PCS I should be able to mount up a /lustre and /home
>one the first server while the disks present on the second server but
>no software is mounting them so there should be no issues?
>Sid Young
> 
>On Fri, Feb 12, 2021 at 7:27 PM Christopher Mountford
><[1]cj...@leicester.ac.uk> wrote:
> 
>  Hi Sid,
>  We've a similar hardware configuration - 2 MDS pairs and 1 OSS pair
>  which each consist of 2 DL360 connected to a single D3700. However
>  we are using Lustre on ZFS with each array split into 2 or 4 zpools
>  (depending on the usage) and haven't seen any problems of this sort.
>  Are you using ldiskfs?
>  - Chris
>  On Fri, Feb 12, 2021 at 03:14:58PM +1000, Sid Young wrote:
>  >G'day all,
>  >Is anyone using a HPe D3710 with two HPeDL380/385 servers in a
>  MDS HA
>  >Configuration? If so, is your D3710 presenting LV's to both
>  servers at
>  >the same time AND are you using PCS with the Lustre PCS
>  Resources?
>  >I've just received new kit and cannot get disk to present to
>  the MDS
>  >servers at the same time. :(
>  >Sid Young
>  > ___
>  > lustre-discuss mailing list
>  > [2]lustre-discuss@lists.lustre.org
>  > [3]http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
> 
> References
> 
>1. mailto:cj...@leicester.ac.uk
>2. mailto:lustre-discuss@lists.lustre.org
>3. 
> https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Flists.lustre.org%2Flistinfo.cgi%2Flustre-discuss-lustre.org&data=04%7C01%7Ccjm14%40leicester.ac.uk%7C4d86239b31b545d327db08d8d139f050%7Caebecd6a31d44b0195ce8274afe853d9%7C0%7C0%7C637489394067185599%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=x1PMOvlWp3bocS%2Bub1mpvE1Mn59Q0EU0M18NQbj1wOk%3D&reserved=0
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] MDS using D3710 DAS

2021-02-12 Thread Christopher Mountford
Hi Sid,

We've a similar hardware configuration - 2 MDS pairs and 1 OSS pair which each 
consist of 2 DL360 connected to a single D3700. However we are using Lustre on 
ZFS with each array split into 2 or 4 zpools (depending on the usage) and 
haven't seen any problems of this sort. Are you using ldiskfs?

- Chris


On Fri, Feb 12, 2021 at 03:14:58PM +1000, Sid Young wrote:
>G'day all,
>Is anyone using a HPe D3710 with two HPeDL380/385 servers in a MDS HA
>Configuration? If so, is your D3710 presenting LV's to both servers at
>the same time AND are you using PCS with the Lustre PCS Resources?
>I've just received new kit and cannot get disk to present to the MDS
>servers at the same time. :(
>Sid Young

> ___
> lustre-discuss mailing list
> lustre-discuss@lists.lustre.org
> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre 2.12 client crashes

2020-01-22 Thread Christopher Mountford
Thank you for your help. I've created an issue in the comminity JIRA for this: 
LU-13168.

Kind Regards,
Christopher.

On Mon, Jan 20, 2020 at 05:22:58PM +, Peter Jones wrote:
> Christopher
> 
> Apologies for the confusing message about requesting an account for JIRA - 
> I'll see if we can remove that message but I think that it might be 
> system-generated. We've had to disable self-registration because of repeated 
> hacking attempts via that mechanism. The message on the left "For questions 
> or login request, send email to Jira administrators" works - the link there 
> sends an email to i...@whamcloud.com and several requests come through per 
> week via that channel - but I can see why the message on the right would draw 
> your eye...
> 
> Peter
> 
> On 2020-01-20, 8:15 AM, "lustre-discuss on behalf of Christopher Mountford" 
>  
> wrote:
> 
> We've seen 3 lustre client panics in the last few hours when using the 
> b2_12 branch (we're using it on client nodes as it patches a data on MDT bug 
> in 2.12.3. Still using 2.12.3 on MDS/OSS). This looks similar similar to 
> LU-12581, which we had seen on our system before but was fixed in 2.12.3. 
> Could this have been re-introduced in the b2_12 branch?
> 
> I've included the dmesg from one of the panics below. Unfortunately we 
> have not yet found a way to reproduce the problem. Has anyone seen anything 
> similar to this?
> 
> Is this mailing list a suitable place to ask for help on this sort of 
> bug? I've been looking at the Whamcloud Community Jira, but the link to 
> request an account returns "Your Jira administrator has not yet configured 
> this contact form."
> 
> dmesg from failed client:
> 
> [542909.741793] 
> =
> [542909.741800] BUG kmalloc-8 (Tainted: G   OE    ): 
> Freechain corrupt
> [542909.741802] 
> -
> 
> [542909.741805] Disabling lock debugging due to kernel taint
> [542909.741809] INFO: Slab 0xe0933440b3c0 objects=102 used=75 
> fp=0x9bb6902cf558 flags=0x6f0081
> [542909.741812] INFO: Object 0x9bb6902cfad0 @offset=2768 
> fp=0x7fff9bb6902cfdf0
> 
> [542909.741816] Redzone 9bb6902cfac8: bb 3b 3b 3b 3b bb bb bb 
>  ....
> [542909.741818] Object 9bb6902cfad0: 6b 6b 6b 6b 6b 6b 6b a5  
> kkk.
> [542909.741821] Redzone 9bb6902cfad8: bb bb bb 3b bb bb bb bb 
>  ...;
> [542909.741823] Padding 9bb6902cfae8: 5a 5a 5a 5a 5a 5a 5a 5a 
>  
> [542909.741828] CPU: 25 PID: 50461 Comm: pool Kdump: loaded Tainted: G
> B  OE     3.10.0-1062.9.1.el7.x86_64 #1
> [542909.741830] Hardware name: HP ProLiant BL460c Gen9, BIOS I36 
> 10/21/2019
> [542909.741832] Call Trace:
> [542909.741846]  [] dump_stack+0x19/0x1b
> [542909.741852]  [] print_trailer+0x161/0x280
> [542909.741856]  [] on_freelist+0xff/0x270
> [542909.741860]  [] free_debug_processing+0x18d/0x270
> [542909.741867]  [] ? kvfree+0x35/0x40
> [542909.741870]  [] __slab_free+0x1ce/0x290
> [542909.741878]  [] ? generic_setxattr+0x68/0x80
> [542909.741883]  [] ? __vfs_setxattr_noperm+0x65/0x1b0
> [542909.741889]  [] ? evm_inode_setxattr+0xe/0x10
> [542909.741892]  [] ? kvfree+0x35/0x40
> [542909.741895]  [] kfree+0x106/0x140
> [542909.741899]  [] kvfree+0x35/0x40
> [542909.741902]  [] setxattr+0x15b/0x1e0
> [542909.741909]  [] ? putname+0x3d/0x60
> [542909.741914]  [] ? user_path_at_empty+0x72/0xc0
> [542909.741920]  [] ? __sb_start_write+0x58/0x120
> [542909.741926]  [] ? do_utimes+0xf1/0x180
> [542909.741930]  [] SyS_setxattr+0xb7/0x100
> [542909.741937]  [] system_call_fastpath+0x25/0x2a
> [542909.741940] 
> =
> [542909.741942] BUG kmalloc-8 (Tainted: GB  OE    ): 
> Wrong object count. Counter is 75 but counted were 95
> [542909.741944] 
> -
> 
> [542909.741947] INFO: Slab 0xe0933440b3c0 objects=102 used=75 
> fp=0x9bb6902cf558 flags=0x6f0081
> [542909.741951] CPU: 25 PID: 50461 Comm: pool Kdump: loaded Tainted: G
> B  OE     3.10.0-1062.9.1.el7.x86_64 #1
> [542909.741953] Hardware name: HP ProLia

[lustre-discuss] Lustre 2.12 client crashes

2020-01-20 Thread Christopher Mountford
18 
eb 2e 0f 1f 00 48 3b 72 e0 48 8d 42 e0 73 1d 48 8b 52 10 48 85 d2 74 0f <48> 3b 
72 e8 72 e7 48 8b 52 08 48 85 d2 75 f1 48 85 c0 74 04 48 
[542911.665436] RIP  [] find_vma+0x3b/0x60
[542911.695917]  RSP 

-- 
-- 
# Dr. Christopher Mountford
# System specialist - Research Computing/HPC
# 
# IT services,
# University of Leicester, University Road, 
# Leicester, LE1 7RH, UK 
#
# t: 0116 252 3471
# e: cj...@le.ac.uk

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


Re: [lustre-discuss] Lustre client crashes in Lustre 2.12.3 with data on MDT

2020-01-10 Thread Christopher Mountford

Thank you for the suggestion, I've just grabbed and built b2_12 from git and 
this does fix the problem. We may have to temporarilly use this on clients 
whilst we wait for the 2.12.4 release.

Many Thanks,
Christopher.

On Fri, Jan 10, 2020 at 03:52:03PM +, Peter Jones wrote:
> While I'm not who you need to interpret the stack trace, I can decipher JIRA 
> and the state of LU-12462 is that it is already landed for the upcoming 
> 2.12.4 release. So, if you have a good reproducer, you could always test a 
> single client on the tip of b2_12 (either building from git or else grabbing 
> the latest build from 
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbuild.whamcloud.com%2Fjob%2Flustre-b2_12%2F&data=02%7C01%7Ccjm14%40leicester.ac.uk%7Cbdc20a3f04af4b6d2aa008d795e50b5f%7Caebecd6a31d44b0195ce8274afe853d9%7C0%7C1%7C637142683281930296&sdata=8rJ1QBqhQxBm3nzWz4xx40ObtxZ65lFt6MRKI5fC%2BfQ%3D&reserved=0)
>  . What's there now is close to the finished article and this will let you 
> know whether moving to 2.12.4 when it comes out will resolve this issue for 
> you. 
> 
> On 2020-01-10, 7:42 AM, "lustre-discuss on behalf of Christopher Mountford" 
>  
> wrote:
> 
> Hi,
> 
> We just switched to a new 2.12.3 Lustre storage system on our local HPC 
> cluster have seen a number of client node crashes - all leaving a similar 
> syslog entry:
> 
> Jan 10 13:21:08 spectre15 kernel: LustreError: 
> 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) extent 
> 9238ac5133f0@{[0 -> 255/255], 
> [1|0|-|cache|wiY|9238a7370f00],[1703936|89|+|-|9238733e7180|256|  
> (null)]}
> Jan 10 13:21:08 spectre15 kernel: LustreError: 
> 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) ### extent: 
> 9238ac5133f0 ns: alice3-OST0019-osc-9248e7337800 lock: 
> 9238733e7180/0x3c4db4a67c3d39eb lrc: 2/0,0 mode: PW/PW res: 
> [0x74400:0x132478:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] 
> (req 65536->262143) flags: 0x200 nid: local remote: 
> 0xda6676eba5fdbd0c expref: -99 pid: 24499 timeout: 0 lvb_type: 1
> Jan 10 13:21:08 spectre15 kernel: LustreError: 
> 24567:0:(osc_cache.c:1241:osc_extent_tree_dump0()) Dump object 
> 9238a7370f00 extents at osc_cache_writeback_range:3062, mppr: 256.
> Jan 10 13:21:08 spectre15 kernel: LustreError: 
> 24567:0:(osc_cache.c:1246:osc_extent_tree_dump0()) extent 
> 9238ac5133f0@{[0 -> 255/255], [1|0|-|cache|wiY|9238a7370f00], 
> [1703936|89|+|-|9238733e7180|256|  (null)]} in tree 1.
> Jan 10 13:21:08 spectre15 kernel: LustreError: 
> 24567:0:(osc_cache.c:1246:osc_extent_tree_dump0()) ### extent: 
> 9238ac5133f0 ns: alice3-OST0019-osc-9248e7337800 lock: 
> 9238733e7180/0x3c4db4a67c3d39eb lrc: 2/0,0 mode: PW/PW res: 
> [0x74400:0x132478:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] 
> (req 65536->262143) flags: 0x200 nid: local remote: 
> 0xda6676eba5fdbd0c expref: -99 pid: 24499 timeout: 0 lvb_type: 1
> Jan 10 13:21:08 spectre15 kernel: LustreError: 
> 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) ASSERTION( 
> ext->oe_start >= start && ext->oe_end <= end ) failed:
> Jan 10 13:21:08 spectre15 kernel: LustreError: 
> 24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) LBUG
> Jan 10 13:21:08 spectre15 kernel: Pid: 24567, comm: rm 
> 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019
> Jan 10 13:21:08 spectre15 kernel: Call Trace:
> Jan 10 13:21:08 spectre15 kernel: [] 
> libcfs_call_trace+0x8c/0xc0 [libcfs]
> Jan 10 13:21:08 spectre15 kernel: [] 
> lbug_with_loc+0x4c/0xa0 [libcfs]
> Jan 10 13:21:08 spectre15 kernel: [] 
> osc_cache_writeback_range+0xacd/0x1260 [osc]
> Jan 10 13:21:08 spectre15 kernel: [] 
> osc_io_fsync_start+0x85/0x1a0 [osc]
> Jan 10 13:21:08 spectre15 kernel: [] 
> cl_io_start+0x68/0x130 [obdclass]
> Jan 10 13:21:08 spectre15 kernel: [] 
> lov_io_call.isra.7+0x87/0x140 [lov]
> Jan 10 13:21:08 spectre15 kernel: [] 
> lov_io_start+0x56/0x150 [lov]
> Jan 10 13:21:08 spectre15 kernel: [] 
> cl_io_start+0x68/0x130 [obdclass]
> Jan 10 13:21:08 spectre15 kernel: [] 
> cl_io_loop+0xcc/0x1c0 [obdclass]
> Jan 10 13:21:08 spectre15 kernel: [] 
> cl_sync_file_range+0x2db/0x380 [lustre]
> Jan 10 13:21:08 spectre15 kernel: [] 
> ll_delete_inode+0x160/0x230 [lustre]
> Jan 10 13:21:08 spectre15 kernel: [] evict+0xb4/0x180
> Jan 10 13:21:08 spectre15 kernel: [] iput+0xfc/0x190
> Jan 10 13:21:08 spectre15 kernel: [] 
> do_unlinkat+0x1ae/0x2d0
> Jan 10 13:21:08 spectre15 kernel: [] 
> SyS_unlinkat+0x1b/0x40
> Jan 10 13:

[lustre-discuss] Lustre client crashes in Lustre 2.12.3 with data on MDT

2020-01-10 Thread Christopher Mountford
Hi,

We just switched to a new 2.12.3 Lustre storage system on our local HPC cluster 
have seen a number of client node crashes - all leaving a similar syslog entry:

Jan 10 13:21:08 spectre15 kernel: LustreError: 
24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) extent 
9238ac5133f0@{[0 -> 255/255], 
[1|0|-|cache|wiY|9238a7370f00],[1703936|89|+|-|9238733e7180|256|
  (null)]}
Jan 10 13:21:08 spectre15 kernel: LustreError: 
24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) ### extent: 
9238ac5133f0 ns: alice3-OST0019-osc-9248e7337800 lock: 
9238733e7180/0x3c4db4a67c3d39eb lrc: 2/0,0 mode: PW/PW res: 
[0x74400:0x132478:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 
65536->262143) flags: 0x200 nid: local remote: 0xda6676eba5fdbd0c 
expref: -99 pid: 24499 timeout: 0 lvb_type: 1
Jan 10 13:21:08 spectre15 kernel: LustreError: 
24567:0:(osc_cache.c:1241:osc_extent_tree_dump0()) Dump object 9238a7370f00 
extents at osc_cache_writeback_range:3062, mppr: 256.
Jan 10 13:21:08 spectre15 kernel: LustreError: 
24567:0:(osc_cache.c:1246:osc_extent_tree_dump0()) extent 9238ac5133f0@{[0 
-> 255/255], [1|0|-|cache|wiY|9238a7370f00], 
[1703936|89|+|-|9238733e7180|256|  (null)]} in tree 1.
Jan 10 13:21:08 spectre15 kernel: LustreError: 
24567:0:(osc_cache.c:1246:osc_extent_tree_dump0()) ### extent: 9238ac5133f0 
ns: alice3-OST0019-osc-9248e7337800 lock: 
9238733e7180/0x3c4db4a67c3d39eb lrc: 2/0,0 mode: PW/PW res: 
[0x74400:0x132478:0x0].0x0 rrc: 2 type: EXT [0->18446744073709551615] (req 
65536->262143) flags: 0x200 nid: local remote: 0xda6676eba5fdbd0c 
expref: -99 pid: 24499 timeout: 0 lvb_type: 1
Jan 10 13:21:08 spectre15 kernel: LustreError: 
24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) ASSERTION( ext->oe_start 
>= start && ext->oe_end <= end ) failed:
Jan 10 13:21:08 spectre15 kernel: LustreError: 
24567:0:(osc_cache.c:3062:osc_cache_writeback_range()) LBUG
Jan 10 13:21:08 spectre15 kernel: Pid: 24567, comm: rm 
3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019
Jan 10 13:21:08 spectre15 kernel: Call Trace:
Jan 10 13:21:08 spectre15 kernel: [] 
libcfs_call_trace+0x8c/0xc0 [libcfs]
Jan 10 13:21:08 spectre15 kernel: [] lbug_with_loc+0x4c/0xa0 
[libcfs]
Jan 10 13:21:08 spectre15 kernel: [] 
osc_cache_writeback_range+0xacd/0x1260 [osc]
Jan 10 13:21:08 spectre15 kernel: [] 
osc_io_fsync_start+0x85/0x1a0 [osc]
Jan 10 13:21:08 spectre15 kernel: [] cl_io_start+0x68/0x130 
[obdclass]
Jan 10 13:21:08 spectre15 kernel: [] 
lov_io_call.isra.7+0x87/0x140 [lov]
Jan 10 13:21:08 spectre15 kernel: [] lov_io_start+0x56/0x150 
[lov]
Jan 10 13:21:08 spectre15 kernel: [] cl_io_start+0x68/0x130 
[obdclass]
Jan 10 13:21:08 spectre15 kernel: [] cl_io_loop+0xcc/0x1c0 
[obdclass]
Jan 10 13:21:08 spectre15 kernel: [] 
cl_sync_file_range+0x2db/0x380 [lustre]
Jan 10 13:21:08 spectre15 kernel: [] 
ll_delete_inode+0x160/0x230 [lustre]
Jan 10 13:21:08 spectre15 kernel: [] evict+0xb4/0x180
Jan 10 13:21:08 spectre15 kernel: [] iput+0xfc/0x190
Jan 10 13:21:08 spectre15 kernel: [] do_unlinkat+0x1ae/0x2d0
Jan 10 13:21:08 spectre15 kernel: [] SyS_unlinkat+0x1b/0x40
Jan 10 13:21:08 spectre15 kernel: [] 
system_call_fastpath+0x25/0x2a
Jan 10 13:21:08 spectre15 kernel: [] 0x
Jan 10 13:21:08 spectre15 kernel: Kernel panic - not syncing: LBUG


We are able to reproduced the error on a test system - it appears to be caused 
by removing multiple files with a single rm -f *, strangely, repeating this and 
deleting the files one at a time is fine (these results are both reproducable). 
Only files with a data on MDT layout cause the crash.

We have been using the 2.12.3 client (with 2.10.7 servers) since December 
without issue. The problem seems to be occuring since we moved to using a new 
Lustre 2.12.3 filesystem which has data on MDT enabled. We have confirmed that 
deleting files which do not have a data on MDT layout does not cause the above 
problem.

This looks to me like LU-12462 (https://jira.whamcloud.com/browse/LU-12462), 
however, it looks like this is only known to affect 2.13.0 (and 2.12.4) - not 
2.12.3, I'm not familiar with jira though so I could be reading this wrong!

Any suggestions on how best to report/resolve this?

We have repeated the tests using a 2.13.0 test client and we do not see any 
crashes on this client (LU-12462 says fixed in 2.13).

Regards,
Christopher.


-- 
-- 
# Dr. Christopher Mountford
# System specialist - Research Computing/HPC
# 
# IT services,
# University of Leicester, University Road, 
# Leicester, LE1 7RH, UK 
#
# t: 0116 252 3471
# e: cj...@le.ac.uk

___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org