Re: [lustre-discuss] [EXTERNAL] tgt_grant.c:571:tgt_grant_incoming
Maybe this bug? https://jira.whamcloud.com/browse/LU-11939 --Rick On 3/11/21, 3:01 AM, "lustre-discuss on behalf of bkmz via lustre-discuss" wrote: Hello, please help :) periodically I get this error in dmesg Mar 9 17:42:32 oss05 kernel: LustreError: 14715:0:(tgt_grant.c:571:tgt_grant_incoming()) scratch-OST001c: cli dd4a4653-12d7-4/96b92f789800 dirty 28672 pend 0 grant -741 Mar 9 17:42:32 oss05 kernel: LustreError: 14715:0:(tgt_grant.c:573:tgt_grant_incoming()) LBUG Package information: lustre-2.12.2-1.el7.x86_64 Name: lustre Version : 2.12.2 Release : 1.el7 Architecture: x86_64 Install Date: Tue 19 Jan 2021 06:55:34 PM MSK Group : System Environment/Kernel Size: 2586107 License : GPL Signature : (none) Source RPM : lustre-2.12.2-1.el7.src.rpm Build Date : Mon 27 May 2019 01:11:16 AM MSK Build Host : trevis-310-el7-x8664-4.trevis.whamcloud.com Relocations : (not relocatable) URL : https://wiki.whamcloud.com/ Summary : Lustre File System System information: CentOS Linux release 7.6.1810 (Core) Linux oss06 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux OFED.4.6.0.4.1.46101.x86_64 560 line in tgt_grant.c if (ted->ted_dirty < 0 || ted->ted_grant < 0 || ted->ted_pending < 0) { but I don't understand the reason :( Why ted->ted_grant < 0 C Уважением, Фатеев Илья ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] [EXTERNAL] MDT mount stuck
Thomas, Is the behavior any different if you mount with the "-o abort_recov" option to avoid the recovery phase? --Rick On 3/11/21, 11:48 AM, "lustre-discuss on behalf of Thomas Roth via lustre-discuss" wrote: Hi all, after not getting out of the ldlm_lockd - situation, we are trying a shutdown plus restart. Does not work at all, the very first mount of the restart is MGS + MDT0, of course. It is quite busy writing traces to the log Mar 11 17:21:17 lxmds19.gsi.de kernel: INFO: task mount.lustre:2948 blocked for more than 120 seconds. Mar 11 17:21:17 lxmds19.gsi.de kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 11 17:21:17 lxmds19.gsi.de kernel: mount.lustreD 9616ffc5acc0 0 2948 2947 0x0082 Mar 11 17:21:17 lxmds19.gsi.de kernel: Call Trace: Mar 11 17:21:17 lxmds19.gsi.de kernel: [] schedule+0x29/0x70 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] schedule_timeout+0x221/0x2d0 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? select_task_rq_fair+0x5a6/0x760 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] wait_for_completion+0xfd/0x140 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? wake_up_state+0x20/0x20 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] llog_process_or_fork+0x244/0x450 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] llog_process+0x14/0x20 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] class_config_parse_llog+0x125/0x350 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] mgc_process_cfg_log+0x790/0xc40 [mgc] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] mgc_process_log+0x3dc/0x8f0 [mgc] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? config_recover_log_add+0x13f/0x280 [mgc] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] mgc_process_config+0x88b/0x13f0 [mgc] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] lustre_process_log+0x2d8/0xad0 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? libcfs_debug_msg+0x57/0x80 [libcfs] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? lprocfs_counter_add+0xf9/0x160 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] server_start_targets+0x13a4/0x2a20 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? lustre_start_mgc+0x260/0x2510 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] server_fill_super+0x10cc/0x1890 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] lustre_fill_super+0x468/0x960 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? lustre_common_put_super+0x270/0x270 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] mount_nodev+0x4f/0xb0 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] lustre_mount+0x38/0x60 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] mount_fs+0x3e/0x1b0 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] vfs_kern_mount+0x67/0x110 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] do_mount+0x1ef/0xd00 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? __check_object_size+0x1ca/0x250 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? kmem_cache_alloc_trace+0x3c/0x200 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] SyS_mount+0x83/0xd0 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] system_call_fastpath+0x25/0x2a Other than that, nothing is happening. The Lustre processes have started, but e.g. recovery_status = Inactive. OK, perhaps because there is nothing out there to recover besides this MDS, all other Lustre servers+clients are still stopped. Still, on previous occasions the mount would not block in this way. The device would be mounted - now it does not make it into /proc/mounts Btw, the disk device can be mounted as type ldiskfs. So it exists, and it looks definitely like a Lustre MDT on the inside. Best, Thomas -- Thomas Roth Department: Informationstechnologie Location: SB3 2.291 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 Managing Directors / Geschäftsführung: Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats: State Secretary / Staatssekretär Dr. Volkmar Dietz ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org
Re: [lustre-discuss] MDT mount stuck
And a perhaps minor observation: Comparing to previous restarts in the log files, I see the line Lustre: MGS: Connection restored to 2519f316-4f30-9698-3487-70eb31a73320 (at 0@lo) Before, it was Lustre: MGS: Connection restored to c70c1b4e-3517-5631-28b1-7163f13e7bed (at 0@lo) What is this number? A unique identifier for the MGS? Which changes between restarts? Regards, Thomas On 11/03/2021 17.47, Thomas Roth via lustre-discuss wrote: Hi all, after not getting out of the ldlm_lockd - situation, we are trying a shutdown plus restart. Does not work at all, the very first mount of the restart is MGS + MDT0, of course. It is quite busy writing traces to the log Mar 11 17:21:17 lxmds19.gsi.de kernel: INFO: task mount.lustre:2948 blocked for more than 120 seconds. Mar 11 17:21:17 lxmds19.gsi.de kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 11 17:21:17 lxmds19.gsi.de kernel: mount.lustre D 9616ffc5acc0 0 2948 2947 0x0082 Mar 11 17:21:17 lxmds19.gsi.de kernel: Call Trace: Mar 11 17:21:17 lxmds19.gsi.de kernel: [] schedule+0x29/0x70 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] schedule_timeout+0x221/0x2d0 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? select_task_rq_fair+0x5a6/0x760 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] wait_for_completion+0xfd/0x140 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? wake_up_state+0x20/0x20 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] llog_process_or_fork+0x244/0x450 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] llog_process+0x14/0x20 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] class_config_parse_llog+0x125/0x350 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] mgc_process_cfg_log+0x790/0xc40 [mgc] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] mgc_process_log+0x3dc/0x8f0 [mgc] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? config_recover_log_add+0x13f/0x280 [mgc] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] mgc_process_config+0x88b/0x13f0 [mgc] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] lustre_process_log+0x2d8/0xad0 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? libcfs_debug_msg+0x57/0x80 [libcfs] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? lprocfs_counter_add+0xf9/0x160 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] server_start_targets+0x13a4/0x2a20 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? lustre_start_mgc+0x260/0x2510 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] server_fill_super+0x10cc/0x1890 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] lustre_fill_super+0x468/0x960 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? lustre_common_put_super+0x270/0x270 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] mount_nodev+0x4f/0xb0 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] lustre_mount+0x38/0x60 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] mount_fs+0x3e/0x1b0 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] vfs_kern_mount+0x67/0x110 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] do_mount+0x1ef/0xd00 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? __check_object_size+0x1ca/0x250 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? kmem_cache_alloc_trace+0x3c/0x200 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] SyS_mount+0x83/0xd0 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] system_call_fastpath+0x25/0x2a Other than that, nothing is happening. The Lustre processes have started, but e.g. recovery_status = Inactive. OK, perhaps because there is nothing out there to recover besides this MDS, all other Lustre servers+clients are still stopped. Still, on previous occasions the mount would not block in this way. The device would be mounted - now it does not make it into /proc/mounts Btw, the disk device can be mounted as type ldiskfs. So it exists, and it looks definitely like a Lustre MDT on the inside. Best, Thomas -- Thomas Roth Department: Informationstechnologie Location: SB3 2.291 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 Managing Directors / Geschäftsführung: Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats: State Secretary / Staatssekretär Dr. Volkmar Dietz ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] MDT mount stuck
Hi all, after not getting out of the ldlm_lockd - situation, we are trying a shutdown plus restart. Does not work at all, the very first mount of the restart is MGS + MDT0, of course. It is quite busy writing traces to the log Mar 11 17:21:17 lxmds19.gsi.de kernel: INFO: task mount.lustre:2948 blocked for more than 120 seconds. Mar 11 17:21:17 lxmds19.gsi.de kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Mar 11 17:21:17 lxmds19.gsi.de kernel: mount.lustreD 9616ffc5acc0 0 2948 2947 0x0082 Mar 11 17:21:17 lxmds19.gsi.de kernel: Call Trace: Mar 11 17:21:17 lxmds19.gsi.de kernel: [] schedule+0x29/0x70 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] schedule_timeout+0x221/0x2d0 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? select_task_rq_fair+0x5a6/0x760 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] wait_for_completion+0xfd/0x140 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? wake_up_state+0x20/0x20 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] llog_process_or_fork+0x244/0x450 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] llog_process+0x14/0x20 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] class_config_parse_llog+0x125/0x350 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] mgc_process_cfg_log+0x790/0xc40 [mgc] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] mgc_process_log+0x3dc/0x8f0 [mgc] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? config_recover_log_add+0x13f/0x280 [mgc] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] mgc_process_config+0x88b/0x13f0 [mgc] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] lustre_process_log+0x2d8/0xad0 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? libcfs_debug_msg+0x57/0x80 [libcfs] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? lprocfs_counter_add+0xf9/0x160 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] server_start_targets+0x13a4/0x2a20 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? lustre_start_mgc+0x260/0x2510 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? class_config_dump_handler+0x7e0/0x7e0 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] server_fill_super+0x10cc/0x1890 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] lustre_fill_super+0x468/0x960 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? lustre_common_put_super+0x270/0x270 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] mount_nodev+0x4f/0xb0 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] lustre_mount+0x38/0x60 [obdclass] Mar 11 17:21:17 lxmds19.gsi.de kernel: [] mount_fs+0x3e/0x1b0 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] vfs_kern_mount+0x67/0x110 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] do_mount+0x1ef/0xd00 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? __check_object_size+0x1ca/0x250 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] ? kmem_cache_alloc_trace+0x3c/0x200 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] SyS_mount+0x83/0xd0 Mar 11 17:21:17 lxmds19.gsi.de kernel: [] system_call_fastpath+0x25/0x2a Other than that, nothing is happening. The Lustre processes have started, but e.g. recovery_status = Inactive. OK, perhaps because there is nothing out there to recover besides this MDS, all other Lustre servers+clients are still stopped. Still, on previous occasions the mount would not block in this way. The device would be mounted - now it does not make it into /proc/mounts Btw, the disk device can be mounted as type ldiskfs. So it exists, and it looks definitely like a Lustre MDT on the inside. Best, Thomas -- Thomas Roth Department: Informationstechnologie Location: SB3 2.291 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1, 64291 Darmstadt, Germany, www.gsi.de Commercial Register / Handelsregister: Amtsgericht Darmstadt, HRB 1528 Managing Directors / Geschäftsführung: Professor Dr. Paolo Giubellino, Dr. Ulrich Breuer, Jörg Blaurock Chairman of the Supervisory Board / Vorsitzender des GSI-Aufsichtsrats: State Secretary / Staatssekretär Dr. Volkmar Dietz ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] Multiple IB interfaces
Alastair, Few scenarios which you may consider: 1) define 2 lnets one per IB interface (say o2ib1 and o2ib2) and share out one OST through o2ib1 and other one through o2ib2. You can map HBA and disk locality so that they are attached to the same cpu. 2) Same as above but share the ost/s from both lnets But configure odd clients (clients with odd ips) to use o2ib1 and even clients to use o2ib2. This may not be exactly what you are looking for but can efficiently utilize both interfaces. -Raj On Tue, Mar 9, 2021 at 9:18 AM Alastair Basden via lustre-discuss < lustre-discuss@lists.lustre.org> wrote: > Hi, > > We are installing some new Lustre servers with 2 InfiniBand cards, 1 > attached to each CPU socket. Storage is nvme, again, some drives attached > to each socket. > > We want to ensure that data to/from each drive uses the appropriate IB > card, and doesn't need to travel through the inter-cpu link. Data being > written is fairly easy I think, we just set that OST to the appropriate IP > address. However, data being read may well go out the other NIC, if I > understand correctly. > > What setup do we need for this? > > I think probably not bonding, as that will presumably not tie > NIC interfaces to cpus. But I also see a note in the Lustre manual: > > """If the server has multiple interfaces on the same subnet, the Linux > kernel will send all traffic using the first configured interface. This is > a limitation of Linux, not Lustre. In this case, network interface bonding > should be used. For more information about network interface bonding, see > Chapter 7, Setting Up Network Interface Bonding.""" > > (plus, no idea if bonding is supported on InfiniBand). > > Thanks, > Alastair. > ___ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org > ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] tgt_grant.c:571:tgt_grant_incoming
Hello, please help :) periodically I get this error in dmesg Mar 9 17:42:32 oss05 kernel: LustreError: 14715:0:(tgt_grant.c:571:tgt_grant_incoming()) scratch-OST001c: cli dd4a4653-12d7-4/96b92f789800 dirty 28672 pend 0 grant -741 Mar 9 17:42:32 oss05 kernel: LustreError: 14715:0:(tgt_grant.c:573:tgt_grant_incoming()) LBUG *Package information:* lustre-2.12.2-1.el7.x86_64 Name : lustre Version : 2.12.2 Release : 1.el7 Architecture: x86_64 Install Date: Tue 19 Jan 2021 06:55:34 PM MSK Group : System Environment/Kernel Size : 2586107 License : GPL Signature : (none) Source RPM : lustre-2.12.2-1.el7.src.rpm Build Date : Mon 27 May 2019 01:11:16 AM MSK Build Host : trevis-310-el7-x8664-4.trevis.whamcloud.com Relocations : (not relocatable) URL : https://wiki.whamcloud.com/ Summary : Lustre File System *System information:* CentOS Linux release 7.6.1810 (Core) Linux oss06 3.10.0-957.el7.x86_64 #1 SMP Thu Nov 8 23:39:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux OFED.4.6.0.4.1.46101.x86_64 560 line in tgt_grant.c if(ted->ted_dirty< 0|| *ted->**ted_grant*< 0|| ted->ted_pending< 0) { but I don't understand the reason :( Why *ted->**ted_grant**< **0* C Уважением, Фатеев Илья ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org