Re: [lustre-discuss] RDMA too fragmented, OSTs unavailable (permanently)
Hi Doug, Thomas Stibor has already added our findings to LU-5718 (he is our Dr. Lustre here at GSI). Just to have this also on the mailing list: the error occurs if a user is close to their quota limit, while the RPC size is at default value. Workaround is setting "max_pages_per_rpc=64" Regrads, Thomas On 09/22/2016 11:56 PM, Oucharek, Doug S wrote: Hi Thomas, It is interesting that you have encountered this error without a router. Good information. I have updated LU-5718 with a link to this discussion. The original fix posted to LU-5718 by Liang will fix his problem for you (it does not assume a router is the cause). That fix does double the amount of memory used per QP. Probably not an issue for a client, but could be an issue for a router (as Cray has found). Are you using the quotas feature? There is some evidence that may play a role here. Doug On Sep 10, 2016, at 12:38 AM, Thomas Rothwrote: Hi all, we are running Lustre 2.5.3 on Infiniband. We have massive problems with clients being unable to communicate with any number of OSTs, rendering the entire cluster quite unusable. Clients show LNetError: 1399:0:(o2iblnd_cb.c:1140:kiblnd_init_rdma()) RDMA too fragmented for 10.20.0.242@o2ib1 (256): 231/256 src 231/256 dst frags LNetError: 1399:0:(o2iblnd_cb.c:1690:kiblnd_reply()) Can't setup rdma for GET from 10.20.0.242@o2ib1: -90 which eventually results in OSTs at that nid becoming "temporarily unavailable". However, the OSTs are never recovered, until they are manually evicted or the host rebooted. On the OSS side, this reads LNetError: 13660:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.20.0.220@o2ib1 (56): c: 7, oc: 0, rc: 7 We have checked the IB fabric, which shows no errors. Since we are not able to reproduce this effect in a simple way, we have also scrutinized the user code, so far without results. Whenever this happens, the connection between client and OSS is fine under all IB test commands. Communication between client and OSS is still going on, but obviously when Lustre tries to replay the missed transaction, this fragmentation limit is hit again, so the OST never becomes available again. If we understand correctly, the map_on_demand parameter should be increased as a workaround. The ko2iblnd module seems to provide this parameter, modinfo ko2iblnd parm: map_on_demand:map on demand (int) but no matter what we load the module with, map_on_demand always remains at the default value, cat /sys/module/ko2iblnd/parameters/map_on_demand 0 Is there any way to understand - why this memory fragmentation occurs/becomes so large? - how to measure the real fragmentation degree (o2iblnd simply stops at 256, perhaps we are at 1000?) - why map_on_demand cannot be changed? Of course this all looks very much like LU-5718, but our clients are not behind LNET routers. There is one router which connects to the campus network but is not in use. And there are some routers which connect to an older cluster, but of course the old (1.8) clients never show any of these errors. Cheers, Thomas Thomas Roth Department: HPC Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1 64291 Darmstadt www.gsi.de Gesellschaft mit beschränkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Geschäftsführung: Professor Dr. Karlheinz Langanke Ursula Weyrich Jörg Blaurock Vorsitzender des Aufsichtsrates: St Dr. Georg Schütte Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org -- Thomas Roth Department: Informationstechnologie Location: SB3 1.250 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1 64291 Darmstadt www.gsi.de Gesellschaft mit beschränkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Geschäftsführung: Ursula Weyrich Professor Dr. Karlheinz Langanke Jörg Blaurock Vorsitzende des Aufsichtsrates: St Dr. Georg Schütte Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] RDMA too fragmented, OSTs unavailable (permanently)
Hi Thomas, It is interesting that you have encountered this error without a router. Good information. I have updated LU-5718 with a link to this discussion. The original fix posted to LU-5718 by Liang will fix his problem for you (it does not assume a router is the cause). That fix does double the amount of memory used per QP. Probably not an issue for a client, but could be an issue for a router (as Cray has found). Are you using the quotas feature? There is some evidence that may play a role here. Doug > On Sep 10, 2016, at 12:38 AM, Thomas Rothwrote: > > Hi all, > > we are running Lustre 2.5.3 on Infiniband. We have massive problems with > clients being unable to communicate with any number of OSTs, rendering the > entire cluster quite unusable. > > Clients show > > LNetError: 1399:0:(o2iblnd_cb.c:1140:kiblnd_init_rdma()) RDMA too > > fragmented for 10.20.0.242@o2ib1 (256): 231/256 src 231/256 dst frags > > LNetError: 1399:0:(o2iblnd_cb.c:1690:kiblnd_reply()) Can't setup rdma for > > GET from 10.20.0.242@o2ib1: -90 > > which eventually results in OSTs at that nid becoming "temporarily > unavailable". > However, the OSTs are never recovered, until they are manually evicted or the > host rebooted. > > On the OSS side, this reads > > LNetError: 13660:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA > > with 10.20.0.220@o2ib1 (56): c: 7, oc: 0, rc: 7 > > > We have checked the IB fabric, which shows no errors. Since we are not able > to reproduce this effect in a simple way, we have also scrutinized the user > code, so far without results. > > Whenever this happens, the connection between client and OSS is fine under > all IB test commands. > Communication between client and OSS is still going on, but obviously when > Lustre tries to replay the missed transaction, this fragmentation limit is > hit again, so the OST never becomes available again. > > If we understand correctly, the map_on_demand parameter should be increased > as a workaround. > The ko2iblnd module seems to provide this parameter, > > modinfo ko2iblnd > > parm: map_on_demand:map on demand (int) > > but no matter what we load the module with, map_on_demand always remains at > the default value, > > cat /sys/module/ko2iblnd/parameters/map_on_demand > > 0 > > Is there any way to understand > - why this memory fragmentation occurs/becomes so large? > - how to measure the real fragmentation degree (o2iblnd simply stops at 256, > perhaps we are at 1000?) > - why map_on_demand cannot be changed? > > > Of course this all looks very much like LU-5718, but our clients are not > behind LNET routers. > > There is one router which connects to the campus network but is not in use. > And there are some routers which connect to an older cluster, but of course > the old (1.8) clients never show any of these errors. > > > Cheers, > Thomas > > > Thomas Roth > Department: HPC > Location: SB3 1.262 > Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 > > GSI Helmholtzzentrum für Schwerionenforschung GmbH > Planckstraße 1 > 64291 Darmstadt > www.gsi.de > > Gesellschaft mit beschränkter Haftung > Sitz der Gesellschaft: Darmstadt > Handelsregister: Amtsgericht Darmstadt, HRB 1528 > > Geschäftsführung: Professor Dr. Karlheinz Langanke > Ursula Weyrich > Jörg Blaurock > > Vorsitzender des Aufsichtsrates: St Dr. Georg Schütte > Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt > ___ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
Re: [lustre-discuss] RDMA too fragmented, OSTs unavailable (permanently)
Thomas, It is somewhat sideways from your questions, but when Cray has seen this problem historically, it has almost always been due to lots of small direct I/O from a user code. - Patrick From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of Thomas Roth <t.r...@gsi.de> Sent: Saturday, September 10, 2016 2:38:37 AM To: lustre-discuss@lists.lustre.org Subject: [lustre-discuss] RDMA too fragmented, OSTs unavailable (permanently) Hi all, we are running Lustre 2.5.3 on Infiniband. We have massive problems with clients being unable to communicate with any number of OSTs, rendering the entire cluster quite unusable. Clients show > LNetError: 1399:0:(o2iblnd_cb.c:1140:kiblnd_init_rdma()) RDMA too fragmented for 10.20.0.242@o2ib1 (256): 231/256 src 231/256 dst frags > LNetError: 1399:0:(o2iblnd_cb.c:1690:kiblnd_reply()) Can't setup rdma for GET from 10.20.0.242@o2ib1: -90 which eventually results in OSTs at that nid becoming "temporarily unavailable". However, the OSTs are never recovered, until they are manually evicted or the host rebooted. On the OSS side, this reads > LNetError: 13660:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.20.0.220@o2ib1 (56): c: 7, oc: 0, rc: 7 We have checked the IB fabric, which shows no errors. Since we are not able to reproduce this effect in a simple way, we have also scrutinized the user code, so far without results. Whenever this happens, the connection between client and OSS is fine under all IB test commands. Communication between client and OSS is still going on, but obviously when Lustre tries to replay the missed transaction, this fragmentation limit is hit again, so the OST never becomes available again. If we understand correctly, the map_on_demand parameter should be increased as a workaround. The ko2iblnd module seems to provide this parameter, > modinfo ko2iblnd > parm: map_on_demand:map on demand (int) but no matter what we load the module with, map_on_demand always remains at the default value, > cat /sys/module/ko2iblnd/parameters/map_on_demand > 0 Is there any way to understand - why this memory fragmentation occurs/becomes so large? - how to measure the real fragmentation degree (o2iblnd simply stops at 256, perhaps we are at 1000?) - why map_on_demand cannot be changed? Of course this all looks very much like LU-5718, but our clients are not behind LNET routers. There is one router which connects to the campus network but is not in use. And there are some routers which connect to an older cluster, but of course the old (1.8) clients never show any of these errors. Cheers, Thomas Thomas Roth Department: HPC Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1 64291 Darmstadt www.gsi.de Gesellschaft mit beschränkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Geschäftsführung: Professor Dr. Karlheinz Langanke Ursula Weyrich Jörg Blaurock Vorsitzender des Aufsichtsrates: St Dr. Georg Schütte Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
[lustre-discuss] RDMA too fragmented, OSTs unavailable (permanently)
Hi all, we are running Lustre 2.5.3 on Infiniband. We have massive problems with clients being unable to communicate with any number of OSTs, rendering the entire cluster quite unusable. Clients show > LNetError: 1399:0:(o2iblnd_cb.c:1140:kiblnd_init_rdma()) RDMA too fragmented for 10.20.0.242@o2ib1 (256): 231/256 src 231/256 dst frags > LNetError: 1399:0:(o2iblnd_cb.c:1690:kiblnd_reply()) Can't setup rdma for GET from 10.20.0.242@o2ib1: -90 which eventually results in OSTs at that nid becoming "temporarily unavailable". However, the OSTs are never recovered, until they are manually evicted or the host rebooted. On the OSS side, this reads > LNetError: 13660:0:(o2iblnd_cb.c:3075:kiblnd_check_conns()) Timed out RDMA with 10.20.0.220@o2ib1 (56): c: 7, oc: 0, rc: 7 We have checked the IB fabric, which shows no errors. Since we are not able to reproduce this effect in a simple way, we have also scrutinized the user code, so far without results. Whenever this happens, the connection between client and OSS is fine under all IB test commands. Communication between client and OSS is still going on, but obviously when Lustre tries to replay the missed transaction, this fragmentation limit is hit again, so the OST never becomes available again. If we understand correctly, the map_on_demand parameter should be increased as a workaround. The ko2iblnd module seems to provide this parameter, > modinfo ko2iblnd > parm: map_on_demand:map on demand (int) but no matter what we load the module with, map_on_demand always remains at the default value, > cat /sys/module/ko2iblnd/parameters/map_on_demand > 0 Is there any way to understand - why this memory fragmentation occurs/becomes so large? - how to measure the real fragmentation degree (o2iblnd simply stops at 256, perhaps we are at 1000?) - why map_on_demand cannot be changed? Of course this all looks very much like LU-5718, but our clients are not behind LNET routers. There is one router which connects to the campus network but is not in use. And there are some routers which connect to an older cluster, but of course the old (1.8) clients never show any of these errors. Cheers, Thomas Thomas Roth Department: HPC Location: SB3 1.262 Phone: +49-6159-71 1453 Fax: +49-6159-71 2986 GSI Helmholtzzentrum für Schwerionenforschung GmbH Planckstraße 1 64291 Darmstadt www.gsi.de Gesellschaft mit beschränkter Haftung Sitz der Gesellschaft: Darmstadt Handelsregister: Amtsgericht Darmstadt, HRB 1528 Geschäftsführung: Professor Dr. Karlheinz Langanke Ursula Weyrich Jörg Blaurock Vorsitzender des Aufsichtsrates: St Dr. Georg Schütte Stellvertreter: Ministerialdirigent Dr. Rolf Bernhardt ___ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org