poking at this further, it doesn't look like it's ECMP issue.
Are there any known reports of issues when running Lustre over ipoib
over an opa fabric? seems a stretch, but it's the only difference in
the network at this point.
can anyone suggest somewhere to look for more debug info?
/var/log/messages and dmesg, don't reveal much info
On Mon, Feb 4, 2019 at 9:19 AM Michael Di Domenico
wrote:
>
> Has anyone heard of lustre having trouble mounting when ECMP is used
> on the compute nodes default gateway?
>
> I'm trying to mount an existing lustre filesystem on a new cluster,
> where the connections ride over OPA IPoIB, which is then converted to
> 10ge via four routers. I'm using ECMP to distribute the packets over
> the four routers.
>
> I can mount lustre on other ethernet clients, but not the ones behind
> my ECMP gateways. Changing the compute node gateway from ECMP to a
> single device doesn't change anything. I'm not easily able to revert
> the network side from ECMP to a single route, so i haven't tried that.
>
> The output i get from mount is, "failed: Input/output error retries left: 0"
>
> syslog on the client and the MGS seem to show that the connection is
> being broken between the MGS and client during the mount with a "timed
> oout for slow reply" message.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org