Re: [lustre-discuss] mount issue and ecmp?

2019-02-08 Thread Michael Di Domenico
poking at this further, it doesn't look like it's ECMP issue.

Are there any known reports of issues when running Lustre over ipoib
over an opa fabric?  seems a stretch, but it's the only difference in
the network at this point.

can anyone suggest somewhere to look for more debug info?
/var/log/messages and dmesg, don't reveal much info




On Mon, Feb 4, 2019 at 9:19 AM Michael Di Domenico
 wrote:
>
> Has anyone heard of lustre having trouble mounting when ECMP is used
> on the compute nodes default gateway?
>
> I'm trying to mount an existing lustre filesystem on a new cluster,
> where the connections ride over OPA IPoIB, which is then converted to
> 10ge via four routers.  I'm using ECMP to distribute the packets over
> the four routers.
>
> I can mount lustre on other ethernet clients, but not the ones behind
> my ECMP gateways.  Changing the compute node gateway from ECMP to a
> single device doesn't change anything.  I'm not easily able to revert
> the network side from ECMP to a single route, so i haven't tried that.
>
> The output i get from mount is, "failed: Input/output error retries left: 0"
>
> syslog on the client and the MGS seem to show that the connection is
> being broken between the MGS and client during the mount with a "timed
> oout for slow reply" message.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org


[lustre-discuss] mount issue and ecmp?

2019-02-04 Thread Michael Di Domenico
Has anyone heard of lustre having trouble mounting when ECMP is used
on the compute nodes default gateway?

I'm trying to mount an existing lustre filesystem on a new cluster,
where the connections ride over OPA IPoIB, which is then converted to
10ge via four routers.  I'm using ECMP to distribute the packets over
the four routers.

I can mount lustre on other ethernet clients, but not the ones behind
my ECMP gateways.  Changing the compute node gateway from ECMP to a
single device doesn't change anything.  I'm not easily able to revert
the network side from ECMP to a single route, so i haven't tried that.

The output i get from mount is, "failed: Input/output error retries left: 0"

syslog on the client and the MGS seem to show that the connection is
being broken between the MGS and client during the mount with a "timed
oout for slow reply" message.
___
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org