Re: [gridengine users] Intermittent commlib errors with MPI jobs

Reuti Wed, 14 Nov 2012 10:04:16 -0800

Am 14.11.2012 um 00:56 schrieb Brendan Moloney:

> Ok I will test that out once I can schedule some down time.  I might even be 
> able to get my hands on another switch by then.


Depending on your NFS setup you can also change this on-the-fly.

-- Reuti


> I appreciate all the help.
> ________________________________________
> From: Reuti [[email protected]]
> Sent: Tuesday, November 13, 2012 3:33 AM
> To: Brendan Moloney
> Cc: [email protected]
> Subject: Re: [gridengine users] Intermittent commlib errors with MPI jobs
> 
> Am 12.11.2012 um 22:03 schrieb Brendan Moloney:
> 
>> I suppose it could be the switch.  Is the only way to test this to swap it 
>> out for a different switch?
> 
> Are all ports used on the switch? Change the used ports.
> 
> -- Reuti
> 
> 
>> Thanks again,
>> Brendan
>> ________________________________________
>> From: Reuti [[email protected]]
>> Sent: Monday, November 12, 2012 4:17 AM
>> To: Brendan Moloney
>> Cc: [email protected]
>> Subject: Re: [gridengine users] Intermittent commlib errors with MPI jobs
>> 
>> Am 10.11.2012 um 00:31 schrieb Brendan Moloney:
>> 
>>> I spent some time researching this issue in the context of OpenSSH and 
>>> found some mentions of similar problems due to the initial handshake 
>>> package being too large 
>>> (http://serverfault.com/questions/265244/ssh-client-problem-connection-reset-by-peer).
>>>   I was dubious that this was my problem but after manually specifying the 
>>> cypher to use ('-c aes256-ctr') I haven't seen the problem again. With the 
>>> number of submissions I have done now I would expect to have seen the issue 
>>> several times, so I am fairly sure it is fixed.  Will keep an eye on it of 
>>> course.
>>> 
>>>>>>> Sometimes I get "Connection reset by peer"
>>>> 
>>>> After a long time or instantly? There are some setting in ssh to avoid a 
>>>> timeout in ssh_config resp. ~/.ssh/config:
>>>> 
>>>> Host *
>>>> Compression yes
>>>> ServerAliveInterval 900
>>> 
>>> Seems to happen fast enough that it is not a timeout issue.
>>> 
>>>>> I am indeed using SSH with a wrapper script for adding the group ID:
>>>>> 
>>>>> qlogin_command               /usr/global/bin/qlogin-wrapper
>>>>> qlogin_daemon                /usr/global/bin/rshd-wrapper
>>>>> rlogin_command               /usr/bin/ssh
>>>>> rlogin_daemon                /usr/global/bin/rshd-wrapper
>>>>> rsh_command                  /usr/bin/ssh
>>>>> rsh_daemon                   /usr/global/bin/rshd-wrapper
>>> 
>>>> It's also possible to set different methods for each of the three pairs. 
>>>> So, rsh_command/rsh_daemon could be set to builtin and the others left as 
>>>> they are. Would this be appropriate for your intended setup of X11 
>>>> forwarding?
>>> 
>>> So using the builtin option would still allow enforcement of memory/time 
>>> limits on parallel jobs?
>> 
>> The ones set by SGE - yes.
>> 
>> To the original problem: can it be a problem in the switch?
>> 
>> -- Reuti
>> 
> 
> 


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Intermittent commlib errors with MPI jobs

Reply via email to