Am 14.11.2012 um 00:56 schrieb Brendan Moloney: > Ok I will test that out once I can schedule some down time. I might even be > able to get my hands on another switch by then.
Depending on your NFS setup you can also change this on-the-fly. -- Reuti > I appreciate all the help. > ________________________________________ > From: Reuti [[email protected]] > Sent: Tuesday, November 13, 2012 3:33 AM > To: Brendan Moloney > Cc: [email protected] > Subject: Re: [gridengine users] Intermittent commlib errors with MPI jobs > > Am 12.11.2012 um 22:03 schrieb Brendan Moloney: > >> I suppose it could be the switch. Is the only way to test this to swap it >> out for a different switch? > > Are all ports used on the switch? Change the used ports. > > -- Reuti > > >> Thanks again, >> Brendan >> ________________________________________ >> From: Reuti [[email protected]] >> Sent: Monday, November 12, 2012 4:17 AM >> To: Brendan Moloney >> Cc: [email protected] >> Subject: Re: [gridengine users] Intermittent commlib errors with MPI jobs >> >> Am 10.11.2012 um 00:31 schrieb Brendan Moloney: >> >>> I spent some time researching this issue in the context of OpenSSH and >>> found some mentions of similar problems due to the initial handshake >>> package being too large >>> (http://serverfault.com/questions/265244/ssh-client-problem-connection-reset-by-peer). >>> I was dubious that this was my problem but after manually specifying the >>> cypher to use ('-c aes256-ctr') I haven't seen the problem again. With the >>> number of submissions I have done now I would expect to have seen the issue >>> several times, so I am fairly sure it is fixed. Will keep an eye on it of >>> course. >>> >>>>>>> Sometimes I get "Connection reset by peer" >>>> >>>> After a long time or instantly? There are some setting in ssh to avoid a >>>> timeout in ssh_config resp. ~/.ssh/config: >>>> >>>> Host * >>>> Compression yes >>>> ServerAliveInterval 900 >>> >>> Seems to happen fast enough that it is not a timeout issue. >>> >>>>> I am indeed using SSH with a wrapper script for adding the group ID: >>>>> >>>>> qlogin_command /usr/global/bin/qlogin-wrapper >>>>> qlogin_daemon /usr/global/bin/rshd-wrapper >>>>> rlogin_command /usr/bin/ssh >>>>> rlogin_daemon /usr/global/bin/rshd-wrapper >>>>> rsh_command /usr/bin/ssh >>>>> rsh_daemon /usr/global/bin/rshd-wrapper >>> >>>> It's also possible to set different methods for each of the three pairs. >>>> So, rsh_command/rsh_daemon could be set to builtin and the others left as >>>> they are. Would this be appropriate for your intended setup of X11 >>>> forwarding? >>> >>> So using the builtin option would still allow enforcement of memory/time >>> limits on parallel jobs? >> >> The ones set by SGE - yes. >> >> To the original problem: can it be a problem in the switch? >> >> -- Reuti >> > > _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
