Hi all I am having an issue with a node rebooting. I am running Desmond fep jobs...
Thanks for any help in advance! /etc/resolv.conf is the same on all nodes /etc/hosts is the same on all nodes All nodes are connected to the same switch in a server rack. Qping from master to node [root@rndusljpp2 lx-amd64]# qping padme 6445 execd 1 11/25/2016 20:57:26 endpoint padme/execd/1 at port 6445 is up for 16733 seconds 11/25/2016 20:57:27 endpoint padme/execd/1 at port 6445 is up for 16734 seconds 11/25/2016 20:57:28 endpoint padme/execd/1 at port 6445 is up for 16735 seconds 11/25/2016 20:57:29 endpoint padme/execd/1 at port 6445 is up for 16736 seconds 11/25/2016 20:57:30 endpoint padme/execd/1 at port 6445 is up for 16737 seconds 11/25/2016 20:57:31 endpoint padme/execd/1 at port 6445 is up for 16738 seconds Qping from node to master [root@padme ~]# qping s1 6444 qmaster 1 11/25/2016 20:59:10 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port 6444 is up for 2440537 seconds 11/25/2016 20:59:11 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port 6444 is up for 2440538 seconds 11/25/2016 20:59:12 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port 6444 is up for 2440539 seconds 11/25/2016 20:59:13 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port 6444 is up for 2440540 seconds 11/25/2016 20:59:14 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port 6444 is up for 2440541 seconds ################### from NODE [root@padme lx-amd64]# ./gethostbyaddr -name 192.168.1.8 rndusljpp2.na.jnj.com [root@padme lx-amd64]# ./gethostbyname -name s1 rndusljpp2.na.jnj.com ################### from QMASTER [root@rndusljpp2 lx-amd64]# ./gethostbyaddr -name 192.168.1.159 padme [root@rndusljpp2 lx-amd64]# ./gethostbyname -name padme padme ############# NODE SGE logs 11/25/2016 07:38:56| main|padme|I|restarting load sensor/opt/schrodinger/2016-3/utilities/flexlm_sensor.pl 11/25/2016 07:38:56| main|padme|W|[load_sensor 6137] fflush failed [Broken pipe] 11/25/2016 07:38:57| main|padme|W|load sensor exited with exit status = 1 11/25/2016 07:39:36| main|padme|I|restarting load sensor /opt/schrodinger/2016-3/utilities/flexlm_sensor.pl 11/25/2016 07:39:36| main|padme|W|[load_sensor 6139] fflush failed [Broken pipe] 11/25/2016 07:39:37| main|padme|W|load sensor exited with exit status = 1 11/25/2016 07:41:58| main|padme|I|starting load sensor /opt/schrodinger/2016-3/utilities/flexlm_sensor.pl 11/25/2016 07:41:58| main|padme|I|registered at qmasterhost "rndusljpp2.na.jnj.com" 11/25/2016 07:41:58| main|padme|I|starting up SGE 8.1.8(lx-amd64) 11/25/2016 07:41:58| main|padme|I|memory accounting inaccurate with USE_SMAPS=false 11/25/2016 07:41:58| main|padme|I|successfully started PDC and PTF 11/25/2016 07:41:58| main|padme|I|checking for old jobs 11/25/2016 07:41:58| main|padme|I|no old jobs at startup 11/25/2016 07:41:59| main|padme|W|load sensor exited with exit status = 1 11/25/2016 07:42:38| main|padme|I|restarting load sensor /opt/schrodinger/2016-3/utilities/flexlm_sensor.pl 11/25/2016 07:42:38| main|padme|W|[load_sensor 5111] fflush failed [Broken pipe] ############# QMASTER log 11/25/2016 07:41:27|listen|rndusljpp2|E|commlib error: endpoint is not unique error (endpoint "padme/execd/1" is already connected) 11/25/2016 07:41:27|listen|rndusljpp2|E|commlib error: got select error (Connection reset by peer) 11/25/2016 07:41:29|worker|rndusljpp2|I|execd on padme registered
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users