Hi Reuti I am not sure what I am looking for...but here is the contents of /tmp on the rebooting node Any outrights you can see?
[root@padme tmp]# ls -l total 20 prw-rw-r-- 1 mcolem19 mcolem19 0 Nov 23 22:09 jmonitor.mcolem19.37995 prw-rw-r-- 1 mcolem19 mcolem19 0 Nov 23 22:35 jmonitor.mcolem19.38497 prw-rw-r-- 1 mcolem19 mcolem19 0 Nov 23 22:45 jmonitor.mcolem19.38615 prw-rw-r-- 1 mcolem19 mcolem19 0 Nov 23 22:45 jmonitor.mcolem19.38624 prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:27 jmonitor.schrogpu.28331 prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:27 jmonitor.schrogpu.28377 prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:40 jmonitor.schrogpu.31781 prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:41 jmonitor.schrogpu.31829 prw-rw-r-- 1 schrogpu schrogpu 0 Sep 9 12:17 jmonitor.schrogpu.5042 prw-rw-r-- 1 schrogpu schrogpu 0 Sep 9 12:17 jmonitor.schrogpu.5043 prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:08 jmonitor.schrogpu.8041 prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:39 jmonitor.schrogpu.8220 prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:26 jmonitor.schrogpu.8346 prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:39 jmonitor.schrogpu.8557 prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:27 jmonitor.schrogpu.8740 drwx------ 2 root root 4096 Nov 4 16:09 keyring-6CWKlB drwxrwxrwx 2 mcolem19 mcolem19 4096 Nov 23 11:03 mmjob.lock prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.28352 prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.28400 prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.28480 prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.28487 prw------- 1 schrogpu schrogpu 0 Sep 5 00:39 mmjob.schrogpu.31802 prw------- 1 schrogpu schrogpu 0 Sep 5 00:39 mmjob.schrogpu.31850 prw------- 1 schrogpu schrogpu 0 Sep 5 00:40 mmjob.schrogpu.31876 prw------- 1 schrogpu schrogpu 0 Sep 5 00:41 mmjob.schrogpu.31891 prw------- 1 schrogpu schrogpu 0 Sep 5 00:08 mmjob.schrogpu.8087 prw------- 1 schrogpu schrogpu 0 Sep 5 00:39 mmjob.schrogpu.8266 prw------- 1 schrogpu schrogpu 0 Sep 5 00:26 mmjob.schrogpu.8392 prw------- 1 schrogpu schrogpu 0 Sep 5 00:39 mmjob.schrogpu.8603 prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.8787 drwx------ 2 gdm gdm 4096 Nov 25 07:42 orbit-gdm drwx------. 2 gdm gdm 4096 Nov 25 07:42 pulse-5mlDwNemaGym drwx------ 2 root root 4096 Nov 4 16:09 pulse-GAI9xhuCTgeg [root@padme tmp]# -----Original Message----- From: Reuti [mailto:re...@staff.uni-marburg.de] Sent: Saturday, November 26, 2016 6:31 AM To: Coleman, Marcus [JRDUS Non-J&J] Cc: users@gridengine.org Subject: [EXTERNAL] Re: [gridengine users] commlib Hi, Am 26.11.2016 um 06:10 schrieb Coleman, Marcus [JRDUS Non-J&J]: > I am having an issue with a node rebooting. I am running Desmond fep > jobs... > > Thanks for any help in advance! > > /etc/resolv.conf is the same on all nodes /etc/hosts is the same on > all nodes All nodes are connected to the same switch in a server rack. > > > Qping from master to node > [root@rndusljpp2 lx-amd64]# qping padme 6445 execd 1 > 11/25/2016 20:57:26 endpoint padme/execd/1 at port 6445 is up for > 16733 seconds > 11/25/2016 20:57:27 endpoint padme/execd/1 at port 6445 is up for > 16734 seconds > 11/25/2016 20:57:28 endpoint padme/execd/1 at port 6445 is up for > 16735 seconds > 11/25/2016 20:57:29 endpoint padme/execd/1 at port 6445 is up for > 16736 seconds > 11/25/2016 20:57:30 endpoint padme/execd/1 at port 6445 is up for > 16737 seconds > 11/25/2016 20:57:31 endpoint padme/execd/1 at port 6445 is up for > 16738 seconds > > Qping from node to master > [root@padme ~]# qping s1 6444 qmaster 1 > 11/25/2016 20:59:10 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port > 6444 is up for 2440537 seconds > 11/25/2016 20:59:11 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port > 6444 is up for 2440538 seconds > 11/25/2016 20:59:12 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port > 6444 is up for 2440539 seconds > 11/25/2016 20:59:13 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port > 6444 is up for 2440540 seconds > 11/25/2016 20:59:14 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port > 6444 is up for 2440541 seconds > > ################### from NODE > [root@padme lx-amd64]# ./gethostbyaddr -name 192.168.1.8 > rndusljpp2.na.jnj.com [root@padme lx-amd64]# ./gethostbyname -name s1 > rndusljpp2.na.jnj.com ################### from QMASTER > [root@rndusljpp2 lx-amd64]# ./gethostbyaddr -name 192.168.1.159 padme > [root@rndusljpp2 lx-amd64]# ./gethostbyname -name padme padme > > > ############# NODE SGE logs > > 11/25/2016 07:38:56| main|padme|I|restarting load > sensor/opt/schrodinger/2016-3/utilities/flexlm_sensor.pl > 11/25/2016 07:38:56| main|padme|W|[load_sensor 6137] fflush failed > [Broken pipe] > 11/25/2016 07:38:57| main|padme|W|load sensor exited with exit status > = 1 > 11/25/2016 07:39:36| main|padme|I|restarting load sensor > /opt/schrodinger/2016-3/utilities/flexlm_sensor.pl > 11/25/2016 07:39:36| main|padme|W|[load_sensor 6139] fflush failed > [Broken pipe] > 11/25/2016 07:39:37| main|padme|W|load sensor exited with exit status > = 1 > 11/25/2016 07:41:58| main|padme|I|starting load sensor > /opt/schrodinger/2016-3/utilities/flexlm_sensor.pl > 11/25/2016 07:41:58| main|padme|I|registered at qmasterhost > "rndusljpp2.na.jnj.com" > 11/25/2016 07:41:58| main|padme|I|starting up SGE 8.1.8(lx-amd64) > 11/25/2016 07:41:58| main|padme|I|memory accounting inaccurate with > USE_SMAPS=false > 11/25/2016 07:41:58| main|padme|I|successfully started PDC and PTF > 11/25/2016 07:41:58| main|padme|I|checking for old jobs > 11/25/2016 07:41:58| main|padme|I|no old jobs at startup > 11/25/2016 07:41:59| main|padme|W|load sensor exited with exit status > = 1 > 11/25/2016 07:42:38| main|padme|I|restarting load sensor > /opt/schrodinger/2016-3/utilities/flexlm_sensor.pl > 11/25/2016 07:42:38| main|padme|W|[load_sensor 5111] fflush failed > [Broken pipe] > > ############# QMASTER log > 11/25/2016 07:41:27|listen|rndusljpp2|E|commlib error: endpoint is not > unique error (endpoint "padme/execd/1" is already connected) > 11/25/2016 07:41:27|listen|rndusljpp2|E|commlib error: got select > error (Connection reset by peer) > 11/25/2016 07:41:29|worker|rndusljpp2|I|execd on padme registered Are there any files in /tmp on the node pointing to a problem starting execd? -- Reuti _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users