Reuti So it rebooted again without any jobs running...and I don't understand " sgead...@rndusljpp2.na.jnj.com removed "mcolem19" from user list" but as you see I got added back ???
11/27/2016 01:30:04| timer|rndusljpp2|I|sgead...@rndusljpp2.na.jnj.com removed "mcolem19" from user list 11/27/2016 01:30:04| timer|rndusljpp2|I|sgead...@rndusljpp2.na.jnj.com removed "mcolem19" from user list 11/27/2016 20:35:12|listen|rndusljpp2|E|commlib error: endpoint is not unique error (endpoint "padme/execd/1" is already connected) 11/27/2016 20:35:12|listen|rndusljpp2|E|commlib error: got select error (Connection reset by peer) 11/27/2016 20:35:13|worker|rndusljpp2|I|execd on padme registered 11/28/2016 06:26:20|listen|rndusljpp2|E|commlib error: endpoint is not unique error (endpoint "padme/execd/1" is already connected) 11/28/2016 06:26:20|listen|rndusljpp2|E|commlib error: got select error (Connection reset by peer) 11/28/2016 06:26:20|worker|rndusljpp2|I|execd on padme registered 11/28/2016 08:49:52|listen|rndusljpp2|E|commlib error: endpoint is not unique error (endpoint "padme/execd/1" is already connected) 11/28/2016 08:49:52|listen|rndusljpp2|E|commlib error: got select error (Connection reset by peer) 11/28/2016 08:49:52|worker|rndusljpp2|I|execd on padme registered 11/28/2016 13:25:54|worker|rndusljpp2|I|sgead...@rndusljpp2.na.jnj.com added "mcolem19" to user list -----Original Message----- From: Reuti [mailto:re...@staff.uni-marburg.de] Sent: Monday, November 28, 2016 11:55 AM To: Coleman, Marcus [JRDUS Non-J&J] Cc: users@gridengine.org Subject: [EXTERNAL] Re: [gridengine users] commlib Am 28.11.2016 um 20:36 schrieb Coleman, Marcus [JRDUS Non-J&J]: > Thanks Reuti! > > I was hoping it was something there....Any ideas on where to go from here? What do: $ ./gethostbyname -all padme $ ./gethostbyaddr -all 192.168.1.159 show on the node and headnode? -- Reuti > -----Original Message----- > From: Reuti [mailto:re...@staff.uni-marburg.de] > Sent: Sunday, November 27, 2016 4:37 AM > To: Coleman, Marcus [JRDUS Non-J&J] > Cc: users@gridengine.org > Subject: [EXTERNAL] Re: [gridengine users] commlib > > > Am 27.11.2016 um 03:23 schrieb Coleman, Marcus [JRDUS Non-J&J]: > >> Hi Reuti >> >> I am not sure what I am looking for...but here is the contents of >> /tmp on the rebooting node Any outrights you can see? >> >> [root@padme tmp]# ls -l >> total 20 >> prw-rw-r-- 1 mcolem19 mcolem19 0 Nov 23 22:09 jmonitor.mcolem19.37995 >> prw-rw-r-- 1 mcolem19 mcolem19 0 Nov 23 22:35 jmonitor.mcolem19.38497 >> prw-rw-r-- 1 mcolem19 mcolem19 0 Nov 23 22:45 jmonitor.mcolem19.38615 >> prw-rw-r-- 1 mcolem19 mcolem19 0 Nov 23 22:45 jmonitor.mcolem19.38624 >> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:27 jmonitor.schrogpu.28331 >> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:27 jmonitor.schrogpu.28377 >> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:40 jmonitor.schrogpu.31781 >> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:41 jmonitor.schrogpu.31829 >> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 9 12:17 jmonitor.schrogpu.5042 >> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 9 12:17 jmonitor.schrogpu.5043 >> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:08 jmonitor.schrogpu.8041 >> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:39 jmonitor.schrogpu.8220 >> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:26 jmonitor.schrogpu.8346 >> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:39 jmonitor.schrogpu.8557 >> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:27 jmonitor.schrogpu.8740 >> drwx------ 2 root root 4096 Nov 4 16:09 keyring-6CWKlB >> drwxrwxrwx 2 mcolem19 mcolem19 4096 Nov 23 11:03 mmjob.lock >> prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.28352 >> prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.28400 >> prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.28480 >> prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.28487 >> prw------- 1 schrogpu schrogpu 0 Sep 5 00:39 mmjob.schrogpu.31802 >> prw------- 1 schrogpu schrogpu 0 Sep 5 00:39 mmjob.schrogpu.31850 >> prw------- 1 schrogpu schrogpu 0 Sep 5 00:40 mmjob.schrogpu.31876 >> prw------- 1 schrogpu schrogpu 0 Sep 5 00:41 mmjob.schrogpu.31891 >> prw------- 1 schrogpu schrogpu 0 Sep 5 00:08 mmjob.schrogpu.8087 >> prw------- 1 schrogpu schrogpu 0 Sep 5 00:39 mmjob.schrogpu.8266 >> prw------- 1 schrogpu schrogpu 0 Sep 5 00:26 mmjob.schrogpu.8392 >> prw------- 1 schrogpu schrogpu 0 Sep 5 00:39 mmjob.schrogpu.8603 >> prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.8787 >> drwx------ 2 gdm gdm 4096 Nov 25 07:42 orbit-gdm >> drwx------. 2 gdm gdm 4096 Nov 25 07:42 pulse-5mlDwNemaGym >> drwx------ 2 root root 4096 Nov 4 16:09 pulse-GAI9xhuCTgeg > > Thx, I was looking for a file created by the execd in case it faces problems > during startup. Such files will be saved in /tmp as last resort for the > logfiles. Unfortunately there are none, hence the startup per se was > successful. > > >> [root@padme tmp]# >> >> >> -----Original Message----- >> From: Reuti [mailto:re...@staff.uni-marburg.de] >> Sent: Saturday, November 26, 2016 6:31 AM >> To: Coleman, Marcus [JRDUS Non-J&J] >> Cc: users@gridengine.org >> Subject: [EXTERNAL] Re: [gridengine users] commlib >> >> Hi, >> >> Am 26.11.2016 um 06:10 schrieb Coleman, Marcus [JRDUS Non-J&J]: >> >>> I am having an issue with a node rebooting. I am running Desmond fep >>> jobs... >>> >>> Thanks for any help in advance! >>> >>> /etc/resolv.conf is the same on all nodes /etc/hosts is the same on >>> all nodes All nodes are connected to the same switch in a server rack. >>> ################### from NODE >>> [root@padme lx-amd64]# ./gethostbyaddr -name 192.168.1.8 >>> rndusljpp2.na.jnj.com [root@padme lx-amd64]# ./gethostbyname -name >>> s1 rndusljpp2.na.jnj.com ################### from QMASTER >>> [root@rndusljpp2 lx-amd64]# ./gethostbyaddr -name 192.168.1.159 >>> padme >>> [root@rndusljpp2 lx-amd64]# ./gethostbyname -name padme padme > > What do: > > $ ./gethostbyname -all padme > $ ./gethostbyaddr -all 192.168.1.159 > > show? > > -- Reuti > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users