Reuti Thanks for the information!!! Any idea on what is causing the reboot?
-----Original Message----- From: Reuti [mailto:re...@staff.uni-marburg.de] Sent: Tuesday, November 29, 2016 6:02 AM To: Coleman, Marcus [JRDUS Non-J&J] Cc: users@gridengine.org Subject: [EXTERNAL] Re: Re: [gridengine users] commlib > Am 29.11.2016 um 00:17 schrieb Coleman, Marcus [JRDUS Non-J&J] > <mcole...@its.jnj.com>: > > Reuti > > So it rebooted again without any jobs running...and I don't understand " > sgead...@rndusljpp2.na.jnj.com removed "mcolem19" from user list" but as you > see I got added back ??? Yes, there is a auto delete time for users which were added automatically due to a job submission. $ qconf -suser mcolem19 will show when the next deletion will take place (unless you set it to 0). $ qconf -suserl shows all currently known users. -- Reuti > > 11/27/2016 01:30:04| timer|rndusljpp2|I|sgead...@rndusljpp2.na.jnj.com > removed "mcolem19" from user list > 11/27/2016 01:30:04| timer|rndusljpp2|I|sgead...@rndusljpp2.na.jnj.com > removed "mcolem19" from user list > 11/27/2016 20:35:12|listen|rndusljpp2|E|commlib error: endpoint is not > unique error (endpoint "padme/execd/1" is already connected) > 11/27/2016 20:35:12|listen|rndusljpp2|E|commlib error: got select > error (Connection reset by peer) > 11/27/2016 20:35:13|worker|rndusljpp2|I|execd on padme registered > 11/28/2016 06:26:20|listen|rndusljpp2|E|commlib error: endpoint is not > unique error (endpoint "padme/execd/1" is already connected) > 11/28/2016 06:26:20|listen|rndusljpp2|E|commlib error: got select > error (Connection reset by peer) > 11/28/2016 06:26:20|worker|rndusljpp2|I|execd on padme registered > 11/28/2016 08:49:52|listen|rndusljpp2|E|commlib error: endpoint is not > unique error (endpoint "padme/execd/1" is already connected) > 11/28/2016 08:49:52|listen|rndusljpp2|E|commlib error: got select > error (Connection reset by peer) > 11/28/2016 08:49:52|worker|rndusljpp2|I|execd on padme registered > 11/28/2016 13:25:54|worker|rndusljpp2|I|sgead...@rndusljpp2.na.jnj.com > added "mcolem19" to user list > > -----Original Message----- > From: Reuti [mailto:re...@staff.uni-marburg.de] > Sent: Monday, November 28, 2016 11:55 AM > To: Coleman, Marcus [JRDUS Non-J&J] > Cc: users@gridengine.org > Subject: [EXTERNAL] Re: [gridengine users] commlib > > > Am 28.11.2016 um 20:36 schrieb Coleman, Marcus [JRDUS Non-J&J]: > >> Thanks Reuti! >> >> I was hoping it was something there....Any ideas on where to go from here? > > What do: > > $ ./gethostbyname -all padme > $ ./gethostbyaddr -all 192.168.1.159 > > show on the node and headnode? > > -- Reuti > > >> -----Original Message----- >> From: Reuti [mailto:re...@staff.uni-marburg.de] >> Sent: Sunday, November 27, 2016 4:37 AM >> To: Coleman, Marcus [JRDUS Non-J&J] >> Cc: users@gridengine.org >> Subject: [EXTERNAL] Re: [gridengine users] commlib >> >> >> Am 27.11.2016 um 03:23 schrieb Coleman, Marcus [JRDUS Non-J&J]: >> >>> Hi Reuti >>> >>> I am not sure what I am looking for...but here is the contents of >>> /tmp on the rebooting node Any outrights you can see? >>> >>> [root@padme tmp]# ls -l >>> total 20 >>> prw-rw-r-- 1 mcolem19 mcolem19 0 Nov 23 22:09 jmonitor.mcolem19.37995 >>> prw-rw-r-- 1 mcolem19 mcolem19 0 Nov 23 22:35 jmonitor.mcolem19.38497 >>> prw-rw-r-- 1 mcolem19 mcolem19 0 Nov 23 22:45 jmonitor.mcolem19.38615 >>> prw-rw-r-- 1 mcolem19 mcolem19 0 Nov 23 22:45 jmonitor.mcolem19.38624 >>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:27 jmonitor.schrogpu.28331 >>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:27 jmonitor.schrogpu.28377 >>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:40 jmonitor.schrogpu.31781 >>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:41 jmonitor.schrogpu.31829 >>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 9 12:17 jmonitor.schrogpu.5042 >>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 9 12:17 jmonitor.schrogpu.5043 >>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:08 jmonitor.schrogpu.8041 >>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:39 jmonitor.schrogpu.8220 >>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:26 jmonitor.schrogpu.8346 >>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:39 jmonitor.schrogpu.8557 >>> prw-rw-r-- 1 schrogpu schrogpu 0 Sep 5 00:27 jmonitor.schrogpu.8740 >>> drwx------ 2 root root 4096 Nov 4 16:09 keyring-6CWKlB >>> drwxrwxrwx 2 mcolem19 mcolem19 4096 Nov 23 11:03 mmjob.lock >>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.28352 >>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.28400 >>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.28480 >>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.28487 >>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:39 mmjob.schrogpu.31802 >>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:39 mmjob.schrogpu.31850 >>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:40 mmjob.schrogpu.31876 >>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:41 mmjob.schrogpu.31891 >>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:08 mmjob.schrogpu.8087 >>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:39 mmjob.schrogpu.8266 >>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:26 mmjob.schrogpu.8392 >>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:39 mmjob.schrogpu.8603 >>> prw------- 1 schrogpu schrogpu 0 Sep 5 00:27 mmjob.schrogpu.8787 >>> drwx------ 2 gdm gdm 4096 Nov 25 07:42 orbit-gdm >>> drwx------. 2 gdm gdm 4096 Nov 25 07:42 pulse-5mlDwNemaGym >>> drwx------ 2 root root 4096 Nov 4 16:09 pulse-GAI9xhuCTgeg >> >> Thx, I was looking for a file created by the execd in case it faces problems >> during startup. Such files will be saved in /tmp as last resort for the >> logfiles. Unfortunately there are none, hence the startup per se was >> successful. >> >> >>> [root@padme tmp]# >>> >>> >>> -----Original Message----- >>> From: Reuti [mailto:re...@staff.uni-marburg.de] >>> Sent: Saturday, November 26, 2016 6:31 AM >>> To: Coleman, Marcus [JRDUS Non-J&J] >>> Cc: users@gridengine.org >>> Subject: [EXTERNAL] Re: [gridengine users] commlib >>> >>> Hi, >>> >>> Am 26.11.2016 um 06:10 schrieb Coleman, Marcus [JRDUS Non-J&J]: >>> >>>> I am having an issue with a node rebooting. I am running Desmond >>>> fep jobs... >>>> >>>> Thanks for any help in advance! >>>> >>>> /etc/resolv.conf is the same on all nodes /etc/hosts is the same on >>>> all nodes All nodes are connected to the same switch in a server rack. >>>> ################### from NODE >>>> [root@padme lx-amd64]# ./gethostbyaddr -name 192.168.1.8 >>>> rndusljpp2.na.jnj.com [root@padme lx-amd64]# ./gethostbyname -name >>>> s1 rndusljpp2.na.jnj.com ################### from QMASTER >>>> [root@rndusljpp2 lx-amd64]# ./gethostbyaddr -name 192.168.1.159 >>>> padme >>>> [root@rndusljpp2 lx-amd64]# ./gethostbyname -name padme padme >> >> What do: >> >> $ ./gethostbyname -all padme >> $ ./gethostbyaddr -all 192.168.1.159 >> >> show? >> >> -- Reuti >> > > _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users