Reuti

So it rebooted again without any jobs running...and I don't understand " 
sgead...@rndusljpp2.na.jnj.com removed "mcolem19" from user list" but as you 
see I got added back ???

11/27/2016 01:30:04| timer|rndusljpp2|I|sgead...@rndusljpp2.na.jnj.com removed 
"mcolem19" from user list
11/27/2016 01:30:04| timer|rndusljpp2|I|sgead...@rndusljpp2.na.jnj.com removed 
"mcolem19" from user list
11/27/2016 20:35:12|listen|rndusljpp2|E|commlib error: endpoint is not unique 
error (endpoint "padme/execd/1" is already connected)
11/27/2016 20:35:12|listen|rndusljpp2|E|commlib error: got select error 
(Connection reset by peer)
11/27/2016 20:35:13|worker|rndusljpp2|I|execd on padme registered
11/28/2016 06:26:20|listen|rndusljpp2|E|commlib error: endpoint is not unique 
error (endpoint "padme/execd/1" is already connected)
11/28/2016 06:26:20|listen|rndusljpp2|E|commlib error: got select error 
(Connection reset by peer)
11/28/2016 06:26:20|worker|rndusljpp2|I|execd on padme registered
11/28/2016 08:49:52|listen|rndusljpp2|E|commlib error: endpoint is not unique 
error (endpoint "padme/execd/1" is already connected)
11/28/2016 08:49:52|listen|rndusljpp2|E|commlib error: got select error 
(Connection reset by peer)
11/28/2016 08:49:52|worker|rndusljpp2|I|execd on padme registered
11/28/2016 13:25:54|worker|rndusljpp2|I|sgead...@rndusljpp2.na.jnj.com added 
"mcolem19" to user list

-----Original Message-----
From: Reuti [mailto:re...@staff.uni-marburg.de] 
Sent: Monday, November 28, 2016 11:55 AM
To: Coleman, Marcus [JRDUS Non-J&J]
Cc: users@gridengine.org
Subject: [EXTERNAL] Re: [gridengine users] commlib


Am 28.11.2016 um 20:36 schrieb Coleman, Marcus [JRDUS Non-J&J]:

> Thanks Reuti! 
> 
> I was hoping it was something there....Any ideas on where to go from here?

What do:

$ ./gethostbyname -all padme
$ ./gethostbyaddr -all 192.168.1.159

show on the node and headnode?

-- Reuti


> -----Original Message-----
> From: Reuti [mailto:re...@staff.uni-marburg.de]
> Sent: Sunday, November 27, 2016 4:37 AM
> To: Coleman, Marcus [JRDUS Non-J&J]
> Cc: users@gridengine.org
> Subject: [EXTERNAL] Re: [gridengine users] commlib
> 
> 
> Am 27.11.2016 um 03:23 schrieb Coleman, Marcus [JRDUS Non-J&J]:
> 
>> Hi Reuti
>> 
>> I am not sure what I am looking for...but here is the contents of 
>> /tmp on the rebooting node Any outrights you can see?
>> 
>> [root@padme tmp]# ls -l
>> total 20
>> prw-rw-r--  1 mcolem19 mcolem19    0 Nov 23 22:09 jmonitor.mcolem19.37995
>> prw-rw-r--  1 mcolem19 mcolem19    0 Nov 23 22:35 jmonitor.mcolem19.38497
>> prw-rw-r--  1 mcolem19 mcolem19    0 Nov 23 22:45 jmonitor.mcolem19.38615
>> prw-rw-r--  1 mcolem19 mcolem19    0 Nov 23 22:45 jmonitor.mcolem19.38624
>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:27 jmonitor.schrogpu.28331
>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:27 jmonitor.schrogpu.28377
>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:40 jmonitor.schrogpu.31781
>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:41 jmonitor.schrogpu.31829
>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  9 12:17 jmonitor.schrogpu.5042
>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  9 12:17 jmonitor.schrogpu.5043
>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:08 jmonitor.schrogpu.8041
>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:39 jmonitor.schrogpu.8220
>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:26 jmonitor.schrogpu.8346
>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:39 jmonitor.schrogpu.8557
>> prw-rw-r--  1 schrogpu schrogpu    0 Sep  5 00:27 jmonitor.schrogpu.8740
>> drwx------  2 root     root     4096 Nov  4 16:09 keyring-6CWKlB
>> drwxrwxrwx  2 mcolem19 mcolem19 4096 Nov 23 11:03 mmjob.lock
>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.28352
>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.28400
>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.28480
>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.28487
>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:39 mmjob.schrogpu.31802
>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:39 mmjob.schrogpu.31850
>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:40 mmjob.schrogpu.31876
>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:41 mmjob.schrogpu.31891
>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:08 mmjob.schrogpu.8087
>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:39 mmjob.schrogpu.8266
>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:26 mmjob.schrogpu.8392
>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:39 mmjob.schrogpu.8603
>> prw-------  1 schrogpu schrogpu    0 Sep  5 00:27 mmjob.schrogpu.8787
>> drwx------  2 gdm      gdm      4096 Nov 25 07:42 orbit-gdm
>> drwx------. 2 gdm      gdm      4096 Nov 25 07:42 pulse-5mlDwNemaGym
>> drwx------  2 root     root     4096 Nov  4 16:09 pulse-GAI9xhuCTgeg
> 
> Thx, I was looking for a file created by the execd in case it faces problems 
> during startup. Such files will be saved in /tmp as last resort for the 
> logfiles. Unfortunately there are none, hence the startup per se was 
> successful.
> 
> 
>> [root@padme tmp]#
>> 
>> 
>> -----Original Message-----
>> From: Reuti [mailto:re...@staff.uni-marburg.de]
>> Sent: Saturday, November 26, 2016 6:31 AM
>> To: Coleman, Marcus [JRDUS Non-J&J]
>> Cc: users@gridengine.org
>> Subject: [EXTERNAL] Re: [gridengine users] commlib
>> 
>> Hi,
>> 
>> Am 26.11.2016 um 06:10 schrieb Coleman, Marcus [JRDUS Non-J&J]:
>> 
>>> I am having an issue with a node rebooting. I am running Desmond fep 
>>> jobs...
>>> 
>>> Thanks for any help in advance!
>>> 
>>> /etc/resolv.conf is the same on all nodes /etc/hosts is the same on 
>>> all nodes All nodes are connected to the same switch in a server rack.
>>> ################### from NODE
>>> [root@padme lx-amd64]# ./gethostbyaddr -name 192.168.1.8 
>>> rndusljpp2.na.jnj.com [root@padme lx-amd64]# ./gethostbyname -name 
>>> s1 rndusljpp2.na.jnj.com ################### from QMASTER
>>> [root@rndusljpp2 lx-amd64]# ./gethostbyaddr -name 192.168.1.159 
>>> padme
>>> [root@rndusljpp2 lx-amd64]# ./gethostbyname -name padme padme
> 
> What do:
> 
> $ ./gethostbyname -all padme
> $ ./gethostbyaddr -all 192.168.1.159
> 
> show?
> 
> -- Reuti
> 


_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to