Re: [gridengine users] commlib

Reuti Sat, 26 Nov 2016 06:34:15 -0800

Hi,

Am 26.11.2016 um 06:10 schrieb Coleman, Marcus [JRDUS Non-J&J]:


> I am having an issue with a node rebooting. I am running Desmond fep jobs…
>  
> Thanks for any help in advance!
>  
> /etc/resolv.conf is the same on all nodes
> /etc/hosts is the same on all nodes
> All nodes are connected to the same switch in a server rack.
>  
>  
> Qping from master to node
> [root@rndusljpp2 lx-amd64]# qping padme 6445 execd 1
> 11/25/2016 20:57:26 endpoint padme/execd/1 at port 6445 is up for 16733 
> seconds
> 11/25/2016 20:57:27 endpoint padme/execd/1 at port 6445 is up for 16734 
> seconds
> 11/25/2016 20:57:28 endpoint padme/execd/1 at port 6445 is up for 16735 
> seconds
> 11/25/2016 20:57:29 endpoint padme/execd/1 at port 6445 is up for 16736 
> seconds
> 11/25/2016 20:57:30 endpoint padme/execd/1 at port 6445 is up for 16737 
> seconds
> 11/25/2016 20:57:31 endpoint padme/execd/1 at port 6445 is up for 16738 
> seconds
>  
> Qping from node to master
> [root@padme ~]# qping s1 6444 qmaster 1
> 11/25/2016 20:59:10 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port 6444 is 
> up for 2440537 seconds
> 11/25/2016 20:59:11 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port 6444 is 
> up for 2440538 seconds
> 11/25/2016 20:59:12 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port 6444 is 
> up for 2440539 seconds
> 11/25/2016 20:59:13 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port 6444 is 
> up for 2440540 seconds
> 11/25/2016 20:59:14 endpoint rndusljpp2.na.jnj.com/qmaster/1 at port 6444 is 
> up for 2440541 seconds
>  
> ################### from NODE
> [root@padme lx-amd64]# ./gethostbyaddr -name 192.168.1.8
> rndusljpp2.na.jnj.com
> [root@padme lx-amd64]# ./gethostbyname -name s1
> rndusljpp2.na.jnj.com
> ################### from QMASTER
> [root@rndusljpp2 lx-amd64]# ./gethostbyaddr -name 192.168.1.159
> padme
> [root@rndusljpp2 lx-amd64]# ./gethostbyname -name padme
> padme
>  
>  
> ############# NODE SGE logs
>  
> 11/25/2016 07:38:56|  main|padme|I|restarting load 
> sensor/opt/schrodinger/2016-3/utilities/flexlm_sensor.pl
> 11/25/2016 07:38:56|  main|padme|W|[load_sensor 6137] fflush failed [Broken 
> pipe]
> 11/25/2016 07:38:57|  main|padme|W|load sensor exited with exit status = 1
> 11/25/2016 07:39:36|  main|padme|I|restarting load sensor 
> /opt/schrodinger/2016-3/utilities/flexlm_sensor.pl
> 11/25/2016 07:39:36|  main|padme|W|[load_sensor 6139] fflush failed [Broken 
> pipe]
> 11/25/2016 07:39:37|  main|padme|W|load sensor exited with exit status = 1
> 11/25/2016 07:41:58|  main|padme|I|starting load sensor 
> /opt/schrodinger/2016-3/utilities/flexlm_sensor.pl
> 11/25/2016 07:41:58|  main|padme|I|registered at qmasterhost 
> "rndusljpp2.na.jnj.com"
> 11/25/2016 07:41:58|  main|padme|I|starting up SGE 8.1.8(lx-amd64)
> 11/25/2016 07:41:58|  main|padme|I|memory accounting inaccurate with 
> USE_SMAPS=false
> 11/25/2016 07:41:58|  main|padme|I|successfully started PDC and PTF
> 11/25/2016 07:41:58|  main|padme|I|checking for old jobs
> 11/25/2016 07:41:58|  main|padme|I|no old jobs at startup
> 11/25/2016 07:41:59|  main|padme|W|load sensor exited with exit status = 1
> 11/25/2016 07:42:38|  main|padme|I|restarting load sensor 
> /opt/schrodinger/2016-3/utilities/flexlm_sensor.pl
> 11/25/2016 07:42:38|  main|padme|W|[load_sensor 5111] fflush failed [Broken 
> pipe]
>  
> ############# QMASTER log
> 11/25/2016 07:41:27|listen|rndusljpp2|E|commlib error: endpoint is not unique 
> error (endpoint "padme/execd/1" is already connected)
> 11/25/2016 07:41:27|listen|rndusljpp2|E|commlib error: got select error 
> (Connection reset by peer)
> 11/25/2016 07:41:29|worker|rndusljpp2|I|execd on padme registered

Are there any files in /tmp on the node pointing to a problem starting execd?

-- Reuti
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] commlib

Reply via email to