Re: [gridengine users] Error message- failed receving gdi request when calling qsub, but job is started

William Hay Wed, 22 Jun 2016 04:17:01 -0700

On Wed, Jun 22, 2016 at 08:39:35AM +0000, sudha.penme...@wipro.com wrote:
> Hi,
> 
> We have added the below qmaster params in the SGE configuration
> 
> qmaster_params               gdi_timeout=240 gdi_retries=-1 cl_ping=true
> 
> Could you let me know the difference between gdi_timeout and gdi_retries. Why 
> is there gdi_retries parameter? Why can't we use gdi_timeout alone to retry 
> permanently like allowing an option -1 for gdi-timeout. I don't get the 
> specific purpose of having extra parameter gdi_retries.
> 
The difference is in the manual page.  gdi_timeout specifies how long to wait 
between retries, gdi_retries specifies how many times to retry.
The timeout setting prevents you from bombarding a slow server with repeated 
requests while the retries setting ensures that things will progress
even if the odd request gets lost for some reason.  If you used a single magic 
value in gdi_timeout to represent try forever then there would be 
no way to specify how long to wait between retries.


> Because when we have NFS latency issue we receive the error "failed receiving 
> gdi request" but yet the job is submitted which is causing confusion.
> 
It has been my practice to have the file system with the grid-engine config be 
local to the qmaster and exported
to the rest of the cluster via NFS precisely because the speed with which the 
qmaster accesses these filesystems
matters a lot more than it does for other nodes.  This does mean our current 
setup lacks a shadow master but one 
of my colleagues is currently setting up a pair of servers with DRBD so we can 
support failover in the event of 
hardware failure.

William

signature.asc
Description: Digital signature

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Error message- failed receving gdi request when calling qsub, but job is started

Reply via email to