Hi William, Thanks for your reply. See my comments below.
On Thu, Nov 23, 2017 at 2:53 AM, William Hay <[email protected]> wrote: > On Wed, Nov 22, 2017 at 09:53:17AM -0800, Mun Johl wrote: > > Hi, > > Periodically I am seeing the following error: > > > > Unable to initialize environment because of error: cannot register > event > > client. Only 100 event clients are allowed in the system > > > > The error first showed up a few days ago but stated "950 event > clients are > > allowed". Because MAX_DYN_EC was not set in my config, I equated it > to > > 100. > I am not sure what you mean by "I equated it to 100"? Did you set it to > 100 > after getting the error? IIRC the default is 1000. > > > Yes, after getting the error I tried to check what MAX_DYN_EC parameter was set to but it was not set in our configuration. I assumed it was implicitly set to 950 based on the original error message. However, that value is *way* larger than I would ever expect in our configuration and thus was perplexed how we could have that many event clients; therefore, I wasn't sure if MAX_DYN_EC was actually set to 950 or if the error message was incorrect. So I set MAX_DYN_EC to 100 via 'qconf -mconf' as a test. Note that 100 is roughly an order of magnitude larger than I would expect we need. Currently, we have very few qsub jobs running at any given time. One other note is that grid has been working fine for months and this error just showed up a couple of weeks ago. Although, we may be seeing our consumable resources being exhausted more frequently as of late. Not that should result in the error I'm seeing, but just another piece of data. > > However, our sim ring is fairly small at this point and we shouldn't > be > > getting anywhere near 100 outstanding qsub's (let alone 950). > Therefore, > > I'm wondering what other factors could result in this error? > > For example, could a slow network or slow grid master result in this > > error? > > Any suggestions on how I can get to root cause would be most > appreciated. > > Thanks, > > Are you actually using qsub? IIRC when using DRMAA it is possible to leak > event clients > (ie the event client is created when a job is qsub'd but isn't > automatically freed when > the job terminates only when the client program does) if you launch > multiple jobs from > the same process. > > If you are using qsub -sync y check that the qsub processes are actually > being > reaped (ie there aren't a bunch of zombie qsubs hanging around). > We're using 'qsub -sync y' and I don't see any zombie qsubs on our grid hosts. But perhaps I should start a cron job to periodically check the number of qsubs that are active. > > Also check that you aren't short of filehandles (ie ulimit) either where > the submit > program runs or where the qmaster lives. > ulimit -n reports 1024 on our qmaster and the execution hosts. However, 'sysctl fs.file-nr' outputs: fs.file-nr = 4736 0 13065172 So I'm a little confused as to why the number of file handles reported by the sysctl command exceeds the ulimit value. Regards, -- Mun
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
