On Wed, Feb 22, 2012 at 10:33 AM, Reuti <[email protected]> wrote:
> Hi,
>
> Am 22.02.2012 um 18:21 schrieb Michael Coffman:
>
>> Just got the following error returned to my qrsh command line:
>>
>> error: error while waiting for builtin IJS connection: "got select timeout"
>
> Was it working before? The startup method of the the global configuration and 
> the node configuration match for rlogin_command/rlogin_daemon?

Yes.  It was working before.  node and global are set to builtin..

>
> -- Reuti
>
>
>> Have only seen this once, it may be unrelated...
>>
>> On Wed, Feb 22, 2012 at 10:12 AM, Michael Coffman
>> <[email protected]> wrote:
>>> I have system that will not accept qrsh jobs.   qstat shows the job in
>>> the r state:
>>>
>>> 174955 0.50934 QRLOGIN    coffman      r     02/22/2012 09:59:21 all.q@cs424
>>>
>>> But the login never goes through.
>>>
>>> a simple qsub will work fine.
>>>
>>> From the qmaster spool/messages file:
>>>
>>> 02/22/2012 10:00:21|worker|geshadow|W|job 174955.1 failed on host
>>> cs424 assumedly after job because: job 174955.1 died through signal
>>> KILL (9)
>>>
>>> I was seeing the following in the spool/messages file on the exec host:
>>> 02/22/2012 09:39:47|  main|cs424|E|shepherd of job 174535.1 exited
>>> with exit status = 25
>>>
>>> I restarted execd using a softstop and start and now nothing is being 
>>> logged.
>>>
>>> There are 2 active jobs on the exec host, so I can't reboot it.
>>> There are some comments about the 25 status error in sge_conf.
>>>
>>> max_advance_reservations - If the max_advance_reservations limit is
>>> exceeded  by  an Advance Reservation request then the submission
>>> command exits with exit status 25 and an appropriate error message
>>>
>>> max_u_jobs - If  the  max_u_jobs  limit is exceeded by a job
>>> submission then the submission command exits with exit status 25 and
>>> an appropriate error message.
>>>
>>> max_jobs - If  the max_jobs  limit  is  exceeded  by  a job submission
>>> then the submission command exits with exit status 25 and an
>>> appropriate error message.
>>>
>>> None of these are the issue.  Any thoughts on how to debug this?
>>>
>>> I also tried running strace against the sgeexecd and did not see
>>> anything interesting.  A grep of the jid produces the following:
>>>
>>> mkdir("active_jobs/174996.1", 0755)     = 0
>>> mkdir("/tmp/174996.1.all.q", 0755)      = 0
>>> chown("/tmp/174996.1.all.q", 26927, 20) = 0
>>> open("/opt/grid-6.2u5/ftcrnd/spool/cs424/active_jobs/174996.1/pe_hostfile",
>>> O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
>>> open("/opt/grid-6.2u5/ftcrnd/spool/cs424/active_jobs/174996.1/environment",
>>> O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
>>> open("active_jobs/174996.1/config", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
>>> chdir("active_jobs/174996.1")           = 0
>>> stat("active_jobs/174996.1/addgrpid", {st_mode=S_IFREG|0644,
>>> st_size=6, ...}) = 0
>>> open("active_jobs/174996.1/addgrpid", O_RDONLY) = 4
>>> stat("active_jobs/174996.1", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>>> open("active_jobs/174996.1/pid", O_RDONLY) = 6
>>> stat("active_jobs/174996.1", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
>>> open("active_jobs/174996.1/config", O_RDONLY) = 4
>>> open("active_jobs/174996.1/exit_status", O_RDONLY) = 4
>>> open("active_jobs/174996.1/error", O_RDONLY) = 4
>>> open("active_jobs/174996.1/pid", O_RDONLY) = 4
>>> open("active_jobs/174996.1/usage", O_RDONLY) = 4
>>> stat("active_jobs/174996.1/checkpointed", 0x7ffff75e7280) = -1 ENOENT
>>> (No such file or directory)
>>> stat("active_jobs/174996.1/noresources", 0x7ffff75e7280) = -1 ENOENT
>>> (No such file or directory)
>>> open("active_jobs/174996.1", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 4
>>> lstat("active_jobs/174996.1/error", {st_mode=S_IFREG|0644, st_size=0, ...}) 
>>> = 0
>>> unlink("active_jobs/174996.1/error")    = 0
>>> lstat("active_jobs/174996.1/config", {st_mode=S_IFREG|0644,
>>> st_size=2167, ...}) = 0
>>> unlink("active_jobs/174996.1/config")   = 0
>>> lstat("active_jobs/174996.1/usage", {st_mode=S_IFREG|0644,
>>> st_size=309, ...}) = 0
>>> unlink("active_jobs/174996.1/usage")    = 0
>>> lstat("active_jobs/174996.1/environment", {st_mode=S_IFREG|0644,
>>> st_size=1364, ...}) = 0
>>> unlink("active_jobs/174996.1/environment") = 0
>>> lstat("active_jobs/174996.1/pe_hostfile", {st_mode=S_IFREG|0644,
>>> st_size=66, ...}) = 0
>>> unlink("active_jobs/174996.1/pe_hostfile") = 0
>>> lstat("active_jobs/174996.1/addgrpid", {st_mode=S_IFREG|0644,
>>> st_size=6, ...}) = 0
>>> unlink("active_jobs/174996.1/addgrpid") = 0
>>> lstat("active_jobs/174996.1/trace", {st_mode=S_IFREG|0644,
>>> st_size=5979, ...}) = 0
>>> unlink("active_jobs/174996.1/trace")    = 0
>>> lstat("active_jobs/174996.1/job_pid", {st_mode=S_IFREG|0644,
>>> st_size=6, ...}) = 0
>>> unlink("active_jobs/174996.1/job_pid")  = 0
>>> lstat("active_jobs/174996.1/pid", {st_mode=S_IFREG|0644, st_size=6, ...}) = >>> 0
>>> unlink("active_jobs/174996.1/pid")      = 0
>>> lstat("active_jobs/174996.1/shepherd_about_to_exit",
>>> {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
>>> unlink("active_jobs/174996.1/shepherd_about_to_exit") = 0
>>> lstat("active_jobs/174996.1/exit_status", {st_mode=S_IFREG|0644,
>>> st_size=2, ...}) = 0
>>> unlink("active_jobs/174996.1/exit_status") = 0
>>> rmdir("active_jobs/174996.1")           = 0
>>> open("/tmp/174996.1.all.q", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 4
>>> rmdir("/tmp/174996.1.all.q")            = 0
>>>
>>> Thanks.
>>> --
>>> -MichaelC
>>
>>
>>
>> --
>> -MichaelC
>>
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
>>
>



-- 
-MichaelC

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to