On Wed, Feb 22, 2012 at 10:33 AM, Reuti <[email protected]> wrote: > Hi, > > Am 22.02.2012 um 18:21 schrieb Michael Coffman: > >> Just got the following error returned to my qrsh command line: >> >> error: error while waiting for builtin IJS connection: "got select timeout" > > Was it working before? The startup method of the the global configuration and > the node configuration match for rlogin_command/rlogin_daemon?
Yes. It was working before. node and global are set to builtin.. > > -- Reuti > > >> Have only seen this once, it may be unrelated... >> >> On Wed, Feb 22, 2012 at 10:12 AM, Michael Coffman >> <[email protected]> wrote: >>> I have system that will not accept qrsh jobs. qstat shows the job in >>> the r state: >>> >>> 174955 0.50934 QRLOGIN coffman r 02/22/2012 09:59:21 all.q@cs424 >>> >>> But the login never goes through. >>> >>> a simple qsub will work fine. >>> >>> From the qmaster spool/messages file: >>> >>> 02/22/2012 10:00:21|worker|geshadow|W|job 174955.1 failed on host >>> cs424 assumedly after job because: job 174955.1 died through signal >>> KILL (9) >>> >>> I was seeing the following in the spool/messages file on the exec host: >>> 02/22/2012 09:39:47| main|cs424|E|shepherd of job 174535.1 exited >>> with exit status = 25 >>> >>> I restarted execd using a softstop and start and now nothing is being >>> logged. >>> >>> There are 2 active jobs on the exec host, so I can't reboot it. >>> There are some comments about the 25 status error in sge_conf. >>> >>> max_advance_reservations - If the max_advance_reservations limit is >>> exceeded by an Advance Reservation request then the submission >>> command exits with exit status 25 and an appropriate error message >>> >>> max_u_jobs - If the max_u_jobs limit is exceeded by a job >>> submission then the submission command exits with exit status 25 and >>> an appropriate error message. >>> >>> max_jobs - If the max_jobs limit is exceeded by a job submission >>> then the submission command exits with exit status 25 and an >>> appropriate error message. >>> >>> None of these are the issue. Any thoughts on how to debug this? >>> >>> I also tried running strace against the sgeexecd and did not see >>> anything interesting. A grep of the jid produces the following: >>> >>> mkdir("active_jobs/174996.1", 0755) = 0 >>> mkdir("/tmp/174996.1.all.q", 0755) = 0 >>> chown("/tmp/174996.1.all.q", 26927, 20) = 0 >>> open("/opt/grid-6.2u5/ftcrnd/spool/cs424/active_jobs/174996.1/pe_hostfile", >>> O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4 >>> open("/opt/grid-6.2u5/ftcrnd/spool/cs424/active_jobs/174996.1/environment", >>> O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4 >>> open("active_jobs/174996.1/config", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4 >>> chdir("active_jobs/174996.1") = 0 >>> stat("active_jobs/174996.1/addgrpid", {st_mode=S_IFREG|0644, >>> st_size=6, ...}) = 0 >>> open("active_jobs/174996.1/addgrpid", O_RDONLY) = 4 >>> stat("active_jobs/174996.1", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >>> open("active_jobs/174996.1/pid", O_RDONLY) = 6 >>> stat("active_jobs/174996.1", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0 >>> open("active_jobs/174996.1/config", O_RDONLY) = 4 >>> open("active_jobs/174996.1/exit_status", O_RDONLY) = 4 >>> open("active_jobs/174996.1/error", O_RDONLY) = 4 >>> open("active_jobs/174996.1/pid", O_RDONLY) = 4 >>> open("active_jobs/174996.1/usage", O_RDONLY) = 4 >>> stat("active_jobs/174996.1/checkpointed", 0x7ffff75e7280) = -1 ENOENT >>> (No such file or directory) >>> stat("active_jobs/174996.1/noresources", 0x7ffff75e7280) = -1 ENOENT >>> (No such file or directory) >>> open("active_jobs/174996.1", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 4 >>> lstat("active_jobs/174996.1/error", {st_mode=S_IFREG|0644, st_size=0, ...}) >>> = 0 >>> unlink("active_jobs/174996.1/error") = 0 >>> lstat("active_jobs/174996.1/config", {st_mode=S_IFREG|0644, >>> st_size=2167, ...}) = 0 >>> unlink("active_jobs/174996.1/config") = 0 >>> lstat("active_jobs/174996.1/usage", {st_mode=S_IFREG|0644, >>> st_size=309, ...}) = 0 >>> unlink("active_jobs/174996.1/usage") = 0 >>> lstat("active_jobs/174996.1/environment", {st_mode=S_IFREG|0644, >>> st_size=1364, ...}) = 0 >>> unlink("active_jobs/174996.1/environment") = 0 >>> lstat("active_jobs/174996.1/pe_hostfile", {st_mode=S_IFREG|0644, >>> st_size=66, ...}) = 0 >>> unlink("active_jobs/174996.1/pe_hostfile") = 0 >>> lstat("active_jobs/174996.1/addgrpid", {st_mode=S_IFREG|0644, >>> st_size=6, ...}) = 0 >>> unlink("active_jobs/174996.1/addgrpid") = 0 >>> lstat("active_jobs/174996.1/trace", {st_mode=S_IFREG|0644, >>> st_size=5979, ...}) = 0 >>> unlink("active_jobs/174996.1/trace") = 0 >>> lstat("active_jobs/174996.1/job_pid", {st_mode=S_IFREG|0644, >>> st_size=6, ...}) = 0 >>> unlink("active_jobs/174996.1/job_pid") = 0 >>> lstat("active_jobs/174996.1/pid", {st_mode=S_IFREG|0644, st_size=6, ...}) = >>> 0 >>> unlink("active_jobs/174996.1/pid") = 0 >>> lstat("active_jobs/174996.1/shepherd_about_to_exit", >>> {st_mode=S_IFREG|0644, st_size=0, ...}) = 0 >>> unlink("active_jobs/174996.1/shepherd_about_to_exit") = 0 >>> lstat("active_jobs/174996.1/exit_status", {st_mode=S_IFREG|0644, >>> st_size=2, ...}) = 0 >>> unlink("active_jobs/174996.1/exit_status") = 0 >>> rmdir("active_jobs/174996.1") = 0 >>> open("/tmp/174996.1.all.q", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 4 >>> rmdir("/tmp/174996.1.all.q") = 0 >>> >>> Thanks. >>> -- >>> -MichaelC >> >> >> >> -- >> -MichaelC >> >> _______________________________________________ >> users mailing list >> [email protected] >> https://gridengine.org/mailman/listinfo/users >> > -- -MichaelC _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
