Hello, I'm running SGE 6.2u5 on an Xserve cluster running Mac OS X Server v10.6 Snow Leopard with Open Directory network accounts. All users belong to the same default group staff.
I have a weird problem with some users who are unable to run qlogin or submit a job. On the qmaster, I see 09/28/2011 15:05:08|worker|xserve01|W|job 295287.1 failed on host xserve12.cpfs.mpg.de general before job because: 09/28/2011 15:05:08 [1319:18564]: can't open file job_pid: Permission denied 09/28/2011 15:05:08|worker|xserve01|E|queue late06.q marked QERROR as result of job 295287's failure at host xserve12.cpfs.mpg.de and in the corresponding error message sent by mail, I see entries like those attached at the end of this message. This happens to some users only. The only common property of the failing accounts I can see at the moment is that these have been created after the upgrade of the OD master from v10.5 Leopard to v10.6 Snow Leopard. I'd be thankful for any hints where to search for the origin of this problem. Best regards, Burkhard. Job 295287 caused action: Queue "[email protected]" set to ERROR User = bschmidt4 Queue = [email protected] Start Time = <unknown> End Time = <unknown> failed before job:09/28/2011 15:05:08 [1319:18564]: can't open file job_pid: Permission denied Shepherd trace: 09/28/2011 15:05:06 [501:18562]: shepherd called with uid = 0, euid = 501 09/28/2011 15:05:06 [501:18562]: qlogin_daemon = builtin 09/28/2011 15:05:06 [501:18562]: starting up 6.2u5 09/28/2011 15:05:06 [501:18562]: setpgid(18562, 18562) returned 0 09/28/2011 15:05:06 [501:18562]: no prolog script to start 09/28/2011 15:05:06 [501:18562]: pipe to child uses fds 4 and 5 09/28/2011 15:05:06 [501:18562]: calling fork_pty() 09/28/2011 15:05:06 [501:18562]: parent: forked "job" with pid 18564 09/28/2011 15:05:06 [501:18562]: parent: job-pid: 18564 09/28/2011 15:05:06 [501:18562]: parent: closing childs end of the pipe 09/28/2011 15:05:06 [501:18562]: csp = 0 09/28/2011 15:05:06 [501:18562]: parent: starting parent loop with remote_host =xserve01.cpfs.mpg.de, remote_port = 62902, job_owner = bschmidt4, fd_pty_master = 6, fd_pipe_in = -1, fd_pipe_out = -1, fd_pipe_err = -1, fd_pipe_to_child = 5 09/28/2011 15:05:06 [501:18562]: parent: opening connection to qrsh/qlogin client 09/28/2011 15:05:06 [501:18564]: child: closing parents end of the pipe 09/28/2011 15:05:06 [501:18564]: child: trying to read from parent through the pipe 09/28/2011 15:05:06 [501:18562]: parent: sending REGISTER_CTRL_MSG to qrsh/qlogin client 09/28/2011 15:05:06 [501:18562]: parent: creating pty_to_commlib thread 09/28/2011 15:05:06 [501:18562]: parent: creating commlib_to_pty thread 09/28/2011 15:05:06 [501:18562]: parent: created both worker threads, now waiting for jobs end 09/28/2011 15:05:06 [501:18562]: commlib_to_pty: received window size message, changing window size 09/28/2011 15:05:06 [501:18562]: commlib_to_pty: received settings message 09/28/2011 15:05:06 [501:18562]: commlib_to_pty: writing to child 11 bytes: noshell = 0 09/28/2011 15:05:06 [501:18564]: child: parent sent us 'noshell = 0' 09/28/2011 15:05:06 [501:18564]: child: starting son(job, QLOGIN, 0); 09/28/2011 15:05:06 [501:18564]: processing qlogin job 09/28/2011 15:05:06 [501:18564]: pid=18564 pgrp=18564 sid=18564 old pgrp=18564 getlogin()=_atsserver 09/28/2011 15:05:06 [501:18564]: reading passwd information for user 'bschmidt4' 09/28/2011 15:05:06 [501:18564]: setosjobid: uid = 0, euid = 501 09/28/2011 15:05:06 [501:18564]: setting limits 09/28/2011 15:05:06 [501:18564]: RLIMIT_CPU setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY) 09/28/2011 15:05:06 [501:18564]: RLIMIT_FSIZE setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY) 09/28/2011 15:05:06 [501:18564]: RLIMIT_DATA setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY) 09/28/2011 15:05:06 [501:18564]: RLIMIT_STACK setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 67104768 hard 67104768) 09/28/2011 15:05:06 [501:18564]: RLIMIT_CORE setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY) 09/28/2011 15:05:06 [501:18564]: RLIMIT_RSS setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY) 09/28/2011 15:05:06 [501:18564]: RLIMIT_RSS setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY) 09/28/2011 15:05:06 [501:18564]: setting environment 09/28/2011 15:05:06 [501:18564]: Initializing error file 09/28/2011 15:05:06 [501:18564]: switching to intermediate/target user 09/28/2011 15:05:06 [1319:18564]: closing all filedescriptors 09/28/2011 15:05:06 [1319:18564]: further messages are in "error" and "trace" 09/28/2011 15:05:08 [1319:18564]: now running with uid=1319, euid=1319 09/28/2011 15:05:08 [1319:18564]: execle(, -(null), NULL, env) 09/28/2011 15:05:08 [1319:18564]: parent: forked "job" with pid 0 09/28/2011 15:05:08 [1319:18564]: can't open file job_pid: Permission denied 09/28/2011 15:05:08 [501:18562]: pty_to_commlib: our child seems to have exited -> exiting 09/28/2011 15:05:08 [501:18562]: wait3 returned 18564 (status: 2816; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 11) 09/28/2011 15:05:08 [501:18562]: job exited with exit status 11 09/28/2011 15:05:08 [501:18562]: parent: wait_my_child returned exit_status = 2816 09/28/2011 15:05:08 [501:18562]: parent: rusage.ru_stime.tv_sec = 0 09/28/2011 15:05:08 [501:18562]: parent: rusage.ru_stime.tv_usec = 2910 09/28/2011 15:05:08 [501:18562]: parent: rusage.ru_utime.tv_sec = 0 09/28/2011 15:05:08 [501:18562]: parent: rusage.ru_utime.tv_usec = 1344 09/28/2011 15:05:08 [501:18562]: parent: received event 1000, g_raised_event = 2 09/28/2011 15:05:08 [501:18562]: parent: shutting down pty_to_commlib thread 09/28/2011 15:05:08 [501:18562]: parent: shutting down commlib_to_pty thread 09/28/2011 15:05:08 [501:18562]: parent: thread_cleanup_lib() 09/28/2011 15:05:08 [501:18562]: parent: leaving main loop. From here on, only the main thread is running. 09/28/2011 15:05:08 [501:18562]: reaped "job" with pid 18564 09/28/2011 15:05:08 [501:18562]: job exited not due to signal 09/28/2011 15:05:08 [501:18562]: job exited with status 11 09/28/2011 15:05:08 [501:18562]: now sending signal KILL to pid -18564 09/28/2011 15:05:08 [501:18562]: no tasker to notify 09/28/2011 15:05:08 [501:18562]: failed starting job 09/28/2011 15:05:08 [501:18562]: no epilog script to start 09/28/2011 15:05:08 [501:18562]: writing exit status to qrsh: 0 09/28/2011 15:05:08 [501:18562]: sending UNREGISTER_CTRL_MSG with exit_status = "0" 09/28/2011 15:05:08 [501:18562]: sending to host: xserve01.cpfs.mpg.de 09/28/2011 15:05:08 [501:18562]: waiting for UNREGISTER_RESPONSE_CTRL_MSG 09/28/2011 15:05:08 [501:18562]: Received UNREGISTER_RESPONSE_CTRL_MSG 09/28/2011 15:05:08 [501:18562]: parent: cl_com_ignore_timeouts 09/28/2011 15:05:08 [501:18562]: parent: leaving closinge_parent_loop() Shepherd error: 09/28/2011 15:05:08 [1319:18564]: can't open file job_pid: Permission denied Shepherd pe_hostfile: xserve12.cpfs.mpg.de 1 [email protected] UNDEFINED
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
