Hi,
Am 28.09.2011 um 15:41 schrieb Schmidt, Burkhard:
I'm running SGE 6.2u5 on an Xserve cluster running Mac OS X Server
v10.6 Snow Leopard with Open Directory network accounts. All users
belong to the same default group staff.
the complete cluster is OS X, or only the master node or only the
slaves?
There were issues in the past as a result for an account having too
many additinal groups, but I'm not sure whether it applies here, as
the error message was different.
http://gridengine.org/pipermail/users/2011-March/000447.html
Nevertheless: can you check the group count of the users in question?
-- Reuti
I have a weird problem with some users who are unable to run qlogin
or submit a job. On the qmaster, I see
09/28/2011 15:05:08|worker|xserve01|W|job 295287.1 failed on host
xserve12.cpfs.mpg.de general before job because: 09/28/2011 15:05:08
[1319:18564]: can't open file job_pid: Permission denied
09/28/2011 15:05:08|worker|xserve01|E|queue late06.q marked QERROR
as result of job 295287's failure at host xserve12.cpfs.mpg.de
and in the corresponding error message sent by mail, I see entries
like those attached at the end of this message.
This happens to some users only. The only common property of the
failing accounts I can see at the moment is that these have been
created after the upgrade of the OD master from v10.5 Leopard to
v10.6 Snow Leopard.
I'd be thankful for any hints where to search for the origin of this
problem.
Best regards, Burkhard.
Job 295287 caused action: Queue "[email protected]" set
to ERROR
User = bschmidt4
Queue = [email protected]
Start Time = <unknown>
End Time = <unknown>
failed before job:09/28/2011 15:05:08 [1319:18564]: can't open file
job_pid: Permission denied
Shepherd trace:
09/28/2011 15:05:06 [501:18562]: shepherd called with uid = 0, euid
= 501
09/28/2011 15:05:06 [501:18562]: qlogin_daemon = builtin
09/28/2011 15:05:06 [501:18562]: starting up 6.2u5
09/28/2011 15:05:06 [501:18562]: setpgid(18562, 18562) returned 0
09/28/2011 15:05:06 [501:18562]: no prolog script to start
09/28/2011 15:05:06 [501:18562]: pipe to child uses fds 4 and 5
09/28/2011 15:05:06 [501:18562]: calling fork_pty()
09/28/2011 15:05:06 [501:18562]: parent: forked "job" with pid 18564
09/28/2011 15:05:06 [501:18562]: parent: job-pid: 18564
09/28/2011 15:05:06 [501:18562]: parent: closing childs end of the
pipe
09/28/2011 15:05:06 [501:18562]: csp = 0
09/28/2011 15:05:06 [501:18562]: parent: starting parent loop with
remote_host =xserve01.cpfs.mpg.de, remote_port = 62902, job_owner =
bschmidt4, fd_pty_master = 6, fd_pipe_in = -1, fd_pipe_out = -1,
fd_pipe_err = -1, fd_pipe_to_child = 5
09/28/2011 15:05:06 [501:18562]: parent: opening connection to qrsh/
qlogin client
09/28/2011 15:05:06 [501:18564]: child: closing parents end of the
pipe
09/28/2011 15:05:06 [501:18564]: child: trying to read from parent
through the pipe
09/28/2011 15:05:06 [501:18562]: parent: sending REGISTER_CTRL_MSG
to qrsh/qlogin client
09/28/2011 15:05:06 [501:18562]: parent: creating pty_to_commlib
thread
09/28/2011 15:05:06 [501:18562]: parent: creating commlib_to_pty
thread
09/28/2011 15:05:06 [501:18562]: parent: created both worker
threads, now waiting for jobs end
09/28/2011 15:05:06 [501:18562]: commlib_to_pty: received window
size message, changing window size
09/28/2011 15:05:06 [501:18562]: commlib_to_pty: received settings
message
09/28/2011 15:05:06 [501:18562]: commlib_to_pty: writing to child 11
bytes: noshell = 0
09/28/2011 15:05:06 [501:18564]: child: parent sent us 'noshell = 0'
09/28/2011 15:05:06 [501:18564]: child: starting son(job, QLOGIN, 0);
09/28/2011 15:05:06 [501:18564]: processing qlogin job
09/28/2011 15:05:06 [501:18564]: pid=18564 pgrp=18564 sid=18564 old
pgrp=18564 getlogin()=_atsserver
09/28/2011 15:05:06 [501:18564]: reading passwd information for user
'bschmidt4'
09/28/2011 15:05:06 [501:18564]: setosjobid: uid = 0, euid = 501
09/28/2011 15:05:06 [501:18564]: setting limits
09/28/2011 15:05:06 [501:18564]: RLIMIT_CPU setting: (soft
0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard
0INFINITY)
09/28/2011 15:05:06 [501:18564]: RLIMIT_FSIZE setting: (soft
0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard
0INFINITY)
09/28/2011 15:05:06 [501:18564]: RLIMIT_DATA setting: (soft
0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard
0INFINITY)
09/28/2011 15:05:06 [501:18564]: RLIMIT_STACK setting: (soft
0INFINITY hard 0INFINITY) resulting: (soft 67104768 hard 67104768)
09/28/2011 15:05:06 [501:18564]: RLIMIT_CORE setting: (soft
0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard
0INFINITY)
09/28/2011 15:05:06 [501:18564]: RLIMIT_RSS setting: (soft
0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard
0INFINITY)
09/28/2011 15:05:06 [501:18564]: RLIMIT_RSS setting: (soft
0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard
0INFINITY)
09/28/2011 15:05:06 [501:18564]: setting environment
09/28/2011 15:05:06 [501:18564]: Initializing error file
09/28/2011 15:05:06 [501:18564]: switching to intermediate/target user
09/28/2011 15:05:06 [1319:18564]: closing all filedescriptors
09/28/2011 15:05:06 [1319:18564]: further messages are in "error"
and "trace"
09/28/2011 15:05:08 [1319:18564]: now running with uid=1319, euid=1319
09/28/2011 15:05:08 [1319:18564]: execle(, -(null), NULL, env)
09/28/2011 15:05:08 [1319:18564]: parent: forked "job" with pid 0
09/28/2011 15:05:08 [1319:18564]: can't open file job_pid:
Permission denied
09/28/2011 15:05:08 [501:18562]: pty_to_commlib: our child seems to
have exited -> exiting
09/28/2011 15:05:08 [501:18562]: wait3 returned 18564 (status: 2816;
WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 11)
09/28/2011 15:05:08 [501:18562]: job exited with exit status 11
09/28/2011 15:05:08 [501:18562]: parent: wait_my_child returned
exit_status = 2816
09/28/2011 15:05:08 [501:18562]: parent:
rusage.ru_stime.tv_sec = 0
09/28/2011 15:05:08 [501:18562]: parent:
rusage.ru_stime.tv_usec = 2910
09/28/2011 15:05:08 [501:18562]: parent:
rusage.ru_utime.tv_sec = 0
09/28/2011 15:05:08 [501:18562]: parent:
rusage.ru_utime.tv_usec = 1344
09/28/2011 15:05:08 [501:18562]: parent: received event 1000,
g_raised_event = 2
09/28/2011 15:05:08 [501:18562]: parent: shutting down
pty_to_commlib thread
09/28/2011 15:05:08 [501:18562]: parent: shutting down
commlib_to_pty thread
09/28/2011 15:05:08 [501:18562]: parent: thread_cleanup_lib()
09/28/2011 15:05:08 [501:18562]: parent: leaving main loop. From
here on, only the main thread is running.
09/28/2011 15:05:08 [501:18562]: reaped "job" with pid 18564
09/28/2011 15:05:08 [501:18562]: job exited not due to signal
09/28/2011 15:05:08 [501:18562]: job exited with status 11
09/28/2011 15:05:08 [501:18562]: now sending signal KILL to pid -18564
09/28/2011 15:05:08 [501:18562]: no tasker to notify
09/28/2011 15:05:08 [501:18562]: failed starting job
09/28/2011 15:05:08 [501:18562]: no epilog script to start
09/28/2011 15:05:08 [501:18562]: writing exit status to qrsh: 0
09/28/2011 15:05:08 [501:18562]: sending UNREGISTER_CTRL_MSG with
exit_status = "0"
09/28/2011 15:05:08 [501:18562]: sending to host: xserve01.cpfs.mpg.de
09/28/2011 15:05:08 [501:18562]: waiting for
UNREGISTER_RESPONSE_CTRL_MSG
09/28/2011 15:05:08 [501:18562]: Received UNREGISTER_RESPONSE_CTRL_MSG
09/28/2011 15:05:08 [501:18562]: parent: cl_com_ignore_timeouts
09/28/2011 15:05:08 [501:18562]: parent: leaving
closinge_parent_loop()
Shepherd error:
09/28/2011 15:05:08 [1319:18564]: can't open file job_pid:
Permission denied
Shepherd pe_hostfile:
xserve12.cpfs.mpg.de 1 [email protected] UNDEFINED
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users