Hi Barry,

SGE uses sysconf(_SC_NGROUPS_MAX) to find the max. number of
supplementary group IDs a user can have, and uses that information to
find the size of a buffer that is large enough to pass to setgroups()
in a later stage, as SGE needs to add an additional GID (as the OS
"job id", so that SGE can tell which process belongs to which job).

However, on OSX, there seems to be a bug or "interesting platform
behavior": http://bugs.python.org/issue7900

To see if you are hitting this bug, can you run this C program as the
user(s) that can't submit jobs??

#include <sys/types.h>
#include <unistd.h>
#include <stdio.h>

main()
{
   printf("%d\n", getgroups(0, NULL));
   printf("%d\n", sysconf(_SC_NGROUPS_MAX));
}

And it would be great if you can also include the output of "id -G"
under the user(s).

If you really are hitting this issue, I can put in a workaround in
Open Grid Scheduler, and make the fix available for other Grid Engine
forks like "Son of Grid Engine" and "Univa Grid Engine" -- we
currently have 3 forks!! :-D

Rayson

P.S. Nice to see you on this list again!


On Fri, Mar 18, 2011 at 11:31 AM, Barry McInnes
<[email protected]> wrote:
> Hi,
> When running gmaster on 10.5 we get user submit errors when they are in
> too many groups, so the job fails. SOme users in less groups (6-8) can
> run jobs eg the first user cannot submit the second user can
> [mac27:~/SGE] bmcinnes% id bmcinnes
> uid=2101(bmcinnes) gid=200(climate)
> groups=200(climate),1953027852(PSD\sysadmins),829578209(PSD\domain
> admins),801476512(PSD\log1),204(_developer),100(_lpoperator),98(_lpadmin),81(_appserveradm),80(admin),79(_appserverusr),62(netaccounts),12(everyone),1207(rain),1100(systems),998(lmadmin),900(sawrtrs),400(cuac),2109053379(PSD\domain
> users),1858905114(PSD\denied rodc password replication
> group),1358185131(PSD\it_wikis),404(com.apple.sharepoint.group.3),928177777(PSD\coopcall),401(com.apple.access_screensharing),403(com.apple.sharepoint.group.2),402(com.apple.sharepoint.group.1)
> [mac27:~/SGE] bmcinnes%
> [mac27:~/SGE] bmcinnes%
> [mac27:~/SGE] bmcinnes% id ppegion
> uid=3009(ppegion) gid=200(climate)
> groups=200(climate),62(netaccounts),12(everyone),594189391(PSD\climate),247203070(PSD\psd1group),2109053379(PSD\domain
> users),404(com.apple.sharepoint.group.3),928177777(PSD\coopcall),403(com.apple.sharepoint.group.2),402(com.apple.sharepoint.group.1)
> [mac27:~/SGE] bmcinnes%
>
> The Mac OS is adding groups membership to users, as well as our group
> settings.
>
> When we go to Mac 10.6 Intel, the qmaster server fails to put any nodes
> in service, due to the same error, so users have no chance to even
> submit jobs
>
> 03/16/2011 13:41:49|worker|g5s2|W|rescheduling job 15015.1
> 03/16/2011 13:41:49|worker|g5s2|E|queue quad marked QERROR as result of
> ob 15015's failure at host mac40.psd.esrl.noaa.gov
> 03/16/2011 14:02:49|worker|g5s2|W|job 15015.1 failed on host
> mac65.psd.esrl.noaa.gov general before job because: 03/16/2011 14:02:49
> [0:22624]: can't set additional group id (uid=0, euid=0): the user
> already has too many group ids
> 03/16/2011 14:02:49|worker|g5s2|W|rescheduling job 15015.1
> 03/16/2011 14:02:49|worker|g5s2|E|queue quad marked QERROR as result of
> job 15015's failure at host mac65.psd.esrl.noaa.gov
> 03/16/2011 14:08:19|worker|g5s2|W|job 15015.1 failed on host
> mac18.psd.esrl.noaa.gov general before job because: 03/16/2011 14:08:19
> [0:42391]: can't set additional group id (uid=0, euid=0): the user
> already has too many group ids
>
> We are using Active Directory authentication, and the Mac clients are
> all 10.6.6.
> We tried OGE 62u7 with the same group id error.
>
> We are currently back at 10.5 PPC qmaster server to get jobs submitted
> and run.
>
> Any help appreciated.
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to