[slurm-dev] Re: sreport TRES permissions issue
Thanks for the info, Michael! On 24/11/15 14:03, Michael Gutteridge wrote: This was a bug fixed in 15.08.2: -- MYSQL - Remove restriction to have to be at least an operator to query TRES https://groups.google.com/forum/?fromgroups#!topic/slurm-devel/XiL7GA8CYj8 I am still running 15.08.1 but have a patch that seems to fix it if you're interested. M On Tue, Nov 24, 2015 at 6:04 AM, Lucas Gabriel Vuotto> wrote: Hello, we have a small HPC cluster managed by slurm. We're running version 15.08.1 on SL 6.5 . We implemented some per user cpu and gpu monthly quotas and we want them to be able to check their consumed quota. sreport would fill this task perfectly *excepts* that it returns: salvador@odin ~ $ sreport user top sreport: error: Access/permission denied sreport: fatal: Problem getting TRES data: Access/permission denied when run by a user with AdminLevel set to none. Even just running `sreport` gives the same error message. Both slurm.conf and slurmdbd.conf man pages says, in PrivateData description, that all users have, by default, access to all the information, and neither one says something about TRES data being private. We make clear that we let `PrivateData` unset in both config files. slurmdbd log doesn't shows any significant data (in our opinion) even when setting DebugLevel to debug4: slurmdbd: debug2: Opened connection 8 from 127.0.0.1 slurmdbd: debug: DBD_INIT: CLUSTER:odin VERSION:7424 UID:2007 IP:127.0.0.1 CONN:8 slurmdbd: debug2: acct_storage_p_get_connection: request new connection 1 slurmdbd: debug2: DBD_GET_TRES: called slurmdbd: error: Processing last message from connection 8(127.0.0.1) uid(2007) slurmdbd: debug4: got 0 commits slurmdbd: debug2: Closed connection 8 uid(2007) The "issue" isn't present when `sreport` is run by a user with AdminLevel set to operator or admin. Anyone have had this problem? Is there any way to fix it? Or should we stick to running a cron job every 5 minutes to gather the data with a privileged enough user and then make a mechanism so unprivileged users can access this data? If it's significant, we have both slurmctld and slurmdbd in the same machine. Cheers, -- lv. -- lv.
[slurm-dev] Re: Cannot exclude hosts with --exclude
There seems to be a wrong character in the double dashes "--". On Tue, 24 Nov 2015, 22:04 Zentz, Scott C.wrote: > Hello Everyone! > > > > I have a user who is trying to exclude some hosts from their job > submission and was using –exclude to accomplish this. He claims that he was > able to do this in the past and the man pages state that –exclude is an > option but when we include that argument, we receive the following error: > > > > zentz@diamond(~)% sbatch testsumit.sh > > sbatch: error: Invalid argument: —-exclude=bc1node1 > > zentz@diamond(~)% > > > > Here is a copy of the “testsubmit.sh” > > > > #!/bin/bash > > > > #SBATCH —-exclude=bc1node1 > > srun echo "test" > > > > > > > > Is the exclude option only for specific cluster types or is there > something else going awry? > > > > Thanks! > > -scz >
[slurm-dev] Cannot exclude hosts with --exclude
Hello Everyone! I have a user who is trying to exclude some hosts from their job submission and was using -exclude to accomplish this. He claims that he was able to do this in the past and the man pages state that -exclude is an option but when we include that argument, we receive the following error: zentz@diamond(~)% sbatch testsumit.sh sbatch: error: Invalid argument: --exclude=bc1node1 zentz@diamond(~)% Here is a copy of the "testsubmit.sh" #!/bin/bash #SBATCH --exclude=bc1node1 srun echo "test" Is the exclude option only for specific cluster types or is there something else going awry? Thanks! -scz
[slurm-dev] Re: sreport TRES permissions issue
This was a bug fixed in 15.08.2: -- MYSQL - Remove restriction to have to be at least an operator to query TRES https://groups.google.com/forum/?fromgroups#!topic/slurm-devel/XiL7GA8CYj8 I am still running 15.08.1 but have a patch that seems to fix it if you're interested. M On Tue, Nov 24, 2015 at 6:04 AM, Lucas Gabriel Vuottowrote: > > Hello, > > we have a small HPC cluster managed by slurm. We're running version > 15.08.1 on SL 6.5 . We implemented some per user cpu and gpu monthly quotas > and we want them to be able to check their consumed quota. sreport would > fill this task perfectly *excepts* that it returns: > > salvador@odin ~ $ sreport user top > sreport: error: Access/permission denied > sreport: fatal: Problem getting TRES data: Access/permission denied > > when run by a user with AdminLevel set to none. Even just running > `sreport` gives the same error message. Both slurm.conf and slurmdbd.conf > man pages says, in PrivateData description, that all users have, by > default, access to all the information, and neither one says something > about TRES data being private. We make clear that we let `PrivateData` > unset in both config files. > > slurmdbd log doesn't shows any significant data (in our opinion) even when > setting DebugLevel to debug4: > > slurmdbd: debug2: Opened connection 8 from 127.0.0.1 > slurmdbd: debug: DBD_INIT: CLUSTER:odin VERSION:7424 UID:2007 > IP:127.0.0.1 CONN:8 > slurmdbd: debug2: acct_storage_p_get_connection: request new connection 1 > slurmdbd: debug2: DBD_GET_TRES: called > slurmdbd: error: Processing last message from connection 8(127.0.0.1) > uid(2007) > slurmdbd: debug4: got 0 commits > slurmdbd: debug2: Closed connection 8 uid(2007) > > The "issue" isn't present when `sreport` is run by a user with AdminLevel > set to operator or admin. > > Anyone have had this problem? Is there any way to fix it? Or should we > stick to running a cron job every 5 minutes to gather the data with a > privileged enough user and then make a mechanism so unprivileged users can > access this data? > > If it's significant, we have both slurmctld and slurmdbd in the same > machine. > > Cheers, > > > -- lv. >
[slurm-dev] weird error (bug?) on srun (16.05.0-0pre1)
Hi all, I am facing a quite weird error on the last version of slurm (16.05.0-0pre1). System crashes when executing srun. So, I have 2 experimental testbeds. One is based on virtual machines, and one is a physical one. Both clusters, pgusical and virtual, run -OS: CentOS7, updated - MPICH Version: 3.1.4 - slurm 16.05.0-0pre1 - munge-0.5.11 - slurm.conf configured with "MpiDefault=pmi2" I have a test helloWorldMPI application. So, in the virtual cluster, the application can be executed with --- srun -n 2 --cpus-per-task=1 --ntasks-per-node=1 ./helloWorldMPI sbatch -n 2 --cpus-per-task=1 --ntasks-per-node=1 helloWorldMPI.sh (a script with a single line, "mpiexec helloWorldMPI"- --- --- both work OK. However, in the physical cluster, I can run the sbatch command, but the srun one crashes. --- --- -bash-4.2$ srun --version slurm 16.05.0-0pre1 -bash-4.2$ srun -n 2 --cpus-per-task=1 --ntasks-per-node=1 ./helloWorldMPI *** Error in `srun': free(): invalid pointer: 0x7fc1ff774ed0 *** === Backtrace: = /lib64/libc.so.6(+0x7d1fd)[0x7fc2000191fd] srun(slurm_xfree+0x49)[0x442ce6] srun(slurm_free_forward_data_msg+0x34)[0x4c0a34] srun(slurm_free_msg_data+0xc70)[0x4c66b6] srun(slurm_free_msg+0x53)[0x4864ae] /home/localsoft/slurm/lib/slurm/mpi_pmi2.so(tree_msg_to_stepds+0x189)[0x7fc1ff56cbc3] /home/localsoft/slurm/lib/slurm/mpi_pmi2.so(temp_kvs_send+0xd7)[0x7fc1ff563bfc] /home/localsoft/slurm/lib/slurm/mpi_pmi2.so(+0xf18d)[0x7fc1ff56b18d] /home/localsoft/slurm/lib/slurm/mpi_pmi2.so(handle_tree_cmd+0x49d)[0x7fc1ff56c601] /home/localsoft/slurm/lib/slurm/mpi_pmi2.so(+0x556f)[0x7fc1ff56156f] /home/localsoft/slurm/lib/slurm/mpi_pmi2.so(+0x5760)[0x7fc1ff561760] srun[0x428b58] srun[0x42891e] srun(eio_handle_mainloop+0x1b0)[0x428528] /home/localsoft/slurm/lib/slurm/mpi_pmi2.so(+0x5b41)[0x7fc1ff561b41] /lib64/libpthread.so.0(+0x7df5)[0x7fc200364df5] /lib64/libc.so.6(clone+0x6d)[0x7fc2000921ad] === Memory map: 0040-005c2000 r-xp 00:22 26456 /home/localsoft/slurm/bin/srun 007c1000-007c2000 r--p 001c1000 00:22 26456 /home/localsoft/slurm/bin/srun 007c2000-007c9000 rw-p 001c2000 00:22 26456 /home/localsoft/slurm/bin/srun 007c9000-007cf000 rw-p 00:00 0 02628000-0288d000 rw-p 00:00 0 [heap] 7fc1e000-7fc1e0021000 rw-p 00:00 0 7fc1e0021000-7fc1e400 ---p 00:00 0 7fc1e800-7fc1e8021000 rw-p 00:00 0 7fc1e8021000-7fc1ec00 ---p 00:00 0 7fc1ec00-7fc1ec021000 rw-p 00:00 0 7fc1ec021000-7fc1f000 ---p 00:00 0 7fc1f000-7fc1f0021000 rw-p 00:00 0 7fc1f0021000-7fc1f400 ---p 00:00 0 7fc1f400-7fc1f4021000 rw-p 00:00 0 7fc1f4021000-7fc1f800 ---p 00:00 0 7fc1f800-7fc1f8021000 rw-p 00:00 0 7fc1f8021000-7fc1fc00 ---p 00:00 0 7fc1fe03-7fc1fe045000 r-xp 08:17 67109001 /usr/lib64/libgcc_s-4.8.3-20140911.so.1 7fc1fe045000-7fc1fe244000 ---p 00015000 08:17 67109001 /usr/lib64/libgcc_s-4.8.3-20140911.so.1 7fc1fe244000-7fc1fe245000 r--p 00014000 08:17 67109001 /usr/lib64/libgcc_s-4.8.3-20140911.so.1 7fc1fe245000-7fc1fe246000 rw-p 00015000 08:17 67109001 /usr/lib64/libgcc_s-4.8.3-20140911.so.1 7fc1fe246000-7fc1fe247000 ---p 00:00 0 7fc1fe247000-7fc1fe347000 rw-p 00:00 0 7fc1fe347000-7fc1fe348000 ---p 00:00 0 7fc1fe348000-7fc1fe448000 rw-p 00:00 0 7fc1fe448000-7fc1fe449000 r-xp 00:22 21727 /home/localsoft/slurm/lib/slurm/route_default.so 7fc1fe449000-7fc1fe648000 ---p 1000 00:22 21727 /home/localsoft/slurm/lib/slurm/route_default.so 7fc1fe648000-7fc1fe649000 r--p 00:22 21727 /home/localsoft/slurm/lib/slurm/route_default.so 7fc1fe649000-7fc1fe64a000 rw-p 1000 00:22 21727 /home/localsoft/slurm/lib/slurm/route_default.so 7fc1fe64a000-7fc1fe64b000 ---p 00:00 0 7fc1fe64b000-7fc1fe74b000 rw-p 00:00 0 [stack:16505] 7fc1fe74b000-7fc1fe74c000 ---p 00:00 0 7fc1fe74c000-7fc1fe84c000 rw-p 00:00 0 [stack:16504] 7fc1fe84c000-7fc1fe84d000 ---p 00:00 0 7fc1fe84d000-7fc1ff04d000 rw-p 00:00 0 [stack:16503] 7fc1ff04d000-7fc1ff04e000 ---p 00:00 0 7fc1ff04e000-7fc1ff14e000 rw-p 00:00 0 [stack:16502] 7fc1ff14e000-7fc1ff157000 r-xp 08:17 67390882 /usr/lib64/libmunge.so.2.0.0 7fc1ff157000-7fc1ff356000 ---p 9000 08:17 67390882 /usr/lib64/libmunge.so.2.0.0 7fc1ff356000-7fc1ff357000 r--p 8000 08:17 67390882 /usr/lib64/libmunge.so.2.0.0 7fc1ff357000-7fc1ff358000 rw-p 9000 08:17 67390882 /usr/lib64/libmunge.so.2.0.0 7fc1ff358000-7fc1ff35b000 r-xp 00:22 1228 /home/localsoft/slurm/lib/slurm/auth_munge.so 7fc1ff35b000-7fc1ff55a000 ---p 3000 00:22 1228 /home/localsoft/slurm/lib/slurm/auth_munge.so 7fc1ff55a000-7fc1ff55b000 r--p 2000 00:22 1228 /home/localsoft/slurm/lib/slurm/auth_munge.so
[slurm-dev] sreport TRES permissions issue
Hello, we have a small HPC cluster managed by slurm. We're running version 15.08.1 on SL 6.5 . We implemented some per user cpu and gpu monthly quotas and we want them to be able to check their consumed quota. sreport would fill this task perfectly *excepts* that it returns: salvador@odin ~ $ sreport user top sreport: error: Access/permission denied sreport: fatal: Problem getting TRES data: Access/permission denied when run by a user with AdminLevel set to none. Even just running `sreport` gives the same error message. Both slurm.conf and slurmdbd.conf man pages says, in PrivateData description, that all users have, by default, access to all the information, and neither one says something about TRES data being private. We make clear that we let `PrivateData` unset in both config files. slurmdbd log doesn't shows any significant data (in our opinion) even when setting DebugLevel to debug4: slurmdbd: debug2: Opened connection 8 from 127.0.0.1 slurmdbd: debug: DBD_INIT: CLUSTER:odin VERSION:7424 UID:2007 IP:127.0.0.1 CONN:8 slurmdbd: debug2: acct_storage_p_get_connection: request new connection 1 slurmdbd: debug2: DBD_GET_TRES: called slurmdbd: error: Processing last message from connection 8(127.0.0.1) uid(2007) slurmdbd: debug4: got 0 commits slurmdbd: debug2: Closed connection 8 uid(2007) The "issue" isn't present when `sreport` is run by a user with AdminLevel set to operator or admin. Anyone have had this problem? Is there any way to fix it? Or should we stick to running a cron job every 5 minutes to gather the data with a privileged enough user and then make a mechanism so unprivileged users can access this data? If it's significant, we have both slurmctld and slurmdbd in the same machine. Cheers, -- lv.