[slurm-dev] Re: sreport TRES permissions issue

2015-11-24 Thread Lucas Gabriel Vuotto


Thanks for the info, Michael!

On 24/11/15 14:03, Michael Gutteridge wrote:

This was a bug fixed in 15.08.2:

   -- MYSQL - Remove restriction to have to be at least an operator to
  query TRES

https://groups.google.com/forum/?fromgroups#!topic/slurm-devel/XiL7GA8CYj8

I am still running 15.08.1 but have a patch that seems to fix it if
you're interested.

M

On Tue, Nov 24, 2015 at 6:04 AM, Lucas Gabriel Vuotto
> wrote:


Hello,

we have a small HPC cluster managed by slurm. We're running version
15.08.1 on SL 6.5 . We implemented some per user cpu and gpu monthly
quotas and we want them to be able to check their consumed quota.
sreport would fill this task perfectly *excepts* that it returns:

salvador@odin ~ $ sreport user top
sreport: error: Access/permission denied
sreport: fatal: Problem getting TRES data: Access/permission denied

when run by a user with AdminLevel set to none. Even just running
`sreport` gives the same error message. Both slurm.conf and
slurmdbd.conf man pages says, in PrivateData description, that all
users have, by default, access to all the information, and neither
one says something about TRES data being private. We make clear that
we let `PrivateData` unset in both config files.

slurmdbd log doesn't shows any significant data (in our opinion)
even when setting DebugLevel to debug4:

slurmdbd: debug2: Opened connection 8 from 127.0.0.1
slurmdbd: debug:  DBD_INIT: CLUSTER:odin VERSION:7424 UID:2007
IP:127.0.0.1 CONN:8
slurmdbd: debug2: acct_storage_p_get_connection: request new
connection 1
slurmdbd: debug2: DBD_GET_TRES: called
slurmdbd: error: Processing last message from connection
8(127.0.0.1) uid(2007)
slurmdbd: debug4: got 0 commits
slurmdbd: debug2: Closed connection 8 uid(2007)

The "issue" isn't present when `sreport` is run by a user with
AdminLevel set to operator or admin.

Anyone have had this problem? Is there any way to fix it? Or should
we stick to running a cron job every 5 minutes to gather the data
with a privileged enough user and then make a mechanism so
unprivileged users can access this data?

If it's significant, we have both slurmctld and slurmdbd in the same
machine.

Cheers,


-- lv.






-- lv.


[slurm-dev] Re: Cannot exclude hosts with --exclude

2015-11-24 Thread Carlos Fenoy
There seems to be a wrong character in the double dashes "--".

On Tue, 24 Nov 2015, 22:04 Zentz, Scott C.  wrote:

> Hello Everyone!
>
>
>
> I have a user who is trying to exclude some hosts from their job
> submission and was using –exclude to accomplish this. He claims that he was
> able to do this in the past and the man pages state that –exclude is an
> option but when we include that argument, we receive the following error:
>
>
>
> zentz@diamond(~)% sbatch testsumit.sh
>
> sbatch: error: Invalid argument: —-exclude=bc1node1
>
> zentz@diamond(~)%
>
>
>
> Here is a copy of the “testsubmit.sh”
>
>
>
> #!/bin/bash
>
>
>
> #SBATCH —-exclude=bc1node1
>
> srun echo "test"
>
>
>
>
>
>
>
> Is the exclude option only for specific cluster types or is there
> something else going awry?
>
>
>
> Thanks!
>
> -scz
>


[slurm-dev] Cannot exclude hosts with --exclude

2015-11-24 Thread Zentz, Scott C.
Hello Everyone!

I have a user who is trying to exclude some hosts from their job submission and 
was using -exclude to accomplish this. He claims that he was able to do this in 
the past and the man pages state that -exclude is an option but when we include 
that argument, we receive the following error:

zentz@diamond(~)% sbatch testsumit.sh
sbatch: error: Invalid argument: --exclude=bc1node1
zentz@diamond(~)%

Here is a copy of the "testsubmit.sh"

#!/bin/bash

#SBATCH --exclude=bc1node1
srun echo "test"



Is the exclude option only for specific cluster types or is there something 
else going awry?

Thanks!
-scz


[slurm-dev] Re: sreport TRES permissions issue

2015-11-24 Thread Michael Gutteridge
This was a bug fixed in 15.08.2:

  -- MYSQL - Remove restriction to have to be at least an operator to
 query TRES

https://groups.google.com/forum/?fromgroups#!topic/slurm-devel/XiL7GA8CYj8

I am still running 15.08.1 but have a patch that seems to fix it if you're
interested.

M

On Tue, Nov 24, 2015 at 6:04 AM, Lucas Gabriel Vuotto 
wrote:

>
> Hello,
>
> we have a small HPC cluster managed by slurm. We're running version
> 15.08.1 on SL 6.5 . We implemented some per user cpu and gpu monthly quotas
> and we want them to be able to check their consumed quota. sreport would
> fill this task perfectly *excepts* that it returns:
>
> salvador@odin ~ $ sreport user top
> sreport: error: Access/permission denied
> sreport: fatal: Problem getting TRES data: Access/permission denied
>
> when run by a user with AdminLevel set to none. Even just running
> `sreport` gives the same error message. Both slurm.conf and slurmdbd.conf
> man pages says, in PrivateData description, that all users have, by
> default, access to all the information, and neither one says something
> about TRES data being private. We make clear that we let `PrivateData`
> unset in both config files.
>
> slurmdbd log doesn't shows any significant data (in our opinion) even when
> setting DebugLevel to debug4:
>
> slurmdbd: debug2: Opened connection 8 from 127.0.0.1
> slurmdbd: debug:  DBD_INIT: CLUSTER:odin VERSION:7424 UID:2007
> IP:127.0.0.1 CONN:8
> slurmdbd: debug2: acct_storage_p_get_connection: request new connection 1
> slurmdbd: debug2: DBD_GET_TRES: called
> slurmdbd: error: Processing last message from connection 8(127.0.0.1)
> uid(2007)
> slurmdbd: debug4: got 0 commits
> slurmdbd: debug2: Closed connection 8 uid(2007)
>
> The "issue" isn't present when `sreport` is run by a user with AdminLevel
> set to operator or admin.
>
> Anyone have had this problem? Is there any way to fix it? Or should we
> stick to running a cron job every 5 minutes to gather the data with a
> privileged enough user and then make a mechanism so unprivileged users can
> access this data?
>
> If it's significant, we have both slurmctld and slurmdbd in the same
> machine.
>
> Cheers,
>
>
> -- lv.
>


[slurm-dev] weird error (bug?) on srun (16.05.0-0pre1)

2015-11-24 Thread Manuel Rodríguez Pascual

Hi all,

I am facing a quite weird error on the last version of slurm
(16.05.0-0pre1). System crashes when executing srun.

So, I have 2 experimental testbeds. One is based on virtual machines,
and one is a physical one.

Both clusters, pgusical and virtual, run
-OS: CentOS7, updated
- MPICH Version: 3.1.4
- slurm 16.05.0-0pre1
- munge-0.5.11
- slurm.conf configured with "MpiDefault=pmi2"

I have a test helloWorldMPI application.

So, in the virtual cluster, the application can be executed with
---

srun  -n 2 --cpus-per-task=1 --ntasks-per-node=1 ./helloWorldMPI
sbatch  -n 2 --cpus-per-task=1 --ntasks-per-node=1 helloWorldMPI.sh (a
script with a single line, "mpiexec helloWorldMPI"-
---
---


both work OK.

However, in the physical cluster, I can run the sbatch command, but
the srun one crashes.

---
---
-bash-4.2$ srun --version
slurm 16.05.0-0pre1

-bash-4.2$ srun  -n 2 --cpus-per-task=1 --ntasks-per-node=1 ./helloWorldMPI
*** Error in `srun': free(): invalid pointer: 0x7fc1ff774ed0 ***
=== Backtrace: =
/lib64/libc.so.6(+0x7d1fd)[0x7fc2000191fd]
srun(slurm_xfree+0x49)[0x442ce6]
srun(slurm_free_forward_data_msg+0x34)[0x4c0a34]
srun(slurm_free_msg_data+0xc70)[0x4c66b6]
srun(slurm_free_msg+0x53)[0x4864ae]
/home/localsoft/slurm/lib/slurm/mpi_pmi2.so(tree_msg_to_stepds+0x189)[0x7fc1ff56cbc3]
/home/localsoft/slurm/lib/slurm/mpi_pmi2.so(temp_kvs_send+0xd7)[0x7fc1ff563bfc]
/home/localsoft/slurm/lib/slurm/mpi_pmi2.so(+0xf18d)[0x7fc1ff56b18d]
/home/localsoft/slurm/lib/slurm/mpi_pmi2.so(handle_tree_cmd+0x49d)[0x7fc1ff56c601]
/home/localsoft/slurm/lib/slurm/mpi_pmi2.so(+0x556f)[0x7fc1ff56156f]
/home/localsoft/slurm/lib/slurm/mpi_pmi2.so(+0x5760)[0x7fc1ff561760]
srun[0x428b58]
srun[0x42891e]
srun(eio_handle_mainloop+0x1b0)[0x428528]
/home/localsoft/slurm/lib/slurm/mpi_pmi2.so(+0x5b41)[0x7fc1ff561b41]
/lib64/libpthread.so.0(+0x7df5)[0x7fc200364df5]
/lib64/libc.so.6(clone+0x6d)[0x7fc2000921ad]
=== Memory map: 
0040-005c2000 r-xp  00:22 26456
  /home/localsoft/slurm/bin/srun
007c1000-007c2000 r--p 001c1000 00:22 26456
  /home/localsoft/slurm/bin/srun
007c2000-007c9000 rw-p 001c2000 00:22 26456
  /home/localsoft/slurm/bin/srun
007c9000-007cf000 rw-p  00:00 0
02628000-0288d000 rw-p  00:00 0  [heap]
7fc1e000-7fc1e0021000 rw-p  00:00 0
7fc1e0021000-7fc1e400 ---p  00:00 0
7fc1e800-7fc1e8021000 rw-p  00:00 0
7fc1e8021000-7fc1ec00 ---p  00:00 0
7fc1ec00-7fc1ec021000 rw-p  00:00 0
7fc1ec021000-7fc1f000 ---p  00:00 0
7fc1f000-7fc1f0021000 rw-p  00:00 0
7fc1f0021000-7fc1f400 ---p  00:00 0
7fc1f400-7fc1f4021000 rw-p  00:00 0
7fc1f4021000-7fc1f800 ---p  00:00 0
7fc1f800-7fc1f8021000 rw-p  00:00 0
7fc1f8021000-7fc1fc00 ---p  00:00 0
7fc1fe03-7fc1fe045000 r-xp  08:17 67109001
  /usr/lib64/libgcc_s-4.8.3-20140911.so.1
7fc1fe045000-7fc1fe244000 ---p 00015000 08:17 67109001
  /usr/lib64/libgcc_s-4.8.3-20140911.so.1
7fc1fe244000-7fc1fe245000 r--p 00014000 08:17 67109001
  /usr/lib64/libgcc_s-4.8.3-20140911.so.1
7fc1fe245000-7fc1fe246000 rw-p 00015000 08:17 67109001
  /usr/lib64/libgcc_s-4.8.3-20140911.so.1
7fc1fe246000-7fc1fe247000 ---p  00:00 0
7fc1fe247000-7fc1fe347000 rw-p  00:00 0
7fc1fe347000-7fc1fe348000 ---p  00:00 0
7fc1fe348000-7fc1fe448000 rw-p  00:00 0
7fc1fe448000-7fc1fe449000 r-xp  00:22 21727
  /home/localsoft/slurm/lib/slurm/route_default.so
7fc1fe449000-7fc1fe648000 ---p 1000 00:22 21727
  /home/localsoft/slurm/lib/slurm/route_default.so
7fc1fe648000-7fc1fe649000 r--p  00:22 21727
  /home/localsoft/slurm/lib/slurm/route_default.so
7fc1fe649000-7fc1fe64a000 rw-p 1000 00:22 21727
  /home/localsoft/slurm/lib/slurm/route_default.so
7fc1fe64a000-7fc1fe64b000 ---p  00:00 0
7fc1fe64b000-7fc1fe74b000 rw-p  00:00 0
  [stack:16505]
7fc1fe74b000-7fc1fe74c000 ---p  00:00 0
7fc1fe74c000-7fc1fe84c000 rw-p  00:00 0
  [stack:16504]
7fc1fe84c000-7fc1fe84d000 ---p  00:00 0
7fc1fe84d000-7fc1ff04d000 rw-p  00:00 0
  [stack:16503]
7fc1ff04d000-7fc1ff04e000 ---p  00:00 0
7fc1ff04e000-7fc1ff14e000 rw-p  00:00 0
  [stack:16502]
7fc1ff14e000-7fc1ff157000 r-xp  08:17 67390882
  /usr/lib64/libmunge.so.2.0.0
7fc1ff157000-7fc1ff356000 ---p 9000 08:17 67390882
  /usr/lib64/libmunge.so.2.0.0
7fc1ff356000-7fc1ff357000 r--p 8000 08:17 67390882
  /usr/lib64/libmunge.so.2.0.0
7fc1ff357000-7fc1ff358000 rw-p 9000 08:17 67390882
  /usr/lib64/libmunge.so.2.0.0
7fc1ff358000-7fc1ff35b000 r-xp  00:22 1228
  /home/localsoft/slurm/lib/slurm/auth_munge.so
7fc1ff35b000-7fc1ff55a000 ---p 3000 00:22 1228
  /home/localsoft/slurm/lib/slurm/auth_munge.so
7fc1ff55a000-7fc1ff55b000 r--p 2000 00:22 1228
  /home/localsoft/slurm/lib/slurm/auth_munge.so

[slurm-dev] sreport TRES permissions issue

2015-11-24 Thread Lucas Gabriel Vuotto


Hello,

we have a small HPC cluster managed by slurm. We're running version 
15.08.1 on SL 6.5 . We implemented some per user cpu and gpu monthly 
quotas and we want them to be able to check their consumed quota. 
sreport would fill this task perfectly *excepts* that it returns:


salvador@odin ~ $ sreport user top
sreport: error: Access/permission denied
sreport: fatal: Problem getting TRES data: Access/permission denied

when run by a user with AdminLevel set to none. Even just running 
`sreport` gives the same error message. Both slurm.conf and 
slurmdbd.conf man pages says, in PrivateData description, that all users 
have, by default, access to all the information, and neither one says 
something about TRES data being private. We make clear that we let 
`PrivateData` unset in both config files.


slurmdbd log doesn't shows any significant data (in our opinion) even 
when setting DebugLevel to debug4:


slurmdbd: debug2: Opened connection 8 from 127.0.0.1
slurmdbd: debug:  DBD_INIT: CLUSTER:odin VERSION:7424 UID:2007 
IP:127.0.0.1 CONN:8

slurmdbd: debug2: acct_storage_p_get_connection: request new connection 1
slurmdbd: debug2: DBD_GET_TRES: called
slurmdbd: error: Processing last message from connection 8(127.0.0.1) 
uid(2007)

slurmdbd: debug4: got 0 commits
slurmdbd: debug2: Closed connection 8 uid(2007)

The "issue" isn't present when `sreport` is run by a user with 
AdminLevel set to operator or admin.


Anyone have had this problem? Is there any way to fix it? Or should we 
stick to running a cron job every 5 minutes to gather the data with a 
privileged enough user and then make a mechanism so unprivileged users 
can access this data?


If it's significant, we have both slurmctld and slurmdbd in the same 
machine.


Cheers,


-- lv.