[slurm-dev] Disabling screen/tmux like applications

2017-07-06 Thread Jerin Philip
Hello,

Are there techniques to disable processes from detaching from slurmstepd
and attaching under init? Will changing *ProcTrackType* help, assuming the
above applications work by changing session and group ids? As of now, its
pgid.



--
Jerin Philip


[slurm-dev] Re: knl_generic plugin on non-KNL node

2017-07-06 Thread Gilles Gouaillardet


Victor,


in your slurm.conf, you should have a line like this one

NodeName=n[1-4] Feature=knl Sockets=1 CoresPerSocket=68 State=UNKNOWN

at first, make sure your regular Xeon nodes do *not* have the 'knl' feature


i guess an other option is not to have the

NodeFeaturesPlugins=knl_generic

line on your regular Xeon nodes

(note that unless you specify an option, you will get some warnings 
since all your slurm.conf are not identical)




Cheers,


Gilles


On 7/6/2017 2:38 AM, Victor Gamayunov wrote:

knl_generic plugin on non-KNL node
Hi,

i have a cluster with a mix of regular Xeon and KNL nodes. I use 
knl_generic to switch KNL modes which works very well.
However, there is a side effect on non-KNL nodes: every time I 
allocate a non-KNL node and specify a constraint (which has nothing to 
do with KNL), the node is rebooted every time.


Is there a way do selectively disable the plugin on non-KNL nodes?

I looked at the code but couldn't quite figure out what forces the reboot.

Thanks
Victor


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Uwe Sauter


> Yes, this is possible, but I would say it's discouraged to do so.
> With RHEL/CentOS 7 you really should be using firewalld, and forget about the 
> old iptables.  Here's a nice introduction:
> https://www.certdepot.net/rhel7-get-started-firewalld/
> 
> Having worked with firewalld for a while now, I find it more flexible to use. 
> Admittedly, there is a bit of a learning
> curve.

I disagree. Firewalld might be better in dynamic environments where you need to 
automatically configure your firewall
but with static services I find iptables much easier.

But this discussion is as mute as the question whether SysVInit or Systemd is 
the better. Only with iptables vs.
firewalld you still have a choice.

>> The crucial part is to ensure that either firewalld *or* iptables is running 
>> but not both. Or you could run without
>> firewall at
>> all *if* you trust your network…
> 
> Agreed.  The compute node network *has to be* trusted in order for Slurm to 
> work.
> 
> /Ole


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Ole Holm Nielsen


On 07/06/2017 04:31 PM, Uwe Sauter wrote:


Alternatively you can

   systemctl disable firewalld.service

   systemctl mask firewalld.service

   yum install iptables-services

   systemctl enable iptables.service ip6tables.service

and configure configure iptables in /etc/sysconfig/iptables and 
/etc/sysconfig/ip6tables, then

   systemctl start iptables.service ip6tables.service


Yes, this is possible, but I would say it's discouraged to do so.
With RHEL/CentOS 7 you really should be using firewalld, and forget 
about the old iptables.  Here's a nice introduction: 
https://www.certdepot.net/rhel7-get-started-firewalld/


Having worked with firewalld for a while now, I find it more flexible to 
use. Admittedly, there is a bit of a learning curve.



The crucial part is to ensure that either firewalld *or* iptables is running 
but not both. Or you could run without firewall at
all *if* you trust your network…


Agreed.  The compute node network *has to be* trusted in order for Slurm 
to work.


/Ole


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Uwe Sauter

Alternatively you can

  systemctl disable firewalld.service

  systemctl mask firewalld.service

  yum install iptables-services

  systemctl enable iptables.service ip6tables.service

and configure configure iptables in /etc/sysconfig/iptables and 
/etc/sysconfig/ip6tables, then

  systemctl start iptables.service ip6tables.service



The crucial part is to ensure that either firewalld *or* iptables is running 
but not both. Or you could run without firewall at
all *if* you trust your network…




Am 06.07.2017 um 14:12 schrieb Ole Holm Nielsen:
> 
> Firewall problems, like I suggested initially!  Nmap is a great tool for 
> probing open ports!
> 
> The iptables *must not* be configured on CentOS 7, you *must* use firewalld.  
> See
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
>  for Slurm configurations.
> 
> /Ole
> 
> On 07/06/2017 01:22 PM, Said Mohamed Said wrote:
>> John and Others,
>>
>>
>> Thank you very much for your support. The problem is finally solved.
>>
>>
>> After Installing nmap, it let me realize that some ports were blocked even 
>> with firewall daemon stopped and disabled. Turned out
>> that iptables was on and enabled. After stopping iptables everything work 
>> just fine.
>>
>>
>>
>> Best Regards,
>>
>>
>> Said.
>>
>> 
>> *From:* John Hearns 
>> *Sent:* Thursday, July 6, 2017 6:47:48 PM
>> *To:* slurm-dev
>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>> Said, you are not out of ideas.
>>
>> I would suggest 'nmap' as a good tool to start with.   Instlal nmap on your 
>> compute node and see which ports are open on the
>> controller node
>>
>> Also do we have a DNS name resolution problem here?
>> I alwasy remember sun Gridengine as being notoriously sensitive to name 
>> resolution, and that was my first question when any SGE
>> problem was reported.
>> So a couple of questions:
>>
>> On the ocntroller node and on the compute node run this:
>> hostname
>> hostname -f
>>
>> Do the cluster controller node or the compute nodes have more than one 
>> network interface.
>> I bet the cluster controller node does!   From the compute node, do an 
>> nslookup or a dig  and see what the COMPUTE NODE thinks
>> are hte names of both of those interfaces.
>>
>> Also as Rajul says - how are you making sure that both controller and 
>> compute nodes have the same slurm.conf file
>> Actually if the slurm.conf files are different this will eb logged when the 
>> compute node starts up, but let us check everything.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 6 July 2017 at 11:37, Said Mohamed Said > > wrote:
>>
>> Even after reinstalling everything from the beginning the problem is
>> still there. Right now I am out of Ideas.
>>
>>
>>
>>
>> Best Regards,
>>
>>
>> Said.
>>
>> 
>> *From:* Said Mohamed Said
>> *Sent:* Thursday, July 6, 2017 2:23:05 PM
>> *To:* slurm-dev
>> *Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP
>>
>> Thank you all for your suggestions, the only thing I can do for now
>> is to uninstall and install from the beginning and I will use the
>> most recent version of slurm on both nodes.
>>
>> For Felix who asked, the OS is CentOS 7.3 on both machines.
>>
>> I will let you know if that can solve the issue.
>> 
>> *From:* Rajul Kumar > >
>> *Sent:* Thursday, July 6, 2017 12:41:51 AM
>> *To:* slurm-dev
>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>> Sorry for the typo
>> It's generally when one of the controller or compute can reach the
>> other one but it's *not* happening vice-versa.
>>
>>
>> On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar
>> > wrote:
>>
>> I came across the same problem sometime back. It's generally
>> when one of the controller or compute can reach to other one but
>> it's happening vice-versa.
>>
>> Have a look at the following points:
>> - controller and compute can ping to each other
>> - both share the same slurm.conf
>> - slurm.conf has the location of both controller and compute
>> - slurm services are running on the compute node when the
>> controller says it's down
>> - TCP connections are not being dropped
>> - Ports are accessible that are to be used for communication,
>> specifically response ports
>> - Check the routing rules if any
>> - Clocks are synced across
>> - Hope there isn't any version mismatch but still have a look
>> (doesn't recognize the nodes for major 

[slurm-dev] Re: different behaviour of signals with sbatch in different machines

2017-07-06 Thread Manuel Rodríguez Pascual

Hi all,

just in case anybody faces this problem at some point...

I found the solution with a set of good examples in
http://mywiki.wooledge.org/SignalTrap

applied to my problem (full code below), it comes to execute

"sh son.sh & wait $!"

in the father script.

Best regards,

Manuel

2017-07-04 15:55 GMT+02:00 Manuel Rodríguez Pascual
:
>
> Hi all,
>
> Developing a Slurm plugin I've come to a funny problem. I guess it is not 
> strictly related to Slurm but just system administration, but maybe someone 
> can point me on the right direction.
>
> I have 2 machines, one with CentOS 7 and one with BullX (based on CentOS6). 
> When I send a signal to finish a running tasks, the behaviours are different.
>
> It can be seen with 2 nested scripts, based on slurm_trap.sh by Mike Drake  
> (https://gist.github.com/MikeDacre/10ae23dcd3986793c3fd ). The code is at the 
> bottom of the mail. As can be seen, both father and son are capturing SIGTERM 
> and SIGKILL,. The execution consists on "father" calling "son", and "son" 
> waiting forever until it is killed.
>
>
> As you can see in the execution results (bottom of the mail), one of the 
> machines executes the functions stated in "trap", but the other does not. 
> Moreover, this second machine does execute the functions in trap when only a 
> single script is executed, not two nested ones.
>
> have you got an explanation for this? Is is possible to ensure that the 
> "trap" command will always be executed?
>
> Thanks for your help,
>
> Manuel
>
> -
> -
> -bash-4.2$ more father.sh
>
> #!/bin/bash
>
> trap_with_arg() {
> func="$1" ; shift
> for sig ; do
> trap "$func $sig" "$sig"
> done
> }
>
> func_trap() {
> echo father: trapped $1
> }
>
> trap_with_arg func_trap 0 1 USR1 EXIT HUP INT QUIT PIPE TERM
>
> cat /dev/zero > /dev/null &
>
> sh son.sh
> -bash-4.2$ more son.sh
> #!/bin/bash
>
>
> trap_with_arg() {
> func="$1" ; shift
> for sig ; do
> trap "$func $sig" "$sig"
> done
> }
>
> func_trap() {
> echo son: trapped $1
> }
>
> trap_with_arg func_trap 0 1 USR1 EXIT HUP INT QUIT PIPE TERM
>
> cat /dev/zero > /dev/null &
> wait
> -
> -
>
>
> Output in CentOS7:
> -bash-4.2$ sbatch  father.sh
> Submitted batch job 1563
> -bash-4.2$ scancel 1563
> -bash-4.2$ more slurm-1563.out
> slurmstepd: error: *** JOB 1563 ON acme12 CANCELLED AT 2017-07-04T15:39:00 ***
> son: trapped TERM
> son: trapped EXIT
> father: trapped TERM
> father: trapped EXIT
>
> Output in BullX:
> ~/signalTests> sbatch  father.sh
> Submitted batch job 233
> ~/signalTests> scancel 233
> ~/signalTests> more slurm-233.out
> slurmstepd: error: *** JOB 233 ON taurusi5089 CANCELLED AT 
> 2017-07-04T15:43:54 ***
>
> Output in BullX, just son:
> ~/signalTests> sbatch -- son.sh
> Submitted batch job 235
> ~/signalTests> scancel 235
> ~/signalTests> more slurm-235.out
> slurmstepd: error: *** JOB 235 ON taurusi4061 CANCELLED AT 
> 2017-07-04T15:48:29 ***
> son: trapped TERM
> son: trapped EXIT
>
>
>
>
>


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Said Mohamed Said
Yes Nielsen, my biggest mistake was assuming that iptables wont be used in 
CentoOS 7.
But I am relieved that I found the solution with the help of all of you. I 
really Appreciate.

Best Regards,
Said.

From: Ole Holm Nielsen 
Sent: Thursday, July 6, 2017 9:11:54 PM
To: slurm-dev
Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP


Firewall problems, like I suggested initially!  Nmap is a great tool for
probing open ports!

The iptables *must not* be configured on CentOS 7, you *must* use
firewalld.  See
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons
for Slurm configurations.

/Ole

On 07/06/2017 01:22 PM, Said Mohamed Said wrote:
> John and Others,
>
>
> Thank you very much for your support. The problem is finally solved.
>
>
> After Installing nmap, it let me realize that some ports were blocked
> even with firewall daemon stopped and disabled. Turned out that iptables
> was on and enabled. After stopping iptables everything work just fine.
>
>
>
> Best Regards,
>
>
> Said.
>
> 
> *From:* John Hearns 
> *Sent:* Thursday, July 6, 2017 6:47:48 PM
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
> Said, you are not out of ideas.
>
> I would suggest 'nmap' as a good tool to start with.   Instlal nmap on
> your compute node and see which ports are open on the controller node
>
> Also do we have a DNS name resolution problem here?
> I alwasy remember sun Gridengine as being notoriously sensitive to name
> resolution, and that was my first question when any SGE problem was
> reported.
> So a couple of questions:
>
> On the ocntroller node and on the compute node run this:
> hostname
> hostname -f
>
> Do the cluster controller node or the compute nodes have more than one
> network interface.
> I bet the cluster controller node does!   From the compute node, do an
> nslookup or a dig  and see what the COMPUTE NODE thinks are hte names of
> both of those interfaces.
>
> Also as Rajul says - how are you making sure that both controller and
> compute nodes have the same slurm.conf file
> Actually if the slurm.conf files are different this will eb logged when
> the compute node starts up, but let us check everything.
>
>
>
>
>
>
>
>
>
> On 6 July 2017 at 11:37, Said Mohamed Said  > wrote:
>
> Even after reinstalling everything from the beginning the problem is
> still there. Right now I am out of Ideas.
>
>
>
>
> Best Regards,
>
>
> Said.
>
> 
> *From:* Said Mohamed Said
> *Sent:* Thursday, July 6, 2017 2:23:05 PM
> *To:* slurm-dev
> *Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP
>
> Thank you all for your suggestions, the only thing I can do for now
> is to uninstall and install from the beginning and I will use the
> most recent version of slurm on both nodes.
>
> For Felix who asked, the OS is CentOS 7.3 on both machines.
>
> I will let you know if that can solve the issue.
> 
> *From:* Rajul Kumar  >
> *Sent:* Thursday, July 6, 2017 12:41:51 AM
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
> Sorry for the typo
> It's generally when one of the controller or compute can reach the
> other one but it's *not* happening vice-versa.
>
>
> On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar
> > wrote:
>
> I came across the same problem sometime back. It's generally
> when one of the controller or compute can reach to other one but
> it's happening vice-versa.
>
> Have a look at the following points:
> - controller and compute can ping to each other
> - both share the same slurm.conf
> - slurm.conf has the location of both controller and compute
> - slurm services are running on the compute node when the
> controller says it's down
> - TCP connections are not being dropped
> - Ports are accessible that are to be used for communication,
> specifically response ports
> - Check the routing rules if any
> - Clocks are synced across
> - Hope there isn't any version mismatch but still have a look
> (doesn't recognize the nodes for major version differences)
>
> Hope this helps.
>
> Best,
> Rajul
>
> On Wed, Jul 5, 2017 at 10:52 AM, John Hearns
> > wrote:
>
> Said,
> a problem like this always has a simple cause. We share
> 

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Ole Holm Nielsen


Firewall problems, like I suggested initially!  Nmap is a great tool for 
probing open ports!


The iptables *must not* be configured on CentOS 7, you *must* use 
firewalld.  See 
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-firewall-for-slurm-daemons 
for Slurm configurations.


/Ole

On 07/06/2017 01:22 PM, Said Mohamed Said wrote:

John and Others,


Thank you very much for your support. The problem is finally solved.


After Installing nmap, it let me realize that some ports were blocked 
even with firewall daemon stopped and disabled. Turned out that iptables 
was on and enabled. After stopping iptables everything work just fine.




Best Regards,


Said.


*From:* John Hearns 
*Sent:* Thursday, July 6, 2017 6:47:48 PM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
Said, you are not out of ideas.

I would suggest 'nmap' as a good tool to start with.   Instlal nmap on 
your compute node and see which ports are open on the controller node


Also do we have a DNS name resolution problem here?
I alwasy remember sun Gridengine as being notoriously sensitive to name 
resolution, and that was my first question when any SGE problem was 
reported.

So a couple of questions:

On the ocntroller node and on the compute node run this:
hostname
hostname -f

Do the cluster controller node or the compute nodes have more than one 
network interface.
I bet the cluster controller node does!   From the compute node, do an 
nslookup or a dig  and see what the COMPUTE NODE thinks are hte names of 
both of those interfaces.


Also as Rajul says - how are you making sure that both controller and 
compute nodes have the same slurm.conf file
Actually if the slurm.conf files are different this will eb logged when 
the compute node starts up, but let us check everything.










On 6 July 2017 at 11:37, Said Mohamed Said > wrote:


Even after reinstalling everything from the beginning the problem is
still there. Right now I am out of Ideas.




Best Regards,


Said.


*From:* Said Mohamed Said
*Sent:* Thursday, July 6, 2017 2:23:05 PM
*To:* slurm-dev
*Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP

Thank you all for your suggestions, the only thing I can do for now
is to uninstall and install from the beginning and I will use the
most recent version of slurm on both nodes.

For Felix who asked, the OS is CentOS 7.3 on both machines.

I will let you know if that can solve the issue.

*From:* Rajul Kumar >
*Sent:* Thursday, July 6, 2017 12:41:51 AM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
Sorry for the typo
It's generally when one of the controller or compute can reach the
other one but it's *not* happening vice-versa.


On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar
> wrote:

I came across the same problem sometime back. It's generally
when one of the controller or compute can reach to other one but
it's happening vice-versa.

Have a look at the following points:
- controller and compute can ping to each other
- both share the same slurm.conf
- slurm.conf has the location of both controller and compute
- slurm services are running on the compute node when the
controller says it's down
- TCP connections are not being dropped
- Ports are accessible that are to be used for communication,
specifically response ports
- Check the routing rules if any
- Clocks are synced across
- Hope there isn't any version mismatch but still have a look
(doesn't recognize the nodes for major version differences)

Hope this helps.

Best,
Rajul

On Wed, Jul 5, 2017 at 10:52 AM, John Hearns
> wrote:

Said,
a problem like this always has a simple cause. We share
your frustration, and several people her have offered help.
So please do not get discouraged. We have all been in your
situation!

The only way to handle problems like this is
a) start at the beginning and read the manuals and webpages
closely
b) start at the lowest level, ie here the network and do NOT
assume that any component is working
c) look at all the log files closely
d) start daeomon sprocesses in a terminal with any 'verbose'
flags set
 

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread John Hearns
Said,
thankyou for letting us know.
I'm going to blame this one on systemd.  Just because I can.


On 6 July 2017 at 13:22, Said Mohamed Said  wrote:

> John and Others,
>
>
> Thank you very much for your support. The problem is finally solved.
>
>
> After Installing nmap, it let me realize that some ports were blocked even
> with firewall daemon stopped and disabled. Turned out that iptables was on
> and enabled. After stopping iptables everything work just fine.
>
>
>
> Best Regards,
>
>
> Said.
> --
> *From:* John Hearns 
> *Sent:* Thursday, July 6, 2017 6:47:48 PM
>
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>
> Said, you are not out of ideas.
>
> I would suggest 'nmap' as a good tool to start with.   Instlal nmap on
> your compute node and see which ports are open on the controller node
>
> Also do we have a DNS name resolution problem here?
> I alwasy remember sun Gridengine as being notoriously sensitive to name
> resolution, and that was my first question when any SGE problem was
> reported.
> So a couple of questions:
>
> On the ocntroller node and on the compute node run this:
> hostname
> hostname -f
>
> Do the cluster controller node or the compute nodes have more than one
> network interface.
> I bet the cluster controller node does!   From the compute node, do an
> nslookup or a dig  and see what the COMPUTE NODE thinks are hte names of
> both of those interfaces.
>
> Also as Rajul says - how are you making sure that both controller and
> compute nodes have the same slurm.conf file
> Actually if the slurm.conf files are different this will eb logged when
> the compute node starts up, but let us check everything.
>
>
>
>
>
>
>
>
>
> On 6 July 2017 at 11:37, Said Mohamed Said  wrote:
>
>> Even after reinstalling everything from the beginning the problem is
>> still there. Right now I am out of Ideas.
>>
>>
>>
>>
>> Best Regards,
>>
>>
>> Said.
>> --
>> *From:* Said Mohamed Said
>> *Sent:* Thursday, July 6, 2017 2:23:05 PM
>> *To:* slurm-dev
>> *Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP
>>
>>
>> Thank you all for your suggestions, the only thing I can do for now is to
>> uninstall and install from the beginning and I will use the most recent
>> version of slurm on both nodes.
>>
>> For Felix who asked, the OS is CentOS 7.3 on both machines.
>>
>> I will let you know if that can solve the issue.
>> --
>> *From:* Rajul Kumar 
>> *Sent:* Thursday, July 6, 2017 12:41:51 AM
>> *To:* slurm-dev
>> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>>
>> Sorry for the typo
>> It's generally when one of the controller or compute can reach the other
>> one but it's *not* happening vice-versa.
>>
>>
>> On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar 
>> wrote:
>>
>>> I came across the same problem sometime back. It's generally when one of
>>> the controller or compute can reach to other one but it's happening
>>> vice-versa.
>>>
>>> Have a look at the following points:
>>> - controller and compute can ping to each other
>>> - both share the same slurm.conf
>>> - slurm.conf has the location of both controller and compute
>>> - slurm services are running on the compute node when the controller
>>> says it's down
>>> - TCP connections are not being dropped
>>> - Ports are accessible that are to be used for communication,
>>> specifically response ports
>>> - Check the routing rules if any
>>> - Clocks are synced across
>>> - Hope there isn't any version mismatch but still have a look (doesn't
>>> recognize the nodes for major version differences)
>>>
>>> Hope this helps.
>>>
>>> Best,
>>> Rajul
>>>
>>> On Wed, Jul 5, 2017 at 10:52 AM, John Hearns 
>>> wrote:
>>>
 Said,
a problem like this always has a simple cause. We share your
 frustration, and several people her have offered help.
 So please do not get discouraged. We have all been in your situation!

 The only way to handle problems like this is
 a) start at the beginning and read the manuals and webpages closely
 b) start at the lowest level, ie here the network and do NOT assume
 that any component is working
 c) look at all the log files closely
 d) start daeomon sprocesses in a terminal with any 'verbose' flags set
 e) then start on more low-level diagnostics, such as tcpdump of network
 adapters and straces of the processes and gstacks


 you have been doing steps a b and c very well
 I suggest staying with these - I myself am going for Adam Huffmans
 suggestion of the NTP clock times.
 Are you SURE that on all nodes you have run the 'date' command and also
 'ntpq -p'
 Are you SURE the master node and the node OBU-N6   are both connecting
 to an NTP server?   ntpq -p will tell you that


 

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Said Mohamed Said
John and Others,


Thank you very much for your support. The problem is finally solved.


After Installing nmap, it let me realize that some ports were blocked even with 
firewall daemon stopped and disabled. Turned out that iptables was on and 
enabled. After stopping iptables everything work just fine.



Best Regards,


Said.


From: John Hearns 
Sent: Thursday, July 6, 2017 6:47:48 PM
To: slurm-dev
Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP

Said, you are not out of ideas.

I would suggest 'nmap' as a good tool to start with.   Instlal nmap on your 
compute node and see which ports are open on the controller node

Also do we have a DNS name resolution problem here?
I alwasy remember sun Gridengine as being notoriously sensitive to name 
resolution, and that was my first question when any SGE problem was reported.
So a couple of questions:

On the ocntroller node and on the compute node run this:
hostname
hostname -f

Do the cluster controller node or the compute nodes have more than one network 
interface.
I bet the cluster controller node does!   From the compute node, do an nslookup 
or a dig  and see what the COMPUTE NODE thinks are hte names of both of those 
interfaces.

Also as Rajul says - how are you making sure that both controller and compute 
nodes have the same slurm.conf file
Actually if the slurm.conf files are different this will eb logged when the 
compute node starts up, but let us check everything.









On 6 July 2017 at 11:37, Said Mohamed Said 
> wrote:

Even after reinstalling everything from the beginning the problem is still 
there. Right now I am out of Ideas.




Best Regards,


Said.


From: Said Mohamed Said
Sent: Thursday, July 6, 2017 2:23:05 PM
To: slurm-dev
Subject: Re: [slurm-dev] Re: SLURM ERROR! NEED HELP


Thank you all for your suggestions, the only thing I can do for now is to 
uninstall and install from the beginning and I will use the most recent version 
of slurm on both nodes.

For Felix who asked, the OS is CentOS 7.3 on both machines.

I will let you know if that can solve the issue.

From: Rajul Kumar >
Sent: Thursday, July 6, 2017 12:41:51 AM
To: slurm-dev
Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP

Sorry for the typo
It's generally when one of the controller or compute can reach the other one 
but it's not happening vice-versa.


On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar 
> wrote:
I came across the same problem sometime back. It's generally when one of the 
controller or compute can reach to other one but it's happening vice-versa.

Have a look at the following points:
- controller and compute can ping to each other
- both share the same slurm.conf
- slurm.conf has the location of both controller and compute
- slurm services are running on the compute node when the controller says it's 
down
- TCP connections are not being dropped
- Ports are accessible that are to be used for communication, specifically 
response ports
- Check the routing rules if any
- Clocks are synced across
- Hope there isn't any version mismatch but still have a look (doesn't 
recognize the nodes for major version differences)

Hope this helps.

Best,
Rajul

On Wed, Jul 5, 2017 at 10:52 AM, John Hearns 
> wrote:
Said,
   a problem like this always has a simple cause. We share your frustration, 
and several people her have offered help.
So please do not get discouraged. We have all been in your situation!

The only way to handle problems like this is
a) start at the beginning and read the manuals and webpages closely
b) start at the lowest level, ie here the network and do NOT assume that any 
component is working
c) look at all the log files closely
d) start daeomon sprocesses in a terminal with any 'verbose' flags set
e) then start on more low-level diagnostics, such as tcpdump of network 
adapters and straces of the processes and gstacks


you have been doing steps a b and c very well
I suggest staying with these - I myself am going for Adam Huffmans suggestion 
of the NTP clock times.
Are you SURE that on all nodes you have run the 'date' command and also 'ntpq 
-p'
Are you SURE the master node and the node OBU-N6   are both connecting to an 
NTP server?   ntpq -p will tell you that


And do not lose heart.  This is how we all learn.

















On 5 July 2017 at 16:23, Said Mohamed Said 
> wrote:
Sinfo -R gives "NODE IS NOT RESPONDING"
ping gives successful results from both nodes

I really can not figure out what is causing the problem.

Regards,
Said

From: Felix Willenborg 

[slurm-dev] Re: slurm accounting schema

2017-07-06 Thread Emyr James


After doing a dump of the database I can see that there are no triggers 
set up. I can't seen any slurm related cronjobs on the slurmdb or 
slurmdbd machine. How is the data for the 
_assoc_usage__table generated ? Can someone 
point me to a place in the github source relvant to this ? I'd like to 
see the query used to populate the table from the data in the job and 
step tables.



On 06/07/2017 10:11, Emyr James wrote:


Hi,
I'm trying to put together some reports using the slurm accounting 
database.
For some reason, some jobs are showing up with an association id of 0 
even though there is a valid user id in the job table. Consequently, 
the hour, day and monthly summary tables are not showing the right 
amount of cpu seconds against each user. Due to issues with missing 
associations I'm looking at generating reports directly from a join 
between _job_table and _step_table but the numbers 
are not adding up.


MariaDB [slurm_acct_db]> select 
sum(sys_sec+sys_usec+user_sec+user_usec),id_user, id_assoc

-> from umic_step_table s ,umic_job_table j
-> where j.job_db_inx=s.job_db_inx
-> group by id_user,id_assoc
-> order by 1 desc;
+--+-+--+
| sum(sys_sec+sys_usec+user_sec+user_usec) | id_user | id_assoc |
+--+-+--+
|   6493170358 |   20006 |9 |
|  9857479 | 999 |0 |
|  4407923 |   20021 |0 |
|  4372063 |   20022 |0 |
|  3580520 |   20019 |0 |
|  3482616 |   20020 |0 |
|   803045 |   20013 |   10 |
|   103101 |   20013 |8 |
|41168 |   20018 |   19 |
|26423 |   20017 |   18 |
|24744 |   20026 |   22 |
|0 |   20022 |   21 |
+--+-+--+

The user with the most cpu time has a valid id_assoc for every job in 
the job table. I also know for a fact that this user stopped running 
jobs a while ago so all the above jobs should have made it into the 
summary (hour/day/month) summary tables.


If I do a sum over the _assoc_usage_hour_table, I don't get 
the same answer as above for this user, and the sum for the id_assoc 
above does not equal the value in the NULL:NULL column below...


MariaDB [slurm_acct_db]> select  a.acct, a.user,sum(u.alloc_secs)
-> from _assoc_usage_hour_table as u
-> left join _assoc_table a
-> on a.id_assoc=u.id
-> where u.id_tres=1
-> group by a.acct, a.user
-> order by 3 desc ;
+-+--+---+
| acct| user | sum(u.alloc_secs) |
+-+--+---+
| acct1   | user1|5784629904 |
| NULL| NULL |301693 |
| acct2   | user2| 83248 |
| acct3   | user3|  1167 |
| acct4   | user4|20 |
| acct5   | user5| 3 |
+-+--+---+

Am I doing something wrong here or is it something I'm missing. What 
do sys_usec and user_usec represent ? Do the summary tables look at 
walltime for each host rather than the cpu seconds used in the step 
table ? Can anyone point me to detailed documentation on how the 
accounting DB schema hangs together ?


Many thanks,

Emyr







--
The Wellcome Trust Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 


[slurm-dev] Length of possible SlurmDBD without HA

2017-07-06 Thread Loris Bennett

Hi,

On the Slurm FAQ page

  https://slurm.schedmd.com/faq.html

it says the following:

  52. How critical is configuring high availability for my database?

  Consider if you really need mysql failover. Short outage of
  slurmdbd is not a problem, because slurmctld will store all data
  in memory and send it to slurmdbd when it's back operating. The
  slurmctld daemon will also cache all user limits and fair share
  information.

I was wondering how long a "short outage" can be.  Presumably this is
determined by the amount of free memory on the server running slurmctld,
the number of jobs, and the amount of memory required per job.

So roughly how much memory will be required per job?

Cheers,

Loris

-- 
Dr. Loris Bennett (Mr.)
ZEDAT, Freie Universität Berlin Email loris.benn...@fu-berlin.de


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread John Hearns
Said, you are not out of ideas.

I would suggest 'nmap' as a good tool to start with.   Instlal nmap on your
compute node and see which ports are open on the controller node

Also do we have a DNS name resolution problem here?
I alwasy remember sun Gridengine as being notoriously sensitive to name
resolution, and that was my first question when any SGE problem was
reported.
So a couple of questions:

On the ocntroller node and on the compute node run this:
hostname
hostname -f

Do the cluster controller node or the compute nodes have more than one
network interface.
I bet the cluster controller node does!   From the compute node, do an
nslookup or a dig  and see what the COMPUTE NODE thinks are hte names of
both of those interfaces.

Also as Rajul says - how are you making sure that both controller and
compute nodes have the same slurm.conf file
Actually if the slurm.conf files are different this will eb logged when the
compute node starts up, but let us check everything.









On 6 July 2017 at 11:37, Said Mohamed Said  wrote:

> Even after reinstalling everything from the beginning the problem is still
> there. Right now I am out of Ideas.
>
>
>
>
> Best Regards,
>
>
> Said.
> --
> *From:* Said Mohamed Said
> *Sent:* Thursday, July 6, 2017 2:23:05 PM
> *To:* slurm-dev
> *Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP
>
>
> Thank you all for your suggestions, the only thing I can do for now is to
> uninstall and install from the beginning and I will use the most recent
> version of slurm on both nodes.
>
> For Felix who asked, the OS is CentOS 7.3 on both machines.
>
> I will let you know if that can solve the issue.
> --
> *From:* Rajul Kumar 
> *Sent:* Thursday, July 6, 2017 12:41:51 AM
> *To:* slurm-dev
> *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
>
> Sorry for the typo
> It's generally when one of the controller or compute can reach the other
> one but it's *not* happening vice-versa.
>
>
> On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar 
> wrote:
>
>> I came across the same problem sometime back. It's generally when one of
>> the controller or compute can reach to other one but it's happening
>> vice-versa.
>>
>> Have a look at the following points:
>> - controller and compute can ping to each other
>> - both share the same slurm.conf
>> - slurm.conf has the location of both controller and compute
>> - slurm services are running on the compute node when the controller says
>> it's down
>> - TCP connections are not being dropped
>> - Ports are accessible that are to be used for communication,
>> specifically response ports
>> - Check the routing rules if any
>> - Clocks are synced across
>> - Hope there isn't any version mismatch but still have a look (doesn't
>> recognize the nodes for major version differences)
>>
>> Hope this helps.
>>
>> Best,
>> Rajul
>>
>> On Wed, Jul 5, 2017 at 10:52 AM, John Hearns 
>> wrote:
>>
>>> Said,
>>>a problem like this always has a simple cause. We share your
>>> frustration, and several people her have offered help.
>>> So please do not get discouraged. We have all been in your situation!
>>>
>>> The only way to handle problems like this is
>>> a) start at the beginning and read the manuals and webpages closely
>>> b) start at the lowest level, ie here the network and do NOT assume that
>>> any component is working
>>> c) look at all the log files closely
>>> d) start daeomon sprocesses in a terminal with any 'verbose' flags set
>>> e) then start on more low-level diagnostics, such as tcpdump of network
>>> adapters and straces of the processes and gstacks
>>>
>>>
>>> you have been doing steps a b and c very well
>>> I suggest staying with these - I myself am going for Adam Huffmans
>>> suggestion of the NTP clock times.
>>> Are you SURE that on all nodes you have run the 'date' command and also
>>> 'ntpq -p'
>>> Are you SURE the master node and the node OBU-N6   are both connecting
>>> to an NTP server?   ntpq -p will tell you that
>>>
>>>
>>> And do not lose heart.  This is how we all learn.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 5 July 2017 at 16:23, Said Mohamed Said  wrote:
>>>
 Sinfo -R gives "NODE IS NOT RESPONDING"
 ping gives successful results from both nodes

 I really can not figure out what is causing the problem.

 Regards,
 Said
 --
 *From:* Felix Willenborg 
 *Sent:* Wednesday, July 5, 2017 9:07:05 PM

 *To:* slurm-dev
 *Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP

 When the nodes change to the down state, what is 'sinfo -R' saying?
 Sometimes it gives you a reason for that.

 Best,
 Felix

 Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:

 Thank you Adam, For NTP I 

[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Ole Holm Nielsen


Have you followed my Wiki for installing Slurm on CentOS 7?
This has worked for us: https://wiki.fysik.dtu.dk/niflheim/SLURM

If your problems are caused by your network setup, then it's almost 
impossible for external people to help you...


/Ole

On 07/06/2017 11:38 AM, Said Mohamed Said wrote:
Even after reinstalling everything from the beginning the problem is 
still there. Right now I am out of Ideas.





Best Regards,


Said.


*From:* Said Mohamed Said
*Sent:* Thursday, July 6, 2017 2:23:05 PM
*To:* slurm-dev
*Subject:* Re: [slurm-dev] Re: SLURM ERROR! NEED HELP

Thank you all for your suggestions, the only thing I can do for now is 
to uninstall and install from the beginning and I will use the most 
recent version of slurm on both nodes.


For Felix who asked, the OS is CentOS 7.3 on both machines.

I will let you know if that can solve the issue.

*From:* Rajul Kumar 
*Sent:* Thursday, July 6, 2017 12:41:51 AM
*To:* slurm-dev
*Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
Sorry for the typo
It's generally when one of the controller or compute can reach the other 
one but it's *not* happening vice-versa.



On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar > wrote:


I came across the same problem sometime back. It's generally when
one of the controller or compute can reach to other one but it's
happening vice-versa.

Have a look at the following points:
- controller and compute can ping to each other
- both share the same slurm.conf
- slurm.conf has the location of both controller and compute
- slurm services are running on the compute node when the controller
says it's down
- TCP connections are not being dropped
- Ports are accessible that are to be used for communication,
specifically response ports
- Check the routing rules if any
- Clocks are synced across
- Hope there isn't any version mismatch but still have a look
(doesn't recognize the nodes for major version differences)

Hope this helps.

Best,
Rajul

On Wed, Jul 5, 2017 at 10:52 AM, John Hearns > wrote:

Said,
a problem like this always has a simple cause. We share your
frustration, and several people her have offered help.
So please do not get discouraged. We have all been in your
situation!

The only way to handle problems like this is
a) start at the beginning and read the manuals and webpages closely
b) start at the lowest level, ie here the network and do NOT
assume that any component is working
c) look at all the log files closely
d) start daeomon sprocesses in a terminal with any 'verbose'
flags set
e) then start on more low-level diagnostics, such as tcpdump of
network adapters and straces of the processes and gstacks


you have been doing steps a b and c very well
I suggest staying with these - I myself am going for Adam
Huffmans suggestion of the NTP clock times.
Are you SURE that on all nodes you have run the 'date' command
and also 'ntpq -p'
Are you SURE the master node and the node OBU-N6   are both
connecting to an NTP server?   ntpq -p will tell you that


And do not lose heart.  This is how we all learn.

















On 5 July 2017 at 16:23, Said Mohamed Said > wrote:

Sinfo -R gives "NODE IS NOT RESPONDING"
ping gives successful results from both nodes

I really can not figure out what is causing the problem.

Regards,
Said


*From:* Felix Willenborg >
*Sent:* Wednesday, July 5, 2017 9:07:05 PM

*To:* slurm-dev
*Subject:* [slurm-dev] Re: SLURM ERROR! NEED HELP
When the nodes change to the down state, what is 'sinfo -R'
saying? Sometimes it gives you a reason for that.

Best,
Felix

Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:

Thank you Adam, For NTP I did that as well before posting
but didn't fix the issue.

Regards,
Said


*From:* Adam Huffman 

*Sent:* Wednesday, July 5, 2017 8:11:03 PM
*To:* slurm-dev
*Subject:* 

[slurm-dev] Re: Need for recompiling openmpi built with --with-pmi?

2017-07-06 Thread Ole Holm Nielsen


I'd like a second revival of this thread!  The full thread is available 
at 
https://groups.google.com/forum/#!msg/slurm-devel/oDoHPoAbiPQ/q9pQL2Uw3y0J


We're in the process of upgrading Slurm from 16.05 to 17.02.  I'd like 
to be certain that our MPI libraries don't require a specific library 
version such as libslurm.so.30.  See the thread's example "$ readelf -d 
libmca_common_pmi.so":

 0x0001 (NEEDED) Shared library: [libslurm.so.27]

Question: Can anyone suggest which OpenMPI libraries I have to go 
through with readelf in order to make sure we don't have the 
libslurm.so.xx problem?


The libmca_common_pmi.so file doesn't exist on our systems.  We have 
OpenMPI 1.10.3 and 2.0.2 installed with EasyBuild.


Our builds of OpenMPI were done on top of a Slurm 16.05 base, and our 
build hosts do **not** have the lib64/libpmi2.la and lib64/libpmi.la 
which cause problems.  According to the above thread, these files were 
removed from the slurm-devel RPM package starting from Slurm 16.05.  So 
I hope that we're good...


I expect the consequences of having an undetected libslurm.so.xx problem 
would be that all MPI jobs would start crashing :-(


Thanks for your help,
Ole

On 02/04/2016 11:26 PM, Kilian Cavalotti wrote:

Hi all,

I would like to revive this old thread, as we've been bitten by this
also when moving from 14.11 to 15.08.

On Mon, Oct 5, 2015 at 4:38 AM, Bjørn-Helge Mevik  wrote:

We have verified that we can compile openmpi (1.8.6) against slurm
14.03.7 (with the .la files removed), and then upgrade slurm to 15.08.0
without having to recompile openmpi.

My understanding of linking and libraries is not very thorough,
unfortunately, but according to

https://lists.fedoraproject.org/pipermail/mingw/2012-January/004421.html

the .la files are only needed in order to link against static libraries,
and since Slurm doesn't provide any static libraries, I guess it would
be safe for the slurm-devel rpm not to include these files.


I think the link above describes the situation pretty well. Could we
please remove the .la files from the slurm-devel RPM if they don't
serve any specific purpose?
The attached patch to slurm.spec worked for me.


[slurm-dev] Re: SLURM ERROR! NEED HELP

2017-07-06 Thread Said Mohamed Said
Even after reinstalling everything from the beginning the problem is still 
there. Right now I am out of Ideas.




Best Regards,


Said.


From: Said Mohamed Said
Sent: Thursday, July 6, 2017 2:23:05 PM
To: slurm-dev
Subject: Re: [slurm-dev] Re: SLURM ERROR! NEED HELP


Thank you all for your suggestions, the only thing I can do for now is to 
uninstall and install from the beginning and I will use the most recent version 
of slurm on both nodes.

For Felix who asked, the OS is CentOS 7.3 on both machines.

I will let you know if that can solve the issue.

From: Rajul Kumar 
Sent: Thursday, July 6, 2017 12:41:51 AM
To: slurm-dev
Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP

Sorry for the typo
It's generally when one of the controller or compute can reach the other one 
but it's not happening vice-versa.


On Wed, Jul 5, 2017 at 11:38 AM, Rajul Kumar 
> wrote:
I came across the same problem sometime back. It's generally when one of the 
controller or compute can reach to other one but it's happening vice-versa.

Have a look at the following points:
- controller and compute can ping to each other
- both share the same slurm.conf
- slurm.conf has the location of both controller and compute
- slurm services are running on the compute node when the controller says it's 
down
- TCP connections are not being dropped
- Ports are accessible that are to be used for communication, specifically 
response ports
- Check the routing rules if any
- Clocks are synced across
- Hope there isn't any version mismatch but still have a look (doesn't 
recognize the nodes for major version differences)

Hope this helps.

Best,
Rajul

On Wed, Jul 5, 2017 at 10:52 AM, John Hearns 
> wrote:
Said,
   a problem like this always has a simple cause. We share your frustration, 
and several people her have offered help.
So please do not get discouraged. We have all been in your situation!

The only way to handle problems like this is
a) start at the beginning and read the manuals and webpages closely
b) start at the lowest level, ie here the network and do NOT assume that any 
component is working
c) look at all the log files closely
d) start daeomon sprocesses in a terminal with any 'verbose' flags set
e) then start on more low-level diagnostics, such as tcpdump of network 
adapters and straces of the processes and gstacks


you have been doing steps a b and c very well
I suggest staying with these - I myself am going for Adam Huffmans suggestion 
of the NTP clock times.
Are you SURE that on all nodes you have run the 'date' command and also 'ntpq 
-p'
Are you SURE the master node and the node OBU-N6   are both connecting to an 
NTP server?   ntpq -p will tell you that


And do not lose heart.  This is how we all learn.

















On 5 July 2017 at 16:23, Said Mohamed Said 
> wrote:
Sinfo -R gives "NODE IS NOT RESPONDING"
ping gives successful results from both nodes

I really can not figure out what is causing the problem.

Regards,
Said

From: Felix Willenborg 
>
Sent: Wednesday, July 5, 2017 9:07:05 PM

To: slurm-dev
Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP

When the nodes change to the down state, what is 'sinfo -R' saying? Sometimes 
it gives you a reason for that.

Best,
Felix

Am 05.07.2017 um 13:16 schrieb Said Mohamed Said:
Thank you Adam, For NTP I did that as well before posting but didn't fix the 
issue.

Regards,
Said

From: Adam Huffman 
Sent: Wednesday, July 5, 2017 8:11:03 PM
To: slurm-dev
Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP


I've seen something similar when node clocks were skewed.

Worth checking that NTP is running and they're all synchronised.

On Wed, Jul 5, 2017 at 12:06 PM, Said Mohamed Said 
 wrote:
> Thank you all for suggestions. I turned off firewall on both machines but
> still no luck. I can confirm that No managed switch is preventing the nodes
> from communicating. If you check the log file, there is communication for
> about 4mins and then the node state goes down.
> Any other idea?
> 
> From: Ole Holm Nielsen 
> 
> Sent: Wednesday, July 5, 2017 7:07:15 PM
> To: slurm-dev
> Subject: [slurm-dev] Re: SLURM ERROR! NEED HELP
>
>
> On 07/05/2017 11:40 AM, Felix Willenborg wrote:
>> in my network I encountered that managed switches were preventing
>> necessary network communication between the nodes, on which SLURM
>> relies. You should check if you're using managed switches to connect
>> nodes to the 

[slurm-dev] slurm accounting schema

2017-07-06 Thread Emyr James


Hi,
I'm trying to put together some reports using the slurm accounting database.
For some reason, some jobs are showing up with an association id of 0 
even though there is a valid user id in the job table. Consequently, the 
hour, day and monthly summary tables are not showing the right amount of 
cpu seconds against each user. Due to issues with missing associations 
I'm looking at generating reports directly from a join between 
_job_table and _step_table but the numbers are not 
adding up.


MariaDB [slurm_acct_db]> select 
sum(sys_sec+sys_usec+user_sec+user_usec),id_user, id_assoc

-> from umic_step_table s ,umic_job_table j
-> where j.job_db_inx=s.job_db_inx
-> group by id_user,id_assoc
-> order by 1 desc;
+--+-+--+
| sum(sys_sec+sys_usec+user_sec+user_usec) | id_user | id_assoc |
+--+-+--+
|   6493170358 |   20006 |9 |
|  9857479 | 999 |0 |
|  4407923 |   20021 |0 |
|  4372063 |   20022 |0 |
|  3580520 |   20019 |0 |
|  3482616 |   20020 |0 |
|   803045 |   20013 |   10 |
|   103101 |   20013 |8 |
|41168 |   20018 |   19 |
|26423 |   20017 |   18 |
|24744 |   20026 |   22 |
|0 |   20022 |   21 |
+--+-+--+

The user with the most cpu time has a valid id_assoc for every job in 
the job table. I also know for a fact that this user stopped running 
jobs a while ago so all the above jobs should have made it into the 
summary (hour/day/month) summary tables.


If I do a sum over the _assoc_usage_hour_table, I don't get the 
same answer as above for this user, and the sum for the id_assoc above 
does not equal the value in the NULL:NULL column below...


MariaDB [slurm_acct_db]> select  a.acct, a.user,sum(u.alloc_secs)
-> from _assoc_usage_hour_table as u
-> left join _assoc_table a
-> on a.id_assoc=u.id
-> where u.id_tres=1
-> group by a.acct, a.user
-> order by 3 desc ;
+-+--+---+
| acct| user | sum(u.alloc_secs) |
+-+--+---+
| acct1   | user1|5784629904 |
| NULL| NULL |301693 |
| acct2   | user2| 83248 |
| acct3   | user3|  1167 |
| acct4   | user4|20 |
| acct5   | user5| 3 |
+-+--+---+

Am I doing something wrong here or is it something I'm missing. What do 
sys_usec and user_usec represent ? Do the summary tables look at 
walltime for each host rather than the cpu seconds used in the step 
table ? Can anyone point me to detailed documentation on how the 
accounting DB schema hangs together ?


Many thanks,

Emyr



--
The Wellcome Trust Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE.