Have you added the cluster to the database?

something like: "sacctmgr add cluster CLUSTER_NAME"

On Wed, Aug 24, 2016 at 11:04 AM, Bancal Samuel <[email protected]>
wrote:

>
> Hi,
>
> Thanks for your quick answer.
>
> In fact NodeName=DEFAULT is not the server's hostname, but matches all
> subsequent nodes defined ( http://slurm.schedmd.com/slurm.conf.html ).
> The server's hostname is "our-slurm-master". Here is the /etc/hosts (which
> I think is correct) :
>
> root@our-slurm-master:~# cat /etc/hosts
> 127.0.0.1    localhost
> 123.234.1.2  our-slurm-master.epfl.ch    our-slurm-master
>
> # The following lines are desirable for IPv6 capable hosts
> ::1     localhost ip6-localhost ip6-loopback
> ff02::1 ip6-allnodes
> ff02::2 ip6-allrouters
>
> I checked /var/run/slurm-llnl/ ... it was automatically created and
> belongs to slurm:slurm 755.
> Also /var/log/slurm-llnl/ was automatically created and belongs to
> slurm:slurm 755.
>
> The NodeName and PartitionName part of the slurm.conf is the exact copy of
> the previous install (slurm 2.6.1) ... in which we didn't declared the
> master node as a calculation node. Is it still possible?
>
> The complete error message is :
>
> ● slurmctld.service - Slurm controller daemon
>    Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor
> preset: enabled)
>    Active: failed (Result: exit-code) since Wed 2016-08-24 09:18:58 CEST;
> 1h 30min ago
>   Process: 46742 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS
> (code=exited, status=0/SUCCESS)
>  Main PID: 46746 (code=exited, status=1/FAILURE)
>
> Aug 24 09:18:58 our-slurm-master systemd[1]: Starting Slurm controller
> daemon...
> Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: PID file
> /var/run/slurm-llnl/slurmctld.pid not readable (yet?) after start:
> Aug 24 09:18:58 our-slurm-master systemd[1]: Started Slurm controller
> daemon.
> Aug 24 09:18:58 our-slurm-master slurmctld[46746]: fatal: You are running
> with a database but for some reason we have no TRES from it.  This should
> only happen if the database is down and you don't have any state files.
> Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: Main
> process exited, code=exited, status=1/FAILURE
> Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: Unit
> entered failed state.
> Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: Failed
> with result 'exit-code'.
>
> Sorry for for cutting it.
>
> The thing is that the DB is up :
>
> root@our-slurm-master:~# netstat -ntaupe | grep 3306
> tcp        0      0 127.0.0.1:3306 0.0.0.0:*               LISTEN
> 115        1223659 40389/mysqld
>
> And even tried :
>
> root@our-slurm-master:~# mysql -u slurm -p slurm_acct_db
> Enter password:
> Reading table information for completion of table and column names
> You can turn off this feature to get a quicker startup with -A
>
> Welcome to the MariaDB monitor.  Commands end with ; or \g.
> Your MariaDB connection id is 38
> Server version: 10.0.25-MariaDB-0ubuntu0.16.04.1 Ubuntu 16.04
>
> Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others.
>
> Type 'help;' or '\h' for help. Type '\c' to clear the current input
> statement.
>
> MariaDB [slurm_acct_db]> show tables;
> +-------------------------+
> | Tables_in_slurm_acct_db |
> +-------------------------+
> | acct_coord_table        |
> | acct_table              |
> | clus_res_table          |
> | cluster_table           |
> | qos_table               |
> | res_table               |
> | table_defs_table        |
> | tres_table              |
> | txn_table               |
> | user_table              |
> +-------------------------+
> 10 rows in set (0.00 sec)
>
> MariaDB [slurm_acct_db]> Bye
>
> Regards,
> Samuel
>
>
>
> On 24. 08. 16 10:06, Raymond Wan wrote:
>
>> Hi Samuel,
>>
>>
>> On Wed, Aug 24, 2016 at 3:44 PM, Bancal Samuel <[email protected]>
>> wrote:
>>
>>> We're trying to setup Slurm on a Ubuntu 16.04 server.
>>> I attach the steps we did for the setup.
>>>
>>
>> It's been a long time since I installed Slurm on our Ubuntu 16.04
>> system.  I'm not sure if I remember everything I did...
>>
>>
>> Aug 24 09:17:38 our-slurm-master systemd[1]: Starting Slurm node daemon...
>>> Aug 24 09:17:38 our-slurm-master slurmd[46263]: fatal: Unable to
>>> determine
>>> this slurmd's NodeName
>>> Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Control
>>> process
>>> exited, code=exited status=1
>>> Aug 24 09:17:38 our-slurm-master systemd[1]: Failed to start Slurm node
>>> daemon.
>>> Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Unit entered
>>> failed state.
>>> Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Failed with
>>> result 'exit-code'.
>>>
>>
>> In your attachment, I noticed a line that says:
>>
>> NodeName=Default ..
>>
>> Is that correct?  I'm just surprised someone would call their server
>> "default".  (Of course, you can do that...  :-) )
>>
>> One thing I remember is that the node names you have in the COMPUTE
>> NODES section should match the names in your /etc/hosts file.  When I
>> had the error above, I think this was the problem that I had...
>>
>>
>> Aug 24 09:18:58 our-slurm-master systemd[1]: Starting Slurm controller
>>> daemon...
>>> Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: PID file
>>> /var/run/slurm-llnl/slurmctld.pid not readable (yet?) after start:
>>>
>>
>> Despite using the standard Slurm packages for Ubuntu, I had to do a
>> few things manually.  One of them was creating directories such as
>> /var/run/slurm-llnl and making sure that the permissions of the
>> directory were correct.  I think owner had to be the user Slurm user
>> ("slurm" in my case).
>>
>> I did go through a loop a few times where I tried to start Slurm, it
>> complained about permissions or even the existence of the PID or log
>> directory.  I created it, and then it moved to the next error.  A bit
>> tedious...but eventually, it did stop complaining.
>>
>>
>> Aug 24 09:18:58 our-slurm-master systemd[1]: Started Slurm controller
>>> daemon.
>>> Aug 24 09:18:58 our-slurm-master slurmctld[46746]: fatal: You are running
>>> with a database but for some reason we have no TRES from it.  Thi
>>>
>>
>> I don't know what this error means, but perhaps you can copy the rest
>> of it and maybe I (or someone else) might have an idea.
>>
>> Good luck!
>>
>> Ray
>>
>
> --
> Samuel Bancal
> ENAC-IT
> EPFL
>



-- 
--
Carles Fenoy

Reply via email to