Have you added the cluster to the database? something like: "sacctmgr add cluster CLUSTER_NAME"
On Wed, Aug 24, 2016 at 11:04 AM, Bancal Samuel <[email protected]> wrote: > > Hi, > > Thanks for your quick answer. > > In fact NodeName=DEFAULT is not the server's hostname, but matches all > subsequent nodes defined ( http://slurm.schedmd.com/slurm.conf.html ). > The server's hostname is "our-slurm-master". Here is the /etc/hosts (which > I think is correct) : > > root@our-slurm-master:~# cat /etc/hosts > 127.0.0.1 localhost > 123.234.1.2 our-slurm-master.epfl.ch our-slurm-master > > # The following lines are desirable for IPv6 capable hosts > ::1 localhost ip6-localhost ip6-loopback > ff02::1 ip6-allnodes > ff02::2 ip6-allrouters > > I checked /var/run/slurm-llnl/ ... it was automatically created and > belongs to slurm:slurm 755. > Also /var/log/slurm-llnl/ was automatically created and belongs to > slurm:slurm 755. > > The NodeName and PartitionName part of the slurm.conf is the exact copy of > the previous install (slurm 2.6.1) ... in which we didn't declared the > master node as a calculation node. Is it still possible? > > The complete error message is : > > ● slurmctld.service - Slurm controller daemon > Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor > preset: enabled) > Active: failed (Result: exit-code) since Wed 2016-08-24 09:18:58 CEST; > 1h 30min ago > Process: 46742 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS > (code=exited, status=0/SUCCESS) > Main PID: 46746 (code=exited, status=1/FAILURE) > > Aug 24 09:18:58 our-slurm-master systemd[1]: Starting Slurm controller > daemon... > Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: PID file > /var/run/slurm-llnl/slurmctld.pid not readable (yet?) after start: > Aug 24 09:18:58 our-slurm-master systemd[1]: Started Slurm controller > daemon. > Aug 24 09:18:58 our-slurm-master slurmctld[46746]: fatal: You are running > with a database but for some reason we have no TRES from it. This should > only happen if the database is down and you don't have any state files. > Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: Main > process exited, code=exited, status=1/FAILURE > Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: Unit > entered failed state. > Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: Failed > with result 'exit-code'. > > Sorry for for cutting it. > > The thing is that the DB is up : > > root@our-slurm-master:~# netstat -ntaupe | grep 3306 > tcp 0 0 127.0.0.1:3306 0.0.0.0:* LISTEN > 115 1223659 40389/mysqld > > And even tried : > > root@our-slurm-master:~# mysql -u slurm -p slurm_acct_db > Enter password: > Reading table information for completion of table and column names > You can turn off this feature to get a quicker startup with -A > > Welcome to the MariaDB monitor. Commands end with ; or \g. > Your MariaDB connection id is 38 > Server version: 10.0.25-MariaDB-0ubuntu0.16.04.1 Ubuntu 16.04 > > Copyright (c) 2000, 2016, Oracle, MariaDB Corporation Ab and others. > > Type 'help;' or '\h' for help. Type '\c' to clear the current input > statement. > > MariaDB [slurm_acct_db]> show tables; > +-------------------------+ > | Tables_in_slurm_acct_db | > +-------------------------+ > | acct_coord_table | > | acct_table | > | clus_res_table | > | cluster_table | > | qos_table | > | res_table | > | table_defs_table | > | tres_table | > | txn_table | > | user_table | > +-------------------------+ > 10 rows in set (0.00 sec) > > MariaDB [slurm_acct_db]> Bye > > Regards, > Samuel > > > > On 24. 08. 16 10:06, Raymond Wan wrote: > >> Hi Samuel, >> >> >> On Wed, Aug 24, 2016 at 3:44 PM, Bancal Samuel <[email protected]> >> wrote: >> >>> We're trying to setup Slurm on a Ubuntu 16.04 server. >>> I attach the steps we did for the setup. >>> >> >> It's been a long time since I installed Slurm on our Ubuntu 16.04 >> system. I'm not sure if I remember everything I did... >> >> >> Aug 24 09:17:38 our-slurm-master systemd[1]: Starting Slurm node daemon... >>> Aug 24 09:17:38 our-slurm-master slurmd[46263]: fatal: Unable to >>> determine >>> this slurmd's NodeName >>> Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Control >>> process >>> exited, code=exited status=1 >>> Aug 24 09:17:38 our-slurm-master systemd[1]: Failed to start Slurm node >>> daemon. >>> Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Unit entered >>> failed state. >>> Aug 24 09:17:38 our-slurm-master systemd[1]: slurmd.service: Failed with >>> result 'exit-code'. >>> >> >> In your attachment, I noticed a line that says: >> >> NodeName=Default .. >> >> Is that correct? I'm just surprised someone would call their server >> "default". (Of course, you can do that... :-) ) >> >> One thing I remember is that the node names you have in the COMPUTE >> NODES section should match the names in your /etc/hosts file. When I >> had the error above, I think this was the problem that I had... >> >> >> Aug 24 09:18:58 our-slurm-master systemd[1]: Starting Slurm controller >>> daemon... >>> Aug 24 09:18:58 our-slurm-master systemd[1]: slurmctld.service: PID file >>> /var/run/slurm-llnl/slurmctld.pid not readable (yet?) after start: >>> >> >> Despite using the standard Slurm packages for Ubuntu, I had to do a >> few things manually. One of them was creating directories such as >> /var/run/slurm-llnl and making sure that the permissions of the >> directory were correct. I think owner had to be the user Slurm user >> ("slurm" in my case). >> >> I did go through a loop a few times where I tried to start Slurm, it >> complained about permissions or even the existence of the PID or log >> directory. I created it, and then it moved to the next error. A bit >> tedious...but eventually, it did stop complaining. >> >> >> Aug 24 09:18:58 our-slurm-master systemd[1]: Started Slurm controller >>> daemon. >>> Aug 24 09:18:58 our-slurm-master slurmctld[46746]: fatal: You are running >>> with a database but for some reason we have no TRES from it. Thi >>> >> >> I don't know what this error means, but perhaps you can copy the rest >> of it and maybe I (or someone else) might have an idea. >> >> Good luck! >> >> Ray >> > > -- > Samuel Bancal > ENAC-IT > EPFL > -- -- Carles Fenoy
