[slurm-users] gres/gpu: count changed for node node002 from 0 to 1
We're running slurm-17.11.12 on Bright Cluster 8.1 and our node002 keeps going into a draining state: sinfo -a PARTITION AVAIL TIMELIMIT NODES STATE NODELIST defq*up infinite 1 drng node002 info -N -o "%.20N %.15C %.10t %.10m %.15P %.15G %.35E" NODELIST CPUS(A/I/O/T) STATE MEMORY PARTITION GRES REASON node001 9/15/0/24mix 191800 defq* gpu:1none node002 1/0/23/24 drng 191800 defq* gpu:1 gres/gpu count changed and jobs are node003 1/23/0/24mix 191800 defq* gpu:1none Node of the nodes have a separate slurm.conf file, it's all shared from the head node. What else could be causing this? [2020-03-13T07:14:28.590] gres/gpu: count changed for node node002 from 0 to 1 [2020-03-13T07:14:28.590] error: _slurm_rpc_node_registration node=node002: Invalid argument [2020-03-13T07:14:28.590] error: Node node001 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2020-03-13T07:14:28.590] error: Node node003 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2020-03-13T07:47:48.787] error: Node node001 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2020-03-13T07:47:48.787] error: Node node003 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2020-03-13T07:47:48.788] gres/gpu: count changed for node node002 from 0 to 1 [2020-03-13T07:47:48.788] error: _slurm_rpc_node_registration node=node002: Invalid argument [2020-03-13T08:21:08.057] error: Node node001 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2020-03-13T08:21:08.058] error: Node node003 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf. [2020-03-13T08:21:08.058] gres/gpu: count changed for node node002 from 0 to 1 [2020-03-13T08:21:08.058] error: _slurm_rpc_node_registration node=node002: Invalid argument
Re: [slurm-users] log rotation for slurmctld.
navin srivastava writes: > can i move the log file to some other location and then restart.reload of > slurm service will start a new log file. Yes, restarting it will start a new log file if the old one is moved away. However, also reconfig will do, and you can trigger that by sending the process a HUP signal. That way you don't have to restart the daemon. We have this in our logrotate file: postrotate ## Using the newer feature of reconfig when getting a SIGHUP. kill -hup $(ps -C slurmctld h -o pid) kill -hup $(ps -C slurmdbd h -o pid) endscript (That is for both slurmctld.log and slurmdbd.log.) -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo signature.asc Description: PGP signature
[slurm-users] log rotation for slurmctld.
Hi, i wanted to understand how log rotation of slurmctld works. in my environment i don't have any logrotation for the slurmctld.log and now the log file size reached to 125GB. can i move the log file to some other location and then restart.reload of slurm service will start a new log file.i think this should work without any issues. am i right or it will create any issue. Also i need to create a log rotate.is the below config works as it is.i need to do it on production environment so asking to make sure it will work fine without any issue. /var/log/slurm/slurmctld.log { weekly missingok notifempty sharedscripts create 0600 slurm slurm rotate 8 compress postrotate /bin/systemctl reload slurmctld.service > /dev/null 2>/dev/null || true endscript } Regards Navin.
Re: [slurm-users] Error upgrading slurmdbd from 19.05 to 20.02
Hi, i guess i found the Problem. It seems to come from this file: src/plugins/accounting_storage/mysql/as_mysql_convert.c in particular from here: --- code --- static int _convert_job_table_pre(mysql_conn_t *mysql_conn, char *cluster_name) { int rc = SLURM_SUCCESS; char *query = NULL; if (db_curr_ver < 8) { /* * Change the names pack_job_id and pack_job_offset to be het_* */ query = xstrdup_printf( "alter table \"%s_%s\" " "change pack_job_id het_job_id int unsigned not null, " "change pack_job_offset het_job_offset " "int unsigned not null;", cluster_name, job_table); } if (query) { if (debug_flags & DEBUG_FLAG_DB_QUERY) DB_DEBUG(mysql_conn->conn, "query\n%s", query); rc = mysql_db_query(mysql_conn, query); xfree(query); if (rc != SLURM_SUCCESS) error("%s: Can't convert %s_%s info: %m", __func__, cluster_name, job_table); } return rc; } --- code --- it checks if version is below "8" and if it is so, rename the tables. In the Table the Version is "7" --- mysql --- MariaDB [slurm_acct_db]> select * from convert_version_table; ++-+ | mod_time | version | ++-+ | 1579853103 | 7 | ++-+ 1 row in set (0.00 sec) --- mysql --- But in my Table, I already have the right columns: --- table --- MariaDB [slurm_acct_db]> show columns from `mpip-cluster_job_table`; ++-+--+-+++ | Field | Type| Null | Key | Default| Extra | ++-+--+-+++ | job_db_inx | bigint(20) unsigned | NO | PRI | NULL | auto_increment | | mod_time | bigint(20) unsigned | NO | | 0 | | | deleted| tinyint(4) | NO | | 0 | | | account| tinytext| YES | | NULL | | | admin_comment | text| YES | | NULL | | | array_task_str | text| YES | | NULL | | | array_max_tasks| int(10) unsigned| NO | | 0 | | | array_task_pending | int(10) unsigned| NO | | 0 | | | constraints| text| YES | | NULL | | | cpus_req | int(10) unsigned| NO | | NULL | | | derived_ec | int(10) unsigned| NO | | 0 | | | derived_es | text| YES | | NULL | | | exit_code | int(10) unsigned| NO | | 0 | | | flags | int(10) unsigned| NO | | 0 | | | job_name | tinytext| NO | | NULL | | | id_assoc | int(10) unsigned| NO | MUL | NULL | | | id_array_job | int(10) unsigned| NO | MUL | 0 | | | id_array_task | int(10) unsigned| NO | | 4294967294 | | | id_block | tinytext| YES | | NULL | | | id_job | int(10) unsigned| NO | MUL | NULL | | | id_qos | int(10) unsigned| NO | MUL | 0 | | | id_resv| int(10) unsigned| NO | MUL | NULL | | | id_wckey | int(10) unsigned| NO | MUL | NULL | | | id_user| int(10) unsigned| NO | MUL | NULL | | | id_group | int(10) unsigned| NO | | NULL | | | het_job_id | int(10) unsigned| NO | MUL | NULL | | | het_job_offset | int(10) unsigned| NO | | NULL | | | kill_requid| int(11) | NO | | -1 | | | state_reason_prev | int(10) unsigned| NO | | NULL | | | mcs_label | tinytext| YES | | NULL | | | mem_req| bigint(20) unsigned | NO | | 0 | | | nodelist | text| YES | | NULL | | | nodes_alloc| int(10) unsigned| NO | MUL | NULL | | | node_inx | text| YES | |
Re: [slurm-users] slurmd -C showing incorrect core count
From what I know of how this works, no, it’s not getting it from a local file or the master node. I don’t believe it even makes a network connection, nor requires a slurm.conf in order to run. If you can run it fresh on a node with no config and that’s what it comes up with, it’s probably getting it from the VM somehow. -- || \\UTGERS, |---*O*--- ||_// the State | Ryan Novosielski - novos...@rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\of NJ | Office of Advanced Research Computing - MSB C630, Newark `' > On Mar 11, 2020, at 10:26 AM, mike tie wrote: > > > Yep, slurmd -C is obviously getting the data from somewhere, either a local > file or from the master node. hence my email to the group; I was hoping > that someone would just say: "yeah, modify file ". But oh well. I'll > start playing with strace and gdb later this week; looking through the > source might also be helpful. > > I'm not cloning existing virtual machines with slurm. I have access to a > vmware system that from time to time isn't running at full capacity; usage > is stable for blocks of a month or two at a time, so my thought/plan was to > spin up a slurm compute node on it, and resize it appropriately every few > months (why not put it to work). I started with 10 cores, and it looks like > I can up it to 16 cores for a while, and that's when I ran into the problem. > > -mike > > > > Michael Tie > Technical Director > Mathematics, Statistics, and Computer Science > > One North College Street phn: 507-222-4067 > Northfield, MN 55057 cel:952-212-8933 > m...@carleton.edufax:507-222-4312 > > > > On Wed, Mar 11, 2020 at 1:15 AM Kirill 'kkm' Katsnelson > wrote: > On Tue, Mar 10, 2020 at 1:41 PM mike tie wrote: > Here is the output of lstopo > > $ lstopo -p > Machine (63GB) > Package P#0 + L3 (16MB) > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#0 > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#1 > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#2 > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#3 > Package P#1 + L3 (16MB) > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#4 > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#5 > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#6 > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#7 > Package P#2 + L3 (16MB) > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#8 > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#9 > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#10 > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#11 > Package P#3 + L3 (16MB) > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#0 + PU P#12 > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#1 + PU P#13 > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#2 + PU P#14 > L2 (4096KB) + L1d (32KB) + L1i (32KB) + Core P#3 + PU P#15 > > There is no sane way to derive the number 10 from this topology. obviously: > it has a prime factor of 5, but everything in the lstopo output is sized in > powers of 2 (4 packages, a.k.a. sockets, 4 single-threaded CPU cores per). > > I responded yesterday but somehow managed to plop my signature into the > middle of it, so maybe you have missed inline replies? > > It's very, very likely that the number is stored *somewhere*. First to > eliminate is the hypothesis that the number is acquired from the control > daemon. That's the simplest step and the largest landgrab in the > divide-and-conquer analysis plan. Then just look where it comes from on the > VM. strace(1) will reveal all files slurmd reads. > > You are not rolling out the VMs from an image, ain't you? I'm wondering why > do you need to tweak an existing VM that is already in a weird state. Is > simply setting its snapshot aside and creating a new one from an image > hard/impossible? I did not touch VMWare for more than 10 years, so I may be a > bit naive; in the platform I'm working now (GCE), create-use-drop pattern of > VM use is much more common and simpler than create and maintain it to either > *ad infinitum* or *ad nauseam*, whichever will have been reached the > earliest. But I don't know anything about VMWare; maybe it's not possible or > feasible with it. > > -kkm >