It is not a wise idea to update in a piece meal fashion such as this. While the slurmctld is down I would update the slurmd's first and then update the slurmctld (Always update the slurmdbd first though). I don't think a 2.5 slurmctld would work with a 2.3 slurmd, I am guessing what happened couldn't of been avoided.
On 02/28/13 08:40, Antonio Messina wrote: > Just to let you know that I've updated to 2.5.3 and now the > ``--test-only`` option works. > > On a side note, I'm not sure if this is on some documentation page, > but I had a few troubles upgrading. While the master was running > version 2.5.3 and the clients still 2.3.4, the slurmctld daemon was > dying with SIGSEGV. The problem was in job_mgr.c, there is a memcpy() > with a NULL pointer as source. It has been fixed after upgrading all > the clients. > > I'm attaching the dump of the gdb session. If you need other infos let > me know. > > > root@slurm:/tmp/slurm-llnl-2.5.3# gdb src/slurmctld/slurmctld > GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04 > Copyright (C) 2012 Free Software Foundation, Inc. > License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> > This is free software: you are free to change and redistribute it. > There is NO WARRANTY, to the extent permitted by law. Type "show copying" > and "show warranty" for details. > This GDB was configured as "x86_64-linux-gnu". > For bug reporting instructions, please see: > <http://bugs.launchpad.net/gdb-linaro/>... > Reading symbols from /tmp/slurm-llnl-2.5.3/src/slurmctld/slurmctld...done. > (gdb) args > Undefined command: "args". Try "help". > (gdb) set args -D > (gdb) r > Starting program: /tmp/slurm-llnl-2.5.3/src/slurmctld/slurmctld -D > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". > slurmctld: pidfile not locked, assuming no running daemon > slurmctld: error: Configured MailProg is invalid > slurmctld: Job accounting information stored, but details not gathered > slurmctld: Accounting storage FileTxt plugin loaded > slurmctld: slurmctld version 2.5.3 started on cluster gc3cluster > slurmctld: Munge cryptographic signature plugin loaded > slurmctld: Consumable Resources (CR) Node Selection plugin loaded with > argument 17 > slurmctld: preempt/none loaded > slurmctld: Checkpoint plugin loaded: checkpoint/none > slurmctld: Job accounting gather NOT_INVOKED plugin loaded > slurmctld: switch NONE plugin loaded > slurmctld: topology NONE plugin loaded > slurmctld: sched: Backfill scheduler plugin loaded > [New Thread 0x7ffff7f80700 (LWP 13646)] > slurmctld: error: Could not open node state file > /var/lib/slurm-llnl/slurmctld/node_state: No such file or directory > slurmctld: error: NOTE: Trying backup state save file. Information may be > lost! > slurmctld: No node state file > (/var/lib/slurm-llnl/slurmctld/node_state.old) to recover > slurmctld: error: Incomplete node data checkpoint file > slurmctld: Recovered state of 0 nodes > slurmctld: error: Could not open job state file > /var/lib/slurm-llnl/slurmctld/job_state: No such file or directory > slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost! > slurmctld: No job state file > (/var/lib/slurm-llnl/slurmctld/job_state.old) to recover > slurmctld: cons_res: select_p_node_init > slurmctld: cons_res: preparing for 2 partitions > slurmctld: error: Could not open reservation state file > /var/lib/slurm-llnl/slurmctld/resv_state: No such file or directory > slurmctld: error: NOTE: Trying backup state save file. Reservations may be > lost > slurmctld: No reservation state file > (/var/lib/slurm-llnl/slurmctld/resv_state.old) to recover > slurmctld: Recovered state of 0 reservations > slurmctld: error: Could not open trigger state file > /var/lib/slurm-llnl/slurmctld/trigger_state: No such file or directory > slurmctld: error: NOTE: Trying backup state save file. Triggers may be lost! > slurmctld: No trigger state file > (/var/lib/slurm-llnl/slurmctld/trigger_state.old) to recover > slurmctld: error: Incomplete trigger data checkpoint file > slurmctld: State of 0 triggers recovered > slurmctld: read_slurm_conf: backup_controller not specified. > slurmctld: Reinitializing job accounting state > slurmctld: cons_res: select_p_reconfigure > slurmctld: cons_res: select_p_node_init > slurmctld: cons_res: preparing for 2 partitions > slurmctld: Running as primary controller > [New Thread 0x7ffff5ba9700 (LWP 13653)] > [New Thread 0x7ffff5aa8700 (LWP 13654)] > [New Thread 0x7ffff59a7700 (LWP 13655)] > [New Thread 0x7ffff58a6700 (LWP 13656)] > [Thread 0x7ffff58a6700 (LWP 13656) exited] > [New Thread 0x7ffff558f700 (LWP 13657)] > slurmctld: auth plugin for Munge (http://code.google.com/p/munge/) loaded > [Thread 0x7ffff558f700 (LWP 13657) exited] > [New Thread 0x7ffff558f700 (LWP 13658)] > [Thread 0x7ffff558f700 (LWP 13658) exited] > [New Thread 0x7ffff558f700 (LWP 13659)] > [Thread 0x7ffff558f700 (LWP 13659) exited] > [New Thread 0x7ffff558f700 (LWP 13660)] > [Thread 0x7ffff558f700 (LWP 13660) exited] > [New Thread 0x7ffff558f700 (LWP 13661)] > [Thread 0x7ffff558f700 (LWP 13661) exited] > [New Thread 0x7ffff558f700 (LWP 13662)] > [Thread 0x7ffff558f700 (LWP 13662) exited] > [New Thread 0x7ffff558f700 (LWP 13663)] > [Thread 0x7ffff558f700 (LWP 13663) exited] > [New Thread 0x7ffff558f700 (LWP 13664)] > [Thread 0x7ffff558f700 (LWP 13664) exited] > [New Thread 0x7ffff558f700 (LWP 13676)] > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 0x7ffff558f700 (LWP 13676)] > 0x000000000044e894 in validate_jobs_on_node (reg_msg=0x7fffe8001a58) > at job_mgr.c:7885 > 7885 memcpy(node_ptr->energy, reg_msg->energy, > sizeof(acct_gather_energy_t)); > (gdb) bt > #0 0x000000000044e894 in validate_jobs_on_node > (reg_msg=0x7fffe8001a58) at job_mgr.c:7885 > #1 0x0000000000475fdc in _slurm_rpc_node_registration > (msg=0x7fffe8000f58) at proc_req.c:1940 > #2 0x000000000047138a in slurmctld_req (msg=0x7fffe8000f58) at proc_req.c:253 > #3 0x0000000000430de3 in _service_connection (arg=0x7ffff0000958) at > controller.c:1022 > #4 0x00007ffff79c0e9a in start_thread () from > /lib/x86_64-linux-gnu/libpthread.so.0 > #5 0x00007ffff76edcbd in clone () from /lib/x86_64-linux-gnu/libc.so.6 > #6 0x0000000000000000 in ?? () > (gdb) list > 7880 error("slurmd registered on unknown node %s", > 7881 reg_msg->node_name); > 7882 return; > 7883 } > 7884 > 7885 memcpy(node_ptr->energy, reg_msg->energy, > sizeof(acct_gather_energy_t)); > 7886 > 7887 if (node_ptr->up_time > reg_msg->up_time) { > 7888 verbose("Node %s rebooted %u secs ago", > 7889 reg_msg->node_name, reg_msg->up_time); > (gdb) p node_ptr > $1 = (struct node_record *) 0x80e2a8 > (gdb) p node_ptr->energy > $2 = (acct_gather_energy_t *) 0x8109e8 > (gdb) p *node_ptr->energy > $3 = {previous_consumed_energy = 0, base_consumed_energy = 0, > base_watts = 0, consumed_energy = 0, current_watts = 0} > (gdb) p reg_msg > $4 = (slurm_node_registration_status_msg_t *) 0x7fffe8001a58 > (gdb) p reg_msg->energy > $5 = (acct_gather_energy_t *) 0x0 > > > On Wed, Feb 27, 2013 at 7:27 PM, Antonio Messina > <[email protected]> wrote: >> On Wed, Feb 27, 2013 at 4:54 PM, Danny Auble <[email protected]> wrote: >>> I would test with a more modern version, 2.5, and see if the problem still >>> exists. >>> >>> Knowing your configuration would also help. >> In attach, my slurm.conf file. We have just one frontend and a bunch >> of worker nodes. >> >> .a. >> >>> Antonio Messina <[email protected]> wrote: >>>> >>>> Hi all, >>>> >>>> In our test cluster, running slurm 2.3.4 (rebuilt ubuntu packages) we >>>> have the following issue: when running "srun --test-only" it shows >>>> incorrect dates: >>>> >>>> antonio@slurm:~$ date >>>> Wed Feb 27 16:36:45 CET 2013 >>>> antonio@slurm:~$ srun --test-only hostname >>>> srun: Job 295 to start at 2064-03-13T01:01:52 using 1 processors on >>>> node-08-01-07 >>>> >>>> Please note that the cluster is empty and if I remove the >>>> ``--test-only`` option the job will run instantaneously. The current >>>> date&time on the machine is also correct (ntp is running). >>>> >>>> .a. >> >> >> -- >> [email protected] >> GC3: Grid Computing Competence Center >> http://www.gc3.uzh.ch/ >> University of Zurich >> Winterthurerstrasse 190 >> CH-8057 Zurich Switzerland >> >> >> -- >> [email protected] >> GC3: Grid Computing Competence Center >> http://www.gc3.uzh.ch/ >> University of Zurich >> Winterthurerstrasse 190 >> CH-8057 Zurich Switzerland > >
