On Thu, Feb 28, 2013 at 5:44 PM, Danny Auble <[email protected]> wrote: > It is not a wise idea to update in a piece meal fashion such as this. While
I agree, but first of all this is a testbed, so I don't care much about it, and moreover, the update was supposed to be managed by cfengine, but I just tested in the middle of the upgrade, while the server was already upgraded and the client weren't, so I've caught the bug. > the slurmctld is down I would update the slurmd's first and then update the > slurmctld (Always update the slurmdbd first though). I don't think a 2.5 > slurmctld would work with a 2.3 slurmd, I am guessing what happened couldn't > of been avoided. I think the daemon should be able to eat everything the client sends it, even if it's just garbage, so IMHO this is a bug, since it expose you to DoS attack. > On 02/28/13 08:40, Antonio Messina wrote: >> >> Just to let you know that I've updated to 2.5.3 and now the >> ``--test-only`` option works. >> >> On a side note, I'm not sure if this is on some documentation page, >> but I had a few troubles upgrading. While the master was running >> version 2.5.3 and the clients still 2.3.4, the slurmctld daemon was >> dying with SIGSEGV. The problem was in job_mgr.c, there is a memcpy() >> with a NULL pointer as source. It has been fixed after upgrading all >> the clients. >> >> I'm attaching the dump of the gdb session. If you need other infos let >> me know. >> >> >> root@slurm:/tmp/slurm-llnl-2.5.3# gdb src/slurmctld/slurmctld >> GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04 >> Copyright (C) 2012 Free Software Foundation, Inc. >> License GPLv3+: GNU GPL version 3 or later >> <http://gnu.org/licenses/gpl.html> >> This is free software: you are free to change and redistribute it. >> There is NO WARRANTY, to the extent permitted by law. Type "show copying" >> and "show warranty" for details. >> This GDB was configured as "x86_64-linux-gnu". >> For bug reporting instructions, please see: >> <http://bugs.launchpad.net/gdb-linaro/>... >> Reading symbols from /tmp/slurm-llnl-2.5.3/src/slurmctld/slurmctld...done. >> (gdb) args >> Undefined command: "args". Try "help". >> (gdb) set args -D >> (gdb) r >> Starting program: /tmp/slurm-llnl-2.5.3/src/slurmctld/slurmctld -D >> [Thread debugging using libthread_db enabled] >> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". >> slurmctld: pidfile not locked, assuming no running daemon >> slurmctld: error: Configured MailProg is invalid >> slurmctld: Job accounting information stored, but details not gathered >> slurmctld: Accounting storage FileTxt plugin loaded >> slurmctld: slurmctld version 2.5.3 started on cluster gc3cluster >> slurmctld: Munge cryptographic signature plugin loaded >> slurmctld: Consumable Resources (CR) Node Selection plugin loaded with >> argument 17 >> slurmctld: preempt/none loaded >> slurmctld: Checkpoint plugin loaded: checkpoint/none >> slurmctld: Job accounting gather NOT_INVOKED plugin loaded >> slurmctld: switch NONE plugin loaded >> slurmctld: topology NONE plugin loaded >> slurmctld: sched: Backfill scheduler plugin loaded >> [New Thread 0x7ffff7f80700 (LWP 13646)] >> slurmctld: error: Could not open node state file >> /var/lib/slurm-llnl/slurmctld/node_state: No such file or directory >> slurmctld: error: NOTE: Trying backup state save file. Information may be >> lost! >> slurmctld: No node state file >> (/var/lib/slurm-llnl/slurmctld/node_state.old) to recover >> slurmctld: error: Incomplete node data checkpoint file >> slurmctld: Recovered state of 0 nodes >> slurmctld: error: Could not open job state file >> /var/lib/slurm-llnl/slurmctld/job_state: No such file or directory >> slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost! >> slurmctld: No job state file >> (/var/lib/slurm-llnl/slurmctld/job_state.old) to recover >> slurmctld: cons_res: select_p_node_init >> slurmctld: cons_res: preparing for 2 partitions >> slurmctld: error: Could not open reservation state file >> /var/lib/slurm-llnl/slurmctld/resv_state: No such file or directory >> slurmctld: error: NOTE: Trying backup state save file. Reservations may be >> lost >> slurmctld: No reservation state file >> (/var/lib/slurm-llnl/slurmctld/resv_state.old) to recover >> slurmctld: Recovered state of 0 reservations >> slurmctld: error: Could not open trigger state file >> /var/lib/slurm-llnl/slurmctld/trigger_state: No such file or directory >> slurmctld: error: NOTE: Trying backup state save file. Triggers may be >> lost! >> slurmctld: No trigger state file >> (/var/lib/slurm-llnl/slurmctld/trigger_state.old) to recover >> slurmctld: error: Incomplete trigger data checkpoint file >> slurmctld: State of 0 triggers recovered >> slurmctld: read_slurm_conf: backup_controller not specified. >> slurmctld: Reinitializing job accounting state >> slurmctld: cons_res: select_p_reconfigure >> slurmctld: cons_res: select_p_node_init >> slurmctld: cons_res: preparing for 2 partitions >> slurmctld: Running as primary controller >> [New Thread 0x7ffff5ba9700 (LWP 13653)] >> [New Thread 0x7ffff5aa8700 (LWP 13654)] >> [New Thread 0x7ffff59a7700 (LWP 13655)] >> [New Thread 0x7ffff58a6700 (LWP 13656)] >> [Thread 0x7ffff58a6700 (LWP 13656) exited] >> [New Thread 0x7ffff558f700 (LWP 13657)] >> slurmctld: auth plugin for Munge (http://code.google.com/p/munge/) loaded >> [Thread 0x7ffff558f700 (LWP 13657) exited] >> [New Thread 0x7ffff558f700 (LWP 13658)] >> [Thread 0x7ffff558f700 (LWP 13658) exited] >> [New Thread 0x7ffff558f700 (LWP 13659)] >> [Thread 0x7ffff558f700 (LWP 13659) exited] >> [New Thread 0x7ffff558f700 (LWP 13660)] >> [Thread 0x7ffff558f700 (LWP 13660) exited] >> [New Thread 0x7ffff558f700 (LWP 13661)] >> [Thread 0x7ffff558f700 (LWP 13661) exited] >> [New Thread 0x7ffff558f700 (LWP 13662)] >> [Thread 0x7ffff558f700 (LWP 13662) exited] >> [New Thread 0x7ffff558f700 (LWP 13663)] >> [Thread 0x7ffff558f700 (LWP 13663) exited] >> [New Thread 0x7ffff558f700 (LWP 13664)] >> [Thread 0x7ffff558f700 (LWP 13664) exited] >> [New Thread 0x7ffff558f700 (LWP 13676)] >> >> Program received signal SIGSEGV, Segmentation fault. >> [Switching to Thread 0x7ffff558f700 (LWP 13676)] >> 0x000000000044e894 in validate_jobs_on_node (reg_msg=0x7fffe8001a58) >> at job_mgr.c:7885 >> 7885 memcpy(node_ptr->energy, reg_msg->energy, >> sizeof(acct_gather_energy_t)); >> (gdb) bt >> #0 0x000000000044e894 in validate_jobs_on_node >> (reg_msg=0x7fffe8001a58) at job_mgr.c:7885 >> #1 0x0000000000475fdc in _slurm_rpc_node_registration >> (msg=0x7fffe8000f58) at proc_req.c:1940 >> #2 0x000000000047138a in slurmctld_req (msg=0x7fffe8000f58) at >> proc_req.c:253 >> #3 0x0000000000430de3 in _service_connection (arg=0x7ffff0000958) at >> controller.c:1022 >> #4 0x00007ffff79c0e9a in start_thread () from >> /lib/x86_64-linux-gnu/libpthread.so.0 >> #5 0x00007ffff76edcbd in clone () from /lib/x86_64-linux-gnu/libc.so.6 >> #6 0x0000000000000000 in ?? () >> (gdb) list >> 7880 error("slurmd registered on unknown node %s", >> 7881 reg_msg->node_name); >> 7882 return; >> 7883 } >> 7884 >> 7885 memcpy(node_ptr->energy, reg_msg->energy, >> sizeof(acct_gather_energy_t)); >> 7886 >> 7887 if (node_ptr->up_time > reg_msg->up_time) { >> 7888 verbose("Node %s rebooted %u secs ago", >> 7889 reg_msg->node_name, reg_msg->up_time); >> (gdb) p node_ptr >> $1 = (struct node_record *) 0x80e2a8 >> (gdb) p node_ptr->energy >> $2 = (acct_gather_energy_t *) 0x8109e8 >> (gdb) p *node_ptr->energy >> $3 = {previous_consumed_energy = 0, base_consumed_energy = 0, >> base_watts = 0, consumed_energy = 0, current_watts = 0} >> (gdb) p reg_msg >> $4 = (slurm_node_registration_status_msg_t *) 0x7fffe8001a58 >> (gdb) p reg_msg->energy >> $5 = (acct_gather_energy_t *) 0x0 >> >> >> On Wed, Feb 27, 2013 at 7:27 PM, Antonio Messina >> <[email protected]> wrote: >>> >>> On Wed, Feb 27, 2013 at 4:54 PM, Danny Auble <[email protected]> wrote: >>>> >>>> I would test with a more modern version, 2.5, and see if the problem >>>> still >>>> exists. >>>> >>>> Knowing your configuration would also help. >>> >>> In attach, my slurm.conf file. We have just one frontend and a bunch >>> of worker nodes. >>> >>> .a. >>> >>>> Antonio Messina <[email protected]> wrote: >>>>> >>>>> >>>>> Hi all, >>>>> >>>>> In our test cluster, running slurm 2.3.4 (rebuilt ubuntu packages) we >>>>> have the following issue: when running "srun --test-only" it shows >>>>> incorrect dates: >>>>> >>>>> antonio@slurm:~$ date >>>>> Wed Feb 27 16:36:45 CET 2013 >>>>> antonio@slurm:~$ srun --test-only hostname >>>>> srun: Job 295 to start at 2064-03-13T01:01:52 using 1 processors on >>>>> node-08-01-07 >>>>> >>>>> Please note that the cluster is empty and if I remove the >>>>> ``--test-only`` option the job will run instantaneously. The current >>>>> date&time on the machine is also correct (ntp is running). >>>>> >>>>> .a. >>> >>> >>> >>> -- >>> [email protected] >>> GC3: Grid Computing Competence Center >>> http://www.gc3.uzh.ch/ >>> University of Zurich >>> Winterthurerstrasse 190 >>> CH-8057 Zurich Switzerland >>> >>> >>> -- >>> [email protected] >>> GC3: Grid Computing Competence Center >>> http://www.gc3.uzh.ch/ >>> University of Zurich >>> Winterthurerstrasse 190 >>> CH-8057 Zurich Switzerland >> >> >> > -- [email protected] GC3: Grid Computing Competence Center http://www.gc3.uzh.ch/ University of Zurich Winterthurerstrasse 190 CH-8057 Zurich Switzerland
