On 02/28/13 09:18, Antonio Messina wrote: > On Thu, Feb 28, 2013 at 5:44 PM, Danny Auble <[email protected]> wrote: >> It is not a wise idea to update in a piece meal fashion such as this. While > I agree, but first of all this is a testbed, so I don't care much > about it, and moreover, the update was supposed to be managed by > cfengine, but I just tested in the middle of the upgrade, while the > server was already upgraded and the client weren't, so I've caught the > bug. > >> the slurmctld is down I would update the slurmd's first and then update the >> slurmctld (Always update the slurmdbd first though). I don't think a 2.5 >> slurmctld would work with a 2.3 slurmd, I am guessing what happened couldn't >> of been avoided. > I think the daemon should be able to eat everything the client sends > it, even if it's just garbage, so IMHO this is a bug, since it expose > you to DoS attack. If you send a patch we will incorporate it into the source.
Thanks, Danny > >> On 02/28/13 08:40, Antonio Messina wrote: >>> Just to let you know that I've updated to 2.5.3 and now the >>> ``--test-only`` option works. >>> >>> On a side note, I'm not sure if this is on some documentation page, >>> but I had a few troubles upgrading. While the master was running >>> version 2.5.3 and the clients still 2.3.4, the slurmctld daemon was >>> dying with SIGSEGV. The problem was in job_mgr.c, there is a memcpy() >>> with a NULL pointer as source. It has been fixed after upgrading all >>> the clients. >>> >>> I'm attaching the dump of the gdb session. If you need other infos let >>> me know. >>> >>> >>> root@slurm:/tmp/slurm-llnl-2.5.3# gdb src/slurmctld/slurmctld >>> GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04 >>> Copyright (C) 2012 Free Software Foundation, Inc. >>> License GPLv3+: GNU GPL version 3 or later >>> <http://gnu.org/licenses/gpl.html> >>> This is free software: you are free to change and redistribute it. >>> There is NO WARRANTY, to the extent permitted by law. Type "show copying" >>> and "show warranty" for details. >>> This GDB was configured as "x86_64-linux-gnu". >>> For bug reporting instructions, please see: >>> <http://bugs.launchpad.net/gdb-linaro/>... >>> Reading symbols from /tmp/slurm-llnl-2.5.3/src/slurmctld/slurmctld...done. >>> (gdb) args >>> Undefined command: "args". Try "help". >>> (gdb) set args -D >>> (gdb) r >>> Starting program: /tmp/slurm-llnl-2.5.3/src/slurmctld/slurmctld -D >>> [Thread debugging using libthread_db enabled] >>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". >>> slurmctld: pidfile not locked, assuming no running daemon >>> slurmctld: error: Configured MailProg is invalid >>> slurmctld: Job accounting information stored, but details not gathered >>> slurmctld: Accounting storage FileTxt plugin loaded >>> slurmctld: slurmctld version 2.5.3 started on cluster gc3cluster >>> slurmctld: Munge cryptographic signature plugin loaded >>> slurmctld: Consumable Resources (CR) Node Selection plugin loaded with >>> argument 17 >>> slurmctld: preempt/none loaded >>> slurmctld: Checkpoint plugin loaded: checkpoint/none >>> slurmctld: Job accounting gather NOT_INVOKED plugin loaded >>> slurmctld: switch NONE plugin loaded >>> slurmctld: topology NONE plugin loaded >>> slurmctld: sched: Backfill scheduler plugin loaded >>> [New Thread 0x7ffff7f80700 (LWP 13646)] >>> slurmctld: error: Could not open node state file >>> /var/lib/slurm-llnl/slurmctld/node_state: No such file or directory >>> slurmctld: error: NOTE: Trying backup state save file. Information may be >>> lost! >>> slurmctld: No node state file >>> (/var/lib/slurm-llnl/slurmctld/node_state.old) to recover >>> slurmctld: error: Incomplete node data checkpoint file >>> slurmctld: Recovered state of 0 nodes >>> slurmctld: error: Could not open job state file >>> /var/lib/slurm-llnl/slurmctld/job_state: No such file or directory >>> slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost! >>> slurmctld: No job state file >>> (/var/lib/slurm-llnl/slurmctld/job_state.old) to recover >>> slurmctld: cons_res: select_p_node_init >>> slurmctld: cons_res: preparing for 2 partitions >>> slurmctld: error: Could not open reservation state file >>> /var/lib/slurm-llnl/slurmctld/resv_state: No such file or directory >>> slurmctld: error: NOTE: Trying backup state save file. Reservations may be >>> lost >>> slurmctld: No reservation state file >>> (/var/lib/slurm-llnl/slurmctld/resv_state.old) to recover >>> slurmctld: Recovered state of 0 reservations >>> slurmctld: error: Could not open trigger state file >>> /var/lib/slurm-llnl/slurmctld/trigger_state: No such file or directory >>> slurmctld: error: NOTE: Trying backup state save file. Triggers may be >>> lost! >>> slurmctld: No trigger state file >>> (/var/lib/slurm-llnl/slurmctld/trigger_state.old) to recover >>> slurmctld: error: Incomplete trigger data checkpoint file >>> slurmctld: State of 0 triggers recovered >>> slurmctld: read_slurm_conf: backup_controller not specified. >>> slurmctld: Reinitializing job accounting state >>> slurmctld: cons_res: select_p_reconfigure >>> slurmctld: cons_res: select_p_node_init >>> slurmctld: cons_res: preparing for 2 partitions >>> slurmctld: Running as primary controller >>> [New Thread 0x7ffff5ba9700 (LWP 13653)] >>> [New Thread 0x7ffff5aa8700 (LWP 13654)] >>> [New Thread 0x7ffff59a7700 (LWP 13655)] >>> [New Thread 0x7ffff58a6700 (LWP 13656)] >>> [Thread 0x7ffff58a6700 (LWP 13656) exited] >>> [New Thread 0x7ffff558f700 (LWP 13657)] >>> slurmctld: auth plugin for Munge (http://code.google.com/p/munge/) loaded >>> [Thread 0x7ffff558f700 (LWP 13657) exited] >>> [New Thread 0x7ffff558f700 (LWP 13658)] >>> [Thread 0x7ffff558f700 (LWP 13658) exited] >>> [New Thread 0x7ffff558f700 (LWP 13659)] >>> [Thread 0x7ffff558f700 (LWP 13659) exited] >>> [New Thread 0x7ffff558f700 (LWP 13660)] >>> [Thread 0x7ffff558f700 (LWP 13660) exited] >>> [New Thread 0x7ffff558f700 (LWP 13661)] >>> [Thread 0x7ffff558f700 (LWP 13661) exited] >>> [New Thread 0x7ffff558f700 (LWP 13662)] >>> [Thread 0x7ffff558f700 (LWP 13662) exited] >>> [New Thread 0x7ffff558f700 (LWP 13663)] >>> [Thread 0x7ffff558f700 (LWP 13663) exited] >>> [New Thread 0x7ffff558f700 (LWP 13664)] >>> [Thread 0x7ffff558f700 (LWP 13664) exited] >>> [New Thread 0x7ffff558f700 (LWP 13676)] >>> >>> Program received signal SIGSEGV, Segmentation fault. >>> [Switching to Thread 0x7ffff558f700 (LWP 13676)] >>> 0x000000000044e894 in validate_jobs_on_node (reg_msg=0x7fffe8001a58) >>> at job_mgr.c:7885 >>> 7885 memcpy(node_ptr->energy, reg_msg->energy, >>> sizeof(acct_gather_energy_t)); >>> (gdb) bt >>> #0 0x000000000044e894 in validate_jobs_on_node >>> (reg_msg=0x7fffe8001a58) at job_mgr.c:7885 >>> #1 0x0000000000475fdc in _slurm_rpc_node_registration >>> (msg=0x7fffe8000f58) at proc_req.c:1940 >>> #2 0x000000000047138a in slurmctld_req (msg=0x7fffe8000f58) at >>> proc_req.c:253 >>> #3 0x0000000000430de3 in _service_connection (arg=0x7ffff0000958) at >>> controller.c:1022 >>> #4 0x00007ffff79c0e9a in start_thread () from >>> /lib/x86_64-linux-gnu/libpthread.so.0 >>> #5 0x00007ffff76edcbd in clone () from /lib/x86_64-linux-gnu/libc.so.6 >>> #6 0x0000000000000000 in ?? () >>> (gdb) list >>> 7880 error("slurmd registered on unknown node %s", >>> 7881 reg_msg->node_name); >>> 7882 return; >>> 7883 } >>> 7884 >>> 7885 memcpy(node_ptr->energy, reg_msg->energy, >>> sizeof(acct_gather_energy_t)); >>> 7886 >>> 7887 if (node_ptr->up_time > reg_msg->up_time) { >>> 7888 verbose("Node %s rebooted %u secs ago", >>> 7889 reg_msg->node_name, reg_msg->up_time); >>> (gdb) p node_ptr >>> $1 = (struct node_record *) 0x80e2a8 >>> (gdb) p node_ptr->energy >>> $2 = (acct_gather_energy_t *) 0x8109e8 >>> (gdb) p *node_ptr->energy >>> $3 = {previous_consumed_energy = 0, base_consumed_energy = 0, >>> base_watts = 0, consumed_energy = 0, current_watts = 0} >>> (gdb) p reg_msg >>> $4 = (slurm_node_registration_status_msg_t *) 0x7fffe8001a58 >>> (gdb) p reg_msg->energy >>> $5 = (acct_gather_energy_t *) 0x0 >>> >>> >>> On Wed, Feb 27, 2013 at 7:27 PM, Antonio Messina >>> <[email protected]> wrote: >>>> On Wed, Feb 27, 2013 at 4:54 PM, Danny Auble <[email protected]> wrote: >>>>> I would test with a more modern version, 2.5, and see if the problem >>>>> still >>>>> exists. >>>>> >>>>> Knowing your configuration would also help. >>>> In attach, my slurm.conf file. We have just one frontend and a bunch >>>> of worker nodes. >>>> >>>> .a. >>>> >>>>> Antonio Messina <[email protected]> wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> In our test cluster, running slurm 2.3.4 (rebuilt ubuntu packages) we >>>>>> have the following issue: when running "srun --test-only" it shows >>>>>> incorrect dates: >>>>>> >>>>>> antonio@slurm:~$ date >>>>>> Wed Feb 27 16:36:45 CET 2013 >>>>>> antonio@slurm:~$ srun --test-only hostname >>>>>> srun: Job 295 to start at 2064-03-13T01:01:52 using 1 processors on >>>>>> node-08-01-07 >>>>>> >>>>>> Please note that the cluster is empty and if I remove the >>>>>> ``--test-only`` option the job will run instantaneously. The current >>>>>> date&time on the machine is also correct (ntp is running). >>>>>> >>>>>> .a. >>>> >>>> >>>> -- >>>> [email protected] >>>> GC3: Grid Computing Competence Center >>>> http://www.gc3.uzh.ch/ >>>> University of Zurich >>>> Winterthurerstrasse 190 >>>> CH-8057 Zurich Switzerland >>>> >>>> >>>> -- >>>> [email protected] >>>> GC3: Grid Computing Competence Center >>>> http://www.gc3.uzh.ch/ >>>> University of Zurich >>>> Winterthurerstrasse 190 >>>> CH-8057 Zurich Switzerland >>> >>> > >
