It is not a wise idea to update in a piece meal fashion such as this.  
While the slurmctld is down I would update the slurmd's first and then 
update the slurmctld (Always update the slurmdbd first though).  I don't 
think a 2.5 slurmctld would work with a 2.3 slurmd, I am guessing what 
happened couldn't of been avoided.

On 02/28/13 08:40, Antonio Messina wrote:
> Just to let you know that I've updated to 2.5.3 and now the
> ``--test-only`` option works.
>
> On a side note, I'm not sure if this is on some documentation page,
> but I had a few troubles upgrading. While the master was running
> version 2.5.3 and the clients still 2.3.4, the slurmctld daemon was
> dying with SIGSEGV. The problem was in job_mgr.c, there is a memcpy()
> with a NULL pointer as source. It has been fixed after upgrading all
> the clients.
>
> I'm attaching the dump of the gdb session. If you need other infos let
> me know.
>
>
> root@slurm:/tmp/slurm-llnl-2.5.3# gdb src/slurmctld/slurmctld
> GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
> Copyright (C) 2012 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
> and "show warranty" for details.
> This GDB was configured as "x86_64-linux-gnu".
> For bug reporting instructions, please see:
> <http://bugs.launchpad.net/gdb-linaro/>...
> Reading symbols from /tmp/slurm-llnl-2.5.3/src/slurmctld/slurmctld...done.
> (gdb) args
> Undefined command: "args".  Try "help".
> (gdb) set args -D
> (gdb) r
> Starting program: /tmp/slurm-llnl-2.5.3/src/slurmctld/slurmctld -D
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> slurmctld: pidfile not locked, assuming no running daemon
> slurmctld: error: Configured MailProg is invalid
> slurmctld: Job accounting information stored, but details not gathered
> slurmctld: Accounting storage FileTxt plugin loaded
> slurmctld: slurmctld version 2.5.3 started on cluster gc3cluster
> slurmctld: Munge cryptographic signature plugin loaded
> slurmctld: Consumable Resources (CR) Node Selection plugin loaded with
> argument 17
> slurmctld: preempt/none loaded
> slurmctld: Checkpoint plugin loaded: checkpoint/none
> slurmctld: Job accounting gather NOT_INVOKED plugin loaded
> slurmctld: switch NONE plugin loaded
> slurmctld: topology NONE plugin loaded
> slurmctld: sched: Backfill scheduler plugin loaded
> [New Thread 0x7ffff7f80700 (LWP 13646)]
> slurmctld: error: Could not open node state file
> /var/lib/slurm-llnl/slurmctld/node_state: No such file or directory
> slurmctld: error: NOTE: Trying backup state save file. Information may be 
> lost!
> slurmctld: No node state file
> (/var/lib/slurm-llnl/slurmctld/node_state.old) to recover
> slurmctld: error: Incomplete node data checkpoint file
> slurmctld: Recovered state of 0 nodes
> slurmctld: error: Could not open job state file
> /var/lib/slurm-llnl/slurmctld/job_state: No such file or directory
> slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
> slurmctld: No job state file
> (/var/lib/slurm-llnl/slurmctld/job_state.old) to recover
> slurmctld: cons_res: select_p_node_init
> slurmctld: cons_res: preparing for 2 partitions
> slurmctld: error: Could not open reservation state file
> /var/lib/slurm-llnl/slurmctld/resv_state: No such file or directory
> slurmctld: error: NOTE: Trying backup state save file. Reservations may be 
> lost
> slurmctld: No reservation state file
> (/var/lib/slurm-llnl/slurmctld/resv_state.old) to recover
> slurmctld: Recovered state of 0 reservations
> slurmctld: error: Could not open trigger state file
> /var/lib/slurm-llnl/slurmctld/trigger_state: No such file or directory
> slurmctld: error: NOTE: Trying backup state save file. Triggers may be lost!
> slurmctld: No trigger state file
> (/var/lib/slurm-llnl/slurmctld/trigger_state.old) to recover
> slurmctld: error: Incomplete trigger data checkpoint file
> slurmctld: State of 0 triggers recovered
> slurmctld: read_slurm_conf: backup_controller not specified.
> slurmctld: Reinitializing job accounting state
> slurmctld: cons_res: select_p_reconfigure
> slurmctld: cons_res: select_p_node_init
> slurmctld: cons_res: preparing for 2 partitions
> slurmctld: Running as primary controller
> [New Thread 0x7ffff5ba9700 (LWP 13653)]
> [New Thread 0x7ffff5aa8700 (LWP 13654)]
> [New Thread 0x7ffff59a7700 (LWP 13655)]
> [New Thread 0x7ffff58a6700 (LWP 13656)]
> [Thread 0x7ffff58a6700 (LWP 13656) exited]
> [New Thread 0x7ffff558f700 (LWP 13657)]
> slurmctld: auth plugin for Munge (http://code.google.com/p/munge/) loaded
> [Thread 0x7ffff558f700 (LWP 13657) exited]
> [New Thread 0x7ffff558f700 (LWP 13658)]
> [Thread 0x7ffff558f700 (LWP 13658) exited]
> [New Thread 0x7ffff558f700 (LWP 13659)]
> [Thread 0x7ffff558f700 (LWP 13659) exited]
> [New Thread 0x7ffff558f700 (LWP 13660)]
> [Thread 0x7ffff558f700 (LWP 13660) exited]
> [New Thread 0x7ffff558f700 (LWP 13661)]
> [Thread 0x7ffff558f700 (LWP 13661) exited]
> [New Thread 0x7ffff558f700 (LWP 13662)]
> [Thread 0x7ffff558f700 (LWP 13662) exited]
> [New Thread 0x7ffff558f700 (LWP 13663)]
> [Thread 0x7ffff558f700 (LWP 13663) exited]
> [New Thread 0x7ffff558f700 (LWP 13664)]
> [Thread 0x7ffff558f700 (LWP 13664) exited]
> [New Thread 0x7ffff558f700 (LWP 13676)]
>
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7ffff558f700 (LWP 13676)]
> 0x000000000044e894 in validate_jobs_on_node (reg_msg=0x7fffe8001a58)
> at job_mgr.c:7885
> 7885          memcpy(node_ptr->energy, reg_msg->energy, 
> sizeof(acct_gather_energy_t));
> (gdb) bt
> #0  0x000000000044e894 in validate_jobs_on_node
> (reg_msg=0x7fffe8001a58) at job_mgr.c:7885
> #1  0x0000000000475fdc in _slurm_rpc_node_registration
> (msg=0x7fffe8000f58) at proc_req.c:1940
> #2  0x000000000047138a in slurmctld_req (msg=0x7fffe8000f58) at proc_req.c:253
> #3  0x0000000000430de3 in _service_connection (arg=0x7ffff0000958) at
> controller.c:1022
> #4  0x00007ffff79c0e9a in start_thread () from
> /lib/x86_64-linux-gnu/libpthread.so.0
> #5  0x00007ffff76edcbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
> #6  0x0000000000000000 in ?? ()
> (gdb) list
> 7880                  error("slurmd registered on unknown node %s",
> 7881                          reg_msg->node_name);
> 7882                  return;
> 7883          }
> 7884  
> 7885          memcpy(node_ptr->energy, reg_msg->energy, 
> sizeof(acct_gather_energy_t));
> 7886  
> 7887          if (node_ptr->up_time > reg_msg->up_time) {
> 7888                  verbose("Node %s rebooted %u secs ago",
> 7889                          reg_msg->node_name, reg_msg->up_time);
> (gdb) p node_ptr
> $1 = (struct node_record *) 0x80e2a8
> (gdb) p node_ptr->energy
> $2 = (acct_gather_energy_t *) 0x8109e8
> (gdb) p *node_ptr->energy
> $3 = {previous_consumed_energy = 0, base_consumed_energy = 0,
> base_watts = 0, consumed_energy = 0, current_watts = 0}
> (gdb) p reg_msg
> $4 = (slurm_node_registration_status_msg_t *) 0x7fffe8001a58
> (gdb) p reg_msg->energy
> $5 = (acct_gather_energy_t *) 0x0
>
>
> On Wed, Feb 27, 2013 at 7:27 PM, Antonio Messina
> <[email protected]> wrote:
>> On Wed, Feb 27, 2013 at 4:54 PM, Danny Auble <[email protected]> wrote:
>>> I would test with a more modern version, 2.5, and see if the problem still
>>> exists.
>>>
>>> Knowing your configuration would also help.
>> In attach, my slurm.conf file. We have just one frontend and a bunch
>> of worker nodes.
>>
>> .a.
>>
>>> Antonio Messina <[email protected]> wrote:
>>>>
>>>> Hi all,
>>>>
>>>> In our test cluster, running slurm 2.3.4 (rebuilt ubuntu packages) we
>>>> have the following issue: when running "srun --test-only" it shows
>>>> incorrect dates:
>>>>
>>>> antonio@slurm:~$ date
>>>> Wed Feb 27 16:36:45 CET 2013
>>>> antonio@slurm:~$ srun  --test-only  hostname
>>>> srun: Job 295 to start at 2064-03-13T01:01:52 using 1 processors on
>>>> node-08-01-07
>>>>
>>>> Please note that the cluster is empty and if I remove the
>>>> ``--test-only`` option the job will run instantaneously. The current
>>>> date&time on the machine is also correct (ntp is running).
>>>>
>>>> .a.
>>
>>
>> --
>> [email protected]
>> GC3: Grid Computing Competence Center
>> http://www.gc3.uzh.ch/
>> University of Zurich
>> Winterthurerstrasse 190
>> CH-8057 Zurich Switzerland
>>
>>
>> --
>> [email protected]
>> GC3: Grid Computing Competence Center
>> http://www.gc3.uzh.ch/
>> University of Zurich
>> Winterthurerstrasse 190
>> CH-8057 Zurich Switzerland
>
>

Reply via email to