[slurm-dev] Re: slurm-dev srun --test-only shows an incorrect date.

Antonio Messina Thu, 28 Feb 2013 09:57:08 -0800

On Thu, Feb 28, 2013 at 5:44 PM, Danny Auble <[email protected]> wrote:
> It is not a wise idea to update in a piece meal fashion such as this.  While


I agree, but first of all this is a testbed, so I don't care much
about it, and moreover, the update was supposed to be managed by
cfengine, but I just tested in the middle of the upgrade, while the
server was already upgraded and the client weren't, so I've caught the
bug.

> the slurmctld is down I would update the slurmd's first and then update the
> slurmctld (Always update the slurmdbd first though).  I don't think a 2.5
> slurmctld would work with a 2.3 slurmd, I am guessing what happened couldn't
> of been avoided.

I think the daemon should be able to eat everything the client sends
it, even if it's just garbage, so IMHO this is a bug, since it expose
you to DoS attack.

> On 02/28/13 08:40, Antonio Messina wrote:
>>
>> Just to let you know that I've updated to 2.5.3 and now the
>> ``--test-only`` option works.
>>
>> On a side note, I'm not sure if this is on some documentation page,
>> but I had a few troubles upgrading. While the master was running
>> version 2.5.3 and the clients still 2.3.4, the slurmctld daemon was
>> dying with SIGSEGV. The problem was in job_mgr.c, there is a memcpy()
>> with a NULL pointer as source. It has been fixed after upgrading all
>> the clients.
>>
>> I'm attaching the dump of the gdb session. If you need other infos let
>> me know.
>>
>>
>> root@slurm:/tmp/slurm-llnl-2.5.3# gdb src/slurmctld/slurmctld
>> GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
>> Copyright (C) 2012 Free Software Foundation, Inc.
>> License GPLv3+: GNU GPL version 3 or later
>> <http://gnu.org/licenses/gpl.html>
>> This is free software: you are free to change and redistribute it.
>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>> and "show warranty" for details.
>> This GDB was configured as "x86_64-linux-gnu".
>> For bug reporting instructions, please see:
>> <http://bugs.launchpad.net/gdb-linaro/>...
>> Reading symbols from /tmp/slurm-llnl-2.5.3/src/slurmctld/slurmctld...done.
>> (gdb) args
>> Undefined command: "args".  Try "help".
>> (gdb) set args -D
>> (gdb) r
>> Starting program: /tmp/slurm-llnl-2.5.3/src/slurmctld/slurmctld -D
>> [Thread debugging using libthread_db enabled]
>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
>> slurmctld: pidfile not locked, assuming no running daemon
>> slurmctld: error: Configured MailProg is invalid
>> slurmctld: Job accounting information stored, but details not gathered
>> slurmctld: Accounting storage FileTxt plugin loaded
>> slurmctld: slurmctld version 2.5.3 started on cluster gc3cluster
>> slurmctld: Munge cryptographic signature plugin loaded
>> slurmctld: Consumable Resources (CR) Node Selection plugin loaded with
>> argument 17
>> slurmctld: preempt/none loaded
>> slurmctld: Checkpoint plugin loaded: checkpoint/none
>> slurmctld: Job accounting gather NOT_INVOKED plugin loaded
>> slurmctld: switch NONE plugin loaded
>> slurmctld: topology NONE plugin loaded
>> slurmctld: sched: Backfill scheduler plugin loaded
>> [New Thread 0x7ffff7f80700 (LWP 13646)]
>> slurmctld: error: Could not open node state file
>> /var/lib/slurm-llnl/slurmctld/node_state: No such file or directory
>> slurmctld: error: NOTE: Trying backup state save file. Information may be
>> lost!
>> slurmctld: No node state file
>> (/var/lib/slurm-llnl/slurmctld/node_state.old) to recover
>> slurmctld: error: Incomplete node data checkpoint file
>> slurmctld: Recovered state of 0 nodes
>> slurmctld: error: Could not open job state file
>> /var/lib/slurm-llnl/slurmctld/job_state: No such file or directory
>> slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
>> slurmctld: No job state file
>> (/var/lib/slurm-llnl/slurmctld/job_state.old) to recover
>> slurmctld: cons_res: select_p_node_init
>> slurmctld: cons_res: preparing for 2 partitions
>> slurmctld: error: Could not open reservation state file
>> /var/lib/slurm-llnl/slurmctld/resv_state: No such file or directory
>> slurmctld: error: NOTE: Trying backup state save file. Reservations may be
>> lost
>> slurmctld: No reservation state file
>> (/var/lib/slurm-llnl/slurmctld/resv_state.old) to recover
>> slurmctld: Recovered state of 0 reservations
>> slurmctld: error: Could not open trigger state file
>> /var/lib/slurm-llnl/slurmctld/trigger_state: No such file or directory
>> slurmctld: error: NOTE: Trying backup state save file. Triggers may be
>> lost!
>> slurmctld: No trigger state file
>> (/var/lib/slurm-llnl/slurmctld/trigger_state.old) to recover
>> slurmctld: error: Incomplete trigger data checkpoint file
>> slurmctld: State of 0 triggers recovered
>> slurmctld: read_slurm_conf: backup_controller not specified.
>> slurmctld: Reinitializing job accounting state
>> slurmctld: cons_res: select_p_reconfigure
>> slurmctld: cons_res: select_p_node_init
>> slurmctld: cons_res: preparing for 2 partitions
>> slurmctld: Running as primary controller
>> [New Thread 0x7ffff5ba9700 (LWP 13653)]
>> [New Thread 0x7ffff5aa8700 (LWP 13654)]
>> [New Thread 0x7ffff59a7700 (LWP 13655)]
>> [New Thread 0x7ffff58a6700 (LWP 13656)]
>> [Thread 0x7ffff58a6700 (LWP 13656) exited]
>> [New Thread 0x7ffff558f700 (LWP 13657)]
>> slurmctld: auth plugin for Munge (http://code.google.com/p/munge/) loaded
>> [Thread 0x7ffff558f700 (LWP 13657) exited]
>> [New Thread 0x7ffff558f700 (LWP 13658)]
>> [Thread 0x7ffff558f700 (LWP 13658) exited]
>> [New Thread 0x7ffff558f700 (LWP 13659)]
>> [Thread 0x7ffff558f700 (LWP 13659) exited]
>> [New Thread 0x7ffff558f700 (LWP 13660)]
>> [Thread 0x7ffff558f700 (LWP 13660) exited]
>> [New Thread 0x7ffff558f700 (LWP 13661)]
>> [Thread 0x7ffff558f700 (LWP 13661) exited]
>> [New Thread 0x7ffff558f700 (LWP 13662)]
>> [Thread 0x7ffff558f700 (LWP 13662) exited]
>> [New Thread 0x7ffff558f700 (LWP 13663)]
>> [Thread 0x7ffff558f700 (LWP 13663) exited]
>> [New Thread 0x7ffff558f700 (LWP 13664)]
>> [Thread 0x7ffff558f700 (LWP 13664) exited]
>> [New Thread 0x7ffff558f700 (LWP 13676)]
>>
>> Program received signal SIGSEGV, Segmentation fault.
>> [Switching to Thread 0x7ffff558f700 (LWP 13676)]
>> 0x000000000044e894 in validate_jobs_on_node (reg_msg=0x7fffe8001a58)
>> at job_mgr.c:7885
>> 7885            memcpy(node_ptr->energy, reg_msg->energy,
>> sizeof(acct_gather_energy_t));
>> (gdb) bt
>> #0  0x000000000044e894 in validate_jobs_on_node
>> (reg_msg=0x7fffe8001a58) at job_mgr.c:7885
>> #1  0x0000000000475fdc in _slurm_rpc_node_registration
>> (msg=0x7fffe8000f58) at proc_req.c:1940
>> #2  0x000000000047138a in slurmctld_req (msg=0x7fffe8000f58) at
>> proc_req.c:253
>> #3  0x0000000000430de3 in _service_connection (arg=0x7ffff0000958) at
>> controller.c:1022
>> #4  0x00007ffff79c0e9a in start_thread () from
>> /lib/x86_64-linux-gnu/libpthread.so.0
>> #5  0x00007ffff76edcbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
>> #6  0x0000000000000000 in ?? ()
>> (gdb) list
>> 7880                    error("slurmd registered on unknown node %s",
>> 7881                            reg_msg->node_name);
>> 7882                    return;
>> 7883            }
>> 7884
>> 7885            memcpy(node_ptr->energy, reg_msg->energy,
>> sizeof(acct_gather_energy_t));
>> 7886
>> 7887            if (node_ptr->up_time > reg_msg->up_time) {
>> 7888                    verbose("Node %s rebooted %u secs ago",
>> 7889                            reg_msg->node_name, reg_msg->up_time);
>> (gdb) p node_ptr
>> $1 = (struct node_record *) 0x80e2a8
>> (gdb) p node_ptr->energy
>> $2 = (acct_gather_energy_t *) 0x8109e8
>> (gdb) p *node_ptr->energy
>> $3 = {previous_consumed_energy = 0, base_consumed_energy = 0,
>> base_watts = 0, consumed_energy = 0, current_watts = 0}
>> (gdb) p reg_msg
>> $4 = (slurm_node_registration_status_msg_t *) 0x7fffe8001a58
>> (gdb) p reg_msg->energy
>> $5 = (acct_gather_energy_t *) 0x0
>>
>>
>> On Wed, Feb 27, 2013 at 7:27 PM, Antonio Messina
>> <[email protected]> wrote:
>>>
>>> On Wed, Feb 27, 2013 at 4:54 PM, Danny Auble <[email protected]> wrote:
>>>>
>>>> I would test with a more modern version, 2.5, and see if the problem
>>>> still
>>>> exists.
>>>>
>>>> Knowing your configuration would also help.
>>>
>>> In attach, my slurm.conf file. We have just one frontend and a bunch
>>> of worker nodes.
>>>
>>> .a.
>>>
>>>> Antonio Messina <[email protected]> wrote:
>>>>>
>>>>>
>>>>> Hi all,
>>>>>
>>>>> In our test cluster, running slurm 2.3.4 (rebuilt ubuntu packages) we
>>>>> have the following issue: when running "srun --test-only" it shows
>>>>> incorrect dates:
>>>>>
>>>>> antonio@slurm:~$ date
>>>>> Wed Feb 27 16:36:45 CET 2013
>>>>> antonio@slurm:~$ srun  --test-only  hostname
>>>>> srun: Job 295 to start at 2064-03-13T01:01:52 using 1 processors on
>>>>> node-08-01-07
>>>>>
>>>>> Please note that the cluster is empty and if I remove the
>>>>> ``--test-only`` option the job will run instantaneously. The current
>>>>> date&time on the machine is also correct (ntp is running).
>>>>>
>>>>> .a.
>>>
>>>
>>>
>>> --
>>> [email protected]
>>> GC3: Grid Computing Competence Center
>>> http://www.gc3.uzh.ch/
>>> University of Zurich
>>> Winterthurerstrasse 190
>>> CH-8057 Zurich Switzerland
>>>
>>>
>>> --
>>> [email protected]
>>> GC3: Grid Computing Competence Center
>>> http://www.gc3.uzh.ch/
>>> University of Zurich
>>> Winterthurerstrasse 190
>>> CH-8057 Zurich Switzerland
>>
>>
>>
>



-- 
[email protected]
GC3: Grid Computing Competence Center
http://www.gc3.uzh.ch/
University of Zurich
Winterthurerstrasse 190
CH-8057 Zurich Switzerland

[slurm-dev] Re: slurm-dev srun --test-only shows an incorrect date.

Reply via email to