On 02/28/13 09:18, Antonio Messina wrote:
> On Thu, Feb 28, 2013 at 5:44 PM, Danny Auble <[email protected]> wrote:
>> It is not a wise idea to update in a piece meal fashion such as this.  While
> I agree, but first of all this is a testbed, so I don't care much
> about it, and moreover, the update was supposed to be managed by
> cfengine, but I just tested in the middle of the upgrade, while the
> server was already upgraded and the client weren't, so I've caught the
> bug.
>
>> the slurmctld is down I would update the slurmd's first and then update the
>> slurmctld (Always update the slurmdbd first though).  I don't think a 2.5
>> slurmctld would work with a 2.3 slurmd, I am guessing what happened couldn't
>> of been avoided.
> I think the daemon should be able to eat everything the client sends
> it, even if it's just garbage, so IMHO this is a bug, since it expose
> you to DoS attack.
If you send a patch we will incorporate it into the source.

Thanks,
Danny
>
>> On 02/28/13 08:40, Antonio Messina wrote:
>>> Just to let you know that I've updated to 2.5.3 and now the
>>> ``--test-only`` option works.
>>>
>>> On a side note, I'm not sure if this is on some documentation page,
>>> but I had a few troubles upgrading. While the master was running
>>> version 2.5.3 and the clients still 2.3.4, the slurmctld daemon was
>>> dying with SIGSEGV. The problem was in job_mgr.c, there is a memcpy()
>>> with a NULL pointer as source. It has been fixed after upgrading all
>>> the clients.
>>>
>>> I'm attaching the dump of the gdb session. If you need other infos let
>>> me know.
>>>
>>>
>>> root@slurm:/tmp/slurm-llnl-2.5.3# gdb src/slurmctld/slurmctld
>>> GNU gdb (Ubuntu/Linaro 7.4-2012.04-0ubuntu2.1) 7.4-2012.04
>>> Copyright (C) 2012 Free Software Foundation, Inc.
>>> License GPLv3+: GNU GPL version 3 or later
>>> <http://gnu.org/licenses/gpl.html>
>>> This is free software: you are free to change and redistribute it.
>>> There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
>>> and "show warranty" for details.
>>> This GDB was configured as "x86_64-linux-gnu".
>>> For bug reporting instructions, please see:
>>> <http://bugs.launchpad.net/gdb-linaro/>...
>>> Reading symbols from /tmp/slurm-llnl-2.5.3/src/slurmctld/slurmctld...done.
>>> (gdb) args
>>> Undefined command: "args".  Try "help".
>>> (gdb) set args -D
>>> (gdb) r
>>> Starting program: /tmp/slurm-llnl-2.5.3/src/slurmctld/slurmctld -D
>>> [Thread debugging using libthread_db enabled]
>>> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
>>> slurmctld: pidfile not locked, assuming no running daemon
>>> slurmctld: error: Configured MailProg is invalid
>>> slurmctld: Job accounting information stored, but details not gathered
>>> slurmctld: Accounting storage FileTxt plugin loaded
>>> slurmctld: slurmctld version 2.5.3 started on cluster gc3cluster
>>> slurmctld: Munge cryptographic signature plugin loaded
>>> slurmctld: Consumable Resources (CR) Node Selection plugin loaded with
>>> argument 17
>>> slurmctld: preempt/none loaded
>>> slurmctld: Checkpoint plugin loaded: checkpoint/none
>>> slurmctld: Job accounting gather NOT_INVOKED plugin loaded
>>> slurmctld: switch NONE plugin loaded
>>> slurmctld: topology NONE plugin loaded
>>> slurmctld: sched: Backfill scheduler plugin loaded
>>> [New Thread 0x7ffff7f80700 (LWP 13646)]
>>> slurmctld: error: Could not open node state file
>>> /var/lib/slurm-llnl/slurmctld/node_state: No such file or directory
>>> slurmctld: error: NOTE: Trying backup state save file. Information may be
>>> lost!
>>> slurmctld: No node state file
>>> (/var/lib/slurm-llnl/slurmctld/node_state.old) to recover
>>> slurmctld: error: Incomplete node data checkpoint file
>>> slurmctld: Recovered state of 0 nodes
>>> slurmctld: error: Could not open job state file
>>> /var/lib/slurm-llnl/slurmctld/job_state: No such file or directory
>>> slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
>>> slurmctld: No job state file
>>> (/var/lib/slurm-llnl/slurmctld/job_state.old) to recover
>>> slurmctld: cons_res: select_p_node_init
>>> slurmctld: cons_res: preparing for 2 partitions
>>> slurmctld: error: Could not open reservation state file
>>> /var/lib/slurm-llnl/slurmctld/resv_state: No such file or directory
>>> slurmctld: error: NOTE: Trying backup state save file. Reservations may be
>>> lost
>>> slurmctld: No reservation state file
>>> (/var/lib/slurm-llnl/slurmctld/resv_state.old) to recover
>>> slurmctld: Recovered state of 0 reservations
>>> slurmctld: error: Could not open trigger state file
>>> /var/lib/slurm-llnl/slurmctld/trigger_state: No such file or directory
>>> slurmctld: error: NOTE: Trying backup state save file. Triggers may be
>>> lost!
>>> slurmctld: No trigger state file
>>> (/var/lib/slurm-llnl/slurmctld/trigger_state.old) to recover
>>> slurmctld: error: Incomplete trigger data checkpoint file
>>> slurmctld: State of 0 triggers recovered
>>> slurmctld: read_slurm_conf: backup_controller not specified.
>>> slurmctld: Reinitializing job accounting state
>>> slurmctld: cons_res: select_p_reconfigure
>>> slurmctld: cons_res: select_p_node_init
>>> slurmctld: cons_res: preparing for 2 partitions
>>> slurmctld: Running as primary controller
>>> [New Thread 0x7ffff5ba9700 (LWP 13653)]
>>> [New Thread 0x7ffff5aa8700 (LWP 13654)]
>>> [New Thread 0x7ffff59a7700 (LWP 13655)]
>>> [New Thread 0x7ffff58a6700 (LWP 13656)]
>>> [Thread 0x7ffff58a6700 (LWP 13656) exited]
>>> [New Thread 0x7ffff558f700 (LWP 13657)]
>>> slurmctld: auth plugin for Munge (http://code.google.com/p/munge/) loaded
>>> [Thread 0x7ffff558f700 (LWP 13657) exited]
>>> [New Thread 0x7ffff558f700 (LWP 13658)]
>>> [Thread 0x7ffff558f700 (LWP 13658) exited]
>>> [New Thread 0x7ffff558f700 (LWP 13659)]
>>> [Thread 0x7ffff558f700 (LWP 13659) exited]
>>> [New Thread 0x7ffff558f700 (LWP 13660)]
>>> [Thread 0x7ffff558f700 (LWP 13660) exited]
>>> [New Thread 0x7ffff558f700 (LWP 13661)]
>>> [Thread 0x7ffff558f700 (LWP 13661) exited]
>>> [New Thread 0x7ffff558f700 (LWP 13662)]
>>> [Thread 0x7ffff558f700 (LWP 13662) exited]
>>> [New Thread 0x7ffff558f700 (LWP 13663)]
>>> [Thread 0x7ffff558f700 (LWP 13663) exited]
>>> [New Thread 0x7ffff558f700 (LWP 13664)]
>>> [Thread 0x7ffff558f700 (LWP 13664) exited]
>>> [New Thread 0x7ffff558f700 (LWP 13676)]
>>>
>>> Program received signal SIGSEGV, Segmentation fault.
>>> [Switching to Thread 0x7ffff558f700 (LWP 13676)]
>>> 0x000000000044e894 in validate_jobs_on_node (reg_msg=0x7fffe8001a58)
>>> at job_mgr.c:7885
>>> 7885            memcpy(node_ptr->energy, reg_msg->energy,
>>> sizeof(acct_gather_energy_t));
>>> (gdb) bt
>>> #0  0x000000000044e894 in validate_jobs_on_node
>>> (reg_msg=0x7fffe8001a58) at job_mgr.c:7885
>>> #1  0x0000000000475fdc in _slurm_rpc_node_registration
>>> (msg=0x7fffe8000f58) at proc_req.c:1940
>>> #2  0x000000000047138a in slurmctld_req (msg=0x7fffe8000f58) at
>>> proc_req.c:253
>>> #3  0x0000000000430de3 in _service_connection (arg=0x7ffff0000958) at
>>> controller.c:1022
>>> #4  0x00007ffff79c0e9a in start_thread () from
>>> /lib/x86_64-linux-gnu/libpthread.so.0
>>> #5  0x00007ffff76edcbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
>>> #6  0x0000000000000000 in ?? ()
>>> (gdb) list
>>> 7880                    error("slurmd registered on unknown node %s",
>>> 7881                            reg_msg->node_name);
>>> 7882                    return;
>>> 7883            }
>>> 7884
>>> 7885            memcpy(node_ptr->energy, reg_msg->energy,
>>> sizeof(acct_gather_energy_t));
>>> 7886
>>> 7887            if (node_ptr->up_time > reg_msg->up_time) {
>>> 7888                    verbose("Node %s rebooted %u secs ago",
>>> 7889                            reg_msg->node_name, reg_msg->up_time);
>>> (gdb) p node_ptr
>>> $1 = (struct node_record *) 0x80e2a8
>>> (gdb) p node_ptr->energy
>>> $2 = (acct_gather_energy_t *) 0x8109e8
>>> (gdb) p *node_ptr->energy
>>> $3 = {previous_consumed_energy = 0, base_consumed_energy = 0,
>>> base_watts = 0, consumed_energy = 0, current_watts = 0}
>>> (gdb) p reg_msg
>>> $4 = (slurm_node_registration_status_msg_t *) 0x7fffe8001a58
>>> (gdb) p reg_msg->energy
>>> $5 = (acct_gather_energy_t *) 0x0
>>>
>>>
>>> On Wed, Feb 27, 2013 at 7:27 PM, Antonio Messina
>>> <[email protected]> wrote:
>>>> On Wed, Feb 27, 2013 at 4:54 PM, Danny Auble <[email protected]> wrote:
>>>>> I would test with a more modern version, 2.5, and see if the problem
>>>>> still
>>>>> exists.
>>>>>
>>>>> Knowing your configuration would also help.
>>>> In attach, my slurm.conf file. We have just one frontend and a bunch
>>>> of worker nodes.
>>>>
>>>> .a.
>>>>
>>>>> Antonio Messina <[email protected]> wrote:
>>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> In our test cluster, running slurm 2.3.4 (rebuilt ubuntu packages) we
>>>>>> have the following issue: when running "srun --test-only" it shows
>>>>>> incorrect dates:
>>>>>>
>>>>>> antonio@slurm:~$ date
>>>>>> Wed Feb 27 16:36:45 CET 2013
>>>>>> antonio@slurm:~$ srun  --test-only  hostname
>>>>>> srun: Job 295 to start at 2064-03-13T01:01:52 using 1 processors on
>>>>>> node-08-01-07
>>>>>>
>>>>>> Please note that the cluster is empty and if I remove the
>>>>>> ``--test-only`` option the job will run instantaneously. The current
>>>>>> date&time on the machine is also correct (ntp is running).
>>>>>>
>>>>>> .a.
>>>>
>>>>
>>>> --
>>>> [email protected]
>>>> GC3: Grid Computing Competence Center
>>>> http://www.gc3.uzh.ch/
>>>> University of Zurich
>>>> Winterthurerstrasse 190
>>>> CH-8057 Zurich Switzerland
>>>>
>>>>
>>>> --
>>>> [email protected]
>>>> GC3: Grid Computing Competence Center
>>>> http://www.gc3.uzh.ch/
>>>> University of Zurich
>>>> Winterthurerstrasse 190
>>>> CH-8057 Zurich Switzerland
>>>
>>>
>
>

Reply via email to