I have scp'd it as moab.log.invalid.gz

On 12/11/12 1:00 PM, "Moe Jette" <[email protected]> wrote:

>
>I would guess that your machine can communicate with the cluster's
>head node (where the slurmctld daemon executes and creates the job
>allocation), but not the compute nodes (where the slurmd daemons
>execute and spawn your tasks). It's probably a network issue.
>
>Quoting Reza Ramazani-Rend <[email protected]>:
>
>> Hi,
>>
>>  I am trying to set up a machine for submitting jobs to a cluster that
>>uses
>> slurm. But, when I try to submit a job, for example, using srun command,
>> despite the job being allocated resources (for example using squeue
>>shows
>> the job running with the correct amount of resources allocated), it
>>fails
>> to run the application, and I have to terminate the srun process by a
>>kill
>> command on the local machine or use scancel to cancel the job and free
>>the
>> resources for other users. I tried to follow the instructions given on
>>the
>> mailing list for similar problems, and it seems that the machine that
>> submits the job fails to receive signals from the compute node. I am
>> attaching the output from ³scontrol show config², the srun command log
>> (logsrunlocal from ³srun ­vvvvvvvvv ­p partitionname date 2>&1 | tee
>>log²),
>> and the output of strace (from ³strace ­r ­f ­o logfile srun в).
>>
>>  Other machines on the network with similar configurations can submit
>>jobs
>> without a problem. The log file from the ³srun ­vvvvvв command does not
>> indicate any problems that I could see until I terminate the job to free
>> the resources (for comparison, logsrun301 is the log file from a
>>successful
>> run from one of the compute nodes). The strace log, however, shows that
>>the
>> client is waiting for a signal that it never receives (line 744,
>> futex(0x4724ba4,
>> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, {1355239853, 0},
>> ffffffff <unfinished ...>, and line745, <... rt_sigtimedwait
>> resumed> ) = 15).
>>
>>  The munge daemon is running on the client, and the permissions to all
>>the
>> directories and files are set up as instructed in the installation
>> document. I also thought selinux might be blocking the communications,
>>but
>> disabling it didn¹t help.
>>
>>  I was wondering if you can identify any problems that I have
>>overlooked or
>> if anything is wrong with the set-up.
>>
>>  Thank you.
>>

Reply via email to