I have scp'd it as moab.log.invalid.gz On 12/11/12 1:00 PM, "Moe Jette" <[email protected]> wrote:
> >I would guess that your machine can communicate with the cluster's >head node (where the slurmctld daemon executes and creates the job >allocation), but not the compute nodes (where the slurmd daemons >execute and spawn your tasks). It's probably a network issue. > >Quoting Reza Ramazani-Rend <[email protected]>: > >> Hi, >> >> I am trying to set up a machine for submitting jobs to a cluster that >>uses >> slurm. But, when I try to submit a job, for example, using srun command, >> despite the job being allocated resources (for example using squeue >>shows >> the job running with the correct amount of resources allocated), it >>fails >> to run the application, and I have to terminate the srun process by a >>kill >> command on the local machine or use scancel to cancel the job and free >>the >> resources for other users. I tried to follow the instructions given on >>the >> mailing list for similar problems, and it seems that the machine that >> submits the job fails to receive signals from the compute node. I am >> attaching the output from ³scontrol show config², the srun command log >> (logsrunlocal from ³srun vvvvvvvvv p partitionname date 2>&1 | tee >>log²), >> and the output of strace (from ³strace r f o logfile srun в). >> >> Other machines on the network with similar configurations can submit >>jobs >> without a problem. The log file from the ³srun vvvvvв command does not >> indicate any problems that I could see until I terminate the job to free >> the resources (for comparison, logsrun301 is the log file from a >>successful >> run from one of the compute nodes). The strace log, however, shows that >>the >> client is waiting for a signal that it never receives (line 744, >> futex(0x4724ba4, >> FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, {1355239853, 0}, >> ffffffff <unfinished ...>, and line745, <... rt_sigtimedwait >> resumed> ) = 15). >> >> The munge daemon is running on the client, and the permissions to all >>the >> directories and files are set up as instructed in the installation >> document. I also thought selinux might be blocking the communications, >>but >> disabling it didn¹t help. >> >> I was wondering if you can identify any problems that I have >>overlooked or >> if anything is wrong with the set-up. >> >> Thank you. >>
