Hi,

Fresh SLURM 15.08.6 on Centos 7.2.

I have fixed my Low RealMemory issues - turns out that the advertised 32GB 
memory was *actually* 31.2GB. Ouch.

Between that - which stopped worker nodes from going into "draining" state, and 
sharing a file system, I've reduced the test errors to a handful.

I'd like to ask about some of them.

=============
test1.83   Test of contiguous option with multiple nodes (--contiguous option).
=============

Is this because my worker nodes are called slurm-w[1-3] but there is also a 
slurm-test worker node?


=============
test7.11   Test of SPANK plugin.
=============

I presume this isn't a requirement anyway - given that it's not in the core?


=============
test10.5   Test bg partition information (-Db option).
=============
=============
test10.13  Test bluegene.conf file creation and validate it (-Dc option).
=============

Given that I'm not on a BG system, I presume it's ok to fail these tests?


Test 17.35 threw this error:

couldn't execute "time": no such file or directory
    while executing
"spawn time -p ./$file_in"
    (file "./test17.35" line 59)

Before failing
=============
test17.35  Test performance/timing of job submissions.
=============

Test.17.35.in exists, but is owned by root (I ran the regression with sudo) - 
which I presume is causing the error?
Also note that when I ran the command "time -p ./test17.35_in" in the cli, I 
then couldn't kill the queue'd commands with scancel --name=test17.35 *or* the 
name that was listed when I did an squeue, scancel --name=test17.3. scancel 
--user=ec2-user did work though.


=============
test30.1   Validates that rpms are built with the correct prefix.
=============

Ok, so the failure is written on the label. But I installed as closely to "as 
written" in the docs as possible - download the bz2, rpmbuild, rpm --install
So again I presume this is also always going to fail?

(for the record, my slurm is installed in all the places you'd expect - mostly 
in /usr/.)


=============
test32.4   Validates that sgather copies specified files from compute nodes.
=============

When I run the test manually, it makes me think that this is failing because 
the SPANK plugin isn't working? (see second FAILURE above)

[ec2-user@slurm-head testing]$ ./test32.4
============================================
TEST: 32.4
spawn /usr/bin/sbatch -N1-4 -o test32.4.out -t1 test32.4.in
Submitted batch job 6447
Job 6447 is in state PENDING, desire DONE
Job 6447 is DONE (COMPLETED)
spawn cat test32.4.out
SLURM_NNODES=4
srun: slurm_spank_local_user_init
srun: slurm_spank_local_user_init
lost connection
lost connection
lost connection
lost connection
srun: error: slurm-w3: task 3: Exited with exit code 1
srun: error: slurm-w2: task 2: Exited with exit code 1
srun: error: slurm-w1: task 1: Exited with exit code 1
srun: error: slurm-test: task 0: Exited with exit code 1
43595     9 /usr/bin/sgather
sum: test32.4_sgather.out*: No such file or directory
srun: slurm_spank_local_user_init
-rwxr-xr-x 1 ec2-user ec2-user 8979 Dec 22 23:16 /tmp/test32.4
-rwxr-xr-x 1 ec2-user ec2-user 8979 Dec 22 23:16 /tmp/test32.4
-rwxr-xr-x 1 ec2-user ec2-user 8979 Dec 22 23:16 /tmp/test32.4
-rwxr-xr-x 1 ec2-user ec2-user 8979 Dec 22 23:16 /tmp/test32.4
srun: slurm_spank_local_user_init

FAILURE: Failed to gather files from all allocated nodes (0 != 4)

FAILURE: Failed to remove gathered files from all allocated nodes (1 != 4)


=============
test32.5   Validates that sgather -k keeps the original source file.
=============

Similar to the 32.4 error - no SPANK?

Ok, 32.6, 32.8, 32.10, 32.11 and 32.12 all failed as well:

32.6: 
Error is expected. No worries.

FAILURE: Failed to gather files from all allocated nodes (0 != 4)
test32.6 FAILURE

32.8: Error is expected. No worries.

FAILURE: Failed to gather files from all allocated nodes (0 != 4)

FAILURE: Failed to remove gathered files from all allocated nodes (1 != 4)
test32.8 FAILURE

32.10: FAILURE: Failed to gather files from all allocated nodes (0 != 8)
test32.10 FAILURE

32.11 (I think this is the sudo problem again - root has no key to other 
systems)

32.12: FAILURE: Failed to gather files from all allocated nodes (0 != 4)

FAILURE: Failed to remove gathered files from all allocated nodes (1 != 4)
test32.12 FAILURE



WRT running the tests as root, I have to don't I? If I run any of the slurm 
commands - sinfo, scontrol, squeue, etc - without sudo, it looks for the conf 
in /usr/local/etc/slurm.conf instead of /etc/slurm/slurm.conf....

Cheers
L.      



This email (including any attachments or links) may contain
confidential and/or legally privileged information and is
intended only to be read or used by the addressee.  If you
are not the intended addressee, any use, distribution,
disclosure or copying of this email is strictly
prohibited.
Confidentiality and legal privilege attached to this email
(including any attachments) are not waived or lost by
reason of its mistaken delivery to you.
If you have received this email in error, please delete it
and notify us immediately by telephone or email.  Peter
MacCallum Cancer Centre provides no guarantee that this
transmission is free of virus or that it has not been
intercepted or altered and will not be liable for any delay
in its receipt.

Reply via email to