I can't speak to what you get from sacct, but I can say that things will 
definitely be different when launched directly via srun vs indirectly thru 
mpirun. The reason is that mpirun uses srun to launch the orte daemons, which 
then fork/exec all the application processes under them (as opposed to 
launching those app procs thru srun). This means two things:

1. Slurm has no direct knowledge or visibility into the application procs 
themselves when launched by mpirun. Slurm only sees the ORTE daemons. I'm sure 
that Slurm rolls up all the resources used by those daemons and their children, 
so the totals should include them

2. Since all Slurm can do is roll everything up, the resources shown in sacct 
will include those used by the daemons and mpirun as well as the application 
procs. Slurm doesn't include their daemons or the slurmctld in their 
accounting. so the two numbers will be significantly different. If you are 
attempting to limit overall resource usage, you may need to leave some slack 
for the daemons and mpirun.

You should also see an extra "step" in the mpirun-launched job as mpirun itself 
generally takes the first step, and the launch of the daemons occupies a second 
step.

As for the strange numbers you are seeing, it looks to me like you are hitting 
a mismatch of unsigned vs signed values. When adding them up, that could cause 
all kinds of erroneous behavior.


On Aug 6, 2013, at 11:55 PM, Christopher Samuel <sam...@unimelb.edu.au> wrote:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 07/08/13 16:19, Christopher Samuel wrote:
> 
>> Anyone seen anything similar, or any ideas on what could be going
>> on?
> 
> Sorry, this was with:
> 
> # ACCOUNTING
> JobAcctGatherType=jobacct_gather/linux
> JobAcctGatherFrequency=30
> 
> Since those initial tests we've started enforcing memory limits (the
> system is not yet in full production) and found that this causes jobs
> to get killed.
> 
> We tried the cgroups gathering method, but jobs still die with mpirun
> and now the numbers don't seem to right for mpirun or srun either:
> 
> mpirun (killed):
> 
> [samuel@barcoo-test Mem]$ sacct -j 94564 -o JobID,MaxRSS,MaxVMSize
>       JobID     MaxRSS  MaxVMSize
> - ------------ ---------- ----------
> 94564
> 94564.batch    -523362K          0
> 94564.0         394525K          0
> 
> srun:
> 
> [samuel@barcoo-test Mem]$ sacct -j 94565 -o JobID,MaxRSS,MaxVMSize
>       JobID     MaxRSS  MaxVMSize
> - ------------ ---------- ----------
> 94565
> 94565.batch        998K          0
> 94565.0          88663K          0
> 
> 
> All the best,
> Chris
> - -- 
> Christopher Samuel        Senior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/      http://twitter.com/vlsci
> 
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEARECAAYFAlIB73wACgkQO2KABBYQAh+kwACfYnMbONcpxD2lsM5i4QDw5r93
> KpMAn2hPUxMJ62u2gZIUGl5I0bQ6lllk
> =jYrC
> -----END PGP SIGNATURE-----
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to