Dear Prof. Marks,

   As I suspected, users can not use ganglia. Our administrators are very
jealous !!

Dear Elias Assmann,

   Many thanks for your comments. I will try to comment on some of them.

First of all, I wonder: To what extent is this problem reproducible?
> E.g., does your job always run on the same 4 nodes?


> Is it always the
> same node(s) that are slow?


> Does the problem also show up in other
> calculations (maybe just changing the number of k-points, or
> restarting the same case from scratch).

The strangest part: at the beginning of this month, the same calculation
was running properly. I had a crash for convergence problems and when I
reduced the "mixing factor" in case.inm (it is now 0.04 in pre-convergence
scf cycle) the problems started. Obviously, I do not believe that the
mixing factor is the problem.

> Is it only lapw1 that is slow?

No. All the executables are running slowly in the problematic node.

> Second, how did you make those ‘top’s?  As for ‘lapw0’ and ‘lapw1’, I
> am guessing that this is just because the snapshots were taken at
> different times (notice that the CPU times of lapw0 on the two nodes
> are quite different, too).

Users can do nothing. The administrator sent me the "top's" and I have
asked him for simultaneous ones.

> About the CPU usage on ‘n2’, I find this very suspicious.  If it is as
> Peter said that the jobs are in the initialization and therefore not
> computing much, that may be fine; but I have to disagree with his
> assessment, because the memory usage of lapw1 on the two nodes is
> basically the same (if anything, the image sizes on ‘n2’ are slightly
> larger).  Note also that it is *not* the case that other processes are
> using the CPU; the total usage is at 7.5 %.
> It would be good to clarify that by getting a ‘top’ such that we know
> that lapw1 had been running for a while.  To this end, top has an ‘-n’
> option which says how many frames to output, e.g. ‘top -bn 10’.
> I am also curious about the load averages.  ‘n2’ has larger “mid-term”
> and “long-term” load averages than the others, and its “short-term”
> average is just as large.  I am not sure what that means.
> On 09/23/2015 02:21 PM, Luis Ogando wrote:
> > I can not access the nodes. SSH among them is forbidden ! We have
> > to ask the administrators for anything !! It is the hell !! Of
> > course, only the PBS jobs can "travel" among the nodes.
> I do not know about PBS Pro, but Torque and SGE have an option (I
> think ‘-I’ in either case) to submit an interactive job where you get
> a login on a node.  Of course that is only a realistic option when the
> queuing time is not too long.  Otherwise, any information that a more
> sophisticated tool can give you will also be available from the
> command line (just more painful to extract!) via ‘top’, ‘ps’, ‘/proc’,
> etc.  You can also put these things in a jobs script (which you
> apparently already did with ‘top’).
> Good luck,
>         Elias

    Finally, I would like to thank all the comments and say that if I did
not comment on them is because the administrators said they can not be the
origin of the problem, "everything is 0K" (?).
   All the best,
Wien mailing list

Reply via email to