If it happens again, one thing to ask them to check is swap usage and how
much memory is cached. On some of my nodes I have noticed that they do not
always release cached memory, and can start swapping. If this happens the
job will get very slow. The commands to use to clear the cache can be found
at
http://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/
or similar. (Needs root access.) Top can also show memory use.

While there should be no need to do this, I have noticed that I need to do
it every 3hrs on 4 nodes - the other 20 don't need it. It is an issue
mainly for big calculations.

Alternatively it was something else, a zombie, big log files or other
things. Rebooting gets rid of a lot of system caches and helps -- even on
my Android tablet every week or two. It's murky waters.

---
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
http://www.numis.northwestern.edu
Corrosion in 4D http://MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
"Research is to see what everybody else has seen, and to think what nobody
else has thought"
Albert Szent-Gyorgi
Hi Elias,

   There were no other jobs in the specific queue I was using and the nodes
are dedicated to that queue, so, it was the opportunity to reboot them
without furious reactions from other users.
   After trying everything suggested by the Wien2k community, the
administrators resignedly remembered the words of wisdom given by the
cluster guru, Shakespeare, and followed the suggestion given by Lyudmila
Dobysheva. In other words, they killed my job, restarted all the nodes and
I resubmitted the calculation
   All the best,
                     Luis


2015-09-29 3:50 GMT-03:00 Elias Assmann <elias.assm...@gmail.com>:

> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 09/28/2015 01:58 PM, Luis Ogando wrote:
> > The problem is solved ! The solution was one suggested by Lyudmila
> > Dobysheva : reboot the nodes. We will never know the origin of the
> > problem, but, honestly, I do not care !
>
> Good to hear that!  So, how did you get the admins to reboot them?
>
> > "There are more things in heaven and earth, Horatio, Than are
> > dreamt of in your philosophy."
>
> That is an apt quote for people working on clusters ;-).
>
>
>         Elias
>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1
> Comment: Using GnuPG with Icedove - http://www.enigmail.net/
>
> iQIcBAEBAgAGBQJWCjTGAAoJEE/4gtQZfOqPhFAQAKZmda0t9FGgfAsk9UjymogK
> oN1WxHdenQVOSaOblpAFEn4c0ihTog7zePEXdTqNl03OcBUcdKtOPVqSVLBKlmlF
> f0VOBUeXjmOZKd6SAIuwNojflW0k9ysrJ2sLCo/dOGepT4L2Q8Um5DHpgh+mjehM
> XtGbn6uDUQlcjoLKgHG9GxBzr9qRDqc4chYnMAvwNGkm7qntt7Q1jol9yGZikB8e
> CONyaqYghNBr4x7BtGOaITJQ7yWw++l7t56oMSCNOXzee8Noy53cKPCVOvzh8lUF
> PlMRNFB9pTgdxs59dy5yF31R4LTJjMG7zm+gHjmWDMi7BnQZQGEWDc6MIzLIwTPj
> kN5dZm4R/cbVjYEzIlmsr9h67H/+9Otr36AvwfvvwycL/wy0RkC7jxqY0eC8i3fK
> v/FdmFbt6b2wxzalmjvg+sEILe18Uz0fCmhcCDRdZ2fgmOWC68WeH4I7d2/kCJTr
> Az2K8ZvZ5LxBCSH9MLoh/heZVSI3rowHu3aUNqfcbZ1pJLmT68RU9ZmPgfQnA4bK
> 4uny7MaDcyYN/IvMRWf8lUiuY3OsRHGZAmcIfagkqvV2ukWPRFQ2AmsaZpMxbYyg
> FsdKDJfYocUdp14KMT3wEhiGmUTE5BwtxAXq4NTq1sdJGESZIzhbEXYHbgnD7mbF
> QDT7WZ/DqG+KpcVTRmnz
> =JtdF
> -----END PGP SIGNATURE-----
> _______________________________________________
> Wien mailing list
> Wien@zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Reply via email to