Re: [Wien] time difference among nodes

2015-09-29 Thread Gavin Abo
From the top's sent before, it looks like the administrators might have 
configured the system with no swap:


r1i1n2

Swap:0M total,0M used,0M free,10563M cached

r1i1n3

Swap:0M total,0M used,0M free,23089M cached

Keep in mind that having swap might mean the difference between hurt 
performance and a hard crash under low memory [ 
http://unix.stackexchange.com/questions/190398/do-i-need-swap-space-if-i-have-more-than-enough-amount-of-ram 
].


On 9/29/2015 5:57 AM, Laurence Marks wrote:


If it happens again, one thing to ask them to check is swap usage and 
how much memory is cached. On some of my nodes I have noticed that 
they do not always release cached memory, and can start swapping. If 
this happens the job will get very slow. The commands to use to clear 
the cache can be found at
http://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/ 
or similar. (Needs root access.) Top can also show memory use.


While there should be no need to do this, I have noticed that I need 
to do it every 3hrs on 4 nodes - the other 20 don't need it. It is an 
issue mainly for big calculations.


Alternatively it was something else, a zombie, big log files or other 
things. Rebooting gets rid of a lot of system caches and helps -- even 
on my Android tablet every week or two. It's murky waters.


---
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
http://www.numis.northwestern.edu
Corrosion in 4Dhttp://MURI4D.numis.northwestern.edu 


Co-Editor, Acta Cryst A
"Research is to see what everybody else has seen, and to think what 
nobody else has thought"

Albert Szent-Gyorgi

___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-29 Thread Luis Ogando
Hi Lyudmila,

   Unfortunately, they do not have "top mode 1" output corresponding to the
problem period.
   Thanks again.
   All the best,
 Luis


2015-09-29 10:37 GMT-03:00 Lyudmila Dobysheva :

> 29.09.2015 14:57, Laurence Marks wrote:
>
>> If it happens again, one thing to ask them to check is swap usage and
>> how much memory is cached.
>>
> ...
>
>> Alternatively it was something else, a zombie, big log files or other
>> things. Rebooting gets rid of a lot of system caches and helps
>>
>
> I stand for losing parallelization on that node due to unclear reason
> (maybe this bad swapping/caching threw away parallel options from the
> memory and all jobs had been sent to only one processor of the node).
>
> I would like to know what had administrator seen in the "1" mode of top
> command.
>
> Best wishes
>   Lyudmila Dobysheva
> --
> Phys.-Techn. Institute of Ural Br. of Russian Ac. of Sci.
> 426001 Izhevsk, ul.Kirova 132
> RUSSIA
> --
> Tel.:7(3412) 432045(office), 722529(Fax)
> E-mail: l...@ftiudm.ru, lyuk...@mail.ru (office)
> lyuk...@gmail.com (home)
> Skype:  lyuka17 (home), lyuka18 (office)
> http://ftiudm.ru/content/view/25/103/lang,english/
> --
> ___
> Wien mailing list
> Wien@zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-29 Thread Luis Ogando
Hi Lyudmila,

   Thanks again !
   I will ask them.
   All the best,
  Luis


2015-09-29 10:37 GMT-03:00 Lyudmila Dobysheva :

> 29.09.2015 14:57, Laurence Marks wrote:
>
>> If it happens again, one thing to ask them to check is swap usage and
>> how much memory is cached.
>>
> ...
>
>> Alternatively it was something else, a zombie, big log files or other
>> things. Rebooting gets rid of a lot of system caches and helps
>>
>
> I stand for losing parallelization on that node due to unclear reason
> (maybe this bad swapping/caching threw away parallel options from the
> memory and all jobs had been sent to only one processor of the node).
>
> I would like to know what had administrator seen in the "1" mode of top
> command.
>
> Best wishes
>   Lyudmila Dobysheva
> --
> Phys.-Techn. Institute of Ural Br. of Russian Ac. of Sci.
> 426001 Izhevsk, ul.Kirova 132
> RUSSIA
> --
> Tel.:7(3412) 432045(office), 722529(Fax)
> E-mail: l...@ftiudm.ru, lyuk...@mail.ru (office)
> lyuk...@gmail.com (home)
> Skype:  lyuka17 (home), lyuka18 (office)
> http://ftiudm.ru/content/view/25/103/lang,english/
> --
> ___
> Wien mailing list
> Wien@zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-29 Thread Lyudmila Dobysheva

29.09.2015 14:57, Laurence Marks wrote:

If it happens again, one thing to ask them to check is swap usage and
how much memory is cached.

...

Alternatively it was something else, a zombie, big log files or other
things. Rebooting gets rid of a lot of system caches and helps


I stand for losing parallelization on that node due to unclear reason 
(maybe this bad swapping/caching threw away parallel options from the 
memory and all jobs had been sent to only one processor of the node).


I would like to know what had administrator seen in the "1" mode of top 
command.


Best wishes
  Lyudmila Dobysheva
--
Phys.-Techn. Institute of Ural Br. of Russian Ac. of Sci.
426001 Izhevsk, ul.Kirova 132
RUSSIA
--
Tel.:7(3412) 432045(office), 722529(Fax)
E-mail: l...@ftiudm.ru, lyuk...@mail.ru (office)
lyuk...@gmail.com (home)
Skype:  lyuka17 (home), lyuka18 (office)
http://ftiudm.ru/content/view/25/103/lang,english/
--
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-29 Thread Luis Ogando
Dear Prof. Marks,

   Thanks !
   I will send your message to the administrators !
   All the best,
   Luis


2015-09-29 8:57 GMT-03:00 Laurence Marks :

> If it happens again, one thing to ask them to check is swap usage and how
> much memory is cached. On some of my nodes I have noticed that they do not
> always release cached memory, and can start swapping. If this happens the
> job will get very slow. The commands to use to clear the cache can be found
> at
>
> http://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/
> or similar. (Needs root access.) Top can also show memory use.
>
> While there should be no need to do this, I have noticed that I need to do
> it every 3hrs on 4 nodes - the other 20 don't need it. It is an issue
> mainly for big calculations.
>
> Alternatively it was something else, a zombie, big log files or other
> things. Rebooting gets rid of a lot of system caches and helps -- even on
> my Android tablet every week or two. It's murky waters.
>
> ---
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> http://www.numis.northwestern.edu
> Corrosion in 4D http://MURI4D.numis.northwestern.edu
> Co-Editor, Acta Cryst A
> "Research is to see what everybody else has seen, and to think what nobody
> else has thought"
> Albert Szent-Gyorgi
> Hi Elias,
>
>There were no other jobs in the specific queue I was using and the
> nodes are dedicated to that queue, so, it was the opportunity to reboot
> them without furious reactions from other users.
>After trying everything suggested by the Wien2k community, the
> administrators resignedly remembered the words of wisdom given by the
> cluster guru, Shakespeare, and followed the suggestion given by Lyudmila
> Dobysheva. In other words, they killed my job, restarted all the nodes and
> I resubmitted the calculation
>All the best,
>  Luis
>
>
> 2015-09-29 3:50 GMT-03:00 Elias Assmann :
>
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA1
>>
>> On 09/28/2015 01:58 PM, Luis Ogando wrote:
>> > The problem is solved ! The solution was one suggested by Lyudmila
>> > Dobysheva : reboot the nodes. We will never know the origin of the
>> > problem, but, honestly, I do not care !
>>
>> Good to hear that!  So, how did you get the admins to reboot them?
>>
>> > "There are more things in heaven and earth, Horatio, Than are
>> > dreamt of in your philosophy."
>>
>> That is an apt quote for people working on clusters ;-).
>>
>>
>> Elias
>>
>> -BEGIN PGP SIGNATURE-
>> Version: GnuPG v1
>> Comment: Using GnuPG with Icedove - http://www.enigmail.net/
>>
>> iQIcBAEBAgAGBQJWCjTGAAoJEE/4gtQZfOqPhFAQAKZmda0t9FGgfAsk9UjymogK
>> oN1WxHdenQVOSaOblpAFEn4c0ihTog7zePEXdTqNl03OcBUcdKtOPVqSVLBKlmlF
>> f0VOBUeXjmOZKd6SAIuwNojflW0k9ysrJ2sLCo/dOGepT4L2Q8Um5DHpgh+mjehM
>> XtGbn6uDUQlcjoLKgHG9GxBzr9qRDqc4chYnMAvwNGkm7qntt7Q1jol9yGZikB8e
>> CONyaqYghNBr4x7BtGOaITJQ7yWw++l7t56oMSCNOXzee8Noy53cKPCVOvzh8lUF
>> PlMRNFB9pTgdxs59dy5yF31R4LTJjMG7zm+gHjmWDMi7BnQZQGEWDc6MIzLIwTPj
>> kN5dZm4R/cbVjYEzIlmsr9h67H/+9Otr36AvwfvvwycL/wy0RkC7jxqY0eC8i3fK
>> v/FdmFbt6b2wxzalmjvg+sEILe18Uz0fCmhcCDRdZ2fgmOWC68WeH4I7d2/kCJTr
>> Az2K8ZvZ5LxBCSH9MLoh/heZVSI3rowHu3aUNqfcbZ1pJLmT68RU9ZmPgfQnA4bK
>> 4uny7MaDcyYN/IvMRWf8lUiuY3OsRHGZAmcIfagkqvV2ukWPRFQ2AmsaZpMxbYyg
>> FsdKDJfYocUdp14KMT3wEhiGmUTE5BwtxAXq4NTq1sdJGESZIzhbEXYHbgnD7mbF
>> QDT7WZ/DqG+KpcVTRmnz
>> =JtdF
>> -END PGP SIGNATURE-
>> ___
>> Wien mailing list
>> Wien@zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at:
>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>>
>
>
> ___
> Wien mailing list
> Wien@zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-29 Thread Laurence Marks
If it happens again, one thing to ask them to check is swap usage and how
much memory is cached. On some of my nodes I have noticed that they do not
always release cached memory, and can start swapping. If this happens the
job will get very slow. The commands to use to clear the cache can be found
at
http://www.tecmint.com/clear-ram-memory-cache-buffer-and-swap-space-on-linux/
or similar. (Needs root access.) Top can also show memory use.

While there should be no need to do this, I have noticed that I need to do
it every 3hrs on 4 nodes - the other 20 don't need it. It is an issue
mainly for big calculations.

Alternatively it was something else, a zombie, big log files or other
things. Rebooting gets rid of a lot of system caches and helps -- even on
my Android tablet every week or two. It's murky waters.

---
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
http://www.numis.northwestern.edu
Corrosion in 4D http://MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
"Research is to see what everybody else has seen, and to think what nobody
else has thought"
Albert Szent-Gyorgi
Hi Elias,

   There were no other jobs in the specific queue I was using and the nodes
are dedicated to that queue, so, it was the opportunity to reboot them
without furious reactions from other users.
   After trying everything suggested by the Wien2k community, the
administrators resignedly remembered the words of wisdom given by the
cluster guru, Shakespeare, and followed the suggestion given by Lyudmila
Dobysheva. In other words, they killed my job, restarted all the nodes and
I resubmitted the calculation
   All the best,
 Luis


2015-09-29 3:50 GMT-03:00 Elias Assmann :

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> On 09/28/2015 01:58 PM, Luis Ogando wrote:
> > The problem is solved ! The solution was one suggested by Lyudmila
> > Dobysheva : reboot the nodes. We will never know the origin of the
> > problem, but, honestly, I do not care !
>
> Good to hear that!  So, how did you get the admins to reboot them?
>
> > "There are more things in heaven and earth, Horatio, Than are
> > dreamt of in your philosophy."
>
> That is an apt quote for people working on clusters ;-).
>
>
> Elias
>
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1
> Comment: Using GnuPG with Icedove - http://www.enigmail.net/
>
> iQIcBAEBAgAGBQJWCjTGAAoJEE/4gtQZfOqPhFAQAKZmda0t9FGgfAsk9UjymogK
> oN1WxHdenQVOSaOblpAFEn4c0ihTog7zePEXdTqNl03OcBUcdKtOPVqSVLBKlmlF
> f0VOBUeXjmOZKd6SAIuwNojflW0k9ysrJ2sLCo/dOGepT4L2Q8Um5DHpgh+mjehM
> XtGbn6uDUQlcjoLKgHG9GxBzr9qRDqc4chYnMAvwNGkm7qntt7Q1jol9yGZikB8e
> CONyaqYghNBr4x7BtGOaITJQ7yWw++l7t56oMSCNOXzee8Noy53cKPCVOvzh8lUF
> PlMRNFB9pTgdxs59dy5yF31R4LTJjMG7zm+gHjmWDMi7BnQZQGEWDc6MIzLIwTPj
> kN5dZm4R/cbVjYEzIlmsr9h67H/+9Otr36AvwfvvwycL/wy0RkC7jxqY0eC8i3fK
> v/FdmFbt6b2wxzalmjvg+sEILe18Uz0fCmhcCDRdZ2fgmOWC68WeH4I7d2/kCJTr
> Az2K8ZvZ5LxBCSH9MLoh/heZVSI3rowHu3aUNqfcbZ1pJLmT68RU9ZmPgfQnA4bK
> 4uny7MaDcyYN/IvMRWf8lUiuY3OsRHGZAmcIfagkqvV2ukWPRFQ2AmsaZpMxbYyg
> FsdKDJfYocUdp14KMT3wEhiGmUTE5BwtxAXq4NTq1sdJGESZIzhbEXYHbgnD7mbF
> QDT7WZ/DqG+KpcVTRmnz
> =JtdF
> -END PGP SIGNATURE-
> ___
> Wien mailing list
> Wien@zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-29 Thread Luis Ogando
Hi Elias,

   There were no other jobs in the specific queue I was using and the nodes
are dedicated to that queue, so, it was the opportunity to reboot them
without furious reactions from other users.
   After trying everything suggested by the Wien2k community, the
administrators resignedly remembered the words of wisdom given by the
cluster guru, Shakespeare, and followed the suggestion given by Lyudmila
Dobysheva. In other words, they killed my job, restarted all the nodes and
I resubmitted the calculation
   All the best,
 Luis


2015-09-29 3:50 GMT-03:00 Elias Assmann :

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> On 09/28/2015 01:58 PM, Luis Ogando wrote:
> > The problem is solved ! The solution was one suggested by Lyudmila
> > Dobysheva : reboot the nodes. We will never know the origin of the
> > problem, but, honestly, I do not care !
>
> Good to hear that!  So, how did you get the admins to reboot them?
>
> > "There are more things in heaven and earth, Horatio, Than are
> > dreamt of in your philosophy."
>
> That is an apt quote for people working on clusters ;-).
>
>
> Elias
>
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1
> Comment: Using GnuPG with Icedove - http://www.enigmail.net/
>
> iQIcBAEBAgAGBQJWCjTGAAoJEE/4gtQZfOqPhFAQAKZmda0t9FGgfAsk9UjymogK
> oN1WxHdenQVOSaOblpAFEn4c0ihTog7zePEXdTqNl03OcBUcdKtOPVqSVLBKlmlF
> f0VOBUeXjmOZKd6SAIuwNojflW0k9ysrJ2sLCo/dOGepT4L2Q8Um5DHpgh+mjehM
> XtGbn6uDUQlcjoLKgHG9GxBzr9qRDqc4chYnMAvwNGkm7qntt7Q1jol9yGZikB8e
> CONyaqYghNBr4x7BtGOaITJQ7yWw++l7t56oMSCNOXzee8Noy53cKPCVOvzh8lUF
> PlMRNFB9pTgdxs59dy5yF31R4LTJjMG7zm+gHjmWDMi7BnQZQGEWDc6MIzLIwTPj
> kN5dZm4R/cbVjYEzIlmsr9h67H/+9Otr36AvwfvvwycL/wy0RkC7jxqY0eC8i3fK
> v/FdmFbt6b2wxzalmjvg+sEILe18Uz0fCmhcCDRdZ2fgmOWC68WeH4I7d2/kCJTr
> Az2K8ZvZ5LxBCSH9MLoh/heZVSI3rowHu3aUNqfcbZ1pJLmT68RU9ZmPgfQnA4bK
> 4uny7MaDcyYN/IvMRWf8lUiuY3OsRHGZAmcIfagkqvV2ukWPRFQ2AmsaZpMxbYyg
> FsdKDJfYocUdp14KMT3wEhiGmUTE5BwtxAXq4NTq1sdJGESZIzhbEXYHbgnD7mbF
> QDT7WZ/DqG+KpcVTRmnz
> =JtdF
> -END PGP SIGNATURE-
> ___
> Wien mailing list
> Wien@zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-28 Thread Elias Assmann
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 09/28/2015 01:58 PM, Luis Ogando wrote:
> The problem is solved ! The solution was one suggested by Lyudmila 
> Dobysheva : reboot the nodes. We will never know the origin of the 
> problem, but, honestly, I do not care !

Good to hear that!  So, how did you get the admins to reboot them?

> "There are more things in heaven and earth, Horatio, Than are
> dreamt of in your philosophy."

That is an apt quote for people working on clusters ;-).


Elias

-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Icedove - http://www.enigmail.net/

iQIcBAEBAgAGBQJWCjTGAAoJEE/4gtQZfOqPhFAQAKZmda0t9FGgfAsk9UjymogK
oN1WxHdenQVOSaOblpAFEn4c0ihTog7zePEXdTqNl03OcBUcdKtOPVqSVLBKlmlF
f0VOBUeXjmOZKd6SAIuwNojflW0k9ysrJ2sLCo/dOGepT4L2Q8Um5DHpgh+mjehM
XtGbn6uDUQlcjoLKgHG9GxBzr9qRDqc4chYnMAvwNGkm7qntt7Q1jol9yGZikB8e
CONyaqYghNBr4x7BtGOaITJQ7yWw++l7t56oMSCNOXzee8Noy53cKPCVOvzh8lUF
PlMRNFB9pTgdxs59dy5yF31R4LTJjMG7zm+gHjmWDMi7BnQZQGEWDc6MIzLIwTPj
kN5dZm4R/cbVjYEzIlmsr9h67H/+9Otr36AvwfvvwycL/wy0RkC7jxqY0eC8i3fK
v/FdmFbt6b2wxzalmjvg+sEILe18Uz0fCmhcCDRdZ2fgmOWC68WeH4I7d2/kCJTr
Az2K8ZvZ5LxBCSH9MLoh/heZVSI3rowHu3aUNqfcbZ1pJLmT68RU9ZmPgfQnA4bK
4uny7MaDcyYN/IvMRWf8lUiuY3OsRHGZAmcIfagkqvV2ukWPRFQ2AmsaZpMxbYyg
FsdKDJfYocUdp14KMT3wEhiGmUTE5BwtxAXq4NTq1sdJGESZIzhbEXYHbgnD7mbF
QDT7WZ/DqG+KpcVTRmnz
=JtdF
-END PGP SIGNATURE-
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-28 Thread Luis Ogando
Dear Wien2k community,

   I would like to thank so many hints !
   The problem is solved ! The solution was one suggested by Lyudmila
Dobysheva : reboot the nodes. We will never know the origin of the problem,
but, honestly, I do not care !

"There are more things in heaven and earth, Horatio,
Than are dreamt of in your philosophy."
- *Hamlet* (1.5.167-8), Hamlet to Horatio
 Shakespeare

   I would like to thank you all again.
   All the best,
 Luis

2015-09-25 5:56 GMT-03:00 Pawel Lesniak :

> Hello,
>
> I'd suggest trying three things.
>
> First of all - does your cluster allow running interactive jobs? If yes,
> than you should create an interactive job to run /bin/bash. I'm not
> familiar with PBS, but in SGE/OGE if you print cluster queues with "qstat
> -f" you'll see "I" in column qtype which means that given queue allows
> running interactive jobs. Using bash you should be able to run top on given
> node without SSH access.
>
> Regardless of success or failure you should be able to look at nodes
> statistics using "qhost" command. You should see at lease what is the
> current load, memory usage and swap usage. In SGE/OGE there's a switch "-j"
> to qhost which will also show you what jobs are currently running on each
> node. It's necessary to be able to see load of machine interactively
> instead of view at some point of time.
>
> The last concept is to prepare a job to run at the same time on the same
> node as Wien2K, consisting of several
> "ps auxww | grep ogando >> ${HOME}/ps.output; sleep 2"
> commands. It will give you some information on what's going on. Think of
> it as non-interactive top stored in text file.
>
>
> Best regards,
>
> Pawel Lesniak
>
>
> W dniu 23.09.2015 o 14:25, Luis Ogando pisze:
>
> 0K ! In this case, I will try it !
>Many thanks,
>   Luis
>
>
> 2015-09-23 9:23 GMT-03:00 Laurence Marks :
>
>> Ganglia is web based, you don't need ssh. Please read the link I sent.
>>
>> ---
>> Professor Laurence Marks
>> Department of Materials Science and Engineering
>> Northwestern University
>> http://www.numis.northwestern.edu
>> Corrosion in 4D http://MURI4D.numis.northwestern.edu
>> Co-Editor, Acta Cryst A
>> "Research is to see what everybody else has seen, and to think what
>> nobody else has thought"
>> Albert Szent-Gyorgi
>> On Sep 23, 2015 07:21, "Luis Ogando"  wrote:
>>
>>>Hi,
>>>
>>>I can not access the nodes. SSH among them is forbidden ! We have to
>>> ask the administrators for anything !! It is the hell !!
>>>Of course, only the PBS jobs can "travel" among the nodes.
>>>All the best,
>>>Luis
>>>
>>>
>>> 2015-09-23 9:14 GMT-03:00 Laurence Marks :
>>>
 Nooo!

 You should use ganglia yourself.

 ---
 Professor Laurence Marks
 Department of Materials Science and Engineering
 Northwestern University
 http://www.numis.northwestern.edu
 Corrosion in 4D 
 http://MURI4D.numis.northwestern.edu
 Co-Editor, Acta Cryst A
 "Research is to see what everybody else has seen, and to think what
 nobody else has thought"
 Albert Szent-Gyorgi
 On Sep 23, 2015 07:13, "Luis Ogando" < 
 lcoda...@gmail.com> wrote:

> Dear Prof. Marks,
>
>Thank you for your comment.
>I sent your suggestions to the administrators.
>All the best,
> Luis
>
>
> 2015-09-23 8:56 GMT-03:00 Laurence Marks < 
> laurence.ma...@gmail.com>:
>
>> It is hard to work this out remotely, particularly with unfriendly
>> sys_admin.
>>
>> I would find out if you have ganglia available, see
>> 
>> http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/linux/bks/SGI_Admin/books/ICEX_Admin_Guide/sgi_html/ch05.html#Z1190844523tls
>> . This is much more useful than top. Try doing http://... to
>> relevant head or admin nodes.
>>
>> ---
>> Professor Laurence Marks
>> Department of Materials Science and Engineering
>> Northwestern University
>> http://www.numis.northwestern.edu
>> Corrosion in 4D 
>> http://MURI4D.numis.northwestern.edu
>> Co-Editor, Acta Cryst A
>> "Research is to see what everybody else has seen, and to think what
>> nobody else has thought"
>> Albert Szent-Gyorgi
>> On Sep 23, 2015 06:31, "Luis Ogando" < 
>> lcoda...@gmail.com> wrote:
>>
>>> Dear Prof. Blaha and Lyudmila Dobysheva,
>>>
>>>Many thanks for your comments !
>>>Unfortunately, users have no privileges in the cluster. I will
>>> send your comments to the administrators and let's see what happens.
>>>Many thanks again,
>>> Luis
>>>
>>>
>>

Re: [Wien] time difference among nodes

2015-09-25 Thread Pawel Lesniak

Hello,

I'd suggest trying three things.

First of all - does your cluster allow running interactive jobs? If yes, 
than you should create an interactive job to run /bin/bash. I'm not 
familiar with PBS, but in SGE/OGE if you print cluster queues with 
"qstat -f" you'll see "I" in column qtype which means that given queue 
allows running interactive jobs. Using bash you should be able to run 
top on given node without SSH access.


Regardless of success or failure you should be able to look at nodes 
statistics using "qhost" command. You should see at lease what is the 
current load, memory usage and swap usage. In SGE/OGE there's a switch 
"-j" to qhost which will also show you what jobs are currently running 
on each node. It's necessary to be able to see load of machine 
interactively instead of view at some point of time.


The last concept is to prepare a job to run at the same time on the same 
node as Wien2K, consisting of several

"ps auxww | grep ogando >> ${HOME}/ps.output; sleep 2"
commands. It will give you some information on what's going on. Think of 
it as non-interactive top stored in text file.



Best regards,

Pawel Lesniak


W dniu 23.09.2015 o 14:25, Luis Ogando pisze:

0K ! In this case, I will try it !
   Many thanks,
  Luis


2015-09-23 9:23 GMT-03:00 Laurence Marks >:


Ganglia is web based, you don't need ssh. Please read the link I sent.

---
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
http://www.numis.northwestern.edu
Corrosion in 4D http://MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
"Research is to see what everybody else has seen, and to think
what nobody else has thought"
Albert Szent-Gyorgi

On Sep 23, 2015 07:21, "Luis Ogando" mailto:lcoda...@gmail.com>> wrote:

   Hi,

   I can not access the nodes. SSH among them is forbidden !
We have to ask the administrators for anything !! It is the
hell !!
   Of course, only the PBS jobs can "travel" among the nodes.
   All the best,
   Luis


2015-09-23 9:14 GMT-03:00 Laurence Marks
mailto:laurence.ma...@gmail.com>>:

Nooo!

You should use ganglia yourself.

---
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
http://www.numis.northwestern.edu
Corrosion in 4D http://MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
"Research is to see what everybody else has seen, and to
think what nobody else has thought"
Albert Szent-Gyorgi

On Sep 23, 2015 07:13, "Luis Ogando" mailto:lcoda...@gmail.com>> wrote:

Dear Prof. Marks,

   Thank you for your comment.
   I sent your suggestions to the administrators.
   All the best,
Luis


2015-09-23 8:56 GMT-03:00 Laurence Marks
mailto:laurence.ma...@gmail.com>>:

It is hard to work this out remotely, particularly
with unfriendly sys_admin.

I would find out if you have ganglia available,
see

http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/linux/bks/SGI_Admin/books/ICEX_Admin_Guide/sgi_html/ch05.html#Z1190844523tls
. This is much more useful than top. Try doing
http://... to relevant head or admin nodes.

---
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
http://www.numis.northwestern.edu
Corrosion in 4D http://MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
"Research is to see what everybody else has seen,
and to think what nobody else has thought"
Albert Szent-Gyorgi

On Sep 23, 2015 06:31, "Luis Ogando"
mailto:lcoda...@gmail.com>>
wrote:

Dear Prof. Blaha and Lyudmila Dobysheva,

   Many thanks for your comments !
Unfortunately, users have no privileges in the
cluster. I will send your comments to the
administrators and let's see what happens.
   Many thanks again,
Luis




___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at


Re: [Wien] time difference among nodes

2015-09-25 Thread Elias Assmann
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Sounds like a nasty problem …  In terms of strategy, I think the first
thing should be to find out if the node is really to blame.  If so,
you have to convince the admins and/or find a way to avoid it.  If
not, you can turn to figuring out whatever else (presumably in your
Wien2k setup) is causing the trouble.

On 09/24/2015 07:37 PM, Luis Ogando wrote:
> First of all, I wonder: To what extent is this problem
> reproducible? E.g., does your job always run on the same 4 nodes?
> 
> Yes.
> 
> Is it always the same node(s) that are slow?
> 
> Yes

It seems unusual that your job should always be assigned the same
nodes, but okay.  If you get your job to run on a different set it
could help establish if the node is really to blame.  In some queuing
systems, you can request specific nodes.  Or you could submit two
copies of your job.

> The strangest part: at the beginning of this month, the same
> calculation was running properly. I had a crash for convergence
> problems and when I reduced the "mixing factor" in case.inm (it is
> now 0.04 in pre-convergence scf cycle) the problems started.
> Obviously, I do not believe that the mixing factor is the problem.
> 
> No. All the executables are running slowly in the problematic
> node.

I would try to widen the tests then -- restart the calculation from
scratch, try a different case, try other programs …

> Users can do nothing. The administrator sent me the "top's" and I
> have asked him for simultaneous ones.

Like I said, even if you have no direct access you can put it in a job
script.  Something along these lines (in bash):

run &

pid=$(jobs -p %1)

while [[ "$(jobs)" ]]; do
   for n in $NODES; do
  ssh $n top -bn1 >>$n.top
  # plus whatever else you want to check
   done
done

wait



Elias

-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Icedove - http://www.enigmail.net/

iQIcBAEBAgAGBQJWBP9EAAoJEE/4gtQZfOqPHfkQALvFqdz2yL5CGbVH7c7klkoo
UT3vR6W+3Ev6in9Ed/z/KOc09m8j2hFrZ0p32jW9EF78jfiObFKaaNVkbHJLpw8l
6ru8AEVBxdNIeCJp53aakILSboRx/GzRnTHdZMyjj8EGfEng+0+fPG2+xm+OWipU
Nsreceb/n+gwJvZTKTn719xushxAM9JSUmSMPrN3WESH4nEgm3wFeR/FuPFyoqfZ
S3RNb0CYd8tB3bs0MP4lYFbHWVeiQVy0j2uOwoiqjfqkSlC1vvJoxnBXO900ybvX
AaIRRXGcmd8XiTaQfD/VPvZX0R3Un1swee4EI0LcMNxiYFGkvuN0p7lMd5MC5Zny
7h+IeXIMH9QNtlWF4HDr7stMAYSeKxKLhTWlddJgIOXrXGPF9BHHJsY/X3LwUIYF
E8UzP061j1LNVwDMUIOYYBX4UCIQJfMpnW3PvbTJIIq56NE3Z6ppxV4ZMAkK2JBo
HRmdtQX8pSCXJaggu7QbAIzdhH4Eat+YoEgBAo6uj1M4tYjZ1GivNlwBO2ItQFTu
Y5JCrWILBKloCEym4TDezcwCR0R2/4cUKkXQlgQUh+iLVrKCG2QkAYnJwSxzdIDe
q19gOQEU5MrUCHtH1vaUTYE+Oq4Z0UNWhKiGRapBgJNFYnRonqzKywqOciWt2SmU
JV7fZo5W2vviyEW/e9TF
=eXD9
-END PGP SIGNATURE-
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-24 Thread Luis Ogando
Dear Prof. Marks,

   As I suspected, users can not use ganglia. Our administrators are very
jealous !!

Dear Elias Assmann,

   Many thanks for your comments. I will try to comment on some of them.


First of all, I wonder: To what extent is this problem reproducible?
> E.g., does your job always run on the same 4 nodes?


Yes.


> Is it always the
> same node(s) that are slow?


Yes


> Does the problem also show up in other
> calculations (maybe just changing the number of k-points, or
> restarting the same case from scratch).


The strangest part: at the beginning of this month, the same calculation
was running properly. I had a crash for convergence problems and when I
reduced the "mixing factor" in case.inm (it is now 0.04 in pre-convergence
scf cycle) the problems started. Obviously, I do not believe that the
mixing factor is the problem.


> Is it only lapw1 that is slow?
>

No. All the executables are running slowly in the problematic node.


>
> Second, how did you make those ‘top’s?  As for ‘lapw0’ and ‘lapw1’, I
> am guessing that this is just because the snapshots were taken at
> different times (notice that the CPU times of lapw0 on the two nodes
> are quite different, too).
>

Users can do nothing. The administrator sent me the "top's" and I have
asked him for simultaneous ones.


>
> About the CPU usage on ‘n2’, I find this very suspicious.  If it is as
> Peter said that the jobs are in the initialization and therefore not
> computing much, that may be fine; but I have to disagree with his
> assessment, because the memory usage of lapw1 on the two nodes is
> basically the same (if anything, the image sizes on ‘n2’ are slightly
> larger).  Note also that it is *not* the case that other processes are
> using the CPU; the total usage is at 7.5 %.
>
> It would be good to clarify that by getting a ‘top’ such that we know
> that lapw1 had been running for a while.  To this end, top has an ‘-n’
> option which says how many frames to output, e.g. ‘top -bn 10’.
>
> I am also curious about the load averages.  ‘n2’ has larger “mid-term”
> and “long-term” load averages than the others, and its “short-term”
> average is just as large.  I am not sure what that means.
>
> On 09/23/2015 02:21 PM, Luis Ogando wrote:
> > I can not access the nodes. SSH among them is forbidden ! We have
> > to ask the administrators for anything !! It is the hell !! Of
> > course, only the PBS jobs can "travel" among the nodes.
>
> I do not know about PBS Pro, but Torque and SGE have an option (I
> think ‘-I’ in either case) to submit an interactive job where you get
> a login on a node.  Of course that is only a realistic option when the
> queuing time is not too long.  Otherwise, any information that a more
> sophisticated tool can give you will also be available from the
> command line (just more painful to extract!) via ‘top’, ‘ps’, ‘/proc’,
> etc.  You can also put these things in a jobs script (which you
> apparently already did with ‘top’).
>
>
> Good luck,
>
> Elias
>

Finally, I would like to thank all the comments and say that if I did
not comment on them is because the administrators said they can not be the
origin of the problem, "everything is 0K" (?).
   All the best,
  Luis
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-24 Thread Elias Assmann
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Luis,

First of all, I wonder: To what extent is this problem reproducible?
E.g., does your job always run on the same 4 nodes?  Is it always the
same node(s) that are slow?  Does the problem also show up in other
calculations (maybe just changing the number of k-points, or
restarting the same case from scratch).  Is it only lapw1 that is slow?

Second, how did you make those ‘top’s?  As for ‘lapw0’ and ‘lapw1’, I
am guessing that this is just because the snapshots were taken at
different times (notice that the CPU times of lapw0 on the two nodes
are quite different, too).

About the CPU usage on ‘n2’, I find this very suspicious.  If it is as
Peter said that the jobs are in the initialization and therefore not
computing much, that may be fine; but I have to disagree with his
assessment, because the memory usage of lapw1 on the two nodes is
basically the same (if anything, the image sizes on ‘n2’ are slightly
larger).  Note also that it is *not* the case that other processes are
using the CPU; the total usage is at 7.5 %.

It would be good to clarify that by getting a ‘top’ such that we know
that lapw1 had been running for a while.  To this end, top has an ‘-n’
option which says how many frames to output, e.g. ‘top -bn 10’.

I am also curious about the load averages.  ‘n2’ has larger “mid-term”
and “long-term” load averages than the others, and its “short-term”
average is just as large.  I am not sure what that means.

On 09/23/2015 02:21 PM, Luis Ogando wrote:
> I can not access the nodes. SSH among them is forbidden ! We have
> to ask the administrators for anything !! It is the hell !! Of
> course, only the PBS jobs can "travel" among the nodes.

I do not know about PBS Pro, but Torque and SGE have an option (I
think ‘-I’ in either case) to submit an interactive job where you get
a login on a node.  Of course that is only a realistic option when the
queuing time is not too long.  Otherwise, any information that a more
sophisticated tool can give you will also be available from the
command line (just more painful to extract!) via ‘top’, ‘ps’, ‘/proc’,
etc.  You can also put these things in a jobs script (which you
apparently already did with ‘top’).


Good luck,

Elias


-BEGIN PGP SIGNATURE-
Version: GnuPG v1
Comment: Using GnuPG with Icedove - http://www.enigmail.net/

iQIcBAEBAgAGBQJWA7M8AAoJEE/4gtQZfOqPu5AQAJERPcJ8VBgVJdiVmDPSmfC0
9lJ+NUXWbNKxP9oXVChniwB/p0TUn588xVtVGIiXuviIW6jWM/reh7aU4NkXfxz/
J3zQq+yZ/gqMnK3JseNpq5hosU6f8keG4dGvq/qz3a+fDefe3Q1KoaTotG3oOyzY
foq3RJjIoY0M7Yl2VJXhhDU6fLWNuu2Uixd9DpbWDmUzhY2o7y8zUZrCdEN0CMN7
OcaUWAkPzFwAdGY/ZVzmc4AvBICXAndBRd29KIMF5JJAxKqwXzbCbROZC14spCl5
Yt8A3deCiUrCGKTuT8w4or8shtkfGxFXXWAEKxY9kKpsHRGmbcOmIVljXk3x6JpV
VOo5y3xHOEmaGOGGRZSDRGK0AWpkiep71us9zOYmnTd0GVuulOOAfi6m4FyTS0vc
3FPws2FUaOZWHm+K0AEMJyyxY5Sz6NwN6sTmiPfelvUdKLDHpDDVyig1a0X+x39+
jfgOx/J927rCYvyWA1/n5h6Mqj7ByUYA3zM9nrrTt3mw5YM/fgCyqlFp8M9cWWRF
cW54Aes9cnV2GdhnbLy7cuOwXK5J7FV6uyQFPipaAkuGEG7ynvUWQdvnftX9j1hL
O8S6WOzZDUYduB3mXJ5XT2iV2jjRd3zEk1niQcRfyFuQUYneY9zuGjpxkknmxEln
5KaBqwFCLo4XnRrvlDkg
=PO9e
-END PGP SIGNATURE-
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-23 Thread Luis Ogando
0K ! In this case, I will try it !
   Many thanks,
  Luis


2015-09-23 9:23 GMT-03:00 Laurence Marks :

> Ganglia is web based, you don't need ssh. Please read the link I sent.
>
> ---
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> http://www.numis.northwestern.edu
> Corrosion in 4D http://MURI4D.numis.northwestern.edu
> Co-Editor, Acta Cryst A
> "Research is to see what everybody else has seen, and to think what nobody
> else has thought"
> Albert Szent-Gyorgi
> On Sep 23, 2015 07:21, "Luis Ogando"  wrote:
>
>>Hi,
>>
>>I can not access the nodes. SSH among them is forbidden ! We have to
>> ask the administrators for anything !! It is the hell !!
>>Of course, only the PBS jobs can "travel" among the nodes.
>>All the best,
>>Luis
>>
>>
>> 2015-09-23 9:14 GMT-03:00 Laurence Marks :
>>
>>> Nooo!
>>>
>>> You should use ganglia yourself.
>>>
>>> ---
>>> Professor Laurence Marks
>>> Department of Materials Science and Engineering
>>> Northwestern University
>>> http://www.numis.northwestern.edu
>>> Corrosion in 4D http://MURI4D.numis.northwestern.edu
>>> Co-Editor, Acta Cryst A
>>> "Research is to see what everybody else has seen, and to think what
>>> nobody else has thought"
>>> Albert Szent-Gyorgi
>>> On Sep 23, 2015 07:13, "Luis Ogando"  wrote:
>>>
 Dear Prof. Marks,

Thank you for your comment.
I sent your suggestions to the administrators.
All the best,
 Luis


 2015-09-23 8:56 GMT-03:00 Laurence Marks :

> It is hard to work this out remotely, particularly with unfriendly
> sys_admin.
>
> I would find out if you have ganglia available, see
> http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/linux/bks/SGI_Admin/books/ICEX_Admin_Guide/sgi_html/ch05.html#Z1190844523tls
> . This is much more useful than top. Try doing http://... to relevant
> head or admin nodes.
>
> ---
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> http://www.numis.northwestern.edu
> Corrosion in 4D http://MURI4D.numis.northwestern.edu
> Co-Editor, Acta Cryst A
> "Research is to see what everybody else has seen, and to think what
> nobody else has thought"
> Albert Szent-Gyorgi
> On Sep 23, 2015 06:31, "Luis Ogando"  wrote:
>
>> Dear Prof. Blaha and Lyudmila Dobysheva,
>>
>>Many thanks for your comments !
>>Unfortunately, users have no privileges in the cluster. I will
>> send your comments to the administrators and let's see what happens.
>>Many thanks again,
>> Luis
>>
>>
>>
>>
> ___
> Wien mailing list
> Wien@zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>

>>> ___
>>> Wien mailing list
>>> Wien@zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>> SEARCH the MAILING-LIST at:
>>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>>>
>>>
>>
> ___
> Wien mailing list
> Wien@zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-23 Thread Laurence Marks
Ganglia is web based, you don't need ssh. Please read the link I sent.

---
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
http://www.numis.northwestern.edu
Corrosion in 4D http://MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
"Research is to see what everybody else has seen, and to think what nobody
else has thought"
Albert Szent-Gyorgi
On Sep 23, 2015 07:21, "Luis Ogando"  wrote:

>Hi,
>
>I can not access the nodes. SSH among them is forbidden ! We have to
> ask the administrators for anything !! It is the hell !!
>Of course, only the PBS jobs can "travel" among the nodes.
>All the best,
>Luis
>
>
> 2015-09-23 9:14 GMT-03:00 Laurence Marks :
>
>> Nooo!
>>
>> You should use ganglia yourself.
>>
>> ---
>> Professor Laurence Marks
>> Department of Materials Science and Engineering
>> Northwestern University
>> http://www.numis.northwestern.edu
>> Corrosion in 4D http://MURI4D.numis.northwestern.edu
>> Co-Editor, Acta Cryst A
>> "Research is to see what everybody else has seen, and to think what
>> nobody else has thought"
>> Albert Szent-Gyorgi
>> On Sep 23, 2015 07:13, "Luis Ogando"  wrote:
>>
>>> Dear Prof. Marks,
>>>
>>>Thank you for your comment.
>>>I sent your suggestions to the administrators.
>>>All the best,
>>> Luis
>>>
>>>
>>> 2015-09-23 8:56 GMT-03:00 Laurence Marks :
>>>
 It is hard to work this out remotely, particularly with unfriendly
 sys_admin.

 I would find out if you have ganglia available, see
 http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/linux/bks/SGI_Admin/books/ICEX_Admin_Guide/sgi_html/ch05.html#Z1190844523tls
 . This is much more useful than top. Try doing http://... to relevant
 head or admin nodes.

 ---
 Professor Laurence Marks
 Department of Materials Science and Engineering
 Northwestern University
 http://www.numis.northwestern.edu
 Corrosion in 4D http://MURI4D.numis.northwestern.edu
 Co-Editor, Acta Cryst A
 "Research is to see what everybody else has seen, and to think what
 nobody else has thought"
 Albert Szent-Gyorgi
 On Sep 23, 2015 06:31, "Luis Ogando"  wrote:

> Dear Prof. Blaha and Lyudmila Dobysheva,
>
>Many thanks for your comments !
>Unfortunately, users have no privileges in the cluster. I will send
> your comments to the administrators and let's see what happens.
>Many thanks again,
> Luis
>
>
>
>
 ___
 Wien mailing list
 Wien@zeus.theochem.tuwien.ac.at
 http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
 SEARCH the MAILING-LIST at:
 http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


>>>
>> ___
>> Wien mailing list
>> Wien@zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at:
>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>>
>>
>
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-23 Thread Luis Ogando
   Hi,

   I can not access the nodes. SSH among them is forbidden ! We have to ask
the administrators for anything !! It is the hell !!
   Of course, only the PBS jobs can "travel" among the nodes.
   All the best,
   Luis


2015-09-23 9:14 GMT-03:00 Laurence Marks :

> Nooo!
>
> You should use ganglia yourself.
>
> ---
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> http://www.numis.northwestern.edu
> Corrosion in 4D http://MURI4D.numis.northwestern.edu
> Co-Editor, Acta Cryst A
> "Research is to see what everybody else has seen, and to think what nobody
> else has thought"
> Albert Szent-Gyorgi
> On Sep 23, 2015 07:13, "Luis Ogando"  wrote:
>
>> Dear Prof. Marks,
>>
>>Thank you for your comment.
>>I sent your suggestions to the administrators.
>>All the best,
>> Luis
>>
>>
>> 2015-09-23 8:56 GMT-03:00 Laurence Marks :
>>
>>> It is hard to work this out remotely, particularly with unfriendly
>>> sys_admin.
>>>
>>> I would find out if you have ganglia available, see
>>> http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/linux/bks/SGI_Admin/books/ICEX_Admin_Guide/sgi_html/ch05.html#Z1190844523tls
>>> . This is much more useful than top. Try doing http://... to relevant
>>> head or admin nodes.
>>>
>>> ---
>>> Professor Laurence Marks
>>> Department of Materials Science and Engineering
>>> Northwestern University
>>> http://www.numis.northwestern.edu
>>> Corrosion in 4D http://MURI4D.numis.northwestern.edu
>>> Co-Editor, Acta Cryst A
>>> "Research is to see what everybody else has seen, and to think what
>>> nobody else has thought"
>>> Albert Szent-Gyorgi
>>> On Sep 23, 2015 06:31, "Luis Ogando"  wrote:
>>>
 Dear Prof. Blaha and Lyudmila Dobysheva,

Many thanks for your comments !
Unfortunately, users have no privileges in the cluster. I will send
 your comments to the administrators and let's see what happens.
Many thanks again,
 Luis




>>> ___
>>> Wien mailing list
>>> Wien@zeus.theochem.tuwien.ac.at
>>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>>> SEARCH the MAILING-LIST at:
>>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>>>
>>>
>>
> ___
> Wien mailing list
> Wien@zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-23 Thread Laurence Marks
Nooo!

You should use ganglia yourself.

---
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
http://www.numis.northwestern.edu
Corrosion in 4D http://MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
"Research is to see what everybody else has seen, and to think what nobody
else has thought"
Albert Szent-Gyorgi
On Sep 23, 2015 07:13, "Luis Ogando"  wrote:

> Dear Prof. Marks,
>
>Thank you for your comment.
>I sent your suggestions to the administrators.
>All the best,
> Luis
>
>
> 2015-09-23 8:56 GMT-03:00 Laurence Marks :
>
>> It is hard to work this out remotely, particularly with unfriendly
>> sys_admin.
>>
>> I would find out if you have ganglia available, see
>> http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/linux/bks/SGI_Admin/books/ICEX_Admin_Guide/sgi_html/ch05.html#Z1190844523tls
>> . This is much more useful than top. Try doing http://... to relevant
>> head or admin nodes.
>>
>> ---
>> Professor Laurence Marks
>> Department of Materials Science and Engineering
>> Northwestern University
>> http://www.numis.northwestern.edu
>> Corrosion in 4D http://MURI4D.numis.northwestern.edu
>> Co-Editor, Acta Cryst A
>> "Research is to see what everybody else has seen, and to think what
>> nobody else has thought"
>> Albert Szent-Gyorgi
>> On Sep 23, 2015 06:31, "Luis Ogando"  wrote:
>>
>>> Dear Prof. Blaha and Lyudmila Dobysheva,
>>>
>>>Many thanks for your comments !
>>>Unfortunately, users have no privileges in the cluster. I will send
>>> your comments to the administrators and let's see what happens.
>>>Many thanks again,
>>> Luis
>>>
>>>
>>>
>>>
>> ___
>> Wien mailing list
>> Wien@zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at:
>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>>
>>
>
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-23 Thread Luis Ogando
Dear Prof. Marks,

   Thank you for your comment.
   I sent your suggestions to the administrators.
   All the best,
Luis


2015-09-23 8:56 GMT-03:00 Laurence Marks :

> It is hard to work this out remotely, particularly with unfriendly
> sys_admin.
>
> I would find out if you have ganglia available, see
> http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/linux/bks/SGI_Admin/books/ICEX_Admin_Guide/sgi_html/ch05.html#Z1190844523tls
> . This is much more useful than top. Try doing http://... to relevant
> head or admin nodes.
>
> ---
> Professor Laurence Marks
> Department of Materials Science and Engineering
> Northwestern University
> http://www.numis.northwestern.edu
> Corrosion in 4D http://MURI4D.numis.northwestern.edu
> Co-Editor, Acta Cryst A
> "Research is to see what everybody else has seen, and to think what nobody
> else has thought"
> Albert Szent-Gyorgi
> On Sep 23, 2015 06:31, "Luis Ogando"  wrote:
>
>> Dear Prof. Blaha and Lyudmila Dobysheva,
>>
>>Many thanks for your comments !
>>Unfortunately, users have no privileges in the cluster. I will send
>> your comments to the administrators and let's see what happens.
>>Many thanks again,
>> Luis
>>
>>
>>
>>
> ___
> Wien mailing list
> Wien@zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-23 Thread Laurence Marks
It is hard to work this out remotely, particularly with unfriendly
sys_admin.

I would find out if you have ganglia available, see
http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/linux/bks/SGI_Admin/books/ICEX_Admin_Guide/sgi_html/ch05.html#Z1190844523tls
. This is much more useful than top. Try doing http://... to relevant head
or admin nodes.

---
Professor Laurence Marks
Department of Materials Science and Engineering
Northwestern University
http://www.numis.northwestern.edu
Corrosion in 4D http://MURI4D.numis.northwestern.edu
Co-Editor, Acta Cryst A
"Research is to see what everybody else has seen, and to think what nobody
else has thought"
Albert Szent-Gyorgi
On Sep 23, 2015 06:31, "Luis Ogando"  wrote:

> Dear Prof. Blaha and Lyudmila Dobysheva,
>
>Many thanks for your comments !
>Unfortunately, users have no privileges in the cluster. I will send
> your comments to the administrators and let's see what happens.
>Many thanks again,
> Luis
>
>
>
>
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-23 Thread Luis Ogando
Dear Prof. Blaha and Lyudmila Dobysheva,

   Many thanks for your comments !
   Unfortunately, users have no privileges in the cluster. I will send your
comments to the administrators and let's see what happens.
   Many thanks again,
Luis
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-23 Thread Lyudmila Dobysheva

23.09.2015 12:22, Lyudmila Dobysheva wrote:

the jobs are all at one processor of the node


Try for to be sure:
In top at n2 type "1" to show individual CPU usage.
It is better to make this after some time to pass the starting phase.

23.09.2015 11:25, Peter Blaha wrote:
> With only a few seconds cpu time, the job is just in the starting 
phase


Lyudmila Dobysheva
--
Phys.-Techn. Institute of Ural Br. of Russian Ac. of Sci.
426001 Izhevsk, ul.Kirova 132
RUSSIA
--
Tel.:7(3412) 432045(office), 722529(Fax)
E-mail: l...@ftiudm.ru, lyuk...@mail.ru (office)
lyuk...@gmail.com (home)
Skype:  lyuka17 (home), lyuka18 (office)
http://ftiudm.ru/content/view/25/103/lang,english/
--
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-23 Thread Peter Blaha
With only a few seconds cpu time, the job is just in the starting phase 
(allocating memory, reading files, distributing data) and thus cpu-load 
is very low.


A few seconds later, this should reach about 100 % for each lapw1_mpi.

On 09/23/2015 11:20 AM, Lyudmila Dobysheva wrote:

22.09.2015 23:08, Luis Ogando wrote:

r1i1n2
   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
  2096 ogando20   0  927m 642m  20m R9  1.8   0:09.30 lapw1c_mpi
  2109 ogando20   0  926m 633m  17m R9  1.8   0:14.58 lapw1c_mpi
  2122 ogando20   0  924m 633m  19m R9  1.8   0:09.65 lapw1c_mpi
  2124 ogando20   0  922m 627m  15m R9  1.7   0:06.72 lapw1c_mpi
  2108 ogando20   0  927m 633m  17m R8  1.8   0:09.04 lapw1c_mpi
  2110 ogando20   0  926m 633m  17m R8  1.7   0:09.01 lapw1c_mpi
  2111 ogando20   0  924m 627m  13m R8  1.7   0:14.56 lapw1c_mpi
  2095 ogando20   0  930m 641m  17m R8  1.8   0:09.32 lapw1c_mpi
  2121 ogando20   0  927m 634m  17m R8  1.8   0:06.76 lapw1c_mpi
  2123 ogando20   0  924m 632m  18m R8  1.7   0:09.65 lapw1c_mpi
  2098 ogando20   0  922m 634m  16m R8  1.8   0:06.71 lapw1c_mpi
  2097 ogando20   0  927m 641m  19m R7  1.8   0:06.75 lapw1c_mpi


If we sum up the %CPU we obtain 100%, so the jobs are all at one node,
sure.
What does this mean? Maybe, machines file?
Or parallel_options in n2 wien-root?

Best wishes
   Lyudmila Dobysheva
--
Phys.-Techn. Institute of Ural Br. of Russian Ac. of Sci.
426001 Izhevsk, ul.Kirova 132
RUSSIA
--
Tel.:7(3412) 432045(office), 722529(Fax)
E-mail: l...@ftiudm.ru, lyuk...@mail.ru (office)
 lyuk...@gmail.com (home)
Skype:  lyuka17 (home), lyuka18 (office)
http://ftiudm.ru/content/view/25/103/lang,english/
--
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


--

  P.Blaha
--
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: bl...@theochem.tuwien.ac.atWIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/staff/tc_group_e.php
--
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-23 Thread Lyudmila Dobysheva

23.09.2015 12:20, Lyudmila Dobysheva wrote:

the jobs are all at one node

at one processor of the node, of course

  Lyudmila Dobysheva
--
Phys.-Techn. Institute of Ural Br. of Russian Ac. of Sci.
426001 Izhevsk, ul.Kirova 132
RUSSIA
--
Tel.:7(3412) 432045(office), 722529(Fax)
E-mail: l...@ftiudm.ru, lyuk...@mail.ru (office)
lyuk...@gmail.com (home)
Skype:  lyuka17 (home), lyuka18 (office)
http://ftiudm.ru/content/view/25/103/lang,english/
--
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-23 Thread Lyudmila Dobysheva

22.09.2015 23:08, Luis Ogando wrote:

r1i1n2
   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
  2096 ogando20   0  927m 642m  20m R9  1.8   0:09.30 lapw1c_mpi
  2109 ogando20   0  926m 633m  17m R9  1.8   0:14.58 lapw1c_mpi
  2122 ogando20   0  924m 633m  19m R9  1.8   0:09.65 lapw1c_mpi
  2124 ogando20   0  922m 627m  15m R9  1.7   0:06.72 lapw1c_mpi
  2108 ogando20   0  927m 633m  17m R8  1.8   0:09.04 lapw1c_mpi
  2110 ogando20   0  926m 633m  17m R8  1.7   0:09.01 lapw1c_mpi
  2111 ogando20   0  924m 627m  13m R8  1.7   0:14.56 lapw1c_mpi
  2095 ogando20   0  930m 641m  17m R8  1.8   0:09.32 lapw1c_mpi
  2121 ogando20   0  927m 634m  17m R8  1.8   0:06.76 lapw1c_mpi
  2123 ogando20   0  924m 632m  18m R8  1.7   0:09.65 lapw1c_mpi
  2098 ogando20   0  922m 634m  16m R8  1.8   0:06.71 lapw1c_mpi
  2097 ogando20   0  927m 641m  19m R7  1.8   0:06.75 lapw1c_mpi


If we sum up the %CPU we obtain 100%, so the jobs are all at one node, 
sure.

What does this mean? Maybe, machines file?
Or parallel_options in n2 wien-root?

Best wishes
  Lyudmila Dobysheva
--
Phys.-Techn. Institute of Ural Br. of Russian Ac. of Sci.
426001 Izhevsk, ul.Kirova 132
RUSSIA
--
Tel.:7(3412) 432045(office), 722529(Fax)
E-mail: l...@ftiudm.ru, lyuk...@mail.ru (office)
lyuk...@gmail.com (home)
Skype:  lyuka17 (home), lyuka18 (office)
http://ftiudm.ru/content/view/25/103/lang,english/
--
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-23 Thread Lyudmila Dobysheva

22.09.2015 23:08, Luis Ogando wrote:

r1i1n1 -
top - 17:40:46 up 12 days, 9 min,  2 users,  load average: 10.55, 4.34, 1.74
Cpu(s):100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
r1i1n2 -
top - 17:42:30 up 221 days,  6:29,  1 user,  load average: 10.76, 9.59, 8.79
Cpu(s):  7.5%us,  0.1%sy,  0.0%ni, 92.4%id,  0.0%wa,  0.0%hi,  0.0%si,
r1i1n3 -
top - 17:42:50 up 56 days,  3:25,  1 user,  load average: 10.57, 6.02, 2.59
Cpu(s): 99.5%us,  0.4%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,


1) The first difference which I see is: the node under question was not 
restarted 221 days. I'd start from rebooting (the problem maybe 
disappears and you never know why that problem had happened).


2) You didn't check:
> 2015-09-18 23:24 GMT-03:00 Laurence Marks
>  * Bad memory
>  * Full disc

try "df" in n2 and some other for comparison. Check and send the output.
Check which is a working directory in the nodes (there should be 
something like "export SCRATCH=./" in .bashrc, make "set > aaa", and 
check the variable SCRATCH in the file aaa). Compare with output of df.


3) Just to be sure: you showed us top for only user ogando, I hope you 
really saw that there were no other users (press in top at n2 "u", and 
answer blank to "Which user (blank for all))". It writes "1 user", but 
there should be at least root, syslog, statd and so forth.


>We also have the first two nodes executing lapw0_mpi while the other
> two are executing lapw1c_mpi. Is this normal ?

I do not know, looks suspicious, but, IMHO, it is not connected with the 
discussed problem.


Best wishes
  Lyudmila Dobysheva


On 09/21/2015 02:51 PM, Luis Ogando wrote:
7) The mystery : two weeks ago, everything was working properly !!
 On Sep 18, 2015 8:58 PM, "Luis Ogando" wrote:
 I am using Wien2k in a SGI cluster with 32 nodes. My
 calculation is running in 4 nodes that have the same
 characteristics and only my job is running in these 4
nodes.
 I noticed that one of these 4 nodes is spending
more than 20
 times the time spent by the other 3 nodes in the
run_lapw execution.
 Could someone imagine a reason for this ? Any advice ?

--
Phys.-Techn. Institute of Ural Br. of Russian Ac. of Sci.
426001 Izhevsk, ul.Kirova 132
RUSSIA
--
Tel.:7(3412) 432045(office), 722529(Fax)
E-mail: l...@ftiudm.ru, lyuk...@mail.ru (office)
lyuk...@gmail.com (home)
Skype:  lyuka17 (home), lyuka18 (office)
http://ftiudm.ru/content/view/25/103/lang,english/
--
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-22 Thread Peter Blaha
Of course, at the "same time" ONLY lapw0_mpi  OR lapw1_mpi should be 
running.
However, I assume you did these  "tops"  sequentially one after the 
other ??? and of course, in an scf-cycle, after a few minutes running 
lapw0, lapw1 will start 

Do these tests in several windows in parallel.

The only suspicious info is the memory consumption. On the slow node you 
see:

> Mem: 36176M total, 8820M used,27355M free,
on the fast one:
> Mem: 36176M total,36080M used,   96M free,

It may indicate, that the slow node has a different configuration, in 
particular does not seem to buffer I/O, ... but keeps only the running 
program (12 x 500MB) in memory. The fast one uses "all" memory, and 
typically this is used by the operating system to hold various daemons 
and buffers permanently in memory.
The latter behavior is what I see normally on my nodes and what should 
be the default behavior.




On 09/22/2015 10:08 PM, Luis Ogando wrote:

Trying to decrease the size of a previous message !!!

--
Dear Prof. Blaha and Marks,

Please, find below the "top" output for my calculation.
As you can see, there is a huge difference in CPU use for the r1i1n2
node (the problematic one). What could be the reason ? What can I do ?
   We also have the first two nodes executing lapw0_mpi while the other
two are executing lapw1c_mpi. Is this normal ?
Thank you again,
 Luis


r1i1n0

top - 17:41:29 up 11 days,  8:49,  2 users,  load average: 10.95, 4.99, 2.01
Tasks: 248 total,  13 running, 235 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.9%us,  0.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
  0.0%st
Mem: 36176M total, 8820M used,27355M free,0M buffers
Swap:0M total,0M used,0M free, 7248M cached

   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
  6670 ogando20   0  517m  70m  14m R  100  0.2   2:22.27 lapw0_mpi
  6671 ogando20   0  511m  71m  19m R  100  0.2   2:22.57 lapw0_mpi
  6672 ogando20   0  512m  67m  15m R  100  0.2   2:22.26 lapw0_mpi
  6673 ogando20   0  511m  69m  18m R  100  0.2   2:22.49 lapw0_mpi
  6674 ogando20   0  511m  64m  13m R  100  0.2   2:22.69 lapw0_mpi
  6675 ogando20   0  511m  67m  16m R  100  0.2   2:22.63 lapw0_mpi
  6676 ogando20   0  511m  63m  12m R  100  0.2   2:22.24 lapw0_mpi
  6677 ogando20   0  511m  62m  11m R  100  0.2   2:22.59 lapw0_mpi
  6679 ogando20   0  511m  67m  16m R  100  0.2   2:22.20 lapw0_mpi
  6681 ogando20   0  512m  62m  11m R  100  0.2   2:22.70 lapw0_mpi
  6678 ogando20   0  511m  64m  13m R  100  0.2   2:22.64 lapw0_mpi
  6680 ogando20   0  510m  62m  12m R  100  0.2   2:22.55 lapw0_mpi
   924 ogando20   0 12916 1620  996 S0  0.0   0:00.28 run_lapw
  6506 ogando20   0 13024 1820  992 S0  0.0   0:00.02 x
  6527 ogando20   0 12740 1456  996 S0  0.0   0:00.02 lapw0para
  6669 ogando20   0 74180 3632 2236 S0  0.0   0:00.09 mpirun
17182 ogando20   0 13308 1892 1060 S0  0.0   0:00.13 csh
17183 ogando20   0 10364  656  396 S0  0.0   0:00.40 pbs_demux
17203 ogando20   0 12932 1720 1008 S0  0.0   0:00.07 csh


r1i1n1

top - 17:40:46 up 12 days, 9 min,  2 users,  load average: 10.55, 4.34, 1.74
Tasks: 242 total,  13 running, 229 sleeping,   0 stopped,   0 zombie
Cpu(s):100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
  0.0%st
Mem: 36176M total,36080M used,   96M free,0M buffers
Swap:0M total,0M used,0M free,34456M cached

   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
27446 ogando20   0  516m  65m 9368 R  100  0.2   1:34.78 lapw0_mpi
27447 ogando20   0  517m  66m 9432 R  100  0.2   1:35.16 lapw0_mpi
27448 ogando20   0  516m  65m 9412 R  100  0.2   1:34.88 lapw0_mpi
27449 ogando20   0  516m  65m 9464 R  100  0.2   1:33.37 lapw0_mpi
27450 ogando20   0  515m  65m 9440 R  100  0.2   1:33.96 lapw0_mpi
27453 ogando20   0  516m  65m 9480 R  100  0.2   1:35.44 lapw0_mpi
27454 ogando20   0  515m  65m 9424 R  100  0.2   1:35.85 lapw0_mpi
27455 ogando20   0  516m  65m 9452 R  100  0.2   1:34.47 lapw0_mpi
27456 ogando20   0  516m  65m 9440 R  100  0.2   1:34.78 lapw0_mpi
27457 ogando20   0  516m  65m 9420 R  100  0.2   1:30.90 lapw0_mpi
27451 ogando20   0  517m  65m 9472 R  100  0.2   1:34.65 lapw0_mpi
27452 ogando20   0  516m  65m 9436 R  100  0.2   1:33.63 lapw0_mpi
27445 ogando20   0 67540 3336 2052 S0  0.0   0:00.11 orted

r1i1n2

top - 17:42:30 up 221 days,  6:29,  1 user,  load average: 10.76, 9.59, 8.79
Tasks: 242 total,  13 running, 229 sleeping

[Wien] time difference among nodes

2015-09-22 Thread Luis Ogando
Trying to decrease the size of a previous message !!!

--
Dear Prof. Blaha and Marks,

   Please, find below the "top" output for my calculation.
   As you can see, there is a huge difference in CPU use for the r1i1n2
node (the problematic one). What could be the reason ? What can I do ?
  We also have the first two nodes executing lapw0_mpi while the other two
are executing lapw1c_mpi. Is this normal ?
   Thank you again,
Luis


r1i1n0

top - 17:41:29 up 11 days,  8:49,  2 users,  load average: 10.95, 4.99, 2.01
Tasks: 248 total,  13 running, 235 sleeping,   0 stopped,   0 zombie
Cpu(s): 99.9%us,  0.1%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
 0.0%st
Mem: 36176M total, 8820M used,27355M free,0M buffers
Swap:0M total,0M used,0M free, 7248M cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND

 6670 ogando20   0  517m  70m  14m R  100  0.2   2:22.27 lapw0_mpi

 6671 ogando20   0  511m  71m  19m R  100  0.2   2:22.57 lapw0_mpi

 6672 ogando20   0  512m  67m  15m R  100  0.2   2:22.26 lapw0_mpi

 6673 ogando20   0  511m  69m  18m R  100  0.2   2:22.49 lapw0_mpi

 6674 ogando20   0  511m  64m  13m R  100  0.2   2:22.69 lapw0_mpi

 6675 ogando20   0  511m  67m  16m R  100  0.2   2:22.63 lapw0_mpi

 6676 ogando20   0  511m  63m  12m R  100  0.2   2:22.24 lapw0_mpi

 6677 ogando20   0  511m  62m  11m R  100  0.2   2:22.59 lapw0_mpi

 6679 ogando20   0  511m  67m  16m R  100  0.2   2:22.20 lapw0_mpi

 6681 ogando20   0  512m  62m  11m R  100  0.2   2:22.70 lapw0_mpi

 6678 ogando20   0  511m  64m  13m R  100  0.2   2:22.64 lapw0_mpi

 6680 ogando20   0  510m  62m  12m R  100  0.2   2:22.55 lapw0_mpi

  924 ogando20   0 12916 1620  996 S0  0.0   0:00.28 run_lapw

 6506 ogando20   0 13024 1820  992 S0  0.0   0:00.02 x

 6527 ogando20   0 12740 1456  996 S0  0.0   0:00.02 lapw0para

 6669 ogando20   0 74180 3632 2236 S0  0.0   0:00.09 mpirun

17182 ogando20   0 13308 1892 1060 S0  0.0   0:00.13 csh

17183 ogando20   0 10364  656  396 S0  0.0   0:00.40 pbs_demux

17203 ogando20   0 12932 1720 1008 S0  0.0   0:00.07 csh


r1i1n1

top - 17:40:46 up 12 days, 9 min,  2 users,  load average: 10.55, 4.34, 1.74
Tasks: 242 total,  13 running, 229 sleeping,   0 stopped,   0 zombie
Cpu(s):100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
 0.0%st
Mem: 36176M total,36080M used,   96M free,0M buffers
Swap:0M total,0M used,0M free,34456M cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND

27446 ogando20   0  516m  65m 9368 R  100  0.2   1:34.78 lapw0_mpi

27447 ogando20   0  517m  66m 9432 R  100  0.2   1:35.16 lapw0_mpi

27448 ogando20   0  516m  65m 9412 R  100  0.2   1:34.88 lapw0_mpi

27449 ogando20   0  516m  65m 9464 R  100  0.2   1:33.37 lapw0_mpi

27450 ogando20   0  515m  65m 9440 R  100  0.2   1:33.96 lapw0_mpi

27453 ogando20   0  516m  65m 9480 R  100  0.2   1:35.44 lapw0_mpi

27454 ogando20   0  515m  65m 9424 R  100  0.2   1:35.85 lapw0_mpi

27455 ogando20   0  516m  65m 9452 R  100  0.2   1:34.47 lapw0_mpi

27456 ogando20   0  516m  65m 9440 R  100  0.2   1:34.78 lapw0_mpi

27457 ogando20   0  516m  65m 9420 R  100  0.2   1:30.90 lapw0_mpi

27451 ogando20   0  517m  65m 9472 R  100  0.2   1:34.65 lapw0_mpi

27452 ogando20   0  516m  65m 9436 R  100  0.2   1:33.63 lapw0_mpi

27445 ogando20   0 67540 3336 2052 S0  0.0   0:00.11 orted

r1i1n2

top - 17:42:30 up 221 days,  6:29,  1 user,  load average: 10.76, 9.59, 8.79
Tasks: 242 total,  13 running, 229 sleeping,   0 stopped,   0 zombie
Cpu(s):  7.5%us,  0.1%sy,  0.0%ni, 92.4%id,  0.0%wa,  0.0%hi,  0.0%si,
 0.0%st
Mem: 36176M total,31464M used, 4712M free,0M buffers
Swap:0M total,0M used,0M free,10563M cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND

 2096 ogando20   0  927m 642m  20m R9  1.8   0:09.30 lapw1c_mpi

 2109 ogando20   0  926m 633m  17m R9  1.8   0:14.58 lapw1c_mpi

 2122 ogando20   0  924m 633m  19m R9  1.8   0:09.65 lapw1c_mpi

 2124 ogando20   0  922m 627m  15m R9  1.7   0:06.72 lapw1c_mpi

 2108 ogando20   0  927m 633m  17m R8  1.8   0:09.04 lapw1c_mpi

 2110 ogando20   0  926m 633m  17m R8  1.7   0:09.01 lapw1c_mpi

 2111 ogando20   0  924m 627m  13m R8  1.7   0:14.56 lapw1c_mpi

 2095 ogando20   0  930m 641m  17m R8  1.8   0:09.32 lapw1c_mpi

 2121 ogando20   0  927m 634m  17m R8  1.8   0:06.76 

Re: [Wien] time difference among nodes

2015-09-21 Thread Luis Ogando
Dear Professor Blaha,

   Thank you !
   My .machines file is 0K.
   I will ask the administrator to follow your other suggestions (users do
not have privileges).
   All the best,
   Luis


2015-09-21 10:22 GMT-03:00 Peter Blaha :

> a) Check your .machines file.  DFoes it meet your expectations, or has
> this node too large load.
>
> b) Can you interactively login into these nodes while your job is running ?
> If yes, login on 2 nodes (in two windows) and runtop
>
> c) If nothing obvious is wrong so far, test the network by doing some
> bigger copying from/to these nodes from your $home (or $scratch) to see if
> file-io is killing you.
>
>
> On 09/21/2015 02:51 PM, Luis Ogando wrote:
>
>> Dear Prof. Marks,
>>
>> Many thanks for your help.
>> The administrators said that everything is 0K, the software is the
>> problem (the easy answer) : no zombies, no other jobs in the node, ... !!
>> Let me give you more information to see if you can imagine other
>> possibilities:
>>
>> 1) Intel Xeon Six Core 5680, 3.33GHz
>>
>> 2) Intel(R) Fortran/CC/OpenMPI Intel(R) 64 Compiler XE for applications
>> running on Intel(R) 64, Version 12.1.1.256 Build 20111011
>>
>> 3) OpenMPI 1.6.5
>>
>> 4) PBS Pro 11.0.2
>>
>> 5) OpenMPI built using  --with-tm  due to prohibited ssh among nodes  (
>> http://www.open-mpi.org/faq/?category=building#build-rte-tm )
>>
>> 6) Wien2k 14.2
>>
>> 7) The mystery : two weeks ago, everything was working properly !!
>>
>> Many thanks again !
>> All the best,
>> Luis
>>
>> 2015-09-18 23:24 GMT-03:00 Laurence Marks > >:
>>
>> Almost certainly one or more of:
>> * Other jobs on the node
>> * Zombie process(es)
>> * Too many mpi
>> * Bad memory
>> * Full disc
>> * Too hot
>>
>> If you have it use ganglia, if not ssh in and use top/ps or whatever
>> SGI has. If you cannot sudo get help from someone who can.
>>
>> On Sep 18, 2015 8:58 PM, "Luis Ogando" > > wrote:
>>
>> Dear Wien2k community,
>>
>> I am using Wien2k in a SGI cluster with 32 nodes. My
>> calculation is running in 4 nodes that have the same
>> characteristics and only my job is running in these 4 nodes.
>> I noticed that one of these 4 nodes is spending more than 20
>> times the time spent by the other 3 nodes in the run_lapw
>> execution.
>> Could someone imagine a reason for this ? Any advice ?
>> All the best,
>>  Luis
>>
>>
>> ___
>> Wien mailing list
>> Wien@zeus.theochem.tuwien.ac.at > Wien@zeus.theochem.tuwien.ac.at>
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at:
>>
>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>>
>>
>>
>>
>> ___
>> Wien mailing list
>> Wien@zeus.theochem.tuwien.ac.at
>> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
>> SEARCH the MAILING-LIST at:
>> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>>
>>
> --
>
>   P.Blaha
> --
> Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
> Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
> Email: bl...@theochem.tuwien.ac.atWIEN2k: http://www.wien2k.at
> WWW:   http://www.imc.tuwien.ac.at/staff/tc_group_e.php
> --
>
> ___
> Wien mailing list
> Wien@zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-21 Thread Peter Blaha
a) Check your .machines file.  DFoes it meet your expectations, or has 
this node too large load.


b) Can you interactively login into these nodes while your job is running ?
If yes, login on 2 nodes (in two windows) and runtop

c) If nothing obvious is wrong so far, test the network by doing some 
bigger copying from/to these nodes from your $home (or $scratch) to see 
if file-io is killing you.



On 09/21/2015 02:51 PM, Luis Ogando wrote:

Dear Prof. Marks,

Many thanks for your help.
The administrators said that everything is 0K, the software is the
problem (the easy answer) : no zombies, no other jobs in the node, ... !!
Let me give you more information to see if you can imagine other
possibilities:

1) Intel Xeon Six Core 5680, 3.33GHz

2) Intel(R) Fortran/CC/OpenMPI Intel(R) 64 Compiler XE for applications
running on Intel(R) 64, Version 12.1.1.256 Build 20111011

3) OpenMPI 1.6.5

4) PBS Pro 11.0.2

5) OpenMPI built using  --with-tm  due to prohibited ssh among nodes  (
http://www.open-mpi.org/faq/?category=building#build-rte-tm )

6) Wien2k 14.2

7) The mystery : two weeks ago, everything was working properly !!

Many thanks again !
All the best,
Luis

2015-09-18 23:24 GMT-03:00 Laurence Marks mailto:laurence.ma...@gmail.com>>:

Almost certainly one or more of:
* Other jobs on the node
* Zombie process(es)
* Too many mpi
* Bad memory
* Full disc
* Too hot

If you have it use ganglia, if not ssh in and use top/ps or whatever
SGI has. If you cannot sudo get help from someone who can.

On Sep 18, 2015 8:58 PM, "Luis Ogando" mailto:lcoda...@gmail.com>> wrote:

Dear Wien2k community,

I am using Wien2k in a SGI cluster with 32 nodes. My
calculation is running in 4 nodes that have the same
characteristics and only my job is running in these 4 nodes.
I noticed that one of these 4 nodes is spending more than 20
times the time spent by the other 3 nodes in the run_lapw execution.
Could someone imagine a reason for this ? Any advice ?
All the best,
 Luis


___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at 
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html




___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html



--

  P.Blaha
--
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300 FAX: +43-1-58801-165982
Email: bl...@theochem.tuwien.ac.atWIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/staff/tc_group_e.php
--
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-21 Thread Luis Ogando
Dear Prof. Marks,

   Many thanks for your help.
   The administrators said that everything is 0K, the software is the
problem (the easy answer) : no zombies, no other jobs in the node, ... !!
   Let me give you more information to see if you can imagine other
possibilities:

1) Intel Xeon Six Core 5680, 3.33GHz

2) Intel(R) Fortran/CC/OpenMPI Intel(R) 64 Compiler XE for applications
running on Intel(R) 64, Version 12.1.1.256 Build 20111011

3) OpenMPI 1.6.5

4) PBS Pro 11.0.2

5) OpenMPI built using  --with-tm  due to prohibited ssh among nodes  (
http://www.open-mpi.org/faq/?category=building#build-rte-tm )

6) Wien2k 14.2

7) The mystery : two weeks ago, everything was working properly !!

   Many thanks again !
   All the best,
   Luis

2015-09-18 23:24 GMT-03:00 Laurence Marks :

> Almost certainly one or more of:
> * Other jobs on the node
> * Zombie process(es)
> * Too many mpi
> * Bad memory
> * Full disc
> * Too hot
>
> If you have it use ganglia, if not ssh in and use top/ps or whatever SGI
> has. If you cannot sudo get help from someone who can.
> On Sep 18, 2015 8:58 PM, "Luis Ogando"  wrote:
>
>> Dear Wien2k community,
>>
>>I am using Wien2k in a SGI cluster with 32 nodes. My calculation is
>> running in 4 nodes that have the same characteristics and only my job is
>> running in these 4 nodes.
>>I noticed that one of these 4 nodes is spending more than 20 times the
>> time spent by the other 3 nodes in the run_lapw execution.
>>Could someone imagine a reason for this ? Any advice ?
>>All the best,
>> Luis
>>
>>
>
> ___
> Wien mailing list
> Wien@zeus.theochem.tuwien.ac.at
> http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
> SEARCH the MAILING-LIST at:
> http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html
>
>
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


Re: [Wien] time difference among nodes

2015-09-18 Thread Laurence Marks
Almost certainly one or more of:
* Other jobs on the node
* Zombie process(es)
* Too many mpi
* Bad memory
* Full disc
* Too hot

If you have it use ganglia, if not ssh in and use top/ps or whatever SGI
has. If you cannot sudo get help from someone who can.
On Sep 18, 2015 8:58 PM, "Luis Ogando"  wrote:

> Dear Wien2k community,
>
>I am using Wien2k in a SGI cluster with 32 nodes. My calculation is
> running in 4 nodes that have the same characteristics and only my job is
> running in these 4 nodes.
>I noticed that one of these 4 nodes is spending more than 20 times the
> time spent by the other 3 nodes in the run_lapw execution.
>Could someone imagine a reason for this ? Any advice ?
>All the best,
> Luis
>
>
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


[Wien] time difference among nodes

2015-09-18 Thread Luis Ogando
Dear Wien2k community,

   I am using Wien2k in a SGI cluster with 32 nodes. My calculation is
running in 4 nodes that have the same characteristics and only my job is
running in these 4 nodes.
   I noticed that one of these 4 nodes is spending more than 20 times the
time spent by the other 3 nodes in the run_lapw execution.
   Could someone imagine a reason for this ? Any advice ?
   All the best,
Luis
___
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html