Re: [Wien] time difference among nodes

2015-09-29 Thread Elias Assmann
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/28/2015 01:58 PM, Luis Ogando wrote: > The problem is solved ! The solution was one suggested by Lyudmila > Dobysheva : reboot the nodes. We will never know the origin of the > problem, but, honestly, I do not care ! Good to hear that! So,

Re: [Wien] time difference among nodes

2015-09-29 Thread Luis Ogando
Hi Lyudmila, Thanks again ! I will ask them. All the best, Luis 2015-09-29 10:37 GMT-03:00 Lyudmila Dobysheva : > 29.09.2015 14:57, Laurence Marks wrote: > >> If it happens again, one thing to ask them to check is swap usage and >> how much memory is

Re: [Wien] time difference among nodes

2015-09-29 Thread Lyudmila Dobysheva
29.09.2015 14:57, Laurence Marks wrote: If it happens again, one thing to ask them to check is swap usage and how much memory is cached. ... Alternatively it was something else, a zombie, big log files or other things. Rebooting gets rid of a lot of system caches and helps I stand for losing

Re: [Wien] time difference among nodes

2015-09-29 Thread Gavin Abo
From the top's sent before, it looks like the administrators might have configured the system with no swap: r1i1n2 Swap:0M total,0M used,0M free,10563M cached r1i1n3 Swap:0M total,0M used,0M free,23089M cached Keep in mind that having

Re: [Wien] time difference among nodes

2015-09-29 Thread Luis Ogando
Hi Lyudmila, Unfortunately, they do not have "top mode 1" output corresponding to the problem period. Thanks again. All the best, Luis 2015-09-29 10:37 GMT-03:00 Lyudmila Dobysheva : > 29.09.2015 14:57, Laurence Marks wrote: > >> If it happens again, one

Re: [Wien] time difference among nodes

2015-09-29 Thread Luis Ogando
Dear Prof. Marks, Thanks ! I will send your message to the administrators ! All the best, Luis 2015-09-29 8:57 GMT-03:00 Laurence Marks : > If it happens again, one thing to ask them to check is swap usage and how > much memory is cached. On

Re: [Wien] time difference among nodes

2015-09-29 Thread Luis Ogando
Hi Elias, There were no other jobs in the specific queue I was using and the nodes are dedicated to that queue, so, it was the opportunity to reboot them without furious reactions from other users. After trying everything suggested by the Wien2k community, the administrators resignedly

Re: [Wien] time difference among nodes

2015-09-29 Thread Laurence Marks
If it happens again, one thing to ask them to check is swap usage and how much memory is cached. On some of my nodes I have noticed that they do not always release cached memory, and can start swapping. If this happens the job will get very slow. The commands to use to clear the cache can be found

Re: [Wien] time difference among nodes

2015-09-28 Thread Luis Ogando
Dear Wien2k community, I would like to thank so many hints ! The problem is solved ! The solution was one suggested by Lyudmila Dobysheva : reboot the nodes. We will never know the origin of the problem, but, honestly, I do not care ! "There are more things in heaven and earth, Horatio,

Re: [Wien] time difference among nodes

2015-09-25 Thread Elias Assmann
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Sounds like a nasty problem … In terms of strategy, I think the first thing should be to find out if the node is really to blame. If so, you have to convince the admins and/or find a way to avoid it. If not, you can turn to figuring out whatever

Re: [Wien] time difference among nodes

2015-09-25 Thread Pawel Lesniak
Hello, I'd suggest trying three things. First of all - does your cluster allow running interactive jobs? If yes, than you should create an interactive job to run /bin/bash. I'm not familiar with PBS, but in SGE/OGE if you print cluster queues with "qstat -f" you'll see "I" in column qtype

Re: [Wien] time difference among nodes

2015-09-24 Thread Elias Assmann
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Luis, First of all, I wonder: To what extent is this problem reproducible? E.g., does your job always run on the same 4 nodes? Is it always the same node(s) that are slow? Does the problem also show up in other calculations (maybe just changing the

Re: [Wien] time difference among nodes

2015-09-24 Thread Luis Ogando
Dear Prof. Marks, As I suspected, users can not use ganglia. Our administrators are very jealous !! Dear Elias Assmann, Many thanks for your comments. I will try to comment on some of them. First of all, I wonder: To what extent is this problem reproducible? > E.g., does your job always

Re: [Wien] time difference among nodes

2015-09-23 Thread Laurence Marks
Ganglia is web based, you don't need ssh. Please read the link I sent. --- Professor Laurence Marks Department of Materials Science and Engineering Northwestern University http://www.numis.northwestern.edu Corrosion in 4D http://MURI4D.numis.northwestern.edu Co-Editor, Acta Cryst A "Research is

Re: [Wien] time difference among nodes

2015-09-23 Thread Laurence Marks
Nooo! You should use ganglia yourself. --- Professor Laurence Marks Department of Materials Science and Engineering Northwestern University http://www.numis.northwestern.edu Corrosion in 4D http://MURI4D.numis.northwestern.edu Co-Editor, Acta Cryst A "Research is to see what everybody else

Re: [Wien] time difference among nodes

2015-09-23 Thread Luis Ogando
0K ! In this case, I will try it ! Many thanks, Luis 2015-09-23 9:23 GMT-03:00 Laurence Marks : > Ganglia is web based, you don't need ssh. Please read the link I sent. > > --- > Professor Laurence Marks > Department of Materials Science and

Re: [Wien] time difference among nodes

2015-09-23 Thread Luis Ogando
Hi, I can not access the nodes. SSH among them is forbidden ! We have to ask the administrators for anything !! It is the hell !! Of course, only the PBS jobs can "travel" among the nodes. All the best, Luis 2015-09-23 9:14 GMT-03:00 Laurence Marks

Re: [Wien] time difference among nodes

2015-09-23 Thread Luis Ogando
Dear Prof. Marks, Thank you for your comment. I sent your suggestions to the administrators. All the best, Luis 2015-09-23 8:56 GMT-03:00 Laurence Marks : > It is hard to work this out remotely, particularly with unfriendly > sys_admin. > > I

Re: [Wien] time difference among nodes

2015-09-23 Thread Luis Ogando
Dear Prof. Blaha and Lyudmila Dobysheva, Many thanks for your comments ! Unfortunately, users have no privileges in the cluster. I will send your comments to the administrators and let's see what happens. Many thanks again, Luis

Re: [Wien] time difference among nodes

2015-09-23 Thread Laurence Marks
It is hard to work this out remotely, particularly with unfriendly sys_admin. I would find out if you have ganglia available, see http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi/linux/bks/SGI_Admin/books/ICEX_Admin_Guide/sgi_html/ch05.html#Z1190844523tls . This is much more useful than

Re: [Wien] time difference among nodes

2015-09-23 Thread Peter Blaha
Of course, at the "same time" ONLY lapw0_mpi OR lapw1_mpi should be running. However, I assume you did these "tops" sequentially one after the other ??? and of course, in an scf-cycle, after a few minutes running lapw0, lapw1 will start Do these tests in several windows in parallel.

Re: [Wien] time difference among nodes

2015-09-23 Thread Lyudmila Dobysheva
22.09.2015 23:08, Luis Ogando wrote: r1i1n1 - top - 17:40:46 up 12 days, 9 min, 2 users, load average: 10.55, 4.34, 1.74 Cpu(s):100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, r1i1n2 - top - 17:42:30 up 221 days, 6:29, 1 user, load average: 10.76,

Re: [Wien] time difference among nodes

2015-09-23 Thread Lyudmila Dobysheva
23.09.2015 12:22, Lyudmila Dobysheva wrote: the jobs are all at one processor of the node Try for to be sure: In top at n2 type "1" to show individual CPU usage. It is better to make this after some time to pass the starting phase. 23.09.2015 11:25, Peter Blaha wrote: > With only a few

Re: [Wien] time difference among nodes

2015-09-23 Thread Lyudmila Dobysheva
22.09.2015 23:08, Luis Ogando wrote: r1i1n2 PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 2096 ogando20 0 927m 642m 20m R9 1.8 0:09.30 lapw1c_mpi 2109 ogando20 0 926m 633m 17m R9 1.8 0:14.58 lapw1c_mpi 2122 ogando20 0 924m 633m

Re: [Wien] time difference among nodes

2015-09-23 Thread Lyudmila Dobysheva
23.09.2015 12:20, Lyudmila Dobysheva wrote: the jobs are all at one node at one processor of the node, of course Lyudmila Dobysheva -- Phys.-Techn. Institute of Ural Br. of Russian Ac. of Sci. 426001 Izhevsk, ul.Kirova 132

Re: [Wien] time difference among nodes

2015-09-23 Thread Peter Blaha
With only a few seconds cpu time, the job is just in the starting phase (allocating memory, reading files, distributing data) and thus cpu-load is very low. A few seconds later, this should reach about 100 % for each lapw1_mpi. On 09/23/2015 11:20 AM, Lyudmila Dobysheva wrote: 22.09.2015

Re: [Wien] time difference among nodes

2015-09-21 Thread Luis Ogando
Dear Prof. Marks, Many thanks for your help. The administrators said that everything is 0K, the software is the problem (the easy answer) : no zombies, no other jobs in the node, ... !! Let me give you more information to see if you can imagine other possibilities: 1) Intel Xeon Six

Re: [Wien] time difference among nodes

2015-09-21 Thread Peter Blaha
a) Check your .machines file. DFoes it meet your expectations, or has this node too large load. b) Can you interactively login into these nodes while your job is running ? If yes, login on 2 nodes (in two windows) and runtop c) If nothing obvious is wrong so far, test the network by doing

Re: [Wien] time difference among nodes

2015-09-21 Thread Luis Ogando
Dear Professor Blaha, Thank you ! My .machines file is 0K. I will ask the administrator to follow your other suggestions (users do not have privileges). All the best, Luis 2015-09-21 10:22 GMT-03:00 Peter Blaha : > a) Check your .machines

Re: [Wien] time difference among nodes

2015-09-18 Thread Laurence Marks
Almost certainly one or more of: * Other jobs on the node * Zombie process(es) * Too many mpi * Bad memory * Full disc * Too hot If you have it use ganglia, if not ssh in and use top/ps or whatever SGI has. If you cannot sudo get help from someone who can. On Sep 18, 2015 8:58 PM, "Luis Ogando"