22.09.2015 23:08, Luis Ogando wrote:
r1i1n1 -------------
top - 17:40:46 up 12 days, 9 min,  2 users,  load average: 10.55, 4.34, 1.74
Cpu(s):100.0%us,  0.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,
r1i1n2 -------------
top - 17:42:30 up 221 days,  6:29,  1 user,  load average: 10.76, 9.59, 8.79
Cpu(s):  7.5%us,  0.1%sy,  0.0%ni, 92.4%id,  0.0%wa,  0.0%hi,  0.0%si,
r1i1n3 -------------
top - 17:42:50 up 56 days,  3:25,  1 user,  load average: 10.57, 6.02, 2.59
Cpu(s): 99.5%us,  0.4%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,

1) The first difference which I see is: the node under question was not restarted 221 days. I'd start from rebooting (the problem maybe disappears and you never know why that problem had happened).

2) You didn't check:
>         2015-09-18 23:24 GMT-03:00 Laurence Marks
>              * Bad memory
>              * Full disc

try "df" in n2 and some other for comparison. Check and send the output.
Check which is a working directory in the nodes (there should be something like "export SCRATCH=./" in .bashrc, make "set > aaa", and check the variable SCRATCH in the file aaa). Compare with output of df.

3) Just to be sure: you showed us top for only user ogando, I hope you really saw that there were no other users (press in top at n2 "u", and answer blank to "Which user (blank for all))". It writes "1 user", but there should be at least root, syslog, statd and so forth.

>    We also have the first two nodes executing lapw0_mpi while the other
> two are executing lapw1c_mpi. Is this normal ?

I do not know, looks suspicious, but, IMHO, it is not connected with the discussed problem.

Best wishes
  Lyudmila Dobysheva

    On 09/21/2015 02:51 PM, Luis Ogando wrote:
        7) The mystery : two weeks ago, everything was working properly !!
             On Sep 18, 2015 8:58 PM, "Luis Ogando" wrote:
                     I am using Wien2k in a SGI cluster with 32 nodes. My
                 calculation is running in 4 nodes that have the same
                 characteristics and only my job is running in these 4
                     I noticed that one of these 4 nodes is spending
        more than 20
                 times the time spent by the other 3 nodes in the
        run_lapw execution.
                     Could someone imagine a reason for this ? Any advice ?
Phys.-Techn. Institute of Ural Br. of Russian Ac. of Sci.
426001 Izhevsk, ul.Kirova 132
Tel.:7(3412) 432045(office), 722529(Fax)
E-mail: l...@ftiudm.ru, lyuk...@mail.ru (office)
        lyuk...@gmail.com (home)
Skype:  lyuka17 (home), lyuka18 (office)
Wien mailing list

Reply via email to