a) Check your .machines file. DFoes it meet your expectations, or has this node too large load.

b) Can you interactively login into these nodes while your job is running ?
If yes, login on 2 nodes (in two windows) and run    top

c) If nothing obvious is wrong so far, test the network by doing some bigger copying from/to these nodes from your $home (or $scratch) to see if file-io is killing you.


On 09/21/2015 02:51 PM, Luis Ogando wrote:
Dear Prof. Marks,

    Many thanks for your help.
    The administrators said that everything is 0K, the software is the
problem (the easy answer) : no zombies, no other jobs in the node, ... !!
    Let me give you more information to see if you can imagine other
possibilities:

1) Intel Xeon Six Core 5680, 3.33GHz

2) Intel(R) Fortran/CC/OpenMPI Intel(R) 64 Compiler XE for applications
running on Intel(R) 64, Version 12.1.1.256 Build 20111011

3) OpenMPI 1.6.5

4) PBS Pro 11.0.2

5) OpenMPI built using  --with-tm  due to prohibited ssh among nodes  (
http://www.open-mpi.org/faq/?category=building#build-rte-tm )

6) Wien2k 14.2

7) The mystery : two weeks ago, everything was working properly !!

    Many thanks again !
    All the best,
                    Luis

2015-09-18 23:24 GMT-03:00 Laurence Marks <laurence.ma...@gmail.com
<mailto:laurence.ma...@gmail.com>>:

    Almost certainly one or more of:
    * Other jobs on the node
    * Zombie process(es)
    * Too many mpi
    * Bad memory
    * Full disc
    * Too hot

    If you have it use ganglia, if not ssh in and use top/ps or whatever
    SGI has. If you cannot sudo get help from someone who can.

    On Sep 18, 2015 8:58 PM, "Luis Ogando" <lcoda...@gmail.com
    <mailto:lcoda...@gmail.com>> wrote:

        Dear Wien2k community,

            I am using Wien2k in a SGI cluster with 32 nodes. My
        calculation is running in 4 nodes that have the same
        characteristics and only my job is running in these 4 nodes.
            I noticed that one of these 4 nodes is spending more than 20
        times the time spent by the other 3 nodes in the run_lapw execution.
            Could someone imagine a reason for this ? Any advice ?
            All the best,
                     Luis


    _______________________________________________
    Wien mailing list
    Wien@zeus.theochem.tuwien.ac.at <mailto:Wien@zeus.theochem.tuwien.ac.at>
    http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
    SEARCH the MAILING-LIST at:
    http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html




_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html


--

                                      P.Blaha
--------------------------------------------------------------------------
Peter BLAHA, Inst.f. Materials Chemistry, TU Vienna, A-1060 Vienna
Phone: +43-1-58801-165300             FAX: +43-1-58801-165982
Email: bl...@theochem.tuwien.ac.at    WIEN2k: http://www.wien2k.at
WWW:   http://www.imc.tuwien.ac.at/staff/tc_group_e.php
--------------------------------------------------------------------------
_______________________________________________
Wien mailing list
Wien@zeus.theochem.tuwien.ac.at
http://zeus.theochem.tuwien.ac.at/mailman/listinfo/wien
SEARCH the MAILING-LIST at:  
http://www.mail-archive.com/wien@zeus.theochem.tuwien.ac.at/index.html

Reply via email to