to clear things,

i still can do a hello world on all 16 threads, but a few more repetitions
of the example and it kernel crashes :(

fcluster@agua:~$ mpirun --hostfile localhostfile -np 16 testMPI/hola
Process 0 on agua out of 16
Process 2 on agua out of 16
Process 14 on agua out of 16
Process 8 on agua out of 16
Process 1 on agua out of 16
Process 7 on agua out of 16
Process 9 on agua out of 16
Process 3 on agua out of 16
Process 4 on agua out of 16
Process 10 on agua out of 16
Process 15 on agua out of 16
Process 5 on agua out of 16
Process 6 on agua out of 16
Process 11 on agua out of 16
Process 13 on agua out of 16
Process 12 on agua out of 16
fcluster@agua:~$



On Wed, Jul 28, 2010 at 2:47 PM, Cristobal Navarro <axisch...@gmail.com>wrote:

>
>
> On Wed, Jul 28, 2010 at 11:09 AM, Gus Correa <g...@ldeo.columbia.edu>wrote:
>
>> Hi Cristobal
>>
>> In case you are not using full path name for mpiexec/mpirun,
>> what does "which mpirun" say?
>>
>
> --> $which mpirun
>       /opt/openmpi-1.4.2
>
>>
>> Often times this is a source of confusion, old versions may
>> be first on the PATH.
>>
>> Gus
>>
>
> openMPI version problem is now gone, i can confirm that the version is
> consistent now :), thanks.
>
> however, i keep getting this kernel crash randomnly when i execute with -np
> higher than 5
> these are Xeons, with Hyperthreading On, is that a problem??
>
> im trying to locate the kernel error on logs, but after rebooting a crash,
> the error is not in the kern.log (neither kern.log.1).
> all i remember is that it starts with "Kernel BUG..."
> and somepart it mentions a certain CPU X, where that cpu can be any from 0
> to 15 (im testing only in main node).  Someone knows where the log of kernel
> error could be?
>
>>
>> Cristobal Navarro wrote:
>>
>>>
>>> On Tue, Jul 27, 2010 at 7:29 PM, Gus Correa <g...@ldeo.columbia.edu<mailto:
>>> g...@ldeo.columbia.edu>> wrote:
>>>
>>>    Hi Cristobal
>>>
>>>    Does it run only on the head node alone?
>>>    (Fuego? Agua? Acatenango?)
>>>    Try to put only the head node on the hostfile and execute with
>>> mpiexec.
>>>
>>> --> i will try only with the head node, and post results back
>>>    This may help sort out what is going on.
>>>    Hopefully it will run on the head node.
>>>
>>>    Also, do you have Infinband connecting the nodes?
>>>    The error messages refer to the openib btl (i.e. Infiniband),
>>>    and complains of
>>>
>>>
>>> no we are just using normal network 100MBit/s , since i am just testing
>>> yet.
>>>
>>>
>>>    "perhaps a missing symbol, or compiled for a different
>>>    version of Open MPI?".
>>>    It sounds as a mixup of versions/builds.
>>>
>>>
>>> --> i agree, somewhere there must be the remains of the older version
>>>
>>>    Did you configure/build OpenMPI from source, or did you install
>>>    it with apt-get?
>>>    It may be easier/less confusing to install from source.
>>>    If you did, what configure options did you use?
>>>
>>>
>>> -->i installed from source, ./configure --prefix=/opt/openmpi-1.4.2
>>> --with-sge --without-xgid --disable--static
>>>
>>>    Also, as for the OpenMPI runtime environment,
>>>    it is not enough to set it on
>>>    the command line, because it will be effective only on the head node.
>>>    You need to either add them to the PATH and LD_LIBRARY_PATH
>>>    on your .bashrc/.cshrc files (assuming these files and your home
>>>    directory are *also* shared with the nodes via NFS),
>>>    or use the --prefix option of mpiexec to point to the OpenMPI main
>>>    directory.
>>>
>>>
>>> yes, all nodes have their PATH and LD_LIBRARY_PATH set up properly inside
>>> the login scripts ( .bashrc in my case  )
>>>
>>>    Needless to say, you need to check and ensure that the OpenMPI
>>>    directory (and maybe your home directory, and your work directory)
>>>    is (are)
>>>    really mounted on the nodes.
>>>
>>>
>>> --> yes, doublechecked that they are
>>>
>>>    I hope this helps,
>>>
>>>
>>> --> thanks really!
>>>
>>>    Gus Correa
>>>
>>>    Update: i just reinstalled openMPI, with the same parameters, and it
>>>    seems that the problem has gone, i couldnt test entirely but when i
>>>    get back to lab ill confirm.
>>>
>>> best regards! Cristobal
>>>
>>>
>>> ------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> users mailing list
>>> us...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>
>
>

Reply via email to