Pak Lui wrote:
Prakash,

tm_poll: protocol number dis error 11
ret is 17002 instead of 0: tm_init failed
3 processes killed (possibly by Open MPI)

I encountered similar problem with OpenPBS before, which also uses the TM interfaces. It returns a TM_ENOTCONNECTED (17002) when I tried to call tm_init for the second time (which in turns call tm_poll and returned that errno).

I think what you did to start tm_init from another node and connect to another mom which I do not think is allowed. The TM module in OpenMPI already called tm_init once. I am curious to know about the reason that you need to call tm_init again?

If you are curious to know about the implementation for PBS, you can download the source from openpbs.org. OpenPBS source: v2.3.16/src/lib/Libifl/tm.c
I am interested in getting this to work as I am working on implementing support for dynamic scheduling in Torque. I want any node in an MPI-2 job (basically Open MPI implementation) to be able to request the Torque/PBS server for more nodes. I am doing a little study in that right now. Instead of nodes talking directly to the server, I want them to be able to talk to Mother Superior and MS instead will talk to the Server.

Could you please explain why this does not work now? And why it works when I do the tm_init from MS, and only does not work from any other MOM?

Thanks,
Prakash

Reply via email to