I'm afraid the rank-file mapper in 1.3.3 has several known problems
that have been described on the list by users. We hopefully have those
fixed in the upcoming 1.3.4 release.
On Aug 31, 2009, at 10:01 AM, Sacerdoti, Federico wrote:
Hi,
I am trying to use the rankmap to bind a 4-proc mpi job to one
socket of a two-socket, 8 core machine. However I'm getting a
strange error.
CMDS USED
orterun --hostfile hostlist.1 -n 4 --mca rmaps_rank_file_path ./
rankmap.1 desres-netscan -o $OUTDIR
$ cat rankmap.1
rank 0=drdb0235.en slot=0:0
rank 1=drdb0235.en slot=0:1
rank 2=drdb0235.en slot=0:2
rank 3=drdb0235.en slot=0:3
$ cat hostlist.1
drdb0235.en slots=8
ERROR SEEN
--------------------------------------------------------------------------
Rankfile claimed host drdb0235.en that was not allocated or
oversubscribed it's slots:
--------------------------------------------------------------------------
[drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG:
Bad parameter in file rmaps_rank_file.c at line 108
[drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG:
Bad parameter in file base/rmaps_base_map_job.c at line 87
[drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG:
Bad parameter in file base/plm_base_launch_support.c at line 77
[drdb0235.en.desres.deshaw.com:14242] [[37407,0],0] ORTE_ERROR_LOG:
Bad parameter in file plm_rsh_module.c at line 985
From looking at the code in rmaps_rank_file.c it seems the error
occurs when the node-gathering code wraps twice around the hostlist.
However I dont see why that is happening.
If I specify 8 slots in the rankmap, I see a different error: Error,
invalid rank (4) in the rankfile (./rankmap.1)
Thanks,
Federico
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users