Thank you for your suggestion.
I have more information and possible explanation, and more questions.
It looks like that NUMA plays a big role here.
In summary, it looks like that synchronization overhead of MPI file I/O
among "socket" is a lot higher than the overhead among the processes within
a socket.
But it looks too big.

I ran IOR on Luster parallel file system testing single share file write
performance.
I used "--map-by socket --bind-by socket", which results in poor
performance.
Write performance gets worse as the number of MPI processes increases.

When I switched it to "--map-by core --bind-by socket", the result
becomes better.
It showed a couple of times faster as the number of MPI processes increases,
but it is not scalable.

I think the (sync) overhead of MPI file I/O among "socket" is too big.
I'm running the test on a big single shared memory machine.
I'm not sure if it is only for a big single shared memory machine.
Is there any way to reduce the sync overhead of MPI file I/O among "socket"
in a shared memory machine?

Thanks,
David

Reply via email to