Thank you for your suggestion. I have more information and possible explanation, and more questions. It looks like that NUMA plays a big role here. In summary, it looks like that synchronization overhead of MPI file I/O among "socket" is a lot higher than the overhead among the processes within a socket. But it looks too big.
I ran IOR on Luster parallel file system testing single share file write performance. I used "--map-by socket --bind-by socket", which results in poor performance. Write performance gets worse as the number of MPI processes increases. When I switched it to "--map-by core --bind-by socket", the result becomes better. It showed a couple of times faster as the number of MPI processes increases, but it is not scalable. I think the (sync) overhead of MPI file I/O among "socket" is too big. I'm running the test on a big single shared memory machine. I'm not sure if it is only for a big single shared memory machine. Is there any way to reduce the sync overhead of MPI file I/O among "socket" in a shared memory machine? Thanks, David