It seems I misunderstood something regarding attaching files. And sorry for the 
footer I used my company Email so I get answers also when I work.

here is the valgrind output https://pastebin.com/Wwvn8Pa7
here the ompi_info –all output https://pastebin.com/FW0fazZH
here the gdb output https://pastebin.com/4fNsxUd1

From: users <users-boun...@lists.open-mpi.org> On Behalf Of Alexander Stadik 
via users
Sent: Thursday, 3 February 2022 14:06
To: users@lists.open-mpi.org
Cc: Alexander Stadik <alexander.sta...@essteyr.com>
Subject: [OMPI users] cuda-aware OpenMPI - high number of small asynch sent 
messages create invalid write


CAUTION: This email originated from outside the organization. Do not click 
links or open attachments unless you recognize the sender and know the content 
is safe.


Hello whoever reads this,

I am running my code using CUDA aware OpenMPI (see ompi_info –all attached).
First I will explain the problem, further down I will give additional info 
about versions, hardware and debugging.

The Problem:

My application solves multiple mathematical equations on GPU via CUDA. 
Multi-GPU capabilities are enabled via CUDA-aware OpenMPI, where I send many 
chunks of data from sections of one simulation domain to a simple halo buffer 
on neighbouring processes (so between the domain partitions mapped to each 
MPI-process)
For all kind of cases there are no issues at all. The basic installation seems 
fine and there seems to be no bug, invalid memory access or a similar thing in 
the code. (Which I also verified for the problem case by hand and via debuggers)
The error I get is a SEGFAULT when calling many (below code) MPI_Isend and 
MPI_Irecv operations with a directly followed MPI_Waitall. The interesting part 
is all accessed data on the device-array is allocated and initialized (which I 
verified)
It happens randomly at different operations in the loop but always at MPI_Isend.

Additionally, I should note down that for the problem case each process handles 
per halo exchange and Waitall about 765 requests and it only happens starting 
from 6 processes (and GPUs) on which means about 4590 total requests on the 
node. It Fails at the first execution of the routine at a random part of the 
loop. The simplified code looks like this:

              =====================================

MPI_Request Requ[req_count];
MPI_Status Stat[req_count];

LoopAllTasks
{
              … Redefine source, destination, tags and request numbers
//Recieve halos backward
MPI_Irecv(devPtr + recv_config.shift_right, recv_config.halo_right, MPI_FLOAT,  
src, tag, MPI_COMM_WORLD_, &Requ[req1]);
// Send halo fowards
MPI_Send(devPtr + send_config.shift_right, send_config.halo_right, MPI_FLOAT, 
dst, tag, MPI_COMM_WORLD_, &Requ[req2);

              … Redefine source, destination, tags and request numbers
//Recieve halos forward
MPI_Irecv(devPtr + recv_config.shift_left, recv_config.halo_left, MPI_FLOAT, 
src, tag, MPI_COMM_WORLD_, &Requ[req3);
// Send halo backwards
MPI_Isend(devPtr + send_config.shift_left, send_config.halo_left, MPI_FLOAT, 
dst, tag, MPI_COMM_WORLD_, &Requ[req4]);
}
MPI_Waitall(req_count, Requ, Stat);
=====================================
What I found until know:

Attached are outputs of valgrind, gdb and ompi_info –all (sorry for bad quality 
had to optimize the picture to fit into this mail)
It doesn’t seem to have anything to do with the request array or the allocated 
device memory, as all accesses are in range. Also in the real code I use 
wrapper over the MPI_Isend and MPI_Irecv, which means any object passed is 
either a copy or a valid pointer.
Valgrind (see attached file) identified it as an ‘invalid write of size 8’ 
while I only operate on floats and it only happens in MPI_Isend not MPI_Irecv.
From gdb and cuda-gdb I could identify that it happens for small messages of 
3-7 Bytes. Also, they are perfectly in range of allocated global memory.
I couldn’t find any similar issue, also I couldn’t find any documentation on 
some internal limitations and how to change them.
Simply reducing the number of requests by half, by using blocking MPI_Send 
instead of MPI_Isend, fixed the issue. But I would like to understand the 
underlying behaviour, so this is not an acceptable solution.

The problem occurs both on a CentOS cluster (-node) using OpenMPI 3.0.0 + CUDA 
11.0 with 6-8 dedicated GPUs (all GTX 1080Ti) and on a Ubuntu 20.04 machine 
with OpenMPI 4.1.0 and 4.1.2 + CUDA 11.2 and 11.4 with 2 dedicated GPUS (RTX 
2080) and overloaded GPUs.

I would be glad about any input, thanks a lot
Alex

[https://www.essteyr.com/wp-content/uploads/2020/02/pic-1_1568d80e-78e3-426f-85e8-4bf0051208351.png]

Make the world a safer place together with

[https://www.essteyr.com/wp-content/uploads/2022/01/210118_Dynairix_EmailBanner_Transparent_Resized.png]<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.dynairix.com%2F&data=04%7C01%7Calexander.stadik%40essteyr.com%7Ceccdab5fdb0a485ba93308d9e7167ce7%7C8e7427d264aa4188b6a85ebc3c5d8a78%7C0%7C0%7C637794906106165697%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=Pj7swnd2oZoBtk315I%2BbGe7DEJ31WR5GytXVbcHERAg%3D&reserved=0>



[https://www.essteyr.com/wp-content/uploads/2020/02/pic-1_1568d80e-78e3-426f-85e8-4bf0051208351.png]

[https://www.essteyr.com/wp-content/uploads/2021/01/ESSSignatur3.png]<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.essteyr.com%2F&data=04%7C01%7Calexander.stadik%40essteyr.com%7Ceccdab5fdb0a485ba93308d9e7167ce7%7C8e7427d264aa4188b6a85ebc3c5d8a78%7C0%7C0%7C637794906106165697%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=LQ5ocAg5XKGjDOCIzCFCp8CR8uI8LQdrRstaDwBy4G4%3D&reserved=0>

[https://www.essteyr.com/wp-content/uploads/2020/02/linkedin_38a91193-02cf-4df9-8e91-230f7459e9c3.png]<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fat.linkedin.com%2Fcompany%2Fess-engineeringsoftwaresteyr&data=04%7C01%7Calexander.stadik%40essteyr.com%7Ceccdab5fdb0a485ba93308d9e7167ce7%7C8e7427d264aa4188b6a85ebc3c5d8a78%7C0%7C0%7C637794906106165697%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=3%2F8Kx8F%2BhWPUxwe7DIleI7KjkYs1N3LABQtrbY0hH0Q%3D&reserved=0>
 
[https://www.essteyr.com/wp-content/uploads/2020/02/twitter_5fc7318f-c0e4-495c-b96c-ebd9cf186067.png]
 
<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Fessteyr&data=04%7C01%7Calexander.stadik%40essteyr.com%7Ceccdab5fdb0a485ba93308d9e7167ce7%7C8e7427d264aa4188b6a85ebc3c5d8a78%7C0%7C0%7C637794906106165697%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=stOJq02Z5nvMMQ4YCtcCu%2F5LBT1XXLS5C5L6q%2BnZMQQ%3D&reserved=0>
  
[https://www.essteyr.com/wp-content/uploads/2020/02/facebook_ee01289e-1a90-48d0-8e82-049bb3c3a46b.png]
 
<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.facebook.com%2Fessteyr&data=04%7C01%7Calexander.stadik%40essteyr.com%7Ceccdab5fdb0a485ba93308d9e7167ce7%7C8e7427d264aa4188b6a85ebc3c5d8a78%7C0%7C0%7C637794906106165697%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=rP%2FuqLtwrIFxXTkJoJWwyor16ohkrJxN2%2Fga4djWJsE%3D&reserved=0>
  
[https://www.essteyr.com/wp-content/uploads/2020/09/SocialLink_Instagram_32x32_ea55186d-8d0b-4f5e-a023-02e04995f5bf.png]
 
<https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.instagram.com%2Fess_engineering_software_steyr%2F&data=04%7C01%7Calexander.stadik%40essteyr.com%7Ceccdab5fdb0a485ba93308d9e7167ce7%7C8e7427d264aa4188b6a85ebc3c5d8a78%7C0%7C0%7C637794906106165697%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=1MAnRcbvA%2BpnF71Xjm0%2FW%2FW2fEa%2BFUlQoPqXNvRJu4Q%3D&reserved=0>

[cid:image001.png@01D81922.16F46ED0]

DI Alexander Stadik

Software Developer
R&D Large Scale Solutions

[https://www.essteyr.com/wp-content/uploads/2021/12/Unknown.png]<callto:alexander.sta...@essteyr.com>

Phone:          +4372522044622
Company:     +43725220446

Mail: alexander.sta...@essteyr.com<mailto:alexander.sta...@essteyr.com>


Register of Firms No.: FN 427703 a
Commercial Court: District Court Steyr
UID: ATU69213102

[https://www.essteyr.com/wp-content/uploads/2018/09/pic-2_f96fc865-57a5-4ef1-a924-add9b85d55cc1.png]

ESS Engineering Software Steyr GmbH • Berggasse 35 • 4400 • Steyr • Austria

[https://www.essteyr.com/wp-content/uploads/2018/09/pic-2_1df6b77f-61f1-40d3-a337-0145e62afb3e1.png]

This message is confidential. It may also be privileged or otherwise protected 
by work product immunity or other legal rules. If you have received it by 
mistake, please let us know by e-mail reply and delete it from your system; you 
may not copy this message or disclose its contents to anyone. Please send us by 
fax any message containing deadlines as incoming e-mails are not screened for 
response deadlines. The integrity and security of this message cannot be 
guaranteed on the Internet.


Reply via email to