Re: [OMPI users] openib issues
Hi Mike, The HCA card is a Mellanox Technologies MT25418 (ConnectX IB DDR, PCIe 2.0 2.5GT/s, rev a0). I cannot post code/instructions how to reproduce these errors as they randomly appeared during some tests I've performed to locate the origin of a segmentation fault during an MPI collective call. Have you ever experienced such issues ? And do you know what these messages mean ? Regards, Eloi On Tuesday 10 August 2010 13:19:45 Mike Dubman wrote: > Hey Eloi, > > What HCA card do you have ? Can you post code/instructions howto reproduce > it? > 10x > Mike > > On Mon, Aug 9, 2010 at 5:22 PM, Eloi Gaudry wrote: > > Hi, > > > > Could someone have a look on these two different error messages ? I'd > > like to know the reason(s) why they were displayed and their actual > > meaning. > > > > Thanks, > > Eloi > > > > On Monday 19 July 2010 16:38:57 Eloi Gaudry wrote: > > > Hi, > > > > > > I've been working on a random segmentation fault that seems to occur > > > > during > > > > > a collective communication when using the openib btl (see [OMPI users] > > > [openib] segfault when using openib btl). > > > > > > During my tests, I've come across different issues reported by > > > OpenMPI-1.4.2: > > > > > > 1/ > > > > > [[12770,1],0][btl_openib_component.c:3227:handle_wc] from bn0103 to: > > bn0122 > > > > > error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for > > > wr_id 560618664 opcode 1 vendor error 105 qp_idx 3 > > > > > > 2/ > > > [[992,1],6][btl_openib_component.c:3227:handle_wc] from pbn04 to: pbn05 > > > error polling LP CQ with status REMOTE ACCESS ERROR status number 10 > > > for wr_id 162858496 opcode 1 vendor error 136 qp_idx > > > 0[[992,1],5][btl_openib_component.c:3227:handle_wc] from pbn05 to: > > > pbn04 error polling HP CQ with status WORK REQUEST FLUSHED ERROR > > > status number > > > > 5 > > > > > for wr_id 485900928 opcode 0 vendor error 249 qp_idx 0 > > > > - > > - > > > > > The OpenFabrics stack has reported a network error event. Open MPI > > > will try to continue, but your job may end up failing. > > > > > > Local host:p'" > > > MPI process PID: 20743 > > > Error number: 3 (IBV_EVENT_QP_ACCESS_ERR) > > > > > > This error may indicate connectivity problems within the fabric; please > > > contact your system administrator. > > > > - > > - > > > > > I'd like to know what these two errors mean and where they come from. > > > > > > Thanks for your help, > > > Eloi > > > > -- > > > > > > Eloi Gaudry > > > > Free Field Technologies > > Company Website: http://www.fft.be > > Company Phone: +32 10 487 959 > > ___ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Eloi Gaudry Free Field Technologies Company Website: http://www.fft.be Company Phone: +32 10 487 959
Re: [OMPI users] openib issues
Hey Eloi, What HCA card do you have ? Can you post code/instructions howto reproduce it? 10x Mike On Mon, Aug 9, 2010 at 5:22 PM, Eloi Gaudry wrote: > Hi, > > Could someone have a look on these two different error messages ? I'd like > to know the reason(s) why they were displayed and their actual meaning. > > Thanks, > Eloi > > On Monday 19 July 2010 16:38:57 Eloi Gaudry wrote: > > Hi, > > > > I've been working on a random segmentation fault that seems to occur > during > > a collective communication when using the openib btl (see [OMPI users] > > [openib] segfault when using openib btl). > > > > During my tests, I've come across different issues reported by > > OpenMPI-1.4.2: > > > > 1/ > > [[12770,1],0][btl_openib_component.c:3227:handle_wc] from bn0103 to: > bn0122 > > error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for > > wr_id 560618664 opcode 1 vendor error 105 qp_idx 3 > > > > 2/ > > [[992,1],6][btl_openib_component.c:3227:handle_wc] from pbn04 to: pbn05 > > error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for > > wr_id 162858496 opcode 1 vendor error 136 qp_idx > > 0[[992,1],5][btl_openib_component.c:3227:handle_wc] from pbn05 to: pbn04 > > error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number > 5 > > for wr_id 485900928 opcode 0 vendor error 249 qp_idx 0 > > > > > -- > > The OpenFabrics stack has reported a network error event. Open MPI will > > try to continue, but your job may end up failing. > > > > Local host:p'" > > MPI process PID: 20743 > > Error number: 3 (IBV_EVENT_QP_ACCESS_ERR) > > > > This error may indicate connectivity problems within the fabric; please > > contact your system administrator. > > > -- > > > > I'd like to know what these two errors mean and where they come from. > > > > Thanks for your help, > > Eloi > > -- > > > Eloi Gaudry > > Free Field Technologies > Company Website: http://www.fft.be > Company Phone: +32 10 487 959 > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >
Re: [OMPI users] openib issues
Hi, Could someone have a look on these two different error messages ? I'd like to know the reason(s) why they were displayed and their actual meaning. Thanks, Eloi On Monday 19 July 2010 16:38:57 Eloi Gaudry wrote: > Hi, > > I've been working on a random segmentation fault that seems to occur during > a collective communication when using the openib btl (see [OMPI users] > [openib] segfault when using openib btl). > > During my tests, I've come across different issues reported by > OpenMPI-1.4.2: > > 1/ > [[12770,1],0][btl_openib_component.c:3227:handle_wc] from bn0103 to: bn0122 > error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for > wr_id 560618664 opcode 1 vendor error 105 qp_idx 3 > > 2/ > [[992,1],6][btl_openib_component.c:3227:handle_wc] from pbn04 to: pbn05 > error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for > wr_id 162858496 opcode 1 vendor error 136 qp_idx > 0[[992,1],5][btl_openib_component.c:3227:handle_wc] from pbn05 to: pbn04 > error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 > for wr_id 485900928 opcode 0 vendor error 249 qp_idx 0 > > -- > The OpenFabrics stack has reported a network error event. Open MPI will > try to continue, but your job may end up failing. > > Local host:p'" > MPI process PID: 20743 > Error number: 3 (IBV_EVENT_QP_ACCESS_ERR) > > This error may indicate connectivity problems within the fabric; please > contact your system administrator. > -- > > I'd like to know what these two errors mean and where they come from. > > Thanks for your help, > Eloi -- Eloi Gaudry Free Field Technologies Company Website: http://www.fft.be Company Phone: +32 10 487 959
[OMPI users] openib issues
Hi, I've been working on a random segmentation fault that seems to occur during a collective communication when using the openib btl (see [OMPI users] [openib] segfault when using openib btl). During my tests, I've come across different issues reported by OpenMPI-1.4.2: 1/ [[12770,1],0][btl_openib_component.c:3227:handle_wc] from bn0103 to: bn0122 error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 560618664 opcode 1 vendor error 105 qp_idx 3 2/ [[992,1],6][btl_openib_component.c:3227:handle_wc] from pbn04 to: pbn05 error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 162858496 opcode 1 vendor error 136 qp_idx 0[[992,1],5][btl_openib_component.c:3227:handle_wc] from pbn05 to: pbn04 error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 485900928 opcode 0 vendor error 249 qp_idx 0 -- The OpenFabrics stack has reported a network error event. Open MPI will try to continue, but your job may end up failing. Local host:p'" MPI process PID: 20743 Error number: 3 (IBV_EVENT_QP_ACCESS_ERR) This error may indicate connectivity problems within the fabric; please contact your system administrator. -- I'd like to know what these two errors mean and where they come from. Thanks for your help, Eloi -- Eloi Gaudry Free Field Technologies Company Website: http://www.fft.be Company Phone: +32 10 487 959