Re: [OMPI users] openib issues

2010-08-10 Thread Eloi Gaudry
Hi Mike,

The HCA card is a Mellanox Technologies MT25418 (ConnectX IB DDR, PCIe 2.0 
2.5GT/s, rev a0).
I cannot post code/instructions how to reproduce these errors as they randomly 
appeared during some tests I've performed to locate the origin of a 
segmentation fault during an MPI collective call.

Have you ever experienced such issues ? And do you know what these messages 
mean ?

Regards,
Eloi


On Tuesday 10 August 2010 13:19:45 Mike Dubman wrote:
> Hey Eloi,
> 
> What HCA card do you have ? Can you post code/instructions howto reproduce
> it?
> 10x
> Mike
> 
> On Mon, Aug 9, 2010 at 5:22 PM, Eloi Gaudry  wrote:
> > Hi,
> > 
> > Could someone have a look on these two different error messages ? I'd
> > like to know the reason(s) why they were displayed and their actual
> > meaning.
> > 
> > Thanks,
> > Eloi
> > 
> > On Monday 19 July 2010 16:38:57 Eloi Gaudry wrote:
> > > Hi,
> > > 
> > > I've been working on a random segmentation fault that seems to occur
> > 
> > during
> > 
> > > a collective communication when using the openib btl (see [OMPI users]
> > > [openib] segfault when using openib btl).
> > > 
> > > During my tests, I've come across different issues reported by
> > > OpenMPI-1.4.2:
> > > 
> > > 1/
> > 
> > > [[12770,1],0][btl_openib_component.c:3227:handle_wc] from bn0103 to:
> > bn0122
> > 
> > > error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for
> > > wr_id 560618664 opcode 1  vendor error 105 qp_idx 3
> > > 
> > > 2/
> > > [[992,1],6][btl_openib_component.c:3227:handle_wc] from pbn04 to: pbn05
> > > error polling LP CQ with status REMOTE ACCESS ERROR status number 10
> > > for wr_id 162858496 opcode 1  vendor error 136 qp_idx
> > > 0[[992,1],5][btl_openib_component.c:3227:handle_wc] from pbn05 to:
> > > pbn04 error polling HP CQ with status WORK REQUEST FLUSHED ERROR
> > > status number
> > 
> > 5
> > 
> > > for wr_id 485900928 opcode 0  vendor error 249 qp_idx 0
> > 
> > -
> > -
> > 
> > > The OpenFabrics stack has reported a network error event.  Open MPI
> > > will try to continue, but your job may end up failing.
> > > 
> > >   Local host:p'"
> > >   MPI process PID:   20743
> > >   Error number:  3 (IBV_EVENT_QP_ACCESS_ERR)
> > > 
> > > This error may indicate connectivity problems within the fabric; please
> > > contact your system administrator.
> > 
> > -
> > -
> > 
> > > I'd like to know what these two errors mean and where they come from.
> > > 
> > > Thanks for your help,
> > > Eloi
> > 
> > --
> > 
> > 
> > Eloi Gaudry
> > 
> > Free Field Technologies
> > Company Website: http://www.fft.be
> > Company Phone:   +32 10 487 959
> > ___
> > users mailing list
> > us...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/users

-- 


Eloi Gaudry

Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959


Re: [OMPI users] openib issues

2010-08-10 Thread Mike Dubman
Hey Eloi,

What HCA card do you have ? Can you post code/instructions howto reproduce
it?
10x
Mike

On Mon, Aug 9, 2010 at 5:22 PM, Eloi Gaudry  wrote:

> Hi,
>
> Could someone have a look on these two different error messages ? I'd like
> to know the reason(s) why they were displayed and their actual meaning.
>
> Thanks,
> Eloi
>
> On Monday 19 July 2010 16:38:57 Eloi Gaudry wrote:
> > Hi,
> >
> > I've been working on a random segmentation fault that seems to occur
> during
> > a collective communication when using the openib btl (see [OMPI users]
> > [openib] segfault when using openib btl).
> >
> > During my tests, I've come across different issues reported by
> > OpenMPI-1.4.2:
> >
> > 1/
> > [[12770,1],0][btl_openib_component.c:3227:handle_wc] from bn0103 to:
> bn0122
> > error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for
> > wr_id 560618664 opcode 1  vendor error 105 qp_idx 3
> >
> > 2/
> > [[992,1],6][btl_openib_component.c:3227:handle_wc] from pbn04 to: pbn05
> > error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for
> > wr_id 162858496 opcode 1  vendor error 136 qp_idx
> > 0[[992,1],5][btl_openib_component.c:3227:handle_wc] from pbn05 to: pbn04
> > error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number
> 5
> > for wr_id 485900928 opcode 0  vendor error 249 qp_idx 0
> >
> >
> --
> > The OpenFabrics stack has reported a network error event.  Open MPI will
> > try to continue, but your job may end up failing.
> >
> >   Local host:p'"
> >   MPI process PID:   20743
> >   Error number:  3 (IBV_EVENT_QP_ACCESS_ERR)
> >
> > This error may indicate connectivity problems within the fabric; please
> > contact your system administrator.
> >
> --
> >
> > I'd like to know what these two errors mean and where they come from.
> >
> > Thanks for your help,
> > Eloi
>
> --
>
>
> Eloi Gaudry
>
> Free Field Technologies
> Company Website: http://www.fft.be
> Company Phone:   +32 10 487 959
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>


Re: [OMPI users] openib issues

2010-08-09 Thread Eloi Gaudry
Hi,

Could someone have a look on these two different error messages ? I'd like to 
know the reason(s) why they were displayed and their actual meaning.

Thanks,
Eloi

On Monday 19 July 2010 16:38:57 Eloi Gaudry wrote:
> Hi,
> 
> I've been working on a random segmentation fault that seems to occur during
> a collective communication when using the openib btl (see [OMPI users]
> [openib] segfault when using openib btl).
> 
> During my tests, I've come across different issues reported by
> OpenMPI-1.4.2:
> 
> 1/
> [[12770,1],0][btl_openib_component.c:3227:handle_wc] from bn0103 to: bn0122
> error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for
> wr_id 560618664 opcode 1  vendor error 105 qp_idx 3
> 
> 2/
> [[992,1],6][btl_openib_component.c:3227:handle_wc] from pbn04 to: pbn05
> error polling LP CQ with status REMOTE ACCESS ERROR status number 10 for
> wr_id 162858496 opcode 1  vendor error 136 qp_idx
> 0[[992,1],5][btl_openib_component.c:3227:handle_wc] from pbn05 to: pbn04
> error polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5
> for wr_id 485900928 opcode 0  vendor error 249 qp_idx 0
> 
> --
> The OpenFabrics stack has reported a network error event.  Open MPI will
> try to continue, but your job may end up failing.
> 
>   Local host:p'"
>   MPI process PID:   20743
>   Error number:  3 (IBV_EVENT_QP_ACCESS_ERR)
> 
> This error may indicate connectivity problems within the fabric; please
> contact your system administrator.
> --
> 
> I'd like to know what these two errors mean and where they come from.
> 
> Thanks for your help,
> Eloi

-- 


Eloi Gaudry

Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959


[OMPI users] openib issues

2010-07-19 Thread Eloi Gaudry
Hi,

I've been working on a random segmentation fault that seems to occur during a 
collective communication when using the openib btl (see [OMPI users] [openib] 
segfault when using openib btl).

During my tests, I've come across different issues reported by OpenMPI-1.4.2:

1/ 
[[12770,1],0][btl_openib_component.c:3227:handle_wc] from bn0103 to: bn0122 
error polling LP CQ with status LOCAL LENGTH ERROR status number 1 for wr_id 
560618664 opcode 1  vendor error 105 qp_idx 3

2/
[[992,1],6][btl_openib_component.c:3227:handle_wc] from pbn04 to: pbn05 error 
polling LP CQ with status REMOTE ACCESS ERROR status number 10 for wr_id 
162858496 opcode 1  vendor error 136 qp_idx 
0[[992,1],5][btl_openib_component.c:3227:handle_wc] from pbn05 to: pbn04 error 
polling HP CQ with status WORK REQUEST FLUSHED ERROR status number 5 for wr_id 
485900928 opcode 0  vendor error 249 
qp_idx 0

--
The OpenFabrics stack has reported a network error event.  Open MPI will try to 
continue, but your job may end up failing.

  Local host:p'"
  MPI process PID:   20743
  Error number:  3 (IBV_EVENT_QP_ACCESS_ERR)

This error may indicate connectivity problems within the fabric; please contact 
your system administrator.
--

I'd like to know what these two errors mean and where they come from.

Thanks for your help,
Eloi

-- 


Eloi Gaudry

Free Field Technologies
Company Website: http://www.fft.be
Company Phone:   +32 10 487 959