On Thu, Mar 03, 2011 at 07:11:52PM +0100, Mike Belopuhov wrote:
> On Fri, Feb 04, 2011 at 14:53 +0000, emeric boit wrote:
> > Hello,
> > 
> > After doing a clean install of OpenBSD 4.8 (AMD64) on an IBM x3550 M3,
> > I find 
> > the
> > system randomly panics after a period of use.
> > uvm_fault(0xffffffff80cc8360, 0xffff8000149b7000, 0, 1) -> e
> > kernel: page
> > fault trap, code=0
> > Stopped at      mpi_reply+0x102:        movq
> > 0(%r13),%rax
> > ddb{0}>
> > 
> > ddb{0}> trace
> > mpi_reply() at mpi_reply+0x102
> > mpi_intr()
> > at mpi_intr+0x20
> > Xintr_ioapic_level18() at Xintr_ioapic_level18+0xec
> > ---
> > interrupt ---
> > Bad frame pointer: 0xffff8000194e1920
> > end trace frame:
> > 0xffff8000194e1920, count: -3
> > Xspllower+0xe:
> > ddb{0}>
> > 
> 
> We've tried different things, but after this hint i realised
> that what might be happening is that bnx and mpi interrupts
> are chained (it's bnx0 actually, my initial guess about bnx1
> was wrong) and mpi_intr is called first.  Currently neither
> mpi(4) nor mpii(4) don't check the interrupt status register
> but look directly into the reply post queue.  Although,
> there's not supposed to be any race between host cpu reading
> from the memory and ioc writing to it, in practice it turns
> out that in some particular hardware configurations this rule
> is violated and we read a garbled reply from the controller.
> 
> If my memory serves, I've considered this for the mpii_intr
> but never got into the situation where it was needed and
> thus omitted it.  I guess I have to bring it back too.
> 
> Emeric tortured the machine with this diff and reported that
> it solves the issue for him.  OK to commit?
> 
> On Wed, Mar 02, 2011 at 17:20 +0000, emeric boit wrote:
> > hi,
> > 
> > This change doesn't solve the issue.
> > 
> > I have remarked that the server crash when I use the network.
> > 
> > I copy a small file several times without problem.
> > On the IBM I do :
> > scp USER@IP:/tmp/mpi.c .
> > 
> > And when I copy a larger file the server crash :
> > scp USER@IP:/bsd .
> > 
> > 
> > And when I copy th same file (bsd) from an usb key I don't have problem.
> > 
> > Emeric.
> > 
> 
> that sounds like an interrupt sharing bug of some sort.
> is it bnx1 that you're using to reproduce a crash?
> 
> try the following diff please (on a clean checkout):
> 
> Index: mpi.c
> ===================================================================
> RCS file: /home/cvs/src/sys/dev/ic/mpi.c,v
> retrieving revision 1.166
> diff -u -p -r1.166 mpi.c
> --- mpi.c     1 Mar 2011 23:48:33 -0000       1.166
> +++ mpi.c     2 Mar 2011 17:40:13 -0000
> @@ -887,6 +887,9 @@ mpi_intr(void *arg)
>       u_int32_t                       reg;
>       int                             rv = 0;
>  
> +     if ((mpi_read_intr(sc) & MPI_INTR_STATUS_REPLY) == 0)
> +             return (rv);
> +
>       while ((reg = mpi_pop_reply(sc)) != 0xffffffff) {
>               mpi_reply(sc, reg);
>               rv = 1;
> 

ok krw@.

.... Ken

Reply via email to