On Wed, 8 Jul 2015 22:20:49 +0100
Stuart Henderson <st...@openbsd.org> wrote:

> On 2015/07/08 20:00, Max Fillinger wrote:
> > On Wed, Jul 08, 2015 at 03:53:46PM +0200, Mark Kettenis wrote:
> > > I'm looking for testers for this diff.  This should be safe to
> > > run on amd64, i386 and sparc64.  But has been reported to lock up
> > > i386 machines.  I can't reproduce this on any of my own systems.
> > > So I'm looking for help.  I'm looking for people that are able to
> > > build a kernel with this diff and the MP_LOCKDEBUG option enabled
> > > (uncommented) in their GENERIC.MP kernel, run it on an MP machine
> > > and put some load on it to see if it locks up and/or panics.
> > > 
> > > Being able to move forward with this would make OpenBSD run
> > > significantly better on MP systems.
> > > 
> > > Thanks,
> > > 
> > > Mark
> > 
> > I just finished compiling the kernel for amd64; I might test i386
> > later. What kind of load would be required to give useful feedback?
> > Would building the userland or some of the bigger ports be a useful
> > test?
> 

I have been running with the patch for ~2h with a decent load on amd64
applied to a Jul 8 snapshot. No crashes, no panics and dmesg doesn't
show anything unusual.

> Building base with the reaper unlock diff on i386 doesn't seem to
> trigger problems, or at least I haven't run into them in a few
> attempts.
> 
> I do see problems when building ports on a dpb cluster, quite quickly
> in some cases - I just did a run and one node locked after 261s,
> another after 756s (dpb master stayed up FWIW).
> 

Stuart, do you think the issue could be memory related and more easily
triggered when the system is forced to swap?

My amd64 box has 8 gigs of ram though the i386 device I used before was
always a *lot* easier to memory starve to say the least. The patches
apply to the uvm which is responsible handling virtual memory &
swapping.

That could be a good reason for a hard to reproduce bug on a memory
starved i386 device wouldn't it?

I can slap a new snaphost on my old i386 box and try memory starving it
tomorrow. Will appreciate any debunking of this train thought of course
or pointers on what type of load to put on the system (cpu, memory,
io?).

> If you're trying to reproduce, make sure you set ddb.console=1 and
> check that you can break into ddb under normal conditions. If you
> manage to trigger a hang, see if you can break into ddb and get
> the usual things (backtrace, ps, sh reg, etc).
> 
> I've been unable to get into ddb after a hang, including on this
> most recent run with MP_LOCKDEBUG.
> 
> Nothing particular special was being built during the last hang;
> from dpb term-report, the last entry before "i386-2-" appeared
> (indicating that the host is no longer contactable) showed these
> 
> archivers/libzip
> audio/libogg
> archivers/lzo2
> 
> Looking at build logs (which are streamed over ssh and logged
> on the dpb master) lzo2 and libzip were compiling (cc from base)
> and libogg was doing pkg_create/gzip when contact was lost.
> So I don't think it's going to be triggered by any particular
> ports, there is nothing out of the ordinary about these, and
> no funny autoconf checks were occurring at the time.
> 
> The main other build-related active process would be sshd,
> and since pkg_create was running it would also most likely
> have been writing to nfs at the time.

Reply via email to