Re: ALIX segfaulting on current/i386

2012-02-19 Thread Gonzalo L. R.

Try with the last snapshot:

OpenBSD 5.1 (GENERIC) #160: Sun Feb 12 09:46:33 MST 2012
dera...@i386.openbsd.org:/usr/src/sys/arch/i386/compile/GENERIC

I have the same machine and no problems.

On Sun, 19 Feb 2012 15:17:41 +0100, Jan Stary wrote:

On a recent install of current/i386 on an ALIX (see dmesg below),
processes (such as a simple 'ls') started to magically segfault and 
die.


Feb 19 14:43:17 www /bsd: pid 26001 (bogofilter): user write of
4096@0x3d5b000 at 1776 failed: 14
Feb 19 14:44:08 www /bsd: pid 7571 (bogofilter): user write of
4096@0x2cfe000 at 1360 failed: 14
Feb 19 14:45:04 www /bsd: pid 9409 (sh): user write of
4096@0x8434b000 at 99760 failed: 14
Feb 19 14:45:12 www /bsd: pid 18943 (cron): user write of
4096@0x7c23e000 at 165360 failed: 14
Feb 19 14:46:38 www /bsd: pid 18781 (error): user write of
118784@0x2eddd000 at 145008 failed: 14
Feb 19 14:47:39 www /bsd: pid 10912 (flush): user write of
4096@0xb0dc000 at 1360 failed: 14
Feb 19 14:47:52 www /bsd: pid 13255 (cleanup): user write of
4096@0x66 at 1872 failed: 14

What does this indicate? Is my RAM bad? Is my CF card bad?
Could someone more knowledgeable please explain the above
messages in detail?

The system acts as a NAT router, and in that respect, nothing
wrong happens to the clients - I browse the web and everything
from behind this machine. But when it does something IO related
(such as opening my mailbox when I launch mutt), it _sometimes_
segfaults now.

For example: I tried to run 'file file.core' in a ktrace.
That ended in a segfault. The kdump ends with

 22449 file CALL  sigprocmask(SIG_BLOCK,0x)
 22449 file RET   sigprocmask 0
 22449 file CALL  
mprotect(0x3c005000,0x1000,0x3PROT_READ|PROT_WRITE)

 22449 file RET   mprotect 0
 22449 file CALL  mprotect(0x3c005000,0x1000,0x1PROT_READ)
 22449 file RET   mprotect 0
 22449 file CALL  sigprocmask(SIG_SETMASK,0)
 22449 file RET   sigprocmask -65793/0xfffefeff
 22449 file PSIG  SIGSEGV SIG_DFL code SEGV_MAPERR1
addr=0x87fb313c trapno=2
 22449 file NAMI  file.core


Thank you for you time

Jan


OpenBSD 5.1-beta (GENERIC) #140: Sat Jan 21 00:40:23 MST 2012
dera...@i386.openbsd.org:/usr/src/sys/arch/i386/compile/GENERIC
cpu0: Geode(TM) Integrated Processor by AMD PCS (AuthenticAMD
586-class) 432 MHz
cpu0: 
FPU,DE,PSE,TSC,MSR,CX8,SEP,PGE,CMOV,CFLUSH,MMX,MMXX,3DNOW2,3DNOW

real mem  = 133758976 (127MB)
avail mem = 121544704 (115MB)
mainbus0 at root
bios0 at mainbus0: AT/286+ BIOS, date 12/10/07, BIOS32 rev. 0 @ 
0xfceb2

pcibios0 at bios0: rev 2.1 @ 0xf/0x1
pcibios0: pcibios_get_intr_routing - function not supported
pcibios0: PCI IRQ Routing information unavailable.
pcibios0: PCI bus #0 is the last bus
bios0: ROM list: 0xe/0xa800
cpu0 at mainbus0: (uniprocessor)
pci0 at mainbus0 bus 0: configuration mode 1 (bios)
pchb0 at pci0 dev 1 function 0 AMD Geode LX rev 0x31
glxsb0 at pci0 dev 1 function 2 AMD Geode LX Crypto rev 0x00: RNG 
AES

vr0 at pci0 dev 9 function 0 VIA VT6105M RhineIII rev 0x96: irq 10,
address 00:0d:b9:12:9f:2c
ukphy0 at vr0 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI
0x004063, model 0x0034
vr1 at pci0 dev 10 function 0 VIA VT6105M RhineIII rev 0x96: irq
11, address 00:0d:b9:12:9f:2d
ukphy1 at vr1 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI
0x004063, model 0x0034
vr2 at pci0 dev 11 function 0 VIA VT6105M RhineIII rev 0x96: irq
12, address 00:0d:b9:12:9f:2e
ukphy2 at vr2 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI
0x004063, model 0x0034
ral0 at pci0 dev 12 function 0 Ralink RT2560 rev 0x01: irq 9,
address 00:11:09:0d:d3:36
ral0: MAC/BBP RT2560 (rev 0x04), RF RT2525
glxpcib0 at pci0 dev 15 function 0 AMD CS5536 ISA rev 0x03: rev 3,
32-bit 3579545Hz timer, watchdog, gpio
gpio0 at glxpcib0: 32 pins
pciide0 at pci0 dev 15 function 2 AMD CS5536 IDE rev 0x01: DMA,
channel 0 wired to compatibility, channel 1 wired to compatibility
wd0 at pciide0 channel 0 drive 0: LEXAR ATA FLASH CARD
wd0: 1-sector PIO, LBA, 15263MB, 31260096 sectors
wd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 2
pciide0: channel 1 ignored (disabled)
ohci0 at pci0 dev 15 function 4 AMD CS5536 USB rev 0x02: irq 15,
version 1.0, legacy support
ehci0 at pci0 dev 15 function 5 AMD CS5536 USB rev 0x02: irq 15
usb0 at ehci0: USB revision 2.0
uhub0 at usb0 AMD EHCI root hub rev 2.00/1.00 addr 1
isa0 at glxpcib0
isadma0 at isa0
com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
com0: console
pcppi0 at isa0 port 0x61
spkr0 at pcppi0
npx0 at isa0 port 0xf0/16: reported by CPUID; using exception 16
usb1 at ohci0: USB revision 1.0
uhub1 at usb1 AMD OHCI root hub rev 1.00/1.00 addr 1
mtrr: K6-family MTRR support (2 registers)
nvram: invalid checksum
vscsi0 at root
scsibus0 at vscsi0: 256 targets
softraid0 at root
scsibus1 at softraid0: 256 targets
root on wd0a (5bea3261eefd6b7e.a) swap on wd0b dump on wd0b
clock: unknown CMOS layout


--
Sending from my VCR



Re: ALIX segfaulting on current/i386

2012-02-19 Thread Philip Guenther
On Sun, Feb 19, 2012 at 6:17 AM, Jan Stary h...@stare.cz wrote:
 On a recent install of current/i386 on an ALIX (see dmesg below),
 processes (such as a simple 'ls') started to magically segfault and die.

 Feb 19 14:43:17 www /bsd: pid 26001 (bogofilter): user write of 
 4096@0x3d5b000 at 1776 failed: 14

14 == EFAULT.  Those are generated when the kernel tries to write out
a process's memory image for a coredump and the indicated range of
memory couldn't be faulted in so that it could be written to the
filesystem.


 What does this indicate? Is my RAM bad? Is my CF card bad?
 Could someone more knowledgeable please explain the above
 messages in detail?

The inability to fault in memory that the kernel thinks should be
there makes me wonder if you're swapping and the device you're
swapping to is failing.  Your dmesg suggests you might be swapping to
your CF card and you (only?) have 128MB of real memory.  When this is
happening, what's the output of swapctl -l?  If that shows you are
indeed into swap, then a failing CF card would be my guess.

(Swapping to CF seems like a bad idea to me, but I'm not expert in
that sort of hardware...)


Philip Guenther



Re: ALIX segfaulting on current/i386

2012-02-19 Thread Jan Stary
On Feb 19 15:00:18, Gonzalo L. R. wrote:
 Try with the last snapshot:
 
 OpenBSD 5.1 (GENERIC) #160: Sun Feb 12 09:46:33 MST 2012
 dera...@i386.openbsd.org:/usr/src/sys/arch/i386/compile/GENERIC
 
 I have the same machine and no problems.

This machine has been running for five years without problems;
that's why I am speculating about a HW failure ...


 On Sun, 19 Feb 2012 15:17:41 +0100, Jan Stary wrote:
 On a recent install of current/i386 on an ALIX (see dmesg below),
 processes (such as a simple 'ls') started to magically segfault
 and die.
 
 Feb 19 14:43:17 www /bsd: pid 26001 (bogofilter): user write of
 4096@0x3d5b000 at 1776 failed: 14
 Feb 19 14:44:08 www /bsd: pid 7571 (bogofilter): user write of
 4096@0x2cfe000 at 1360 failed: 14
 Feb 19 14:45:04 www /bsd: pid 9409 (sh): user write of
 4096@0x8434b000 at 99760 failed: 14
 Feb 19 14:45:12 www /bsd: pid 18943 (cron): user write of
 4096@0x7c23e000 at 165360 failed: 14
 Feb 19 14:46:38 www /bsd: pid 18781 (error): user write of
 118784@0x2eddd000 at 145008 failed: 14
 Feb 19 14:47:39 www /bsd: pid 10912 (flush): user write of
 4096@0xb0dc000 at 1360 failed: 14
 Feb 19 14:47:52 www /bsd: pid 13255 (cleanup): user write of
 4096@0x66 at 1872 failed: 14
 
 What does this indicate? Is my RAM bad? Is my CF card bad?
 Could someone more knowledgeable please explain the above
 messages in detail?



Re: ALIX segfaulting on current/i386

2012-02-19 Thread Jan Stary
On Feb 19 10:12:03, Philip Guenther wrote:
 On Sun, Feb 19, 2012 at 6:17 AM, Jan Stary h...@stare.cz wrote:
  On a recent install of current/i386 on an ALIX (see dmesg below),
  processes (such as a simple 'ls') started to magically segfault and die.
 
  Feb 19 14:43:17 www /bsd: pid 26001 (bogofilter): user write of 
  4096@0x3d5b000 at 1776 failed: 14
 
 14 == EFAULT.  Those are generated when the kernel tries to write out
 a process's memory image for a coredump and the indicated range of
 memory couldn't be faulted in so that it could be written to the
 filesystem.
 

Thank you for the explanation.

So, firstly, the kernel decides a proccess needs to be coredumped.
(That alone is a problem for me - why would that happen?)
And secondly, the attempt to coredump the process fails. Right?

  What does this indicate? Is my RAM bad? Is my CF card bad?
  Could someone more knowledgeable please explain the above
  messages in detail?
 
 The inability to fault in memory that the kernel thinks should be
 there makes me wonder if you're swapping and the device you're
 swapping to is failing. Your dmesg suggests you might be swapping to
 your CF card and you (only?) have 128MB of real memory.  When this is
 happening, what's the output of swapctl -l?  If that shows you are
 indeed into swap, then a failing CF card would be my guess.

Yes, the machine only has 128MB of memory - which I think should be
enough for what it does: NATing pf, dhcpd and resolver for the
internal network, and postfix and httpd for my domain (which
amounts to almost no traffic).

It does not have any swap configured. In fact, I try to design
my systems so that they don't ever need to swap.
 
 $ swapctl -l 
 swapctl: no swap devices configured

Would you please care to explain further how the swapping
is related to the coredumping EFAULTs?

 (Swapping to CF seems like a bad idea to me, but I'm not expert in
 that sort of hardware...)

I don't swap to the CF.

If it so happens that there is not enough memory for some running
process (a situaion I cannot rule out now), and there is no swap
to deal with this, is that a reason for a process to be coredumped?
(I think that I have seen processes just die with ENOMEM
in that situaion.)

Thank you for your time

Jan



Re: ALIX segfaulting on current/i386

2012-02-19 Thread Ted Unangst
On Sun, Feb 19, 2012, Jan Stary wrote:

 If it so happens that there is not enough memory for some running
 process (a situaion I cannot rule out now), and there is no swap
 to deal with this, is that a reason for a process to be coredumped?
 (I think that I have seen processes just die with ENOMEM
 in that situaion.)

ENOMEM is somewhat unlikely even in low memory situations, because the
kernel allows overcommit.  A process can allocate memory that's not
technically available, then when it tries to use that memory and the
kernel can't find anything to provide, segfault.  ENOMEM is an error
code, but not a signal, so a process cannot strictly speaking die from
it.



Re: ALIX segfaulting on current/i386

2012-02-19 Thread Philip Guenther
On Sun, Feb 19, 2012 at 11:24 AM, Jan Stary h...@stare.cz wrote:
 On Feb 19 10:12:03, Philip Guenther wrote:
 On Sun, Feb 19, 2012 at 6:17 AM, Jan Stary h...@stare.cz wrote:
  On a recent install of current/i386 on an ALIX (see dmesg below),
  processes (such as a simple 'ls') started to magically segfault and die.
 
  Feb 19 14:43:17 www /bsd: pid 26001 (bogofilter): user write of
4096@0x3d5b000 at 1776 failed: 14

 14 == EFAULT.  Those are generated when the kernel tries to write out
 a process's memory image for a coredump and the indicated range of
 memory couldn't be faulted in so that it could be written to the
 filesystem.


 Thank you for the explanation.

 So, firstly, the kernel decides a proccess needs to be coredumped.
 (That alone is a problem for me - why would that happen?)
 And secondly, the attempt to coredump the process fails. Right?

Yep.


  What does this indicate? Is my RAM bad? Is my CF card bad?
  Could someone more knowledgeable please explain the above
  messages in detail?

 The inability to fault in memory that the kernel thinks should be
 there makes me wonder if you're swapping and the device you're
 swapping to is failing. Your dmesg suggests you might be swapping to
 your CF card and you (only?) have 128MB of real memory.  When this is
 happening, what's the output of swapctl -l?  If that shows you are
 indeed into swap, then a failing CF card would be my guess.

 Yes, the machine only has 128MB of memory - which I think should be
 enough for what it does: NATing pf, dhcpd and resolver for the
 internal network, and postfix and httpd for my domain (which
 amounts to almost no traffic).

Have you monitored the memory usage to confirm or deny your belief
that it's sufficient?


 It does not have any swap configured. In fact, I try to design
 my systems so that they don't ever need to swap.

  $ swapctl -l
  swapctl: no swap devices configured

 Would you please care to explain further how the swapping
 is related to the coredumping EFAULTs?

It was a hypothesis based on the available evidence.  Your additional
evidence rules it out, so I see no reason to waste our time explaining
it.

At this point, I suggest you gather data about the system and see if
there's a correlation between the data and when this occurs.  Then
make a hypothesis from that, figure out a way to test it, etc.  In
short, use *SCIENCE* on it!


Philip Guenther



Re: ALIX segfaulting on current/i386

2012-02-19 Thread Jan Stary
On Feb 19 10:12:03, Philip Guenther wrote:
 14 == EFAULT.  Those are generated when the kernel tries to write out
 a process's memory image for a coredump and the indicated range of
 memory couldn't be faulted in so that it could be written to the
 filesystem.

  The inability to fault in memory that the kernel thinks should be
  there makes me wonder if you're swapping and the device you're
  swapping to is failing. Your dmesg suggests you might be swapping to
  your CF card and you (only?) have 128MB of real memory.  When this is
  happening, what's the output of swapctl -l?  If that shows you are
  indeed into swap, then a failing CF card would be my guess.
 
 Yes, the machine only has 128MB of memory - which I think should be
 enough for what it does: NATing pf, dhcpd and resolver for the
 internal network, and postfix and httpd for my domain (which
 amounts to almost no traffic).
 
 It does not have any swap configured. In fact, I try to design
 my systems so that they don't ever need to swap.
  
  $ swapctl -l 
  swapctl: no swap devices configured
 
 If it so happens that there is not enough memory for some running
 process (a situaion I cannot rule out now), and there is no swap
 to deal with this, is that a reason for a process to be coredumped?

On Feb 19 12:26:03, Philip Guenther wrote:
 Have you monitored the memory usage to confirm or deny your belief
 that it's sufficient?

I have now (and should have before of course).
The memory is _not_ sufficient because of a single
process (a demanding user's tomcat installation)
eating all the memory.

Specifically, the user is in the 'default' login class,
which entitles him to 512M on this 128M machine. His
java process requires about 180M (says top).


On Feb 19 15:17:31, Ted Unangst wrote:
  If it so happens that there is not enough memory for some running
  process (a situaion I cannot rule out now), and there is no swap
  to deal with this, is that a reason for a process to be coredumped?
  (I think that I have seen processes just die with ENOMEM
  in that situaion.)
 
 ENOMEM is somewhat unlikely even in low memory situations, because the
 kernel allows overcommit.  A process can allocate memory that's not
 technically available,

This must be what's happening on my machine now:
a java process getting 180M on a 128M machine.

 then when it tries to use that memory and the kernel can't find
 anything to provide, segfault. 

What puzzles me that it's the other processes that are segfaulting;
the java process that ate the (nonexistent) memory keeps running,
but an innocent vi(1) gets killed later, or a ntpd trying to sync.
Is this how the memory overcommit functionality works?

I have killed the java process about an hour ago,
and limited the memory usage in login.conf to

:datasize-max=100M:\
:datasize-cur=100M:\

which makes the java process fail to start:

Error occurred during initialization of VM
Could not reserve enough space for object heap
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.

None of the symptoms has occured since.

At any rate, thank you for the hints. What can I do
to further test my current suspicion that the memory
insuficiency is what was causing it?

Jan