Re: ALIX segfaulting on current/i386
Try with the last snapshot: OpenBSD 5.1 (GENERIC) #160: Sun Feb 12 09:46:33 MST 2012 dera...@i386.openbsd.org:/usr/src/sys/arch/i386/compile/GENERIC I have the same machine and no problems. On Sun, 19 Feb 2012 15:17:41 +0100, Jan Stary wrote: On a recent install of current/i386 on an ALIX (see dmesg below), processes (such as a simple 'ls') started to magically segfault and die. Feb 19 14:43:17 www /bsd: pid 26001 (bogofilter): user write of 4096@0x3d5b000 at 1776 failed: 14 Feb 19 14:44:08 www /bsd: pid 7571 (bogofilter): user write of 4096@0x2cfe000 at 1360 failed: 14 Feb 19 14:45:04 www /bsd: pid 9409 (sh): user write of 4096@0x8434b000 at 99760 failed: 14 Feb 19 14:45:12 www /bsd: pid 18943 (cron): user write of 4096@0x7c23e000 at 165360 failed: 14 Feb 19 14:46:38 www /bsd: pid 18781 (error): user write of 118784@0x2eddd000 at 145008 failed: 14 Feb 19 14:47:39 www /bsd: pid 10912 (flush): user write of 4096@0xb0dc000 at 1360 failed: 14 Feb 19 14:47:52 www /bsd: pid 13255 (cleanup): user write of 4096@0x66 at 1872 failed: 14 What does this indicate? Is my RAM bad? Is my CF card bad? Could someone more knowledgeable please explain the above messages in detail? The system acts as a NAT router, and in that respect, nothing wrong happens to the clients - I browse the web and everything from behind this machine. But when it does something IO related (such as opening my mailbox when I launch mutt), it _sometimes_ segfaults now. For example: I tried to run 'file file.core' in a ktrace. That ended in a segfault. The kdump ends with 22449 file CALL sigprocmask(SIG_BLOCK,0x) 22449 file RET sigprocmask 0 22449 file CALL mprotect(0x3c005000,0x1000,0x3PROT_READ|PROT_WRITE) 22449 file RET mprotect 0 22449 file CALL mprotect(0x3c005000,0x1000,0x1PROT_READ) 22449 file RET mprotect 0 22449 file CALL sigprocmask(SIG_SETMASK,0) 22449 file RET sigprocmask -65793/0xfffefeff 22449 file PSIG SIGSEGV SIG_DFL code SEGV_MAPERR1 addr=0x87fb313c trapno=2 22449 file NAMI file.core Thank you for you time Jan OpenBSD 5.1-beta (GENERIC) #140: Sat Jan 21 00:40:23 MST 2012 dera...@i386.openbsd.org:/usr/src/sys/arch/i386/compile/GENERIC cpu0: Geode(TM) Integrated Processor by AMD PCS (AuthenticAMD 586-class) 432 MHz cpu0: FPU,DE,PSE,TSC,MSR,CX8,SEP,PGE,CMOV,CFLUSH,MMX,MMXX,3DNOW2,3DNOW real mem = 133758976 (127MB) avail mem = 121544704 (115MB) mainbus0 at root bios0 at mainbus0: AT/286+ BIOS, date 12/10/07, BIOS32 rev. 0 @ 0xfceb2 pcibios0 at bios0: rev 2.1 @ 0xf/0x1 pcibios0: pcibios_get_intr_routing - function not supported pcibios0: PCI IRQ Routing information unavailable. pcibios0: PCI bus #0 is the last bus bios0: ROM list: 0xe/0xa800 cpu0 at mainbus0: (uniprocessor) pci0 at mainbus0 bus 0: configuration mode 1 (bios) pchb0 at pci0 dev 1 function 0 AMD Geode LX rev 0x31 glxsb0 at pci0 dev 1 function 2 AMD Geode LX Crypto rev 0x00: RNG AES vr0 at pci0 dev 9 function 0 VIA VT6105M RhineIII rev 0x96: irq 10, address 00:0d:b9:12:9f:2c ukphy0 at vr0 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 0x004063, model 0x0034 vr1 at pci0 dev 10 function 0 VIA VT6105M RhineIII rev 0x96: irq 11, address 00:0d:b9:12:9f:2d ukphy1 at vr1 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 0x004063, model 0x0034 vr2 at pci0 dev 11 function 0 VIA VT6105M RhineIII rev 0x96: irq 12, address 00:0d:b9:12:9f:2e ukphy2 at vr2 phy 1: Generic IEEE 802.3u media interface, rev. 3: OUI 0x004063, model 0x0034 ral0 at pci0 dev 12 function 0 Ralink RT2560 rev 0x01: irq 9, address 00:11:09:0d:d3:36 ral0: MAC/BBP RT2560 (rev 0x04), RF RT2525 glxpcib0 at pci0 dev 15 function 0 AMD CS5536 ISA rev 0x03: rev 3, 32-bit 3579545Hz timer, watchdog, gpio gpio0 at glxpcib0: 32 pins pciide0 at pci0 dev 15 function 2 AMD CS5536 IDE rev 0x01: DMA, channel 0 wired to compatibility, channel 1 wired to compatibility wd0 at pciide0 channel 0 drive 0: LEXAR ATA FLASH CARD wd0: 1-sector PIO, LBA, 15263MB, 31260096 sectors wd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 2 pciide0: channel 1 ignored (disabled) ohci0 at pci0 dev 15 function 4 AMD CS5536 USB rev 0x02: irq 15, version 1.0, legacy support ehci0 at pci0 dev 15 function 5 AMD CS5536 USB rev 0x02: irq 15 usb0 at ehci0: USB revision 2.0 uhub0 at usb0 AMD EHCI root hub rev 2.00/1.00 addr 1 isa0 at glxpcib0 isadma0 at isa0 com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo com0: console pcppi0 at isa0 port 0x61 spkr0 at pcppi0 npx0 at isa0 port 0xf0/16: reported by CPUID; using exception 16 usb1 at ohci0: USB revision 1.0 uhub1 at usb1 AMD OHCI root hub rev 1.00/1.00 addr 1 mtrr: K6-family MTRR support (2 registers) nvram: invalid checksum vscsi0 at root scsibus0 at vscsi0: 256 targets softraid0 at root scsibus1 at softraid0: 256 targets root on wd0a (5bea3261eefd6b7e.a) swap on wd0b dump on wd0b clock: unknown CMOS layout -- Sending from my VCR
Re: ALIX segfaulting on current/i386
On Sun, Feb 19, 2012 at 6:17 AM, Jan Stary h...@stare.cz wrote: On a recent install of current/i386 on an ALIX (see dmesg below), processes (such as a simple 'ls') started to magically segfault and die. Feb 19 14:43:17 www /bsd: pid 26001 (bogofilter): user write of 4096@0x3d5b000 at 1776 failed: 14 14 == EFAULT. Those are generated when the kernel tries to write out a process's memory image for a coredump and the indicated range of memory couldn't be faulted in so that it could be written to the filesystem. What does this indicate? Is my RAM bad? Is my CF card bad? Could someone more knowledgeable please explain the above messages in detail? The inability to fault in memory that the kernel thinks should be there makes me wonder if you're swapping and the device you're swapping to is failing. Your dmesg suggests you might be swapping to your CF card and you (only?) have 128MB of real memory. When this is happening, what's the output of swapctl -l? If that shows you are indeed into swap, then a failing CF card would be my guess. (Swapping to CF seems like a bad idea to me, but I'm not expert in that sort of hardware...) Philip Guenther
Re: ALIX segfaulting on current/i386
On Feb 19 15:00:18, Gonzalo L. R. wrote: Try with the last snapshot: OpenBSD 5.1 (GENERIC) #160: Sun Feb 12 09:46:33 MST 2012 dera...@i386.openbsd.org:/usr/src/sys/arch/i386/compile/GENERIC I have the same machine and no problems. This machine has been running for five years without problems; that's why I am speculating about a HW failure ... On Sun, 19 Feb 2012 15:17:41 +0100, Jan Stary wrote: On a recent install of current/i386 on an ALIX (see dmesg below), processes (such as a simple 'ls') started to magically segfault and die. Feb 19 14:43:17 www /bsd: pid 26001 (bogofilter): user write of 4096@0x3d5b000 at 1776 failed: 14 Feb 19 14:44:08 www /bsd: pid 7571 (bogofilter): user write of 4096@0x2cfe000 at 1360 failed: 14 Feb 19 14:45:04 www /bsd: pid 9409 (sh): user write of 4096@0x8434b000 at 99760 failed: 14 Feb 19 14:45:12 www /bsd: pid 18943 (cron): user write of 4096@0x7c23e000 at 165360 failed: 14 Feb 19 14:46:38 www /bsd: pid 18781 (error): user write of 118784@0x2eddd000 at 145008 failed: 14 Feb 19 14:47:39 www /bsd: pid 10912 (flush): user write of 4096@0xb0dc000 at 1360 failed: 14 Feb 19 14:47:52 www /bsd: pid 13255 (cleanup): user write of 4096@0x66 at 1872 failed: 14 What does this indicate? Is my RAM bad? Is my CF card bad? Could someone more knowledgeable please explain the above messages in detail?
Re: ALIX segfaulting on current/i386
On Feb 19 10:12:03, Philip Guenther wrote: On Sun, Feb 19, 2012 at 6:17 AM, Jan Stary h...@stare.cz wrote: On a recent install of current/i386 on an ALIX (see dmesg below), processes (such as a simple 'ls') started to magically segfault and die. Feb 19 14:43:17 www /bsd: pid 26001 (bogofilter): user write of 4096@0x3d5b000 at 1776 failed: 14 14 == EFAULT. Those are generated when the kernel tries to write out a process's memory image for a coredump and the indicated range of memory couldn't be faulted in so that it could be written to the filesystem. Thank you for the explanation. So, firstly, the kernel decides a proccess needs to be coredumped. (That alone is a problem for me - why would that happen?) And secondly, the attempt to coredump the process fails. Right? What does this indicate? Is my RAM bad? Is my CF card bad? Could someone more knowledgeable please explain the above messages in detail? The inability to fault in memory that the kernel thinks should be there makes me wonder if you're swapping and the device you're swapping to is failing. Your dmesg suggests you might be swapping to your CF card and you (only?) have 128MB of real memory. When this is happening, what's the output of swapctl -l? If that shows you are indeed into swap, then a failing CF card would be my guess. Yes, the machine only has 128MB of memory - which I think should be enough for what it does: NATing pf, dhcpd and resolver for the internal network, and postfix and httpd for my domain (which amounts to almost no traffic). It does not have any swap configured. In fact, I try to design my systems so that they don't ever need to swap. $ swapctl -l swapctl: no swap devices configured Would you please care to explain further how the swapping is related to the coredumping EFAULTs? (Swapping to CF seems like a bad idea to me, but I'm not expert in that sort of hardware...) I don't swap to the CF. If it so happens that there is not enough memory for some running process (a situaion I cannot rule out now), and there is no swap to deal with this, is that a reason for a process to be coredumped? (I think that I have seen processes just die with ENOMEM in that situaion.) Thank you for your time Jan
Re: ALIX segfaulting on current/i386
On Sun, Feb 19, 2012, Jan Stary wrote: If it so happens that there is not enough memory for some running process (a situaion I cannot rule out now), and there is no swap to deal with this, is that a reason for a process to be coredumped? (I think that I have seen processes just die with ENOMEM in that situaion.) ENOMEM is somewhat unlikely even in low memory situations, because the kernel allows overcommit. A process can allocate memory that's not technically available, then when it tries to use that memory and the kernel can't find anything to provide, segfault. ENOMEM is an error code, but not a signal, so a process cannot strictly speaking die from it.
Re: ALIX segfaulting on current/i386
On Sun, Feb 19, 2012 at 11:24 AM, Jan Stary h...@stare.cz wrote: On Feb 19 10:12:03, Philip Guenther wrote: On Sun, Feb 19, 2012 at 6:17 AM, Jan Stary h...@stare.cz wrote: On a recent install of current/i386 on an ALIX (see dmesg below), processes (such as a simple 'ls') started to magically segfault and die. Feb 19 14:43:17 www /bsd: pid 26001 (bogofilter): user write of 4096@0x3d5b000 at 1776 failed: 14 14 == EFAULT. Those are generated when the kernel tries to write out a process's memory image for a coredump and the indicated range of memory couldn't be faulted in so that it could be written to the filesystem. Thank you for the explanation. So, firstly, the kernel decides a proccess needs to be coredumped. (That alone is a problem for me - why would that happen?) And secondly, the attempt to coredump the process fails. Right? Yep. What does this indicate? Is my RAM bad? Is my CF card bad? Could someone more knowledgeable please explain the above messages in detail? The inability to fault in memory that the kernel thinks should be there makes me wonder if you're swapping and the device you're swapping to is failing. Your dmesg suggests you might be swapping to your CF card and you (only?) have 128MB of real memory. When this is happening, what's the output of swapctl -l? If that shows you are indeed into swap, then a failing CF card would be my guess. Yes, the machine only has 128MB of memory - which I think should be enough for what it does: NATing pf, dhcpd and resolver for the internal network, and postfix and httpd for my domain (which amounts to almost no traffic). Have you monitored the memory usage to confirm or deny your belief that it's sufficient? It does not have any swap configured. In fact, I try to design my systems so that they don't ever need to swap. $ swapctl -l swapctl: no swap devices configured Would you please care to explain further how the swapping is related to the coredumping EFAULTs? It was a hypothesis based on the available evidence. Your additional evidence rules it out, so I see no reason to waste our time explaining it. At this point, I suggest you gather data about the system and see if there's a correlation between the data and when this occurs. Then make a hypothesis from that, figure out a way to test it, etc. In short, use *SCIENCE* on it! Philip Guenther
Re: ALIX segfaulting on current/i386
On Feb 19 10:12:03, Philip Guenther wrote: 14 == EFAULT. Those are generated when the kernel tries to write out a process's memory image for a coredump and the indicated range of memory couldn't be faulted in so that it could be written to the filesystem. The inability to fault in memory that the kernel thinks should be there makes me wonder if you're swapping and the device you're swapping to is failing. Your dmesg suggests you might be swapping to your CF card and you (only?) have 128MB of real memory. When this is happening, what's the output of swapctl -l? If that shows you are indeed into swap, then a failing CF card would be my guess. Yes, the machine only has 128MB of memory - which I think should be enough for what it does: NATing pf, dhcpd and resolver for the internal network, and postfix and httpd for my domain (which amounts to almost no traffic). It does not have any swap configured. In fact, I try to design my systems so that they don't ever need to swap. $ swapctl -l swapctl: no swap devices configured If it so happens that there is not enough memory for some running process (a situaion I cannot rule out now), and there is no swap to deal with this, is that a reason for a process to be coredumped? On Feb 19 12:26:03, Philip Guenther wrote: Have you monitored the memory usage to confirm or deny your belief that it's sufficient? I have now (and should have before of course). The memory is _not_ sufficient because of a single process (a demanding user's tomcat installation) eating all the memory. Specifically, the user is in the 'default' login class, which entitles him to 512M on this 128M machine. His java process requires about 180M (says top). On Feb 19 15:17:31, Ted Unangst wrote: If it so happens that there is not enough memory for some running process (a situaion I cannot rule out now), and there is no swap to deal with this, is that a reason for a process to be coredumped? (I think that I have seen processes just die with ENOMEM in that situaion.) ENOMEM is somewhat unlikely even in low memory situations, because the kernel allows overcommit. A process can allocate memory that's not technically available, This must be what's happening on my machine now: a java process getting 180M on a 128M machine. then when it tries to use that memory and the kernel can't find anything to provide, segfault. What puzzles me that it's the other processes that are segfaulting; the java process that ate the (nonexistent) memory keeps running, but an innocent vi(1) gets killed later, or a ntpd trying to sync. Is this how the memory overcommit functionality works? I have killed the java process about an hour ago, and limited the memory usage in login.conf to :datasize-max=100M:\ :datasize-cur=100M:\ which makes the java process fail to start: Error occurred during initialization of VM Could not reserve enough space for object heap Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. None of the symptoms has occured since. At any rate, thank you for the hints. What can I do to further test my current suspicion that the memory insuficiency is what was causing it? Jan