Ariane van der Steldt wrote:
On Thu, Feb 04, 2010 at 09:29:13AM -0700, Jeff Ross wrote:
Jeff Ross wrote:
On Sat, 30 Jan 2010, Bob Beck wrote:

Ooooh. nice one. Obviously ami couldn't get memory mappings and freaked out.

While not completely necessary, I'd love for you to file that whole
thing into sendbug() in a pr so we don't
forget it. but that one I need to pester krw, art, dlg, and maybe
marco about what ami is doing.

note that the behaviour you see wrt free memory dropping but not
hitting swap is what I expect.
basically that makes the buffer cache subordinate to working set
memory between 10 and 90% of
physmem. the buffer cache will throw away pages before allowing the
system to swap.

Drop it back to 70% and tell me if you still get the same panic
please.  and if you have a fixed test
case that reproduces this on your machine ( a load generator for
postgres with clients) I'd love to
have a copy in the pr as well.
70% produces the same panic.

panic: pmap_enter: no pv entries available
Stopped at      Debugger+0x4:   leave
RUN AT LEAST 'trace' AND 'ps' AND INCLUDE OUTPUT WHEN REPORTING THIS PANIC!
IF RUNNING SMP, USE 'mach ddbcpu <#>' AND 'trace' ON OTHER PROCESSORS, TOO.
DO NOT EVEN BOTHER REPORTING THIS WITHOUT INCLUDING THAT INFORMATION!
ddb{0}> trace
Debugger(7149,dfe3ede8,d08edf18,c,0) at Debugger+0x4
panic(d077d740,0,7000,d08ad6e0,ffffffff) at panic+0x55
pmap_enter(d08f3520,e0031000,7149000,3,13) at pmap_enter+0x2e5
_bus_dmamem_map(d0875c40,d505fc44,1,6344,d505fc58,1,dfe3eebc,1) at _bus_dmamem_map+0x9c
ami_allocmem(d49b5800,6344,20,d0753ddc) at ami_allocmem+0x92
ami_mgmt(d49b5800,a1,4,0,0,6344,d49cd000,1) at ami_mgmt+0x268
ami_refresh_sensors(d49b5800,da987028,da987050,80000000,da987028) at ami_refresh_sensors+0x25
sensor_task_work(d49b3d80,0,50,200286) at sensor_task_work+0x1f
workq_thread(d0863100) at workq_thread+0x32
Bad frame pointer: 0xd0a32e78
ddb{0}>

I'll skip the ps for this go round since it should be pretty much the same and go directly to sendbug, including the pg_bench script I use to trigger it.

Thanks!

Jeff
Okay, I've been testing.  I brought everything up to current, applied the
ami.c patch sent by David Gwynne as modified by Phillip Guenther, and the patch to bus_dma.c sent by Kenneth Westerback.

I started by setting kern.bufcachepercent=60 and then moving down by 10 after
each panic. Anything 20 or greater triggers the same panic as above.

I then set it to 10 to see what would happen.  The load ran okay, but I did
get three uvm_mapent_alloc: out of static map entries entries into the console
that seems to coincide with the end of one of the three pgbench runs and the
start of the next.

So I set it to 11 and got this:

ddb{2}> show panic
malloc: out of space in kmem_map
ddb{2}> trace
Debugger(3fff,c,d488a000,4,4000) at Debugger+0x4
panic(d0752c20,0,4000,0,0) at panic+0x55
malloc(4000,7f,0,da8980b4,70000) at malloc+0x76
vfs_getcwd_scandir(e002eee8,e002eecc,e002eed0,d4e80400,da24f184) at vfs_getcwd_
scandir+0x123
vfs_getcwd_common(da898154,0,e002ef10,d4e80400,200,1,da24f184,22) at vfs_getcwd
_common+0x1f0
sys___getcwd(da24f184,e002ef68,e002ef58,da24f184) at sys___getcwd+0x62
syscall() at syscall+0x12b
--- syscall (number 304) ---
0x1c028f25:

I reported a similar panic back in December
        http://kerneltrap.org/mailarchive/openbsd-misc/2009/12/14/6309363
and was told I'd twisted the knobs too hard ;-)

Here are the sysctl values I'm currently using:

kern.maxproc=10240
kern.maxfiles=20480
kern.maxvnodes=6000

kern.shminfo.shmseg=32
kern.seminfo.semmni=256
kern.seminfo.semmns=2048
kern.shminfo.shmall=512000
kern.shminfo.shmmax=768000000

About that time Owain Ainsworth sent his version of a fix to bus_dma.c so I applied that and built a new kernel and I still get panics when I adjust kern.bufcachepercent above 15 or so.

Here's the latest panic, trace and ps with kern.bufcachepercent set to 20:
ddb{0}> show panic
pmap_enter: no pv entries available
ddb{0}> trace
Debugger(b6f30,dff02e1c,2000,c,0) at Debugger+0x4
panic(d077d7a0,0,0,d0e7a980,1) at panic+0x55
pmap_enter(d08f3520,e002d000,b6f30000,7,13) at pmap_enter+0x2e5
uvm_km_alloc1(d08ad720,2000,0,1) at uvm_km_alloc1+0xd5
fork1(da8c5834,14,1,0,0) at fork1+0x100
sys_fork(da8c5834,dff02f68,dff02f58,da8c5834) at sys_fork+0x38
syscall() at syscall+0x12b
--- syscall (number 2) ---
ddb{0}> ps
    PID   PPID   PGRP    UID  S       FLAGS  WAIT          COMMAND
  18032   3768  18032    503  3   0x2000008  flt_pmfail2   postgres
  28815   3768  28815    503  3   0x2000008  inode         postgres
  16262   3768  16262    503  2   0x2000008                postgres
  15301   3768  15301    503  3   0x2000008  inode         postgres
  16712   3768  16712    503  2   0x2000008                postgres
   5959   3768   5959    503  3   0x2000008  flt_pmfail2   postgres
  24166   3768  24166    503  3   0x2000008  flt_pmfail2   postgres
  23692   3768  23692    503  3   0x2000008  flt_pmfail2   postgres
  21841   3768  21841    503  3   0x2000008  flt_pmfail2   postgres
  23838   3768  23838    503  3   0x2000008  flt_pmfail2   postgres
  25423   3768  25423    503  3   0x2000008  inode         postgres
   5075   3768   5075    503  3   0x2000008  inode         postgres
  23008   3768  23008    503  3   0x2000008  inode         postgres
  10527   3768  10527    503  3   0x2000008  flt_pmfail2   postgres
  17391   3768  17391    503  3   0x2000008  flt_pmfail2   postgres
  27363   3768  27363    503  3   0x2000008  flt_pmfail2   postgres
   4858   3768   4858    503  2   0x2000008                postgres
  18716   3768  18716    503  2   0x2000008                postgres
   8073   3768   8073    503  2   0x2000008                postgres
  30893   3768  30893    503  3   0x2000008  flt_pmfail2   postgres
  13741   3768  13741    503  3   0x2000008  flt_pmfail2   postgres
  14272   3768  14272    503  2   0x2000008                postgres
   1962   3768   1962    503  3   0x2000008  inode         postgres
   4988   3768   4988    503  3   0x2000008  flt_pmfail2   postgres
   8452   3768   8452    503  3   0x2000008  inode         postgres
  15633   3768  15633    503  3   0x2000008  flt_pmfail2   postgres
  17648   3768  17648    503  3   0x2000008  flt_pmfail2   postgres
  27751   3768  27751    503  3   0x2000008  inode         postgres
  12932   3768  12932    503  7   0x2000008                postgres
  21367   3768  21367    503  2   0x2000008                postgres
  13583   3768  13583    503  3   0x2000008  flt_pmfail2   postgres
  11813   3768  11813    503  3   0x2000008  inode         postgres
   6094   3768   6094    503  3   0x2000008  inode         postgres
  13686   3768  13686    503  3   0x2000008  flt_pmfail2   postgres
  10011   3768  10011    503  3   0x2000008  flt_pmfail2   postgres
  30652   3768  30652    503  7   0x2000008                postgres
   5358   3768   5358    503  3   0x2000008  flt_pmfail2   postgres
  24385   3768  24385    503  3   0x2000008  inode         postgres
  29773   3768  29773    503  3   0x2000008  flt_pmfail2   postgres
  22599   3768  22599    503  3   0x2000008  flt_pmfail2   postgres
  14231   3768  14231    503  3   0x2000008  flt_pmfail2   postgres
  14083   3768  14083    503  3   0x2000008  flt_pmfail2   postgres
   9729   3768   9729    503  2   0x2000008                postgres
  11748   3768  11748    503  3   0x2000008  inode         postgres
   9097   3768   9097    503  3   0x2000008  flt_pmfail2   postgres
  16057   3768  16057    503  3   0x2000008  inode         postgres
   1067   3768   1067    503  3   0x2000008  flt_pmfail2   postgres
  27445   3768  27445    503  3   0x2000008  flt_pmfail2   postgres
  24037   3768  24037    503  3   0x2000008  flt_pmfail2   postgres
   7000   3768   7000    503  2   0x2000008                postgres
  29367   3768  29367    503  3   0x2000008  inode         postgres
  25020   3768  25020    503  3   0x2000008  flt_pmfail2   postgres
  20564   3768  20564    503  3   0x2000008  flt_pmfail2   postgres
   4090   3768   4090    503  3   0x2000008  flt_pmfail2   postgres
   8865   3768   8865    503  3   0x2000008  flt_pmfail2   postgres
  29631   3768  29631    503  3   0x2000008  inode         postgres
  16278   3768  16278    503  7   0x2000008                postgres
  20854   3768  20854    503  2   0x2000008                postgres
  31943   3768  31943    503  3   0x2000008  flt_pmfail2   postgres
  18137   3768  18137    503  2   0x2000008                postgres
  10523   4345  10523      0  3   0x2044182  poll          systat
   4345      1   4345      0  3   0x2004082  pause         ksh
   2520      1   2520      0  3   0x2004082  ttyin         getty
  32389      1  32389      0  3   0x2004082  ttyin         getty
  21553      1  21553      0  3   0x2000080  select        cron
  22285      1  22285      0  3   0x2000080  select        sshd
  19843      1  19843      0  3   0x2000180  select        inetd
  24948  27412  17184     83  3   0x2000180  poll          ntpd
  27412  17184  17184     83  3   0x2000180  poll          ntpd
  17184      1  17184      0  3   0x2000080  poll          ntpd
   7640  25372  25372     74  3   0x2000180  bpf           pflogd
  25372      1  25372      0  3   0x2000080  netio         pflogd
  25643  12522  12522     73  3   0x2000180  poll          syslogd
  12522      1  12522      0  3   0x2000088  netio         syslogd
  31132   3768  31132    503  3   0x2000088  poll          postgres
   6731   3768   6731    503  3   0x2000088  select        postgres
   3796   3768   3796    503  2         0x8                postgres
  10178   3768  10178    503  3   0x2000088  select        postgres
  23616   3768  23616    503  3   0x2000088  select        postgres
   1109  22145  22399   7794  3   0x2004082  piperd        qmail-clean
   4633  22145  22399   7795  3   0x2004082  select        qmail-rspawn
  26222  22145  22399      0  3   0x2004082  select        qmail-lspawn
* 3768  27937  22399    503  7   0x200400a                postgres
   3493  26189  22399     73  3   0x2004082  piperd        multilog
  19747  10749  22399   7792  3   0x2004082  piperd        multilog
  31915  18366  22399   1002  3   0x2004082  piperd        multilog
    611  19270  22399     73  3   0x2004182  netio         socklog
   6710      1   6710      0  3   0x2004082  ttyin         getty
  30305      1  30305      0  3   0x2004082  ttyin         getty
   1774      1   1774      0  3   0x2004082  ttyin         getty
  25968  31940  22399   1001  3   0x2004182  poll          dnscache
  30323   3543  22399     73  3   0x2004082  piperd        multilog
  22145  15017  22399   7796  3   0x2004082  select        qmail-send
  26189  30365  22399      0  3   0x2004082  poll          supervise
  27937  30365  22399      0  3   0x2004082  poll          supervise
   3543  30365  22399      0  3   0x2004082  poll          supervise
  19270  30365  22399      0  3   0x2004082  poll          supervise
  10749  30365  22399      0  3   0x2004082  poll          supervise
  15017  30365  22399      0  3   0x2004082  poll          supervise
  18366  30365  22399      0  3   0x2004082  poll          supervise
  31940  30365  22399      0  3   0x2004082  poll          supervise
  30671  22399  22399      0  3   0x2004082  piperd        readproctitle
  30365  22399  22399      0  3   0x2004082  nanosleep     svscan
  22399      1  22399      0  3   0x2004082  pause         sh
     19      0      0      0  3   0x2100200  bored         crypto
     18      0      0      0  3   0x2100200  aiodoned      aiodoned
     17      0      0      0  3   0x2100200  syncer        update
     16      0      0      0  3   0x2100200  cleaner       cleaner
     15      0      0      0  3    0x100200  reaper        reaper
     14      0      0      0  2   0x2100200                pagedaemon
     13      0      0      0  3   0x2100200  pftm          pfpurge
     12      0      0      0  3   0x2100200  usbevt        usb2
     11      0      0      0  3   0x2100200  usbevt        usb1
     10      0      0      0  3   0x2100200  usbtsk        usbtask
      9      0      0      0  3   0x2100200  usbevt        usb0
      8      0      0      0  3   0x2100200  acpi_idle     acpi0
      7      0      0      0  3  0x40100200                idle3
      6      0      0      0  3  0x40100200                idle2
      5      0      0      0  3  0x40100200                idle1
      4      0      0      0  3   0x2100200  bored         syswq
      3      0      0      0  3  0x40100200                idle0
      2      0      0      0  3   0x2100200  kmalloc       kmthread
      1      0      1      0  3   0x2004080  wait          init
      0     -1      0      0  3   0x2080200  scheduler     swapper

  Anybody got an idea what I can test from here?

Maybe the shm maps shared between all the postgres processes causes
trouble. What's the number of connections you allow the server to have?
What amount of shm does postgres use?


In sysctl I have this:

kern.shminfo.shmall=512000
kern.shminfo.shmmax=768000000

I'm only allowing 100 connections to Postgresql, however, these panics are happing with roughly 60 clients connecting.

The amount of memory * the number of postgres entries is the number of
pv-entries required to keep them up. Each pv-entry is 16 byte
(4 pointers of 4 byte each). This value probably overflows the
kmem_map.
What's the value of sysctl vm.nkmempages?


vm.nkmempages=16384

Adapting any of these values should cause you to no longer hit the
problem. Simply making the io fail is not a good solution, I think. It
would only cause the buffer layer to be unable to sync after a while.

Can you try adapting the values above to confirm my analysis?

That is what I've been trying to do. This is (someday) destined to be a web server driving postgresql backed drupal web sites. The postgresql guidelines for performance don't seem to apply to OpenBSD, for example, they tell you to set the shared_buffers in postgresql.conf to 1/4 of the available memory in the server. I tried setting mine to 980MBs (4GB of ram) and that meant setting kern.shminfo.shmmax to 1GB just to get postgres to start. Later I found out that each process on an i386 is limited to 1GB, so I backed that off to 600MB.

What I will try next is setting postgres's shared_buffers to 1/4 of 1GB, since that is really the amount of available memory I have per postmaster, and dropping kern.shminfo.shmmax accordingly. I know that will run but not terribly fast, unless maybe kicking kern.bufcachepercent back up to 90% will help and not cause a panic.

Ciao,

Thanks so much!

Jeff

Reply via email to