Re: dynamically calculating NKPT [was: Re: huge ktr buffer]

2013-02-05 Thread Alan Cox
On 02/05/2013 09:45, m...@freebsd.org wrote:
 On Tue, Feb 5, 2013 at 7:14 AM, Konstantin Belousov kostik...@gmail.com 
 wrote:
 On Mon, Feb 04, 2013 at 03:05:15PM -0800, Neel Natu wrote:
 Hi,

 I have a patch to dynamically calculate NKPT for amd64 kernels. This
 should fix the various issues that people pointed out in the email
 thread.

 Please review and let me know if there are any objections to committing 
 this.

 Also, thanks to Alan (alc@) for reviewing and providing feedback on
 the initial version of the patch.

 Patch (also available at 
 http://people.freebsd.org/~neel/patches/nkpt_diff.txt):

 Index: sys/amd64/include/pmap.h
 ===
 --- sys/amd64/include/pmap.h  (revision 246277)
 +++ sys/amd64/include/pmap.h  (working copy)
 @@ -113,13 +113,7 @@
   ((unsigned long)(l2)  PDRSHIFT) | \
   ((unsigned long)(l1)  PAGE_SHIFT))

 -/* Initial number of kernel page tables. */
 -#ifndef NKPT
 -#define  NKPT32
 -#endif
 -
  #define NKPML4E  1   /* number of kernel PML4 
 slots */
 -#define NKPDPE   howmany(NKPT, NPDEPG)/* number of kernel PDP 
 slots */

  #define  NUPML4E (NPML4EPG/2)/* number of userland PML4 
 pages */
  #define  NUPDPE  (NUPML4E*NPDPEPG)/* number of userland PDP 
 pages */
 @@ -181,6 +175,7 @@
  #define  PML4map ((pd_entry_t *)(addr_PML4map))
  #define  PML4pml4e   ((pd_entry_t *)(addr_PML4pml4e))

 +extern int nkpt; /* Initial number of kernel page tables */
  extern u_int64_t KPDPphys;   /* physical address of kernel level 3 */
  extern u_int64_t KPML4phys;  /* physical address of kernel level 4 */

 Index: sys/amd64/amd64/minidump_machdep.c
 ===
 --- sys/amd64/amd64/minidump_machdep.c(revision 246277)
 +++ sys/amd64/amd64/minidump_machdep.c(working copy)
 @@ -232,7 +232,7 @@
   /* Walk page table pages, set bits in vm_page_dump */
   pmapsize = 0;
   pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys);
 - for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + NKPT * NBPDR,
 + for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + nkpt * NBPDR,
   kernel_vm_end); ) {
   /*
* We always write a page, even if it is zero. Each
 @@ -364,7 +364,7 @@
   /* Dump kernel page directory pages */
   bzero(fakepd, sizeof(fakepd));
   pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys);
 - for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + NKPT * NBPDR,
 + for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + nkpt * NBPDR,
   kernel_vm_end); va += NBPDP) {
   i = (va  PDPSHIFT)  ((1ul  NPDPEPGSHIFT) - 1);

 Index: sys/amd64/amd64/pmap.c
 ===
 --- sys/amd64/amd64/pmap.c(revision 246277)
 +++ sys/amd64/amd64/pmap.c(working copy)
 @@ -202,6 +202,10 @@
  vm_offset_t virtual_avail;   /* VA of first avail page (after kernel bss) 
 */
  vm_offset_t virtual_end; /* VA of last avail page (end of kernel AS) */

 +int nkpt;
 +SYSCTL_INT(_machdep, OID_AUTO, nkpt, CTLFLAG_RD, nkpt, 0,
 +Number of kernel page table pages allocated on bootup);
 +
  static int ndmpdp;
  static vm_paddr_t dmaplimit;
  vm_offset_t kernel_vm_end = VM_MIN_KERNEL_ADDRESS;
 @@ -495,17 +499,42 @@

  CTASSERT(powerof2(NDMPML4E));

 +/* number of kernel PDP slots */
 +#define  NKPDPE(ptpgs)   howmany((ptpgs), NPDEPG)
 +
  static void
 +nkpt_init(vm_paddr_t addr)
 +{
 + int pt_pages;
 +
 +#ifdef NKPT
 + pt_pages = NKPT;
 +#else
 + pt_pages = howmany(addr, 1  PDRSHIFT);
 + pt_pages += NKPDPE(pt_pages);
 +
 + /*
 +  * Add some slop beyond the bare minimum required for bootstrapping
 +  * the kernel.
 +  *
 +  * This is quite important when allocating KVA for kernel modules.
 +  * The modules are required to be linked in the negative 2GB of
 +  * the address space.  If we run out of KVA in this region then
 +  * pmap_growkernel() will need to allocate page table pages to map
 +  * the entire 512GB of KVA space which is an unnecessary tax on
 +  * physical memory.
 +  */
 + pt_pages += 4;  /* 8MB additional slop for kernel modules */
 8MB might be to low. I just checked one of my machines with fully
 modularized kernel, it takes slightly more than 6 MB to load 50 modules.
 I think that 16MB would be safer, but it probably needs to be scaled
 down based on the available phys memory. amd64 kernel could be booted
 on 128MB machine still.
 Is there no way to not map the entire 512GB?  Otherwise this patch
 could really hose some vendors.  E.g. the kernel module for the OneFS
 file system is around 8MB all by itself.

Mapping the entire 512 GB from the start would require the preallocation
of 1 GB of memory for page table pages.

 I found when we moved from 

Re: dynamically calculating NKPT [was: Re: huge ktr buffer]

2013-02-05 Thread Alan Cox
On 02/05/2013 10:13, Konstantin Belousov wrote:
 On Tue, Feb 05, 2013 at 07:45:24AM -0800, m...@freebsd.org wrote:
 On Tue, Feb 5, 2013 at 7:14 AM, Konstantin Belousov kostik...@gmail.com 
 wrote:
 On Mon, Feb 04, 2013 at 03:05:15PM -0800, Neel Natu wrote:
 Hi,

 I have a patch to dynamically calculate NKPT for amd64 kernels. This
 should fix the various issues that people pointed out in the email
 thread.

 Please review and let me know if there are any objections to committing 
 this.

 Also, thanks to Alan (alc@) for reviewing and providing feedback on
 the initial version of the patch.

 Patch (also available at 
 http://people.freebsd.org/~neel/patches/nkpt_diff.txt):

 Index: sys/amd64/include/pmap.h
 ===
 --- sys/amd64/include/pmap.h  (revision 246277)
 +++ sys/amd64/include/pmap.h  (working copy)
 @@ -113,13 +113,7 @@
   ((unsigned long)(l2)  PDRSHIFT) | \
   ((unsigned long)(l1)  PAGE_SHIFT))

 -/* Initial number of kernel page tables. */
 -#ifndef NKPT
 -#define  NKPT32
 -#endif
 -
  #define NKPML4E  1   /* number of kernel PML4 
 slots */
 -#define NKPDPE   howmany(NKPT, NPDEPG)/* number of kernel PDP 
 slots */

  #define  NUPML4E (NPML4EPG/2)/* number of userland PML4 
 pages */
  #define  NUPDPE  (NUPML4E*NPDPEPG)/* number of userland PDP 
 pages */
 @@ -181,6 +175,7 @@
  #define  PML4map ((pd_entry_t *)(addr_PML4map))
  #define  PML4pml4e   ((pd_entry_t *)(addr_PML4pml4e))

 +extern int nkpt; /* Initial number of kernel page tables */
  extern u_int64_t KPDPphys;   /* physical address of kernel level 3 */
  extern u_int64_t KPML4phys;  /* physical address of kernel level 4 */

 Index: sys/amd64/amd64/minidump_machdep.c
 ===
 --- sys/amd64/amd64/minidump_machdep.c(revision 246277)
 +++ sys/amd64/amd64/minidump_machdep.c(working copy)
 @@ -232,7 +232,7 @@
   /* Walk page table pages, set bits in vm_page_dump */
   pmapsize = 0;
   pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys);
 - for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + NKPT * NBPDR,
 + for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + nkpt * NBPDR,
   kernel_vm_end); ) {
   /*
* We always write a page, even if it is zero. Each
 @@ -364,7 +364,7 @@
   /* Dump kernel page directory pages */
   bzero(fakepd, sizeof(fakepd));
   pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys);
 - for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + NKPT * NBPDR,
 + for (va = VM_MIN_KERNEL_ADDRESS; va  MAX(KERNBASE + nkpt * NBPDR,
   kernel_vm_end); va += NBPDP) {
   i = (va  PDPSHIFT)  ((1ul  NPDPEPGSHIFT) - 1);

 Index: sys/amd64/amd64/pmap.c
 ===
 --- sys/amd64/amd64/pmap.c(revision 246277)
 +++ sys/amd64/amd64/pmap.c(working copy)
 @@ -202,6 +202,10 @@
  vm_offset_t virtual_avail;   /* VA of first avail page (after kernel bss) 
 */
  vm_offset_t virtual_end; /* VA of last avail page (end of kernel AS) 
 */

 +int nkpt;
 +SYSCTL_INT(_machdep, OID_AUTO, nkpt, CTLFLAG_RD, nkpt, 0,
 +Number of kernel page table pages allocated on bootup);
 +
  static int ndmpdp;
  static vm_paddr_t dmaplimit;
  vm_offset_t kernel_vm_end = VM_MIN_KERNEL_ADDRESS;
 @@ -495,17 +499,42 @@

  CTASSERT(powerof2(NDMPML4E));

 +/* number of kernel PDP slots */
 +#define  NKPDPE(ptpgs)   howmany((ptpgs), NPDEPG)
 +
  static void
 +nkpt_init(vm_paddr_t addr)
 +{
 + int pt_pages;
 +
 +#ifdef NKPT
 + pt_pages = NKPT;
 +#else
 + pt_pages = howmany(addr, 1  PDRSHIFT);
 + pt_pages += NKPDPE(pt_pages);
 +
 + /*
 +  * Add some slop beyond the bare minimum required for bootstrapping
 +  * the kernel.
 +  *
 +  * This is quite important when allocating KVA for kernel modules.
 +  * The modules are required to be linked in the negative 2GB of
 +  * the address space.  If we run out of KVA in this region then
 +  * pmap_growkernel() will need to allocate page table pages to map
 +  * the entire 512GB of KVA space which is an unnecessary tax on
 +  * physical memory.
 +  */
 + pt_pages += 4;  /* 8MB additional slop for kernel modules */
 8MB might be to low. I just checked one of my machines with fully
 modularized kernel, it takes slightly more than 6 MB to load 50 modules.
 I think that 16MB would be safer, but it probably needs to be scaled
 down based on the available phys memory. amd64 kernel could be booted
 on 128MB machine still.
 Is there no way to not map the entire 512GB?  Otherwise this patch
 could really hose some vendors.  E.g. the kernel module for the OneFS
 file system is around 8MB all by itself.
 No, I do not think that this patch would hose somebody with the 8MB
 

Re: kmem_map auto-sizing and size dependencies

2013-01-18 Thread Alan Cox
I'll follow up with detailed answers to your questions over the weekend.
For now, I will, however, point out that you've misinterpreted the
tunables.  In fact, they say that your kmem map can hold up to 16GB and the
current used space is about 58MB.  Like other things, the kmem map is
auto-sized based on the available physical memory and capped so as not to
consume too much of the overall kernel address space.

Regards,
Alan

On Fri, Jan 18, 2013 at 9:29 AM, Andre Oppermann an...@freebsd.org wrote:

 The autotuning work is reaching into many places of the kernel and
 while trying to tie up all lose ends I've got stuck in the kmem_map
 and how it works or what its limitations are.

 During startup the VM is initialized and an initial kernel virtual
 memory map is setup in kmem_init() covering the entire KVM address
 range.  Only the kernel itself is actually allocated within that
 map.  A bit later on a number of other submaps are allocated (clean_map,
 buffer_map, pager_map, exec_map).  Also in kmeminit() (in kern_malloc.c,
 different from kmem_init) the kmem_map is allocated.

 The (inital?) size of the kmem_map is determined by some voodoo magic,
 a sprinkle of nmbclusters * PAGE_SIZE incrementor and lots of tunables.
 However it seems to work out to an effective kmem_map_size of about 58MB
 on my 16GB AMD64 dev machine:

 vm.kvm_size: 549755809792
 vm.kvm_free: 530233421824
 vm.kmem_size: 16,594,300,928
 vm.kmem_size_min: 0
 vm.kmem_size_max: 329,853,485,875
 vm.kmem_size_scale: 1
 vm.kmem_map_size: 59,518,976
 vm.kmem_map_free: 16,534,777,856

 The kmem_map serves kernel malloc (via UMA), contigmalloc and everthing
 else that uses UMA for memory allocation.

 Mbuf memory too is managed by UMA which obtains the backing kernel memory
 from the kmem_map.  The limits of the various mbuf memory types have
 been considerably raised recently and may make use of 50-75% of all
 physically
 present memory, or available KVM space, whichever is smaller.

 Now my questions/comments are:

  Does the kmem_map automatically extend itself if more memory is requested?

  Should it be set to a larger initial value based on min(physical,KVM)
 space
  available?

  The use of nmbclusters for the initial kmem_map size calculation isn't
  appropriate anymore due to it being set up later and nmbclusters isn't the
  only mbuf relevant mbuf type.  We make significant use of page sized mbuf
  clusters too.

  The naming and output of the various vm.kmem_* and vm.kvm_* sysctls is
  confusing and not easy to reconcile.  Either we need some more detailing
  more aspects or less.  Plus perhaps sysctl subtrees to better describe the
  hierarchy of the maps.

  Why are separate kmem submaps being used?  Is it to limit memory usage of
  certain subsystems?  Are those limits actually enforced?

 --
 Andre
 __**_
 freebsd-curr...@freebsd.org mailing list
 http://lists.freebsd.org/**mailman/listinfo/freebsd-**currenthttp://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscribe@**
 freebsd.org freebsd-current-unsubscr...@freebsd.org

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: huge ktr buffer

2012-12-06 Thread Alan Cox
On 12/06/2012 09:43, Davide Italiano wrote:
 On Thu, Dec 6, 2012 at 4:18 PM, Andriy Gapon a...@freebsd.org wrote:
 So I configured a kernel with the following option:
 options   KTR_ENTRIES=(1024UL*1024)
 then booted the kernel and did
 $ sysctl debug.ktr.clear=1
 and got an insta-reboot.

 No panic, nothing, just a reset.

 I suspect that the huge static buffer resulting from the above option could 
 be a
 cause.  But I would like to understand the details, if possible.

 Also, perhaps ktr could be a little bit more sophisticated with its buffer 
 than
 just using a static array.

 --
 Andriy Gapon
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
 It was a while ago, but running r238886 built using the following
 kernel configuration file:
 http://people.freebsd.org/~davide/DEBUG I found a similar issue.
 The machine  paniced: fatal trap 12 with interrupt disable in early boot
 (even before the appareance of the Berkley logo).
 Basically, my configuration file is just GENERIC with slight
 modifications, in particular debugging options (WITNESS, INVARIANTS,
 etc..) turned on and the following KTR options enabled:

 options KTR
 options KTR_COMPILE=(KTR_CALLOUT|KTR_PROC)
 options KTR_MASK=(KTR_CALLOUT|KTR_PROC)
 options KTR_ENTRIES=524288

 It seems the issue is related to KTR itself, and in particular to the
 value of KTR_ENTRIES. As long as this value is little (e.g. 2048)
 everything works fine and the boot sequence ends. If I choose 524288
 (the value you can also see from the kernel conf file) the fatal trap
 occurs.

 Even though it was really difficult to me to get some informations
 because the fail happens too early, I put some printf() within the
 code and I isolated the point in which the kernel dies:
 (sys/amd64/amd64/machdep.c, in getmemsize())

 1540/*
 1541* map page into kernel: valid, read/write,non-cacheable
 1542*/
 1543*pte = pa | PG_V | PG_RW | PG_N;


 As also Alan suggested, a way to workaround the problem is to increase
 NKPT value (e.g. from 32 to 64). Obviously, this is not a proper fix.
 For a proper fix the kernel needs to be able to dynamically set the
 size of NKPT.  In this particular case, this wouldn't be too hard, but
 there is a different case, where people preload a large memory disk
 image at boot time that isn't so easy to fix.


Andriy makes a good suggestion.  One that I think should be easy to
implement.  The KTR code already supports the use of a dynamically
allocated KTR buffer.  (See sysctl_debug_ktr_entries().)  Let's take
advantage of this.  Place a cap on the size of the (compile-time)
statically allocated buffer.  However, use this buffer early in the
kernel initialization process, specifically, up until SI_ORDER_KMEM has
completed.  At that point, switch to a dynamically allocated buffer and
copy over the entries from the statically allocated buffer.  Relatively
speaking, SI_ORDER_KMEM is early enough in the boot process, that I
doubt many people wanting an enormous KTR buffer will be impacted by the
cap.  In fact, I think you could implement overflow detection without
pessimizing the KTR code.

Alan

P.S. There are other reasons that having an enormous statically
allocated array in the kernel is undesirable.  The first that comes to
mind is that it eats up memory at low physical addresses, which is
sometimes needed for special purposes.  So, I think there are good
reasons besides the NKPT issue to shift the KTR code to dynamic allocation.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Memory reserves or lack thereof

2012-11-15 Thread Alan Cox
On 11/13/2012 05:54, Konstantin Belousov wrote:
 On Mon, Nov 12, 2012 at 05:10:01PM -0600, Alan Cox wrote:
 On 11/12/2012 3:48 PM, Konstantin Belousov wrote:
 On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote:
 This patch still doesn't address the issue of M_NOWAIT calls driving
 the memory the all the way down to 2 pages, right ? It would be nice to
 have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE
 flag to dig deep.
 This is out of scope of the change. But it is required for any further
 adjustements.
 I would suggest a somewhat different response:

 The patch does make M_NOWAIT into a non-sleep version of M_WAITOK and 
 does reintroduce M_USE_RESERVE as a way to specify dig deep.

 Currently, both M_NOWAIT and M_WAITOK can drive the cache/free memory 
 down to two pages.  The effect of the patch is to stop M_NOWAIT at two 
 pages rather than allowing it to continue to zero pages.

 When you say, This is out of scope ..., I believe that you are 
 referring to changing two pages into something larger.  I agree that 
 this is out of scope for the current change.
 I referred exactly to the difference between M_USE_RESERVE set or not.
 IMO this is what was asked by the question author. So yes, my mean of
 the 'out of scope' is about tweaking the 'two pages reserve' in some
 way.

Since M_USE_RESERVE is no longer deprecated in HEAD, here is my proposed
man page update to malloc(9):

Index: share/man/man9/malloc.9
===
--- share/man/man9/malloc.9 (revision 243091)
+++ share/man/man9/malloc.9 (working copy)
@@ -29,7 +29,7 @@
 .\ $NetBSD: malloc.9,v 1.3 1996/11/11 00:05:11 lukem Exp $
 .\ $FreeBSD$
 .\
-.Dd January 28, 2012
+.Dd November 15, 2012
 .Dt MALLOC 9
 .Os
 .Sh NAME
@@ -153,13 +153,12 @@ if
 .Dv M_WAITOK
 is specified.
 .It Dv M_USE_RESERVE
-Indicates that the system can dig into its reserve in order to obtain the
-requested memory.
-This option used to be called
-.Dv M_KERNEL
-but has been renamed to something more obvious.
-This option has been deprecated and is slowly being removed from the
kernel,
-and so should not be used with any new programming.
+Indicates that the system can use its reserve of memory to satisfy the
+request.
+This option should only be used in combination with
+.Dv M_NOWAIT
+when an allocation failure cannot be tolerated by the caller without
+catastrophic effects on the system.
 .El
 .Pp
 Exactly one of either

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Memory reserves or lack thereof

2012-11-15 Thread Alan Cox
On 11/15/2012 12:21, Konstantin Belousov wrote:
 On Thu, Nov 15, 2012 at 11:32:18AM -0600, Alan Cox wrote:
 On 11/13/2012 05:54, Konstantin Belousov wrote:
 On Mon, Nov 12, 2012 at 05:10:01PM -0600, Alan Cox wrote:
 On 11/12/2012 3:48 PM, Konstantin Belousov wrote:
 On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote:
 This patch still doesn't address the issue of M_NOWAIT calls driving
 the memory the all the way down to 2 pages, right ? It would be nice to
 have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE
 flag to dig deep.
 This is out of scope of the change. But it is required for any further
 adjustements.
 I would suggest a somewhat different response:

 The patch does make M_NOWAIT into a non-sleep version of M_WAITOK and 
 does reintroduce M_USE_RESERVE as a way to specify dig deep.

 Currently, both M_NOWAIT and M_WAITOK can drive the cache/free memory 
 down to two pages.  The effect of the patch is to stop M_NOWAIT at two 
 pages rather than allowing it to continue to zero pages.

 When you say, This is out of scope ..., I believe that you are 
 referring to changing two pages into something larger.  I agree that 
 this is out of scope for the current change.
 I referred exactly to the difference between M_USE_RESERVE set or not.
 IMO this is what was asked by the question author. So yes, my mean of
 the 'out of scope' is about tweaking the 'two pages reserve' in some
 way.
 Since M_USE_RESERVE is no longer deprecated in HEAD, here is my proposed
 man page update to malloc(9):

 Index: share/man/man9/malloc.9
 ===
 --- share/man/man9/malloc.9 (revision 243091)
 +++ share/man/man9/malloc.9 (working copy)
 @@ -29,7 +29,7 @@
  .\ $NetBSD: malloc.9,v 1.3 1996/11/11 00:05:11 lukem Exp $
  .\ $FreeBSD$
  .\
 -.Dd January 28, 2012
 +.Dd November 15, 2012
  .Dt MALLOC 9
  .Os
  .Sh NAME
 @@ -153,13 +153,12 @@ if
  .Dv M_WAITOK
  is specified.
  .It Dv M_USE_RESERVE
 -Indicates that the system can dig into its reserve in order to obtain the
 -requested memory.
 -This option used to be called
 -.Dv M_KERNEL
 -but has been renamed to something more obvious.
 -This option has been deprecated and is slowly being removed from the
 kernel,
 -and so should not be used with any new programming.
 +Indicates that the system can use its reserve of memory to satisfy the
 +request.
 +This option should only be used in combination with
 +.Dv M_NOWAIT
 +when an allocation failure cannot be tolerated by the caller without
 +catastrophic effects on the system.
  .El
  .Pp
  Exactly one of either
 The text looks fine. Shouldn't the requirement for M_USE_RESERVE be also
 expressed in KASSERT, like this:

 diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h
 index d9e4692..f8a4f70 100644
 --- a/sys/vm/vm_page.h
 +++ b/sys/vm/vm_page.h
 @@ -353,6 +351,9 @@ malloc2vm_flags(int malloc_flags)
  {
   int pflags;
  
 + KASSERT((malloc_flags  M_USE_RESERVE) == 0 ||
 + (malloc_flags  M_NOWAIT) != 0,
 + (M_USE_RESERVE requires M_NOWAIT));
   pflags = (malloc_flags  M_USE_RESERVE) != 0 ? VM_ALLOC_INTERRUPT :
   VM_ALLOC_SYSTEM;
   if ((malloc_flags  M_ZERO) != 0)

 I understand that this could be added to places of the allocator's entries,
 but I think that the page allocations are fine too.

Yes, please do that.

Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Memory reserves or lack thereof

2012-11-13 Thread Alan Cox
On 11/12/2012 11:35, Alan Cox wrote:
 On 11/12/2012 07:36, Konstantin Belousov wrote:
 On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote:
 On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov 
 kostik...@gmail.comwrote:

 On Fri, Nov 09, 2012 at 07:10:04PM +, Sears, Steven wrote:
 I have a memory subsystem design question that I'm hoping someone can
 answer.
 I've been looking at a machine that is completely out of memory, as in

  v_free_count = 0,
  v_cache_count = 0,

 I wondered how a machine could completely run out of memory like this,
 especially after finding a lack of interrupt storms or other pathologies
 that would tend to overcommit memory. So I started investigating.
 Most allocators come down to vm_page_alloc(), which has this guard:

   if ((curproc == pageproc)  (page_req != VM_ALLOC_INTERRUPT)) {
   page_req = VM_ALLOC_SYSTEM;
   };

   if (cnt.v_free_count + cnt.v_cache_count  cnt.v_free_reserved ||
   (page_req == VM_ALLOC_SYSTEM 
   cnt.v_free_count + cnt.v_cache_count 
 cnt.v_interrupt_free_min) ||
   (page_req == VM_ALLOC_INTERRUPT 
   cnt.v_free_count + cnt.v_cache_count  0)) {

 The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate
 every last page.
 From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare,
 perhaps only used from interrupt threads. Not so, see kmem_malloc() or
 uma_small_alloc() which both contain this mapping:
   if ((flags  (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
   pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
   else
   pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;

 Note that M_USE_RESERVE has been deprecated and is used in just a
 handful of places. Also note that lots of code paths come through these
 routines.
 What this means is essentially _any_ allocation using M_NOWAIT will
 bypass whatever reserves have been held back and will take every last page
 available.
 There is no documentation stating M_NOWAIT has this side effect of
 essentially being privileged, so any innocuous piece of code that can't
 block will use it. And of course M_NOWAIT is literally used all over.
 It looks to me like the design goal of the BSD allocators is on
 recovery; it will give all pages away knowing it can recover.
 Am I missing anything? I would have expected some small number of pages
 to be held in reserve just in case. And I didn't expect M_NOWAIT to be a
 sort of back door for grabbing memory.
 Your analysis is right, there is nothing to add or correct.
 This is the reason to strongly prefer M_WAITOK.

 Agreed.  Once upon time, before SMPng, M_NOWAIT was rarely used.  It was
 well understand that it should only be used by interrupt handlers.

 The trouble is that M_NOWAIT conflates two orthogonal things.  The obvious
 being that the allocation shouldn't sleep.  The other being how far we're
 willing to deplete the cache/free page queues.

 When fine-grained locking got sprinkled throughout the kernel, we all to
 often found ourselves wanting to do allocations without the possibility of
 blocking.  So, M_NOWAIT became commonplace, where it wasn't before.

 This had the unintended consequence of introducing a lot of memory
 allocations in the top-half of the kernel, i.e., non-interrupt handling
 code, that were digging deep into the cache/free page queues.

 Also, ironically, in today's kernel an M_NOWAIT | M_USE_RESERVE
 allocation is less likely to succeed than an M_NOWAIT allocation.
 However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it
 could only allocate a free page.  M_USE_RESERVE said that it ok to allocate
 a cached page even though M_NOWAIT was specified.  Consequently, the system
 wouldn't dig as far into the free page queue if M_USE_RESERVE was
 specified, because it was allowed to reclaim a cached page.

 In conclusion, I think it's time that we change M_NOWAIT so that it doesn't
 dig any deeper into the cache/free page queues than M_WAITOK does and
 reintroduce a M_USE_RESERVE-like flag that says dig deep into the
 cache/free page queues.  The trouble is that we then need to identify all
 of those places that are implicitly depending on the current behavior of
 M_NOWAIT also digging deep into the cache/free page queues so that we can
 add an explicit M_USE_RESERVE.

 Alan

 P.S. I suspect that we should also increase the size of the page reserve
 that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*().  How
 many legitimate users of a new M_USE_RESERVE-like flag in today's kernel
 could actually be satisfied by two pages?
 I am almost sure that most of people who put the M_NOWAIT flag, do not
 know the 'allow the deeper drain of free queue' effect. As such, I believe
 we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations
 of the problematic places would be in the swapout path.

 I found a single explicit use of M_USE_RESERVE in the kernel,
 so the flip is relatively simple

Re: Memory reserves or lack thereof

2012-11-12 Thread Alan Cox
On 11/12/2012 07:36, Konstantin Belousov wrote:
 On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote:
 On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov 
 kostik...@gmail.comwrote:

 On Fri, Nov 09, 2012 at 07:10:04PM +, Sears, Steven wrote:
 I have a memory subsystem design question that I'm hoping someone can
 answer.
 I've been looking at a machine that is completely out of memory, as in

  v_free_count = 0,
  v_cache_count = 0,

 I wondered how a machine could completely run out of memory like this,
 especially after finding a lack of interrupt storms or other pathologies
 that would tend to overcommit memory. So I started investigating.
 Most allocators come down to vm_page_alloc(), which has this guard:

   if ((curproc == pageproc)  (page_req != VM_ALLOC_INTERRUPT)) {
   page_req = VM_ALLOC_SYSTEM;
   };

   if (cnt.v_free_count + cnt.v_cache_count  cnt.v_free_reserved ||
   (page_req == VM_ALLOC_SYSTEM 
   cnt.v_free_count + cnt.v_cache_count 
 cnt.v_interrupt_free_min) ||
   (page_req == VM_ALLOC_INTERRUPT 
   cnt.v_free_count + cnt.v_cache_count  0)) {

 The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate
 every last page.
 From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare,
 perhaps only used from interrupt threads. Not so, see kmem_malloc() or
 uma_small_alloc() which both contain this mapping:
   if ((flags  (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
   pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
   else
   pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;

 Note that M_USE_RESERVE has been deprecated and is used in just a
 handful of places. Also note that lots of code paths come through these
 routines.
 What this means is essentially _any_ allocation using M_NOWAIT will
 bypass whatever reserves have been held back and will take every last page
 available.
 There is no documentation stating M_NOWAIT has this side effect of
 essentially being privileged, so any innocuous piece of code that can't
 block will use it. And of course M_NOWAIT is literally used all over.
 It looks to me like the design goal of the BSD allocators is on
 recovery; it will give all pages away knowing it can recover.
 Am I missing anything? I would have expected some small number of pages
 to be held in reserve just in case. And I didn't expect M_NOWAIT to be a
 sort of back door for grabbing memory.
 Your analysis is right, there is nothing to add or correct.
 This is the reason to strongly prefer M_WAITOK.

 Agreed.  Once upon time, before SMPng, M_NOWAIT was rarely used.  It was
 well understand that it should only be used by interrupt handlers.

 The trouble is that M_NOWAIT conflates two orthogonal things.  The obvious
 being that the allocation shouldn't sleep.  The other being how far we're
 willing to deplete the cache/free page queues.

 When fine-grained locking got sprinkled throughout the kernel, we all to
 often found ourselves wanting to do allocations without the possibility of
 blocking.  So, M_NOWAIT became commonplace, where it wasn't before.

 This had the unintended consequence of introducing a lot of memory
 allocations in the top-half of the kernel, i.e., non-interrupt handling
 code, that were digging deep into the cache/free page queues.

 Also, ironically, in today's kernel an M_NOWAIT | M_USE_RESERVE
 allocation is less likely to succeed than an M_NOWAIT allocation.
 However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it
 could only allocate a free page.  M_USE_RESERVE said that it ok to allocate
 a cached page even though M_NOWAIT was specified.  Consequently, the system
 wouldn't dig as far into the free page queue if M_USE_RESERVE was
 specified, because it was allowed to reclaim a cached page.

 In conclusion, I think it's time that we change M_NOWAIT so that it doesn't
 dig any deeper into the cache/free page queues than M_WAITOK does and
 reintroduce a M_USE_RESERVE-like flag that says dig deep into the
 cache/free page queues.  The trouble is that we then need to identify all
 of those places that are implicitly depending on the current behavior of
 M_NOWAIT also digging deep into the cache/free page queues so that we can
 add an explicit M_USE_RESERVE.

 Alan

 P.S. I suspect that we should also increase the size of the page reserve
 that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*().  How
 many legitimate users of a new M_USE_RESERVE-like flag in today's kernel
 could actually be satisfied by two pages?
 I am almost sure that most of people who put the M_NOWAIT flag, do not
 know the 'allow the deeper drain of free queue' effect. As such, I believe
 we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations
 of the problematic places would be in the swapout path.

 I found a single explicit use of M_USE_RESERVE in the kernel,
 so the flip is relatively simple.

Agreed.  Most recently I eliminated several

Re: Memory reserves or lack thereof

2012-11-12 Thread Alan Cox

On 11/12/2012 3:48 PM, Konstantin Belousov wrote:

On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote:

This patch still doesn't address the issue of M_NOWAIT calls driving
the memory the all the way down to 2 pages, right ? It would be nice to
have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE
flag to dig deep.

This is out of scope of the change. But it is required for any further
adjustements.


I would suggest a somewhat different response:

The patch does make M_NOWAIT into a non-sleep version of M_WAITOK and 
does reintroduce M_USE_RESERVE as a way to specify dig deep.


Currently, both M_NOWAIT and M_WAITOK can drive the cache/free memory 
down to two pages.  The effect of the patch is to stop M_NOWAIT at two 
pages rather than allowing it to continue to zero pages.


When you say, This is out of scope ..., I believe that you are 
referring to changing two pages into something larger.  I agree that 
this is out of scope for the current change.


Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Memory reserves or lack thereof

2012-11-12 Thread Alan Cox

On 11/12/2012 5:24 PM, Adrian Chadd wrote:

.. wait, so what exactly would the difference be between M_NOWAIT and M_WAITOK?


Whether or not the allocation can sleep until memory becomes available.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Memory reserves or lack thereof

2012-11-11 Thread Alan Cox
On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov kostik...@gmail.comwrote:

 On Fri, Nov 09, 2012 at 07:10:04PM +, Sears, Steven wrote:
  I have a memory subsystem design question that I'm hoping someone can
 answer.
 
  I've been looking at a machine that is completely out of memory, as in
 
   v_free_count = 0,
   v_cache_count = 0,
 
  I wondered how a machine could completely run out of memory like this,
 especially after finding a lack of interrupt storms or other pathologies
 that would tend to overcommit memory. So I started investigating.
 
  Most allocators come down to vm_page_alloc(), which has this guard:
 
if ((curproc == pageproc)  (page_req != VM_ALLOC_INTERRUPT)) {
page_req = VM_ALLOC_SYSTEM;
};
 
if (cnt.v_free_count + cnt.v_cache_count  cnt.v_free_reserved ||
(page_req == VM_ALLOC_SYSTEM 
cnt.v_free_count + cnt.v_cache_count 
 cnt.v_interrupt_free_min) ||
(page_req == VM_ALLOC_INTERRUPT 
cnt.v_free_count + cnt.v_cache_count  0)) {
 
  The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate
 every last page.
 
  From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare,
 perhaps only used from interrupt threads. Not so, see kmem_malloc() or
 uma_small_alloc() which both contain this mapping:
 
if ((flags  (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT)
pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED;
else
pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED;
 
  Note that M_USE_RESERVE has been deprecated and is used in just a
 handful of places. Also note that lots of code paths come through these
 routines.
 
  What this means is essentially _any_ allocation using M_NOWAIT will
 bypass whatever reserves have been held back and will take every last page
 available.
 
  There is no documentation stating M_NOWAIT has this side effect of
 essentially being privileged, so any innocuous piece of code that can't
 block will use it. And of course M_NOWAIT is literally used all over.
 
  It looks to me like the design goal of the BSD allocators is on
 recovery; it will give all pages away knowing it can recover.
 
  Am I missing anything? I would have expected some small number of pages
 to be held in reserve just in case. And I didn't expect M_NOWAIT to be a
 sort of back door for grabbing memory.
 

 Your analysis is right, there is nothing to add or correct.
 This is the reason to strongly prefer M_WAITOK.


Agreed.  Once upon time, before SMPng, M_NOWAIT was rarely used.  It was
well understand that it should only be used by interrupt handlers.

The trouble is that M_NOWAIT conflates two orthogonal things.  The obvious
being that the allocation shouldn't sleep.  The other being how far we're
willing to deplete the cache/free page queues.

When fine-grained locking got sprinkled throughout the kernel, we all to
often found ourselves wanting to do allocations without the possibility of
blocking.  So, M_NOWAIT became commonplace, where it wasn't before.

This had the unintended consequence of introducing a lot of memory
allocations in the top-half of the kernel, i.e., non-interrupt handling
code, that were digging deep into the cache/free page queues.

Also, ironically, in today's kernel an M_NOWAIT | M_USE_RESERVE
allocation is less likely to succeed than an M_NOWAIT allocation.
However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it
could only allocate a free page.  M_USE_RESERVE said that it ok to allocate
a cached page even though M_NOWAIT was specified.  Consequently, the system
wouldn't dig as far into the free page queue if M_USE_RESERVE was
specified, because it was allowed to reclaim a cached page.

In conclusion, I think it's time that we change M_NOWAIT so that it doesn't
dig any deeper into the cache/free page queues than M_WAITOK does and
reintroduce a M_USE_RESERVE-like flag that says dig deep into the
cache/free page queues.  The trouble is that we then need to identify all
of those places that are implicitly depending on the current behavior of
M_NOWAIT also digging deep into the cache/free page queues so that we can
add an explicit M_USE_RESERVE.

Alan

P.S. I suspect that we should also increase the size of the page reserve
that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*().  How
many legitimate users of a new M_USE_RESERVE-like flag in today's kernel
could actually be satisfied by two pages?
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Threaded 6.4 code compiled under 9.0 uses a lot more memory?..

2012-11-03 Thread Alan Cox
On Wed, Oct 31, 2012 at 2:06 PM, Konstantin Belousov kostik...@gmail.comwrote:

 On Wed, Oct 31, 2012 at 11:52:06AM -0700, Adrian Chadd wrote:
  On 31 October 2012 11:20, Ian Lepore free...@damnhippie.dyndns.org
 wrote:
   I think there are some things we should be investigating about the
   growth of memory usage.  I just noticed this:
  
   Freebsd 6.2 on an arm processor:
  
 369 root 1   8  -88  1752K   748K nanslp   3:00  0.00% watchdogd
  
   Freebsd 10.0 on the same system:
  
 367 root 1 -52   r0 10232K 10160K nanslp  10:04  0.00% watchdogd
  
   The 10.0 system is built with MALLOC_PRODUCTION (without that defined
   the system won't even boot, it only has 64MB of ram).  That's a crazy
   amount of growth for a relatively simple daemon.
 
  Would you please, _please_ do some digging into this?
 
  It's quite possible there's something in the libraries that are
  allocating some memory upon first call invocation - yes, that's
  jemalloc, but it could also be other things like stdio.
 
  We really, really need to fix this userland bloat; it's terribly
  ridiculous at this point. There's no reason a watchdog daemon should
  take 10megabytes of RAM.
 Watchdogd was recently changed to mlock its memory. This is the cause
 of the RSS increase.


Is it also statically linked?

Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: contigmalloc() breaking Xorg

2012-07-13 Thread Alan Cox

On 07/12/2012 07:26, John Baldwin wrote:

[ Adding alc@ for VM stuff, Warner for arm/mips bus dma brokenness ]


When the code underlying contigmalloc() fails in its initial attempt to 
allocate memory and proceeds to launder and reclaim pages, it should 
almost certainly do as the page daemon does and invoke the vm_lowmem 
handlers.  In particular, this should coax the ZFS ARC into releasing 
some of its hoard of wired memory.  Try this:


Index: vm/vm_contig.c
===
--- vm/vm_contig.c  (revision 238372)
+++ vm/vm_contig.c  (working copy)
@@ -192,6 +192,18 @@ vm_contig_grow_cache(int tries, vm_paddr_t low, vm
 {
int actl, actmax, inactl, inactmax;

+   if (tries  0) {
+   /*
+* Decrease registered cache sizes.
+*/
+   EVENTHANDLER_INVOKE(vm_lowmem, 0);
+
+   /*
+* We do this explicitly after the caches have been drained
+* above.
+*/
+   uma_reclaim();
+   }
vm_page_lock_queues();
inactl = 0;
inactmax = tries  1 ? 0 : cnt.v_inactive_count;


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Rtld object tasting [Was: Re: wired memory - again!]

2012-06-13 Thread Alan Cox
On Wed, Jun 13, 2012 at 2:12 PM, Konstantin Belousov kostik...@gmail.comwrote:

 On Wed, Jun 13, 2012 at 07:14:09AM -0600, Ian Lepore wrote:
  http://lists.freebsd.org/pipermail/freebsd-arm/2012-January/003288.html

 The map_object.c patch is step in the almost right direction, I wanted
 to remove the static page-sized buffer from get_elf_header for long time.
 It works because rtld always holds bind_lock exclusively while loading
 an object. There is no need to copy the first page after it is mapped.

 commit 0f6f8629af1345acded7c0c685d3ff7b4d9180d6
 Author: Konstantin Belousov k...@freebsd.org
 Date:   Wed Jun 13 22:04:18 2012 +0300

Eliminate the static buffer used to read the first page of the mapped
object, and eliminate the pread(2) call as well. Mmap the first page
of the object temporaly, and unmap it on error or last use.

Fix several cases were the whole mapping of the object leaked on error.

Potentially, this leaves one-page gap between succeeding dlopen(3),
but there are other mmap(2) consumers as well.


I suggest adding MAP_PREFAULT_READ to the mmap(2) call.  A heuristic in
vm_map_pmap_enter() would trigger automatic mapping for small files, but if
the object file is larger than 96 pages then you need to explicitly
specific MAP_PREFAULT_READ.

Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: superpages and kmem on amd64

2012-05-20 Thread Alan Cox
On Sun, May 20, 2012 at 2:01 AM, Marko Zec z...@fer.hr wrote:

 Hi all,

 I'm playing with an algorithm which makes use of large contiguous blocks of
 kernel memory (ranging from 1M to 1G in size), so it would be nice if those
 could be somehow forcibly mapped to superpages.  I was hoping that the VM
 system would automagically map (merge) contiguous 4k pages to superpages,
 but
 apparently it doesn't:

 vm.pmap.pdpe.demotions: 2
 vm.pmap.pde.promotions: 543
 vm.pmap.pde.p_failures: 266253
 vm.pmap.pde.mappings: 0
 vm.pmap.pde.demotions: 31


No, your conclusion is incorrect.  These counts show that 543 superpage
mappings were created by promotion.

Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: superpages and kmem on amd64

2012-05-20 Thread Alan Cox

On 05/20/2012 09:43, Marko Zec wrote:

On Sunday 20 May 2012 09:25:59 Alan Cox wrote:

On Sun, May 20, 2012 at 2:01 AM, Marko Zecz...@fer.hr  wrote:

Hi all,

I'm playing with an algorithm which makes use of large contiguous blocks
of kernel memory (ranging from 1M to 1G in size), so it would be nice if
those could be somehow forcibly mapped to superpages.  I was hoping that
the VM system would automagically map (merge) contiguous 4k pages to
superpages, but
apparently it doesn't:

vm.pmap.pdpe.demotions: 2
vm.pmap.pde.promotions: 543
vm.pmap.pde.p_failures: 266253
vm.pmap.pde.mappings: 0
vm.pmap.pde.demotions: 31

No, your conclusion is incorrect.  These counts show that 543 superpage
mappings were created by promotion.

OK, that sounds promising.  Does created by promotion count reflect
historic / cumulative stats, or is vm.pmap.pde.promotions the actual number
of superpages active?  Or should we subtract vm.pmap.pde.demotions from it to
get the current value?


The count is cumulative.  There is no instantaneous count.  Subtracting 
demotions from promotions plus mappings is not a reliable way to get the 
instantaneous total because a superpage mapping can be destroyed without 
first being demoted.



In any case, I wish to be certain that a particular kmem virtual address range
is mapped to superpages - how can I enforce that at malloc time, and / or
find out later if I really got my kmem mapped to superpages?  Perhaps
vm_map_lookup() could provide more info, but I'm wondering if someone already
wrote a wrapper function for that, which takes only the base virtual address
as a single argument?


Try using pmap_mincore() to verify that the mappings are superpages.


BTW, apparently malloc(size, M_TEMP, M_NOWAIT) requests fail for size  1G,
even at boot time.  Any ideas how to circumvent that (8.3-STABLE, amd64, 4G
physical RAM)?


I suspect that you need to increase the size of your kmem map.

Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: superpages and kmem on amd64

2012-05-20 Thread Alan Cox

On 05/20/2012 17:48, Marko Zec wrote:

On Sunday 20 May 2012 19:34:26 Alan Cox wrote:
...

In any case, I wish to be certain that a particular kmem virtual address
range is mapped to superpages - how can I enforce that at malloc time,
and / or find out later if I really got my kmem mapped to superpages?
Perhaps vm_map_lookup() could provide more info, but I'm wondering if
someone already wrote a wrapper function for that, which takes only the
base virtual address as a single argument?

Try using pmap_mincore() to verify that the mappings are superpages.

flags = pmap_mincore(vmspace_pmap(curthread-td_proc-p_vmspace),
(vm_offset_t) addr));

OK, that works, and now I know my kmem chunk is on a superpage, horray!!!
Thanks!


BTW, apparently malloc(size, M_TEMP, M_NOWAIT) requests fail for size
1G, even at boot time.  Any ideas how to circumvent that (8.3-STABLE,
amd64, 4G physical RAM)?

I suspect that you need to increase the size of your kmem map.

Huh any hints how should I achieve that?  In desperation I placed

vm.kmem_size=8G

in /boot/loader.conf and got this:

vm.kmem_map_free: 8123924480
vm.kmem_map_size: 8364032
vm.kmem_size_scale: 1
vm.kmem_size_max: 329853485875
vm.kmem_size_min: 0
vm.kmem_size: 8132288512

but malloc(2G) still fails...


Here is at least one reason why it fails:

void *
uma_large_malloc(int size, int wait)

Note the type of size.  Can you malloc 1GB?


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: problems with mmap() and disk caching

2012-04-29 Thread Alan Cox

On 04/11/2012 01:07, Andrey Zonov wrote:

On 10.04.2012 20:19, Alan Cox wrote:

On 04/09/2012 10:26, John Baldwin wrote:

On Thursday, April 05, 2012 11:54:31 am Alan Cox wrote:

On 04/04/2012 02:17, Konstantin Belousov wrote:

On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote:

Hi,

I open the file, then call mmap() on the whole file and get pointer,
then I work with this pointer. I expect that page should be only 
once

touched to get it into the memory (disk cache?), but this doesn't
work!

I wrote the test (attached) and ran it for the 1G file generated 
from

/dev/random, the result is the following:

Prepare file:
# swapoff -a
# newfs /dev/ada0b
# mount /dev/ada0b /mnt
# dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024

Purge cache:
# umount /mnt
# mount /dev/ada0b /mnt

Run test:
$ ./mmap /mnt/random-1024 30
mmap: 1 pass took: 7.431046 (none: 262112; res: 32; super:
0; other: 0)
mmap: 2 pass took: 7.356670 (none: 261648; res: 496; super:
0; other: 0)
mmap: 3 pass took: 7.307094 (none: 260521; res: 1623; super:
0; other: 0)
mmap: 4 pass took: 7.350239 (none: 258904; res: 3240; super:
0; other: 0)
mmap: 5 pass took: 7.392480 (none: 257286; res: 4858; super:
0; other: 0)
mmap: 6 pass took: 7.292069 (none: 255584; res: 6560; super:
0; other: 0)
mmap: 7 pass took: 7.048980 (none: 251142; res: 11002; super:
0; other: 0)
mmap: 8 pass took: 6.899387 (none: 247584; res: 14560; super:
0; other: 0)
mmap: 9 pass took: 7.190579 (none: 242992; res: 19152; super:
0; other: 0)
mmap: 10 pass took: 6.915482 (none: 239308; res: 22836; super:
0; other: 0)
mmap: 11 pass took: 6.565909 (none: 232835; res: 29309; super:
0; other: 0)
mmap: 12 pass took: 6.423945 (none: 226160; res: 35984; super:
0; other: 0)
mmap: 13 pass took: 6.315385 (none: 208555; res: 53589; super:
0; other: 0)
mmap: 14 pass took: 6.760780 (none: 192805; res: 69339; super:
0; other: 0)
mmap: 15 pass took: 5.721513 (none: 174497; res: 87647; super:
0; other: 0)
mmap: 16 pass took: 5.004424 (none: 155938; res: 106206; super:
0; other: 0)
mmap: 17 pass took: 4.224926 (none: 135639; res: 126505; super:
0; other: 0)
mmap: 18 pass took: 3.749608 (none: 117952; res: 144192; super:
0; other: 0)
mmap: 19 pass took: 3.398084 (none: 99066; res: 163078; super:
0; other: 0)
mmap: 20 pass took: 3.029557 (none: 74994; res: 187150; super:
0; other: 0)
mmap: 21 pass took: 2.379430 (none: 55231; res: 206913; super:
0; other: 0)
mmap: 22 pass took: 2.046521 (none: 40786; res: 221358; super:
0; other: 0)
mmap: 23 pass took: 1.152797 (none: 30311; res: 231833; super:
0; other: 0)
mmap: 24 pass took: 0.972617 (none: 16196; res: 245948; super:
0; other: 0)
mmap: 25 pass took: 0.577515 (none: 8286; res: 253858; super:
0; other: 0)
mmap: 26 pass took: 0.380738 (none: 3712; res: 258432; super:
0; other: 0)
mmap: 27 pass took: 0.253583 (none: 1193; res: 260951; super:
0; other: 0)
mmap: 28 pass took: 0.157508 (none: 0; res: 262144; super:
0; other: 0)
mmap: 29 pass took: 0.156169 (none: 0; res: 262144; super:
0; other: 0)
mmap: 30 pass took: 0.156550 (none: 0; res: 262144; super:
0; other: 0)

If I ran this:
$ cat /mnt/random-1024 /dev/null
before test, when result is the following:

$ ./mmap /mnt/random-1024 5
mmap: 1 pass took: 0.337657 (none: 0; res: 262144; super:
0; other: 0)
mmap: 2 pass took: 0.186137 (none: 0; res: 262144; super:
0; other: 0)
mmap: 3 pass took: 0.186132 (none: 0; res: 262144; super:
0; other: 0)
mmap: 4 pass took: 0.186535 (none: 0; res: 262144; super:
0; other: 0)
mmap: 5 pass took: 0.190353 (none: 0; res: 262144; super:
0; other: 0)

This is what I expect. But why this doesn't work without reading 
file

manually?

Issue seems to be in some change of the behaviour of the reserv or
phys allocator. I Cc:ed Alan.

I'm pretty sure that the behavior here hasn't significantly changed in
about twelve years. Otherwise, I agree with your analysis.

On more than one occasion, I've been tempted to change:

pmap_remove_all(mt);
if (mt-dirty != 0)
vm_page_deactivate(mt);
else
vm_page_cache(mt);

to:

vm_page_dontneed(mt);

because I suspect that the current code does more harm than good. In
theory, it saves activations of the page daemon. However, more often
than not, I suspect that we are spending more on page reactivations 
than

we are saving on page daemon activations. The sequential access
detection heuristic is just too easily triggered. For example, I've
seen it triggered by demand paging of the gcc text segment. Also, I
think that pmap_remove_all() and especially vm_page_cache() are too
severe for a detection heuristic that is so easily triggered.

Are you planning to commit this?



Not yet. I did some tests with a file that was several times larger than
DRAM, and I didn't like what I saw. Initially, everything behaved as
expected, but about halfway through the test the bulk of the pages were
active. Despite the call to pmap_clear_reference() in
vm_page_dontneed(), the page daemon is finding the pages to be
referenced and reactivating

Re: mlockall() on freebsd 7.2 + amd64 returns EAGAIN

2012-04-21 Thread Alan Cox

On 04/13/2012 16:45, Konstantin Belousov wrote:

On Fri, Apr 13, 2012 at 11:37:44AM -0700, Sushanth Rai wrote:

I've attached the simple program that creates 5 threads. Following is the o/p of 
/proc/pid/map when this program is running. Note that I modified
sys/fs/procfs/procfs_map.c to print whether a region is wired. As you can see 
from this o/p, none of stack areas get wired.

0x40 0x401000 1 0 0xff002d943bd0 r-x 1 0 0x1000 COW NC wired vnode 
/var/tmp/thread1
0x50 0x501000 1 0 0xff002dd13e58 rw- 2 0 0x3100 NCOW NNC wired default -
0x501000 0x60 255 0 0xff002dd13e58 rwx 2 0 0x3100 NCOW NNC wired 
default -
0x80050 0x800526000 38 0 0xff0025574000 r-x 192 46 0x1004 COW NC wired 
vnode /libexec/ld-elf.so.1
0x800526000 0x800537000 17 0 0xff002d9f81b0 rw- 1 0 0x3100 NCOW NNC wired 
default -
0x800626000 0x80062d000 7 0 0xff002dd13bd0 rw- 1 0 0x3100 COW NNC wired 
vnode /libexec/ld-elf.so.1
0x80062d000 0x800633000 6 0 0xff002dd145e8 rw- 1 0 0x3100 NCOW NNC wired 
default -
0x800633000 0x800645000 18 0 0xff00256d71b0 r-x 63 42 0x4 COW NC wired 
vnode /lib/libthr.so.3
0x800645000 0x800646000 1 0 0xff002d975510 r-x 1 0 0x3100 COW NNC wired 
vnode /lib/libthr.so.3
0x800646000 0x800746000 0 0 0xff002dc5cca8 --- 4 0 0x3100 NCOW NNC 
not-wired default -
0x800746000 0x80074a000 4 0 0xff002572a288 rw- 1 0 0x3100 COW NNC wired 
vnode /lib/libthr.so.3
0x80074a000 0x80074c000 2 0 0xff002dc5cca8 rw- 4 0 0x3100 NCOW NNC wired 
default -
0x80074c000 0x80083e000 242 0 0xff001cd226c0 r-x 238 92 0x1004 COW NC wired 
vnode /lib/libc.so.7
0x80083e000 0x80083f000 1 0 0xff002dd12000 r-x 1 0 0x3100 COW NNC wired 
vnode /lib/libc.so.7
0x80083f000 0x80093e000 0 0 0xff002dc5cca8 --- 4 0 0x3100 NCOW NNC 
not-wired default -
0x80093e000 0x80095d000 31 0 0xff002dddc360 rw- 1 0 0x3100 COW NNC wired 
vnode /lib/libc.so.7
0x80095d000 0x800974000 23 0 0xff002dc5cca8 rw- 4 0 0x3100 NCOW NNC wired 
default -
0x800a0 0x800b0 256 0 0xff002dbd1798 rw- 1 0 0x3100 NCOW NNC wired 
default -
0x800b0 0x800c0 256 0 0xff002dd14948 rw- 1 0 0x3100 NCOW NNC wired 
default -
0x7f3db000 0x7f3fb000 1 0 0xff002dbb4360 rw- 1 0 0x3100 NCOW NNC 
not-wired default -
0x7f5dc000 0x7f5fc000 1 0 0xff002dc66af8 rw- 1 0 0x3100 NCOW NNC 
not-wired default -
0x7f7dd000 0x7f7fd000 1 0 0xff002dbea438 rw- 1 0 0x3100 NCOW NNC 
not-wired default -
0x7f9de000 0x7f9fe000 1 0 0xff002dd7fd80 rw- 1 0 0x3100 NCOW NNC 
not-wired default -
0x7fbdf000 0x7fbff000 1 0 0xff002dbe9438 rw- 1 0 0x3100 NCOW NNC 
not-wired default -
0x7fbff000 0x7fc0 0 0 0 --- 0 0 0x0 NCOW NNC not-wired none -
0x7ffe 0x8000 32 0 0xff002dd125e8 rwx 1 0 0x3100 NCOW NNC 
wired default -

--- On Fri, 4/13/12, Konstantin Belousovkostik...@gmail.com  wrote:


From: Konstantin Belousovkostik...@gmail.com
Subject: Re: mlockall() on freebsd 7.2 + amd64 returns EAGAIN
To: Sushanth Raisushanth_...@yahoo.com
Cc: freebsd-hackers@freebsd.org
Date: Friday, April 13, 2012, 1:11 AM
On Thu, Apr 12, 2012 at 08:10:26PM
-0700, Sushanth Rai wrote:

Then it should be fixed in r190885.


Thanks. That works like a charm.

mlockall() mostly works now. There is still a, issue in

wiring the stacks of multithreaded program when the program
uses default stack allocation scheme. Thread library
allocates stack for each thread by calling mmap() and
sending address and size to be mapped. The kernel adjusts
the start address to sgrowsz in  vm_map_stack() and
maps at the adjusted address. But the subsequent wiring is
done using the original address, which fails.

Oh, I see. The problem is the VM_MAP_WIRE_NOHOLES flag. Since we
map only the initial stack fragment even for the MCL_WIREFUTURE maps,
there is a hole in the stack region.

In fact, for MCL_WIREFUTURE, we probably should map the whole
stack at once, prefaulting all pages.

Below are two patches. The change for vm_mmap.c would fix your immediate
problem by allowing holes in wired region.

The change for vm_map.c prefaults the whole stack instead of the
initial fragment. The single-threaded programs still get a fault
on stack growth.


The vm_mmap.c change looks ok to me.  Please commit it.  I haven't yet 
had a chance to think about the other change.


Alan


diff --git a/sys/vm/vm_map.c b/sys/vm/vm_map.c
index 6198629..2fd18d1 100644
--- a/sys/vm/vm_map.c
+++ b/sys/vm/vm_map.c
@@ -3259,7 +3259,10 @@ vm_map_stack(vm_map_t map, vm_offset_t addrbos, 
vm_size_t max_ssize,
addrbos + max_ssize  addrbos)
return (KERN_NO_SPACE);

-   init_ssize = (max_ssize  sgrowsiz) ? max_ssize : sgrowsiz;
+   if (map-flags  MAP_WIREFUTURE)
+   init_ssize = max_ssize;
+   else
+   init_ssize = (max_ssize  sgrowsiz) ? max_ssize : sgrowsiz;

PROC_LOCK(curthread-td_proc);
vmemlim = lim_cur(curthread-td_proc, RLIMIT_VMEM);
diff --git 

Re: Corrupted pmap pm_vlist - pmap_remove_pte()

2012-04-17 Thread Alan Cox

On 4/17/2012 4:48 AM, Konstantin Belousov wrote:

On Mon, Apr 16, 2012 at 03:08:25PM -0400, Ewart Tempest wrote:

In FreeBSD 6.*, we have been seeing crashes in pmap_remove_pages() that only 
seem to occur in scaling scenarios:

2564#ifdef PMAP_REMOVE_PAGES_CURPROC_ONLY
2565pte = vtopte(pv-pv_va);
2566#else
2567pte = pmap_pte(pmap, pv-pv_va);
2568#endif
2569tpte = *pte;= page fault here

The suspicion is that the pmap's pm_pvlist list is getting corrupted. To this 
end, I have a question on the following logic in pmap_remove_pte() (see in-line 
comment):

1533 static int
1534 pmap_remove_pte(pmap_t pmap, pt_entry_t *ptq, vm_offset_t va, 
pd_entry_t ptepde)
1535 {
1536pt_entry_t oldpte;
1537vm_page_t m;
1538
1539PMAP_LOCK_ASSERT(pmap, MA_OWNED);
1540oldpte = pte_load_clear(ptq);
1541if (oldpte  PG_W)
1542pmap-pm_stats.wired_count -= 1;
1543/*
1544 * Machines that don't support invlpg, also don't support
1545 * PG_G.
1546 */
1547if (oldpte  PG_G)
1548pmap_invalidate_page(kernel_pmap, va);
1549pmap-pm_stats.resident_count -= 1;
1550if (oldpte  PG_MANAGED) {
1551m = PHYS_TO_VM_PAGE(oldpte  PG_FRAME);
1552if (oldpte  PG_M) {
1553 #if defined(PMAP_DIAGNOSTIC)
1554if (pmap_nw_modified((pt_entry_t) oldpte)) {
1555printf(
1556pmap_remove: modified page not writable: va: 0x%lx, pte: 
0x%lx\n,
1557va, oldpte);
1558}
1559 #endif
1560if (pmap_track_modified(va))
1561vm_page_dirty(m);
1562}
1563if (oldpte  PG_A)
1564vm_page_flag_set(m, PG_REFERENCED);
1565pmap_remove_entry(pmap, m, va);
1566}
1567return (pmap_unuse_pt(pmap, va, ptepde));=== *** under what 
circumstances is it valid to free the page but not remove it from the pmap's pm_vlist? Even 
the code comment for pmap_unuse_pt() commences After removing a page table entry ... 
. ***

It is valid to not remove pv_entry when no pv_entry exists for the mapping.
The pv_entry is created if the page is managed, see pmap_enter() code.
The block above the return is executed when the page is managed, or at
least pmap thinks so.

The HEAD code will panic in pmap_pvh_free() if pmap_phv_remove() cannot
find the pv entry for given page and given pmap/va.


1568 }

If the tail end of the above function is changed as follows:

1565pmap_remove_entry(pmap, m, va);
1565.5return (pmap_unuse_pt(pmap, va, ptepde));
1566}
1567return (0);

Then we don't see any crashes ... but is it the right thing to do?

Should be not. Try to test this with some unmanaged mapping, like
/dev/mem pages mapped into the exiting process address space.

I am too new to know about any nuances of the RELENG_6 code.


The RELENG_6 code is doing essentially the same things as newer 
versions.   Crashes in this specific place are usually caused by DRAM 
errors.


Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: problems with mmap() and disk caching

2012-04-10 Thread Alan Cox

On 04/09/2012 10:26, John Baldwin wrote:

On Thursday, April 05, 2012 11:54:31 am Alan Cox wrote:

On 04/04/2012 02:17, Konstantin Belousov wrote:

On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote:

Hi,

I open the file, then call mmap() on the whole file and get pointer,
then I work with this pointer.  I expect that page should be only once
touched to get it into the memory (disk cache?), but this doesn't work!

I wrote the test (attached) and ran it for the 1G file generated from
/dev/random, the result is the following:

Prepare file:
# swapoff -a
# newfs /dev/ada0b
# mount /dev/ada0b /mnt
# dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024

Purge cache:
# umount /mnt
# mount /dev/ada0b /mnt

Run test:
$ ./mmap /mnt/random-1024 30
mmap:  1 pass took:   7.431046 (none: 262112; res: 32; super:
0; other:  0)
mmap:  2 pass took:   7.356670 (none: 261648; res:496; super:
0; other:  0)
mmap:  3 pass took:   7.307094 (none: 260521; res:   1623; super:
0; other:  0)
mmap:  4 pass took:   7.350239 (none: 258904; res:   3240; super:
0; other:  0)
mmap:  5 pass took:   7.392480 (none: 257286; res:   4858; super:
0; other:  0)
mmap:  6 pass took:   7.292069 (none: 255584; res:   6560; super:
0; other:  0)
mmap:  7 pass took:   7.048980 (none: 251142; res:  11002; super:
0; other:  0)
mmap:  8 pass took:   6.899387 (none: 247584; res:  14560; super:
0; other:  0)
mmap:  9 pass took:   7.190579 (none: 242992; res:  19152; super:
0; other:  0)
mmap: 10 pass took:   6.915482 (none: 239308; res:  22836; super:
0; other:  0)
mmap: 11 pass took:   6.565909 (none: 232835; res:  29309; super:
0; other:  0)
mmap: 12 pass took:   6.423945 (none: 226160; res:  35984; super:
0; other:  0)
mmap: 13 pass took:   6.315385 (none: 208555; res:  53589; super:
0; other:  0)
mmap: 14 pass took:   6.760780 (none: 192805; res:  69339; super:
0; other:  0)
mmap: 15 pass took:   5.721513 (none: 174497; res:  87647; super:
0; other:  0)
mmap: 16 pass took:   5.004424 (none: 155938; res: 106206; super:
0; other:  0)
mmap: 17 pass took:   4.224926 (none: 135639; res: 126505; super:
0; other:  0)
mmap: 18 pass took:   3.749608 (none: 117952; res: 144192; super:
0; other:  0)
mmap: 19 pass took:   3.398084 (none:  99066; res: 163078; super:
0; other:  0)
mmap: 20 pass took:   3.029557 (none:  74994; res: 187150; super:
0; other:  0)
mmap: 21 pass took:   2.379430 (none:  55231; res: 206913; super:
0; other:  0)
mmap: 22 pass took:   2.046521 (none:  40786; res: 221358; super:
0; other:  0)
mmap: 23 pass took:   1.152797 (none:  30311; res: 231833; super:
0; other:  0)
mmap: 24 pass took:   0.972617 (none:  16196; res: 245948; super:
0; other:  0)
mmap: 25 pass took:   0.577515 (none:   8286; res: 253858; super:
0; other:  0)
mmap: 26 pass took:   0.380738 (none:   3712; res: 258432; super:
0; other:  0)
mmap: 27 pass took:   0.253583 (none:   1193; res: 260951; super:
0; other:  0)
mmap: 28 pass took:   0.157508 (none:  0; res: 262144; super:
0; other:  0)
mmap: 29 pass took:   0.156169 (none:  0; res: 262144; super:
0; other:  0)
mmap: 30 pass took:   0.156550 (none:  0; res: 262144; super:
0; other:  0)

If I ran this:
$ cat /mnt/random-1024   /dev/null
before test, when result is the following:

$ ./mmap /mnt/random-1024 5
mmap:  1 pass took:   0.337657 (none:  0; res: 262144; super:
0; other:  0)
mmap:  2 pass took:   0.186137 (none:  0; res: 262144; super:
0; other:  0)
mmap:  3 pass took:   0.186132 (none:  0; res: 262144; super:
0; other:  0)
mmap:  4 pass took:   0.186535 (none:  0; res: 262144; super:
0; other:  0)
mmap:  5 pass took:   0.190353 (none:  0; res: 262144; super:
0; other:  0)

This is what I expect.  But why this doesn't work without reading file
manually?

Issue seems to be in some change of the behaviour of the reserv or
phys allocator. I Cc:ed Alan.

I'm pretty sure that the behavior here hasn't significantly changed in
about twelve years.  Otherwise, I agree with your analysis.

On more than one occasion, I've been tempted to change:

  pmap_remove_all(mt);
  if (mt-dirty != 0)
  vm_page_deactivate(mt);
  else
  vm_page_cache(mt);

to:

  vm_page_dontneed(mt);

because I suspect that the current code does more harm than good.  In
theory, it saves activations of the page daemon.  However, more often
than not, I suspect that we are spending more on page reactivations than
we are saving on page daemon activations.  The sequential access
detection heuristic is just too easily triggered.  For example, I've
seen it triggered by demand paging of the gcc text segment.  Also, I

Re: problems with mmap() and disk caching

2012-04-06 Thread Alan Cox

On 04/04/2012 02:17, Konstantin Belousov wrote:

On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote:

Hi,

I open the file, then call mmap() on the whole file and get pointer,
then I work with this pointer.  I expect that page should be only once
touched to get it into the memory (disk cache?), but this doesn't work!

I wrote the test (attached) and ran it for the 1G file generated from
/dev/random, the result is the following:

Prepare file:
# swapoff -a
# newfs /dev/ada0b
# mount /dev/ada0b /mnt
# dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024

Purge cache:
# umount /mnt
# mount /dev/ada0b /mnt

Run test:
$ ./mmap /mnt/random-1024 30
mmap:  1 pass took:   7.431046 (none: 262112; res: 32; super:
0; other:  0)
mmap:  2 pass took:   7.356670 (none: 261648; res:496; super:
0; other:  0)
mmap:  3 pass took:   7.307094 (none: 260521; res:   1623; super:
0; other:  0)
mmap:  4 pass took:   7.350239 (none: 258904; res:   3240; super:
0; other:  0)
mmap:  5 pass took:   7.392480 (none: 257286; res:   4858; super:
0; other:  0)
mmap:  6 pass took:   7.292069 (none: 255584; res:   6560; super:
0; other:  0)
mmap:  7 pass took:   7.048980 (none: 251142; res:  11002; super:
0; other:  0)
mmap:  8 pass took:   6.899387 (none: 247584; res:  14560; super:
0; other:  0)
mmap:  9 pass took:   7.190579 (none: 242992; res:  19152; super:
0; other:  0)
mmap: 10 pass took:   6.915482 (none: 239308; res:  22836; super:
0; other:  0)
mmap: 11 pass took:   6.565909 (none: 232835; res:  29309; super:
0; other:  0)
mmap: 12 pass took:   6.423945 (none: 226160; res:  35984; super:
0; other:  0)
mmap: 13 pass took:   6.315385 (none: 208555; res:  53589; super:
0; other:  0)
mmap: 14 pass took:   6.760780 (none: 192805; res:  69339; super:
0; other:  0)
mmap: 15 pass took:   5.721513 (none: 174497; res:  87647; super:
0; other:  0)
mmap: 16 pass took:   5.004424 (none: 155938; res: 106206; super:
0; other:  0)
mmap: 17 pass took:   4.224926 (none: 135639; res: 126505; super:
0; other:  0)
mmap: 18 pass took:   3.749608 (none: 117952; res: 144192; super:
0; other:  0)
mmap: 19 pass took:   3.398084 (none:  99066; res: 163078; super:
0; other:  0)
mmap: 20 pass took:   3.029557 (none:  74994; res: 187150; super:
0; other:  0)
mmap: 21 pass took:   2.379430 (none:  55231; res: 206913; super:
0; other:  0)
mmap: 22 pass took:   2.046521 (none:  40786; res: 221358; super:
0; other:  0)
mmap: 23 pass took:   1.152797 (none:  30311; res: 231833; super:
0; other:  0)
mmap: 24 pass took:   0.972617 (none:  16196; res: 245948; super:
0; other:  0)
mmap: 25 pass took:   0.577515 (none:   8286; res: 253858; super:
0; other:  0)
mmap: 26 pass took:   0.380738 (none:   3712; res: 258432; super:
0; other:  0)
mmap: 27 pass took:   0.253583 (none:   1193; res: 260951; super:
0; other:  0)
mmap: 28 pass took:   0.157508 (none:  0; res: 262144; super:
0; other:  0)
mmap: 29 pass took:   0.156169 (none:  0; res: 262144; super:
0; other:  0)
mmap: 30 pass took:   0.156550 (none:  0; res: 262144; super:
0; other:  0)

If I ran this:
$ cat /mnt/random-1024  /dev/null
before test, when result is the following:

$ ./mmap /mnt/random-1024 5
mmap:  1 pass took:   0.337657 (none:  0; res: 262144; super:
0; other:  0)
mmap:  2 pass took:   0.186137 (none:  0; res: 262144; super:
0; other:  0)
mmap:  3 pass took:   0.186132 (none:  0; res: 262144; super:
0; other:  0)
mmap:  4 pass took:   0.186535 (none:  0; res: 262144; super:
0; other:  0)
mmap:  5 pass took:   0.190353 (none:  0; res: 262144; super:
0; other:  0)

This is what I expect.  But why this doesn't work without reading file
manually?

Issue seems to be in some change of the behaviour of the reserv or
phys allocator. I Cc:ed Alan.

What happen is that fault handler deactivates or caches the pages
previous to the one which would satisfy the fault. See the if()
statement starting at line 463 of vm/vm_fault.c. Since all pages
of the object in your test are clean, the pages are cached.

Next fault would need to allocate some more pages for different index
of the same object. What I see is that vm_reserv_alloc_page() returns a
page that is from the cache for the same object, but different pindex.
As an obvious result, the page is invalidated and repurposed. When next
loop started, the page is not resident anymore, so it has to be re-read
from disk.


I pretty sure that the pages aren't being repurposed this quickly.  
Instead, I believe that the explanation is to be found in mincore().  
mincore() is only reporting pages that are in the object's memq as 
resident.  It is not reporting cache pages as resident.



The behaviour of the allocator is not consistent, so some pages are not
reused, allowing the test to converge and to collect all pages of the
object eventually.

Calling madvise(MADV_RANDOM) fixes 

Re: problems with mmap() and disk caching

2012-04-06 Thread Alan Cox

On 04/06/2012 03:38, Konstantin Belousov wrote:

On Thu, Apr 05, 2012 at 01:25:49PM -0500, Alan Cox wrote:

On 04/05/2012 12:31, Konstantin Belousov wrote:

On Thu, Apr 05, 2012 at 10:54:31AM -0500, Alan Cox wrote:

On 04/04/2012 02:17, Konstantin Belousov wrote:

On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote:

Hi,

I open the file, then call mmap() on the whole file and get pointer,
then I work with this pointer.  I expect that page should be only once
touched to get it into the memory (disk cache?), but this doesn't work!

I wrote the test (attached) and ran it for the 1G file generated from
/dev/random, the result is the following:

Prepare file:
# swapoff -a
# newfs /dev/ada0b
# mount /dev/ada0b /mnt
# dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024

Purge cache:
# umount /mnt
# mount /dev/ada0b /mnt

Run test:
$ ./mmap /mnt/random-1024 30
mmap:  1 pass took:   7.431046 (none: 262112; res: 32; super:
0; other:  0)
mmap:  2 pass took:   7.356670 (none: 261648; res:496; super:
0; other:  0)
mmap:  3 pass took:   7.307094 (none: 260521; res:   1623; super:
0; other:  0)
mmap:  4 pass took:   7.350239 (none: 258904; res:   3240; super:
0; other:  0)
mmap:  5 pass took:   7.392480 (none: 257286; res:   4858; super:
0; other:  0)
mmap:  6 pass took:   7.292069 (none: 255584; res:   6560; super:
0; other:  0)
mmap:  7 pass took:   7.048980 (none: 251142; res:  11002; super:
0; other:  0)
mmap:  8 pass took:   6.899387 (none: 247584; res:  14560; super:
0; other:  0)
mmap:  9 pass took:   7.190579 (none: 242992; res:  19152; super:
0; other:  0)
mmap: 10 pass took:   6.915482 (none: 239308; res:  22836; super:
0; other:  0)
mmap: 11 pass took:   6.565909 (none: 232835; res:  29309; super:
0; other:  0)
mmap: 12 pass took:   6.423945 (none: 226160; res:  35984; super:
0; other:  0)
mmap: 13 pass took:   6.315385 (none: 208555; res:  53589; super:
0; other:  0)
mmap: 14 pass took:   6.760780 (none: 192805; res:  69339; super:
0; other:  0)
mmap: 15 pass took:   5.721513 (none: 174497; res:  87647; super:
0; other:  0)
mmap: 16 pass took:   5.004424 (none: 155938; res: 106206; super:
0; other:  0)
mmap: 17 pass took:   4.224926 (none: 135639; res: 126505; super:
0; other:  0)
mmap: 18 pass took:   3.749608 (none: 117952; res: 144192; super:
0; other:  0)
mmap: 19 pass took:   3.398084 (none:  99066; res: 163078; super:
0; other:  0)
mmap: 20 pass took:   3.029557 (none:  74994; res: 187150; super:
0; other:  0)
mmap: 21 pass took:   2.379430 (none:  55231; res: 206913; super:
0; other:  0)
mmap: 22 pass took:   2.046521 (none:  40786; res: 221358; super:
0; other:  0)
mmap: 23 pass took:   1.152797 (none:  30311; res: 231833; super:
0; other:  0)
mmap: 24 pass took:   0.972617 (none:  16196; res: 245948; super:
0; other:  0)
mmap: 25 pass took:   0.577515 (none:   8286; res: 253858; super:
0; other:  0)
mmap: 26 pass took:   0.380738 (none:   3712; res: 258432; super:
0; other:  0)
mmap: 27 pass took:   0.253583 (none:   1193; res: 260951; super:
0; other:  0)
mmap: 28 pass took:   0.157508 (none:  0; res: 262144; super:
0; other:  0)
mmap: 29 pass took:   0.156169 (none:  0; res: 262144; super:
0; other:  0)
mmap: 30 pass took:   0.156550 (none:  0; res: 262144; super:
0; other:  0)

If I ran this:
$ cat /mnt/random-1024/dev/null
before test, when result is the following:

$ ./mmap /mnt/random-1024 5
mmap:  1 pass took:   0.337657 (none:  0; res: 262144; super:
0; other:  0)
mmap:  2 pass took:   0.186137 (none:  0; res: 262144; super:
0; other:  0)
mmap:  3 pass took:   0.186132 (none:  0; res: 262144; super:
0; other:  0)
mmap:  4 pass took:   0.186535 (none:  0; res: 262144; super:
0; other:  0)
mmap:  5 pass took:   0.190353 (none:  0; res: 262144; super:
0; other:  0)

This is what I expect.  But why this doesn't work without reading file
manually?

Issue seems to be in some change of the behaviour of the reserv or
phys allocator. I Cc:ed Alan.

I'm pretty sure that the behavior here hasn't significantly changed in
about twelve years.  Otherwise, I agree with your analysis.

On more than one occasion, I've been tempted to change:

 pmap_remove_all(mt);
 if (mt-dirty != 0)
 vm_page_deactivate(mt);
 else
 vm_page_cache(mt);

to:

 vm_page_dontneed(mt);

because I suspect that the current code does more harm than good.  In
theory, it saves activations of the page daemon.  However, more often
than not, I suspect that we are spending more on page reactivations than
we are saving on page daemon activations.  The sequential access
detection heuristic is just

Re: problems with mmap() and disk caching

2012-04-05 Thread Alan Cox

On 04/04/2012 02:17, Konstantin Belousov wrote:

On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote:

Hi,

I open the file, then call mmap() on the whole file and get pointer,
then I work with this pointer.  I expect that page should be only once
touched to get it into the memory (disk cache?), but this doesn't work!

I wrote the test (attached) and ran it for the 1G file generated from
/dev/random, the result is the following:

Prepare file:
# swapoff -a
# newfs /dev/ada0b
# mount /dev/ada0b /mnt
# dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024

Purge cache:
# umount /mnt
# mount /dev/ada0b /mnt

Run test:
$ ./mmap /mnt/random-1024 30
mmap:  1 pass took:   7.431046 (none: 262112; res: 32; super:
0; other:  0)
mmap:  2 pass took:   7.356670 (none: 261648; res:496; super:
0; other:  0)
mmap:  3 pass took:   7.307094 (none: 260521; res:   1623; super:
0; other:  0)
mmap:  4 pass took:   7.350239 (none: 258904; res:   3240; super:
0; other:  0)
mmap:  5 pass took:   7.392480 (none: 257286; res:   4858; super:
0; other:  0)
mmap:  6 pass took:   7.292069 (none: 255584; res:   6560; super:
0; other:  0)
mmap:  7 pass took:   7.048980 (none: 251142; res:  11002; super:
0; other:  0)
mmap:  8 pass took:   6.899387 (none: 247584; res:  14560; super:
0; other:  0)
mmap:  9 pass took:   7.190579 (none: 242992; res:  19152; super:
0; other:  0)
mmap: 10 pass took:   6.915482 (none: 239308; res:  22836; super:
0; other:  0)
mmap: 11 pass took:   6.565909 (none: 232835; res:  29309; super:
0; other:  0)
mmap: 12 pass took:   6.423945 (none: 226160; res:  35984; super:
0; other:  0)
mmap: 13 pass took:   6.315385 (none: 208555; res:  53589; super:
0; other:  0)
mmap: 14 pass took:   6.760780 (none: 192805; res:  69339; super:
0; other:  0)
mmap: 15 pass took:   5.721513 (none: 174497; res:  87647; super:
0; other:  0)
mmap: 16 pass took:   5.004424 (none: 155938; res: 106206; super:
0; other:  0)
mmap: 17 pass took:   4.224926 (none: 135639; res: 126505; super:
0; other:  0)
mmap: 18 pass took:   3.749608 (none: 117952; res: 144192; super:
0; other:  0)
mmap: 19 pass took:   3.398084 (none:  99066; res: 163078; super:
0; other:  0)
mmap: 20 pass took:   3.029557 (none:  74994; res: 187150; super:
0; other:  0)
mmap: 21 pass took:   2.379430 (none:  55231; res: 206913; super:
0; other:  0)
mmap: 22 pass took:   2.046521 (none:  40786; res: 221358; super:
0; other:  0)
mmap: 23 pass took:   1.152797 (none:  30311; res: 231833; super:
0; other:  0)
mmap: 24 pass took:   0.972617 (none:  16196; res: 245948; super:
0; other:  0)
mmap: 25 pass took:   0.577515 (none:   8286; res: 253858; super:
0; other:  0)
mmap: 26 pass took:   0.380738 (none:   3712; res: 258432; super:
0; other:  0)
mmap: 27 pass took:   0.253583 (none:   1193; res: 260951; super:
0; other:  0)
mmap: 28 pass took:   0.157508 (none:  0; res: 262144; super:
0; other:  0)
mmap: 29 pass took:   0.156169 (none:  0; res: 262144; super:
0; other:  0)
mmap: 30 pass took:   0.156550 (none:  0; res: 262144; super:
0; other:  0)

If I ran this:
$ cat /mnt/random-1024  /dev/null
before test, when result is the following:

$ ./mmap /mnt/random-1024 5
mmap:  1 pass took:   0.337657 (none:  0; res: 262144; super:
0; other:  0)
mmap:  2 pass took:   0.186137 (none:  0; res: 262144; super:
0; other:  0)
mmap:  3 pass took:   0.186132 (none:  0; res: 262144; super:
0; other:  0)
mmap:  4 pass took:   0.186535 (none:  0; res: 262144; super:
0; other:  0)
mmap:  5 pass took:   0.190353 (none:  0; res: 262144; super:
0; other:  0)

This is what I expect.  But why this doesn't work without reading file
manually?

Issue seems to be in some change of the behaviour of the reserv or
phys allocator. I Cc:ed Alan.


I'm pretty sure that the behavior here hasn't significantly changed in 
about twelve years.  Otherwise, I agree with your analysis.


On more than one occasion, I've been tempted to change:

pmap_remove_all(mt);
if (mt-dirty != 0)
vm_page_deactivate(mt);
else
vm_page_cache(mt);

to:

vm_page_dontneed(mt);

because I suspect that the current code does more harm than good.  In 
theory, it saves activations of the page daemon.  However, more often 
than not, I suspect that we are spending more on page reactivations than 
we are saving on page daemon activations.  The sequential access 
detection heuristic is just too easily triggered.  For example, I've 
seen it triggered by demand paging of the gcc text segment.  Also, I 
think that pmap_remove_all() and especially vm_page_cache() are too 
severe for a detection heuristic 

Re: problems with mmap() and disk caching

2012-04-05 Thread Alan Cox

On 04/04/2012 04:36, Andrey Zonov wrote:

On 04.04.2012 11:17, Konstantin Belousov wrote:


Calling madvise(MADV_RANDOM) fixes the issue, because the code to
deactivate/cache the pages is turned off. On the other hand, it also
turns of read-ahead for faulting, and the first loop becomes eternally
long.


Now it takes 5 times longer.  Anyway, thanks for explanation.



Doing MADV_WILLNEED does not fix the problem indeed, since willneed
reactivates the pages of the object at the time of call. To use
MADV_WILLNEED, you would need to call it between faults/memcpy.



I played with it, but no luck so far.



I've also never seen super pages, how to make them work?

They just work, at least for me. Look at the output of procstat -v
after enough loops finished to not cause disk activity.



The problem was in my test program.  I fixed it, now I see super pages 
but I'm still not satisfied.  There are several tests below:


1. With madvise(MADV_RANDOM) I see almost all super pages:
$ ./mmap /mnt/random-1024 5
mmap:  1 pass took:  26.438535 (none:  0; res: 262144; super: 511; 
other:  0)
mmap:  2 pass took:   0.187311 (none:  0; res: 262144; super: 511; 
other:  0)
mmap:  3 pass took:   0.184953 (none:  0; res: 262144; super: 511; 
other:  0)
mmap:  4 pass took:   0.186007 (none:  0; res: 262144; super: 511; 
other:  0)
mmap:  5 pass took:   0.185790 (none:  0; res: 262144; super: 511; 
other:  0)


Should it be 512?



Check the starting virtual address.  It is probably not aligned on a 
superpage boundary.  Hence, a few pages at the start and end of your 
mapped region are not in a superpage.



2. Without madvise(MADV_RANDOM):
$ ./mmap /mnt/random-1024 50
mmap:  1 pass took:   7.629745 (none: 262112; res: 32; super: 0; 
other:  0)
mmap:  2 pass took:   7.301720 (none: 261202; res:942; super: 0; 
other:  0)
mmap:  3 pass took:   7.261416 (none: 260226; res:   1918; super: 1; 
other:  0)

[skip]
mmap: 49 pass took:   0.155368 (none:  0; res: 262144; super: 323; 
other:  0)
mmap: 50 pass took:   0.155438 (none:  0; res: 262144; super: 323; 
other:  0)


Only 323 pages.

3. If I just re-run test I don't see super pages with any size of 
block.


$ ./mmap /mnt/random-1024 5 $((130))
mmap:  1 pass took:   1.013939 (none:  0; res: 262144; super: 0; 
other:  0)
mmap:  2 pass took:   0.267082 (none:  0; res: 262144; super: 0; 
other:  0)
mmap:  3 pass took:   0.270711 (none:  0; res: 262144; super: 0; 
other:  0)
mmap:  4 pass took:   0.268940 (none:  0; res: 262144; super: 0; 
other:  0)
mmap:  5 pass took:   0.269634 (none:  0; res: 262144; super: 0; 
other:  0)


4. If I activate madvise(MADV_WILLNEDD) in the copy loop and re-run 
test then I see super pages only if I use block greater than 2Mb.


$ ./mmap /mnt/random-1024 1 $((121))
mmap:  1 pass took:   0.299722 (none:  0; res: 262144; super: 0; 
other:  0)

$ ./mmap /mnt/random-1024 1 $((122))
mmap:  1 pass took:   0.271828 (none:  0; res: 262144; super: 170; 
other:  0)

$ ./mmap /mnt/random-1024 1 $((123))
mmap:  1 pass took:   0.333188 (none:  0; res: 262144; super: 258; 
other:  0)

$ ./mmap /mnt/random-1024 1 $((124))
mmap:  1 pass took:   0.339250 (none:  0; res: 262144; super: 303; 
other:  0)

$ ./mmap /mnt/random-1024 1 $((125))
mmap:  1 pass took:   0.418812 (none:  0; res: 262144; super: 324; 
other:  0)

$ ./mmap /mnt/random-1024 1 $((126))
mmap:  1 pass took:   0.360892 (none:  0; res: 262144; super: 335; 
other:  0)

$ ./mmap /mnt/random-1024 1 $((127))
mmap:  1 pass took:   0.401122 (none:  0; res: 262144; super: 342; 
other:  0)

$ ./mmap /mnt/random-1024 1 $((128))
mmap:  1 pass took:   0.478764 (none:  0; res: 262144; super: 345; 
other:  0)

$ ./mmap /mnt/random-1024 1 $((129))
mmap:  1 pass took:   0.607266 (none:  0; res: 262144; super: 346; 
other:  0)

$ ./mmap /mnt/random-1024 1 $((130))
mmap:  1 pass took:   0.901269 (none:  0; res: 262144; super: 347; 
other:  0)


5. If I activate madvise(MADV_WILLNEED) immediately after mmap() then 
I see some number of super pages (the number from test #2).


$ ./mmap /mnt/random-1024 5
mmap:  1 pass took:   0.178666 (none:  0; res: 262144; super: 323; 
other:  0)
mmap:  2 pass took:   0.158889 (none:  0; res: 262144; super: 323; 
other:  0)
mmap:  3 pass took:   0.157229 (none:  0; res: 262144; super: 323; 
other:  0)
mmap:  4 pass took:   0.156895 (none:  0; res: 262144; super: 323; 
other:  0)
mmap:  5 pass took:   0.162938 (none:  0; res: 262144; super: 323; 
other:  0)


6. If I read file manually before test then I don't see super pages 
with any size of block and madvise(MADV_WILLNEED) doesn't help.


$ ./mmap /mnt/random-1024 5 $((130))
mmap:  1 pass took:   0.996767 (none:  0; res: 262144; super: 0; 
other:  0)
mmap:  2 pass took:   0.311129 (none:  

Re: problems with mmap() and disk caching

2012-04-05 Thread Alan Cox

On 04/05/2012 12:31, Konstantin Belousov wrote:

On Thu, Apr 05, 2012 at 10:54:31AM -0500, Alan Cox wrote:

On 04/04/2012 02:17, Konstantin Belousov wrote:

On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote:

Hi,

I open the file, then call mmap() on the whole file and get pointer,
then I work with this pointer.  I expect that page should be only once
touched to get it into the memory (disk cache?), but this doesn't work!

I wrote the test (attached) and ran it for the 1G file generated from
/dev/random, the result is the following:

Prepare file:
# swapoff -a
# newfs /dev/ada0b
# mount /dev/ada0b /mnt
# dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024

Purge cache:
# umount /mnt
# mount /dev/ada0b /mnt

Run test:
$ ./mmap /mnt/random-1024 30
mmap:  1 pass took:   7.431046 (none: 262112; res: 32; super:
0; other:  0)
mmap:  2 pass took:   7.356670 (none: 261648; res:496; super:
0; other:  0)
mmap:  3 pass took:   7.307094 (none: 260521; res:   1623; super:
0; other:  0)
mmap:  4 pass took:   7.350239 (none: 258904; res:   3240; super:
0; other:  0)
mmap:  5 pass took:   7.392480 (none: 257286; res:   4858; super:
0; other:  0)
mmap:  6 pass took:   7.292069 (none: 255584; res:   6560; super:
0; other:  0)
mmap:  7 pass took:   7.048980 (none: 251142; res:  11002; super:
0; other:  0)
mmap:  8 pass took:   6.899387 (none: 247584; res:  14560; super:
0; other:  0)
mmap:  9 pass took:   7.190579 (none: 242992; res:  19152; super:
0; other:  0)
mmap: 10 pass took:   6.915482 (none: 239308; res:  22836; super:
0; other:  0)
mmap: 11 pass took:   6.565909 (none: 232835; res:  29309; super:
0; other:  0)
mmap: 12 pass took:   6.423945 (none: 226160; res:  35984; super:
0; other:  0)
mmap: 13 pass took:   6.315385 (none: 208555; res:  53589; super:
0; other:  0)
mmap: 14 pass took:   6.760780 (none: 192805; res:  69339; super:
0; other:  0)
mmap: 15 pass took:   5.721513 (none: 174497; res:  87647; super:
0; other:  0)
mmap: 16 pass took:   5.004424 (none: 155938; res: 106206; super:
0; other:  0)
mmap: 17 pass took:   4.224926 (none: 135639; res: 126505; super:
0; other:  0)
mmap: 18 pass took:   3.749608 (none: 117952; res: 144192; super:
0; other:  0)
mmap: 19 pass took:   3.398084 (none:  99066; res: 163078; super:
0; other:  0)
mmap: 20 pass took:   3.029557 (none:  74994; res: 187150; super:
0; other:  0)
mmap: 21 pass took:   2.379430 (none:  55231; res: 206913; super:
0; other:  0)
mmap: 22 pass took:   2.046521 (none:  40786; res: 221358; super:
0; other:  0)
mmap: 23 pass took:   1.152797 (none:  30311; res: 231833; super:
0; other:  0)
mmap: 24 pass took:   0.972617 (none:  16196; res: 245948; super:
0; other:  0)
mmap: 25 pass took:   0.577515 (none:   8286; res: 253858; super:
0; other:  0)
mmap: 26 pass took:   0.380738 (none:   3712; res: 258432; super:
0; other:  0)
mmap: 27 pass took:   0.253583 (none:   1193; res: 260951; super:
0; other:  0)
mmap: 28 pass took:   0.157508 (none:  0; res: 262144; super:
0; other:  0)
mmap: 29 pass took:   0.156169 (none:  0; res: 262144; super:
0; other:  0)
mmap: 30 pass took:   0.156550 (none:  0; res: 262144; super:
0; other:  0)

If I ran this:
$ cat /mnt/random-1024   /dev/null
before test, when result is the following:

$ ./mmap /mnt/random-1024 5
mmap:  1 pass took:   0.337657 (none:  0; res: 262144; super:
0; other:  0)
mmap:  2 pass took:   0.186137 (none:  0; res: 262144; super:
0; other:  0)
mmap:  3 pass took:   0.186132 (none:  0; res: 262144; super:
0; other:  0)
mmap:  4 pass took:   0.186535 (none:  0; res: 262144; super:
0; other:  0)
mmap:  5 pass took:   0.190353 (none:  0; res: 262144; super:
0; other:  0)

This is what I expect.  But why this doesn't work without reading file
manually?

Issue seems to be in some change of the behaviour of the reserv or
phys allocator. I Cc:ed Alan.

I'm pretty sure that the behavior here hasn't significantly changed in
about twelve years.  Otherwise, I agree with your analysis.

On more than one occasion, I've been tempted to change:

 pmap_remove_all(mt);
 if (mt-dirty != 0)
 vm_page_deactivate(mt);
 else
 vm_page_cache(mt);

to:

 vm_page_dontneed(mt);

because I suspect that the current code does more harm than good.  In
theory, it saves activations of the page daemon.  However, more often
than not, I suspect that we are spending more on page reactivations than
we are saving on page daemon activations.  The sequential access
detection heuristic is just too easily triggered.  For example, I've
seen it triggered by demand paging of the gcc text segment.  Also, I

Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash

2012-03-29 Thread Alan Cox
On Thu, Mar 29, 2012 at 11:27 AM, Mark Felder f...@feld.me wrote:

 On Thu, 29 Mar 2012 10:55:36 -0500, Hans Petter Selasky hsela...@c2i.net
 wrote:


 It almost sounds like the lost interrupt issue I've seen with USB EHCI
 devices, though disk I/O should have a retry timeout?

 What does wmstat -i output?

 --HPS



 Here's a server that has a week uptime and is due for a crash any hour now:

 root@server:/# vmstat -i
 interrupt  total   rate
 irq1: atkbd0  34  0
 irq6: fdc0 9  0
 irq15: ata1   34  0
 irq16: em1778061  1
 irq17: mpt0 19217711 31
 irq18: em0 283674769460
 cpu0: timer246571507400
 Total  550242125892



Not so long ago, VMware implemented a clever scheme for reducing the
overhead of virtualized interrupts that must be delivered by at least some
(if not all) of their emulated storage controllers:

http://static.usenix.org/events/atc11/tech/techAbstracts.html#Ahmad

Perhaps, there is a bad interaction between this scheme and FreeBSD's mpt
driver.

Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: mmap performance and memory use

2011-10-28 Thread Alan Cox

On 10/26/2011 06:23, Svatopluk Kraus wrote:

Hi,

well, I'm working on new port (arm11 mpcore) and pmap_enter_object()
is what I'm debugging rigth now. And I did not find any way in
userland how to force kernel to call pmap_enter_object() which makes
SUPERPAGE mapping without promotion. I tried to call mmap() with
MAP_PREFAULT_READ without success. I tried to call madvise() with
MADV_WILLNEED without success too.



mmap() should call pmap_enter_object() if MAP_PREFAULT_READ was 
specified.  I'm surprised to hear that it's not happening for you.



To make SUPERPAGE mapping, it's obvious that all physical pages under
SUPERPAGE must be allocated in vm_object. And SUPERPAGE mapping must
be done before first access to them, otherwise a promotion is on the
way. MAP_PREFAULT_READ does nothing with it. If madvice() is used,
vm_object_madvise() is called but only cached pages are allocated in
advance. Of coarse, an allocation of all physical memory behind
virtual address space in advance is not preferred in most situations.

For example, I want to do some computation on 4M memory space (I know
that each byte will be accessed) and want to utilize SUPERPAGE mapping
without promotion, so save 4K page table (i386 machine). However,
malloc() leads to promotion, mmap() with MAP_PREFAULT_READ doesn't do
nothing so SUPERPAGE mapping is promoted, and madvice() with
MADV_WILLNEED calls vm_object_madvise() but because the pages are not
cached (how can be on anonymous memory), it is not work without
promotion too.

So, SUPERPAGE mapping without promotions is fine, but it can be done
only if physical memory being mapped is already allocated. Is it
really possible to force that in userland?



To force the allocation of the physical memory?  Right now, the only way 
is for your program to touch the pages.



Moreover, the SUPERPAGE mapping is made readonly firstly. So, even if
I have SUPERPAGE mapping without promotion, the mapping is demoted
after first write, and promoted again after all underlying pages are
accessed by write. There is 4K page table saving no longer.



Yes, that is all true.  It is possible to change things so that the page 
table pages are reclaimed after a time, and not kept around 
indefinitely.  However, this not high on my personal priority list.  
Before that, it is more likely that I will add an option to avoid the 
demotion on write, if we don't have to copy the entire superpage to do so.


Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: mmap performance and memory use

2011-10-25 Thread Alan Cox

On 10/10/2011 4:28 PM, Wojciech Puchar wrote:


Notice that vm.pmap.pde.promotions increased by 31.  This means that 
31 superpage mappings were created by promotion from small page 
mappings.


thank you. i looked at .mappings as it seemed logical for me that is 
shows total.


In contrast, vm.pmap.pde.mappings counts superpage mappings that are 
created directly and not by promotion from small page mappings.  For 
example, if a large executable, such as gcc, is resident in memory, 
the text segment will be pre-mapped using superpage mappings, 
avoiding soft fault and promotion overhead.  Similarly, mmap(..., 
MAP_PREFAULT_READ) on a large, memory resident file may pre-map the 
file using superpage mappings.


your options are not described in mmap manpage nor madvise 
(MAP_PREFAULT_READ).


when can i find the up to date manpage or description?



A few minutes ago, I merged the changes to support and document 
MAP_PREFAULT_READ into 8-STABLE.  So, now it exists in HEAD, 9.0, and 
8-STABLE.


Alan



___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: mmap performance and memory use

2011-10-12 Thread Alan Cox

On 10/11/2011 12:36, Mark Tinguely wrote:

On 10/11/2011 11:12 AM, Alan Cox wrote:

On 10/10/2011 16:28, Wojciech Puchar wrote:
is it possible to force VM subsystem to operate on superpages when 
possible - i mean swapping in 2MB chunks?




Currently, no.  For some applications, like the Sun/Oracle JVM, that 
have code to explicitly manage large pages, there could be some 
benefit in the form of reduced overhead.  So, it's on my to do 
list, but no where near the top of that list.


Alan



Am I correct in remembering that super-pages have to be aligned on the 
super-page boundary and be contiguous?




Yes.  However, if you allocate (or mmap(2)) a large range of virtual 
memory, e.g., 10 MB, and the start of that range is not aligned on a 
superpage boundary, the virtual memory system can still promote the four 
2 MB sized superpages in the middle of that range.


If so, in the mmap(), he may want to include the 'MAP_FIXED' flag with 
an address that is on a super-page boundary. Right now, the 
VMFS_ALIGNED_SPACE that does the VA super-page alignment is only 
used for device pagers.




Yes.  More precisely, the second, third, etc. mmap(2) should duplicate 
the alignment of the first mmap(2).  In fact, this is what 
VMFS_ALIGNED_SPACE does.  It looks at the alignment of the pages already 
allocated to the file (or vm object) and attempts to duplicate that 
alignment.


Sooner or later, I will probably make VMFS_ALIGNED_SPACE the default for 
file types other than devices.


Similarly, if the allocated physical pages for the object are not 
contiguous, then MAP_PREFAULT_READ will not result in a super-page 
promotion.




As described in my earlier e-mail on this topic, in this case, I call 
these superpage mappings and not superpage promotions, because the 
virtual system creates a large page mapping, e.g., a 2 MB page table 
entry, from the start.  It does not create small page mappings and then 
promote them to a large page mapping.


Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: mmap performance and memory use

2011-10-11 Thread Alan Cox

On 10/10/2011 16:28, Wojciech Puchar wrote:


Notice that vm.pmap.pde.promotions increased by 31.  This means that 
31 superpage mappings were created by promotion from small page 
mappings.


thank you. i looked at .mappings as it seemed logical for me that is 
shows total.


In contrast, vm.pmap.pde.mappings counts superpage mappings that are 
created directly and not by promotion from small page mappings.  For 
example, if a large executable, such as gcc, is resident in memory, 
the text segment will be pre-mapped using superpage mappings, 
avoiding soft fault and promotion overhead.  Similarly, mmap(..., 
MAP_PREFAULT_READ) on a large, memory resident file may pre-map the 
file using superpage mappings.


your options are not described in mmap manpage nor madvise 
(MAP_PREFAULT_READ).


when can i find the up to date manpage or description?



It is documented in mmap(2) on HEAD and 9.x:

 MAP_PREFAULT_READ
   Immediately update the calling process's 
lowest-level

   virtual address translation structures, such as its
   page table, so that every memory resident page 
within
   the region is mapped for read access.  
Ordinarily these

   structures are updated lazily.  The effect of this
   option is to eliminate any soft faults that 
would oth-

   erwise occur on the initial read accesses to the
   region.  Although this option does not preclude prot
   from including PROT_WRITE, it does not eliminate 
soft

   faults on the initial write accesses to the region.

I don't believe that this feature was merged into to 8.x.  However, 
there is no technical reason that it can't be merged.




is it possible to force VM subsystem to operate on superpages when 
possible - i mean swapping in 2MB chunks?




Currently, no.  For some applications, like the Sun/Oracle JVM, that 
have code to explicitly manage large pages, there could be some benefit 
in the form of reduced overhead.  So, it's on my to do list, but no 
where near the top of that list.


Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: mmap performance and memory use

2011-10-10 Thread Alan Cox

On 10/07/2011 12:23, Wojciech Puchar wrote:


You are correct about the page table page.  However, a superpage 
mapping consumes a single PV entry, in place of 512 or 1024 PV
entries.  This winds up saving about three physical pages worth of 
memory for every superpage mapping.

does it actually work?



Yes, the sysctl output shows that it is working.  You can also verify 
this with mincore(2).



simple test

before (only idle system with 2GB RAM and most free)

vm.pmap.pde.promotions: 921
vm.pmap.pde.p_failures: 21398
vm.pmap.pde.mappings: 299
vm.pmap.pde.demotions: 596
vm.pmap.shpgperproc: 200
vm.pmap.pv_entry_max: 696561
vm.pmap.pg_ps_enabled: 1
vm.pmap.pat_works: 1


and with that program running (==sleeping)

#include unistd.h
int a[124];
main() {
 int b;
 for(b=0;b(124);b++) a[b]=b;
 sleep(1000);
}


vm.pmap.pdpe.demotions: 0
vm.pmap.pde.promotions: 952
vm.pmap.pde.p_failures: 21398
vm.pmap.pde.mappings: 299
vm.pmap.pde.demotions: 596
vm.pmap.shpgperproc: 200
vm.pmap.pv_entry_max: 696561
vm.pmap.pg_ps_enabled: 1
vm.pmap.pat_works: 1



seems like i don't understand what these sysctl things mean (i did 
sysctl -d)
or it doesn't really work. with program allocating and using linear 
64MB chunk it should be 31 or 32 more mappings in vm.pmap.pde.mappings

there are zero difference.


Notice that vm.pmap.pde.promotions increased by 31.  This means that 31 
superpage mappings were created by promotion from small page mappings.


In contrast, vm.pmap.pde.mappings counts superpage mappings that are 
created directly and not by promotion from small page mappings.  For 
example, if a large executable, such as gcc, is resident in memory, the 
text segment will be pre-mapped using superpage mappings, avoiding soft 
fault and promotion overhead.  Similarly, mmap(..., MAP_PREFAULT_READ) 
on a large, memory resident file may pre-map the file using superpage 
mappings.


Alan


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: mmap performance and memory use

2011-10-07 Thread Alan Cox
On Thu, Oct 6, 2011 at 11:01 AM, Kostik Belousov kostik...@gmail.comwrote:

 On Thu, Oct 06, 2011 at 04:41:45PM +0200, Wojciech Puchar wrote:
  i have few questions.
 
  1) suppose i map 1TB of address space as anonymous and touch just one
  page. how much memory is used to manage this?
 I am not sure how deep the enumeration you want to know, but the first
 approximation will be:
 one struct vm_map_entry
 one struct vm_object
 one pv_entry

 Page table structures need four pages for directories and page table
 proper.
 
  2) suppose we have 1TB file on disk without holes and 10 processes
  mmaps this file to it's address space. are just pages shared or can
  pagetables be shared too? how much memory is used to manage such
  situation?
 Only pages are shared. Pagetables are not.

 For one thing, this indeed causes more memory use for the OS. This is
 somewhat mitigated by automatic use of superpages. Superpage promotion
 still keeps the 4KB page table around, so most savings from the
 superpages are due to more efficient use of TLB.


You are correct about the page table page.  However, a superpage mapping
consumes a single PV entry, in place of 512 or 1024 PV entries.  This winds
up saving about three physical pages worth of memory for every superpage
mapping.

Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Memory allocation in kernel -- what to use in which situation? What is the best for page-sized allocations?

2011-10-02 Thread Alan Cox
On Sun, Oct 2, 2011 at 1:21 PM, m...@freebsd.org wrote:

 2011/10/2 Lev Serebryakov l...@freebsd.org:
  Hello, Freebsd-hackers.
 
   Here are several memory-allocation mechanisms in the kernel. The two
  I'm aware of is MALLOC_DEFINE()/malloc()/free() and uma_* (zone(9)).
 
   As far as I understand, malloc() is general-purpose, but it has
  fixed transaction cost (in term of memory consumption) for each
  block allocated, and is not very suitable for allocation of many small
  blocks, as lots of memory will be wasted for bookkeeping.
 
   zone(9) allocator, on other hand, have very low cost of each
  allocated block, but could allocate only pre-configured fixed-size
  blocks, and ideal for allocation tons of small objects (and provide
  API for reusing them, too!).
 
   Am I right?

 No one has quite answered this question, IMO, so here's my 2 cents.

 malloc(9) on smaller sizes (= PAGE_SIZE) uses uma(9) under the
 covers.  There are a set of uma zones for 16, 32, 64, 128, ...
 PAGE_SIZE bytes and malloc(9) looks up the malloc size in a small
 array to determine which uma zone to allocate from.

 So malloc(9) on small sizes doesn't have overhead of bookkeeping, but
 it does have overhead of rounding to the next highest malloc uma
 bucket.  At $WORK we found, for example, that 48 bytes and 96 bytes
 were very common sizes and so I added uma zones there (and few other
 odd sies determined by using the malloc statistics option).

But what if I need to allocate a lot (say, 16K-32K) of page-sized
  blocks? Not in one chunk, for sure, but in lifetime of my kernel
  module. Which allocator should I use? It seems, the best one will be
  very low-level only-page-sized allocator. Is here any in kernel?

 4k allocations, as has been pointed out, get a single kernel page in
 both the virtual space and physical space.  They (like all the large
 allocations) use a field in the vm_page for the physical page backing
 the virtual address to record info about the allocation.

 Any allocation PAGE_SIZE and larger will round up to the next multiple
 of pages and allocate whole pages.  IMO the problems here are (1) as
 was pointed out, TLB shootdown on free(9), and (2) the current
 algorithm for finding space in a kmem_map is a linear search and
 doesn't track where there are fragmented chunks, so it's not terribly
 efficient when finding larger sies, and the PAGE_SIZE allocations will
 not fill in fragmented areas.


Regarding #2, no, it is not linear; it is an amortized logarithmic first
fit.  Every node in every vm map, including the kmem map, is augmented with
free space information.  This is used by the first fit traversal to skip
entire subtrees that contain insufficient space.

Regards,
Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: UMA large allocations issues

2011-07-23 Thread Alan Cox
On Fri, Jul 22, 2011 at 9:07 PM, Davide Italiano
davide.itali...@gmail.comwrote:

 Hi.
 I'm a student and some time ago I started investigating a bit about
 the performance/fragmentation issue of large allocations within the
 UMA allocator.
 Benckmarks showed up that this problems of performances are mainly
 related to the fact that every call to uma_large_malloc() results in a
 call to kmem_malloc(), and this behaviour is really inefficient.

 I started doing some work. Here's somethin:
 First of all, I tried to define larger zones and let uma do it all as
 a first step.
 UMA can allocate slabs of more than one page. So I tried to define
 zones of 1,2,4,8 pages, moving ZMEM_KMAX up.
 I tested the solution w/ raidtest. Here there are some numbers.

 Here's the workload characterization:


 set mediasize=`diskinfo /dev/zvol/tank/vol | awk '{print $3}'`
 set sectorsize=`diskinfo /dev/zvol/tank/vol | awk '{print $2}'`
 raidtest genfile -s $mediasize -S $sectorsize -n 5

 # $mediasize = 10737418240
 # $sectorsize = 512

 Number of READ requests: 24924
 Number of WRITE requests: 25076
 Numbers of bytes to transmit: 3305292800


 raidtest test -d /dev/zvol/tank/vol -n 4
 ## tested using 4 cores, 1.5 GB Ram

 Results:
 Number of processes: 4
 Bytes per second: 10146896
 Requests per second: 153

 Results: (4* PAGE_SIZE)
 Number of processes: 4
 Bytes per second: 14793969
 Requests per second: 223

 Results: (8* PAGE_SIZE)
 Number of processes: 4
 Bytes per second: 6855779
 Requests per second: 103


 The result of this tests is that defining larger zones is useful until
 the size of these zones is not too big. After some size, performances
 decreases significantly.

 As second step, alc@ proposed to create a new layer that sits between
 UMA and the VM subsystem. This layer can manage a pool of chunk that
 can be used to satisfy requests from uma_large_malloc so avoiding the
 overhead due to kmem_malloc() calls.

 I've recently started developing a patch (not yet full working) that
 implements this layer. First of all I'd like to concentrate my
 attention to the performance problem rather than the fragmentation
 one. So the patch that actually started to write doesn't care about
 fragmentation aspects.

 http://davit.altervista.org/uma_large_allocations.patch

 There are some questions to which I wasn't able to answer (for
 example, when it's safe to call kmem_malloc() to request memory).


In this context, there is really only one restriction.  Your
page_alloc_new() should never call kmem_malloc() with M_WAITOK if your
bitmap_mtx lock is held.  It may only call kmem_malloc() with M_NOWAIT if
your bitmap_mtx lock is held.

That said, I would try to structure the code so that you're not doing any
kmem_malloc() calls with the bitmap_mtx lock held.

So, at the end of the day I'm asking for your opinion about this issue
 and I'm looking for a mentor (some kind of guidance) to continue
 this project. If someone is interested to help, it would be very
 appreciated.


I will take a closer look at your patch later today, and send you comments.

Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: SMP question w.r.t. reading kernel variables

2011-04-20 Thread Alan Cox
On Wed, Apr 20, 2011 at 7:42 AM, Rick Macklem rmack...@uoguelph.ca wrote:

  On Tue, Apr 19, 2011 at 12:00:29PM +,
  freebsd-hackers-requ...@freebsd.org wrote:
   Subject: Re: SMP question w.r.t. reading kernel variables
   To: Rick Macklem rmack...@uoguelph.ca
   Cc: freebsd-hackers@freebsd.org
   Message-ID: 201104181712.14457@freebsd.org
 
  [John Baldwin]
   On Monday, April 18, 2011 4:22:37 pm Rick Macklem wrote:
 On Sunday, April 17, 2011 3:49:48 pm Rick Macklem wrote:
  ...
All of this makes sense. What I was concerned about was memory
cache
consistency and whet (if anything) has to be done to make sure a
thread
doesn't see a stale cached value for the memory location.
   
Here's a generic example of what I was thinking of:
(assume x is a global int and y is a local int on the thread's
stack)
- time proceeds down the screen
thread X on CPU 0 thread Y on CPU 1
x = 0;
 x = 0; /* 0 for x's location
 in CPU 1's memory cache */
x = 1;
 y = x;
-- now, is y guaranteed to be 1 or can it get the stale cached
0 value?
if not, what needs to be done to guarantee it?
  
   Well, the bigger problem is getting the CPU and compiler to order
   the
   instructions such that they don't execute out of order, etc. Because
   of that,
   even if your code has 'x = 0; x = 1;' as adjacent threads in thread
   X,
   the 'x = 1' may actually execute a good bit after the 'y = x' on CPU
   1.
 
  Actually, as I recall the rules for C, it's worse than that. For
  this (admittedly simplified scenario), x=0; in thread X may never
  execute unless it's declared volatile, as the compiler may optimize it
  out and emit no code for it.
 
 
   Locks force that to sychronize as the CPUs coordinate around the
   lock cookie
   (e.g. the 'mtx_lock' member of 'struct mutex').
  
Also, I see cases of:
 mtx_lock(np);
 np-n_attrstamp = 0;
 mtx_unlock(np);
in the regular NFS client. Why is the assignment mutex locked? (I
had assumed
it was related to the above memory caching issue, but now I'm not
so sure.)
  
   In general I think writes to data that are protected by locks should
   always be
   protected by locks. In some cases you may be able to read data using
   weaker
   locking (where no locking can be a form of weaker locking, but
   also a
   read/shared lock is weak, and if a variable is protected by multiple
   locks,
   then any singe lock is weak, but sufficient for reading while all of
   the
   associated locks must be held for writing) than writing, but writing
   generally
   requires full locking (write locks, etc.).
 
 Oops, I now see that you've differentiated between writing and reading.
 (I mistakenly just stated that you had recommended a lock for reading.
  Sorry about my misinterpretation of the above on the first quick read.)

  What he said. In addition to all that, lock operations generate
  atomic barriers which a compiler or optimizer is prevented from
  moving code across.
 
 All good and useful comments, thanks.

 The above example was meant to be contrived, to indicate what I was
 worried about w.r.t. memory caches.
 Here's a somewhat simplified version of what my actual problem is:
 (Mostly fyi, in case you are interested.)

 Thread X is doing a forced dismount of an NFS volume, it (in dounmount()):
 - sets MNTK_UNMOUNTF
 - calls VFS_SYNC()/nfs_sync()
  - so this doesn't get hung on an unresponsive server it must test
for MNTK_UNMOUNTF and return an error it is set. This seems fine,
since it is the same thread and in a called function. (I can't
imagine that the optimizer could move setting of a global flag
to after a function call which might use it.)
 - calls VFS_UNMOUNT()/nfs_unmount()
  - now the fun begins...
  after some other stuff, it calls nfscl_umount() to get rid of the
  state info (opens/locks...)
  nfscl_umount() - synchronizes with other threads that will use this
state (see below) using the combination of a mutex and a
shared/exclusive sleep lock. (Because of various quirks in the
code, this shared/exclusive lock is a locally coded version and
I happenned to call the shared case a refcnt and the exclusive
case just a lock.)

 Other threads that will use state info (open/lock...) will:
 -call nfscl_getcl()
  - this function does two things that are relevant
  1 - it allocates a new clientid, as required, while holding the mutex
  - this case needs to check for MNTK_UNMOUNTF and return error, in
case the clientid has already been deleted by nfscl_umount() above.
  (This happens before #2 because the sleep lock is in the clientd
 structure.)
 -- it must see the MNTK_UNMOUNTF set if it happens after (in a temporal
 sense)
  being set by dounmount()
  2 - while holding the mutex, it acquires the shared lock

Re: Question about Reverse Mappings in FreeBSD.

2011-03-20 Thread Alan Cox
On Fri, Mar 18, 2011 at 7:30 PM, J L dimitar9...@gmail.com wrote:

 I read an article about Reverse Mappings technique in memory management
 part. It improves a lot from Linux 2.4 to 2.6. I am wondering is FreeBSD
 also have this feature? Which source files should I go to find these? I
 want
 to do some study on this.
 Wish someone can enlighten me. Thank you.



Reverse mappings are implemented by the machine-dependent layer of the
virtual memory system, which is called the pmap.  Look for files named
pmap.c in the source tree, such as sys/amd64/amd64/pmap.c.  In particular,
look for the code that manages pv entries.

Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Analyzing wired memory?

2011-02-08 Thread Alan Cox
On Tue, Feb 8, 2011 at 6:20 AM, Ivan Voras ivo...@freebsd.org wrote:

 Is it possible to track by some way what kernel system, process or thread
 has wired memory? (including data exists but needs code to extract it)


No.


 I'd like to analyze a system where there is a lot of memory wired but not
 accounted for in the output of vmstat -m and vmstat -z. There are no user
 processes which would lock memory themselves.

 Any pointers?


Have you accounted for the buffer cache?

Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Analyzing wired memory?

2011-02-08 Thread Alan Cox

On 2/8/2011 12:27 PM, Robert Watson wrote:


On Tue, 8 Feb 2011, Alan Cox wrote:


On Tue, Feb 8, 2011 at 6:20 AM, Ivan Voras ivo...@freebsd.org wrote:

Is it possible to track by some way what kernel system, process or 
thread has wired memory? (including data exists but needs code to 
extract it)



No.


I'd like to analyze a system where there is a lot of memory wired 
but not accounted for in the output of vmstat -m and vmstat -z. 
There are no user processes which would lock memory themselves.


Any pointers?


Have you accounted for the buffer cache?


John and I have occasionally talked about making procstat -v work on 
the kernel; conceivably it could also export a wired page count for 
mappings where it makes sense.  Ideally procstat would drill in a bit 
and allow you to see things at least at the granularty of this page 
range was allocated to UMA.


I would certainly have found this useful on a few occasions, and would 
gladly help out with implementing it.  For example, it would help us in 
understanding the kmem_map fragmentation caused by ZFS.  That said, I'm 
not sure how you will represent the case where UMA allocates physical 
memory directly and uses the direct map to access it.


Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: [rfc] allow to boot with = 256GB physmem

2011-01-21 Thread Alan Cox
On Fri, Jan 21, 2011 at 11:44 AM, John Baldwin j...@freebsd.org wrote:

 On Friday, January 21, 2011 11:09:10 am Sergey Kandaurov wrote:
  Hello.
 
  Some time ago I faced with a problem booting with 400GB physmem.
  The problem is that vm.max_proc_mmap type overflows with
  such high value, and that results in a broken mmap() syscall.
  The max_proc_mmap value is a signed int and roughly calculated
  at vmmapentry_rsrc_init() as u_long vm_kmem_size quotient:
  vm_kmem_size / sizeof(struct vm_map_entry) / 100.
 
  Although at the time it was introduced at svn r57263 the value
  was quite low (f.e. the related commit log stands:
  The value defaults to around 9000 for a 128MB machine.),
  the problem is observed on amd64 where KVA space after
  r212784 is factually bound to the only physical memory size.
 
  With INT_MAX here is 0x7fff, and sizeof(struct vm_map_entry)
  is 120, it's enough to have sligthly less than 256GB to be able
  to reproduce the problem.
 
  I rewrote vmmapentry_rsrc_init() to set large enough limit for
  max_proc_mmap just to protect from integer type overflow.
  As it's also possible to live tune this value, I also added a
  simple anti-shoot constraint to its sysctl handler.
  I'm not sure though if it's worth to commit the second part.
 
  As this patch may cause some bikeshedding,
  I'd like to hear your comments before I will commit it.
 
  http://plukky.net/~pluknet/patches/max_proc_mmap.diffhttp://plukky.net/%7Epluknet/patches/max_proc_mmap.diff

 Is there any reason we can't just make this variable and sysctl a long?


Or just delete it.

1. Contrary to what the commit message says, this sysctl does not
effectively limit the number of vm map entries.  It only limits the number
that are created by one system call, mmap().  Other system calls create vm
map entries just as easily, for example, mprotect(), madvise(), mlock(), and
minherit().  Basically, anything that alters the properties of a mapping.
Thus, in 2000, after this sysctl was added, the same resource exhaustion
induced crash could have been reproduced by trivially changing the program
in PR/16573 to do an mprotect() or two.

In a nutshell, if you want to really limit the number of vm map entries that
a process can allocate, the implementation is a bit more involved than what
was done for this sysctl.

2. UMA implements M_WAITOK, whereas the old zone allocator in 2000 did not.
Moreover, vm map entries for user maps are allocated with M_WAITOK.  So, the
exact crash reported in PR/16573 couldn't happen any longer.

3. We now have the vmemoryuse resource limit.  When this sysctl was
defined, we didn't.  Limiting the virtual memory indirectly but effectively
limits the number of vm map entries that a process can allocate.

In summary, I would do a little due diligence, for example, run the program
from PR/16573 with the limit disabled.  If you can't reproduce the crash, in
other words, nothing contradicts point #2 above, then I would just delete
this sysctl.

Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: [rfc] allow to boot with = 256GB physmem

2011-01-21 Thread Alan Cox
On Fri, Jan 21, 2011 at 2:58 PM, Alan Cox alan.l@gmail.com wrote:

 On Fri, Jan 21, 2011 at 11:44 AM, John Baldwin j...@freebsd.org wrote:

 On Friday, January 21, 2011 11:09:10 am Sergey Kandaurov wrote:
  Hello.
 
  Some time ago I faced with a problem booting with 400GB physmem.
  The problem is that vm.max_proc_mmap type overflows with
  such high value, and that results in a broken mmap() syscall.
  The max_proc_mmap value is a signed int and roughly calculated
  at vmmapentry_rsrc_init() as u_long vm_kmem_size quotient:
  vm_kmem_size / sizeof(struct vm_map_entry) / 100.
 
  Although at the time it was introduced at svn r57263 the value
  was quite low (f.e. the related commit log stands:
  The value defaults to around 9000 for a 128MB machine.),
  the problem is observed on amd64 where KVA space after
  r212784 is factually bound to the only physical memory size.
 
  With INT_MAX here is 0x7fff, and sizeof(struct vm_map_entry)
  is 120, it's enough to have sligthly less than 256GB to be able
  to reproduce the problem.
 
  I rewrote vmmapentry_rsrc_init() to set large enough limit for
  max_proc_mmap just to protect from integer type overflow.
  As it's also possible to live tune this value, I also added a
  simple anti-shoot constraint to its sysctl handler.
  I'm not sure though if it's worth to commit the second part.
 
  As this patch may cause some bikeshedding,
  I'd like to hear your comments before I will commit it.
 
  http://plukky.net/~pluknet/patches/max_proc_mmap.diffhttp://plukky.net/%7Epluknet/patches/max_proc_mmap.diff

 Is there any reason we can't just make this variable and sysctl a long?


 Or just delete it.

 1. Contrary to what the commit message says, this sysctl does not
 effectively limit the number of vm map entries.  It only limits the number
 that are created by one system call, mmap().  Other system calls create vm
 map entries just as easily, for example, mprotect(), madvise(), mlock(), and
 minherit().  Basically, anything that alters the properties of a mapping.
 Thus, in 2000, after this sysctl was added, the same resource exhaustion
 induced crash could have been reproduced by trivially changing the program
 in PR/16573 to do an mprotect() or two.

 In a nutshell, if you want to really limit the number of vm map entries
 that a process can allocate, the implementation is a bit more involved than
 what was done for this sysctl.

 2. UMA implements M_WAITOK, whereas the old zone allocator in 2000 did
 not.  Moreover, vm map entries for user maps are allocated with M_WAITOK.
 So, the exact crash reported in PR/16573 couldn't happen any longer.


Actually, I take back part of what I said here.  The old zone allocator did
implement something like M_WAITOK, and that appears to have been used for
user maps.  However, the crash described in PR/16573 was actually on the
allocation of a vm map entry within the *kernel* address space for a process
U area.  This type of allocation did not use the old zone allocator's
equivalent to M_WAITOK.  However, we no longer have U areas, so the exact
crash scenario is clearly no longer possible.  Interestingly, the sysctl in
question has no direct effect on the allocation of kernel vm map entries.

So, I remain skeptical that this sysctl is preventing any resource
exhaustion based panics in the current kernel.  Again, I would be thrilled
to see one or more people do some testing, such as rerunning the program
from PR/16573.


3. We now have the vmemoryuse resource limit.  When this sysctl was
 defined, we didn't.  Limiting the virtual memory indirectly but effectively
 limits the number of vm map entries that a process can allocate.

 In summary, I would do a little due diligence, for example, run the program
 from PR/16573 with the limit disabled.  If you can't reproduce the crash, in
 other words, nothing contradicts point #2 above, then I would just delete
 this sysctl.

 Alan


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: vm_map_findspace space search?

2010-12-16 Thread Alan Cox
On Wed, Dec 15, 2010 at 7:37 PM, Venkatesh Srinivas 
vsrini...@dragonflybsd.org wrote:

 Hi,

 In svn r133636, there was a commit to convert the linear search in
 vm_map_findspace() to use the vm_map splay tree. Just curious, were
 there any discussions about that particular change? Any measurements
 other than the ones noted in the commit log? Any notes on why that
 design was used rather than any other?

 I've seen the 'Some mmap observations...' thread from about a year
 earlier and was wondering about some of the possible designs discussed
 there. In particular the Bonwick/Adams vmem allocator was brought up;
 I think that something inspired by it (and strangely close to the
 QuickFit malloc) would be appropriate:

 Here's how I see it working:
 * Add a series of power-of-two or logarithmic sized freelists to the
  vm_map structure; they would point to vm_map_entries immediately to the
  left of free space holes of a given size.
 * finding free space would just pop an entry off of the free space list
  and split in the usual way; deletion could coalesce in constant time.
 * Unlike the vmem allocator, we would not need to allocate external
  boundary tags; the vm_map_entries themselves would be the tags.

 At least from looking at the pattern of vm_map_findspace()s on DFly,
 the most common requests were for 1 page, 4 page, and 16 page-sized
 holes (iirc these combined for 75% of requests there; I imagine the
 pattern in FreeBSD would be very similar). The fragmentation concerns
 from this would be pretty minor with that pattern...


I'm afraid that the pattern is is not always so simple.  Sometimes external
fragmentation is, in fact, a problem.  For example, search for ZFS ARC
kmem_map fragmentation.  I recall there being at least one particularly
detailed e-mail that quantified the fragmentation being caused by the ZFS
ARC.  There are also microbenchmarks that simulate an mmap() based web
server, which will show a different pattern than you describe.

If you're interested in working on something in this general area, I can
suggest something.

Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: i386 pmap_zero_page() late sched_pin()?

2010-12-12 Thread Alan Cox
On Sun, Dec 12, 2010 at 10:40 AM, Venkatesh Srinivas 
vsrini...@dragonflybsd.org wrote:

 Hi,

 In the i386 pmap's pmap_zero_page(), there is a fragment...

sysmaps = sysmaps_pcpu[PCPU_GET(cpuid)];
mtx_lock(sysmaps-lock);
 *   sched_pin();
/*map the page we mean to zero at sysmaps-CADDR2*/
pagezero(sysmaps-CADDR2);
sched_unpin();

 I don't know this bit of code too well, so I don't know if the sched_pin()
 being where it is is okay or not. My first reading says its not okay; if a
 thread is moved to another CPU before it is able to pin, it will use the
 wrong sysmaps structure. Is this the case? Is it alright that the wrong
 sysmap structure is used?

 Oh, Nathaniel Filardo (n...@cs.jhu.edu) first spotted this, not I.


This isn't a bug.  There is nothing about the code that mandates that
processor i must always use sysmap entry i.   In the unlikely event that the
thread migrates from processor X to processor Y before the sched_pin(), the
mutex on sysmap entry X will prevent it from being used by processor X until
processor Y is done with it.  So, it doesn't matter to correctness that the
wrong sysmap entry is used, and it is extremely unlikely to matter to
performance.

Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: vm_object ref_count question

2010-10-28 Thread Alan Cox
On Thu, Oct 28, 2010 at 2:48 AM, Eknath Venkataramani eknath.i...@gmail.com
 wrote:

 ref_count is defined inside struct vm_object.
 and it is incremented everytime the object is referenced
 How is the page reference logged then? rather in which variable?


There is no per-page reference.  There is, however, a
garbage-collection-like process performed by vm_object_collapse().

Regards,
Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Contiguous physical memory

2010-10-27 Thread Alan Cox
On Wed, Oct 27, 2010 at 2:17 PM, Dr. Baud drb...@yahoo.com wrote:


Can anyone suggest a method/formula to monitor contiguous physical
 memory
 allocations such that one could predict when contigmalloc(), make that
 bus_dmamem_alloc might fail?


From the command line you can obtain this information with sysctl
vm.phys_free.

Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: page table fault, which should map kernel virtual address space

2010-10-03 Thread Alan Cox
On Thu, Sep 30, 2010 at 6:28 AM, Svatopluk Kraus onw...@gmail.com wrote:

 On Tue, Sep 21, 2010 at 7:38 PM, Alan Cox alan.l@gmail.com wrote:
  On Mon, Sep 20, 2010 at 9:32 AM, Svatopluk Kraus onw...@gmail.com
 wrote:
  Beyond 'kernel_map', some submaps of 'kernel_map' (buffer_map,
  pager_map,...) exist as result of 'kmem_suballoc' function call.
  When this submaps are used (for example 'kmem_alloc_nofault'
  function) and its virtual address subspace is at the end of
  used kernel virtual address space at the moment (and above 'NKPT'
  preallocation), then missing page tables are not allocated
  and double fault can happen.
 
 
  No, the page tables are allocated.  If you create a submap X of the
 kernel
  map using kmem_suballoc(), then a vm_map_findspace() is performed by
  vm_map_find() on the kernel map to find space for the submap X.  As you
 note
  above, the call to vm_map_findspace() on the kernel map will call
  pmap_growkernel() if needed to extend the kernel page table.
 
  If you create another submap X' of X, then that submap X' can only map
  addresses that fall within the range for X.  So, any necessary page table
  pages were allocated when X was created.

 You are right. Mea culpa. I was focused on a solution and made
 too quick conclusion. The page table fault hitted in 'pager_map',
 which is submap of 'clean_map' and when I debugged the problem
 I didn't see a submap stuff as a whole.

  That said, there may actually be a problem with the implementation of the
  superpage_align parameter to kmem_suballoc().  If a submap is created
 with
  superpage_align equal to TRUE, but the submap's size is not a multiple of
  the superpage size, then vm_map_find() may not allocate a page table page
  for the last megabyte or so of the submap.
 
  There are only a few places where kmem_suballoc() is called with
  superpage_align set to TRUE.  If you changed them to FALSE, that is an
 easy
  way to test this hypothesis.

 Yes, it helps.

 My story is that the problem started up when I updated a project
 ('coldfire' port)
 based on FreeBSD 8.0. to FreeBSD current version. In the current version
 the 'clean_map' submap is created with superpage_align set to TRUE.

 I have looked at vm_map_find() and debugged the page table fault once
 again.
 IMO, it looks that a do-while loop does not work in the function as
 intended.
 A vm_map_findspace() finds a space and calls pmap_growkernel() if needed.
 A pmap_align_superpage() arrange the space but never call
 pmap_growkernel().
 A vm_map_insert() inserts the aligned space into a map without error
 and never call pmap_growkernel() and does not invoke loop iteration.

 I don't know too much about an idea how a virtual memory model is
 implemented
 and used in other modules. But it seems that it could be more reasonable to
 align address space in vm_map_findspace() internally and not to loop
 externally.

 I have tried to add a check in vm_map_insert() that checks the 'end'
 parameter
 against 'kernel_vm_end' variable and returns KERN_NO_SPACE error if needed.
 In this case the loop in vm_map_find() works and I have no problem with
 the page table fault. But 'kernel_vm_end' variable must be initializated
 properly before first use of vm_map_insert(). The 'kernel_vm_end' variable
 can be self-initializated in pmap_growkernel() in FreeBSD 8.0 (it is too
 late),
 but it was changed in current version ('i386' port).

 Thanks for your answer, but I'm still looking for permanent
 and approved solution.


I have a patch that implements one possible fix for this problem.  I'll
probably commit that patch in the next day or two.

Regards,
Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Examining the VM splay tree effectiveness

2010-09-30 Thread Alan Cox
On Thu, Sep 30, 2010 at 12:37 PM, Andre Oppermann an...@freebsd.org wrote:

 On 30.09.2010 18:37, Andre Oppermann wrote:

 Just for the kick of it I decided to take a closer look at the use of
 splay trees (inherited from Mach if I read the history correctly) in
 the FreeBSD VM system suspecting an interesting journey.


 Correcting myself regarding the history: The splay tree for vmmap was
 done about 8 years ago by alc@ to replace a simple linked list and was
 a huge improvement.  The change in vmpage from a hash to the same splay
 tree as in vmmap was committed by dillon@ about 7.5 years ago with some
 involvement of a...@.
 ss


Yes, and there is a substantial difference in the degree of locality of
access to these different structures, and thus the effectiveness of a splay
tree.  When I did the last round of changes to the locking on the vm map, I
made some measurements of the splay tree's performance on a JVM running a
moderately large bioinformatics application.  The upshot was that the
average number of map entries visited on an access to the vm map's splay
tree was less than the expected depth of a node in a perfectly balanced
tree.

I teach class shortly.  I'll provide more details later.

Regards,
Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: zfs + uma

2010-09-21 Thread Alan Cox
On Tue, Sep 21, 2010 at 1:39 AM, Jeff Roberson jrober...@jroberson.netwrote:

 On Tue, 21 Sep 2010, Andriy Gapon wrote:

  on 19/09/2010 11:42 Andriy Gapon said the following:

 on 19/09/2010 11:27 Jeff Roberson said the following:

 I don't like this because even with very large buffers you can still
 have high
 enough turnover to require per-cpu caching.  Kip specifically added UMA
 support
 to address this issue in zfs.  If you have allocations which don't
 require
 per-cpu caching and are very large why even use UMA?


 Good point.
 Right now I am running with 4 items/bucket limit for items larger than
 32KB.


 But I also have two counter-points actually :)
 1. Uniformity.  E.g. you can handle all ZFS I/O buffers via the same
 mechanism
 regardless of buffer size.
 2. (Open)Solaris does that for a while and it seems to suit them well.
  Not
 saying that they are perfect, or the best, or an example to follow, but
 still
 that means quite a bit (for me).


 I'm afraid there is not enough context here for me to know what 'the same
 mechanism' is or what solaris does.  Can you elaborate?

 I prefer not to take the weight of specific examples too heavily when
 considering the allocator as it must handle many cases and many types of
 systems.  I believe there are cases where you want large allocations to be
 handled by per-cpu caches, regardless of whether ZFS is one such case.  If
 ZFS does not need them, then it should simply allocate directly from the VM.
  However, I don't want to introduce some maximum constraint unless it can be
 shown that adequate behavior is not generated from some more adaptable
 algorithm.


Actually, I think that there is a middle ground between per-cpu caches and
directly from the VM that we are missing.  When I've looked at the default
configuration of ZFS (without the extra UMA zones enabled), there is an
incredible amount of churn on the kmem map caused by the implementation of
uma_large_malloc() and uma_large_free() going directly to the kmem map.  Not
only are the obvious things happening, like allocating and freeing kernel
virtual addresses and underlying physical pages on every call, but also
system-wide TLB shootdowns and sometimes superpage demotions are occurring.

I have some trouble believing that the large allocations being performed by
ZFS really need per-CPU caching, but I can certainly believe that they could
benefit from not going directly to the kmem map on every uma_large_malloc()
and uma_large_free().  In other words, I think it would make a lot of sense
to have a thin layer between UMA and the kmem map that caches allocated but
unused ranges of pages.

Regards,
Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: page table fault, which should map kernel virtual address space

2010-09-21 Thread Alan Cox
On Mon, Sep 20, 2010 at 9:32 AM, Svatopluk Kraus onw...@gmail.com wrote:


 Hallo,

 this is about 'NKPT' definition, 'kernel_map' submaps,
 and 'vm_map_findspace' function.

 Variable 'kernel_map' is used to manage kernel virtual address
 space. When 'vm_map_findspace' function deals with 'kernel_map'
 then 'pmap_growkernel' function is called.

 At least in 'i386' architecture, pmap implementation uses
 'pmap_growkernel' function to allocate missing page tables.
 Missing page tables are problem, because no one checks
 'pte' pointer for validity after use of 'vtopte' macro.

 'NKPT' definition defines a number of preallocated
 page tables during system boot.

 Beyond 'kernel_map', some submaps of 'kernel_map' (buffer_map,
 pager_map,...) exist as result of 'kmem_suballoc' function call.
 When this submaps are used (for example 'kmem_alloc_nofault'
 function) and its virtual address subspace is at the end of
 used kernel virtual address space at the moment (and above 'NKPT'
 preallocation), then missing page tables are not allocated
 and double fault can happen.


No, the page tables are allocated.  If you create a submap X of the kernel
map using kmem_suballoc(), then a vm_map_findspace() is performed by
vm_map_find() on the kernel map to find space for the submap X.  As you note
above, the call to vm_map_findspace() on the kernel map will call
pmap_growkernel() if needed to extend the kernel page table.

If you create another submap X' of X, then that submap X' can only map
addresses that fall within the range for X.  So, any necessary page table
pages were allocated when X was created.

That said, there may actually be a problem with the implementation of the
superpage_align parameter to kmem_suballoc().  If a submap is created with
superpage_align equal to TRUE, but the submap's size is not a multiple of
the superpage size, then vm_map_find() may not allocate a page table page
for the last megabyte or so of the submap.

There are only a few places where kmem_suballoc() is called with
superpage_align set to TRUE.  If you changed them to FALSE, that is an easy
way to test this hypothesis.

Regards,
Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: UMA allocations from a specific physical range

2010-09-06 Thread Alan Cox

m...@freebsd.org wrote:
[snip]

IIRC the memory from vm_phys_alloc_contig() can be released like any
other page; the interface should just be fetching a specific page.
How far off is the page wire count?  I'm assuming it's hitting the
assert that it's  1?

I think vm_page_free() is the right interface to free the page again,
so the wire count being off presumably means someone else wired it on
you; do you know what code did it?  If no one else has a reference to
the page anymore then setting the wire count to 1, while a hack,
should be safe.

  


Yes, vm_page_free() can be used to free a single page that was returned 
by vm_phys_alloc_contig().


Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Intel TurboBoost in practice

2010-07-27 Thread Alan Cox
On Mon, Jul 26, 2010 at 9:11 AM, Alexander Motin m...@freebsd.org wrote:

 Robert Watson wrote:
  On Sun, 25 Jul 2010, Alexander Motin wrote:
  The numbers that you are showing doesn't show much difference. Have
  you tried buildworld?
 
  If you mean relative difference -- as I have told, it's mostly because
  of my CPU. It's maximal boost is 266MHz (8.3%), but 133MHz of them is
  enabled most of time if CPU is not overheated. It probably doesn't, as
  it works on clear table under air conditioner. So maximal effect I can
  expect on is 4.2%. In such situation 2.8% probably not so bad to
  illustrate that feature works and there is space for further
  improvements. If I had Core i5-750S I would expect 33% boost.
 
  Can I recommend the use of ministat(1) and sample sizes of at least 8
  runs per configuration?

 Thanks for pushing me to do it right. :) Here is 3*15 runs with fresh
 kernel with disabled debug. Results are quite close to original: -2.73%
 and -2.19% of time.
 x C1
 + C2
 * C3
 +-+
 |+*  x|
 |+*  x|
 |+*  x|
 |+*  x|
 |+*  x|
 |+*  x|
 |+*  x|
 |+   **  x|
 |+ + ** xx|
 |+ + ** **  xx   x|
 | |__M_A| |
 |A|   |
 ||A|  |
 +-+
NMinMax Median   AvgStddev
 x  15  12.68  12.84  12.69 12.698667   0.039254966
 +  15  12.35  12.36  12.35 12.351333  0.0035186578
 Difference at 95.0% confidence
-0.347333 +/- 0.0208409
-2.7352% +/- 0.164119%
(Student's t, pooled s = 0.0278687)
 *  15  12.41  12.44  12.42 12.42  0.0075592895
 Difference at 95.0% confidence
-0.278667 +/- 0.0211391
-2.19446% +/- 0.166467%
(Student's t, pooled s = 0.0282674)

 I also checked one more aspect -- TurboBoost works only when CPU runs at
 highest EIST frequency (P0 state). I've reduced dev.cpu.0.freq from 3201
 to 3067 and repeated the test:
 x C1
 + C2
 * C3
 +-+
 | x   +  *|
 | x   +  *|
 | x   +  *|
 | x   +  *   *|
 | x  x+  *   *|
 | x  x+  +   *   *|
 | x  x+  +   *   *|
 | x  x+  +   *   *|
 | x  x+   +  +   +   *   *|
 ||MA| |
 |   |_MA_||
 |M_A_||
 +-+
NMinMax Median   AvgStddev
 x  15  13.72  13.73  13.72 13.72  0.0048795004
 +  15  13.79  13.82   13.8 13.80  0.0072374686
 Difference at 95.0% confidence
0.08 +/- 0.00461567
0.582949% +/- 0.0336337%
(Student's t, pooled s = 0.00617213)
 *  15  13.89   13.9  13.8913.894  0.0050709255
 Difference at 95.0% confidence
0.170667 +/- 0.00372127
1.24362% +/- 0.0271164%
(Student's t, pooled s = 0.00497613)

 In that case using C2 or C3 predictably caused small performance reduce,
 as after falling to sleep, CPU needs time to wakeup. Even if tested CPU0
 won't ever sleep during test, it's TLB shutdown IPIs to other cores
 still probably could suffer from waiting other cores' wakeup.


In the deeper sleep states, are the TLB contents actually maintained while
the processor sleeps?  (I notice that in some configurations, we actually
flush dirty data from the cache before sleeping.)

Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to 

Re: Intel TurboBoost in practice

2010-07-24 Thread Alan Cox
2010/7/24 Alexander Motin m...@freebsd.org

 Hi.

 I've make small observations of Intel TurboBoost technology under
 FreeBSD. This technology allows Intel Core i5/i7 CPUs to rise frequency
 of some cores if other cores are idle and power/thermal conditions
 permit. CPU core counted as idle, if it has been put into C3 or deeper
 power state (may reflect ACPI C2/C3 states). So to reach maximal
 effectiveness, some tuning may be needed.


[snip]



 PPS: I expect even better effect achieved by further reducing interrupt
 rates on idle CPUs.


I'm currently testing a patch that eliminates another 31% of the global TLB
shootdowns for a buildworld on an amd64 machine.  So, you can expect
improvement in this area.

Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: disk I/O, VFS hirunningspace

2010-07-16 Thread Alan Cox

Peter Jeremy wrote:

Regarding vfs.lorunningspace and vfs.hirunningspace...

On 2010-Jul-15 13:52:43 -0500, Alan Coxalan.l@gmail.com  wrote:
   

Keep in mind that we still run on some fairly small systems with limited I/O
capabilities, e.g., a typical arm platform.  More generally, with the range
of systems that FreeBSD runs on today, any particular choice of constants is
going to perform poorly for someone.  If nothing else, making these sysctls
a function of the buffer cache size is probably better than any particular
constants.
 


That sounds reasonable but brings up a related issue - the buffer
cache.  Given the unified VM system no longer needs a traditional Unix
buffer cache, what is the buffer cache still used for?


Today, it is essentially a mapping cache.  So, what does that mean?

After you've set aside a modest amount of physical memory for the kernel 
to hold its own internal data structures, all of the remaining physical 
memory can potentially be used to cache file data.  However, on many 
architectures this is far more memory than the kernel can 
instantaneously access.  Consider i386.  You might have 4+ GB of 
physical memory, but the kernel address space is (by default) only 1 
GB.  So, at any instant in time, only a fraction of the physical memory 
is instantaneously accessible to the kernel.  In general, to access an 
arbitrary physical page, the kernel is going to have to replace an 
existing virtual-to-physical mapping in its address space with one for 
the desired page.  (Generally speaking, on most architectures, even the 
kernel can't directly access physical memory that isn't mapped by a 
virtual address.)


The buffer cache is essentially a region of the kernel address space 
that is dedicated to mappings to physical pages containing cached file 
data.  As applications access files, the kernel dynamically maps (and 
unmaps) physical pages containing cached file data into this region.  
Once the desired pages are mapped, then read(2) and write(2) can 
essentially bcopy from the buffer cache mapping to the application's 
buffer.  (Understand that this buffer cache mapping is a prerequisite 
for the copy out to occur.)


So, why did I call it a mapping cache?  There is generally locality in 
the access to file data.  So, rather than map and unmap the desired 
physical pages on every read and write, the mappings to file data are 
allowed to persist and are managed much like many other kinds of 
caches.  When the kernel needs to map a new set of file pages, it finds 
an older, not-so-recently used mapping and destroys it, allowing those 
kernel virtual addresses to be remapped to the new pages.


So far, I've used i386 as a motivating example.  What of other 
architectures?  Most 64-bit machines take advantage of their large 
address space by implementing some form of direct map that provides 
instantaneous access to all of physical memory.  (Again, I use 
instantaneous to mean that the kernel doesn't have to dynamically 
create a virtual-to-physical mapping before being able to access the 
data.)  On these machines, you could, in principle, use the direct map 
to implement the bcopy to the application's buffer.  So, what is the 
point of the buffer cache on these machines?


A trivial benefit is that the file pages are mapped contiguously in the 
buffer cache.  Even though the underlying physical pages may be 
scattered throughout the physical address space, they are mapped 
contiguously.  So, the bcopy doesn't need to worry about every page 
boundary, only buffer boundaries.


The buffer cache also plays a role in the page replacement mechanism.  
Once mapped into the buffer cache, a page is wired, that is, it 
removed from the paging lists, where the page daemon could reclaim it.  
However, a page in the buffer cache should really be thought of as being 
active.  In fact, when a page is unmapped from the buffer cache, it is 
placed at the tail of the virtual memory system's inactive list.  The 
same place where the virtual memory system would place a physical page 
that it is transitioning from active to inactive.  If an application 
later performs a read(2) from or write(2) to the same page, that page 
will be removed from the inactive list and mapped back into the buffer 
cache.  So, the mapping and unmapping process contributes to creating an 
LRU-ordered inactive queue.


Finally, the buffer cache limits the amount of dirty file system data 
that is cached in memory.



...  Is the current
tuning formula still reasonable (for virtually all current systems
it's basically 10MB + 10% RAM)?


It's probably still good enough.  However, this is not a statement for 
which I have supporting data.  So, I reserve the right to change my 
opinion.  :-)


Consider what the buffer cache now does.  It's just a mapping cache.  
Increasing the buffer cache size doesn't affect (much) the amount of 
physical memory available for caching file data.  So, unlike ancient 
times, 

Re: disk I/O, VFS hirunningspace

2010-07-15 Thread Alan Cox
On Thu, Jul 15, 2010 at 8:01 AM, Ivan Voras ivo...@freebsd.org wrote:

 On 07/14/10 18:27, Jerry Toung wrote:
  On Wed, Jul 14, 2010 at 12:04 AM, Gary Jennejohn
  gljennj...@googlemail.comwrote:
 
 
 
  Rather than commenting out the code try setting the sysctl
  vfs.hirunningspace to various powers-of-two.  Default seems to be
  1MB.  I just changed it on the command line as a test to 2MB.
 
  You can do this in /etc/sysctl.conf.
 
 
  thank you all, that did it. The settings that Matt recommended are giving
  the same numbers

 Any objections to raising the defaults to 8 MB / 1 MB in HEAD?



Keep in mind that we still run on some fairly small systems with limited I/O
capabilities, e.g., a typical arm platform.  More generally, with the range
of systems that FreeBSD runs on today, any particular choice of constants is
going to perform poorly for someone.  If nothing else, making these sysctls
a function of the buffer cache size is probably better than any particular
constants.

Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Strange problem with 8-stable, VMWare vSphere 4 AMD CPUs (unexpected shutdowns)

2010-02-11 Thread Alan Cox
On Thu, Feb 11, 2010 at 7:13 AM, John Baldwin j...@freebsd.org wrote:

 On Wednesday 10 February 2010 1:38:37 pm Ivan Voras wrote:
  On 10 February 2010 19:35, Andriy Gapon a...@icyb.net.ua wrote:
   on 10/02/2010 20:26 Ivan Voras said the following:
   On 10 February 2010 19:10, Andriy Gapon a...@icyb.net.ua wrote:
   on 10/02/2010 20:03 Ivan Voras said the following:
   When you say very unique is it in the it is not Linux or Windows
   sense or do we do something nonstandard?
   The former - neither Linux, Windows or OpenSolaris seem to have what
 we
 have.
  
   I can't find the exact documents but I think both Windows
   MegaUltimateServer (the highest priced version of Windows Server,
   whatever it's called today) and Linux (though disabled and marked
   Experimental) have it, or have some kind of support for large pages
   that might not be as pervasive (maybe they use it for kernel only?). I
   have no idea about (Open)Solaris.
  
   I haven't said that those OSes do not use large pages.
   I've said what I've said :-)
 
  Ok :)
 
  Is there a difference between large pages as they are commonly known
  and superpages as in FreeBSD ? In other words - are you referencing
  some specific mechanism, like automatic promotion / demotion of the
  large pages or maybe something else?

 Yes, the automatic promotion / demotion.  That is a far-less common
 feature.
 FreeBSD/i386 has used large pages for the kernel text as far back as at
 least
 4.x, but that is not the same as superpages.  Linux does not have automatic
 promotion / demotion to my knowledge.  I do not know about other OS's.


A comparison of current large page support among Unix-like and Windows
operating systems has two dimensions: (1) whether or not the creation of
large pages for applications is automatic and (2) whether or not the machine
administrator has to statically partition the machine's physical memory
between large and small pages at boot time.

For FreeBSD, large pages are created automatically and there is not a static
partitioning of physical memory.  In contrast, Linux does not create large
pages automatically and does require a static partitioning.  Specifically,
Linux requires the administrator to explicitly and statically partition the
machine's physical memory at boot time into two parts, one that is dedicated
to large pages and another for general use.  To utilize large pages an
application has to explicitly request memory from the dedicated large pages
pool.  However, to make this somewhat easier, but not automatic, there do
exist re-implementations of malloc that you can explicitly link with your
application.

In Solaris, the application has to explicitly request the use of large
pages, either via explicit kernel calls in the program or from the command
line with support from a library.  However, there is not a static
partitioning of physical memory.  So, for example, when you run the Sun jdk
on Solaris, it explicitly requests large pages for much of its data, and
this works without administrator having to configure the machine for large
page usage.

To the best of my knowledge, Windows is just like Solaris.

Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Superpages on amd64 FreeBSD 7.2-STABLE

2009-12-12 Thread Alan Cox
On Thu, Dec 10, 2009 at 8:50 AM, Bernd Walter ti...@cicely7.cicely.dewrote:

 On Wed, Dec 09, 2009 at 09:07:33AM -0500, John Baldwin wrote:
  On Thursday 26 November 2009 10:14:20 am Linda Messerschmidt wrote:
   It's not clear to me if this might be a problem with the superpages
   implementation, or if squid does something particularly horrible to
   its memory when it forks to cause this, but I wanted to ask about it
   on the list in case somebody who understands it better might know
   whats going on. :-)
 
  I talked with Alan Cox some about this off-list and there is a case that
 can
  cause this behavior if the parent squid process takes write faults on a
  superpage before the child process has called exec() then it can result
 in
  superpages being fragmented and never reassembled.  Using vfork() should
  prevent this from happening.  It is a known issue, but it will probably
 be
  some time before it is addressed.  There is lower hanging fruit in other
 areas
  in the VM that will probably be worked on first.

 For me the whole threads puzzles me.
 Especially because vfork is often called a solution.

 Scenario A
 Parent with super page
 fork/exec
 This problem can happen because there is a race.
 The parent now has it's super pages fragmented permanently!?
 the child throws away his pages because of the exec!?

 Scenario B
 Parent with super page
 vfork/exec
 This problem won't happen because the child has no pseudo copy of the
 parents memory and then starts with a completely new map.

 Scenario C
 Parent with super page
 fork/ no exec
 The problem can happen because the child shares the same memory over
 it's complete lifetime.
 The parent can get it's super pages fragmented over time.


I'm not sure how you are defining problem.  If we define problem as I
would, i.e., that re-promotion can never occur, then Scenario C is not
a problem scenario, only Scenario A is.

The source of the problem in Scenario A is basically that we have two ways
of handling copy-on-write faults.  Before the exec() occurs, copy-on-write
faults are handled as you might intuit from the name, a new physical copy is
made.  If the entirety of the 2MB region is written to before the exec(),
then
this region will be promoted to a superpage.  However, once the exec()
occurs,
copy-on-write faults are optimized.  Specifically, the kernel recognizes
that
the underlying physical page is no longer shared with the child and simply
restores write access to it.  It is the combination of these two methods
that
effectively blocks re-promotion because the underlying 4KB physical pages
within a 2MB region are no longer contiguous.

In other words, once the first page within a region has been copied, you
have
a choice to make: Do you perform avoidable copies or do you abandon the
possibility of ever creating a superpage.  The former has a significant
one-time cost and the latter has a small recurring cost.  Not knowing how
much the latter will add up to, I chose the former.  However, that choice
may change in time, particularly, if I find an effective heuristic for
choosing
between the two options.

Anyway, please keep trying superpages with large memory applications like
this.  Reports like this help me to prioritize my efforts.

Regards,
Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: mmap(2) with MAP_ANON honouring offset although it shouldn't

2009-11-04 Thread Alan Cox

Ed Schouten wrote:

* John Baldwin j...@freebsd.org wrote:
  

Note that the spec doesn't cover MAP_ANON at all FWIW.



Yes. I've noticed Linux also uses MAP_ANONYMOUS instead of MAP_ANON.
They do provide MAP_ANON for compatibility, if I remember correctly.

  


For what it's worth, I believe that Solaris does the exact opposite.  
They provide MAP_ANONYMOUS for compatibility.  It seems like a good idea 
for us to do the same.


We also have an unimplemented option MAP_RENAME defined for 
compatibility with Sun that is nowhere mentioned in modern Solaris 
documentation.


Alan


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: mmap(2) with MAP_ANON honouring offset although it shouldn't

2009-11-04 Thread Alan Cox

Ed Schouten wrote:

* Alan Cox a...@cs.rice.edu wrote:
  

For what it's worth, I believe that Solaris does the exact opposite.
They provide MAP_ANONYMOUS for compatibility.  It seems like a good
idea for us to do the same.



Something like this?

Index: mman.h
===
--- mman.h  (revision 198919)
+++ mman.h  (working copy)
@@ -82,6 +82,9 @@
  */
 #defineMAP_FILE 0x /* map from file (default) */
 #defineMAP_ANON 0x1000 /* allocated from memory, swap space */
+#ifndef _KERNEL
+#defineMAP_ANONYMOUSMAP_ANON /* For compatibility. */
+#endif /* !_KERNEL */
 
 /*

  * Extended flags

  


Yes.  If no one objects in the next day or so, then please commit this 
change.


Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: mmap(2) with MAP_ANON honouring offset although it shouldn't

2009-10-21 Thread Alan Cox
On Wed, Oct 21, 2009 at 10:51 AM, Alexander Best 
alexbes...@math.uni-muenster.de wrote:

 although the mmap(2) manual states in section MAP_ANON:

 The offset argument is ignored.

 this doesn't seem to be true. running

 printf(%p\n, mmap((void*)0x1000, 0x1000, PROT_NONE, MAP_ANON, -1,
 0x12345678));

 and

 printf(%p\n, mmap((void*)0x1000, 0x1000, PROT_NONE, MAP_ANON, -1, 0));

 produces different outputs. i've attached a patch to solve the problem. the
 patch is similar to the one proposed in this PR, but should apply cleanly
 to
 CURRENT: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/71258


The standards for mmap(2) actually disallow values of off that are not a
multiple of the page size.

See http://www.opengroup.org/onlinepubs/95399/functions/mmap.html for
the following:
[EINVAL]The *addr* argument (if MAP_FIXED was specified) or *off* is not a
multiple of the page size as returned by
*sysconf*()http://www.opengroup.org/onlinepubs/95399/functions/sysconf.html,
or is considered invalid by the implementation.Both Solaris and Linux
enforce this restriction.

I'm not convinced that the ability to specify a value for off that is not
a multiple of the page size is a useful differentiating feature of FreeBSD
versus Solaris or Linux.  Does anyone have a compelling argument (or use
case) to motivate us being different in this respect?

If you disallow values for off that are not a multiple of the page size,
then you are effectively ignoring off for MAP_ANON.

Regards,
Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: mmap/munmap with zero length

2009-07-14 Thread Alan Cox

John Baldwin wrote:

On Monday 13 July 2009 3:33:51 pm Tijl Coosemans wrote:
  

On Monday 13 July 2009 20:28:08 John Baldwin wrote:


On Sunday 05 July 2009 3:32:25 am Alexander Best wrote:
  

so mmap differs from the POSIX recommendation right. the malloc.conf
option seems more like a workaround/hack. imo it's confusing to have
mmap und munmap deal differently with len=0. being able to
succesfully alocate memory which cannot be removed doesn't seem
logical to me.


This should fix it:

--- //depot/user/jhb/acpipci/vm/vm_mmap.c
+++ /home/jhb/work/p4/acpipci/vm/vm_mmap.c
@@ -229,7 +229,7 @@

fp = NULL;
/* make sure mapping fits into numeric range etc */
-   if ((ssize_t) uap-len  0 ||
+   if ((ssize_t) uap-len = 0 ||
((flags  MAP_ANON)  uap-fd != -1))
return (EINVAL);
  

Why not uap-len == 0? Sizes of 2GiB and more (32bit) shouldn't cause
an error.



I don't actually disagree and know of locally modified versions of FreeBSD 
that remove this check for precisely that reason.


  


I have no objections to uap-len == 0 (without the cast).

Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Problem with vm.pmap.shpgperproc and vm.pmap.pv_entry_max

2009-07-05 Thread Alan Cox
On Fri, Jul 3, 2009 at 8:18 AM, c0re dumped ez.c...@gmail.com wrote:

 So, I never had problem with this server, but recently it starts to
 giv me the following messages *every* minute :

 Jul  3 10:04:00 squid kernel: Approaching the limit on PV entries,
 consider increasing either the vm.pmap.shpgperproc or the
 vm.pmap.pv_entry_max tunable.
 Jul  3 10:05:00 squid kernel: Approaching the limit on PV entries,
 consider increasing either the vm.pmap.shpgperproc or the
 vm.pmap.pv_entry_max tunable.
 Jul  3 10:06:00 squid kernel: Approaching the limit on PV entries,
 consider increasing either the vm.pmap.shpgperproc or the
 vm.pmap.pv_entry_max tunable.
 Jul  3 10:07:01 squid kernel: Approaching the limit on PV entries,
 consider increasing either the vm.pmap.shpgperproc or the
 vm.pmap.pv_entry_max tunable.
 Jul  3 10:08:01 squid kernel: Approaching the limit on PV entries,
 consider increasing either the vm.pmap.shpgperproc or the
 vm.pmap.pv_entry_max tunable.
 Jul  3 10:09:01 squid kernel: Approaching the limit on PV entries,
 consider increasing either the vm.pmap.shpgperproc or the
 vm.pmap.pv_entry_max tunable.
 Jul  3 10:10:01 squid kernel: Approaching the limit on PV entries,
 consider increasing either the vm.pmap.shpgperproc or the
 vm.pmap.pv_entry_max tunable.
 Jul  3 10:11:01 squid kernel: Approaching the limit on PV entries,
 consider increasing either the vm.pmap.shpgperproc or the
 vm.pmap.pv_entry_max tunable.

 This server is running Squid + dansguardian. The users are complaining
 about slow navigation and they are driving me crazy !

 Have anyone faced this problem before ?

 Some infos:

 # uname -a
 FreeBSD squid 7.2-RELEASE FreeBSD 7.2-RELEASE #0: Fri May  1 08:49:13
 UTC 2009 r...@walker.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC
 i386

 # sysctl vm
 vm.vmtotal:
 System wide totals computed every five seconds: (values in kilobytes)
 ===
 Processes:  (RUNQ: 1 Disk Wait: 1 Page Wait: 0 Sleep: 230)
 Virtual Memory: (Total: 19174412K, Active 9902152K)
 Real Memory:(Total: 1908080K Active 1715908K)
 Shared Virtual Memory:  (Total: 647372K Active: 10724K)
 Shared Real Memory: (Total: 68092K Active: 4436K)
 Free Memory Pages:  88372K

 vm.loadavg: { 0.96 0.96 1.13 }
 vm.v_free_min: 4896
 vm.v_free_target: 20635
 vm.v_free_reserved: 1051
 vm.v_inactive_target: 30952
 vm.v_cache_min: 20635
 vm.v_cache_max: 41270
 vm.v_pageout_free_min: 34
 vm.pageout_algorithm: 0
 vm.swap_enabled: 1
 vm.kmem_size_scale: 3
 vm.kmem_size_max: 335544320
 vm.kmem_size_min: 0
 vm.kmem_size: 335544320
 vm.nswapdev: 1
 vm.dmmax: 32
 vm.swap_async_max: 4
 vm.zone_count: 84
 vm.swap_idle_threshold2: 10
 vm.swap_idle_threshold1: 2
 vm.exec_map_entries: 16
 vm.stats.misc.zero_page_count: 0
 vm.stats.misc.cnt_prezero: 0
 vm.stats.vm.v_kthreadpages: 0
 vm.stats.vm.v_rforkpages: 0
 vm.stats.vm.v_vforkpages: 340091
 vm.stats.vm.v_forkpages: 3604123
 vm.stats.vm.v_kthreads: 53
 vm.stats.vm.v_rforks: 0
 vm.stats.vm.v_vforks: 2251
 vm.stats.vm.v_forks: 19295
 vm.stats.vm.v_interrupt_free_min: 2
 vm.stats.vm.v_pageout_free_min: 34
 vm.stats.vm.v_cache_max: 41270
 vm.stats.vm.v_cache_min: 20635
 vm.stats.vm.v_cache_count: 5734
 vm.stats.vm.v_inactive_count: 242259
 vm.stats.vm.v_inactive_target: 30952
 vm.stats.vm.v_active_count: 445958
 vm.stats.vm.v_wire_count: 58879
 vm.stats.vm.v_free_count: 16335
 vm.stats.vm.v_free_min: 4896
 vm.stats.vm.v_free_target: 20635
 vm.stats.vm.v_free_reserved: 1051
 vm.stats.vm.v_page_count: 769244
 vm.stats.vm.v_page_size: 4096
 vm.stats.vm.v_tfree: 12442098
 vm.stats.vm.v_pfree: 1657776
 vm.stats.vm.v_dfree: 0
 vm.stats.vm.v_tcached: 253415
 vm.stats.vm.v_pdpages: 254373
 vm.stats.vm.v_pdwakeups: 14
 vm.stats.vm.v_reactivated: 414
 vm.stats.vm.v_intrans: 1912
 vm.stats.vm.v_vnodepgsout: 0
 vm.stats.vm.v_vnodepgsin: 6593
 vm.stats.vm.v_vnodeout: 0
 vm.stats.vm.v_vnodein: 891
 vm.stats.vm.v_swappgsout: 0
 vm.stats.vm.v_swappgsin: 0
 vm.stats.vm.v_swapout: 0
 vm.stats.vm.v_swapin: 0
 vm.stats.vm.v_ozfod: 56314
 vm.stats.vm.v_zfod: 2016628
 vm.stats.vm.v_cow_optim: 1959
 vm.stats.vm.v_cow_faults: 584331
 vm.stats.vm.v_vm_faults: 3661086
 vm.stats.sys.v_soft: 23280645
 vm.stats.sys.v_intr: 18528397
 vm.stats.sys.v_syscall: 1990471112
 vm.stats.sys.v_trap: 8079878
 vm.stats.sys.v_swtch: 105613021
 vm.stats.object.bypasses: 14893
 vm.stats.object.collapses: 55259
 vm.v_free_severe: 2973
 vm.max_proc_mmap: 49344
 vm.old_msync: 0
 vm.msync_flush_flags: 3
 vm.boot_pages: 48
 vm.max_wired: 255475
 vm.pageout_lock_miss: 0
 vm.disable_swapspace_pageouts: 0
 vm.defer_swapspace_pageouts: 0
 vm.swap_idle_enabled: 0
 vm.pageout_stats_interval: 5
 vm.pageout_full_stats_interval: 20
 vm.pageout_stats_max: 20635
 vm.max_launder: 32
 vm.phys_segs:
 SEGMENT 0:

 start: 0x1000
 end:   0x9a000
 free list: 0xc0cca168

 SEGMENT 1:

 start: 0x10
 end:   0x40
 free list: 0xc0cca168

 SEGMENT 2:

 start:

Re: large pages (amd64)

2009-07-03 Thread Alan Cox

Robert Watson wrote:


On Tue, 30 Jun 2009, Mel Flynn wrote:

It looks like sys/kern/kern_proc.c could call mincore around the 
loop at line 1601 (rev 194498), but I know nothing about the vm 
subsystem to know the implications or locking involved. There's 
still 16 bytes of spare to consume, in the kve_vminfo struct though ;)


Yes, to start with, you could replace the call to pmap_extract() 
with a call to pmap_mincore() and export a Boolean to user space 
that says, This region of the address space contains one or more 
superpage mappings.


How about attached?


I like the idea -- there are some style nits that need fixing though. 
Assuming Alan is happy with the VM side of things, I can do the 
cleanup and get it in the tree.


Aside from the style nits, it looks good to me.

Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: large pages (amd64)

2009-06-30 Thread Alan Cox

Mel Flynn wrote:

On Sunday 28 June 2009 15:41:49 Alan Cox wrote:
  

Wojciech Puchar wrote:



  

how can i check how much (or maybe - what processes) 2MB pages are
actually allocated?
  

I'm afraid that you can't with great precision.  For a given program
execution, on an otherwise idle machine, you can only estimate the
number by looking at the change in the quantity promotions + mappings -
demotions before, during, and after the program execution.

A program can call mincore(2) in order to determine if a virtual address
is part of a 2 or 4MB virtual page.



Would it be possible to expose the super page count as kve_super in the 
kinfo_vmentry struct so that procstat can show this information? If only to 
determine if one is using the feature and possibly benefiting from it.


  


Yes, I think so. 

It looks like sys/kern/kern_proc.c could call mincore around the loop at line 
1601 (rev 194498), but I know nothing about the vm subsystem to know the 
implications or locking involved. There's still 16 bytes of spare to consume, 
in the kve_vminfo struct though ;)
  


Yes, to start with, you could replace the call to pmap_extract() with a 
call to pmap_mincore() and export a Boolean to user space that says, 
This region of the address space contains one or more superpage mappings.


Counting the number of superpages is a little trickier.  The problem 
being that pmap_mincore() doesn't tell you how large the underlying page 
is.  So, the loop at line 1601 can't easily tell where one superpage 
ends and the next 4KB page or superpage begins, making counting the 
number of superpages in a region a little tricky.  One possibility is to 
change pmap_mincore() to return the page size (or the logarithm of the 
page size) rather than a single bit.


If you want to give it a try, I'll be happy to help.  There aren't 
really any implications or synchronization issues that you need to 
worry about.


Alan



___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: large pages (amd64)

2009-06-28 Thread Alan Cox
On Sun, Jun 28, 2009 at 12:36 PM, Wojciech Puchar 
woj...@wojtek.tensor.gdynia.pl wrote:

 i enabled
 vm.pmap.pg_ps_enabled: 1


 could you please explain what exactly this values means?
 because i don't understand why promotions-demotions!=mappings


mappings is not what you seem to think it is.  vm.pmap.pde.mappings is the
number of 2/4MB page mappings that are created directly and not through the
incremental promotion process.  For example, it counts the 2/4MB page
mappings that are created when the text segment of a large executable, e.g.,
gcc, is pre-faulted at startup or when a graphics card's frame buffer is
mmap()ed.

Moreover, not every promoted mapping is demoted.  A promoted mapping may be
destroyed without demotion, for example, when a process exits.  This is, in
fact, the ideal case because it is cheaper to destroy a single 2/4MB page
mapping instead of 512 or 1024 4KB page mappings.



 vm.pmap.pde.promotions: 2703
 vm.pmap.pde.p_failures: 6290
 vm.pmap.pde.mappings: 610
 vm.pmap.pde.demotions: 289




 other question - tried enabling it on my i386 laptop (256 megs ram), always
 mappings==0, while promitionsdemotions0.


The default starting address for executables on i386 is not aligned to a
2/4MB page boundary.  Hence, mappings are much less likely to occur.


 certainly there are apps that could be put on big pages, gimp editing 40MB
 bitmap for example


Regards,
Alan

http://lists.freebsd.org/mailman/listinfo/freebsd-hackersfreebsd-hackers-unsubscr...@freebsd.org
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: large pages (amd64)

2009-06-28 Thread Alan Cox

Wojciech Puchar wrote:


  other question - tried enabling it on my i386 laptop (256 megs 
ram), always

  mappings==0, while promitionsdemotions0.


The default starting address for executables on i386 is not aligned 
to a 2/4MB page boundary. Hence, mappings are much less likely to 
occur.



  certainly there are apps that could be put on big pages, gimp 
editing 40MB bitmap for

  example


Regards,
Alan



how can i check how much (or maybe - what processes) 2MB pages are 
actually allocated?


I'm afraid that you can't with great precision.  For a given program 
execution, on an otherwise idle machine, you can only estimate the 
number by looking at the change in the quantity promotions + mappings - 
demotions before, during, and after the program execution.


A program can call mincore(2) in order to determine if a virtual address 
is part of a 2 or 4MB virtual page.


Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: large pages (amd64)

2009-06-28 Thread Alan Cox
On Sun, Jun 28, 2009 at 7:14 PM, Nathanael Hoyle nho...@hoyletech.comwrote:

 Wojciech Puchar wrote:

 i enabled
 vm.pmap.pg_ps_enabled: 1


 could you please explain what exactly this values means?
 because i don't understand why promotions-demotions!=mappings

 vm.pmap.pde.promotions: 2703
 vm.pmap.pde.p_failures: 6290
 vm.pmap.pde.mappings: 610
 vm.pmap.pde.demotions: 289




 other question - tried enabling it on my i386 laptop (256 megs ram),
 always mappings==0, while promitionsdemotions0.

 certainly there are apps that could be put on big pages, gimp editing 40MB
 bitmap for example


 Just to be clear, since you say i386 (I presume you mean architecture), I
 believe the Physical Address Extensions which allowed 2MB Page Size bit to
 be set was introduced with Pentium Pro. Processors prior to this were
 limited to standard 4KB pages.


No.  Many of those processors supported 4MB pages.

Regards,
Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: large pages (amd64)

2009-06-28 Thread Alan Cox

Nathanael Hoyle wrote:
[snip]
Having been corrected by both you and Joerg (thank you!), I went back 
to re-verify my understanding. It appears that while I was slightly 
mixing PAE in with PSE, PSE support for 4MB pages was introduced 
'silently' with the Pentium, and documented first with the Pentium 
Pro.  I haven't found anything that points to earlier inclusion. 
Certainly the 80386 processor specifically, I am fairly confident 
would be limited to the 4KB pages.


Agreed? Or are you aware of earlier usage than the Pentium for 4MB pages?



Yes, I agree.  I'm not aware of 4MB page support before the Pentium.

Regards,
Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Increasing KVM on amd64

2008-09-10 Thread Alan Cox

Artem Belevich wrote:


Alan,

Thanks a lot for the patch. I've applied it to RELENG_7 and it seems
to work great - make -j8 buildworld succeeds, linux emulation seems
to work well enough to run linux-sun-jdk14 binaries, ZFS ARC size is
bigger, too. So far I didn't see any ZFS-related KVM shortages either.

The only problem is that everything is fine as long as vm.kmem_size is
set to less or equal to 4096M. As soon as I set it to 4100M or
anything larger, kernel crashes on startup. I'm unable to capture
exact crash messages as they keep scrolling really fast on the screen
for a few seconds until the box reboots. Unfortunately  the box does
not have built-in serial ports, so the messages are gone before I can
see them. :-(

 



There are two underlying causes.  First, the size of the kmem map, which 
holds the kernel's heap, is recorded in a 32-bit int.  So, setting 
vm.kmem_size to 4100M is leading to integer overflow.  The following 
change addresses this issue:


sys/kern/kern_malloc.c

Revision *1.167*: download 
http://www.freebsd.org/cgi/cvsweb.cgi/%7Echeckout%7E/src/sys/kern/kern_malloc.c?rev=1.167;content-type=text%2Fplain 
- view: text 
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_malloc.c?rev=1.167;content-type=text%2Fplain, 
markup 
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_malloc.c?rev=1.167;content-type=text%2Fx-cvsweb-markup, 
annotated 
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_malloc.c?annotate=1.167 
- select for diffs 
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_malloc.c?r1=1.167#rev1.167

/Sat Jul 5 19:34:33 2008 UTC/ (2 months ago) by /alc/
Branches: MAIN 
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_malloc.c?only_with_tag=MAIN
CVS tags: HEAD 
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_malloc.c?only_with_tag=HEAD
Diff to: previous 1.166: preferred 
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_malloc.c.diff?r1=1.166;r2=1.167, 
colored 
http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_malloc.c.diff?r1=1.166;r2=1.167;f=h

Changes since revision 1.166: +11 -11 lines

SVN rev 180308 on 2008-07-05 19:34:33Z by alc

Enable the creation of a kmem map larger than 4GB.
Submitted by: Tz-Huan Huang

Make several variables related to kmem map auto-sizing static.
Found by: CScout


Second, there is no room for a kmem map greater than 4GB unless the 
overall KVM size is greater than 6GB.  Specifically, a 4GB kmem map 
isn't possible with 6GB KVM because the kmem map would overlap the 
kernel's code, data, and bss segment.


If you're able to apply the above kern_malloc.c change to your kernel, 
then I should be able to describe how to increase your KVM beyond 6GB.


Regards,
Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Increasing KVM on amd64

2008-06-08 Thread Alan Cox

Tz-Huan Huang wrote:


On Sun, Jun 8, 2008 at 7:59 AM, Alan Cox [EMAIL PROTECTED] wrote:
 


You can download a patch from
http://www.cs.rice.edu/~alc/amd64_kvm_6GB.patch that increases amd64's
kernel virtual address space to 6GB.  This patch also increases the default
for the kmem map to almost 2GB.  I believe that kernel loadable modules
still work.  However, I suspect that mini-dumps are broken.

I don't plan on committing this patch in its current form.  Some of the
changes are done in a hackish way.  I am, however, curious to hear whether
or not it works for you.
   



Thanks for the patch. I applied it on 7-stable but got failed on pmap.c.

 


[snip]


We have no machine running 8-current with more than 6G memory now...

 



Sorry, at this point the patch is only applicable to HEAD.  That said, 
the failed chunk is probably easily applied by hand to RELENG_7.


Thanks,
Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Increasing KVM on amd64

2008-06-07 Thread Alan Cox
You can download a patch from 
http://www.cs.rice.edu/~alc/amd64_kvm_6GB.patch that increases amd64's 
kernel virtual address space to 6GB.  This patch also increases the 
default for the kmem map to almost 2GB.  I believe that kernel loadable 
modules still work.  However, I suspect that mini-dumps are broken.


I don't plan on committing this patch in its current form.  Some of the 
changes are done in a hackish way.  I am, however, curious to hear 
whether or not it works for you.


Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Increasing KVM on amd64

2008-06-07 Thread Alan Cox

Kostik Belousov wrote:


On Sat, Jun 07, 2008 at 06:59:35PM -0500, Alan Cox wrote:
 

You can download a patch from 
http://www.cs.rice.edu/~alc/amd64_kvm_6GB.patch that increases amd64's 
kernel virtual address space to 6GB.  This patch also increases the 
default for the kmem map to almost 2GB.  I believe that kernel loadable 
modules still work.  However, I suspect that mini-dumps are broken.
   

The amd64 modules text/data/bss virtual addresses are allocated from the 
kernel_map (link_elf_obj.c, link_elf_load_file). Now, the lower end

of the kernel_map is top-6Gb.

Kernel code (both kernel proper and modules) is compiled for kernel memory
model, according to the gcc info:
`-mcmodel=kernel'
Generate code for the kernel code model.  The kernel runs in the
negative 2 GB of the address space.  This model has to be used for
Linux kernel code.

I suspect we have a problem there.

 



The change to link_elf_obj.c is supposed to ensure allocation of an 
address in the upper 2GB of the kernel map for the module.


Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: possibly missed wakeup in swap_pager_getpages()

2007-03-09 Thread Alan Cox

Divacky Roman wrote:


hi

 


[snip]


is my analysis correct? if so, can the race be mitigated by moving the flag 
setting (hence
also the locking) after the msleep()?

 



No.  When the wakeup is performed, the VPO_SWAPINPROG flag is also 
cleared.  Both operations occur in the I/O completion handler with the 
object locked.  Thus, if the I/O completion handler runs first, the 
msleep on the page within the while loop will not occur because the 
page's VPO_SWAPINPROG  flag is already cleared.


Regards,
Alan

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Possible problems with mmap/munmap on FreeBSD ...

2005-03-30 Thread Alan Cox
On Tue, Mar 29, 2005 at 09:18:32PM -0500, David Schultz wrote:
 On Tue, Mar 29, 2005, Richard Sharpe wrote:
  Hi,
  
  I am having some problems with the tdb package on FreeBSD 4.6.2 and 4.10.
  
  One of the things the above package does is:
  
 mmap the tdb file to a region of memory
 store stuff in the region (memmov etc).
 when it needs to extend the size of the region {
   munmap the region
   write data at the end of the file
   mmap the region again with a larger size
 }
  
  What I am seeing is that after the munmap the data written to the region
  is gone.
  
  However, if I insert an msync before the munmap, everything is nicely
  coherent. This seems odd (in the sense that it works without the msync
  under Linux).
  
  The region is mmapped with:
  
 mmap(NULL, tdb-map_size,
  PROT_READ|(tdb-read_only? 0:PROT_WRITE),
  MAP_SHARED|MAP_FILE, tdb-fd, 0);
  
  What I notice is that all the calls to mmap return the same address.
  
  A careful reading of the man pages for mmap and munmap does not suggest
  that I am doing anything wrong.
  
  Is it possible that FreeBSD is deferring flushing the dirty data, and then
  forgets to do it when the same starting address is used etc?
 
 It looks like all of the underlying pages are getting invalidated
 in vm_object_page_remove().  This is clearly the right thing to do
 for private mappings, but it seems wrong for shared mappings.
 Perhaps Alan has some insight.

Hmm.  In this code path we don't call vm_object_page_remove() on
vnode-backed objects, only default/swap-backed objects that have no
other mappings that reference them.

Regards,
Alan
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: contigmalloc(9) rewrite

2004-06-18 Thread Alan Cox
On Fri, Jun 18, 2004 at 04:51:15PM -0400, Brian Fundakowski Feldman wrote:
 On Tue, Jun 15, 2004 at 03:57:09PM -0400, Brian Fundakowski Feldman wrote:
  The patch, which applies to 5-CURRENT, can be found here:
  http://green.homeunix.org/~green/contigmalloc2.patch
  The default is to use the old contigmalloc().  You can set the
  sysctl or loader tunable vm.old_contigmalloc to 0 to enable it.
  
  For anyone that normally runs into failed allocations hot-plugging
  hardware, please try this and see if it helps out.
 
 By the way, I have updated it further to split apart contigmalloc()
 into a separate vm_page_alloc_contig() and mapping function as per
 feedback from Alan Cox and Hiten Pandya.  The operation is still the
 same except for now being able to see memory allocated with it
 in your vmstat(8) -m output. The patch is still at the same location,
 and requires sysctl vm.old_contigmalloc=0 to enable.
 

Why don't you commit the part that makes allocation of physical memory
start from high addresses?

Alan
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Update: Debox sendfile modifications

2003-11-05 Thread Alan Cox
On Wed, Nov 05, 2003 at 01:25:43AM -0500, Vivek Pai wrote:
 Mike Silbersack wrote:
 On Tue, 4 Nov 2003, Vivek Pai wrote:
 The one other aspect of this is that sf_bufs mappings are maintained for
 a configurable amount of time, reducing the number of TLB ops. You can
 specify the parameter for how long, ranging from -1 (no coalescing at
 all), 0 (coalesce, but free immediately after last holder release), to
 any other time. Obviously, any value above 0 will increase the amount of
 wired memory at any given point in time, but it's configurable.
 
 Ah, I missed that point.  Did your testing show the caching part of the
 functionality to be significant?
 
 I think it buys us a small gain (a few percent) under static-content
 workloads, and a little less under SpecWeb99, where more time is spent
 in dynamic content. However, it's almost free - the additional
 complexity beyond just coalescing is hooking into the timer to free
 unused mappings.

I think it's reasonable to expect a more pronounced effect on i386
SMP.  In order to maintain TLB coherence, we issue two interprocessor
interrupts _per_page_ transmitted by sendfile(2).

Alan
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: [Fwd: Some mmap observations compared to Linux 2.6/OpenBSD]

2003-10-26 Thread Alan Cox
On Wed, Oct 22, 2003 at 01:34:01PM +1000, Q wrote:
 It has been suggested that I should direct this question to the VM
 system guru's. Your comments on this would be appreciated.
 

An address space is represented by a data structure that we call a
vm_map.  A vm_map is, in the abstact, an ordered collection of in-use
address ranges.

FreeBSD 4.x implements the vm_map as a doubly-linked list of address
ranges with a hint pointing to the last accessed entry.  Thus, if the
hint fails, the expected time to lookup an in-use address is O(n).
FreeBSD 5.x overlays a balanced binary search tree over this
structure.  This accelerates the lookup to an (amortized) O(log n).
In fact, for the kind of balanced binary search tree that we use, the
last accessed entry is at the root of the tree.  Thus, any locality in
the pattern of lookups will produce even better results.

Linux and OpenBSD are similar to FreeBSD 5.x.  The differences lie in
the details, like the kind of balanced binary tree that is used.

That said, having a balanced binary tree to represent the address
space does NOT inherently make finding unallocated space any faster.
Instead, OpenBSD augments an address space entry with extra
information to speed up this process:

struct vm_map_entry {
...
vaddr_t ownspace;   /* free space after */
vaddr_t space;  /* space in subtree */
...

The same could be done in FreeBSD, and you don't need a red-black tree
in order to do it.

If someone, especially a junior kernel hacker with a commit bit, is
serious about trying this, I'll gladly mentor them.  Recognize,
however, that this approach may produce great results for a
microbenchmark, but pessimize other more interesting workloads,
like say, buildworld, making it a poor choice.  Nonetheless, I think
we should strive to get better results in this area.

Regards,
Alan

 
 -Forwarded Message-
 From: Q [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Subject: Some mmap observations compared to Linux 2.6/OpenBSD
 Date: Wed, 22 Oct 2003 12:22:35 +1000
 
 As an effort to get more acquainted with the FreeBSD kernel, I have been
 looking through how mmap works. I don't yet understand how it all fits
 together, or of the exact implications things may have in the wild, but
 I have noticed under some synthetic conditions, ie. mmaping small
 non-contiguous pages of a file, mmap allocation scales much more poorly
 on FreeBSD than on OpenBSD and Linux 2.6.
 
 After investigating this further I have observed that vm_map_findspace()
 traverses a linked list to find the next region (O(n) cost), whereas
 OpenBSD and Linux 2.6 both use Red-Black trees for the same purpose
 (O(log n) cost). Profiling the FreeBSD kernel appears to confirm this.
 
 Can someone comment on whether this is something that has been done
 intentionally, or avoided in favour of some other yet to be implemented
 solution? Or is it still on someones todo list.
 
 -- 
 Seeya...Q
 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 
   _  /   Quinton Dolan -  [EMAIL PROTECTED]
   __  __/  /   /   __/   /  /  
  /__  /   _//  / Gold Coast, QLD, Australia
   __/  __/ __/ /   /   -  / Ph: +61 419 729 806
 ___  /  
 _\
 
___
[EMAIL PROTECTED] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: why doesn't aio use at_exit(9)?

2000-12-01 Thread Alan Cox

On Fri, Dec 01, 2000 at 02:02:58AM -0800, Alfred Perlstein wrote:
 why doesn't aio use at_exit(9) instead of requiring an explicit
 call in kern_exit.c for aio_rundown?
 

There's no reason that I'm aware of.  Unless you're in a hurry,
I'll add that change to a cleanup patch that I have in the pipe.

Alan


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: why doesn't aio use at_exit(9)?

2000-12-01 Thread Alan Cox

On Fri, Dec 01, 2000 at 12:08:48PM -0800, Alfred Perlstein wrote:
 * Alan Cox [EMAIL PROTECTED] [001201 11:56] wrote:
  On Fri, Dec 01, 2000 at 02:02:58AM -0800, Alfred Perlstein wrote:
   why doesn't aio use at_exit(9) instead of requiring an explicit
   call in kern_exit.c for aio_rundown?
   
  
  There's no reason that I'm aware of.  Unless you're in a hurry,
  I'll add that change to a cleanup patch that I have in the pipe.
 
 Er, how much of a cleanup do you have?  The only work I've done
 so far is to remove all the #ifdef VFS_AIO's in the file, if you
 could commit your cleanup soon it would help. :)
 

If you're already working on converting aio to use at_exit,
go ahead.  It won't interfere with my work.

Alan




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: page coloring

2000-11-23 Thread Alan Cox

On Thu, Nov 23, 2000 at 12:48:09PM -0800, Mike Smith wrote:
   Isn't the page coloring algoritm in _vm_page_list_find totally bogus?
   
  
  No, it's not.  The comment is, however, misplaced.  It describes
  the behavior of an inline function in vm_page.h, and not the function
  it precedes.
 
 Hrm.  My comment was based on John Dyson's own observations on its 
 behaviour, and other discussions which concluded that the code wasn't 
 flexible enough (hardcoded assumptions on cache organisation, size etc.)
 

Yes, it would be nice if it was "auto-configuring", because many people
who use it misconfigure it.  They configure the number of colors based
upon the size of their cache rather than cache size/degree of associativity.
(Having too many colors makes it less likely that you'll get a page
of the right color when you ask for one because that queue will be empty.)

Overall, it's my experience that people have absurb expectations
of page coloring.  They think of it as an optimization.  It's not.
It's better thought of as a necessary evil: Suppose you're writing
a numerical application and either you or your compiler goes
to a lot of trouble to "block" the algorithm for the L2 cache
size.  If the underlying OS doesn't do page coloring, it's likely
that your program will still experience conflict misses despite
your efforts to avoid them.  Furthermore, you'll see a different
number of conflict misses each time you run the program (assuming
the typical random page allocation strategies).  So, the execution
time will vary.  In short, page coloring simply provides a machine
whose performance is more deterministic/predictable.

Alan

P.S.  I noticed that I mistakenly cc'ed my previous message to -current
rather than -hackers.  I've changed it back to -hackers.


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: minherit(2) API

2000-07-24 Thread Alan Cox

 I think I can change it in NetBSD -- anyone willing to do the
 necessary changes in FreeBSD and OpenBSD?

I'll do it.

Alan


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: clearing pages in the idle loop

2000-07-19 Thread Alan Cox

Last year, I tried to reproduce some of the claims/results
in this paper on FreeBSD/x86 and couldn't.  I also tried
limiting the idle loop to clearing pages of one particular
color at a time.  That way, the cache lines replaced by
the second page you clear are the cache lines holding
the first page you cleared, and so on for the third,
fourth, ... pages cleared.  Again, I saw no measurable
effect on tests like "buildworld", which is a similar
workload to the paper's if I recall correctly.

Finally, it's possible that having these pre-zeroed pages
in your L2 cache might be beneficial if they get allocated
and used right away.  FreeBSD's idle loop zeroes the pages
that are next in line for allocation.

Alan


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Panic in pmap_remove_pages on 4.0-current

1999-08-28 Thread Alan Cox

This exact problem came up last month.  pmap_remove_pages is tripping
over a corrupted page table entry (pte).  Unfortunately, by the time the panic
occurs, pmap_remove_pages has overwritten the corrupted pte with zero.

Earlier this month, I added a KASSERT to detect this problem (and
panic) before the corrupted pte is overwritten.  This KASSERT seems 
to be missing from your kernel.  Could you turn on assertion checking
in your kernel configuration and/or update to a newer kernel.

Alan


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: Panic in pmap_remove_pages on 4.0-current

1999-08-28 Thread Alan Cox
This exact problem came up last month.  pmap_remove_pages is tripping
over a corrupted page table entry (pte).  Unfortunately, by the time the panic
occurs, pmap_remove_pages has overwritten the corrupted pte with zero.

Earlier this month, I added a KASSERT to detect this problem (and
panic) before the corrupted pte is overwritten.  This KASSERT seems 
to be missing from your kernel.  Could you turn on assertion checking
in your kernel configuration and/or update to a newer kernel.

Alan


To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-hackers in the body of the message



Re: patch for behavior changes and madvise MADV_DONTNEED

1999-07-30 Thread Alan Cox

On Fri, Jul 30, 1999 at 09:51:35AM -0700, Matthew Dillon wrote:

 I'm not sure I see how MADV_FREE could slow performance down unless,
 perhaps, it is the overhead of the system call itself.  e.g. if malloc
 is calling it on a page-by-page basis and not implementing any hysteresis.


It's the system call overhead.  Adding (more) hysteresis would reduce
the overhead by some factor, but we'd still be making unnecessary
MADV_FREE calls.  Calling MADV_FREE accomplishes nothing unless
the system is actually short of memory.  The right way to address
this problem is likely to add a mechanism to the VM system that
sends a signal to the process when MADV_FREE would actually be
beneficial.

 2% isn't a big deal.  MADV_FREE theoretically makes a big impact on 
 paging performance in a heavily loaded  paging system.  If your tests
 were run on a system that wasn't paging, you weren't testing the right
 thing.
 

99% of our user base, whose machines aren't thrashing during a "make world"
or other normal operation, shouldn't pay a 2% penalty to produce
a theoretical improvement for the 1% that are.  At best, that's "optimizing"
the infrequent case at the expense of the frequent, not a good trade-off.

In any case, the man page for malloc/free explains how to change
the default, if you're a part of the "oppressed" 1%.

Alan


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message



Re: patch for behavior changes and madvise MADV_DONTNEED

1999-07-30 Thread Alan Cox
On Fri, Jul 30, 1999 at 09:51:35AM -0700, Matthew Dillon wrote:

 I'm not sure I see how MADV_FREE could slow performance down unless,
 perhaps, it is the overhead of the system call itself.  e.g. if malloc
 is calling it on a page-by-page basis and not implementing any hysteresis.


It's the system call overhead.  Adding (more) hysteresis would reduce
the overhead by some factor, but we'd still be making unnecessary
MADV_FREE calls.  Calling MADV_FREE accomplishes nothing unless
the system is actually short of memory.  The right way to address
this problem is likely to add a mechanism to the VM system that
sends a signal to the process when MADV_FREE would actually be
beneficial.

 2% isn't a big deal.  MADV_FREE theoretically makes a big impact on 
 paging performance in a heavily loaded  paging system.  If your tests
 were run on a system that wasn't paging, you weren't testing the right
 thing.
 

99% of our user base, whose machines aren't thrashing during a make world
or other normal operation, shouldn't pay a 2% penalty to produce
a theoretical improvement for the 1% that are.  At best, that's optimizing
the infrequent case at the expense of the frequent, not a good trade-off.

In any case, the man page for malloc/free explains how to change
the default, if you're a part of the oppressed 1%.

Alan


To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-hackers in the body of the message



Re: patch for behavior changes and madvise MADV_DONTNEED

1999-07-29 Thread Alan Cox
Your new behavior flag isn't known by vm_map_simply_entry, so
map simplification could drop the behavior setting on the floor.
I would prefer that the behavior is folded into eflags.

Overall, I agree with the direction.  Behavior is correctly a map
attribute rather than an object attribute.

Alan

P.S.  The MADV_FREE's by malloc/free were disabled by default in -CURRENT
and -STABLE prior to the release of 3.2.  They were a performance
pessimization, slowing down make world by ~2%.


To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-hackers in the body of the message



Re: Improving the Unix API

1999-06-28 Thread Alan Cox
 As far as sysctl goes, FreeBSD deprecates the use of numbers for OIDs and
 has a string-based mechanism for exploring the sysctl tree.

So we are actually both going the same way. Linus with /proc/sys and his
official dislike of sysctl (Oh well I think sysctl using number spaces is the
right idea - like snmp is), and BSD going to names




To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-hackers in the body of the message



Re: Improving the Unix API

1999-06-28 Thread Alan Cox
 As far as I know, only FreeBSD has a string-based sysctl implementation.

Nod.

 Something which always confused me about Linux' procfs - what have all
 these kernel variables got to do with process state?  We used to have a
 kernfs which was intended for this kind of thing but it rotted after
 people started extending sysctl for the purpose.

About as much as having a /usr/bin for the slower binaries on the 40Mbyte
moving head disk has relationship to /usr nowdays. /proc is basically
both process and machine state in Linux. It got expaneded on.

Alan





To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-hackers in the body of the message



Re: 3.2 Freeze date

1999-05-10 Thread Alan Cox
On Mon, May 10, 1999 at 07:33:28PM -0700, Matthew Dillon wrote:
 
 My main interest are the NFS/TCP fixes, which Alan now has a -stable patch
 for.  But it's already the tenth so if it goes in now the source will 
 then have to be reviewed by more gurus ( post-commit ).
 

The NFS/TCP realignment patch was checked into -stable last Sat
morning.  Is there anything else?

Alan


To Unsubscribe: send mail to majord...@freebsd.org
with unsubscribe freebsd-hackers in the body of the message