Re: dynamically calculating NKPT [was: Re: huge ktr buffer]
On 02/05/2013 09:45, m...@freebsd.org wrote: On Tue, Feb 5, 2013 at 7:14 AM, Konstantin Belousov kostik...@gmail.com wrote: On Mon, Feb 04, 2013 at 03:05:15PM -0800, Neel Natu wrote: Hi, I have a patch to dynamically calculate NKPT for amd64 kernels. This should fix the various issues that people pointed out in the email thread. Please review and let me know if there are any objections to committing this. Also, thanks to Alan (alc@) for reviewing and providing feedback on the initial version of the patch. Patch (also available at http://people.freebsd.org/~neel/patches/nkpt_diff.txt): Index: sys/amd64/include/pmap.h === --- sys/amd64/include/pmap.h (revision 246277) +++ sys/amd64/include/pmap.h (working copy) @@ -113,13 +113,7 @@ ((unsigned long)(l2) PDRSHIFT) | \ ((unsigned long)(l1) PAGE_SHIFT)) -/* Initial number of kernel page tables. */ -#ifndef NKPT -#define NKPT32 -#endif - #define NKPML4E 1 /* number of kernel PML4 slots */ -#define NKPDPE howmany(NKPT, NPDEPG)/* number of kernel PDP slots */ #define NUPML4E (NPML4EPG/2)/* number of userland PML4 pages */ #define NUPDPE (NUPML4E*NPDPEPG)/* number of userland PDP pages */ @@ -181,6 +175,7 @@ #define PML4map ((pd_entry_t *)(addr_PML4map)) #define PML4pml4e ((pd_entry_t *)(addr_PML4pml4e)) +extern int nkpt; /* Initial number of kernel page tables */ extern u_int64_t KPDPphys; /* physical address of kernel level 3 */ extern u_int64_t KPML4phys; /* physical address of kernel level 4 */ Index: sys/amd64/amd64/minidump_machdep.c === --- sys/amd64/amd64/minidump_machdep.c(revision 246277) +++ sys/amd64/amd64/minidump_machdep.c(working copy) @@ -232,7 +232,7 @@ /* Walk page table pages, set bits in vm_page_dump */ pmapsize = 0; pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys); - for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + NKPT * NBPDR, + for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + nkpt * NBPDR, kernel_vm_end); ) { /* * We always write a page, even if it is zero. Each @@ -364,7 +364,7 @@ /* Dump kernel page directory pages */ bzero(fakepd, sizeof(fakepd)); pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys); - for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + NKPT * NBPDR, + for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + nkpt * NBPDR, kernel_vm_end); va += NBPDP) { i = (va PDPSHIFT) ((1ul NPDPEPGSHIFT) - 1); Index: sys/amd64/amd64/pmap.c === --- sys/amd64/amd64/pmap.c(revision 246277) +++ sys/amd64/amd64/pmap.c(working copy) @@ -202,6 +202,10 @@ vm_offset_t virtual_avail; /* VA of first avail page (after kernel bss) */ vm_offset_t virtual_end; /* VA of last avail page (end of kernel AS) */ +int nkpt; +SYSCTL_INT(_machdep, OID_AUTO, nkpt, CTLFLAG_RD, nkpt, 0, +Number of kernel page table pages allocated on bootup); + static int ndmpdp; static vm_paddr_t dmaplimit; vm_offset_t kernel_vm_end = VM_MIN_KERNEL_ADDRESS; @@ -495,17 +499,42 @@ CTASSERT(powerof2(NDMPML4E)); +/* number of kernel PDP slots */ +#define NKPDPE(ptpgs) howmany((ptpgs), NPDEPG) + static void +nkpt_init(vm_paddr_t addr) +{ + int pt_pages; + +#ifdef NKPT + pt_pages = NKPT; +#else + pt_pages = howmany(addr, 1 PDRSHIFT); + pt_pages += NKPDPE(pt_pages); + + /* + * Add some slop beyond the bare minimum required for bootstrapping + * the kernel. + * + * This is quite important when allocating KVA for kernel modules. + * The modules are required to be linked in the negative 2GB of + * the address space. If we run out of KVA in this region then + * pmap_growkernel() will need to allocate page table pages to map + * the entire 512GB of KVA space which is an unnecessary tax on + * physical memory. + */ + pt_pages += 4; /* 8MB additional slop for kernel modules */ 8MB might be to low. I just checked one of my machines with fully modularized kernel, it takes slightly more than 6 MB to load 50 modules. I think that 16MB would be safer, but it probably needs to be scaled down based on the available phys memory. amd64 kernel could be booted on 128MB machine still. Is there no way to not map the entire 512GB? Otherwise this patch could really hose some vendors. E.g. the kernel module for the OneFS file system is around 8MB all by itself. Mapping the entire 512 GB from the start would require the preallocation of 1 GB of memory for page table pages. I found when we moved from
Re: dynamically calculating NKPT [was: Re: huge ktr buffer]
On 02/05/2013 10:13, Konstantin Belousov wrote: On Tue, Feb 05, 2013 at 07:45:24AM -0800, m...@freebsd.org wrote: On Tue, Feb 5, 2013 at 7:14 AM, Konstantin Belousov kostik...@gmail.com wrote: On Mon, Feb 04, 2013 at 03:05:15PM -0800, Neel Natu wrote: Hi, I have a patch to dynamically calculate NKPT for amd64 kernels. This should fix the various issues that people pointed out in the email thread. Please review and let me know if there are any objections to committing this. Also, thanks to Alan (alc@) for reviewing and providing feedback on the initial version of the patch. Patch (also available at http://people.freebsd.org/~neel/patches/nkpt_diff.txt): Index: sys/amd64/include/pmap.h === --- sys/amd64/include/pmap.h (revision 246277) +++ sys/amd64/include/pmap.h (working copy) @@ -113,13 +113,7 @@ ((unsigned long)(l2) PDRSHIFT) | \ ((unsigned long)(l1) PAGE_SHIFT)) -/* Initial number of kernel page tables. */ -#ifndef NKPT -#define NKPT32 -#endif - #define NKPML4E 1 /* number of kernel PML4 slots */ -#define NKPDPE howmany(NKPT, NPDEPG)/* number of kernel PDP slots */ #define NUPML4E (NPML4EPG/2)/* number of userland PML4 pages */ #define NUPDPE (NUPML4E*NPDPEPG)/* number of userland PDP pages */ @@ -181,6 +175,7 @@ #define PML4map ((pd_entry_t *)(addr_PML4map)) #define PML4pml4e ((pd_entry_t *)(addr_PML4pml4e)) +extern int nkpt; /* Initial number of kernel page tables */ extern u_int64_t KPDPphys; /* physical address of kernel level 3 */ extern u_int64_t KPML4phys; /* physical address of kernel level 4 */ Index: sys/amd64/amd64/minidump_machdep.c === --- sys/amd64/amd64/minidump_machdep.c(revision 246277) +++ sys/amd64/amd64/minidump_machdep.c(working copy) @@ -232,7 +232,7 @@ /* Walk page table pages, set bits in vm_page_dump */ pmapsize = 0; pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys); - for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + NKPT * NBPDR, + for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + nkpt * NBPDR, kernel_vm_end); ) { /* * We always write a page, even if it is zero. Each @@ -364,7 +364,7 @@ /* Dump kernel page directory pages */ bzero(fakepd, sizeof(fakepd)); pdp = (uint64_t *)PHYS_TO_DMAP(KPDPphys); - for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + NKPT * NBPDR, + for (va = VM_MIN_KERNEL_ADDRESS; va MAX(KERNBASE + nkpt * NBPDR, kernel_vm_end); va += NBPDP) { i = (va PDPSHIFT) ((1ul NPDPEPGSHIFT) - 1); Index: sys/amd64/amd64/pmap.c === --- sys/amd64/amd64/pmap.c(revision 246277) +++ sys/amd64/amd64/pmap.c(working copy) @@ -202,6 +202,10 @@ vm_offset_t virtual_avail; /* VA of first avail page (after kernel bss) */ vm_offset_t virtual_end; /* VA of last avail page (end of kernel AS) */ +int nkpt; +SYSCTL_INT(_machdep, OID_AUTO, nkpt, CTLFLAG_RD, nkpt, 0, +Number of kernel page table pages allocated on bootup); + static int ndmpdp; static vm_paddr_t dmaplimit; vm_offset_t kernel_vm_end = VM_MIN_KERNEL_ADDRESS; @@ -495,17 +499,42 @@ CTASSERT(powerof2(NDMPML4E)); +/* number of kernel PDP slots */ +#define NKPDPE(ptpgs) howmany((ptpgs), NPDEPG) + static void +nkpt_init(vm_paddr_t addr) +{ + int pt_pages; + +#ifdef NKPT + pt_pages = NKPT; +#else + pt_pages = howmany(addr, 1 PDRSHIFT); + pt_pages += NKPDPE(pt_pages); + + /* + * Add some slop beyond the bare minimum required for bootstrapping + * the kernel. + * + * This is quite important when allocating KVA for kernel modules. + * The modules are required to be linked in the negative 2GB of + * the address space. If we run out of KVA in this region then + * pmap_growkernel() will need to allocate page table pages to map + * the entire 512GB of KVA space which is an unnecessary tax on + * physical memory. + */ + pt_pages += 4; /* 8MB additional slop for kernel modules */ 8MB might be to low. I just checked one of my machines with fully modularized kernel, it takes slightly more than 6 MB to load 50 modules. I think that 16MB would be safer, but it probably needs to be scaled down based on the available phys memory. amd64 kernel could be booted on 128MB machine still. Is there no way to not map the entire 512GB? Otherwise this patch could really hose some vendors. E.g. the kernel module for the OneFS file system is around 8MB all by itself. No, I do not think that this patch would hose somebody with the 8MB
Re: kmem_map auto-sizing and size dependencies
I'll follow up with detailed answers to your questions over the weekend. For now, I will, however, point out that you've misinterpreted the tunables. In fact, they say that your kmem map can hold up to 16GB and the current used space is about 58MB. Like other things, the kmem map is auto-sized based on the available physical memory and capped so as not to consume too much of the overall kernel address space. Regards, Alan On Fri, Jan 18, 2013 at 9:29 AM, Andre Oppermann an...@freebsd.org wrote: The autotuning work is reaching into many places of the kernel and while trying to tie up all lose ends I've got stuck in the kmem_map and how it works or what its limitations are. During startup the VM is initialized and an initial kernel virtual memory map is setup in kmem_init() covering the entire KVM address range. Only the kernel itself is actually allocated within that map. A bit later on a number of other submaps are allocated (clean_map, buffer_map, pager_map, exec_map). Also in kmeminit() (in kern_malloc.c, different from kmem_init) the kmem_map is allocated. The (inital?) size of the kmem_map is determined by some voodoo magic, a sprinkle of nmbclusters * PAGE_SIZE incrementor and lots of tunables. However it seems to work out to an effective kmem_map_size of about 58MB on my 16GB AMD64 dev machine: vm.kvm_size: 549755809792 vm.kvm_free: 530233421824 vm.kmem_size: 16,594,300,928 vm.kmem_size_min: 0 vm.kmem_size_max: 329,853,485,875 vm.kmem_size_scale: 1 vm.kmem_map_size: 59,518,976 vm.kmem_map_free: 16,534,777,856 The kmem_map serves kernel malloc (via UMA), contigmalloc and everthing else that uses UMA for memory allocation. Mbuf memory too is managed by UMA which obtains the backing kernel memory from the kmem_map. The limits of the various mbuf memory types have been considerably raised recently and may make use of 50-75% of all physically present memory, or available KVM space, whichever is smaller. Now my questions/comments are: Does the kmem_map automatically extend itself if more memory is requested? Should it be set to a larger initial value based on min(physical,KVM) space available? The use of nmbclusters for the initial kmem_map size calculation isn't appropriate anymore due to it being set up later and nmbclusters isn't the only mbuf relevant mbuf type. We make significant use of page sized mbuf clusters too. The naming and output of the various vm.kmem_* and vm.kvm_* sysctls is confusing and not easy to reconcile. Either we need some more detailing more aspects or less. Plus perhaps sysctl subtrees to better describe the hierarchy of the maps. Why are separate kmem submaps being used? Is it to limit memory usage of certain subsystems? Are those limits actually enforced? -- Andre __**_ freebsd-curr...@freebsd.org mailing list http://lists.freebsd.org/**mailman/listinfo/freebsd-**currenthttp://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscribe@** freebsd.org freebsd-current-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: huge ktr buffer
On 12/06/2012 09:43, Davide Italiano wrote: On Thu, Dec 6, 2012 at 4:18 PM, Andriy Gapon a...@freebsd.org wrote: So I configured a kernel with the following option: options KTR_ENTRIES=(1024UL*1024) then booted the kernel and did $ sysctl debug.ktr.clear=1 and got an insta-reboot. No panic, nothing, just a reset. I suspect that the huge static buffer resulting from the above option could be a cause. But I would like to understand the details, if possible. Also, perhaps ktr could be a little bit more sophisticated with its buffer than just using a static array. -- Andriy Gapon ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org It was a while ago, but running r238886 built using the following kernel configuration file: http://people.freebsd.org/~davide/DEBUG I found a similar issue. The machine paniced: fatal trap 12 with interrupt disable in early boot (even before the appareance of the Berkley logo). Basically, my configuration file is just GENERIC with slight modifications, in particular debugging options (WITNESS, INVARIANTS, etc..) turned on and the following KTR options enabled: options KTR options KTR_COMPILE=(KTR_CALLOUT|KTR_PROC) options KTR_MASK=(KTR_CALLOUT|KTR_PROC) options KTR_ENTRIES=524288 It seems the issue is related to KTR itself, and in particular to the value of KTR_ENTRIES. As long as this value is little (e.g. 2048) everything works fine and the boot sequence ends. If I choose 524288 (the value you can also see from the kernel conf file) the fatal trap occurs. Even though it was really difficult to me to get some informations because the fail happens too early, I put some printf() within the code and I isolated the point in which the kernel dies: (sys/amd64/amd64/machdep.c, in getmemsize()) 1540/* 1541* map page into kernel: valid, read/write,non-cacheable 1542*/ 1543*pte = pa | PG_V | PG_RW | PG_N; As also Alan suggested, a way to workaround the problem is to increase NKPT value (e.g. from 32 to 64). Obviously, this is not a proper fix. For a proper fix the kernel needs to be able to dynamically set the size of NKPT. In this particular case, this wouldn't be too hard, but there is a different case, where people preload a large memory disk image at boot time that isn't so easy to fix. Andriy makes a good suggestion. One that I think should be easy to implement. The KTR code already supports the use of a dynamically allocated KTR buffer. (See sysctl_debug_ktr_entries().) Let's take advantage of this. Place a cap on the size of the (compile-time) statically allocated buffer. However, use this buffer early in the kernel initialization process, specifically, up until SI_ORDER_KMEM has completed. At that point, switch to a dynamically allocated buffer and copy over the entries from the statically allocated buffer. Relatively speaking, SI_ORDER_KMEM is early enough in the boot process, that I doubt many people wanting an enormous KTR buffer will be impacted by the cap. In fact, I think you could implement overflow detection without pessimizing the KTR code. Alan P.S. There are other reasons that having an enormous statically allocated array in the kernel is undesirable. The first that comes to mind is that it eats up memory at low physical addresses, which is sometimes needed for special purposes. So, I think there are good reasons besides the NKPT issue to shift the KTR code to dynamic allocation. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On 11/13/2012 05:54, Konstantin Belousov wrote: On Mon, Nov 12, 2012 at 05:10:01PM -0600, Alan Cox wrote: On 11/12/2012 3:48 PM, Konstantin Belousov wrote: On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote: This patch still doesn't address the issue of M_NOWAIT calls driving the memory the all the way down to 2 pages, right ? It would be nice to have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE flag to dig deep. This is out of scope of the change. But it is required for any further adjustements. I would suggest a somewhat different response: The patch does make M_NOWAIT into a non-sleep version of M_WAITOK and does reintroduce M_USE_RESERVE as a way to specify dig deep. Currently, both M_NOWAIT and M_WAITOK can drive the cache/free memory down to two pages. The effect of the patch is to stop M_NOWAIT at two pages rather than allowing it to continue to zero pages. When you say, This is out of scope ..., I believe that you are referring to changing two pages into something larger. I agree that this is out of scope for the current change. I referred exactly to the difference between M_USE_RESERVE set or not. IMO this is what was asked by the question author. So yes, my mean of the 'out of scope' is about tweaking the 'two pages reserve' in some way. Since M_USE_RESERVE is no longer deprecated in HEAD, here is my proposed man page update to malloc(9): Index: share/man/man9/malloc.9 === --- share/man/man9/malloc.9 (revision 243091) +++ share/man/man9/malloc.9 (working copy) @@ -29,7 +29,7 @@ .\ $NetBSD: malloc.9,v 1.3 1996/11/11 00:05:11 lukem Exp $ .\ $FreeBSD$ .\ -.Dd January 28, 2012 +.Dd November 15, 2012 .Dt MALLOC 9 .Os .Sh NAME @@ -153,13 +153,12 @@ if .Dv M_WAITOK is specified. .It Dv M_USE_RESERVE -Indicates that the system can dig into its reserve in order to obtain the -requested memory. -This option used to be called -.Dv M_KERNEL -but has been renamed to something more obvious. -This option has been deprecated and is slowly being removed from the kernel, -and so should not be used with any new programming. +Indicates that the system can use its reserve of memory to satisfy the +request. +This option should only be used in combination with +.Dv M_NOWAIT +when an allocation failure cannot be tolerated by the caller without +catastrophic effects on the system. .El .Pp Exactly one of either ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On 11/15/2012 12:21, Konstantin Belousov wrote: On Thu, Nov 15, 2012 at 11:32:18AM -0600, Alan Cox wrote: On 11/13/2012 05:54, Konstantin Belousov wrote: On Mon, Nov 12, 2012 at 05:10:01PM -0600, Alan Cox wrote: On 11/12/2012 3:48 PM, Konstantin Belousov wrote: On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote: This patch still doesn't address the issue of M_NOWAIT calls driving the memory the all the way down to 2 pages, right ? It would be nice to have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE flag to dig deep. This is out of scope of the change. But it is required for any further adjustements. I would suggest a somewhat different response: The patch does make M_NOWAIT into a non-sleep version of M_WAITOK and does reintroduce M_USE_RESERVE as a way to specify dig deep. Currently, both M_NOWAIT and M_WAITOK can drive the cache/free memory down to two pages. The effect of the patch is to stop M_NOWAIT at two pages rather than allowing it to continue to zero pages. When you say, This is out of scope ..., I believe that you are referring to changing two pages into something larger. I agree that this is out of scope for the current change. I referred exactly to the difference between M_USE_RESERVE set or not. IMO this is what was asked by the question author. So yes, my mean of the 'out of scope' is about tweaking the 'two pages reserve' in some way. Since M_USE_RESERVE is no longer deprecated in HEAD, here is my proposed man page update to malloc(9): Index: share/man/man9/malloc.9 === --- share/man/man9/malloc.9 (revision 243091) +++ share/man/man9/malloc.9 (working copy) @@ -29,7 +29,7 @@ .\ $NetBSD: malloc.9,v 1.3 1996/11/11 00:05:11 lukem Exp $ .\ $FreeBSD$ .\ -.Dd January 28, 2012 +.Dd November 15, 2012 .Dt MALLOC 9 .Os .Sh NAME @@ -153,13 +153,12 @@ if .Dv M_WAITOK is specified. .It Dv M_USE_RESERVE -Indicates that the system can dig into its reserve in order to obtain the -requested memory. -This option used to be called -.Dv M_KERNEL -but has been renamed to something more obvious. -This option has been deprecated and is slowly being removed from the kernel, -and so should not be used with any new programming. +Indicates that the system can use its reserve of memory to satisfy the +request. +This option should only be used in combination with +.Dv M_NOWAIT +when an allocation failure cannot be tolerated by the caller without +catastrophic effects on the system. .El .Pp Exactly one of either The text looks fine. Shouldn't the requirement for M_USE_RESERVE be also expressed in KASSERT, like this: diff --git a/sys/vm/vm_page.h b/sys/vm/vm_page.h index d9e4692..f8a4f70 100644 --- a/sys/vm/vm_page.h +++ b/sys/vm/vm_page.h @@ -353,6 +351,9 @@ malloc2vm_flags(int malloc_flags) { int pflags; + KASSERT((malloc_flags M_USE_RESERVE) == 0 || + (malloc_flags M_NOWAIT) != 0, + (M_USE_RESERVE requires M_NOWAIT)); pflags = (malloc_flags M_USE_RESERVE) != 0 ? VM_ALLOC_INTERRUPT : VM_ALLOC_SYSTEM; if ((malloc_flags M_ZERO) != 0) I understand that this could be added to places of the allocator's entries, but I think that the page allocations are fine too. Yes, please do that. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On 11/12/2012 11:35, Alan Cox wrote: On 11/12/2012 07:36, Konstantin Belousov wrote: On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote: On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov kostik...@gmail.comwrote: On Fri, Nov 09, 2012 at 07:10:04PM +, Sears, Steven wrote: I have a memory subsystem design question that I'm hoping someone can answer. I've been looking at a machine that is completely out of memory, as in v_free_count = 0, v_cache_count = 0, I wondered how a machine could completely run out of memory like this, especially after finding a lack of interrupt storms or other pathologies that would tend to overcommit memory. So I started investigating. Most allocators come down to vm_page_alloc(), which has this guard: if ((curproc == pageproc) (page_req != VM_ALLOC_INTERRUPT)) { page_req = VM_ALLOC_SYSTEM; }; if (cnt.v_free_count + cnt.v_cache_count cnt.v_free_reserved || (page_req == VM_ALLOC_SYSTEM cnt.v_free_count + cnt.v_cache_count cnt.v_interrupt_free_min) || (page_req == VM_ALLOC_INTERRUPT cnt.v_free_count + cnt.v_cache_count 0)) { The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate every last page. From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare, perhaps only used from interrupt threads. Not so, see kmem_malloc() or uma_small_alloc() which both contain this mapping: if ((flags (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; else pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; Note that M_USE_RESERVE has been deprecated and is used in just a handful of places. Also note that lots of code paths come through these routines. What this means is essentially _any_ allocation using M_NOWAIT will bypass whatever reserves have been held back and will take every last page available. There is no documentation stating M_NOWAIT has this side effect of essentially being privileged, so any innocuous piece of code that can't block will use it. And of course M_NOWAIT is literally used all over. It looks to me like the design goal of the BSD allocators is on recovery; it will give all pages away knowing it can recover. Am I missing anything? I would have expected some small number of pages to be held in reserve just in case. And I didn't expect M_NOWAIT to be a sort of back door for grabbing memory. Your analysis is right, there is nothing to add or correct. This is the reason to strongly prefer M_WAITOK. Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was well understand that it should only be used by interrupt handlers. The trouble is that M_NOWAIT conflates two orthogonal things. The obvious being that the allocation shouldn't sleep. The other being how far we're willing to deplete the cache/free page queues. When fine-grained locking got sprinkled throughout the kernel, we all to often found ourselves wanting to do allocations without the possibility of blocking. So, M_NOWAIT became commonplace, where it wasn't before. This had the unintended consequence of introducing a lot of memory allocations in the top-half of the kernel, i.e., non-interrupt handling code, that were digging deep into the cache/free page queues. Also, ironically, in today's kernel an M_NOWAIT | M_USE_RESERVE allocation is less likely to succeed than an M_NOWAIT allocation. However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it could only allocate a free page. M_USE_RESERVE said that it ok to allocate a cached page even though M_NOWAIT was specified. Consequently, the system wouldn't dig as far into the free page queue if M_USE_RESERVE was specified, because it was allowed to reclaim a cached page. In conclusion, I think it's time that we change M_NOWAIT so that it doesn't dig any deeper into the cache/free page queues than M_WAITOK does and reintroduce a M_USE_RESERVE-like flag that says dig deep into the cache/free page queues. The trouble is that we then need to identify all of those places that are implicitly depending on the current behavior of M_NOWAIT also digging deep into the cache/free page queues so that we can add an explicit M_USE_RESERVE. Alan P.S. I suspect that we should also increase the size of the page reserve that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*(). How many legitimate users of a new M_USE_RESERVE-like flag in today's kernel could actually be satisfied by two pages? I am almost sure that most of people who put the M_NOWAIT flag, do not know the 'allow the deeper drain of free queue' effect. As such, I believe we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations of the problematic places would be in the swapout path. I found a single explicit use of M_USE_RESERVE in the kernel, so the flip is relatively simple
Re: Memory reserves or lack thereof
On 11/12/2012 07:36, Konstantin Belousov wrote: On Sun, Nov 11, 2012 at 03:40:24PM -0600, Alan Cox wrote: On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov kostik...@gmail.comwrote: On Fri, Nov 09, 2012 at 07:10:04PM +, Sears, Steven wrote: I have a memory subsystem design question that I'm hoping someone can answer. I've been looking at a machine that is completely out of memory, as in v_free_count = 0, v_cache_count = 0, I wondered how a machine could completely run out of memory like this, especially after finding a lack of interrupt storms or other pathologies that would tend to overcommit memory. So I started investigating. Most allocators come down to vm_page_alloc(), which has this guard: if ((curproc == pageproc) (page_req != VM_ALLOC_INTERRUPT)) { page_req = VM_ALLOC_SYSTEM; }; if (cnt.v_free_count + cnt.v_cache_count cnt.v_free_reserved || (page_req == VM_ALLOC_SYSTEM cnt.v_free_count + cnt.v_cache_count cnt.v_interrupt_free_min) || (page_req == VM_ALLOC_INTERRUPT cnt.v_free_count + cnt.v_cache_count 0)) { The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate every last page. From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare, perhaps only used from interrupt threads. Not so, see kmem_malloc() or uma_small_alloc() which both contain this mapping: if ((flags (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; else pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; Note that M_USE_RESERVE has been deprecated and is used in just a handful of places. Also note that lots of code paths come through these routines. What this means is essentially _any_ allocation using M_NOWAIT will bypass whatever reserves have been held back and will take every last page available. There is no documentation stating M_NOWAIT has this side effect of essentially being privileged, so any innocuous piece of code that can't block will use it. And of course M_NOWAIT is literally used all over. It looks to me like the design goal of the BSD allocators is on recovery; it will give all pages away knowing it can recover. Am I missing anything? I would have expected some small number of pages to be held in reserve just in case. And I didn't expect M_NOWAIT to be a sort of back door for grabbing memory. Your analysis is right, there is nothing to add or correct. This is the reason to strongly prefer M_WAITOK. Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was well understand that it should only be used by interrupt handlers. The trouble is that M_NOWAIT conflates two orthogonal things. The obvious being that the allocation shouldn't sleep. The other being how far we're willing to deplete the cache/free page queues. When fine-grained locking got sprinkled throughout the kernel, we all to often found ourselves wanting to do allocations without the possibility of blocking. So, M_NOWAIT became commonplace, where it wasn't before. This had the unintended consequence of introducing a lot of memory allocations in the top-half of the kernel, i.e., non-interrupt handling code, that were digging deep into the cache/free page queues. Also, ironically, in today's kernel an M_NOWAIT | M_USE_RESERVE allocation is less likely to succeed than an M_NOWAIT allocation. However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it could only allocate a free page. M_USE_RESERVE said that it ok to allocate a cached page even though M_NOWAIT was specified. Consequently, the system wouldn't dig as far into the free page queue if M_USE_RESERVE was specified, because it was allowed to reclaim a cached page. In conclusion, I think it's time that we change M_NOWAIT so that it doesn't dig any deeper into the cache/free page queues than M_WAITOK does and reintroduce a M_USE_RESERVE-like flag that says dig deep into the cache/free page queues. The trouble is that we then need to identify all of those places that are implicitly depending on the current behavior of M_NOWAIT also digging deep into the cache/free page queues so that we can add an explicit M_USE_RESERVE. Alan P.S. I suspect that we should also increase the size of the page reserve that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*(). How many legitimate users of a new M_USE_RESERVE-like flag in today's kernel could actually be satisfied by two pages? I am almost sure that most of people who put the M_NOWAIT flag, do not know the 'allow the deeper drain of free queue' effect. As such, I believe we should flip the meaning of M_NOWAIT/M_USE_RESERVE. My only expectations of the problematic places would be in the swapout path. I found a single explicit use of M_USE_RESERVE in the kernel, so the flip is relatively simple. Agreed. Most recently I eliminated several
Re: Memory reserves or lack thereof
On 11/12/2012 3:48 PM, Konstantin Belousov wrote: On Mon, Nov 12, 2012 at 01:28:02PM -0800, Sushanth Rai wrote: This patch still doesn't address the issue of M_NOWAIT calls driving the memory the all the way down to 2 pages, right ? It would be nice to have M_NOWAIT just do non-sleep version of M_WAITOK and M_USE_RESERVE flag to dig deep. This is out of scope of the change. But it is required for any further adjustements. I would suggest a somewhat different response: The patch does make M_NOWAIT into a non-sleep version of M_WAITOK and does reintroduce M_USE_RESERVE as a way to specify dig deep. Currently, both M_NOWAIT and M_WAITOK can drive the cache/free memory down to two pages. The effect of the patch is to stop M_NOWAIT at two pages rather than allowing it to continue to zero pages. When you say, This is out of scope ..., I believe that you are referring to changing two pages into something larger. I agree that this is out of scope for the current change. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On 11/12/2012 5:24 PM, Adrian Chadd wrote: .. wait, so what exactly would the difference be between M_NOWAIT and M_WAITOK? Whether or not the allocation can sleep until memory becomes available. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory reserves or lack thereof
On Sat, Nov 10, 2012 at 7:20 AM, Konstantin Belousov kostik...@gmail.comwrote: On Fri, Nov 09, 2012 at 07:10:04PM +, Sears, Steven wrote: I have a memory subsystem design question that I'm hoping someone can answer. I've been looking at a machine that is completely out of memory, as in v_free_count = 0, v_cache_count = 0, I wondered how a machine could completely run out of memory like this, especially after finding a lack of interrupt storms or other pathologies that would tend to overcommit memory. So I started investigating. Most allocators come down to vm_page_alloc(), which has this guard: if ((curproc == pageproc) (page_req != VM_ALLOC_INTERRUPT)) { page_req = VM_ALLOC_SYSTEM; }; if (cnt.v_free_count + cnt.v_cache_count cnt.v_free_reserved || (page_req == VM_ALLOC_SYSTEM cnt.v_free_count + cnt.v_cache_count cnt.v_interrupt_free_min) || (page_req == VM_ALLOC_INTERRUPT cnt.v_free_count + cnt.v_cache_count 0)) { The key observation is if VM_ALLOC_INTERRUPT is set, it will allocate every last page. From the name one might expect VM_ALLOC_INTERRUPT to be somewhat rare, perhaps only used from interrupt threads. Not so, see kmem_malloc() or uma_small_alloc() which both contain this mapping: if ((flags (M_NOWAIT|M_USE_RESERVE)) == M_NOWAIT) pflags = VM_ALLOC_INTERRUPT | VM_ALLOC_WIRED; else pflags = VM_ALLOC_SYSTEM | VM_ALLOC_WIRED; Note that M_USE_RESERVE has been deprecated and is used in just a handful of places. Also note that lots of code paths come through these routines. What this means is essentially _any_ allocation using M_NOWAIT will bypass whatever reserves have been held back and will take every last page available. There is no documentation stating M_NOWAIT has this side effect of essentially being privileged, so any innocuous piece of code that can't block will use it. And of course M_NOWAIT is literally used all over. It looks to me like the design goal of the BSD allocators is on recovery; it will give all pages away knowing it can recover. Am I missing anything? I would have expected some small number of pages to be held in reserve just in case. And I didn't expect M_NOWAIT to be a sort of back door for grabbing memory. Your analysis is right, there is nothing to add or correct. This is the reason to strongly prefer M_WAITOK. Agreed. Once upon time, before SMPng, M_NOWAIT was rarely used. It was well understand that it should only be used by interrupt handlers. The trouble is that M_NOWAIT conflates two orthogonal things. The obvious being that the allocation shouldn't sleep. The other being how far we're willing to deplete the cache/free page queues. When fine-grained locking got sprinkled throughout the kernel, we all to often found ourselves wanting to do allocations without the possibility of blocking. So, M_NOWAIT became commonplace, where it wasn't before. This had the unintended consequence of introducing a lot of memory allocations in the top-half of the kernel, i.e., non-interrupt handling code, that were digging deep into the cache/free page queues. Also, ironically, in today's kernel an M_NOWAIT | M_USE_RESERVE allocation is less likely to succeed than an M_NOWAIT allocation. However, prior to FreeBSD 7.x, M_NOWAIT couldn't allocate a cached page; it could only allocate a free page. M_USE_RESERVE said that it ok to allocate a cached page even though M_NOWAIT was specified. Consequently, the system wouldn't dig as far into the free page queue if M_USE_RESERVE was specified, because it was allowed to reclaim a cached page. In conclusion, I think it's time that we change M_NOWAIT so that it doesn't dig any deeper into the cache/free page queues than M_WAITOK does and reintroduce a M_USE_RESERVE-like flag that says dig deep into the cache/free page queues. The trouble is that we then need to identify all of those places that are implicitly depending on the current behavior of M_NOWAIT also digging deep into the cache/free page queues so that we can add an explicit M_USE_RESERVE. Alan P.S. I suspect that we should also increase the size of the page reserve that is kept for VM_ALLOC_INTERRUPT allocations in vm_page_alloc*(). How many legitimate users of a new M_USE_RESERVE-like flag in today's kernel could actually be satisfied by two pages? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Threaded 6.4 code compiled under 9.0 uses a lot more memory?..
On Wed, Oct 31, 2012 at 2:06 PM, Konstantin Belousov kostik...@gmail.comwrote: On Wed, Oct 31, 2012 at 11:52:06AM -0700, Adrian Chadd wrote: On 31 October 2012 11:20, Ian Lepore free...@damnhippie.dyndns.org wrote: I think there are some things we should be investigating about the growth of memory usage. I just noticed this: Freebsd 6.2 on an arm processor: 369 root 1 8 -88 1752K 748K nanslp 3:00 0.00% watchdogd Freebsd 10.0 on the same system: 367 root 1 -52 r0 10232K 10160K nanslp 10:04 0.00% watchdogd The 10.0 system is built with MALLOC_PRODUCTION (without that defined the system won't even boot, it only has 64MB of ram). That's a crazy amount of growth for a relatively simple daemon. Would you please, _please_ do some digging into this? It's quite possible there's something in the libraries that are allocating some memory upon first call invocation - yes, that's jemalloc, but it could also be other things like stdio. We really, really need to fix this userland bloat; it's terribly ridiculous at this point. There's no reason a watchdog daemon should take 10megabytes of RAM. Watchdogd was recently changed to mlock its memory. This is the cause of the RSS increase. Is it also statically linked? Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: contigmalloc() breaking Xorg
On 07/12/2012 07:26, John Baldwin wrote: [ Adding alc@ for VM stuff, Warner for arm/mips bus dma brokenness ] When the code underlying contigmalloc() fails in its initial attempt to allocate memory and proceeds to launder and reclaim pages, it should almost certainly do as the page daemon does and invoke the vm_lowmem handlers. In particular, this should coax the ZFS ARC into releasing some of its hoard of wired memory. Try this: Index: vm/vm_contig.c === --- vm/vm_contig.c (revision 238372) +++ vm/vm_contig.c (working copy) @@ -192,6 +192,18 @@ vm_contig_grow_cache(int tries, vm_paddr_t low, vm { int actl, actmax, inactl, inactmax; + if (tries 0) { + /* +* Decrease registered cache sizes. +*/ + EVENTHANDLER_INVOKE(vm_lowmem, 0); + + /* +* We do this explicitly after the caches have been drained +* above. +*/ + uma_reclaim(); + } vm_page_lock_queues(); inactl = 0; inactmax = tries 1 ? 0 : cnt.v_inactive_count; ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Rtld object tasting [Was: Re: wired memory - again!]
On Wed, Jun 13, 2012 at 2:12 PM, Konstantin Belousov kostik...@gmail.comwrote: On Wed, Jun 13, 2012 at 07:14:09AM -0600, Ian Lepore wrote: http://lists.freebsd.org/pipermail/freebsd-arm/2012-January/003288.html The map_object.c patch is step in the almost right direction, I wanted to remove the static page-sized buffer from get_elf_header for long time. It works because rtld always holds bind_lock exclusively while loading an object. There is no need to copy the first page after it is mapped. commit 0f6f8629af1345acded7c0c685d3ff7b4d9180d6 Author: Konstantin Belousov k...@freebsd.org Date: Wed Jun 13 22:04:18 2012 +0300 Eliminate the static buffer used to read the first page of the mapped object, and eliminate the pread(2) call as well. Mmap the first page of the object temporaly, and unmap it on error or last use. Fix several cases were the whole mapping of the object leaked on error. Potentially, this leaves one-page gap between succeeding dlopen(3), but there are other mmap(2) consumers as well. I suggest adding MAP_PREFAULT_READ to the mmap(2) call. A heuristic in vm_map_pmap_enter() would trigger automatic mapping for small files, but if the object file is larger than 96 pages then you need to explicitly specific MAP_PREFAULT_READ. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: superpages and kmem on amd64
On Sun, May 20, 2012 at 2:01 AM, Marko Zec z...@fer.hr wrote: Hi all, I'm playing with an algorithm which makes use of large contiguous blocks of kernel memory (ranging from 1M to 1G in size), so it would be nice if those could be somehow forcibly mapped to superpages. I was hoping that the VM system would automagically map (merge) contiguous 4k pages to superpages, but apparently it doesn't: vm.pmap.pdpe.demotions: 2 vm.pmap.pde.promotions: 543 vm.pmap.pde.p_failures: 266253 vm.pmap.pde.mappings: 0 vm.pmap.pde.demotions: 31 No, your conclusion is incorrect. These counts show that 543 superpage mappings were created by promotion. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: superpages and kmem on amd64
On 05/20/2012 09:43, Marko Zec wrote: On Sunday 20 May 2012 09:25:59 Alan Cox wrote: On Sun, May 20, 2012 at 2:01 AM, Marko Zecz...@fer.hr wrote: Hi all, I'm playing with an algorithm which makes use of large contiguous blocks of kernel memory (ranging from 1M to 1G in size), so it would be nice if those could be somehow forcibly mapped to superpages. I was hoping that the VM system would automagically map (merge) contiguous 4k pages to superpages, but apparently it doesn't: vm.pmap.pdpe.demotions: 2 vm.pmap.pde.promotions: 543 vm.pmap.pde.p_failures: 266253 vm.pmap.pde.mappings: 0 vm.pmap.pde.demotions: 31 No, your conclusion is incorrect. These counts show that 543 superpage mappings were created by promotion. OK, that sounds promising. Does created by promotion count reflect historic / cumulative stats, or is vm.pmap.pde.promotions the actual number of superpages active? Or should we subtract vm.pmap.pde.demotions from it to get the current value? The count is cumulative. There is no instantaneous count. Subtracting demotions from promotions plus mappings is not a reliable way to get the instantaneous total because a superpage mapping can be destroyed without first being demoted. In any case, I wish to be certain that a particular kmem virtual address range is mapped to superpages - how can I enforce that at malloc time, and / or find out later if I really got my kmem mapped to superpages? Perhaps vm_map_lookup() could provide more info, but I'm wondering if someone already wrote a wrapper function for that, which takes only the base virtual address as a single argument? Try using pmap_mincore() to verify that the mappings are superpages. BTW, apparently malloc(size, M_TEMP, M_NOWAIT) requests fail for size 1G, even at boot time. Any ideas how to circumvent that (8.3-STABLE, amd64, 4G physical RAM)? I suspect that you need to increase the size of your kmem map. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: superpages and kmem on amd64
On 05/20/2012 17:48, Marko Zec wrote: On Sunday 20 May 2012 19:34:26 Alan Cox wrote: ... In any case, I wish to be certain that a particular kmem virtual address range is mapped to superpages - how can I enforce that at malloc time, and / or find out later if I really got my kmem mapped to superpages? Perhaps vm_map_lookup() could provide more info, but I'm wondering if someone already wrote a wrapper function for that, which takes only the base virtual address as a single argument? Try using pmap_mincore() to verify that the mappings are superpages. flags = pmap_mincore(vmspace_pmap(curthread-td_proc-p_vmspace), (vm_offset_t) addr)); OK, that works, and now I know my kmem chunk is on a superpage, horray!!! Thanks! BTW, apparently malloc(size, M_TEMP, M_NOWAIT) requests fail for size 1G, even at boot time. Any ideas how to circumvent that (8.3-STABLE, amd64, 4G physical RAM)? I suspect that you need to increase the size of your kmem map. Huh any hints how should I achieve that? In desperation I placed vm.kmem_size=8G in /boot/loader.conf and got this: vm.kmem_map_free: 8123924480 vm.kmem_map_size: 8364032 vm.kmem_size_scale: 1 vm.kmem_size_max: 329853485875 vm.kmem_size_min: 0 vm.kmem_size: 8132288512 but malloc(2G) still fails... Here is at least one reason why it fails: void * uma_large_malloc(int size, int wait) Note the type of size. Can you malloc 1GB? ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: problems with mmap() and disk caching
On 04/11/2012 01:07, Andrey Zonov wrote: On 10.04.2012 20:19, Alan Cox wrote: On 04/09/2012 10:26, John Baldwin wrote: On Thursday, April 05, 2012 11:54:31 am Alan Cox wrote: On 04/04/2012 02:17, Konstantin Belousov wrote: On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote: Hi, I open the file, then call mmap() on the whole file and get pointer, then I work with this pointer. I expect that page should be only once touched to get it into the memory (disk cache?), but this doesn't work! I wrote the test (attached) and ran it for the 1G file generated from /dev/random, the result is the following: Prepare file: # swapoff -a # newfs /dev/ada0b # mount /dev/ada0b /mnt # dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024 Purge cache: # umount /mnt # mount /dev/ada0b /mnt Run test: $ ./mmap /mnt/random-1024 30 mmap: 1 pass took: 7.431046 (none: 262112; res: 32; super: 0; other: 0) mmap: 2 pass took: 7.356670 (none: 261648; res: 496; super: 0; other: 0) mmap: 3 pass took: 7.307094 (none: 260521; res: 1623; super: 0; other: 0) mmap: 4 pass took: 7.350239 (none: 258904; res: 3240; super: 0; other: 0) mmap: 5 pass took: 7.392480 (none: 257286; res: 4858; super: 0; other: 0) mmap: 6 pass took: 7.292069 (none: 255584; res: 6560; super: 0; other: 0) mmap: 7 pass took: 7.048980 (none: 251142; res: 11002; super: 0; other: 0) mmap: 8 pass took: 6.899387 (none: 247584; res: 14560; super: 0; other: 0) mmap: 9 pass took: 7.190579 (none: 242992; res: 19152; super: 0; other: 0) mmap: 10 pass took: 6.915482 (none: 239308; res: 22836; super: 0; other: 0) mmap: 11 pass took: 6.565909 (none: 232835; res: 29309; super: 0; other: 0) mmap: 12 pass took: 6.423945 (none: 226160; res: 35984; super: 0; other: 0) mmap: 13 pass took: 6.315385 (none: 208555; res: 53589; super: 0; other: 0) mmap: 14 pass took: 6.760780 (none: 192805; res: 69339; super: 0; other: 0) mmap: 15 pass took: 5.721513 (none: 174497; res: 87647; super: 0; other: 0) mmap: 16 pass took: 5.004424 (none: 155938; res: 106206; super: 0; other: 0) mmap: 17 pass took: 4.224926 (none: 135639; res: 126505; super: 0; other: 0) mmap: 18 pass took: 3.749608 (none: 117952; res: 144192; super: 0; other: 0) mmap: 19 pass took: 3.398084 (none: 99066; res: 163078; super: 0; other: 0) mmap: 20 pass took: 3.029557 (none: 74994; res: 187150; super: 0; other: 0) mmap: 21 pass took: 2.379430 (none: 55231; res: 206913; super: 0; other: 0) mmap: 22 pass took: 2.046521 (none: 40786; res: 221358; super: 0; other: 0) mmap: 23 pass took: 1.152797 (none: 30311; res: 231833; super: 0; other: 0) mmap: 24 pass took: 0.972617 (none: 16196; res: 245948; super: 0; other: 0) mmap: 25 pass took: 0.577515 (none: 8286; res: 253858; super: 0; other: 0) mmap: 26 pass took: 0.380738 (none: 3712; res: 258432; super: 0; other: 0) mmap: 27 pass took: 0.253583 (none: 1193; res: 260951; super: 0; other: 0) mmap: 28 pass took: 0.157508 (none: 0; res: 262144; super: 0; other: 0) mmap: 29 pass took: 0.156169 (none: 0; res: 262144; super: 0; other: 0) mmap: 30 pass took: 0.156550 (none: 0; res: 262144; super: 0; other: 0) If I ran this: $ cat /mnt/random-1024 /dev/null before test, when result is the following: $ ./mmap /mnt/random-1024 5 mmap: 1 pass took: 0.337657 (none: 0; res: 262144; super: 0; other: 0) mmap: 2 pass took: 0.186137 (none: 0; res: 262144; super: 0; other: 0) mmap: 3 pass took: 0.186132 (none: 0; res: 262144; super: 0; other: 0) mmap: 4 pass took: 0.186535 (none: 0; res: 262144; super: 0; other: 0) mmap: 5 pass took: 0.190353 (none: 0; res: 262144; super: 0; other: 0) This is what I expect. But why this doesn't work without reading file manually? Issue seems to be in some change of the behaviour of the reserv or phys allocator. I Cc:ed Alan. I'm pretty sure that the behavior here hasn't significantly changed in about twelve years. Otherwise, I agree with your analysis. On more than one occasion, I've been tempted to change: pmap_remove_all(mt); if (mt-dirty != 0) vm_page_deactivate(mt); else vm_page_cache(mt); to: vm_page_dontneed(mt); because I suspect that the current code does more harm than good. In theory, it saves activations of the page daemon. However, more often than not, I suspect that we are spending more on page reactivations than we are saving on page daemon activations. The sequential access detection heuristic is just too easily triggered. For example, I've seen it triggered by demand paging of the gcc text segment. Also, I think that pmap_remove_all() and especially vm_page_cache() are too severe for a detection heuristic that is so easily triggered. Are you planning to commit this? Not yet. I did some tests with a file that was several times larger than DRAM, and I didn't like what I saw. Initially, everything behaved as expected, but about halfway through the test the bulk of the pages were active. Despite the call to pmap_clear_reference() in vm_page_dontneed(), the page daemon is finding the pages to be referenced and reactivating
Re: mlockall() on freebsd 7.2 + amd64 returns EAGAIN
On 04/13/2012 16:45, Konstantin Belousov wrote: On Fri, Apr 13, 2012 at 11:37:44AM -0700, Sushanth Rai wrote: I've attached the simple program that creates 5 threads. Following is the o/p of /proc/pid/map when this program is running. Note that I modified sys/fs/procfs/procfs_map.c to print whether a region is wired. As you can see from this o/p, none of stack areas get wired. 0x40 0x401000 1 0 0xff002d943bd0 r-x 1 0 0x1000 COW NC wired vnode /var/tmp/thread1 0x50 0x501000 1 0 0xff002dd13e58 rw- 2 0 0x3100 NCOW NNC wired default - 0x501000 0x60 255 0 0xff002dd13e58 rwx 2 0 0x3100 NCOW NNC wired default - 0x80050 0x800526000 38 0 0xff0025574000 r-x 192 46 0x1004 COW NC wired vnode /libexec/ld-elf.so.1 0x800526000 0x800537000 17 0 0xff002d9f81b0 rw- 1 0 0x3100 NCOW NNC wired default - 0x800626000 0x80062d000 7 0 0xff002dd13bd0 rw- 1 0 0x3100 COW NNC wired vnode /libexec/ld-elf.so.1 0x80062d000 0x800633000 6 0 0xff002dd145e8 rw- 1 0 0x3100 NCOW NNC wired default - 0x800633000 0x800645000 18 0 0xff00256d71b0 r-x 63 42 0x4 COW NC wired vnode /lib/libthr.so.3 0x800645000 0x800646000 1 0 0xff002d975510 r-x 1 0 0x3100 COW NNC wired vnode /lib/libthr.so.3 0x800646000 0x800746000 0 0 0xff002dc5cca8 --- 4 0 0x3100 NCOW NNC not-wired default - 0x800746000 0x80074a000 4 0 0xff002572a288 rw- 1 0 0x3100 COW NNC wired vnode /lib/libthr.so.3 0x80074a000 0x80074c000 2 0 0xff002dc5cca8 rw- 4 0 0x3100 NCOW NNC wired default - 0x80074c000 0x80083e000 242 0 0xff001cd226c0 r-x 238 92 0x1004 COW NC wired vnode /lib/libc.so.7 0x80083e000 0x80083f000 1 0 0xff002dd12000 r-x 1 0 0x3100 COW NNC wired vnode /lib/libc.so.7 0x80083f000 0x80093e000 0 0 0xff002dc5cca8 --- 4 0 0x3100 NCOW NNC not-wired default - 0x80093e000 0x80095d000 31 0 0xff002dddc360 rw- 1 0 0x3100 COW NNC wired vnode /lib/libc.so.7 0x80095d000 0x800974000 23 0 0xff002dc5cca8 rw- 4 0 0x3100 NCOW NNC wired default - 0x800a0 0x800b0 256 0 0xff002dbd1798 rw- 1 0 0x3100 NCOW NNC wired default - 0x800b0 0x800c0 256 0 0xff002dd14948 rw- 1 0 0x3100 NCOW NNC wired default - 0x7f3db000 0x7f3fb000 1 0 0xff002dbb4360 rw- 1 0 0x3100 NCOW NNC not-wired default - 0x7f5dc000 0x7f5fc000 1 0 0xff002dc66af8 rw- 1 0 0x3100 NCOW NNC not-wired default - 0x7f7dd000 0x7f7fd000 1 0 0xff002dbea438 rw- 1 0 0x3100 NCOW NNC not-wired default - 0x7f9de000 0x7f9fe000 1 0 0xff002dd7fd80 rw- 1 0 0x3100 NCOW NNC not-wired default - 0x7fbdf000 0x7fbff000 1 0 0xff002dbe9438 rw- 1 0 0x3100 NCOW NNC not-wired default - 0x7fbff000 0x7fc0 0 0 0 --- 0 0 0x0 NCOW NNC not-wired none - 0x7ffe 0x8000 32 0 0xff002dd125e8 rwx 1 0 0x3100 NCOW NNC wired default - --- On Fri, 4/13/12, Konstantin Belousovkostik...@gmail.com wrote: From: Konstantin Belousovkostik...@gmail.com Subject: Re: mlockall() on freebsd 7.2 + amd64 returns EAGAIN To: Sushanth Raisushanth_...@yahoo.com Cc: freebsd-hackers@freebsd.org Date: Friday, April 13, 2012, 1:11 AM On Thu, Apr 12, 2012 at 08:10:26PM -0700, Sushanth Rai wrote: Then it should be fixed in r190885. Thanks. That works like a charm. mlockall() mostly works now. There is still a, issue in wiring the stacks of multithreaded program when the program uses default stack allocation scheme. Thread library allocates stack for each thread by calling mmap() and sending address and size to be mapped. The kernel adjusts the start address to sgrowsz in vm_map_stack() and maps at the adjusted address. But the subsequent wiring is done using the original address, which fails. Oh, I see. The problem is the VM_MAP_WIRE_NOHOLES flag. Since we map only the initial stack fragment even for the MCL_WIREFUTURE maps, there is a hole in the stack region. In fact, for MCL_WIREFUTURE, we probably should map the whole stack at once, prefaulting all pages. Below are two patches. The change for vm_mmap.c would fix your immediate problem by allowing holes in wired region. The change for vm_map.c prefaults the whole stack instead of the initial fragment. The single-threaded programs still get a fault on stack growth. The vm_mmap.c change looks ok to me. Please commit it. I haven't yet had a chance to think about the other change. Alan diff --git a/sys/vm/vm_map.c b/sys/vm/vm_map.c index 6198629..2fd18d1 100644 --- a/sys/vm/vm_map.c +++ b/sys/vm/vm_map.c @@ -3259,7 +3259,10 @@ vm_map_stack(vm_map_t map, vm_offset_t addrbos, vm_size_t max_ssize, addrbos + max_ssize addrbos) return (KERN_NO_SPACE); - init_ssize = (max_ssize sgrowsiz) ? max_ssize : sgrowsiz; + if (map-flags MAP_WIREFUTURE) + init_ssize = max_ssize; + else + init_ssize = (max_ssize sgrowsiz) ? max_ssize : sgrowsiz; PROC_LOCK(curthread-td_proc); vmemlim = lim_cur(curthread-td_proc, RLIMIT_VMEM); diff --git
Re: Corrupted pmap pm_vlist - pmap_remove_pte()
On 4/17/2012 4:48 AM, Konstantin Belousov wrote: On Mon, Apr 16, 2012 at 03:08:25PM -0400, Ewart Tempest wrote: In FreeBSD 6.*, we have been seeing crashes in pmap_remove_pages() that only seem to occur in scaling scenarios: 2564#ifdef PMAP_REMOVE_PAGES_CURPROC_ONLY 2565pte = vtopte(pv-pv_va); 2566#else 2567pte = pmap_pte(pmap, pv-pv_va); 2568#endif 2569tpte = *pte;= page fault here The suspicion is that the pmap's pm_pvlist list is getting corrupted. To this end, I have a question on the following logic in pmap_remove_pte() (see in-line comment): 1533 static int 1534 pmap_remove_pte(pmap_t pmap, pt_entry_t *ptq, vm_offset_t va, pd_entry_t ptepde) 1535 { 1536pt_entry_t oldpte; 1537vm_page_t m; 1538 1539PMAP_LOCK_ASSERT(pmap, MA_OWNED); 1540oldpte = pte_load_clear(ptq); 1541if (oldpte PG_W) 1542pmap-pm_stats.wired_count -= 1; 1543/* 1544 * Machines that don't support invlpg, also don't support 1545 * PG_G. 1546 */ 1547if (oldpte PG_G) 1548pmap_invalidate_page(kernel_pmap, va); 1549pmap-pm_stats.resident_count -= 1; 1550if (oldpte PG_MANAGED) { 1551m = PHYS_TO_VM_PAGE(oldpte PG_FRAME); 1552if (oldpte PG_M) { 1553 #if defined(PMAP_DIAGNOSTIC) 1554if (pmap_nw_modified((pt_entry_t) oldpte)) { 1555printf( 1556pmap_remove: modified page not writable: va: 0x%lx, pte: 0x%lx\n, 1557va, oldpte); 1558} 1559 #endif 1560if (pmap_track_modified(va)) 1561vm_page_dirty(m); 1562} 1563if (oldpte PG_A) 1564vm_page_flag_set(m, PG_REFERENCED); 1565pmap_remove_entry(pmap, m, va); 1566} 1567return (pmap_unuse_pt(pmap, va, ptepde));=== *** under what circumstances is it valid to free the page but not remove it from the pmap's pm_vlist? Even the code comment for pmap_unuse_pt() commences After removing a page table entry ... . *** It is valid to not remove pv_entry when no pv_entry exists for the mapping. The pv_entry is created if the page is managed, see pmap_enter() code. The block above the return is executed when the page is managed, or at least pmap thinks so. The HEAD code will panic in pmap_pvh_free() if pmap_phv_remove() cannot find the pv entry for given page and given pmap/va. 1568 } If the tail end of the above function is changed as follows: 1565pmap_remove_entry(pmap, m, va); 1565.5return (pmap_unuse_pt(pmap, va, ptepde)); 1566} 1567return (0); Then we don't see any crashes ... but is it the right thing to do? Should be not. Try to test this with some unmanaged mapping, like /dev/mem pages mapped into the exiting process address space. I am too new to know about any nuances of the RELENG_6 code. The RELENG_6 code is doing essentially the same things as newer versions. Crashes in this specific place are usually caused by DRAM errors. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: problems with mmap() and disk caching
On 04/09/2012 10:26, John Baldwin wrote: On Thursday, April 05, 2012 11:54:31 am Alan Cox wrote: On 04/04/2012 02:17, Konstantin Belousov wrote: On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote: Hi, I open the file, then call mmap() on the whole file and get pointer, then I work with this pointer. I expect that page should be only once touched to get it into the memory (disk cache?), but this doesn't work! I wrote the test (attached) and ran it for the 1G file generated from /dev/random, the result is the following: Prepare file: # swapoff -a # newfs /dev/ada0b # mount /dev/ada0b /mnt # dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024 Purge cache: # umount /mnt # mount /dev/ada0b /mnt Run test: $ ./mmap /mnt/random-1024 30 mmap: 1 pass took: 7.431046 (none: 262112; res: 32; super: 0; other: 0) mmap: 2 pass took: 7.356670 (none: 261648; res:496; super: 0; other: 0) mmap: 3 pass took: 7.307094 (none: 260521; res: 1623; super: 0; other: 0) mmap: 4 pass took: 7.350239 (none: 258904; res: 3240; super: 0; other: 0) mmap: 5 pass took: 7.392480 (none: 257286; res: 4858; super: 0; other: 0) mmap: 6 pass took: 7.292069 (none: 255584; res: 6560; super: 0; other: 0) mmap: 7 pass took: 7.048980 (none: 251142; res: 11002; super: 0; other: 0) mmap: 8 pass took: 6.899387 (none: 247584; res: 14560; super: 0; other: 0) mmap: 9 pass took: 7.190579 (none: 242992; res: 19152; super: 0; other: 0) mmap: 10 pass took: 6.915482 (none: 239308; res: 22836; super: 0; other: 0) mmap: 11 pass took: 6.565909 (none: 232835; res: 29309; super: 0; other: 0) mmap: 12 pass took: 6.423945 (none: 226160; res: 35984; super: 0; other: 0) mmap: 13 pass took: 6.315385 (none: 208555; res: 53589; super: 0; other: 0) mmap: 14 pass took: 6.760780 (none: 192805; res: 69339; super: 0; other: 0) mmap: 15 pass took: 5.721513 (none: 174497; res: 87647; super: 0; other: 0) mmap: 16 pass took: 5.004424 (none: 155938; res: 106206; super: 0; other: 0) mmap: 17 pass took: 4.224926 (none: 135639; res: 126505; super: 0; other: 0) mmap: 18 pass took: 3.749608 (none: 117952; res: 144192; super: 0; other: 0) mmap: 19 pass took: 3.398084 (none: 99066; res: 163078; super: 0; other: 0) mmap: 20 pass took: 3.029557 (none: 74994; res: 187150; super: 0; other: 0) mmap: 21 pass took: 2.379430 (none: 55231; res: 206913; super: 0; other: 0) mmap: 22 pass took: 2.046521 (none: 40786; res: 221358; super: 0; other: 0) mmap: 23 pass took: 1.152797 (none: 30311; res: 231833; super: 0; other: 0) mmap: 24 pass took: 0.972617 (none: 16196; res: 245948; super: 0; other: 0) mmap: 25 pass took: 0.577515 (none: 8286; res: 253858; super: 0; other: 0) mmap: 26 pass took: 0.380738 (none: 3712; res: 258432; super: 0; other: 0) mmap: 27 pass took: 0.253583 (none: 1193; res: 260951; super: 0; other: 0) mmap: 28 pass took: 0.157508 (none: 0; res: 262144; super: 0; other: 0) mmap: 29 pass took: 0.156169 (none: 0; res: 262144; super: 0; other: 0) mmap: 30 pass took: 0.156550 (none: 0; res: 262144; super: 0; other: 0) If I ran this: $ cat /mnt/random-1024 /dev/null before test, when result is the following: $ ./mmap /mnt/random-1024 5 mmap: 1 pass took: 0.337657 (none: 0; res: 262144; super: 0; other: 0) mmap: 2 pass took: 0.186137 (none: 0; res: 262144; super: 0; other: 0) mmap: 3 pass took: 0.186132 (none: 0; res: 262144; super: 0; other: 0) mmap: 4 pass took: 0.186535 (none: 0; res: 262144; super: 0; other: 0) mmap: 5 pass took: 0.190353 (none: 0; res: 262144; super: 0; other: 0) This is what I expect. But why this doesn't work without reading file manually? Issue seems to be in some change of the behaviour of the reserv or phys allocator. I Cc:ed Alan. I'm pretty sure that the behavior here hasn't significantly changed in about twelve years. Otherwise, I agree with your analysis. On more than one occasion, I've been tempted to change: pmap_remove_all(mt); if (mt-dirty != 0) vm_page_deactivate(mt); else vm_page_cache(mt); to: vm_page_dontneed(mt); because I suspect that the current code does more harm than good. In theory, it saves activations of the page daemon. However, more often than not, I suspect that we are spending more on page reactivations than we are saving on page daemon activations. The sequential access detection heuristic is just too easily triggered. For example, I've seen it triggered by demand paging of the gcc text segment. Also, I
Re: problems with mmap() and disk caching
On 04/04/2012 02:17, Konstantin Belousov wrote: On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote: Hi, I open the file, then call mmap() on the whole file and get pointer, then I work with this pointer. I expect that page should be only once touched to get it into the memory (disk cache?), but this doesn't work! I wrote the test (attached) and ran it for the 1G file generated from /dev/random, the result is the following: Prepare file: # swapoff -a # newfs /dev/ada0b # mount /dev/ada0b /mnt # dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024 Purge cache: # umount /mnt # mount /dev/ada0b /mnt Run test: $ ./mmap /mnt/random-1024 30 mmap: 1 pass took: 7.431046 (none: 262112; res: 32; super: 0; other: 0) mmap: 2 pass took: 7.356670 (none: 261648; res:496; super: 0; other: 0) mmap: 3 pass took: 7.307094 (none: 260521; res: 1623; super: 0; other: 0) mmap: 4 pass took: 7.350239 (none: 258904; res: 3240; super: 0; other: 0) mmap: 5 pass took: 7.392480 (none: 257286; res: 4858; super: 0; other: 0) mmap: 6 pass took: 7.292069 (none: 255584; res: 6560; super: 0; other: 0) mmap: 7 pass took: 7.048980 (none: 251142; res: 11002; super: 0; other: 0) mmap: 8 pass took: 6.899387 (none: 247584; res: 14560; super: 0; other: 0) mmap: 9 pass took: 7.190579 (none: 242992; res: 19152; super: 0; other: 0) mmap: 10 pass took: 6.915482 (none: 239308; res: 22836; super: 0; other: 0) mmap: 11 pass took: 6.565909 (none: 232835; res: 29309; super: 0; other: 0) mmap: 12 pass took: 6.423945 (none: 226160; res: 35984; super: 0; other: 0) mmap: 13 pass took: 6.315385 (none: 208555; res: 53589; super: 0; other: 0) mmap: 14 pass took: 6.760780 (none: 192805; res: 69339; super: 0; other: 0) mmap: 15 pass took: 5.721513 (none: 174497; res: 87647; super: 0; other: 0) mmap: 16 pass took: 5.004424 (none: 155938; res: 106206; super: 0; other: 0) mmap: 17 pass took: 4.224926 (none: 135639; res: 126505; super: 0; other: 0) mmap: 18 pass took: 3.749608 (none: 117952; res: 144192; super: 0; other: 0) mmap: 19 pass took: 3.398084 (none: 99066; res: 163078; super: 0; other: 0) mmap: 20 pass took: 3.029557 (none: 74994; res: 187150; super: 0; other: 0) mmap: 21 pass took: 2.379430 (none: 55231; res: 206913; super: 0; other: 0) mmap: 22 pass took: 2.046521 (none: 40786; res: 221358; super: 0; other: 0) mmap: 23 pass took: 1.152797 (none: 30311; res: 231833; super: 0; other: 0) mmap: 24 pass took: 0.972617 (none: 16196; res: 245948; super: 0; other: 0) mmap: 25 pass took: 0.577515 (none: 8286; res: 253858; super: 0; other: 0) mmap: 26 pass took: 0.380738 (none: 3712; res: 258432; super: 0; other: 0) mmap: 27 pass took: 0.253583 (none: 1193; res: 260951; super: 0; other: 0) mmap: 28 pass took: 0.157508 (none: 0; res: 262144; super: 0; other: 0) mmap: 29 pass took: 0.156169 (none: 0; res: 262144; super: 0; other: 0) mmap: 30 pass took: 0.156550 (none: 0; res: 262144; super: 0; other: 0) If I ran this: $ cat /mnt/random-1024 /dev/null before test, when result is the following: $ ./mmap /mnt/random-1024 5 mmap: 1 pass took: 0.337657 (none: 0; res: 262144; super: 0; other: 0) mmap: 2 pass took: 0.186137 (none: 0; res: 262144; super: 0; other: 0) mmap: 3 pass took: 0.186132 (none: 0; res: 262144; super: 0; other: 0) mmap: 4 pass took: 0.186535 (none: 0; res: 262144; super: 0; other: 0) mmap: 5 pass took: 0.190353 (none: 0; res: 262144; super: 0; other: 0) This is what I expect. But why this doesn't work without reading file manually? Issue seems to be in some change of the behaviour of the reserv or phys allocator. I Cc:ed Alan. What happen is that fault handler deactivates or caches the pages previous to the one which would satisfy the fault. See the if() statement starting at line 463 of vm/vm_fault.c. Since all pages of the object in your test are clean, the pages are cached. Next fault would need to allocate some more pages for different index of the same object. What I see is that vm_reserv_alloc_page() returns a page that is from the cache for the same object, but different pindex. As an obvious result, the page is invalidated and repurposed. When next loop started, the page is not resident anymore, so it has to be re-read from disk. I pretty sure that the pages aren't being repurposed this quickly. Instead, I believe that the explanation is to be found in mincore(). mincore() is only reporting pages that are in the object's memq as resident. It is not reporting cache pages as resident. The behaviour of the allocator is not consistent, so some pages are not reused, allowing the test to converge and to collect all pages of the object eventually. Calling madvise(MADV_RANDOM) fixes
Re: problems with mmap() and disk caching
On 04/06/2012 03:38, Konstantin Belousov wrote: On Thu, Apr 05, 2012 at 01:25:49PM -0500, Alan Cox wrote: On 04/05/2012 12:31, Konstantin Belousov wrote: On Thu, Apr 05, 2012 at 10:54:31AM -0500, Alan Cox wrote: On 04/04/2012 02:17, Konstantin Belousov wrote: On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote: Hi, I open the file, then call mmap() on the whole file and get pointer, then I work with this pointer. I expect that page should be only once touched to get it into the memory (disk cache?), but this doesn't work! I wrote the test (attached) and ran it for the 1G file generated from /dev/random, the result is the following: Prepare file: # swapoff -a # newfs /dev/ada0b # mount /dev/ada0b /mnt # dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024 Purge cache: # umount /mnt # mount /dev/ada0b /mnt Run test: $ ./mmap /mnt/random-1024 30 mmap: 1 pass took: 7.431046 (none: 262112; res: 32; super: 0; other: 0) mmap: 2 pass took: 7.356670 (none: 261648; res:496; super: 0; other: 0) mmap: 3 pass took: 7.307094 (none: 260521; res: 1623; super: 0; other: 0) mmap: 4 pass took: 7.350239 (none: 258904; res: 3240; super: 0; other: 0) mmap: 5 pass took: 7.392480 (none: 257286; res: 4858; super: 0; other: 0) mmap: 6 pass took: 7.292069 (none: 255584; res: 6560; super: 0; other: 0) mmap: 7 pass took: 7.048980 (none: 251142; res: 11002; super: 0; other: 0) mmap: 8 pass took: 6.899387 (none: 247584; res: 14560; super: 0; other: 0) mmap: 9 pass took: 7.190579 (none: 242992; res: 19152; super: 0; other: 0) mmap: 10 pass took: 6.915482 (none: 239308; res: 22836; super: 0; other: 0) mmap: 11 pass took: 6.565909 (none: 232835; res: 29309; super: 0; other: 0) mmap: 12 pass took: 6.423945 (none: 226160; res: 35984; super: 0; other: 0) mmap: 13 pass took: 6.315385 (none: 208555; res: 53589; super: 0; other: 0) mmap: 14 pass took: 6.760780 (none: 192805; res: 69339; super: 0; other: 0) mmap: 15 pass took: 5.721513 (none: 174497; res: 87647; super: 0; other: 0) mmap: 16 pass took: 5.004424 (none: 155938; res: 106206; super: 0; other: 0) mmap: 17 pass took: 4.224926 (none: 135639; res: 126505; super: 0; other: 0) mmap: 18 pass took: 3.749608 (none: 117952; res: 144192; super: 0; other: 0) mmap: 19 pass took: 3.398084 (none: 99066; res: 163078; super: 0; other: 0) mmap: 20 pass took: 3.029557 (none: 74994; res: 187150; super: 0; other: 0) mmap: 21 pass took: 2.379430 (none: 55231; res: 206913; super: 0; other: 0) mmap: 22 pass took: 2.046521 (none: 40786; res: 221358; super: 0; other: 0) mmap: 23 pass took: 1.152797 (none: 30311; res: 231833; super: 0; other: 0) mmap: 24 pass took: 0.972617 (none: 16196; res: 245948; super: 0; other: 0) mmap: 25 pass took: 0.577515 (none: 8286; res: 253858; super: 0; other: 0) mmap: 26 pass took: 0.380738 (none: 3712; res: 258432; super: 0; other: 0) mmap: 27 pass took: 0.253583 (none: 1193; res: 260951; super: 0; other: 0) mmap: 28 pass took: 0.157508 (none: 0; res: 262144; super: 0; other: 0) mmap: 29 pass took: 0.156169 (none: 0; res: 262144; super: 0; other: 0) mmap: 30 pass took: 0.156550 (none: 0; res: 262144; super: 0; other: 0) If I ran this: $ cat /mnt/random-1024/dev/null before test, when result is the following: $ ./mmap /mnt/random-1024 5 mmap: 1 pass took: 0.337657 (none: 0; res: 262144; super: 0; other: 0) mmap: 2 pass took: 0.186137 (none: 0; res: 262144; super: 0; other: 0) mmap: 3 pass took: 0.186132 (none: 0; res: 262144; super: 0; other: 0) mmap: 4 pass took: 0.186535 (none: 0; res: 262144; super: 0; other: 0) mmap: 5 pass took: 0.190353 (none: 0; res: 262144; super: 0; other: 0) This is what I expect. But why this doesn't work without reading file manually? Issue seems to be in some change of the behaviour of the reserv or phys allocator. I Cc:ed Alan. I'm pretty sure that the behavior here hasn't significantly changed in about twelve years. Otherwise, I agree with your analysis. On more than one occasion, I've been tempted to change: pmap_remove_all(mt); if (mt-dirty != 0) vm_page_deactivate(mt); else vm_page_cache(mt); to: vm_page_dontneed(mt); because I suspect that the current code does more harm than good. In theory, it saves activations of the page daemon. However, more often than not, I suspect that we are spending more on page reactivations than we are saving on page daemon activations. The sequential access detection heuristic is just
Re: problems with mmap() and disk caching
On 04/04/2012 02:17, Konstantin Belousov wrote: On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote: Hi, I open the file, then call mmap() on the whole file and get pointer, then I work with this pointer. I expect that page should be only once touched to get it into the memory (disk cache?), but this doesn't work! I wrote the test (attached) and ran it for the 1G file generated from /dev/random, the result is the following: Prepare file: # swapoff -a # newfs /dev/ada0b # mount /dev/ada0b /mnt # dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024 Purge cache: # umount /mnt # mount /dev/ada0b /mnt Run test: $ ./mmap /mnt/random-1024 30 mmap: 1 pass took: 7.431046 (none: 262112; res: 32; super: 0; other: 0) mmap: 2 pass took: 7.356670 (none: 261648; res:496; super: 0; other: 0) mmap: 3 pass took: 7.307094 (none: 260521; res: 1623; super: 0; other: 0) mmap: 4 pass took: 7.350239 (none: 258904; res: 3240; super: 0; other: 0) mmap: 5 pass took: 7.392480 (none: 257286; res: 4858; super: 0; other: 0) mmap: 6 pass took: 7.292069 (none: 255584; res: 6560; super: 0; other: 0) mmap: 7 pass took: 7.048980 (none: 251142; res: 11002; super: 0; other: 0) mmap: 8 pass took: 6.899387 (none: 247584; res: 14560; super: 0; other: 0) mmap: 9 pass took: 7.190579 (none: 242992; res: 19152; super: 0; other: 0) mmap: 10 pass took: 6.915482 (none: 239308; res: 22836; super: 0; other: 0) mmap: 11 pass took: 6.565909 (none: 232835; res: 29309; super: 0; other: 0) mmap: 12 pass took: 6.423945 (none: 226160; res: 35984; super: 0; other: 0) mmap: 13 pass took: 6.315385 (none: 208555; res: 53589; super: 0; other: 0) mmap: 14 pass took: 6.760780 (none: 192805; res: 69339; super: 0; other: 0) mmap: 15 pass took: 5.721513 (none: 174497; res: 87647; super: 0; other: 0) mmap: 16 pass took: 5.004424 (none: 155938; res: 106206; super: 0; other: 0) mmap: 17 pass took: 4.224926 (none: 135639; res: 126505; super: 0; other: 0) mmap: 18 pass took: 3.749608 (none: 117952; res: 144192; super: 0; other: 0) mmap: 19 pass took: 3.398084 (none: 99066; res: 163078; super: 0; other: 0) mmap: 20 pass took: 3.029557 (none: 74994; res: 187150; super: 0; other: 0) mmap: 21 pass took: 2.379430 (none: 55231; res: 206913; super: 0; other: 0) mmap: 22 pass took: 2.046521 (none: 40786; res: 221358; super: 0; other: 0) mmap: 23 pass took: 1.152797 (none: 30311; res: 231833; super: 0; other: 0) mmap: 24 pass took: 0.972617 (none: 16196; res: 245948; super: 0; other: 0) mmap: 25 pass took: 0.577515 (none: 8286; res: 253858; super: 0; other: 0) mmap: 26 pass took: 0.380738 (none: 3712; res: 258432; super: 0; other: 0) mmap: 27 pass took: 0.253583 (none: 1193; res: 260951; super: 0; other: 0) mmap: 28 pass took: 0.157508 (none: 0; res: 262144; super: 0; other: 0) mmap: 29 pass took: 0.156169 (none: 0; res: 262144; super: 0; other: 0) mmap: 30 pass took: 0.156550 (none: 0; res: 262144; super: 0; other: 0) If I ran this: $ cat /mnt/random-1024 /dev/null before test, when result is the following: $ ./mmap /mnt/random-1024 5 mmap: 1 pass took: 0.337657 (none: 0; res: 262144; super: 0; other: 0) mmap: 2 pass took: 0.186137 (none: 0; res: 262144; super: 0; other: 0) mmap: 3 pass took: 0.186132 (none: 0; res: 262144; super: 0; other: 0) mmap: 4 pass took: 0.186535 (none: 0; res: 262144; super: 0; other: 0) mmap: 5 pass took: 0.190353 (none: 0; res: 262144; super: 0; other: 0) This is what I expect. But why this doesn't work without reading file manually? Issue seems to be in some change of the behaviour of the reserv or phys allocator. I Cc:ed Alan. I'm pretty sure that the behavior here hasn't significantly changed in about twelve years. Otherwise, I agree with your analysis. On more than one occasion, I've been tempted to change: pmap_remove_all(mt); if (mt-dirty != 0) vm_page_deactivate(mt); else vm_page_cache(mt); to: vm_page_dontneed(mt); because I suspect that the current code does more harm than good. In theory, it saves activations of the page daemon. However, more often than not, I suspect that we are spending more on page reactivations than we are saving on page daemon activations. The sequential access detection heuristic is just too easily triggered. For example, I've seen it triggered by demand paging of the gcc text segment. Also, I think that pmap_remove_all() and especially vm_page_cache() are too severe for a detection heuristic
Re: problems with mmap() and disk caching
On 04/04/2012 04:36, Andrey Zonov wrote: On 04.04.2012 11:17, Konstantin Belousov wrote: Calling madvise(MADV_RANDOM) fixes the issue, because the code to deactivate/cache the pages is turned off. On the other hand, it also turns of read-ahead for faulting, and the first loop becomes eternally long. Now it takes 5 times longer. Anyway, thanks for explanation. Doing MADV_WILLNEED does not fix the problem indeed, since willneed reactivates the pages of the object at the time of call. To use MADV_WILLNEED, you would need to call it between faults/memcpy. I played with it, but no luck so far. I've also never seen super pages, how to make them work? They just work, at least for me. Look at the output of procstat -v after enough loops finished to not cause disk activity. The problem was in my test program. I fixed it, now I see super pages but I'm still not satisfied. There are several tests below: 1. With madvise(MADV_RANDOM) I see almost all super pages: $ ./mmap /mnt/random-1024 5 mmap: 1 pass took: 26.438535 (none: 0; res: 262144; super: 511; other: 0) mmap: 2 pass took: 0.187311 (none: 0; res: 262144; super: 511; other: 0) mmap: 3 pass took: 0.184953 (none: 0; res: 262144; super: 511; other: 0) mmap: 4 pass took: 0.186007 (none: 0; res: 262144; super: 511; other: 0) mmap: 5 pass took: 0.185790 (none: 0; res: 262144; super: 511; other: 0) Should it be 512? Check the starting virtual address. It is probably not aligned on a superpage boundary. Hence, a few pages at the start and end of your mapped region are not in a superpage. 2. Without madvise(MADV_RANDOM): $ ./mmap /mnt/random-1024 50 mmap: 1 pass took: 7.629745 (none: 262112; res: 32; super: 0; other: 0) mmap: 2 pass took: 7.301720 (none: 261202; res:942; super: 0; other: 0) mmap: 3 pass took: 7.261416 (none: 260226; res: 1918; super: 1; other: 0) [skip] mmap: 49 pass took: 0.155368 (none: 0; res: 262144; super: 323; other: 0) mmap: 50 pass took: 0.155438 (none: 0; res: 262144; super: 323; other: 0) Only 323 pages. 3. If I just re-run test I don't see super pages with any size of block. $ ./mmap /mnt/random-1024 5 $((130)) mmap: 1 pass took: 1.013939 (none: 0; res: 262144; super: 0; other: 0) mmap: 2 pass took: 0.267082 (none: 0; res: 262144; super: 0; other: 0) mmap: 3 pass took: 0.270711 (none: 0; res: 262144; super: 0; other: 0) mmap: 4 pass took: 0.268940 (none: 0; res: 262144; super: 0; other: 0) mmap: 5 pass took: 0.269634 (none: 0; res: 262144; super: 0; other: 0) 4. If I activate madvise(MADV_WILLNEDD) in the copy loop and re-run test then I see super pages only if I use block greater than 2Mb. $ ./mmap /mnt/random-1024 1 $((121)) mmap: 1 pass took: 0.299722 (none: 0; res: 262144; super: 0; other: 0) $ ./mmap /mnt/random-1024 1 $((122)) mmap: 1 pass took: 0.271828 (none: 0; res: 262144; super: 170; other: 0) $ ./mmap /mnt/random-1024 1 $((123)) mmap: 1 pass took: 0.333188 (none: 0; res: 262144; super: 258; other: 0) $ ./mmap /mnt/random-1024 1 $((124)) mmap: 1 pass took: 0.339250 (none: 0; res: 262144; super: 303; other: 0) $ ./mmap /mnt/random-1024 1 $((125)) mmap: 1 pass took: 0.418812 (none: 0; res: 262144; super: 324; other: 0) $ ./mmap /mnt/random-1024 1 $((126)) mmap: 1 pass took: 0.360892 (none: 0; res: 262144; super: 335; other: 0) $ ./mmap /mnt/random-1024 1 $((127)) mmap: 1 pass took: 0.401122 (none: 0; res: 262144; super: 342; other: 0) $ ./mmap /mnt/random-1024 1 $((128)) mmap: 1 pass took: 0.478764 (none: 0; res: 262144; super: 345; other: 0) $ ./mmap /mnt/random-1024 1 $((129)) mmap: 1 pass took: 0.607266 (none: 0; res: 262144; super: 346; other: 0) $ ./mmap /mnt/random-1024 1 $((130)) mmap: 1 pass took: 0.901269 (none: 0; res: 262144; super: 347; other: 0) 5. If I activate madvise(MADV_WILLNEED) immediately after mmap() then I see some number of super pages (the number from test #2). $ ./mmap /mnt/random-1024 5 mmap: 1 pass took: 0.178666 (none: 0; res: 262144; super: 323; other: 0) mmap: 2 pass took: 0.158889 (none: 0; res: 262144; super: 323; other: 0) mmap: 3 pass took: 0.157229 (none: 0; res: 262144; super: 323; other: 0) mmap: 4 pass took: 0.156895 (none: 0; res: 262144; super: 323; other: 0) mmap: 5 pass took: 0.162938 (none: 0; res: 262144; super: 323; other: 0) 6. If I read file manually before test then I don't see super pages with any size of block and madvise(MADV_WILLNEED) doesn't help. $ ./mmap /mnt/random-1024 5 $((130)) mmap: 1 pass took: 0.996767 (none: 0; res: 262144; super: 0; other: 0) mmap: 2 pass took: 0.311129 (none:
Re: problems with mmap() and disk caching
On 04/05/2012 12:31, Konstantin Belousov wrote: On Thu, Apr 05, 2012 at 10:54:31AM -0500, Alan Cox wrote: On 04/04/2012 02:17, Konstantin Belousov wrote: On Tue, Apr 03, 2012 at 11:02:53PM +0400, Andrey Zonov wrote: Hi, I open the file, then call mmap() on the whole file and get pointer, then I work with this pointer. I expect that page should be only once touched to get it into the memory (disk cache?), but this doesn't work! I wrote the test (attached) and ran it for the 1G file generated from /dev/random, the result is the following: Prepare file: # swapoff -a # newfs /dev/ada0b # mount /dev/ada0b /mnt # dd if=/dev/random of=/mnt/random-1024 bs=1m count=1024 Purge cache: # umount /mnt # mount /dev/ada0b /mnt Run test: $ ./mmap /mnt/random-1024 30 mmap: 1 pass took: 7.431046 (none: 262112; res: 32; super: 0; other: 0) mmap: 2 pass took: 7.356670 (none: 261648; res:496; super: 0; other: 0) mmap: 3 pass took: 7.307094 (none: 260521; res: 1623; super: 0; other: 0) mmap: 4 pass took: 7.350239 (none: 258904; res: 3240; super: 0; other: 0) mmap: 5 pass took: 7.392480 (none: 257286; res: 4858; super: 0; other: 0) mmap: 6 pass took: 7.292069 (none: 255584; res: 6560; super: 0; other: 0) mmap: 7 pass took: 7.048980 (none: 251142; res: 11002; super: 0; other: 0) mmap: 8 pass took: 6.899387 (none: 247584; res: 14560; super: 0; other: 0) mmap: 9 pass took: 7.190579 (none: 242992; res: 19152; super: 0; other: 0) mmap: 10 pass took: 6.915482 (none: 239308; res: 22836; super: 0; other: 0) mmap: 11 pass took: 6.565909 (none: 232835; res: 29309; super: 0; other: 0) mmap: 12 pass took: 6.423945 (none: 226160; res: 35984; super: 0; other: 0) mmap: 13 pass took: 6.315385 (none: 208555; res: 53589; super: 0; other: 0) mmap: 14 pass took: 6.760780 (none: 192805; res: 69339; super: 0; other: 0) mmap: 15 pass took: 5.721513 (none: 174497; res: 87647; super: 0; other: 0) mmap: 16 pass took: 5.004424 (none: 155938; res: 106206; super: 0; other: 0) mmap: 17 pass took: 4.224926 (none: 135639; res: 126505; super: 0; other: 0) mmap: 18 pass took: 3.749608 (none: 117952; res: 144192; super: 0; other: 0) mmap: 19 pass took: 3.398084 (none: 99066; res: 163078; super: 0; other: 0) mmap: 20 pass took: 3.029557 (none: 74994; res: 187150; super: 0; other: 0) mmap: 21 pass took: 2.379430 (none: 55231; res: 206913; super: 0; other: 0) mmap: 22 pass took: 2.046521 (none: 40786; res: 221358; super: 0; other: 0) mmap: 23 pass took: 1.152797 (none: 30311; res: 231833; super: 0; other: 0) mmap: 24 pass took: 0.972617 (none: 16196; res: 245948; super: 0; other: 0) mmap: 25 pass took: 0.577515 (none: 8286; res: 253858; super: 0; other: 0) mmap: 26 pass took: 0.380738 (none: 3712; res: 258432; super: 0; other: 0) mmap: 27 pass took: 0.253583 (none: 1193; res: 260951; super: 0; other: 0) mmap: 28 pass took: 0.157508 (none: 0; res: 262144; super: 0; other: 0) mmap: 29 pass took: 0.156169 (none: 0; res: 262144; super: 0; other: 0) mmap: 30 pass took: 0.156550 (none: 0; res: 262144; super: 0; other: 0) If I ran this: $ cat /mnt/random-1024 /dev/null before test, when result is the following: $ ./mmap /mnt/random-1024 5 mmap: 1 pass took: 0.337657 (none: 0; res: 262144; super: 0; other: 0) mmap: 2 pass took: 0.186137 (none: 0; res: 262144; super: 0; other: 0) mmap: 3 pass took: 0.186132 (none: 0; res: 262144; super: 0; other: 0) mmap: 4 pass took: 0.186535 (none: 0; res: 262144; super: 0; other: 0) mmap: 5 pass took: 0.190353 (none: 0; res: 262144; super: 0; other: 0) This is what I expect. But why this doesn't work without reading file manually? Issue seems to be in some change of the behaviour of the reserv or phys allocator. I Cc:ed Alan. I'm pretty sure that the behavior here hasn't significantly changed in about twelve years. Otherwise, I agree with your analysis. On more than one occasion, I've been tempted to change: pmap_remove_all(mt); if (mt-dirty != 0) vm_page_deactivate(mt); else vm_page_cache(mt); to: vm_page_dontneed(mt); because I suspect that the current code does more harm than good. In theory, it saves activations of the page daemon. However, more often than not, I suspect that we are spending more on page reactivations than we are saving on page daemon activations. The sequential access detection heuristic is just too easily triggered. For example, I've seen it triggered by demand paging of the gcc text segment. Also, I
Re: Please help me diagnose this crazy VMWare/FreeBSD 8.x crash
On Thu, Mar 29, 2012 at 11:27 AM, Mark Felder f...@feld.me wrote: On Thu, 29 Mar 2012 10:55:36 -0500, Hans Petter Selasky hsela...@c2i.net wrote: It almost sounds like the lost interrupt issue I've seen with USB EHCI devices, though disk I/O should have a retry timeout? What does wmstat -i output? --HPS Here's a server that has a week uptime and is due for a crash any hour now: root@server:/# vmstat -i interrupt total rate irq1: atkbd0 34 0 irq6: fdc0 9 0 irq15: ata1 34 0 irq16: em1778061 1 irq17: mpt0 19217711 31 irq18: em0 283674769460 cpu0: timer246571507400 Total 550242125892 Not so long ago, VMware implemented a clever scheme for reducing the overhead of virtualized interrupts that must be delivered by at least some (if not all) of their emulated storage controllers: http://static.usenix.org/events/atc11/tech/techAbstracts.html#Ahmad Perhaps, there is a bad interaction between this scheme and FreeBSD's mpt driver. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: mmap performance and memory use
On 10/26/2011 06:23, Svatopluk Kraus wrote: Hi, well, I'm working on new port (arm11 mpcore) and pmap_enter_object() is what I'm debugging rigth now. And I did not find any way in userland how to force kernel to call pmap_enter_object() which makes SUPERPAGE mapping without promotion. I tried to call mmap() with MAP_PREFAULT_READ without success. I tried to call madvise() with MADV_WILLNEED without success too. mmap() should call pmap_enter_object() if MAP_PREFAULT_READ was specified. I'm surprised to hear that it's not happening for you. To make SUPERPAGE mapping, it's obvious that all physical pages under SUPERPAGE must be allocated in vm_object. And SUPERPAGE mapping must be done before first access to them, otherwise a promotion is on the way. MAP_PREFAULT_READ does nothing with it. If madvice() is used, vm_object_madvise() is called but only cached pages are allocated in advance. Of coarse, an allocation of all physical memory behind virtual address space in advance is not preferred in most situations. For example, I want to do some computation on 4M memory space (I know that each byte will be accessed) and want to utilize SUPERPAGE mapping without promotion, so save 4K page table (i386 machine). However, malloc() leads to promotion, mmap() with MAP_PREFAULT_READ doesn't do nothing so SUPERPAGE mapping is promoted, and madvice() with MADV_WILLNEED calls vm_object_madvise() but because the pages are not cached (how can be on anonymous memory), it is not work without promotion too. So, SUPERPAGE mapping without promotions is fine, but it can be done only if physical memory being mapped is already allocated. Is it really possible to force that in userland? To force the allocation of the physical memory? Right now, the only way is for your program to touch the pages. Moreover, the SUPERPAGE mapping is made readonly firstly. So, even if I have SUPERPAGE mapping without promotion, the mapping is demoted after first write, and promoted again after all underlying pages are accessed by write. There is 4K page table saving no longer. Yes, that is all true. It is possible to change things so that the page table pages are reclaimed after a time, and not kept around indefinitely. However, this not high on my personal priority list. Before that, it is more likely that I will add an option to avoid the demotion on write, if we don't have to copy the entire superpage to do so. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: mmap performance and memory use
On 10/10/2011 4:28 PM, Wojciech Puchar wrote: Notice that vm.pmap.pde.promotions increased by 31. This means that 31 superpage mappings were created by promotion from small page mappings. thank you. i looked at .mappings as it seemed logical for me that is shows total. In contrast, vm.pmap.pde.mappings counts superpage mappings that are created directly and not by promotion from small page mappings. For example, if a large executable, such as gcc, is resident in memory, the text segment will be pre-mapped using superpage mappings, avoiding soft fault and promotion overhead. Similarly, mmap(..., MAP_PREFAULT_READ) on a large, memory resident file may pre-map the file using superpage mappings. your options are not described in mmap manpage nor madvise (MAP_PREFAULT_READ). when can i find the up to date manpage or description? A few minutes ago, I merged the changes to support and document MAP_PREFAULT_READ into 8-STABLE. So, now it exists in HEAD, 9.0, and 8-STABLE. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: mmap performance and memory use
On 10/11/2011 12:36, Mark Tinguely wrote: On 10/11/2011 11:12 AM, Alan Cox wrote: On 10/10/2011 16:28, Wojciech Puchar wrote: is it possible to force VM subsystem to operate on superpages when possible - i mean swapping in 2MB chunks? Currently, no. For some applications, like the Sun/Oracle JVM, that have code to explicitly manage large pages, there could be some benefit in the form of reduced overhead. So, it's on my to do list, but no where near the top of that list. Alan Am I correct in remembering that super-pages have to be aligned on the super-page boundary and be contiguous? Yes. However, if you allocate (or mmap(2)) a large range of virtual memory, e.g., 10 MB, and the start of that range is not aligned on a superpage boundary, the virtual memory system can still promote the four 2 MB sized superpages in the middle of that range. If so, in the mmap(), he may want to include the 'MAP_FIXED' flag with an address that is on a super-page boundary. Right now, the VMFS_ALIGNED_SPACE that does the VA super-page alignment is only used for device pagers. Yes. More precisely, the second, third, etc. mmap(2) should duplicate the alignment of the first mmap(2). In fact, this is what VMFS_ALIGNED_SPACE does. It looks at the alignment of the pages already allocated to the file (or vm object) and attempts to duplicate that alignment. Sooner or later, I will probably make VMFS_ALIGNED_SPACE the default for file types other than devices. Similarly, if the allocated physical pages for the object are not contiguous, then MAP_PREFAULT_READ will not result in a super-page promotion. As described in my earlier e-mail on this topic, in this case, I call these superpage mappings and not superpage promotions, because the virtual system creates a large page mapping, e.g., a 2 MB page table entry, from the start. It does not create small page mappings and then promote them to a large page mapping. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: mmap performance and memory use
On 10/10/2011 16:28, Wojciech Puchar wrote: Notice that vm.pmap.pde.promotions increased by 31. This means that 31 superpage mappings were created by promotion from small page mappings. thank you. i looked at .mappings as it seemed logical for me that is shows total. In contrast, vm.pmap.pde.mappings counts superpage mappings that are created directly and not by promotion from small page mappings. For example, if a large executable, such as gcc, is resident in memory, the text segment will be pre-mapped using superpage mappings, avoiding soft fault and promotion overhead. Similarly, mmap(..., MAP_PREFAULT_READ) on a large, memory resident file may pre-map the file using superpage mappings. your options are not described in mmap manpage nor madvise (MAP_PREFAULT_READ). when can i find the up to date manpage or description? It is documented in mmap(2) on HEAD and 9.x: MAP_PREFAULT_READ Immediately update the calling process's lowest-level virtual address translation structures, such as its page table, so that every memory resident page within the region is mapped for read access. Ordinarily these structures are updated lazily. The effect of this option is to eliminate any soft faults that would oth- erwise occur on the initial read accesses to the region. Although this option does not preclude prot from including PROT_WRITE, it does not eliminate soft faults on the initial write accesses to the region. I don't believe that this feature was merged into to 8.x. However, there is no technical reason that it can't be merged. is it possible to force VM subsystem to operate on superpages when possible - i mean swapping in 2MB chunks? Currently, no. For some applications, like the Sun/Oracle JVM, that have code to explicitly manage large pages, there could be some benefit in the form of reduced overhead. So, it's on my to do list, but no where near the top of that list. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: mmap performance and memory use
On 10/07/2011 12:23, Wojciech Puchar wrote: You are correct about the page table page. However, a superpage mapping consumes a single PV entry, in place of 512 or 1024 PV entries. This winds up saving about three physical pages worth of memory for every superpage mapping. does it actually work? Yes, the sysctl output shows that it is working. You can also verify this with mincore(2). simple test before (only idle system with 2GB RAM and most free) vm.pmap.pde.promotions: 921 vm.pmap.pde.p_failures: 21398 vm.pmap.pde.mappings: 299 vm.pmap.pde.demotions: 596 vm.pmap.shpgperproc: 200 vm.pmap.pv_entry_max: 696561 vm.pmap.pg_ps_enabled: 1 vm.pmap.pat_works: 1 and with that program running (==sleeping) #include unistd.h int a[124]; main() { int b; for(b=0;b(124);b++) a[b]=b; sleep(1000); } vm.pmap.pdpe.demotions: 0 vm.pmap.pde.promotions: 952 vm.pmap.pde.p_failures: 21398 vm.pmap.pde.mappings: 299 vm.pmap.pde.demotions: 596 vm.pmap.shpgperproc: 200 vm.pmap.pv_entry_max: 696561 vm.pmap.pg_ps_enabled: 1 vm.pmap.pat_works: 1 seems like i don't understand what these sysctl things mean (i did sysctl -d) or it doesn't really work. with program allocating and using linear 64MB chunk it should be 31 or 32 more mappings in vm.pmap.pde.mappings there are zero difference. Notice that vm.pmap.pde.promotions increased by 31. This means that 31 superpage mappings were created by promotion from small page mappings. In contrast, vm.pmap.pde.mappings counts superpage mappings that are created directly and not by promotion from small page mappings. For example, if a large executable, such as gcc, is resident in memory, the text segment will be pre-mapped using superpage mappings, avoiding soft fault and promotion overhead. Similarly, mmap(..., MAP_PREFAULT_READ) on a large, memory resident file may pre-map the file using superpage mappings. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: mmap performance and memory use
On Thu, Oct 6, 2011 at 11:01 AM, Kostik Belousov kostik...@gmail.comwrote: On Thu, Oct 06, 2011 at 04:41:45PM +0200, Wojciech Puchar wrote: i have few questions. 1) suppose i map 1TB of address space as anonymous and touch just one page. how much memory is used to manage this? I am not sure how deep the enumeration you want to know, but the first approximation will be: one struct vm_map_entry one struct vm_object one pv_entry Page table structures need four pages for directories and page table proper. 2) suppose we have 1TB file on disk without holes and 10 processes mmaps this file to it's address space. are just pages shared or can pagetables be shared too? how much memory is used to manage such situation? Only pages are shared. Pagetables are not. For one thing, this indeed causes more memory use for the OS. This is somewhat mitigated by automatic use of superpages. Superpage promotion still keeps the 4KB page table around, so most savings from the superpages are due to more efficient use of TLB. You are correct about the page table page. However, a superpage mapping consumes a single PV entry, in place of 512 or 1024 PV entries. This winds up saving about three physical pages worth of memory for every superpage mapping. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Memory allocation in kernel -- what to use in which situation? What is the best for page-sized allocations?
On Sun, Oct 2, 2011 at 1:21 PM, m...@freebsd.org wrote: 2011/10/2 Lev Serebryakov l...@freebsd.org: Hello, Freebsd-hackers. Here are several memory-allocation mechanisms in the kernel. The two I'm aware of is MALLOC_DEFINE()/malloc()/free() and uma_* (zone(9)). As far as I understand, malloc() is general-purpose, but it has fixed transaction cost (in term of memory consumption) for each block allocated, and is not very suitable for allocation of many small blocks, as lots of memory will be wasted for bookkeeping. zone(9) allocator, on other hand, have very low cost of each allocated block, but could allocate only pre-configured fixed-size blocks, and ideal for allocation tons of small objects (and provide API for reusing them, too!). Am I right? No one has quite answered this question, IMO, so here's my 2 cents. malloc(9) on smaller sizes (= PAGE_SIZE) uses uma(9) under the covers. There are a set of uma zones for 16, 32, 64, 128, ... PAGE_SIZE bytes and malloc(9) looks up the malloc size in a small array to determine which uma zone to allocate from. So malloc(9) on small sizes doesn't have overhead of bookkeeping, but it does have overhead of rounding to the next highest malloc uma bucket. At $WORK we found, for example, that 48 bytes and 96 bytes were very common sizes and so I added uma zones there (and few other odd sies determined by using the malloc statistics option). But what if I need to allocate a lot (say, 16K-32K) of page-sized blocks? Not in one chunk, for sure, but in lifetime of my kernel module. Which allocator should I use? It seems, the best one will be very low-level only-page-sized allocator. Is here any in kernel? 4k allocations, as has been pointed out, get a single kernel page in both the virtual space and physical space. They (like all the large allocations) use a field in the vm_page for the physical page backing the virtual address to record info about the allocation. Any allocation PAGE_SIZE and larger will round up to the next multiple of pages and allocate whole pages. IMO the problems here are (1) as was pointed out, TLB shootdown on free(9), and (2) the current algorithm for finding space in a kmem_map is a linear search and doesn't track where there are fragmented chunks, so it's not terribly efficient when finding larger sies, and the PAGE_SIZE allocations will not fill in fragmented areas. Regarding #2, no, it is not linear; it is an amortized logarithmic first fit. Every node in every vm map, including the kmem map, is augmented with free space information. This is used by the first fit traversal to skip entire subtrees that contain insufficient space. Regards, Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: UMA large allocations issues
On Fri, Jul 22, 2011 at 9:07 PM, Davide Italiano davide.itali...@gmail.comwrote: Hi. I'm a student and some time ago I started investigating a bit about the performance/fragmentation issue of large allocations within the UMA allocator. Benckmarks showed up that this problems of performances are mainly related to the fact that every call to uma_large_malloc() results in a call to kmem_malloc(), and this behaviour is really inefficient. I started doing some work. Here's somethin: First of all, I tried to define larger zones and let uma do it all as a first step. UMA can allocate slabs of more than one page. So I tried to define zones of 1,2,4,8 pages, moving ZMEM_KMAX up. I tested the solution w/ raidtest. Here there are some numbers. Here's the workload characterization: set mediasize=`diskinfo /dev/zvol/tank/vol | awk '{print $3}'` set sectorsize=`diskinfo /dev/zvol/tank/vol | awk '{print $2}'` raidtest genfile -s $mediasize -S $sectorsize -n 5 # $mediasize = 10737418240 # $sectorsize = 512 Number of READ requests: 24924 Number of WRITE requests: 25076 Numbers of bytes to transmit: 3305292800 raidtest test -d /dev/zvol/tank/vol -n 4 ## tested using 4 cores, 1.5 GB Ram Results: Number of processes: 4 Bytes per second: 10146896 Requests per second: 153 Results: (4* PAGE_SIZE) Number of processes: 4 Bytes per second: 14793969 Requests per second: 223 Results: (8* PAGE_SIZE) Number of processes: 4 Bytes per second: 6855779 Requests per second: 103 The result of this tests is that defining larger zones is useful until the size of these zones is not too big. After some size, performances decreases significantly. As second step, alc@ proposed to create a new layer that sits between UMA and the VM subsystem. This layer can manage a pool of chunk that can be used to satisfy requests from uma_large_malloc so avoiding the overhead due to kmem_malloc() calls. I've recently started developing a patch (not yet full working) that implements this layer. First of all I'd like to concentrate my attention to the performance problem rather than the fragmentation one. So the patch that actually started to write doesn't care about fragmentation aspects. http://davit.altervista.org/uma_large_allocations.patch There are some questions to which I wasn't able to answer (for example, when it's safe to call kmem_malloc() to request memory). In this context, there is really only one restriction. Your page_alloc_new() should never call kmem_malloc() with M_WAITOK if your bitmap_mtx lock is held. It may only call kmem_malloc() with M_NOWAIT if your bitmap_mtx lock is held. That said, I would try to structure the code so that you're not doing any kmem_malloc() calls with the bitmap_mtx lock held. So, at the end of the day I'm asking for your opinion about this issue and I'm looking for a mentor (some kind of guidance) to continue this project. If someone is interested to help, it would be very appreciated. I will take a closer look at your patch later today, and send you comments. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: SMP question w.r.t. reading kernel variables
On Wed, Apr 20, 2011 at 7:42 AM, Rick Macklem rmack...@uoguelph.ca wrote: On Tue, Apr 19, 2011 at 12:00:29PM +, freebsd-hackers-requ...@freebsd.org wrote: Subject: Re: SMP question w.r.t. reading kernel variables To: Rick Macklem rmack...@uoguelph.ca Cc: freebsd-hackers@freebsd.org Message-ID: 201104181712.14457@freebsd.org [John Baldwin] On Monday, April 18, 2011 4:22:37 pm Rick Macklem wrote: On Sunday, April 17, 2011 3:49:48 pm Rick Macklem wrote: ... All of this makes sense. What I was concerned about was memory cache consistency and whet (if anything) has to be done to make sure a thread doesn't see a stale cached value for the memory location. Here's a generic example of what I was thinking of: (assume x is a global int and y is a local int on the thread's stack) - time proceeds down the screen thread X on CPU 0 thread Y on CPU 1 x = 0; x = 0; /* 0 for x's location in CPU 1's memory cache */ x = 1; y = x; -- now, is y guaranteed to be 1 or can it get the stale cached 0 value? if not, what needs to be done to guarantee it? Well, the bigger problem is getting the CPU and compiler to order the instructions such that they don't execute out of order, etc. Because of that, even if your code has 'x = 0; x = 1;' as adjacent threads in thread X, the 'x = 1' may actually execute a good bit after the 'y = x' on CPU 1. Actually, as I recall the rules for C, it's worse than that. For this (admittedly simplified scenario), x=0; in thread X may never execute unless it's declared volatile, as the compiler may optimize it out and emit no code for it. Locks force that to sychronize as the CPUs coordinate around the lock cookie (e.g. the 'mtx_lock' member of 'struct mutex'). Also, I see cases of: mtx_lock(np); np-n_attrstamp = 0; mtx_unlock(np); in the regular NFS client. Why is the assignment mutex locked? (I had assumed it was related to the above memory caching issue, but now I'm not so sure.) In general I think writes to data that are protected by locks should always be protected by locks. In some cases you may be able to read data using weaker locking (where no locking can be a form of weaker locking, but also a read/shared lock is weak, and if a variable is protected by multiple locks, then any singe lock is weak, but sufficient for reading while all of the associated locks must be held for writing) than writing, but writing generally requires full locking (write locks, etc.). Oops, I now see that you've differentiated between writing and reading. (I mistakenly just stated that you had recommended a lock for reading. Sorry about my misinterpretation of the above on the first quick read.) What he said. In addition to all that, lock operations generate atomic barriers which a compiler or optimizer is prevented from moving code across. All good and useful comments, thanks. The above example was meant to be contrived, to indicate what I was worried about w.r.t. memory caches. Here's a somewhat simplified version of what my actual problem is: (Mostly fyi, in case you are interested.) Thread X is doing a forced dismount of an NFS volume, it (in dounmount()): - sets MNTK_UNMOUNTF - calls VFS_SYNC()/nfs_sync() - so this doesn't get hung on an unresponsive server it must test for MNTK_UNMOUNTF and return an error it is set. This seems fine, since it is the same thread and in a called function. (I can't imagine that the optimizer could move setting of a global flag to after a function call which might use it.) - calls VFS_UNMOUNT()/nfs_unmount() - now the fun begins... after some other stuff, it calls nfscl_umount() to get rid of the state info (opens/locks...) nfscl_umount() - synchronizes with other threads that will use this state (see below) using the combination of a mutex and a shared/exclusive sleep lock. (Because of various quirks in the code, this shared/exclusive lock is a locally coded version and I happenned to call the shared case a refcnt and the exclusive case just a lock.) Other threads that will use state info (open/lock...) will: -call nfscl_getcl() - this function does two things that are relevant 1 - it allocates a new clientid, as required, while holding the mutex - this case needs to check for MNTK_UNMOUNTF and return error, in case the clientid has already been deleted by nfscl_umount() above. (This happens before #2 because the sleep lock is in the clientd structure.) -- it must see the MNTK_UNMOUNTF set if it happens after (in a temporal sense) being set by dounmount() 2 - while holding the mutex, it acquires the shared lock
Re: Question about Reverse Mappings in FreeBSD.
On Fri, Mar 18, 2011 at 7:30 PM, J L dimitar9...@gmail.com wrote: I read an article about Reverse Mappings technique in memory management part. It improves a lot from Linux 2.4 to 2.6. I am wondering is FreeBSD also have this feature? Which source files should I go to find these? I want to do some study on this. Wish someone can enlighten me. Thank you. Reverse mappings are implemented by the machine-dependent layer of the virtual memory system, which is called the pmap. Look for files named pmap.c in the source tree, such as sys/amd64/amd64/pmap.c. In particular, look for the code that manages pv entries. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Analyzing wired memory?
On Tue, Feb 8, 2011 at 6:20 AM, Ivan Voras ivo...@freebsd.org wrote: Is it possible to track by some way what kernel system, process or thread has wired memory? (including data exists but needs code to extract it) No. I'd like to analyze a system where there is a lot of memory wired but not accounted for in the output of vmstat -m and vmstat -z. There are no user processes which would lock memory themselves. Any pointers? Have you accounted for the buffer cache? Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Analyzing wired memory?
On 2/8/2011 12:27 PM, Robert Watson wrote: On Tue, 8 Feb 2011, Alan Cox wrote: On Tue, Feb 8, 2011 at 6:20 AM, Ivan Voras ivo...@freebsd.org wrote: Is it possible to track by some way what kernel system, process or thread has wired memory? (including data exists but needs code to extract it) No. I'd like to analyze a system where there is a lot of memory wired but not accounted for in the output of vmstat -m and vmstat -z. There are no user processes which would lock memory themselves. Any pointers? Have you accounted for the buffer cache? John and I have occasionally talked about making procstat -v work on the kernel; conceivably it could also export a wired page count for mappings where it makes sense. Ideally procstat would drill in a bit and allow you to see things at least at the granularty of this page range was allocated to UMA. I would certainly have found this useful on a few occasions, and would gladly help out with implementing it. For example, it would help us in understanding the kmem_map fragmentation caused by ZFS. That said, I'm not sure how you will represent the case where UMA allocates physical memory directly and uses the direct map to access it. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: [rfc] allow to boot with = 256GB physmem
On Fri, Jan 21, 2011 at 11:44 AM, John Baldwin j...@freebsd.org wrote: On Friday, January 21, 2011 11:09:10 am Sergey Kandaurov wrote: Hello. Some time ago I faced with a problem booting with 400GB physmem. The problem is that vm.max_proc_mmap type overflows with such high value, and that results in a broken mmap() syscall. The max_proc_mmap value is a signed int and roughly calculated at vmmapentry_rsrc_init() as u_long vm_kmem_size quotient: vm_kmem_size / sizeof(struct vm_map_entry) / 100. Although at the time it was introduced at svn r57263 the value was quite low (f.e. the related commit log stands: The value defaults to around 9000 for a 128MB machine.), the problem is observed on amd64 where KVA space after r212784 is factually bound to the only physical memory size. With INT_MAX here is 0x7fff, and sizeof(struct vm_map_entry) is 120, it's enough to have sligthly less than 256GB to be able to reproduce the problem. I rewrote vmmapentry_rsrc_init() to set large enough limit for max_proc_mmap just to protect from integer type overflow. As it's also possible to live tune this value, I also added a simple anti-shoot constraint to its sysctl handler. I'm not sure though if it's worth to commit the second part. As this patch may cause some bikeshedding, I'd like to hear your comments before I will commit it. http://plukky.net/~pluknet/patches/max_proc_mmap.diffhttp://plukky.net/%7Epluknet/patches/max_proc_mmap.diff Is there any reason we can't just make this variable and sysctl a long? Or just delete it. 1. Contrary to what the commit message says, this sysctl does not effectively limit the number of vm map entries. It only limits the number that are created by one system call, mmap(). Other system calls create vm map entries just as easily, for example, mprotect(), madvise(), mlock(), and minherit(). Basically, anything that alters the properties of a mapping. Thus, in 2000, after this sysctl was added, the same resource exhaustion induced crash could have been reproduced by trivially changing the program in PR/16573 to do an mprotect() or two. In a nutshell, if you want to really limit the number of vm map entries that a process can allocate, the implementation is a bit more involved than what was done for this sysctl. 2. UMA implements M_WAITOK, whereas the old zone allocator in 2000 did not. Moreover, vm map entries for user maps are allocated with M_WAITOK. So, the exact crash reported in PR/16573 couldn't happen any longer. 3. We now have the vmemoryuse resource limit. When this sysctl was defined, we didn't. Limiting the virtual memory indirectly but effectively limits the number of vm map entries that a process can allocate. In summary, I would do a little due diligence, for example, run the program from PR/16573 with the limit disabled. If you can't reproduce the crash, in other words, nothing contradicts point #2 above, then I would just delete this sysctl. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: [rfc] allow to boot with = 256GB physmem
On Fri, Jan 21, 2011 at 2:58 PM, Alan Cox alan.l@gmail.com wrote: On Fri, Jan 21, 2011 at 11:44 AM, John Baldwin j...@freebsd.org wrote: On Friday, January 21, 2011 11:09:10 am Sergey Kandaurov wrote: Hello. Some time ago I faced with a problem booting with 400GB physmem. The problem is that vm.max_proc_mmap type overflows with such high value, and that results in a broken mmap() syscall. The max_proc_mmap value is a signed int and roughly calculated at vmmapentry_rsrc_init() as u_long vm_kmem_size quotient: vm_kmem_size / sizeof(struct vm_map_entry) / 100. Although at the time it was introduced at svn r57263 the value was quite low (f.e. the related commit log stands: The value defaults to around 9000 for a 128MB machine.), the problem is observed on amd64 where KVA space after r212784 is factually bound to the only physical memory size. With INT_MAX here is 0x7fff, and sizeof(struct vm_map_entry) is 120, it's enough to have sligthly less than 256GB to be able to reproduce the problem. I rewrote vmmapentry_rsrc_init() to set large enough limit for max_proc_mmap just to protect from integer type overflow. As it's also possible to live tune this value, I also added a simple anti-shoot constraint to its sysctl handler. I'm not sure though if it's worth to commit the second part. As this patch may cause some bikeshedding, I'd like to hear your comments before I will commit it. http://plukky.net/~pluknet/patches/max_proc_mmap.diffhttp://plukky.net/%7Epluknet/patches/max_proc_mmap.diff Is there any reason we can't just make this variable and sysctl a long? Or just delete it. 1. Contrary to what the commit message says, this sysctl does not effectively limit the number of vm map entries. It only limits the number that are created by one system call, mmap(). Other system calls create vm map entries just as easily, for example, mprotect(), madvise(), mlock(), and minherit(). Basically, anything that alters the properties of a mapping. Thus, in 2000, after this sysctl was added, the same resource exhaustion induced crash could have been reproduced by trivially changing the program in PR/16573 to do an mprotect() or two. In a nutshell, if you want to really limit the number of vm map entries that a process can allocate, the implementation is a bit more involved than what was done for this sysctl. 2. UMA implements M_WAITOK, whereas the old zone allocator in 2000 did not. Moreover, vm map entries for user maps are allocated with M_WAITOK. So, the exact crash reported in PR/16573 couldn't happen any longer. Actually, I take back part of what I said here. The old zone allocator did implement something like M_WAITOK, and that appears to have been used for user maps. However, the crash described in PR/16573 was actually on the allocation of a vm map entry within the *kernel* address space for a process U area. This type of allocation did not use the old zone allocator's equivalent to M_WAITOK. However, we no longer have U areas, so the exact crash scenario is clearly no longer possible. Interestingly, the sysctl in question has no direct effect on the allocation of kernel vm map entries. So, I remain skeptical that this sysctl is preventing any resource exhaustion based panics in the current kernel. Again, I would be thrilled to see one or more people do some testing, such as rerunning the program from PR/16573. 3. We now have the vmemoryuse resource limit. When this sysctl was defined, we didn't. Limiting the virtual memory indirectly but effectively limits the number of vm map entries that a process can allocate. In summary, I would do a little due diligence, for example, run the program from PR/16573 with the limit disabled. If you can't reproduce the crash, in other words, nothing contradicts point #2 above, then I would just delete this sysctl. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: vm_map_findspace space search?
On Wed, Dec 15, 2010 at 7:37 PM, Venkatesh Srinivas vsrini...@dragonflybsd.org wrote: Hi, In svn r133636, there was a commit to convert the linear search in vm_map_findspace() to use the vm_map splay tree. Just curious, were there any discussions about that particular change? Any measurements other than the ones noted in the commit log? Any notes on why that design was used rather than any other? I've seen the 'Some mmap observations...' thread from about a year earlier and was wondering about some of the possible designs discussed there. In particular the Bonwick/Adams vmem allocator was brought up; I think that something inspired by it (and strangely close to the QuickFit malloc) would be appropriate: Here's how I see it working: * Add a series of power-of-two or logarithmic sized freelists to the vm_map structure; they would point to vm_map_entries immediately to the left of free space holes of a given size. * finding free space would just pop an entry off of the free space list and split in the usual way; deletion could coalesce in constant time. * Unlike the vmem allocator, we would not need to allocate external boundary tags; the vm_map_entries themselves would be the tags. At least from looking at the pattern of vm_map_findspace()s on DFly, the most common requests were for 1 page, 4 page, and 16 page-sized holes (iirc these combined for 75% of requests there; I imagine the pattern in FreeBSD would be very similar). The fragmentation concerns from this would be pretty minor with that pattern... I'm afraid that the pattern is is not always so simple. Sometimes external fragmentation is, in fact, a problem. For example, search for ZFS ARC kmem_map fragmentation. I recall there being at least one particularly detailed e-mail that quantified the fragmentation being caused by the ZFS ARC. There are also microbenchmarks that simulate an mmap() based web server, which will show a different pattern than you describe. If you're interested in working on something in this general area, I can suggest something. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: i386 pmap_zero_page() late sched_pin()?
On Sun, Dec 12, 2010 at 10:40 AM, Venkatesh Srinivas vsrini...@dragonflybsd.org wrote: Hi, In the i386 pmap's pmap_zero_page(), there is a fragment... sysmaps = sysmaps_pcpu[PCPU_GET(cpuid)]; mtx_lock(sysmaps-lock); * sched_pin(); /*map the page we mean to zero at sysmaps-CADDR2*/ pagezero(sysmaps-CADDR2); sched_unpin(); I don't know this bit of code too well, so I don't know if the sched_pin() being where it is is okay or not. My first reading says its not okay; if a thread is moved to another CPU before it is able to pin, it will use the wrong sysmaps structure. Is this the case? Is it alright that the wrong sysmap structure is used? Oh, Nathaniel Filardo (n...@cs.jhu.edu) first spotted this, not I. This isn't a bug. There is nothing about the code that mandates that processor i must always use sysmap entry i. In the unlikely event that the thread migrates from processor X to processor Y before the sched_pin(), the mutex on sysmap entry X will prevent it from being used by processor X until processor Y is done with it. So, it doesn't matter to correctness that the wrong sysmap entry is used, and it is extremely unlikely to matter to performance. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: vm_object ref_count question
On Thu, Oct 28, 2010 at 2:48 AM, Eknath Venkataramani eknath.i...@gmail.com wrote: ref_count is defined inside struct vm_object. and it is incremented everytime the object is referenced How is the page reference logged then? rather in which variable? There is no per-page reference. There is, however, a garbage-collection-like process performed by vm_object_collapse(). Regards, Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Contiguous physical memory
On Wed, Oct 27, 2010 at 2:17 PM, Dr. Baud drb...@yahoo.com wrote: Can anyone suggest a method/formula to monitor contiguous physical memory allocations such that one could predict when contigmalloc(), make that bus_dmamem_alloc might fail? From the command line you can obtain this information with sysctl vm.phys_free. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: page table fault, which should map kernel virtual address space
On Thu, Sep 30, 2010 at 6:28 AM, Svatopluk Kraus onw...@gmail.com wrote: On Tue, Sep 21, 2010 at 7:38 PM, Alan Cox alan.l@gmail.com wrote: On Mon, Sep 20, 2010 at 9:32 AM, Svatopluk Kraus onw...@gmail.com wrote: Beyond 'kernel_map', some submaps of 'kernel_map' (buffer_map, pager_map,...) exist as result of 'kmem_suballoc' function call. When this submaps are used (for example 'kmem_alloc_nofault' function) and its virtual address subspace is at the end of used kernel virtual address space at the moment (and above 'NKPT' preallocation), then missing page tables are not allocated and double fault can happen. No, the page tables are allocated. If you create a submap X of the kernel map using kmem_suballoc(), then a vm_map_findspace() is performed by vm_map_find() on the kernel map to find space for the submap X. As you note above, the call to vm_map_findspace() on the kernel map will call pmap_growkernel() if needed to extend the kernel page table. If you create another submap X' of X, then that submap X' can only map addresses that fall within the range for X. So, any necessary page table pages were allocated when X was created. You are right. Mea culpa. I was focused on a solution and made too quick conclusion. The page table fault hitted in 'pager_map', which is submap of 'clean_map' and when I debugged the problem I didn't see a submap stuff as a whole. That said, there may actually be a problem with the implementation of the superpage_align parameter to kmem_suballoc(). If a submap is created with superpage_align equal to TRUE, but the submap's size is not a multiple of the superpage size, then vm_map_find() may not allocate a page table page for the last megabyte or so of the submap. There are only a few places where kmem_suballoc() is called with superpage_align set to TRUE. If you changed them to FALSE, that is an easy way to test this hypothesis. Yes, it helps. My story is that the problem started up when I updated a project ('coldfire' port) based on FreeBSD 8.0. to FreeBSD current version. In the current version the 'clean_map' submap is created with superpage_align set to TRUE. I have looked at vm_map_find() and debugged the page table fault once again. IMO, it looks that a do-while loop does not work in the function as intended. A vm_map_findspace() finds a space and calls pmap_growkernel() if needed. A pmap_align_superpage() arrange the space but never call pmap_growkernel(). A vm_map_insert() inserts the aligned space into a map without error and never call pmap_growkernel() and does not invoke loop iteration. I don't know too much about an idea how a virtual memory model is implemented and used in other modules. But it seems that it could be more reasonable to align address space in vm_map_findspace() internally and not to loop externally. I have tried to add a check in vm_map_insert() that checks the 'end' parameter against 'kernel_vm_end' variable and returns KERN_NO_SPACE error if needed. In this case the loop in vm_map_find() works and I have no problem with the page table fault. But 'kernel_vm_end' variable must be initializated properly before first use of vm_map_insert(). The 'kernel_vm_end' variable can be self-initializated in pmap_growkernel() in FreeBSD 8.0 (it is too late), but it was changed in current version ('i386' port). Thanks for your answer, but I'm still looking for permanent and approved solution. I have a patch that implements one possible fix for this problem. I'll probably commit that patch in the next day or two. Regards, Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Examining the VM splay tree effectiveness
On Thu, Sep 30, 2010 at 12:37 PM, Andre Oppermann an...@freebsd.org wrote: On 30.09.2010 18:37, Andre Oppermann wrote: Just for the kick of it I decided to take a closer look at the use of splay trees (inherited from Mach if I read the history correctly) in the FreeBSD VM system suspecting an interesting journey. Correcting myself regarding the history: The splay tree for vmmap was done about 8 years ago by alc@ to replace a simple linked list and was a huge improvement. The change in vmpage from a hash to the same splay tree as in vmmap was committed by dillon@ about 7.5 years ago with some involvement of a...@. ss Yes, and there is a substantial difference in the degree of locality of access to these different structures, and thus the effectiveness of a splay tree. When I did the last round of changes to the locking on the vm map, I made some measurements of the splay tree's performance on a JVM running a moderately large bioinformatics application. The upshot was that the average number of map entries visited on an access to the vm map's splay tree was less than the expected depth of a node in a perfectly balanced tree. I teach class shortly. I'll provide more details later. Regards, Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: zfs + uma
On Tue, Sep 21, 2010 at 1:39 AM, Jeff Roberson jrober...@jroberson.netwrote: On Tue, 21 Sep 2010, Andriy Gapon wrote: on 19/09/2010 11:42 Andriy Gapon said the following: on 19/09/2010 11:27 Jeff Roberson said the following: I don't like this because even with very large buffers you can still have high enough turnover to require per-cpu caching. Kip specifically added UMA support to address this issue in zfs. If you have allocations which don't require per-cpu caching and are very large why even use UMA? Good point. Right now I am running with 4 items/bucket limit for items larger than 32KB. But I also have two counter-points actually :) 1. Uniformity. E.g. you can handle all ZFS I/O buffers via the same mechanism regardless of buffer size. 2. (Open)Solaris does that for a while and it seems to suit them well. Not saying that they are perfect, or the best, or an example to follow, but still that means quite a bit (for me). I'm afraid there is not enough context here for me to know what 'the same mechanism' is or what solaris does. Can you elaborate? I prefer not to take the weight of specific examples too heavily when considering the allocator as it must handle many cases and many types of systems. I believe there are cases where you want large allocations to be handled by per-cpu caches, regardless of whether ZFS is one such case. If ZFS does not need them, then it should simply allocate directly from the VM. However, I don't want to introduce some maximum constraint unless it can be shown that adequate behavior is not generated from some more adaptable algorithm. Actually, I think that there is a middle ground between per-cpu caches and directly from the VM that we are missing. When I've looked at the default configuration of ZFS (without the extra UMA zones enabled), there is an incredible amount of churn on the kmem map caused by the implementation of uma_large_malloc() and uma_large_free() going directly to the kmem map. Not only are the obvious things happening, like allocating and freeing kernel virtual addresses and underlying physical pages on every call, but also system-wide TLB shootdowns and sometimes superpage demotions are occurring. I have some trouble believing that the large allocations being performed by ZFS really need per-CPU caching, but I can certainly believe that they could benefit from not going directly to the kmem map on every uma_large_malloc() and uma_large_free(). In other words, I think it would make a lot of sense to have a thin layer between UMA and the kmem map that caches allocated but unused ranges of pages. Regards, Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: page table fault, which should map kernel virtual address space
On Mon, Sep 20, 2010 at 9:32 AM, Svatopluk Kraus onw...@gmail.com wrote: Hallo, this is about 'NKPT' definition, 'kernel_map' submaps, and 'vm_map_findspace' function. Variable 'kernel_map' is used to manage kernel virtual address space. When 'vm_map_findspace' function deals with 'kernel_map' then 'pmap_growkernel' function is called. At least in 'i386' architecture, pmap implementation uses 'pmap_growkernel' function to allocate missing page tables. Missing page tables are problem, because no one checks 'pte' pointer for validity after use of 'vtopte' macro. 'NKPT' definition defines a number of preallocated page tables during system boot. Beyond 'kernel_map', some submaps of 'kernel_map' (buffer_map, pager_map,...) exist as result of 'kmem_suballoc' function call. When this submaps are used (for example 'kmem_alloc_nofault' function) and its virtual address subspace is at the end of used kernel virtual address space at the moment (and above 'NKPT' preallocation), then missing page tables are not allocated and double fault can happen. No, the page tables are allocated. If you create a submap X of the kernel map using kmem_suballoc(), then a vm_map_findspace() is performed by vm_map_find() on the kernel map to find space for the submap X. As you note above, the call to vm_map_findspace() on the kernel map will call pmap_growkernel() if needed to extend the kernel page table. If you create another submap X' of X, then that submap X' can only map addresses that fall within the range for X. So, any necessary page table pages were allocated when X was created. That said, there may actually be a problem with the implementation of the superpage_align parameter to kmem_suballoc(). If a submap is created with superpage_align equal to TRUE, but the submap's size is not a multiple of the superpage size, then vm_map_find() may not allocate a page table page for the last megabyte or so of the submap. There are only a few places where kmem_suballoc() is called with superpage_align set to TRUE. If you changed them to FALSE, that is an easy way to test this hypothesis. Regards, Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: UMA allocations from a specific physical range
m...@freebsd.org wrote: [snip] IIRC the memory from vm_phys_alloc_contig() can be released like any other page; the interface should just be fetching a specific page. How far off is the page wire count? I'm assuming it's hitting the assert that it's 1? I think vm_page_free() is the right interface to free the page again, so the wire count being off presumably means someone else wired it on you; do you know what code did it? If no one else has a reference to the page anymore then setting the wire count to 1, while a hack, should be safe. Yes, vm_page_free() can be used to free a single page that was returned by vm_phys_alloc_contig(). Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Intel TurboBoost in practice
On Mon, Jul 26, 2010 at 9:11 AM, Alexander Motin m...@freebsd.org wrote: Robert Watson wrote: On Sun, 25 Jul 2010, Alexander Motin wrote: The numbers that you are showing doesn't show much difference. Have you tried buildworld? If you mean relative difference -- as I have told, it's mostly because of my CPU. It's maximal boost is 266MHz (8.3%), but 133MHz of them is enabled most of time if CPU is not overheated. It probably doesn't, as it works on clear table under air conditioner. So maximal effect I can expect on is 4.2%. In such situation 2.8% probably not so bad to illustrate that feature works and there is space for further improvements. If I had Core i5-750S I would expect 33% boost. Can I recommend the use of ministat(1) and sample sizes of at least 8 runs per configuration? Thanks for pushing me to do it right. :) Here is 3*15 runs with fresh kernel with disabled debug. Results are quite close to original: -2.73% and -2.19% of time. x C1 + C2 * C3 +-+ |+* x| |+* x| |+* x| |+* x| |+* x| |+* x| |+* x| |+ ** x| |+ + ** xx| |+ + ** ** xx x| | |__M_A| | |A| | ||A| | +-+ NMinMax Median AvgStddev x 15 12.68 12.84 12.69 12.698667 0.039254966 + 15 12.35 12.36 12.35 12.351333 0.0035186578 Difference at 95.0% confidence -0.347333 +/- 0.0208409 -2.7352% +/- 0.164119% (Student's t, pooled s = 0.0278687) * 15 12.41 12.44 12.42 12.42 0.0075592895 Difference at 95.0% confidence -0.278667 +/- 0.0211391 -2.19446% +/- 0.166467% (Student's t, pooled s = 0.0282674) I also checked one more aspect -- TurboBoost works only when CPU runs at highest EIST frequency (P0 state). I've reduced dev.cpu.0.freq from 3201 to 3067 and repeated the test: x C1 + C2 * C3 +-+ | x + *| | x + *| | x + *| | x + * *| | x x+ * *| | x x+ + * *| | x x+ + * *| | x x+ + * *| | x x+ + + + * *| ||MA| | | |_MA_|| |M_A_|| +-+ NMinMax Median AvgStddev x 15 13.72 13.73 13.72 13.72 0.0048795004 + 15 13.79 13.82 13.8 13.80 0.0072374686 Difference at 95.0% confidence 0.08 +/- 0.00461567 0.582949% +/- 0.0336337% (Student's t, pooled s = 0.00617213) * 15 13.89 13.9 13.8913.894 0.0050709255 Difference at 95.0% confidence 0.170667 +/- 0.00372127 1.24362% +/- 0.0271164% (Student's t, pooled s = 0.00497613) In that case using C2 or C3 predictably caused small performance reduce, as after falling to sleep, CPU needs time to wakeup. Even if tested CPU0 won't ever sleep during test, it's TLB shutdown IPIs to other cores still probably could suffer from waiting other cores' wakeup. In the deeper sleep states, are the TLB contents actually maintained while the processor sleeps? (I notice that in some configurations, we actually flush dirty data from the cache before sleeping.) Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to
Re: Intel TurboBoost in practice
2010/7/24 Alexander Motin m...@freebsd.org Hi. I've make small observations of Intel TurboBoost technology under FreeBSD. This technology allows Intel Core i5/i7 CPUs to rise frequency of some cores if other cores are idle and power/thermal conditions permit. CPU core counted as idle, if it has been put into C3 or deeper power state (may reflect ACPI C2/C3 states). So to reach maximal effectiveness, some tuning may be needed. [snip] PPS: I expect even better effect achieved by further reducing interrupt rates on idle CPUs. I'm currently testing a patch that eliminates another 31% of the global TLB shootdowns for a buildworld on an amd64 machine. So, you can expect improvement in this area. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: disk I/O, VFS hirunningspace
Peter Jeremy wrote: Regarding vfs.lorunningspace and vfs.hirunningspace... On 2010-Jul-15 13:52:43 -0500, Alan Coxalan.l@gmail.com wrote: Keep in mind that we still run on some fairly small systems with limited I/O capabilities, e.g., a typical arm platform. More generally, with the range of systems that FreeBSD runs on today, any particular choice of constants is going to perform poorly for someone. If nothing else, making these sysctls a function of the buffer cache size is probably better than any particular constants. That sounds reasonable but brings up a related issue - the buffer cache. Given the unified VM system no longer needs a traditional Unix buffer cache, what is the buffer cache still used for? Today, it is essentially a mapping cache. So, what does that mean? After you've set aside a modest amount of physical memory for the kernel to hold its own internal data structures, all of the remaining physical memory can potentially be used to cache file data. However, on many architectures this is far more memory than the kernel can instantaneously access. Consider i386. You might have 4+ GB of physical memory, but the kernel address space is (by default) only 1 GB. So, at any instant in time, only a fraction of the physical memory is instantaneously accessible to the kernel. In general, to access an arbitrary physical page, the kernel is going to have to replace an existing virtual-to-physical mapping in its address space with one for the desired page. (Generally speaking, on most architectures, even the kernel can't directly access physical memory that isn't mapped by a virtual address.) The buffer cache is essentially a region of the kernel address space that is dedicated to mappings to physical pages containing cached file data. As applications access files, the kernel dynamically maps (and unmaps) physical pages containing cached file data into this region. Once the desired pages are mapped, then read(2) and write(2) can essentially bcopy from the buffer cache mapping to the application's buffer. (Understand that this buffer cache mapping is a prerequisite for the copy out to occur.) So, why did I call it a mapping cache? There is generally locality in the access to file data. So, rather than map and unmap the desired physical pages on every read and write, the mappings to file data are allowed to persist and are managed much like many other kinds of caches. When the kernel needs to map a new set of file pages, it finds an older, not-so-recently used mapping and destroys it, allowing those kernel virtual addresses to be remapped to the new pages. So far, I've used i386 as a motivating example. What of other architectures? Most 64-bit machines take advantage of their large address space by implementing some form of direct map that provides instantaneous access to all of physical memory. (Again, I use instantaneous to mean that the kernel doesn't have to dynamically create a virtual-to-physical mapping before being able to access the data.) On these machines, you could, in principle, use the direct map to implement the bcopy to the application's buffer. So, what is the point of the buffer cache on these machines? A trivial benefit is that the file pages are mapped contiguously in the buffer cache. Even though the underlying physical pages may be scattered throughout the physical address space, they are mapped contiguously. So, the bcopy doesn't need to worry about every page boundary, only buffer boundaries. The buffer cache also plays a role in the page replacement mechanism. Once mapped into the buffer cache, a page is wired, that is, it removed from the paging lists, where the page daemon could reclaim it. However, a page in the buffer cache should really be thought of as being active. In fact, when a page is unmapped from the buffer cache, it is placed at the tail of the virtual memory system's inactive list. The same place where the virtual memory system would place a physical page that it is transitioning from active to inactive. If an application later performs a read(2) from or write(2) to the same page, that page will be removed from the inactive list and mapped back into the buffer cache. So, the mapping and unmapping process contributes to creating an LRU-ordered inactive queue. Finally, the buffer cache limits the amount of dirty file system data that is cached in memory. ... Is the current tuning formula still reasonable (for virtually all current systems it's basically 10MB + 10% RAM)? It's probably still good enough. However, this is not a statement for which I have supporting data. So, I reserve the right to change my opinion. :-) Consider what the buffer cache now does. It's just a mapping cache. Increasing the buffer cache size doesn't affect (much) the amount of physical memory available for caching file data. So, unlike ancient times,
Re: disk I/O, VFS hirunningspace
On Thu, Jul 15, 2010 at 8:01 AM, Ivan Voras ivo...@freebsd.org wrote: On 07/14/10 18:27, Jerry Toung wrote: On Wed, Jul 14, 2010 at 12:04 AM, Gary Jennejohn gljennj...@googlemail.comwrote: Rather than commenting out the code try setting the sysctl vfs.hirunningspace to various powers-of-two. Default seems to be 1MB. I just changed it on the command line as a test to 2MB. You can do this in /etc/sysctl.conf. thank you all, that did it. The settings that Matt recommended are giving the same numbers Any objections to raising the defaults to 8 MB / 1 MB in HEAD? Keep in mind that we still run on some fairly small systems with limited I/O capabilities, e.g., a typical arm platform. More generally, with the range of systems that FreeBSD runs on today, any particular choice of constants is going to perform poorly for someone. If nothing else, making these sysctls a function of the buffer cache size is probably better than any particular constants. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Strange problem with 8-stable, VMWare vSphere 4 AMD CPUs (unexpected shutdowns)
On Thu, Feb 11, 2010 at 7:13 AM, John Baldwin j...@freebsd.org wrote: On Wednesday 10 February 2010 1:38:37 pm Ivan Voras wrote: On 10 February 2010 19:35, Andriy Gapon a...@icyb.net.ua wrote: on 10/02/2010 20:26 Ivan Voras said the following: On 10 February 2010 19:10, Andriy Gapon a...@icyb.net.ua wrote: on 10/02/2010 20:03 Ivan Voras said the following: When you say very unique is it in the it is not Linux or Windows sense or do we do something nonstandard? The former - neither Linux, Windows or OpenSolaris seem to have what we have. I can't find the exact documents but I think both Windows MegaUltimateServer (the highest priced version of Windows Server, whatever it's called today) and Linux (though disabled and marked Experimental) have it, or have some kind of support for large pages that might not be as pervasive (maybe they use it for kernel only?). I have no idea about (Open)Solaris. I haven't said that those OSes do not use large pages. I've said what I've said :-) Ok :) Is there a difference between large pages as they are commonly known and superpages as in FreeBSD ? In other words - are you referencing some specific mechanism, like automatic promotion / demotion of the large pages or maybe something else? Yes, the automatic promotion / demotion. That is a far-less common feature. FreeBSD/i386 has used large pages for the kernel text as far back as at least 4.x, but that is not the same as superpages. Linux does not have automatic promotion / demotion to my knowledge. I do not know about other OS's. A comparison of current large page support among Unix-like and Windows operating systems has two dimensions: (1) whether or not the creation of large pages for applications is automatic and (2) whether or not the machine administrator has to statically partition the machine's physical memory between large and small pages at boot time. For FreeBSD, large pages are created automatically and there is not a static partitioning of physical memory. In contrast, Linux does not create large pages automatically and does require a static partitioning. Specifically, Linux requires the administrator to explicitly and statically partition the machine's physical memory at boot time into two parts, one that is dedicated to large pages and another for general use. To utilize large pages an application has to explicitly request memory from the dedicated large pages pool. However, to make this somewhat easier, but not automatic, there do exist re-implementations of malloc that you can explicitly link with your application. In Solaris, the application has to explicitly request the use of large pages, either via explicit kernel calls in the program or from the command line with support from a library. However, there is not a static partitioning of physical memory. So, for example, when you run the Sun jdk on Solaris, it explicitly requests large pages for much of its data, and this works without administrator having to configure the machine for large page usage. To the best of my knowledge, Windows is just like Solaris. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Superpages on amd64 FreeBSD 7.2-STABLE
On Thu, Dec 10, 2009 at 8:50 AM, Bernd Walter ti...@cicely7.cicely.dewrote: On Wed, Dec 09, 2009 at 09:07:33AM -0500, John Baldwin wrote: On Thursday 26 November 2009 10:14:20 am Linda Messerschmidt wrote: It's not clear to me if this might be a problem with the superpages implementation, or if squid does something particularly horrible to its memory when it forks to cause this, but I wanted to ask about it on the list in case somebody who understands it better might know whats going on. :-) I talked with Alan Cox some about this off-list and there is a case that can cause this behavior if the parent squid process takes write faults on a superpage before the child process has called exec() then it can result in superpages being fragmented and never reassembled. Using vfork() should prevent this from happening. It is a known issue, but it will probably be some time before it is addressed. There is lower hanging fruit in other areas in the VM that will probably be worked on first. For me the whole threads puzzles me. Especially because vfork is often called a solution. Scenario A Parent with super page fork/exec This problem can happen because there is a race. The parent now has it's super pages fragmented permanently!? the child throws away his pages because of the exec!? Scenario B Parent with super page vfork/exec This problem won't happen because the child has no pseudo copy of the parents memory and then starts with a completely new map. Scenario C Parent with super page fork/ no exec The problem can happen because the child shares the same memory over it's complete lifetime. The parent can get it's super pages fragmented over time. I'm not sure how you are defining problem. If we define problem as I would, i.e., that re-promotion can never occur, then Scenario C is not a problem scenario, only Scenario A is. The source of the problem in Scenario A is basically that we have two ways of handling copy-on-write faults. Before the exec() occurs, copy-on-write faults are handled as you might intuit from the name, a new physical copy is made. If the entirety of the 2MB region is written to before the exec(), then this region will be promoted to a superpage. However, once the exec() occurs, copy-on-write faults are optimized. Specifically, the kernel recognizes that the underlying physical page is no longer shared with the child and simply restores write access to it. It is the combination of these two methods that effectively blocks re-promotion because the underlying 4KB physical pages within a 2MB region are no longer contiguous. In other words, once the first page within a region has been copied, you have a choice to make: Do you perform avoidable copies or do you abandon the possibility of ever creating a superpage. The former has a significant one-time cost and the latter has a small recurring cost. Not knowing how much the latter will add up to, I chose the former. However, that choice may change in time, particularly, if I find an effective heuristic for choosing between the two options. Anyway, please keep trying superpages with large memory applications like this. Reports like this help me to prioritize my efforts. Regards, Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: mmap(2) with MAP_ANON honouring offset although it shouldn't
Ed Schouten wrote: * John Baldwin j...@freebsd.org wrote: Note that the spec doesn't cover MAP_ANON at all FWIW. Yes. I've noticed Linux also uses MAP_ANONYMOUS instead of MAP_ANON. They do provide MAP_ANON for compatibility, if I remember correctly. For what it's worth, I believe that Solaris does the exact opposite. They provide MAP_ANONYMOUS for compatibility. It seems like a good idea for us to do the same. We also have an unimplemented option MAP_RENAME defined for compatibility with Sun that is nowhere mentioned in modern Solaris documentation. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: mmap(2) with MAP_ANON honouring offset although it shouldn't
Ed Schouten wrote: * Alan Cox a...@cs.rice.edu wrote: For what it's worth, I believe that Solaris does the exact opposite. They provide MAP_ANONYMOUS for compatibility. It seems like a good idea for us to do the same. Something like this? Index: mman.h === --- mman.h (revision 198919) +++ mman.h (working copy) @@ -82,6 +82,9 @@ */ #defineMAP_FILE 0x /* map from file (default) */ #defineMAP_ANON 0x1000 /* allocated from memory, swap space */ +#ifndef _KERNEL +#defineMAP_ANONYMOUSMAP_ANON /* For compatibility. */ +#endif /* !_KERNEL */ /* * Extended flags Yes. If no one objects in the next day or so, then please commit this change. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: mmap(2) with MAP_ANON honouring offset although it shouldn't
On Wed, Oct 21, 2009 at 10:51 AM, Alexander Best alexbes...@math.uni-muenster.de wrote: although the mmap(2) manual states in section MAP_ANON: The offset argument is ignored. this doesn't seem to be true. running printf(%p\n, mmap((void*)0x1000, 0x1000, PROT_NONE, MAP_ANON, -1, 0x12345678)); and printf(%p\n, mmap((void*)0x1000, 0x1000, PROT_NONE, MAP_ANON, -1, 0)); produces different outputs. i've attached a patch to solve the problem. the patch is similar to the one proposed in this PR, but should apply cleanly to CURRENT: http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/71258 The standards for mmap(2) actually disallow values of off that are not a multiple of the page size. See http://www.opengroup.org/onlinepubs/95399/functions/mmap.html for the following: [EINVAL]The *addr* argument (if MAP_FIXED was specified) or *off* is not a multiple of the page size as returned by *sysconf*()http://www.opengroup.org/onlinepubs/95399/functions/sysconf.html, or is considered invalid by the implementation.Both Solaris and Linux enforce this restriction. I'm not convinced that the ability to specify a value for off that is not a multiple of the page size is a useful differentiating feature of FreeBSD versus Solaris or Linux. Does anyone have a compelling argument (or use case) to motivate us being different in this respect? If you disallow values for off that are not a multiple of the page size, then you are effectively ignoring off for MAP_ANON. Regards, Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: mmap/munmap with zero length
John Baldwin wrote: On Monday 13 July 2009 3:33:51 pm Tijl Coosemans wrote: On Monday 13 July 2009 20:28:08 John Baldwin wrote: On Sunday 05 July 2009 3:32:25 am Alexander Best wrote: so mmap differs from the POSIX recommendation right. the malloc.conf option seems more like a workaround/hack. imo it's confusing to have mmap und munmap deal differently with len=0. being able to succesfully alocate memory which cannot be removed doesn't seem logical to me. This should fix it: --- //depot/user/jhb/acpipci/vm/vm_mmap.c +++ /home/jhb/work/p4/acpipci/vm/vm_mmap.c @@ -229,7 +229,7 @@ fp = NULL; /* make sure mapping fits into numeric range etc */ - if ((ssize_t) uap-len 0 || + if ((ssize_t) uap-len = 0 || ((flags MAP_ANON) uap-fd != -1)) return (EINVAL); Why not uap-len == 0? Sizes of 2GiB and more (32bit) shouldn't cause an error. I don't actually disagree and know of locally modified versions of FreeBSD that remove this check for precisely that reason. I have no objections to uap-len == 0 (without the cast). Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Problem with vm.pmap.shpgperproc and vm.pmap.pv_entry_max
On Fri, Jul 3, 2009 at 8:18 AM, c0re dumped ez.c...@gmail.com wrote: So, I never had problem with this server, but recently it starts to giv me the following messages *every* minute : Jul 3 10:04:00 squid kernel: Approaching the limit on PV entries, consider increasing either the vm.pmap.shpgperproc or the vm.pmap.pv_entry_max tunable. Jul 3 10:05:00 squid kernel: Approaching the limit on PV entries, consider increasing either the vm.pmap.shpgperproc or the vm.pmap.pv_entry_max tunable. Jul 3 10:06:00 squid kernel: Approaching the limit on PV entries, consider increasing either the vm.pmap.shpgperproc or the vm.pmap.pv_entry_max tunable. Jul 3 10:07:01 squid kernel: Approaching the limit on PV entries, consider increasing either the vm.pmap.shpgperproc or the vm.pmap.pv_entry_max tunable. Jul 3 10:08:01 squid kernel: Approaching the limit on PV entries, consider increasing either the vm.pmap.shpgperproc or the vm.pmap.pv_entry_max tunable. Jul 3 10:09:01 squid kernel: Approaching the limit on PV entries, consider increasing either the vm.pmap.shpgperproc or the vm.pmap.pv_entry_max tunable. Jul 3 10:10:01 squid kernel: Approaching the limit on PV entries, consider increasing either the vm.pmap.shpgperproc or the vm.pmap.pv_entry_max tunable. Jul 3 10:11:01 squid kernel: Approaching the limit on PV entries, consider increasing either the vm.pmap.shpgperproc or the vm.pmap.pv_entry_max tunable. This server is running Squid + dansguardian. The users are complaining about slow navigation and they are driving me crazy ! Have anyone faced this problem before ? Some infos: # uname -a FreeBSD squid 7.2-RELEASE FreeBSD 7.2-RELEASE #0: Fri May 1 08:49:13 UTC 2009 r...@walker.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC i386 # sysctl vm vm.vmtotal: System wide totals computed every five seconds: (values in kilobytes) === Processes: (RUNQ: 1 Disk Wait: 1 Page Wait: 0 Sleep: 230) Virtual Memory: (Total: 19174412K, Active 9902152K) Real Memory:(Total: 1908080K Active 1715908K) Shared Virtual Memory: (Total: 647372K Active: 10724K) Shared Real Memory: (Total: 68092K Active: 4436K) Free Memory Pages: 88372K vm.loadavg: { 0.96 0.96 1.13 } vm.v_free_min: 4896 vm.v_free_target: 20635 vm.v_free_reserved: 1051 vm.v_inactive_target: 30952 vm.v_cache_min: 20635 vm.v_cache_max: 41270 vm.v_pageout_free_min: 34 vm.pageout_algorithm: 0 vm.swap_enabled: 1 vm.kmem_size_scale: 3 vm.kmem_size_max: 335544320 vm.kmem_size_min: 0 vm.kmem_size: 335544320 vm.nswapdev: 1 vm.dmmax: 32 vm.swap_async_max: 4 vm.zone_count: 84 vm.swap_idle_threshold2: 10 vm.swap_idle_threshold1: 2 vm.exec_map_entries: 16 vm.stats.misc.zero_page_count: 0 vm.stats.misc.cnt_prezero: 0 vm.stats.vm.v_kthreadpages: 0 vm.stats.vm.v_rforkpages: 0 vm.stats.vm.v_vforkpages: 340091 vm.stats.vm.v_forkpages: 3604123 vm.stats.vm.v_kthreads: 53 vm.stats.vm.v_rforks: 0 vm.stats.vm.v_vforks: 2251 vm.stats.vm.v_forks: 19295 vm.stats.vm.v_interrupt_free_min: 2 vm.stats.vm.v_pageout_free_min: 34 vm.stats.vm.v_cache_max: 41270 vm.stats.vm.v_cache_min: 20635 vm.stats.vm.v_cache_count: 5734 vm.stats.vm.v_inactive_count: 242259 vm.stats.vm.v_inactive_target: 30952 vm.stats.vm.v_active_count: 445958 vm.stats.vm.v_wire_count: 58879 vm.stats.vm.v_free_count: 16335 vm.stats.vm.v_free_min: 4896 vm.stats.vm.v_free_target: 20635 vm.stats.vm.v_free_reserved: 1051 vm.stats.vm.v_page_count: 769244 vm.stats.vm.v_page_size: 4096 vm.stats.vm.v_tfree: 12442098 vm.stats.vm.v_pfree: 1657776 vm.stats.vm.v_dfree: 0 vm.stats.vm.v_tcached: 253415 vm.stats.vm.v_pdpages: 254373 vm.stats.vm.v_pdwakeups: 14 vm.stats.vm.v_reactivated: 414 vm.stats.vm.v_intrans: 1912 vm.stats.vm.v_vnodepgsout: 0 vm.stats.vm.v_vnodepgsin: 6593 vm.stats.vm.v_vnodeout: 0 vm.stats.vm.v_vnodein: 891 vm.stats.vm.v_swappgsout: 0 vm.stats.vm.v_swappgsin: 0 vm.stats.vm.v_swapout: 0 vm.stats.vm.v_swapin: 0 vm.stats.vm.v_ozfod: 56314 vm.stats.vm.v_zfod: 2016628 vm.stats.vm.v_cow_optim: 1959 vm.stats.vm.v_cow_faults: 584331 vm.stats.vm.v_vm_faults: 3661086 vm.stats.sys.v_soft: 23280645 vm.stats.sys.v_intr: 18528397 vm.stats.sys.v_syscall: 1990471112 vm.stats.sys.v_trap: 8079878 vm.stats.sys.v_swtch: 105613021 vm.stats.object.bypasses: 14893 vm.stats.object.collapses: 55259 vm.v_free_severe: 2973 vm.max_proc_mmap: 49344 vm.old_msync: 0 vm.msync_flush_flags: 3 vm.boot_pages: 48 vm.max_wired: 255475 vm.pageout_lock_miss: 0 vm.disable_swapspace_pageouts: 0 vm.defer_swapspace_pageouts: 0 vm.swap_idle_enabled: 0 vm.pageout_stats_interval: 5 vm.pageout_full_stats_interval: 20 vm.pageout_stats_max: 20635 vm.max_launder: 32 vm.phys_segs: SEGMENT 0: start: 0x1000 end: 0x9a000 free list: 0xc0cca168 SEGMENT 1: start: 0x10 end: 0x40 free list: 0xc0cca168 SEGMENT 2: start:
Re: large pages (amd64)
Robert Watson wrote: On Tue, 30 Jun 2009, Mel Flynn wrote: It looks like sys/kern/kern_proc.c could call mincore around the loop at line 1601 (rev 194498), but I know nothing about the vm subsystem to know the implications or locking involved. There's still 16 bytes of spare to consume, in the kve_vminfo struct though ;) Yes, to start with, you could replace the call to pmap_extract() with a call to pmap_mincore() and export a Boolean to user space that says, This region of the address space contains one or more superpage mappings. How about attached? I like the idea -- there are some style nits that need fixing though. Assuming Alan is happy with the VM side of things, I can do the cleanup and get it in the tree. Aside from the style nits, it looks good to me. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: large pages (amd64)
Mel Flynn wrote: On Sunday 28 June 2009 15:41:49 Alan Cox wrote: Wojciech Puchar wrote: how can i check how much (or maybe - what processes) 2MB pages are actually allocated? I'm afraid that you can't with great precision. For a given program execution, on an otherwise idle machine, you can only estimate the number by looking at the change in the quantity promotions + mappings - demotions before, during, and after the program execution. A program can call mincore(2) in order to determine if a virtual address is part of a 2 or 4MB virtual page. Would it be possible to expose the super page count as kve_super in the kinfo_vmentry struct so that procstat can show this information? If only to determine if one is using the feature and possibly benefiting from it. Yes, I think so. It looks like sys/kern/kern_proc.c could call mincore around the loop at line 1601 (rev 194498), but I know nothing about the vm subsystem to know the implications or locking involved. There's still 16 bytes of spare to consume, in the kve_vminfo struct though ;) Yes, to start with, you could replace the call to pmap_extract() with a call to pmap_mincore() and export a Boolean to user space that says, This region of the address space contains one or more superpage mappings. Counting the number of superpages is a little trickier. The problem being that pmap_mincore() doesn't tell you how large the underlying page is. So, the loop at line 1601 can't easily tell where one superpage ends and the next 4KB page or superpage begins, making counting the number of superpages in a region a little tricky. One possibility is to change pmap_mincore() to return the page size (or the logarithm of the page size) rather than a single bit. If you want to give it a try, I'll be happy to help. There aren't really any implications or synchronization issues that you need to worry about. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: large pages (amd64)
On Sun, Jun 28, 2009 at 12:36 PM, Wojciech Puchar woj...@wojtek.tensor.gdynia.pl wrote: i enabled vm.pmap.pg_ps_enabled: 1 could you please explain what exactly this values means? because i don't understand why promotions-demotions!=mappings mappings is not what you seem to think it is. vm.pmap.pde.mappings is the number of 2/4MB page mappings that are created directly and not through the incremental promotion process. For example, it counts the 2/4MB page mappings that are created when the text segment of a large executable, e.g., gcc, is pre-faulted at startup or when a graphics card's frame buffer is mmap()ed. Moreover, not every promoted mapping is demoted. A promoted mapping may be destroyed without demotion, for example, when a process exits. This is, in fact, the ideal case because it is cheaper to destroy a single 2/4MB page mapping instead of 512 or 1024 4KB page mappings. vm.pmap.pde.promotions: 2703 vm.pmap.pde.p_failures: 6290 vm.pmap.pde.mappings: 610 vm.pmap.pde.demotions: 289 other question - tried enabling it on my i386 laptop (256 megs ram), always mappings==0, while promitionsdemotions0. The default starting address for executables on i386 is not aligned to a 2/4MB page boundary. Hence, mappings are much less likely to occur. certainly there are apps that could be put on big pages, gimp editing 40MB bitmap for example Regards, Alan http://lists.freebsd.org/mailman/listinfo/freebsd-hackersfreebsd-hackers-unsubscr...@freebsd.org ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: large pages (amd64)
Wojciech Puchar wrote: other question - tried enabling it on my i386 laptop (256 megs ram), always mappings==0, while promitionsdemotions0. The default starting address for executables on i386 is not aligned to a 2/4MB page boundary. Hence, mappings are much less likely to occur. certainly there are apps that could be put on big pages, gimp editing 40MB bitmap for example Regards, Alan how can i check how much (or maybe - what processes) 2MB pages are actually allocated? I'm afraid that you can't with great precision. For a given program execution, on an otherwise idle machine, you can only estimate the number by looking at the change in the quantity promotions + mappings - demotions before, during, and after the program execution. A program can call mincore(2) in order to determine if a virtual address is part of a 2 or 4MB virtual page. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: large pages (amd64)
On Sun, Jun 28, 2009 at 7:14 PM, Nathanael Hoyle nho...@hoyletech.comwrote: Wojciech Puchar wrote: i enabled vm.pmap.pg_ps_enabled: 1 could you please explain what exactly this values means? because i don't understand why promotions-demotions!=mappings vm.pmap.pde.promotions: 2703 vm.pmap.pde.p_failures: 6290 vm.pmap.pde.mappings: 610 vm.pmap.pde.demotions: 289 other question - tried enabling it on my i386 laptop (256 megs ram), always mappings==0, while promitionsdemotions0. certainly there are apps that could be put on big pages, gimp editing 40MB bitmap for example Just to be clear, since you say i386 (I presume you mean architecture), I believe the Physical Address Extensions which allowed 2MB Page Size bit to be set was introduced with Pentium Pro. Processors prior to this were limited to standard 4KB pages. No. Many of those processors supported 4MB pages. Regards, Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: large pages (amd64)
Nathanael Hoyle wrote: [snip] Having been corrected by both you and Joerg (thank you!), I went back to re-verify my understanding. It appears that while I was slightly mixing PAE in with PSE, PSE support for 4MB pages was introduced 'silently' with the Pentium, and documented first with the Pentium Pro. I haven't found anything that points to earlier inclusion. Certainly the 80386 processor specifically, I am fairly confident would be limited to the 4KB pages. Agreed? Or are you aware of earlier usage than the Pentium for 4MB pages? Yes, I agree. I'm not aware of 4MB page support before the Pentium. Regards, Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org
Re: Increasing KVM on amd64
Artem Belevich wrote: Alan, Thanks a lot for the patch. I've applied it to RELENG_7 and it seems to work great - make -j8 buildworld succeeds, linux emulation seems to work well enough to run linux-sun-jdk14 binaries, ZFS ARC size is bigger, too. So far I didn't see any ZFS-related KVM shortages either. The only problem is that everything is fine as long as vm.kmem_size is set to less or equal to 4096M. As soon as I set it to 4100M or anything larger, kernel crashes on startup. I'm unable to capture exact crash messages as they keep scrolling really fast on the screen for a few seconds until the box reboots. Unfortunately the box does not have built-in serial ports, so the messages are gone before I can see them. :-( There are two underlying causes. First, the size of the kmem map, which holds the kernel's heap, is recorded in a 32-bit int. So, setting vm.kmem_size to 4100M is leading to integer overflow. The following change addresses this issue: sys/kern/kern_malloc.c Revision *1.167*: download http://www.freebsd.org/cgi/cvsweb.cgi/%7Echeckout%7E/src/sys/kern/kern_malloc.c?rev=1.167;content-type=text%2Fplain - view: text http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_malloc.c?rev=1.167;content-type=text%2Fplain, markup http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_malloc.c?rev=1.167;content-type=text%2Fx-cvsweb-markup, annotated http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_malloc.c?annotate=1.167 - select for diffs http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_malloc.c?r1=1.167#rev1.167 /Sat Jul 5 19:34:33 2008 UTC/ (2 months ago) by /alc/ Branches: MAIN http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_malloc.c?only_with_tag=MAIN CVS tags: HEAD http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_malloc.c?only_with_tag=HEAD Diff to: previous 1.166: preferred http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_malloc.c.diff?r1=1.166;r2=1.167, colored http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/kern/kern_malloc.c.diff?r1=1.166;r2=1.167;f=h Changes since revision 1.166: +11 -11 lines SVN rev 180308 on 2008-07-05 19:34:33Z by alc Enable the creation of a kmem map larger than 4GB. Submitted by: Tz-Huan Huang Make several variables related to kmem map auto-sizing static. Found by: CScout Second, there is no room for a kmem map greater than 4GB unless the overall KVM size is greater than 6GB. Specifically, a 4GB kmem map isn't possible with 6GB KVM because the kmem map would overlap the kernel's code, data, and bss segment. If you're able to apply the above kern_malloc.c change to your kernel, then I should be able to describe how to increase your KVM beyond 6GB. Regards, Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Increasing KVM on amd64
Tz-Huan Huang wrote: On Sun, Jun 8, 2008 at 7:59 AM, Alan Cox [EMAIL PROTECTED] wrote: You can download a patch from http://www.cs.rice.edu/~alc/amd64_kvm_6GB.patch that increases amd64's kernel virtual address space to 6GB. This patch also increases the default for the kmem map to almost 2GB. I believe that kernel loadable modules still work. However, I suspect that mini-dumps are broken. I don't plan on committing this patch in its current form. Some of the changes are done in a hackish way. I am, however, curious to hear whether or not it works for you. Thanks for the patch. I applied it on 7-stable but got failed on pmap.c. [snip] We have no machine running 8-current with more than 6G memory now... Sorry, at this point the patch is only applicable to HEAD. That said, the failed chunk is probably easily applied by hand to RELENG_7. Thanks, Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Increasing KVM on amd64
You can download a patch from http://www.cs.rice.edu/~alc/amd64_kvm_6GB.patch that increases amd64's kernel virtual address space to 6GB. This patch also increases the default for the kmem map to almost 2GB. I believe that kernel loadable modules still work. However, I suspect that mini-dumps are broken. I don't plan on committing this patch in its current form. Some of the changes are done in a hackish way. I am, however, curious to hear whether or not it works for you. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Increasing KVM on amd64
Kostik Belousov wrote: On Sat, Jun 07, 2008 at 06:59:35PM -0500, Alan Cox wrote: You can download a patch from http://www.cs.rice.edu/~alc/amd64_kvm_6GB.patch that increases amd64's kernel virtual address space to 6GB. This patch also increases the default for the kmem map to almost 2GB. I believe that kernel loadable modules still work. However, I suspect that mini-dumps are broken. The amd64 modules text/data/bss virtual addresses are allocated from the kernel_map (link_elf_obj.c, link_elf_load_file). Now, the lower end of the kernel_map is top-6Gb. Kernel code (both kernel proper and modules) is compiled for kernel memory model, according to the gcc info: `-mcmodel=kernel' Generate code for the kernel code model. The kernel runs in the negative 2 GB of the address space. This model has to be used for Linux kernel code. I suspect we have a problem there. The change to link_elf_obj.c is supposed to ensure allocation of an address in the upper 2GB of the kernel map for the module. Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: possibly missed wakeup in swap_pager_getpages()
Divacky Roman wrote: hi [snip] is my analysis correct? if so, can the race be mitigated by moving the flag setting (hence also the locking) after the msleep()? No. When the wakeup is performed, the VPO_SWAPINPROG flag is also cleared. Both operations occur in the I/O completion handler with the object locked. Thus, if the I/O completion handler runs first, the msleep on the page within the while loop will not occur because the page's VPO_SWAPINPROG flag is already cleared. Regards, Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Possible problems with mmap/munmap on FreeBSD ...
On Tue, Mar 29, 2005 at 09:18:32PM -0500, David Schultz wrote: On Tue, Mar 29, 2005, Richard Sharpe wrote: Hi, I am having some problems with the tdb package on FreeBSD 4.6.2 and 4.10. One of the things the above package does is: mmap the tdb file to a region of memory store stuff in the region (memmov etc). when it needs to extend the size of the region { munmap the region write data at the end of the file mmap the region again with a larger size } What I am seeing is that after the munmap the data written to the region is gone. However, if I insert an msync before the munmap, everything is nicely coherent. This seems odd (in the sense that it works without the msync under Linux). The region is mmapped with: mmap(NULL, tdb-map_size, PROT_READ|(tdb-read_only? 0:PROT_WRITE), MAP_SHARED|MAP_FILE, tdb-fd, 0); What I notice is that all the calls to mmap return the same address. A careful reading of the man pages for mmap and munmap does not suggest that I am doing anything wrong. Is it possible that FreeBSD is deferring flushing the dirty data, and then forgets to do it when the same starting address is used etc? It looks like all of the underlying pages are getting invalidated in vm_object_page_remove(). This is clearly the right thing to do for private mappings, but it seems wrong for shared mappings. Perhaps Alan has some insight. Hmm. In this code path we don't call vm_object_page_remove() on vnode-backed objects, only default/swap-backed objects that have no other mappings that reference them. Regards, Alan ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: contigmalloc(9) rewrite
On Fri, Jun 18, 2004 at 04:51:15PM -0400, Brian Fundakowski Feldman wrote: On Tue, Jun 15, 2004 at 03:57:09PM -0400, Brian Fundakowski Feldman wrote: The patch, which applies to 5-CURRENT, can be found here: http://green.homeunix.org/~green/contigmalloc2.patch The default is to use the old contigmalloc(). You can set the sysctl or loader tunable vm.old_contigmalloc to 0 to enable it. For anyone that normally runs into failed allocations hot-plugging hardware, please try this and see if it helps out. By the way, I have updated it further to split apart contigmalloc() into a separate vm_page_alloc_contig() and mapping function as per feedback from Alan Cox and Hiten Pandya. The operation is still the same except for now being able to see memory allocated with it in your vmstat(8) -m output. The patch is still at the same location, and requires sysctl vm.old_contigmalloc=0 to enable. Why don't you commit the part that makes allocation of physical memory start from high addresses? Alan ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: Update: Debox sendfile modifications
On Wed, Nov 05, 2003 at 01:25:43AM -0500, Vivek Pai wrote: Mike Silbersack wrote: On Tue, 4 Nov 2003, Vivek Pai wrote: The one other aspect of this is that sf_bufs mappings are maintained for a configurable amount of time, reducing the number of TLB ops. You can specify the parameter for how long, ranging from -1 (no coalescing at all), 0 (coalesce, but free immediately after last holder release), to any other time. Obviously, any value above 0 will increase the amount of wired memory at any given point in time, but it's configurable. Ah, I missed that point. Did your testing show the caching part of the functionality to be significant? I think it buys us a small gain (a few percent) under static-content workloads, and a little less under SpecWeb99, where more time is spent in dynamic content. However, it's almost free - the additional complexity beyond just coalescing is hooking into the timer to free unused mappings. I think it's reasonable to expect a more pronounced effect on i386 SMP. In order to maintain TLB coherence, we issue two interprocessor interrupts _per_page_ transmitted by sendfile(2). Alan ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: [Fwd: Some mmap observations compared to Linux 2.6/OpenBSD]
On Wed, Oct 22, 2003 at 01:34:01PM +1000, Q wrote: It has been suggested that I should direct this question to the VM system guru's. Your comments on this would be appreciated. An address space is represented by a data structure that we call a vm_map. A vm_map is, in the abstact, an ordered collection of in-use address ranges. FreeBSD 4.x implements the vm_map as a doubly-linked list of address ranges with a hint pointing to the last accessed entry. Thus, if the hint fails, the expected time to lookup an in-use address is O(n). FreeBSD 5.x overlays a balanced binary search tree over this structure. This accelerates the lookup to an (amortized) O(log n). In fact, for the kind of balanced binary search tree that we use, the last accessed entry is at the root of the tree. Thus, any locality in the pattern of lookups will produce even better results. Linux and OpenBSD are similar to FreeBSD 5.x. The differences lie in the details, like the kind of balanced binary tree that is used. That said, having a balanced binary tree to represent the address space does NOT inherently make finding unallocated space any faster. Instead, OpenBSD augments an address space entry with extra information to speed up this process: struct vm_map_entry { ... vaddr_t ownspace; /* free space after */ vaddr_t space; /* space in subtree */ ... The same could be done in FreeBSD, and you don't need a red-black tree in order to do it. If someone, especially a junior kernel hacker with a commit bit, is serious about trying this, I'll gladly mentor them. Recognize, however, that this approach may produce great results for a microbenchmark, but pessimize other more interesting workloads, like say, buildworld, making it a poor choice. Nonetheless, I think we should strive to get better results in this area. Regards, Alan -Forwarded Message- From: Q [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: Some mmap observations compared to Linux 2.6/OpenBSD Date: Wed, 22 Oct 2003 12:22:35 +1000 As an effort to get more acquainted with the FreeBSD kernel, I have been looking through how mmap works. I don't yet understand how it all fits together, or of the exact implications things may have in the wild, but I have noticed under some synthetic conditions, ie. mmaping small non-contiguous pages of a file, mmap allocation scales much more poorly on FreeBSD than on OpenBSD and Linux 2.6. After investigating this further I have observed that vm_map_findspace() traverses a linked list to find the next region (O(n) cost), whereas OpenBSD and Linux 2.6 both use Red-Black trees for the same purpose (O(log n) cost). Profiling the FreeBSD kernel appears to confirm this. Can someone comment on whether this is something that has been done intentionally, or avoided in favour of some other yet to be implemented solution? Or is it still on someones todo list. -- Seeya...Q -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- _ / Quinton Dolan - [EMAIL PROTECTED] __ __/ / / __/ / / /__ / _// / Gold Coast, QLD, Australia __/ __/ __/ / / - / Ph: +61 419 729 806 ___ / _\ ___ [EMAIL PROTECTED] mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to [EMAIL PROTECTED]
Re: why doesn't aio use at_exit(9)?
On Fri, Dec 01, 2000 at 02:02:58AM -0800, Alfred Perlstein wrote: why doesn't aio use at_exit(9) instead of requiring an explicit call in kern_exit.c for aio_rundown? There's no reason that I'm aware of. Unless you're in a hurry, I'll add that change to a cleanup patch that I have in the pipe. Alan To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: why doesn't aio use at_exit(9)?
On Fri, Dec 01, 2000 at 12:08:48PM -0800, Alfred Perlstein wrote: * Alan Cox [EMAIL PROTECTED] [001201 11:56] wrote: On Fri, Dec 01, 2000 at 02:02:58AM -0800, Alfred Perlstein wrote: why doesn't aio use at_exit(9) instead of requiring an explicit call in kern_exit.c for aio_rundown? There's no reason that I'm aware of. Unless you're in a hurry, I'll add that change to a cleanup patch that I have in the pipe. Er, how much of a cleanup do you have? The only work I've done so far is to remove all the #ifdef VFS_AIO's in the file, if you could commit your cleanup soon it would help. :) If you're already working on converting aio to use at_exit, go ahead. It won't interfere with my work. Alan To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: page coloring
On Thu, Nov 23, 2000 at 12:48:09PM -0800, Mike Smith wrote: Isn't the page coloring algoritm in _vm_page_list_find totally bogus? No, it's not. The comment is, however, misplaced. It describes the behavior of an inline function in vm_page.h, and not the function it precedes. Hrm. My comment was based on John Dyson's own observations on its behaviour, and other discussions which concluded that the code wasn't flexible enough (hardcoded assumptions on cache organisation, size etc.) Yes, it would be nice if it was "auto-configuring", because many people who use it misconfigure it. They configure the number of colors based upon the size of their cache rather than cache size/degree of associativity. (Having too many colors makes it less likely that you'll get a page of the right color when you ask for one because that queue will be empty.) Overall, it's my experience that people have absurb expectations of page coloring. They think of it as an optimization. It's not. It's better thought of as a necessary evil: Suppose you're writing a numerical application and either you or your compiler goes to a lot of trouble to "block" the algorithm for the L2 cache size. If the underlying OS doesn't do page coloring, it's likely that your program will still experience conflict misses despite your efforts to avoid them. Furthermore, you'll see a different number of conflict misses each time you run the program (assuming the typical random page allocation strategies). So, the execution time will vary. In short, page coloring simply provides a machine whose performance is more deterministic/predictable. Alan P.S. I noticed that I mistakenly cc'ed my previous message to -current rather than -hackers. I've changed it back to -hackers. To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: minherit(2) API
I think I can change it in NetBSD -- anyone willing to do the necessary changes in FreeBSD and OpenBSD? I'll do it. Alan To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: clearing pages in the idle loop
Last year, I tried to reproduce some of the claims/results in this paper on FreeBSD/x86 and couldn't. I also tried limiting the idle loop to clearing pages of one particular color at a time. That way, the cache lines replaced by the second page you clear are the cache lines holding the first page you cleared, and so on for the third, fourth, ... pages cleared. Again, I saw no measurable effect on tests like "buildworld", which is a similar workload to the paper's if I recall correctly. Finally, it's possible that having these pre-zeroed pages in your L2 cache might be beneficial if they get allocated and used right away. FreeBSD's idle loop zeroes the pages that are next in line for allocation. Alan To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Panic in pmap_remove_pages on 4.0-current
This exact problem came up last month. pmap_remove_pages is tripping over a corrupted page table entry (pte). Unfortunately, by the time the panic occurs, pmap_remove_pages has overwritten the corrupted pte with zero. Earlier this month, I added a KASSERT to detect this problem (and panic) before the corrupted pte is overwritten. This KASSERT seems to be missing from your kernel. Could you turn on assertion checking in your kernel configuration and/or update to a newer kernel. Alan To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: Panic in pmap_remove_pages on 4.0-current
This exact problem came up last month. pmap_remove_pages is tripping over a corrupted page table entry (pte). Unfortunately, by the time the panic occurs, pmap_remove_pages has overwritten the corrupted pte with zero. Earlier this month, I added a KASSERT to detect this problem (and panic) before the corrupted pte is overwritten. This KASSERT seems to be missing from your kernel. Could you turn on assertion checking in your kernel configuration and/or update to a newer kernel. Alan To Unsubscribe: send mail to majord...@freebsd.org with unsubscribe freebsd-hackers in the body of the message
Re: patch for behavior changes and madvise MADV_DONTNEED
On Fri, Jul 30, 1999 at 09:51:35AM -0700, Matthew Dillon wrote: I'm not sure I see how MADV_FREE could slow performance down unless, perhaps, it is the overhead of the system call itself. e.g. if malloc is calling it on a page-by-page basis and not implementing any hysteresis. It's the system call overhead. Adding (more) hysteresis would reduce the overhead by some factor, but we'd still be making unnecessary MADV_FREE calls. Calling MADV_FREE accomplishes nothing unless the system is actually short of memory. The right way to address this problem is likely to add a mechanism to the VM system that sends a signal to the process when MADV_FREE would actually be beneficial. 2% isn't a big deal. MADV_FREE theoretically makes a big impact on paging performance in a heavily loaded paging system. If your tests were run on a system that wasn't paging, you weren't testing the right thing. 99% of our user base, whose machines aren't thrashing during a "make world" or other normal operation, shouldn't pay a 2% penalty to produce a theoretical improvement for the 1% that are. At best, that's "optimizing" the infrequent case at the expense of the frequent, not a good trade-off. In any case, the man page for malloc/free explains how to change the default, if you're a part of the "oppressed" 1%. Alan To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-hackers" in the body of the message
Re: patch for behavior changes and madvise MADV_DONTNEED
On Fri, Jul 30, 1999 at 09:51:35AM -0700, Matthew Dillon wrote: I'm not sure I see how MADV_FREE could slow performance down unless, perhaps, it is the overhead of the system call itself. e.g. if malloc is calling it on a page-by-page basis and not implementing any hysteresis. It's the system call overhead. Adding (more) hysteresis would reduce the overhead by some factor, but we'd still be making unnecessary MADV_FREE calls. Calling MADV_FREE accomplishes nothing unless the system is actually short of memory. The right way to address this problem is likely to add a mechanism to the VM system that sends a signal to the process when MADV_FREE would actually be beneficial. 2% isn't a big deal. MADV_FREE theoretically makes a big impact on paging performance in a heavily loaded paging system. If your tests were run on a system that wasn't paging, you weren't testing the right thing. 99% of our user base, whose machines aren't thrashing during a make world or other normal operation, shouldn't pay a 2% penalty to produce a theoretical improvement for the 1% that are. At best, that's optimizing the infrequent case at the expense of the frequent, not a good trade-off. In any case, the man page for malloc/free explains how to change the default, if you're a part of the oppressed 1%. Alan To Unsubscribe: send mail to majord...@freebsd.org with unsubscribe freebsd-hackers in the body of the message
Re: patch for behavior changes and madvise MADV_DONTNEED
Your new behavior flag isn't known by vm_map_simply_entry, so map simplification could drop the behavior setting on the floor. I would prefer that the behavior is folded into eflags. Overall, I agree with the direction. Behavior is correctly a map attribute rather than an object attribute. Alan P.S. The MADV_FREE's by malloc/free were disabled by default in -CURRENT and -STABLE prior to the release of 3.2. They were a performance pessimization, slowing down make world by ~2%. To Unsubscribe: send mail to majord...@freebsd.org with unsubscribe freebsd-hackers in the body of the message
Re: Improving the Unix API
As far as sysctl goes, FreeBSD deprecates the use of numbers for OIDs and has a string-based mechanism for exploring the sysctl tree. So we are actually both going the same way. Linus with /proc/sys and his official dislike of sysctl (Oh well I think sysctl using number spaces is the right idea - like snmp is), and BSD going to names To Unsubscribe: send mail to majord...@freebsd.org with unsubscribe freebsd-hackers in the body of the message
Re: Improving the Unix API
As far as I know, only FreeBSD has a string-based sysctl implementation. Nod. Something which always confused me about Linux' procfs - what have all these kernel variables got to do with process state? We used to have a kernfs which was intended for this kind of thing but it rotted after people started extending sysctl for the purpose. About as much as having a /usr/bin for the slower binaries on the 40Mbyte moving head disk has relationship to /usr nowdays. /proc is basically both process and machine state in Linux. It got expaneded on. Alan To Unsubscribe: send mail to majord...@freebsd.org with unsubscribe freebsd-hackers in the body of the message
Re: 3.2 Freeze date
On Mon, May 10, 1999 at 07:33:28PM -0700, Matthew Dillon wrote: My main interest are the NFS/TCP fixes, which Alan now has a -stable patch for. But it's already the tenth so if it goes in now the source will then have to be reviewed by more gurus ( post-commit ). The NFS/TCP realignment patch was checked into -stable last Sat morning. Is there anything else? Alan To Unsubscribe: send mail to majord...@freebsd.org with unsubscribe freebsd-hackers in the body of the message