http://patchwork.ozlabs.org/patch/1186/

Submitter Suresh Siddha
Date 2008-09-23 21:00:38
Message ID <[EMAIL PROTECTED]>
Download mbox | patch
Permalink /patch/1186/
State New
Headers show

Comments

Suresh Siddha - 2008-09-23 21:00:38
In the first pass, kernel physical mapping will be setup using large or
small pages but uses the same PTE attributes as that of the early
PTE attributes setup by early boot code in head_[32|64].S

After flushing TLB's, we go through the second pass, which setups the
direct mapped PTE's with the appropriate attributes (like NX, GLOBAL etc)
which are runtime detectable.

This two pass mechanism conforms to the TLB app note which says:

"Software should not write to a paging-structure entry in a way that would
 change, for any linear address, both the page size and either the page frame
 or attributes."

Signed-off-by: Suresh Siddha <[EMAIL PROTECTED]>
---
Jeremy Fitzhardinge - 2008-10-06 20:48:13
Suresh Siddha wrote:
> In the first pass, kernel physical mapping will be setup using large or
> small pages but uses the same PTE attributes as that of the early
> PTE attributes setup by early boot code in head_[32|64].S
>
> After flushing TLB's, we go through the second pass, which setups the
> direct mapped PTE's with the appropriate attributes (like NX, GLOBAL etc)
> which are runtime detectable.
>
> This two pass mechanism conforms to the TLB app note which says:
>
> "Software should not write to a paging-structure entry in a way that would
>  change, for any linear address, both the page size and either the page frame
>  or attributes."
>   

I'd noticed that current tip/master hasn't been booting under Xen, and I 
just got around to bisecting it down to this change.

This patch is causing Xen to fail various pagetable updates because it 
ends up remapping pagetables to RW, which Xen explicitly prohibits (as 
that would allow guests to make arbitrary changes to pagetables, rather 
than have them mediated by the hypervisor).

A few things strike me about this patch:

   1. It's high time we unified the physical memory mapping code, and it
      would have been better to do so before making a change of this
      kind to the code.
   2. The existing code already avoided overwriting a pagetable entry
      unless the page size changed.  Wouldn't it be easier to construct
      the mappings first, using the old code, then do a CPA call to set
      the NX bit appropriately?
   3. The actual implementation is pretty ugly; adding a global variable
      and hopping about with goto does not improve this code.

What are the downsides of not following the TLB app note's advice?  Does 
it cause real failures?  Could we revert this patch and address the 
problem some other way?  Which app note is this, BTW?  The one I have on 
hand, "TLBs, Paging-Structure Caches, and Their Invalidation", Apr 2007, 
does not seem to mention this restriction.

As it is, I suspect it will take a non-trivial amount of work to restore 
Xen with this code in place (touching this code is always non-trivial).  
I haven't looked into it in depth yet, but there's a few stand out "bad 
for Xen" pieces of code here.  (And I haven't tested 32-bit yet.)

Quick rules for keeping Xen happy here:

   1. Xen provides its own initial pagetable; the head_64.S one is
      unused when booting under Xen.
   2. Xen requires that any pagetable page must always be mapped RO, so
      we're careful to not replace an existing mapping with a new one,
      in case the existing mapping is a pagetable one.
   3. Xen never uses large pages, and the hypervisor will fail any
      attempt to do so.


> Index: tip/arch/x86/mm/init_64.c
> ===================================================================
> --- tip.orig/arch/x86/mm/init_64.c	2008-09-22 15:59:31.000000000 -0700
> +++ tip/arch/x86/mm/init_64.c	2008-09-22 15:59:37.000000000 -0700
> @@ -323,6 +323,8 @@
>  	early_iounmap(adr, PAGE_SIZE);
>  }
>  
> +static int physical_mapping_iter;
> +
>  static unsigned long __meminit
>  phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end)
>  {
> @@ -343,16 +345,19 @@
>  		}
>  
>  		if (pte_val(*pte))
> -			continue;
> +			goto repeat_set_pte;
>   

This looks troublesome.  The code was explicitly avoiding resetting a 
pte which had already been set.  This change will make it overwrite the 
mapping with PAGE_KERNEL, which will break Xen if the mapping was 
previously RO.

>  
>  		if (0)
>  			printk("   pte=%p addr=%lx pte=%016lx\n",
>  			       pte, addr, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL).pte);
> +		pages++;
> +repeat_set_pte:
>  		set_pte(pte, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL));
>  		last_map_addr = (addr & PAGE_MASK) + PAGE_SIZE;
> -		pages++;
>  	}
> -	update_page_count(PG_LEVEL_4K, pages);
> +
> +	if (physical_mapping_iter == 1)
> +		update_page_count(PG_LEVEL_4K, pages);
>  
>  	return last_map_addr;
>  }
> @@ -371,7 +376,6 @@
>  {
>  	unsigned long pages = 0;
>  	unsigned long last_map_addr = end;
> -	unsigned long start = address;
>  
>  	int i = pmd_index(address);
>  
> @@ -394,15 +398,14 @@
>  				last_map_addr = phys_pte_update(pmd, address,
>  								end);
>  				spin_unlock(&init_mm.page_table_lock);
> +				continue;
>  			}
> -			/* Count entries we're using from level2_ident_pgt */
> -			if (start == 0)
> -				pages++;
> -			continue;
> +			goto repeat_set_pte;
>  		}
>  
>  		if (page_size_mask & (1<<PG_LEVEL_2M)) {
>  			pages++;
> +repeat_set_pte:
>  			spin_lock(&init_mm.page_table_lock);
>  			set_pte((pte_t *)pmd,
>  				pfn_pte(address >> PAGE_SHIFT, PAGE_KERNEL_LARGE));
> @@ -419,7 +422,8 @@
>  		pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
>  		spin_unlock(&init_mm.page_table_lock);
>  	}
> -	update_page_count(PG_LEVEL_2M, pages);
> +	if (physical_mapping_iter == 1)
> +		update_page_count(PG_LEVEL_2M, pages);
>  	return last_map_addr;
>  }
>  
> @@ -458,14 +462,18 @@
>  		}
>  
>  		if (pud_val(*pud)) {
> -			if (!pud_large(*pud))
> +			if (!pud_large(*pud)) {
>  				last_map_addr = phys_pmd_update(pud, addr, end,
>  							 page_size_mask);
> -			continue;
> +				continue;
> +			}
> +
> +			goto repeat_set_pte;
>  		}
>  
>  		if (page_size_mask & (1<<PG_LEVEL_1G)) {
>  			pages++;
> +repeat_set_pte:
>  			spin_lock(&init_mm.page_table_lock);
>  			set_pte((pte_t *)pud,
>  				pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_LARGE));
> @@ -483,7 +491,9 @@
>  		spin_unlock(&init_mm.page_table_lock);
>  	}
>  	__flush_tlb_all();
> -	update_page_count(PG_LEVEL_1G, pages);
> +
> +	if (physical_mapping_iter == 1)
> +		update_page_count(PG_LEVEL_1G, pages);
>  
>  	return last_map_addr;
>  }
> @@ -547,15 +557,54 @@
>  		direct_gbpages = 0;
>  }
>  
> +static int is_kernel(unsigned long pfn)
> +{
> +	unsigned long pg_addresss = pfn << PAGE_SHIFT;
> +
> +	if (pg_addresss >= (unsigned long) __pa(_text) &&
> +	    pg_addresss <= (unsigned long) __pa(_end))
> +		return 1;
> +
> +	return 0;
> +}
> +
>  static unsigned long __init kernel_physical_mapping_init(unsigned long start,
>  						unsigned long end,
>  						unsigned long page_size_mask)
>  {
>  
> -	unsigned long next, last_map_addr = end;
> +	unsigned long next, last_map_addr;
> +	u64 cached_supported_pte_mask = __supported_pte_mask;
> +	unsigned long cache_start = start;
> +	unsigned long cache_end = end;
> +
> +	/*
> +	 * First iteration will setup identity mapping using large/small pages
> +	 * based on page_size_mask, with other attributes same as set by
> +	 * the early code in head_64.S
>   

We can't assume here that the pagetables we're modifying are necessarily 
the head_64.S ones.

> +	 *
> +	 * Second iteration will setup the appropriate attributes
> +	 * as desired for the kernel identity mapping.
> +	 *
> +	 * This two pass mechanism conforms to the TLB app note which says:
> +	 *
> +	 *     "Software should not write to a paging-structure entry in a way
> +	 *      that would change, for any linear address, both the page size
> +	 *      and either the page frame or attributes."
> +	 *
> +	 * For now, only difference between very early PTE attributes used in
> +	 * head_64.S and here is _PAGE_NX.
> +	 */
> +	BUILD_BUG_ON((__PAGE_KERNEL_LARGE & ~__PAGE_KERNEL_IDENT_LARGE_EXEC)
> +		     != _PAGE_NX);
> +	__supported_pte_mask &= ~(_PAGE_NX);
> +	physical_mapping_iter = 1;
>  
> -	start = (unsigned long)__va(start);
> -	end = (unsigned long)__va(end);
> +repeat:
> +	last_map_addr = cache_end;
> +
> +	start = (unsigned long)__va(cache_start);
> +	end = (unsigned long)__va(cache_end);
>  
>  	for (; start < end; start = next) {
>  		pgd_t *pgd = pgd_offset_k(start);
> @@ -567,11 +616,21 @@
>  			next = end;
>  
>  		if (pgd_val(*pgd)) {
> +			/*
> +			 * Static identity mappings will be overwritten
> +			 * with run-time mappings. For example, this allows
> +			 * the static 0-1GB identity mapping to be mapped
> +			 * non-executable with this.
> +			 */
> +			if (is_kernel(pte_pfn(*((pte_t *) pgd))))
> +				goto realloc;
>   

This is definitely a Xen-breaker, but removing this is not sufficient on 
its own.  Is this actually related to the rest of the patch, or a 
gratuitous throw-in change?

> +
>  			last_map_addr = phys_pud_update(pgd, __pa(start),
>  						 __pa(end), page_size_mask);
>  			continue;
>  		}
>  
> +realloc:
>  		pud = alloc_low_page(&pud_phys);
>  		last_map_addr = phys_pud_init(pud, __pa(start), __pa(next),
>  						 page_size_mask);
> @@ -581,6 +640,16 @@
>  		pgd_populate(&init_mm, pgd, __va(pud_phys));
>  		spin_unlock(&init_mm.page_table_lock);
>  	}
> +	__flush_tlb_all();
> +
> +	if (physical_mapping_iter == 1) {
> +		physical_mapping_iter = 2;
> +		/*
> +		 * Second iteration will set the actual desired PTE attributes.
> +		 */
> +		__supported_pte_mask = cached_supported_pte_mask;
> +		goto repeat;
> +	}
>  
>  	return last_map_addr;
>  }
>
>   

Thanks,
    J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jeremy Fitzhardinge - 2008-10-06 23:09:55
Jeremy Fitzhardinge wrote:
> As it is, I suspect it will take a non-trivial amount of work to 
> restore Xen with this code in place (touching this code is always 
> non-trivial).  I haven't looked into it in depth yet, but there's a 
> few stand out "bad for Xen" pieces of code here.  (And I haven't 
> tested 32-bit yet.)

32-bit Xen is OK with this patch.  Reverting it restores 64-bit Xen.

    J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Suresh Siddha - 2008-10-07 01:58:35
On Mon, Oct 06, 2008 at 01:48:13PM -0700, Jeremy Fitzhardinge wrote:
> Suresh Siddha wrote:
> > In the first pass, kernel physical mapping will be setup using large or
> > small pages but uses the same PTE attributes as that of the early
> > PTE attributes setup by early boot code in head_[32|64].S
> >
> > After flushing TLB's, we go through the second pass, which setups the
> > direct mapped PTE's with the appropriate attributes (like NX, GLOBAL etc)
> > which are runtime detectable.
> >
> > This two pass mechanism conforms to the TLB app note which says:
> >
> > "Software should not write to a paging-structure entry in a way that would
> >  change, for any linear address, both the page size and either the page frame
> >  or attributes."
> >
> 
> I'd noticed that current tip/master hasn't been booting under Xen, and I
> just got around to bisecting it down to this change.
> 
> This patch is causing Xen to fail various pagetable updates because it
> ends up remapping pagetables to RW, which Xen explicitly prohibits (as
> that would allow guests to make arbitrary changes to pagetables, rather
> than have them mediated by the hypervisor).

Jeremy, hi. This dependency is not documented or explicitly called anywhere
in the mm/init_64.c code. I would have expected to see a big comment near this
kind of code :(

	if (pte_val(*pte))
		continue;

> A few things strike me about this patch:
> 
>    1. It's high time we unified the physical memory mapping code, and it
>       would have been better to do so before making a change of this
>       kind to the code.
>    2. The existing code already avoided overwriting a pagetable entry
>       unless the page size changed. Wouldn't it be easier to construct
>       the mappings first, using the old code, then do a CPA call to set
>       the NX bit appropriately?

It is not just the NX bit that we change. For DEBUG_PAGEALLOC, we want
use 4k pages instead of large page mappings during the identity mapping
(as this will clean some of the cpa pool code avoiding the cpa and hence
 the page allocations for splitting the big pages from interrupt context's).
In this case will will split the static large page mappings.

>    3. The actual implementation is pretty ugly; adding a global variable
>       and hopping about with goto does not improve this code.

This is very early init code and I can't be fancy like calling cpa()
which need mm to be up and running. And also, cpa's on individual chunks
for entire identity mapping will make the boot slow.

Now that I am aware of this xen failure, I will try to clean up this in a better
fashion.

> What are the downsides of not following the TLB app note's advice?  Does

App notes says that cpu behavior is undefined. We will  probably see more
issues with attribute changes like UC/WB etc, as far as the other attributes
are concerned we are paranoid and wanted to fix all the violations while
we are at it.

> it cause real failures?  Could we revert this patch and address the
> problem some other way?  Which app note is this, BTW?  The one I have on
> hand, "TLBs, Paging-Structure Caches, and Their Invalidation", Apr 2007,
> does not seem to mention this restriction.

http://developer.intel.com/design/processor/applnots/317080.pdf
Section 6 page 26

> As it is, I suspect it will take a non-trivial amount of work to restore

I didn't get much time today to think about this. Let me think a bit
more tonight and will get back to you tomorrow with a patch fixing
this or a request to Ingo to revert this(if we revert we have to revert
the whole patchset otherwise, we will break DEBUG_PAGEALLOC etc).

> Xen with this code in place (touching this code is always non-trivial).
> I haven't looked into it in depth yet, but there's a few stand out "bad
> for Xen" pieces of code here.  (And I haven't tested 32-bit yet.)
> 
> Quick rules for keeping Xen happy here:
> 
>    1. Xen provides its own initial pagetable; the head_64.S one is
>       unused when booting under Xen.
>    2. Xen requires that any pagetable page must always be mapped RO, so
>       we're careful to not replace an existing mapping with a new one,
>       in case the existing mapping is a pagetable one.
>    3. Xen never uses large pages, and the hypervisor will fail any
>       attempt to do so.

Thanks for this info. Will get back to you tomorrow.

thanks,
suresh
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/
Jeremy Fitzhardinge - 2008-10-07 15:28:08
Suresh Siddha wrote:
> Jeremy, hi. This dependency is not documented or explicitly called anywhere
> in the mm/init_64.c code. I would have expected to see a big comment near this
> kind of code :(
>   

Indeed yes.  I've explained it in various places, including commit 
comments, but there should be a comment right there in the code.

> It is not just the NX bit that we change. For DEBUG_PAGEALLOC, we want
> use 4k pages instead of large page mappings during the identity mapping
> (as this will clean some of the cpa pool code avoiding the cpa and hence
>  the page allocations for splitting the big pages from interrupt context's).
> In this case will will split the static large page mappings.
>   

Well, that's OK.  We just need to preserve the original page permissions 
when fragmenting the large mappings.  (This isn't a case that affects 
Xen, because it will already be 4k mappings.)

>>    3. The actual implementation is pretty ugly; adding a global variable
>>       and hopping about with goto does not improve this code.
>>     
>
> This is very early init code and I can't be fancy like calling cpa()
> which need mm to be up and running.

Well, is there any urgency to set NX that early?  It might catch some 
early bugs, but there's no urgent need.

>  And also, cpa's on individual chunks
> for entire identity mapping will make the boot slow.
>   

Really?  Why?  How slow?


>> it cause real failures?  Could we revert this patch and address the
>> problem some other way?  Which app note is this, BTW?  The one I have on
>> hand, "TLBs, Paging-Structure Caches, and Their Invalidation", Apr 2007,
>> does not seem to mention this restriction.
>>     
>
> http://developer.intel.com/design/processor/applnots/317080.pdf
> Section 6 page 26
>   

Ah, OK.  I have the first version of this document which does not 
mention this.  It would be good to explicitly cite this document by name 
in the comments.

>> Xen with this code in place (touching this code is always non-trivial).
>> I haven't looked into it in depth yet, but there's a few stand out "bad
>> for Xen" pieces of code here.  (And I haven't tested 32-bit yet.)
>>
>> Quick rules for keeping Xen happy here:
>>
>>    1. Xen provides its own initial pagetable; the head_64.S one is
>>       unused when booting under Xen.
>>    2. Xen requires that any pagetable page must always be mapped RO, so
>>       we're careful to not replace an existing mapping with a new one,
>>       in case the existing mapping is a pagetable one.
>>    3. Xen never uses large pages, and the hypervisor will fail any
>>       attempt to do so.
>>     
>
> Thanks for this info. Will get back to you tomorrow.
>   

Great.  Also, do you think you'll have a chance to look at unifying the 
32 and 64 bit code (where 32 uses the 64-bit version)?


Thanks,
    J
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Patch

Index: tip/arch/x86/mm/init_32.c
===================================================================
--- tip.orig/arch/x86/mm/init_32.c	2008-09-22 15:59:31.000000000 -0700
+++ tip/arch/x86/mm/init_32.c	2008-09-22 15:59:37.000000000 -0700
@@ -196,11 +196,30 @@ 
 	pgd_t *pgd;
 	pmd_t *pmd;
 	pte_t *pte;
-	unsigned pages_2m = 0, pages_4k = 0;
+	unsigned pages_2m, pages_4k;
+	int mapping_iter;
+
+	/*
+	 * First iteration will setup identity mapping using large/small pages
+	 * based on use_pse, with other attributes same as set by
+	 * the early code in head_32.S
+	 *
+	 * Second iteration will setup the appropriate attributes (NX, GLOBAL..)
+	 * as desired for the kernel identity mapping.
+	 *
+	 * This two pass mechanism conforms to the TLB app note which says:
+	 *
+	 *     "Software should not write to a paging-structure entry in a way
+	 *      that would change, for any linear address, both the page size
+	 *      and either the page frame or attributes."
+	 */
+	mapping_iter = 1;
 
 	if (!cpu_has_pse)
 		use_pse = 0;
 
+repeat:
+	pages_2m = pages_4k = 0;
 	pfn = start_pfn;
 	pgd_idx = pgd_index((pfn<<PAGE_SHIFT) + PAGE_OFFSET);
 	pgd = pgd_base + pgd_idx;
@@ -226,6 +245,13 @@ 
 			if (use_pse) {
 				unsigned int addr2;
 				pgprot_t prot = PAGE_KERNEL_LARGE;
+				/*
+				 * first pass will use the same initial
+				 * identity mapping attribute + _PAGE_PSE.
+				 */
+				pgprot_t init_prot =
+					__pgprot(PTE_IDENT_ATTR |
+						 _PAGE_PSE);
 
 				addr2 = (pfn + PTRS_PER_PTE-1) * PAGE_SIZE +
 					PAGE_OFFSET + PAGE_SIZE-1;
@@ -235,7 +261,10 @@ 
 					prot = PAGE_KERNEL_LARGE_EXEC;
 
 				pages_2m++;
-				set_pmd(pmd, pfn_pmd(pfn, prot));
+				if (mapping_iter == 1)
+					set_pmd(pmd, pfn_pmd(pfn, init_prot));
+				else
+					set_pmd(pmd, pfn_pmd(pfn, prot));
 
 				pfn += PTRS_PER_PTE;
 				continue;
@@ -247,17 +276,43 @@ 
 			for (; pte_ofs < PTRS_PER_PTE && pfn < end_pfn;
 			     pte++, pfn++, pte_ofs++, addr += PAGE_SIZE) {
 				pgprot_t prot = PAGE_KERNEL;
+				/*
+				 * first pass will use the same initial
+				 * identity mapping attribute.
+				 */
+				pgprot_t init_prot = __pgprot(PTE_IDENT_ATTR);
 
 				if (is_kernel_text(addr))
 					prot = PAGE_KERNEL_EXEC;
 
 				pages_4k++;
-				set_pte(pte, pfn_pte(pfn, prot));
+				if (mapping_iter == 1)
+					set_pte(pte, pfn_pte(pfn, init_prot));
+				else
+					set_pte(pte, pfn_pte(pfn, prot));
 			}
 		}
 	}
-	update_page_count(PG_LEVEL_2M, pages_2m);
-	update_page_count(PG_LEVEL_4K, pages_4k);
+	if (mapping_iter == 1) {
+		/*
+		 * update direct mapping page count only in the first
+		 * iteration.
+		 */
+		update_page_count(PG_LEVEL_2M, pages_2m);
+		update_page_count(PG_LEVEL_4K, pages_4k);
+
+		/*
+		 * local global flush tlb, which will flush the previous
+		 * mappings present in both small and large page TLB's.
+		 */
+		__flush_tlb_all();
+
+		/*
+		 * Second iteration will set the actual desired PTE attributes.
+		 */
+		mapping_iter = 2;
+		goto repeat;
+	}
 }
 
 /*
Index: tip/arch/x86/mm/init_64.c
===================================================================
--- tip.orig/arch/x86/mm/init_64.c	2008-09-22 15:59:31.000000000 -0700
+++ tip/arch/x86/mm/init_64.c	2008-09-22 15:59:37.000000000 -0700
@@ -323,6 +323,8 @@ 
 	early_iounmap(adr, PAGE_SIZE);
 }
 
+static int physical_mapping_iter;
+
 static unsigned long __meminit
 phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end)
 {
@@ -343,16 +345,19 @@ 
 		}
 
 		if (pte_val(*pte))
-			continue;
+			goto repeat_set_pte;
 
 		if (0)
 			printk("   pte=%p addr=%lx pte=%016lx\n",
 			       pte, addr, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL).pte);
+		pages++;
+repeat_set_pte:
 		set_pte(pte, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL));
 		last_map_addr = (addr & PAGE_MASK) + PAGE_SIZE;
-		pages++;
 	}
-	update_page_count(PG_LEVEL_4K, pages);
+
+	if (physical_mapping_iter == 1)
+		update_page_count(PG_LEVEL_4K, pages);
 
 	return last_map_addr;
 }
@@ -371,7 +376,6 @@ 
 {
 	unsigned long pages = 0;
 	unsigned long last_map_addr = end;
-	unsigned long start = address;
 
 	int i = pmd_index(address);
 
@@ -394,15 +398,14 @@ 
 				last_map_addr = phys_pte_update(pmd, address,
 								end);
 				spin_unlock(&init_mm.page_table_lock);
+				continue;
 			}
-			/* Count entries we're using from level2_ident_pgt */
-			if (start == 0)
-				pages++;
-			continue;
+			goto repeat_set_pte;
 		}
 
 		if (page_size_mask & (1<<PG_LEVEL_2M)) {
 			pages++;
+repeat_set_pte:
 			spin_lock(&init_mm.page_table_lock);
 			set_pte((pte_t *)pmd,
 				pfn_pte(address >> PAGE_SHIFT, PAGE_KERNEL_LARGE));
@@ -419,7 +422,8 @@ 
 		pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
 		spin_unlock(&init_mm.page_table_lock);
 	}
-	update_page_count(PG_LEVEL_2M, pages);
+	if (physical_mapping_iter == 1)
+		update_page_count(PG_LEVEL_2M, pages);
 	return last_map_addr;
 }
 
@@ -458,14 +462,18 @@ 
 		}
 
 		if (pud_val(*pud)) {
-			if (!pud_large(*pud))
+			if (!pud_large(*pud)) {
 				last_map_addr = phys_pmd_update(pud, addr, end,
 							 page_size_mask);
-			continue;
+				continue;
+			}
+
+			goto repeat_set_pte;
 		}
 
 		if (page_size_mask & (1<<PG_LEVEL_1G)) {
 			pages++;
+repeat_set_pte:
 			spin_lock(&init_mm.page_table_lock);
 			set_pte((pte_t *)pud,
 				pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_LARGE));
@@ -483,7 +491,9 @@ 
 		spin_unlock(&init_mm.page_table_lock);
 	}
 	__flush_tlb_all();
-	update_page_count(PG_LEVEL_1G, pages);
+
+	if (physical_mapping_iter == 1)
+		update_page_count(PG_LEVEL_1G, pages);
 
 	return last_map_addr;
 }
@@ -547,15 +557,54 @@ 
 		direct_gbpages = 0;
 }
 
+static int is_kernel(unsigned long pfn)
+{
+	unsigned long pg_addresss = pfn << PAGE_SHIFT;
+
+	if (pg_addresss >= (unsigned long) __pa(_text) &&
+	    pg_addresss <= (unsigned long) __pa(_end))
+		return 1;
+
+	return 0;
+}
+
 static unsigned long __init kernel_physical_mapping_init(unsigned long start,
 						unsigned long end,
 						unsigned long page_size_mask)
 {
 
-	unsigned long next, last_map_addr = end;
+	unsigned long next, last_map_addr;
+	u64 cached_supported_pte_mask = __supported_pte_mask;
+	unsigned long cache_start = start;
+	unsigned long cache_end = end;
+
+	/*
+	 * First iteration will setup identity mapping using large/small pages
+	 * based on page_size_mask, with other attributes same as set by
+	 * the early code in head_64.S
+	 *
+	 * Second iteration will setup the appropriate attributes
+	 * as desired for the kernel identity mapping.
+	 *
+	 * This two pass mechanism conforms to the TLB app note which says:
+	 *
+	 *     "Software should not write to a paging-structure entry in a way
+	 *      that would change, for any linear address, both the page size
+	 *      and either the page frame or attributes."
+	 *
+	 * For now, only difference between very early PTE attributes used in
+	 * head_64.S and here is _PAGE_NX.
+	 */
+	BUILD_BUG_ON((__PAGE_KERNEL_LARGE & ~__PAGE_KERNEL_IDENT_LARGE_EXEC)
+		     != _PAGE_NX);
+	__supported_pte_mask &= ~(_PAGE_NX);
+	physical_mapping_iter = 1;
 
-	start = (unsigned long)__va(start);
-	end = (unsigned long)__va(end);
+repeat:
+	last_map_addr = cache_end;
+
+	start = (unsigned long)__va(cache_start);
+	end = (unsigned long)__va(cache_end);
 
 	for (; start < end; start = next) {
 		pgd_t *pgd = pgd_offset_k(start);
@@ -567,11 +616,21 @@ 
 			next = end;
 
 		if (pgd_val(*pgd)) {
+			/*
+			 * Static identity mappings will be overwritten
+			 * with run-time mappings. For example, this allows
+			 * the static 0-1GB identity mapping to be mapped
+			 * non-executable with this.
+			 */
+			if (is_kernel(pte_pfn(*((pte_t *) pgd))))
+				goto realloc;
+
 			last_map_addr = phys_pud_update(pgd, __pa(start),
 						 __pa(end), page_size_mask);
 			continue;
 		}
 
+realloc:
 		pud = alloc_low_page(&pud_phys);
 		last_map_addr = phys_pud_init(pud, __pa(start), __pa(next),
 						 page_size_mask);
@@ -581,6 +640,16 @@ 
 		pgd_populate(&init_mm, pgd, __va(pud_phys));
 		spin_unlock(&init_mm.page_table_lock);
 	}
+	__flush_tlb_all();
+
+	if (physical_mapping_iter == 1) {
+		physical_mapping_iter = 2;
+		/*
+		 * Second iteration will set the actual desired PTE attributes.
+		 */
+		__supported_pte_mask = cached_supported_pte_mask;
+		goto repeat;
+	}
 
 	return last_map_addr;
 }


Reply via email to