from:"Khalid Aziz"

Re: [PATCH 0/5] Bring the BusLogic host bus adapter driver up to Y2021

2021-04-19 Thread Khalid Aziz

On 4/19/21 10:01 AM, Maciej W. Rozycki wrote:
> On Mon, 19 Apr 2021, Khalid Aziz wrote:
> 
>> On 4/18/21 2:21 PM, Ondrej Zary wrote:
>>>
>>> Found the 3000763 document here:
>>> https://doc.lagout.org/science/0_Computer Science/0_Computer 
>>> History/old-hardware/buslogic/3000763_PCI_EISA_Wide_SCSI_Tech_Ref_Dec94.pdf
>>>
>>> There's also 3002593 there:
>>> https://doc.lagout.org/science/0_Computer Science/0_Computer 
>>> History/old-hardware/buslogic/
>>>
>>
>> Thanks!!!
> 
>  Ondrej: Thanks a lot indeed!  These documents seem to have the essential 
> interface details all covered, except for Fast-20 SCSI adapters, which I 
> guess are a minor modification from the software's point of view.
> 
>  Khalid: I have skimmed over these documents and I infer 24-bit addressing 
> can be verified with any MultiMaster adapter, including ones that do have 
> 32-bit addressing implemented, by using the legacy Initialize Mailbox HBA 
> command.  That could be used to stop Christoph's recent changes for older 
> adapter support removal and replace them with proper fixes for whatever 
> has become broken.  Is that something you'd be willing as the driver's 
> maintainer to look into, or shall I?
> 

Hi Maciej,

Do you mean use OpCode 01 (INITIALIZE MAILBOX) to set a 24-bit address
for mailbox in place of OpCode 81? Verifying the change would be a
challenge. Do you have an old adapter to test it with? If you do, go
ahead and make the changes. I will be happy to review. I have only a
BT-757 adapter.

Thanks,
Khalid

Re: [PATCH 0/5] Bring the BusLogic host bus adapter driver up to Y2021

2021-04-19 Thread Khalid Aziz

On 4/18/21 2:21 PM, Ondrej Zary wrote:
> On Friday 16 April 2021 23:25:18 Maciej W. Rozycki wrote:
>> On Fri, 16 Apr 2021, Khalid Aziz wrote:
>>
>>>>  Sadly I didn't get to these resources while they were still there, and 
>>>> neither did archive.org, and now they not appear available from anywhere 
>>>> online.  I'm sure Leonard had this all, but, alas, he is long gone too.
>>>
>>> These documents were all gone by the time I started working on this
>>> driver in 2013.
>>
>>  According to my e-mail archives I got my BT-958 directly from Mylex brand 
>> new as KT-958 back in early 1998 (the rest of the system is a bit older).  
>> It wasn't up until 2003 when I was caught by the issue with the LOG SENSE 
>> command that I got interested in the programming details of the adapter.  
>>
>>  At that time Mylex was in flux already, having been bought by LSI shortly 
>> before.  Support advised me what was there at Leonard's www.dandelion.com 
>> site was all that was available (I have a personal copy of the site) and 
>> they would suggest to switch to their current products.  So it was too 
>> late already ten years before you got at the driver.
>>
>>  I'll yet double-check the contents of the KT-958 kit which I have kept, 
>> but if there was any technical documentation supplied there on a CD (which 
>> I doubt), I would have surely discovered it earlier.  It's away along with 
>> the server, remotely managed, ~160km/100mi from here, so it'll be some 
>> time before I get at it though.
>>
>>  Still, maybe one of the SCSI old-timers has that stuff stashed somewhere.  
>> I have plenty of technical documentation going back to early to mid 1990s 
>> (some in the hard copy form), not necessarily readily available nowadays. 
>> Sadly lots of such stuff goes offline or is completely lost to the mist of 
>> time.
>>
>>   Maciej
>>
> 
> Found the 3000763 document here:
> https://doc.lagout.org/science/0_Computer Science/0_Computer 
> History/old-hardware/buslogic/3000763_PCI_EISA_Wide_SCSI_Tech_Ref_Dec94.pdf
> 
> There's also 3002593 there:
> https://doc.lagout.org/science/0_Computer Science/0_Computer 
> History/old-hardware/buslogic/
> 

Thanks!!!

--
Khalid

Re: [PATCH 1/5] scsi: BusLogic: Fix missing `pr_cont' use

2021-04-16 Thread Khalid Aziz

On 4/15/21 8:08 PM, Joe Perches wrote:
> And while it's a lot more code, I'd prefer a solution that looks more
> like the other commonly used kernel logging extension mechanisms
> where adapter is placed before the format, ... in the argument list.

Hi Joe,

I don't mind making these changes. It is quite a bit of code but
consistency with other kernel code is useful. Would you like to finalize
this patch, or would you prefer that I take this patch as starting point
and finalize it?

Thanks,
Khalid

> 
> Today it's:
> 
> void blogic_msg(enum, fmt, adapter, ...);
> 
> without the __printf marking so there is one format/arg mismatch.
> 
> fyi: in the suggested patch below it's
> - blogic_info("BIOS Address: 0x%lX, ", adapter,
> - adapter->bios_addr);
> + blogic_cont(adapter, "BIOS Address: 0x%X, ",
> + adapter->bios_addr);
> 
> I'd prefer
> __printf(3, 4)
> void blogic_msg(enum, adapter, fmt, ...)
> 
> (or maybe void blogic_msg(adapter, enum, fmt, ...))
> 
> And there's a simple addition of a blogic_cont macro and extension
> to blogic_msg to simplify the logic and obviousness of the logging
> extension lines too.
> 
> I suggest this done with coccinelle and a little typing:
> ---
>  drivers/scsi/BusLogic.c | 496 
> +++-
>  drivers/scsi/BusLogic.h |  32 ++--
>  2 files changed, 341 insertions(+), 187 deletions(-)
> 
> diff --git a/drivers/scsi/BusLogic.c b/drivers/scsi/BusLogic.c
> index ccb061ab0a0a..7a52371b5ab6 100644
> --- a/drivers/scsi/BusLogic.c
> +++ b/drivers/scsi/BusLogic.c
> @@ -134,8 +134,10 @@ static char *blogic_cmd_failure_reason;
>  
>  static void blogic_announce_drvr(struct blogic_adapter *adapter)
>  {
> - blogic_announce("* BusLogic SCSI Driver Version " 
> blogic_drvr_version " of " blogic_drvr_date " *\n", adapter);
> - blogic_announce("Copyright 1995-1998 by Leonard N. Zubkoff 
> \n", adapter);
> + blogic_announce(adapter,
> + "* BusLogic SCSI Driver Version " 
> blogic_drvr_version " of " blogic_drvr_date " *\n");
> + blogic_announce(adapter,
> + "Copyright 1995-1998 by Leonard N. Zubkoff 
> \n");
>  }
>  
>  
> @@ -198,8 +200,7 @@ static bool __init blogic_create_initccbs(struct 
> blogic_adapter *adapter)
>   blk_pointer = dma_alloc_coherent(>pci_device->dev,
>   blk_size, , GFP_KERNEL);
>   if (blk_pointer == NULL) {
> - blogic_err("UNABLE TO ALLOCATE CCB GROUP - DETACHING\n",
> - adapter);
> + blogic_err(adapter, "UNABLE TO ALLOCATE CCB GROUP - 
> DETACHING\n");
>   return false;
>   }
>   blogic_init_ccbs(adapter, blk_pointer, blk_size, blkp);
> @@ -259,10 +260,13 @@ static void blogic_create_addlccbs(struct 
> blogic_adapter *adapter,
>   }
>   if (adapter->alloc_ccbs > prev_alloc) {
>   if (print_success)
> - blogic_notice("Allocated %d additional CCBs (total now 
> %d)\n", adapter, adapter->alloc_ccbs - prev_alloc, adapter->alloc_ccbs);
> + blogic_notice(adapter,
> +   "Allocated %d additional CCBs (total now 
> %d)\n",
> +   adapter->alloc_ccbs - prev_alloc,
> +   adapter->alloc_ccbs);
>   return;
>   }
> - blogic_notice("Failed to allocate additional CCBs\n", adapter);
> + blogic_notice(adapter, "Failed to allocate additional CCBs\n");
>   if (adapter->drvr_qdepth > adapter->alloc_ccbs - adapter->tgt_count) {
>   adapter->drvr_qdepth = adapter->alloc_ccbs - adapter->tgt_count;
>   adapter->scsi_host->can_queue = adapter->drvr_qdepth;
> @@ -441,7 +445,9 @@ static int blogic_cmd(struct blogic_adapter *adapter, 
> enum blogic_opcode opcode,
>   goto done;
>   }
>   if (blogic_global_options.trace_config)
> - blogic_notice("blogic_cmd(%02X) Status = %02X: (Modify 
> I/O Address)\n", adapter, opcode, statusreg.all);
> + blogic_notice(adapter,
> +   "blogic_cmd(%02X) Status = %02X: (Modify 
> I/O Address)\n",
> +   opcode, statusreg.all);
>   result = 0;
>   goto done;
>   }
> @@ -499,15 +505,16 @@ static int blogic_cmd(struct blogic_adapter *adapter, 
> enum blogic_opcode opcode,
>*/
>   if (blogic_global_options.trace_config) {
>   int i;
> - blogic_notice("blogic_cmd(%02X) Status = %02X: %2d ==> %2d:",
> - adapter, opcode, statusreg.all, replylen,
> + blogic_notice(adapter,
> +   "blogic_cmd(%02X) Status = %02X:

Re: [PATCH 2/5] scsi: BusLogic: Avoid unbounded `vsprintf' use

2021-04-16 Thread Khalid Aziz

On 4/14/21 4:39 PM, Maciej W. Rozycki wrote:
> Existing `blogic_msg' invocations do not appear to overrun its internal 
> buffer of a fixed length of 100, which would cause stack corruption, but 
> it's easy to miss with possible further updates and a fix is cheap in 
> performance terms, so limit the output produced into the buffer by using 
> `vsnprintf' rather than `vsprintf'.
> 
> Signed-off-by: Maciej W. Rozycki 
> ---
>  drivers/scsi/BusLogic.c |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> linux-buslogic-vsnprintf.diff
> Index: linux-macro-ide/drivers/scsi/BusLogic.c
> ===
> --- linux-macro-ide.orig/drivers/scsi/BusLogic.c
> +++ linux-macro-ide/drivers/scsi/BusLogic.c
> @@ -3588,7 +3588,7 @@ static void blogic_msg(enum blogic_msgle
>   int len = 0;
>  
>   va_start(args, adapter);
> - len = vsprintf(buf, fmt, args);
> + len = vsnprintf(buf, sizeof(buf), fmt, args);
>   va_end(args);
>   if (msglevel == BLOGIC_ANNOUNCE_LEVEL) {
>   static int msglines = 0;
> 

As Maciej explained in other email that snprintf() does null-terminate
the string, I think this change is fine.

Acked-by: Khalid Aziz

Re: [PATCH 0/5] Bring the BusLogic host bus adapter driver up to Y2021

2021-04-16 Thread Khalid Aziz

On 4/14/21 4:38 PM, Maciej W. Rozycki wrote:
> Hi,
> 
>  First of all, does anyone have a copy of: "MultiMaster UltraSCSI Host 
> Adapters for PCI Systems: Technical Reference Manual" (pub. 3002493-E)?  
> It used to live in the "Mylex Manuals and Documentation Archives" section 
> of the Mylex web site , 
> specifically at: .
> 
>  Another useful document might be: "Wide SCSI Host Adapters for PCI and 
> EISA Systems: Technical Reference Manual" (pub. 3000763-A), which used to 
> live at: , linked from the 
> same place.
> 
>  Sadly I didn't get to these resources while they were still there, and 
> neither did archive.org, and now they not appear available from anywhere 
> online.  I'm sure Leonard had this all, but, alas, he is long gone too.

These documents were all gone by the time I started working on this
driver in 2013.

--
Khalid

Re: [PATCH 1/5] scsi: BusLogic: Fix missing `pr_cont' use

2021-04-16 Thread Khalid Aziz

if (adapter != NULL && adapter->adapter_initd)
> @@ -3611,7 +3611,7 @@ static void blogic_msg(enum blogic_msgle
>   else
>   printk("%s%s", blogic_msglevelmap[msglevel], 
> buf);
>   } else
> - printk("%s", buf);
> + pr_cont("%s", buf);
>   }
>   begin = (buf[len - 1] == '\n');
>  }
> 

Looks good.

Acked-by: Khalid Aziz

Re: [PATCH 4/8] scsi: FlashPoint: Remove unused variable 'TID' from 'FlashPoint_AbortCCB()'

2021-03-17 Thread Khalid Aziz

On 3/17/21 3:11 AM, Lee Jones wrote:
> Fixes the following W=1 kernel build warning(s):
> 
>  drivers/scsi/FlashPoint.c: In function ‘FlashPoint_AbortCCB’:
>  drivers/scsi/FlashPoint.c:1618:16: warning: variable ‘TID’ set but not used 
> [-Wunused-but-set-variable]
> 
> Cc: Khalid Aziz 
> Cc: "James E.J. Bottomley" 
> Cc: "Martin K. Petersen" 
> Cc: linux-s...@vger.kernel.org
> Signed-off-by: Lee Jones 
> ---
>  drivers/scsi/FlashPoint.c | 4 
>  1 file changed, 4 deletions(-)
> 
> diff --git a/drivers/scsi/FlashPoint.c b/drivers/scsi/FlashPoint.c
> index f479c542e787c..0464e37c806a4 100644
> --- a/drivers/scsi/FlashPoint.c
> +++ b/drivers/scsi/FlashPoint.c
> @@ -1615,7 +1615,6 @@ static int FlashPoint_AbortCCB(void *pCurrCard, struct 
> sccb *p_Sccb)
>  
>   unsigned char thisCard;
>   CALL_BK_FN callback;
> - unsigned char TID;
>   struct sccb *pSaveSCCB;
>   struct sccb_mgr_tar_info *currTar_Info;
>  
> @@ -1652,9 +1651,6 @@ static int FlashPoint_AbortCCB(void *pCurrCard, struct 
> sccb *p_Sccb)
>   }
>  
>   else {
> -
> - TID = p_Sccb->TargID;
> -
>   if (p_Sccb->Sccb_tag) {
>       MDISABLE_INT(ioport);
>   if (((struct sccb_card *)pCurrCard)->
> 

Acked-by: Khalid Aziz

Re: [PATCH 1/8] scsi: BusLogic: Supply __printf(x, y) formatting for blogic_msg()

2021-03-17 Thread Khalid Aziz

On 3/17/21 3:11 AM, Lee Jones wrote:
> Fixes the following W=1 kernel build warning(s):
> 
>  In file included from drivers/scsi/BusLogic.c:51:
>  drivers/scsi/BusLogic.c: In function ‘blogic_msg’:
>  drivers/scsi/BusLogic.c:3591:2: warning: function ‘blogic_msg’ might be a 
> candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format]
> 
> Cc: Khalid Aziz 
> Cc: "James E.J. Bottomley" 
> Cc: "Martin K. Petersen" 
> Cc: "Leonard N. Zubkoff" 
> Cc: linux-s...@vger.kernel.org
> Signed-off-by: Lee Jones 
> ---
>  drivers/scsi/BusLogic.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/BusLogic.c b/drivers/scsi/BusLogic.c
> index ccb061ab0a0ad..0ac3f713fc212 100644
> --- a/drivers/scsi/BusLogic.c
> +++ b/drivers/scsi/BusLogic.c
> @@ -3578,7 +3578,7 @@ Target  Requested Completed  Requested Completed  
> Requested Completed\n\
>  /*
>blogic_msg prints Driver Messages.
>  */
> -
> +__printf(2, 4)
>  static void blogic_msg(enum blogic_msglevel msglevel, char *fmt,
>   struct blogic_adapter *adapter, ...)
>  {
> 

Acked-by: Khalid Aziz

Re: [PATCH] scsi: FlashPoint: Fix typo issue

2021-03-04 Thread Khalid Aziz

On 3/3/21 10:51 PM, zuoqil...@163.com wrote:
> From: zuoqilin 
> 
> Change 'defualt' to 'default'.
> 
> Signed-off-by: zuoqilin 
> ---
>  drivers/scsi/FlashPoint.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/scsi/FlashPoint.c b/drivers/scsi/FlashPoint.c
> index 24ace18..f479c54 100644
> --- a/drivers/scsi/FlashPoint.c
> +++ b/drivers/scsi/FlashPoint.c
> @@ -4534,7 +4534,7 @@ static void FPT_phaseBusFree(u32 port, unsigned char 
> p_card)
>   *
>   * Function: Auto Load Default Map
>   *
> - * Description: Load the Automation RAM with the defualt map values.
> + * Description: Load the Automation RAM with the default map values.
>   *
>   *-*/
>  static void FPT_autoLoadDefaultMap(u32 p_port)
> 

Acked-by: Khalid Aziz 

I will accept this patch this time, but I really would like patches that
fix only typos to fix more than just one typo, preferably all the typos
in the file. There are more typos in that file.

Thanks,
Khalid

Re: [PATCH V3] mm/compaction: correct deferral logic for proactive compaction

2021-01-19 Thread Khalid Aziz


On 1/18/21 10:12 AM, Charan Teja Reddy wrote:

should_proactive_compact_node() returns true when sum of the
weighted fragmentation score of all the zones in the node is greater
than the wmark_high of compaction, which then triggers the proactive
compaction that operates on the individual zones of the node. But
proactive compaction runs on the zone only when its weighted
fragmentation score is greater than wmark_low(=wmark_high - 10).

This means that the sum of the weighted fragmentation scores of all the
zones can exceed the wmark_high but individual weighted fragmentation
zone scores can still be less than wmark_low which makes the unnecessary
trigger of the proactive compaction only to return doing nothing.

Issue with the return of proactive compaction with out even trying is
its deferral. It is simply deferred for 1 << COMPACT_MAX_DEFER_SHIFT if
the scores across the proactive compaction is same, thinking that
compaction didn't make any progress but in reality it didn't even try.
With the delay between successive retries for proactive compaction is
500msec, it can result into the deferral for ~30sec with out even trying
the proactive compaction.

Test scenario is that: compaction_proactiveness=50 thus the wmark_low =
50 and wmark_high = 60. System have 2 zones(Normal and Movable) with
sizes 5GB and 6GB respectively. After opening some apps on the android,
the weighted fragmentation scores of these zones are 47 and 49
respectively. Since the sum of these fragmentation scores are above the
wmark_high which triggers the proactive compaction and there since the
individual zones weighted fragmentation scores are below wmark_low, it
returns without trying the proactive compaction. As a result the
weighted fragmentation scores of the zones are still 47 and 49 which
makes the existing logic to defer the compaction thinking that
noprogress is made across the compaction.

Fix this by checking just zone fragmentation score, not the weighted, in
__compact_finished() and use the zones weighted fragmentation score in
fragmentation_score_node(). In the test case above, If the weighted
average of is above wmark_high, then individual score (not adjusted) of
atleast one zone has to be above wmark_high. Thus it avoids the
unnecessary trigger and deferrals of the proactive compaction.

Fix-suggested-by: Vlastimil Babka 
Signed-off-by: Charan Teja Reddy 
---

Changes in V3: Addressed suggestions from Vlastimil

Changes in V2: https://lore.kernel.org/patchwork/patch/1366862/

Changes in V1: https://lore.kernel.org/patchwork/patch/1364646/

  mm/compaction.c | 20 ++--
  1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index e5acb97..ccddb3a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1925,20 +1925,28 @@ static bool kswapd_is_running(pg_data_t *pgdat)
  
  /*

   * A zone's fragmentation score is the external fragmentation wrt to the
- * COMPACTION_HPAGE_ORDER scaled by the zone's size. It returns a value
- * in the range [0, 100].
+ * COMPACTION_HPAGE_ORDER. It returns a value in the range [0, 100].
+ */
+static unsigned int fragmentation_score_zone(struct zone *zone)
+{
+   return extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
+}
+
+/*
+ * A weighted zone's fragmentation score is the external fragmentation
+ * wrt to the COMPACTION_HPAGE_ORDER scaled by the zone's size. It
+ * returns a value in the range [0, 100].
   *
   * The scaling factor ensures that proactive compaction focuses on larger
   * zones like ZONE_NORMAL, rather than smaller, specialized zones like
   * ZONE_DMA32. For smaller zones, the score value remains close to zero,
   * and thus never exceeds the high threshold for proactive compaction.
   */
-static unsigned int fragmentation_score_zone(struct zone *zone)
+static unsigned int fragmentation_score_zone_weighted(struct zone *zone)
  {
unsigned long score;
  
-	score = zone->present_pages *

-   extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
+   score = zone->present_pages * fragmentation_score_zone(zone);
return div64_ul(score, zone->zone_pgdat->node_present_pages + 1);
  }
  
@@ -1958,7 +1966,7 @@ static unsigned int fragmentation_score_node(pg_data_t *pgdat)

struct zone *zone;
  
  		zone = >node_zones[zoneid];

-   score += fragmentation_score_zone(zone);
+   score += fragmentation_score_zone_weighted(zone);
}
  
  	return score;




Looks good.

Reviewed-by: Khalid Aziz

Re: [PATCH] sparc64: Use arch_validate_flags() to validate ADI flag

2020-11-24 Thread Khalid Aziz


On 11/20/20 11:01 AM, Catalin Marinas wrote:

Hi Khalid,

On Fri, Oct 23, 2020 at 11:56:11AM -0600, Khalid Aziz wrote:

diff --git a/arch/sparc/include/asm/mman.h b/arch/sparc/include/asm/mman.h
index f94532f25db1..274217e7ed70 100644
--- a/arch/sparc/include/asm/mman.h
+++ b/arch/sparc/include/asm/mman.h
@@ -57,35 +57,39 @@ static inline int sparc_validate_prot(unsigned long prot, 
unsigned long addr)
  {
if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM | PROT_ADI))
return 0;
-   if (prot & PROT_ADI) {
-   if (!adi_capable())
-   return 0;
+   return 1;
+}


We kept the equivalent of !adi_capable() check in the arm64
arch_validate_prot() and left arch_validate_flags() more relaxed. I.e.
you can pass PROT_MTE to mmap() even if the hardware doesn't support
MTE. This is in line with the pre-MTE ABI where unknown mmap() flags
would be ignored while mprotect() would reject them. This discrepancy
isn't nice but we decided to preserve the pre-MTE mmap ABI behaviour.
Anyway, it's up to you if you want to change the sparc behaviour, I
don't think it matters in practice.


Hi Catalin,

Thanks for taking a look at this patch. I felt mmap() silently accepting 
PROT_ADI but not really enabling protection can be dangerous since it 
leads the end user to be under false impression that they have protected 
the memory. I chose to treat PROT_ADI as a known flag and provide a 
definite feedback to user whether it can be honored or not.




I think with this patch, arch_validate_prot() no longer needs the 'addr'
argument. Maybe you can submit an additional patch to remove them (not
urgent, the compiler should get rid of them).


Yes, 'addr' is an unused argument now. On the other hand, I suspect with 
additional protections being implemented in hardware for memory regions, 
sooner or later someone will see a need to validate protection bits in 
the context of memory region it is being applied to. Address is not 
going to be enough information though and we are most likely going to 
need size of the memory region being operated upon as well. That means 
this code is likely to need a patch to add the size argument. So it is 
reasonable to remove 'addr' for now and reintroduce a more complete 
version with size as well later in a patch when the need comes up.




  
-		if (addr) {

-   struct vm_area_struct *vma;
+#define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags)
+/* arch_validate_flags() - Ensure combination of flags is valid for a
+ * VMA.
+ */
+static inline bool arch_validate_flags(unsigned long vm_flags)
+{
+   /* If ADI is being enabled on this VMA, check for ADI
+* capability on the platform and ensure VMA is suitable
+* for ADI
+*/
+   if (vm_flags & VM_SPARC_ADI) {
+   if (!adi_capable())
+   return false;
  
-			vma = find_vma(current->mm, addr);

-   if (vma) {
-   /* ADI can not be enabled on PFN
-* mapped pages
-*/
-   if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
-   return 0;
+   /* ADI can not be enabled on PFN mapped pages */
+   if (vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+   return false;
  
-/* Mergeable pages can become unmergeable

-* if ADI is enabled on them even if they
-* have identical data on them. This can be
-* because ADI enabled pages with identical
-* data may still not have identical ADI
-* tags on them. Disallow ADI on mergeable
-* pages.
-*/
-   if (vma->vm_flags & VM_MERGEABLE)
-   return 0;
-   }
-   }
+   /* Mergeable pages can become unmergeable
+* if ADI is enabled on them even if they
+* have identical data on them. This can be
+* because ADI enabled pages with identical
+* data may still not have identical ADI
+* tags on them. Disallow ADI on mergeable
+* pages.
+*/
+   if (vm_flags & VM_MERGEABLE)
+   return false;


Ah, you added a check to the madvise(MADV_MERGEABLE) path to ignore the
flag if VM_SPARC_ADI. On arm64 we intercept memcmp_pages() but we have a
PG_arch_2 flag to mark a page as containing tags. Either way should
work.

FWIW, if you are happy with the mmap() rejecting PROT_ADI on
!adi_capable() hardware:

Reviewed-by: Catalin Marinas 



Thanks!

--
Khalid

[PATCH] sparc64: Use arch_validate_flags() to validate ADI flag

2020-10-23 Thread Khalid Aziz

When userspace calls mprotect() to enable ADI on an address range,
do_mprotect_pkey() calls arch_validate_prot() to validate new
protection flags. arch_validate_prot() for sparc looks at the first
VMA associated with address range to verify if ADI can indeed be
enabled on this address range. This has two issues - (1) Address
range might cover multiple VMAs while arch_validate_prot() looks at
only the first VMA, (2) arch_validate_prot() peeks at VMA without
holding mmap lock which can result in race condition.

arch_validate_flags() from commit c462ac288f2c ("mm: Introduce
arch_validate_flags()") allows for VMA flags to be validated for all
VMAs that cover the address range given by user while holding mmap
lock. This patch updates sparc code to move the VMA check from
arch_validate_prot() to arch_validate_flags() to fix above two
issues.

Suggested-by: Jann Horn 
Suggested-by: Christoph Hellwig 
Suggested-by: Catalin Marinas 
Signed-off-by: Khalid Aziz 
---
 arch/sparc/include/asm/mman.h | 54 +++
 1 file changed, 29 insertions(+), 25 deletions(-)

diff --git a/arch/sparc/include/asm/mman.h b/arch/sparc/include/asm/mman.h
index f94532f25db1..274217e7ed70 100644
--- a/arch/sparc/include/asm/mman.h
+++ b/arch/sparc/include/asm/mman.h
@@ -57,35 +57,39 @@ static inline int sparc_validate_prot(unsigned long prot, 
unsigned long addr)
 {
if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM | PROT_ADI))
return 0;
-   if (prot & PROT_ADI) {
-   if (!adi_capable())
-   return 0;
+   return 1;
+}
 
-   if (addr) {
-   struct vm_area_struct *vma;
+#define arch_validate_flags(vm_flags) arch_validate_flags(vm_flags)
+/* arch_validate_flags() - Ensure combination of flags is valid for a
+ * VMA.
+ */
+static inline bool arch_validate_flags(unsigned long vm_flags)
+{
+   /* If ADI is being enabled on this VMA, check for ADI
+* capability on the platform and ensure VMA is suitable
+* for ADI
+*/
+   if (vm_flags & VM_SPARC_ADI) {
+   if (!adi_capable())
+   return false;
 
-   vma = find_vma(current->mm, addr);
-   if (vma) {
-   /* ADI can not be enabled on PFN
-* mapped pages
-*/
-   if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
-   return 0;
+   /* ADI can not be enabled on PFN mapped pages */
+   if (vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
+   return false;
 
-   /* Mergeable pages can become unmergeable
-* if ADI is enabled on them even if they
-* have identical data on them. This can be
-* because ADI enabled pages with identical
-* data may still not have identical ADI
-* tags on them. Disallow ADI on mergeable
-* pages.
-*/
-   if (vma->vm_flags & VM_MERGEABLE)
-   return 0;
-   }
-   }
+   /* Mergeable pages can become unmergeable
+* if ADI is enabled on them even if they
+* have identical data on them. This can be
+* because ADI enabled pages with identical
+* data may still not have identical ADI
+* tags on them. Disallow ADI on mergeable
+* pages.
+*/
+   if (vm_flags & VM_MERGEABLE)
+   return false;
}
-   return 1;
+   return true;
 }
 #endif /* CONFIG_SPARC64 */
 
-- 
2.25.1

Re: [PATCH 1/2] mm/mprotect: Call arch_validate_prot under mmap_lock and with length

2020-10-15 Thread Khalid Aziz

On 10/15/20 3:05 AM, Catalin Marinas wrote:
> On Wed, Oct 14, 2020 at 03:21:16PM -0600, Khalid Aziz wrote:
>> What FreeBSD does seems like a reasonable thing to do. Any way first
>> thing to do is to update sparc to use arch_validate_flags() and update
>> sparc_validate_prot() to not peek into vma without lock.
> 
> If you go for arch_validate_flags(), I think sparc_validate_prot()
> doesn't need the vma at all.

Yes, the plan is to move vma flag check from sparc_validate_prot() to
arch_validate_flags()..

> 
> BTW, on the ADI topic, I think you have a race in do_swap_page() since
> set_pte_at() is called before arch_do_swap_page(). So a thread in the
> same process would see the new mapping but the tags have not been
> updated yet. Unless sparc relies on the new user pte to be set, I think
> you can just swap the two calls.
> 

Thanks for pointing that out. I will take a look at it.

--
Khalid

Re: [PATCH 1/2] mm/mprotect: Call arch_validate_prot under mmap_lock and with length

2020-10-14 Thread Khalid Aziz

On 10/13/20 3:16 AM, Catalin Marinas wrote:
> On Mon, Oct 12, 2020 at 01:14:50PM -0600, Khalid Aziz wrote:
>> On 10/12/20 11:22 AM, Catalin Marinas wrote:
>>> On Mon, Oct 12, 2020 at 11:03:33AM -0600, Khalid Aziz wrote:
>>>> On 10/10/20 5:09 AM, Catalin Marinas wrote:
>>>>> On Wed, Oct 07, 2020 at 02:14:09PM -0600, Khalid Aziz wrote:
>>>>>> On 10/7/20 1:39 AM, Jann Horn wrote:
>>>>>>> arch_validate_prot() is a hook that can validate whether a given set of
>>>>>>> protection flags is valid in an mprotect() operation. It is given the 
>>>>>>> set
>>>>>>> of protection flags and the address being modified.
>>>>>>>
>>>>>>> However, the address being modified can currently not actually be used 
>>>>>>> in
>>>>>>> a meaningful way because:
>>>>>>>
>>>>>>> 1. Only the address is given, but not the length, and the operation can
>>>>>>>span multiple VMAs. Therefore, the callee can't actually tell which
>>>>>>>virtual address range, or which VMAs, are being targeted.
>>>>>>> 2. The mmap_lock is not held, meaning that if the callee were to check
>>>>>>>the VMA at @addr, that VMA would be unrelated to the one the
>>>>>>>operation is performed on.
>>>>>>>
>>>>>>> Currently, custom arch_validate_prot() handlers are defined by
>>>>>>> arm64, powerpc and sparc.
>>>>>>> arm64 and powerpc don't care about the address range, they just check 
>>>>>>> the
>>>>>>> flags against CPU support masks.
>>>>>>> sparc's arch_validate_prot() attempts to look at the VMA, but doesn't 
>>>>>>> take
>>>>>>> the mmap_lock.
>>>>>>>
>>>>>>> Change the function signature to also take a length, and move the
>>>>>>> arch_validate_prot() call in mm/mprotect.c down into the locked region.
>>>>> [...]
>>>>>> As Chris pointed out, the call to arch_validate_prot() from do_mmap2()
>>>>>> is made without holding mmap_lock. Lock is not acquired until
>>>>>> vm_mmap_pgoff(). This variance is uncomfortable but I am more
>>>>>> uncomfortable forcing all implementations of validate_prot to require
>>>>>> mmap_lock be held when non-sparc implementations do not have such need
>>>>>> yet. Since do_mmap2() is in powerpc specific code, for now this patch
>>>>>> solves a current problem.
>>>>>
>>>>> I still think sparc should avoid walking the vmas in
>>>>> arch_validate_prot(). The core code already has the vmas, though not
>>>>> when calling arch_validate_prot(). That's one of the reasons I added
>>>>> arch_validate_flags() with the MTE patches. For sparc, this could be
>>>>> (untested, just copied the arch_validate_prot() code):
>>>>
>>>> I am little uncomfortable with the idea of validating protection bits
>>>> inside the VMA walk loop in do_mprotect_pkey(). When ADI is being
>>>> enabled across multiple VMAs and arch_validate_flags() fails on a VMA
>>>> later, do_mprotect_pkey() will bail out with error leaving ADI enabled
>>>> on earlier VMAs. This will apply to protection bits other than ADI as
>>>> well of course. This becomes a partial failure of mprotect() call. I
>>>> think it should be all or nothing with mprotect() - when one calls
>>>> mprotect() from userspace, either the entire address range passed in
>>>> gets its protection bits updated or none of it does. That requires
>>>> validating protection bits upfront or undoing what earlier iterations of
>>>> VMA walk loop might have done.
>>>
>>> I thought the same initially but mprotect() already does this with the
>>> VM_MAY* flag checking. If you ask it for an mprotect() that crosses
>>> multiple vmas and one of them fails, it doesn't roll back the changes to
>>> the prior ones. I considered that a similar approach is fine for MTE
>>> (it's most likely a user error).
>>
>> You are right about the current behavior with VM_MAY* flags, but that is
>> not the right behavior. Adding more cases to this just perpetuates
>> incorrect behavior. It is not easy to roll back changes after VMAs have
>> potentially been split/

Re: [PATCH 1/2] mm/mprotect: Call arch_validate_prot under mmap_lock and with length

2020-10-12 Thread Khalid Aziz

On 10/12/20 11:22 AM, Catalin Marinas wrote:
> On Mon, Oct 12, 2020 at 11:03:33AM -0600, Khalid Aziz wrote:
>> On 10/10/20 5:09 AM, Catalin Marinas wrote:
>>> On Wed, Oct 07, 2020 at 02:14:09PM -0600, Khalid Aziz wrote:
>>>> On 10/7/20 1:39 AM, Jann Horn wrote:
>>>>> arch_validate_prot() is a hook that can validate whether a given set of
>>>>> protection flags is valid in an mprotect() operation. It is given the set
>>>>> of protection flags and the address being modified.
>>>>>
>>>>> However, the address being modified can currently not actually be used in
>>>>> a meaningful way because:
>>>>>
>>>>> 1. Only the address is given, but not the length, and the operation can
>>>>>span multiple VMAs. Therefore, the callee can't actually tell which
>>>>>virtual address range, or which VMAs, are being targeted.
>>>>> 2. The mmap_lock is not held, meaning that if the callee were to check
>>>>>the VMA at @addr, that VMA would be unrelated to the one the
>>>>>operation is performed on.
>>>>>
>>>>> Currently, custom arch_validate_prot() handlers are defined by
>>>>> arm64, powerpc and sparc.
>>>>> arm64 and powerpc don't care about the address range, they just check the
>>>>> flags against CPU support masks.
>>>>> sparc's arch_validate_prot() attempts to look at the VMA, but doesn't take
>>>>> the mmap_lock.
>>>>>
>>>>> Change the function signature to also take a length, and move the
>>>>> arch_validate_prot() call in mm/mprotect.c down into the locked region.
>>> [...]
>>>> As Chris pointed out, the call to arch_validate_prot() from do_mmap2()
>>>> is made without holding mmap_lock. Lock is not acquired until
>>>> vm_mmap_pgoff(). This variance is uncomfortable but I am more
>>>> uncomfortable forcing all implementations of validate_prot to require
>>>> mmap_lock be held when non-sparc implementations do not have such need
>>>> yet. Since do_mmap2() is in powerpc specific code, for now this patch
>>>> solves a current problem.
>>>
>>> I still think sparc should avoid walking the vmas in
>>> arch_validate_prot(). The core code already has the vmas, though not
>>> when calling arch_validate_prot(). That's one of the reasons I added
>>> arch_validate_flags() with the MTE patches. For sparc, this could be
>>> (untested, just copied the arch_validate_prot() code):
>>
>> I am little uncomfortable with the idea of validating protection bits
>> inside the VMA walk loop in do_mprotect_pkey(). When ADI is being
>> enabled across multiple VMAs and arch_validate_flags() fails on a VMA
>> later, do_mprotect_pkey() will bail out with error leaving ADI enabled
>> on earlier VMAs. This will apply to protection bits other than ADI as
>> well of course. This becomes a partial failure of mprotect() call. I
>> think it should be all or nothing with mprotect() - when one calls
>> mprotect() from userspace, either the entire address range passed in
>> gets its protection bits updated or none of it does. That requires
>> validating protection bits upfront or undoing what earlier iterations of
>> VMA walk loop might have done.
> 
> I thought the same initially but mprotect() already does this with the
> VM_MAY* flag checking. If you ask it for an mprotect() that crosses
> multiple vmas and one of them fails, it doesn't roll back the changes to
> the prior ones. I considered that a similar approach is fine for MTE
> (it's most likely a user error).
> 

You are right about the current behavior with VM_MAY* flags, but that is
not the right behavior. Adding more cases to this just perpetuates
incorrect behavior. It is not easy to roll back changes after VMAs have
potentially been split/merged which is probably why the current code
simply throws in the towel and returns with partially modified address
space. It is lot easier to do all the checks upfront and then proceed or
not proceed with modifying VMAs. One approach might be to call
arch_validate_flags() in a loop before modifying VMAs and walk all VMAs
with a read lock held. Current code also bails out with ENOMEM if it
finds a hole in the address range and leaves any modifications already
made in place. This is another case where a hole could have been
detected earlier.

--
Khalid

Re: [PATCH 1/2] mm/mprotect: Call arch_validate_prot under mmap_lock and with length

2020-10-12 Thread Khalid Aziz

On 10/10/20 5:09 AM, Catalin Marinas wrote:
> Hi Khalid,
> 
> On Wed, Oct 07, 2020 at 02:14:09PM -0600, Khalid Aziz wrote:
>> On 10/7/20 1:39 AM, Jann Horn wrote:
>>> arch_validate_prot() is a hook that can validate whether a given set of
>>> protection flags is valid in an mprotect() operation. It is given the set
>>> of protection flags and the address being modified.
>>>
>>> However, the address being modified can currently not actually be used in
>>> a meaningful way because:
>>>
>>> 1. Only the address is given, but not the length, and the operation can
>>>span multiple VMAs. Therefore, the callee can't actually tell which
>>>virtual address range, or which VMAs, are being targeted.
>>> 2. The mmap_lock is not held, meaning that if the callee were to check
>>>the VMA at @addr, that VMA would be unrelated to the one the
>>>operation is performed on.
>>>
>>> Currently, custom arch_validate_prot() handlers are defined by
>>> arm64, powerpc and sparc.
>>> arm64 and powerpc don't care about the address range, they just check the
>>> flags against CPU support masks.
>>> sparc's arch_validate_prot() attempts to look at the VMA, but doesn't take
>>> the mmap_lock.
>>>
>>> Change the function signature to also take a length, and move the
>>> arch_validate_prot() call in mm/mprotect.c down into the locked region.
> [...]
>> As Chris pointed out, the call to arch_validate_prot() from do_mmap2()
>> is made without holding mmap_lock. Lock is not acquired until
>> vm_mmap_pgoff(). This variance is uncomfortable but I am more
>> uncomfortable forcing all implementations of validate_prot to require
>> mmap_lock be held when non-sparc implementations do not have such need
>> yet. Since do_mmap2() is in powerpc specific code, for now this patch
>> solves a current problem.
> 
> I still think sparc should avoid walking the vmas in
> arch_validate_prot(). The core code already has the vmas, though not
> when calling arch_validate_prot(). That's one of the reasons I added
> arch_validate_flags() with the MTE patches. For sparc, this could be
> (untested, just copied the arch_validate_prot() code):

I am little uncomfortable with the idea of validating protection bits
inside the VMA walk loop in do_mprotect_pkey(). When ADI is being
enabled across multiple VMAs and arch_validate_flags() fails on a VMA
later, do_mprotect_pkey() will bail out with error leaving ADI enabled
on earlier VMAs. This will apply to protection bits other than ADI as
well of course. This becomes a partial failure of mprotect() call. I
think it should be all or nothing with mprotect() - when one calls
mprotect() from userspace, either the entire address range passed in
gets its protection bits updated or none of it does. That requires
validating protection bits upfront or undoing what earlier iterations of
VMA walk loop might have done.

--
Khalid

> 
> static inline bool arch_validate_flags(unsigned long vm_flags)
> {
>   if (!(vm_flags & VM_SPARC_ADI))
>   return true;
> 
>   if (!adi_capable())
>   return false;
> 
>   /* ADI can not be enabled on PFN mapped pages */
>   if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
>   return false;
> 
>   /*
>* Mergeable pages can become unmergeable if ADI is enabled on
>* them even if they have identical data on them. This can be
>* because ADI enabled pages with identical data may still not
>* have identical ADI tags on them. Disallow ADI on mergeable
>* pages.
>*/
>   if (vma->vm_flags & VM_MERGEABLE)
>   return false;
> 
>   return true;
> }
> 
>> That leaves open the question of should
>> generic mmap call arch_validate_prot and return EINVAL for invalid
>> combination of protection bits, but that is better addressed in a
>> separate patch.
> 
> The above would cover mmap() as well.
> 
> The current sparc_validate_prot() relies on finding the vma for the
> corresponding address. However, if you call this early in the mmap()
> path, there's no such vma. It is only created later in mmap_region()
> which no longer has the original PROT_* flags (all converted to VM_*
> flags).
> 
> Calling arch_validate_flags() on mmap() has a small side-effect on the
> user ABI: if the CPU is not adi_capable(), PROT_ADI is currently ignored
> on mmap() but rejected by sparc_validate_prot(). Powerpc already does
> this already and I think it should be fine for arm64 (it needs checking
> though as we have another flag, PROT_BTI, hopefully dynamic loaders
> don't pass this flag unconditionally).
> 
> However, as I said above, it doesn't solve the mmap() PROT_ADI checking
> for sparc since there's no vma yet. I'd strongly recommend the
> arch_validate_flags() approach and reverting the "start" parameter added
> to arch_validate_prot() if you go for the flags route.
> 
> Thanks.
>

Re: [PATCH 2/2] sparc: Check VMA range in sparc_validate_prot()

2020-10-07 Thread Khalid Aziz

On 10/7/20 1:39 AM, Jann Horn wrote:
> sparc_validate_prot() is called from do_mprotect_pkey() as
> arch_validate_prot(); it tries to ensure that an mprotect() call can't
> enable ADI on incompatible VMAs.
> The current implementation only checks that the VMA at the start address
> matches the rules for ADI mappings; instead, check all VMAs that will be
> affected by mprotect().
> 
> (This hook is called before mprotect() makes sure that the specified range
> is actually covered by VMAs, and mprotect() returns specific error codes
> when that's not the case. In order for mprotect() to still generate the
> same error codes for mprotect(, , ...|PROT_ADI), we need
> to *accept* cases where the range is not fully covered by VMAs.)
> 
> Cc: sta...@vger.kernel.org
> Fixes: 74a04967482f ("sparc64: Add support for ADI (Application Data 
> Integrity)")
> Signed-off-by: Jann Horn 
> ---
> compile-tested only, I don't have a Sparc ADI setup - might be nice if some
> Sparc person could test this?
> 
>  arch/sparc/include/asm/mman.h | 50 +--
>  1 file changed, 30 insertions(+), 20 deletions(-)


Looks good to me.

Reviewed-by: Khalid Aziz 


> 
> diff --git a/arch/sparc/include/asm/mman.h b/arch/sparc/include/asm/mman.h
> index e85222c76585..6dced75567c3 100644
> --- a/arch/sparc/include/asm/mman.h
> +++ b/arch/sparc/include/asm/mman.h
> @@ -60,31 +60,41 @@ static inline int sparc_validate_prot(unsigned long prot, 
> unsigned long addr,
>   if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM | PROT_ADI))
>   return 0;
>   if (prot & PROT_ADI) {
> + struct vm_area_struct *vma, *next;
> +
>   if (!adi_capable())
>   return 0;
>  
> - if (addr) {
> - struct vm_area_struct *vma;
> + vma = find_vma(current->mm, addr);
> + /* if @addr is unmapped, let mprotect() deal with it */
> + if (!vma || vma->vm_start > addr)
> + return 1;
> + while (1) {
> + /* ADI can not be enabled on PFN
> +  * mapped pages
> +  */
> + if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> + return 0;
>  
> - vma = find_vma(current->mm, addr);
> - if (vma) {
> - /* ADI can not be enabled on PFN
> -  * mapped pages
> -  */
> - if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> - return 0;
> + /* Mergeable pages can become unmergeable
> +  * if ADI is enabled on them even if they
> +  * have identical data on them. This can be
> +  * because ADI enabled pages with identical
> +  * data may still not have identical ADI
> +  * tags on them. Disallow ADI on mergeable
> +  * pages.
> +  */
> + if (vma->vm_flags & VM_MERGEABLE)
> + return 0;
>  
> - /* Mergeable pages can become unmergeable
> -  * if ADI is enabled on them even if they
> -  * have identical data on them. This can be
> -  * because ADI enabled pages with identical
> -  * data may still not have identical ADI
> -  * tags on them. Disallow ADI on mergeable
> -  * pages.
> -  */
> - if (vma->vm_flags & VM_MERGEABLE)
> - return 0;
> - }
> + /* reached the end of the range without errors? */
> + if (addr+len <= vma->vm_end)
> + return 1;
> + next = vma->vm_next;
> + /* if a VMA hole follows, let mprotect() deal with it */
> + if (!next || next->vm_start != vma->vm_end)
> + return 1;
> + vma = next;
>   }
>   }
>   return 1;
>

Re: [PATCH 1/2] mm/mprotect: Call arch_validate_prot under mmap_lock and with length

2020-10-07 Thread Khalid Aziz

On 10/7/20 1:39 AM, Jann Horn wrote:
> arch_validate_prot() is a hook that can validate whether a given set of
> protection flags is valid in an mprotect() operation. It is given the set
> of protection flags and the address being modified.
> 
> However, the address being modified can currently not actually be used in
> a meaningful way because:
> 
> 1. Only the address is given, but not the length, and the operation can
>span multiple VMAs. Therefore, the callee can't actually tell which
>virtual address range, or which VMAs, are being targeted.
> 2. The mmap_lock is not held, meaning that if the callee were to check
>the VMA at @addr, that VMA would be unrelated to the one the
>operation is performed on.
> 
> Currently, custom arch_validate_prot() handlers are defined by
> arm64, powerpc and sparc.
> arm64 and powerpc don't care about the address range, they just check the
> flags against CPU support masks.
> sparc's arch_validate_prot() attempts to look at the VMA, but doesn't take
> the mmap_lock.
> 
> Change the function signature to also take a length, and move the
> arch_validate_prot() call in mm/mprotect.c down into the locked region.
> 
> Cc: sta...@vger.kernel.org
> Fixes: 9035cf9a97e4 ("mm: Add address parameter to arch_validate_prot()")
> Suggested-by: Khalid Aziz 
> Suggested-by: Christoph Hellwig 
> Signed-off-by: Jann Horn 
> ---
>  arch/arm64/include/asm/mman.h   | 4 ++--
>  arch/powerpc/include/asm/mman.h | 3 ++-
>  arch/powerpc/kernel/syscalls.c  | 2 +-
>  arch/sparc/include/asm/mman.h   | 6 --
>  include/linux/mman.h| 3 ++-
>  mm/mprotect.c   | 6 --
>  6 files changed, 15 insertions(+), 9 deletions(-)


This looks good to me.

As Chris pointed out, the call to arch_validate_prot() from do_mmap2()
is made without holding mmap_lock. Lock is not acquired until
vm_mmap_pgoff(). This variance is uncomfortable but I am more
uncomfortable forcing all implementations of validate_prot to require
mmap_lock be held when non-sparc implementations do not have such need
yet. Since do_mmap2() is in powerpc specific code, for now this patch
solves a current problem. That leaves open the question of should
generic mmap call arch_validate_prot and return EINVAL for invalid
combination of protection bits, but that is better addressed in a
separate patch.

Reviewed-by: Khalid Aziz 

> 
> diff --git a/arch/arm64/include/asm/mman.h b/arch/arm64/include/asm/mman.h
> index 081ec8de9ea6..0876a87986dc 100644
> --- a/arch/arm64/include/asm/mman.h
> +++ b/arch/arm64/include/asm/mman.h
> @@ -23,7 +23,7 @@ static inline pgprot_t arch_vm_get_page_prot(unsigned long 
> vm_flags)
>  #define arch_vm_get_page_prot(vm_flags) arch_vm_get_page_prot(vm_flags)
>  
>  static inline bool arch_validate_prot(unsigned long prot,
> - unsigned long addr __always_unused)
> + unsigned long addr __always_unused, unsigned long len __always_unused)
>  {
>   unsigned long supported = PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM;
>  
> @@ -32,6 +32,6 @@ static inline bool arch_validate_prot(unsigned long prot,
>  
>   return (prot & ~supported) == 0;
>  }
> -#define arch_validate_prot(prot, addr) arch_validate_prot(prot, addr)
> +#define arch_validate_prot(prot, addr, len) arch_validate_prot(prot, addr, 
> len)
>  
>  #endif /* ! __ASM_MMAN_H__ */
> diff --git a/arch/powerpc/include/asm/mman.h b/arch/powerpc/include/asm/mman.h
> index 7cb6d18f5cd6..65dd9b594985 100644
> --- a/arch/powerpc/include/asm/mman.h
> +++ b/arch/powerpc/include/asm/mman.h
> @@ -36,7 +36,8 @@ static inline pgprot_t arch_vm_get_page_prot(unsigned long 
> vm_flags)
>  }
>  #define arch_vm_get_page_prot(vm_flags) arch_vm_get_page_prot(vm_flags)
>  
> -static inline bool arch_validate_prot(unsigned long prot, unsigned long addr)
> +static inline bool arch_validate_prot(unsigned long prot, unsigned long addr,
> +   unsigned long len)
>  {
>   if (prot & ~(PROT_READ | PROT_WRITE | PROT_EXEC | PROT_SEM | PROT_SAO))
>   return false;
> diff --git a/arch/powerpc/kernel/syscalls.c b/arch/powerpc/kernel/syscalls.c
> index 078608ec2e92..b1fabb97d138 100644
> --- a/arch/powerpc/kernel/syscalls.c
> +++ b/arch/powerpc/kernel/syscalls.c
> @@ -43,7 +43,7 @@ static inline long do_mmap2(unsigned long addr, size_t len,
>  {
>   long ret = -EINVAL;
>  
> - if (!arch_validate_prot(prot, addr))
> + if (!arch_validate_prot(prot, addr, len))
>   goto out;
>  
>   if (shift) {
> diff --git a/arch/sparc/include/asm/mman.h b/arch/sparc/include/asm/mman.h
> index f94532f25db1..e85222c76585 100644
> --- a/arch/sparc/include/asm/mman.h
> +++ b/arch/sparc/

Re: SPARC version of arch_validate_prot() looks broken (UAF read)

2020-09-29 Thread Khalid Aziz

On 9/28/20 6:14 AM, Jann Horn wrote:
> From what I can tell from looking at the code:
> 
> SPARC's arch_validate_prot() looks up the VMA and peeks at it; that's
> not permitted though. do_mprotect_pkey() calls arch_validate_prot()
> before taking the mmap lock, so we can hit use-after-free reads if
> someone concurrently deletes a VMA we're looking at.

That makes sense. It will be a good idea to encapsulate vma access
inside sparc_validate_prot() between mmap_read_lock() and
mmap_read_unlock().

> 
> Additionally, arch_validate_prot() currently only accepts the start
> address as a parameter, but the SPARC code probably should be checking
> the entire given range, which might consist of multiple VMAs?
> 
> I'm not sure what the best fix is here; it kinda seems like what SPARC
> really wants is a separate hook that is called from inside the loop in
> do_mprotect_pkey() that iterates over the VMAs? So maybe commit
> 9035cf9a97e4 ("mm: Add address parameter to arch_validate_prot()")
> should be reverted, and a separate hook should be created?
> 
> (Luckily the ordering of the vmacache operations works out suIch that
> AFAICS, despite calling find_vma() without holding the mmap_sem, we
> can never end up establishing a vmacache entry with a dangling pointer
> that might be considered valid on a subsequent call. So this should be
> limited to a rather boring UAF data read, and not be exploitable for a
> UAF write or UAF function pointer read.)
> 

I think arch_validate_prot() is still the right hook to validate the
protection bits. sparc_validate_prot() can iterate over VMAs with read
lock. This will, of course, require range as well to be passed to
arch_validate_prot().

Thanks,
Khalid

Re: [RFC PATCH 1/1] usb: ehci: Remove erroneous return of EPROTO upon detection of stall

2020-09-04 Thread Khalid Aziz

On 9/4/20 9:19 AM, Greg KH wrote:
> On Mon, Aug 31, 2020 at 10:08:43AM -0600, Khalid Aziz wrote:
>> With the USB 3.0/3.1 controller on MSI B450-A Pro Max motherboard,
>> full speed and low speed devices see constant resets making
>> keyboards and mouse unreliable and unusable. These resets are caused
>> by detection of stall in qtd_copy_status() and returning EPROTO
>> which in turn results in TT buffers in hub being cleared. Hubs do
>> not seem to repsond well to this and seem to hang which causes
>> further USB transactions to time out. A reset finally clears the
>> issue until we repeat the cycle all over again.
>>
>> Signed-off-by: Khalid Aziz 
>> Cc: Khalid Aziz 
>> ---
>>  drivers/usb/host/ehci-q.c | 4 
>>  1 file changed, 4 deletions(-)
>>
>> diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
>> index 8a5c9b3ebe1e..7d4b2bc4633c 100644
>> --- a/drivers/usb/host/ehci-q.c
>> +++ b/drivers/usb/host/ehci-q.c
>> @@ -214,10 +214,6 @@ static int qtd_copy_status (
>>   * When MMF is active and PID Code is IN, queue is halted.
>>   * EHCI Specification, Table 4-13.
>>   */
>> -} else if ((token & QTD_STS_MMF) &&
>> -(QTD_PID(token) == PID_CODE_IN)) {
>> -status = -EPROTO;
>> -/* CERR nonzero + halt --> stall */
>>  } else if (QTD_CERR(token)) {
>>  status = -EPIPE;
>>  
> 
> Removing this check is not a good idea, any chance you can come up with
> some other test instead for this broken hardware?
> 
> What about getting a USB hub that works?  :)
> 

I agree removing that check is not the right way to fix this problem. It
just so happens, the USB resets disappear when that check is removed. It
is more likely that check needs to be refined further to differentiate
between a hub that was unplugged (reason for the original commit) and a
hub that is seeing split transaction errors on full/low speed devices.

I am not sure if hardware is broken. I currently am using one of the
four hubs I have in a working configuration. The hub I was using before
motherboard replacement on my desktop stopped working with new
motherboard. Suspecting hardware defect on the motherboard, I bought a
PCI plug in USB 2.0 card but that showed the same failure. So I got two
more USB hubs just in case my existing hubs were broken. In all I tried
seven combinations of hardware and five of them failed the same way.
Every one of these hubs, keyboards, mouse and tablet works with no
problems on my laptop. All high speed and super speed devices (various
storage devices I have) work flawlessly on my desktop plugged into any
port or any hub. My desktop is a Ryzen 5 3600X in an MSI B450-A pro max
motherboard. Previous motherboard on my desktop was an ASRock Z77
Extreme motherboard with Intel core i7-3770. My laptop is an Intel
i5-7300U in a Lenovo thinkpad. Somehow hubs are getting set up
differently for split transactions full/low speed devices between two
machines.

Since I have a working configuration of hardware, my next steps are to
use my desktop with working configuration of hardware and then go deeper
into USB debugging to find out what is wrong with non-working
configurations.

Thanks,
Khalid

Re: [RFC RESEND PATCH 0/1] USB EHCI: repeated resets on full and low speed devices

2020-09-01 Thread Khalid Aziz

On 9/1/20 1:51 PM, Alan Stern wrote:
> On Tue, Sep 01, 2020 at 11:00:16AM -0600, Khalid Aziz wrote:
>> On 9/1/20 10:36 AM, Alan Stern wrote:
>>> On Tue, Sep 01, 2020 at 09:15:46AM -0700, Khalid Aziz wrote:
>>>> On 8/31/20 8:31 PM, Alan Stern wrote:
>>>>> Can you collect a usbmon trace showing an example of this problem?
>>>>>
>>>>
>>>> I have attached usbmon traces for when USB hub with keyboards and mouse
>>>> is plugged into USB 2.0 port and when it is plugged into the NEC USB 3.0
>>>> port.
>>>
>>> The usbmon traces show lots of errors, but no Clear-TT events.  The 
>>> large number of errors suggests that you've got a hardware problem; 
>>> either a bad hub or bad USB connections.
>>
>> That is what I thought initially which is why I got additional hubs and
>> a USB 2.0 PCI card to test. I am seeing errors across 3 USB controllers,
>> 4 USB hubs and 4 slow/full speed devices. All of the hubs and slow/full
>> devices work with zero errors on my laptop. My keyboard/mouse devices
>> and 2 of my USB hubs predate motherboard update and they all worked
>> flawlessly before the motherboard upgrade. Some combinations of these
>> also works with no errors on my desktop with new motherboard that I had
>> listed in my original email:
> 
> It's a very puzzling situation.
> 
> One thing which probably would work well, surprisingly, would be to buy 
> an old USB-1.1 hub and plug it into the PCI card.  That combination is 
> likely to be similar to what you see when plugging the devices directly 
> into the PCI card.  It might even work okay with the USB-3 controllers.
> 
>> 2. USB 2.0 controller - WORKS
>> 5. USB 3.0/3.1 controller -> Bus powered USB 2.0 hub - WORKS
>>
>> I am not seeing a common failure here that would point to any specific
>> hardware being bad. Besides, that one code change (which I still can't
>> say is the right code change) in ehci-q.c makes USB 2.0 controller work
>> reliably with all my devices.
> 
> The USB and EHCI designs are flawed in that under the circumstances 
> you're seeing, they don't have any way to tell the difference between a 
> STALL and a host timing error.  The current code treats these situations 
> as timing/transmission errors (resulting in device resets); your change 
> causes them to be treated as STALLs.  However, there are known, common 
> situations in which those same symptoms really are caused by 
> transmission errors, so we don't want to start treating them as STALLs.
> 
> Besides, I suspect that your code change does _not_ make the USB-2 
> controller work reliably with your devices.  You should collect a usbmon 
> trace under those conditions; I predict it will be full of STALLs.  And 
> furthermore, I believe these STALLs will not show up in a usbmon trace 
> made with the devices plugged directly into the PCI card.  If I'm right 
> about these things, the errors are still present even with your patch; 
> all it does is hide them.
> 
> Short of a USB bus analyzer, however, there's no way to tell what's 
> really going on.

I have managed to find a hardware combination that seems to work, so for
now at least my machine is usable. I will figure out how to interpret
usbmon output and run more experiments. There seems to be a real problem
in the driver somewhere and should be solved.

Thanks,
Khalid

Re: [RFC RESEND PATCH 0/1] USB EHCI: repeated resets on full and low speed devices

2020-09-01 Thread Khalid Aziz

On 9/1/20 10:36 AM, Alan Stern wrote:
> On Tue, Sep 01, 2020 at 09:15:46AM -0700, Khalid Aziz wrote:
>> On 8/31/20 8:31 PM, Alan Stern wrote:
>>> Can you collect a usbmon trace showing an example of this problem?
>>>
>>
>> I have attached usbmon traces for when USB hub with keyboards and mouse
>> is plugged into USB 2.0 port and when it is plugged into the NEC USB 3.0
>> port.
> 
> The usbmon traces show lots of errors, but no Clear-TT events.  The 
> large number of errors suggests that you've got a hardware problem; 
> either a bad hub or bad USB connections.

That is what I thought initially which is why I got additional hubs and
a USB 2.0 PCI card to test. I am seeing errors across 3 USB controllers,
4 USB hubs and 4 slow/full speed devices. All of the hubs and slow/full
devices work with zero errors on my laptop. My keyboard/mouse devices
and 2 of my USB hubs predate motherboard update and they all worked
flawlessly before the motherboard upgrade. Some combinations of these
also works with no errors on my desktop with new motherboard that I had
listed in my original email:

2. USB 2.0 controller - WORKS
5. USB 3.0/3.1 controller -> Bus powered USB 2.0 hub - WORKS

I am not seeing a common failure here that would point to any specific
hardware being bad. Besides, that one code change (which I still can't
say is the right code change) in ehci-q.c makes USB 2.0 controller work
reliably with all my devices.

--
Khalid

Re: [RFC RESEND PATCH 0/1] USB EHCI: repeated resets on full and low speed devices

2020-09-01 Thread Khalid Aziz

On 8/31/20 8:31 PM, Alan Stern wrote:
> On Mon, Aug 31, 2020 at 10:23:30AM -0600, Khalid Aziz wrote:
>> [Resending since I screwed up linux-usb mailing list address in
>> cut-n-paste in original email]
>>
>>
>> I recently replaced the motherboard on my desktop with an MSI B450-A
>> Pro Max motherboard. Since then my keybaords, mouse and tablet have
>> become very unreliable. I see messages like this over and over in
>> dmesg:
>>
>> ug 23 00:01:49 rhapsody kernel: [198769.314732] usb 1-2.4: reset full-speed 
>> USB
>>  device number 27 using ehci-pci
>> Aug 23 00:01:49 rhapsody kernel: [198769.562234] usb 1-2.1: reset full-speed 
>> USB
>>  device number 28 using ehci-pci
>> Aug 23 00:01:52 rhapsody kernel: [198772.570704] usb 1-2.1: reset full-speed 
>> USB
>>  device number 28 using ehci-pci
>> Aug 23 00:02:02 rhapsody kernel: [198782.526669] usb 1-2.4: reset full-speed 
>> USB
>>  device number 27 using ehci-pci
>> Aug 23 00:02:03 rhapsody kernel: [198782.714660] usb 1-2.1: reset full-speed 
>> USB
>>  device number 28 using ehci-pci
>> Aug 23 00:02:04 rhapsody kernel: [198784.210171] usb 1-2.3: reset low-speed 
>> USB device number 26 using ehci-pci
>> Aug 23 00:02:06 rhapsody kernel: [198786.110181] usb 1-2.4: reset full-speed 
>> USB device number 27 using ehci-pci
>> Aug 23 00:02:08 rhapsody kernel: [198787.726158] usb 1-2.4: reset full-speed 
>> USB device number 27 using ehci-pci
>> Aug 23 00:02:10 rhapsody kernel: [198790.126628] usb 1-2.1: reset full-speed 
>> USB device number 28 using ehci-pci
>> Aug 23 00:02:10 rhapsody kernel: [198790.314141] usb 1-2.4: reset full-speed 
>> USB device number 27 using ehci-pci
>> Aug 23 00:02:12 rhapsody kernel: [198792.518765] usb 1-2.4: reset full-speed 
>> USB device number 27 using ehci-pci
>>
>> The devices I am using are:
>>
>> - Logitech K360 wireless keyboard
>> - Wired Lenovo USB keyboard
>> - Wired Lenovo USB mouse
>> - Wired Wacom Intuos tablet
>>
>> After a reset, the wireless keyboard simply stops working. Rest of
>> the devices keep seeing intermittent failure.
>>
>> I tried various combinations of hubs and USB controllers to see what
>> works. MSI B450-A motherboard has USB 3.0 and USB 3.1 controllers. I
>> added a USB 2.0 PCI card as well for this test:
>>
>> 03:00.0 USB controller: Advanced Micro Devices, Inc. [AMD] 400 Series 
>> Chipset USB 3.1 XHCI Controller (rev 01)
>> 29:01.0 USB controller: NEC Corporation OHCI USB Controller (rev 43)
>> 29:01.1 USB controller: NEC Corporation OHCI USB Controller (rev 43)
>> 29:01.2 USB controller: NEC Corporation uPD72010x USB 2.0 Controller (rev 04)
>> 2c:00.3 USB controller: Advanced Micro Devices, Inc. [AMD] Matisse USB 3.0 
>> Host Controller
>>
>> I have a bus powered USB 3.0 hub, a bus powered USB 2.0 hub and a
>> self powered USB 2.0 hub built into my monitor.
>>
>> I have connected my devices directly into the ports on motherboard
>> and PCI card as well as into external hub. Here are the results I
>> saw when devices wee plugged into various combination of ports:
>>
>> 1. USB 3.0/3.1 controller - does NOT work
>> 2. USB 2.0 controller - WORKS
>> 3. USB 3.0/3.1 controller -> Self powered USB 2.0 hub in monitor - does
>>NOT work
>> 4. USB 3.0/3.1 controller -> bus powered USB 3.0 hub - does NOT work
>> 5. USB 3.0/3.1 controller -> Bus powered USB 2.0 hub - WORKS
>> 7. USB 2.0 controller -> Bus powered USB 3.0 hub - does NOT work
>> 8. USB 2.0 controller -> Bus powered 2.0 hub - Does not work
> 
> The error messages in your log extract all refer to ehci-pci, which is 
> the driver for a USB-2 controller.  They are completely unrelated to any 
> problems you may be having with USB-3 controllers.

I just happened to cut and paste the messages from when I was testing
with the USB 2.0 controller. Here are the messages when I ran the test
with USB 3.0 controller:

Aug 13 14:25:48 rhapsody kernel: [78779.868354] usb 1-9.4: reset
full-speed USB  device number 38 using xhci_hcd
Aug 13 14:26:18 rhapsody kernel: [78809.939457] usb 1-9.4: reset
full-speed USB  device number 38 using xhci_hcd
Aug 13 14:26:39 rhapsody kernel: [78830.899982] usb 1-9.4: reset
full-speed USB  device number 38 using xhci_hcd
Aug 13 14:26:39 rhapsody kernel: [78831.379883] usb 1-9.2: reset
low-speed USB device number 36 using xhci_hcd
Aug 13 14:26:40 rhapsody kernel: [78832.043900] usb 1-9.3: reset
low-speed USB device number 37 using xhci_hcd
Aug 13 14:26:47 rhapsody kernel: [78839.520211] usb 1-9.4: reset
full-speed USB device number 38 using xhci_hcd
Au

[RFC PATCH 1/1] usb: ehci: Remove erroneous return of EPROTO upon detection of stall

2020-08-31 Thread Khalid Aziz

With the USB 3.0/3.1 controller on MSI B450-A Pro Max motherboard,
full speed and low speed devices see constant resets making
keyboards and mouse unreliable and unusable. These resets are caused
by detection of stall in qtd_copy_status() and returning EPROTO
which in turn results in TT buffers in hub being cleared. Hubs do
not seem to repsond well to this and seem to hang which causes
further USB transactions to time out. A reset finally clears the
issue until we repeat the cycle all over again.

Signed-off-by: Khalid Aziz 
Cc: Khalid Aziz 
---
 drivers/usb/host/ehci-q.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
index 8a5c9b3ebe1e..7d4b2bc4633c 100644
--- a/drivers/usb/host/ehci-q.c
+++ b/drivers/usb/host/ehci-q.c
@@ -214,10 +214,6 @@ static int qtd_copy_status (
 * When MMF is active and PID Code is IN, queue is halted.
 * EHCI Specification, Table 4-13.
 */
-   } else if ((token & QTD_STS_MMF) &&
-   (QTD_PID(token) == PID_CODE_IN)) {
-   status = -EPROTO;
-   /* CERR nonzero + halt --> stall */
} else if (QTD_CERR(token)) {
status = -EPIPE;
 
-- 
2.25.1

[RFC PATCH 0/1] USB EHCI: repeated resets on full and low speed devices

2020-08-31 Thread Khalid Aziz

 * buffer in this case.  Strictly speaking 
 this
 484  * is a violation of the spec.
 485  */
 486 if (last_status != -EPIPE)
 487 ehci_clear_tt_buffer(ehci, qh, urb,
 488 token);
 489 }

It seems like clearing TT buffers in this case is resulting in hub
hanging. A USB reset gets it going again until we repeat the cycle
over again. The comment in this code says "The TT's in some hubs
malfunction when they receive this request following a STALL (they
stop sending isochronous packets)". That may be what is happening.

Removing the code that returns EPROTO for such case solves the
problem on my machine (as in the RFC patch) but that probably is not
the right solution. I do not understand USB protocol well enough to
propose a better solution. Does anyone have a better idea?


Khalid Aziz (1):
  usb: ehci: Remove erroneous return of EPROTO upon detection of stall 

 drivers/usb/host/ehci-q.c | 4 
 1 file changed, 4 deletions(-)

-- 
2.25.1

Re: [PATCH v6] mm: Proactive compaction

2020-06-09 Thread Khalid Aziz

seTransparentHugePages
> -XX:+AlwaysPreTouch
> 
> The above command allocates 700G of Java heap using hugepages.
> 
> - With vanilla 5.6.0-rc3
> 
> 17.39user 1666.48system 27:37.89elapsed
> 
> - With 5.6.0-rc3 + this patch, with proactiveness=20
> 
> 8.35user 194.58system 3:19.62elapsed
> 
> Elapsed time remains around 3:15, as proactiveness is further
> increased.
> 
> Note that proactive compaction happens throughout the runtime of
> these
> workloads. The situation of one-time compaction, sufficient to supply
> hugepages for following allocation stream, can probably happen for
> more
> extreme proactiveness values, like 80 or 90.
> 
> In the above Java workload, proactiveness is set to 20. The test
> starts
> with a node's score of 80 or higher, depending on the delay between
> the
> fragmentation step and starting the benchmark, which gives more-or-
> less
> time for the initial round of compaction. As the benchmark
> consumes
> hugepages, node's score quickly rises above the high threshold (90)
> and
> proactive compaction starts again, which brings down the score to the
> low threshold level (80).  Repeat.
> 
> bpftrace also confirms proactive compaction running 20+ times during
> the
> runtime of this Java benchmark. kcompactd threads consume 100% of one
> of
> the CPUs while it tries to bring a node's score within thresholds.
> 
> Backoff behavior
> 
> 
> Above workloads produce a memory state which is easy to compact.
> However, if memory is filled with unmovable pages, proactive
> compaction
> should essentially back off. To test this aspect:
> 
> - Created a kernel driver that allocates almost all memory as
> hugepages
>   followed by freeing first 3/4 of each hugepage.
> - Set proactiveness=40
> - Note that proactive_compact_node() is deferred maximum number of
> times
>   with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
>   (=> ~30 seconds between retries).
> 
> [1] https://patchwork.kernel.org/patch/11098289/
> [2] 
> https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/
> [3] https://lwn.net/Articles/817905/
> 
> Signed-off-by: Nitin Gupta 
> Reviewed-by: Vlastimil Babka 
> To: Mel Gorman 
> To: Michal Hocko 
> To: Vlastimil Babka 
> CC: Matthew Wilcox 
> CC: Andrew Morton 
> CC: Mike Kravetz 
> CC: Joonsoo Kim 
> CC: David Rientjes 
> CC: Nitin Gupta 
> CC: linux-kernel 
> CC: linux-mm 
> CC: Linux API 
> 
> ---
> Changelog v6 vs v5:
>  - Fallback to HUGETLB_PAGE_ORDER if HPAGE_PMD_ORDER is not defined,
> and
>some cleanups (Vlastimil)
>  - Cap min threshold to avoid excess compaction load in case user
> sets
>extreme values like 100 for `vm.compaction_proactiveness` sysctl
> (Khalid)
>  - Add some more explanation about the effect of tunable on
> compaction
>behavior in user guide (Khalid)
> 
> Changelog v5 vs v4:
>  - Change tunable from sysfs to sysctl (Vlastimil)
>  - Replace HUGETLB_PAGE_ORDER with HPAGE_PMD_ORDER (Vlastimil)
>  - Minor cleanups (remove redundant initializations, ...)
> 
> Changelog v4 vs v3:
>  - Document various functions.
>  - Added admin-guide for the new tunable `proactiveness`.
>  - Rename proactive_compaction_score to fragmentation_score for
> clarity.
> 
> Changelog v3 vs v2:
>  - Make proactiveness a global tunable and not per-node. Also
> upadated
> the
>patch description to reflect the same (Vlastimil Babka).
>  - Don't start proactive compaction if kswapd is running (Vlastimil
> Babka).
>  - Clarified in the description that compaction runs in parallel with
>the workload, instead of a one-time compaction followed by a
> stream
> of
>hugepage allocations.
> 
> Changelog v2 vs v1:
>  - Introduce per-node and per-zone "proactive compaction score". This
>score is compared against watermarks which are set according to
>user provided proactiveness value.
>  - Separate code-paths for proactive compaction from targeted
> compaction
>i.e. where pgdat->kcompactd_max_order is non-zero.
>  - Renamed hpage_compaction_effort -> proactiveness. In future we may
>use more than extfrag wrt hugepage size to determine proactive
>compaction score.
> ---
>  Documentation/admin-guide/sysctl/vm.rst |  15 ++
>  include/linux/compaction.h  |   2 +
>  kernel/sysctl.c |   9 ++
>  mm/compaction.c | 183
> +++-
>  mm/internal.h   |   1 +
>  mm/vmstat.c |  18 +++
>  6 files changed, 223 insertions(+), 5 deletions(-)


Looks good to me.

Reviewed-by: Khalid Aziz 


&

Re: [PATCH v5] mm: Proactive compaction

2020-05-28 Thread Khalid Aziz

This looks good to me. I like the idea overall of controlling
aggressiveness of compaction with a single tunable for the whole
system. I wonder how an end user could arrive at what a reasonable
value would be for this based upon their workload. More comments below.

On Mon, 2020-05-18 at 11:14 -0700, Nitin Gupta wrote:
> For some applications, we need to allocate almost all memory as
> hugepages. However, on a running system, higher-order allocations can
> fail if the memory is fragmented. Linux kernel currently does on-
> demand
> compaction as we request more hugepages, but this style of compaction
> incurs very high latency. Experiments with one-time full memory
> compaction (followed by hugepage allocations) show that kernel is
> able
> to restore a highly fragmented memory state to a fairly compacted
> memory
> state within <1 sec for a 32G system. Such data suggests that a more
> proactive compaction can help us allocate a large fraction of memory
> as
> hugepages keeping allocation latencies low.
> 
> For a more proactive compaction, the approach taken here is to define
> a new tunable called 'proactiveness' which dictates bounds for
> external
> fragmentation wrt HUGETLB_PAGE_ORDER order which kcompactd tries to
> maintain.
> 
> The tunable is exposed through sysctl:
>   /proc/sys/vm/compaction_proactiveness
> 
> It takes value in range [0, 100], with a default of 20.

Looking at the code, setting this to 100 would mean system would
continuously strive to drive level of fragmentation down to 0 which can
not be reasonable and would bog the system down. A cap lower than 100
might be a good idea to keep kcompactd from dragging system down.

> 
> Note that a previous version of this patch [1] was found to introduce
> too
> many tunables (per-order extfrag{low, high}), but this one reduces
> them
> to just one (proactiveness). Also, the new tunable is an opaque value
> instead of asking for specific bounds of "external fragmentation",
> which
> would have been difficult to estimate. The internal interpretation of
> this opaque value allows for future fine-tuning.
> 
> Currently, we use a simple translation from this tunable to [low,
> high]
> "fragmentation score" thresholds (low=100-proactiveness,
> high=low+10%).
> The score for a node is defined as weighted mean of per-zone external
> fragmentation wrt HUGETLB_PAGE_ORDER order. A zone's present_pages
> determines its weight.
> 
> To periodically check per-node score, we reuse per-node kcompactd
> threads, which are woken up every 500 milliseconds to check the same.
> If
> a node's score exceeds its high threshold (as derived from user-
> provided
> proactiveness value), proactive compaction is started until its score
> reaches its low threshold value. By default, proactiveness is set to
> 20,
> which implies threshold values of low=80 and high=90.
> 
> This patch is largely based on ideas from Michal Hocko posted here:
> https://lore.kernel.org/linux-mm/20161230131412.gi13...@dhcp22.suse.cz/
> 
> Performance data
> 
> 
> System: x64_64, 1T RAM, 80 CPU threads.
> Kernel: 5.6.0-rc3 + this patch
> 
> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
> 
> Before starting the driver, the system was fragmented from a
> userspace
> program that allocates all memory and then for each 2M aligned
> section,
> frees 3/4 of base pages using munmap. The workload is mainly
> anonymous
> userspace pages, which are easy to move around. I intentionally
> avoided
> unmovable pages in this test to see how much latency we incur when
> hugepage allocations hit direct compaction.
> 
> 1. Kernel hugepage allocation latencies
> 
> With the system in such a fragmented state, a kernel driver then
> allocates
> as many hugepages as possible and measures allocation latency:
> 
> (all latency values are in microseconds)
> 
> - With vanilla 5.6.0-rc3
> 
> echo 0 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness
  

This is not needed here since there will be no
/proc/sys/vm/compaction_proactiveness without this patch on vanilla
kernel.

> 
>   percentile latency
>   –– –––
>  57894
> 109496
> 25   12561
> 30   15295
> 40   18244
> 50   21229
> 60   27556
> 75   30147
> 80   31047
> 90   32859
> 95   33799
> 
> Total 2M hugepages allocated = 383859 (749G worth of hugepages out of
> 762G total free => 98% of free memory could be allocated as
> hugepages)
> 
> - With 5.6.0-rc3 + this patch, with proactiveness=20
> 
> echo 20 | sudo tee /sys/kernel/mm/compaction/node-*/proactiveness

Should be "echo 20 | sudo tee /proc/sys/vm/compaction_proactiveness"

> 
>   percentile latency
>   –– –––
>  5   2
> 10   2
> 25   3
> 30   3
> 40   3
>

Re: [PATCH] pcdp: Replace zero-length array with flexible-array

2020-05-08 Thread Khalid Aziz

On 5/7/20 1:05 PM, Gustavo A. R. Silva wrote:
> The current codebase makes use of the zero-length array language
> extension to the C90 standard, but the preferred mechanism to declare
> variable-length types such as these ones is a flexible array member[1][2],
> introduced in C99:
> 
> struct foo {
> int stuff;
> struct boo array[];
> };
> 
> By making use of the mechanism above, we will get a compiler warning
> in case the flexible array does not occur last in the structure, which
> will help us prevent some kind of undefined behavior bugs from being
> inadvertently introduced[3] to the codebase from now on.
> 
> Also, notice that, dynamic memory allocations won't be affected by
> this change:
> 
> "Flexible array members have incomplete type, and so the sizeof operator
> may not be applied. As a quirk of the original implementation of
> zero-length arrays, sizeof evaluates to zero."[1]
> 
> sizeof(flexible-array-member) triggers a warning because flexible array
> members have incomplete type[1]. There are some instances of code in
> which the sizeof operator is being incorrectly/erroneously applied to
> zero-length arrays and the result is zero. Such instances may be hiding
> some bugs. So, this work (flexible-array member conversions) will also
> help to get completely rid of those sorts of issues.
> 
> This issue was found with the help of Coccinelle.
> 
> [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
> [2] https://github.com/KSPP/linux/issues/21
> [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")
> 
> Signed-off-by: Gustavo A. R. Silva 
> ---
>  drivers/firmware/pcdp.h |2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/firmware/pcdp.h b/drivers/firmware/pcdp.h
> index ce75d1da9e84..e02540571c52 100644
> --- a/drivers/firmware/pcdp.h
> +++ b/drivers/firmware/pcdp.h
> @@ -103,6 +103,6 @@ struct pcdp {
>   u8  creator_id[4];
>   u32 creator_rev;
>   u32 num_uarts;
> - struct pcdp_uartuart[0];/* actual size is num_uarts */
> + struct pcdp_uart    uart[]; /* actual size is num_uarts */
>   /* remainder of table is pcdp_device structures */
>  } __attribute__((packed));
> 

Loks good to me.

Acked-by: Khalid Aziz

Re: [RFC] mm: Proactive compaction

2019-09-24 Thread Khalid Aziz

On 9/24/19 7:39 AM, Vlastimil Babka wrote:
> On 9/20/19 1:37 AM, Nitin Gupta wrote:
>> On Tue, 2019-08-20 at 10:46 +0200, Vlastimil Babka wrote:
>>>
>>> That's a lot of control knobs - how is an admin supposed to tune them to
>>> their
>>> needs?
>>
>>
>> Yes, it's difficult for an admin to get so many tunable right unless
>> targeting a very specific workload.
>>
>> How about a simpler solution where we exposed just one tunable per-node:
>>/sys/.../node-x/compaction_effort
>> which accepts [0, 100]
>>
>> This parallels /proc/sys/vm/swappiness but for compaction. With this
>> single number, we can estimate per-order [low, high] watermarks for external
>> fragmentation like this:
>>  - For now, map this range to [low, medium, high] which correponds to 
>> specific
>> low, high thresholds for extfrag.
>>  - Apply more relaxed thresholds for higher-order than for lower orders.
>>
>> With this single tunable we remove the burden of setting per-order explicit
>> [low, high] thresholds and it should be easier to experiment with.
> 
> What about instead autotuning by the numbers of allocations hitting
> direct compaction recently? IIRC there were attempts in the past (myself
> included) and recently Khalid's that was quite elaborated.
> 

I do think the right way forward with this longstanding problem is to
take the burden of managing free memory away from end user and let the
kernel autotune itself to the demands of workload. We can start with a
simpler algorithm in the kernel that adapts to workload and refine it as
we move forward. As long as initial implementation performs at least as
well as current free page management, we have a workable path for
improvements. I am moving the implementation I put together in kernel to
a userspace daemon just to test it out on larger variety of workloads.
It is more limited in userspace with limited access to statistics the
algorithm needs to perform trend analysis so I would rather be doing
this in the kernel.

--
Khalid

Re: [RFC PATCH 0/2] Add predictive memory reclamation and compaction

2019-09-03 Thread Khalid Aziz

On 9/2/19 2:02 AM, Michal Hocko wrote:
> On Fri 30-08-19 15:35:06, Khalid Aziz wrote:
> [...]
>> - Kernel is not self-tuning and is dependent upon a userspace tool to
>> perform well in a fundamental area of memory management.
> 
> You keep bringing this up without an actual analysis of a wider range of
> workloads that would prove that the default behavior is really
> suboptimal. You are making some assumptions based on a very specific DB
> workload which might benefit from a more aggressive background workload.
> If you really want to sell any changes to auto tuning then you really
> need to come up with more workloads and an actual theory why an early
> and more aggressive reclaim pays off.
> 

Hi Michal,

Fair enough. I have seen DB and cloud server workloads suffer under
default behavior of reclaim/compaction. It manifests itself as prolonged
delays in populating new database and in launching new cloud
applications. It is fair to ask for the predictive algorithm to be
proven before pulling something like this in kernel. I will implement
this same algorithm in userspace and use existing knobs to tune kernel
dynamically. Running that with large number of workloads will provide
data on how often does this help. If I find any useful tunables missing,
I will be sure to bring it up.

Thanks,
Khalid

Re: [RFC PATCH 0/2] Add predictive memory reclamation and compaction

2019-08-30 Thread Khalid Aziz

On 8/27/19 12:16 AM, Michal Hocko wrote:
> On Tue 27-08-19 02:14:20, Bharath Vedartham wrote:
>> Hi Michal,
>>
>> Here are some of my thoughts,
>> On Wed, Aug 21, 2019 at 04:06:32PM +0200, Michal Hocko wrote:
>>> On Thu 15-08-19 14:51:04, Khalid Aziz wrote:
>>>> Hi Michal,
>>>>
>>>> The smarts for tuning these knobs can be implemented in userspace and
>>>> more knobs added to allow for what is missing today, but we get back to
>>>> the same issue as before. That does nothing to make kernel self-tuning
>>>> and adds possibly even more knobs to userspace. Something so fundamental
>>>> to kernel memory management as making free pages available when they are
>>>> needed really should be taken care of in the kernel itself. Moving it to
>>>> userspace just means the kernel is hobbled unless one installs and tunes
>>>> a userspace package correctly.
>>>
>>> From my past experience the existing autotunig works mostly ok for a
>>> vast variety of workloads. A more clever tuning is possible and people
>>> are doing that already. Especially for cases when the machine is heavily
>>> overcommited. There are different ways to achieve that. Your new
>>> in-kernel auto tuning would have to be tested on a large variety of
>>> workloads to be proven and riskless. So I am quite skeptical to be
>>> honest.
>> Could you give some references to such works regarding tuning the kernel? 
> 
> Talk to Facebook guys and their usage of PSI to control the memory
> distribution and OOM situations.
> 
>> Essentially, Our idea here is to foresee potential memory exhaustion.
>> This foreseeing is done by observing the workload, observing the memory
>> usage of the workload. Based on this observations, we make a prediction
>> whether or not memory exhaustion could occur.
> 
> I understand that and I am not disputing this can be useful. All I do
> argue here is that there is unlikely a good "crystall ball" for most/all
> workloads that would justify its inclusion into the kernel and that this
> is something better done in the userspace where you can experiment and
> tune the behavior for a particular workload of your interest.
> 
> Therefore I would like to shift the discussion towards existing APIs and
> whether they are suitable for such an advance auto-tuning. I haven't
> heard any arguments about missing pieces.
> 

We seem to be in agreement that dynamic tuning is a useful tool. The
question is does that tuning belong in the kernel or in userspace. I see
your point that putting it in userspace allows for faster evolution of
such predictive algorithm than it would be for in-kernel algorithm. I
see following pros and cons with that approach:

+ Keeps complexity of predictive algorithms out of kernel and allows for
faster evolution of these algorithms in userspace.

+ Tuning algorithm can be fine-tuned to specific workloads as appropriate

- Kernel is not self-tuning and is dependent upon a userspace tool to
perform well in a fundamental area of memory management.

- More knobs get added to already crowded field of knobs to allow for
userspace to tweak mm subsystem for better performance.

As for adding predictive algorithm to kernel, I see following pros and cons:

+ Kernel becomes self-tuning and can respond to varying workloads better.

+ Allows for number of user visible tuning knobs to be reduced.

- Getting predictive algorithm right is important to ensure none of the
users see worse performance than today.

- Adds a certain level of complexity to mm subsystem

Pushing the burden of tuning kernel to userspace is no different from
where we are today and we still have allocation stall issues after years
of tuning from userspace. Adding more knobs to aid tuning from userspace
just makes the kernel look even more complex to the users. In my
opinion, a self tuning kernel should be the base for long term solution.
We can still export knobs to userspace to allow for users with specific
needs to further fine-tune but the base kernel should work well enough
for majority of users. We are not there at this point. We can discuss
what are the missing pieces to support further tuning from userspace but
is continuing to tweak from userpace the right long term strategy?

Assuming we want to continue to support tuning from userspace instead, I
can't say more knobs are needed right now. We may have enough knobs and
monitors available between /proc/buddyinfo, /sys/devices/system/node and
/proc/sys/vm. Right values for these knobs and their interaction is not
always clear. Maybe we need to simplify these knobs into something more
understandable for average user as opposed to adding more knobs.

--
Khalid

Re: [RFC] mm: Proactive compaction

2019-08-24 Thread Khalid Aziz

On 8/20/19 2:46 AM, Vlastimil Babka wrote:
> +CC Khalid Aziz who proposed a different approach:
> https://lore.kernel.org/linux-mm/20190813014012.30232-1-khalid.a...@oracle.com/T/#u
> 
> On 8/16/19 11:43 PM, Nitin Gupta wrote:
>> The patch has plenty of rough edges but posting it early to see if I'm
>> going in the right direction and to get some early feedback.
> 
> That's a lot of control knobs - how is an admin supposed to tune them to their
> needs?
> 

At a high level, this idea makes sense and is similar to the idea of
watermarks for free pages. My concern is the same. We now have more
knobs to tune and that increases complexity for sys admins as well as
the chances of a misconfigured system.

--
Khalid

Re: [RFC PATCH 0/2] Add predictive memory reclamation and compaction

2019-08-15 Thread Khalid Aziz

On 8/15/19 11:02 AM, Michal Hocko wrote:
> On Thu 15-08-19 10:27:26, Khalid Aziz wrote:
>> On 8/14/19 2:58 AM, Michal Hocko wrote:
>>> On Tue 13-08-19 09:20:51, Khalid Aziz wrote:
>>>> On 8/13/19 8:05 AM, Michal Hocko wrote:
>>>>> On Mon 12-08-19 19:40:10, Khalid Aziz wrote:
>>>>> [...]
>>>>>> Patch 1 adds code to maintain a sliding lookback window of (time, number
>>>>>> of free pages) points which can be updated continuously and adds code to
>>>>>> compute best fit line across these points. It also adds code to use the
>>>>>> best fit lines to determine if kernel must start reclamation or
>>>>>> compaction.
>>>>>>
>>>>>> Patch 2 adds code to collect data points on free pages of various orders
>>>>>> at different points in time, uses code in patch 1 to update sliding
>>>>>> lookback window with these points and kicks off reclamation or
>>>>>> compaction based upon the results it gets.
>>>>>
>>>>> An important piece of information missing in your description is why
>>>>> do we need to keep that logic in the kernel. In other words, we have
>>>>> the background reclaim that acts on a wmark range and those are tunable
>>>>> from the userspace. The primary point of this background reclaim is to
>>>>> keep balance and prevent from direct reclaim. Why cannot you implement
>>>>> this or any other dynamic trend watching watchdog and tune watermarks
>>>>> accordingly? Something similar applies to kcompactd although we might be
>>>>> lacking a good interface.
>>>>>
>>>>
>>>> Hi Michal,
>>>>
>>>> That is a very good question. As a matter of fact the initial prototype
>>>> to assess the feasibility of this approach was written in userspace for
>>>> a very limited application. We wrote the initial prototype to monitor
>>>> fragmentation and used /sys/devices/system/node/node*/compact to trigger
>>>> compaction. The prototype demonstrated this approach has merits.
>>>>
>>>> The primary reason to implement this logic in the kernel is to make the
>>>> kernel self-tuning.
>>>
>>> What makes this particular self-tuning an universal win? In other words
>>> there are many ways to analyze the memory pressure and feedback it back
>>> that I can think of. It is quite likely that very specific workloads
>>> would have very specific demands there. I have seen cases where are
>>> trivial increase of min_free_kbytes to normally insane value worked
>>> really great for a DB workload because the wasted memory didn't matter
>>> for example.
>>
>> Hi Michal,
>>
>> The problem is not so much as do we have enough knobs available, rather
>> how do we tweak them dynamically to avoid allocation stalls. Knobs like
>> watermarks and min_free_kbytes are set once typically and left alone.
> 
> Does anything prevent from tuning these knobs more dynamically based on
> already exported metrics?

Hi Michal,

The smarts for tuning these knobs can be implemented in userspace and
more knobs added to allow for what is missing today, but we get back to
the same issue as before. That does nothing to make kernel self-tuning
and adds possibly even more knobs to userspace. Something so fundamental
to kernel memory management as making free pages available when they are
needed really should be taken care of in the kernel itself. Moving it to
userspace just means the kernel is hobbled unless one installs and tunes
a userspace package correctly.

> 
>> Allocation stalls show up even on much smaller scale than large DB or
>> cloud platforms. I have seen it on a desktop class machine running a few
>> services in the background. Desktop is running gnome3, I would lock the
>> screen and come back to unlock it a day or two later. In that time most
>> of memory has been consumed by buffer/page cache. Just unlocking the
>> screen can take 30+ seconds while system reclaims pages to be able swap
>> back in all the processes that were inactive so far.
> 
> This sounds like a bug to me.

Quite possibly. I had seen that behavior with 4.17, 4.18 and 4.19
kernels. I then just moved enough tasks off of my machine to other
machines to make the problem go away. So I can't say if the problem has
persisted past 4.19.

> 
>> It is true different workloads will have different requirements and that
>> is what I am attempting to address here. Instead of tweaking the knobs
>>

Re: [RFC PATCH 0/2] Add predictive memory reclamation and compaction

2019-08-15 Thread Khalid Aziz

On 8/14/19 2:58 AM, Michal Hocko wrote:
> On Tue 13-08-19 09:20:51, Khalid Aziz wrote:
>> On 8/13/19 8:05 AM, Michal Hocko wrote:
>>> On Mon 12-08-19 19:40:10, Khalid Aziz wrote:
>>> [...]
>>>> Patch 1 adds code to maintain a sliding lookback window of (time, number
>>>> of free pages) points which can be updated continuously and adds code to
>>>> compute best fit line across these points. It also adds code to use the
>>>> best fit lines to determine if kernel must start reclamation or
>>>> compaction.
>>>>
>>>> Patch 2 adds code to collect data points on free pages of various orders
>>>> at different points in time, uses code in patch 1 to update sliding
>>>> lookback window with these points and kicks off reclamation or
>>>> compaction based upon the results it gets.
>>>
>>> An important piece of information missing in your description is why
>>> do we need to keep that logic in the kernel. In other words, we have
>>> the background reclaim that acts on a wmark range and those are tunable
>>> from the userspace. The primary point of this background reclaim is to
>>> keep balance and prevent from direct reclaim. Why cannot you implement
>>> this or any other dynamic trend watching watchdog and tune watermarks
>>> accordingly? Something similar applies to kcompactd although we might be
>>> lacking a good interface.
>>>
>>
>> Hi Michal,
>>
>> That is a very good question. As a matter of fact the initial prototype
>> to assess the feasibility of this approach was written in userspace for
>> a very limited application. We wrote the initial prototype to monitor
>> fragmentation and used /sys/devices/system/node/node*/compact to trigger
>> compaction. The prototype demonstrated this approach has merits.
>>
>> The primary reason to implement this logic in the kernel is to make the
>> kernel self-tuning.
> 
> What makes this particular self-tuning an universal win? In other words
> there are many ways to analyze the memory pressure and feedback it back
> that I can think of. It is quite likely that very specific workloads
> would have very specific demands there. I have seen cases where are
> trivial increase of min_free_kbytes to normally insane value worked
> really great for a DB workload because the wasted memory didn't matter
> for example.

Hi Michal,

The problem is not so much as do we have enough knobs available, rather
how do we tweak them dynamically to avoid allocation stalls. Knobs like
watermarks and min_free_kbytes are set once typically and left alone.
Allocation stalls show up even on much smaller scale than large DB or
cloud platforms. I have seen it on a desktop class machine running a few
services in the background. Desktop is running gnome3, I would lock the
screen and come back to unlock it a day or two later. In that time most
of memory has been consumed by buffer/page cache. Just unlocking the
screen can take 30+ seconds while system reclaims pages to be able swap
back in all the processes that were inactive so far.

It is true different workloads will have different requirements and that
is what I am attempting to address here. Instead of tweaking the knobs
statically based upon one workload requirements, I am looking at the
trend of memory consumption instead. A best fit line showing recent
trend can be quite indicative of what the workload is doing in terms of
memory. For instance, a cloud server might be running a certain number
of instances for a few days and it can end up using any memory not used
up by tasks, for buffer/page cache. Now the sys admin gets a request to
launch another instance and when they try to to do that, system starts
to allocate pages and soon runs out of free pages. We are now in direct
reclaim path and it can take significant amount of time to find all free
pages the new task needs. If the kernel were watching the memory
consumption trend instead, it could see that the trend line shows a
complete exhaustion of free pages or 100% fragmentation in near future,
irrespective of what the workload is. This allows kernel to start
reclamation/compaction before we actually hit the point of complete free
page exhaustion or fragmentation. This could avoid direct
reclamation/compaction or at least cut down its severity enough. That is
what makes it a win in large number of cases. Least square algorithm is
lightweight enough to not add to system load or complexity. If you have
come across a better algorithm, I certainly would look into using that.

> 
>> The more knobs we have externally, the more complex
>> it becomes to tune the kernel externally.
> 
> I agree on this point. Is the current set of tunning sufficient? What
>

Re: [RFC PATCH 0/2] Add predictive memory reclamation and compaction

2019-08-13 Thread Khalid Aziz

On 8/13/19 8:05 AM, Michal Hocko wrote:
> On Mon 12-08-19 19:40:10, Khalid Aziz wrote:
> [...]
>> Patch 1 adds code to maintain a sliding lookback window of (time, number
>> of free pages) points which can be updated continuously and adds code to
>> compute best fit line across these points. It also adds code to use the
>> best fit lines to determine if kernel must start reclamation or
>> compaction.
>>
>> Patch 2 adds code to collect data points on free pages of various orders
>> at different points in time, uses code in patch 1 to update sliding
>> lookback window with these points and kicks off reclamation or
>> compaction based upon the results it gets.
> 
> An important piece of information missing in your description is why
> do we need to keep that logic in the kernel. In other words, we have
> the background reclaim that acts on a wmark range and those are tunable
> from the userspace. The primary point of this background reclaim is to
> keep balance and prevent from direct reclaim. Why cannot you implement
> this or any other dynamic trend watching watchdog and tune watermarks
> accordingly? Something similar applies to kcompactd although we might be
> lacking a good interface.
> 

Hi Michal,

That is a very good question. As a matter of fact the initial prototype
to assess the feasibility of this approach was written in userspace for
a very limited application. We wrote the initial prototype to monitor
fragmentation and used /sys/devices/system/node/node*/compact to trigger
compaction. The prototype demonstrated this approach has merits.

The primary reason to implement this logic in the kernel is to make the
kernel self-tuning. The more knobs we have externally, the more complex
it becomes to tune the kernel externally. If we can make the kernel
self-tuning, we can actually eliminate external knobs and simplify
kernel admin. Inspite of availability of tuning knobs and large number
of tuning guides for databases and cloud platforms, allocation stalls is
a routinely occurring problem on customer deployments. A best fit line
algorithm shows immeasurable impact on system performance yet provides
measurable improvement and room for further refinement. Makes sense?

Thanks,
Khalid

[RFC PATCH 2/2] mm/vmscan: Add fragmentation and page starvation prediction to kswapd

2019-08-12 Thread Khalid Aziz

This patch adds proactive memory reclamation to kswapd using the
free page exhaustion/fragmentation prediction based upon memory
consumption trend. It uses the least squares fit algorithm introduced
earlier for this prediction. A new function node_trend_analysis()
iterates through all zones and updates trend data in the lookback
window for least square fit algorithm. At the same time it flags any
zones that have potential for exhaustion/fragmentation by setting
ZONE_POTENTIAL_FRAG flag.

prepare_kswapd_sleep() calls node_trend_analysis() to check if the
node has potential exhaustion/fragmentation. If so, kswapd will
continue reclamataion. balance_pgdat has been modified to take
potential fragmentation into account when deciding when to wake
kcompactd up. Any zones that have potential severe fragmentation get
watermark boosted to reclaim and compact free pages proactively.

Signed-off-by: Khalid Aziz 
Signed-off-by: Bharath Vedartham 
Tested-by: Vandana BN 
---
 include/linux/mmzone.h |  38 ++
 mm/page_alloc.c|  27 --
 mm/vmscan.c| 116 ++---
 3 files changed, 148 insertions(+), 33 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 9a0e5cab7171..a523476b5ce1 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -587,6 +587,12 @@ struct zone {
 
boolcontiguous;
 
+   /*
+* Structures to use for memory consumption prediction for
+* each order
+*/
+   struct lsq_struct   mem_prediction[MAX_ORDER];
+
ZONE_PADDING(_pad3_)
/* Zone statistics */
atomic_long_t   vm_stat[NR_VM_ZONE_STAT_ITEMS];
@@ -611,6 +617,9 @@ enum zone_flags {
ZONE_BOOSTED_WATERMARK, /* zone recently boosted watermarks.
 * Cleared when kswapd is woken.
 */
+   ZONE_POTENTIAL_FRAG,/* zone detected with a potential
+* external fragmentation event.
+*/
 };
 
 extern int mem_predict(struct frag_info *frag_vec, struct zone *zone);
@@ -1130,6 +1139,35 @@ static inline struct zoneref 
*first_zones_zonelist(struct zonelist *zonelist,
 #define for_each_zone_zonelist(zone, z, zlist, highidx) \
for_each_zone_zonelist_nodemask(zone, z, zlist, highidx, NULL)
 
+extern int watermark_boost_factor;
+
+static inline void boost_watermark(struct zone *zone)
+{
+   unsigned long max_boost;
+
+   if (!watermark_boost_factor)
+   return;
+
+   max_boost = mult_frac(zone->_watermark[WMARK_HIGH],
+   watermark_boost_factor, 1);
+
+   /*
+* high watermark may be uninitialised if fragmentation occurs
+* very early in boot so do not boost. We do not fall
+* through and boost by pageblock_nr_pages as failing
+* allocations that early means that reclaim is not going
+* to help and it may even be impossible to reclaim the
+* boosted watermark resulting in a hang.
+*/
+   if (!max_boost)
+   return;
+
+   max_boost = max(pageblock_nr_pages, max_boost);
+
+   zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
+   max_boost);
+}
+
 #ifdef CONFIG_SPARSEMEM
 #include 
 #endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 272c6de1bf4e..1b4e6ba16f1c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2351,33 +2351,6 @@ static bool can_steal_fallback(unsigned int order, int 
start_mt)
return false;
 }
 
-static inline void boost_watermark(struct zone *zone)
-{
-   unsigned long max_boost;
-
-   if (!watermark_boost_factor)
-   return;
-
-   max_boost = mult_frac(zone->_watermark[WMARK_HIGH],
-   watermark_boost_factor, 1);
-
-   /*
-* high watermark may be uninitialised if fragmentation occurs
-* very early in boot so do not boost. We do not fall
-* through and boost by pageblock_nr_pages as failing
-* allocations that early means that reclaim is not going
-* to help and it may even be impossible to reclaim the
-* boosted watermark resulting in a hang.
-*/
-   if (!max_boost)
-   return;
-
-   max_boost = max(pageblock_nr_pages, max_boost);
-
-   zone->watermark_boost = min(zone->watermark_boost + pageblock_nr_pages,
-   max_boost);
-}
-
 /*
  * This function implements actual steal behaviour. If order is large enough,
  * we can steal whole pageblock. If not, we first move freepages in this
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 44df66a98f2a..b9cf6658c83d 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -51,6 +51,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -3397,14 +3398,82 @@ static void cle

[RFC PATCH 0/2] Add predictive memory reclamation and compaction

2019-08-12 Thread Khalid Aziz

 compaction.
get_page_from_freelist() might be a better place to gather data points
and make decision on starting reclamation or comapction but it can also
impact page allocation latency. Another possibility is to create a
separate kernel thread that gathers page usage data periodically and
wakes up kswapd or kcompactd as needed based upon trend analysis. This
is something that can be finalized before final implementation of this
proposal.

Impact of this implementation was measured using two sets of tests.
First test consists of three concurrent dd processes writing large
amounts of data (66 GB, 131 GB and 262 GB) to three different SSDs
causing large number of free pages to be used up for buffer/page cache.
Number of cumulative allocation stalls as reported by /proc/vmstat were
recorded for 5 runs of this test.

5.3-rc2
---

allocstall_dma 0
allocstall_dma32 0
allocstall_normal 15
allocstall_movable 1629
compact_stall 0

Total = 1644


5.3-rc2 + this patch series
---

allocstall_dma 0
allocstall_dma32 0
allocstall_normal 182
allocstall_movable 1266
compact_stall 0

Total = 1544

There was no significant change in system time between these runs. This
was a ~6.5% improvement in number of allocation stalls.

A scond test used was the parallel dd test from mmtests. Average number
of stalls over 4 runs with unpatched 5.3-rc2 kernel was 6057. Average
number of stalls over 4 runs after applying these patches was 5584. This
was an ~8% improvement in number of allocation stalls.

This work is complementary to other allocation/compaction stall
improvements. It attempts to address potential stalls proactively before
they happen and will make use of any improvements made to the
reclamation/compaction code.

Any feedback on this proposal and associated implementation will be
greatly appreciated. This is work in progress.

Khalid Aziz (2):
  mm: Add trend based prediction algorithm for memory usage
  mm/vmscan: Add fragmentation prediction to kswapd

 include/linux/mmzone.h |  72 +++
 mm/Makefile|   2 +-
 mm/lsq.c   | 273 +
 mm/page_alloc.c|  27 
 mm/vmscan.c| 116 -
 5 files changed, 456 insertions(+), 34 deletions(-)
 create mode 100644 mm/lsq.c

-- 
2.20.1

[RFC PATCH 1/2] mm: Add trend based prediction algorithm for memory usage

2019-08-12 Thread Khalid Aziz

Direct page reclamation and compaction have high and unpredictable
latency costs for applications. This patch adds code to predict if
system is about to run out of free memory by watching the historical
memory consumption trends. It computes a best fit line to this
historical data using method of least squares. it can then compute if
system will run out of memory if the current trend continues.
Historical data is held in a new data structure lsq_struct for each
zone and each order within the zone. Size of the window for historical
data is given by LSQ_LOOKBACK.

Signed-off-by: Khalid Aziz 
Signed-off-by: Bharath Vedartham 
Reviewed-by: Vandana BN 
---
 include/linux/mmzone.h |  34 +
 mm/Makefile|   2 +-
 mm/lsq.c   | 273 +
 3 files changed, 308 insertions(+), 1 deletion(-)
 create mode 100644 mm/lsq.c

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d77d717c620c..9a0e5cab7171 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -355,6 +355,38 @@ struct per_cpu_nodestat {
 
 #endif /* !__GENERATING_BOUNDS.H */
 
+/*
+ * Size of lookback window for the free memory exhaustion prediction
+ * algorithm. Keep it to less than 16 to keep data manageable
+ */
+#define LSQ_LOOKBACK 8
+
+/*
+ * How far forward to look when determining if memory exhaustion would
+ * become an issue.
+ */
+extern unsigned long mempredict_threshold;
+
+/*
+ * Structure to keep track of current values required to compute the best
+ * fit line using method of least squares
+ */
+struct lsq_struct {
+   bool ready;
+   int next;
+   u64 x[LSQ_LOOKBACK];
+   unsigned long y[LSQ_LOOKBACK];
+};
+
+struct frag_info {
+   unsigned long free_pages;
+   unsigned long time;
+};
+
+/* Possile bits to be set by mem_predict in its return value */
+#define MEMPREDICT_RECLAIM 0x01
+#define MEMPREDICT_COMPACT 0x02
+
 enum zone_type {
 #ifdef CONFIG_ZONE_DMA
/*
@@ -581,6 +613,8 @@ enum zone_flags {
 */
 };
 
+extern int mem_predict(struct frag_info *frag_vec, struct zone *zone);
+
 static inline unsigned long zone_managed_pages(struct zone *zone)
 {
return (unsigned long)atomic_long_read(>managed_pages);
diff --git a/mm/Makefile b/mm/Makefile
index 338e528ad436..fb7b3c19dd13 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -39,7 +39,7 @@ obj-y := filemap.o mempool.o oom_kill.o 
fadvise.o \
   mm_init.o mmu_context.o percpu.o slab_common.o \
   compaction.o vmacache.o \
   interval_tree.o list_lru.o workingset.o \
-  debug.o gup.o $(mmu-y)
+  debug.o gup.o lsq.o $(mmu-y)
 
 # Give 'page_alloc' its own module-parameter namespace
 page-alloc-y := page_alloc.o
diff --git a/mm/lsq.c b/mm/lsq.c
new file mode 100644
index ..6005a2b2f44d
--- /dev/null
+++ b/mm/lsq.c
@@ -0,0 +1,273 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * lsq.c: Provide a prediction on whether free memory exhaustion is
+ * imminent or not by using a best fit line based upon method of
+ * least squares. Best fit line is based upon recent historical
+ * data. This historical data forms the lookback window for the
+ * algorithm.
+ *
+ *
+ * Author: Robert Harris
+ * Author: Khalid Aziz 
+ *
+ * Copyright (c) 2019, Oracle and/or its affiliates. All rights reserved.
+ * DO NOT ALTER OR REMOVE COPYRIGHT NOTICES OR THIS FILE HEADER.
+ *
+ * This code is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 only, as
+ * published by the Free Software Foundation.
+ *
+ * This code is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License
+ * version 2 for more details (a copy is included in the LICENSE file that
+ * accompanied this code).
+ *
+ * Please contact Oracle, 500 Oracle Parkway, Redwood Shores, CA 94065 USA
+ * or visit www.oracle.com if you need additional information or have any
+ * questions.
+ *
+ */
+
+#include 
+#include 
+#include 
+
+/*
+ * How far forward to look when determining if fragmentation would
+ * become an issue. The unit for this is same as the unit for the
+ * x-axis of graph where sample points for memory utilization are being
+ * plotted. We start with a default value of 1000 units but can tweak it
+ * dynamically to get better prediction results. With data points for
+ * memory being gathered with granularity of milliseconds, this translates
+ * to a look ahead of 1 second. If system is 1 second away from severe
+ * fragmentation, start compaction now to avoid direct comapction.
+ */
+unsigned long mempredict_threshold = 1000;
+
+/*
+ * Threshold for number of free pages that should trigger reclamat

Re: [PATCH 09/16] sparc64: use the generic get_user_pages_fast code

2019-07-26 Thread Khalid Aziz

On 7/17/19 3:59 PM, Dmitry V. Levin wrote:
> Hi,
> 
> On Tue, Jun 25, 2019 at 04:37:08PM +0200, Christoph Hellwig wrote:
>> The sparc64 code is mostly equivalent to the generic one, minus various
>> bugfixes and two arch overrides that this patch adds to pgtable.h.
>>
>> Signed-off-by: Christoph Hellwig 
>> Reviewed-by: Khalid Aziz 
>> ---
>>  arch/sparc/Kconfig  |   1 +
>>  arch/sparc/include/asm/pgtable_64.h |  18 ++
>>  arch/sparc/mm/Makefile  |   2 +-
>>  arch/sparc/mm/gup.c | 340 
>>  4 files changed, 20 insertions(+), 341 deletions(-)
>>  delete mode 100644 arch/sparc/mm/gup.c
> 
> So this ended up as commit 7b9afb86b6328f10dc2cad9223d7def12d60e505
> (thanks to Anatoly for bisecting) and introduced a regression: 
> futex.test from the strace test suite now causes an Oops on sparc64
> in futex syscall
> 

I have been working on reproducing this problem but ran into a different
problem. I found 5.1 and newer kernels no longer boot on an S7 server or
in an ldom on a T7 server (kernel hangs after "crc32c_sparc64: Using
sparc64 crc32c opcode optimized CRC32C implementation" on console). A
long git bisect session between 5.0 and 5.1 pointed to commit
73a66023c937 ("sparc64: fix sparc_ipc type conversion") but that makes
no sense. I will keep working on finding root cause. I wonder if
Anatoly's git bisect result is also suspect.

--
Khalid

Re: [PATCH 01/16] mm: use untagged_addr() for get_user_pages_fast addresses

2019-06-21 Thread Khalid Aziz

On 6/21/19 7:39 AM, Jason Gunthorpe wrote:
> On Tue, Jun 11, 2019 at 04:40:47PM +0200, Christoph Hellwig wrote:
>> This will allow sparc64 to override its ADI tags for
>> get_user_pages and get_user_pages_fast.
>>
>> Signed-off-by: Christoph Hellwig 
>>  mm/gup.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/gup.c b/mm/gup.c
>> index ddde097cf9e4..6bb521db67ec 100644
>> +++ b/mm/gup.c
>> @@ -2146,7 +2146,7 @@ int __get_user_pages_fast(unsigned long start, int 
>> nr_pages, int write,
>>  unsigned long flags;
>>  int nr = 0;
>>  
>> -start &= PAGE_MASK;
>> +start = untagged_addr(start) & PAGE_MASK;
>>  len = (unsigned long) nr_pages << PAGE_SHIFT;
>>  end = start + len;
> 
> Hmm, this function, and the other, goes on to do:
> 
> if (unlikely(!access_ok((void __user *)start, len)))
> return 0;
> 
> and I thought that access_ok takes in the tagged pointer?
> 
> How about re-order it a bit?

access_ok() can handle tagged or untagged pointers. It just strips the
tag bits from the top bits. Current order doesn't really matter from
functionality point of view. There might be minor gain in delaying
untagging in __get_user_pages_fast() but I could go either way.

--
Khalid

> 
> diff --git a/mm/gup.c b/mm/gup.c
> index ddde097cf9e410..f48747ced4723b 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2148,11 +2148,12 @@ int __get_user_pages_fast(unsigned long start, int 
> nr_pages, int write,
>  
>   start &= PAGE_MASK;
>   len = (unsigned long) nr_pages << PAGE_SHIFT;
> - end = start + len;
> -
>   if (unlikely(!access_ok((void __user *)start, len)))
>   return 0;
>  
> + start = untagged_ptr(start);
> + end = start + len;
> +
>   /*
>* Disable interrupts.  We use the nested form as we can already have
>* interrupts disabled by get_futex_key.
>

Re: [PATCH v17 04/15] mm, arm64: untag user pointers passed to memory syscalls

2019-06-19 Thread Khalid Aziz

On 6/19/19 9:55 AM, Khalid Aziz wrote:
> On 6/12/19 5:43 AM, Andrey Konovalov wrote:
>> This patch is a part of a series that extends arm64 kernel ABI to allow to
>> pass tagged user pointers (with the top byte set to something else other
>> than 0x00) as syscall arguments.
>>
>> This patch allows tagged pointers to be passed to the following memory
>> syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
>> mremap, msync, munlock, move_pages.
>>
>> The mmap and mremap syscalls do not currently accept tagged addresses.
>> Architectures may interpret the tag as a background colour for the
>> corresponding vma.
>>
>> Reviewed-by: Catalin Marinas 
>> Reviewed-by: Kees Cook 
>> Signed-off-by: Andrey Konovalov 
>> ---
> 
> Reviewed-by: Khalid Aziz 
> 
> 

I would also recommend updating commit log for all the patches in this
series that are changing files under mm/ as opposed to arch/arm64 to not
reference arm64 kernel ABI since the change applies to every
architecture. So something along the lines of "This patch is part of a
series that extends kernel ABI to allow..."

--
Khalid


>>  mm/madvise.c   | 2 ++
>>  mm/mempolicy.c | 3 +++
>>  mm/migrate.c   | 2 +-
>>  mm/mincore.c   | 2 ++
>>  mm/mlock.c | 4 
>>  mm/mprotect.c  | 2 ++
>>  mm/mremap.c| 7 +++
>>  mm/msync.c | 2 ++
>>  8 files changed, 23 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 628022e674a7..39b82f8a698f 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -810,6 +810,8 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, 
>> len_in, int, behavior)
>>  size_t len;
>>  struct blk_plug plug;
>>  
>> +start = untagged_addr(start);
>> +
>>  if (!madvise_behavior_valid(behavior))
>>  return error;
>>  
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index 01600d80ae01..78e0a88b2680 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -1360,6 +1360,7 @@ static long kernel_mbind(unsigned long start, unsigned 
>> long len,
>>  int err;
>>  unsigned short mode_flags;
>>  
>> +start = untagged_addr(start);
>>  mode_flags = mode & MPOL_MODE_FLAGS;
>>  mode &= ~MPOL_MODE_FLAGS;
>>  if (mode >= MPOL_MAX)
>> @@ -1517,6 +1518,8 @@ static int kernel_get_mempolicy(int __user *policy,
>>  int uninitialized_var(pval);
>>  nodemask_t nodes;
>>  
>> +addr = untagged_addr(addr);
>> +
>>  if (nmask != NULL && maxnode < nr_node_ids)
>>  return -EINVAL;
>>  
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index f2ecc2855a12..d22c45cf36b2 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -1616,7 +1616,7 @@ static int do_pages_move(struct mm_struct *mm, 
>> nodemask_t task_nodes,
>>  goto out_flush;
>>  if (get_user(node, nodes + i))
>>  goto out_flush;
>> -addr = (unsigned long)p;
>> +addr = (unsigned long)untagged_addr(p);
>>  
>>  err = -ENODEV;
>>  if (node < 0 || node >= MAX_NUMNODES)
>> diff --git a/mm/mincore.c b/mm/mincore.c
>> index c3f058bd0faf..64c322ed845c 100644
>> --- a/mm/mincore.c
>> +++ b/mm/mincore.c
>> @@ -249,6 +249,8 @@ SYSCALL_DEFINE3(mincore, unsigned long, start, size_t, 
>> len,
>>  unsigned long pages;
>>  unsigned char *tmp;
>>  
>> +start = untagged_addr(start);
>> +
>>  /* Check the start address: needs to be page-aligned.. */
>>  if (start & ~PAGE_MASK)
>>  return -EINVAL;fixup_user_fault
>> diff --git a/mm/mlock.c b/mm/mlock.c
>> index 080f3b36415b..e82609eaa428 100644
>> --- a/mm/mlock.c
>> +++ b/mm/mlock.c
>> @@ -674,6 +674,8 @@ static __must_check int do_mlock(unsigned long start, 
>> size_t len, vm_flags_t fla
>>  unsigned long lock_limit;
>>  int error = -ENOMEM;
>>  
>> +start = untagged_addr(start);
>> +
>>  if (!can_do_mlock())
>>  return -EPERM;
>>  
>> @@ -735,6 +737,8 @@ SYSCALL_DEFINE2(munlock, unsigned long, start, size_t, 
>> len)
>>  {
>>  int ret;
>>  
>> +start = untagged_addr(start);
>> +
>>  len = PAGE_ALIGN(len + (offset_in_page(start)));
>>  start &= PAGE_MASK;
>>  
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>&g

Re: [PATCH 09/16] sparc64: use the generic get_user_pages_fast code

2019-06-11 Thread Khalid Aziz

On 6/11/19 8:40 AM, Christoph Hellwig wrote:
> The sparc64 code is mostly equivalent to the generic one, minus various
> bugfixes and two arch overrides that this patch adds to pgtable.h.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/sparc/Kconfig  |   1 +
>  arch/sparc/include/asm/pgtable_64.h |  18 ++
>  arch/sparc/mm/Makefile  |   2 +-
>  arch/sparc/mm/gup.c | 340 
>  4 files changed, 20 insertions(+), 341 deletions(-)
>  delete mode 100644 arch/sparc/mm/gup.c
> 

Reviewed-by: Khalid Aziz

Re: [PATCH 10/16] mm: rename CONFIG_HAVE_GENERIC_GUP to CONFIG_HAVE_FAST_GUP

2019-06-11 Thread Khalid Aziz

On 6/11/19 8:40 AM, Christoph Hellwig wrote:
> We only support the generic GUP now, so rename the config option to
> be more clear, and always use the mm/Kconfig definition of the
> symbol and select it from the arch Kconfigs.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/arm/Kconfig | 5 +
>  arch/arm64/Kconfig   | 4 +---
>  arch/mips/Kconfig| 2 +-
>  arch/powerpc/Kconfig | 2 +-
>  arch/s390/Kconfig| 2 +-
>  arch/sh/Kconfig  | 2 +-
>  arch/sparc/Kconfig   | 2 +-
>  arch/x86/Kconfig | 4 +---
>  mm/Kconfig   | 2 +-
>  mm/gup.c | 4 ++--
>  10 files changed, 11 insertions(+), 18 deletions(-)
> 

Looks good.

Reviewed-by: Khalid Aziz

Re: [PATCH 08/16] sparc64: define untagged_addr()

2019-06-11 Thread Khalid Aziz

On 6/11/19 8:40 AM, Christoph Hellwig wrote:
> Add a helper to untag a user pointer.  This is needed for ADI support
> in get_user_pages_fast.
> 
> Signed-off-by: Christoph Hellwig 
> ---
>  arch/sparc/include/asm/pgtable_64.h | 22 ++
>  1 file changed, 22 insertions(+)

Looks good to me.

Reviewed-by: Khalid Aziz 

> 
> diff --git a/arch/sparc/include/asm/pgtable_64.h 
> b/arch/sparc/include/asm/pgtable_64.h
> index f0dcf991d27f..1904782dcd39 100644
> --- a/arch/sparc/include/asm/pgtable_64.h
> +++ b/arch/sparc/include/asm/pgtable_64.h
> @@ -1076,6 +1076,28 @@ static inline int io_remap_pfn_range(struct 
> vm_area_struct *vma,
>  }
>  #define io_remap_pfn_range io_remap_pfn_range 
>  
> +static inline unsigned long untagged_addr(unsigned long start)
> +{
> + if (adi_capable()) {
> + long addr = start;
> +
> + /* If userspace has passed a versioned address, kernel
> +  * will not find it in the VMAs since it does not store
> +  * the version tags in the list of VMAs. Storing version
> +  * tags in list of VMAs is impractical since they can be
> +  * changed any time from userspace without dropping into
> +  * kernel. Any address search in VMAs will be done with
> +  * non-versioned addresses. Ensure the ADI version bits
> +  * are dropped here by sign extending the last bit before
> +  * ADI bits. IOMMU does not implement version tags.
> +  */
> + return (addr << (long)adi_nbits()) >> (long)adi_nbits();
> + }
> +
> + return start;
> +}
> +#define untagged_addr untagged_addr
> +
>  #include 
>  #include 
>  
>

Re: [PATCH 01/16] mm: use untagged_addr() for get_user_pages_fast addresses

2019-06-11 Thread Khalid Aziz

On 6/11/19 8:40 AM, Christoph Hellwig wrote:
> This will allow sparc64 to override its ADI tags for
> get_user_pages and get_user_pages_fast.
> 
> Signed-off-by: Christoph Hellwig 
> ---

Commit message is sparc64 specific but the goal here is to allow any
architecture with memory tagging to use this. So I would suggest
rewording the commit log. Other than that:

Reviewed-by: Khalid Aziz 

>  mm/gup.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index ddde097cf9e4..6bb521db67ec 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2146,7 +2146,7 @@ int __get_user_pages_fast(unsigned long start, int 
> nr_pages, int write,
>   unsigned long flags;
>   int nr = 0;
>  
> - start &= PAGE_MASK;
> + start = untagged_addr(start) & PAGE_MASK;
>   len = (unsigned long) nr_pages << PAGE_SHIFT;
>   end = start + len;
>  
> @@ -2219,7 +2219,7 @@ int get_user_pages_fast(unsigned long start, int 
> nr_pages,
>   unsigned long addr, len, end;
>   int nr = 0, ret = 0;
>  
> - start &= PAGE_MASK;
> + start = untagged_addr(start) & PAGE_MASK;
>   addr = start;
>   len = (unsigned long) nr_pages << PAGE_SHIFT;
>   end = start + len;
>

Re: [PATCH 01/16] uaccess: add untagged_addr definition for other arches

2019-06-03 Thread Khalid Aziz

On 6/1/19 1:49 AM, Christoph Hellwig wrote:
> From: Andrey Konovalov 
> 
> To allow arm64 syscalls to accept tagged pointers from userspace, we must
> untag them when they are passed to the kernel. Since untagging is done in
> generic parts of the kernel, the untagged_addr macro needs to be defined
> for all architectures.
> 
> Define it as a noop for architectures other than arm64.

Could you reword above sentence? We are already starting off with
untagged_addr() not being no-op for arm64 and sparc64. It will expand
further potentially. So something more along the lines of "Define it as
noop for architectures that do not support memory tagging". The first
paragraph in the log can also be rewritten to be not specific to arm64.

--
Khalid

> 
> Acked-by: Catalin Marinas 
> Signed-off-by: Andrey Konovalov 
> Signed-off-by: Christoph Hellwig 
> ---
>  include/linux/mm.h | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 0e8834ac32b7..949d43e9c0b6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -99,6 +99,10 @@ extern int mmap_rnd_compat_bits __read_mostly;
>  #include 
>  #include 
>  
> +#ifndef untagged_addr
> +#define untagged_addr(addr) (addr)
> +#endif
> +
>  #ifndef __pa_symbol
>  #define __pa_symbol(x)  __pa(RELOC_HIDE((unsigned long)(x), 0))
>  #endif
>

Re: [PATCH v15 01/17] uaccess: add untagged_addr definition for other arches

2019-05-29 Thread Khalid Aziz

On Mon, 2019-05-06 at 18:30 +0200, Andrey Konovalov wrote:
> To allow arm64 syscalls to accept tagged pointers from userspace, we
> must
> untag them when they are passed to the kernel. Since untagging is
> done in
> generic parts of the kernel, the untagged_addr macro needs to be
> defined
> for all architectures.
> 
> Define it as a noop for architectures other than arm64.
> 
> Acked-by: Catalin Marinas 
> Signed-off-by: Andrey Konovalov 
> ---
>  include/linux/mm.h | 4 
>  1 file changed, 4 insertions(+)

As discussed in the other thread Chris started, there is a generic need
to untag addresses in kernel and this patch gets us ready for that.

Reviewed-by: Khalid Aziz 

> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 6b10c21630f5..44041df804a6 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -99,6 +99,10 @@ extern int mmap_rnd_compat_bits __read_mostly;
>  #include 
>  #include 
>  
> +#ifndef untagged_addr
> +#define untagged_addr(addr) (addr)
> +#endif
> +
>  #ifndef __pa_symbol
>  #define __pa_symbol(x)  __pa(RELOC_HIDE((unsigned long)(x), 0))
>  #endif

Re: [PATCH 4/6] mm: add a gup_fixup_start_addr hook

2019-05-28 Thread Khalid Aziz

On 5/25/19 11:05 AM, Linus Torvalds wrote:
> [ Adding Khalid, who added the sparc64 code ]
> 
> On Sat, May 25, 2019 at 6:32 AM Christoph Hellwig  wrote:
>>
>> This will allow sparc64 to override its ADI tags for
>> get_user_pages and get_user_pages_fast.  I have no idea why this
>> is not required for plain old get_user_pages, but it keeps the
>> existing sparc64 behavior.
> 
> This is actually generic. ARM64 has tagged pointers too. Right now the
> system call interfaces are all supposed to mask off the tags, but
> there's been noise about having the kernel understand them.
> 
> That said:
> 
>> +#ifndef gup_fixup_start_addr
>> +#define gup_fixup_start_addr(start)(start)
>> +#endif
> 
> I'd rather name this much more specifically (ie make it very much
> about "clean up pointer tags") and I'm also not clear on why sparc64
> actually wants this. I thought the sparc64 rules were the same as the
> (current) arm64 rules: any addresses passed to the kernel have to be
> the non-tagged ones.
> 
> As you say, nothing *else* in the kernel does that address cleanup,
> why should get_user_pages_fast() do it?
> 
> David? Khalid? Why does sparc64 actually need this? It looks like the
> generic get_user_pages() doesn't do it.
> 

There is another discussion going on about tagged pointers on ARM64 and
intersection with sparc64 code. I agree there is a generic need to mask
off tags for kernel use now that ARM64 is also looking into supporting
memory tagging. The need comes from sparc64 not storing tagged address
in VMAs. It is not practical to store tagged addresses in VMAs because
manipulation of address tags is done entirely in userspace on sparc64.
Userspace is free to change tags on an address range at any time without
involving kernel and constantly rotating tags is actually a security
feature even. This makes it impractical for kernel to try to keep up
with constantly changing tagged addresses in VMAs. Untagged addresses in
VMAs means any find_vma() and brethren calls need to be passed an
untagged address.

On sparc64, my intent was to support address tagging for dynamically
allocated data buffers only (malloc, mmap and shm specifically) and not
for any generic system calls which limited the scope and amount of
untagging needed in the kernel. ARM64 is working to add transparent
tagged address support at C library level. Adding tagged addresses to C
library requires every possible call into kernel to either handle tagged
addresses or untag address at some point. Andrey found out it is not as
easy as untagging addresses in functions that search through vma.
Callers of find_vma() and others tend to do address arithmetic on the
address stored in vma that is returned. This requires a more complex
solution than just stripping tags in vma lookup routines.

Since untagging addresses is a generic need required for far more than
gup, I prefer the way Andrey wrote it -

--
Khalid

Re: [PATCH] mm: remove unused variable

2019-03-12 Thread Khalid Aziz

On 3/12/19 7:28 AM, Bartosz Golaszewski wrote:
> From: Bartosz Golaszewski 
> 
> The mm variable is set but unused. Remove it.

It is used. Look further down for calls to set_pte_at().

--
Khalid

> 
> Signed-off-by: Bartosz Golaszewski 
> ---
>  mm/mprotect.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 028c724dcb1a..130dac3ad04f 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -39,7 +39,6 @@ static unsigned long change_pte_range(struct vm_area_struct 
> *vma, pmd_t *pmd,
>   unsigned long addr, unsigned long end, pgprot_t newprot,
>   int dirty_accountable, int prot_numa)
>  {
> - struct mm_struct *mm = vma->vm_mm;
>   pte_t *pte, oldpte;
>   spinlock_t *ptl;
>   unsigned long pages = 0;
>

Re: [RFC PATCH v8 08/14] arm64/mm: disable section/contiguous mappings if XPFO is enabled

2019-02-15 Thread Khalid Aziz

On 2/15/19 6:09 AM, Mark Rutland wrote:
> Hi,
> 
> On Wed, Feb 13, 2019 at 05:01:31PM -0700, Khalid Aziz wrote:
>> From: Tycho Andersen 
>>
>> XPFO doesn't support section/contiguous mappings yet, so let's disable it
>> if XPFO is turned on.
>>
>> Thanks to Laura Abbot for the simplification from v5, and Mark Rutland for
>> pointing out we need NO_CONT_MAPPINGS too.
>>
>> CC: linux-arm-ker...@lists.infradead.org
>> Signed-off-by: Tycho Andersen 
>> Reviewed-by: Khalid Aziz 
> 
> There should be no point in this series where it's possible to enable a
> broken XPFO. Either this patch should be merged into the rest of the
> arm64 bits, or it should be placed before the rest of the arm64 bits.
> 
> That's a pre-requisite for merging, and it significantly reduces the
> burden on reviewers.
> 
> In general, a patch series should bisect cleanly. Could you please
> restructure the series to that effect?
> 
> Thanks,
> Mark.

That sounds reasonable to me. I will merge this with patch 5 ("arm64/mm:
Add support for XPFO") for the next version unless there are objections.

Thanks,
Khalid

> 
>> ---
>>  arch/arm64/mm/mmu.c  | 2 +-
>>  include/linux/xpfo.h | 4 
>>  mm/xpfo.c| 6 ++
>>  3 files changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
>> index d1d6601b385d..f4dd27073006 100644
>> --- a/arch/arm64/mm/mmu.c
>> +++ b/arch/arm64/mm/mmu.c
>> @@ -451,7 +451,7 @@ static void __init map_mem(pgd_t *pgdp)
>>  struct memblock_region *reg;
>>  int flags = 0;
>>  
>> -if (debug_pagealloc_enabled())
>> +if (debug_pagealloc_enabled() || xpfo_enabled())
>>  flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
>>  
>>  /*
>> diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
>> index 1ae05756344d..8b029918a958 100644
>> --- a/include/linux/xpfo.h
>> +++ b/include/linux/xpfo.h
>> @@ -47,6 +47,8 @@ void xpfo_temp_map(const void *addr, size_t size, void 
>> **mapping,
>>  void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
>>   size_t mapping_len);
>>  
>> +bool xpfo_enabled(void);
>> +
>>  #else /* !CONFIG_XPFO */
>>  
>>  static inline void xpfo_kmap(void *kaddr, struct page *page) { }
>> @@ -69,6 +71,8 @@ static inline void xpfo_temp_unmap(const void *addr, 
>> size_t size,
>>  }
>>  
>>  
>> +static inline bool xpfo_enabled(void) { return false; }
>> +
>>  #endif /* CONFIG_XPFO */
>>  
>>  #endif /* _LINUX_XPFO_H */
>> diff --git a/mm/xpfo.c b/mm/xpfo.c
>> index 92ca6d1baf06..150784ae0f08 100644
>> --- a/mm/xpfo.c
>> +++ b/mm/xpfo.c
>> @@ -71,6 +71,12 @@ struct page_ext_operations page_xpfo_ops = {
>>  .init = init_xpfo,
>>  };
>>  
>> +bool __init xpfo_enabled(void)
>> +{
>> +return !xpfo_disabled;
>> +}
>> +EXPORT_SYMBOL(xpfo_enabled);
>> +
>>  static inline struct xpfo *lookup_xpfo(struct page *page)
>>  {
>>  struct page_ext *page_ext = lookup_page_ext(page);
>> -- 
>> 2.17.1
>>

Re: [RFC PATCH v8 03/14] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)

2019-02-14 Thread Khalid Aziz

On 2/14/19 12:08 PM, Peter Zijlstra wrote:
> On Thu, Feb 14, 2019 at 10:13:54AM -0700, Khalid Aziz wrote:
> 
>> Patch 11 ("xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION") cleans
>> all this up. If the original authors of these two patches, Juerg
>> Haefliger and Julian Stecklina, are ok with it, I would like to combine
>> the two patches in one.
> 
> Don't preserve broken patches because of different authorship or
> whatever.
> 
> If you care you can say things like:
> 
>  Based-on-code-from:
>  Co-developed-by:
>  Originally-from:
> 
> or whatever other things there are. But individual patches should be
> correct and complete.
> 

That sounds reasonable. I will merge these two patches in the next version.

Thanks,
Khalid

Re: [RFC PATCH v8 13/14] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

2019-02-14 Thread Khalid Aziz

On 2/14/19 10:42 AM, Dave Hansen wrote:
>>  #endif
>> +
>> +/* If there is a pending TLB flush for this CPU due to XPFO
>> + * flush, do it now.
>> + */
> 
> Don't forget CodingStyle in all this, please.

Of course. I will fix that.

> 
>> +if (cpumask_test_and_clear_cpu(cpu, _xpfo_flush)) {
>> +count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
>> +__flush_tlb_all();
>> +}
> 
> This seems to exist in parallel with all of the cpu_tlbstate
> infrastructure.  Shouldn't it go in there?

That sounds like a good idea. On the other hand, pending flush needs to
be kept track of entirely within arch/x86/mm/tlb.c and using a local
variable with scope limited to just that file feels like a lighter
weight implementation. I could go either way.

> 
> Also, if we're doing full flushes like this, it seems a bit wasteful to
> then go and do later things like invalidate_user_asid() when we *know*
> that the asid would have been flushed by this operation.  I'm pretty
> sure this isn't the only __flush_tlb_all() callsite that does this, so
> it's not really criticism of this patch specifically.  It's more of a
> structural issue.
> 
> 

That is a good point. It is not just wasteful, it is bound to have
performance impact even if slight.

>> +void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
>> +{
> 
> This is a bit lightly commented.  Please give this some good
> descriptions about the logic behind the implementation and the tradeoffs
> that are in play.
> 
> This is doing a local flush, but deferring the flushes on all other
> processors, right?  Can you explain the logic behind that in a comment
> here, please?  This also has to be called with preemption disabled, right?
> 
>> +struct cpumask tmp_mask;
>> +
>> +/* Balance as user space task's flush, a bit conservative */
>> +if (end == TLB_FLUSH_ALL ||
>> +(end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
>> +do_flush_tlb_all(NULL);
>> +} else {
>> +struct flush_tlb_info info;
>> +
>> +info.start = start;
>> +info.end = end;
>> +do_kernel_range_flush();
>> +}
>> +cpumask_setall(_mask);
>> +cpumask_clear_cpu(smp_processor_id(), _mask);
>> +cpumask_or(_xpfo_flush, _xpfo_flush, _mask);
>> +}
> 
> Fun.  cpumask_setall() is non-atomic while cpumask_clear_cpu() and
> cpumask_or() *are* atomic.  The cpumask_clear_cpu() is operating on
> thread-local storage and doesn't need to be atomic.  Please make it
> __cpumask_clear_cpu().
> 

I will fix that. Thanks!

--
Khalid

Re: [RFC PATCH v8 04/14] swiotlb: Map the buffer if it was unmapped by XPFO

2019-02-14 Thread Khalid Aziz

On 2/14/19 10:44 AM, Christoph Hellwig wrote:
> On Thu, Feb 14, 2019 at 09:56:24AM -0700, Khalid Aziz wrote:
>> On 2/14/19 12:47 AM, Christoph Hellwig wrote:
>>> On Wed, Feb 13, 2019 at 05:01:27PM -0700, Khalid Aziz wrote:
>>>> +++ b/kernel/dma/swiotlb.c
>>>> @@ -396,8 +396,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, 
>>>> phys_addr_t tlb_addr,
>>>>  {
>>>>unsigned long pfn = PFN_DOWN(orig_addr);
>>>>unsigned char *vaddr = phys_to_virt(tlb_addr);
>>>> +  struct page *page = pfn_to_page(pfn);
>>>>  
>>>> -  if (PageHighMem(pfn_to_page(pfn))) {
>>>> +  if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
>>>
>>> I think this just wants a page_unmapped or similar helper instead of
>>> needing the xpfo_page_is_unmapped check.  We actually have quite
>>> a few similar construct in the arch dma mapping code for architectures
>>> that require cache flushing.
>>
>> As I am not the original author of this patch, I am interpreting the
>> original intent. I think xpfo_page_is_unmapped() was added to account
>> for kernel build without CONFIG_XPFO. xpfo_page_is_unmapped() has an
>> alternate definition to return false if CONFIG_XPFO is not defined.
>> xpfo_is_unmapped() is cleaned up further in patch 11 ("xpfo, mm: remove
>> dependency on CONFIG_PAGE_EXTENSION") to a one-liner "return
>> PageXpfoUnmapped(page);". xpfo_is_unmapped() can be eliminated entirely
>> by adding an else clause to the following code added by that patch:
> 
> The point I'm making it that just about every PageHighMem() check
> before code that does a kmap* later needs to account for xpfo as well.
> 
> So instead of opencoding the above, be that using xpfo_page_is_unmapped
> or PageXpfoUnmapped, we really need one self-describing helper that
> checks if a page is unmapped for any reason and needs a kmap to access
> it.
> 

Understood. XpfoUnmapped is a the state for a page when it is a free
page. When this page is allocated to userspace and userspace passes this
page back to kernel in a syscall, kernel will always go through kmap to
map it temporarily any way. When the page is freed back to the kernel,
its mapping in physmap is restored. If the free page is allocated to
kernel, its physmap entry is preserved. So I am inclined to say a page
being XpfoUnmapped should not affect need or lack of need for kmap
elsewhere. Does that make sense?

Thanks,
Khalid

Re: [RFC PATCH v8 07/14] arm64/mm, xpfo: temporarily map dcache regions

2019-02-14 Thread Khalid Aziz

On 2/14/19 8:54 AM, Tycho Andersen wrote:
> Hi,
> 
> On Wed, Feb 13, 2019 at 05:01:30PM -0700, Khalid Aziz wrote:
>> From: Juerg Haefliger 
>>
>> If the page is unmapped by XPFO, a data cache flush results in a fatal
>> page fault, so let's temporarily map the region, flush the cache, and then
>> unmap it.
>>
>> v6: actually flush in the face of xpfo, and temporarily map the underlying
>> memory so it can be flushed correctly
>>
>> CC: linux-arm-ker...@lists.infradead.org
>> Signed-off-by: Juerg Haefliger 
>> Signed-off-by: Tycho Andersen 
>> ---
>>  arch/arm64/mm/flush.c | 7 +++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
>> index 30695a868107..fad09aafd9d5 100644
>> --- a/arch/arm64/mm/flush.c
>> +++ b/arch/arm64/mm/flush.c
>> @@ -20,6 +20,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -28,9 +29,15 @@
>>  void sync_icache_aliases(void *kaddr, unsigned long len)
>>  {
>>  unsigned long addr = (unsigned long)kaddr;
>> +unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
>> +void *mapping[num_pages];
> 
> What version does this build on? Presumably -Wvla will cause an error
> here, but,
> 
>>  if (icache_is_aliasing()) {
>> +xpfo_temp_map(kaddr, len, mapping,
>> +  sizeof(mapping[0]) * num_pages);
>>  __clean_dcache_area_pou(kaddr, len);
> 
> Here, we map the pages to some random address via xpfo_temp_map(),
> then pass the *original* address (which may not have been mapped) to
> __clean_dcache_area_pou(). So I think this whole approach is wrong.
> 
> If we want to do it this way, it may be that we need some
> xpfo_map_contiguous() type thing, but since we're just going to flush
> it anyway, that seems a little crazy. Maybe someone who knows more
> about arm64 knows a better way?
> 
> Tycho
> 

Hi Tycho,

You are right. Things don't quite look right with this patch. I don't
know arm64 well enough either, so I will wait for someone more
knowledgeable to make a recommendation here.

On a side note, do you mind if I update your address in your
signed-off-by from ty...@docker.com when I send the next version of this
series?

Thanks,
Khalid


pEpkey.asc
Description: application/pgp-keys

Re: [RFC PATCH v8 03/14] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)

2019-02-14 Thread Khalid Aziz

On 2/14/19 9:15 AM, Borislav Petkov wrote:
> On Thu, Feb 14, 2019 at 11:56:31AM +0100, Peter Zijlstra wrote:
>>> +EXPORT_SYMBOL(xpfo_kunmap);
>>
>> And these here things are most definitely not IRQ-safe.
> 
> Should also be EXPORT_SYMBOL_GPL.
> 

Agreed. On the other hand, is there even a need to export this? It
should only be called from kunmap() or kunmap_atomic() and not from any
module directly. Same for xpfo_kmap.

Thanks,
Khalid

pEpkey.asc
Description: application/pgp-keys

Re: [RFC PATCH v8 03/14] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)

2019-02-14 Thread Khalid Aziz

On 2/14/19 3:56 AM, Peter Zijlstra wrote:
> On Wed, Feb 13, 2019 at 05:01:26PM -0700, Khalid Aziz wrote:
>>  static inline void *kmap_atomic(struct page *page)
>>  {
>> +void *kaddr;
>> +
>>  preempt_disable();
>>  pagefault_disable();
>> +kaddr = page_address(page);
>> +xpfo_kmap(kaddr, page);
>> +return kaddr;
>>  }
>>  #define kmap_atomic_prot(page, prot)kmap_atomic(page)
>>  
>>  static inline void __kunmap_atomic(void *addr)
>>  {
>> +xpfo_kunmap(addr, virt_to_page(addr));
>>  pagefault_enable();
>>  preempt_enable();
>>  }
> 
> How is that supposed to work; IIRC kmap_atomic was supposed to be
> IRQ-safe.
> 

Ah, the spin_lock in in xpfo_kmap() can be problematic in interrupt
context. I will see if I can fix that.

Juerg, you wrote the original code and understand what you were trying
to do here. If you have ideas on how to tackle this, I would very much
appreciate it.

>> +/* Per-page XPFO house-keeping data */
>> +struct xpfo {
>> +unsigned long flags;/* Page state */
>> +bool inited;/* Map counter and lock initialized */
> 
> What's sizeof(_Bool) ? Why can't you use a bit in that flags word?
> 
>> +atomic_t mapcount;  /* Counter for balancing map/unmap requests */
>> +spinlock_t maplock; /* Lock to serialize map/unmap requests */
>> +};
> 
> Without that bool, the structure would be 16 bytes on 64bit, which seems
> like a good number.
> 

Patch 11 ("xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION") cleans
all this up. If the original authors of these two patches, Juerg
Haefliger and Julian Stecklina, are ok with it, I would like to combine
the two patches in one.

>> +void xpfo_kmap(void *kaddr, struct page *page)
>> +{
>> +struct xpfo *xpfo;
>> +
>> +if (!static_branch_unlikely(_inited))
>> +return;
>> +
>> +xpfo = lookup_xpfo(page);
>> +
>> +/*
>> + * The page was allocated before page_ext was initialized (which means
>> + * it's a kernel page) or it's allocated to the kernel, so nothing to
>> + * do.
>> + */
>> +if (!xpfo || unlikely(!xpfo->inited) ||
>> +!test_bit(XPFO_PAGE_USER, >flags))
>> +return;
>> +
>> +spin_lock(>maplock);
>> +
>> +/*
>> + * The page was previously allocated to user space, so map it back
>> + * into the kernel. No TLB flush required.
>> + */
>> +if ((atomic_inc_return(>mapcount) == 1) &&
>> +test_and_clear_bit(XPFO_PAGE_UNMAPPED, >flags))
>> +set_kpte(kaddr, page, PAGE_KERNEL);
>> +
>> +spin_unlock(>maplock);
>> +}
>> +EXPORT_SYMBOL(xpfo_kmap);
>> +
>> +void xpfo_kunmap(void *kaddr, struct page *page)
>> +{
>> +struct xpfo *xpfo;
>> +
>> +if (!static_branch_unlikely(_inited))
>> +return;
>> +
>> +xpfo = lookup_xpfo(page);
>> +
>> +/*
>> + * The page was allocated before page_ext was initialized (which means
>> + * it's a kernel page) or it's allocated to the kernel, so nothing to
>> + * do.
>> + */
>> +if (!xpfo || unlikely(!xpfo->inited) ||
>> +!test_bit(XPFO_PAGE_USER, >flags))
>> +return;
>> +
>> +spin_lock(>maplock);
>> +
>> +/*
>> + * The page is to be allocated back to user space, so unmap it from the
>> + * kernel, flush the TLB and tag it as a user page.
>> + */
>> +if (atomic_dec_return(>mapcount) == 0) {
>> +WARN(test_bit(XPFO_PAGE_UNMAPPED, >flags),
>> + "xpfo: unmapping already unmapped page\n");
>> +set_bit(XPFO_PAGE_UNMAPPED, >flags);
>> +set_kpte(kaddr, page, __pgprot(0));
>> +xpfo_flush_kernel_tlb(page, 0);
>> +}
>> +
>> +spin_unlock(>maplock);
>> +}
>> +EXPORT_SYMBOL(xpfo_kunmap);
> 
> And these here things are most definitely not IRQ-safe.
> 

Got it. I will work on this.

Thanks,
Khalid



pEpkey.asc
Description: application/pgp-keys

Re: [RFC PATCH v8 04/14] swiotlb: Map the buffer if it was unmapped by XPFO

2019-02-14 Thread Khalid Aziz

On 2/14/19 12:47 AM, Christoph Hellwig wrote:
> On Wed, Feb 13, 2019 at 05:01:27PM -0700, Khalid Aziz wrote:
>> +++ b/kernel/dma/swiotlb.c
>> @@ -396,8 +396,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, 
>> phys_addr_t tlb_addr,
>>  {
>>  unsigned long pfn = PFN_DOWN(orig_addr);
>>  unsigned char *vaddr = phys_to_virt(tlb_addr);
>> +struct page *page = pfn_to_page(pfn);
>>  
>> -if (PageHighMem(pfn_to_page(pfn))) {
>> +if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
> 
> I think this just wants a page_unmapped or similar helper instead of
> needing the xpfo_page_is_unmapped check.  We actually have quite
> a few similar construct in the arch dma mapping code for architectures
> that require cache flushing.

As I am not the original author of this patch, I am interpreting the
original intent. I think xpfo_page_is_unmapped() was added to account
for kernel build without CONFIG_XPFO. xpfo_page_is_unmapped() has an
alternate definition to return false if CONFIG_XPFO is not defined.
xpfo_is_unmapped() is cleaned up further in patch 11 ("xpfo, mm: remove
dependency on CONFIG_PAGE_EXTENSION") to a one-liner "return
PageXpfoUnmapped(page);". xpfo_is_unmapped() can be eliminated entirely
by adding an else clause to the following code added by that patch:

--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -398,6 +402,15 @@ TESTCLEARFLAG(Young, young, PF_ANY)
 PAGEFLAG(Idle, idle, PF_ANY)
 #endif

+#ifdef CONFIG_XPFO
+PAGEFLAG(XpfoUser, xpfo_user, PF_ANY)
+TESTCLEARFLAG(XpfoUser, xpfo_user, PF_ANY)
+TESTSETFLAG(XpfoUser, xpfo_user, PF_ANY)
+PAGEFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+TESTCLEARFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+TESTSETFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+#endif
+
 /*
  * On an anonymous page mapped into a user virtual memory area,
  * page->mapping points to its anon_vma, not to a struct address_space;


Adding the following #else to above conditional:

#else
TESTPAGEFLAG_FALSE(XpfoUser)
TESTPAGEFLAG_FALSE(XpfoUnmapped)

should allow us to eliminate xpfo_is_unmapped(). Right?

Thanks,
Khalid

> 
>> +bool xpfo_page_is_unmapped(struct page *page)
>> +{
>> +struct xpfo *xpfo;
>> +
>> +if (!static_branch_unlikely(_inited))
>> +return false;
>> +
>> +xpfo = lookup_xpfo(page);
>> +if (unlikely(!xpfo) && !xpfo->inited)
>> +return false;
>> +
>> +return test_bit(XPFO_PAGE_UNMAPPED, >flags);
>> +}
>> +EXPORT_SYMBOL(xpfo_page_is_unmapped);
> 
> And at least for swiotlb there is no need to export this helper,
> as it is always built in.
> 



pEpkey.asc
Description: application/pgp-keys

[RFC PATCH v8 01/14] mm: add MAP_HUGETLB support to vm_mmap

2019-02-13 Thread Khalid Aziz

From: Tycho Andersen 

vm_mmap is exported, which means kernel modules can use it. In particular,
for testing XPFO support, we want to use it with the MAP_HUGETLB flag, so
let's support it via vm_mmap.

Signed-off-by: Tycho Andersen 
Tested-by: Marco Benatto 
Tested-by: Khalid Aziz 
---
 include/linux/mm.h |  2 ++
 mm/mmap.c  | 19 +--
 mm/util.c  | 32 
 3 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..30bddc7b3c75 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2361,6 +2361,8 @@ struct vm_unmapped_area_info {
 extern unsigned long unmapped_area(struct vm_unmapped_area_info *info);
 extern unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info);
 
+struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags);
+
 /*
  * Search for an unmapped address range.
  *
diff --git a/mm/mmap.c b/mm/mmap.c
index 6c04292e16a7..c668d7d27c2b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1582,24 +1582,7 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, 
unsigned long len,
if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
goto out_fput;
} else if (flags & MAP_HUGETLB) {
-   struct user_struct *user = NULL;
-   struct hstate *hs;
-
-   hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
-   if (!hs)
-   return -EINVAL;
-
-   len = ALIGN(len, huge_page_size(hs));
-   /*
-* VM_NORESERVE is used because the reservations will be
-* taken when vm_ops->mmap() is called
-* A dummy user value is used because we are not locking
-* memory so no accounting is necessary
-*/
-   file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
-   VM_NORESERVE,
-   , HUGETLB_ANONHUGE_INODE,
-   (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+   file = map_hugetlb_setup(, flags);
if (IS_ERR(file))
return PTR_ERR(file);
}
diff --git a/mm/util.c b/mm/util.c
index 8bf08b5b5760..536c14cf88ba 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -357,6 +357,29 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned 
long addr,
return ret;
 }
 
+struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags)
+{
+   struct user_struct *user = NULL;
+   struct hstate *hs;
+
+   hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+   if (!hs)
+   return ERR_PTR(-EINVAL);
+
+   *len = ALIGN(*len, huge_page_size(hs));
+
+   /*
+* VM_NORESERVE is used because the reservations will be
+* taken when vm_ops->mmap() is called
+* A dummy user value is used because we are not locking
+* memory so no accounting is necessary
+*/
+   return hugetlb_file_setup(HUGETLB_ANON_FILE, *len,
+   VM_NORESERVE,
+   , HUGETLB_ANONHUGE_INODE,
+   (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+}
+
 unsigned long vm_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long offset)
@@ -366,6 +389,15 @@ unsigned long vm_mmap(struct file *file, unsigned long 
addr,
if (unlikely(offset_in_page(offset)))
return -EINVAL;
 
+   if (flag & MAP_HUGETLB) {
+   if (file)
+   return -EINVAL;
+
+   file = map_hugetlb_setup(, flag);
+   if (IS_ERR(file))
+   return PTR_ERR(file);
+   }
+
return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);
 }
 EXPORT_SYMBOL(vm_mmap);
-- 
2.17.1

[RFC PATCH v8 03/14] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)

2019-02-13 Thread Khalid Aziz

From: Juerg Haefliger 

This patch adds support for XPFO which protects against 'ret2dir' kernel
attacks. The basic idea is to enforce exclusive ownership of page frames
by either the kernel or userspace, unless explicitly requested by the
kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping,
specifically:
  - two flags to distinguish user vs. kernel pages and to tag unmapped
pages.
  - a reference counter to balance kmap/kunmap operations.
  - a lock to serialize access to the XPFO fields.

This patch is based on the work of Vasileios P. Kemerlis et al. who
published their work in this paper:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

v6: * use flush_tlb_kernel_range() instead of __flush_tlb_one, so we flush
  the tlb entry on all CPUs when unmapping it in kunmap
* handle lookup_page_ext()/lookup_xpfo() returning NULL
* drop lots of BUG()s in favor of WARN()
* don't disable irqs in xpfo_kmap/xpfo_kunmap, export
  __split_large_page so we can do our own alloc_pages(GFP_ATOMIC) to
  pass it

CC: x...@kernel.org
Suggested-by: Vasileios P. Kemerlis 
Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Signed-off-by: Marco Benatto 
[jstec...@amazon.de: rebased from v4.13 to v4.19]
Signed-off-by: Julian Stecklina 
Reviewed-by: Khalid Aziz 
---
 .../admin-guide/kernel-parameters.txt |   2 +
 arch/x86/Kconfig  |   1 +
 arch/x86/include/asm/pgtable.h|  26 ++
 arch/x86/mm/Makefile  |   2 +
 arch/x86/mm/pageattr.c|  23 +-
 arch/x86/mm/xpfo.c| 119 ++
 include/linux/highmem.h   |  15 +-
 include/linux/xpfo.h  |  48 
 mm/Makefile   |   1 +
 mm/page_alloc.c   |   2 +
 mm/page_ext.c |   4 +
 mm/xpfo.c | 223 ++
 security/Kconfig  |  19 ++
 13 files changed, 463 insertions(+), 22 deletions(-)
 create mode 100644 arch/x86/mm/xpfo.c
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index aefd358a5ca3..c4c62599f216 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2982,6 +2982,8 @@
 
nox2apic[X86-64,APIC] Do not enable x2APIC mode.
 
+   noxpfo  [X86-64] Disable XPFO when CONFIG_XPFO is on.
+
cpu0_hotplug[X86] Turn on CPU0 hotplug feature when
CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
Some features depend on CPU0. Known dependencies are:
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8689e794a43c..d69d8cc6e57e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -207,6 +207,7 @@ config X86
select USER_STACKTRACE_SUPPORT
select VIRT_TO_BUS
select X86_FEATURE_NAMESif PROC_FS
+   select ARCH_SUPPORTS_XPFO   if X86_64
 
 config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 40616e805292..f6eeb75c8a21 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1437,6 +1437,32 @@ static inline bool arch_has_pfn_modify_check(void)
return boot_cpu_has_bug(X86_BUG_L1TF);
 }
 
+/*
+ * The current flushing context - we pass it instead of 5 arguments:
+ */
+struct cpa_data {
+   unsigned long   *vaddr;
+   pgd_t   *pgd;
+   pgprot_tmask_set;
+   pgprot_tmask_clr;
+   unsigned long   numpages;
+   int flags;
+   unsigned long   pfn;
+   unsignedforce_split : 1,
+   force_static_prot   : 1;
+   int curpage;
+   struct page **pages;
+};
+
+
+int
+should_split_large_page(pte_t *kpte, unsigned long address,
+   struct cpa_data *cpa);
+extern spinlock_t cpa_lock;
+int
+__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
+  struct page *base);
+
 #include 
 #endif /* __ASSEMBLY__ */
 
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 4b101dd6e52f..93b0fdaf4a99 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -53,3 +53,5 @@ obj-$(CONFIG_PAGE_TABLE_ISOLATION)+= pti.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt_boot.o
+
+obj-$(CONFIG_XPFO

[RFC PATCH v8 11/14] xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION

2019-02-13 Thread Khalid Aziz

From: Julian Stecklina 

Instead of using the page extension debug feature, encode all
information, we need for XPFO in struct page. This allows to get rid of
some checks in the hot paths and there are also no pages anymore that
are allocated before XPFO is enabled.

Also make debugging aids configurable for maximum performance.

Signed-off-by: Julian Stecklina 
Cc: x...@kernel.org
Cc: kernel-harden...@lists.openwall.com
Cc: Vasileios P. Kemerlis 
Cc: Juerg Haefliger 
Cc: Tycho Andersen 
Cc: Marco Benatto 
Cc: David Woodhouse 
Reviewed-by: Khalid Aziz 
---
 include/linux/mm_types.h   |   8 ++
 include/linux/page-flags.h |  13 +++
 include/linux/xpfo.h   |   3 +-
 include/trace/events/mmflags.h |  10 ++-
 mm/page_alloc.c|   3 +-
 mm/page_ext.c  |   4 -
 mm/xpfo.c  | 159 -
 security/Kconfig   |  12 ++-
 8 files changed, 80 insertions(+), 132 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2c471a2c43fa..d17d33f36a01 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -204,6 +204,14 @@ struct page {
 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
int _last_cpupid;
 #endif
+
+#ifdef CONFIG_XPFO
+   /* Counts the number of times this page has been kmapped. */
+   atomic_t xpfo_mapcount;
+
+   /* Serialize kmap/kunmap of this page */
+   spinlock_t xpfo_lock;
+#endif
 } _struct_page_alignment;
 
 /*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 50ce1bddaf56..a532063f27b5 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -101,6 +101,10 @@ enum pageflags {
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
PG_young,
PG_idle,
+#endif
+#ifdef CONFIG_XPFO
+   PG_xpfo_user,   /* Page is allocated to user-space */
+   PG_xpfo_unmapped,   /* Page is unmapped from the linear map */
 #endif
__NR_PAGEFLAGS,
 
@@ -398,6 +402,15 @@ TESTCLEARFLAG(Young, young, PF_ANY)
 PAGEFLAG(Idle, idle, PF_ANY)
 #endif
 
+#ifdef CONFIG_XPFO
+PAGEFLAG(XpfoUser, xpfo_user, PF_ANY)
+TESTCLEARFLAG(XpfoUser, xpfo_user, PF_ANY)
+TESTSETFLAG(XpfoUser, xpfo_user, PF_ANY)
+PAGEFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+TESTCLEARFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+TESTSETFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+#endif
+
 /*
  * On an anonymous page mapped into a user virtual memory area,
  * page->mapping points to its anon_vma, not to a struct address_space;
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 117869991d5b..1dd590ff1a1f 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -28,7 +28,7 @@ struct page;
 
 #include 
 
-extern struct page_ext_operations page_xpfo_ops;
+void xpfo_init_single_page(struct page *page);
 
 void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
 void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size,
@@ -57,6 +57,7 @@ phys_addr_t user_virt_to_phys(unsigned long addr);
 
 #else /* !CONFIG_XPFO */
 
+static inline void xpfo_init_single_page(struct page *page) { }
 static inline void xpfo_kmap(void *kaddr, struct page *page) { }
 static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
 static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { 
}
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a1675d43777e..6bb000bb366f 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -79,6 +79,12 @@
 #define IF_HAVE_PG_IDLE(flag,string)
 #endif
 
+#ifdef CONFIG_XPFO
+#define IF_HAVE_PG_XPFO(flag,string) ,{1UL << flag, string}
+#else
+#define IF_HAVE_PG_XPFO(flag,string)
+#endif
+
 #define __def_pageflag_names   \
{1UL << PG_locked,  "locked"},  \
{1UL << PG_waiters, "waiters"   },  \
@@ -105,7 +111,9 @@ IF_HAVE_PG_MLOCK(PG_mlocked,"mlocked"   
)   \
 IF_HAVE_PG_UNCACHED(PG_uncached,   "uncached"  )   \
 IF_HAVE_PG_HWPOISON(PG_hwpoison,   "hwpoison"  )   \
 IF_HAVE_PG_IDLE(PG_young,  "young" )   \
-IF_HAVE_PG_IDLE(PG_idle,   "idle"  )
+IF_HAVE_PG_IDLE(PG_idle,   "idle"  )   \
+IF_HAVE_PG_XPFO(PG_xpfo_user,  "xpfo_user" )   \
+IF_HAVE_PG_XPFO(PG_xpfo_unmapped,  "xpfo_unmapped" )   \
 
 #define show_page_flags(flags) \
(flags) ? __print_flags(flags, "|", \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 08e277790b5f..d00382b20001 100644
--- a/mm/page_alloc.c
+++ b/mm/pa

[RFC PATCH v8 14/14] xpfo, mm: Optimize XPFO TLB flushes by batching them together

2019-02-13 Thread Khalid Aziz

When XPFO forces a TLB flush on all cores, the performance impact is
very significant. Batching as many of these TLB updates as
possible can help lower this impact. When a userspace allocates a
page, kernel tries to get that page from the per-cpu free list.
This free list is replenished in bulk when it runs low. Free
list is being replenished for future allocation to userspace is a
good opportunity to update TLB entries in batch and reduce the
impact of multiple TLB flushes later. This patch adds new tags for
the page so a page can be marked as available for userspace
allocation and unmapped from kernel address space. All such pages
are removed from kernel address space in bulk at the time they are
added to per-cpu free list. This patch when combined with deferred
TLB flushes improves performance further. Using the same benchmark
as before of building kernel in parallel, here are the system
times on two differently sized systems:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

4.20950.966s
4.20+XPFO   25073.169s  26.366x
4.20+XPFO+Deferred flush1372.874s   1.44x
4.20+XPFO+Deferred flush+Batch update   1255.021s   1.32x

Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

4.20607.671s
4.20+XPFO   1588.646s   2.614x
4.20+XPFO+Deferred flush803.989s1.32x
4.20+XPFO+Deferred flush+Batch update   795.728s1.31x

Signed-off-by: Khalid Aziz 
Signed-off-by: Tycho Andersen 
---
 arch/x86/mm/xpfo.c |  5 +
 include/linux/page-flags.h |  5 -
 include/linux/xpfo.h   |  8 
 mm/page_alloc.c|  4 
 mm/xpfo.c  | 35 +--
 5 files changed, 54 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index d3833532bfdc..fb06bb3cb718 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -87,6 +87,11 @@ inline void set_kpte(void *kaddr, struct page *page, 
pgprot_t prot)
 
 }
 
+void xpfo_flush_tlb_all(void)
+{
+   xpfo_flush_tlb_kernel_range(0, TLB_FLUSH_ALL);
+}
+
 inline void xpfo_flush_kernel_tlb(struct page *page, int order)
 {
int level;
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a532063f27b5..fdf7e14cbc96 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -406,9 +406,11 @@ PAGEFLAG(Idle, idle, PF_ANY)
 PAGEFLAG(XpfoUser, xpfo_user, PF_ANY)
 TESTCLEARFLAG(XpfoUser, xpfo_user, PF_ANY)
 TESTSETFLAG(XpfoUser, xpfo_user, PF_ANY)
+#define __PG_XPFO_USER (1UL << PG_xpfo_user)
 PAGEFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
 TESTCLEARFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
 TESTSETFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+#define __PG_XPFO_UNMAPPED (1UL << PG_xpfo_unmapped)
 #endif
 
 /*
@@ -787,7 +789,8 @@ static inline void ClearPageSlabPfmemalloc(struct page 
*page)
  * alloc-free cycle to prevent from reusing the page.
  */
 #define PAGE_FLAGS_CHECK_AT_PREP   \
-   (((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON)
+   (((1UL << NR_PAGEFLAGS) - 1) & ~__PG_HWPOISON & ~__PG_XPFO_USER & \
+   ~__PG_XPFO_UNMAPPED)
 
 #define PAGE_FLAGS_PRIVATE \
(1UL << PG_private | 1UL << PG_private_2)
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 1dd590ff1a1f..c4f6c99e7380 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -34,6 +34,7 @@ void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
 void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size,
enum dma_data_direction dir);
 void xpfo_flush_kernel_tlb(struct page *page, int order);
+void xpfo_flush_tlb_all(void);
 
 void xpfo_kmap(void *kaddr, struct page *page);
 void xpfo_kunmap(void *kaddr, struct page *page);
@@ -55,6 +56,8 @@ bool xpfo_enabled(void);
 
 phys_addr_t user_virt_to_phys(unsigned long addr);
 
+bool xpfo_pcp_refill(struct page *page, enum migratetype migratetype,
+int order);
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_init_single_page(struct page *page) { }
@@ -82,6 +85,11 @@ static inline bool xpfo_enabled(void) { return false; }
 
 static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; }
 
+static inline bool xpfo_pcp_refill(struct page *page,
+  enum migratetype migratetype, int order)
+{
+}
+
 #endif /* CONFIG_XPFO */
 
 #endif /* _LINUX_XPFO_H */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d00382b20001..5702b6fa435c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2478,6 +2478,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int 
order,
int migratetype)
 {
int i, alloced = 0;
+

[RFC PATCH v8 12/14] xpfo, mm: optimize spinlock usage in xpfo_kunmap

2019-02-13 Thread Khalid Aziz

From: Julian Stecklina 

Only the xpfo_kunmap call that needs to actually unmap the page
needs to be serialized. We need to be careful to handle the case,
where after the atomic decrement of the mapcount, a xpfo_kmap
increased the mapcount again. In this case, we can safely skip
modifying the page table.

Model-checked with up to 4 concurrent callers with Spin.

Signed-off-by: Julian Stecklina 
Signed-off-by: Khalid Aziz 
Cc: x...@kernel.org
Cc: kernel-harden...@lists.openwall.com
Cc: Vasileios P. Kemerlis 
Cc: Juerg Haefliger 
Cc: Tycho Andersen 
Cc: Marco Benatto 
Cc: David Woodhouse 
---
 mm/xpfo.c | 25 -
 1 file changed, 16 insertions(+), 9 deletions(-)

diff --git a/mm/xpfo.c b/mm/xpfo.c
index dc03c423c52f..5157cbebce4b 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -124,28 +124,35 @@ EXPORT_SYMBOL(xpfo_kmap);
 
 void xpfo_kunmap(void *kaddr, struct page *page)
 {
+   bool flush_tlb = false;
+
if (!static_branch_unlikely(_inited))
return;
 
if (!PageXpfoUser(page))
return;
 
-   spin_lock(>xpfo_lock);
-
/*
 * The page is to be allocated back to user space, so unmap it from the
 * kernel, flush the TLB and tag it as a user page.
 */
if (atomic_dec_return(>xpfo_mapcount) == 0) {
-#ifdef CONFIG_XPFO_DEBUG
-   BUG_ON(PageXpfoUnmapped(page));
-#endif
-   SetPageXpfoUnmapped(page);
-   set_kpte(kaddr, page, __pgprot(0));
-   xpfo_flush_kernel_tlb(page, 0);
+   spin_lock(>xpfo_lock);
+
+   /*
+* In the case, where we raced with kmap after the
+* atomic_dec_return, we must not nuke the mapping.
+*/
+   if (atomic_read(>xpfo_mapcount) == 0) {
+   SetPageXpfoUnmapped(page);
+   set_kpte(kaddr, page, __pgprot(0));
+   flush_tlb = true;
+   }
+   spin_unlock(>xpfo_lock);
}
 
-   spin_unlock(>xpfo_lock);
+   if (flush_tlb)
+   xpfo_flush_kernel_tlb(page, 0);
 }
 EXPORT_SYMBOL(xpfo_kunmap);
 
-- 
2.17.1

[RFC PATCH v8 02/14] x86: always set IF before oopsing from page fault

2019-02-13 Thread Khalid Aziz

From: Tycho Andersen 

Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
that might sleep:

Aug 23 19:30:27 xpfo kernel: [   38.302714] BUG: sleeping function called from 
invalid context at ./include/linux/percpu-rwsem.h:33
Aug 23 19:30:27 xpfo kernel: [   38.303837] in_atomic(): 0, irqs_disabled(): 1, 
pid: 1970, name: lkdtm_xpfo_test
Aug 23 19:30:27 xpfo kernel: [   38.304758] CPU: 3 PID: 1970 Comm: 
lkdtm_xpfo_test Tainted: G  D 4.13.0-rc5+ #228
Aug 23 19:30:27 xpfo kernel: [   38.305813] Hardware name: QEMU Standard PC 
(i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
Aug 23 19:30:27 xpfo kernel: [   38.306926] Call Trace:
Aug 23 19:30:27 xpfo kernel: [   38.307243]  dump_stack+0x63/0x8b
Aug 23 19:30:27 xpfo kernel: [   38.307665]  ___might_sleep+0xec/0x110
Aug 23 19:30:27 xpfo kernel: [   38.308139]  __might_sleep+0x45/0x80
Aug 23 19:30:27 xpfo kernel: [   38.308593]  exit_signals+0x21/0x1c0
Aug 23 19:30:27 xpfo kernel: [   38.309046]  ? 
blocking_notifier_call_chain+0x11/0x20
Aug 23 19:30:27 xpfo kernel: [   38.309677]  do_exit+0x98/0xbf0
Aug 23 19:30:27 xpfo kernel: [   38.310078]  ? smp_reader+0x27/0x40 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.310604]  ? kthread+0x10f/0x150
Aug 23 19:30:27 xpfo kernel: [   38.311045]  ? read_user_with_flags+0x60/0x60 
[lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.311680]  rewind_stack_do_exit+0x17/0x20

To be safe, let's just always enable irqs.

The particular case I'm hitting is:

Aug 23 19:30:27 xpfo kernel: [   38.278615]  __bad_area_nosemaphore+0x1a9/0x1d0
Aug 23 19:30:27 xpfo kernel: [   38.278617]  bad_area_nosemaphore+0xf/0x20
Aug 23 19:30:27 xpfo kernel: [   38.278618]  __do_page_fault+0xd1/0x540
Aug 23 19:30:27 xpfo kernel: [   38.278620]  ? irq_work_queue+0x9b/0xb0
Aug 23 19:30:27 xpfo kernel: [   38.278623]  ? wake_up_klogd+0x36/0x40
Aug 23 19:30:27 xpfo kernel: [   38.278624]  trace_do_page_fault+0x3c/0xf0
Aug 23 19:30:27 xpfo kernel: [   38.278625]  do_async_page_fault+0x14/0x60
Aug 23 19:30:27 xpfo kernel: [   38.278627]  async_page_fault+0x28/0x30

When a fault is in kernel space which has been triggered by XPFO.

Signed-off-by: Tycho Andersen 
CC: x...@kernel.org
Tested-by: Khalid Aziz 
---
 arch/x86/mm/fault.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 71d4b9d4d43f..ba51652fbd33 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -748,6 +748,12 @@ no_context(struct pt_regs *regs, unsigned long error_code,
/* Executive summary in case the body of the oops scrolled away */
printk(KERN_DEFAULT "CR2: %016lx\n", address);
 
+   /*
+* We're about to oops, which might kill the task. Make sure we're
+* allowed to sleep.
+*/
+   flags |= X86_EFLAGS_IF;
+
oops_end(flags, regs, sig);
 }
 
-- 
2.17.1

[RFC PATCH v8 04/14] swiotlb: Map the buffer if it was unmapped by XPFO

2019-02-13 Thread Khalid Aziz

From: Juerg Haefliger 

v6: * guard against lookup_xpfo() returning NULL

CC: Konrad Rzeszutek Wilk 
Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Reviewed-by: Khalid Aziz 
Reviewed-by: Konrad Rzeszutek Wilk 
---
 include/linux/xpfo.h |  4 
 kernel/dma/swiotlb.c |  3 ++-
 mm/xpfo.c| 15 +++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index b15234745fb4..cba37ffb09b1 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -36,6 +36,8 @@ void xpfo_kunmap(void *kaddr, struct page *page);
 void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp);
 void xpfo_free_pages(struct page *page, int order);
 
+bool xpfo_page_is_unmapped(struct page *page);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -43,6 +45,8 @@ static inline void xpfo_kunmap(void *kaddr, struct page 
*page) { }
 static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { 
}
 static inline void xpfo_free_pages(struct page *page, int order) { }
 
+static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+
 #endif /* CONFIG_XPFO */
 
 #endif /* _LINUX_XPFO_H */
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 045930e32c0e..820a54b57491 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -396,8 +396,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, 
phys_addr_t tlb_addr,
 {
unsigned long pfn = PFN_DOWN(orig_addr);
unsigned char *vaddr = phys_to_virt(tlb_addr);
+   struct page *page = pfn_to_page(pfn);
 
-   if (PageHighMem(pfn_to_page(pfn))) {
+   if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
/* The buffer does not have a mapping.  Map it in and copy */
unsigned int offset = orig_addr & ~PAGE_MASK;
char *buffer;
diff --git a/mm/xpfo.c b/mm/xpfo.c
index 24b33d3c20cb..67884736bebe 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -221,3 +221,18 @@ void xpfo_kunmap(void *kaddr, struct page *page)
spin_unlock(>maplock);
 }
 EXPORT_SYMBOL(xpfo_kunmap);
+
+bool xpfo_page_is_unmapped(struct page *page)
+{
+   struct xpfo *xpfo;
+
+   if (!static_branch_unlikely(_inited))
+   return false;
+
+   xpfo = lookup_xpfo(page);
+   if (unlikely(!xpfo) && !xpfo->inited)
+   return false;
+
+   return test_bit(XPFO_PAGE_UNMAPPED, >flags);
+}
+EXPORT_SYMBOL(xpfo_page_is_unmapped);
-- 
2.17.1

[RFC PATCH v8 08/14] arm64/mm: disable section/contiguous mappings if XPFO is enabled

2019-02-13 Thread Khalid Aziz

From: Tycho Andersen 

XPFO doesn't support section/contiguous mappings yet, so let's disable it
if XPFO is turned on.

Thanks to Laura Abbot for the simplification from v5, and Mark Rutland for
pointing out we need NO_CONT_MAPPINGS too.

CC: linux-arm-ker...@lists.infradead.org
Signed-off-by: Tycho Andersen 
Reviewed-by: Khalid Aziz 
---
 arch/arm64/mm/mmu.c  | 2 +-
 include/linux/xpfo.h | 4 
 mm/xpfo.c| 6 ++
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d1d6601b385d..f4dd27073006 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -451,7 +451,7 @@ static void __init map_mem(pgd_t *pgdp)
struct memblock_region *reg;
int flags = 0;
 
-   if (debug_pagealloc_enabled())
+   if (debug_pagealloc_enabled() || xpfo_enabled())
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
/*
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 1ae05756344d..8b029918a958 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -47,6 +47,8 @@ void xpfo_temp_map(const void *addr, size_t size, void 
**mapping,
 void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
 size_t mapping_len);
 
+bool xpfo_enabled(void);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -69,6 +71,8 @@ static inline void xpfo_temp_unmap(const void *addr, size_t 
size,
 }
 
 
+static inline bool xpfo_enabled(void) { return false; }
+
 #endif /* CONFIG_XPFO */
 
 #endif /* _LINUX_XPFO_H */
diff --git a/mm/xpfo.c b/mm/xpfo.c
index 92ca6d1baf06..150784ae0f08 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -71,6 +71,12 @@ struct page_ext_operations page_xpfo_ops = {
.init = init_xpfo,
 };
 
+bool __init xpfo_enabled(void)
+{
+   return !xpfo_disabled;
+}
+EXPORT_SYMBOL(xpfo_enabled);
+
 static inline struct xpfo *lookup_xpfo(struct page *page)
 {
struct page_ext *page_ext = lookup_page_ext(page);
-- 
2.17.1

[RFC PATCH v8 06/14] xpfo: add primitives for mapping underlying memory

2019-02-13 Thread Khalid Aziz

From: Tycho Andersen 

In some cases (on arm64 DMA and data cache flushes) we may have unmapped
the underlying pages needed for something via XPFO. Here are some
primitives useful for ensuring the underlying memory is mapped/unmapped in
the face of xpfo.

Signed-off-by: Tycho Andersen 
Reviewed-by: Khalid Aziz 
---
 include/linux/xpfo.h | 22 ++
 mm/xpfo.c| 30 ++
 2 files changed, 52 insertions(+)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index cba37ffb09b1..1ae05756344d 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -38,6 +38,15 @@ void xpfo_free_pages(struct page *page, int order);
 
 bool xpfo_page_is_unmapped(struct page *page);
 
+#define XPFO_NUM_PAGES(addr, size) \
+   (PFN_UP((unsigned long) (addr) + (size)) - \
+   PFN_DOWN((unsigned long) (addr)))
+
+void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+  size_t mapping_len);
+void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
+size_t mapping_len);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -47,6 +56,19 @@ static inline void xpfo_free_pages(struct page *page, int 
order) { }
 
 static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
 
+#define XPFO_NUM_PAGES(addr, size) 0
+
+static inline void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+size_t mapping_len)
+{
+}
+
+static inline void xpfo_temp_unmap(const void *addr, size_t size,
+  void **mapping, size_t mapping_len)
+{
+}
+
+
 #endif /* CONFIG_XPFO */
 
 #endif /* _LINUX_XPFO_H */
diff --git a/mm/xpfo.c b/mm/xpfo.c
index 67884736bebe..92ca6d1baf06 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -14,6 +14,7 @@
  * the Free Software Foundation.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -236,3 +237,32 @@ bool xpfo_page_is_unmapped(struct page *page)
return test_bit(XPFO_PAGE_UNMAPPED, >flags);
 }
 EXPORT_SYMBOL(xpfo_page_is_unmapped);
+
+void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+  size_t mapping_len)
+{
+   struct page *page = virt_to_page(addr);
+   int i, num_pages = mapping_len / sizeof(mapping[0]);
+
+   memset(mapping, 0, mapping_len);
+
+   for (i = 0; i < num_pages; i++) {
+   if (page_to_virt(page + i) >= addr + size)
+   break;
+
+   if (xpfo_page_is_unmapped(page + i))
+   mapping[i] = kmap_atomic(page + i);
+   }
+}
+EXPORT_SYMBOL(xpfo_temp_map);
+
+void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
+size_t mapping_len)
+{
+   int i, num_pages = mapping_len / sizeof(mapping[0]);
+
+   for (i = 0; i < num_pages; i++)
+   if (mapping[i])
+   kunmap_atomic(mapping[i]);
+}
+EXPORT_SYMBOL(xpfo_temp_unmap);
-- 
2.17.1

[RFC PATCH v8 10/14] lkdtm: Add test for XPFO

2019-02-13 Thread Khalid Aziz

From: Juerg Haefliger 

This test simply reads from userspace memory via the kernel's linear
map.

v6: * drop an #ifdef, just let the test fail if XPFO is not supported
* add XPFO_SMP test to try and test the case when one CPU does an xpfo
  unmap of an address, that it can't be used accidentally by other
  CPUs.

Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Tested-by: Marco Benatto 
[jstec...@amazon.de: rebased from v4.13 to v4.19]
Signed-off-by: Julian Stecklina 
Tested-by: Khalid Aziz 
---
 drivers/misc/lkdtm/Makefile |   1 +
 drivers/misc/lkdtm/core.c   |   3 +
 drivers/misc/lkdtm/lkdtm.h  |   5 +
 drivers/misc/lkdtm/xpfo.c   | 194 
 4 files changed, 203 insertions(+)
 create mode 100644 drivers/misc/lkdtm/xpfo.c

diff --git a/drivers/misc/lkdtm/Makefile b/drivers/misc/lkdtm/Makefile
index 951c984de61a..97c6b7818cce 100644
--- a/drivers/misc/lkdtm/Makefile
+++ b/drivers/misc/lkdtm/Makefile
@@ -9,6 +9,7 @@ lkdtm-$(CONFIG_LKDTM)   += refcount.o
 lkdtm-$(CONFIG_LKDTM)  += rodata_objcopy.o
 lkdtm-$(CONFIG_LKDTM)  += usercopy.o
 lkdtm-$(CONFIG_LKDTM)  += stackleak.o
+lkdtm-$(CONFIG_LKDTM)  += xpfo.o
 
 KASAN_SANITIZE_stackleak.o := n
 KCOV_INSTRUMENT_rodata.o   := n
diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index 2837dc77478e..25f4ab4ebf50 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -185,6 +185,9 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(USERCOPY_KERNEL),
CRASHTYPE(USERCOPY_KERNEL_DS),
CRASHTYPE(STACKLEAK_ERASING),
+   CRASHTYPE(XPFO_READ_USER),
+   CRASHTYPE(XPFO_READ_USER_HUGE),
+   CRASHTYPE(XPFO_SMP),
 };
 
 
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 3c6fd327e166..6b31ff0c7f8f 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -87,4 +87,9 @@ void lkdtm_USERCOPY_KERNEL_DS(void);
 /* lkdtm_stackleak.c */
 void lkdtm_STACKLEAK_ERASING(void);
 
+/* lkdtm_xpfo.c */
+void lkdtm_XPFO_READ_USER(void);
+void lkdtm_XPFO_READ_USER_HUGE(void);
+void lkdtm_XPFO_SMP(void);
+
 #endif
diff --git a/drivers/misc/lkdtm/xpfo.c b/drivers/misc/lkdtm/xpfo.c
new file mode 100644
index ..d903063bdd0b
--- /dev/null
+++ b/drivers/misc/lkdtm/xpfo.c
@@ -0,0 +1,194 @@
+/*
+ * This is for all the tests related to XPFO (eXclusive Page Frame Ownership).
+ */
+
+#include "lkdtm.h"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#define XPFO_DATA 0xdeadbeef
+
+static unsigned long do_map(unsigned long flags)
+{
+   unsigned long user_addr, user_data = XPFO_DATA;
+
+   user_addr = vm_mmap(NULL, 0, PAGE_SIZE,
+   PROT_READ | PROT_WRITE | PROT_EXEC,
+   flags, 0);
+   if (user_addr >= TASK_SIZE) {
+   pr_warn("Failed to allocate user memory\n");
+   return 0;
+   }
+
+   if (copy_to_user((void __user *)user_addr, _data,
+sizeof(user_data))) {
+   pr_warn("copy_to_user failed\n");
+   goto free_user;
+   }
+
+   return user_addr;
+
+free_user:
+   vm_munmap(user_addr, PAGE_SIZE);
+   return 0;
+}
+
+static unsigned long *user_to_kernel(unsigned long user_addr)
+{
+   phys_addr_t phys_addr;
+   void *virt_addr;
+
+   phys_addr = user_virt_to_phys(user_addr);
+   if (!phys_addr) {
+   pr_warn("Failed to get physical address of user memory\n");
+   return NULL;
+   }
+
+   virt_addr = phys_to_virt(phys_addr);
+   if (phys_addr != virt_to_phys(virt_addr)) {
+   pr_warn("Physical address of user memory seems incorrect\n");
+   return NULL;
+   }
+
+   return virt_addr;
+}
+
+static void read_map(unsigned long *virt_addr)
+{
+   pr_info("Attempting bad read from kernel address %p\n", virt_addr);
+   if (*(unsigned long *)virt_addr == XPFO_DATA)
+   pr_err("FAIL: Bad read succeeded?!\n");
+   else
+   pr_err("FAIL: Bad read didn't fail but data is incorrect?!\n");
+}
+
+static void read_user_with_flags(unsigned long flags)
+{
+   unsigned long user_addr, *kernel;
+
+   user_addr = do_map(flags);
+   if (!user_addr) {
+   pr_err("FAIL: map failed\n");
+   return;
+   }
+
+   kernel = user_to_kernel(user_addr);
+   if (!kernel) {
+   pr_err("FAIL: user to kernel conversion failed\n");
+   goto free_user;
+   }
+
+   read_map(kernel);
+
+free_user:
+   vm_munmap(user_addr, PAGE_SIZE);
+}
+
+/* Read from userspace via the kernel's linear map. */
+void lkdtm_XPFO_READ_USER(void)
+{
+   read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS);
+}
+
+void lkdtm_XPFO_READ_USER_HUGE

[RFC PATCH v8 09/14] mm: add a user_virt_to_phys symbol

2019-02-13 Thread Khalid Aziz

From: Tycho Andersen 

We need someting like this for testing XPFO. Since it's architecture
specific, putting it in the test code is slightly awkward, so let's make it
an arch-specific symbol and export it for use in LKDTM.

CC: linux-arm-ker...@lists.infradead.org
CC: x...@kernel.org
Signed-off-by: Tycho Andersen 
Tested-by: Marco Benatto 
Signed-off-by: Khalid Aziz 
---
v6: * add a definition of user_virt_to_phys in the !CONFIG_XPFO case
v7: * make user_virt_to_phys a GPL symbol

 arch/x86/mm/xpfo.c   | 57 
 include/linux/xpfo.h |  8 +++
 2 files changed, 65 insertions(+)

diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index 6c7502993351..e13b99019c47 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -117,3 +117,60 @@ inline void xpfo_flush_kernel_tlb(struct page *page, int 
order)
 
flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
 }
+
+/* Convert a user space virtual address to a physical address.
+ * Shamelessly copied from slow_virt_to_phys() and lookup_address() in
+ * arch/x86/mm/pageattr.c
+ */
+phys_addr_t user_virt_to_phys(unsigned long addr)
+{
+   phys_addr_t phys_addr;
+   unsigned long offset;
+   pgd_t *pgd;
+   p4d_t *p4d;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+
+   pgd = pgd_offset(current->mm, addr);
+   if (pgd_none(*pgd))
+   return 0;
+
+   p4d = p4d_offset(pgd, addr);
+   if (p4d_none(*p4d))
+   return 0;
+
+   if (p4d_large(*p4d) || !p4d_present(*p4d)) {
+   phys_addr = (unsigned long)p4d_pfn(*p4d) << PAGE_SHIFT;
+   offset = addr & ~P4D_MASK;
+   goto out;
+   }
+
+   pud = pud_offset(p4d, addr);
+   if (pud_none(*pud))
+   return 0;
+
+   if (pud_large(*pud) || !pud_present(*pud)) {
+   phys_addr = (unsigned long)pud_pfn(*pud) << PAGE_SHIFT;
+   offset = addr & ~PUD_MASK;
+   goto out;
+   }
+
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return 0;
+
+   if (pmd_large(*pmd) || !pmd_present(*pmd)) {
+   phys_addr = (unsigned long)pmd_pfn(*pmd) << PAGE_SHIFT;
+   offset = addr & ~PMD_MASK;
+   goto out;
+   }
+
+   pte =  pte_offset_kernel(pmd, addr);
+   phys_addr = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
+   offset = addr & ~PAGE_MASK;
+
+out:
+   return (phys_addr_t)(phys_addr | offset);
+}
+EXPORT_SYMBOL_GPL(user_virt_to_phys);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 8b029918a958..117869991d5b 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -24,6 +24,10 @@ struct page;
 
 #ifdef CONFIG_XPFO
 
+#include 
+
+#include 
+
 extern struct page_ext_operations page_xpfo_ops;
 
 void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
@@ -49,6 +53,8 @@ void xpfo_temp_unmap(const void *addr, size_t size, void 
**mapping,
 
 bool xpfo_enabled(void);
 
+phys_addr_t user_virt_to_phys(unsigned long addr);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -73,6 +79,8 @@ static inline void xpfo_temp_unmap(const void *addr, size_t 
size,
 
 static inline bool xpfo_enabled(void) { return false; }
 
+static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; }
+
 #endif /* CONFIG_XPFO */
 
 #endif /* _LINUX_XPFO_H */
-- 
2.17.1

[RFC PATCH v8 05/14] arm64/mm: Add support for XPFO

2019-02-13 Thread Khalid Aziz

From: Juerg Haefliger 

Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
provide a hook for updating a single kernel page table entry (which is
required by the generic XPFO code).

CC: linux-arm-ker...@lists.infradead.org
Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Signed-off-by: Khalid Aziz 
---
v6:
- use flush_tlb_kernel_range() instead of __flush_tlb_one()

v8:
- Add check for NULL pte in set_kpte()

 arch/arm64/Kconfig |  1 +
 arch/arm64/mm/Makefile |  2 ++
 arch/arm64/mm/xpfo.c   | 64 ++
 3 files changed, 67 insertions(+)
 create mode 100644 arch/arm64/mm/xpfo.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index ea2ab0330e3a..f0a9c0007d23 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -171,6 +171,7 @@ config ARM64
select SWIOTLB
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
+   select ARCH_SUPPORTS_XPFO
help
  ARM 64-bit (AArch64) Linux support.
 
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 849c1df3d214..cca3808d9776 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -12,3 +12,5 @@ KASAN_SANITIZE_physaddr.o += n
 
 obj-$(CONFIG_KASAN)+= kasan_init.o
 KASAN_SANITIZE_kasan_init.o:= n
+
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
new file mode 100644
index ..1f790f7746ad
--- /dev/null
+++ b/arch/arm64/mm/xpfo.c
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger 
+ *   Vasileios P. Kemerlis 
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include 
+#include 
+
+#include 
+
+/*
+ * Lookup the page table entry for a virtual address and return a pointer to
+ * the entry. Based on x86 tree.
+ */
+static pte_t *lookup_address(unsigned long addr)
+{
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+
+   pgd = pgd_offset_k(addr);
+   if (pgd_none(*pgd))
+   return NULL;
+
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+
+   return pte_offset_kernel(pmd, addr);
+}
+
+/* Update a single kernel page table entry */
+inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
+{
+   pte_t *pte = lookup_address((unsigned long)kaddr);
+
+   if (unlikely(!pte)) {
+   WARN(1, "xpfo: invalid address %p\n", kaddr);
+   return;
+   }
+
+   set_pte(pte, pfn_pte(page_to_pfn(page), prot));
+}
+
+inline void xpfo_flush_kernel_tlb(struct page *page, int order)
+{
+   unsigned long kaddr = (unsigned long)page_address(page);
+   unsigned long size = PAGE_SIZE;
+
+   flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+}
-- 
2.17.1

[RFC PATCH v8 13/14] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

2019-02-13 Thread Khalid Aziz

XPFO flushes kernel space TLB entries for pages that are now mapped
in userspace on not only the current CPU but also all other CPUs
synchronously. Processes on each core allocating pages causes a
flood of IPI messages to all other cores to flush TLB entries.
Many of these messages are to flush the entire TLB on the core if
the number of entries being flushed from local core exceeds
tlb_single_page_flush_ceiling. The cost of TLB flush caused by
unmapping pages from physmap goes up dramatically on machines with
high core count.

This patch flushes relevant TLB entries for current process or
entire TLB depending upon number of entries for the current CPU
and posts a pending TLB flush on all other CPUs when a page is
unmapped from kernel space and mapped in userspace. Each core
checks the pending TLB flush flag for itself on every context
switch, flushes its TLB if the flag is set and clears it.
This patch potentially aggregates multiple TLB flushes into one.
This has very significant impact especially on machines with large
core counts. To illustrate this, kernel was compiled with -j on
two classes of machines - a server with high core count and large
amount of memory, and a desktop class machine with more modest
specs. System time from "make -j" from vanilla 4.20 kernel, 4.20
with XPFO patches before applying this patch and after applying
this patch are below:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

4.20950.966s
4.20+XPFO   25073.169s  26.366x
4.20+XPFO+Deferred flush1372.874s1.44x

Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

4.20607.671s
4.20+XPFO   1588.646s   2.614x
4.20+XPFO+Deferred flush803.989s1.32x

This patch could use more optimization. Batching more TLB entry
flushes, as was suggested for earlier version of these patches,
can help reduce these cases. This same code should be implemented
for other architectures as well once finalized.

Signed-off-by: Khalid Aziz 
---
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/tlb.c   | 38 +
 arch/x86/mm/xpfo.c  |  2 +-
 3 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index f4204bf377fc..92d23629d01d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -561,6 +561,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, 
unsigned long start,
unsigned long end, unsigned int stride_shift,
bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+extern void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long 
end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 03b6b4c2238d..c907b643eecb 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -35,6 +35,15 @@
  */
 #define LAST_USER_MM_IBPB  0x1UL
 
+/*
+ * When a full TLB flush is needed to flush stale TLB entries
+ * for pages that have been mapped into userspace and unmapped
+ * from kernel space, this TLB flush will be delayed until the
+ * task is scheduled on that CPU. Keep track of CPUs with
+ * pending full TLB flush forced by xpfo.
+ */
+static cpumask_t pending_xpfo_flush;
+
 /*
  * We get here when we do something requiring a TLB invalidation
  * but could not go invalidate all of the contexts.  We do the
@@ -319,6 +328,15 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
__flush_tlb_all();
}
 #endif
+
+   /* If there is a pending TLB flush for this CPU due to XPFO
+* flush, do it now.
+*/
+   if (cpumask_test_and_clear_cpu(cpu, _xpfo_flush)) {
+   count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+   __flush_tlb_all();
+   }
+
this_cpu_write(cpu_tlbstate.is_lazy, false);
 
/*
@@ -801,6 +819,26 @@ void flush_tlb_kernel_range(unsigned long start, unsigned 
long end)
}
 }
 
+void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+   struct cpumask tmp_mask;
+
+   /* Balance as user space task's flush, a bit conservative */
+   if (end == TLB_FLUSH_ALL ||
+   (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
+   do_flush_tlb_all(NULL);
+   } else {
+   struct flush_tlb_info info;
+
+   info.start = start;
+   info.end = end;
+   do_kernel_range_flush();
+   }
+   cpumask_setall(_mask);
+   cpumask_clear_cpu(smp_processor_id(), _mask);
+   cpumask_or(_xpfo_flush, _xpfo_flush, _mask);
+}
+
 void arch_tlbbatch_flush(struct arch_tlbflush

[RFC PATCH v8 00/14] Add support for eXclusive Page Frame Ownership

2019-02-13 Thread Khalid Aziz

I am continuing to build on the work Juerg, Tycho and Julian have
done on XPFO. After the last round of updates, we were seeing very
significant performance penalties when stale TLB entries were
flushed actively after an XPFO TLB update.  Benchmark for measuring
performance is kernel build using parallel make. To get full
protection from ret2dir attackes, we must flush stale TLB entries.
Performance penalty from flushing stale TLB entries goes up as the
number of cores goes up. On a desktop class machine with only 4
cores, enabling TLB flush for stale entries causes system time for
"make -j4" to go up by a factor of 2.61x but on a larger machine
with 96 cores, system time with "make -j60" goes up by a factor of
26.37x!  I have been working on reducing this performance penalty.

I implemented two solutions to reduce performance penalty and that
has had large impact. XPFO code flushes TLB every time a page is
allocated to userspace. It does so by sending IPIs to all processors
to flush TLB. Back to back allocations of pages to userspace on
multiple processors results in a storm of IPIs.  Each one of these
incoming IPIs is handled by a processor by flushing its TLB. To
reduce this IPI storm, I have added a per CPU flag that can be set
to tell a processor to flush its TLB. A processor checks this flag
on every context switch. If the flag is set, it flushes its TLB and
clears the flag. This allows for multiple TLB flush requests to a
single CPU to be combined into a single request. A kernel TLB entry
for a page that has been allocated to userspace is flushed on all
processors unlike the previous version of this patch. A processor
could hold a stale kernel TLB entry that was removed on another
processor until the next context switch. A local userspace page
allocation by the currently running process could force the TLB
flush earlier for such entries.

The other solution reduces the number of TLB flushes required, by
performing TLB flush for multiple pages at one time when pages are
refilled on the per-cpu freelist. If the pages being addedd to
per-cpu freelist are marked for userspace allocation, TLB entries
for these pages can be flushed upfront and pages tagged as currently
unmapped. When any such page is allocated to userspace, there is no
need to performa a TLB flush at that time any more. This batching of
TLB flushes reduces performance imapct further.

I measured system time for parallel make with unmodified 4.20
kernel, 4.20 with XPFO patches before these patches and then again
after applying each of these patches. Here are the results:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

4.20950.966s
4.20+XPFO   25073.169s  26.37x
4.20+XPFO+Deferred flush1372.874s   1.44x
4.20+XPFO+Deferred flush+Batch update   1255.021s   1.32x


Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

4.20607.671s
4.20+XPFO   1588.646s   2.61x
4.20+XPFO+Deferred flush803.989s1.32x
4.20+XPFO+Deferred flush+Batch update   795.728s1.31x

30+% overhead is still very high and there is room for improvement.

Performance with this patch set is good enough to use these as
starting point for further refinement before we merge it into main
kernel, hence RFC.

I have dropped the patch "mm, x86: omit TLB flushing by default for
XPFO page table modifications" since not flushing TLB leaves kernel
wide open to attack and there is no point in enabling XPFO without
flushing TLB every time kernel TLB entries for pages are removed. I
also dropped the patch "EXPERIMENTAL: xpfo, mm: optimize spin lock
usage in xpfo_kmap". There was not a measurable improvement in
performance with this patch and it introduced a possibility for
deadlock that Laura found.

What remains to be done beyond this patch series:

1. Performance improvements: Ideas to explore - (1) Add a freshly
   freed page to per cpu freelist and not make a kernel TLB entry
   for it, (2) kernel mappings private to an mm, (3) Any others??
2. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
   from Juerg. I dropped it for now since swiotlb code for ARM has
   changed a lot in 4.20.
3. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
   CPUs" to other architectures besides x86.


-

Juerg Haefliger (5):
  mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
  swiotlb: Map the buffer if it was unmapped by XPFO
  arm64/mm: Add support for XPFO
  arm64/mm, xpfo: temporarily map dcache regions
  lkdtm: Add test for XPFO

Julian Stecklina (2):
  xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION
  xpfo, mm: optimize spinlock usage in xpfo_kunmap

Khalid Aziz (2):
  xpfo, mm: Defer TLB flushes f

[RFC PATCH v8 07/14] arm64/mm, xpfo: temporarily map dcache regions

2019-02-13 Thread Khalid Aziz

From: Juerg Haefliger 

If the page is unmapped by XPFO, a data cache flush results in a fatal
page fault, so let's temporarily map the region, flush the cache, and then
unmap it.

v6: actually flush in the face of xpfo, and temporarily map the underlying
memory so it can be flushed correctly

CC: linux-arm-ker...@lists.infradead.org
Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
---
 arch/arm64/mm/flush.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index 30695a868107..fad09aafd9d5 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -28,9 +29,15 @@
 void sync_icache_aliases(void *kaddr, unsigned long len)
 {
unsigned long addr = (unsigned long)kaddr;
+   unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
+   void *mapping[num_pages];
 
if (icache_is_aliasing()) {
+   xpfo_temp_map(kaddr, len, mapping,
+ sizeof(mapping[0]) * num_pages);
__clean_dcache_area_pou(kaddr, len);
+   xpfo_temp_unmap(kaddr, len, mapping,
+   sizeof(mapping[0]) * num_pages);
__flush_icache_all();
} else {
flush_icache_range(addr, addr + len);
-- 
2.17.1

Re: [RFC PATCH v7 05/16] arm64/mm: Add support for XPFO

2019-02-12 Thread Khalid Aziz

On 2/12/19 1:01 PM, Laura Abbott wrote:
> On 2/12/19 7:52 AM, Khalid Aziz wrote:
>> On 1/23/19 7:24 AM, Konrad Rzeszutek Wilk wrote:
>>> On Thu, Jan 10, 2019 at 02:09:37PM -0700, Khalid Aziz wrote:
>>>> From: Juerg Haefliger 
>>>>
>>>> Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
>>>> provide a hook for updating a single kernel page table entry (which is
>>>> required by the generic XPFO code).
>>>>
>>>> v6: use flush_tlb_kernel_range() instead of __flush_tlb_one()
>>>>
>>>> CC: linux-arm-ker...@lists.infradead.org
>>>> Signed-off-by: Juerg Haefliger 
>>>> Signed-off-by: Tycho Andersen 
>>>> Signed-off-by: Khalid Aziz 
>>>> ---
>>>>   arch/arm64/Kconfig |  1 +
>>>>   arch/arm64/mm/Makefile |  2 ++
>>>>   arch/arm64/mm/xpfo.c   | 58
>>>> ++
>>>>   3 files changed, 61 insertions(+)
>>>>   create mode 100644 arch/arm64/mm/xpfo.c
>>>>
>>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>>> index ea2ab0330e3a..f0a9c0007d23 100644
>>>> --- a/arch/arm64/Kconfig
>>>> +++ b/arch/arm64/Kconfig
>>>> @@ -171,6 +171,7 @@ config ARM64
>>>>   select SWIOTLB
>>>>   select SYSCTL_EXCEPTION_TRACE
>>>>   select THREAD_INFO_IN_TASK
>>>> +    select ARCH_SUPPORTS_XPFO
>>>>   help
>>>>     ARM 64-bit (AArch64) Linux support.
>>>>   diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>>>> index 849c1df3d214..cca3808d9776 100644
>>>> --- a/arch/arm64/mm/Makefile
>>>> +++ b/arch/arm64/mm/Makefile
>>>> @@ -12,3 +12,5 @@ KASAN_SANITIZE_physaddr.o    += n
>>>>     obj-$(CONFIG_KASAN)    += kasan_init.o
>>>>   KASAN_SANITIZE_kasan_init.o    := n
>>>> +
>>>> +obj-$(CONFIG_XPFO)    += xpfo.o
>>>> diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
>>>> new file mode 100644
>>>> index ..678e2be848eb
>>>> --- /dev/null
>>>> +++ b/arch/arm64/mm/xpfo.c
>>>> @@ -0,0 +1,58 @@
>>>> +/*
>>>> + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
>>>> + * Copyright (C) 2016 Brown University. All rights reserved.
>>>> + *
>>>> + * Authors:
>>>> + *   Juerg Haefliger 
>>>> + *   Vasileios P. Kemerlis 
>>>> + *
>>>> + * This program is free software; you can redistribute it and/or
>>>> modify it
>>>> + * under the terms of the GNU General Public License version 2 as
>>>> published by
>>>> + * the Free Software Foundation.
>>>> + */
>>>> +
>>>> +#include 
>>>> +#include 
>>>> +
>>>> +#include 
>>>> +
>>>> +/*
>>>> + * Lookup the page table entry for a virtual address and return a
>>>> pointer to
>>>> + * the entry. Based on x86 tree.
>>>> + */
>>>> +static pte_t *lookup_address(unsigned long addr)
>>>> +{
>>>> +    pgd_t *pgd;
>>>> +    pud_t *pud;
>>>> +    pmd_t *pmd;
>>>> +
>>>> +    pgd = pgd_offset_k(addr);
>>>> +    if (pgd_none(*pgd))
>>>> +    return NULL;
>>>> +
>>>> +    pud = pud_offset(pgd, addr);
>>>> +    if (pud_none(*pud))
>>>> +    return NULL;
>>>> +
>>>> +    pmd = pmd_offset(pud, addr);
>>>> +    if (pmd_none(*pmd))
>>>> +    return NULL;
>>>> +
>>>> +    return pte_offset_kernel(pmd, addr);
>>>> +}
>>>> +
>>>> +/* Update a single kernel page table entry */
>>>> +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
>>>> +{
>>>> +    pte_t *pte = lookup_address((unsigned long)kaddr);
>>>> +
>>>> +    set_pte(pte, pfn_pte(page_to_pfn(page), prot));
>>>
>>> Thought on the other hand.. what if the page is PMD? Do you really want
>>> to do this?
>>>
>>> What if 'pte' is NULL?
>>>> +}
>>>> +
>>>> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
>>>> +{
>>>> +    unsigned long kaddr = (unsigned long)page_address(page);
>>>> +    unsigned long size = PAGE_SIZE

Re: [RFC PATCH v7 05/16] arm64/mm: Add support for XPFO

2019-02-12 Thread Khalid Aziz

On 1/23/19 7:24 AM, Konrad Rzeszutek Wilk wrote:
> On Thu, Jan 10, 2019 at 02:09:37PM -0700, Khalid Aziz wrote:
>> From: Juerg Haefliger 
>>
>> Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
>> provide a hook for updating a single kernel page table entry (which is
>> required by the generic XPFO code).
>>
>> v6: use flush_tlb_kernel_range() instead of __flush_tlb_one()
>>
>> CC: linux-arm-ker...@lists.infradead.org
>> Signed-off-by: Juerg Haefliger 
>> Signed-off-by: Tycho Andersen 
>> Signed-off-by: Khalid Aziz 
>> ---
>>  arch/arm64/Kconfig |  1 +
>>  arch/arm64/mm/Makefile |  2 ++
>>  arch/arm64/mm/xpfo.c   | 58 ++
>>  3 files changed, 61 insertions(+)
>>  create mode 100644 arch/arm64/mm/xpfo.c
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index ea2ab0330e3a..f0a9c0007d23 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -171,6 +171,7 @@ config ARM64
>>  select SWIOTLB
>>  select SYSCTL_EXCEPTION_TRACE
>>  select THREAD_INFO_IN_TASK
>> +select ARCH_SUPPORTS_XPFO
>>  help
>>ARM 64-bit (AArch64) Linux support.
>>  
>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>> index 849c1df3d214..cca3808d9776 100644
>> --- a/arch/arm64/mm/Makefile
>> +++ b/arch/arm64/mm/Makefile
>> @@ -12,3 +12,5 @@ KASAN_SANITIZE_physaddr.o  += n
>>  
>>  obj-$(CONFIG_KASAN) += kasan_init.o
>>  KASAN_SANITIZE_kasan_init.o := n
>> +
>> +obj-$(CONFIG_XPFO)  += xpfo.o
>> diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
>> new file mode 100644
>> index ..678e2be848eb
>> --- /dev/null
>> +++ b/arch/arm64/mm/xpfo.c
>> @@ -0,0 +1,58 @@
>> +/*
>> + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
>> + * Copyright (C) 2016 Brown University. All rights reserved.
>> + *
>> + * Authors:
>> + *   Juerg Haefliger 
>> + *   Vasileios P. Kemerlis 
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published 
>> by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#include 
>> +#include 
>> +
>> +#include 
>> +
>> +/*
>> + * Lookup the page table entry for a virtual address and return a pointer to
>> + * the entry. Based on x86 tree.
>> + */
>> +static pte_t *lookup_address(unsigned long addr)
>> +{
>> +pgd_t *pgd;
>> +pud_t *pud;
>> +pmd_t *pmd;
>> +
>> +pgd = pgd_offset_k(addr);
>> +if (pgd_none(*pgd))
>> +return NULL;
>> +
>> +pud = pud_offset(pgd, addr);
>> +if (pud_none(*pud))
>> +return NULL;
>> +
>> +pmd = pmd_offset(pud, addr);
>> +if (pmd_none(*pmd))
>> +return NULL;
>> +
>> +return pte_offset_kernel(pmd, addr);
>> +}
>> +
>> +/* Update a single kernel page table entry */
>> +inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
>> +{
>> +pte_t *pte = lookup_address((unsigned long)kaddr);
>> +
>> +set_pte(pte, pfn_pte(page_to_pfn(page), prot));
> 
> Thought on the other hand.. what if the page is PMD? Do you really want
> to do this?
> 
> What if 'pte' is NULL?
>> +}
>> +
>> +inline void xpfo_flush_kernel_tlb(struct page *page, int order)
>> +{
>> +unsigned long kaddr = (unsigned long)page_address(page);
>> +unsigned long size = PAGE_SIZE;
>> +
>> +flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
> 
> Ditto here. You are assuming it is PTE, but it may be PMD or such.
> Or worts - the lookup_address could be NULL.
> 
>> +}
>> -- 
>> 2.17.1
>>

Hi Konrad,

This makes sense. x86 version of set_kpte() checks pte for NULL and also
checks if the page is PMD. Now what you said about adding level to
lookup_address() for arm makes more sense.

Can someone with knowledge of arm64 mmu make recommendations here?

Thanks,
Khalid


pEpkey.asc
Description: application/pgp-keys

Re: [RFC PATCH v7 05/16] arm64/mm: Add support for XPFO

2019-02-12 Thread Khalid Aziz

On 1/23/19 7:20 AM, Konrad Rzeszutek Wilk wrote:
> On Thu, Jan 10, 2019 at 02:09:37PM -0700, Khalid Aziz wrote:
>> From: Juerg Haefliger 
>>
>> Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
>> provide a hook for updating a single kernel page table entry (which is
>> required by the generic XPFO code).
>>
>> v6: use flush_tlb_kernel_range() instead of __flush_tlb_one()
>>
>> CC: linux-arm-ker...@lists.infradead.org
>> Signed-off-by: Juerg Haefliger 
>> Signed-off-by: Tycho Andersen 
>> Signed-off-by: Khalid Aziz 
>> ---
>>  arch/arm64/Kconfig |  1 +
>>  arch/arm64/mm/Makefile |  2 ++
>>  arch/arm64/mm/xpfo.c   | 58 ++
>>  3 files changed, 61 insertions(+)
>>  create mode 100644 arch/arm64/mm/xpfo.c
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index ea2ab0330e3a..f0a9c0007d23 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -171,6 +171,7 @@ config ARM64
>>  select SWIOTLB
>>  select SYSCTL_EXCEPTION_TRACE
>>  select THREAD_INFO_IN_TASK
>> +select ARCH_SUPPORTS_XPFO
>>  help
>>ARM 64-bit (AArch64) Linux support.
>>  
>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>> index 849c1df3d214..cca3808d9776 100644
>> --- a/arch/arm64/mm/Makefile
>> +++ b/arch/arm64/mm/Makefile
>> @@ -12,3 +12,5 @@ KASAN_SANITIZE_physaddr.o  += n
>>  
>>  obj-$(CONFIG_KASAN) += kasan_init.o
>>  KASAN_SANITIZE_kasan_init.o := n
>> +
>> +obj-$(CONFIG_XPFO)  += xpfo.o
>> diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
>> new file mode 100644
>> index ..678e2be848eb
>> --- /dev/null
>> +++ b/arch/arm64/mm/xpfo.c
>> @@ -0,0 +1,58 @@
>> +/*
>> + * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
>> + * Copyright (C) 2016 Brown University. All rights reserved.
>> + *
>> + * Authors:
>> + *   Juerg Haefliger 
>> + *   Vasileios P. Kemerlis 
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License version 2 as published 
>> by
>> + * the Free Software Foundation.
>> + */
>> +
>> +#include 
>> +#include 
>> +
>> +#include 
>> +
>> +/*
>> + * Lookup the page table entry for a virtual address and return a pointer to
>> + * the entry. Based on x86 tree.
>> + */
>> +static pte_t *lookup_address(unsigned long addr)
> 
> The x86 also has level. Would it make sense to include that in here?
> 

Possibly. ARM64 does not define page levels (as in the enum for page
levels) at this time but that can be added easily. Adding level to
lookup_address() for arm will make it uniform with x86 but is there any
other rationale besides that? Do you see a future use for this
information? The only other architecture I could see that defines
lookup_address() is sh but it uses it for trapped io only.

Thanks,
Khalid


pEpkey.asc
Description: application/pgp-keys

Re: [RFC PATCH v7 14/16] EXPERIMENTAL: xpfo, mm: optimize spin lock usage in xpfo_kmap

2019-01-17 Thread Khalid Aziz

On 1/16/19 5:18 PM, Laura Abbott wrote:
> On 1/10/19 1:09 PM, Khalid Aziz wrote:
>> From: Julian Stecklina 
>>
>> We can reduce spin lock usage in xpfo_kmap to the 0->1 transition of
>> the mapcount. This means that xpfo_kmap() can now race and that we
>> get spurious page faults.
>>
>> The page fault handler helps the system make forward progress by
>> fixing the page table instead of allowing repeated page faults until
>> the right xpfo_kmap went through.
>>
>> Model-checked with up to 4 concurrent callers with Spin.
>>
> 
> This needs the spurious check for arm64 as well. This at
> least gets me booting but could probably use more review:
> 
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 7d9571f4ae3d..8f425848cbb9 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -32,6 +32,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  #include 
> @@ -289,6 +290,9 @@ static void __do_kernel_fault(unsigned long addr,
> unsigned int esr,
>     if (!is_el1_instruction_abort(esr) && fixup_exception(regs))
>     return;
>  
> +   if (xpfo_spurious_fault(addr))
> +   return;
> +
>     if (is_el1_permission_fault(addr, esr, regs)) {
>     if (esr & ESR_ELx_WNR)
>     msg = "write to read-only memory";
> 
> 

That makes sense. Thanks for debugging this. I will add this to patch 14
("EXPERIMENTAL: xpfo, mm: optimize spin lock usage in xpfo_kmap").

Thanks,
Khalid

Re: [RFC PATCH v7 00/16] Add support for eXclusive Page Frame Ownership

2019-01-16 Thread Khalid Aziz

On 1/16/19 7:56 AM, Julian Stecklina wrote:
> Khalid Aziz  writes:
> 
>> I am continuing to build on the work Juerg, Tycho and Julian have done
>> on XPFO.
> 
> Awesome!
> 
>> A rogue process can launch a ret2dir attack only from a CPU that has
>> dual mapping for its pages in physmap in its TLB. We can hence defer
>> TLB flush on a CPU until a process that would have caused a TLB flush
>> is scheduled on that CPU.
> 
> Assuming the attacker already has the ability to execute arbitrary code
> in userspace, they can just create a second process and thus avoid the
> TLB flush. Am I getting this wrong?

No, you got it right. The patch I wrote closes the security hole when
attack is launched from the same process but still leaves a window open
when attack is launched from another process. I am working on figuring
out how to close that hole while keeping the performance the same as it
is now. Synchronous TLB flush across all cores is the most secure but
performance impact is horrendous.

--
Khalid

pEpkey.asc
Description: application/pgp-keys

Re: [RFC PATCH v7 00/16] Add support for eXclusive Page Frame Ownership

2019-01-11 Thread Khalid Aziz

On 1/11/19 2:06 PM, Andy Lutomirski wrote:
> On Fri, Jan 11, 2019 at 12:42 PM Dave Hansen  wrote:
>>
 The second process could easily have the page's old TLB entry.  It could
 abuse that entry as long as that CPU doesn't context switch
 (switch_mm_irqs_off()) or otherwise flush the TLB entry.
>>>
>>> That is an interesting scenario. Working through this scenario, physmap
>>> TLB entry for a page is flushed on the local processor when the page is
>>> allocated to userspace, in xpfo_alloc_pages(). When the userspace passes
>>> page back into kernel, that page is mapped into kernel space using a va
>>> from kmap pool in xpfo_kmap() which can be different for each new
>>> mapping of the same page. The physical page is unmapped from kernel on
>>> the way back from kernel to userspace by xpfo_kunmap(). So two processes
>>> on different CPUs sharing same physical page might not be seeing the
>>> same virtual address for that page while they are in the kernel, as long
>>> as it is an address from kmap pool. ret2dir attack relies upon being
>>> able to craft a predictable virtual address in the kernel physmap for a
>>> physical page and redirect execution to that address. Does that sound right?
>>
>> All processes share one set of kernel page tables.  Or, did your patches
>> change that somehow that I missed?
>>
>> Since they share the page tables, they implicitly share kmap*()
>> mappings.  kmap_atomic() is not *used* by more than one CPU, but the
>> mapping is accessible and at least exists for all processors.
>>
>> I'm basically assuming that any entry mapped in a shared page table is
>> exploitable on any CPU regardless of where we logically *want* it to be
>> used.
>>
>>
> 
> We can, very easily, have kernel mappings that are private to a given
> mm.  Maybe this is useful here.
> 

That sounds like an interesting idea. kmap mappings would be a good
candidate for that. Those are temporary mappings and should only be
valid for one process.

--
Khalid


pEpkey.asc
Description: application/pgp-keys

Re: [RFC PATCH v7 00/16] Add support for eXclusive Page Frame Ownership

2019-01-11 Thread Khalid Aziz

On 1/11/19 1:42 PM, Dave Hansen wrote:
>>> The second process could easily have the page's old TLB entry.  It could
>>> abuse that entry as long as that CPU doesn't context switch
>>> (switch_mm_irqs_off()) or otherwise flush the TLB entry.
>>
>> That is an interesting scenario. Working through this scenario, physmap
>> TLB entry for a page is flushed on the local processor when the page is
>> allocated to userspace, in xpfo_alloc_pages(). When the userspace passes
>> page back into kernel, that page is mapped into kernel space using a va
>> from kmap pool in xpfo_kmap() which can be different for each new
>> mapping of the same page. The physical page is unmapped from kernel on
>> the way back from kernel to userspace by xpfo_kunmap(). So two processes
>> on different CPUs sharing same physical page might not be seeing the
>> same virtual address for that page while they are in the kernel, as long
>> as it is an address from kmap pool. ret2dir attack relies upon being
>> able to craft a predictable virtual address in the kernel physmap for a
>> physical page and redirect execution to that address. Does that sound right?
> 
> All processes share one set of kernel page tables.  Or, did your patches
> change that somehow that I missed?
> 
> Since they share the page tables, they implicitly share kmap*()
> mappings.  kmap_atomic() is not *used* by more than one CPU, but the
> mapping is accessible and at least exists for all processors.
> 
> I'm basically assuming that any entry mapped in a shared page table is
> exploitable on any CPU regardless of where we logically *want* it to be
> used.
> 
> 

Ah, I see what you are saying. Virtual address on one processor is
visible on the other processor as well and one process could communicate
that va to the other process in some way so it could be exploited by the
other process. This va is exploitable only between the kmap and matching
kunmap but the window exists. I am trying to understand your scenario,
so I can address it right.

--
Khalid




pEpkey.asc
Description: application/pgp-keys

Re: [RFC PATCH v7 00/16] Add support for eXclusive Page Frame Ownership

2019-01-11 Thread Khalid Aziz

On 1/10/19 5:44 PM, Andy Lutomirski wrote:
> On Thu, Jan 10, 2019 at 3:07 PM Kees Cook  wrote:
>>
>> On Thu, Jan 10, 2019 at 1:10 PM Khalid Aziz  wrote:
>>> I implemented a solution to reduce performance penalty and
>>> that has had large impact. When XPFO code flushes stale TLB entries,
>>> it does so for all CPUs on the system which may include CPUs that
>>> may not have any matching TLB entries or may never be scheduled to
>>> run the userspace task causing TLB flush. Problem is made worse by
>>> the fact that if number of entries being flushed exceeds
>>> tlb_single_page_flush_ceiling, it results in a full TLB flush on
>>> every CPU. A rogue process can launch a ret2dir attack only from a
>>> CPU that has dual mapping for its pages in physmap in its TLB. We
>>> can hence defer TLB flush on a CPU until a process that would have
>>> caused a TLB flush is scheduled on that CPU. I have added a cpumask
>>> to task_struct which is then used to post pending TLB flush on CPUs
>>> other than the one a process is running on. This cpumask is checked
>>> when a process migrates to a new CPU and TLB is flushed at that
>>> time. I measured system time for parallel make with unmodified 4.20
>>> kernel, 4.20 with XPFO patches before this optimization and then
>>> again after applying this optimization. Here are the results:
> 
> I wasn't cc'd on the patch, so I don't know the exact details.
> 
> I'm assuming that "ret2dir" means that you corrupt the kernel into
> using a direct-map page as its stack.  If so, then I don't see why the
> task in whose context the attack is launched needs to be the same
> process as the one that has the page mapped for user access.

You are right. More work is needed to refine delayed TLB flush to close
this gap.

> 
> My advice would be to attempt an entirely different optimization: try
> to avoid putting pages *back* into the direct map when they're freed
> until there is an actual need to use them for kernel purposes.

I had thought about that but it turns out the performance impact happens
on the initial allocation of the page and resulting TLB flushes, not
from putting the pages back into direct map. The way we could benefit
from not adding page back to direct map is if we change page allocation
to prefer pages not in direct map. That way we incur the cost of TLB
flushes initially but then satisfy multiple allocation requests after
that from those "xpfo cost" free pages. More changes will be needed to
pick which of these pages can be added back to direct map without
degenerating into worst case scenario of a page bouncing constantly
between this list of preferred pages and direct mapped pages. It started
to get complex enough that I decided to put this in my back pocket and
attempt simpler approaches first :)

> 
> How are you handing page cache?  Presumably MAP_SHARED PROT_WRITE
> pages are still in the direct map so that IO works.
> 

Since Juerg wrote the actual implementation of XPFO, he probably
understands it better. XPFO tackles only the page allocation requests
from userspace and does not touch page cache pages.

--
Khalid

pEpkey.asc
Description: application/pgp-keys

Re: [RFC PATCH v7 07/16] arm64/mm, xpfo: temporarily map dcache regions

2019-01-11 Thread Khalid Aziz

On 1/11/19 7:54 AM, Tycho Andersen wrote:
> On Thu, Jan 10, 2019 at 02:09:39PM -0700, Khalid Aziz wrote:
>> From: Juerg Haefliger 
>>
>> If the page is unmapped by XPFO, a data cache flush results in a fatal
>> page fault, so let's temporarily map the region, flush the cache, and then
>> unmap it.
>>
>> v6: actually flush in the face of xpfo, and temporarily map the underlying
>> memory so it can be flushed correctly
>>
>> CC: linux-arm-ker...@lists.infradead.org
>> Signed-off-by: Juerg Haefliger 
>> Signed-off-by: Tycho Andersen 
>> Signed-off-by: Khalid Aziz 
>> ---
>>  arch/arm64/mm/flush.c | 7 +++
>>  1 file changed, 7 insertions(+)
>>
>> diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
>> index 30695a868107..f12f26b60319 100644
>> --- a/arch/arm64/mm/flush.c
>> +++ b/arch/arm64/mm/flush.c
>> @@ -20,6 +20,7 @@
>>  #include 
>>  #include 
>>  #include 
>> +#include 
>>  
>>  #include 
>>  #include 
>> @@ -28,9 +29,15 @@
>>  void sync_icache_aliases(void *kaddr, unsigned long len)
>>  {
>>  unsigned long addr = (unsigned long)kaddr;
>> +unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
>> +void *mapping[num_pages];
> 
> Does this still compile with -Wvla? It was a bad hack on my part, and
> we should probably just drop it and come up with something else :)

I will make a note of it. I hope someone with better knowledge of arm64
than me can come up with a better solution ;)

--
Khalid

> 
> Tycho
> 
>>  if (icache_is_aliasing()) {
>> +xpfo_temp_map(kaddr, len, mapping,
>> +  sizeof(mapping[0]) * num_pages);
>>  __clean_dcache_area_pou(kaddr, len);
>> +xpfo_temp_unmap(kaddr, len, mapping,
>> +sizeof(mapping[0]) * num_pages);
>>  __flush_icache_all();
>>  } else {
>>  flush_icache_range(addr, addr + len);
>> -- 
>> 2.17.1
>>



pEpkey.asc
Description: application/pgp-keys

Re: [RFC PATCH v7 00/16] Add support for eXclusive Page Frame Ownership

2019-01-11 Thread Khalid Aziz

Hi Dave,

Thanks for looking at this and providing feedback.

On 1/10/19 4:40 PM, Dave Hansen wrote:
> First of all, thanks for picking this back up.  It looks to be going in
> a very positive direction!
> 
> On 1/10/19 1:09 PM, Khalid Aziz wrote:
>> I implemented a solution to reduce performance penalty and
>> that has had large impact. When XPFO code flushes stale TLB entries,
>> it does so for all CPUs on the system which may include CPUs that
>> may not have any matching TLB entries or may never be scheduled to
>> run the userspace task causing TLB flush.
> ...
>> A rogue process can launch a ret2dir attack only from a CPU that has 
>> dual mapping for its pages in physmap in its TLB. We can hence defer 
>> TLB flush on a CPU until a process that would have caused a TLB
>> flush is scheduled on that CPU.
> 
> This logic is a bit suspect to me.  Imagine a situation where we have
> two attacker processes: one which is causing page to go from
> kernel->user (and be unmapped from the kernel) and a second process that
> *was* accessing that page.
> 
> The second process could easily have the page's old TLB entry.  It could
> abuse that entry as long as that CPU doesn't context switch
> (switch_mm_irqs_off()) or otherwise flush the TLB entry.

That is an interesting scenario. Working through this scenario, physmap
TLB entry for a page is flushed on the local processor when the page is
allocated to userspace, in xpfo_alloc_pages(). When the userspace passes
page back into kernel, that page is mapped into kernel space using a va
from kmap pool in xpfo_kmap() which can be different for each new
mapping of the same page. The physical page is unmapped from kernel on
the way back from kernel to userspace by xpfo_kunmap(). So two processes
on different CPUs sharing same physical page might not be seeing the
same virtual address for that page while they are in the kernel, as long
as it is an address from kmap pool. ret2dir attack relies upon being
able to craft a predictable virtual address in the kernel physmap for a
physical page and redirect execution to that address. Does that sound right?

Now what happens if only one of these cooperating processes allocates
the page, places malicious payload on that page and passes the address
of this page to the other process which can deduce physmap for the page
through /proc and exploit the physmap entry for the page on its CPU.
That must be the scenario you are referring to.

> 
> As for where to flush the TLB...  As you know, using synchronous IPIs is
> obviously the most bulletproof from a mitigation perspective.  If you
> can batch the IPIs, you can get the overhead down, but you need to do
> the flushes for a bunch of pages at once, which I think is what you were
> exploring but haven't gotten working yet.
> 
> Anything else you do will have *some* reduced mitigation value, which
> isn't a deal-breaker (to me at least).  Some ideas:

Even without batched IPIs working reliably, I was able to measure the
performance impact of this partially working solution. With just batched
IPIs and no delayed TLB flushes, performance improved by a factor of 2.
The 26x system time went down to 12x-13x but it was still too high and a
non-starter. Combining batched IPI with delayed TLB flushes improved
performance to about 1.1x as opposed to 1.33x with delayed TLB flush
alone. Those numbers are very rough since the batching implementation is
incomplete.

> 
> Take a look at the SWITCH_TO_KERNEL_CR3 in head_64.S.  Every time that
> gets called, we've (potentially) just done a user->kernel transition and
> might benefit from flushing the TLB.  We're always doing a CR3 write (on
> Meltdown-vulnerable hardware) and it can do a full TLB flush based on if
> X86_CR3_PCID_NOFLUSH_BIT is set.  So, when you need a TLB flush, you
> would set a bit that ADJUST_KERNEL_CR3 would see on the next
> user->kernel transition on *each* CPU.  Potentially, multiple TLB
> flushes could be coalesced this way.  The downside of this is that
> you're exposed to the old TLB entries if a flush is needed while you are
> already *in* the kernel.
> 
> You could also potentially do this from C code, like in the syscall
> entry code, or in sensitive places, like when you're returning from a
> guest after a VMEXIT in the kvm code.
> 

Good suggestions. Thanks.

I think benefit will be highest from batching TLB flushes. I see a lot
of time consumed by full TLB flushes on other processors when local
processor did only a limited TLB flush. I will continue to debug the
batch TLB updates.

--
Khalid

pEpkey.asc
Description: application/pgp-keys

Re: [RFC PATCH v7 00/16] Add support for eXclusive Page Frame Ownership

2019-01-10 Thread Khalid Aziz

Thanks for looking this over.

On 1/10/19 4:07 PM, Kees Cook wrote:
> On Thu, Jan 10, 2019 at 1:10 PM Khalid Aziz  wrote:
>> I implemented a solution to reduce performance penalty and
>> that has had large impact. When XPFO code flushes stale TLB entries,
>> it does so for all CPUs on the system which may include CPUs that
>> may not have any matching TLB entries or may never be scheduled to
>> run the userspace task causing TLB flush. Problem is made worse by
>> the fact that if number of entries being flushed exceeds
>> tlb_single_page_flush_ceiling, it results in a full TLB flush on
>> every CPU. A rogue process can launch a ret2dir attack only from a
>> CPU that has dual mapping for its pages in physmap in its TLB. We
>> can hence defer TLB flush on a CPU until a process that would have
>> caused a TLB flush is scheduled on that CPU. I have added a cpumask
>> to task_struct which is then used to post pending TLB flush on CPUs
>> other than the one a process is running on. This cpumask is checked
>> when a process migrates to a new CPU and TLB is flushed at that
>> time. I measured system time for parallel make with unmodified 4.20
>> kernel, 4.20 with XPFO patches before this optimization and then
>> again after applying this optimization. Here are the results:
>>
>> Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
>> make -j60 all
>>
>> 4.20915.183s
>> 4.20+XPFO   24129.354s  26.366x
>> 4.20+XPFO+Deferred flush1216.987s1.330xx
>>
>>
>> Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
>> make -j4 all
>>
>> 4.20607.671s
>> 4.20+XPFO   1588.646s   2.614x
>> 4.20+XPFO+Deferred flush794.473s1.307xx
> 
> Well that's an impressive improvement! Nice work. :)
> 
> (Are the cpumask improvements possible to be extended to other TLB
> flushing needs? i.e. could there be other performance gains with that
> code even for a non-XPFO system?)

It may be usable for other situations as well but I have not given it
any thought yet. I will take a look.

> 
>> 30+% overhead is still very high and there is room for improvement.
>> Dave Hansen had suggested batch updating TLB entries and Tycho had
>> created an initial implementation but I have not been able to get
>> that to work correctly. I am still working on it and I suspect we
>> will see a noticeable improvement in performance with that. In the
>> code I added, I post a pending full TLB flush to all other CPUs even
>> when number of TLB entries being flushed on current CPU does not
>> exceed tlb_single_page_flush_ceiling. There has to be a better way
>> to do this. I just haven't found an efficient way to implemented
>> delayed limited TLB flush on other CPUs.
>>
>> I am not entirely sure if switch_mm_irqs_off() is indeed the right
>> place to perform the pending TLB flush for a CPU. Any feedback on
>> that will be very helpful. Delaying full TLB flushes on other CPUs
>> seems to help tremendously, so if there is a better way to implement
>> the same thing than what I have done in patch 16, I am open to
>> ideas.
> 
> Dave, Andy, Ingo, Thomas, does anyone have time to look this over?
> 
>> Performance with this patch set is good enough to use these as
>> starting point for further refinement before we merge it into main
>> kernel, hence RFC.
>>
>> Since not flushing stale TLB entries creates a false sense of
>> security, I would recommend making TLB flush mandatory and eliminate
>> the "xpfotlbflush" kernel parameter (patch "mm, x86: omit TLB
>> flushing by default for XPFO page table modifications").
> 
> At this point, yes, that does seem to make sense.
> 
>> What remains to be done beyond this patch series:
>>
>> 1. Performance improvements
>> 2. Remove xpfotlbflush parameter
>> 3. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
>>from Juerg. I dropped it for now since swiotlb code for ARM has
>>changed a lot in 4.20.
>> 4. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
>>CPUs" to other architectures besides x86.
> 
> This seems like a good plan.
> 
> I've put this series in one of my tree so that 0day will find it and
> grind tests...
> https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/log/?h=kspp/xpfo/v7

Thanks for doing that!

--
Khalid


pEpkey.asc
Description: application/pgp-keys

[RFC PATCH v7 10/16] lkdtm: Add test for XPFO

2019-01-10 Thread Khalid Aziz

From: Juerg Haefliger 

This test simply reads from userspace memory via the kernel's linear
map.

v6: * drop an #ifdef, just let the test fail if XPFO is not supported
* add XPFO_SMP test to try and test the case when one CPU does an xpfo
  unmap of an address, that it can't be used accidentally by other
  CPUs.

Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Tested-by: Marco Benatto 
[jstec...@amazon.de: rebased from v4.13 to v4.19]
Signed-off-by: Julian Stecklina 
Signed-off-by: Khalid Aziz 
---
 drivers/misc/lkdtm/Makefile |   1 +
 drivers/misc/lkdtm/core.c   |   3 +
 drivers/misc/lkdtm/lkdtm.h  |   5 +
 drivers/misc/lkdtm/xpfo.c   | 194 
 4 files changed, 203 insertions(+)
 create mode 100644 drivers/misc/lkdtm/xpfo.c

diff --git a/drivers/misc/lkdtm/Makefile b/drivers/misc/lkdtm/Makefile
index 951c984de61a..97c6b7818cce 100644
--- a/drivers/misc/lkdtm/Makefile
+++ b/drivers/misc/lkdtm/Makefile
@@ -9,6 +9,7 @@ lkdtm-$(CONFIG_LKDTM)   += refcount.o
 lkdtm-$(CONFIG_LKDTM)  += rodata_objcopy.o
 lkdtm-$(CONFIG_LKDTM)  += usercopy.o
 lkdtm-$(CONFIG_LKDTM)  += stackleak.o
+lkdtm-$(CONFIG_LKDTM)  += xpfo.o
 
 KASAN_SANITIZE_stackleak.o := n
 KCOV_INSTRUMENT_rodata.o   := n
diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index 2837dc77478e..25f4ab4ebf50 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -185,6 +185,9 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(USERCOPY_KERNEL),
CRASHTYPE(USERCOPY_KERNEL_DS),
CRASHTYPE(STACKLEAK_ERASING),
+   CRASHTYPE(XPFO_READ_USER),
+   CRASHTYPE(XPFO_READ_USER_HUGE),
+   CRASHTYPE(XPFO_SMP),
 };
 
 
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 3c6fd327e166..6b31ff0c7f8f 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -87,4 +87,9 @@ void lkdtm_USERCOPY_KERNEL_DS(void);
 /* lkdtm_stackleak.c */
 void lkdtm_STACKLEAK_ERASING(void);
 
+/* lkdtm_xpfo.c */
+void lkdtm_XPFO_READ_USER(void);
+void lkdtm_XPFO_READ_USER_HUGE(void);
+void lkdtm_XPFO_SMP(void);
+
 #endif
diff --git a/drivers/misc/lkdtm/xpfo.c b/drivers/misc/lkdtm/xpfo.c
new file mode 100644
index ..d903063bdd0b
--- /dev/null
+++ b/drivers/misc/lkdtm/xpfo.c
@@ -0,0 +1,194 @@
+/*
+ * This is for all the tests related to XPFO (eXclusive Page Frame Ownership).
+ */
+
+#include "lkdtm.h"
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#define XPFO_DATA 0xdeadbeef
+
+static unsigned long do_map(unsigned long flags)
+{
+   unsigned long user_addr, user_data = XPFO_DATA;
+
+   user_addr = vm_mmap(NULL, 0, PAGE_SIZE,
+   PROT_READ | PROT_WRITE | PROT_EXEC,
+   flags, 0);
+   if (user_addr >= TASK_SIZE) {
+   pr_warn("Failed to allocate user memory\n");
+   return 0;
+   }
+
+   if (copy_to_user((void __user *)user_addr, _data,
+sizeof(user_data))) {
+   pr_warn("copy_to_user failed\n");
+   goto free_user;
+   }
+
+   return user_addr;
+
+free_user:
+   vm_munmap(user_addr, PAGE_SIZE);
+   return 0;
+}
+
+static unsigned long *user_to_kernel(unsigned long user_addr)
+{
+   phys_addr_t phys_addr;
+   void *virt_addr;
+
+   phys_addr = user_virt_to_phys(user_addr);
+   if (!phys_addr) {
+   pr_warn("Failed to get physical address of user memory\n");
+   return NULL;
+   }
+
+   virt_addr = phys_to_virt(phys_addr);
+   if (phys_addr != virt_to_phys(virt_addr)) {
+   pr_warn("Physical address of user memory seems incorrect\n");
+   return NULL;
+   }
+
+   return virt_addr;
+}
+
+static void read_map(unsigned long *virt_addr)
+{
+   pr_info("Attempting bad read from kernel address %p\n", virt_addr);
+   if (*(unsigned long *)virt_addr == XPFO_DATA)
+   pr_err("FAIL: Bad read succeeded?!\n");
+   else
+   pr_err("FAIL: Bad read didn't fail but data is incorrect?!\n");
+}
+
+static void read_user_with_flags(unsigned long flags)
+{
+   unsigned long user_addr, *kernel;
+
+   user_addr = do_map(flags);
+   if (!user_addr) {
+   pr_err("FAIL: map failed\n");
+   return;
+   }
+
+   kernel = user_to_kernel(user_addr);
+   if (!kernel) {
+   pr_err("FAIL: user to kernel conversion failed\n");
+   goto free_user;
+   }
+
+   read_map(kernel);
+
+free_user:
+   vm_munmap(user_addr, PAGE_SIZE);
+}
+
+/* Read from userspace via the kernel's linear map. */
+void lkdtm_XPFO_READ_USER(void)
+{
+   read_user_with_flags(MAP_PRIVATE | MAP_ANONYMOUS);
+}
+
+void lkdtm_XPFO_READ_U

[RFC PATCH v7 13/16] xpfo, mm: optimize spinlock usage in xpfo_kunmap

2019-01-10 Thread Khalid Aziz

From: Julian Stecklina 

Only the xpfo_kunmap call that needs to actually unmap the page
needs to be serialized. We need to be careful to handle the case,
where after the atomic decrement of the mapcount, a xpfo_kmap
increased the mapcount again. In this case, we can safely skip
modifying the page table.

Model-checked with up to 4 concurrent callers with Spin.

Signed-off-by: Julian Stecklina 
Cc: x...@kernel.org
Cc: kernel-harden...@lists.openwall.com
Cc: Vasileios P. Kemerlis 
Cc: Juerg Haefliger 
Cc: Tycho Andersen 
Cc: Marco Benatto 
Cc: David Woodhouse 
Signed-off-by: Khalid Aziz 
---
 mm/xpfo.c | 22 --
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/mm/xpfo.c b/mm/xpfo.c
index cbfeafc2f10f..dbf20efb0499 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -149,22 +149,24 @@ void xpfo_kunmap(void *kaddr, struct page *page)
if (!PageXpfoUser(page))
return;
 
-   spin_lock(>xpfo_lock);
-
/*
 * The page is to be allocated back to user space, so unmap it from the
 * kernel, flush the TLB and tag it as a user page.
 */
if (atomic_dec_return(>xpfo_mapcount) == 0) {
-#ifdef CONFIG_XPFO_DEBUG
-   BUG_ON(PageXpfoUnmapped(page));
-#endif
-   SetPageXpfoUnmapped(page);
-   set_kpte(kaddr, page, __pgprot(0));
-   xpfo_cond_flush_kernel_tlb(page, 0);
-   }
+   spin_lock(>xpfo_lock);
 
-   spin_unlock(>xpfo_lock);
+   /*
+* In the case, where we raced with kmap after the
+* atomic_dec_return, we must not nuke the mapping.
+*/
+   if (atomic_read(>xpfo_mapcount) == 0) {
+   SetPageXpfoUnmapped(page);
+   set_kpte(kaddr, page, __pgprot(0));
+   xpfo_cond_flush_kernel_tlb(page, 0);
+   }
+   spin_unlock(>xpfo_lock);
+   }
 }
 EXPORT_SYMBOL(xpfo_kunmap);
 
-- 
2.17.1

[RFC PATCH v7 07/16] arm64/mm, xpfo: temporarily map dcache regions

2019-01-10 Thread Khalid Aziz

From: Juerg Haefliger 

If the page is unmapped by XPFO, a data cache flush results in a fatal
page fault, so let's temporarily map the region, flush the cache, and then
unmap it.

v6: actually flush in the face of xpfo, and temporarily map the underlying
memory so it can be flushed correctly

CC: linux-arm-ker...@lists.infradead.org
Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Signed-off-by: Khalid Aziz 
---
 arch/arm64/mm/flush.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/arch/arm64/mm/flush.c b/arch/arm64/mm/flush.c
index 30695a868107..f12f26b60319 100644
--- a/arch/arm64/mm/flush.c
+++ b/arch/arm64/mm/flush.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -28,9 +29,15 @@
 void sync_icache_aliases(void *kaddr, unsigned long len)
 {
unsigned long addr = (unsigned long)kaddr;
+   unsigned long num_pages = XPFO_NUM_PAGES(addr, len);
+   void *mapping[num_pages];
 
if (icache_is_aliasing()) {
+   xpfo_temp_map(kaddr, len, mapping,
+ sizeof(mapping[0]) * num_pages);
__clean_dcache_area_pou(kaddr, len);
+   xpfo_temp_unmap(kaddr, len, mapping,
+   sizeof(mapping[0]) * num_pages);
__flush_icache_all();
} else {
flush_icache_range(addr, addr + len);
-- 
2.17.1

[RFC PATCH v7 11/16] mm, x86: omit TLB flushing by default for XPFO page table modifications

2019-01-10 Thread Khalid Aziz

From: Julian Stecklina 

XPFO carries a large performance overhead. In my tests, I saw >40%
overhead for compiling a Linux kernel with XPFO enabled. The
frequent TLB flushes that XPFO performs are the root cause of much
of this overhead.

TLB flushing is required for full paranoia mode where we don't want
TLB entries of physmap pages to stick around potentially
indefinitely. In reality, though, these TLB entries are going to be
evicted pretty rapidly even without explicit flushing. That means
omitting TLB flushes only marginally lowers the security benefits of
XPFO. For kernel compile, omitting TLB flushes pushes the overhead
below 3%.

Change the default in XPFO to not flush TLBs unless the user
explicitly requests to do so using a kernel parameter.

Signed-off-by: Julian Stecklina 
Cc: x...@kernel.org
Cc: kernel-harden...@lists.openwall.com
Cc: Vasileios P. Kemerlis 
Cc: Juerg Haefliger 
Cc: Tycho Andersen 
Cc: Marco Benatto 
Cc: David Woodhouse 
Signed-off-by: Khalid Aziz 
---
 mm/xpfo.c | 37 +
 1 file changed, 29 insertions(+), 8 deletions(-)

diff --git a/mm/xpfo.c b/mm/xpfo.c
index 25fba05d01bd..e80374b0c78e 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -36,6 +36,7 @@ struct xpfo {
 };
 
 DEFINE_STATIC_KEY_FALSE(xpfo_inited);
+DEFINE_STATIC_KEY_FALSE(xpfo_do_tlb_flush);
 
 static bool xpfo_disabled __initdata;
 
@@ -46,7 +47,15 @@ static int __init noxpfo_param(char *str)
return 0;
 }
 
+static int __init xpfotlbflush_param(char *str)
+{
+   static_branch_enable(_do_tlb_flush);
+
+   return 0;
+}
+
 early_param("noxpfo", noxpfo_param);
+early_param("xpfotlbflush", xpfotlbflush_param);
 
 static bool __init need_xpfo(void)
 {
@@ -76,6 +85,13 @@ bool __init xpfo_enabled(void)
 }
 EXPORT_SYMBOL(xpfo_enabled);
 
+
+static void xpfo_cond_flush_kernel_tlb(struct page *page, int order)
+{
+   if (static_branch_unlikely(_do_tlb_flush))
+   xpfo_flush_kernel_tlb(page, order);
+}
+
 static inline struct xpfo *lookup_xpfo(struct page *page)
 {
struct page_ext *page_ext = lookup_page_ext(page);
@@ -114,12 +130,17 @@ void xpfo_alloc_pages(struct page *page, int order, gfp_t 
gfp)
 "xpfo: already mapped page being allocated\n");
 
if ((gfp & GFP_HIGHUSER) == GFP_HIGHUSER) {
-   /*
-* Tag the page as a user page and flush the TLB if it
-* was previously allocated to the kernel.
-*/
-   if (!test_and_set_bit(XPFO_PAGE_USER, >flags))
-   flush_tlb = 1;
+   if (static_branch_unlikely(_do_tlb_flush)) {
+   /*
+* Tag the page as a user page and flush the 
TLB if it
+* was previously allocated to the kernel.
+*/
+   if (!test_and_set_bit(XPFO_PAGE_USER, 
>flags))
+   flush_tlb = 1;
+   } else {
+   set_bit(XPFO_PAGE_USER, >flags);
+   }
+
} else {
/* Tag the page as a non-user (kernel) page */
clear_bit(XPFO_PAGE_USER, >flags);
@@ -127,7 +148,7 @@ void xpfo_alloc_pages(struct page *page, int order, gfp_t 
gfp)
}
 
if (flush_tlb)
-   xpfo_flush_kernel_tlb(page, order);
+   xpfo_cond_flush_kernel_tlb(page, order);
 }
 
 void xpfo_free_pages(struct page *page, int order)
@@ -221,7 +242,7 @@ void xpfo_kunmap(void *kaddr, struct page *page)
 "xpfo: unmapping already unmapped page\n");
set_bit(XPFO_PAGE_UNMAPPED, >flags);
set_kpte(kaddr, page, __pgprot(0));
-   xpfo_flush_kernel_tlb(page, 0);
+   xpfo_cond_flush_kernel_tlb(page, 0);
}
 
spin_unlock(>maplock);
-- 
2.17.1

[RFC PATCH v7 12/16] xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION

2019-01-10 Thread Khalid Aziz

From: Julian Stecklina 

Instead of using the page extension debug feature, encode all
information, we need for XPFO in struct page. This allows to get rid of
some checks in the hot paths and there are also no pages anymore that
are allocated before XPFO is enabled.

Also make debugging aids configurable for maximum performance.

Signed-off-by: Julian Stecklina 
Cc: x...@kernel.org
Cc: kernel-harden...@lists.openwall.com
Cc: Vasileios P. Kemerlis 
Cc: Juerg Haefliger 
Cc: Tycho Andersen 
Cc: Marco Benatto 
Cc: David Woodhouse 
Signed-off-by: Khalid Aziz 
---
 include/linux/mm_types.h   |   8 ++
 include/linux/page-flags.h |  13 +++
 include/linux/xpfo.h   |   3 +-
 include/trace/events/mmflags.h |  10 +-
 mm/page_alloc.c|   3 +-
 mm/page_ext.c  |   4 -
 mm/xpfo.c  | 162 -
 security/Kconfig   |  12 ++-
 8 files changed, 81 insertions(+), 134 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 2c471a2c43fa..d17d33f36a01 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -204,6 +204,14 @@ struct page {
 #ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
int _last_cpupid;
 #endif
+
+#ifdef CONFIG_XPFO
+   /* Counts the number of times this page has been kmapped. */
+   atomic_t xpfo_mapcount;
+
+   /* Serialize kmap/kunmap of this page */
+   spinlock_t xpfo_lock;
+#endif
 } _struct_page_alignment;
 
 /*
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 50ce1bddaf56..a532063f27b5 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -101,6 +101,10 @@ enum pageflags {
 #if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
PG_young,
PG_idle,
+#endif
+#ifdef CONFIG_XPFO
+   PG_xpfo_user,   /* Page is allocated to user-space */
+   PG_xpfo_unmapped,   /* Page is unmapped from the linear map */
 #endif
__NR_PAGEFLAGS,
 
@@ -398,6 +402,15 @@ TESTCLEARFLAG(Young, young, PF_ANY)
 PAGEFLAG(Idle, idle, PF_ANY)
 #endif
 
+#ifdef CONFIG_XPFO
+PAGEFLAG(XpfoUser, xpfo_user, PF_ANY)
+TESTCLEARFLAG(XpfoUser, xpfo_user, PF_ANY)
+TESTSETFLAG(XpfoUser, xpfo_user, PF_ANY)
+PAGEFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+TESTCLEARFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+TESTSETFLAG(XpfoUnmapped, xpfo_unmapped, PF_ANY)
+#endif
+
 /*
  * On an anonymous page mapped into a user virtual memory area,
  * page->mapping points to its anon_vma, not to a struct address_space;
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index d4b38ab8a633..ea512f49 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -27,7 +27,7 @@ struct page;
 
 #include 
 
-extern struct page_ext_operations page_xpfo_ops;
+void xpfo_init_single_page(struct page *page);
 
 void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
 void xpfo_dma_map_unmap_area(bool map, const void *addr, size_t size,
@@ -56,6 +56,7 @@ phys_addr_t user_virt_to_phys(unsigned long addr);
 
 #else /* !CONFIG_XPFO */
 
+static inline void xpfo_init_single_page(struct page *page) { }
 static inline void xpfo_kmap(void *kaddr, struct page *page) { }
 static inline void xpfo_kunmap(void *kaddr, struct page *page) { }
 static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { 
}
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index a1675d43777e..6bb000bb366f 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -79,6 +79,12 @@
 #define IF_HAVE_PG_IDLE(flag,string)
 #endif
 
+#ifdef CONFIG_XPFO
+#define IF_HAVE_PG_XPFO(flag,string) ,{1UL << flag, string}
+#else
+#define IF_HAVE_PG_XPFO(flag,string)
+#endif
+
 #define __def_pageflag_names   \
{1UL << PG_locked,  "locked"},  \
{1UL << PG_waiters, "waiters"   },  \
@@ -105,7 +111,9 @@ IF_HAVE_PG_MLOCK(PG_mlocked,"mlocked"   
)   \
 IF_HAVE_PG_UNCACHED(PG_uncached,   "uncached"  )   \
 IF_HAVE_PG_HWPOISON(PG_hwpoison,   "hwpoison"  )   \
 IF_HAVE_PG_IDLE(PG_young,  "young" )   \
-IF_HAVE_PG_IDLE(PG_idle,   "idle"  )
+IF_HAVE_PG_IDLE(PG_idle,   "idle"  )   \
+IF_HAVE_PG_XPFO(PG_xpfo_user,  "xpfo_user" )   \
+IF_HAVE_PG_XPFO(PG_xpfo_unmapped,  "xpfo_unmapped" )   \
 
 #define show_page_flags(flags) \
(flags) ? __print_flags(flags, "|", \
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 08e277790b5f..d00382b20001 100644
--- a/mm/

[RFC PATCH v7 15/16] xpfo, mm: Fix hang when booting with "xpfotlbflush"

2019-01-10 Thread Khalid Aziz

Kernel hangs when booted up with "xpfotlbflush" option. This is caused
by xpfo_kunmap() fliushing TLB while holding xpfo lock starving other
tasks waiting for the lock. This patch moves tlb flush outside of the
code holding xpfo lock.

Signed-off-by: Khalid Aziz 
---
 mm/xpfo.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/xpfo.c b/mm/xpfo.c
index 85079377c91d..79ffdba6af69 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -148,6 +148,8 @@ EXPORT_SYMBOL(xpfo_kmap);
 
 void xpfo_kunmap(void *kaddr, struct page *page)
 {
+   bool flush_tlb = false;
+
if (!static_branch_unlikely(_inited))
return;
 
@@ -168,10 +170,13 @@ void xpfo_kunmap(void *kaddr, struct page *page)
if (atomic_read(>xpfo_mapcount) == 0) {
SetPageXpfoUnmapped(page);
set_kpte(kaddr, page, __pgprot(0));
-   xpfo_cond_flush_kernel_tlb(page, 0);
+   flush_tlb = true;
}
spin_unlock(>xpfo_lock);
}
+
+   if (flush_tlb)
+   xpfo_cond_flush_kernel_tlb(page, 0);
 }
 EXPORT_SYMBOL(xpfo_kunmap);
 
-- 
2.17.1

[RFC PATCH v7 08/16] arm64/mm: disable section/contiguous mappings if XPFO is enabled

2019-01-10 Thread Khalid Aziz

From: Tycho Andersen 

XPFO doesn't support section/contiguous mappings yet, so let's disable it
if XPFO is turned on.

Thanks to Laura Abbot for the simplification from v5, and Mark Rutland for
pointing out we need NO_CONT_MAPPINGS too.

CC: linux-arm-ker...@lists.infradead.org
Signed-off-by: Tycho Andersen 
Signed-off-by: Khalid Aziz 
---
 arch/arm64/mm/mmu.c  | 2 +-
 include/linux/xpfo.h | 4 
 mm/xpfo.c| 6 ++
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index d1d6601b385d..f4dd27073006 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -451,7 +451,7 @@ static void __init map_mem(pgd_t *pgdp)
struct memblock_region *reg;
int flags = 0;
 
-   if (debug_pagealloc_enabled())
+   if (debug_pagealloc_enabled() || xpfo_enabled())
flags = NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;
 
/*
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 2682a00ebbcb..0c26836a24e1 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -46,6 +46,8 @@ void xpfo_temp_map(const void *addr, size_t size, void 
**mapping,
 void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
 size_t mapping_len);
 
+bool xpfo_enabled(void);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -68,6 +70,8 @@ static inline void xpfo_temp_unmap(const void *addr, size_t 
size,
 }
 
 
+static inline bool xpfo_enabled(void) { return false; }
+
 #endif /* CONFIG_XPFO */
 
 #endif /* _LINUX_XPFO_H */
diff --git a/mm/xpfo.c b/mm/xpfo.c
index f79075bf7d65..25fba05d01bd 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -70,6 +70,12 @@ struct page_ext_operations page_xpfo_ops = {
.init = init_xpfo,
 };
 
+bool __init xpfo_enabled(void)
+{
+   return !xpfo_disabled;
+}
+EXPORT_SYMBOL(xpfo_enabled);
+
 static inline struct xpfo *lookup_xpfo(struct page *page)
 {
struct page_ext *page_ext = lookup_page_ext(page);
-- 
2.17.1

[RFC PATCH v7 05/16] arm64/mm: Add support for XPFO

2019-01-10 Thread Khalid Aziz

From: Juerg Haefliger 

Enable support for eXclusive Page Frame Ownership (XPFO) for arm64 and
provide a hook for updating a single kernel page table entry (which is
required by the generic XPFO code).

v6: use flush_tlb_kernel_range() instead of __flush_tlb_one()

CC: linux-arm-ker...@lists.infradead.org
Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Signed-off-by: Khalid Aziz 
---
 arch/arm64/Kconfig |  1 +
 arch/arm64/mm/Makefile |  2 ++
 arch/arm64/mm/xpfo.c   | 58 ++
 3 files changed, 61 insertions(+)
 create mode 100644 arch/arm64/mm/xpfo.c

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index ea2ab0330e3a..f0a9c0007d23 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -171,6 +171,7 @@ config ARM64
select SWIOTLB
select SYSCTL_EXCEPTION_TRACE
select THREAD_INFO_IN_TASK
+   select ARCH_SUPPORTS_XPFO
help
  ARM 64-bit (AArch64) Linux support.
 
diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
index 849c1df3d214..cca3808d9776 100644
--- a/arch/arm64/mm/Makefile
+++ b/arch/arm64/mm/Makefile
@@ -12,3 +12,5 @@ KASAN_SANITIZE_physaddr.o += n
 
 obj-$(CONFIG_KASAN)+= kasan_init.o
 KASAN_SANITIZE_kasan_init.o:= n
+
+obj-$(CONFIG_XPFO) += xpfo.o
diff --git a/arch/arm64/mm/xpfo.c b/arch/arm64/mm/xpfo.c
new file mode 100644
index ..678e2be848eb
--- /dev/null
+++ b/arch/arm64/mm/xpfo.c
@@ -0,0 +1,58 @@
+/*
+ * Copyright (C) 2017 Hewlett Packard Enterprise Development, L.P.
+ * Copyright (C) 2016 Brown University. All rights reserved.
+ *
+ * Authors:
+ *   Juerg Haefliger 
+ *   Vasileios P. Kemerlis 
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ */
+
+#include 
+#include 
+
+#include 
+
+/*
+ * Lookup the page table entry for a virtual address and return a pointer to
+ * the entry. Based on x86 tree.
+ */
+static pte_t *lookup_address(unsigned long addr)
+{
+   pgd_t *pgd;
+   pud_t *pud;
+   pmd_t *pmd;
+
+   pgd = pgd_offset_k(addr);
+   if (pgd_none(*pgd))
+   return NULL;
+
+   pud = pud_offset(pgd, addr);
+   if (pud_none(*pud))
+   return NULL;
+
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return NULL;
+
+   return pte_offset_kernel(pmd, addr);
+}
+
+/* Update a single kernel page table entry */
+inline void set_kpte(void *kaddr, struct page *page, pgprot_t prot)
+{
+   pte_t *pte = lookup_address((unsigned long)kaddr);
+
+   set_pte(pte, pfn_pte(page_to_pfn(page), prot));
+}
+
+inline void xpfo_flush_kernel_tlb(struct page *page, int order)
+{
+   unsigned long kaddr = (unsigned long)page_address(page);
+   unsigned long size = PAGE_SIZE;
+
+   flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
+}
-- 
2.17.1

[RFC PATCH v7 16/16] xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

2019-01-10 Thread Khalid Aziz

XPFO flushes kernel space TLB entries for pages that are now mapped
in userspace on not only the current CPU but also all other CPUs.
If the number of TLB entries to flush exceeds
tlb_single_page_flush_ceiling, this results in entire TLB neing
flushed on all CPUs. A malicious userspace app can exploit the
dual mapping of a physical page caused by physmap only on the CPU
it is running on. There is no good reason to incur the very high
cost of TLB flush on CPUs that may never run the malicious app or
do not have any TLB entries for the malicious app. The cost of
full TLB flush goes up dramatically on machines with high core
count.

This patch flushes relevant TLB entries for current process or
entire TLB depending upon number of entries for the current CPU
and posts a pending TLB flush on all other CPUs when a page is
unmapped from kernel space and mapped in userspace. This pending
TLB flush is posted for each task separately and TLB is flushed on
a CPU when a task is scheduled on it that has a pending TLB flush
posted for that CPU. This patch does two things - (1) it
potentially aggregates multiple TLB flushes into one, and (2) it
avoids TLB flush on CPUs that never run the task that caused a TLB
flush. This has very significant impact especially on machines
with large core counts. To illustrate this, kernel was compiled
with -j on two classes of machines - a server with high core count
and large amount of memory, and a desktop class machine with more
modest specs. System time from "make -j" from vanilla 4.20 kernel,
4.20 with XPFO patches before applying this patch and after
applying this patch are below:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

4.20915.183s
4.19+XPFO   24129.354s  26.366x
4.19+XPFO+Deferred flush1216.987s1.330xx

Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

4.20607.671s
4.19+XPFO   1588.646s   2.614x
4.19+XPFO+Deferred flush794.473s1.307xx

This patch could use more optimization. For instance, it posts a
pending full TLB flush for other CPUs even when number of TLB
entries being flushed does not exceed tlb_single_page_flush_ceiling.
Batching more TLB entry flushes, as was suggested for earlier
version of these patches, can help reduce these cases. This same
code should be implemented for other architectures as well once
finalized.

Signed-off-by: Khalid Aziz 
---
 arch/x86/include/asm/tlbflush.h |  1 +
 arch/x86/mm/tlb.c   | 27 +++
 arch/x86/mm/xpfo.c  |  2 +-
 include/linux/sched.h   |  9 +
 4 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index f4204bf377fc..92d23629d01d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -561,6 +561,7 @@ extern void flush_tlb_mm_range(struct mm_struct *mm, 
unsigned long start,
unsigned long end, unsigned int stride_shift,
bool freed_tables);
 extern void flush_tlb_kernel_range(unsigned long start, unsigned long end);
+extern void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long 
end);
 
 static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 03b6b4c2238d..b04a501c850b 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -319,6 +319,15 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
__flush_tlb_all();
}
 #endif
+
+   /* If there is a pending TLB flush for this CPU due to XPFO
+* flush, do it now.
+*/
+   if (tsk && cpumask_test_and_clear_cpu(cpu, >pending_xpfo_flush)) {
+   count_vm_tlb_event(NR_TLB_REMOTE_FLUSH_RECEIVED);
+   __flush_tlb_all();
+   }
+
this_cpu_write(cpu_tlbstate.is_lazy, false);
 
/*
@@ -801,6 +810,24 @@ void flush_tlb_kernel_range(unsigned long start, unsigned 
long end)
}
 }
 
+void xpfo_flush_tlb_kernel_range(unsigned long start, unsigned long end)
+{
+
+   /* Balance as user space task's flush, a bit conservative */
+   if (end == TLB_FLUSH_ALL ||
+   (end - start) > tlb_single_page_flush_ceiling << PAGE_SHIFT) {
+   do_flush_tlb_all(NULL);
+   } else {
+   struct flush_tlb_info info;
+
+   info.start = start;
+   info.end = end;
+   do_kernel_range_flush();
+   }
+   cpumask_setall(>pending_xpfo_flush);
+   cpumask_clear_cpu(smp_processor_id(), >pending_xpfo_flush);
+}
+
 void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
 {
struct flush_tlb_info info = {
diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
ind

[RFC PATCH v7 01/16] mm: add MAP_HUGETLB support to vm_mmap

2019-01-10 Thread Khalid Aziz

From: Tycho Andersen 

vm_mmap is exported, which means kernel modules can use it. In particular,
for testing XPFO support, we want to use it with the MAP_HUGETLB flag, so
let's support it via vm_mmap.

Signed-off-by: Tycho Andersen 
Tested-by: Marco Benatto 
Signed-off-by: Khalid Aziz 
---
 include/linux/mm.h |  2 ++
 mm/mmap.c  | 19 +--
 mm/util.c  | 32 
 3 files changed, 35 insertions(+), 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 5411de93a363..30bddc7b3c75 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2361,6 +2361,8 @@ struct vm_unmapped_area_info {
 extern unsigned long unmapped_area(struct vm_unmapped_area_info *info);
 extern unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info);
 
+struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags);
+
 /*
  * Search for an unmapped address range.
  *
diff --git a/mm/mmap.c b/mm/mmap.c
index 6c04292e16a7..c668d7d27c2b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1582,24 +1582,7 @@ unsigned long ksys_mmap_pgoff(unsigned long addr, 
unsigned long len,
if (unlikely(flags & MAP_HUGETLB && !is_file_hugepages(file)))
goto out_fput;
} else if (flags & MAP_HUGETLB) {
-   struct user_struct *user = NULL;
-   struct hstate *hs;
-
-   hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
-   if (!hs)
-   return -EINVAL;
-
-   len = ALIGN(len, huge_page_size(hs));
-   /*
-* VM_NORESERVE is used because the reservations will be
-* taken when vm_ops->mmap() is called
-* A dummy user value is used because we are not locking
-* memory so no accounting is necessary
-*/
-   file = hugetlb_file_setup(HUGETLB_ANON_FILE, len,
-   VM_NORESERVE,
-   , HUGETLB_ANONHUGE_INODE,
-   (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+   file = map_hugetlb_setup(, flags);
if (IS_ERR(file))
return PTR_ERR(file);
}
diff --git a/mm/util.c b/mm/util.c
index 8bf08b5b5760..536c14cf88ba 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -357,6 +357,29 @@ unsigned long vm_mmap_pgoff(struct file *file, unsigned 
long addr,
return ret;
 }
 
+struct file *map_hugetlb_setup(unsigned long *len, unsigned long flags)
+{
+   struct user_struct *user = NULL;
+   struct hstate *hs;
+
+   hs = hstate_sizelog((flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+   if (!hs)
+   return ERR_PTR(-EINVAL);
+
+   *len = ALIGN(*len, huge_page_size(hs));
+
+   /*
+* VM_NORESERVE is used because the reservations will be
+* taken when vm_ops->mmap() is called
+* A dummy user value is used because we are not locking
+* memory so no accounting is necessary
+*/
+   return hugetlb_file_setup(HUGETLB_ANON_FILE, *len,
+   VM_NORESERVE,
+   , HUGETLB_ANONHUGE_INODE,
+   (flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK);
+}
+
 unsigned long vm_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot,
unsigned long flag, unsigned long offset)
@@ -366,6 +389,15 @@ unsigned long vm_mmap(struct file *file, unsigned long 
addr,
if (unlikely(offset_in_page(offset)))
return -EINVAL;
 
+   if (flag & MAP_HUGETLB) {
+   if (file)
+   return -EINVAL;
+
+   file = map_hugetlb_setup(, flag);
+   if (IS_ERR(file))
+   return PTR_ERR(file);
+   }
+
return vm_mmap_pgoff(file, addr, len, prot, flag, offset >> PAGE_SHIFT);
 }
 EXPORT_SYMBOL(vm_mmap);
-- 
2.17.1

[RFC PATCH v7 14/16] EXPERIMENTAL: xpfo, mm: optimize spin lock usage in xpfo_kmap

2019-01-10 Thread Khalid Aziz

From: Julian Stecklina 

We can reduce spin lock usage in xpfo_kmap to the 0->1 transition of
the mapcount. This means that xpfo_kmap() can now race and that we
get spurious page faults.

The page fault handler helps the system make forward progress by
fixing the page table instead of allowing repeated page faults until
the right xpfo_kmap went through.

Model-checked with up to 4 concurrent callers with Spin.

Signed-off-by: Julian Stecklina 
Cc: x...@kernel.org
Cc: kernel-harden...@lists.openwall.com
Cc: Vasileios P. Kemerlis 
Cc: Juerg Haefliger 
Cc: Tycho Andersen 
Cc: Marco Benatto 
Cc: David Woodhouse 
Signed-off-by: Khalid Aziz 
---
 arch/x86/mm/fault.c  |  4 
 include/linux/xpfo.h |  4 
 mm/xpfo.c| 50 +---
 3 files changed, 51 insertions(+), 7 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index ba51652fbd33..207081dcd572 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -18,6 +18,7 @@
 #include  /* faulthandler_disabled()  */
 #include  /* efi_recover_from_page_fault()*/
 #include 
+#include 
 
 #include /* boot_cpu_has, ...*/
 #include  /* dotraplinkage, ...   */
@@ -1218,6 +1219,9 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long 
hw_error_code,
if (kprobes_fault(regs))
return;
 
+   if (xpfo_spurious_fault(address))
+   return;
+
/*
 * Note, despite being a "bad area", there are quite a few
 * acceptable reasons to get here, such as erratum fixups
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index ea512f49..58dd243637d2 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -54,6 +54,8 @@ bool xpfo_enabled(void);
 
 phys_addr_t user_virt_to_phys(unsigned long addr);
 
+bool xpfo_spurious_fault(unsigned long addr);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_init_single_page(struct page *page) { }
@@ -81,6 +83,8 @@ static inline bool xpfo_enabled(void) { return false; }
 
 static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; }
 
+static inline bool xpfo_spurious_fault(unsigned long addr) { return false; }
+
 #endif /* CONFIG_XPFO */
 
 #endif /* _LINUX_XPFO_H */
diff --git a/mm/xpfo.c b/mm/xpfo.c
index dbf20efb0499..85079377c91d 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -119,6 +119,16 @@ void xpfo_free_pages(struct page *page, int order)
}
 }
 
+static void xpfo_do_map(void *kaddr, struct page *page)
+{
+   spin_lock(>xpfo_lock);
+   if (PageXpfoUnmapped(page)) {
+   set_kpte(kaddr, page, PAGE_KERNEL);
+   ClearPageXpfoUnmapped(page);
+   }
+   spin_unlock(>xpfo_lock);
+}
+
 void xpfo_kmap(void *kaddr, struct page *page)
 {
if (!static_branch_unlikely(_inited))
@@ -127,17 +137,12 @@ void xpfo_kmap(void *kaddr, struct page *page)
if (!PageXpfoUser(page))
return;
 
-   spin_lock(>xpfo_lock);
-
/*
 * The page was previously allocated to user space, so map it back
 * into the kernel. No TLB flush required.
 */
-   if ((atomic_inc_return(>xpfo_mapcount) == 1) &&
-   TestClearPageXpfoUnmapped(page))
-   set_kpte(kaddr, page, PAGE_KERNEL);
-
-   spin_unlock(>xpfo_lock);
+   if (atomic_inc_return(>xpfo_mapcount) == 1)
+   xpfo_do_map(kaddr, page);
 }
 EXPORT_SYMBOL(xpfo_kmap);
 
@@ -204,3 +209,34 @@ void xpfo_temp_unmap(const void *addr, size_t size, void 
**mapping,
kunmap_atomic(mapping[i]);
 }
 EXPORT_SYMBOL(xpfo_temp_unmap);
+
+bool xpfo_spurious_fault(unsigned long addr)
+{
+   struct page *page;
+   bool spurious;
+   int mapcount;
+
+   if (!static_branch_unlikely(_inited))
+   return false;
+
+   /* XXX Is this sufficient to guard against calling virt_to_page() on a
+* virtual address that has no corresponding struct page? */
+   if (!virt_addr_valid(addr))
+   return false;
+
+   page = virt_to_page(addr);
+   mapcount = atomic_read(>xpfo_mapcount);
+   spurious = PageXpfoUser(page) && mapcount;
+
+   /* Guarantee forward progress in case xpfo_kmap() raced. */
+   if (spurious && PageXpfoUnmapped(page)) {
+   xpfo_do_map((void *)(addr & PAGE_MASK), page);
+   }
+
+   if (unlikely(!spurious))
+   printk("XPFO non-spurious fault %lx user=%d unmapped=%d 
mapcount=%d\n",
+   addr, PageXpfoUser(page), PageXpfoUnmapped(page),
+   mapcount);
+
+   return spurious;
+}
-- 
2.17.1

[RFC PATCH v7 04/16] swiotlb: Map the buffer if it was unmapped by XPFO

2019-01-10 Thread Khalid Aziz

From: Juerg Haefliger 

v6: * guard against lookup_xpfo() returning NULL

CC: Konrad Rzeszutek Wilk 
Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Signed-off-by: Khalid Aziz 
---
 include/linux/xpfo.h |  4 
 kernel/dma/swiotlb.c |  3 ++-
 mm/xpfo.c| 15 +++
 3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index a39259ce0174..e38b823f44e3 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -35,6 +35,8 @@ void xpfo_kunmap(void *kaddr, struct page *page);
 void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp);
 void xpfo_free_pages(struct page *page, int order);
 
+bool xpfo_page_is_unmapped(struct page *page);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -42,6 +44,8 @@ static inline void xpfo_kunmap(void *kaddr, struct page 
*page) { }
 static inline void xpfo_alloc_pages(struct page *page, int order, gfp_t gfp) { 
}
 static inline void xpfo_free_pages(struct page *page, int order) { }
 
+static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
+
 #endif /* CONFIG_XPFO */
 
 #endif /* _LINUX_XPFO_H */
diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c
index 045930e32c0e..820a54b57491 100644
--- a/kernel/dma/swiotlb.c
+++ b/kernel/dma/swiotlb.c
@@ -396,8 +396,9 @@ static void swiotlb_bounce(phys_addr_t orig_addr, 
phys_addr_t tlb_addr,
 {
unsigned long pfn = PFN_DOWN(orig_addr);
unsigned char *vaddr = phys_to_virt(tlb_addr);
+   struct page *page = pfn_to_page(pfn);
 
-   if (PageHighMem(pfn_to_page(pfn))) {
+   if (PageHighMem(page) || xpfo_page_is_unmapped(page)) {
/* The buffer does not have a mapping.  Map it in and copy */
unsigned int offset = orig_addr & ~PAGE_MASK;
char *buffer;
diff --git a/mm/xpfo.c b/mm/xpfo.c
index bff24afcaa2e..cdbcbac582d5 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -220,3 +220,18 @@ void xpfo_kunmap(void *kaddr, struct page *page)
spin_unlock(>maplock);
 }
 EXPORT_SYMBOL(xpfo_kunmap);
+
+bool xpfo_page_is_unmapped(struct page *page)
+{
+   struct xpfo *xpfo;
+
+   if (!static_branch_unlikely(_inited))
+   return false;
+
+   xpfo = lookup_xpfo(page);
+   if (unlikely(!xpfo) && !xpfo->inited)
+   return false;
+
+   return test_bit(XPFO_PAGE_UNMAPPED, >flags);
+}
+EXPORT_SYMBOL(xpfo_page_is_unmapped);
-- 
2.17.1

[RFC PATCH v7 06/16] xpfo: add primitives for mapping underlying memory

2019-01-10 Thread Khalid Aziz

From: Tycho Andersen 

In some cases (on arm64 DMA and data cache flushes) we may have unmapped
the underlying pages needed for something via XPFO. Here are some
primitives useful for ensuring the underlying memory is mapped/unmapped in
the face of xpfo.

Signed-off-by: Tycho Andersen 
Signed-off-by: Khalid Aziz 
---
 include/linux/xpfo.h | 22 ++
 mm/xpfo.c| 30 ++
 2 files changed, 52 insertions(+)

diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index e38b823f44e3..2682a00ebbcb 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -37,6 +37,15 @@ void xpfo_free_pages(struct page *page, int order);
 
 bool xpfo_page_is_unmapped(struct page *page);
 
+#define XPFO_NUM_PAGES(addr, size) \
+   (PFN_UP((unsigned long) (addr) + (size)) - \
+   PFN_DOWN((unsigned long) (addr)))
+
+void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+  size_t mapping_len);
+void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
+size_t mapping_len);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -46,6 +55,19 @@ static inline void xpfo_free_pages(struct page *page, int 
order) { }
 
 static inline bool xpfo_page_is_unmapped(struct page *page) { return false; }
 
+#define XPFO_NUM_PAGES(addr, size) 0
+
+static inline void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+size_t mapping_len)
+{
+}
+
+static inline void xpfo_temp_unmap(const void *addr, size_t size,
+  void **mapping, size_t mapping_len)
+{
+}
+
+
 #endif /* CONFIG_XPFO */
 
 #endif /* _LINUX_XPFO_H */
diff --git a/mm/xpfo.c b/mm/xpfo.c
index cdbcbac582d5..f79075bf7d65 100644
--- a/mm/xpfo.c
+++ b/mm/xpfo.c
@@ -13,6 +13,7 @@
  * the Free Software Foundation.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -235,3 +236,32 @@ bool xpfo_page_is_unmapped(struct page *page)
return test_bit(XPFO_PAGE_UNMAPPED, >flags);
 }
 EXPORT_SYMBOL(xpfo_page_is_unmapped);
+
+void xpfo_temp_map(const void *addr, size_t size, void **mapping,
+  size_t mapping_len)
+{
+   struct page *page = virt_to_page(addr);
+   int i, num_pages = mapping_len / sizeof(mapping[0]);
+
+   memset(mapping, 0, mapping_len);
+
+   for (i = 0; i < num_pages; i++) {
+   if (page_to_virt(page + i) >= addr + size)
+   break;
+
+   if (xpfo_page_is_unmapped(page + i))
+   mapping[i] = kmap_atomic(page + i);
+   }
+}
+EXPORT_SYMBOL(xpfo_temp_map);
+
+void xpfo_temp_unmap(const void *addr, size_t size, void **mapping,
+size_t mapping_len)
+{
+   int i, num_pages = mapping_len / sizeof(mapping[0]);
+
+   for (i = 0; i < num_pages; i++)
+   if (mapping[i])
+   kunmap_atomic(mapping[i]);
+}
+EXPORT_SYMBOL(xpfo_temp_unmap);
-- 
2.17.1

[RFC PATCH v7 00/16] Add support for eXclusive Page Frame Ownership

2019-01-10 Thread Khalid Aziz

I am continuing to build on the work Juerg, Tycho and Julian have done
on XPFO. After the last round of updates, we were seeing very
significant performance penalties when stale TLB entries were flushed
actively after an XPFO TLB update. Benchmark for measuring performance
is kernel build using parallel make. To get full protection from
ret2dir attackes, we must flush stale TLB entries. Performance
penalty from flushing stale TLB entries goes up as the number of
cores goes up. On a desktop class machine with only 4 cores,
enabling TLB flush for stale entries causes system time for "make
-j4" to go up by a factor of 2.614x but on a larger machine with 96
cores, system time with "make -j60" goes up by a factor of 26.366x!
I have been working on reducing this performance penalty.

I implemented a solution to reduce performance penalty and
that has had large impact. When XPFO code flushes stale TLB entries,
it does so for all CPUs on the system which may include CPUs that
may not have any matching TLB entries or may never be scheduled to
run the userspace task causing TLB flush. Problem is made worse by
the fact that if number of entries being flushed exceeds
tlb_single_page_flush_ceiling, it results in a full TLB flush on
every CPU. A rogue process can launch a ret2dir attack only from a
CPU that has dual mapping for its pages in physmap in its TLB. We
can hence defer TLB flush on a CPU until a process that would have
caused a TLB flush is scheduled on that CPU. I have added a cpumask
to task_struct which is then used to post pending TLB flush on CPUs
other than the one a process is running on. This cpumask is checked
when a process migrates to a new CPU and TLB is flushed at that
time. I measured system time for parallel make with unmodified 4.20
kernel, 4.20 with XPFO patches before this optimization and then
again after applying this optimization. Here are the results:

Hardware: 96-core Intel Xeon Platinum 8160 CPU @ 2.10GHz, 768 GB RAM
make -j60 all

4.20915.183s
4.20+XPFO   24129.354s  26.366x
4.20+XPFO+Deferred flush1216.987s1.330xx


Hardware: 4-core Intel Core i5-3550 CPU @ 3.30GHz, 8G RAM
make -j4 all

4.20607.671s
4.20+XPFO   1588.646s   2.614x
4.20+XPFO+Deferred flush794.473s1.307xx

30+% overhead is still very high and there is room for improvement.
Dave Hansen had suggested batch updating TLB entries and Tycho had
created an initial implementation but I have not been able to get
that to work correctly. I am still working on it and I suspect we
will see a noticeable improvement in performance with that. In the
code I added, I post a pending full TLB flush to all other CPUs even
when number of TLB entries being flushed on current CPU does not
exceed tlb_single_page_flush_ceiling. There has to be a better way
to do this. I just haven't found an efficient way to implemented
delayed limited TLB flush on other CPUs.

I am not entirely sure if switch_mm_irqs_off() is indeed the right
place to perform the pending TLB flush for a CPU. Any feedback on
that will be very helpful. Delaying full TLB flushes on other CPUs
seems to help tremendously, so if there is a better way to implement
the same thing than what I have done in patch 16, I am open to
ideas.

Performance with this patch set is good enough to use these as
starting point for further refinement before we merge it into main
kernel, hence RFC.

Since not flushing stale TLB entries creates a false sense of
security, I would recommend making TLB flush mandatory and eliminate
the "xpfotlbflush" kernel parameter (patch "mm, x86: omit TLB
flushing by default for XPFO page table modifications").

What remains to be done beyond this patch series:

1. Performance improvements
2. Remove xpfotlbflush parameter
3. Re-evaluate the patch "arm64/mm: Add support for XPFO to swiotlb"
   from Juerg. I dropped it for now since swiotlb code for ARM has
   changed a lot in 4.20.
4. Extend the patch "xpfo, mm: Defer TLB flushes for non-current
   CPUs" to other architectures besides x86.


-

Juerg Haefliger (5):
  mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)
  swiotlb: Map the buffer if it was unmapped by XPFO
  arm64/mm: Add support for XPFO
  arm64/mm, xpfo: temporarily map dcache regions
  lkdtm: Add test for XPFO

Julian Stecklina (4):
  mm, x86: omit TLB flushing by default for XPFO page table
modifications
  xpfo, mm: remove dependency on CONFIG_PAGE_EXTENSION
  xpfo, mm: optimize spinlock usage in xpfo_kunmap
  EXPERIMENTAL: xpfo, mm: optimize spin lock usage in xpfo_kmap

Khalid Aziz (2):
  xpfo, mm: Fix hang when booting with "xpfotlbflush"
  xpfo, mm: Defer TLB flushes for non-current CPUs (x86 only)

Tycho Andersen (5):
  mm: add MAP_HUGETLB support to vm_mmap
  x86: always se

[RFC PATCH v7 03/16] mm, x86: Add support for eXclusive Page Frame Ownership (XPFO)

2019-01-10 Thread Khalid Aziz

From: Juerg Haefliger 

This patch adds support for XPFO which protects against 'ret2dir' kernel
attacks. The basic idea is to enforce exclusive ownership of page frames
by either the kernel or userspace, unless explicitly requested by the
kernel. Whenever a page destined for userspace is allocated, it is
unmapped from physmap (the kernel's page table). When such a page is
reclaimed from userspace, it is mapped back to physmap.

Additional fields in the page_ext struct are used for XPFO housekeeping,
specifically:
  - two flags to distinguish user vs. kernel pages and to tag unmapped
pages.
  - a reference counter to balance kmap/kunmap operations.
  - a lock to serialize access to the XPFO fields.

This patch is based on the work of Vasileios P. Kemerlis et al. who
published their work in this paper:
  http://www.cs.columbia.edu/~vpk/papers/ret2dir.sec14.pdf

v6: * use flush_tlb_kernel_range() instead of __flush_tlb_one, so we flush
  the tlb entry on all CPUs when unmapping it in kunmap
* handle lookup_page_ext()/lookup_xpfo() returning NULL
* drop lots of BUG()s in favor of WARN()
* don't disable irqs in xpfo_kmap/xpfo_kunmap, export
  __split_large_page so we can do our own alloc_pages(GFP_ATOMIC) to
  pass it

CC: x...@kernel.org
Suggested-by: Vasileios P. Kemerlis 
Signed-off-by: Juerg Haefliger 
Signed-off-by: Tycho Andersen 
Signed-off-by: Marco Benatto 
[jstec...@amazon.de: rebased from v4.13 to v4.19]
Signed-off-by: Julian Stecklina 
Signed-off-by: Khalid Aziz 
---
 .../admin-guide/kernel-parameters.txt |   2 +
 arch/x86/Kconfig  |   1 +
 arch/x86/include/asm/pgtable.h|  26 ++
 arch/x86/mm/Makefile  |   2 +
 arch/x86/mm/pageattr.c|  23 +-
 arch/x86/mm/xpfo.c| 114 +
 include/linux/highmem.h   |  15 +-
 include/linux/xpfo.h  |  47 
 mm/Makefile   |   1 +
 mm/page_alloc.c   |   2 +
 mm/page_ext.c |   4 +
 mm/xpfo.c | 222 ++
 security/Kconfig  |  19 ++
 13 files changed, 456 insertions(+), 22 deletions(-)
 create mode 100644 arch/x86/mm/xpfo.c
 create mode 100644 include/linux/xpfo.h
 create mode 100644 mm/xpfo.c

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index aefd358a5ca3..c4c62599f216 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2982,6 +2982,8 @@
 
nox2apic[X86-64,APIC] Do not enable x2APIC mode.
 
+   noxpfo  [X86-64] Disable XPFO when CONFIG_XPFO is on.
+
cpu0_hotplug[X86] Turn on CPU0 hotplug feature when
CONFIG_BOOTPARAM_HOTPLUG_CPU0 is off.
Some features depend on CPU0. Known dependencies are:
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 8689e794a43c..d69d8cc6e57e 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -207,6 +207,7 @@ config X86
select USER_STACKTRACE_SUPPORT
select VIRT_TO_BUS
select X86_FEATURE_NAMESif PROC_FS
+   select ARCH_SUPPORTS_XPFO   if X86_64
 
 config INSTRUCTION_DECODER
def_bool y
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 40616e805292..ad2d1792939d 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1437,6 +1437,32 @@ static inline bool arch_has_pfn_modify_check(void)
return boot_cpu_has_bug(X86_BUG_L1TF);
 }
 
+/*
+ * The current flushing context - we pass it instead of 5 arguments:
+ */
+struct cpa_data {
+   unsigned long   *vaddr;
+   pgd_t   *pgd;
+   pgprot_tmask_set;
+   pgprot_tmask_clr;
+   unsigned long   numpages;
+   int flags;
+   unsigned long   pfn;
+   unsignedforce_split : 1,
+   force_static_prot   : 1;
+   int curpage;
+   struct page **pages;
+};
+
+
+int
+should_split_large_page(pte_t *kpte, unsigned long address,
+   struct cpa_data *cpa);
+extern spinlock_t cpa_lock;
+int
+__split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
+  struct page *base);
+
 #include 
 #endif /* __ASSEMBLY__ */
 
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 4b101dd6e52f..93b0fdaf4a99 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -53,3 +53,5 @@ obj-$(CONFIG_PAGE_TABLE_ISOLATION)+= pti.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt_identity.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)  += mem_encrypt_boot.o
+
+obj-$(CONFIG_XPFO

[RFC PATCH v7 09/16] mm: add a user_virt_to_phys symbol

2019-01-10 Thread Khalid Aziz

From: Tycho Andersen 

We need someting like this for testing XPFO. Since it's architecture
specific, putting it in the test code is slightly awkward, so let's make it
an arch-specific symbol and export it for use in LKDTM.

v6: * add a definition of user_virt_to_phys in the !CONFIG_XPFO case

CC: linux-arm-ker...@lists.infradead.org
CC: x...@kernel.org
Signed-off-by: Tycho Andersen 
Tested-by: Marco Benatto 
Signed-off-by: Khalid Aziz 
---
 arch/x86/mm/xpfo.c   | 57 
 include/linux/xpfo.h |  8 +++
 2 files changed, 65 insertions(+)

diff --git a/arch/x86/mm/xpfo.c b/arch/x86/mm/xpfo.c
index d1f04ea533cd..bcdb2f2089d2 100644
--- a/arch/x86/mm/xpfo.c
+++ b/arch/x86/mm/xpfo.c
@@ -112,3 +112,60 @@ inline void xpfo_flush_kernel_tlb(struct page *page, int 
order)
 
flush_tlb_kernel_range(kaddr, kaddr + (1 << order) * size);
 }
+
+/* Convert a user space virtual address to a physical address.
+ * Shamelessly copied from slow_virt_to_phys() and lookup_address() in
+ * arch/x86/mm/pageattr.c
+ */
+phys_addr_t user_virt_to_phys(unsigned long addr)
+{
+   phys_addr_t phys_addr;
+   unsigned long offset;
+   pgd_t *pgd;
+   p4d_t *p4d;
+   pud_t *pud;
+   pmd_t *pmd;
+   pte_t *pte;
+
+   pgd = pgd_offset(current->mm, addr);
+   if (pgd_none(*pgd))
+   return 0;
+
+   p4d = p4d_offset(pgd, addr);
+   if (p4d_none(*p4d))
+   return 0;
+
+   if (p4d_large(*p4d) || !p4d_present(*p4d)) {
+   phys_addr = (unsigned long)p4d_pfn(*p4d) << PAGE_SHIFT;
+   offset = addr & ~P4D_MASK;
+   goto out;
+   }
+
+   pud = pud_offset(p4d, addr);
+   if (pud_none(*pud))
+   return 0;
+
+   if (pud_large(*pud) || !pud_present(*pud)) {
+   phys_addr = (unsigned long)pud_pfn(*pud) << PAGE_SHIFT;
+   offset = addr & ~PUD_MASK;
+   goto out;
+   }
+
+   pmd = pmd_offset(pud, addr);
+   if (pmd_none(*pmd))
+   return 0;
+
+   if (pmd_large(*pmd) || !pmd_present(*pmd)) {
+   phys_addr = (unsigned long)pmd_pfn(*pmd) << PAGE_SHIFT;
+   offset = addr & ~PMD_MASK;
+   goto out;
+   }
+
+   pte =  pte_offset_kernel(pmd, addr);
+   phys_addr = (phys_addr_t)pte_pfn(*pte) << PAGE_SHIFT;
+   offset = addr & ~PAGE_MASK;
+
+out:
+   return (phys_addr_t)(phys_addr | offset);
+}
+EXPORT_SYMBOL(user_virt_to_phys);
diff --git a/include/linux/xpfo.h b/include/linux/xpfo.h
index 0c26836a24e1..d4b38ab8a633 100644
--- a/include/linux/xpfo.h
+++ b/include/linux/xpfo.h
@@ -23,6 +23,10 @@ struct page;
 
 #ifdef CONFIG_XPFO
 
+#include 
+
+#include 
+
 extern struct page_ext_operations page_xpfo_ops;
 
 void set_kpte(void *kaddr, struct page *page, pgprot_t prot);
@@ -48,6 +52,8 @@ void xpfo_temp_unmap(const void *addr, size_t size, void 
**mapping,
 
 bool xpfo_enabled(void);
 
+phys_addr_t user_virt_to_phys(unsigned long addr);
+
 #else /* !CONFIG_XPFO */
 
 static inline void xpfo_kmap(void *kaddr, struct page *page) { }
@@ -72,6 +78,8 @@ static inline void xpfo_temp_unmap(const void *addr, size_t 
size,
 
 static inline bool xpfo_enabled(void) { return false; }
 
+static inline phys_addr_t user_virt_to_phys(unsigned long addr) { return 0; }
+
 #endif /* CONFIG_XPFO */
 
 #endif /* _LINUX_XPFO_H */
-- 
2.17.1

[RFC PATCH v7 02/16] x86: always set IF before oopsing from page fault

2019-01-10 Thread Khalid Aziz

From: Tycho Andersen 

Oopsing might kill the task, via rewind_stack_do_exit() at the bottom, and
that might sleep:

Aug 23 19:30:27 xpfo kernel: [   38.302714] BUG: sleeping function called from 
invalid context at ./include/linux/percpu-rwsem.h:33
Aug 23 19:30:27 xpfo kernel: [   38.303837] in_atomic(): 0, irqs_disabled(): 1, 
pid: 1970, name: lkdtm_xpfo_test
Aug 23 19:30:27 xpfo kernel: [   38.304758] CPU: 3 PID: 1970 Comm: 
lkdtm_xpfo_test Tainted: G  D 4.13.0-rc5+ #228
Aug 23 19:30:27 xpfo kernel: [   38.305813] Hardware name: QEMU Standard PC 
(i440FX + PIIX, 1996), BIOS 1.10.1-1ubuntu1 04/01/2014
Aug 23 19:30:27 xpfo kernel: [   38.306926] Call Trace:
Aug 23 19:30:27 xpfo kernel: [   38.307243]  dump_stack+0x63/0x8b
Aug 23 19:30:27 xpfo kernel: [   38.307665]  ___might_sleep+0xec/0x110
Aug 23 19:30:27 xpfo kernel: [   38.308139]  __might_sleep+0x45/0x80
Aug 23 19:30:27 xpfo kernel: [   38.308593]  exit_signals+0x21/0x1c0
Aug 23 19:30:27 xpfo kernel: [   38.309046]  ? 
blocking_notifier_call_chain+0x11/0x20
Aug 23 19:30:27 xpfo kernel: [   38.309677]  do_exit+0x98/0xbf0
Aug 23 19:30:27 xpfo kernel: [   38.310078]  ? smp_reader+0x27/0x40 [lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.310604]  ? kthread+0x10f/0x150
Aug 23 19:30:27 xpfo kernel: [   38.311045]  ? read_user_with_flags+0x60/0x60 
[lkdtm]
Aug 23 19:30:27 xpfo kernel: [   38.311680]  rewind_stack_do_exit+0x17/0x20

To be safe, let's just always enable irqs.

The particular case I'm hitting is:

Aug 23 19:30:27 xpfo kernel: [   38.278615]  __bad_area_nosemaphore+0x1a9/0x1d0
Aug 23 19:30:27 xpfo kernel: [   38.278617]  bad_area_nosemaphore+0xf/0x20
Aug 23 19:30:27 xpfo kernel: [   38.278618]  __do_page_fault+0xd1/0x540
Aug 23 19:30:27 xpfo kernel: [   38.278620]  ? irq_work_queue+0x9b/0xb0
Aug 23 19:30:27 xpfo kernel: [   38.278623]  ? wake_up_klogd+0x36/0x40
Aug 23 19:30:27 xpfo kernel: [   38.278624]  trace_do_page_fault+0x3c/0xf0
Aug 23 19:30:27 xpfo kernel: [   38.278625]  do_async_page_fault+0x14/0x60
Aug 23 19:30:27 xpfo kernel: [   38.278627]  async_page_fault+0x28/0x30

When a fault is in kernel space which has been triggered by XPFO.

Signed-off-by: Tycho Andersen 
CC: x...@kernel.org
Signed-off-by: Khalid Aziz 
---
 arch/x86/mm/fault.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 71d4b9d4d43f..ba51652fbd33 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -748,6 +748,12 @@ no_context(struct pt_regs *regs, unsigned long error_code,
/* Executive summary in case the body of the oops scrolled away */
printk(KERN_DEFAULT "CR2: %016lx\n", address);
 
+   /*
+* We're about to oops, which might kill the task. Make sure we're
+* allowed to sleep.
+*/
+   flags |= X86_EFLAGS_IF;
+
oops_end(flags, regs, sig);
 }
 
-- 
2.17.1

1 2 3 4 5 6 7 8 9 >

1 - 100 of 841 matches

Mail list logo