Re: [PATCH 3.18 32/32] mm: larger stack guard gap, between vmas

2017-06-20 Thread Willy Tarreau
On Tue, Jun 20, 2017 at 10:49:16PM -0700, Hugh Dickins wrote:
> On Mon, 19 Jun 2017, Greg Kroah-Hartman wrote:
> 
> > 3.18-stable review patch.  If anyone has any objections, please let me know.
> > 
> > --
> > 
> > From: Hugh Dickins 
> > 
> > commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.
> 
> Here's a few adjustments to the 3.18 patch: no doubt you'll have
> already sorted out any build errors (and I have to confess that
> I haven't even tried to build this); and the VM_WARN_ON line (as
> in 4.4) only fixes a highly unlikely error; but those FOLL_MLOCK
> lines in mm/gup.c were mistaken, and do need to be deleted.

Are you sure ? The test on the FOLL_MLOCK flag remains present in
4.11 and mainline :

/* mlock all present pages, but do not fault in new pages */
if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
return -ENOENT;

And this test was present although different in 3.18 as well :

/* For mlock, just skip the stack guard page. */
if ((*flags & FOLL_MLOCK) &&
(stack_guard_page_start(vma, address) ||
 stack_guard_page_end(vma, address + PAGE_SIZE)))

So by removing it we're totally removing any test on FOLL_MLOCK. That
might be the correct fix, but I'm just a bit surprized since the mainline
patch doesn't remove it, and only removes the test on FOLL_POPULATE.

Thanks,
Willy


Re: [PATCH 3.18 32/32] mm: larger stack guard gap, between vmas

2017-06-20 Thread Willy Tarreau
On Tue, Jun 20, 2017 at 10:49:16PM -0700, Hugh Dickins wrote:
> On Mon, 19 Jun 2017, Greg Kroah-Hartman wrote:
> 
> > 3.18-stable review patch.  If anyone has any objections, please let me know.
> > 
> > --
> > 
> > From: Hugh Dickins 
> > 
> > commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.
> 
> Here's a few adjustments to the 3.18 patch: no doubt you'll have
> already sorted out any build errors (and I have to confess that
> I haven't even tried to build this); and the VM_WARN_ON line (as
> in 4.4) only fixes a highly unlikely error; but those FOLL_MLOCK
> lines in mm/gup.c were mistaken, and do need to be deleted.

Are you sure ? The test on the FOLL_MLOCK flag remains present in
4.11 and mainline :

/* mlock all present pages, but do not fault in new pages */
if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK)
return -ENOENT;

And this test was present although different in 3.18 as well :

/* For mlock, just skip the stack guard page. */
if ((*flags & FOLL_MLOCK) &&
(stack_guard_page_start(vma, address) ||
 stack_guard_page_end(vma, address + PAGE_SIZE)))

So by removing it we're totally removing any test on FOLL_MLOCK. That
might be the correct fix, but I'm just a bit surprized since the mainline
patch doesn't remove it, and only removes the test on FOLL_POPULATE.

Thanks,
Willy


[PATCH] [RFC] binfmt_elf: Use ELF_ET_DYN_BASE only for PIE

2017-06-20 Thread Kees Cook
The ELF_ET_DYN_BASE position was originally intended to keep loaders
away from ET_EXEC binaries. (For example, running "/lib/ld-linux.so.2
/bin/cat" might cause the subsequent load of /bin/cat into where the
loader had been loaded.) With the advent of PIE (ET_DYN binaries with
an INTERP Program Header), ELF_ET_DYN_BASE continued to be used since
the kernel was only looking at ET_DYN. However, since ELF_ET_DYN_BASE
is traditionally set at the top 1/3rd of the TASK_SIZE, a substantial
portion of the address space is unused.

For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs
are loaded below the mmap region. This means they can be made to collide
(CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with pathological
stack regions. Lowering ELF_ET_DYN_BASE solves both by moving programs
above the mmap region in all cases, and will now additionally avoid
programs falling back to the mmap region by enforcing MAP_FIXED for
program loads (i.e. if it would have collided with the stack, now it
will fail to load instead of falling back to the mmap region).

To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP)
are loaded into the mmap region, leaving space available for either an
ET_EXEC binary with a fixed location or PIE being loaded into mmap by the
loader. Only PIE programs are loaded offset from ELF_ET_DYN_BASE, which
means architectures can now safely lower their values without risk of
loaders colliding with their subsequently loaded programs.

Thanks go to PaX for inspiration on how to approach this solution.

Fixes: d1fd836dcf00 ("mm: split ET_DYN ASLR from mmap ASLR")
Signed-off-by: Kees Cook 
---
 arch/x86/include/asm/elf.h |  8 ++--
 fs/binfmt_elf.c| 38 +++---
 2 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index e8ab9a46bc68..46549973ea98 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -245,12 +245,8 @@ extern int force_personality32;
 #define CORE_DUMP_USE_REGSET
 #define ELF_EXEC_PAGESIZE  4096
 
-/* This is the location that an ET_DYN program is loaded if exec'ed.  Typical
-   use of this is to invoke "./ld.so someprog" to test out a new version of
-   the loader.  We need to make sure that it is out of the way of the program
-   that it will "exec", and that there is sufficient room for the brk.  */
-
-#define ELF_ET_DYN_BASE(TASK_SIZE / 3 * 2)
+/* This is the base location for PIE (ET_DYN with INTERP) loads. */
+#define ELF_ET_DYN_BASE0x40UL
 
 /* This yields a mask that user programs can use to figure out what
instruction set this CPU supports.  This could be done in user space,
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 5075fd5c62c8..a998c7251d1c 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -930,13 +930,37 @@ static int load_elf_binary(struct linux_binprm *bprm)
if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
elf_flags |= MAP_FIXED;
} else if (loc->elf_ex.e_type == ET_DYN) {
-   /* Try and get dynamic programs out of the way of the
-* default mmap base, as well as whatever program they
-* might try to exec.  This is because the brk will
-* follow the loader, and is not movable.  */
-   load_bias = ELF_ET_DYN_BASE - vaddr;
-   if (current->flags & PF_RANDOMIZE)
-   load_bias += arch_mmap_rnd();
+   /*
+* There are effectively two types of ET_DYN
+* binaries: programs (i.e. PIE: ET_DYN with INTERP)
+* and loaders (ET_DYN without INTERP, since they
+* _are_ the ELF interpreter). The loaders must
+* be loaded away from programs since the program
+* may otherwise collide with the loader (especially
+* for ET_EXEC which does not have a randomized
+* position). For example to handle invocations of
+* "./ld.so someprog" to test out a new version of
+* the loader, the subsequent program that the
+* loader loads must avoid the loader itself, so
+* they cannot share the same load range. Sufficient
+* room for the brk must be allocated with the
+* loader as well, since brk must be available with
+* the loader.
+*
+* Therefore, programs are loaded offset from
+* ELF_ET_DYN_BASE and loaders are loaded into the
+* independently randomized mmap region (0 

[PATCH] [RFC] binfmt_elf: Use ELF_ET_DYN_BASE only for PIE

2017-06-20 Thread Kees Cook
The ELF_ET_DYN_BASE position was originally intended to keep loaders
away from ET_EXEC binaries. (For example, running "/lib/ld-linux.so.2
/bin/cat" might cause the subsequent load of /bin/cat into where the
loader had been loaded.) With the advent of PIE (ET_DYN binaries with
an INTERP Program Header), ELF_ET_DYN_BASE continued to be used since
the kernel was only looking at ET_DYN. However, since ELF_ET_DYN_BASE
is traditionally set at the top 1/3rd of the TASK_SIZE, a substantial
portion of the address space is unused.

For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs
are loaded below the mmap region. This means they can be made to collide
(CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with pathological
stack regions. Lowering ELF_ET_DYN_BASE solves both by moving programs
above the mmap region in all cases, and will now additionally avoid
programs falling back to the mmap region by enforcing MAP_FIXED for
program loads (i.e. if it would have collided with the stack, now it
will fail to load instead of falling back to the mmap region).

To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP)
are loaded into the mmap region, leaving space available for either an
ET_EXEC binary with a fixed location or PIE being loaded into mmap by the
loader. Only PIE programs are loaded offset from ELF_ET_DYN_BASE, which
means architectures can now safely lower their values without risk of
loaders colliding with their subsequently loaded programs.

Thanks go to PaX for inspiration on how to approach this solution.

Fixes: d1fd836dcf00 ("mm: split ET_DYN ASLR from mmap ASLR")
Signed-off-by: Kees Cook 
---
 arch/x86/include/asm/elf.h |  8 ++--
 fs/binfmt_elf.c| 38 +++---
 2 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index e8ab9a46bc68..46549973ea98 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -245,12 +245,8 @@ extern int force_personality32;
 #define CORE_DUMP_USE_REGSET
 #define ELF_EXEC_PAGESIZE  4096
 
-/* This is the location that an ET_DYN program is loaded if exec'ed.  Typical
-   use of this is to invoke "./ld.so someprog" to test out a new version of
-   the loader.  We need to make sure that it is out of the way of the program
-   that it will "exec", and that there is sufficient room for the brk.  */
-
-#define ELF_ET_DYN_BASE(TASK_SIZE / 3 * 2)
+/* This is the base location for PIE (ET_DYN with INTERP) loads. */
+#define ELF_ET_DYN_BASE0x40UL
 
 /* This yields a mask that user programs can use to figure out what
instruction set this CPU supports.  This could be done in user space,
diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 5075fd5c62c8..a998c7251d1c 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -930,13 +930,37 @@ static int load_elf_binary(struct linux_binprm *bprm)
if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
elf_flags |= MAP_FIXED;
} else if (loc->elf_ex.e_type == ET_DYN) {
-   /* Try and get dynamic programs out of the way of the
-* default mmap base, as well as whatever program they
-* might try to exec.  This is because the brk will
-* follow the loader, and is not movable.  */
-   load_bias = ELF_ET_DYN_BASE - vaddr;
-   if (current->flags & PF_RANDOMIZE)
-   load_bias += arch_mmap_rnd();
+   /*
+* There are effectively two types of ET_DYN
+* binaries: programs (i.e. PIE: ET_DYN with INTERP)
+* and loaders (ET_DYN without INTERP, since they
+* _are_ the ELF interpreter). The loaders must
+* be loaded away from programs since the program
+* may otherwise collide with the loader (especially
+* for ET_EXEC which does not have a randomized
+* position). For example to handle invocations of
+* "./ld.so someprog" to test out a new version of
+* the loader, the subsequent program that the
+* loader loads must avoid the loader itself, so
+* they cannot share the same load range. Sufficient
+* room for the brk must be allocated with the
+* loader as well, since brk must be available with
+* the loader.
+*
+* Therefore, programs are loaded offset from
+* ELF_ET_DYN_BASE and loaders are loaded into the
+* independently randomized mmap region (0 load_bias
+  

linux-next: manual merge of the scsi tree with the block tree

2017-06-20 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the scsi tree got a conflict in:

  drivers/scsi/scsi_lib.c

between commit:

  ca18d6f769d2 ("block: Make most scsi_req_init() calls implicit")

from the block tree and commit:

  2dd6fb5957a7 ("scsi: Only add commands to the device command list if required 
by the LLD")

from the scsi tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/scsi/scsi_lib.c
index 550e29f903b7,41c19c75dab4..
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@@ -1116,20 -1124,35 +1106,49 @@@ err_exit
  }
  EXPORT_SYMBOL(scsi_init_io);
  
+ /* Add a command to the list used by the aacraid and dpt_i2o drivers */
+ void scsi_add_cmd_to_list(struct scsi_cmnd *cmd)
+ {
+   struct scsi_device *sdev = cmd->device;
+   struct Scsi_Host *shost = sdev->host;
+   unsigned long flags;
+ 
+   if (shost->use_cmd_list) {
+   spin_lock_irqsave(>list_lock, flags);
+   list_add_tail(>list, >cmd_list);
+   spin_unlock_irqrestore(>list_lock, flags);
+   }
+ }
+ 
+ /* Remove a command from the list used by the aacraid and dpt_i2o drivers */
+ void scsi_del_cmd_from_list(struct scsi_cmnd *cmd)
+ {
+   struct scsi_device *sdev = cmd->device;
+   struct Scsi_Host *shost = sdev->host;
+   unsigned long flags;
+ 
+   if (shost->use_cmd_list) {
+   spin_lock_irqsave(>list_lock, flags);
+   BUG_ON(list_empty(>list));
+   list_del_init(>list);
+   spin_unlock_irqrestore(>list_lock, flags);
+   }
+ }
+ 
 +/**
 + * scsi_initialize_rq - initialize struct scsi_cmnd.req
 + *
 + * Called from inside blk_get_request().
 + */
 +void scsi_initialize_rq(struct request *rq)
 +{
 +  struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
 +
 +  scsi_req_init(>req);
 +}
 +EXPORT_SYMBOL(scsi_initialize_rq);
 +
 +/* Called after a request has been started. */
  void scsi_init_command(struct scsi_device *dev, struct scsi_cmnd *cmd)
  {
void *buf = cmd->sense_buffer;
@@@ -2974,10 -2989,7 +2989,7 @@@ int scsi_internal_device_block_nowait(s
 * request queue. 
 */
if (q->mq_ops) {
-   if (wait)
-   blk_mq_quiesce_queue(q);
-   else
-   blk_mq_quiesce_queue_nowait(q);
 -  blk_mq_stop_hw_queues(q);
++  blk_mq_quiesce_queue_nowait(q);
} else {
spin_lock_irqsave(q->queue_lock, flags);
blk_stop_queue(q);
@@@ -2988,31 -2998,77 +2998,77 @@@
  
return 0;
  }
- EXPORT_SYMBOL_GPL(scsi_internal_device_block);
-  
+ EXPORT_SYMBOL_GPL(scsi_internal_device_block_nowait);
+ 
  /**
-  * scsi_internal_device_unblock - resume a device after a block request
-  * @sdev: device to resume
-  * @new_state:state to set devices to after unblocking
+  * scsi_internal_device_block - try to transition to the SDEV_BLOCK state
+  * @sdev: device to block
+  *
+  * Pause SCSI command processing on the specified device and wait until all
+  * ongoing scsi_request_fn() / scsi_queue_rq() calls have finished. May sleep.
   *
-  * Called by scsi lld's or the midlayer to restart the device queue
-  * for the previously suspended scsi device.  Called from interrupt or
-  * normal process context.
+  * Returns zero if successful or a negative error code upon failure.
   *
-  * Returns zero if successful or error if not.
+  * Note:
+  * This routine transitions the device to the SDEV_BLOCK state (which must be
+  * a legal transition). When the device is in this state, command processing
+  * is paused until the device leaves the SDEV_BLOCK state. See also
+  * scsi_internal_device_unblock().
   *
-  * Notes:   
-  *This routine transitions the device to the SDEV_RUNNING state
-  *or to one of the offline states (which must be a legal transition)
-  *allowing the midlayer to goose the queue for this device.
+  * To do: avoid that scsi_send_eh_cmnd() calls queuecommand() after
+  * scsi_internal_device_block() has blocked a SCSI device and also
+  * remove the rport mutex lock and unlock calls from srp_queuecommand().
   */
- int
- scsi_internal_device_unblock(struct scsi_device *sdev,
-enum scsi_device_state new_state)
+ static int scsi_internal_device_block(struct scsi_device *sdev)
  {
-   struct request_queue *q = sdev->request_queue; 
+   struct request_queue *q = sdev->request_queue;
+   int err;
+ 
+   mutex_lock(>state_mutex);
+   err = scsi_internal_device_block_nowait(sdev);
+   if (err == 0) {
+   if 

linux-next: manual merge of the scsi tree with the block tree

2017-06-20 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the scsi tree got a conflict in:

  drivers/scsi/scsi_lib.c

between commit:

  ca18d6f769d2 ("block: Make most scsi_req_init() calls implicit")

from the block tree and commit:

  2dd6fb5957a7 ("scsi: Only add commands to the device command list if required 
by the LLD")

from the scsi tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/scsi/scsi_lib.c
index 550e29f903b7,41c19c75dab4..
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@@ -1116,20 -1124,35 +1106,49 @@@ err_exit
  }
  EXPORT_SYMBOL(scsi_init_io);
  
+ /* Add a command to the list used by the aacraid and dpt_i2o drivers */
+ void scsi_add_cmd_to_list(struct scsi_cmnd *cmd)
+ {
+   struct scsi_device *sdev = cmd->device;
+   struct Scsi_Host *shost = sdev->host;
+   unsigned long flags;
+ 
+   if (shost->use_cmd_list) {
+   spin_lock_irqsave(>list_lock, flags);
+   list_add_tail(>list, >cmd_list);
+   spin_unlock_irqrestore(>list_lock, flags);
+   }
+ }
+ 
+ /* Remove a command from the list used by the aacraid and dpt_i2o drivers */
+ void scsi_del_cmd_from_list(struct scsi_cmnd *cmd)
+ {
+   struct scsi_device *sdev = cmd->device;
+   struct Scsi_Host *shost = sdev->host;
+   unsigned long flags;
+ 
+   if (shost->use_cmd_list) {
+   spin_lock_irqsave(>list_lock, flags);
+   BUG_ON(list_empty(>list));
+   list_del_init(>list);
+   spin_unlock_irqrestore(>list_lock, flags);
+   }
+ }
+ 
 +/**
 + * scsi_initialize_rq - initialize struct scsi_cmnd.req
 + *
 + * Called from inside blk_get_request().
 + */
 +void scsi_initialize_rq(struct request *rq)
 +{
 +  struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq);
 +
 +  scsi_req_init(>req);
 +}
 +EXPORT_SYMBOL(scsi_initialize_rq);
 +
 +/* Called after a request has been started. */
  void scsi_init_command(struct scsi_device *dev, struct scsi_cmnd *cmd)
  {
void *buf = cmd->sense_buffer;
@@@ -2974,10 -2989,7 +2989,7 @@@ int scsi_internal_device_block_nowait(s
 * request queue. 
 */
if (q->mq_ops) {
-   if (wait)
-   blk_mq_quiesce_queue(q);
-   else
-   blk_mq_quiesce_queue_nowait(q);
 -  blk_mq_stop_hw_queues(q);
++  blk_mq_quiesce_queue_nowait(q);
} else {
spin_lock_irqsave(q->queue_lock, flags);
blk_stop_queue(q);
@@@ -2988,31 -2998,77 +2998,77 @@@
  
return 0;
  }
- EXPORT_SYMBOL_GPL(scsi_internal_device_block);
-  
+ EXPORT_SYMBOL_GPL(scsi_internal_device_block_nowait);
+ 
  /**
-  * scsi_internal_device_unblock - resume a device after a block request
-  * @sdev: device to resume
-  * @new_state:state to set devices to after unblocking
+  * scsi_internal_device_block - try to transition to the SDEV_BLOCK state
+  * @sdev: device to block
+  *
+  * Pause SCSI command processing on the specified device and wait until all
+  * ongoing scsi_request_fn() / scsi_queue_rq() calls have finished. May sleep.
   *
-  * Called by scsi lld's or the midlayer to restart the device queue
-  * for the previously suspended scsi device.  Called from interrupt or
-  * normal process context.
+  * Returns zero if successful or a negative error code upon failure.
   *
-  * Returns zero if successful or error if not.
+  * Note:
+  * This routine transitions the device to the SDEV_BLOCK state (which must be
+  * a legal transition). When the device is in this state, command processing
+  * is paused until the device leaves the SDEV_BLOCK state. See also
+  * scsi_internal_device_unblock().
   *
-  * Notes:   
-  *This routine transitions the device to the SDEV_RUNNING state
-  *or to one of the offline states (which must be a legal transition)
-  *allowing the midlayer to goose the queue for this device.
+  * To do: avoid that scsi_send_eh_cmnd() calls queuecommand() after
+  * scsi_internal_device_block() has blocked a SCSI device and also
+  * remove the rport mutex lock and unlock calls from srp_queuecommand().
   */
- int
- scsi_internal_device_unblock(struct scsi_device *sdev,
-enum scsi_device_state new_state)
+ static int scsi_internal_device_block(struct scsi_device *sdev)
  {
-   struct request_queue *q = sdev->request_queue; 
+   struct request_queue *q = sdev->request_queue;
+   int err;
+ 
+   mutex_lock(>state_mutex);
+   err = scsi_internal_device_block_nowait(sdev);
+   if (err == 0) {
+   if 

Re: [PATCH 3.18 32/32] mm: larger stack guard gap, between vmas

2017-06-20 Thread Hugh Dickins
On Mon, 19 Jun 2017, Greg Kroah-Hartman wrote:

> 3.18-stable review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> From: Hugh Dickins 
> 
> commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.

Here's a few adjustments to the 3.18 patch: no doubt you'll have
already sorted out any build errors (and I have to confess that
I haven't even tried to build this); and the VM_WARN_ON line (as
in 4.4) only fixes a highly unlikely error; but those FOLL_MLOCK
lines in mm/gup.c were mistaken, and do need to be deleted.

Hugh

diff -purN 318n/include/linux/mm.h 318h/include/linux/mm.h
--- 318n/include/linux/mm.h 2017-06-20 16:48:54.050429965 -0700
+++ 318h/include/linux/mm.h 2017-06-20 19:13:05.842191061 -0700
@@ -1242,6 +1242,9 @@ int set_page_dirty_lock(struct page *pag
 int clear_page_dirty_for_io(struct page *page);
 int get_cmdline(struct task_struct *task, char *buffer, int buflen);
 
+extern struct task_struct *task_of_stack(struct task_struct *task,
+   struct vm_area_struct *vma, bool in_group);
+
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len,
@@ -1897,8 +1900,9 @@ void page_cache_async_readahead(struct a
pgoff_t offset,
unsigned long size);
 
-extern unsigned long stack_guard_gap;
+unsigned long max_sane_readahead(unsigned long nr);
 
+extern unsigned long stack_guard_gap;
 /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
 
diff -purN 318n/mm/gup.c 318h/mm/gup.c
--- 318n/mm/gup.c   2017-06-20 16:48:54.054429927 -0700
+++ 318h/mm/gup.c   2017-06-20 19:18:19.579275331 -0700
@@ -275,9 +275,6 @@ static int faultin_page(struct task_stru
unsigned int fault_flags = 0;
int ret;
 
-   /* mlock all present pages, but do not fault in new pages */
-   if (*flags & FOLL_MLOCK)
-   return -ENOENT;
if (*flags & FOLL_WRITE)
fault_flags |= FAULT_FLAG_WRITE;
if (nonblocking)
diff -purN 318n/mm/mmap.c 318h/mm/mmap.c
--- 318n/mm/mmap.c  2017-06-20 16:48:54.054429927 -0700
+++ 318h/mm/mmap.c  2017-06-20 19:43:16.945345744 -0700
@@ -931,7 +931,7 @@ again:  remove_next = 1 + (end > next->
else if (next)
vma_gap_update(next);
else
-   mm->highest_vm_end = end;
+   VM_WARN_ON(mm->highest_vm_end != vm_end_gap(vma));
}
if (insert && file)
uprobe_mmap(insert);


Re: [PATCH 3.18 32/32] mm: larger stack guard gap, between vmas

2017-06-20 Thread Hugh Dickins
On Mon, 19 Jun 2017, Greg Kroah-Hartman wrote:

> 3.18-stable review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> From: Hugh Dickins 
> 
> commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.

Here's a few adjustments to the 3.18 patch: no doubt you'll have
already sorted out any build errors (and I have to confess that
I haven't even tried to build this); and the VM_WARN_ON line (as
in 4.4) only fixes a highly unlikely error; but those FOLL_MLOCK
lines in mm/gup.c were mistaken, and do need to be deleted.

Hugh

diff -purN 318n/include/linux/mm.h 318h/include/linux/mm.h
--- 318n/include/linux/mm.h 2017-06-20 16:48:54.050429965 -0700
+++ 318h/include/linux/mm.h 2017-06-20 19:13:05.842191061 -0700
@@ -1242,6 +1242,9 @@ int set_page_dirty_lock(struct page *pag
 int clear_page_dirty_for_io(struct page *page);
 int get_cmdline(struct task_struct *task, char *buffer, int buflen);
 
+extern struct task_struct *task_of_stack(struct task_struct *task,
+   struct vm_area_struct *vma, bool in_group);
+
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len,
@@ -1897,8 +1900,9 @@ void page_cache_async_readahead(struct a
pgoff_t offset,
unsigned long size);
 
-extern unsigned long stack_guard_gap;
+unsigned long max_sane_readahead(unsigned long nr);
 
+extern unsigned long stack_guard_gap;
 /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */
 extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
 
diff -purN 318n/mm/gup.c 318h/mm/gup.c
--- 318n/mm/gup.c   2017-06-20 16:48:54.054429927 -0700
+++ 318h/mm/gup.c   2017-06-20 19:18:19.579275331 -0700
@@ -275,9 +275,6 @@ static int faultin_page(struct task_stru
unsigned int fault_flags = 0;
int ret;
 
-   /* mlock all present pages, but do not fault in new pages */
-   if (*flags & FOLL_MLOCK)
-   return -ENOENT;
if (*flags & FOLL_WRITE)
fault_flags |= FAULT_FLAG_WRITE;
if (nonblocking)
diff -purN 318n/mm/mmap.c 318h/mm/mmap.c
--- 318n/mm/mmap.c  2017-06-20 16:48:54.054429927 -0700
+++ 318h/mm/mmap.c  2017-06-20 19:43:16.945345744 -0700
@@ -931,7 +931,7 @@ again:  remove_next = 1 + (end > next->
else if (next)
vma_gap_update(next);
else
-   mm->highest_vm_end = end;
+   VM_WARN_ON(mm->highest_vm_end != vm_end_gap(vma));
}
if (insert && file)
uprobe_mmap(insert);


[PATCH v2] brcmfmac: Fix a memory leak in error handling path in 'brcmf_cfg80211_attach'

2017-06-20 Thread Christophe JAILLET
If 'wiphy_new()' fails, we leak 'ops'. Add a new label in the error
handling path to free it in such a case.

Cc: sta...@vger.kernel.org
Fixes: 5c22fb85102a7 ("brcmfmac: add wowl gtk rekeying offload support")
Signed-off-by: Christophe JAILLET 
---
v2: Add CC tag
Change prefix
---
 drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
index 2443c71a202f..032d823c53c2 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
@@ -6861,7 +6861,7 @@ struct brcmf_cfg80211_info *brcmf_cfg80211_attach(struct 
brcmf_pub *drvr,
wiphy = wiphy_new(ops, sizeof(struct brcmf_cfg80211_info));
if (!wiphy) {
brcmf_err("Could not allocate wiphy device\n");
-   return NULL;
+   goto ops_out;
}
memcpy(wiphy->perm_addr, drvr->mac, ETH_ALEN);
set_wiphy_dev(wiphy, busdev);
@@ -7012,6 +7012,7 @@ struct brcmf_cfg80211_info *brcmf_cfg80211_attach(struct 
brcmf_pub *drvr,
ifp->vif = NULL;
 wiphy_out:
brcmf_free_wiphy(wiphy);
+ops_out:
kfree(ops);
return NULL;
 }
-- 
2.11.0



[PATCH v2] brcmfmac: Fix a memory leak in error handling path in 'brcmf_cfg80211_attach'

2017-06-20 Thread Christophe JAILLET
If 'wiphy_new()' fails, we leak 'ops'. Add a new label in the error
handling path to free it in such a case.

Cc: sta...@vger.kernel.org
Fixes: 5c22fb85102a7 ("brcmfmac: add wowl gtk rekeying offload support")
Signed-off-by: Christophe JAILLET 
---
v2: Add CC tag
Change prefix
---
 drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
index 2443c71a202f..032d823c53c2 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
@@ -6861,7 +6861,7 @@ struct brcmf_cfg80211_info *brcmf_cfg80211_attach(struct 
brcmf_pub *drvr,
wiphy = wiphy_new(ops, sizeof(struct brcmf_cfg80211_info));
if (!wiphy) {
brcmf_err("Could not allocate wiphy device\n");
-   return NULL;
+   goto ops_out;
}
memcpy(wiphy->perm_addr, drvr->mac, ETH_ALEN);
set_wiphy_dev(wiphy, busdev);
@@ -7012,6 +7012,7 @@ struct brcmf_cfg80211_info *brcmf_cfg80211_attach(struct 
brcmf_pub *drvr,
ifp->vif = NULL;
 wiphy_out:
brcmf_free_wiphy(wiphy);
+ops_out:
kfree(ops);
return NULL;
 }
-- 
2.11.0



Re: [PATCH 4.4 30/30] mm: larger stack guard gap, between vmas

2017-06-20 Thread Hugh Dickins
On Mon, 19 Jun 2017, Greg Kroah-Hartman wrote:

> 4.4-stable review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> From: Hugh Dickins 
> 
> commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.

The 4.11 and 4.9 patches are fine, but I noticed a couple of corrections
to this 4.4 one.  There are very unlikely circumstances (on VM_GROWSUP
architectures only) in which that "highest_vm_end = end" line would be
wrong: several different ways to fix it, I'm tending to go with Andrea's
observation that the only need for update has already been done above,
so just VM_WARN_ON if it's unexpected.

Hugh

diff -purN 404n/include/linux/mm.h 404h/include/linux/mm.h
--- 404n/include/linux/mm.h 2017-06-20 16:48:17.162770068 -0700
+++ 404h/include/linux/mm.h 2017-06-20 17:36:35.871975695 -0700
@@ -1283,6 +1283,8 @@ static inline bool vma_is_anonymous(stru
return !vma->vm_ops;
 }
 
+int vma_is_stack_for_task(struct vm_area_struct *vma, struct task_struct *t);
+
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len,
diff -purN 404n/mm/mmap.c 404h/mm/mmap.c
--- 404n/mm/mmap.c  2017-06-20 16:48:17.166770032 -0700
+++ 404h/mm/mmap.c  2017-06-20 17:42:31.312682716 -0700
@@ -923,7 +923,7 @@ again:  remove_next = 1 + (end > next->
else if (next)
vma_gap_update(next);
else
-   mm->highest_vm_end = end;
+   VM_WARN_ON(mm->highest_vm_end != vm_end_gap(vma));
}
if (insert && file)
uprobe_mmap(insert);


Re: [PATCH 4.4 30/30] mm: larger stack guard gap, between vmas

2017-06-20 Thread Hugh Dickins
On Mon, 19 Jun 2017, Greg Kroah-Hartman wrote:

> 4.4-stable review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> From: Hugh Dickins 
> 
> commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream.

The 4.11 and 4.9 patches are fine, but I noticed a couple of corrections
to this 4.4 one.  There are very unlikely circumstances (on VM_GROWSUP
architectures only) in which that "highest_vm_end = end" line would be
wrong: several different ways to fix it, I'm tending to go with Andrea's
observation that the only need for update has already been done above,
so just VM_WARN_ON if it's unexpected.

Hugh

diff -purN 404n/include/linux/mm.h 404h/include/linux/mm.h
--- 404n/include/linux/mm.h 2017-06-20 16:48:17.162770068 -0700
+++ 404h/include/linux/mm.h 2017-06-20 17:36:35.871975695 -0700
@@ -1283,6 +1283,8 @@ static inline bool vma_is_anonymous(stru
return !vma->vm_ops;
 }
 
+int vma_is_stack_for_task(struct vm_area_struct *vma, struct task_struct *t);
+
 extern unsigned long move_page_tables(struct vm_area_struct *vma,
unsigned long old_addr, struct vm_area_struct *new_vma,
unsigned long new_addr, unsigned long len,
diff -purN 404n/mm/mmap.c 404h/mm/mmap.c
--- 404n/mm/mmap.c  2017-06-20 16:48:17.166770032 -0700
+++ 404h/mm/mmap.c  2017-06-20 17:42:31.312682716 -0700
@@ -923,7 +923,7 @@ again:  remove_next = 1 + (end > next->
else if (next)
vma_gap_update(next);
else
-   mm->highest_vm_end = end;
+   VM_WARN_ON(mm->highest_vm_end != vm_end_gap(vma));
}
if (insert && file)
uprobe_mmap(insert);


Re: [PATCH 2/6] drivers base/arch_topology: frequency-invariant load-tracking support

2017-06-20 Thread Viresh Kumar
On 20-06-17, 17:31, Saravana Kannan wrote:
> On 06/19/2017 11:17 PM, Viresh Kumar wrote:
> >On Thu, Jun 8, 2017 at 1:25 PM, Dietmar Eggemann
> > wrote:
> >
> >>diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> >
> >>  static int __init register_cpufreq_notifier(void)
> >>  {
> >>+   int ret;
> >>+
> >> /*
> >>  * on ACPI-based systems we need to use the default cpu capacity
> >>  * until we have the necessary code to parse the cpu capacity, so
> >>@@ -225,8 +265,14 @@ static int __init register_cpufreq_notifier(void)
> >>
> >> cpumask_copy(cpus_to_visit, cpu_possible_mask);
> >>
> >>-   return cpufreq_register_notifier(_cpu_capacity_notifier,
> >>-CPUFREQ_POLICY_NOTIFIER);
> >>+   ret = cpufreq_register_notifier(_cpu_capacity_notifier,
> >>+   CPUFREQ_POLICY_NOTIFIER);
> >
> >Wanted to make sure that we all understand the constraints this is going to 
> >add
> >for the ARM64 platforms.
> >
> >With the introduction of this transition notifier, we would not be able to 
> >use
> >the fast-switch path in the schedutil governor. I am not sure if there are 
> >any
> >ARM platforms that can actually use the fast-switch path in future or not
> >though. The requirement of fast-switch path is that the freq can be changed
> >without sleeping in the hot-path.
> >
> >So, will we ever want fast-switching for ARM platforms ?
> >
> 
> I don't think we should go down a path that'll prevent ARM platform from
> switching over to fast-switching in the future.

Yeah, that's why brought attention to this stuff.

I think this patch doesn't really need to go down the notifiers way.

We can do something like this in the implementation of
topology_get_freq_scale():

return (policy->cur << SCHED_CAPACITY_SHIFT) / max;

Though, we would be required to take care of policy structure in this
case somehow.

> Having said that, I'm not sure I fully agree with the decision to completely
> disable notifiers in the fast-switching case. How many of the current users
> of notifiers truly need support for sleeping in the notifier?

Its not just about sleeping here. We do not wish to call too much
stuff from scheduler hot path. Even if it doesn't sleep.

> Why not make
> all the transition notifiers atomic? Or at least add atomic transition
> notifiers that can be registered for separately if the client doesn't need
> the ability to sleep?
> 
> Most of the clients don't seem like ones that'll need to sleep.

Only if the scheduler maintainers agree to getting these notifiers
called from hot path, which I don't think is going to happen.

> There are a bunch of generic off-tree drivers (can't upstream them yet
> because it depends on the bus scaling framework) that also depend on CPUfreq
> transition notifiers that are going to stop working if fast switching
> becomes available in the future. So, this decision to disallow transition
> notifiers is painful for other reasons too.

I think its kind of fine to work without fast switch in those cases,
as we are anyway ready to call notifiers which may end up taking any
amount of time.

This case was special as it is affecting entire arch here and so I
pointed it out.

-- 
viresh


Re: [PATCH 2/6] drivers base/arch_topology: frequency-invariant load-tracking support

2017-06-20 Thread Viresh Kumar
On 20-06-17, 17:31, Saravana Kannan wrote:
> On 06/19/2017 11:17 PM, Viresh Kumar wrote:
> >On Thu, Jun 8, 2017 at 1:25 PM, Dietmar Eggemann
> > wrote:
> >
> >>diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
> >
> >>  static int __init register_cpufreq_notifier(void)
> >>  {
> >>+   int ret;
> >>+
> >> /*
> >>  * on ACPI-based systems we need to use the default cpu capacity
> >>  * until we have the necessary code to parse the cpu capacity, so
> >>@@ -225,8 +265,14 @@ static int __init register_cpufreq_notifier(void)
> >>
> >> cpumask_copy(cpus_to_visit, cpu_possible_mask);
> >>
> >>-   return cpufreq_register_notifier(_cpu_capacity_notifier,
> >>-CPUFREQ_POLICY_NOTIFIER);
> >>+   ret = cpufreq_register_notifier(_cpu_capacity_notifier,
> >>+   CPUFREQ_POLICY_NOTIFIER);
> >
> >Wanted to make sure that we all understand the constraints this is going to 
> >add
> >for the ARM64 platforms.
> >
> >With the introduction of this transition notifier, we would not be able to 
> >use
> >the fast-switch path in the schedutil governor. I am not sure if there are 
> >any
> >ARM platforms that can actually use the fast-switch path in future or not
> >though. The requirement of fast-switch path is that the freq can be changed
> >without sleeping in the hot-path.
> >
> >So, will we ever want fast-switching for ARM platforms ?
> >
> 
> I don't think we should go down a path that'll prevent ARM platform from
> switching over to fast-switching in the future.

Yeah, that's why brought attention to this stuff.

I think this patch doesn't really need to go down the notifiers way.

We can do something like this in the implementation of
topology_get_freq_scale():

return (policy->cur << SCHED_CAPACITY_SHIFT) / max;

Though, we would be required to take care of policy structure in this
case somehow.

> Having said that, I'm not sure I fully agree with the decision to completely
> disable notifiers in the fast-switching case. How many of the current users
> of notifiers truly need support for sleeping in the notifier?

Its not just about sleeping here. We do not wish to call too much
stuff from scheduler hot path. Even if it doesn't sleep.

> Why not make
> all the transition notifiers atomic? Or at least add atomic transition
> notifiers that can be registered for separately if the client doesn't need
> the ability to sleep?
> 
> Most of the clients don't seem like ones that'll need to sleep.

Only if the scheduler maintainers agree to getting these notifiers
called from hot path, which I don't think is going to happen.

> There are a bunch of generic off-tree drivers (can't upstream them yet
> because it depends on the bus scaling framework) that also depend on CPUfreq
> transition notifiers that are going to stop working if fast switching
> becomes available in the future. So, this decision to disallow transition
> notifiers is painful for other reasons too.

I think its kind of fine to work without fast switch in those cases,
as we are anyway ready to call notifiers which may end up taking any
amount of time.

This case was special as it is affecting entire arch here and so I
pointed it out.

-- 
viresh


Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.

2017-06-20 Thread Anshuman Khandual
On 06/21/2017 04:53 AM, Ram Pai wrote:
> On Tue, Jun 20, 2017 at 03:50:25PM +0530, Anshuman Khandual wrote:
>> On 06/17/2017 09:22 AM, Ram Pai wrote:
>>> Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6
>>> in the 4K backed hpte pages. These bits continue to be used
>>> for 64K backed hpte pages in this patch, but will be freed
>>> up in the next patch.
>>
>> The counting 3, 4, 5 and 6 are in BE format I believe, I was
>> initially trying to see that from right to left as we normally
>> do in the kernel and was getting confused. So basically these
>> bits (which are only applicable for 64K mapping IIUC) are going
>> to be freed up from the PTE format.
>>
>> #define _RPAGE_RSV1  0x1000UL
>> #define _RPAGE_RSV2  0x0800UL
>> #define _RPAGE_RSV3  0x0400UL
>> #define _RPAGE_RSV4  0x0200UL
>>
>> As you have mentioned before this feature is available for 64K
>> page size only and not for 4K mappings. So I assume we support
>> both the combinations.
>>
>> * 64K mapping on 64K
>> * 64K mapping on 4K
> 
> yes.
> 
>>
>> These are the current users of the above bits
>>
>> #define H_PAGE_BUSY  _RPAGE_RSV1 /* software: PTE & hash are busy */
>> #define H_PAGE_F_SECOND  _RPAGE_RSV2 /* HPTE is in 2ndary 
>> HPTEG */
>> #define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
>> #define H_PAGE_HASHPTE   _RPAGE_RPN43/* PTE has associated 
>> HPTE */
>>
>>>
>>> The patch does the following change to the 64K PTE format
>>>
>>> H_PAGE_BUSY moves from bit 3 to bit 9
>>
>> and what is in there on bit 9 now ? This ?
>>
>> #define _RPAGE_SW2   0x00400
>>
>> which is used as 
>>
>> #define _PAGE_SPECIAL_RPAGE_SW2 /* software: special page */
>>
>> which will not be required any more ?
> 
> i think you are reading bit 9 from right to left. the bit 9 i refer to
> is from left to right. Using the same numbering convention the ISA3.0 uses.

Right, my bad. Then it would be this one.

'#define _RPAGE_RPN42   0x0040UL'

> I know it is confusing, will make a mention in the comment of this
> patch, to read it the big-endian way.

Right.

> 
> BTW: Bit 9 is not used currently. so using it in this patch. But this is
> a temporary move. the H_PAGE_BUSY will move to bit 7 in the next patch.
> 
> Had to keep at bit 9, because bit 7 is not yet entirely freed up. it is
> used by 64K PTE backed by 64k htpe.

Got it.

> 
>>
>>> H_PAGE_F_SECOND which occupied bit 4 moves to the second part
>>> of the pte.
>>> H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
>>> second part of the pte.
>>>
>>> the four  bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
>>> is  initialized  to  0xF  indicating  an invalid  slot.  If  a hpte
>>> gets cached in a 0xF  slot(i.e  7th  slot  of  secondary),  it   is
>>> released immediately. In  other  words, even  though   0xF   is   a
>>
>> Release immediately means we attempt again for a new hash slot ?
> 
> yes.
> 
>>
>>> valid slot we discard  and consider it as an invalid
>>> slot;i.e hpte_soft_invalid(). This  gives  us  an opportunity to not
>>> depend on a bit in the primary PTE in order to determine the
>>> validity of a slot.
>>
>> So we have to see the slot number in the second half for each PTE to
>> figure out if it has got a valid slot in the hash page table.
> 
> yes.
> 
>>
>>>
>>> When  we  release  ahpte   in the 0xF   slot we also   release a
>>> legitimate primary   slot  andunmapthat  entry. This  is  to
>>> ensure  that we do get a   legimate   non-0xF  slot the next time we
>>> retry for a slot.
>>
>> Okay.
>>
>>>
>>> Though treating 0xF slot as invalid reduces the number of available
>>> slots  and  may  have an effect  on the performance, the probabilty
>>> of hitting a 0xF is extermely low.
>>
>> Why you say that ? I thought every slot number has the same probability
>> of hit from the hash function.
> 
> Every hash bucket has the same probability. But every slot within the
> hash bucket is filled in sequentially. so it takes 15 hptes to hash to
> the same bucket before we get to the 15th slot in the secondary.

Okay, would the last one be 16th instead ?

> 
>>
>>>
>>> Compared  to the current scheme, the above described scheme reduces
>>> the number of false hash table updates  significantly  and  has the
>>
>> How it reduces false hash table updates ?
> 
> earlier, we had 1 bit allocated in the first-part-of-the 64K-PTE 
> for four consecutive 4K hptes. If any one 4k hpte got hashed-in,
> the bit got set. Which means anytime it faulted on the remaining
> three 4k hpte, we saw the bit already set and tried to erroneously 
> update that hpte. So we had a 75% update error rate. Funcationally
> not bad, but bad from a performance point of view.

I am bit out of sync regarding these PTE bits, after Aneesh's radix
changes went in :) Will look into this bit closer.

> 
> With 

Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.

2017-06-20 Thread Anshuman Khandual
On 06/21/2017 04:53 AM, Ram Pai wrote:
> On Tue, Jun 20, 2017 at 03:50:25PM +0530, Anshuman Khandual wrote:
>> On 06/17/2017 09:22 AM, Ram Pai wrote:
>>> Rearrange 64K PTE bits to  free  up  bits 3, 4, 5  and  6
>>> in the 4K backed hpte pages. These bits continue to be used
>>> for 64K backed hpte pages in this patch, but will be freed
>>> up in the next patch.
>>
>> The counting 3, 4, 5 and 6 are in BE format I believe, I was
>> initially trying to see that from right to left as we normally
>> do in the kernel and was getting confused. So basically these
>> bits (which are only applicable for 64K mapping IIUC) are going
>> to be freed up from the PTE format.
>>
>> #define _RPAGE_RSV1  0x1000UL
>> #define _RPAGE_RSV2  0x0800UL
>> #define _RPAGE_RSV3  0x0400UL
>> #define _RPAGE_RSV4  0x0200UL
>>
>> As you have mentioned before this feature is available for 64K
>> page size only and not for 4K mappings. So I assume we support
>> both the combinations.
>>
>> * 64K mapping on 64K
>> * 64K mapping on 4K
> 
> yes.
> 
>>
>> These are the current users of the above bits
>>
>> #define H_PAGE_BUSY  _RPAGE_RSV1 /* software: PTE & hash are busy */
>> #define H_PAGE_F_SECOND  _RPAGE_RSV2 /* HPTE is in 2ndary 
>> HPTEG */
>> #define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44)
>> #define H_PAGE_HASHPTE   _RPAGE_RPN43/* PTE has associated 
>> HPTE */
>>
>>>
>>> The patch does the following change to the 64K PTE format
>>>
>>> H_PAGE_BUSY moves from bit 3 to bit 9
>>
>> and what is in there on bit 9 now ? This ?
>>
>> #define _RPAGE_SW2   0x00400
>>
>> which is used as 
>>
>> #define _PAGE_SPECIAL_RPAGE_SW2 /* software: special page */
>>
>> which will not be required any more ?
> 
> i think you are reading bit 9 from right to left. the bit 9 i refer to
> is from left to right. Using the same numbering convention the ISA3.0 uses.

Right, my bad. Then it would be this one.

'#define _RPAGE_RPN42   0x0040UL'

> I know it is confusing, will make a mention in the comment of this
> patch, to read it the big-endian way.

Right.

> 
> BTW: Bit 9 is not used currently. so using it in this patch. But this is
> a temporary move. the H_PAGE_BUSY will move to bit 7 in the next patch.
> 
> Had to keep at bit 9, because bit 7 is not yet entirely freed up. it is
> used by 64K PTE backed by 64k htpe.

Got it.

> 
>>
>>> H_PAGE_F_SECOND which occupied bit 4 moves to the second part
>>> of the pte.
>>> H_PAGE_F_GIX which  occupied bit 5, 6 and 7 also moves to the
>>> second part of the pte.
>>>
>>> the four  bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot
>>> is  initialized  to  0xF  indicating  an invalid  slot.  If  a hpte
>>> gets cached in a 0xF  slot(i.e  7th  slot  of  secondary),  it   is
>>> released immediately. In  other  words, even  though   0xF   is   a
>>
>> Release immediately means we attempt again for a new hash slot ?
> 
> yes.
> 
>>
>>> valid slot we discard  and consider it as an invalid
>>> slot;i.e hpte_soft_invalid(). This  gives  us  an opportunity to not
>>> depend on a bit in the primary PTE in order to determine the
>>> validity of a slot.
>>
>> So we have to see the slot number in the second half for each PTE to
>> figure out if it has got a valid slot in the hash page table.
> 
> yes.
> 
>>
>>>
>>> When  we  release  ahpte   in the 0xF   slot we also   release a
>>> legitimate primary   slot  andunmapthat  entry. This  is  to
>>> ensure  that we do get a   legimate   non-0xF  slot the next time we
>>> retry for a slot.
>>
>> Okay.
>>
>>>
>>> Though treating 0xF slot as invalid reduces the number of available
>>> slots  and  may  have an effect  on the performance, the probabilty
>>> of hitting a 0xF is extermely low.
>>
>> Why you say that ? I thought every slot number has the same probability
>> of hit from the hash function.
> 
> Every hash bucket has the same probability. But every slot within the
> hash bucket is filled in sequentially. so it takes 15 hptes to hash to
> the same bucket before we get to the 15th slot in the secondary.

Okay, would the last one be 16th instead ?

> 
>>
>>>
>>> Compared  to the current scheme, the above described scheme reduces
>>> the number of false hash table updates  significantly  and  has the
>>
>> How it reduces false hash table updates ?
> 
> earlier, we had 1 bit allocated in the first-part-of-the 64K-PTE 
> for four consecutive 4K hptes. If any one 4k hpte got hashed-in,
> the bit got set. Which means anytime it faulted on the remaining
> three 4k hpte, we saw the bit already set and tried to erroneously 
> update that hpte. So we had a 75% update error rate. Funcationally
> not bad, but bad from a performance point of view.

I am bit out of sync regarding these PTE bits, after Aneesh's radix
changes went in :) Will look into this bit closer.

> 
> With 

[PATCH v2] backlight: lm3630a: bump REG_MAX value to 0x50 instead of 0x1F

2017-06-20 Thread Bhushan Shah
In the lm3630a_chip_init we try to write to 0x50 register, which is
higher value then the max_register value, this resulted in regmap_write
return -EIO.

Fix this by bumping REG_MAX value to 0x50.

Signed-off-by: Bhushan Shah 
Suggested-by: Bjorn Andersson 
---

Changes since v1:
- Fix the lm3630a_write call to use proper value (sent worng patch earlier)

 drivers/video/backlight/lm3630a_bl.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/video/backlight/lm3630a_bl.c 
b/drivers/video/backlight/lm3630a_bl.c
index 60d6c2ac87aa..b641f706dbc9 100644
--- a/drivers/video/backlight/lm3630a_bl.c
+++ b/drivers/video/backlight/lm3630a_bl.c
@@ -31,7 +31,8 @@
 #define REG_FAULT  0x0B
 #define REG_PWM_OUTLOW 0x12
 #define REG_PWM_OUTHIGH0x13
-#define REG_MAX0x1F
+#define REG_FLTR_STR   0x50
+#define REG_MAX0x50
 
 #define INT_DEBOUNCE_MSEC  10
 struct lm3630a_chip {
@@ -80,7 +81,7 @@ static int lm3630a_chip_init(struct lm3630a_chip *pchip)
 
usleep_range(1000, 2000);
/* set Filter Strength Register */
-   rval = lm3630a_write(pchip, 0x50, 0x03);
+   rval = lm3630a_write(pchip, REG_FLTR_STR, 0x03);
/* set Cofig. register */
rval |= lm3630a_update(pchip, REG_CONFIG, 0x07, pdata->pwm_ctrl);
/* set boost control */
-- 
2.13.0



[PATCH v2] backlight: lm3630a: bump REG_MAX value to 0x50 instead of 0x1F

2017-06-20 Thread Bhushan Shah
In the lm3630a_chip_init we try to write to 0x50 register, which is
higher value then the max_register value, this resulted in regmap_write
return -EIO.

Fix this by bumping REG_MAX value to 0x50.

Signed-off-by: Bhushan Shah 
Suggested-by: Bjorn Andersson 
---

Changes since v1:
- Fix the lm3630a_write call to use proper value (sent worng patch earlier)

 drivers/video/backlight/lm3630a_bl.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/video/backlight/lm3630a_bl.c 
b/drivers/video/backlight/lm3630a_bl.c
index 60d6c2ac87aa..b641f706dbc9 100644
--- a/drivers/video/backlight/lm3630a_bl.c
+++ b/drivers/video/backlight/lm3630a_bl.c
@@ -31,7 +31,8 @@
 #define REG_FAULT  0x0B
 #define REG_PWM_OUTLOW 0x12
 #define REG_PWM_OUTHIGH0x13
-#define REG_MAX0x1F
+#define REG_FLTR_STR   0x50
+#define REG_MAX0x50
 
 #define INT_DEBOUNCE_MSEC  10
 struct lm3630a_chip {
@@ -80,7 +81,7 @@ static int lm3630a_chip_init(struct lm3630a_chip *pchip)
 
usleep_range(1000, 2000);
/* set Filter Strength Register */
-   rval = lm3630a_write(pchip, 0x50, 0x03);
+   rval = lm3630a_write(pchip, REG_FLTR_STR, 0x03);
/* set Cofig. register */
rval |= lm3630a_update(pchip, REG_CONFIG, 0x07, pdata->pwm_ctrl);
/* set boost control */
-- 
2.13.0



[PATCH] backlight: lm3630a: bump REG_MAX value to 0x50 instead of 0x1F

2017-06-20 Thread Bhushan Shah
In the lm3630a_chip_init we try to write to 0x50 register, which is
higher value then the max_register value, this resulted in regmap_write
return -EIO.

Fix this by bumping REG_MAX value to 0x50.

Signed-off-by: Bhushan Shah 
Suggested-by: Bjorn Andersson 
---
 drivers/video/backlight/lm3630a_bl.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/video/backlight/lm3630a_bl.c 
b/drivers/video/backlight/lm3630a_bl.c
index 60d6c2ac87aa..42496b6c09d0 100644
--- a/drivers/video/backlight/lm3630a_bl.c
+++ b/drivers/video/backlight/lm3630a_bl.c
@@ -31,7 +31,8 @@
 #define REG_FAULT  0x0B
 #define REG_PWM_OUTLOW 0x12
 #define REG_PWM_OUTHIGH0x13
-#define REG_MAX0x1F
+#define REG_FLTR_STR   0x50
+#define REG_MAX0x50
 
 #define INT_DEBOUNCE_MSEC  10
 struct lm3630a_chip {
@@ -80,7 +81,7 @@ static int lm3630a_chip_init(struct lm3630a_chip *pchip)
 
usleep_range(1000, 2000);
/* set Filter Strength Register */
-   rval = lm3630a_write(pchip, 0x50, 0x03);
+   rval = lm3630a_write(pchip, FLTR_STR, 0x03);
/* set Cofig. register */
rval |= lm3630a_update(pchip, REG_CONFIG, 0x07, pdata->pwm_ctrl);
/* set boost control */
-- 
2.13.0



[PATCH] backlight: lm3630a: bump REG_MAX value to 0x50 instead of 0x1F

2017-06-20 Thread Bhushan Shah
In the lm3630a_chip_init we try to write to 0x50 register, which is
higher value then the max_register value, this resulted in regmap_write
return -EIO.

Fix this by bumping REG_MAX value to 0x50.

Signed-off-by: Bhushan Shah 
Suggested-by: Bjorn Andersson 
---
 drivers/video/backlight/lm3630a_bl.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/video/backlight/lm3630a_bl.c 
b/drivers/video/backlight/lm3630a_bl.c
index 60d6c2ac87aa..42496b6c09d0 100644
--- a/drivers/video/backlight/lm3630a_bl.c
+++ b/drivers/video/backlight/lm3630a_bl.c
@@ -31,7 +31,8 @@
 #define REG_FAULT  0x0B
 #define REG_PWM_OUTLOW 0x12
 #define REG_PWM_OUTHIGH0x13
-#define REG_MAX0x1F
+#define REG_FLTR_STR   0x50
+#define REG_MAX0x50
 
 #define INT_DEBOUNCE_MSEC  10
 struct lm3630a_chip {
@@ -80,7 +81,7 @@ static int lm3630a_chip_init(struct lm3630a_chip *pchip)
 
usleep_range(1000, 2000);
/* set Filter Strength Register */
-   rval = lm3630a_write(pchip, 0x50, 0x03);
+   rval = lm3630a_write(pchip, FLTR_STR, 0x03);
/* set Cofig. register */
rval |= lm3630a_update(pchip, REG_CONFIG, 0x07, pdata->pwm_ctrl);
/* set boost control */
-- 
2.13.0



[PATCH] MAINTAINERS: Add Qualcomm pinctrl drivers section

2017-06-20 Thread Bjorn Andersson
Document the maintainership of the Qualcomm pinctrl drivers to improve the
hitrate.

Signed-off-by: Bjorn Andersson 
---
 MAINTAINERS | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 75cd0dc8e93f..c2e3a349ca7c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1666,7 +1666,6 @@ F:arch/arm/mach-qcom/
 F: arch/arm64/boot/dts/qcom/*
 F: drivers/i2c/busses/i2c-qup.c
 F: drivers/clk/qcom/
-F: drivers/pinctrl/qcom/
 F: drivers/dma/qcom/
 F: drivers/soc/qcom/
 F: drivers/spi/spi-qup.c
@@ -10183,6 +10182,13 @@ M: Heikki Krogerus 

 S: Maintained
 F: drivers/pinctrl/intel/
 
+PIN CONTROLLER - QUALCOMM
+M: Bjorn Andersson 
+S: Maintained
+L: linux-arm-...@vger.kernel.org
+F: Documentation/devicetree/bindings/pinctrl/qcom,*.txt
+F: drivers/pinctrl/qcom/
+
 PIN CONTROLLER - RENESAS
 M: Laurent Pinchart 
 M: Geert Uytterhoeven 
-- 
2.12.0



[PATCH] MAINTAINERS: Add Qualcomm pinctrl drivers section

2017-06-20 Thread Bjorn Andersson
Document the maintainership of the Qualcomm pinctrl drivers to improve the
hitrate.

Signed-off-by: Bjorn Andersson 
---
 MAINTAINERS | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 75cd0dc8e93f..c2e3a349ca7c 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1666,7 +1666,6 @@ F:arch/arm/mach-qcom/
 F: arch/arm64/boot/dts/qcom/*
 F: drivers/i2c/busses/i2c-qup.c
 F: drivers/clk/qcom/
-F: drivers/pinctrl/qcom/
 F: drivers/dma/qcom/
 F: drivers/soc/qcom/
 F: drivers/spi/spi-qup.c
@@ -10183,6 +10182,13 @@ M: Heikki Krogerus 

 S: Maintained
 F: drivers/pinctrl/intel/
 
+PIN CONTROLLER - QUALCOMM
+M: Bjorn Andersson 
+S: Maintained
+L: linux-arm-...@vger.kernel.org
+F: Documentation/devicetree/bindings/pinctrl/qcom,*.txt
+F: drivers/pinctrl/qcom/
+
 PIN CONTROLLER - RENESAS
 M: Laurent Pinchart 
 M: Geert Uytterhoeven 
-- 
2.12.0



[PATCH v3 02/11] x86/ldt: Simplify LDT switching logic

2017-06-20 Thread Andy Lutomirski
Originally, Linux reloaded the LDT whenever the prev mm or the next
mm had an LDT.  It was changed in 0bbed3beb4f2 ("[PATCH]
Thread-Local Storage (TLS) support") (from the historical tree) like
this:

-   /* load_LDT, if either the previous or next thread
-* has a non-default LDT.
+   /*
+* load the LDT, if the LDT is different:
 */
-   if (next->context.size+prev->context.size)
+   if (unlikely(prev->context.ldt != next->context.ldt))
load_LDT(>context);

The current code is unlikely to avoid any LDT reloads, since different
mms won't share an LDT.

When we redo lazy mode to stop flush IPIs without switching to
init_mm, though, the current logic would become incorrect: it will
be possible to have real_prev == next but nonetheless have a stale
LDT descriptor.

Simplify the code to update LDTR if either the previous or the next
mm has an LDT, i.e. effectively restore the historical logic..
While we're at it, clean up the code by moving all the ifdeffery to
a header where it belongs.

Acked-by: Rik van Riel 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/mmu_context.h | 26 ++
 arch/x86/mm/tlb.c  | 20 ++--
 2 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index 1458f530948b..ecfcb6643c9b 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -93,6 +93,32 @@ static inline void load_mm_ldt(struct mm_struct *mm)
 #else
clear_LDT();
 #endif
+}
+
+static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
+{
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+   /*
+* Load the LDT if either the old or new mm had an LDT.
+*
+* An mm will never go from having an LDT to not having an LDT.  Two
+* mms never share an LDT, so we don't gain anything by checking to
+* see whether the LDT changed.  There's also no guarantee that
+* prev->context.ldt actually matches LDTR, but, if LDTR is non-NULL,
+* then prev->context.ldt will also be non-NULL.
+*
+* If we really cared, we could optimize the case where prev == next
+* and we're exiting lazy mode.  Most of the time, if this happens,
+* we don't actually need to reload LDTR, but modify_ldt() is mostly
+* used by legacy code and emulators where we don't need this level of
+* performance.
+*
+* This uses | instead of || because it generates better code.
+*/
+   if (unlikely((unsigned long)prev->context.ldt |
+(unsigned long)next->context.ldt))
+   load_mm_ldt(next);
+#endif
 
DEBUG_LOCKS_WARN_ON(preemptible());
 }
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index f06239c6919f..fd593833a854 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -148,25 +148,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 real_prev != _mm);
cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
 
-   /* Load per-mm CR4 state */
+   /* Load per-mm CR4 and LDTR state */
load_mm_cr4(next);
-
-#ifdef CONFIG_MODIFY_LDT_SYSCALL
-   /*
-* Load the LDT, if the LDT is different.
-*
-* It's possible that prev->context.ldt doesn't match
-* the LDT register.  This can happen if leave_mm(prev)
-* was called and then modify_ldt changed
-* prev->context.ldt but suppressed an IPI to this CPU.
-* In this case, prev->context.ldt != NULL, because we
-* never set context.ldt to NULL while the mm still
-* exists.  That means that next->context.ldt !=
-* prev->context.ldt, because mms never share an LDT.
-*/
-   if (unlikely(real_prev->context.ldt != next->context.ldt))
-   load_mm_ldt(next);
-#endif
+   switch_ldt(real_prev, next);
 }
 
 /*
-- 
2.9.4



[PATCH 2/4] time: Add warning about imminent deprecation of CONFIG_GENERIC_TIME_VSYSCALL_OLD

2017-06-20 Thread John Stultz
CONFIG_GENERIC_TIME_VSYSCALL_OLD was introduced five years ago
to allow a transition from the old vsyscall implementations to
the new method (which simplified internal accounting and made
timekeeping more precise).

However, PPC and IA64 have yet to make the transition, despite
in some cases me sending test patches to try to help it along.

http://patches.linaro.org/patch/30501/
http://patches.linaro.org/patch/35412/

If its helpful, my last pass at the patches can be found here:
https://git.linaro.org/people/john.stultz/linux.git dev/oldvsyscall-cleanup

So I think its time to set a deadline and make it clear this
is going away. So this patch adds warnings about this
functionality being dropped. Likely to be in v4.15.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Miroslav Lichvar 
Cc: Richard Cochran 
Cc: Prarit Bhargava 
Cc: Marcelo Tosatti 
Cc: Paul Mackerras 
Cc: Anton Blanchard 
Cc: Benjamin Herrenschmidt 
Cc: Tony Luck 
Cc: Michael Ellerman 
Cc: Fenghua Yu 
Signed-off-by: John Stultz 
---
 kernel/time/timekeeping.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 0454bfa..cedafa0 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -516,6 +516,7 @@ static void halt_fast_timekeeper(struct timekeeper *tk)
 }
 
 #ifdef CONFIG_GENERIC_TIME_VSYSCALL_OLD
+#warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD 
compatibity will disappear soon.
 
 static inline void update_vsyscall(struct timekeeper *tk)
 {
-- 
2.7.4



[PATCH v3 02/11] x86/ldt: Simplify LDT switching logic

2017-06-20 Thread Andy Lutomirski
Originally, Linux reloaded the LDT whenever the prev mm or the next
mm had an LDT.  It was changed in 0bbed3beb4f2 ("[PATCH]
Thread-Local Storage (TLS) support") (from the historical tree) like
this:

-   /* load_LDT, if either the previous or next thread
-* has a non-default LDT.
+   /*
+* load the LDT, if the LDT is different:
 */
-   if (next->context.size+prev->context.size)
+   if (unlikely(prev->context.ldt != next->context.ldt))
load_LDT(>context);

The current code is unlikely to avoid any LDT reloads, since different
mms won't share an LDT.

When we redo lazy mode to stop flush IPIs without switching to
init_mm, though, the current logic would become incorrect: it will
be possible to have real_prev == next but nonetheless have a stale
LDT descriptor.

Simplify the code to update LDTR if either the previous or the next
mm has an LDT, i.e. effectively restore the historical logic..
While we're at it, clean up the code by moving all the ifdeffery to
a header where it belongs.

Acked-by: Rik van Riel 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/mmu_context.h | 26 ++
 arch/x86/mm/tlb.c  | 20 ++--
 2 files changed, 28 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index 1458f530948b..ecfcb6643c9b 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -93,6 +93,32 @@ static inline void load_mm_ldt(struct mm_struct *mm)
 #else
clear_LDT();
 #endif
+}
+
+static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
+{
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+   /*
+* Load the LDT if either the old or new mm had an LDT.
+*
+* An mm will never go from having an LDT to not having an LDT.  Two
+* mms never share an LDT, so we don't gain anything by checking to
+* see whether the LDT changed.  There's also no guarantee that
+* prev->context.ldt actually matches LDTR, but, if LDTR is non-NULL,
+* then prev->context.ldt will also be non-NULL.
+*
+* If we really cared, we could optimize the case where prev == next
+* and we're exiting lazy mode.  Most of the time, if this happens,
+* we don't actually need to reload LDTR, but modify_ldt() is mostly
+* used by legacy code and emulators where we don't need this level of
+* performance.
+*
+* This uses | instead of || because it generates better code.
+*/
+   if (unlikely((unsigned long)prev->context.ldt |
+(unsigned long)next->context.ldt))
+   load_mm_ldt(next);
+#endif
 
DEBUG_LOCKS_WARN_ON(preemptible());
 }
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index f06239c6919f..fd593833a854 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -148,25 +148,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 real_prev != _mm);
cpumask_clear_cpu(cpu, mm_cpumask(real_prev));
 
-   /* Load per-mm CR4 state */
+   /* Load per-mm CR4 and LDTR state */
load_mm_cr4(next);
-
-#ifdef CONFIG_MODIFY_LDT_SYSCALL
-   /*
-* Load the LDT, if the LDT is different.
-*
-* It's possible that prev->context.ldt doesn't match
-* the LDT register.  This can happen if leave_mm(prev)
-* was called and then modify_ldt changed
-* prev->context.ldt but suppressed an IPI to this CPU.
-* In this case, prev->context.ldt != NULL, because we
-* never set context.ldt to NULL while the mm still
-* exists.  That means that next->context.ldt !=
-* prev->context.ldt, because mms never share an LDT.
-*/
-   if (unlikely(real_prev->context.ldt != next->context.ldt))
-   load_mm_ldt(next);
-#endif
+   switch_ldt(real_prev, next);
 }
 
 /*
-- 
2.9.4



[PATCH 2/4] time: Add warning about imminent deprecation of CONFIG_GENERIC_TIME_VSYSCALL_OLD

2017-06-20 Thread John Stultz
CONFIG_GENERIC_TIME_VSYSCALL_OLD was introduced five years ago
to allow a transition from the old vsyscall implementations to
the new method (which simplified internal accounting and made
timekeeping more precise).

However, PPC and IA64 have yet to make the transition, despite
in some cases me sending test patches to try to help it along.

http://patches.linaro.org/patch/30501/
http://patches.linaro.org/patch/35412/

If its helpful, my last pass at the patches can be found here:
https://git.linaro.org/people/john.stultz/linux.git dev/oldvsyscall-cleanup

So I think its time to set a deadline and make it clear this
is going away. So this patch adds warnings about this
functionality being dropped. Likely to be in v4.15.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Miroslav Lichvar 
Cc: Richard Cochran 
Cc: Prarit Bhargava 
Cc: Marcelo Tosatti 
Cc: Paul Mackerras 
Cc: Anton Blanchard 
Cc: Benjamin Herrenschmidt 
Cc: Tony Luck 
Cc: Michael Ellerman 
Cc: Fenghua Yu 
Signed-off-by: John Stultz 
---
 kernel/time/timekeeping.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 0454bfa..cedafa0 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -516,6 +516,7 @@ static void halt_fast_timekeeper(struct timekeeper *tk)
 }
 
 #ifdef CONFIG_GENERIC_TIME_VSYSCALL_OLD
+#warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD 
compatibity will disappear soon.
 
 static inline void update_vsyscall(struct timekeeper *tk)
 {
-- 
2.7.4



[PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-20 Thread Andy Lutomirski
There are two kernel features that would benefit from tracking
how up-to-date each CPU's TLB is in the case where IPIs aren't keeping
it up to date in real time:

 - Lazy mm switching currently works by switching to init_mm when
   it would otherwise flush.  This is wasteful: there isn't fundamentally
   any need to update CR3 at all when going lazy or when returning from
   lazy mode, nor is there any need to receive flush IPIs at all.  Instead,
   we should just stop trying to keep the TLB coherent when we go lazy and,
   when unlazying, check whether we missed any flushes.

 - PCID will let us keep recent user contexts alive in the TLB.  If we
   start doing this, we need a way to decide whether those contexts are
   up to date.

On some paravirt systems, remote TLBs can be flushed without IPIs.
This won't update the target CPUs' tlb_gens, which may cause
unnecessary local flushes later on.  We can address this if it becomes
a problem by carefully updating the target CPU's tlb_gen directly.

By itself, this patch is a very minor optimization that avoids
unnecessary flushes when multiple TLB flushes targetting the same CPU
race.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/tlbflush.h | 37 +++
 arch/x86/mm/tlb.c   | 79 +
 2 files changed, 109 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 1eb946c0507e..4f6c30d6ec39 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,11 @@ static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
 #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
 #endif
 
+struct tlb_context {
+   u64 ctx_id;
+   u64 tlb_gen;
+};
+
 struct tlb_state {
/*
 * cpu_tlbstate.loaded_mm should match CR3 whenever interrupts
@@ -97,6 +102,21 @@ struct tlb_state {
 * disabling interrupts when modifying either one.
 */
unsigned long cr4;
+
+   /*
+* This is a list of all contexts that might exist in the TLB.
+* Since we don't yet use PCID, there is only one context.
+*
+* For each context, ctx_id indicates which mm the TLB's user
+* entries came from.  As an invariant, the TLB will never
+* contain entries that are out-of-date as when that mm reached
+* the tlb_gen in the list.
+*
+* To be clear, this means that it's legal for the TLB code to
+* flush the TLB without updating tlb_gen.  This can happen
+* (for now, at least) due to paravirt remote flushes.
+*/
+   struct tlb_context ctxs[1];
 };
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
 
@@ -248,9 +268,26 @@ static inline void __flush_tlb_one(unsigned long addr)
  * and page-granular flushes are available only on i486 and up.
  */
 struct flush_tlb_info {
+   /*
+* We support several kinds of flushes.
+*
+* - Fully flush a single mm.  flush_mm will be set, flush_end will be
+*   TLB_FLUSH_ALL, and new_tlb_gen will be the tlb_gen to which the
+*   IPI sender is trying to catch us up.
+*
+* - Partially flush a single mm.  flush_mm will be set, flush_start
+*   and flush_end will indicate the range, and new_tlb_gen will be
+*   set such that the changes between generation new_tlb_gen-1 and
+*   new_tlb_gen are entirely contained in the indicated range.
+*
+* - Fully flush all mms whose tlb_gens have been updated.  flush_mm
+*   will be NULL, flush_end will be TLB_FLUSH_ALL, and new_tlb_gen
+*   will be zero.
+*/
struct mm_struct *mm;
unsigned long start;
unsigned long end;
+   u64 new_tlb_gen;
 };
 
 #define local_flush_tlb() __flush_tlb()
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6d9d37323a43..9f5ef7a5e74a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -105,6 +105,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
}
 
this_cpu_write(cpu_tlbstate.loaded_mm, next);
+   this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
+   this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
+  atomic64_read(>context.tlb_gen));
 
WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
cpumask_set_cpu(cpu, mm_cpumask(next));
@@ -194,20 +197,73 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 static void flush_tlb_func_common(const struct flush_tlb_info *f,
  bool local, enum tlb_flush_reason reason)
 {
+   struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+
+   /*
+* Our memory ordering requirement is that any TLB fills that
+* happen after we flush the TLB are ordered after we read
+* active_mm's 

[PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm

2017-06-20 Thread Andy Lutomirski
There are two kernel features that would benefit from tracking
how up-to-date each CPU's TLB is in the case where IPIs aren't keeping
it up to date in real time:

 - Lazy mm switching currently works by switching to init_mm when
   it would otherwise flush.  This is wasteful: there isn't fundamentally
   any need to update CR3 at all when going lazy or when returning from
   lazy mode, nor is there any need to receive flush IPIs at all.  Instead,
   we should just stop trying to keep the TLB coherent when we go lazy and,
   when unlazying, check whether we missed any flushes.

 - PCID will let us keep recent user contexts alive in the TLB.  If we
   start doing this, we need a way to decide whether those contexts are
   up to date.

On some paravirt systems, remote TLBs can be flushed without IPIs.
This won't update the target CPUs' tlb_gens, which may cause
unnecessary local flushes later on.  We can address this if it becomes
a problem by carefully updating the target CPU's tlb_gen directly.

By itself, this patch is a very minor optimization that avoids
unnecessary flushes when multiple TLB flushes targetting the same CPU
race.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/tlbflush.h | 37 +++
 arch/x86/mm/tlb.c   | 79 +
 2 files changed, 109 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 1eb946c0507e..4f6c30d6ec39 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,11 @@ static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
 #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
 #endif
 
+struct tlb_context {
+   u64 ctx_id;
+   u64 tlb_gen;
+};
+
 struct tlb_state {
/*
 * cpu_tlbstate.loaded_mm should match CR3 whenever interrupts
@@ -97,6 +102,21 @@ struct tlb_state {
 * disabling interrupts when modifying either one.
 */
unsigned long cr4;
+
+   /*
+* This is a list of all contexts that might exist in the TLB.
+* Since we don't yet use PCID, there is only one context.
+*
+* For each context, ctx_id indicates which mm the TLB's user
+* entries came from.  As an invariant, the TLB will never
+* contain entries that are out-of-date as when that mm reached
+* the tlb_gen in the list.
+*
+* To be clear, this means that it's legal for the TLB code to
+* flush the TLB without updating tlb_gen.  This can happen
+* (for now, at least) due to paravirt remote flushes.
+*/
+   struct tlb_context ctxs[1];
 };
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
 
@@ -248,9 +268,26 @@ static inline void __flush_tlb_one(unsigned long addr)
  * and page-granular flushes are available only on i486 and up.
  */
 struct flush_tlb_info {
+   /*
+* We support several kinds of flushes.
+*
+* - Fully flush a single mm.  flush_mm will be set, flush_end will be
+*   TLB_FLUSH_ALL, and new_tlb_gen will be the tlb_gen to which the
+*   IPI sender is trying to catch us up.
+*
+* - Partially flush a single mm.  flush_mm will be set, flush_start
+*   and flush_end will indicate the range, and new_tlb_gen will be
+*   set such that the changes between generation new_tlb_gen-1 and
+*   new_tlb_gen are entirely contained in the indicated range.
+*
+* - Fully flush all mms whose tlb_gens have been updated.  flush_mm
+*   will be NULL, flush_end will be TLB_FLUSH_ALL, and new_tlb_gen
+*   will be zero.
+*/
struct mm_struct *mm;
unsigned long start;
unsigned long end;
+   u64 new_tlb_gen;
 };
 
 #define local_flush_tlb() __flush_tlb()
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 6d9d37323a43..9f5ef7a5e74a 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -105,6 +105,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
}
 
this_cpu_write(cpu_tlbstate.loaded_mm, next);
+   this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id);
+   this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
+  atomic64_read(>context.tlb_gen));
 
WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next)));
cpumask_set_cpu(cpu, mm_cpumask(next));
@@ -194,20 +197,73 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 static void flush_tlb_func_common(const struct flush_tlb_info *f,
  bool local, enum tlb_flush_reason reason)
 {
+   struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
+
+   /*
+* Our memory ordering requirement is that any TLB fills that
+* happen after we flush the TLB are ordered after we read
+* active_mm's tlb_gen.  We don't 

[PATCH v3 03/11] x86/mm: Remove reset_lazy_tlbstate()

2017-06-20 Thread Andy Lutomirski
The only call site also calls idle_task_exit(), and idle_task_exit()
puts us into a clean state by explicitly switching to init_mm.

Reviewed-by: Rik van Riel 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/tlbflush.h | 8 
 arch/x86/kernel/smpboot.c   | 1 -
 2 files changed, 9 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 5f78c6a77578..50ea3482e1d1 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -259,14 +259,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 #define TLBSTATE_OK1
 #define TLBSTATE_LAZY  2
 
-static inline void reset_lazy_tlbstate(void)
-{
-   this_cpu_write(cpu_tlbstate.state, 0);
-   this_cpu_write(cpu_tlbstate.loaded_mm, _mm);
-
-   WARN_ON(read_cr3_pa() != __pa_symbol(swapper_pg_dir));
-}
-
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch 
*batch,
struct mm_struct *mm)
 {
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index f04479a8f74f..6169a56aab49 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1589,7 +1589,6 @@ void native_cpu_die(unsigned int cpu)
 void play_dead_common(void)
 {
idle_task_exit();
-   reset_lazy_tlbstate();
 
/* Ack it */
(void)cpu_report_death();
-- 
2.9.4



[PATCH v3 09/11] x86/mm: Add nopcid to turn off PCID

2017-06-20 Thread Andy Lutomirski
The parameter is only present on x86_64 systems to save a few bytes,
as PCID is always disabled on x86_32.

Signed-off-by: Andy Lutomirski 
---
 Documentation/admin-guide/kernel-parameters.txt |  2 ++
 arch/x86/kernel/cpu/common.c| 18 ++
 2 files changed, 20 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 0f5c3b4347c6..aa385109ae58 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2648,6 +2648,8 @@
nopat   [X86] Disable PAT (page attribute table extension of
pagetables) support.
 
+   nopcid  [X86-64] Disable the PCID cpu feature.
+
norandmaps  Don't use address space randomization.  Equivalent to
echo 0 > /proc/sys/kernel/randomize_va_space
 
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c8b39870f33e..904485e7b230 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -168,6 +168,24 @@ static int __init x86_mpx_setup(char *s)
 }
 __setup("nompx", x86_mpx_setup);
 
+#ifdef CONFIG_X86_64
+static int __init x86_pcid_setup(char *s)
+{
+   /* require an exact match without trailing characters */
+   if (strlen(s))
+   return 0;
+
+   /* do not emit a message if the feature is not present */
+   if (!boot_cpu_has(X86_FEATURE_PCID))
+   return 1;
+
+   setup_clear_cpu_cap(X86_FEATURE_PCID);
+   pr_info("nopcid: PCID feature disabled\n");
+   return 1;
+}
+__setup("nopcid", x86_pcid_setup);
+#endif
+
 static int __init x86_noinvpcid_setup(char *s)
 {
/* noinvpcid doesn't accept parameters */
-- 
2.9.4



[PATCH v3 09/11] x86/mm: Add nopcid to turn off PCID

2017-06-20 Thread Andy Lutomirski
The parameter is only present on x86_64 systems to save a few bytes,
as PCID is always disabled on x86_32.

Signed-off-by: Andy Lutomirski 
---
 Documentation/admin-guide/kernel-parameters.txt |  2 ++
 arch/x86/kernel/cpu/common.c| 18 ++
 2 files changed, 20 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 0f5c3b4347c6..aa385109ae58 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2648,6 +2648,8 @@
nopat   [X86] Disable PAT (page attribute table extension of
pagetables) support.
 
+   nopcid  [X86-64] Disable the PCID cpu feature.
+
norandmaps  Don't use address space randomization.  Equivalent to
echo 0 > /proc/sys/kernel/randomize_va_space
 
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c8b39870f33e..904485e7b230 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -168,6 +168,24 @@ static int __init x86_mpx_setup(char *s)
 }
 __setup("nompx", x86_mpx_setup);
 
+#ifdef CONFIG_X86_64
+static int __init x86_pcid_setup(char *s)
+{
+   /* require an exact match without trailing characters */
+   if (strlen(s))
+   return 0;
+
+   /* do not emit a message if the feature is not present */
+   if (!boot_cpu_has(X86_FEATURE_PCID))
+   return 1;
+
+   setup_clear_cpu_cap(X86_FEATURE_PCID);
+   pr_info("nopcid: PCID feature disabled\n");
+   return 1;
+}
+__setup("nopcid", x86_pcid_setup);
+#endif
+
 static int __init x86_noinvpcid_setup(char *s)
 {
/* noinvpcid doesn't accept parameters */
-- 
2.9.4



[PATCH v3 03/11] x86/mm: Remove reset_lazy_tlbstate()

2017-06-20 Thread Andy Lutomirski
The only call site also calls idle_task_exit(), and idle_task_exit()
puts us into a clean state by explicitly switching to init_mm.

Reviewed-by: Rik van Riel 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/tlbflush.h | 8 
 arch/x86/kernel/smpboot.c   | 1 -
 2 files changed, 9 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 5f78c6a77578..50ea3482e1d1 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -259,14 +259,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 #define TLBSTATE_OK1
 #define TLBSTATE_LAZY  2
 
-static inline void reset_lazy_tlbstate(void)
-{
-   this_cpu_write(cpu_tlbstate.state, 0);
-   this_cpu_write(cpu_tlbstate.loaded_mm, _mm);
-
-   WARN_ON(read_cr3_pa() != __pa_symbol(swapper_pg_dir));
-}
-
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch 
*batch,
struct mm_struct *mm)
 {
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index f04479a8f74f..6169a56aab49 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -1589,7 +1589,6 @@ void native_cpu_die(unsigned int cpu)
 void play_dead_common(void)
 {
idle_task_exit();
-   reset_lazy_tlbstate();
 
/* Ack it */
(void)cpu_report_death();
-- 
2.9.4



[PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

2017-06-20 Thread Andy Lutomirski
PCID is a "process context ID" -- it's what other architectures call
an address space ID.  Every non-global TLB entry is tagged with a
PCID, only TLB entries that match the currently selected PCID are
used, and we can switch PGDs without flushing the TLB.  x86's
PCID is 12 bits.

This is an unorthodox approach to using PCID.  x86's PCID is far too
short to uniquely identify a process, and we can't even really
uniquely identify a running process because there are monster
systems with over 4096 CPUs.  To make matters worse, past attempts
to use all 12 PCID bits have resulted in slowdowns instead of
speedups.

This patch uses PCID differently.  We use a PCID to identify a
recently-used mm on a per-cpu basis.  An mm has no fixed PCID
binding at all; instead, we give it a fresh PCID each time it's
loaded except in cases where we want to preserve the TLB, in which
case we reuse a recent value.

This seems to save about 100ns on context switches between mms.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/mmu_context.h |  3 ++
 arch/x86/include/asm/processor-flags.h |  2 +
 arch/x86/include/asm/tlbflush.h| 18 +++-
 arch/x86/mm/init.c |  1 +
 arch/x86/mm/tlb.c  | 82 ++
 5 files changed, 86 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index 69a4f1ee86ac..2537ec03c9b7 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -299,6 +299,9 @@ static inline unsigned long __get_current_cr3_fast(void)
 {
unsigned long cr3 = __pa(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd);
 
+   if (static_cpu_has(X86_FEATURE_PCID))
+   cr3 |= this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
/* For now, be very restrictive about when this can be called. */
VM_WARN_ON(in_nmi() || !in_atomic());
 
diff --git a/arch/x86/include/asm/processor-flags.h 
b/arch/x86/include/asm/processor-flags.h
index 79aa2f98398d..791b60199aa4 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -35,6 +35,7 @@
 /* Mask off the address space ID bits. */
 #define CR3_ADDR_MASK 0x7000ull
 #define CR3_PCID_MASK 0xFFFull
+#define CR3_NOFLUSH (1UL << 63)
 #else
 /*
  * CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save
@@ -42,6 +43,7 @@
  */
 #define CR3_ADDR_MASK 0xull
 #define CR3_PCID_MASK 0ull
+#define CR3_NOFLUSH 0
 #endif
 
 #endif /* _ASM_X86_PROCESSOR_FLAGS_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 57b305e13c4c..a9a5aa6f45f7 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,12 @@ static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
 #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
 #endif
 
+/*
+ * 6 because 6 should be plenty and struct tlb_state will fit in
+ * two cache lines.
+ */
+#define NR_DYNAMIC_ASIDS 6
+
 struct tlb_context {
u64 ctx_id;
u64 tlb_gen;
@@ -95,6 +101,8 @@ struct tlb_state {
 * mode even if we've already switched back to swapper_pg_dir.
 */
struct mm_struct *loaded_mm;
+   u16 loaded_mm_asid;
+   u16 next_asid;
 
/*
 * Access to this CR4 shadow and to H/W CR4 is protected by
@@ -104,7 +112,8 @@ struct tlb_state {
 
/*
 * This is a list of all contexts that might exist in the TLB.
-* Since we don't yet use PCID, there is only one context.
+* There is one per ASID that we use, and the ASID (what the
+* CPU calls PCID) is the index into ctxts.
 *
 * For each context, ctx_id indicates which mm the TLB's user
 * entries came from.  As an invariant, the TLB will never
@@ -114,8 +123,13 @@ struct tlb_state {
 * To be clear, this means that it's legal for the TLB code to
 * flush the TLB without updating tlb_gen.  This can happen
 * (for now, at least) due to paravirt remote flushes.
+*
+* NB: context 0 is a bit special, since it's also used by
+* various bits of init code.  This is fine -- code that
+* isn't aware of PCID will end up harmlessly flushing
+* context 0.
 */
-   struct tlb_context ctxs[1];
+   struct tlb_context ctxs[NR_DYNAMIC_ASIDS];
 };
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 7d6fa4676af9..9c9570d300ba 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,6 +812,7 @@ void __init zone_sizes_init(void)
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = _mm,
+   .next_asid = 1,
.cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */
 };
 EXPORT_SYMBOL_GPL(cpu_tlbstate);
diff --git a/arch/x86/mm/tlb.c 

Re: [GIT PULL][PATCH 0/4] Timekeeping items for 4.13

2017-06-20 Thread John Stultz
On Tue, Jun 20, 2017 at 10:21 PM, John Stultz  wrote:
> Just a small set of changes, the biggest changes being the
> MONOTONIC_RAW handling cleanup, and a new kselftest from
> Miroslav. Also a a clear warning deprecating
> CONFIG_GENERIC_TIME_VSYSCALL_OLD, which affects ppc and ia64.

I forgot to note, while it can be intuited from the pull request,
while these are targeted for tip/timers/core, this patchset is based
on the current tip/timers/urgent, as it depends on patches there.

thanks
-john


[PATCH v3 10/11] x86/mm: Enable CR4.PCIDE on supported systems

2017-06-20 Thread Andy Lutomirski
We can use PCID if the CPU has PCID and PGE and we're not on Xen.

By itself, this has no effect.  The next patch will start using
PCID.

Cc: Juergen Gross 
Cc: Boris Ostrovsky 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/tlbflush.h |  8 
 arch/x86/kernel/cpu/common.c| 15 +++
 arch/x86/xen/enlighten_pv.c |  6 ++
 3 files changed, 29 insertions(+)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 87b13e51e867..57b305e13c4c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -243,6 +243,14 @@ static inline void __flush_tlb_all(void)
__flush_tlb_global();
else
__flush_tlb();
+
+   /*
+* Note: if we somehow had PCID but not PGE, then this wouldn't work --
+* we'd end up flushing kernel translations for the current ASID but
+* we might fail to flush kernel translations for other cached ASIDs.
+*
+* To avoid this issue, we force PCID off if PGE is off.
+*/
 }
 
 static inline void __flush_tlb_one(unsigned long addr)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 904485e7b230..01caf66b270f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1143,6 +1143,21 @@ static void identify_cpu(struct cpuinfo_x86 *c)
setup_smep(c);
setup_smap(c);
 
+   /* Set up PCID */
+   if (cpu_has(c, X86_FEATURE_PCID)) {
+   if (cpu_has(c, X86_FEATURE_PGE)) {
+   cr4_set_bits(X86_CR4_PCIDE);
+   } else {
+   /*
+* flush_tlb_all(), as currently implemented, won't
+* work if PCID is on but PGE is not.  Since that
+* combination doesn't exist on real hardware, there's
+* no reason to try to fully support it.
+*/
+   clear_cpu_cap(c, X86_FEATURE_PCID);
+   }
+   }
+
/*
 * The vendor-specific functions might have changed features.
 * Now we do "generic changes."
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index f33eef4ebd12..a136aac543c3 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -295,6 +295,12 @@ static void __init xen_init_capabilities(void)
setup_clear_cpu_cap(X86_FEATURE_ACC);
setup_clear_cpu_cap(X86_FEATURE_X2APIC);
 
+   /*
+* Xen PV would need some work to support PCID: CR3 handling as well
+* as xen_flush_tlb_others() would need updating.
+*/
+   setup_clear_cpu_cap(X86_FEATURE_PCID);
+
if (!xen_initial_domain())
setup_clear_cpu_cap(X86_FEATURE_ACPI);
 
-- 
2.9.4



[PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID

2017-06-20 Thread Andy Lutomirski
PCID is a "process context ID" -- it's what other architectures call
an address space ID.  Every non-global TLB entry is tagged with a
PCID, only TLB entries that match the currently selected PCID are
used, and we can switch PGDs without flushing the TLB.  x86's
PCID is 12 bits.

This is an unorthodox approach to using PCID.  x86's PCID is far too
short to uniquely identify a process, and we can't even really
uniquely identify a running process because there are monster
systems with over 4096 CPUs.  To make matters worse, past attempts
to use all 12 PCID bits have resulted in slowdowns instead of
speedups.

This patch uses PCID differently.  We use a PCID to identify a
recently-used mm on a per-cpu basis.  An mm has no fixed PCID
binding at all; instead, we give it a fresh PCID each time it's
loaded except in cases where we want to preserve the TLB, in which
case we reuse a recent value.

This seems to save about 100ns on context switches between mms.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/mmu_context.h |  3 ++
 arch/x86/include/asm/processor-flags.h |  2 +
 arch/x86/include/asm/tlbflush.h| 18 +++-
 arch/x86/mm/init.c |  1 +
 arch/x86/mm/tlb.c  | 82 ++
 5 files changed, 86 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index 69a4f1ee86ac..2537ec03c9b7 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -299,6 +299,9 @@ static inline unsigned long __get_current_cr3_fast(void)
 {
unsigned long cr3 = __pa(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd);
 
+   if (static_cpu_has(X86_FEATURE_PCID))
+   cr3 |= this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
/* For now, be very restrictive about when this can be called. */
VM_WARN_ON(in_nmi() || !in_atomic());
 
diff --git a/arch/x86/include/asm/processor-flags.h 
b/arch/x86/include/asm/processor-flags.h
index 79aa2f98398d..791b60199aa4 100644
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -35,6 +35,7 @@
 /* Mask off the address space ID bits. */
 #define CR3_ADDR_MASK 0x7000ull
 #define CR3_PCID_MASK 0xFFFull
+#define CR3_NOFLUSH (1UL << 63)
 #else
 /*
  * CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save
@@ -42,6 +43,7 @@
  */
 #define CR3_ADDR_MASK 0xull
 #define CR3_PCID_MASK 0ull
+#define CR3_NOFLUSH 0
 #endif
 
 #endif /* _ASM_X86_PROCESSOR_FLAGS_H */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 57b305e13c4c..a9a5aa6f45f7 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -82,6 +82,12 @@ static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
 #define __flush_tlb_single(addr) __native_flush_tlb_single(addr)
 #endif
 
+/*
+ * 6 because 6 should be plenty and struct tlb_state will fit in
+ * two cache lines.
+ */
+#define NR_DYNAMIC_ASIDS 6
+
 struct tlb_context {
u64 ctx_id;
u64 tlb_gen;
@@ -95,6 +101,8 @@ struct tlb_state {
 * mode even if we've already switched back to swapper_pg_dir.
 */
struct mm_struct *loaded_mm;
+   u16 loaded_mm_asid;
+   u16 next_asid;
 
/*
 * Access to this CR4 shadow and to H/W CR4 is protected by
@@ -104,7 +112,8 @@ struct tlb_state {
 
/*
 * This is a list of all contexts that might exist in the TLB.
-* Since we don't yet use PCID, there is only one context.
+* There is one per ASID that we use, and the ASID (what the
+* CPU calls PCID) is the index into ctxts.
 *
 * For each context, ctx_id indicates which mm the TLB's user
 * entries came from.  As an invariant, the TLB will never
@@ -114,8 +123,13 @@ struct tlb_state {
 * To be clear, this means that it's legal for the TLB code to
 * flush the TLB without updating tlb_gen.  This can happen
 * (for now, at least) due to paravirt remote flushes.
+*
+* NB: context 0 is a bit special, since it's also used by
+* various bits of init code.  This is fine -- code that
+* isn't aware of PCID will end up harmlessly flushing
+* context 0.
 */
-   struct tlb_context ctxs[1];
+   struct tlb_context ctxs[NR_DYNAMIC_ASIDS];
 };
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate);
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 7d6fa4676af9..9c9570d300ba 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,6 +812,7 @@ void __init zone_sizes_init(void)
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = _mm,
+   .next_asid = 1,
.cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */
 };
 EXPORT_SYMBOL_GPL(cpu_tlbstate);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 

Re: [GIT PULL][PATCH 0/4] Timekeeping items for 4.13

2017-06-20 Thread John Stultz
On Tue, Jun 20, 2017 at 10:21 PM, John Stultz  wrote:
> Just a small set of changes, the biggest changes being the
> MONOTONIC_RAW handling cleanup, and a new kselftest from
> Miroslav. Also a a clear warning deprecating
> CONFIG_GENERIC_TIME_VSYSCALL_OLD, which affects ppc and ia64.

I forgot to note, while it can be intuited from the pull request,
while these are targeted for tip/timers/core, this patchset is based
on the current tip/timers/urgent, as it depends on patches there.

thanks
-john


[PATCH v3 10/11] x86/mm: Enable CR4.PCIDE on supported systems

2017-06-20 Thread Andy Lutomirski
We can use PCID if the CPU has PCID and PGE and we're not on Xen.

By itself, this has no effect.  The next patch will start using
PCID.

Cc: Juergen Gross 
Cc: Boris Ostrovsky 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/tlbflush.h |  8 
 arch/x86/kernel/cpu/common.c| 15 +++
 arch/x86/xen/enlighten_pv.c |  6 ++
 3 files changed, 29 insertions(+)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 87b13e51e867..57b305e13c4c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -243,6 +243,14 @@ static inline void __flush_tlb_all(void)
__flush_tlb_global();
else
__flush_tlb();
+
+   /*
+* Note: if we somehow had PCID but not PGE, then this wouldn't work --
+* we'd end up flushing kernel translations for the current ASID but
+* we might fail to flush kernel translations for other cached ASIDs.
+*
+* To avoid this issue, we force PCID off if PGE is off.
+*/
 }
 
 static inline void __flush_tlb_one(unsigned long addr)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 904485e7b230..01caf66b270f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1143,6 +1143,21 @@ static void identify_cpu(struct cpuinfo_x86 *c)
setup_smep(c);
setup_smap(c);
 
+   /* Set up PCID */
+   if (cpu_has(c, X86_FEATURE_PCID)) {
+   if (cpu_has(c, X86_FEATURE_PGE)) {
+   cr4_set_bits(X86_CR4_PCIDE);
+   } else {
+   /*
+* flush_tlb_all(), as currently implemented, won't
+* work if PCID is on but PGE is not.  Since that
+* combination doesn't exist on real hardware, there's
+* no reason to try to fully support it.
+*/
+   clear_cpu_cap(c, X86_FEATURE_PCID);
+   }
+   }
+
/*
 * The vendor-specific functions might have changed features.
 * Now we do "generic changes."
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index f33eef4ebd12..a136aac543c3 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -295,6 +295,12 @@ static void __init xen_init_capabilities(void)
setup_clear_cpu_cap(X86_FEATURE_ACC);
setup_clear_cpu_cap(X86_FEATURE_X2APIC);
 
+   /*
+* Xen PV would need some work to support PCID: CR3 handling as well
+* as xen_flush_tlb_others() would need updating.
+*/
+   setup_clear_cpu_cap(X86_FEATURE_PCID);
+
if (!xen_initial_domain())
setup_clear_cpu_cap(X86_FEATURE_ACPI);
 
-- 
2.9.4



[PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-20 Thread Andy Lutomirski
x86's lazy TLB mode used to be fairly weak -- it would switch to
init_mm the first time it tried to flush a lazy TLB.  This meant an
unnecessary CR3 write and, if the flush was remote, an unnecessary
IPI.

Rewrite it entirely.  When we enter lazy mode, we simply remove the
cpu from mm_cpumask.  This means that we need a way to figure out
whether we've missed a flush when we switch back out of lazy mode.
I use the tlb_gen machinery to track whether a context is up to
date.

Note to reviewers: this patch, my itself, looks a bit odd.  I'm
using an array of length 1 containing (ctx_id, tlb_gen) rather than
just storing tlb_gen, and making it at array isn't necessary yet.
I'm doing this because the next few patches add PCID support, and,
with PCID, we need ctx_id, and the array will end up with a length
greater than 1.  Making it an array now means that there will be
less churn and therefore less stress on your eyeballs.

NB: This is dubious but, AFAICT, still correct on Xen and UV.
xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
patch changes the way that mm_cpumask() works.  This should be okay,
since Xen *also* iterates all online CPUs to find all the CPUs it
needs to twiddle.

The UV tlbflush code is rather dated and should be changed.

Cc: Andrew Banman 
Cc: Mike Travis 
Cc: Dimitri Sivanich 
Cc: Juergen Gross 
Cc: Boris Ostrovsky 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/mmu_context.h |   6 +-
 arch/x86/include/asm/tlbflush.h|   4 -
 arch/x86/mm/init.c |   1 -
 arch/x86/mm/tlb.c  | 227 +++--
 arch/x86/xen/mmu_pv.c  |   3 +-
 5 files changed, 119 insertions(+), 122 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index e5295d485899..69a4f1ee86ac 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -125,8 +125,10 @@ static inline void switch_ldt(struct mm_struct *prev, 
struct mm_struct *next)
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct 
*tsk)
 {
-   if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
-   this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
+   int cpu = smp_processor_id();
+
+   if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
+   cpumask_clear_cpu(cpu, mm_cpumask(mm));
 }
 
 extern atomic64_t last_mm_ctx_id;
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4f6c30d6ec39..87b13e51e867 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -95,7 +95,6 @@ struct tlb_state {
 * mode even if we've already switched back to swapper_pg_dir.
 */
struct mm_struct *loaded_mm;
-   int state;
 
/*
 * Access to this CR4 shadow and to H/W CR4 is protected by
@@ -310,9 +309,6 @@ static inline void flush_tlb_page(struct vm_area_struct 
*vma, unsigned long a)
 void native_flush_tlb_others(const struct cpumask *cpumask,
 const struct flush_tlb_info *info);
 
-#define TLBSTATE_OK1
-#define TLBSTATE_LAZY  2
-
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch 
*batch,
struct mm_struct *mm)
 {
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 88ee942cb47d..7d6fa4676af9 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,7 +812,6 @@ void __init zone_sizes_init(void)
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = _mm,
-   .state = 0,
.cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */
 };
 EXPORT_SYMBOL_GPL(cpu_tlbstate);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 9f5ef7a5e74a..fea2b07ac7d8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -45,8 +45,8 @@ void leave_mm(int cpu)
if (loaded_mm == _mm)
return;
 
-   if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
-   BUG();
+   /* Warn if we're not lazy. */
+   WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
 
switch_mm(NULL, _mm, NULL);
 }
@@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 {
unsigned cpu = smp_processor_id();
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+   u64 next_tlb_gen;
 
/*
-* NB: The scheduler will call us with prev == next when
-* switching from lazy TLB mode to normal mode if active_mm
-* isn't changing.  When this happens, there is no guarantee
-* that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
+* NB: The scheduler will call us with prev == next when switching
+* from lazy TLB mode to normal mode if active_mm 

[PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking

2017-06-20 Thread Andy Lutomirski
x86's lazy TLB mode used to be fairly weak -- it would switch to
init_mm the first time it tried to flush a lazy TLB.  This meant an
unnecessary CR3 write and, if the flush was remote, an unnecessary
IPI.

Rewrite it entirely.  When we enter lazy mode, we simply remove the
cpu from mm_cpumask.  This means that we need a way to figure out
whether we've missed a flush when we switch back out of lazy mode.
I use the tlb_gen machinery to track whether a context is up to
date.

Note to reviewers: this patch, my itself, looks a bit odd.  I'm
using an array of length 1 containing (ctx_id, tlb_gen) rather than
just storing tlb_gen, and making it at array isn't necessary yet.
I'm doing this because the next few patches add PCID support, and,
with PCID, we need ctx_id, and the array will end up with a length
greater than 1.  Making it an array now means that there will be
less churn and therefore less stress on your eyeballs.

NB: This is dubious but, AFAICT, still correct on Xen and UV.
xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this
patch changes the way that mm_cpumask() works.  This should be okay,
since Xen *also* iterates all online CPUs to find all the CPUs it
needs to twiddle.

The UV tlbflush code is rather dated and should be changed.

Cc: Andrew Banman 
Cc: Mike Travis 
Cc: Dimitri Sivanich 
Cc: Juergen Gross 
Cc: Boris Ostrovsky 
Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/mmu_context.h |   6 +-
 arch/x86/include/asm/tlbflush.h|   4 -
 arch/x86/mm/init.c |   1 -
 arch/x86/mm/tlb.c  | 227 +++--
 arch/x86/xen/mmu_pv.c  |   3 +-
 5 files changed, 119 insertions(+), 122 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index e5295d485899..69a4f1ee86ac 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -125,8 +125,10 @@ static inline void switch_ldt(struct mm_struct *prev, 
struct mm_struct *next)
 
 static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct 
*tsk)
 {
-   if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
-   this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
+   int cpu = smp_processor_id();
+
+   if (cpumask_test_cpu(cpu, mm_cpumask(mm)))
+   cpumask_clear_cpu(cpu, mm_cpumask(mm));
 }
 
 extern atomic64_t last_mm_ctx_id;
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4f6c30d6ec39..87b13e51e867 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -95,7 +95,6 @@ struct tlb_state {
 * mode even if we've already switched back to swapper_pg_dir.
 */
struct mm_struct *loaded_mm;
-   int state;
 
/*
 * Access to this CR4 shadow and to H/W CR4 is protected by
@@ -310,9 +309,6 @@ static inline void flush_tlb_page(struct vm_area_struct 
*vma, unsigned long a)
 void native_flush_tlb_others(const struct cpumask *cpumask,
 const struct flush_tlb_info *info);
 
-#define TLBSTATE_OK1
-#define TLBSTATE_LAZY  2
-
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch 
*batch,
struct mm_struct *mm)
 {
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 88ee942cb47d..7d6fa4676af9 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -812,7 +812,6 @@ void __init zone_sizes_init(void)
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
.loaded_mm = _mm,
-   .state = 0,
.cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */
 };
 EXPORT_SYMBOL_GPL(cpu_tlbstate);
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 9f5ef7a5e74a..fea2b07ac7d8 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -45,8 +45,8 @@ void leave_mm(int cpu)
if (loaded_mm == _mm)
return;
 
-   if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
-   BUG();
+   /* Warn if we're not lazy. */
+   WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm)));
 
switch_mm(NULL, _mm, NULL);
 }
@@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 {
unsigned cpu = smp_processor_id();
struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm);
+   u64 next_tlb_gen;
 
/*
-* NB: The scheduler will call us with prev == next when
-* switching from lazy TLB mode to normal mode if active_mm
-* isn't changing.  When this happens, there is no guarantee
-* that CR3 (and hence cpu_tlbstate.loaded_mm) matches next.
+* NB: The scheduler will call us with prev == next when switching
+* from lazy TLB mode to normal mode if active_mm isn't changing.
+* When this happens, we don't assume that CR3 (and hence
+* 

[PATCH v3 01/11] x86/mm: Don't reenter flush_tlb_func_common()

2017-06-20 Thread Andy Lutomirski
It was historically possible to have two concurrent TLB flushes
targetting the same CPU: one initiated locally and one initiated
remotely.  This can now cause an OOPS in leave_mm() at
arch/x86/mm/tlb.c:47:

if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
BUG();

with this call trace:
 flush_tlb_func_local arch/x86/mm/tlb.c:239 [inline]
 flush_tlb_mm_range+0x26d/0x370 arch/x86/mm/tlb.c:317

Without reentrancy, this OOPS is impossible: leave_mm() is only
called if we're not in TLBSTATE_OK, but then we're unexpectedly
in TLBSTATE_OK in leave_mm().

This can be caused by flush_tlb_func_remote() happening between
the two checks and calling leave_mm(), resulting in two consecutive
leave_mm() calls on the same CPU with no intervening switch_mm()
calls.

We never saw this OOPS before because the old leave_mm()
implementation didn't put us back in TLBSTATE_OK, so the assertion
didn't fire.

Nadav noticed the reentrancy issue in a different context, but
neither of us realized that it caused a problem yet.

Cc: Nadav Amit 
Cc: Dave Hansen 
Reported-by: "Levin, Alexander (Sasha Levin)" 
Fixes: 3d28ebceaffa ("x86/mm: Rework lazy TLB to track the actual loaded mm")
Signed-off-by: Andy Lutomirski 
---
 arch/x86/mm/tlb.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2a5e851f2035..f06239c6919f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -208,6 +208,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 static void flush_tlb_func_common(const struct flush_tlb_info *f,
  bool local, enum tlb_flush_reason reason)
 {
+   /* This code cannot presently handle being reentered. */
+   VM_WARN_ON(!irqs_disabled());
+
if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
leave_mm(smp_processor_id());
return;
@@ -313,8 +316,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned 
long start,
info.end = TLB_FLUSH_ALL;
}
 
-   if (mm == this_cpu_read(cpu_tlbstate.loaded_mm))
+   if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
+   local_irq_disable();
flush_tlb_func_local(, TLB_LOCAL_MM_SHOOTDOWN);
+   local_irq_enable();
+   }
+
if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
flush_tlb_others(mm_cpumask(mm), );
put_cpu();
@@ -370,8 +377,12 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch 
*batch)
 
int cpu = get_cpu();
 
-   if (cpumask_test_cpu(cpu, >cpumask))
+   if (cpumask_test_cpu(cpu, >cpumask)) {
+   local_irq_disable();
flush_tlb_func_local(, TLB_LOCAL_SHOOTDOWN);
+   local_irq_enable();
+   }
+
if (cpumask_any_but(>cpumask, cpu) < nr_cpu_ids)
flush_tlb_others(>cpumask, );
cpumask_clear(>cpumask);
-- 
2.9.4



[PATCH 3/4] kselftests: timers: Fix inconsistency-check to not ignore first timestamp

2017-06-20 Thread John Stultz
From: Miroslav Lichvar 

When the first timestamp in the list of clock readings was later than
the second timestamp and all other timestamps were in order, the
inconsistency was not reported because the index of the out-of-order
timestamp was equal to the default value.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Miroslav Lichvar 
Cc: Richard Cochran 
Cc: Prarit Bhargava 
Cc: Stephen Boyd 
Cc: Shuah Khan 
Signed-off-by: Miroslav Lichvar 
Signed-off-by: John Stultz 
---
 tools/testing/selftests/timers/inconsistency-check.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/timers/inconsistency-check.c 
b/tools/testing/selftests/timers/inconsistency-check.c
index caf1bc9..74c60e8 100644
--- a/tools/testing/selftests/timers/inconsistency-check.c
+++ b/tools/testing/selftests/timers/inconsistency-check.c
@@ -118,7 +118,7 @@ int consistency_test(int clock_type, unsigned long seconds)
start_str = ctime();
 
while (seconds == -1 || now - then < seconds) {
-   inconsistent = 0;
+   inconsistent = -1;
 
/* Fill list */
for (i = 0; i < CALLS_PER_LOOP; i++)
@@ -130,7 +130,7 @@ int consistency_test(int clock_type, unsigned long seconds)
inconsistent = i;
 
/* display inconsistency */
-   if (inconsistent) {
+   if (inconsistent >= 0) {
unsigned long long delta;
 
printf("\%s\n", start_str);
-- 
2.7.4



[PATCH v3 07/11] x86/mm: Stop calling leave_mm() in idle code

2017-06-20 Thread Andy Lutomirski
Now that lazy TLB suppresses all flush IPIs (as opposed to all but
the first), there's no need to leave_mm() when going idle.

This means we can get rid of the rcuidle hack in
switch_mm_irqs_off() and we can unexport leave_mm().

This also removes acpi_unlazy_tlb() from the x86 and ia64 headers,
since it has no callers any more.

Signed-off-by: Andy Lutomirski 
---
 arch/ia64/include/asm/acpi.h  |  2 --
 arch/x86/include/asm/acpi.h   |  2 --
 arch/x86/mm/tlb.c | 19 +++
 drivers/acpi/processor_idle.c |  2 --
 drivers/idle/intel_idle.c |  9 -
 5 files changed, 7 insertions(+), 27 deletions(-)

diff --git a/arch/ia64/include/asm/acpi.h b/arch/ia64/include/asm/acpi.h
index a3d0211970e9..c86a947f5368 100644
--- a/arch/ia64/include/asm/acpi.h
+++ b/arch/ia64/include/asm/acpi.h
@@ -112,8 +112,6 @@ static inline void arch_acpi_set_pdc_bits(u32 *buf)
buf[2] |= ACPI_PDC_EST_CAPABILITY_SMP;
 }
 
-#define acpi_unlazy_tlb(x)
-
 #ifdef CONFIG_ACPI_NUMA
 extern cpumask_t early_cpu_possible_map;
 #define for_each_possible_early_cpu(cpu)  \
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index 2efc768e4362..562286fa151f 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -150,8 +150,6 @@ static inline void disable_acpi(void) { }
 extern int x86_acpi_numa_init(void);
 #endif /* CONFIG_ACPI_NUMA */
 
-#define acpi_unlazy_tlb(x) leave_mm(x)
-
 #ifdef CONFIG_ACPI_APEI
 static inline pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index fea2b07ac7d8..5f932fd80881 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -50,7 +50,6 @@ void leave_mm(int cpu)
 
switch_mm(NULL, _mm, NULL);
 }
-EXPORT_SYMBOL_GPL(leave_mm);
 
 void switch_mm(struct mm_struct *prev, struct mm_struct *next,
   struct task_struct *tsk)
@@ -113,14 +112,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
   next_tlb_gen);
write_cr3(__pa(next->pgd));
-   /*
-* This gets called via leave_mm() in the idle path
-* where RCU functions differently.  Tracing normally
-* uses RCU, so we have to call the tracepoint
-* specially here.
-*/
-   trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
-   TLB_FLUSH_ALL);
+   trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
+   TLB_FLUSH_ALL);
}
 
/*
@@ -166,13 +159,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
this_cpu_write(cpu_tlbstate.loaded_mm, next);
write_cr3(__pa(next->pgd));
 
-   /*
-* This gets called via leave_mm() in the idle path where RCU
-* functions differently.  Tracing normally uses RCU, so we
-* have to call the tracepoint specially here.
-*/
-   trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
-   TLB_FLUSH_ALL);
+   trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
}
 
load_mm_cr4(next);
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 5c8aa9cf62d7..fe3d2a40f311 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -708,8 +708,6 @@ static DEFINE_RAW_SPINLOCK(c3_lock);
 static void acpi_idle_enter_bm(struct acpi_processor *pr,
   struct acpi_processor_cx *cx, bool timer_bc)
 {
-   acpi_unlazy_tlb(smp_processor_id());
-
/*
 * Must be done before busmaster disable as we might need to
 * access HPET !
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 216d7ec88c0c..2ae43f59091d 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -912,16 +912,15 @@ static __cpuidle int intel_idle(struct cpuidle_device 
*dev,
struct cpuidle_state *state = >states[index];
unsigned long eax = flg2MWAIT(state->flags);
unsigned int cstate;
-   int cpu = smp_processor_id();
 
cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;
 
/*
-* leave_mm() to avoid costly and often unnecessary wakeups
-* for flushing the user TLB's associated with the active mm.
+* NB: if CPUIDLE_FLAG_TLB_FLUSHED is set, this idle transition
+* will probably flush the TLB.  It's not guaranteed to flush
+* the TLB, though, so it's not clear that we can do anything
+* useful with this knowledge.
 */
-   if (state->flags & 

[PATCH v3 01/11] x86/mm: Don't reenter flush_tlb_func_common()

2017-06-20 Thread Andy Lutomirski
It was historically possible to have two concurrent TLB flushes
targetting the same CPU: one initiated locally and one initiated
remotely.  This can now cause an OOPS in leave_mm() at
arch/x86/mm/tlb.c:47:

if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK)
BUG();

with this call trace:
 flush_tlb_func_local arch/x86/mm/tlb.c:239 [inline]
 flush_tlb_mm_range+0x26d/0x370 arch/x86/mm/tlb.c:317

Without reentrancy, this OOPS is impossible: leave_mm() is only
called if we're not in TLBSTATE_OK, but then we're unexpectedly
in TLBSTATE_OK in leave_mm().

This can be caused by flush_tlb_func_remote() happening between
the two checks and calling leave_mm(), resulting in two consecutive
leave_mm() calls on the same CPU with no intervening switch_mm()
calls.

We never saw this OOPS before because the old leave_mm()
implementation didn't put us back in TLBSTATE_OK, so the assertion
didn't fire.

Nadav noticed the reentrancy issue in a different context, but
neither of us realized that it caused a problem yet.

Cc: Nadav Amit 
Cc: Dave Hansen 
Reported-by: "Levin, Alexander (Sasha Levin)" 
Fixes: 3d28ebceaffa ("x86/mm: Rework lazy TLB to track the actual loaded mm")
Signed-off-by: Andy Lutomirski 
---
 arch/x86/mm/tlb.c | 15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 2a5e851f2035..f06239c6919f 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -208,6 +208,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
 static void flush_tlb_func_common(const struct flush_tlb_info *f,
  bool local, enum tlb_flush_reason reason)
 {
+   /* This code cannot presently handle being reentered. */
+   VM_WARN_ON(!irqs_disabled());
+
if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) {
leave_mm(smp_processor_id());
return;
@@ -313,8 +316,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned 
long start,
info.end = TLB_FLUSH_ALL;
}
 
-   if (mm == this_cpu_read(cpu_tlbstate.loaded_mm))
+   if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) {
+   local_irq_disable();
flush_tlb_func_local(, TLB_LOCAL_MM_SHOOTDOWN);
+   local_irq_enable();
+   }
+
if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids)
flush_tlb_others(mm_cpumask(mm), );
put_cpu();
@@ -370,8 +377,12 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch 
*batch)
 
int cpu = get_cpu();
 
-   if (cpumask_test_cpu(cpu, >cpumask))
+   if (cpumask_test_cpu(cpu, >cpumask)) {
+   local_irq_disable();
flush_tlb_func_local(, TLB_LOCAL_SHOOTDOWN);
+   local_irq_enable();
+   }
+
if (cpumask_any_but(>cpumask, cpu) < nr_cpu_ids)
flush_tlb_others(>cpumask, );
cpumask_clear(>cpumask);
-- 
2.9.4



[PATCH 3/4] kselftests: timers: Fix inconsistency-check to not ignore first timestamp

2017-06-20 Thread John Stultz
From: Miroslav Lichvar 

When the first timestamp in the list of clock readings was later than
the second timestamp and all other timestamps were in order, the
inconsistency was not reported because the index of the out-of-order
timestamp was equal to the default value.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Miroslav Lichvar 
Cc: Richard Cochran 
Cc: Prarit Bhargava 
Cc: Stephen Boyd 
Cc: Shuah Khan 
Signed-off-by: Miroslav Lichvar 
Signed-off-by: John Stultz 
---
 tools/testing/selftests/timers/inconsistency-check.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/timers/inconsistency-check.c 
b/tools/testing/selftests/timers/inconsistency-check.c
index caf1bc9..74c60e8 100644
--- a/tools/testing/selftests/timers/inconsistency-check.c
+++ b/tools/testing/selftests/timers/inconsistency-check.c
@@ -118,7 +118,7 @@ int consistency_test(int clock_type, unsigned long seconds)
start_str = ctime();
 
while (seconds == -1 || now - then < seconds) {
-   inconsistent = 0;
+   inconsistent = -1;
 
/* Fill list */
for (i = 0; i < CALLS_PER_LOOP; i++)
@@ -130,7 +130,7 @@ int consistency_test(int clock_type, unsigned long seconds)
inconsistent = i;
 
/* display inconsistency */
-   if (inconsistent) {
+   if (inconsistent >= 0) {
unsigned long long delta;
 
printf("\%s\n", start_str);
-- 
2.7.4



[PATCH v3 07/11] x86/mm: Stop calling leave_mm() in idle code

2017-06-20 Thread Andy Lutomirski
Now that lazy TLB suppresses all flush IPIs (as opposed to all but
the first), there's no need to leave_mm() when going idle.

This means we can get rid of the rcuidle hack in
switch_mm_irqs_off() and we can unexport leave_mm().

This also removes acpi_unlazy_tlb() from the x86 and ia64 headers,
since it has no callers any more.

Signed-off-by: Andy Lutomirski 
---
 arch/ia64/include/asm/acpi.h  |  2 --
 arch/x86/include/asm/acpi.h   |  2 --
 arch/x86/mm/tlb.c | 19 +++
 drivers/acpi/processor_idle.c |  2 --
 drivers/idle/intel_idle.c |  9 -
 5 files changed, 7 insertions(+), 27 deletions(-)

diff --git a/arch/ia64/include/asm/acpi.h b/arch/ia64/include/asm/acpi.h
index a3d0211970e9..c86a947f5368 100644
--- a/arch/ia64/include/asm/acpi.h
+++ b/arch/ia64/include/asm/acpi.h
@@ -112,8 +112,6 @@ static inline void arch_acpi_set_pdc_bits(u32 *buf)
buf[2] |= ACPI_PDC_EST_CAPABILITY_SMP;
 }
 
-#define acpi_unlazy_tlb(x)
-
 #ifdef CONFIG_ACPI_NUMA
 extern cpumask_t early_cpu_possible_map;
 #define for_each_possible_early_cpu(cpu)  \
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index 2efc768e4362..562286fa151f 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -150,8 +150,6 @@ static inline void disable_acpi(void) { }
 extern int x86_acpi_numa_init(void);
 #endif /* CONFIG_ACPI_NUMA */
 
-#define acpi_unlazy_tlb(x) leave_mm(x)
-
 #ifdef CONFIG_ACPI_APEI
 static inline pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr)
 {
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index fea2b07ac7d8..5f932fd80881 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -50,7 +50,6 @@ void leave_mm(int cpu)
 
switch_mm(NULL, _mm, NULL);
 }
-EXPORT_SYMBOL_GPL(leave_mm);
 
 void switch_mm(struct mm_struct *prev, struct mm_struct *next,
   struct task_struct *tsk)
@@ -113,14 +112,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen,
   next_tlb_gen);
write_cr3(__pa(next->pgd));
-   /*
-* This gets called via leave_mm() in the idle path
-* where RCU functions differently.  Tracing normally
-* uses RCU, so we have to call the tracepoint
-* specially here.
-*/
-   trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
-   TLB_FLUSH_ALL);
+   trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
+   TLB_FLUSH_ALL);
}
 
/*
@@ -166,13 +159,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct 
mm_struct *next,
this_cpu_write(cpu_tlbstate.loaded_mm, next);
write_cr3(__pa(next->pgd));
 
-   /*
-* This gets called via leave_mm() in the idle path where RCU
-* functions differently.  Tracing normally uses RCU, so we
-* have to call the tracepoint specially here.
-*/
-   trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH,
-   TLB_FLUSH_ALL);
+   trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
}
 
load_mm_cr4(next);
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 5c8aa9cf62d7..fe3d2a40f311 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -708,8 +708,6 @@ static DEFINE_RAW_SPINLOCK(c3_lock);
 static void acpi_idle_enter_bm(struct acpi_processor *pr,
   struct acpi_processor_cx *cx, bool timer_bc)
 {
-   acpi_unlazy_tlb(smp_processor_id());
-
/*
 * Must be done before busmaster disable as we might need to
 * access HPET !
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 216d7ec88c0c..2ae43f59091d 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -912,16 +912,15 @@ static __cpuidle int intel_idle(struct cpuidle_device 
*dev,
struct cpuidle_state *state = >states[index];
unsigned long eax = flg2MWAIT(state->flags);
unsigned int cstate;
-   int cpu = smp_processor_id();
 
cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1;
 
/*
-* leave_mm() to avoid costly and often unnecessary wakeups
-* for flushing the user TLB's associated with the active mm.
+* NB: if CPUIDLE_FLAG_TLB_FLUSHED is set, this idle transition
+* will probably flush the TLB.  It's not guaranteed to flush
+* the TLB, though, so it's not clear that we can do anything
+* useful with this knowledge.
 */
-   if (state->flags & CPUIDLE_FLAG_TLB_FLUSHED)
-   

[PATCH v3 00/11] PCID and improved laziness

2017-06-20 Thread Andy Lutomirski
There are three performance benefits here:

1. TLB flushing is slow.  (I.e. the flush itself takes a while.)
   This avoids many of them when switching tasks by using PCID.  In
   a stupid little benchmark I did, it saves about 100ns on my laptop
   per context switch.  I'll try to improve that benchmark.

2. Mms that have been used recently on a given CPU might get to keep
   their TLB entries alive across process switches with this patch
   set.  TLB fills are pretty fast on modern CPUs, but they're even
   faster when they don't happen.

3. Lazy TLB is way better.  We used to do two stupid things when we
   ran kernel threads: we'd send IPIs to flush user contexts on their
   CPUs and then we'd write to CR3 for no particular reason as an excuse
   to stop further IPIs.  With this patch, we do neither.

This will, in general, perform suboptimally if paravirt TLB flushing
is in use (currently just Xen, I think, but Hyper-V is in the works).
The code is structured so we could fix it in one of two ways: we
could take a spinlock when touching the percpu state so we can update
it remotely after a paravirt flush, or we could be more careful about
our exactly how we access the state and use cmpxchg16b to do atomic
remote updates.  (On SMP systems without cmpxchg16b, we'd just skip
the optimization entirely.)

This is based on tip:x86/mm.  The branch is here if you want to play:
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/pcid

Changes from v2:
 - Add some Acks
 - Move the reentrancy issue to the beginning.
   (I also sent the same patch as a standalone fix -- it's just in here
so that this series applies to x86/mm.)
 - Fix some comments.

Changes from RFC:
 - flush_tlb_func_common() no longer gets reentered (Nadav)
 - Fix ASID corruption on unlazying (kbuild bot)
 - Move Xen init to the right place
 - Misc cleanups

Andy Lutomirski (11):
  x86/mm: Don't reenter flush_tlb_func_common()
  x86/ldt: Simplify LDT switching logic
  x86/mm: Remove reset_lazy_tlbstate()
  x86/mm: Give each mm TLB flush generation a unique ID
  x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
  x86/mm: Rework lazy TLB mode and TLB freshness tracking
  x86/mm: Stop calling leave_mm() in idle code
  x86/mm: Disable PCID on 32-bit kernels
  x86/mm: Add nopcid to turn off PCID
  x86/mm: Enable CR4.PCIDE on supported systems
  x86/mm: Try to preserve old TLB entries using PCID

 Documentation/admin-guide/kernel-parameters.txt |   2 +
 arch/ia64/include/asm/acpi.h|   2 -
 arch/x86/include/asm/acpi.h |   2 -
 arch/x86/include/asm/disabled-features.h|   4 +-
 arch/x86/include/asm/mmu.h  |  25 +-
 arch/x86/include/asm/mmu_context.h  |  40 ++-
 arch/x86/include/asm/processor-flags.h  |   2 +
 arch/x86/include/asm/tlbflush.h |  89 +-
 arch/x86/kernel/cpu/bugs.c  |   8 +
 arch/x86/kernel/cpu/common.c|  33 +++
 arch/x86/kernel/smpboot.c   |   1 -
 arch/x86/mm/init.c  |   2 +-
 arch/x86/mm/tlb.c   | 368 +++-
 arch/x86/xen/enlighten_pv.c |   6 +
 arch/x86/xen/mmu_pv.c   |   3 +-
 drivers/acpi/processor_idle.c   |   2 -
 drivers/idle/intel_idle.c   |   9 +-
 17 files changed, 430 insertions(+), 168 deletions(-)

-- 
2.9.4



[GIT PULL][PATCH 0/4] Timekeeping items for 4.13

2017-06-20 Thread John Stultz
Just a small set of changes, the biggest changes being the
MONOTONIC_RAW handling cleanup, and a new kselftest from
Miroslav. Also a a clear warning deprecating
CONFIG_GENERIC_TIME_VSYSCALL_OLD, which affects ppc and ia64.

Let me know if you have any thoughts or objections!

thanks
-john

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Miroslav Lichvar 
Cc: Richard Cochran 
Cc: Prarit Bhargava 
Cc: Stephen Boyd 
Cc: Shuah Khan 

The following changes since commit 8e6cec1c7c5afa489687c90be15d6ed82c742975:

  Merge branch 'clockevents/4.12-fixes' of 
https://git.linaro.org/people/daniel.lezcano/linux into timers/urgent 
(2017-06-20 12:50:32 +0200)

are available in the git repository at:

  https://git.linaro.org/people/john.stultz/linux.git fortglx/4.13/time

for you to fetch changes up to 767392565a3e618950fe1a5ff1ba11295f6332f4:

  kselftests: timers: Add test for frequency step (2017-06-20 22:14:45 -0700)


John Stultz (2):
  time: Clean up CLOCK_MONOTONIC_RAW time handling
  time: Add warning about imminent deprecation of
CONFIG_GENERIC_TIME_VSYSCALL_OLD

Miroslav Lichvar (2):
  kselftests: timers: Fix inconsistency-check to not ignore first
timestamp
  kselftests: timers: Add test for frequency step

 arch/arm64/kernel/vdso.c   |   6 +-
 include/linux/timekeeper_internal.h|   4 +-
 kernel/time/timekeeping.c  |  46 ++--
 tools/testing/selftests/timers/Makefile|   5 +-
 tools/testing/selftests/timers/freq-step.c | 268 +
 .../testing/selftests/timers/inconsistency-check.c |   4 +-
 6 files changed, 303 insertions(+), 30 deletions(-)
 create mode 100644 tools/testing/selftests/timers/freq-step.c

-- 
2.7.4



[PATCH v3 00/11] PCID and improved laziness

2017-06-20 Thread Andy Lutomirski
There are three performance benefits here:

1. TLB flushing is slow.  (I.e. the flush itself takes a while.)
   This avoids many of them when switching tasks by using PCID.  In
   a stupid little benchmark I did, it saves about 100ns on my laptop
   per context switch.  I'll try to improve that benchmark.

2. Mms that have been used recently on a given CPU might get to keep
   their TLB entries alive across process switches with this patch
   set.  TLB fills are pretty fast on modern CPUs, but they're even
   faster when they don't happen.

3. Lazy TLB is way better.  We used to do two stupid things when we
   ran kernel threads: we'd send IPIs to flush user contexts on their
   CPUs and then we'd write to CR3 for no particular reason as an excuse
   to stop further IPIs.  With this patch, we do neither.

This will, in general, perform suboptimally if paravirt TLB flushing
is in use (currently just Xen, I think, but Hyper-V is in the works).
The code is structured so we could fix it in one of two ways: we
could take a spinlock when touching the percpu state so we can update
it remotely after a paravirt flush, or we could be more careful about
our exactly how we access the state and use cmpxchg16b to do atomic
remote updates.  (On SMP systems without cmpxchg16b, we'd just skip
the optimization entirely.)

This is based on tip:x86/mm.  The branch is here if you want to play:
https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/pcid

Changes from v2:
 - Add some Acks
 - Move the reentrancy issue to the beginning.
   (I also sent the same patch as a standalone fix -- it's just in here
so that this series applies to x86/mm.)
 - Fix some comments.

Changes from RFC:
 - flush_tlb_func_common() no longer gets reentered (Nadav)
 - Fix ASID corruption on unlazying (kbuild bot)
 - Move Xen init to the right place
 - Misc cleanups

Andy Lutomirski (11):
  x86/mm: Don't reenter flush_tlb_func_common()
  x86/ldt: Simplify LDT switching logic
  x86/mm: Remove reset_lazy_tlbstate()
  x86/mm: Give each mm TLB flush generation a unique ID
  x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
  x86/mm: Rework lazy TLB mode and TLB freshness tracking
  x86/mm: Stop calling leave_mm() in idle code
  x86/mm: Disable PCID on 32-bit kernels
  x86/mm: Add nopcid to turn off PCID
  x86/mm: Enable CR4.PCIDE on supported systems
  x86/mm: Try to preserve old TLB entries using PCID

 Documentation/admin-guide/kernel-parameters.txt |   2 +
 arch/ia64/include/asm/acpi.h|   2 -
 arch/x86/include/asm/acpi.h |   2 -
 arch/x86/include/asm/disabled-features.h|   4 +-
 arch/x86/include/asm/mmu.h  |  25 +-
 arch/x86/include/asm/mmu_context.h  |  40 ++-
 arch/x86/include/asm/processor-flags.h  |   2 +
 arch/x86/include/asm/tlbflush.h |  89 +-
 arch/x86/kernel/cpu/bugs.c  |   8 +
 arch/x86/kernel/cpu/common.c|  33 +++
 arch/x86/kernel/smpboot.c   |   1 -
 arch/x86/mm/init.c  |   2 +-
 arch/x86/mm/tlb.c   | 368 +++-
 arch/x86/xen/enlighten_pv.c |   6 +
 arch/x86/xen/mmu_pv.c   |   3 +-
 drivers/acpi/processor_idle.c   |   2 -
 drivers/idle/intel_idle.c   |   9 +-
 17 files changed, 430 insertions(+), 168 deletions(-)

-- 
2.9.4



[GIT PULL][PATCH 0/4] Timekeeping items for 4.13

2017-06-20 Thread John Stultz
Just a small set of changes, the biggest changes being the
MONOTONIC_RAW handling cleanup, and a new kselftest from
Miroslav. Also a a clear warning deprecating
CONFIG_GENERIC_TIME_VSYSCALL_OLD, which affects ppc and ia64.

Let me know if you have any thoughts or objections!

thanks
-john

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Miroslav Lichvar 
Cc: Richard Cochran 
Cc: Prarit Bhargava 
Cc: Stephen Boyd 
Cc: Shuah Khan 

The following changes since commit 8e6cec1c7c5afa489687c90be15d6ed82c742975:

  Merge branch 'clockevents/4.12-fixes' of 
https://git.linaro.org/people/daniel.lezcano/linux into timers/urgent 
(2017-06-20 12:50:32 +0200)

are available in the git repository at:

  https://git.linaro.org/people/john.stultz/linux.git fortglx/4.13/time

for you to fetch changes up to 767392565a3e618950fe1a5ff1ba11295f6332f4:

  kselftests: timers: Add test for frequency step (2017-06-20 22:14:45 -0700)


John Stultz (2):
  time: Clean up CLOCK_MONOTONIC_RAW time handling
  time: Add warning about imminent deprecation of
CONFIG_GENERIC_TIME_VSYSCALL_OLD

Miroslav Lichvar (2):
  kselftests: timers: Fix inconsistency-check to not ignore first
timestamp
  kselftests: timers: Add test for frequency step

 arch/arm64/kernel/vdso.c   |   6 +-
 include/linux/timekeeper_internal.h|   4 +-
 kernel/time/timekeeping.c  |  46 ++--
 tools/testing/selftests/timers/Makefile|   5 +-
 tools/testing/selftests/timers/freq-step.c | 268 +
 .../testing/selftests/timers/inconsistency-check.c |   4 +-
 6 files changed, 303 insertions(+), 30 deletions(-)
 create mode 100644 tools/testing/selftests/timers/freq-step.c

-- 
2.7.4



[PATCH v3 04/11] x86/mm: Give each mm TLB flush generation a unique ID

2017-06-20 Thread Andy Lutomirski
This adds two new variables to mmu_context_t: ctx_id and tlb_gen.
ctx_id uniquely identifies the mm_struct and will never be reused.
For a given mm_struct (and hence ctx_id), tlb_gen is a monotonic
count of the number of times that a TLB flush has been requested.
The pair (ctx_id, tlb_gen) can be used as an identifier for TLB
flush actions and will be used in subsequent patches to reliably
determine whether all needed TLB flushes have occurred on a given
CPU.

This patch is split out for ease of review.  By itself, it has no
real effect other than creating and updating the new variables.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/mmu.h | 25 +++--
 arch/x86/include/asm/mmu_context.h |  5 +
 arch/x86/include/asm/tlbflush.h| 18 ++
 arch/x86/mm/tlb.c  |  6 --
 4 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 79b647a7ebd0..bb8c597c2248 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -3,12 +3,28 @@
 
 #include 
 #include 
+#include 
 
 /*
- * The x86 doesn't have a mmu context, but
- * we put the segment information here.
+ * x86 has arch-specific MMU state beyond what lives in mm_struct.
  */
 typedef struct {
+   /*
+* ctx_id uniquely identifies this mm_struct.  A ctx_id will never
+* be reused, and zero is not a valid ctx_id.
+*/
+   u64 ctx_id;
+
+   /*
+* Any code that needs to do any sort of TLB flushing for this
+* mm will first make its changes to the page tables, then
+* increment tlb_gen, then flush.  This lets the low-level
+* flushing code keep track of what needs flushing.
+*
+* This is not used on Xen PV.
+*/
+   atomic64_t tlb_gen;
+
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
struct ldt_struct *ldt;
 #endif
@@ -37,6 +53,11 @@ typedef struct {
 #endif
 } mm_context_t;
 
+#define INIT_MM_CONTEXT(mm)\
+   .context = {\
+   .ctx_id = 1,\
+   }
+
 void leave_mm(int cpu);
 
 #endif /* _ASM_X86_MMU_H */
diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index ecfcb6643c9b..e5295d485899 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -129,9 +129,14 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, 
struct task_struct *tsk)
this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
 }
 
+extern atomic64_t last_mm_ctx_id;
+
 static inline int init_new_context(struct task_struct *tsk,
   struct mm_struct *mm)
 {
+   mm->context.ctx_id = atomic64_inc_return(_mm_ctx_id);
+   atomic64_set(>context.tlb_gen, 0);
+
#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
/* pkey 0 is the default and always allocated */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 50ea3482e1d1..1eb946c0507e 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -57,6 +57,23 @@ static inline void invpcid_flush_all_nonglobals(void)
__invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL);
 }
 
+static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
+{
+   u64 new_tlb_gen;
+
+   /*
+* Bump the generation count.  This also serves as a full barrier
+* that synchronizes with switch_mm: callers are required to order
+* their read of mm_cpumask after their writes to the paging
+* structures.
+*/
+   smp_mb__before_atomic();
+   new_tlb_gen = atomic64_inc_return(>context.tlb_gen);
+   smp_mb__after_atomic();
+
+   return new_tlb_gen;
+}
+
 #ifdef CONFIG_PARAVIRT
 #include 
 #else
@@ -262,6 +279,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch 
*batch,
struct mm_struct *mm)
 {
+   bump_mm_tlb_gen(mm);
cpumask_or(>cpumask, >cpumask, mm_cpumask(mm));
 }
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index fd593833a854..6d9d37323a43 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,6 +28,8 @@
  * Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
  */
 
+atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
+
 void leave_mm(int cpu)
 {
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -286,8 +288,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long 
start,
 
cpu = get_cpu();
 
-   /* Synchronize with switch_mm. */
-   smp_mb();
+   /* This is also a barrier that synchronizes with switch_mm(). */
+   bump_mm_tlb_gen(mm);
 
/* 

[PATCH v3 08/11] x86/mm: Disable PCID on 32-bit kernels

2017-06-20 Thread Andy Lutomirski
32-bit kernels on new hardware will see PCID in CPUID, but PCID can
only be used in 64-bit mode.  Rather than making all PCID code
conditional, just disable the feature on 32-bit builds.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/disabled-features.h | 4 +++-
 arch/x86/kernel/cpu/bugs.c   | 8 
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/disabled-features.h 
b/arch/x86/include/asm/disabled-features.h
index 5dff775af7cd..c10c9128f54e 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -21,11 +21,13 @@
 # define DISABLE_K6_MTRR   (1<<(X86_FEATURE_K6_MTRR & 31))
 # define DISABLE_CYRIX_ARR (1<<(X86_FEATURE_CYRIX_ARR & 31))
 # define DISABLE_CENTAUR_MCR   (1<<(X86_FEATURE_CENTAUR_MCR & 31))
+# define DISABLE_PCID  0
 #else
 # define DISABLE_VME   0
 # define DISABLE_K6_MTRR   0
 # define DISABLE_CYRIX_ARR 0
 # define DISABLE_CENTAUR_MCR   0
+# define DISABLE_PCID  (1<<(X86_FEATURE_PCID & 31))
 #endif /* CONFIG_X86_64 */
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
@@ -49,7 +51,7 @@
 #define DISABLED_MASK1 0
 #define DISABLED_MASK2 0
 #define DISABLED_MASK3 (DISABLE_CYRIX_ARR|DISABLE_CENTAUR_MCR|DISABLE_K6_MTRR)
-#define DISABLED_MASK4 0
+#define DISABLED_MASK4 (DISABLE_PCID)
 #define DISABLED_MASK5 0
 #define DISABLED_MASK6 0
 #define DISABLED_MASK7 0
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 0af86d9242da..db684880d74a 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -21,6 +21,14 @@
 
 void __init check_bugs(void)
 {
+#ifdef CONFIG_X86_32
+   /*
+* Regardless of whether PCID is enumerated, the SDM says
+* that it can't be enabled in 32-bit mode.
+*/
+   setup_clear_cpu_cap(X86_FEATURE_PCID);
+#endif
+
identify_boot_cpu();
 
if (!IS_ENABLED(CONFIG_SMP)) {
-- 
2.9.4



[PATCH v3 04/11] x86/mm: Give each mm TLB flush generation a unique ID

2017-06-20 Thread Andy Lutomirski
This adds two new variables to mmu_context_t: ctx_id and tlb_gen.
ctx_id uniquely identifies the mm_struct and will never be reused.
For a given mm_struct (and hence ctx_id), tlb_gen is a monotonic
count of the number of times that a TLB flush has been requested.
The pair (ctx_id, tlb_gen) can be used as an identifier for TLB
flush actions and will be used in subsequent patches to reliably
determine whether all needed TLB flushes have occurred on a given
CPU.

This patch is split out for ease of review.  By itself, it has no
real effect other than creating and updating the new variables.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/mmu.h | 25 +++--
 arch/x86/include/asm/mmu_context.h |  5 +
 arch/x86/include/asm/tlbflush.h| 18 ++
 arch/x86/mm/tlb.c  |  6 --
 4 files changed, 50 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 79b647a7ebd0..bb8c597c2248 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -3,12 +3,28 @@
 
 #include 
 #include 
+#include 
 
 /*
- * The x86 doesn't have a mmu context, but
- * we put the segment information here.
+ * x86 has arch-specific MMU state beyond what lives in mm_struct.
  */
 typedef struct {
+   /*
+* ctx_id uniquely identifies this mm_struct.  A ctx_id will never
+* be reused, and zero is not a valid ctx_id.
+*/
+   u64 ctx_id;
+
+   /*
+* Any code that needs to do any sort of TLB flushing for this
+* mm will first make its changes to the page tables, then
+* increment tlb_gen, then flush.  This lets the low-level
+* flushing code keep track of what needs flushing.
+*
+* This is not used on Xen PV.
+*/
+   atomic64_t tlb_gen;
+
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
struct ldt_struct *ldt;
 #endif
@@ -37,6 +53,11 @@ typedef struct {
 #endif
 } mm_context_t;
 
+#define INIT_MM_CONTEXT(mm)\
+   .context = {\
+   .ctx_id = 1,\
+   }
+
 void leave_mm(int cpu);
 
 #endif /* _ASM_X86_MMU_H */
diff --git a/arch/x86/include/asm/mmu_context.h 
b/arch/x86/include/asm/mmu_context.h
index ecfcb6643c9b..e5295d485899 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -129,9 +129,14 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, 
struct task_struct *tsk)
this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY);
 }
 
+extern atomic64_t last_mm_ctx_id;
+
 static inline int init_new_context(struct task_struct *tsk,
   struct mm_struct *mm)
 {
+   mm->context.ctx_id = atomic64_inc_return(_mm_ctx_id);
+   atomic64_set(>context.tlb_gen, 0);
+
#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
/* pkey 0 is the default and always allocated */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 50ea3482e1d1..1eb946c0507e 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -57,6 +57,23 @@ static inline void invpcid_flush_all_nonglobals(void)
__invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL);
 }
 
+static inline u64 bump_mm_tlb_gen(struct mm_struct *mm)
+{
+   u64 new_tlb_gen;
+
+   /*
+* Bump the generation count.  This also serves as a full barrier
+* that synchronizes with switch_mm: callers are required to order
+* their read of mm_cpumask after their writes to the paging
+* structures.
+*/
+   smp_mb__before_atomic();
+   new_tlb_gen = atomic64_inc_return(>context.tlb_gen);
+   smp_mb__after_atomic();
+
+   return new_tlb_gen;
+}
+
 #ifdef CONFIG_PARAVIRT
 #include 
 #else
@@ -262,6 +279,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask,
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch 
*batch,
struct mm_struct *mm)
 {
+   bump_mm_tlb_gen(mm);
cpumask_or(>cpumask, >cpumask, mm_cpumask(mm));
 }
 
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index fd593833a854..6d9d37323a43 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,6 +28,8 @@
  * Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
  */
 
+atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
+
 void leave_mm(int cpu)
 {
struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -286,8 +288,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long 
start,
 
cpu = get_cpu();
 
-   /* Synchronize with switch_mm. */
-   smp_mb();
+   /* This is also a barrier that synchronizes with switch_mm(). */
+   bump_mm_tlb_gen(mm);
 
/* Should we flush just 

[PATCH v3 08/11] x86/mm: Disable PCID on 32-bit kernels

2017-06-20 Thread Andy Lutomirski
32-bit kernels on new hardware will see PCID in CPUID, but PCID can
only be used in 64-bit mode.  Rather than making all PCID code
conditional, just disable the feature on 32-bit builds.

Signed-off-by: Andy Lutomirski 
---
 arch/x86/include/asm/disabled-features.h | 4 +++-
 arch/x86/kernel/cpu/bugs.c   | 8 
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/disabled-features.h 
b/arch/x86/include/asm/disabled-features.h
index 5dff775af7cd..c10c9128f54e 100644
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -21,11 +21,13 @@
 # define DISABLE_K6_MTRR   (1<<(X86_FEATURE_K6_MTRR & 31))
 # define DISABLE_CYRIX_ARR (1<<(X86_FEATURE_CYRIX_ARR & 31))
 # define DISABLE_CENTAUR_MCR   (1<<(X86_FEATURE_CENTAUR_MCR & 31))
+# define DISABLE_PCID  0
 #else
 # define DISABLE_VME   0
 # define DISABLE_K6_MTRR   0
 # define DISABLE_CYRIX_ARR 0
 # define DISABLE_CENTAUR_MCR   0
+# define DISABLE_PCID  (1<<(X86_FEATURE_PCID & 31))
 #endif /* CONFIG_X86_64 */
 
 #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
@@ -49,7 +51,7 @@
 #define DISABLED_MASK1 0
 #define DISABLED_MASK2 0
 #define DISABLED_MASK3 (DISABLE_CYRIX_ARR|DISABLE_CENTAUR_MCR|DISABLE_K6_MTRR)
-#define DISABLED_MASK4 0
+#define DISABLED_MASK4 (DISABLE_PCID)
 #define DISABLED_MASK5 0
 #define DISABLED_MASK6 0
 #define DISABLED_MASK7 0
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 0af86d9242da..db684880d74a 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -21,6 +21,14 @@
 
 void __init check_bugs(void)
 {
+#ifdef CONFIG_X86_32
+   /*
+* Regardless of whether PCID is enumerated, the SDM says
+* that it can't be enabled in 32-bit mode.
+*/
+   setup_clear_cpu_cap(X86_FEATURE_PCID);
+#endif
+
identify_boot_cpu();
 
if (!IS_ENABLED(CONFIG_SMP)) {
-- 
2.9.4



[PATCH 4/4] kselftests: timers: Add test for frequency step

2017-06-20 Thread John Stultz
From: Miroslav Lichvar 

This test checks the response of the system clock to frequency
steps made with adjtimex(). The frequency error and stability of
the CLOCK_MONOTONIC clock relative to the CLOCK_MONOTONIC_RAW clock
is measured in two intervals following the step. The test fails if
values from the second interval exceed specified limits.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Miroslav Lichvar 
Cc: Richard Cochran 
Cc: Prarit Bhargava 
Cc: Stephen Boyd 
Cc: Shuah Khan 
Signed-off-by: Miroslav Lichvar 
Signed-off-by: John Stultz 
---
 tools/testing/selftests/timers/Makefile|   5 +-
 tools/testing/selftests/timers/freq-step.c | 268 +
 2 files changed, 271 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/timers/freq-step.c

diff --git a/tools/testing/selftests/timers/Makefile 
b/tools/testing/selftests/timers/Makefile
index 5fa1d7e9..5801bbe 100644
--- a/tools/testing/selftests/timers/Makefile
+++ b/tools/testing/selftests/timers/Makefile
@@ -1,6 +1,6 @@
 BUILD_FLAGS = -DKTEST
 CFLAGS += -O3 -Wl,-no-as-needed -Wall $(BUILD_FLAGS)
-LDFLAGS += -lrt -lpthread
+LDFLAGS += -lrt -lpthread -lm
 
 # these are all "safe" tests that don't modify
 # system time or require escalated privileges
@@ -8,7 +8,7 @@ TEST_GEN_PROGS = posix_timers nanosleep nsleep-lat 
set-timer-lat mqueue-lat \
 inconsistency-check raw_skew threadtest rtctest
 
 TEST_GEN_PROGS_EXTENDED = alarmtimer-suspend valid-adjtimex adjtick 
change_skew \
- skew_consistency clocksource-switch leap-a-day \
+ skew_consistency clocksource-switch freq-step leap-a-day \
  leapcrash set-tai set-2038 set-tz
 
 
@@ -24,6 +24,7 @@ run_destructive_tests: run_tests
./change_skew
./skew_consistency
./clocksource-switch
+   ./freq-step
./leap-a-day -s -i 10
./leapcrash
./set-tz
diff --git a/tools/testing/selftests/timers/freq-step.c 
b/tools/testing/selftests/timers/freq-step.c
new file mode 100644
index 000..e8c6183
--- /dev/null
+++ b/tools/testing/selftests/timers/freq-step.c
@@ -0,0 +1,268 @@
+/*
+ * This test checks the response of the system clock to frequency
+ * steps made with adjtimex(). The frequency error and stability of
+ * the CLOCK_MONOTONIC clock relative to the CLOCK_MONOTONIC_RAW clock
+ * is measured in two intervals following the step. The test fails if
+ * values from the second interval exceed specified limits.
+ *
+ * Copyright (C) Miroslav Lichvar   2017
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+
+#define SAMPLES 100
+#define SAMPLE_READINGS 10
+#define MEAN_SAMPLE_INTERVAL 0.1
+#define STEP_INTERVAL 1.0
+#define MAX_PRECISION 100e-9
+#define MAX_FREQ_ERROR 10e-6
+#define MAX_STDDEV 1000e-9
+
+struct sample {
+   double offset;
+   double time;
+};
+
+static time_t mono_raw_base;
+static time_t mono_base;
+static long user_hz;
+static double precision;
+static double mono_freq_offset;
+
+static double diff_timespec(struct timespec *ts1, struct timespec *ts2)
+{
+   return ts1->tv_sec - ts2->tv_sec + (ts1->tv_nsec - ts2->tv_nsec) / 1e9;
+}
+
+static double get_sample(struct sample *sample)
+{
+   double delay, mindelay = 0.0;
+   struct timespec ts1, ts2, ts3;
+   int i;
+
+   for (i = 0; i < SAMPLE_READINGS; i++) {
+   clock_gettime(CLOCK_MONOTONIC_RAW, );
+   clock_gettime(CLOCK_MONOTONIC, );
+   clock_gettime(CLOCK_MONOTONIC_RAW, );
+
+   ts1.tv_sec -= mono_raw_base;
+   ts2.tv_sec -= mono_base;
+   ts3.tv_sec -= mono_raw_base;
+
+   delay = diff_timespec(, );
+   if (delay <= 1e-9) {
+   i--;
+   continue;
+   }
+
+   if (!i || delay < mindelay) {
+   sample->offset = diff_timespec(, );
+   sample->offset -= delay / 2.0;
+   sample->time = ts1.tv_sec + ts1.tv_nsec / 1e9;
+   mindelay = delay;
+   }
+   }
+
+   return mindelay;
+}
+
+static void reset_ntp_error(void)
+{
+   struct timex txc;
+
+   txc.modes = ADJ_SETOFFSET;
+   

[PATCH 1/4] time: Clean up CLOCK_MONOTONIC_RAW time handling

2017-06-20 Thread John Stultz
Now that we fixed the sub-ns handling for CLOCK_MONOTONIC_RAW,
remove the duplicitive tk->raw_time.tv_nsec, which can be
stored in tk->tkr_raw.xtime_nsec (similarly to how its handled
for monotonic time).

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Miroslav Lichvar 
Cc: Richard Cochran 
Cc: Prarit Bhargava 
Cc: Stephen Boyd 
Cc: Kevin Brodsky 
Cc: Will Deacon 
Cc: Daniel Mentz 
Tested-by: Daniel Mentz 
Signed-off-by: John Stultz 
---
 arch/arm64/kernel/vdso.c|  6 ++---
 include/linux/timekeeper_internal.h |  4 ++--
 kernel/time/timekeeping.c   | 45 -
 3 files changed, 29 insertions(+), 26 deletions(-)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index d0cb007..7492d90 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -220,10 +220,8 @@ void update_vsyscall(struct timekeeper *tk)
if (!use_syscall) {
/* tkr_mono.cycle_last == tkr_raw.cycle_last */
vdso_data->cs_cycle_last= tk->tkr_mono.cycle_last;
-   vdso_data->raw_time_sec = tk->raw_time.tv_sec;
-   vdso_data->raw_time_nsec= (tk->raw_time.tv_nsec <<
-  tk->tkr_raw.shift) +
- tk->tkr_raw.xtime_nsec;
+   vdso_data->raw_time_sec = tk->raw_sec;
+   vdso_data->raw_time_nsec= tk->tkr_raw.xtime_nsec;
vdso_data->xtime_clock_sec  = tk->xtime_sec;
vdso_data->xtime_clock_nsec = tk->tkr_mono.xtime_nsec;
vdso_data->cs_mono_mult = tk->tkr_mono.mult;
diff --git a/include/linux/timekeeper_internal.h 
b/include/linux/timekeeper_internal.h
index f7043cc..0a0a53d 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -51,7 +51,7 @@ struct tk_read_base {
  * @clock_was_set_seq: The sequence number of clock was set events
  * @cs_was_changed_seq:The sequence number of clocksource change events
  * @next_leap_ktime:   CLOCK_MONOTONIC time value of a pending leap-second
- * @raw_time:  Monotonic raw base time in timespec64 format
+ * @raw_sec:   CLOCK_MONOTONIC_RAW  time in seconds
  * @cycle_interval:Number of clock cycles in one NTP interval
  * @xtime_interval:Number of clock shifted nano seconds in one NTP
  * interval.
@@ -93,7 +93,7 @@ struct timekeeper {
unsigned intclock_was_set_seq;
u8  cs_was_changed_seq;
ktime_t next_leap_ktime;
-   struct timespec64   raw_time;
+   u64 raw_sec;
 
/* The following members are for timekeeping internal use */
u64 cycle_interval;
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index b602c48..0454bfa 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -72,6 +72,10 @@ static inline void tk_normalize_xtime(struct timekeeper *tk)
tk->tkr_mono.xtime_nsec -= (u64)NSEC_PER_SEC << 
tk->tkr_mono.shift;
tk->xtime_sec++;
}
+   while (tk->tkr_raw.xtime_nsec >= ((u64)NSEC_PER_SEC << 
tk->tkr_raw.shift)) {
+   tk->tkr_raw.xtime_nsec -= (u64)NSEC_PER_SEC << 
tk->tkr_raw.shift;
+   tk->raw_sec++;
+   }
 }
 
 static inline struct timespec64 tk_xtime(struct timekeeper *tk)
@@ -285,12 +289,14 @@ static void tk_setup_internals(struct timekeeper *tk, 
struct clocksource *clock)
 /* if changing clocks, convert xtime_nsec shift units */
if (old_clock) {
int shift_change = clock->shift - old_clock->shift;
-   if (shift_change < 0)
+   if (shift_change < 0) {
tk->tkr_mono.xtime_nsec >>= -shift_change;
-   else
+   tk->tkr_raw.xtime_nsec >>= -shift_change;
+   } else {
tk->tkr_mono.xtime_nsec <<= shift_change;
+   tk->tkr_raw.xtime_nsec <<= shift_change;
+   }
}
-   tk->tkr_raw.xtime_nsec = 0;
 
tk->tkr_mono.shift = clock->shift;
tk->tkr_raw.shift = clock->shift;
@@ -619,9 +625,6 @@ static inline void tk_update_ktime_data(struct timekeeper 
*tk)
nsec = (u32) tk->wall_to_monotonic.tv_nsec;
tk->tkr_mono.base = ns_to_ktime(seconds * NSEC_PER_SEC + nsec);
 
-   /* Update the monotonic raw base */
-   tk->tkr_raw.base = timespec64_to_ktime(tk->raw_time);
-
/*
 * The sum of the nanoseconds portions of xtime and
 * wall_to_monotonic can be greater/equal one 

[PATCH 4/4] kselftests: timers: Add test for frequency step

2017-06-20 Thread John Stultz
From: Miroslav Lichvar 

This test checks the response of the system clock to frequency
steps made with adjtimex(). The frequency error and stability of
the CLOCK_MONOTONIC clock relative to the CLOCK_MONOTONIC_RAW clock
is measured in two intervals following the step. The test fails if
values from the second interval exceed specified limits.

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Miroslav Lichvar 
Cc: Richard Cochran 
Cc: Prarit Bhargava 
Cc: Stephen Boyd 
Cc: Shuah Khan 
Signed-off-by: Miroslav Lichvar 
Signed-off-by: John Stultz 
---
 tools/testing/selftests/timers/Makefile|   5 +-
 tools/testing/selftests/timers/freq-step.c | 268 +
 2 files changed, 271 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/timers/freq-step.c

diff --git a/tools/testing/selftests/timers/Makefile 
b/tools/testing/selftests/timers/Makefile
index 5fa1d7e9..5801bbe 100644
--- a/tools/testing/selftests/timers/Makefile
+++ b/tools/testing/selftests/timers/Makefile
@@ -1,6 +1,6 @@
 BUILD_FLAGS = -DKTEST
 CFLAGS += -O3 -Wl,-no-as-needed -Wall $(BUILD_FLAGS)
-LDFLAGS += -lrt -lpthread
+LDFLAGS += -lrt -lpthread -lm
 
 # these are all "safe" tests that don't modify
 # system time or require escalated privileges
@@ -8,7 +8,7 @@ TEST_GEN_PROGS = posix_timers nanosleep nsleep-lat 
set-timer-lat mqueue-lat \
 inconsistency-check raw_skew threadtest rtctest
 
 TEST_GEN_PROGS_EXTENDED = alarmtimer-suspend valid-adjtimex adjtick 
change_skew \
- skew_consistency clocksource-switch leap-a-day \
+ skew_consistency clocksource-switch freq-step leap-a-day \
  leapcrash set-tai set-2038 set-tz
 
 
@@ -24,6 +24,7 @@ run_destructive_tests: run_tests
./change_skew
./skew_consistency
./clocksource-switch
+   ./freq-step
./leap-a-day -s -i 10
./leapcrash
./set-tz
diff --git a/tools/testing/selftests/timers/freq-step.c 
b/tools/testing/selftests/timers/freq-step.c
new file mode 100644
index 000..e8c6183
--- /dev/null
+++ b/tools/testing/selftests/timers/freq-step.c
@@ -0,0 +1,268 @@
+/*
+ * This test checks the response of the system clock to frequency
+ * steps made with adjtimex(). The frequency error and stability of
+ * the CLOCK_MONOTONIC clock relative to the CLOCK_MONOTONIC_RAW clock
+ * is measured in two intervals following the step. The test fails if
+ * values from the second interval exceed specified limits.
+ *
+ * Copyright (C) Miroslav Lichvar   2017
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "../kselftest.h"
+
+#define SAMPLES 100
+#define SAMPLE_READINGS 10
+#define MEAN_SAMPLE_INTERVAL 0.1
+#define STEP_INTERVAL 1.0
+#define MAX_PRECISION 100e-9
+#define MAX_FREQ_ERROR 10e-6
+#define MAX_STDDEV 1000e-9
+
+struct sample {
+   double offset;
+   double time;
+};
+
+static time_t mono_raw_base;
+static time_t mono_base;
+static long user_hz;
+static double precision;
+static double mono_freq_offset;
+
+static double diff_timespec(struct timespec *ts1, struct timespec *ts2)
+{
+   return ts1->tv_sec - ts2->tv_sec + (ts1->tv_nsec - ts2->tv_nsec) / 1e9;
+}
+
+static double get_sample(struct sample *sample)
+{
+   double delay, mindelay = 0.0;
+   struct timespec ts1, ts2, ts3;
+   int i;
+
+   for (i = 0; i < SAMPLE_READINGS; i++) {
+   clock_gettime(CLOCK_MONOTONIC_RAW, );
+   clock_gettime(CLOCK_MONOTONIC, );
+   clock_gettime(CLOCK_MONOTONIC_RAW, );
+
+   ts1.tv_sec -= mono_raw_base;
+   ts2.tv_sec -= mono_base;
+   ts3.tv_sec -= mono_raw_base;
+
+   delay = diff_timespec(, );
+   if (delay <= 1e-9) {
+   i--;
+   continue;
+   }
+
+   if (!i || delay < mindelay) {
+   sample->offset = diff_timespec(, );
+   sample->offset -= delay / 2.0;
+   sample->time = ts1.tv_sec + ts1.tv_nsec / 1e9;
+   mindelay = delay;
+   }
+   }
+
+   return mindelay;
+}
+
+static void reset_ntp_error(void)
+{
+   struct timex txc;
+
+   txc.modes = ADJ_SETOFFSET;
+   txc.time.tv_sec = 0;
+   txc.time.tv_usec = 0;
+
+   if (adjtimex() < 0) {
+   perror("[FAIL] adjtimex");
+   ksft_exit_fail();
+   }
+}
+
+static void set_frequency(double freq)
+{
+   struct timex txc;

[PATCH 1/4] time: Clean up CLOCK_MONOTONIC_RAW time handling

2017-06-20 Thread John Stultz
Now that we fixed the sub-ns handling for CLOCK_MONOTONIC_RAW,
remove the duplicitive tk->raw_time.tv_nsec, which can be
stored in tk->tkr_raw.xtime_nsec (similarly to how its handled
for monotonic time).

Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Miroslav Lichvar 
Cc: Richard Cochran 
Cc: Prarit Bhargava 
Cc: Stephen Boyd 
Cc: Kevin Brodsky 
Cc: Will Deacon 
Cc: Daniel Mentz 
Tested-by: Daniel Mentz 
Signed-off-by: John Stultz 
---
 arch/arm64/kernel/vdso.c|  6 ++---
 include/linux/timekeeper_internal.h |  4 ++--
 kernel/time/timekeeping.c   | 45 -
 3 files changed, 29 insertions(+), 26 deletions(-)

diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c
index d0cb007..7492d90 100644
--- a/arch/arm64/kernel/vdso.c
+++ b/arch/arm64/kernel/vdso.c
@@ -220,10 +220,8 @@ void update_vsyscall(struct timekeeper *tk)
if (!use_syscall) {
/* tkr_mono.cycle_last == tkr_raw.cycle_last */
vdso_data->cs_cycle_last= tk->tkr_mono.cycle_last;
-   vdso_data->raw_time_sec = tk->raw_time.tv_sec;
-   vdso_data->raw_time_nsec= (tk->raw_time.tv_nsec <<
-  tk->tkr_raw.shift) +
- tk->tkr_raw.xtime_nsec;
+   vdso_data->raw_time_sec = tk->raw_sec;
+   vdso_data->raw_time_nsec= tk->tkr_raw.xtime_nsec;
vdso_data->xtime_clock_sec  = tk->xtime_sec;
vdso_data->xtime_clock_nsec = tk->tkr_mono.xtime_nsec;
vdso_data->cs_mono_mult = tk->tkr_mono.mult;
diff --git a/include/linux/timekeeper_internal.h 
b/include/linux/timekeeper_internal.h
index f7043cc..0a0a53d 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -51,7 +51,7 @@ struct tk_read_base {
  * @clock_was_set_seq: The sequence number of clock was set events
  * @cs_was_changed_seq:The sequence number of clocksource change events
  * @next_leap_ktime:   CLOCK_MONOTONIC time value of a pending leap-second
- * @raw_time:  Monotonic raw base time in timespec64 format
+ * @raw_sec:   CLOCK_MONOTONIC_RAW  time in seconds
  * @cycle_interval:Number of clock cycles in one NTP interval
  * @xtime_interval:Number of clock shifted nano seconds in one NTP
  * interval.
@@ -93,7 +93,7 @@ struct timekeeper {
unsigned intclock_was_set_seq;
u8  cs_was_changed_seq;
ktime_t next_leap_ktime;
-   struct timespec64   raw_time;
+   u64 raw_sec;
 
/* The following members are for timekeeping internal use */
u64 cycle_interval;
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index b602c48..0454bfa 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -72,6 +72,10 @@ static inline void tk_normalize_xtime(struct timekeeper *tk)
tk->tkr_mono.xtime_nsec -= (u64)NSEC_PER_SEC << 
tk->tkr_mono.shift;
tk->xtime_sec++;
}
+   while (tk->tkr_raw.xtime_nsec >= ((u64)NSEC_PER_SEC << 
tk->tkr_raw.shift)) {
+   tk->tkr_raw.xtime_nsec -= (u64)NSEC_PER_SEC << 
tk->tkr_raw.shift;
+   tk->raw_sec++;
+   }
 }
 
 static inline struct timespec64 tk_xtime(struct timekeeper *tk)
@@ -285,12 +289,14 @@ static void tk_setup_internals(struct timekeeper *tk, 
struct clocksource *clock)
 /* if changing clocks, convert xtime_nsec shift units */
if (old_clock) {
int shift_change = clock->shift - old_clock->shift;
-   if (shift_change < 0)
+   if (shift_change < 0) {
tk->tkr_mono.xtime_nsec >>= -shift_change;
-   else
+   tk->tkr_raw.xtime_nsec >>= -shift_change;
+   } else {
tk->tkr_mono.xtime_nsec <<= shift_change;
+   tk->tkr_raw.xtime_nsec <<= shift_change;
+   }
}
-   tk->tkr_raw.xtime_nsec = 0;
 
tk->tkr_mono.shift = clock->shift;
tk->tkr_raw.shift = clock->shift;
@@ -619,9 +625,6 @@ static inline void tk_update_ktime_data(struct timekeeper 
*tk)
nsec = (u32) tk->wall_to_monotonic.tv_nsec;
tk->tkr_mono.base = ns_to_ktime(seconds * NSEC_PER_SEC + nsec);
 
-   /* Update the monotonic raw base */
-   tk->tkr_raw.base = timespec64_to_ktime(tk->raw_time);
-
/*
 * The sum of the nanoseconds portions of xtime and
 * wall_to_monotonic can be greater/equal one second. Take
@@ -631,6 +634,11 @@ static inline void tk_update_ktime_data(struct timekeeper 
*tk)
if (nsec >= NSEC_PER_SEC)
seconds++;
tk->ktime_sec = seconds;
+
+   /* Update the monotonic raw base */
+   

linux-next: build warning after merge of the percpu tree

2017-06-20 Thread Stephen Rothwell
Hi all,

After merging the percpu tree, today's linux-next build (arm
multi_v7_defconfig) produced this warning:

In file included from include/linux/kernel.h:13:0,
 from include/linux/bitmap.h:9,
 from mm/percpu.c:58:
mm/percpu.c: In function 'pcpu_alloc':
include/linux/printk.h:303:2: warning: 'err' may be used uninitialized in this 
function [-Wmaybe-uninitialized]
  printk(KERN_WARNING pr_fmt(fmt), ##__VA_ARGS__)
  ^
mm/percpu.c:864:14: note: 'err' was declared here
  const char *err;
  ^

I am not sure which commit introduced this, although the

if (is_atomic)
goto fail;

in pcpu_alloc() looks suspicious.

-- 
Cheers,
Stephen Rothwell


linux-next: build warning after merge of the percpu tree

2017-06-20 Thread Stephen Rothwell
Hi all,

After merging the percpu tree, today's linux-next build (arm
multi_v7_defconfig) produced this warning:

In file included from include/linux/kernel.h:13:0,
 from include/linux/bitmap.h:9,
 from mm/percpu.c:58:
mm/percpu.c: In function 'pcpu_alloc':
include/linux/printk.h:303:2: warning: 'err' may be used uninitialized in this 
function [-Wmaybe-uninitialized]
  printk(KERN_WARNING pr_fmt(fmt), ##__VA_ARGS__)
  ^
mm/percpu.c:864:14: note: 'err' was declared here
  const char *err;
  ^

I am not sure which commit introduced this, although the

if (is_atomic)
goto fail;

in pcpu_alloc() looks suspicious.

-- 
Cheers,
Stephen Rothwell


Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem

2017-06-20 Thread Andy Lutomirski
On Tue, Jun 20, 2017 at 6:40 PM, Dave Chinner  wrote:
> Your mangling terminology here. We don't "break COW" - we *use*
> copy-on-write to break *extent sharing*. We can break extent sharing
> in page_mkwrite - that's exactly what we do for normal pagecache
> based mmap writes, and it's done in page_mkwrite.

Right, my bad.

>
> It hasn't been enabled it for DAX yet because it simply hasn't been
> robustly tested yet.
>
>> A per-inode
>> count of the number of live DAX mappings or of the number of struct
>> file instances that have requested DAX would work here.
>
> For what purpose does this serve? The reflink invalidates all the
> existing mappings, so the next write access causes a fault and then
> page_mkwrite is called and the shared extent will get COWed

The same purpose as XFS's FS_XFLAG_DAX (assuming I'm understanding it
right), except that IMO an API that doesn't involve making a change to
an inode that sticks around would be nice.  The inode flag has the
unfortunate property that, if two different programs each try to set
the flag, mmap, write, and clear the flag, they'll stomp on each other
and risk data corruption.

I admit I'm now thoroughly confused as to exactly what XFS does here
-- does FS_XFLAG_DAX persist across unmount/mount?

>
>>  - Trying to use DAX on a file that is already reflinked.  The order
>> of operations doesn't matter hugely, except that breaking COW for the
>> entire range in question all at once would be faster and result in
>> better allocation.
>
> We have COW extent size hints for that. i.e. if you want to COW a
> huge page at a time, set the COW extent size hint to the huge page
> size...

Nifty.

> Apparently it is. There are people telling us that mtime
> updates in page faults introduce too much unpredictable latency and
> that screws over their low latency real time applications.

I was one of those, and I even wrote patches.  I should try to dust them off.

>
> Those same people are telling use that dirty tracking in page faults
> for msync/fsync on DAX is too heavyweight and calling msync is too
> onerous and has unpredictable latencies because it might result in
> having to sync tens of thousands of unrelated dirty objects. Hence
> they want to use userspace data sync primitives to avoid this
> overhead and so filesystems need to make it possible to provide this
> userspace idata sync capability.

If I were using DAX in production, I'd have exactly this issue.  Let
me quote myself:

On Tue, Jun 20, 2017 at 9:14 AM, Andy Lutomirski  wrote:
> 3. (Not strictly related to DAX.) A way to tell the kernel "I have
> this file mmapped for write.  Please go out of your way to avoid
> future page faults."  I've wanted this for ordinary files on ext4.
> The kernel could, but presently does not, use hardware dirty tracking
> instead of software dirty tracking to decide when to write the page
> back.  The kernel could also, in principle, write back dirty pages
> without ever write protecting them.  For DAX, this might change
> behavior to prevent any operation that would relocate blocks or to
> have the kernel go out of its way to only do such operations when
> absolutely necessary and to immediately update and unwriteprotect the
> relevant pages.

I agree that this is a real issue, but it's not limited to DAX.  I've
wanted a mode where I tell the kernel "I'm a high-performance
application mmapping this file and I'm going to write to it a lot.  Do
your best to avoid any page faults, even if it adversely affects the
performance of the system."  This mode could do lots of things.  It
could cause the system to leave the page writable even after writeback
and, if possible, to use hardware dirty tracking.  It could cause the
system to make a copy of the page and write back from the copy if
there is anything in play that could need stable pages during
writeback.  And, for DAX, it could tell the system to keep the page
pinned and disallow moving it and reflinking it.

(Of course, the above requires that we either deal with mtime like my
patches do or that this heavyweight mechanism disable mtime updates.
I prefer the former.)

Here's the overall point I'm trying to make: unprivileged programs
that want to write to DAX files with userspace commit mechanisms
(CLFLUSHOPT;SFENCE, etc) should be able to do so reliably, without
privilege, and with reasonably clean APIs.  Ideally they could do this
to any file they have write access to.  Programs that want to write to
mmapped files, DAX or otherwise, without latency spikes due to
.page_mkwrite should be able to opt in to a heavier weight mechanism.
But these two issues are someone independent, and I think they should
be solved separately.


Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem

2017-06-20 Thread Andy Lutomirski
On Tue, Jun 20, 2017 at 6:40 PM, Dave Chinner  wrote:
> Your mangling terminology here. We don't "break COW" - we *use*
> copy-on-write to break *extent sharing*. We can break extent sharing
> in page_mkwrite - that's exactly what we do for normal pagecache
> based mmap writes, and it's done in page_mkwrite.

Right, my bad.

>
> It hasn't been enabled it for DAX yet because it simply hasn't been
> robustly tested yet.
>
>> A per-inode
>> count of the number of live DAX mappings or of the number of struct
>> file instances that have requested DAX would work here.
>
> For what purpose does this serve? The reflink invalidates all the
> existing mappings, so the next write access causes a fault and then
> page_mkwrite is called and the shared extent will get COWed

The same purpose as XFS's FS_XFLAG_DAX (assuming I'm understanding it
right), except that IMO an API that doesn't involve making a change to
an inode that sticks around would be nice.  The inode flag has the
unfortunate property that, if two different programs each try to set
the flag, mmap, write, and clear the flag, they'll stomp on each other
and risk data corruption.

I admit I'm now thoroughly confused as to exactly what XFS does here
-- does FS_XFLAG_DAX persist across unmount/mount?

>
>>  - Trying to use DAX on a file that is already reflinked.  The order
>> of operations doesn't matter hugely, except that breaking COW for the
>> entire range in question all at once would be faster and result in
>> better allocation.
>
> We have COW extent size hints for that. i.e. if you want to COW a
> huge page at a time, set the COW extent size hint to the huge page
> size...

Nifty.

> Apparently it is. There are people telling us that mtime
> updates in page faults introduce too much unpredictable latency and
> that screws over their low latency real time applications.

I was one of those, and I even wrote patches.  I should try to dust them off.

>
> Those same people are telling use that dirty tracking in page faults
> for msync/fsync on DAX is too heavyweight and calling msync is too
> onerous and has unpredictable latencies because it might result in
> having to sync tens of thousands of unrelated dirty objects. Hence
> they want to use userspace data sync primitives to avoid this
> overhead and so filesystems need to make it possible to provide this
> userspace idata sync capability.

If I were using DAX in production, I'd have exactly this issue.  Let
me quote myself:

On Tue, Jun 20, 2017 at 9:14 AM, Andy Lutomirski  wrote:
> 3. (Not strictly related to DAX.) A way to tell the kernel "I have
> this file mmapped for write.  Please go out of your way to avoid
> future page faults."  I've wanted this for ordinary files on ext4.
> The kernel could, but presently does not, use hardware dirty tracking
> instead of software dirty tracking to decide when to write the page
> back.  The kernel could also, in principle, write back dirty pages
> without ever write protecting them.  For DAX, this might change
> behavior to prevent any operation that would relocate blocks or to
> have the kernel go out of its way to only do such operations when
> absolutely necessary and to immediately update and unwriteprotect the
> relevant pages.

I agree that this is a real issue, but it's not limited to DAX.  I've
wanted a mode where I tell the kernel "I'm a high-performance
application mmapping this file and I'm going to write to it a lot.  Do
your best to avoid any page faults, even if it adversely affects the
performance of the system."  This mode could do lots of things.  It
could cause the system to leave the page writable even after writeback
and, if possible, to use hardware dirty tracking.  It could cause the
system to make a copy of the page and write back from the copy if
there is anything in play that could need stable pages during
writeback.  And, for DAX, it could tell the system to keep the page
pinned and disallow moving it and reflinking it.

(Of course, the above requires that we either deal with mtime like my
patches do or that this heavyweight mechanism disable mtime updates.
I prefer the former.)

Here's the overall point I'm trying to make: unprivileged programs
that want to write to DAX files with userspace commit mechanisms
(CLFLUSHOPT;SFENCE, etc) should be able to do so reliably, without
privilege, and with reasonably clean APIs.  Ideally they could do this
to any file they have write access to.  Programs that want to write to
mmapped files, DAX or otherwise, without latency spikes due to
.page_mkwrite should be able to opt in to a heavier weight mechanism.
But these two issues are someone independent, and I think they should
be solved separately.


Re: [PATCH 32/51] rtc: pcap: stop using rtc deprecated functions

2017-06-20 Thread kbuild test robot
Hi Benjamin,

[auto build test ERROR on abelloni/rtc-next]
[also build test ERROR on v4.12-rc6 next-20170620]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Benjamin-Gaignard/rtc-stop-using-rtc-deprecated-functions/20170621-044455
base:   https://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git 
rtc-next
config: arm-ezx_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

All errors (new ones prefixed by >>):

   drivers/built-in.o: In function `pcap_rtc_set_mmss64':
>> hid-generic.c:(.text+0xac5b8): undefined reference to `__aeabi_ldivmod'
   hid-generic.c:(.text+0xac5dc): undefined reference to `__aeabi_ldivmod'
   drivers/built-in.o: In function `pcap_rtc_set_alarm':
>> hid-generic.c:(.text+0xac61c): undefined reference to `__aeabi_uldivmod'
   hid-generic.c:(.text+0xac640): undefined reference to `__aeabi_uldivmod'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 32/51] rtc: pcap: stop using rtc deprecated functions

2017-06-20 Thread kbuild test robot
Hi Benjamin,

[auto build test ERROR on abelloni/rtc-next]
[also build test ERROR on v4.12-rc6 next-20170620]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Benjamin-Gaignard/rtc-stop-using-rtc-deprecated-functions/20170621-044455
base:   https://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git 
rtc-next
config: arm-ezx_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

All errors (new ones prefixed by >>):

   drivers/built-in.o: In function `pcap_rtc_set_mmss64':
>> hid-generic.c:(.text+0xac5b8): undefined reference to `__aeabi_ldivmod'
   hid-generic.c:(.text+0xac5dc): undefined reference to `__aeabi_ldivmod'
   drivers/built-in.o: In function `pcap_rtc_set_alarm':
>> hid-generic.c:(.text+0xac61c): undefined reference to `__aeabi_uldivmod'
   hid-generic.c:(.text+0xac640): undefined reference to `__aeabi_uldivmod'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [Ksummit-discuss] Next steps and plans for the 2017 Maintainer and Kernel Summits

2017-06-20 Thread Theodore Ts'o
On Wed, Jun 14, 2017 at 09:29:20AM -0400, Theodore Ts'o wrote:
> We will use the rest of names that Linus had mentioned as being part
> of his secondary list to preseed the list of people to be considered.
> People who suggest topics that should be discussed on the Maintainer's
> summit will be added to the list.  Please prefix [MAINTAINERS SUMMIT]
> to the subject line, and remember that we only have a half-day, and
> they should be appropriate for the Maintainer's summit --- that is,
> they should be focused on the development process.

If you are interested in attending the Maintainer's Summit, I'd
appreciate it if you could send in any topic proposals by Monday, June
26th.  Many thanks!!

- Ted


Re: [Ksummit-discuss] Next steps and plans for the 2017 Maintainer and Kernel Summits

2017-06-20 Thread Theodore Ts'o
On Wed, Jun 14, 2017 at 09:29:20AM -0400, Theodore Ts'o wrote:
> We will use the rest of names that Linus had mentioned as being part
> of his secondary list to preseed the list of people to be considered.
> People who suggest topics that should be discussed on the Maintainer's
> summit will be added to the list.  Please prefix [MAINTAINERS SUMMIT]
> to the subject line, and remember that we only have a half-day, and
> they should be appropriate for the Maintainer's summit --- that is,
> they should be focused on the development process.

If you are interested in attending the Maintainer's Summit, I'd
appreciate it if you could send in any topic proposals by Monday, June
26th.  Many thanks!!

- Ted


Re: [PATCH V1 09/15] spmi: pmic-arb: check apid enabled before calling the handler

2017-06-20 Thread kgunda

On 2017-06-17 02:41, Stephen Boyd wrote:

On 06/14, kgu...@codeaurora.org wrote:

On 2017-06-01 02:09, Stephen Boyd wrote:
>On 05/30, Kiran Gunda wrote:
>>From: Abhijeet Dharmapurikar 
>>
>>The driver currently invokes the apid handler (periph_handler())
>
>You mean periph_interrupt()?
>
Yes.
>>once it sees that the summary status bit for that apid is set.
>>
>>However the hardware is designed to set that bit even if the apid
>>interrupts are disabled. The driver should check whether the apid
>>is indeed enabled before calling the apid handler.
>
>Really? Wow that is awful. Or is this because ACC_ENABLE bit is
>always set now and never cleared?
>
Yes. It is awful. It is not because of the ACC_ENABLE bit is set.
>>
>>Signed-off-by: Abhijeet Dharmapurikar 
>>Signed-off-by: Kiran Gunda 
>>---
>> drivers/spmi/spmi-pmic-arb.c | 10 +++---
>> 1 file changed, 7 insertions(+), 3 deletions(-)
>>
>>diff --git a/drivers/spmi/spmi-pmic-arb.c
>>b/drivers/spmi/spmi-pmic-arb.c
>>index ad34491..f8638fa 100644
>>--- a/drivers/spmi/spmi-pmic-arb.c
>>+++ b/drivers/spmi/spmi-pmic-arb.c
>>@@ -536,8 +536,8 @@ static void pmic_arb_chained_irq(struct
>>irq_desc *desc)
>>void __iomem *intr = pa->intr;
>>int first = pa->min_apid >> 5;
>>int last = pa->max_apid >> 5;
>>-   u32 status;
>>-   int i, id;
>>+   u32 status, enable;
>>+   int i, id, apid;
>>
>>chained_irq_enter(chip, desc);
>>
>>@@ -547,7 +547,11 @@ static void pmic_arb_chained_irq(struct
>>irq_desc *desc)
>>while (status) {
>>id = ffs(status) - 1;
>>status &= ~BIT(id);
>>-   periph_interrupt(pa, id + i * 32);
>>+   apid = id + i * 32;
>>+   enable = readl_relaxed(intr +
>>+   pa->ver_ops->acc_enable(apid));
>
>Do we need to read the hardware to figure this out? After earlier
>patches in this series we would never clear the
>SPMI_PIC_ACC_ENABLE_BIT after one of the irqs in a peripheral is
>unmasked for the first time (which looks to be fixing a bug in
>the existing driver BTW). So in practice, this should almost
>always be true.
>
yes. We have removed clearing the SPMI_PIC_ACC_ENABLE_BIT from the
irq_mask,
because if we disable this BIT it disables all the peripheral IRQs,
which we don't want.


Right, we could reference count it though and only clear and set
the bit when we mask and unmask the last irq in the peripheral.

Actually we are disabling the interrupt at peripheral level, so i 
believe

disabling the PIC_ACC_EN_BIT is not mandatory.


Once the peripheral fires the interrupt the summary status bit for
that peripheral
is set even though the SPMI_PIC_ACC_ENABLE_BIT is not enabled.
That's why we have to
read this register to not service the interrupt that is not
requested/enabled yet.
This SPMI_PIC_ACC_ENABLE_BIT is enabled during the irq_unmask which
is called from request_irq.


Ok. So this is again about handling the case where an interrupt
is pending out of the bootloader?


Yes.


>In the one case that it isn't true, we'll be handling some other
>irq for another peripheral and then hardware will tell us there's
>an interrupt for a peripheral that doesn't have any interrupts
>unmasked. We would call periph_interrupt() and then that
>shouldn't see any interrupts in the status register for that
>APID. So we do some more work, but nothing happens still. Did I
>miss something? What is this fixing?

Yes. As you said this fixes the issue of calling the periph_interrupt
for some other irq that is not yet requested and enabled yet.



Hmm. I seemed to miss the fact that periph_interrupt() will see
an unmasked interrupt and think it's valid. I thought that only
SPMI_PIC_ACC_ENABLE_BIT was broken, but you're saying that the
status register for a particular peripheral will always latch
interrupts even if we haven't enabled them?

Yes. Correct. Both SPMI_PIC_OWNERm_ACC_STATUSn and SPMI_PIC_IRQ_STATUSn
are getting set when the peripheral fires the interrupt, even though we
haven't enable that particular peripheral bit in SPMI_PIC_ACC_ENABLEn 
register.


Re: [PATCH V1 09/15] spmi: pmic-arb: check apid enabled before calling the handler

2017-06-20 Thread kgunda

On 2017-06-17 02:41, Stephen Boyd wrote:

On 06/14, kgu...@codeaurora.org wrote:

On 2017-06-01 02:09, Stephen Boyd wrote:
>On 05/30, Kiran Gunda wrote:
>>From: Abhijeet Dharmapurikar 
>>
>>The driver currently invokes the apid handler (periph_handler())
>
>You mean periph_interrupt()?
>
Yes.
>>once it sees that the summary status bit for that apid is set.
>>
>>However the hardware is designed to set that bit even if the apid
>>interrupts are disabled. The driver should check whether the apid
>>is indeed enabled before calling the apid handler.
>
>Really? Wow that is awful. Or is this because ACC_ENABLE bit is
>always set now and never cleared?
>
Yes. It is awful. It is not because of the ACC_ENABLE bit is set.
>>
>>Signed-off-by: Abhijeet Dharmapurikar 
>>Signed-off-by: Kiran Gunda 
>>---
>> drivers/spmi/spmi-pmic-arb.c | 10 +++---
>> 1 file changed, 7 insertions(+), 3 deletions(-)
>>
>>diff --git a/drivers/spmi/spmi-pmic-arb.c
>>b/drivers/spmi/spmi-pmic-arb.c
>>index ad34491..f8638fa 100644
>>--- a/drivers/spmi/spmi-pmic-arb.c
>>+++ b/drivers/spmi/spmi-pmic-arb.c
>>@@ -536,8 +536,8 @@ static void pmic_arb_chained_irq(struct
>>irq_desc *desc)
>>void __iomem *intr = pa->intr;
>>int first = pa->min_apid >> 5;
>>int last = pa->max_apid >> 5;
>>-   u32 status;
>>-   int i, id;
>>+   u32 status, enable;
>>+   int i, id, apid;
>>
>>chained_irq_enter(chip, desc);
>>
>>@@ -547,7 +547,11 @@ static void pmic_arb_chained_irq(struct
>>irq_desc *desc)
>>while (status) {
>>id = ffs(status) - 1;
>>status &= ~BIT(id);
>>-   periph_interrupt(pa, id + i * 32);
>>+   apid = id + i * 32;
>>+   enable = readl_relaxed(intr +
>>+   pa->ver_ops->acc_enable(apid));
>
>Do we need to read the hardware to figure this out? After earlier
>patches in this series we would never clear the
>SPMI_PIC_ACC_ENABLE_BIT after one of the irqs in a peripheral is
>unmasked for the first time (which looks to be fixing a bug in
>the existing driver BTW). So in practice, this should almost
>always be true.
>
yes. We have removed clearing the SPMI_PIC_ACC_ENABLE_BIT from the
irq_mask,
because if we disable this BIT it disables all the peripheral IRQs,
which we don't want.


Right, we could reference count it though and only clear and set
the bit when we mask and unmask the last irq in the peripheral.

Actually we are disabling the interrupt at peripheral level, so i 
believe

disabling the PIC_ACC_EN_BIT is not mandatory.


Once the peripheral fires the interrupt the summary status bit for
that peripheral
is set even though the SPMI_PIC_ACC_ENABLE_BIT is not enabled.
That's why we have to
read this register to not service the interrupt that is not
requested/enabled yet.
This SPMI_PIC_ACC_ENABLE_BIT is enabled during the irq_unmask which
is called from request_irq.


Ok. So this is again about handling the case where an interrupt
is pending out of the bootloader?


Yes.


>In the one case that it isn't true, we'll be handling some other
>irq for another peripheral and then hardware will tell us there's
>an interrupt for a peripheral that doesn't have any interrupts
>unmasked. We would call periph_interrupt() and then that
>shouldn't see any interrupts in the status register for that
>APID. So we do some more work, but nothing happens still. Did I
>miss something? What is this fixing?

Yes. As you said this fixes the issue of calling the periph_interrupt
for some other irq that is not yet requested and enabled yet.



Hmm. I seemed to miss the fact that periph_interrupt() will see
an unmasked interrupt and think it's valid. I thought that only
SPMI_PIC_ACC_ENABLE_BIT was broken, but you're saying that the
status register for a particular peripheral will always latch
interrupts even if we haven't enabled them?

Yes. Correct. Both SPMI_PIC_OWNERm_ACC_STATUSn and SPMI_PIC_IRQ_STATUSn
are getting set when the peripheral fires the interrupt, even though we
haven't enable that particular peripheral bit in SPMI_PIC_ACC_ENABLEn 
register.


[GVT-g] [Intel-gfx] [ANNOUNCE] 2017-Q2 release of XenGT (Intel GVT-g for Xen)

2017-06-20 Thread Xu, Terrence
Hi all,

We are pleased to announce an update of Intel GVT-g for Xen.

Intel GVT-g is a full GPU virtualization solution with mediated pass-through, 
starting from 4th generation Intel Core(TM) processors with Intel processor 
graphics. A virtual GPU instance is maintained for each VM, with part of 
performance critical resources directly assigned. The capability of running 
native graphics driver inside a VM, without hypervisor intervention in 
performance critical paths, achieves a good balance among performance, feature, 
and sharing capability. GVT-g for Xen hypervisor is XenGT.


Repositories
-Xen :  https://github.com/01org/igvtg-xen (tag: 2017-q2-xengt-stable-4.9)
-Kernel: https://github.com/01org/gvt-linux/ (tag: 2017-q2-gvt-stable-4.11)
-Qemu: https://github.com/01org/igvtg-qemu (tag: 2017-q2-stable-2.9.0)


This update consists of:
-GVT-g upgraded to new architecture which has been upstreamed from kernel 
4.10.
-QEMU version upgraded to 2.9 from 2.3.
-Xen version upgraded to 4.9 from 4.6.
-Supported server platforms: Intel(r) Xeon(r) E3_v4 and E3_v5 with Intel 
Graphics processor. 
-Supported client platforms: Intel(r) Core(tm) 5th generation (code name: 
Broadwell) and 6th generation (code name: Skylake). 
-Validated Guest OS: Windows7 32bit, Window7 64bit, Windows8.1 64bit, 
Windows10 64bit and Linux.
-GVT-g only supports remote display not local display by this release. 
-Remote protocol: only guest-side remoting protocol is supported, host-side 
remoting connection like SPICE is working in progress. For example, user can 
use X11VNC for Guest Linux VM or TightVNC for Guest Windows VM.


Limitation or known issues:
-GVT-g can support maximum 7 Guest VMs due to host graphics resource 
limitation. When user runs 7 VMs simultaneously, host OS can only run in text 
mode.
-In order to support Guest Windows7 32bit VM, user is recommended to add 
vgt_low_gm_sz=128 in HVM file because Guest Windows7 32bit VM needs more 
graphics resource than other Guest VM.
-In order to support Guest VM high resolution and screen resolution 
adjustable in Guest Windows8.1 64bit VM and Guest Windows10 64bit VM, user is 
recommended to configure vgt_low_gm_sz=64 / 128 / 256 / 512 in HVM file to get 
larger VM aperture size.
-In corner case, Guest Windows 8.1 64bit VM may be killed automatically by 
Xen when Guest VM runs into TDR. This issues happens only on Broadwell platform.
-In corner case, Guest Windows VMs may be killed automatically by Xen when 
user shutdowns the last launched Guest Windows VM. This issues happens only on 
Broadwell platform.
-In corner case, Guest Linux VM virtual display screen may freeze and not 
be able to recover when Guest VM runs into a TDR, but the Guest VM is still 
running and alive which can be accessed through SSH.


Setup guide:
https://github.com/01org/gvt-linux/wiki/GVTg_Setup_Guide


This is the first GVT-g community release based on new Upstream architecture 
design, refer to the following document for new architecture introduction:
https://01.org/igvt-g/documentation/intel-gvt-g-new-architecture-introduction


Please subscribe to join the mailing list if you want to learn more about GVT-g 
project: 
https://lists.01.org/mailman/listinfo/igvt-g
Please subscribe to join the mailing list if you want to contribute/review 
latest GVT-g upstream patches:
https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev


Official GVT-g portal: 
https://01.org/igvt-g


More information about background, architecture and others about Intel GVT-g, 
can be found at:
https://01.org/igvt-g
https://www.usenix.org/conference/atc14/technical-sessions/presentation/tian
http://events.linuxfoundation.org/sites/events/files/slides/XenGT-Xen%20Summit-v7_0.pdf
http://events.linuxfoundation.org/sites/events/files/slides/XenGT-Xen%20Summit-REWRITE%203RD%20v4.pdf
https://01.org/xen/blogs/srclarkx/2013/graphics-virtualization-xengt


Note: 
The XenGT project should be considered a work in progress. As such it is not a 
complete product nor should it be considered one. Extra care should be taken 
when testing and configuring a system to use the XenGT project.



Thanks
Terrence
Tel: +86-21-6116 5390
MP: +86-1356 4367 024
Mail: terrence...@intel.com

___
iGVT-g mailing list
igv...@lists.01.org
https://lists.01.org/mailman/listinfo/igvt-g



[GVT-g] [Intel-gfx] [ANNOUNCE] 2017-Q2 release of XenGT (Intel GVT-g for Xen)

2017-06-20 Thread Xu, Terrence
Hi all,

We are pleased to announce an update of Intel GVT-g for Xen.

Intel GVT-g is a full GPU virtualization solution with mediated pass-through, 
starting from 4th generation Intel Core(TM) processors with Intel processor 
graphics. A virtual GPU instance is maintained for each VM, with part of 
performance critical resources directly assigned. The capability of running 
native graphics driver inside a VM, without hypervisor intervention in 
performance critical paths, achieves a good balance among performance, feature, 
and sharing capability. GVT-g for Xen hypervisor is XenGT.


Repositories
-Xen :  https://github.com/01org/igvtg-xen (tag: 2017-q2-xengt-stable-4.9)
-Kernel: https://github.com/01org/gvt-linux/ (tag: 2017-q2-gvt-stable-4.11)
-Qemu: https://github.com/01org/igvtg-qemu (tag: 2017-q2-stable-2.9.0)


This update consists of:
-GVT-g upgraded to new architecture which has been upstreamed from kernel 
4.10.
-QEMU version upgraded to 2.9 from 2.3.
-Xen version upgraded to 4.9 from 4.6.
-Supported server platforms: Intel(r) Xeon(r) E3_v4 and E3_v5 with Intel 
Graphics processor. 
-Supported client platforms: Intel(r) Core(tm) 5th generation (code name: 
Broadwell) and 6th generation (code name: Skylake). 
-Validated Guest OS: Windows7 32bit, Window7 64bit, Windows8.1 64bit, 
Windows10 64bit and Linux.
-GVT-g only supports remote display not local display by this release. 
-Remote protocol: only guest-side remoting protocol is supported, host-side 
remoting connection like SPICE is working in progress. For example, user can 
use X11VNC for Guest Linux VM or TightVNC for Guest Windows VM.


Limitation or known issues:
-GVT-g can support maximum 7 Guest VMs due to host graphics resource 
limitation. When user runs 7 VMs simultaneously, host OS can only run in text 
mode.
-In order to support Guest Windows7 32bit VM, user is recommended to add 
vgt_low_gm_sz=128 in HVM file because Guest Windows7 32bit VM needs more 
graphics resource than other Guest VM.
-In order to support Guest VM high resolution and screen resolution 
adjustable in Guest Windows8.1 64bit VM and Guest Windows10 64bit VM, user is 
recommended to configure vgt_low_gm_sz=64 / 128 / 256 / 512 in HVM file to get 
larger VM aperture size.
-In corner case, Guest Windows 8.1 64bit VM may be killed automatically by 
Xen when Guest VM runs into TDR. This issues happens only on Broadwell platform.
-In corner case, Guest Windows VMs may be killed automatically by Xen when 
user shutdowns the last launched Guest Windows VM. This issues happens only on 
Broadwell platform.
-In corner case, Guest Linux VM virtual display screen may freeze and not 
be able to recover when Guest VM runs into a TDR, but the Guest VM is still 
running and alive which can be accessed through SSH.


Setup guide:
https://github.com/01org/gvt-linux/wiki/GVTg_Setup_Guide


This is the first GVT-g community release based on new Upstream architecture 
design, refer to the following document for new architecture introduction:
https://01.org/igvt-g/documentation/intel-gvt-g-new-architecture-introduction


Please subscribe to join the mailing list if you want to learn more about GVT-g 
project: 
https://lists.01.org/mailman/listinfo/igvt-g
Please subscribe to join the mailing list if you want to contribute/review 
latest GVT-g upstream patches:
https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev


Official GVT-g portal: 
https://01.org/igvt-g


More information about background, architecture and others about Intel GVT-g, 
can be found at:
https://01.org/igvt-g
https://www.usenix.org/conference/atc14/technical-sessions/presentation/tian
http://events.linuxfoundation.org/sites/events/files/slides/XenGT-Xen%20Summit-v7_0.pdf
http://events.linuxfoundation.org/sites/events/files/slides/XenGT-Xen%20Summit-REWRITE%203RD%20v4.pdf
https://01.org/xen/blogs/srclarkx/2013/graphics-virtualization-xengt


Note: 
The XenGT project should be considered a work in progress. As such it is not a 
complete product nor should it be considered one. Extra care should be taken 
when testing and configuring a system to use the XenGT project.



Thanks
Terrence
Tel: +86-21-6116 5390
MP: +86-1356 4367 024
Mail: terrence...@intel.com

___
iGVT-g mailing list
igv...@lists.01.org
https://lists.01.org/mailman/listinfo/igvt-g



[GVT-g] [Intel-gfx] [ANNOUNCE] 2017-Q2 release of KVMGT (Intel GVT-g for KVM)

2017-06-20 Thread Xu, Terrence
Hi all,

We are pleased to announce an update of Intel GVT-g for KVM.

Intel GVT-g for KVM (a.k.a. KVMGT) is a full GPU virtualization solution with 
mediated pass-through, starting from 5th generation Intel Core(TM) processors 
with Intel processor graphics.  A virtual GPU instance is maintained for each 
VM, with part of performance critical resources directly assigned. The 
capability of running native graphics driver inside a VM, without hypervisor 
intervention in performance critical paths, achieves a good balance among 
performance, feature, and sharing capability.


Repositories:
-Kernel: https://github.com/01org/gvt-linux/  (tag: 2017-q2-gvt-stable-4.11)
-Qemu: https://github.com/01org/igvtg-qemu  (tag: 2017-q2-stable-2.9.0)


This update consists of:
-GVT-g upgraded to new architecture which has been upstreamed from kernel 
4.10.
-QEMU version upgraded to 2.9 from 2.3 (There is no any specific code 
changes in QEMU for KVMGT,  so people could directly use upstream QEMU as an 
alternative).
-Supported server platforms: Intel(r) Xeon(r) E3_v4 and E3_v5 with Intel 
Graphics processor. 
-Supported client platforms: Intel(r) Core(tm) 5th generation (code name: 
Broadwell) and 6th generation (code name: Skylake). 
-Validated Guest OS: Windows7 32bit, Window7 64bit, Windows8.1 64bit, 
Windows10 64bit and Linux.
-GVT-g only supports remote display not local display by this release. 
-Remote protocol: only guest-side remoting protocol is supported, host-side 
remoting connection like SPICE is working in progress. For example, user can 
use X11VNC for Guest Linux VM or TightVNC for Guest Windows VM.


Limitation or known issues:
-GVT-g can support maximum 7 Guest VMs due to host graphics resource 
limitation. When user runs 7 VMs simultaneously, host OS can only run in text 
mode.
-In order to support Guest Windows7 32bit VM, user can only uses vGPU 
type1, type2, type4 not type8 because Guest Windows7 32bit VM needs more 
graphics resource than other Guest VM.
-Some 3rd party applications/tools like GPU_Z, Passmark 9.0 may read/write 
GPU MSR directly, it will trigger VM BSOD since those MSRs are unhandled 
registers in KVMGT. The workaround is to set MSR read /write ignore flag to 1 
in host grub file by adding "kvm.ignore_msrs=1".
-In corner case, Guest Linux VM virtual display screen may freeze and not 
be able to recover when Guest VM runs into a TDR, but the Guest VM is still 
running and alive which can be accessed through SSH.



Setup guide:
https://github.com/01org/gvt-linux/wiki/GVTg_Setup_Guide


This is the first GVT-g community release based on new Upstream architecture 
design, refer to the following document for new architecture introduction:
https://01.org/igvt-g/documentation/intel-gvt-g-new-architecture-introduction 


Please subscribe to join the mailing list if you want to learn more about GVT-g 
project: 
https://lists.01.org/mailman/listinfo/igvt-g
Please subscribe to join the mailing list if you want to contribute/review 
latest GVT-g upstream patches:
https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev


Official GVT-g portal:
https://01.org/igvt-g


More information about background, architecture and others about Intel GVT-g, 
can be found at:
http://www.linux-kvm.org/images/f/f3/01x08b-KVMGT-a.pdf
https://www.usenix.org/conference/atc14/technical-sessions/presentation/tian


Note:
The KVMGT project should be considered a work in progress. As such it is not a 
complete product nor should it be considered one. Extra care should be taken 
when testing and configuring a system to use the KVMGT project.



Thanks
Terrence
Tel: +86-21-6116 5390
MP: +86-1356 4367 024
Mail: terrence...@intel.com

___
iGVT-g mailing list
igv...@lists.01.org
https://lists.01.org/mailman/listinfo/igvt-g



[GVT-g] [Intel-gfx] [ANNOUNCE] 2017-Q2 release of KVMGT (Intel GVT-g for KVM)

2017-06-20 Thread Xu, Terrence
Hi all,

We are pleased to announce an update of Intel GVT-g for KVM.

Intel GVT-g for KVM (a.k.a. KVMGT) is a full GPU virtualization solution with 
mediated pass-through, starting from 5th generation Intel Core(TM) processors 
with Intel processor graphics.  A virtual GPU instance is maintained for each 
VM, with part of performance critical resources directly assigned. The 
capability of running native graphics driver inside a VM, without hypervisor 
intervention in performance critical paths, achieves a good balance among 
performance, feature, and sharing capability.


Repositories:
-Kernel: https://github.com/01org/gvt-linux/  (tag: 2017-q2-gvt-stable-4.11)
-Qemu: https://github.com/01org/igvtg-qemu  (tag: 2017-q2-stable-2.9.0)


This update consists of:
-GVT-g upgraded to new architecture which has been upstreamed from kernel 
4.10.
-QEMU version upgraded to 2.9 from 2.3 (There is no any specific code 
changes in QEMU for KVMGT,  so people could directly use upstream QEMU as an 
alternative).
-Supported server platforms: Intel(r) Xeon(r) E3_v4 and E3_v5 with Intel 
Graphics processor. 
-Supported client platforms: Intel(r) Core(tm) 5th generation (code name: 
Broadwell) and 6th generation (code name: Skylake). 
-Validated Guest OS: Windows7 32bit, Window7 64bit, Windows8.1 64bit, 
Windows10 64bit and Linux.
-GVT-g only supports remote display not local display by this release. 
-Remote protocol: only guest-side remoting protocol is supported, host-side 
remoting connection like SPICE is working in progress. For example, user can 
use X11VNC for Guest Linux VM or TightVNC for Guest Windows VM.


Limitation or known issues:
-GVT-g can support maximum 7 Guest VMs due to host graphics resource 
limitation. When user runs 7 VMs simultaneously, host OS can only run in text 
mode.
-In order to support Guest Windows7 32bit VM, user can only uses vGPU 
type1, type2, type4 not type8 because Guest Windows7 32bit VM needs more 
graphics resource than other Guest VM.
-Some 3rd party applications/tools like GPU_Z, Passmark 9.0 may read/write 
GPU MSR directly, it will trigger VM BSOD since those MSRs are unhandled 
registers in KVMGT. The workaround is to set MSR read /write ignore flag to 1 
in host grub file by adding "kvm.ignore_msrs=1".
-In corner case, Guest Linux VM virtual display screen may freeze and not 
be able to recover when Guest VM runs into a TDR, but the Guest VM is still 
running and alive which can be accessed through SSH.



Setup guide:
https://github.com/01org/gvt-linux/wiki/GVTg_Setup_Guide


This is the first GVT-g community release based on new Upstream architecture 
design, refer to the following document for new architecture introduction:
https://01.org/igvt-g/documentation/intel-gvt-g-new-architecture-introduction 


Please subscribe to join the mailing list if you want to learn more about GVT-g 
project: 
https://lists.01.org/mailman/listinfo/igvt-g
Please subscribe to join the mailing list if you want to contribute/review 
latest GVT-g upstream patches:
https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev


Official GVT-g portal:
https://01.org/igvt-g


More information about background, architecture and others about Intel GVT-g, 
can be found at:
http://www.linux-kvm.org/images/f/f3/01x08b-KVMGT-a.pdf
https://www.usenix.org/conference/atc14/technical-sessions/presentation/tian


Note:
The KVMGT project should be considered a work in progress. As such it is not a 
complete product nor should it be considered one. Extra care should be taken 
when testing and configuring a system to use the KVMGT project.



Thanks
Terrence
Tel: +86-21-6116 5390
MP: +86-1356 4367 024
Mail: terrence...@intel.com

___
iGVT-g mailing list
igv...@lists.01.org
https://lists.01.org/mailman/listinfo/igvt-g



Re: [PATCH] PM / OPP: Add dev_pm_opp_{set|put}_clkname()

2017-06-20 Thread Viresh Kumar
On 20-06-17, 14:08, Stephen Boyd wrote:
> On 06/20, Viresh Kumar wrote:
> > + */
> > +struct opp_table *dev_pm_opp_set_clkname(struct device *dev, const char 
> > *name)
> > +{
> > +   struct opp_table *opp_table;
> > +   int ret;
> > +
> > +   opp_table = dev_pm_opp_get_opp_table(dev);
> > +   if (!opp_table)
> > +   return ERR_PTR(-ENOMEM);
> > +
> > +   /* This should be called before OPPs are initialized */
> > +   if (WARN_ON(!list_empty(_table->opp_list))) {
> > +   ret = -EBUSY;
> > +   goto err;
> > +   }
> > +
> > +   /* Already have clkname set */
> > +   if (opp_table->clk_name) {
> > +   ret = -EBUSY;
> > +   goto err;
> > +   }
> > +
> > +   opp_table->clk_name = kstrdup(name, GFP_KERNEL);
> > +   if (!opp_table->clk_name) {
> 
> Is there a reason to duplicate clk_name instead of using the clk
> structure returned from clk_get()? Is it because we may already
> have opp_table->clk set from default init? Why can't we always
> clk_put() the clk structure if it's !IS_ERR() and then allow
> dev_pm_opp_set_clkname() to be called many times in succession?
> Long story short, I don't see the benefit to allocating the name
> again here just to use it as a mechanism to know if the APIs have
> been called symmetrically.

Yeah, it was kind of required in what I was trying to do earlier, but
not anymore.

-- 
viresh


Re: [PATCH] PM / OPP: Add dev_pm_opp_{set|put}_clkname()

2017-06-20 Thread Viresh Kumar
On 20-06-17, 14:08, Stephen Boyd wrote:
> On 06/20, Viresh Kumar wrote:
> > + */
> > +struct opp_table *dev_pm_opp_set_clkname(struct device *dev, const char 
> > *name)
> > +{
> > +   struct opp_table *opp_table;
> > +   int ret;
> > +
> > +   opp_table = dev_pm_opp_get_opp_table(dev);
> > +   if (!opp_table)
> > +   return ERR_PTR(-ENOMEM);
> > +
> > +   /* This should be called before OPPs are initialized */
> > +   if (WARN_ON(!list_empty(_table->opp_list))) {
> > +   ret = -EBUSY;
> > +   goto err;
> > +   }
> > +
> > +   /* Already have clkname set */
> > +   if (opp_table->clk_name) {
> > +   ret = -EBUSY;
> > +   goto err;
> > +   }
> > +
> > +   opp_table->clk_name = kstrdup(name, GFP_KERNEL);
> > +   if (!opp_table->clk_name) {
> 
> Is there a reason to duplicate clk_name instead of using the clk
> structure returned from clk_get()? Is it because we may already
> have opp_table->clk set from default init? Why can't we always
> clk_put() the clk structure if it's !IS_ERR() and then allow
> dev_pm_opp_set_clkname() to be called many times in succession?
> Long story short, I don't see the benefit to allocating the name
> again here just to use it as a mechanism to know if the APIs have
> been called symmetrically.

Yeah, it was kind of required in what I was trying to do earlier, but
not anymore.

-- 
viresh


[PATCH V2] PM / OPP: Add dev_pm_opp_{set|put}_clkname()

2017-06-20 Thread Viresh Kumar
In order to support OPP switching, OPP layer needs to get pointer to the
clock for the device. Simple cases work fine without using the routines
added by this patch (i.e.  by passing connection-id as NULL), but for a
device with multiple clocks available, the OPP core needs to know the
exact name of the clk to use.

Add a new set of APIs to get that done.

Tested-by: Rajendra Nayak 
Signed-off-by: Viresh Kumar 
---
V1->V2:
- No need to save clk_name, as we wouldn't use it again.

 drivers/base/power/opp/core.c | 67 +++
 include/linux/pm_opp.h|  9 ++
 2 files changed, 76 insertions(+)

diff --git a/drivers/base/power/opp/core.c b/drivers/base/power/opp/core.c
index 5ee7aadf0abf..a8cc14fd8ae4 100644
--- a/drivers/base/power/opp/core.c
+++ b/drivers/base/power/opp/core.c
@@ -1363,6 +1363,73 @@ void dev_pm_opp_put_regulators(struct opp_table 
*opp_table)
 EXPORT_SYMBOL_GPL(dev_pm_opp_put_regulators);
 
 /**
+ * dev_pm_opp_set_clkname() - Set clk name for the device
+ * @dev: Device for which clk name is being set.
+ * @name: Clk name.
+ *
+ * In order to support OPP switching, OPP layer needs to get pointer to the
+ * clock for the device. Simple cases work fine without using this routine 
(i.e.
+ * by passing connection-id as NULL), but for a device with multiple clocks
+ * available, the OPP core needs to know the exact name of the clk to use.
+ *
+ * This must be called before any OPPs are initialized for the device.
+ */
+struct opp_table *dev_pm_opp_set_clkname(struct device *dev, const char *name)
+{
+   struct opp_table *opp_table;
+   int ret;
+
+   opp_table = dev_pm_opp_get_opp_table(dev);
+   if (!opp_table)
+   return ERR_PTR(-ENOMEM);
+
+   /* This should be called before OPPs are initialized */
+   if (WARN_ON(!list_empty(_table->opp_list))) {
+   ret = -EBUSY;
+   goto err;
+   }
+
+   /* Already have default clk set, free it */
+   if (!IS_ERR(opp_table->clk))
+   clk_put(opp_table->clk);
+
+   /* Find clk for the device */
+   opp_table->clk = clk_get(dev, name);
+   if (IS_ERR(opp_table->clk)) {
+   ret = PTR_ERR(opp_table->clk);
+   if (ret != -EPROBE_DEFER) {
+   dev_err(dev, "%s: Couldn't find clock: %d\n", __func__,
+   ret);
+   }
+   goto err;
+   }
+
+   return opp_table;
+
+err:
+   dev_pm_opp_put_opp_table(opp_table);
+
+   return ERR_PTR(ret);
+}
+EXPORT_SYMBOL_GPL(dev_pm_opp_set_clkname);
+
+/**
+ * dev_pm_opp_put_clkname() - Releases resources blocked for clk.
+ * @opp_table: OPP table returned from dev_pm_opp_set_clkname().
+ */
+void dev_pm_opp_put_clkname(struct opp_table *opp_table)
+{
+   /* Make sure there are no concurrent readers while updating opp_table */
+   WARN_ON(!list_empty(_table->opp_list));
+
+   clk_put(opp_table->clk);
+   opp_table->clk = ERR_PTR(-EINVAL);
+
+   dev_pm_opp_put_opp_table(opp_table);
+}
+EXPORT_SYMBOL_GPL(dev_pm_opp_put_clkname);
+
+/**
  * dev_pm_opp_register_set_opp_helper() - Register custom set OPP helper
  * @dev: Device for which the helper is getting registered.
  * @set_opp: Custom set OPP helper.
diff --git a/include/linux/pm_opp.h b/include/linux/pm_opp.h
index a6685b3dde26..51ec727b4824 100644
--- a/include/linux/pm_opp.h
+++ b/include/linux/pm_opp.h
@@ -121,6 +121,8 @@ struct opp_table *dev_pm_opp_set_prop_name(struct device 
*dev, const char *name)
 void dev_pm_opp_put_prop_name(struct opp_table *opp_table);
 struct opp_table *dev_pm_opp_set_regulators(struct device *dev, const char * 
const names[], unsigned int count);
 void dev_pm_opp_put_regulators(struct opp_table *opp_table);
+struct opp_table *dev_pm_opp_set_clkname(struct device *dev, const char * 
name);
+void dev_pm_opp_put_clkname(struct opp_table *opp_table);
 struct opp_table *dev_pm_opp_register_set_opp_helper(struct device *dev, int 
(*set_opp)(struct dev_pm_set_opp_data *data));
 void dev_pm_opp_register_put_opp_helper(struct opp_table *opp_table);
 int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq);
@@ -257,6 +259,13 @@ static inline struct opp_table 
*dev_pm_opp_set_regulators(struct device *dev, co
 
 static inline void dev_pm_opp_put_regulators(struct opp_table *opp_table) {}
 
+static inline struct opp_table *dev_pm_opp_set_clkname(struct device *dev, 
const char * name)
+{
+   return ERR_PTR(-ENOTSUPP);
+}
+
+static inline void dev_pm_opp_put_clkname(struct opp_table *opp_table) {}
+
 static inline int dev_pm_opp_set_rate(struct device *dev, unsigned long 
target_freq)
 {
return -ENOTSUPP;
-- 
2.13.0.71.gd7076ec9c9cb



[PATCH V2] PM / OPP: Add dev_pm_opp_{set|put}_clkname()

2017-06-20 Thread Viresh Kumar
In order to support OPP switching, OPP layer needs to get pointer to the
clock for the device. Simple cases work fine without using the routines
added by this patch (i.e.  by passing connection-id as NULL), but for a
device with multiple clocks available, the OPP core needs to know the
exact name of the clk to use.

Add a new set of APIs to get that done.

Tested-by: Rajendra Nayak 
Signed-off-by: Viresh Kumar 
---
V1->V2:
- No need to save clk_name, as we wouldn't use it again.

 drivers/base/power/opp/core.c | 67 +++
 include/linux/pm_opp.h|  9 ++
 2 files changed, 76 insertions(+)

diff --git a/drivers/base/power/opp/core.c b/drivers/base/power/opp/core.c
index 5ee7aadf0abf..a8cc14fd8ae4 100644
--- a/drivers/base/power/opp/core.c
+++ b/drivers/base/power/opp/core.c
@@ -1363,6 +1363,73 @@ void dev_pm_opp_put_regulators(struct opp_table 
*opp_table)
 EXPORT_SYMBOL_GPL(dev_pm_opp_put_regulators);
 
 /**
+ * dev_pm_opp_set_clkname() - Set clk name for the device
+ * @dev: Device for which clk name is being set.
+ * @name: Clk name.
+ *
+ * In order to support OPP switching, OPP layer needs to get pointer to the
+ * clock for the device. Simple cases work fine without using this routine 
(i.e.
+ * by passing connection-id as NULL), but for a device with multiple clocks
+ * available, the OPP core needs to know the exact name of the clk to use.
+ *
+ * This must be called before any OPPs are initialized for the device.
+ */
+struct opp_table *dev_pm_opp_set_clkname(struct device *dev, const char *name)
+{
+   struct opp_table *opp_table;
+   int ret;
+
+   opp_table = dev_pm_opp_get_opp_table(dev);
+   if (!opp_table)
+   return ERR_PTR(-ENOMEM);
+
+   /* This should be called before OPPs are initialized */
+   if (WARN_ON(!list_empty(_table->opp_list))) {
+   ret = -EBUSY;
+   goto err;
+   }
+
+   /* Already have default clk set, free it */
+   if (!IS_ERR(opp_table->clk))
+   clk_put(opp_table->clk);
+
+   /* Find clk for the device */
+   opp_table->clk = clk_get(dev, name);
+   if (IS_ERR(opp_table->clk)) {
+   ret = PTR_ERR(opp_table->clk);
+   if (ret != -EPROBE_DEFER) {
+   dev_err(dev, "%s: Couldn't find clock: %d\n", __func__,
+   ret);
+   }
+   goto err;
+   }
+
+   return opp_table;
+
+err:
+   dev_pm_opp_put_opp_table(opp_table);
+
+   return ERR_PTR(ret);
+}
+EXPORT_SYMBOL_GPL(dev_pm_opp_set_clkname);
+
+/**
+ * dev_pm_opp_put_clkname() - Releases resources blocked for clk.
+ * @opp_table: OPP table returned from dev_pm_opp_set_clkname().
+ */
+void dev_pm_opp_put_clkname(struct opp_table *opp_table)
+{
+   /* Make sure there are no concurrent readers while updating opp_table */
+   WARN_ON(!list_empty(_table->opp_list));
+
+   clk_put(opp_table->clk);
+   opp_table->clk = ERR_PTR(-EINVAL);
+
+   dev_pm_opp_put_opp_table(opp_table);
+}
+EXPORT_SYMBOL_GPL(dev_pm_opp_put_clkname);
+
+/**
  * dev_pm_opp_register_set_opp_helper() - Register custom set OPP helper
  * @dev: Device for which the helper is getting registered.
  * @set_opp: Custom set OPP helper.
diff --git a/include/linux/pm_opp.h b/include/linux/pm_opp.h
index a6685b3dde26..51ec727b4824 100644
--- a/include/linux/pm_opp.h
+++ b/include/linux/pm_opp.h
@@ -121,6 +121,8 @@ struct opp_table *dev_pm_opp_set_prop_name(struct device 
*dev, const char *name)
 void dev_pm_opp_put_prop_name(struct opp_table *opp_table);
 struct opp_table *dev_pm_opp_set_regulators(struct device *dev, const char * 
const names[], unsigned int count);
 void dev_pm_opp_put_regulators(struct opp_table *opp_table);
+struct opp_table *dev_pm_opp_set_clkname(struct device *dev, const char * 
name);
+void dev_pm_opp_put_clkname(struct opp_table *opp_table);
 struct opp_table *dev_pm_opp_register_set_opp_helper(struct device *dev, int 
(*set_opp)(struct dev_pm_set_opp_data *data));
 void dev_pm_opp_register_put_opp_helper(struct opp_table *opp_table);
 int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq);
@@ -257,6 +259,13 @@ static inline struct opp_table 
*dev_pm_opp_set_regulators(struct device *dev, co
 
 static inline void dev_pm_opp_put_regulators(struct opp_table *opp_table) {}
 
+static inline struct opp_table *dev_pm_opp_set_clkname(struct device *dev, 
const char * name)
+{
+   return ERR_PTR(-ENOTSUPP);
+}
+
+static inline void dev_pm_opp_put_clkname(struct opp_table *opp_table) {}
+
 static inline int dev_pm_opp_set_rate(struct device *dev, unsigned long 
target_freq)
 {
return -ENOTSUPP;
-- 
2.13.0.71.gd7076ec9c9cb



Re: [PATCH v6 1/4] of: remove *phandle properties from expanded device tree

2017-06-20 Thread Michael Ellerman
Hi Frank,

frowand.l...@gmail.com writes:
> From: Frank Rowand 
>
> Remove "phandle", "linux,phandle", and "ibm,phandle" properties from
> the internal device tree.  The phandle will still be in the struct
> device_node phandle field and will still be displayed as if it is
> a property in /proc/device_tree.
>
> This is to resolve the issue found by Stephen Boyd [1] when he changed
> the type of struct property.value from void * to const void *.  As
> a result of the type change, the overlay code had compile errors
> where the resolver updates phandle values.
>
>   [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html
>
> - Add sysfs infrastructure to report np->phandle, as if it was a property.
> - Do not create "phandle" "ibm,phandle", and "linux,phandle" properties
>   in the expanded device tree.
> - Remove phandle properties in of_attach_node(), for nodes dynamically
>   attached to the live tree.  Add the phandle sysfs entry for these nodes.
> - When creating an overlay changeset, duplicate the node phandle in
>   __of_node_dup().
> - Remove no longer needed checks to exclude "phandle" and "linux,phandle"
>   properties in several locations.
> - A side effect of these changes is that the obsolete "linux,phandle" and
>   "ibm,phandle" properties will no longer appear in /proc/device-tree (they
>   will appear as "phandle").

Sorry but I don't think that can work for us.

Our DLPAR (ie. CPU/memory/device hotplug) stuff on PowerVM uses
"ibm,phandle", and it's not the same thing as "phandle" /
"linux,phandle".

I don't know the code well myself, but the spec (PAPR) says:

  Note: If the “ibm,phandle” property exists, there are two “phandle”
  namespaces which must be kept separate. One is that actually used by
  the OF client interface, the other is properties in the device tree
  making reference to device tree nodes. These requirements are written
  to maintain backward compatibility with older FW versions predating
  these requirements; if the “ibm,phandle” property is not present, the
  OS may assume that any device tree properties which refer to this node
  will have a phandle value matching that returned by client interface
  services.

I have systems here that still use "ibm,phandle". I also see at least
some of the userspace code that looks for "ibm,phandle", and nothing
else.

The note above actually implies that the current Linux code is wrong,
when it uses "ibm,phandle" as the value of np->phandle.

So sorry that's a big mess, but we can't just rip out those properties.

I think the minimal change would be to treat "ibm,phandle" like a normal
property, I think that would allow our tools to keep working?


The other thing that worries me is that by renaming (effectively)
"linux,phandle" to "phandle", we lose the ability to accurately
regenerate the device tree from /proc/device-tree. In theory it
shouldn't matter, but I worry that in practice something will break.

What if we just kept a single bit flag somewhere indicating if the name of
the phandle property we found was "phandle" or "linux,phandle", and
create the sysfs phandle using that name?

cheers


Re: [PATCH v6 1/4] of: remove *phandle properties from expanded device tree

2017-06-20 Thread Michael Ellerman
Hi Frank,

frowand.l...@gmail.com writes:
> From: Frank Rowand 
>
> Remove "phandle", "linux,phandle", and "ibm,phandle" properties from
> the internal device tree.  The phandle will still be in the struct
> device_node phandle field and will still be displayed as if it is
> a property in /proc/device_tree.
>
> This is to resolve the issue found by Stephen Boyd [1] when he changed
> the type of struct property.value from void * to const void *.  As
> a result of the type change, the overlay code had compile errors
> where the resolver updates phandle values.
>
>   [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html
>
> - Add sysfs infrastructure to report np->phandle, as if it was a property.
> - Do not create "phandle" "ibm,phandle", and "linux,phandle" properties
>   in the expanded device tree.
> - Remove phandle properties in of_attach_node(), for nodes dynamically
>   attached to the live tree.  Add the phandle sysfs entry for these nodes.
> - When creating an overlay changeset, duplicate the node phandle in
>   __of_node_dup().
> - Remove no longer needed checks to exclude "phandle" and "linux,phandle"
>   properties in several locations.
> - A side effect of these changes is that the obsolete "linux,phandle" and
>   "ibm,phandle" properties will no longer appear in /proc/device-tree (they
>   will appear as "phandle").

Sorry but I don't think that can work for us.

Our DLPAR (ie. CPU/memory/device hotplug) stuff on PowerVM uses
"ibm,phandle", and it's not the same thing as "phandle" /
"linux,phandle".

I don't know the code well myself, but the spec (PAPR) says:

  Note: If the “ibm,phandle” property exists, there are two “phandle”
  namespaces which must be kept separate. One is that actually used by
  the OF client interface, the other is properties in the device tree
  making reference to device tree nodes. These requirements are written
  to maintain backward compatibility with older FW versions predating
  these requirements; if the “ibm,phandle” property is not present, the
  OS may assume that any device tree properties which refer to this node
  will have a phandle value matching that returned by client interface
  services.

I have systems here that still use "ibm,phandle". I also see at least
some of the userspace code that looks for "ibm,phandle", and nothing
else.

The note above actually implies that the current Linux code is wrong,
when it uses "ibm,phandle" as the value of np->phandle.

So sorry that's a big mess, but we can't just rip out those properties.

I think the minimal change would be to treat "ibm,phandle" like a normal
property, I think that would allow our tools to keep working?


The other thing that worries me is that by renaming (effectively)
"linux,phandle" to "phandle", we lose the ability to accurately
regenerate the device tree from /proc/device-tree. In theory it
shouldn't matter, but I worry that in practice something will break.

What if we just kept a single bit flag somewhere indicating if the name of
the phandle property we found was "phandle" or "linux,phandle", and
create the sysfs phandle using that name?

cheers


Re: [PATCH v3 28/31] quota: add get_inode_usage callback to transfer multi-inode charges

2017-06-20 Thread Theodore Ts'o
Tahsin, when you think we've closed on the reviews, could you send out
a complete set of all of the patches on a new mail thread, using git
send-email so I can make sure I'm grabbing the final version of all of
the patches in this patch series?

It's great that you are using separate versions for each patch, since
it makes it easy to track changes in each patch, but when it comes
time for me to get a complete set of patches to apply, it does it make
it harder to figure out that I need v5 of this patch, and v3 of that
patch

Many thanks!

- Ted



Re: [PATCH v3 28/31] quota: add get_inode_usage callback to transfer multi-inode charges

2017-06-20 Thread Theodore Ts'o
Tahsin, when you think we've closed on the reviews, could you send out
a complete set of all of the patches on a new mail thread, using git
send-email so I can make sure I'm grabbing the final version of all of
the patches in this patch series?

It's great that you are using separate versions for each patch, since
it makes it easy to track changes in each patch, but when it comes
time for me to get a complete set of patches to apply, it does it make
it harder to figure out that I need v5 of this patch, and v3 of that
patch

Many thanks!

- Ted



[PATCH V2 1/5] arch_topology: Get rid of "cap_parsing_done"

2017-06-20 Thread Viresh Kumar
We can reuse "cap_parsing_failed" instead of keeping an additional
variable here.

Signed-off-by: Viresh Kumar 
---
 drivers/base/arch_topology.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index d1c33a85059e..8239a6232808 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -161,7 +161,6 @@ int __init topology_parse_cpu_capacity(struct device_node 
*cpu_node, int cpu)
 
 #ifdef CONFIG_CPU_FREQ
 static cpumask_var_t cpus_to_visit;
-static bool cap_parsing_done;
 static void parsing_done_workfn(struct work_struct *work);
 static DECLARE_WORK(parsing_done_work, parsing_done_workfn);
 
@@ -173,7 +172,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
struct cpufreq_policy *policy = data;
int cpu;
 
-   if (cap_parsing_failed || cap_parsing_done)
+   if (cap_parsing_failed)
return 0;
 
switch (val) {
@@ -193,7 +192,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
topology_normalize_cpu_scale();
kfree(raw_capacity);
pr_debug("cpu_capacity: parsing done\n");
-   cap_parsing_done = true;
+   cap_parsing_failed = true;
schedule_work(_done_work);
}
}
-- 
2.13.0.71.gd7076ec9c9cb



[PATCH V2 0/5] arch_topology: Minor cleanups

2017-06-20 Thread Viresh Kumar
Hi Greg,

You weren't included in the first [1] version of this series, as it was
targeting arch/arm*/ directories then.

Here are some cleanups for the arch_topology core.

Tested on ARM64 Hikey board by setting following in cpu nodes in DT:
capacity-dmips-mhz = <1000>;

V1->V2:
- based over linux-next/master (to get Juri's recent changes)
- More cleanups included. V1 only had the first patch.
- Rename of cap_parsing_failed isn't required anymore (as it is
  localized to a single routine now).

--
viresh

[1] https://marc.info/?l=linux-arm-kernel=149795553024005

Viresh Kumar (5):
  arch_topology: Get rid of "cap_parsing_done"
  arch_topology: Don't break lines unnecessarily
  arch_topology: Covert switch block to if block
  arch_topology: Return 0 or -ve errors from
topology_parse_cpu_capacity()
  arch_topology: Localize cap_parsing_failed to
topology_parse_cpu_capacity()

 arch/arm/kernel/topology.c   |  2 +-
 drivers/base/arch_topology.c | 78 ++--
 2 files changed, 40 insertions(+), 40 deletions(-)

-- 
2.13.0.71.gd7076ec9c9cb



[PATCH V2 1/5] arch_topology: Get rid of "cap_parsing_done"

2017-06-20 Thread Viresh Kumar
We can reuse "cap_parsing_failed" instead of keeping an additional
variable here.

Signed-off-by: Viresh Kumar 
---
 drivers/base/arch_topology.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index d1c33a85059e..8239a6232808 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -161,7 +161,6 @@ int __init topology_parse_cpu_capacity(struct device_node 
*cpu_node, int cpu)
 
 #ifdef CONFIG_CPU_FREQ
 static cpumask_var_t cpus_to_visit;
-static bool cap_parsing_done;
 static void parsing_done_workfn(struct work_struct *work);
 static DECLARE_WORK(parsing_done_work, parsing_done_workfn);
 
@@ -173,7 +172,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
struct cpufreq_policy *policy = data;
int cpu;
 
-   if (cap_parsing_failed || cap_parsing_done)
+   if (cap_parsing_failed)
return 0;
 
switch (val) {
@@ -193,7 +192,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
topology_normalize_cpu_scale();
kfree(raw_capacity);
pr_debug("cpu_capacity: parsing done\n");
-   cap_parsing_done = true;
+   cap_parsing_failed = true;
schedule_work(_done_work);
}
}
-- 
2.13.0.71.gd7076ec9c9cb



[PATCH V2 0/5] arch_topology: Minor cleanups

2017-06-20 Thread Viresh Kumar
Hi Greg,

You weren't included in the first [1] version of this series, as it was
targeting arch/arm*/ directories then.

Here are some cleanups for the arch_topology core.

Tested on ARM64 Hikey board by setting following in cpu nodes in DT:
capacity-dmips-mhz = <1000>;

V1->V2:
- based over linux-next/master (to get Juri's recent changes)
- More cleanups included. V1 only had the first patch.
- Rename of cap_parsing_failed isn't required anymore (as it is
  localized to a single routine now).

--
viresh

[1] https://marc.info/?l=linux-arm-kernel=149795553024005

Viresh Kumar (5):
  arch_topology: Get rid of "cap_parsing_done"
  arch_topology: Don't break lines unnecessarily
  arch_topology: Covert switch block to if block
  arch_topology: Return 0 or -ve errors from
topology_parse_cpu_capacity()
  arch_topology: Localize cap_parsing_failed to
topology_parse_cpu_capacity()

 arch/arm/kernel/topology.c   |  2 +-
 drivers/base/arch_topology.c | 78 ++--
 2 files changed, 40 insertions(+), 40 deletions(-)

-- 
2.13.0.71.gd7076ec9c9cb



[PATCH V2 4/5] arch_topology: Return 0 or -ve errors from topology_parse_cpu_capacity()

2017-06-20 Thread Viresh Kumar
Use the standard way of returning errors instead of returning 0(failure)
OR 1(success) and making it hard to read.

Signed-off-by: Viresh Kumar 
---
 arch/arm/kernel/topology.c   | 2 +-
 drivers/base/arch_topology.c | 8 
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index bf949a763dbe..a7ef4c35855e 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -111,7 +111,7 @@ static void __init parse_dt_topology(void)
continue;
}
 
-   if (topology_parse_cpu_capacity(cn, cpu)) {
+   if (!topology_parse_cpu_capacity(cn, cpu)) {
of_node_put(cn);
continue;
}
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 07784ba666a7..ff8713b83090 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -121,11 +121,11 @@ void topology_normalize_cpu_scale(void)
 
 int __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
 {
-   int ret = 1;
+   int ret;
u32 cpu_capacity;
 
if (cap_parsing_failed)
-   return !ret;
+   return -EINVAL;
 
ret = of_property_read_u32(cpu_node, "capacity-dmips-mhz",
   _capacity);
@@ -137,7 +137,7 @@ int __init topology_parse_cpu_capacity(struct device_node 
*cpu_node, int cpu)
if (!raw_capacity) {
pr_err("cpu_capacity: failed to allocate memory 
for raw capacities\n");
cap_parsing_failed = true;
-   return 0;
+   return -ENOMEM;
}
}
capacity_scale = max(cpu_capacity, capacity_scale);
@@ -154,7 +154,7 @@ int __init topology_parse_cpu_capacity(struct device_node 
*cpu_node, int cpu)
kfree(raw_capacity);
}
 
-   return !ret;
+   return ret;
 }
 
 #ifdef CONFIG_CPU_FREQ
-- 
2.13.0.71.gd7076ec9c9cb



[PATCH V2 4/5] arch_topology: Return 0 or -ve errors from topology_parse_cpu_capacity()

2017-06-20 Thread Viresh Kumar
Use the standard way of returning errors instead of returning 0(failure)
OR 1(success) and making it hard to read.

Signed-off-by: Viresh Kumar 
---
 arch/arm/kernel/topology.c   | 2 +-
 drivers/base/arch_topology.c | 8 
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index bf949a763dbe..a7ef4c35855e 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -111,7 +111,7 @@ static void __init parse_dt_topology(void)
continue;
}
 
-   if (topology_parse_cpu_capacity(cn, cpu)) {
+   if (!topology_parse_cpu_capacity(cn, cpu)) {
of_node_put(cn);
continue;
}
diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 07784ba666a7..ff8713b83090 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -121,11 +121,11 @@ void topology_normalize_cpu_scale(void)
 
 int __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
 {
-   int ret = 1;
+   int ret;
u32 cpu_capacity;
 
if (cap_parsing_failed)
-   return !ret;
+   return -EINVAL;
 
ret = of_property_read_u32(cpu_node, "capacity-dmips-mhz",
   _capacity);
@@ -137,7 +137,7 @@ int __init topology_parse_cpu_capacity(struct device_node 
*cpu_node, int cpu)
if (!raw_capacity) {
pr_err("cpu_capacity: failed to allocate memory 
for raw capacities\n");
cap_parsing_failed = true;
-   return 0;
+   return -ENOMEM;
}
}
capacity_scale = max(cpu_capacity, capacity_scale);
@@ -154,7 +154,7 @@ int __init topology_parse_cpu_capacity(struct device_node 
*cpu_node, int cpu)
kfree(raw_capacity);
}
 
-   return !ret;
+   return ret;
 }
 
 #ifdef CONFIG_CPU_FREQ
-- 
2.13.0.71.gd7076ec9c9cb



[PATCH V2 3/5] arch_topology: Covert switch block to if block

2017-06-20 Thread Viresh Kumar
We only need to take care of one case here (CPUFREQ_NOTIFY) and there is
no need to add an extra level of indentation to the case specific code
by using a switch block. Use an if block instead.

Also add some blank lines to make the code look better.

Signed-off-by: Viresh Kumar 
---
 drivers/base/arch_topology.c | 41 ++---
 1 file changed, 22 insertions(+), 19 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 2f1d9921ee54..07784ba666a7 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -173,26 +173,29 @@ init_cpu_capacity_callback(struct notifier_block *nb,
if (cap_parsing_failed)
return 0;
 
-   switch (val) {
-   case CPUFREQ_NOTIFY:
-   pr_debug("cpu_capacity: init cpu capacity for CPUs [%*pbl] 
(to_visit=%*pbl)\n",
-   cpumask_pr_args(policy->related_cpus),
-   cpumask_pr_args(cpus_to_visit));
-   cpumask_andnot(cpus_to_visit, cpus_to_visit,
-  policy->related_cpus);
-   for_each_cpu(cpu, policy->related_cpus) {
-   raw_capacity[cpu] = topology_get_cpu_scale(NULL, cpu) *
-   policy->cpuinfo.max_freq / 1000UL;
-   capacity_scale = max(raw_capacity[cpu], capacity_scale);
-   }
-   if (cpumask_empty(cpus_to_visit)) {
-   topology_normalize_cpu_scale();
-   kfree(raw_capacity);
-   pr_debug("cpu_capacity: parsing done\n");
-   cap_parsing_failed = true;
-   schedule_work(_done_work);
-   }
+   if (val != CPUFREQ_NOTIFY)
+   return 0;
+
+   pr_debug("cpu_capacity: init cpu capacity for CPUs [%*pbl] 
(to_visit=%*pbl)\n",
+cpumask_pr_args(policy->related_cpus),
+cpumask_pr_args(cpus_to_visit));
+
+   cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus);
+
+   for_each_cpu(cpu, policy->related_cpus) {
+   raw_capacity[cpu] = topology_get_cpu_scale(NULL, cpu) *
+   policy->cpuinfo.max_freq / 1000UL;
+   capacity_scale = max(raw_capacity[cpu], capacity_scale);
}
+
+   if (cpumask_empty(cpus_to_visit)) {
+   topology_normalize_cpu_scale();
+   kfree(raw_capacity);
+   pr_debug("cpu_capacity: parsing done\n");
+   cap_parsing_failed = true;
+   schedule_work(_done_work);
+   }
+
return 0;
 }
 
-- 
2.13.0.71.gd7076ec9c9cb



[PATCH V2 5/5] arch_topology: Localize cap_parsing_failed to topology_parse_cpu_capacity()

2017-06-20 Thread Viresh Kumar
cap_parsing_failed is only required in topology_parse_cpu_capacity() to
know if we have already tried to allocate raw_capacity and failed, or if
at least one of the cpu_node didn't had the required
"capacity-dmips-mhz" property.

All other users can use raw_capacity instead of cap_parsing_failed.

Make sure we set raw_capacity to NULL after we free it.

Signed-off-by: Viresh Kumar 
---
 drivers/base/arch_topology.c | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index ff8713b83090..d8923f89724f 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -95,14 +95,21 @@ subsys_initcall(register_cpu_capacity_sysctl);
 
 static u32 capacity_scale;
 static u32 *raw_capacity;
-static bool cap_parsing_failed;
+
+static int __init free_raw_capacity(void)
+{
+   kfree(raw_capacity);
+   raw_capacity = NULL;
+
+   return 0;
+}
 
 void topology_normalize_cpu_scale(void)
 {
u64 capacity;
int cpu;
 
-   if (!raw_capacity || cap_parsing_failed)
+   if (!raw_capacity)
return;
 
pr_debug("cpu_capacity: capacity_scale=%u\n", capacity_scale);
@@ -121,6 +128,7 @@ void topology_normalize_cpu_scale(void)
 
 int __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
 {
+   static bool cap_parsing_failed;
int ret;
u32 cpu_capacity;
 
@@ -151,7 +159,7 @@ int __init topology_parse_cpu_capacity(struct device_node 
*cpu_node, int cpu)
pr_err("cpu_capacity: partial information: fallback to 
1024 for all CPUs\n");
}
cap_parsing_failed = true;
-   kfree(raw_capacity);
+   free_raw_capacity();
}
 
return ret;
@@ -170,7 +178,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
struct cpufreq_policy *policy = data;
int cpu;
 
-   if (cap_parsing_failed)
+   if (!raw_capacity)
return 0;
 
if (val != CPUFREQ_NOTIFY)
@@ -190,9 +198,8 @@ init_cpu_capacity_callback(struct notifier_block *nb,
 
if (cpumask_empty(cpus_to_visit)) {
topology_normalize_cpu_scale();
-   kfree(raw_capacity);
+   free_raw_capacity();
pr_debug("cpu_capacity: parsing done\n");
-   cap_parsing_failed = true;
schedule_work(_done_work);
}
 
@@ -232,11 +239,5 @@ static void parsing_done_workfn(struct work_struct *work)
 }
 
 #else
-static int __init free_raw_capacity(void)
-{
-   kfree(raw_capacity);
-
-   return 0;
-}
 core_initcall(free_raw_capacity);
 #endif
-- 
2.13.0.71.gd7076ec9c9cb



[PATCH V2 2/5] arch_topology: Don't break lines unnecessarily

2017-06-20 Thread Viresh Kumar
There is no need of line break at few places, avoid them.

Signed-off-by: Viresh Kumar 
---
 drivers/base/arch_topology.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 8239a6232808..2f1d9921ee54 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -41,8 +41,7 @@ static ssize_t cpu_capacity_show(struct device *dev,
 {
struct cpu *cpu = container_of(dev, struct cpu, dev);
 
-   return sprintf(buf, "%lu\n",
-   topology_get_cpu_scale(NULL, cpu->dev.id));
+   return sprintf(buf, "%lu\n", topology_get_cpu_scale(NULL, cpu->dev.id));
 }
 
 static ssize_t cpu_capacity_store(struct device *dev,
@@ -128,8 +127,7 @@ int __init topology_parse_cpu_capacity(struct device_node 
*cpu_node, int cpu)
if (cap_parsing_failed)
return !ret;
 
-   ret = of_property_read_u32(cpu_node,
-  "capacity-dmips-mhz",
+   ret = of_property_read_u32(cpu_node, "capacity-dmips-mhz",
   _capacity);
if (!ret) {
if (!raw_capacity) {
@@ -180,8 +178,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
pr_debug("cpu_capacity: init cpu capacity for CPUs [%*pbl] 
(to_visit=%*pbl)\n",
cpumask_pr_args(policy->related_cpus),
cpumask_pr_args(cpus_to_visit));
-   cpumask_andnot(cpus_to_visit,
-  cpus_to_visit,
+   cpumask_andnot(cpus_to_visit, cpus_to_visit,
   policy->related_cpus);
for_each_cpu(cpu, policy->related_cpus) {
raw_capacity[cpu] = topology_get_cpu_scale(NULL, cpu) *
-- 
2.13.0.71.gd7076ec9c9cb



[PATCH V2 3/5] arch_topology: Covert switch block to if block

2017-06-20 Thread Viresh Kumar
We only need to take care of one case here (CPUFREQ_NOTIFY) and there is
no need to add an extra level of indentation to the case specific code
by using a switch block. Use an if block instead.

Also add some blank lines to make the code look better.

Signed-off-by: Viresh Kumar 
---
 drivers/base/arch_topology.c | 41 ++---
 1 file changed, 22 insertions(+), 19 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 2f1d9921ee54..07784ba666a7 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -173,26 +173,29 @@ init_cpu_capacity_callback(struct notifier_block *nb,
if (cap_parsing_failed)
return 0;
 
-   switch (val) {
-   case CPUFREQ_NOTIFY:
-   pr_debug("cpu_capacity: init cpu capacity for CPUs [%*pbl] 
(to_visit=%*pbl)\n",
-   cpumask_pr_args(policy->related_cpus),
-   cpumask_pr_args(cpus_to_visit));
-   cpumask_andnot(cpus_to_visit, cpus_to_visit,
-  policy->related_cpus);
-   for_each_cpu(cpu, policy->related_cpus) {
-   raw_capacity[cpu] = topology_get_cpu_scale(NULL, cpu) *
-   policy->cpuinfo.max_freq / 1000UL;
-   capacity_scale = max(raw_capacity[cpu], capacity_scale);
-   }
-   if (cpumask_empty(cpus_to_visit)) {
-   topology_normalize_cpu_scale();
-   kfree(raw_capacity);
-   pr_debug("cpu_capacity: parsing done\n");
-   cap_parsing_failed = true;
-   schedule_work(_done_work);
-   }
+   if (val != CPUFREQ_NOTIFY)
+   return 0;
+
+   pr_debug("cpu_capacity: init cpu capacity for CPUs [%*pbl] 
(to_visit=%*pbl)\n",
+cpumask_pr_args(policy->related_cpus),
+cpumask_pr_args(cpus_to_visit));
+
+   cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus);
+
+   for_each_cpu(cpu, policy->related_cpus) {
+   raw_capacity[cpu] = topology_get_cpu_scale(NULL, cpu) *
+   policy->cpuinfo.max_freq / 1000UL;
+   capacity_scale = max(raw_capacity[cpu], capacity_scale);
}
+
+   if (cpumask_empty(cpus_to_visit)) {
+   topology_normalize_cpu_scale();
+   kfree(raw_capacity);
+   pr_debug("cpu_capacity: parsing done\n");
+   cap_parsing_failed = true;
+   schedule_work(_done_work);
+   }
+
return 0;
 }
 
-- 
2.13.0.71.gd7076ec9c9cb



[PATCH V2 5/5] arch_topology: Localize cap_parsing_failed to topology_parse_cpu_capacity()

2017-06-20 Thread Viresh Kumar
cap_parsing_failed is only required in topology_parse_cpu_capacity() to
know if we have already tried to allocate raw_capacity and failed, or if
at least one of the cpu_node didn't had the required
"capacity-dmips-mhz" property.

All other users can use raw_capacity instead of cap_parsing_failed.

Make sure we set raw_capacity to NULL after we free it.

Signed-off-by: Viresh Kumar 
---
 drivers/base/arch_topology.c | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index ff8713b83090..d8923f89724f 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -95,14 +95,21 @@ subsys_initcall(register_cpu_capacity_sysctl);
 
 static u32 capacity_scale;
 static u32 *raw_capacity;
-static bool cap_parsing_failed;
+
+static int __init free_raw_capacity(void)
+{
+   kfree(raw_capacity);
+   raw_capacity = NULL;
+
+   return 0;
+}
 
 void topology_normalize_cpu_scale(void)
 {
u64 capacity;
int cpu;
 
-   if (!raw_capacity || cap_parsing_failed)
+   if (!raw_capacity)
return;
 
pr_debug("cpu_capacity: capacity_scale=%u\n", capacity_scale);
@@ -121,6 +128,7 @@ void topology_normalize_cpu_scale(void)
 
 int __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu)
 {
+   static bool cap_parsing_failed;
int ret;
u32 cpu_capacity;
 
@@ -151,7 +159,7 @@ int __init topology_parse_cpu_capacity(struct device_node 
*cpu_node, int cpu)
pr_err("cpu_capacity: partial information: fallback to 
1024 for all CPUs\n");
}
cap_parsing_failed = true;
-   kfree(raw_capacity);
+   free_raw_capacity();
}
 
return ret;
@@ -170,7 +178,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
struct cpufreq_policy *policy = data;
int cpu;
 
-   if (cap_parsing_failed)
+   if (!raw_capacity)
return 0;
 
if (val != CPUFREQ_NOTIFY)
@@ -190,9 +198,8 @@ init_cpu_capacity_callback(struct notifier_block *nb,
 
if (cpumask_empty(cpus_to_visit)) {
topology_normalize_cpu_scale();
-   kfree(raw_capacity);
+   free_raw_capacity();
pr_debug("cpu_capacity: parsing done\n");
-   cap_parsing_failed = true;
schedule_work(_done_work);
}
 
@@ -232,11 +239,5 @@ static void parsing_done_workfn(struct work_struct *work)
 }
 
 #else
-static int __init free_raw_capacity(void)
-{
-   kfree(raw_capacity);
-
-   return 0;
-}
 core_initcall(free_raw_capacity);
 #endif
-- 
2.13.0.71.gd7076ec9c9cb



[PATCH V2 2/5] arch_topology: Don't break lines unnecessarily

2017-06-20 Thread Viresh Kumar
There is no need of line break at few places, avoid them.

Signed-off-by: Viresh Kumar 
---
 drivers/base/arch_topology.c | 9 +++--
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 8239a6232808..2f1d9921ee54 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -41,8 +41,7 @@ static ssize_t cpu_capacity_show(struct device *dev,
 {
struct cpu *cpu = container_of(dev, struct cpu, dev);
 
-   return sprintf(buf, "%lu\n",
-   topology_get_cpu_scale(NULL, cpu->dev.id));
+   return sprintf(buf, "%lu\n", topology_get_cpu_scale(NULL, cpu->dev.id));
 }
 
 static ssize_t cpu_capacity_store(struct device *dev,
@@ -128,8 +127,7 @@ int __init topology_parse_cpu_capacity(struct device_node 
*cpu_node, int cpu)
if (cap_parsing_failed)
return !ret;
 
-   ret = of_property_read_u32(cpu_node,
-  "capacity-dmips-mhz",
+   ret = of_property_read_u32(cpu_node, "capacity-dmips-mhz",
   _capacity);
if (!ret) {
if (!raw_capacity) {
@@ -180,8 +178,7 @@ init_cpu_capacity_callback(struct notifier_block *nb,
pr_debug("cpu_capacity: init cpu capacity for CPUs [%*pbl] 
(to_visit=%*pbl)\n",
cpumask_pr_args(policy->related_cpus),
cpumask_pr_args(cpus_to_visit));
-   cpumask_andnot(cpus_to_visit,
-  cpus_to_visit,
+   cpumask_andnot(cpus_to_visit, cpus_to_visit,
   policy->related_cpus);
for_each_cpu(cpu, policy->related_cpus) {
raw_capacity[cpu] = topology_get_cpu_scale(NULL, cpu) *
-- 
2.13.0.71.gd7076ec9c9cb



Re: [PATCH 32/51] rtc: pcap: stop using rtc deprecated functions

2017-06-20 Thread kbuild test robot
Hi Benjamin,

[auto build test ERROR on abelloni/rtc-next]
[also build test ERROR on v4.12-rc6 next-20170620]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Benjamin-Gaignard/rtc-stop-using-rtc-deprecated-functions/20170621-044455
base:   https://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git 
rtc-next
config: i386-randconfig-b0-06210824 (attached as .config)
compiler: gcc-5 (Debian 5.4.1-2) 5.4.1 20160904
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/built-in.o: In function `pcap_rtc_set_mmss64':
>> rtc-pcap.c:(.text+0x18a7e0): undefined reference to `__moddi3'
>> rtc-pcap.c:(.text+0x18a800): undefined reference to `__divdi3'
   drivers/built-in.o: In function `pcap_rtc_set_alarm':
>> rtc-pcap.c:(.text+0x18a83e): undefined reference to `__umoddi3'
>> rtc-pcap.c:(.text+0x18a85e): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 32/51] rtc: pcap: stop using rtc deprecated functions

2017-06-20 Thread kbuild test robot
Hi Benjamin,

[auto build test ERROR on abelloni/rtc-next]
[also build test ERROR on v4.12-rc6 next-20170620]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Benjamin-Gaignard/rtc-stop-using-rtc-deprecated-functions/20170621-044455
base:   https://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git 
rtc-next
config: i386-randconfig-b0-06210824 (attached as .config)
compiler: gcc-5 (Debian 5.4.1-2) 5.4.1 20160904
reproduce:
# save the attached .config to linux build tree
make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/built-in.o: In function `pcap_rtc_set_mmss64':
>> rtc-pcap.c:(.text+0x18a7e0): undefined reference to `__moddi3'
>> rtc-pcap.c:(.text+0x18a800): undefined reference to `__divdi3'
   drivers/built-in.o: In function `pcap_rtc_set_alarm':
>> rtc-pcap.c:(.text+0x18a83e): undefined reference to `__umoddi3'
>> rtc-pcap.c:(.text+0x18a85e): undefined reference to `__udivdi3'

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 13/51] rtc: cpcap: stop using rtc deprecated functions

2017-06-20 Thread kbuild test robot
Hi Benjamin,

[auto build test ERROR on abelloni/rtc-next]
[also build test ERROR on v4.12-rc6 next-20170620]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Benjamin-Gaignard/rtc-stop-using-rtc-deprecated-functions/20170621-044455
base:   https://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git 
rtc-next
config: arm-omap2plus_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

All errors (new ones prefixed by >>):

>> ERROR: "__aeabi_uldivmod" [drivers/rtc/rtc-cpcap.ko] undefined!

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH 13/51] rtc: cpcap: stop using rtc deprecated functions

2017-06-20 Thread kbuild test robot
Hi Benjamin,

[auto build test ERROR on abelloni/rtc-next]
[also build test ERROR on v4.12-rc6 next-20170620]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Benjamin-Gaignard/rtc-stop-using-rtc-deprecated-functions/20170621-044455
base:   https://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git 
rtc-next
config: arm-omap2plus_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
wget 
https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

All errors (new ones prefixed by >>):

>> ERROR: "__aeabi_uldivmod" [drivers/rtc/rtc-cpcap.ko] undefined!

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


[PATCH 1/3] ASoC: fsl: mpc5200_dma: remove unused psc_dma

2017-06-20 Thread Kuninori Morimoto

linux/sound/soc/fsl/mpc5200_dma.c:305:18: warning: unused variable \
psc_dma’ [-Wunused-variable]

Signed-off-by: Kuninori Morimoto 
---
 sound/soc/fsl/mpc5200_dma.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/sound/soc/fsl/mpc5200_dma.c b/sound/soc/fsl/mpc5200_dma.c
index 93885d9..cd8ccaba 100644
--- a/sound/soc/fsl/mpc5200_dma.c
+++ b/sound/soc/fsl/mpc5200_dma.c
@@ -302,7 +302,6 @@ static int psc_dma_new(struct snd_soc_pcm_runtime *rtd)
struct snd_card *card = rtd->card->snd_card;
struct snd_soc_dai *dai = rtd->cpu_dai;
struct snd_pcm *pcm = rtd->pcm;
-   struct psc_dma *psc_dma = snd_soc_dai_get_drvdata(rtd->cpu_dai);
size_t size = psc_dma_hardware.buffer_bytes_max;
int rc;
 
-- 
1.9.1



[PATCH 1/3] ASoC: fsl: mpc5200_dma: remove unused psc_dma

2017-06-20 Thread Kuninori Morimoto

linux/sound/soc/fsl/mpc5200_dma.c:305:18: warning: unused variable \
psc_dma’ [-Wunused-variable]

Signed-off-by: Kuninori Morimoto 
---
 sound/soc/fsl/mpc5200_dma.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/sound/soc/fsl/mpc5200_dma.c b/sound/soc/fsl/mpc5200_dma.c
index 93885d9..cd8ccaba 100644
--- a/sound/soc/fsl/mpc5200_dma.c
+++ b/sound/soc/fsl/mpc5200_dma.c
@@ -302,7 +302,6 @@ static int psc_dma_new(struct snd_soc_pcm_runtime *rtd)
struct snd_card *card = rtd->card->snd_card;
struct snd_soc_dai *dai = rtd->cpu_dai;
struct snd_pcm *pcm = rtd->pcm;
-   struct psc_dma *psc_dma = snd_soc_dai_get_drvdata(rtd->cpu_dai);
size_t size = psc_dma_hardware.buffer_bytes_max;
int rc;
 
-- 
1.9.1



Re: [PATCH] acer-wmi: Using zero as the first WMI instance number

2017-06-20 Thread joeyli
On Tue, Jun 20, 2017 at 02:45:56PM -0700, Darren Hart wrote:
> On Tue, Jun 20, 2017 at 10:46:12PM +0200, Pali Rohár wrote:
> > On Tuesday 20 June 2017 19:22:46 Andy Shevchenko wrote:
> > > On Tue, Jun 20, 2017 at 7:48 PM, Pali Rohár 
> > > wrote:
> > > > On Tuesday 20 June 2017 17:06:23 Lee, Chun-Yi wrote:
> > > >> Pali Rohár found that there have some wmi query/evaluation
> > > >> code that they used 'one' as the first WMI instance number.
> > > >> But the number is indexed from zero that it's must less than
> > > >> the instance_count in _WDG.
> > > >> 
> > > >> This patch changes those instance number from one to zero.
> > > >> 
> > > >> Cc: Darren Hart 
> > > >> Cc: Andy Shevchenko 
> > > >> Cc: Pali Rohár 
> > > >> Signed-off-by: "Lee, Chun-Yi" 
> > > > 
> > > > Looks good, Reviewed-by: Pali Rohár 
> > > 
> > > Unfortunately patchwork ignores this tag.
> > > So, in the future please:
> > > - put a tag on a separate line
> > > - do _not_ prepend it by any characters including white spaces
> > > (except new line :-) )

I see! I will also follow the rules in the future.

> > 
> > I have not known that those lines are parsed automatically.
> > Will do it in future!
> 
> :-) Thanks!
> 
> For reference, Patchwork tally's such things in the patch list view:
> https://patchwork.kernel.org/project/platform-driver-x86/list/
> 
> And, selfishly, the more normalized these are, the less likely Andy and I are 
> to
> make mistakes manipulating them into the patches.
>

Joey Lee


Re: [PATCH] acer-wmi: Using zero as the first WMI instance number

2017-06-20 Thread joeyli
On Tue, Jun 20, 2017 at 02:45:56PM -0700, Darren Hart wrote:
> On Tue, Jun 20, 2017 at 10:46:12PM +0200, Pali Rohár wrote:
> > On Tuesday 20 June 2017 19:22:46 Andy Shevchenko wrote:
> > > On Tue, Jun 20, 2017 at 7:48 PM, Pali Rohár 
> > > wrote:
> > > > On Tuesday 20 June 2017 17:06:23 Lee, Chun-Yi wrote:
> > > >> Pali Rohár found that there have some wmi query/evaluation
> > > >> code that they used 'one' as the first WMI instance number.
> > > >> But the number is indexed from zero that it's must less than
> > > >> the instance_count in _WDG.
> > > >> 
> > > >> This patch changes those instance number from one to zero.
> > > >> 
> > > >> Cc: Darren Hart 
> > > >> Cc: Andy Shevchenko 
> > > >> Cc: Pali Rohár 
> > > >> Signed-off-by: "Lee, Chun-Yi" 
> > > > 
> > > > Looks good, Reviewed-by: Pali Rohár 
> > > 
> > > Unfortunately patchwork ignores this tag.
> > > So, in the future please:
> > > - put a tag on a separate line
> > > - do _not_ prepend it by any characters including white spaces
> > > (except new line :-) )

I see! I will also follow the rules in the future.

> > 
> > I have not known that those lines are parsed automatically.
> > Will do it in future!
> 
> :-) Thanks!
> 
> For reference, Patchwork tally's such things in the patch list view:
> https://patchwork.kernel.org/project/platform-driver-x86/list/
> 
> And, selfishly, the more normalized these are, the less likely Andy and I are 
> to
> make mistakes manipulating them into the patches.
>

Joey Lee


  1   2   3   4   5   6   7   8   9   10   >