Re: [PATCH 3.18 32/32] mm: larger stack guard gap, between vmas
On Tue, Jun 20, 2017 at 10:49:16PM -0700, Hugh Dickins wrote: > On Mon, 19 Jun 2017, Greg Kroah-Hartman wrote: > > > 3.18-stable review patch. If anyone has any objections, please let me know. > > > > -- > > > > From: Hugh Dickins> > > > commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream. > > Here's a few adjustments to the 3.18 patch: no doubt you'll have > already sorted out any build errors (and I have to confess that > I haven't even tried to build this); and the VM_WARN_ON line (as > in 4.4) only fixes a highly unlikely error; but those FOLL_MLOCK > lines in mm/gup.c were mistaken, and do need to be deleted. Are you sure ? The test on the FOLL_MLOCK flag remains present in 4.11 and mainline : /* mlock all present pages, but do not fault in new pages */ if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK) return -ENOENT; And this test was present although different in 3.18 as well : /* For mlock, just skip the stack guard page. */ if ((*flags & FOLL_MLOCK) && (stack_guard_page_start(vma, address) || stack_guard_page_end(vma, address + PAGE_SIZE))) So by removing it we're totally removing any test on FOLL_MLOCK. That might be the correct fix, but I'm just a bit surprized since the mainline patch doesn't remove it, and only removes the test on FOLL_POPULATE. Thanks, Willy
Re: [PATCH 3.18 32/32] mm: larger stack guard gap, between vmas
On Tue, Jun 20, 2017 at 10:49:16PM -0700, Hugh Dickins wrote: > On Mon, 19 Jun 2017, Greg Kroah-Hartman wrote: > > > 3.18-stable review patch. If anyone has any objections, please let me know. > > > > -- > > > > From: Hugh Dickins > > > > commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream. > > Here's a few adjustments to the 3.18 patch: no doubt you'll have > already sorted out any build errors (and I have to confess that > I haven't even tried to build this); and the VM_WARN_ON line (as > in 4.4) only fixes a highly unlikely error; but those FOLL_MLOCK > lines in mm/gup.c were mistaken, and do need to be deleted. Are you sure ? The test on the FOLL_MLOCK flag remains present in 4.11 and mainline : /* mlock all present pages, but do not fault in new pages */ if ((*flags & (FOLL_POPULATE | FOLL_MLOCK)) == FOLL_MLOCK) return -ENOENT; And this test was present although different in 3.18 as well : /* For mlock, just skip the stack guard page. */ if ((*flags & FOLL_MLOCK) && (stack_guard_page_start(vma, address) || stack_guard_page_end(vma, address + PAGE_SIZE))) So by removing it we're totally removing any test on FOLL_MLOCK. That might be the correct fix, but I'm just a bit surprized since the mainline patch doesn't remove it, and only removes the test on FOLL_POPULATE. Thanks, Willy
[PATCH] [RFC] binfmt_elf: Use ELF_ET_DYN_BASE only for PIE
The ELF_ET_DYN_BASE position was originally intended to keep loaders away from ET_EXEC binaries. (For example, running "/lib/ld-linux.so.2 /bin/cat" might cause the subsequent load of /bin/cat into where the loader had been loaded.) With the advent of PIE (ET_DYN binaries with an INTERP Program Header), ELF_ET_DYN_BASE continued to be used since the kernel was only looking at ET_DYN. However, since ELF_ET_DYN_BASE is traditionally set at the top 1/3rd of the TASK_SIZE, a substantial portion of the address space is unused. For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are loaded below the mmap region. This means they can be made to collide (CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with pathological stack regions. Lowering ELF_ET_DYN_BASE solves both by moving programs above the mmap region in all cases, and will now additionally avoid programs falling back to the mmap region by enforcing MAP_FIXED for program loads (i.e. if it would have collided with the stack, now it will fail to load instead of falling back to the mmap region). To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP) are loaded into the mmap region, leaving space available for either an ET_EXEC binary with a fixed location or PIE being loaded into mmap by the loader. Only PIE programs are loaded offset from ELF_ET_DYN_BASE, which means architectures can now safely lower their values without risk of loaders colliding with their subsequently loaded programs. Thanks go to PaX for inspiration on how to approach this solution. Fixes: d1fd836dcf00 ("mm: split ET_DYN ASLR from mmap ASLR") Signed-off-by: Kees Cook--- arch/x86/include/asm/elf.h | 8 ++-- fs/binfmt_elf.c| 38 +++--- 2 files changed, 33 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h index e8ab9a46bc68..46549973ea98 100644 --- a/arch/x86/include/asm/elf.h +++ b/arch/x86/include/asm/elf.h @@ -245,12 +245,8 @@ extern int force_personality32; #define CORE_DUMP_USE_REGSET #define ELF_EXEC_PAGESIZE 4096 -/* This is the location that an ET_DYN program is loaded if exec'ed. Typical - use of this is to invoke "./ld.so someprog" to test out a new version of - the loader. We need to make sure that it is out of the way of the program - that it will "exec", and that there is sufficient room for the brk. */ - -#define ELF_ET_DYN_BASE(TASK_SIZE / 3 * 2) +/* This is the base location for PIE (ET_DYN with INTERP) loads. */ +#define ELF_ET_DYN_BASE0x40UL /* This yields a mask that user programs can use to figure out what instruction set this CPU supports. This could be done in user space, diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 5075fd5c62c8..a998c7251d1c 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -930,13 +930,37 @@ static int load_elf_binary(struct linux_binprm *bprm) if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) { elf_flags |= MAP_FIXED; } else if (loc->elf_ex.e_type == ET_DYN) { - /* Try and get dynamic programs out of the way of the -* default mmap base, as well as whatever program they -* might try to exec. This is because the brk will -* follow the loader, and is not movable. */ - load_bias = ELF_ET_DYN_BASE - vaddr; - if (current->flags & PF_RANDOMIZE) - load_bias += arch_mmap_rnd(); + /* +* There are effectively two types of ET_DYN +* binaries: programs (i.e. PIE: ET_DYN with INTERP) +* and loaders (ET_DYN without INTERP, since they +* _are_ the ELF interpreter). The loaders must +* be loaded away from programs since the program +* may otherwise collide with the loader (especially +* for ET_EXEC which does not have a randomized +* position). For example to handle invocations of +* "./ld.so someprog" to test out a new version of +* the loader, the subsequent program that the +* loader loads must avoid the loader itself, so +* they cannot share the same load range. Sufficient +* room for the brk must be allocated with the +* loader as well, since brk must be available with +* the loader. +* +* Therefore, programs are loaded offset from +* ELF_ET_DYN_BASE and loaders are loaded into the +* independently randomized mmap region (0
[PATCH] [RFC] binfmt_elf: Use ELF_ET_DYN_BASE only for PIE
The ELF_ET_DYN_BASE position was originally intended to keep loaders away from ET_EXEC binaries. (For example, running "/lib/ld-linux.so.2 /bin/cat" might cause the subsequent load of /bin/cat into where the loader had been loaded.) With the advent of PIE (ET_DYN binaries with an INTERP Program Header), ELF_ET_DYN_BASE continued to be used since the kernel was only looking at ET_DYN. However, since ELF_ET_DYN_BASE is traditionally set at the top 1/3rd of the TASK_SIZE, a substantial portion of the address space is unused. For 32-bit tasks when RLIMIT_STACK is set to RLIM_INFINITY, programs are loaded below the mmap region. This means they can be made to collide (CVE-2017-1000370) or nearly collide (CVE-2017-1000371) with pathological stack regions. Lowering ELF_ET_DYN_BASE solves both by moving programs above the mmap region in all cases, and will now additionally avoid programs falling back to the mmap region by enforcing MAP_FIXED for program loads (i.e. if it would have collided with the stack, now it will fail to load instead of falling back to the mmap region). To allow for a lower ELF_ET_DYN_BASE, loaders (ET_DYN without INTERP) are loaded into the mmap region, leaving space available for either an ET_EXEC binary with a fixed location or PIE being loaded into mmap by the loader. Only PIE programs are loaded offset from ELF_ET_DYN_BASE, which means architectures can now safely lower their values without risk of loaders colliding with their subsequently loaded programs. Thanks go to PaX for inspiration on how to approach this solution. Fixes: d1fd836dcf00 ("mm: split ET_DYN ASLR from mmap ASLR") Signed-off-by: Kees Cook --- arch/x86/include/asm/elf.h | 8 ++-- fs/binfmt_elf.c| 38 +++--- 2 files changed, 33 insertions(+), 13 deletions(-) diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h index e8ab9a46bc68..46549973ea98 100644 --- a/arch/x86/include/asm/elf.h +++ b/arch/x86/include/asm/elf.h @@ -245,12 +245,8 @@ extern int force_personality32; #define CORE_DUMP_USE_REGSET #define ELF_EXEC_PAGESIZE 4096 -/* This is the location that an ET_DYN program is loaded if exec'ed. Typical - use of this is to invoke "./ld.so someprog" to test out a new version of - the loader. We need to make sure that it is out of the way of the program - that it will "exec", and that there is sufficient room for the brk. */ - -#define ELF_ET_DYN_BASE(TASK_SIZE / 3 * 2) +/* This is the base location for PIE (ET_DYN with INTERP) loads. */ +#define ELF_ET_DYN_BASE0x40UL /* This yields a mask that user programs can use to figure out what instruction set this CPU supports. This could be done in user space, diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 5075fd5c62c8..a998c7251d1c 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -930,13 +930,37 @@ static int load_elf_binary(struct linux_binprm *bprm) if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) { elf_flags |= MAP_FIXED; } else if (loc->elf_ex.e_type == ET_DYN) { - /* Try and get dynamic programs out of the way of the -* default mmap base, as well as whatever program they -* might try to exec. This is because the brk will -* follow the loader, and is not movable. */ - load_bias = ELF_ET_DYN_BASE - vaddr; - if (current->flags & PF_RANDOMIZE) - load_bias += arch_mmap_rnd(); + /* +* There are effectively two types of ET_DYN +* binaries: programs (i.e. PIE: ET_DYN with INTERP) +* and loaders (ET_DYN without INTERP, since they +* _are_ the ELF interpreter). The loaders must +* be loaded away from programs since the program +* may otherwise collide with the loader (especially +* for ET_EXEC which does not have a randomized +* position). For example to handle invocations of +* "./ld.so someprog" to test out a new version of +* the loader, the subsequent program that the +* loader loads must avoid the loader itself, so +* they cannot share the same load range. Sufficient +* room for the brk must be allocated with the +* loader as well, since brk must be available with +* the loader. +* +* Therefore, programs are loaded offset from +* ELF_ET_DYN_BASE and loaders are loaded into the +* independently randomized mmap region (0 load_bias +
linux-next: manual merge of the scsi tree with the block tree
Hi all, Today's linux-next merge of the scsi tree got a conflict in: drivers/scsi/scsi_lib.c between commit: ca18d6f769d2 ("block: Make most scsi_req_init() calls implicit") from the block tree and commit: 2dd6fb5957a7 ("scsi: Only add commands to the device command list if required by the LLD") from the scsi tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc drivers/scsi/scsi_lib.c index 550e29f903b7,41c19c75dab4.. --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@@ -1116,20 -1124,35 +1106,49 @@@ err_exit } EXPORT_SYMBOL(scsi_init_io); + /* Add a command to the list used by the aacraid and dpt_i2o drivers */ + void scsi_add_cmd_to_list(struct scsi_cmnd *cmd) + { + struct scsi_device *sdev = cmd->device; + struct Scsi_Host *shost = sdev->host; + unsigned long flags; + + if (shost->use_cmd_list) { + spin_lock_irqsave(>list_lock, flags); + list_add_tail(>list, >cmd_list); + spin_unlock_irqrestore(>list_lock, flags); + } + } + + /* Remove a command from the list used by the aacraid and dpt_i2o drivers */ + void scsi_del_cmd_from_list(struct scsi_cmnd *cmd) + { + struct scsi_device *sdev = cmd->device; + struct Scsi_Host *shost = sdev->host; + unsigned long flags; + + if (shost->use_cmd_list) { + spin_lock_irqsave(>list_lock, flags); + BUG_ON(list_empty(>list)); + list_del_init(>list); + spin_unlock_irqrestore(>list_lock, flags); + } + } + +/** + * scsi_initialize_rq - initialize struct scsi_cmnd.req + * + * Called from inside blk_get_request(). + */ +void scsi_initialize_rq(struct request *rq) +{ + struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq); + + scsi_req_init(>req); +} +EXPORT_SYMBOL(scsi_initialize_rq); + +/* Called after a request has been started. */ void scsi_init_command(struct scsi_device *dev, struct scsi_cmnd *cmd) { void *buf = cmd->sense_buffer; @@@ -2974,10 -2989,7 +2989,7 @@@ int scsi_internal_device_block_nowait(s * request queue. */ if (q->mq_ops) { - if (wait) - blk_mq_quiesce_queue(q); - else - blk_mq_quiesce_queue_nowait(q); - blk_mq_stop_hw_queues(q); ++ blk_mq_quiesce_queue_nowait(q); } else { spin_lock_irqsave(q->queue_lock, flags); blk_stop_queue(q); @@@ -2988,31 -2998,77 +2998,77 @@@ return 0; } - EXPORT_SYMBOL_GPL(scsi_internal_device_block); - + EXPORT_SYMBOL_GPL(scsi_internal_device_block_nowait); + /** - * scsi_internal_device_unblock - resume a device after a block request - * @sdev: device to resume - * @new_state:state to set devices to after unblocking + * scsi_internal_device_block - try to transition to the SDEV_BLOCK state + * @sdev: device to block + * + * Pause SCSI command processing on the specified device and wait until all + * ongoing scsi_request_fn() / scsi_queue_rq() calls have finished. May sleep. * - * Called by scsi lld's or the midlayer to restart the device queue - * for the previously suspended scsi device. Called from interrupt or - * normal process context. + * Returns zero if successful or a negative error code upon failure. * - * Returns zero if successful or error if not. + * Note: + * This routine transitions the device to the SDEV_BLOCK state (which must be + * a legal transition). When the device is in this state, command processing + * is paused until the device leaves the SDEV_BLOCK state. See also + * scsi_internal_device_unblock(). * - * Notes: - *This routine transitions the device to the SDEV_RUNNING state - *or to one of the offline states (which must be a legal transition) - *allowing the midlayer to goose the queue for this device. + * To do: avoid that scsi_send_eh_cmnd() calls queuecommand() after + * scsi_internal_device_block() has blocked a SCSI device and also + * remove the rport mutex lock and unlock calls from srp_queuecommand(). */ - int - scsi_internal_device_unblock(struct scsi_device *sdev, -enum scsi_device_state new_state) + static int scsi_internal_device_block(struct scsi_device *sdev) { - struct request_queue *q = sdev->request_queue; + struct request_queue *q = sdev->request_queue; + int err; + + mutex_lock(>state_mutex); + err = scsi_internal_device_block_nowait(sdev); + if (err == 0) { + if
linux-next: manual merge of the scsi tree with the block tree
Hi all, Today's linux-next merge of the scsi tree got a conflict in: drivers/scsi/scsi_lib.c between commit: ca18d6f769d2 ("block: Make most scsi_req_init() calls implicit") from the block tree and commit: 2dd6fb5957a7 ("scsi: Only add commands to the device command list if required by the LLD") from the scsi tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc drivers/scsi/scsi_lib.c index 550e29f903b7,41c19c75dab4.. --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@@ -1116,20 -1124,35 +1106,49 @@@ err_exit } EXPORT_SYMBOL(scsi_init_io); + /* Add a command to the list used by the aacraid and dpt_i2o drivers */ + void scsi_add_cmd_to_list(struct scsi_cmnd *cmd) + { + struct scsi_device *sdev = cmd->device; + struct Scsi_Host *shost = sdev->host; + unsigned long flags; + + if (shost->use_cmd_list) { + spin_lock_irqsave(>list_lock, flags); + list_add_tail(>list, >cmd_list); + spin_unlock_irqrestore(>list_lock, flags); + } + } + + /* Remove a command from the list used by the aacraid and dpt_i2o drivers */ + void scsi_del_cmd_from_list(struct scsi_cmnd *cmd) + { + struct scsi_device *sdev = cmd->device; + struct Scsi_Host *shost = sdev->host; + unsigned long flags; + + if (shost->use_cmd_list) { + spin_lock_irqsave(>list_lock, flags); + BUG_ON(list_empty(>list)); + list_del_init(>list); + spin_unlock_irqrestore(>list_lock, flags); + } + } + +/** + * scsi_initialize_rq - initialize struct scsi_cmnd.req + * + * Called from inside blk_get_request(). + */ +void scsi_initialize_rq(struct request *rq) +{ + struct scsi_cmnd *cmd = blk_mq_rq_to_pdu(rq); + + scsi_req_init(>req); +} +EXPORT_SYMBOL(scsi_initialize_rq); + +/* Called after a request has been started. */ void scsi_init_command(struct scsi_device *dev, struct scsi_cmnd *cmd) { void *buf = cmd->sense_buffer; @@@ -2974,10 -2989,7 +2989,7 @@@ int scsi_internal_device_block_nowait(s * request queue. */ if (q->mq_ops) { - if (wait) - blk_mq_quiesce_queue(q); - else - blk_mq_quiesce_queue_nowait(q); - blk_mq_stop_hw_queues(q); ++ blk_mq_quiesce_queue_nowait(q); } else { spin_lock_irqsave(q->queue_lock, flags); blk_stop_queue(q); @@@ -2988,31 -2998,77 +2998,77 @@@ return 0; } - EXPORT_SYMBOL_GPL(scsi_internal_device_block); - + EXPORT_SYMBOL_GPL(scsi_internal_device_block_nowait); + /** - * scsi_internal_device_unblock - resume a device after a block request - * @sdev: device to resume - * @new_state:state to set devices to after unblocking + * scsi_internal_device_block - try to transition to the SDEV_BLOCK state + * @sdev: device to block + * + * Pause SCSI command processing on the specified device and wait until all + * ongoing scsi_request_fn() / scsi_queue_rq() calls have finished. May sleep. * - * Called by scsi lld's or the midlayer to restart the device queue - * for the previously suspended scsi device. Called from interrupt or - * normal process context. + * Returns zero if successful or a negative error code upon failure. * - * Returns zero if successful or error if not. + * Note: + * This routine transitions the device to the SDEV_BLOCK state (which must be + * a legal transition). When the device is in this state, command processing + * is paused until the device leaves the SDEV_BLOCK state. See also + * scsi_internal_device_unblock(). * - * Notes: - *This routine transitions the device to the SDEV_RUNNING state - *or to one of the offline states (which must be a legal transition) - *allowing the midlayer to goose the queue for this device. + * To do: avoid that scsi_send_eh_cmnd() calls queuecommand() after + * scsi_internal_device_block() has blocked a SCSI device and also + * remove the rport mutex lock and unlock calls from srp_queuecommand(). */ - int - scsi_internal_device_unblock(struct scsi_device *sdev, -enum scsi_device_state new_state) + static int scsi_internal_device_block(struct scsi_device *sdev) { - struct request_queue *q = sdev->request_queue; + struct request_queue *q = sdev->request_queue; + int err; + + mutex_lock(>state_mutex); + err = scsi_internal_device_block_nowait(sdev); + if (err == 0) { + if
Re: [PATCH 3.18 32/32] mm: larger stack guard gap, between vmas
On Mon, 19 Jun 2017, Greg Kroah-Hartman wrote: > 3.18-stable review patch. If anyone has any objections, please let me know. > > -- > > From: Hugh Dickins> > commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream. Here's a few adjustments to the 3.18 patch: no doubt you'll have already sorted out any build errors (and I have to confess that I haven't even tried to build this); and the VM_WARN_ON line (as in 4.4) only fixes a highly unlikely error; but those FOLL_MLOCK lines in mm/gup.c were mistaken, and do need to be deleted. Hugh diff -purN 318n/include/linux/mm.h 318h/include/linux/mm.h --- 318n/include/linux/mm.h 2017-06-20 16:48:54.050429965 -0700 +++ 318h/include/linux/mm.h 2017-06-20 19:13:05.842191061 -0700 @@ -1242,6 +1242,9 @@ int set_page_dirty_lock(struct page *pag int clear_page_dirty_for_io(struct page *page); int get_cmdline(struct task_struct *task, char *buffer, int buflen); +extern struct task_struct *task_of_stack(struct task_struct *task, + struct vm_area_struct *vma, bool in_group); + extern unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len, @@ -1897,8 +1900,9 @@ void page_cache_async_readahead(struct a pgoff_t offset, unsigned long size); -extern unsigned long stack_guard_gap; +unsigned long max_sane_readahead(unsigned long nr); +extern unsigned long stack_guard_gap; /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */ extern int expand_stack(struct vm_area_struct *vma, unsigned long address); diff -purN 318n/mm/gup.c 318h/mm/gup.c --- 318n/mm/gup.c 2017-06-20 16:48:54.054429927 -0700 +++ 318h/mm/gup.c 2017-06-20 19:18:19.579275331 -0700 @@ -275,9 +275,6 @@ static int faultin_page(struct task_stru unsigned int fault_flags = 0; int ret; - /* mlock all present pages, but do not fault in new pages */ - if (*flags & FOLL_MLOCK) - return -ENOENT; if (*flags & FOLL_WRITE) fault_flags |= FAULT_FLAG_WRITE; if (nonblocking) diff -purN 318n/mm/mmap.c 318h/mm/mmap.c --- 318n/mm/mmap.c 2017-06-20 16:48:54.054429927 -0700 +++ 318h/mm/mmap.c 2017-06-20 19:43:16.945345744 -0700 @@ -931,7 +931,7 @@ again: remove_next = 1 + (end > next-> else if (next) vma_gap_update(next); else - mm->highest_vm_end = end; + VM_WARN_ON(mm->highest_vm_end != vm_end_gap(vma)); } if (insert && file) uprobe_mmap(insert);
Re: [PATCH 3.18 32/32] mm: larger stack guard gap, between vmas
On Mon, 19 Jun 2017, Greg Kroah-Hartman wrote: > 3.18-stable review patch. If anyone has any objections, please let me know. > > -- > > From: Hugh Dickins > > commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream. Here's a few adjustments to the 3.18 patch: no doubt you'll have already sorted out any build errors (and I have to confess that I haven't even tried to build this); and the VM_WARN_ON line (as in 4.4) only fixes a highly unlikely error; but those FOLL_MLOCK lines in mm/gup.c were mistaken, and do need to be deleted. Hugh diff -purN 318n/include/linux/mm.h 318h/include/linux/mm.h --- 318n/include/linux/mm.h 2017-06-20 16:48:54.050429965 -0700 +++ 318h/include/linux/mm.h 2017-06-20 19:13:05.842191061 -0700 @@ -1242,6 +1242,9 @@ int set_page_dirty_lock(struct page *pag int clear_page_dirty_for_io(struct page *page); int get_cmdline(struct task_struct *task, char *buffer, int buflen); +extern struct task_struct *task_of_stack(struct task_struct *task, + struct vm_area_struct *vma, bool in_group); + extern unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len, @@ -1897,8 +1900,9 @@ void page_cache_async_readahead(struct a pgoff_t offset, unsigned long size); -extern unsigned long stack_guard_gap; +unsigned long max_sane_readahead(unsigned long nr); +extern unsigned long stack_guard_gap; /* Generic expand stack which grows the stack according to GROWS{UP,DOWN} */ extern int expand_stack(struct vm_area_struct *vma, unsigned long address); diff -purN 318n/mm/gup.c 318h/mm/gup.c --- 318n/mm/gup.c 2017-06-20 16:48:54.054429927 -0700 +++ 318h/mm/gup.c 2017-06-20 19:18:19.579275331 -0700 @@ -275,9 +275,6 @@ static int faultin_page(struct task_stru unsigned int fault_flags = 0; int ret; - /* mlock all present pages, but do not fault in new pages */ - if (*flags & FOLL_MLOCK) - return -ENOENT; if (*flags & FOLL_WRITE) fault_flags |= FAULT_FLAG_WRITE; if (nonblocking) diff -purN 318n/mm/mmap.c 318h/mm/mmap.c --- 318n/mm/mmap.c 2017-06-20 16:48:54.054429927 -0700 +++ 318h/mm/mmap.c 2017-06-20 19:43:16.945345744 -0700 @@ -931,7 +931,7 @@ again: remove_next = 1 + (end > next-> else if (next) vma_gap_update(next); else - mm->highest_vm_end = end; + VM_WARN_ON(mm->highest_vm_end != vm_end_gap(vma)); } if (insert && file) uprobe_mmap(insert);
[PATCH v2] brcmfmac: Fix a memory leak in error handling path in 'brcmf_cfg80211_attach'
If 'wiphy_new()' fails, we leak 'ops'. Add a new label in the error handling path to free it in such a case. Cc: sta...@vger.kernel.org Fixes: 5c22fb85102a7 ("brcmfmac: add wowl gtk rekeying offload support") Signed-off-by: Christophe JAILLET--- v2: Add CC tag Change prefix --- drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c index 2443c71a202f..032d823c53c2 100644 --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c @@ -6861,7 +6861,7 @@ struct brcmf_cfg80211_info *brcmf_cfg80211_attach(struct brcmf_pub *drvr, wiphy = wiphy_new(ops, sizeof(struct brcmf_cfg80211_info)); if (!wiphy) { brcmf_err("Could not allocate wiphy device\n"); - return NULL; + goto ops_out; } memcpy(wiphy->perm_addr, drvr->mac, ETH_ALEN); set_wiphy_dev(wiphy, busdev); @@ -7012,6 +7012,7 @@ struct brcmf_cfg80211_info *brcmf_cfg80211_attach(struct brcmf_pub *drvr, ifp->vif = NULL; wiphy_out: brcmf_free_wiphy(wiphy); +ops_out: kfree(ops); return NULL; } -- 2.11.0
[PATCH v2] brcmfmac: Fix a memory leak in error handling path in 'brcmf_cfg80211_attach'
If 'wiphy_new()' fails, we leak 'ops'. Add a new label in the error handling path to free it in such a case. Cc: sta...@vger.kernel.org Fixes: 5c22fb85102a7 ("brcmfmac: add wowl gtk rekeying offload support") Signed-off-by: Christophe JAILLET --- v2: Add CC tag Change prefix --- drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c index 2443c71a202f..032d823c53c2 100644 --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c @@ -6861,7 +6861,7 @@ struct brcmf_cfg80211_info *brcmf_cfg80211_attach(struct brcmf_pub *drvr, wiphy = wiphy_new(ops, sizeof(struct brcmf_cfg80211_info)); if (!wiphy) { brcmf_err("Could not allocate wiphy device\n"); - return NULL; + goto ops_out; } memcpy(wiphy->perm_addr, drvr->mac, ETH_ALEN); set_wiphy_dev(wiphy, busdev); @@ -7012,6 +7012,7 @@ struct brcmf_cfg80211_info *brcmf_cfg80211_attach(struct brcmf_pub *drvr, ifp->vif = NULL; wiphy_out: brcmf_free_wiphy(wiphy); +ops_out: kfree(ops); return NULL; } -- 2.11.0
Re: [PATCH 4.4 30/30] mm: larger stack guard gap, between vmas
On Mon, 19 Jun 2017, Greg Kroah-Hartman wrote: > 4.4-stable review patch. If anyone has any objections, please let me know. > > -- > > From: Hugh Dickins> > commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream. The 4.11 and 4.9 patches are fine, but I noticed a couple of corrections to this 4.4 one. There are very unlikely circumstances (on VM_GROWSUP architectures only) in which that "highest_vm_end = end" line would be wrong: several different ways to fix it, I'm tending to go with Andrea's observation that the only need for update has already been done above, so just VM_WARN_ON if it's unexpected. Hugh diff -purN 404n/include/linux/mm.h 404h/include/linux/mm.h --- 404n/include/linux/mm.h 2017-06-20 16:48:17.162770068 -0700 +++ 404h/include/linux/mm.h 2017-06-20 17:36:35.871975695 -0700 @@ -1283,6 +1283,8 @@ static inline bool vma_is_anonymous(stru return !vma->vm_ops; } +int vma_is_stack_for_task(struct vm_area_struct *vma, struct task_struct *t); + extern unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len, diff -purN 404n/mm/mmap.c 404h/mm/mmap.c --- 404n/mm/mmap.c 2017-06-20 16:48:17.166770032 -0700 +++ 404h/mm/mmap.c 2017-06-20 17:42:31.312682716 -0700 @@ -923,7 +923,7 @@ again: remove_next = 1 + (end > next-> else if (next) vma_gap_update(next); else - mm->highest_vm_end = end; + VM_WARN_ON(mm->highest_vm_end != vm_end_gap(vma)); } if (insert && file) uprobe_mmap(insert);
Re: [PATCH 4.4 30/30] mm: larger stack guard gap, between vmas
On Mon, 19 Jun 2017, Greg Kroah-Hartman wrote: > 4.4-stable review patch. If anyone has any objections, please let me know. > > -- > > From: Hugh Dickins > > commit 1be7107fbe18eed3e319a6c3e83c78254b693acb upstream. The 4.11 and 4.9 patches are fine, but I noticed a couple of corrections to this 4.4 one. There are very unlikely circumstances (on VM_GROWSUP architectures only) in which that "highest_vm_end = end" line would be wrong: several different ways to fix it, I'm tending to go with Andrea's observation that the only need for update has already been done above, so just VM_WARN_ON if it's unexpected. Hugh diff -purN 404n/include/linux/mm.h 404h/include/linux/mm.h --- 404n/include/linux/mm.h 2017-06-20 16:48:17.162770068 -0700 +++ 404h/include/linux/mm.h 2017-06-20 17:36:35.871975695 -0700 @@ -1283,6 +1283,8 @@ static inline bool vma_is_anonymous(stru return !vma->vm_ops; } +int vma_is_stack_for_task(struct vm_area_struct *vma, struct task_struct *t); + extern unsigned long move_page_tables(struct vm_area_struct *vma, unsigned long old_addr, struct vm_area_struct *new_vma, unsigned long new_addr, unsigned long len, diff -purN 404n/mm/mmap.c 404h/mm/mmap.c --- 404n/mm/mmap.c 2017-06-20 16:48:17.166770032 -0700 +++ 404h/mm/mmap.c 2017-06-20 17:42:31.312682716 -0700 @@ -923,7 +923,7 @@ again: remove_next = 1 + (end > next-> else if (next) vma_gap_update(next); else - mm->highest_vm_end = end; + VM_WARN_ON(mm->highest_vm_end != vm_end_gap(vma)); } if (insert && file) uprobe_mmap(insert);
Re: [PATCH 2/6] drivers base/arch_topology: frequency-invariant load-tracking support
On 20-06-17, 17:31, Saravana Kannan wrote: > On 06/19/2017 11:17 PM, Viresh Kumar wrote: > >On Thu, Jun 8, 2017 at 1:25 PM, Dietmar Eggemann > >wrote: > > > >>diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c > > > >> static int __init register_cpufreq_notifier(void) > >> { > >>+ int ret; > >>+ > >> /* > >> * on ACPI-based systems we need to use the default cpu capacity > >> * until we have the necessary code to parse the cpu capacity, so > >>@@ -225,8 +265,14 @@ static int __init register_cpufreq_notifier(void) > >> > >> cpumask_copy(cpus_to_visit, cpu_possible_mask); > >> > >>- return cpufreq_register_notifier(_cpu_capacity_notifier, > >>-CPUFREQ_POLICY_NOTIFIER); > >>+ ret = cpufreq_register_notifier(_cpu_capacity_notifier, > >>+ CPUFREQ_POLICY_NOTIFIER); > > > >Wanted to make sure that we all understand the constraints this is going to > >add > >for the ARM64 platforms. > > > >With the introduction of this transition notifier, we would not be able to > >use > >the fast-switch path in the schedutil governor. I am not sure if there are > >any > >ARM platforms that can actually use the fast-switch path in future or not > >though. The requirement of fast-switch path is that the freq can be changed > >without sleeping in the hot-path. > > > >So, will we ever want fast-switching for ARM platforms ? > > > > I don't think we should go down a path that'll prevent ARM platform from > switching over to fast-switching in the future. Yeah, that's why brought attention to this stuff. I think this patch doesn't really need to go down the notifiers way. We can do something like this in the implementation of topology_get_freq_scale(): return (policy->cur << SCHED_CAPACITY_SHIFT) / max; Though, we would be required to take care of policy structure in this case somehow. > Having said that, I'm not sure I fully agree with the decision to completely > disable notifiers in the fast-switching case. How many of the current users > of notifiers truly need support for sleeping in the notifier? Its not just about sleeping here. We do not wish to call too much stuff from scheduler hot path. Even if it doesn't sleep. > Why not make > all the transition notifiers atomic? Or at least add atomic transition > notifiers that can be registered for separately if the client doesn't need > the ability to sleep? > > Most of the clients don't seem like ones that'll need to sleep. Only if the scheduler maintainers agree to getting these notifiers called from hot path, which I don't think is going to happen. > There are a bunch of generic off-tree drivers (can't upstream them yet > because it depends on the bus scaling framework) that also depend on CPUfreq > transition notifiers that are going to stop working if fast switching > becomes available in the future. So, this decision to disallow transition > notifiers is painful for other reasons too. I think its kind of fine to work without fast switch in those cases, as we are anyway ready to call notifiers which may end up taking any amount of time. This case was special as it is affecting entire arch here and so I pointed it out. -- viresh
Re: [PATCH 2/6] drivers base/arch_topology: frequency-invariant load-tracking support
On 20-06-17, 17:31, Saravana Kannan wrote: > On 06/19/2017 11:17 PM, Viresh Kumar wrote: > >On Thu, Jun 8, 2017 at 1:25 PM, Dietmar Eggemann > > wrote: > > > >>diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c > > > >> static int __init register_cpufreq_notifier(void) > >> { > >>+ int ret; > >>+ > >> /* > >> * on ACPI-based systems we need to use the default cpu capacity > >> * until we have the necessary code to parse the cpu capacity, so > >>@@ -225,8 +265,14 @@ static int __init register_cpufreq_notifier(void) > >> > >> cpumask_copy(cpus_to_visit, cpu_possible_mask); > >> > >>- return cpufreq_register_notifier(_cpu_capacity_notifier, > >>-CPUFREQ_POLICY_NOTIFIER); > >>+ ret = cpufreq_register_notifier(_cpu_capacity_notifier, > >>+ CPUFREQ_POLICY_NOTIFIER); > > > >Wanted to make sure that we all understand the constraints this is going to > >add > >for the ARM64 platforms. > > > >With the introduction of this transition notifier, we would not be able to > >use > >the fast-switch path in the schedutil governor. I am not sure if there are > >any > >ARM platforms that can actually use the fast-switch path in future or not > >though. The requirement of fast-switch path is that the freq can be changed > >without sleeping in the hot-path. > > > >So, will we ever want fast-switching for ARM platforms ? > > > > I don't think we should go down a path that'll prevent ARM platform from > switching over to fast-switching in the future. Yeah, that's why brought attention to this stuff. I think this patch doesn't really need to go down the notifiers way. We can do something like this in the implementation of topology_get_freq_scale(): return (policy->cur << SCHED_CAPACITY_SHIFT) / max; Though, we would be required to take care of policy structure in this case somehow. > Having said that, I'm not sure I fully agree with the decision to completely > disable notifiers in the fast-switching case. How many of the current users > of notifiers truly need support for sleeping in the notifier? Its not just about sleeping here. We do not wish to call too much stuff from scheduler hot path. Even if it doesn't sleep. > Why not make > all the transition notifiers atomic? Or at least add atomic transition > notifiers that can be registered for separately if the client doesn't need > the ability to sleep? > > Most of the clients don't seem like ones that'll need to sleep. Only if the scheduler maintainers agree to getting these notifiers called from hot path, which I don't think is going to happen. > There are a bunch of generic off-tree drivers (can't upstream them yet > because it depends on the bus scaling framework) that also depend on CPUfreq > transition notifiers that are going to stop working if fast switching > becomes available in the future. So, this decision to disallow transition > notifiers is painful for other reasons too. I think its kind of fine to work without fast switch in those cases, as we are anyway ready to call notifiers which may end up taking any amount of time. This case was special as it is affecting entire arch here and so I pointed it out. -- viresh
Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.
On 06/21/2017 04:53 AM, Ram Pai wrote: > On Tue, Jun 20, 2017 at 03:50:25PM +0530, Anshuman Khandual wrote: >> On 06/17/2017 09:22 AM, Ram Pai wrote: >>> Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6 >>> in the 4K backed hpte pages. These bits continue to be used >>> for 64K backed hpte pages in this patch, but will be freed >>> up in the next patch. >> >> The counting 3, 4, 5 and 6 are in BE format I believe, I was >> initially trying to see that from right to left as we normally >> do in the kernel and was getting confused. So basically these >> bits (which are only applicable for 64K mapping IIUC) are going >> to be freed up from the PTE format. >> >> #define _RPAGE_RSV1 0x1000UL >> #define _RPAGE_RSV2 0x0800UL >> #define _RPAGE_RSV3 0x0400UL >> #define _RPAGE_RSV4 0x0200UL >> >> As you have mentioned before this feature is available for 64K >> page size only and not for 4K mappings. So I assume we support >> both the combinations. >> >> * 64K mapping on 64K >> * 64K mapping on 4K > > yes. > >> >> These are the current users of the above bits >> >> #define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */ >> #define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary >> HPTEG */ >> #define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44) >> #define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated >> HPTE */ >> >>> >>> The patch does the following change to the 64K PTE format >>> >>> H_PAGE_BUSY moves from bit 3 to bit 9 >> >> and what is in there on bit 9 now ? This ? >> >> #define _RPAGE_SW2 0x00400 >> >> which is used as >> >> #define _PAGE_SPECIAL_RPAGE_SW2 /* software: special page */ >> >> which will not be required any more ? > > i think you are reading bit 9 from right to left. the bit 9 i refer to > is from left to right. Using the same numbering convention the ISA3.0 uses. Right, my bad. Then it would be this one. '#define _RPAGE_RPN42 0x0040UL' > I know it is confusing, will make a mention in the comment of this > patch, to read it the big-endian way. Right. > > BTW: Bit 9 is not used currently. so using it in this patch. But this is > a temporary move. the H_PAGE_BUSY will move to bit 7 in the next patch. > > Had to keep at bit 9, because bit 7 is not yet entirely freed up. it is > used by 64K PTE backed by 64k htpe. Got it. > >> >>> H_PAGE_F_SECOND which occupied bit 4 moves to the second part >>> of the pte. >>> H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the >>> second part of the pte. >>> >>> the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot >>> is initialized to 0xF indicating an invalid slot. If a hpte >>> gets cached in a 0xF slot(i.e 7th slot of secondary), it is >>> released immediately. In other words, even though 0xF is a >> >> Release immediately means we attempt again for a new hash slot ? > > yes. > >> >>> valid slot we discard and consider it as an invalid >>> slot;i.e hpte_soft_invalid(). This gives us an opportunity to not >>> depend on a bit in the primary PTE in order to determine the >>> validity of a slot. >> >> So we have to see the slot number in the second half for each PTE to >> figure out if it has got a valid slot in the hash page table. > > yes. > >> >>> >>> When we release ahpte in the 0xF slot we also release a >>> legitimate primary slot andunmapthat entry. This is to >>> ensure that we do get a legimate non-0xF slot the next time we >>> retry for a slot. >> >> Okay. >> >>> >>> Though treating 0xF slot as invalid reduces the number of available >>> slots and may have an effect on the performance, the probabilty >>> of hitting a 0xF is extermely low. >> >> Why you say that ? I thought every slot number has the same probability >> of hit from the hash function. > > Every hash bucket has the same probability. But every slot within the > hash bucket is filled in sequentially. so it takes 15 hptes to hash to > the same bucket before we get to the 15th slot in the secondary. Okay, would the last one be 16th instead ? > >> >>> >>> Compared to the current scheme, the above described scheme reduces >>> the number of false hash table updates significantly and has the >> >> How it reduces false hash table updates ? > > earlier, we had 1 bit allocated in the first-part-of-the 64K-PTE > for four consecutive 4K hptes. If any one 4k hpte got hashed-in, > the bit got set. Which means anytime it faulted on the remaining > three 4k hpte, we saw the bit already set and tried to erroneously > update that hpte. So we had a 75% update error rate. Funcationally > not bad, but bad from a performance point of view. I am bit out of sync regarding these PTE bits, after Aneesh's radix changes went in :) Will look into this bit closer. > > With
Re: [RFC v2 01/12] powerpc: Free up four 64K PTE bits in 4K backed hpte pages.
On 06/21/2017 04:53 AM, Ram Pai wrote: > On Tue, Jun 20, 2017 at 03:50:25PM +0530, Anshuman Khandual wrote: >> On 06/17/2017 09:22 AM, Ram Pai wrote: >>> Rearrange 64K PTE bits to free up bits 3, 4, 5 and 6 >>> in the 4K backed hpte pages. These bits continue to be used >>> for 64K backed hpte pages in this patch, but will be freed >>> up in the next patch. >> >> The counting 3, 4, 5 and 6 are in BE format I believe, I was >> initially trying to see that from right to left as we normally >> do in the kernel and was getting confused. So basically these >> bits (which are only applicable for 64K mapping IIUC) are going >> to be freed up from the PTE format. >> >> #define _RPAGE_RSV1 0x1000UL >> #define _RPAGE_RSV2 0x0800UL >> #define _RPAGE_RSV3 0x0400UL >> #define _RPAGE_RSV4 0x0200UL >> >> As you have mentioned before this feature is available for 64K >> page size only and not for 4K mappings. So I assume we support >> both the combinations. >> >> * 64K mapping on 64K >> * 64K mapping on 4K > > yes. > >> >> These are the current users of the above bits >> >> #define H_PAGE_BUSY _RPAGE_RSV1 /* software: PTE & hash are busy */ >> #define H_PAGE_F_SECOND _RPAGE_RSV2 /* HPTE is in 2ndary >> HPTEG */ >> #define H_PAGE_F_GIX (_RPAGE_RSV3 | _RPAGE_RSV4 | _RPAGE_RPN44) >> #define H_PAGE_HASHPTE _RPAGE_RPN43/* PTE has associated >> HPTE */ >> >>> >>> The patch does the following change to the 64K PTE format >>> >>> H_PAGE_BUSY moves from bit 3 to bit 9 >> >> and what is in there on bit 9 now ? This ? >> >> #define _RPAGE_SW2 0x00400 >> >> which is used as >> >> #define _PAGE_SPECIAL_RPAGE_SW2 /* software: special page */ >> >> which will not be required any more ? > > i think you are reading bit 9 from right to left. the bit 9 i refer to > is from left to right. Using the same numbering convention the ISA3.0 uses. Right, my bad. Then it would be this one. '#define _RPAGE_RPN42 0x0040UL' > I know it is confusing, will make a mention in the comment of this > patch, to read it the big-endian way. Right. > > BTW: Bit 9 is not used currently. so using it in this patch. But this is > a temporary move. the H_PAGE_BUSY will move to bit 7 in the next patch. > > Had to keep at bit 9, because bit 7 is not yet entirely freed up. it is > used by 64K PTE backed by 64k htpe. Got it. > >> >>> H_PAGE_F_SECOND which occupied bit 4 moves to the second part >>> of the pte. >>> H_PAGE_F_GIX which occupied bit 5, 6 and 7 also moves to the >>> second part of the pte. >>> >>> the four bits((H_PAGE_F_SECOND|H_PAGE_F_GIX) that represent a slot >>> is initialized to 0xF indicating an invalid slot. If a hpte >>> gets cached in a 0xF slot(i.e 7th slot of secondary), it is >>> released immediately. In other words, even though 0xF is a >> >> Release immediately means we attempt again for a new hash slot ? > > yes. > >> >>> valid slot we discard and consider it as an invalid >>> slot;i.e hpte_soft_invalid(). This gives us an opportunity to not >>> depend on a bit in the primary PTE in order to determine the >>> validity of a slot. >> >> So we have to see the slot number in the second half for each PTE to >> figure out if it has got a valid slot in the hash page table. > > yes. > >> >>> >>> When we release ahpte in the 0xF slot we also release a >>> legitimate primary slot andunmapthat entry. This is to >>> ensure that we do get a legimate non-0xF slot the next time we >>> retry for a slot. >> >> Okay. >> >>> >>> Though treating 0xF slot as invalid reduces the number of available >>> slots and may have an effect on the performance, the probabilty >>> of hitting a 0xF is extermely low. >> >> Why you say that ? I thought every slot number has the same probability >> of hit from the hash function. > > Every hash bucket has the same probability. But every slot within the > hash bucket is filled in sequentially. so it takes 15 hptes to hash to > the same bucket before we get to the 15th slot in the secondary. Okay, would the last one be 16th instead ? > >> >>> >>> Compared to the current scheme, the above described scheme reduces >>> the number of false hash table updates significantly and has the >> >> How it reduces false hash table updates ? > > earlier, we had 1 bit allocated in the first-part-of-the 64K-PTE > for four consecutive 4K hptes. If any one 4k hpte got hashed-in, > the bit got set. Which means anytime it faulted on the remaining > three 4k hpte, we saw the bit already set and tried to erroneously > update that hpte. So we had a 75% update error rate. Funcationally > not bad, but bad from a performance point of view. I am bit out of sync regarding these PTE bits, after Aneesh's radix changes went in :) Will look into this bit closer. > > With
[PATCH v2] backlight: lm3630a: bump REG_MAX value to 0x50 instead of 0x1F
In the lm3630a_chip_init we try to write to 0x50 register, which is higher value then the max_register value, this resulted in regmap_write return -EIO. Fix this by bumping REG_MAX value to 0x50. Signed-off-by: Bhushan ShahSuggested-by: Bjorn Andersson --- Changes since v1: - Fix the lm3630a_write call to use proper value (sent worng patch earlier) drivers/video/backlight/lm3630a_bl.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/video/backlight/lm3630a_bl.c b/drivers/video/backlight/lm3630a_bl.c index 60d6c2ac87aa..b641f706dbc9 100644 --- a/drivers/video/backlight/lm3630a_bl.c +++ b/drivers/video/backlight/lm3630a_bl.c @@ -31,7 +31,8 @@ #define REG_FAULT 0x0B #define REG_PWM_OUTLOW 0x12 #define REG_PWM_OUTHIGH0x13 -#define REG_MAX0x1F +#define REG_FLTR_STR 0x50 +#define REG_MAX0x50 #define INT_DEBOUNCE_MSEC 10 struct lm3630a_chip { @@ -80,7 +81,7 @@ static int lm3630a_chip_init(struct lm3630a_chip *pchip) usleep_range(1000, 2000); /* set Filter Strength Register */ - rval = lm3630a_write(pchip, 0x50, 0x03); + rval = lm3630a_write(pchip, REG_FLTR_STR, 0x03); /* set Cofig. register */ rval |= lm3630a_update(pchip, REG_CONFIG, 0x07, pdata->pwm_ctrl); /* set boost control */ -- 2.13.0
[PATCH v2] backlight: lm3630a: bump REG_MAX value to 0x50 instead of 0x1F
In the lm3630a_chip_init we try to write to 0x50 register, which is higher value then the max_register value, this resulted in regmap_write return -EIO. Fix this by bumping REG_MAX value to 0x50. Signed-off-by: Bhushan Shah Suggested-by: Bjorn Andersson --- Changes since v1: - Fix the lm3630a_write call to use proper value (sent worng patch earlier) drivers/video/backlight/lm3630a_bl.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/video/backlight/lm3630a_bl.c b/drivers/video/backlight/lm3630a_bl.c index 60d6c2ac87aa..b641f706dbc9 100644 --- a/drivers/video/backlight/lm3630a_bl.c +++ b/drivers/video/backlight/lm3630a_bl.c @@ -31,7 +31,8 @@ #define REG_FAULT 0x0B #define REG_PWM_OUTLOW 0x12 #define REG_PWM_OUTHIGH0x13 -#define REG_MAX0x1F +#define REG_FLTR_STR 0x50 +#define REG_MAX0x50 #define INT_DEBOUNCE_MSEC 10 struct lm3630a_chip { @@ -80,7 +81,7 @@ static int lm3630a_chip_init(struct lm3630a_chip *pchip) usleep_range(1000, 2000); /* set Filter Strength Register */ - rval = lm3630a_write(pchip, 0x50, 0x03); + rval = lm3630a_write(pchip, REG_FLTR_STR, 0x03); /* set Cofig. register */ rval |= lm3630a_update(pchip, REG_CONFIG, 0x07, pdata->pwm_ctrl); /* set boost control */ -- 2.13.0
[PATCH] backlight: lm3630a: bump REG_MAX value to 0x50 instead of 0x1F
In the lm3630a_chip_init we try to write to 0x50 register, which is higher value then the max_register value, this resulted in regmap_write return -EIO. Fix this by bumping REG_MAX value to 0x50. Signed-off-by: Bhushan ShahSuggested-by: Bjorn Andersson --- drivers/video/backlight/lm3630a_bl.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/video/backlight/lm3630a_bl.c b/drivers/video/backlight/lm3630a_bl.c index 60d6c2ac87aa..42496b6c09d0 100644 --- a/drivers/video/backlight/lm3630a_bl.c +++ b/drivers/video/backlight/lm3630a_bl.c @@ -31,7 +31,8 @@ #define REG_FAULT 0x0B #define REG_PWM_OUTLOW 0x12 #define REG_PWM_OUTHIGH0x13 -#define REG_MAX0x1F +#define REG_FLTR_STR 0x50 +#define REG_MAX0x50 #define INT_DEBOUNCE_MSEC 10 struct lm3630a_chip { @@ -80,7 +81,7 @@ static int lm3630a_chip_init(struct lm3630a_chip *pchip) usleep_range(1000, 2000); /* set Filter Strength Register */ - rval = lm3630a_write(pchip, 0x50, 0x03); + rval = lm3630a_write(pchip, FLTR_STR, 0x03); /* set Cofig. register */ rval |= lm3630a_update(pchip, REG_CONFIG, 0x07, pdata->pwm_ctrl); /* set boost control */ -- 2.13.0
[PATCH] backlight: lm3630a: bump REG_MAX value to 0x50 instead of 0x1F
In the lm3630a_chip_init we try to write to 0x50 register, which is higher value then the max_register value, this resulted in regmap_write return -EIO. Fix this by bumping REG_MAX value to 0x50. Signed-off-by: Bhushan Shah Suggested-by: Bjorn Andersson --- drivers/video/backlight/lm3630a_bl.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/video/backlight/lm3630a_bl.c b/drivers/video/backlight/lm3630a_bl.c index 60d6c2ac87aa..42496b6c09d0 100644 --- a/drivers/video/backlight/lm3630a_bl.c +++ b/drivers/video/backlight/lm3630a_bl.c @@ -31,7 +31,8 @@ #define REG_FAULT 0x0B #define REG_PWM_OUTLOW 0x12 #define REG_PWM_OUTHIGH0x13 -#define REG_MAX0x1F +#define REG_FLTR_STR 0x50 +#define REG_MAX0x50 #define INT_DEBOUNCE_MSEC 10 struct lm3630a_chip { @@ -80,7 +81,7 @@ static int lm3630a_chip_init(struct lm3630a_chip *pchip) usleep_range(1000, 2000); /* set Filter Strength Register */ - rval = lm3630a_write(pchip, 0x50, 0x03); + rval = lm3630a_write(pchip, FLTR_STR, 0x03); /* set Cofig. register */ rval |= lm3630a_update(pchip, REG_CONFIG, 0x07, pdata->pwm_ctrl); /* set boost control */ -- 2.13.0
[PATCH] MAINTAINERS: Add Qualcomm pinctrl drivers section
Document the maintainership of the Qualcomm pinctrl drivers to improve the hitrate. Signed-off-by: Bjorn Andersson--- MAINTAINERS | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 75cd0dc8e93f..c2e3a349ca7c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1666,7 +1666,6 @@ F:arch/arm/mach-qcom/ F: arch/arm64/boot/dts/qcom/* F: drivers/i2c/busses/i2c-qup.c F: drivers/clk/qcom/ -F: drivers/pinctrl/qcom/ F: drivers/dma/qcom/ F: drivers/soc/qcom/ F: drivers/spi/spi-qup.c @@ -10183,6 +10182,13 @@ M: Heikki Krogerus S: Maintained F: drivers/pinctrl/intel/ +PIN CONTROLLER - QUALCOMM +M: Bjorn Andersson +S: Maintained +L: linux-arm-...@vger.kernel.org +F: Documentation/devicetree/bindings/pinctrl/qcom,*.txt +F: drivers/pinctrl/qcom/ + PIN CONTROLLER - RENESAS M: Laurent Pinchart M: Geert Uytterhoeven -- 2.12.0
[PATCH] MAINTAINERS: Add Qualcomm pinctrl drivers section
Document the maintainership of the Qualcomm pinctrl drivers to improve the hitrate. Signed-off-by: Bjorn Andersson --- MAINTAINERS | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/MAINTAINERS b/MAINTAINERS index 75cd0dc8e93f..c2e3a349ca7c 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -1666,7 +1666,6 @@ F:arch/arm/mach-qcom/ F: arch/arm64/boot/dts/qcom/* F: drivers/i2c/busses/i2c-qup.c F: drivers/clk/qcom/ -F: drivers/pinctrl/qcom/ F: drivers/dma/qcom/ F: drivers/soc/qcom/ F: drivers/spi/spi-qup.c @@ -10183,6 +10182,13 @@ M: Heikki Krogerus S: Maintained F: drivers/pinctrl/intel/ +PIN CONTROLLER - QUALCOMM +M: Bjorn Andersson +S: Maintained +L: linux-arm-...@vger.kernel.org +F: Documentation/devicetree/bindings/pinctrl/qcom,*.txt +F: drivers/pinctrl/qcom/ + PIN CONTROLLER - RENESAS M: Laurent Pinchart M: Geert Uytterhoeven -- 2.12.0
[PATCH v3 02/11] x86/ldt: Simplify LDT switching logic
Originally, Linux reloaded the LDT whenever the prev mm or the next mm had an LDT. It was changed in 0bbed3beb4f2 ("[PATCH] Thread-Local Storage (TLS) support") (from the historical tree) like this: - /* load_LDT, if either the previous or next thread -* has a non-default LDT. + /* +* load the LDT, if the LDT is different: */ - if (next->context.size+prev->context.size) + if (unlikely(prev->context.ldt != next->context.ldt)) load_LDT(>context); The current code is unlikely to avoid any LDT reloads, since different mms won't share an LDT. When we redo lazy mode to stop flush IPIs without switching to init_mm, though, the current logic would become incorrect: it will be possible to have real_prev == next but nonetheless have a stale LDT descriptor. Simplify the code to update LDTR if either the previous or the next mm has an LDT, i.e. effectively restore the historical logic.. While we're at it, clean up the code by moving all the ifdeffery to a header where it belongs. Acked-by: Rik van RielSigned-off-by: Andy Lutomirski --- arch/x86/include/asm/mmu_context.h | 26 ++ arch/x86/mm/tlb.c | 20 ++-- 2 files changed, 28 insertions(+), 18 deletions(-) diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 1458f530948b..ecfcb6643c9b 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -93,6 +93,32 @@ static inline void load_mm_ldt(struct mm_struct *mm) #else clear_LDT(); #endif +} + +static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next) +{ +#ifdef CONFIG_MODIFY_LDT_SYSCALL + /* +* Load the LDT if either the old or new mm had an LDT. +* +* An mm will never go from having an LDT to not having an LDT. Two +* mms never share an LDT, so we don't gain anything by checking to +* see whether the LDT changed. There's also no guarantee that +* prev->context.ldt actually matches LDTR, but, if LDTR is non-NULL, +* then prev->context.ldt will also be non-NULL. +* +* If we really cared, we could optimize the case where prev == next +* and we're exiting lazy mode. Most of the time, if this happens, +* we don't actually need to reload LDTR, but modify_ldt() is mostly +* used by legacy code and emulators where we don't need this level of +* performance. +* +* This uses | instead of || because it generates better code. +*/ + if (unlikely((unsigned long)prev->context.ldt | +(unsigned long)next->context.ldt)) + load_mm_ldt(next); +#endif DEBUG_LOCKS_WARN_ON(preemptible()); } diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index f06239c6919f..fd593833a854 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -148,25 +148,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, real_prev != _mm); cpumask_clear_cpu(cpu, mm_cpumask(real_prev)); - /* Load per-mm CR4 state */ + /* Load per-mm CR4 and LDTR state */ load_mm_cr4(next); - -#ifdef CONFIG_MODIFY_LDT_SYSCALL - /* -* Load the LDT, if the LDT is different. -* -* It's possible that prev->context.ldt doesn't match -* the LDT register. This can happen if leave_mm(prev) -* was called and then modify_ldt changed -* prev->context.ldt but suppressed an IPI to this CPU. -* In this case, prev->context.ldt != NULL, because we -* never set context.ldt to NULL while the mm still -* exists. That means that next->context.ldt != -* prev->context.ldt, because mms never share an LDT. -*/ - if (unlikely(real_prev->context.ldt != next->context.ldt)) - load_mm_ldt(next); -#endif + switch_ldt(real_prev, next); } /* -- 2.9.4
[PATCH 2/4] time: Add warning about imminent deprecation of CONFIG_GENERIC_TIME_VSYSCALL_OLD
CONFIG_GENERIC_TIME_VSYSCALL_OLD was introduced five years ago to allow a transition from the old vsyscall implementations to the new method (which simplified internal accounting and made timekeeping more precise). However, PPC and IA64 have yet to make the transition, despite in some cases me sending test patches to try to help it along. http://patches.linaro.org/patch/30501/ http://patches.linaro.org/patch/35412/ If its helpful, my last pass at the patches can be found here: https://git.linaro.org/people/john.stultz/linux.git dev/oldvsyscall-cleanup So I think its time to set a deadline and make it clear this is going away. So this patch adds warnings about this functionality being dropped. Likely to be in v4.15. Cc: Thomas GleixnerCc: Ingo Molnar Cc: Miroslav Lichvar Cc: Richard Cochran Cc: Prarit Bhargava Cc: Marcelo Tosatti Cc: Paul Mackerras Cc: Anton Blanchard Cc: Benjamin Herrenschmidt Cc: Tony Luck Cc: Michael Ellerman Cc: Fenghua Yu Signed-off-by: John Stultz --- kernel/time/timekeeping.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 0454bfa..cedafa0 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -516,6 +516,7 @@ static void halt_fast_timekeeper(struct timekeeper *tk) } #ifdef CONFIG_GENERIC_TIME_VSYSCALL_OLD +#warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD compatibity will disappear soon. static inline void update_vsyscall(struct timekeeper *tk) { -- 2.7.4
[PATCH v3 02/11] x86/ldt: Simplify LDT switching logic
Originally, Linux reloaded the LDT whenever the prev mm or the next mm had an LDT. It was changed in 0bbed3beb4f2 ("[PATCH] Thread-Local Storage (TLS) support") (from the historical tree) like this: - /* load_LDT, if either the previous or next thread -* has a non-default LDT. + /* +* load the LDT, if the LDT is different: */ - if (next->context.size+prev->context.size) + if (unlikely(prev->context.ldt != next->context.ldt)) load_LDT(>context); The current code is unlikely to avoid any LDT reloads, since different mms won't share an LDT. When we redo lazy mode to stop flush IPIs without switching to init_mm, though, the current logic would become incorrect: it will be possible to have real_prev == next but nonetheless have a stale LDT descriptor. Simplify the code to update LDTR if either the previous or the next mm has an LDT, i.e. effectively restore the historical logic.. While we're at it, clean up the code by moving all the ifdeffery to a header where it belongs. Acked-by: Rik van Riel Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/mmu_context.h | 26 ++ arch/x86/mm/tlb.c | 20 ++-- 2 files changed, 28 insertions(+), 18 deletions(-) diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 1458f530948b..ecfcb6643c9b 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -93,6 +93,32 @@ static inline void load_mm_ldt(struct mm_struct *mm) #else clear_LDT(); #endif +} + +static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next) +{ +#ifdef CONFIG_MODIFY_LDT_SYSCALL + /* +* Load the LDT if either the old or new mm had an LDT. +* +* An mm will never go from having an LDT to not having an LDT. Two +* mms never share an LDT, so we don't gain anything by checking to +* see whether the LDT changed. There's also no guarantee that +* prev->context.ldt actually matches LDTR, but, if LDTR is non-NULL, +* then prev->context.ldt will also be non-NULL. +* +* If we really cared, we could optimize the case where prev == next +* and we're exiting lazy mode. Most of the time, if this happens, +* we don't actually need to reload LDTR, but modify_ldt() is mostly +* used by legacy code and emulators where we don't need this level of +* performance. +* +* This uses | instead of || because it generates better code. +*/ + if (unlikely((unsigned long)prev->context.ldt | +(unsigned long)next->context.ldt)) + load_mm_ldt(next); +#endif DEBUG_LOCKS_WARN_ON(preemptible()); } diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index f06239c6919f..fd593833a854 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -148,25 +148,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, real_prev != _mm); cpumask_clear_cpu(cpu, mm_cpumask(real_prev)); - /* Load per-mm CR4 state */ + /* Load per-mm CR4 and LDTR state */ load_mm_cr4(next); - -#ifdef CONFIG_MODIFY_LDT_SYSCALL - /* -* Load the LDT, if the LDT is different. -* -* It's possible that prev->context.ldt doesn't match -* the LDT register. This can happen if leave_mm(prev) -* was called and then modify_ldt changed -* prev->context.ldt but suppressed an IPI to this CPU. -* In this case, prev->context.ldt != NULL, because we -* never set context.ldt to NULL while the mm still -* exists. That means that next->context.ldt != -* prev->context.ldt, because mms never share an LDT. -*/ - if (unlikely(real_prev->context.ldt != next->context.ldt)) - load_mm_ldt(next); -#endif + switch_ldt(real_prev, next); } /* -- 2.9.4
[PATCH 2/4] time: Add warning about imminent deprecation of CONFIG_GENERIC_TIME_VSYSCALL_OLD
CONFIG_GENERIC_TIME_VSYSCALL_OLD was introduced five years ago to allow a transition from the old vsyscall implementations to the new method (which simplified internal accounting and made timekeeping more precise). However, PPC and IA64 have yet to make the transition, despite in some cases me sending test patches to try to help it along. http://patches.linaro.org/patch/30501/ http://patches.linaro.org/patch/35412/ If its helpful, my last pass at the patches can be found here: https://git.linaro.org/people/john.stultz/linux.git dev/oldvsyscall-cleanup So I think its time to set a deadline and make it clear this is going away. So this patch adds warnings about this functionality being dropped. Likely to be in v4.15. Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Miroslav Lichvar Cc: Richard Cochran Cc: Prarit Bhargava Cc: Marcelo Tosatti Cc: Paul Mackerras Cc: Anton Blanchard Cc: Benjamin Herrenschmidt Cc: Tony Luck Cc: Michael Ellerman Cc: Fenghua Yu Signed-off-by: John Stultz --- kernel/time/timekeeping.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 0454bfa..cedafa0 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -516,6 +516,7 @@ static void halt_fast_timekeeper(struct timekeeper *tk) } #ifdef CONFIG_GENERIC_TIME_VSYSCALL_OLD +#warning Please contact your maintainers, as GENERIC_TIME_VSYSCALL_OLD compatibity will disappear soon. static inline void update_vsyscall(struct timekeeper *tk) { -- 2.7.4
[PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
There are two kernel features that would benefit from tracking how up-to-date each CPU's TLB is in the case where IPIs aren't keeping it up to date in real time: - Lazy mm switching currently works by switching to init_mm when it would otherwise flush. This is wasteful: there isn't fundamentally any need to update CR3 at all when going lazy or when returning from lazy mode, nor is there any need to receive flush IPIs at all. Instead, we should just stop trying to keep the TLB coherent when we go lazy and, when unlazying, check whether we missed any flushes. - PCID will let us keep recent user contexts alive in the TLB. If we start doing this, we need a way to decide whether those contexts are up to date. On some paravirt systems, remote TLBs can be flushed without IPIs. This won't update the target CPUs' tlb_gens, which may cause unnecessary local flushes later on. We can address this if it becomes a problem by carefully updating the target CPU's tlb_gen directly. By itself, this patch is a very minor optimization that avoids unnecessary flushes when multiple TLB flushes targetting the same CPU race. Signed-off-by: Andy Lutomirski--- arch/x86/include/asm/tlbflush.h | 37 +++ arch/x86/mm/tlb.c | 79 + 2 files changed, 109 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 1eb946c0507e..4f6c30d6ec39 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -82,6 +82,11 @@ static inline u64 bump_mm_tlb_gen(struct mm_struct *mm) #define __flush_tlb_single(addr) __native_flush_tlb_single(addr) #endif +struct tlb_context { + u64 ctx_id; + u64 tlb_gen; +}; + struct tlb_state { /* * cpu_tlbstate.loaded_mm should match CR3 whenever interrupts @@ -97,6 +102,21 @@ struct tlb_state { * disabling interrupts when modifying either one. */ unsigned long cr4; + + /* +* This is a list of all contexts that might exist in the TLB. +* Since we don't yet use PCID, there is only one context. +* +* For each context, ctx_id indicates which mm the TLB's user +* entries came from. As an invariant, the TLB will never +* contain entries that are out-of-date as when that mm reached +* the tlb_gen in the list. +* +* To be clear, this means that it's legal for the TLB code to +* flush the TLB without updating tlb_gen. This can happen +* (for now, at least) due to paravirt remote flushes. +*/ + struct tlb_context ctxs[1]; }; DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate); @@ -248,9 +268,26 @@ static inline void __flush_tlb_one(unsigned long addr) * and page-granular flushes are available only on i486 and up. */ struct flush_tlb_info { + /* +* We support several kinds of flushes. +* +* - Fully flush a single mm. flush_mm will be set, flush_end will be +* TLB_FLUSH_ALL, and new_tlb_gen will be the tlb_gen to which the +* IPI sender is trying to catch us up. +* +* - Partially flush a single mm. flush_mm will be set, flush_start +* and flush_end will indicate the range, and new_tlb_gen will be +* set such that the changes between generation new_tlb_gen-1 and +* new_tlb_gen are entirely contained in the indicated range. +* +* - Fully flush all mms whose tlb_gens have been updated. flush_mm +* will be NULL, flush_end will be TLB_FLUSH_ALL, and new_tlb_gen +* will be zero. +*/ struct mm_struct *mm; unsigned long start; unsigned long end; + u64 new_tlb_gen; }; #define local_flush_tlb() __flush_tlb() diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 6d9d37323a43..9f5ef7a5e74a 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -105,6 +105,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, } this_cpu_write(cpu_tlbstate.loaded_mm, next); + this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id); + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, + atomic64_read(>context.tlb_gen)); WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next))); cpumask_set_cpu(cpu, mm_cpumask(next)); @@ -194,20 +197,73 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, static void flush_tlb_func_common(const struct flush_tlb_info *f, bool local, enum tlb_flush_reason reason) { + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); + + /* +* Our memory ordering requirement is that any TLB fills that +* happen after we flush the TLB are ordered after we read +* active_mm's
[PATCH v3 05/11] x86/mm: Track the TLB's tlb_gen and update the flushing algorithm
There are two kernel features that would benefit from tracking how up-to-date each CPU's TLB is in the case where IPIs aren't keeping it up to date in real time: - Lazy mm switching currently works by switching to init_mm when it would otherwise flush. This is wasteful: there isn't fundamentally any need to update CR3 at all when going lazy or when returning from lazy mode, nor is there any need to receive flush IPIs at all. Instead, we should just stop trying to keep the TLB coherent when we go lazy and, when unlazying, check whether we missed any flushes. - PCID will let us keep recent user contexts alive in the TLB. If we start doing this, we need a way to decide whether those contexts are up to date. On some paravirt systems, remote TLBs can be flushed without IPIs. This won't update the target CPUs' tlb_gens, which may cause unnecessary local flushes later on. We can address this if it becomes a problem by carefully updating the target CPU's tlb_gen directly. By itself, this patch is a very minor optimization that avoids unnecessary flushes when multiple TLB flushes targetting the same CPU race. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/tlbflush.h | 37 +++ arch/x86/mm/tlb.c | 79 + 2 files changed, 109 insertions(+), 7 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 1eb946c0507e..4f6c30d6ec39 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -82,6 +82,11 @@ static inline u64 bump_mm_tlb_gen(struct mm_struct *mm) #define __flush_tlb_single(addr) __native_flush_tlb_single(addr) #endif +struct tlb_context { + u64 ctx_id; + u64 tlb_gen; +}; + struct tlb_state { /* * cpu_tlbstate.loaded_mm should match CR3 whenever interrupts @@ -97,6 +102,21 @@ struct tlb_state { * disabling interrupts when modifying either one. */ unsigned long cr4; + + /* +* This is a list of all contexts that might exist in the TLB. +* Since we don't yet use PCID, there is only one context. +* +* For each context, ctx_id indicates which mm the TLB's user +* entries came from. As an invariant, the TLB will never +* contain entries that are out-of-date as when that mm reached +* the tlb_gen in the list. +* +* To be clear, this means that it's legal for the TLB code to +* flush the TLB without updating tlb_gen. This can happen +* (for now, at least) due to paravirt remote flushes. +*/ + struct tlb_context ctxs[1]; }; DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate); @@ -248,9 +268,26 @@ static inline void __flush_tlb_one(unsigned long addr) * and page-granular flushes are available only on i486 and up. */ struct flush_tlb_info { + /* +* We support several kinds of flushes. +* +* - Fully flush a single mm. flush_mm will be set, flush_end will be +* TLB_FLUSH_ALL, and new_tlb_gen will be the tlb_gen to which the +* IPI sender is trying to catch us up. +* +* - Partially flush a single mm. flush_mm will be set, flush_start +* and flush_end will indicate the range, and new_tlb_gen will be +* set such that the changes between generation new_tlb_gen-1 and +* new_tlb_gen are entirely contained in the indicated range. +* +* - Fully flush all mms whose tlb_gens have been updated. flush_mm +* will be NULL, flush_end will be TLB_FLUSH_ALL, and new_tlb_gen +* will be zero. +*/ struct mm_struct *mm; unsigned long start; unsigned long end; + u64 new_tlb_gen; }; #define local_flush_tlb() __flush_tlb() diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 6d9d37323a43..9f5ef7a5e74a 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -105,6 +105,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, } this_cpu_write(cpu_tlbstate.loaded_mm, next); + this_cpu_write(cpu_tlbstate.ctxs[0].ctx_id, next->context.ctx_id); + this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, + atomic64_read(>context.tlb_gen)); WARN_ON_ONCE(cpumask_test_cpu(cpu, mm_cpumask(next))); cpumask_set_cpu(cpu, mm_cpumask(next)); @@ -194,20 +197,73 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, static void flush_tlb_func_common(const struct flush_tlb_info *f, bool local, enum tlb_flush_reason reason) { + struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); + + /* +* Our memory ordering requirement is that any TLB fills that +* happen after we flush the TLB are ordered after we read +* active_mm's tlb_gen. We don't
[PATCH v3 03/11] x86/mm: Remove reset_lazy_tlbstate()
The only call site also calls idle_task_exit(), and idle_task_exit() puts us into a clean state by explicitly switching to init_mm. Reviewed-by: Rik van RielSigned-off-by: Andy Lutomirski --- arch/x86/include/asm/tlbflush.h | 8 arch/x86/kernel/smpboot.c | 1 - 2 files changed, 9 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 5f78c6a77578..50ea3482e1d1 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -259,14 +259,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask, #define TLBSTATE_OK1 #define TLBSTATE_LAZY 2 -static inline void reset_lazy_tlbstate(void) -{ - this_cpu_write(cpu_tlbstate.state, 0); - this_cpu_write(cpu_tlbstate.loaded_mm, _mm); - - WARN_ON(read_cr3_pa() != __pa_symbol(swapper_pg_dir)); -} - static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, struct mm_struct *mm) { diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index f04479a8f74f..6169a56aab49 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1589,7 +1589,6 @@ void native_cpu_die(unsigned int cpu) void play_dead_common(void) { idle_task_exit(); - reset_lazy_tlbstate(); /* Ack it */ (void)cpu_report_death(); -- 2.9.4
[PATCH v3 09/11] x86/mm: Add nopcid to turn off PCID
The parameter is only present on x86_64 systems to save a few bytes, as PCID is always disabled on x86_32. Signed-off-by: Andy Lutomirski--- Documentation/admin-guide/kernel-parameters.txt | 2 ++ arch/x86/kernel/cpu/common.c| 18 ++ 2 files changed, 20 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 0f5c3b4347c6..aa385109ae58 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2648,6 +2648,8 @@ nopat [X86] Disable PAT (page attribute table extension of pagetables) support. + nopcid [X86-64] Disable the PCID cpu feature. + norandmaps Don't use address space randomization. Equivalent to echo 0 > /proc/sys/kernel/randomize_va_space diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index c8b39870f33e..904485e7b230 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -168,6 +168,24 @@ static int __init x86_mpx_setup(char *s) } __setup("nompx", x86_mpx_setup); +#ifdef CONFIG_X86_64 +static int __init x86_pcid_setup(char *s) +{ + /* require an exact match without trailing characters */ + if (strlen(s)) + return 0; + + /* do not emit a message if the feature is not present */ + if (!boot_cpu_has(X86_FEATURE_PCID)) + return 1; + + setup_clear_cpu_cap(X86_FEATURE_PCID); + pr_info("nopcid: PCID feature disabled\n"); + return 1; +} +__setup("nopcid", x86_pcid_setup); +#endif + static int __init x86_noinvpcid_setup(char *s) { /* noinvpcid doesn't accept parameters */ -- 2.9.4
[PATCH v3 09/11] x86/mm: Add nopcid to turn off PCID
The parameter is only present on x86_64 systems to save a few bytes, as PCID is always disabled on x86_32. Signed-off-by: Andy Lutomirski --- Documentation/admin-guide/kernel-parameters.txt | 2 ++ arch/x86/kernel/cpu/common.c| 18 ++ 2 files changed, 20 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 0f5c3b4347c6..aa385109ae58 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -2648,6 +2648,8 @@ nopat [X86] Disable PAT (page attribute table extension of pagetables) support. + nopcid [X86-64] Disable the PCID cpu feature. + norandmaps Don't use address space randomization. Equivalent to echo 0 > /proc/sys/kernel/randomize_va_space diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index c8b39870f33e..904485e7b230 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -168,6 +168,24 @@ static int __init x86_mpx_setup(char *s) } __setup("nompx", x86_mpx_setup); +#ifdef CONFIG_X86_64 +static int __init x86_pcid_setup(char *s) +{ + /* require an exact match without trailing characters */ + if (strlen(s)) + return 0; + + /* do not emit a message if the feature is not present */ + if (!boot_cpu_has(X86_FEATURE_PCID)) + return 1; + + setup_clear_cpu_cap(X86_FEATURE_PCID); + pr_info("nopcid: PCID feature disabled\n"); + return 1; +} +__setup("nopcid", x86_pcid_setup); +#endif + static int __init x86_noinvpcid_setup(char *s) { /* noinvpcid doesn't accept parameters */ -- 2.9.4
[PATCH v3 03/11] x86/mm: Remove reset_lazy_tlbstate()
The only call site also calls idle_task_exit(), and idle_task_exit() puts us into a clean state by explicitly switching to init_mm. Reviewed-by: Rik van Riel Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/tlbflush.h | 8 arch/x86/kernel/smpboot.c | 1 - 2 files changed, 9 deletions(-) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 5f78c6a77578..50ea3482e1d1 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -259,14 +259,6 @@ void native_flush_tlb_others(const struct cpumask *cpumask, #define TLBSTATE_OK1 #define TLBSTATE_LAZY 2 -static inline void reset_lazy_tlbstate(void) -{ - this_cpu_write(cpu_tlbstate.state, 0); - this_cpu_write(cpu_tlbstate.loaded_mm, _mm); - - WARN_ON(read_cr3_pa() != __pa_symbol(swapper_pg_dir)); -} - static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, struct mm_struct *mm) { diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index f04479a8f74f..6169a56aab49 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1589,7 +1589,6 @@ void native_cpu_die(unsigned int cpu) void play_dead_common(void) { idle_task_exit(); - reset_lazy_tlbstate(); /* Ack it */ (void)cpu_report_death(); -- 2.9.4
[PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID
PCID is a "process context ID" -- it's what other architectures call an address space ID. Every non-global TLB entry is tagged with a PCID, only TLB entries that match the currently selected PCID are used, and we can switch PGDs without flushing the TLB. x86's PCID is 12 bits. This is an unorthodox approach to using PCID. x86's PCID is far too short to uniquely identify a process, and we can't even really uniquely identify a running process because there are monster systems with over 4096 CPUs. To make matters worse, past attempts to use all 12 PCID bits have resulted in slowdowns instead of speedups. This patch uses PCID differently. We use a PCID to identify a recently-used mm on a per-cpu basis. An mm has no fixed PCID binding at all; instead, we give it a fresh PCID each time it's loaded except in cases where we want to preserve the TLB, in which case we reuse a recent value. This seems to save about 100ns on context switches between mms. Signed-off-by: Andy Lutomirski--- arch/x86/include/asm/mmu_context.h | 3 ++ arch/x86/include/asm/processor-flags.h | 2 + arch/x86/include/asm/tlbflush.h| 18 +++- arch/x86/mm/init.c | 1 + arch/x86/mm/tlb.c | 82 ++ 5 files changed, 86 insertions(+), 20 deletions(-) diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 69a4f1ee86ac..2537ec03c9b7 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -299,6 +299,9 @@ static inline unsigned long __get_current_cr3_fast(void) { unsigned long cr3 = __pa(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd); + if (static_cpu_has(X86_FEATURE_PCID)) + cr3 |= this_cpu_read(cpu_tlbstate.loaded_mm_asid); + /* For now, be very restrictive about when this can be called. */ VM_WARN_ON(in_nmi() || !in_atomic()); diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h index 79aa2f98398d..791b60199aa4 100644 --- a/arch/x86/include/asm/processor-flags.h +++ b/arch/x86/include/asm/processor-flags.h @@ -35,6 +35,7 @@ /* Mask off the address space ID bits. */ #define CR3_ADDR_MASK 0x7000ull #define CR3_PCID_MASK 0xFFFull +#define CR3_NOFLUSH (1UL << 63) #else /* * CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save @@ -42,6 +43,7 @@ */ #define CR3_ADDR_MASK 0xull #define CR3_PCID_MASK 0ull +#define CR3_NOFLUSH 0 #endif #endif /* _ASM_X86_PROCESSOR_FLAGS_H */ diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 57b305e13c4c..a9a5aa6f45f7 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -82,6 +82,12 @@ static inline u64 bump_mm_tlb_gen(struct mm_struct *mm) #define __flush_tlb_single(addr) __native_flush_tlb_single(addr) #endif +/* + * 6 because 6 should be plenty and struct tlb_state will fit in + * two cache lines. + */ +#define NR_DYNAMIC_ASIDS 6 + struct tlb_context { u64 ctx_id; u64 tlb_gen; @@ -95,6 +101,8 @@ struct tlb_state { * mode even if we've already switched back to swapper_pg_dir. */ struct mm_struct *loaded_mm; + u16 loaded_mm_asid; + u16 next_asid; /* * Access to this CR4 shadow and to H/W CR4 is protected by @@ -104,7 +112,8 @@ struct tlb_state { /* * This is a list of all contexts that might exist in the TLB. -* Since we don't yet use PCID, there is only one context. +* There is one per ASID that we use, and the ASID (what the +* CPU calls PCID) is the index into ctxts. * * For each context, ctx_id indicates which mm the TLB's user * entries came from. As an invariant, the TLB will never @@ -114,8 +123,13 @@ struct tlb_state { * To be clear, this means that it's legal for the TLB code to * flush the TLB without updating tlb_gen. This can happen * (for now, at least) due to paravirt remote flushes. +* +* NB: context 0 is a bit special, since it's also used by +* various bits of init code. This is fine -- code that +* isn't aware of PCID will end up harmlessly flushing +* context 0. */ - struct tlb_context ctxs[1]; + struct tlb_context ctxs[NR_DYNAMIC_ASIDS]; }; DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate); diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index 7d6fa4676af9..9c9570d300ba 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -812,6 +812,7 @@ void __init zone_sizes_init(void) DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = { .loaded_mm = _mm, + .next_asid = 1, .cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */ }; EXPORT_SYMBOL_GPL(cpu_tlbstate); diff --git a/arch/x86/mm/tlb.c
Re: [GIT PULL][PATCH 0/4] Timekeeping items for 4.13
On Tue, Jun 20, 2017 at 10:21 PM, John Stultzwrote: > Just a small set of changes, the biggest changes being the > MONOTONIC_RAW handling cleanup, and a new kselftest from > Miroslav. Also a a clear warning deprecating > CONFIG_GENERIC_TIME_VSYSCALL_OLD, which affects ppc and ia64. I forgot to note, while it can be intuited from the pull request, while these are targeted for tip/timers/core, this patchset is based on the current tip/timers/urgent, as it depends on patches there. thanks -john
[PATCH v3 10/11] x86/mm: Enable CR4.PCIDE on supported systems
We can use PCID if the CPU has PCID and PGE and we're not on Xen. By itself, this has no effect. The next patch will start using PCID. Cc: Juergen GrossCc: Boris Ostrovsky Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/tlbflush.h | 8 arch/x86/kernel/cpu/common.c| 15 +++ arch/x86/xen/enlighten_pv.c | 6 ++ 3 files changed, 29 insertions(+) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 87b13e51e867..57b305e13c4c 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -243,6 +243,14 @@ static inline void __flush_tlb_all(void) __flush_tlb_global(); else __flush_tlb(); + + /* +* Note: if we somehow had PCID but not PGE, then this wouldn't work -- +* we'd end up flushing kernel translations for the current ASID but +* we might fail to flush kernel translations for other cached ASIDs. +* +* To avoid this issue, we force PCID off if PGE is off. +*/ } static inline void __flush_tlb_one(unsigned long addr) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 904485e7b230..01caf66b270f 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1143,6 +1143,21 @@ static void identify_cpu(struct cpuinfo_x86 *c) setup_smep(c); setup_smap(c); + /* Set up PCID */ + if (cpu_has(c, X86_FEATURE_PCID)) { + if (cpu_has(c, X86_FEATURE_PGE)) { + cr4_set_bits(X86_CR4_PCIDE); + } else { + /* +* flush_tlb_all(), as currently implemented, won't +* work if PCID is on but PGE is not. Since that +* combination doesn't exist on real hardware, there's +* no reason to try to fully support it. +*/ + clear_cpu_cap(c, X86_FEATURE_PCID); + } + } + /* * The vendor-specific functions might have changed features. * Now we do "generic changes." diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c index f33eef4ebd12..a136aac543c3 100644 --- a/arch/x86/xen/enlighten_pv.c +++ b/arch/x86/xen/enlighten_pv.c @@ -295,6 +295,12 @@ static void __init xen_init_capabilities(void) setup_clear_cpu_cap(X86_FEATURE_ACC); setup_clear_cpu_cap(X86_FEATURE_X2APIC); + /* +* Xen PV would need some work to support PCID: CR3 handling as well +* as xen_flush_tlb_others() would need updating. +*/ + setup_clear_cpu_cap(X86_FEATURE_PCID); + if (!xen_initial_domain()) setup_clear_cpu_cap(X86_FEATURE_ACPI); -- 2.9.4
[PATCH v3 11/11] x86/mm: Try to preserve old TLB entries using PCID
PCID is a "process context ID" -- it's what other architectures call an address space ID. Every non-global TLB entry is tagged with a PCID, only TLB entries that match the currently selected PCID are used, and we can switch PGDs without flushing the TLB. x86's PCID is 12 bits. This is an unorthodox approach to using PCID. x86's PCID is far too short to uniquely identify a process, and we can't even really uniquely identify a running process because there are monster systems with over 4096 CPUs. To make matters worse, past attempts to use all 12 PCID bits have resulted in slowdowns instead of speedups. This patch uses PCID differently. We use a PCID to identify a recently-used mm on a per-cpu basis. An mm has no fixed PCID binding at all; instead, we give it a fresh PCID each time it's loaded except in cases where we want to preserve the TLB, in which case we reuse a recent value. This seems to save about 100ns on context switches between mms. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/mmu_context.h | 3 ++ arch/x86/include/asm/processor-flags.h | 2 + arch/x86/include/asm/tlbflush.h| 18 +++- arch/x86/mm/init.c | 1 + arch/x86/mm/tlb.c | 82 ++ 5 files changed, 86 insertions(+), 20 deletions(-) diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index 69a4f1ee86ac..2537ec03c9b7 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -299,6 +299,9 @@ static inline unsigned long __get_current_cr3_fast(void) { unsigned long cr3 = __pa(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd); + if (static_cpu_has(X86_FEATURE_PCID)) + cr3 |= this_cpu_read(cpu_tlbstate.loaded_mm_asid); + /* For now, be very restrictive about when this can be called. */ VM_WARN_ON(in_nmi() || !in_atomic()); diff --git a/arch/x86/include/asm/processor-flags.h b/arch/x86/include/asm/processor-flags.h index 79aa2f98398d..791b60199aa4 100644 --- a/arch/x86/include/asm/processor-flags.h +++ b/arch/x86/include/asm/processor-flags.h @@ -35,6 +35,7 @@ /* Mask off the address space ID bits. */ #define CR3_ADDR_MASK 0x7000ull #define CR3_PCID_MASK 0xFFFull +#define CR3_NOFLUSH (1UL << 63) #else /* * CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save @@ -42,6 +43,7 @@ */ #define CR3_ADDR_MASK 0xull #define CR3_PCID_MASK 0ull +#define CR3_NOFLUSH 0 #endif #endif /* _ASM_X86_PROCESSOR_FLAGS_H */ diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 57b305e13c4c..a9a5aa6f45f7 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -82,6 +82,12 @@ static inline u64 bump_mm_tlb_gen(struct mm_struct *mm) #define __flush_tlb_single(addr) __native_flush_tlb_single(addr) #endif +/* + * 6 because 6 should be plenty and struct tlb_state will fit in + * two cache lines. + */ +#define NR_DYNAMIC_ASIDS 6 + struct tlb_context { u64 ctx_id; u64 tlb_gen; @@ -95,6 +101,8 @@ struct tlb_state { * mode even if we've already switched back to swapper_pg_dir. */ struct mm_struct *loaded_mm; + u16 loaded_mm_asid; + u16 next_asid; /* * Access to this CR4 shadow and to H/W CR4 is protected by @@ -104,7 +112,8 @@ struct tlb_state { /* * This is a list of all contexts that might exist in the TLB. -* Since we don't yet use PCID, there is only one context. +* There is one per ASID that we use, and the ASID (what the +* CPU calls PCID) is the index into ctxts. * * For each context, ctx_id indicates which mm the TLB's user * entries came from. As an invariant, the TLB will never @@ -114,8 +123,13 @@ struct tlb_state { * To be clear, this means that it's legal for the TLB code to * flush the TLB without updating tlb_gen. This can happen * (for now, at least) due to paravirt remote flushes. +* +* NB: context 0 is a bit special, since it's also used by +* various bits of init code. This is fine -- code that +* isn't aware of PCID will end up harmlessly flushing +* context 0. */ - struct tlb_context ctxs[1]; + struct tlb_context ctxs[NR_DYNAMIC_ASIDS]; }; DECLARE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate); diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index 7d6fa4676af9..9c9570d300ba 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -812,6 +812,7 @@ void __init zone_sizes_init(void) DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = { .loaded_mm = _mm, + .next_asid = 1, .cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */ }; EXPORT_SYMBOL_GPL(cpu_tlbstate); diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index
Re: [GIT PULL][PATCH 0/4] Timekeeping items for 4.13
On Tue, Jun 20, 2017 at 10:21 PM, John Stultz wrote: > Just a small set of changes, the biggest changes being the > MONOTONIC_RAW handling cleanup, and a new kselftest from > Miroslav. Also a a clear warning deprecating > CONFIG_GENERIC_TIME_VSYSCALL_OLD, which affects ppc and ia64. I forgot to note, while it can be intuited from the pull request, while these are targeted for tip/timers/core, this patchset is based on the current tip/timers/urgent, as it depends on patches there. thanks -john
[PATCH v3 10/11] x86/mm: Enable CR4.PCIDE on supported systems
We can use PCID if the CPU has PCID and PGE and we're not on Xen. By itself, this has no effect. The next patch will start using PCID. Cc: Juergen Gross Cc: Boris Ostrovsky Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/tlbflush.h | 8 arch/x86/kernel/cpu/common.c| 15 +++ arch/x86/xen/enlighten_pv.c | 6 ++ 3 files changed, 29 insertions(+) diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 87b13e51e867..57b305e13c4c 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -243,6 +243,14 @@ static inline void __flush_tlb_all(void) __flush_tlb_global(); else __flush_tlb(); + + /* +* Note: if we somehow had PCID but not PGE, then this wouldn't work -- +* we'd end up flushing kernel translations for the current ASID but +* we might fail to flush kernel translations for other cached ASIDs. +* +* To avoid this issue, we force PCID off if PGE is off. +*/ } static inline void __flush_tlb_one(unsigned long addr) diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c index 904485e7b230..01caf66b270f 100644 --- a/arch/x86/kernel/cpu/common.c +++ b/arch/x86/kernel/cpu/common.c @@ -1143,6 +1143,21 @@ static void identify_cpu(struct cpuinfo_x86 *c) setup_smep(c); setup_smap(c); + /* Set up PCID */ + if (cpu_has(c, X86_FEATURE_PCID)) { + if (cpu_has(c, X86_FEATURE_PGE)) { + cr4_set_bits(X86_CR4_PCIDE); + } else { + /* +* flush_tlb_all(), as currently implemented, won't +* work if PCID is on but PGE is not. Since that +* combination doesn't exist on real hardware, there's +* no reason to try to fully support it. +*/ + clear_cpu_cap(c, X86_FEATURE_PCID); + } + } + /* * The vendor-specific functions might have changed features. * Now we do "generic changes." diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c index f33eef4ebd12..a136aac543c3 100644 --- a/arch/x86/xen/enlighten_pv.c +++ b/arch/x86/xen/enlighten_pv.c @@ -295,6 +295,12 @@ static void __init xen_init_capabilities(void) setup_clear_cpu_cap(X86_FEATURE_ACC); setup_clear_cpu_cap(X86_FEATURE_X2APIC); + /* +* Xen PV would need some work to support PCID: CR3 handling as well +* as xen_flush_tlb_others() would need updating. +*/ + setup_clear_cpu_cap(X86_FEATURE_PCID); + if (!xen_initial_domain()) setup_clear_cpu_cap(X86_FEATURE_ACPI); -- 2.9.4
[PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking
x86's lazy TLB mode used to be fairly weak -- it would switch to init_mm the first time it tried to flush a lazy TLB. This meant an unnecessary CR3 write and, if the flush was remote, an unnecessary IPI. Rewrite it entirely. When we enter lazy mode, we simply remove the cpu from mm_cpumask. This means that we need a way to figure out whether we've missed a flush when we switch back out of lazy mode. I use the tlb_gen machinery to track whether a context is up to date. Note to reviewers: this patch, my itself, looks a bit odd. I'm using an array of length 1 containing (ctx_id, tlb_gen) rather than just storing tlb_gen, and making it at array isn't necessary yet. I'm doing this because the next few patches add PCID support, and, with PCID, we need ctx_id, and the array will end up with a length greater than 1. Making it an array now means that there will be less churn and therefore less stress on your eyeballs. NB: This is dubious but, AFAICT, still correct on Xen and UV. xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this patch changes the way that mm_cpumask() works. This should be okay, since Xen *also* iterates all online CPUs to find all the CPUs it needs to twiddle. The UV tlbflush code is rather dated and should be changed. Cc: Andrew BanmanCc: Mike Travis Cc: Dimitri Sivanich Cc: Juergen Gross Cc: Boris Ostrovsky Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/mmu_context.h | 6 +- arch/x86/include/asm/tlbflush.h| 4 - arch/x86/mm/init.c | 1 - arch/x86/mm/tlb.c | 227 +++-- arch/x86/xen/mmu_pv.c | 3 +- 5 files changed, 119 insertions(+), 122 deletions(-) diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index e5295d485899..69a4f1ee86ac 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -125,8 +125,10 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next) static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) { - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) - this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY); + int cpu = smp_processor_id(); + + if (cpumask_test_cpu(cpu, mm_cpumask(mm))) + cpumask_clear_cpu(cpu, mm_cpumask(mm)); } extern atomic64_t last_mm_ctx_id; diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 4f6c30d6ec39..87b13e51e867 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -95,7 +95,6 @@ struct tlb_state { * mode even if we've already switched back to swapper_pg_dir. */ struct mm_struct *loaded_mm; - int state; /* * Access to this CR4 shadow and to H/W CR4 is protected by @@ -310,9 +309,6 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) void native_flush_tlb_others(const struct cpumask *cpumask, const struct flush_tlb_info *info); -#define TLBSTATE_OK1 -#define TLBSTATE_LAZY 2 - static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, struct mm_struct *mm) { diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index 88ee942cb47d..7d6fa4676af9 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -812,7 +812,6 @@ void __init zone_sizes_init(void) DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = { .loaded_mm = _mm, - .state = 0, .cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */ }; EXPORT_SYMBOL_GPL(cpu_tlbstate); diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 9f5ef7a5e74a..fea2b07ac7d8 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -45,8 +45,8 @@ void leave_mm(int cpu) if (loaded_mm == _mm) return; - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) - BUG(); + /* Warn if we're not lazy. */ + WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))); switch_mm(NULL, _mm, NULL); } @@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, { unsigned cpu = smp_processor_id(); struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm); + u64 next_tlb_gen; /* -* NB: The scheduler will call us with prev == next when -* switching from lazy TLB mode to normal mode if active_mm -* isn't changing. When this happens, there is no guarantee -* that CR3 (and hence cpu_tlbstate.loaded_mm) matches next. +* NB: The scheduler will call us with prev == next when switching +* from lazy TLB mode to normal mode if active_mm
[PATCH v3 06/11] x86/mm: Rework lazy TLB mode and TLB freshness tracking
x86's lazy TLB mode used to be fairly weak -- it would switch to init_mm the first time it tried to flush a lazy TLB. This meant an unnecessary CR3 write and, if the flush was remote, an unnecessary IPI. Rewrite it entirely. When we enter lazy mode, we simply remove the cpu from mm_cpumask. This means that we need a way to figure out whether we've missed a flush when we switch back out of lazy mode. I use the tlb_gen machinery to track whether a context is up to date. Note to reviewers: this patch, my itself, looks a bit odd. I'm using an array of length 1 containing (ctx_id, tlb_gen) rather than just storing tlb_gen, and making it at array isn't necessary yet. I'm doing this because the next few patches add PCID support, and, with PCID, we need ctx_id, and the array will end up with a length greater than 1. Making it an array now means that there will be less churn and therefore less stress on your eyeballs. NB: This is dubious but, AFAICT, still correct on Xen and UV. xen_exit_mmap() uses mm_cpumask() for nefarious purposes and this patch changes the way that mm_cpumask() works. This should be okay, since Xen *also* iterates all online CPUs to find all the CPUs it needs to twiddle. The UV tlbflush code is rather dated and should be changed. Cc: Andrew Banman Cc: Mike Travis Cc: Dimitri Sivanich Cc: Juergen Gross Cc: Boris Ostrovsky Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/mmu_context.h | 6 +- arch/x86/include/asm/tlbflush.h| 4 - arch/x86/mm/init.c | 1 - arch/x86/mm/tlb.c | 227 +++-- arch/x86/xen/mmu_pv.c | 3 +- 5 files changed, 119 insertions(+), 122 deletions(-) diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index e5295d485899..69a4f1ee86ac 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -125,8 +125,10 @@ static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next) static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) { - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) - this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY); + int cpu = smp_processor_id(); + + if (cpumask_test_cpu(cpu, mm_cpumask(mm))) + cpumask_clear_cpu(cpu, mm_cpumask(mm)); } extern atomic64_t last_mm_ctx_id; diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 4f6c30d6ec39..87b13e51e867 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -95,7 +95,6 @@ struct tlb_state { * mode even if we've already switched back to swapper_pg_dir. */ struct mm_struct *loaded_mm; - int state; /* * Access to this CR4 shadow and to H/W CR4 is protected by @@ -310,9 +309,6 @@ static inline void flush_tlb_page(struct vm_area_struct *vma, unsigned long a) void native_flush_tlb_others(const struct cpumask *cpumask, const struct flush_tlb_info *info); -#define TLBSTATE_OK1 -#define TLBSTATE_LAZY 2 - static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, struct mm_struct *mm) { diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c index 88ee942cb47d..7d6fa4676af9 100644 --- a/arch/x86/mm/init.c +++ b/arch/x86/mm/init.c @@ -812,7 +812,6 @@ void __init zone_sizes_init(void) DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = { .loaded_mm = _mm, - .state = 0, .cr4 = ~0UL,/* fail hard if we screw up cr4 shadow initialization */ }; EXPORT_SYMBOL_GPL(cpu_tlbstate); diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 9f5ef7a5e74a..fea2b07ac7d8 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -45,8 +45,8 @@ void leave_mm(int cpu) if (loaded_mm == _mm) return; - if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) - BUG(); + /* Warn if we're not lazy. */ + WARN_ON(cpumask_test_cpu(smp_processor_id(), mm_cpumask(loaded_mm))); switch_mm(NULL, _mm, NULL); } @@ -67,133 +67,118 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, { unsigned cpu = smp_processor_id(); struct mm_struct *real_prev = this_cpu_read(cpu_tlbstate.loaded_mm); + u64 next_tlb_gen; /* -* NB: The scheduler will call us with prev == next when -* switching from lazy TLB mode to normal mode if active_mm -* isn't changing. When this happens, there is no guarantee -* that CR3 (and hence cpu_tlbstate.loaded_mm) matches next. +* NB: The scheduler will call us with prev == next when switching +* from lazy TLB mode to normal mode if active_mm isn't changing. +* When this happens, we don't assume that CR3 (and hence +*
[PATCH v3 01/11] x86/mm: Don't reenter flush_tlb_func_common()
It was historically possible to have two concurrent TLB flushes targetting the same CPU: one initiated locally and one initiated remotely. This can now cause an OOPS in leave_mm() at arch/x86/mm/tlb.c:47: if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) BUG(); with this call trace: flush_tlb_func_local arch/x86/mm/tlb.c:239 [inline] flush_tlb_mm_range+0x26d/0x370 arch/x86/mm/tlb.c:317 Without reentrancy, this OOPS is impossible: leave_mm() is only called if we're not in TLBSTATE_OK, but then we're unexpectedly in TLBSTATE_OK in leave_mm(). This can be caused by flush_tlb_func_remote() happening between the two checks and calling leave_mm(), resulting in two consecutive leave_mm() calls on the same CPU with no intervening switch_mm() calls. We never saw this OOPS before because the old leave_mm() implementation didn't put us back in TLBSTATE_OK, so the assertion didn't fire. Nadav noticed the reentrancy issue in a different context, but neither of us realized that it caused a problem yet. Cc: Nadav AmitCc: Dave Hansen Reported-by: "Levin, Alexander (Sasha Levin)" Fixes: 3d28ebceaffa ("x86/mm: Rework lazy TLB to track the actual loaded mm") Signed-off-by: Andy Lutomirski --- arch/x86/mm/tlb.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 2a5e851f2035..f06239c6919f 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -208,6 +208,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, static void flush_tlb_func_common(const struct flush_tlb_info *f, bool local, enum tlb_flush_reason reason) { + /* This code cannot presently handle being reentered. */ + VM_WARN_ON(!irqs_disabled()); + if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) { leave_mm(smp_processor_id()); return; @@ -313,8 +316,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, info.end = TLB_FLUSH_ALL; } - if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) + if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { + local_irq_disable(); flush_tlb_func_local(, TLB_LOCAL_MM_SHOOTDOWN); + local_irq_enable(); + } + if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) flush_tlb_others(mm_cpumask(mm), ); put_cpu(); @@ -370,8 +377,12 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) int cpu = get_cpu(); - if (cpumask_test_cpu(cpu, >cpumask)) + if (cpumask_test_cpu(cpu, >cpumask)) { + local_irq_disable(); flush_tlb_func_local(, TLB_LOCAL_SHOOTDOWN); + local_irq_enable(); + } + if (cpumask_any_but(>cpumask, cpu) < nr_cpu_ids) flush_tlb_others(>cpumask, ); cpumask_clear(>cpumask); -- 2.9.4
[PATCH 3/4] kselftests: timers: Fix inconsistency-check to not ignore first timestamp
From: Miroslav LichvarWhen the first timestamp in the list of clock readings was later than the second timestamp and all other timestamps were in order, the inconsistency was not reported because the index of the out-of-order timestamp was equal to the default value. Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Miroslav Lichvar Cc: Richard Cochran Cc: Prarit Bhargava Cc: Stephen Boyd Cc: Shuah Khan Signed-off-by: Miroslav Lichvar Signed-off-by: John Stultz --- tools/testing/selftests/timers/inconsistency-check.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/timers/inconsistency-check.c b/tools/testing/selftests/timers/inconsistency-check.c index caf1bc9..74c60e8 100644 --- a/tools/testing/selftests/timers/inconsistency-check.c +++ b/tools/testing/selftests/timers/inconsistency-check.c @@ -118,7 +118,7 @@ int consistency_test(int clock_type, unsigned long seconds) start_str = ctime(); while (seconds == -1 || now - then < seconds) { - inconsistent = 0; + inconsistent = -1; /* Fill list */ for (i = 0; i < CALLS_PER_LOOP; i++) @@ -130,7 +130,7 @@ int consistency_test(int clock_type, unsigned long seconds) inconsistent = i; /* display inconsistency */ - if (inconsistent) { + if (inconsistent >= 0) { unsigned long long delta; printf("\%s\n", start_str); -- 2.7.4
[PATCH v3 07/11] x86/mm: Stop calling leave_mm() in idle code
Now that lazy TLB suppresses all flush IPIs (as opposed to all but the first), there's no need to leave_mm() when going idle. This means we can get rid of the rcuidle hack in switch_mm_irqs_off() and we can unexport leave_mm(). This also removes acpi_unlazy_tlb() from the x86 and ia64 headers, since it has no callers any more. Signed-off-by: Andy Lutomirski--- arch/ia64/include/asm/acpi.h | 2 -- arch/x86/include/asm/acpi.h | 2 -- arch/x86/mm/tlb.c | 19 +++ drivers/acpi/processor_idle.c | 2 -- drivers/idle/intel_idle.c | 9 - 5 files changed, 7 insertions(+), 27 deletions(-) diff --git a/arch/ia64/include/asm/acpi.h b/arch/ia64/include/asm/acpi.h index a3d0211970e9..c86a947f5368 100644 --- a/arch/ia64/include/asm/acpi.h +++ b/arch/ia64/include/asm/acpi.h @@ -112,8 +112,6 @@ static inline void arch_acpi_set_pdc_bits(u32 *buf) buf[2] |= ACPI_PDC_EST_CAPABILITY_SMP; } -#define acpi_unlazy_tlb(x) - #ifdef CONFIG_ACPI_NUMA extern cpumask_t early_cpu_possible_map; #define for_each_possible_early_cpu(cpu) \ diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h index 2efc768e4362..562286fa151f 100644 --- a/arch/x86/include/asm/acpi.h +++ b/arch/x86/include/asm/acpi.h @@ -150,8 +150,6 @@ static inline void disable_acpi(void) { } extern int x86_acpi_numa_init(void); #endif /* CONFIG_ACPI_NUMA */ -#define acpi_unlazy_tlb(x) leave_mm(x) - #ifdef CONFIG_ACPI_APEI static inline pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr) { diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index fea2b07ac7d8..5f932fd80881 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -50,7 +50,6 @@ void leave_mm(int cpu) switch_mm(NULL, _mm, NULL); } -EXPORT_SYMBOL_GPL(leave_mm); void switch_mm(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk) @@ -113,14 +112,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, next_tlb_gen); write_cr3(__pa(next->pgd)); - /* -* This gets called via leave_mm() in the idle path -* where RCU functions differently. Tracing normally -* uses RCU, so we have to call the tracepoint -* specially here. -*/ - trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, - TLB_FLUSH_ALL); + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, + TLB_FLUSH_ALL); } /* @@ -166,13 +159,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, this_cpu_write(cpu_tlbstate.loaded_mm, next); write_cr3(__pa(next->pgd)); - /* -* This gets called via leave_mm() in the idle path where RCU -* functions differently. Tracing normally uses RCU, so we -* have to call the tracepoint specially here. -*/ - trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, - TLB_FLUSH_ALL); + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); } load_mm_cr4(next); diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c index 5c8aa9cf62d7..fe3d2a40f311 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -708,8 +708,6 @@ static DEFINE_RAW_SPINLOCK(c3_lock); static void acpi_idle_enter_bm(struct acpi_processor *pr, struct acpi_processor_cx *cx, bool timer_bc) { - acpi_unlazy_tlb(smp_processor_id()); - /* * Must be done before busmaster disable as we might need to * access HPET ! diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index 216d7ec88c0c..2ae43f59091d 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -912,16 +912,15 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state = >states[index]; unsigned long eax = flg2MWAIT(state->flags); unsigned int cstate; - int cpu = smp_processor_id(); cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1; /* -* leave_mm() to avoid costly and often unnecessary wakeups -* for flushing the user TLB's associated with the active mm. +* NB: if CPUIDLE_FLAG_TLB_FLUSHED is set, this idle transition +* will probably flush the TLB. It's not guaranteed to flush +* the TLB, though, so it's not clear that we can do anything +* useful with this knowledge. */ - if (state->flags &
[PATCH v3 01/11] x86/mm: Don't reenter flush_tlb_func_common()
It was historically possible to have two concurrent TLB flushes targetting the same CPU: one initiated locally and one initiated remotely. This can now cause an OOPS in leave_mm() at arch/x86/mm/tlb.c:47: if (this_cpu_read(cpu_tlbstate.state) == TLBSTATE_OK) BUG(); with this call trace: flush_tlb_func_local arch/x86/mm/tlb.c:239 [inline] flush_tlb_mm_range+0x26d/0x370 arch/x86/mm/tlb.c:317 Without reentrancy, this OOPS is impossible: leave_mm() is only called if we're not in TLBSTATE_OK, but then we're unexpectedly in TLBSTATE_OK in leave_mm(). This can be caused by flush_tlb_func_remote() happening between the two checks and calling leave_mm(), resulting in two consecutive leave_mm() calls on the same CPU with no intervening switch_mm() calls. We never saw this OOPS before because the old leave_mm() implementation didn't put us back in TLBSTATE_OK, so the assertion didn't fire. Nadav noticed the reentrancy issue in a different context, but neither of us realized that it caused a problem yet. Cc: Nadav Amit Cc: Dave Hansen Reported-by: "Levin, Alexander (Sasha Levin)" Fixes: 3d28ebceaffa ("x86/mm: Rework lazy TLB to track the actual loaded mm") Signed-off-by: Andy Lutomirski --- arch/x86/mm/tlb.c | 15 +-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index 2a5e851f2035..f06239c6919f 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -208,6 +208,9 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, static void flush_tlb_func_common(const struct flush_tlb_info *f, bool local, enum tlb_flush_reason reason) { + /* This code cannot presently handle being reentered. */ + VM_WARN_ON(!irqs_disabled()); + if (this_cpu_read(cpu_tlbstate.state) != TLBSTATE_OK) { leave_mm(smp_processor_id()); return; @@ -313,8 +316,12 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, info.end = TLB_FLUSH_ALL; } - if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) + if (mm == this_cpu_read(cpu_tlbstate.loaded_mm)) { + local_irq_disable(); flush_tlb_func_local(, TLB_LOCAL_MM_SHOOTDOWN); + local_irq_enable(); + } + if (cpumask_any_but(mm_cpumask(mm), cpu) < nr_cpu_ids) flush_tlb_others(mm_cpumask(mm), ); put_cpu(); @@ -370,8 +377,12 @@ void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch) int cpu = get_cpu(); - if (cpumask_test_cpu(cpu, >cpumask)) + if (cpumask_test_cpu(cpu, >cpumask)) { + local_irq_disable(); flush_tlb_func_local(, TLB_LOCAL_SHOOTDOWN); + local_irq_enable(); + } + if (cpumask_any_but(>cpumask, cpu) < nr_cpu_ids) flush_tlb_others(>cpumask, ); cpumask_clear(>cpumask); -- 2.9.4
[PATCH 3/4] kselftests: timers: Fix inconsistency-check to not ignore first timestamp
From: Miroslav Lichvar When the first timestamp in the list of clock readings was later than the second timestamp and all other timestamps were in order, the inconsistency was not reported because the index of the out-of-order timestamp was equal to the default value. Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Miroslav Lichvar Cc: Richard Cochran Cc: Prarit Bhargava Cc: Stephen Boyd Cc: Shuah Khan Signed-off-by: Miroslav Lichvar Signed-off-by: John Stultz --- tools/testing/selftests/timers/inconsistency-check.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/tools/testing/selftests/timers/inconsistency-check.c b/tools/testing/selftests/timers/inconsistency-check.c index caf1bc9..74c60e8 100644 --- a/tools/testing/selftests/timers/inconsistency-check.c +++ b/tools/testing/selftests/timers/inconsistency-check.c @@ -118,7 +118,7 @@ int consistency_test(int clock_type, unsigned long seconds) start_str = ctime(); while (seconds == -1 || now - then < seconds) { - inconsistent = 0; + inconsistent = -1; /* Fill list */ for (i = 0; i < CALLS_PER_LOOP; i++) @@ -130,7 +130,7 @@ int consistency_test(int clock_type, unsigned long seconds) inconsistent = i; /* display inconsistency */ - if (inconsistent) { + if (inconsistent >= 0) { unsigned long long delta; printf("\%s\n", start_str); -- 2.7.4
[PATCH v3 07/11] x86/mm: Stop calling leave_mm() in idle code
Now that lazy TLB suppresses all flush IPIs (as opposed to all but the first), there's no need to leave_mm() when going idle. This means we can get rid of the rcuidle hack in switch_mm_irqs_off() and we can unexport leave_mm(). This also removes acpi_unlazy_tlb() from the x86 and ia64 headers, since it has no callers any more. Signed-off-by: Andy Lutomirski --- arch/ia64/include/asm/acpi.h | 2 -- arch/x86/include/asm/acpi.h | 2 -- arch/x86/mm/tlb.c | 19 +++ drivers/acpi/processor_idle.c | 2 -- drivers/idle/intel_idle.c | 9 - 5 files changed, 7 insertions(+), 27 deletions(-) diff --git a/arch/ia64/include/asm/acpi.h b/arch/ia64/include/asm/acpi.h index a3d0211970e9..c86a947f5368 100644 --- a/arch/ia64/include/asm/acpi.h +++ b/arch/ia64/include/asm/acpi.h @@ -112,8 +112,6 @@ static inline void arch_acpi_set_pdc_bits(u32 *buf) buf[2] |= ACPI_PDC_EST_CAPABILITY_SMP; } -#define acpi_unlazy_tlb(x) - #ifdef CONFIG_ACPI_NUMA extern cpumask_t early_cpu_possible_map; #define for_each_possible_early_cpu(cpu) \ diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h index 2efc768e4362..562286fa151f 100644 --- a/arch/x86/include/asm/acpi.h +++ b/arch/x86/include/asm/acpi.h @@ -150,8 +150,6 @@ static inline void disable_acpi(void) { } extern int x86_acpi_numa_init(void); #endif /* CONFIG_ACPI_NUMA */ -#define acpi_unlazy_tlb(x) leave_mm(x) - #ifdef CONFIG_ACPI_APEI static inline pgprot_t arch_apei_get_mem_attribute(phys_addr_t addr) { diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index fea2b07ac7d8..5f932fd80881 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -50,7 +50,6 @@ void leave_mm(int cpu) switch_mm(NULL, _mm, NULL); } -EXPORT_SYMBOL_GPL(leave_mm); void switch_mm(struct mm_struct *prev, struct mm_struct *next, struct task_struct *tsk) @@ -113,14 +112,8 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, this_cpu_write(cpu_tlbstate.ctxs[0].tlb_gen, next_tlb_gen); write_cr3(__pa(next->pgd)); - /* -* This gets called via leave_mm() in the idle path -* where RCU functions differently. Tracing normally -* uses RCU, so we have to call the tracepoint -* specially here. -*/ - trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, - TLB_FLUSH_ALL); + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, + TLB_FLUSH_ALL); } /* @@ -166,13 +159,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next, this_cpu_write(cpu_tlbstate.loaded_mm, next); write_cr3(__pa(next->pgd)); - /* -* This gets called via leave_mm() in the idle path where RCU -* functions differently. Tracing normally uses RCU, so we -* have to call the tracepoint specially here. -*/ - trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, - TLB_FLUSH_ALL); + trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL); } load_mm_cr4(next); diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c index 5c8aa9cf62d7..fe3d2a40f311 100644 --- a/drivers/acpi/processor_idle.c +++ b/drivers/acpi/processor_idle.c @@ -708,8 +708,6 @@ static DEFINE_RAW_SPINLOCK(c3_lock); static void acpi_idle_enter_bm(struct acpi_processor *pr, struct acpi_processor_cx *cx, bool timer_bc) { - acpi_unlazy_tlb(smp_processor_id()); - /* * Must be done before busmaster disable as we might need to * access HPET ! diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c index 216d7ec88c0c..2ae43f59091d 100644 --- a/drivers/idle/intel_idle.c +++ b/drivers/idle/intel_idle.c @@ -912,16 +912,15 @@ static __cpuidle int intel_idle(struct cpuidle_device *dev, struct cpuidle_state *state = >states[index]; unsigned long eax = flg2MWAIT(state->flags); unsigned int cstate; - int cpu = smp_processor_id(); cstate = (((eax) >> MWAIT_SUBSTATE_SIZE) & MWAIT_CSTATE_MASK) + 1; /* -* leave_mm() to avoid costly and often unnecessary wakeups -* for flushing the user TLB's associated with the active mm. +* NB: if CPUIDLE_FLAG_TLB_FLUSHED is set, this idle transition +* will probably flush the TLB. It's not guaranteed to flush +* the TLB, though, so it's not clear that we can do anything +* useful with this knowledge. */ - if (state->flags & CPUIDLE_FLAG_TLB_FLUSHED) -
[PATCH v3 00/11] PCID and improved laziness
There are three performance benefits here: 1. TLB flushing is slow. (I.e. the flush itself takes a while.) This avoids many of them when switching tasks by using PCID. In a stupid little benchmark I did, it saves about 100ns on my laptop per context switch. I'll try to improve that benchmark. 2. Mms that have been used recently on a given CPU might get to keep their TLB entries alive across process switches with this patch set. TLB fills are pretty fast on modern CPUs, but they're even faster when they don't happen. 3. Lazy TLB is way better. We used to do two stupid things when we ran kernel threads: we'd send IPIs to flush user contexts on their CPUs and then we'd write to CR3 for no particular reason as an excuse to stop further IPIs. With this patch, we do neither. This will, in general, perform suboptimally if paravirt TLB flushing is in use (currently just Xen, I think, but Hyper-V is in the works). The code is structured so we could fix it in one of two ways: we could take a spinlock when touching the percpu state so we can update it remotely after a paravirt flush, or we could be more careful about our exactly how we access the state and use cmpxchg16b to do atomic remote updates. (On SMP systems without cmpxchg16b, we'd just skip the optimization entirely.) This is based on tip:x86/mm. The branch is here if you want to play: https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/pcid Changes from v2: - Add some Acks - Move the reentrancy issue to the beginning. (I also sent the same patch as a standalone fix -- it's just in here so that this series applies to x86/mm.) - Fix some comments. Changes from RFC: - flush_tlb_func_common() no longer gets reentered (Nadav) - Fix ASID corruption on unlazying (kbuild bot) - Move Xen init to the right place - Misc cleanups Andy Lutomirski (11): x86/mm: Don't reenter flush_tlb_func_common() x86/ldt: Simplify LDT switching logic x86/mm: Remove reset_lazy_tlbstate() x86/mm: Give each mm TLB flush generation a unique ID x86/mm: Track the TLB's tlb_gen and update the flushing algorithm x86/mm: Rework lazy TLB mode and TLB freshness tracking x86/mm: Stop calling leave_mm() in idle code x86/mm: Disable PCID on 32-bit kernels x86/mm: Add nopcid to turn off PCID x86/mm: Enable CR4.PCIDE on supported systems x86/mm: Try to preserve old TLB entries using PCID Documentation/admin-guide/kernel-parameters.txt | 2 + arch/ia64/include/asm/acpi.h| 2 - arch/x86/include/asm/acpi.h | 2 - arch/x86/include/asm/disabled-features.h| 4 +- arch/x86/include/asm/mmu.h | 25 +- arch/x86/include/asm/mmu_context.h | 40 ++- arch/x86/include/asm/processor-flags.h | 2 + arch/x86/include/asm/tlbflush.h | 89 +- arch/x86/kernel/cpu/bugs.c | 8 + arch/x86/kernel/cpu/common.c| 33 +++ arch/x86/kernel/smpboot.c | 1 - arch/x86/mm/init.c | 2 +- arch/x86/mm/tlb.c | 368 +++- arch/x86/xen/enlighten_pv.c | 6 + arch/x86/xen/mmu_pv.c | 3 +- drivers/acpi/processor_idle.c | 2 - drivers/idle/intel_idle.c | 9 +- 17 files changed, 430 insertions(+), 168 deletions(-) -- 2.9.4
[GIT PULL][PATCH 0/4] Timekeeping items for 4.13
Just a small set of changes, the biggest changes being the MONOTONIC_RAW handling cleanup, and a new kselftest from Miroslav. Also a a clear warning deprecating CONFIG_GENERIC_TIME_VSYSCALL_OLD, which affects ppc and ia64. Let me know if you have any thoughts or objections! thanks -john Cc: Thomas GleixnerCc: Ingo Molnar Cc: Miroslav Lichvar Cc: Richard Cochran Cc: Prarit Bhargava Cc: Stephen Boyd Cc: Shuah Khan The following changes since commit 8e6cec1c7c5afa489687c90be15d6ed82c742975: Merge branch 'clockevents/4.12-fixes' of https://git.linaro.org/people/daniel.lezcano/linux into timers/urgent (2017-06-20 12:50:32 +0200) are available in the git repository at: https://git.linaro.org/people/john.stultz/linux.git fortglx/4.13/time for you to fetch changes up to 767392565a3e618950fe1a5ff1ba11295f6332f4: kselftests: timers: Add test for frequency step (2017-06-20 22:14:45 -0700) John Stultz (2): time: Clean up CLOCK_MONOTONIC_RAW time handling time: Add warning about imminent deprecation of CONFIG_GENERIC_TIME_VSYSCALL_OLD Miroslav Lichvar (2): kselftests: timers: Fix inconsistency-check to not ignore first timestamp kselftests: timers: Add test for frequency step arch/arm64/kernel/vdso.c | 6 +- include/linux/timekeeper_internal.h| 4 +- kernel/time/timekeeping.c | 46 ++-- tools/testing/selftests/timers/Makefile| 5 +- tools/testing/selftests/timers/freq-step.c | 268 + .../testing/selftests/timers/inconsistency-check.c | 4 +- 6 files changed, 303 insertions(+), 30 deletions(-) create mode 100644 tools/testing/selftests/timers/freq-step.c -- 2.7.4
[PATCH v3 00/11] PCID and improved laziness
There are three performance benefits here: 1. TLB flushing is slow. (I.e. the flush itself takes a while.) This avoids many of them when switching tasks by using PCID. In a stupid little benchmark I did, it saves about 100ns on my laptop per context switch. I'll try to improve that benchmark. 2. Mms that have been used recently on a given CPU might get to keep their TLB entries alive across process switches with this patch set. TLB fills are pretty fast on modern CPUs, but they're even faster when they don't happen. 3. Lazy TLB is way better. We used to do two stupid things when we ran kernel threads: we'd send IPIs to flush user contexts on their CPUs and then we'd write to CR3 for no particular reason as an excuse to stop further IPIs. With this patch, we do neither. This will, in general, perform suboptimally if paravirt TLB flushing is in use (currently just Xen, I think, but Hyper-V is in the works). The code is structured so we could fix it in one of two ways: we could take a spinlock when touching the percpu state so we can update it remotely after a paravirt flush, or we could be more careful about our exactly how we access the state and use cmpxchg16b to do atomic remote updates. (On SMP systems without cmpxchg16b, we'd just skip the optimization entirely.) This is based on tip:x86/mm. The branch is here if you want to play: https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/pcid Changes from v2: - Add some Acks - Move the reentrancy issue to the beginning. (I also sent the same patch as a standalone fix -- it's just in here so that this series applies to x86/mm.) - Fix some comments. Changes from RFC: - flush_tlb_func_common() no longer gets reentered (Nadav) - Fix ASID corruption on unlazying (kbuild bot) - Move Xen init to the right place - Misc cleanups Andy Lutomirski (11): x86/mm: Don't reenter flush_tlb_func_common() x86/ldt: Simplify LDT switching logic x86/mm: Remove reset_lazy_tlbstate() x86/mm: Give each mm TLB flush generation a unique ID x86/mm: Track the TLB's tlb_gen and update the flushing algorithm x86/mm: Rework lazy TLB mode and TLB freshness tracking x86/mm: Stop calling leave_mm() in idle code x86/mm: Disable PCID on 32-bit kernels x86/mm: Add nopcid to turn off PCID x86/mm: Enable CR4.PCIDE on supported systems x86/mm: Try to preserve old TLB entries using PCID Documentation/admin-guide/kernel-parameters.txt | 2 + arch/ia64/include/asm/acpi.h| 2 - arch/x86/include/asm/acpi.h | 2 - arch/x86/include/asm/disabled-features.h| 4 +- arch/x86/include/asm/mmu.h | 25 +- arch/x86/include/asm/mmu_context.h | 40 ++- arch/x86/include/asm/processor-flags.h | 2 + arch/x86/include/asm/tlbflush.h | 89 +- arch/x86/kernel/cpu/bugs.c | 8 + arch/x86/kernel/cpu/common.c| 33 +++ arch/x86/kernel/smpboot.c | 1 - arch/x86/mm/init.c | 2 +- arch/x86/mm/tlb.c | 368 +++- arch/x86/xen/enlighten_pv.c | 6 + arch/x86/xen/mmu_pv.c | 3 +- drivers/acpi/processor_idle.c | 2 - drivers/idle/intel_idle.c | 9 +- 17 files changed, 430 insertions(+), 168 deletions(-) -- 2.9.4
[GIT PULL][PATCH 0/4] Timekeeping items for 4.13
Just a small set of changes, the biggest changes being the MONOTONIC_RAW handling cleanup, and a new kselftest from Miroslav. Also a a clear warning deprecating CONFIG_GENERIC_TIME_VSYSCALL_OLD, which affects ppc and ia64. Let me know if you have any thoughts or objections! thanks -john Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Miroslav Lichvar Cc: Richard Cochran Cc: Prarit Bhargava Cc: Stephen Boyd Cc: Shuah Khan The following changes since commit 8e6cec1c7c5afa489687c90be15d6ed82c742975: Merge branch 'clockevents/4.12-fixes' of https://git.linaro.org/people/daniel.lezcano/linux into timers/urgent (2017-06-20 12:50:32 +0200) are available in the git repository at: https://git.linaro.org/people/john.stultz/linux.git fortglx/4.13/time for you to fetch changes up to 767392565a3e618950fe1a5ff1ba11295f6332f4: kselftests: timers: Add test for frequency step (2017-06-20 22:14:45 -0700) John Stultz (2): time: Clean up CLOCK_MONOTONIC_RAW time handling time: Add warning about imminent deprecation of CONFIG_GENERIC_TIME_VSYSCALL_OLD Miroslav Lichvar (2): kselftests: timers: Fix inconsistency-check to not ignore first timestamp kselftests: timers: Add test for frequency step arch/arm64/kernel/vdso.c | 6 +- include/linux/timekeeper_internal.h| 4 +- kernel/time/timekeeping.c | 46 ++-- tools/testing/selftests/timers/Makefile| 5 +- tools/testing/selftests/timers/freq-step.c | 268 + .../testing/selftests/timers/inconsistency-check.c | 4 +- 6 files changed, 303 insertions(+), 30 deletions(-) create mode 100644 tools/testing/selftests/timers/freq-step.c -- 2.7.4
[PATCH v3 04/11] x86/mm: Give each mm TLB flush generation a unique ID
This adds two new variables to mmu_context_t: ctx_id and tlb_gen. ctx_id uniquely identifies the mm_struct and will never be reused. For a given mm_struct (and hence ctx_id), tlb_gen is a monotonic count of the number of times that a TLB flush has been requested. The pair (ctx_id, tlb_gen) can be used as an identifier for TLB flush actions and will be used in subsequent patches to reliably determine whether all needed TLB flushes have occurred on a given CPU. This patch is split out for ease of review. By itself, it has no real effect other than creating and updating the new variables. Signed-off-by: Andy Lutomirski--- arch/x86/include/asm/mmu.h | 25 +++-- arch/x86/include/asm/mmu_context.h | 5 + arch/x86/include/asm/tlbflush.h| 18 ++ arch/x86/mm/tlb.c | 6 -- 4 files changed, 50 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h index 79b647a7ebd0..bb8c597c2248 100644 --- a/arch/x86/include/asm/mmu.h +++ b/arch/x86/include/asm/mmu.h @@ -3,12 +3,28 @@ #include #include +#include /* - * The x86 doesn't have a mmu context, but - * we put the segment information here. + * x86 has arch-specific MMU state beyond what lives in mm_struct. */ typedef struct { + /* +* ctx_id uniquely identifies this mm_struct. A ctx_id will never +* be reused, and zero is not a valid ctx_id. +*/ + u64 ctx_id; + + /* +* Any code that needs to do any sort of TLB flushing for this +* mm will first make its changes to the page tables, then +* increment tlb_gen, then flush. This lets the low-level +* flushing code keep track of what needs flushing. +* +* This is not used on Xen PV. +*/ + atomic64_t tlb_gen; + #ifdef CONFIG_MODIFY_LDT_SYSCALL struct ldt_struct *ldt; #endif @@ -37,6 +53,11 @@ typedef struct { #endif } mm_context_t; +#define INIT_MM_CONTEXT(mm)\ + .context = {\ + .ctx_id = 1,\ + } + void leave_mm(int cpu); #endif /* _ASM_X86_MMU_H */ diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index ecfcb6643c9b..e5295d485899 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -129,9 +129,14 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY); } +extern atomic64_t last_mm_ctx_id; + static inline int init_new_context(struct task_struct *tsk, struct mm_struct *mm) { + mm->context.ctx_id = atomic64_inc_return(_mm_ctx_id); + atomic64_set(>context.tlb_gen, 0); + #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS if (cpu_feature_enabled(X86_FEATURE_OSPKE)) { /* pkey 0 is the default and always allocated */ diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 50ea3482e1d1..1eb946c0507e 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -57,6 +57,23 @@ static inline void invpcid_flush_all_nonglobals(void) __invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL); } +static inline u64 bump_mm_tlb_gen(struct mm_struct *mm) +{ + u64 new_tlb_gen; + + /* +* Bump the generation count. This also serves as a full barrier +* that synchronizes with switch_mm: callers are required to order +* their read of mm_cpumask after their writes to the paging +* structures. +*/ + smp_mb__before_atomic(); + new_tlb_gen = atomic64_inc_return(>context.tlb_gen); + smp_mb__after_atomic(); + + return new_tlb_gen; +} + #ifdef CONFIG_PARAVIRT #include #else @@ -262,6 +279,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask, static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, struct mm_struct *mm) { + bump_mm_tlb_gen(mm); cpumask_or(>cpumask, >cpumask, mm_cpumask(mm)); } diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index fd593833a854..6d9d37323a43 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -28,6 +28,8 @@ * Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi */ +atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1); + void leave_mm(int cpu) { struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); @@ -286,8 +288,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, cpu = get_cpu(); - /* Synchronize with switch_mm. */ - smp_mb(); + /* This is also a barrier that synchronizes with switch_mm(). */ + bump_mm_tlb_gen(mm); /*
[PATCH v3 08/11] x86/mm: Disable PCID on 32-bit kernels
32-bit kernels on new hardware will see PCID in CPUID, but PCID can only be used in 64-bit mode. Rather than making all PCID code conditional, just disable the feature on 32-bit builds. Signed-off-by: Andy Lutomirski--- arch/x86/include/asm/disabled-features.h | 4 +++- arch/x86/kernel/cpu/bugs.c | 8 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h index 5dff775af7cd..c10c9128f54e 100644 --- a/arch/x86/include/asm/disabled-features.h +++ b/arch/x86/include/asm/disabled-features.h @@ -21,11 +21,13 @@ # define DISABLE_K6_MTRR (1<<(X86_FEATURE_K6_MTRR & 31)) # define DISABLE_CYRIX_ARR (1<<(X86_FEATURE_CYRIX_ARR & 31)) # define DISABLE_CENTAUR_MCR (1<<(X86_FEATURE_CENTAUR_MCR & 31)) +# define DISABLE_PCID 0 #else # define DISABLE_VME 0 # define DISABLE_K6_MTRR 0 # define DISABLE_CYRIX_ARR 0 # define DISABLE_CENTAUR_MCR 0 +# define DISABLE_PCID (1<<(X86_FEATURE_PCID & 31)) #endif /* CONFIG_X86_64 */ #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS @@ -49,7 +51,7 @@ #define DISABLED_MASK1 0 #define DISABLED_MASK2 0 #define DISABLED_MASK3 (DISABLE_CYRIX_ARR|DISABLE_CENTAUR_MCR|DISABLE_K6_MTRR) -#define DISABLED_MASK4 0 +#define DISABLED_MASK4 (DISABLE_PCID) #define DISABLED_MASK5 0 #define DISABLED_MASK6 0 #define DISABLED_MASK7 0 diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 0af86d9242da..db684880d74a 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -21,6 +21,14 @@ void __init check_bugs(void) { +#ifdef CONFIG_X86_32 + /* +* Regardless of whether PCID is enumerated, the SDM says +* that it can't be enabled in 32-bit mode. +*/ + setup_clear_cpu_cap(X86_FEATURE_PCID); +#endif + identify_boot_cpu(); if (!IS_ENABLED(CONFIG_SMP)) { -- 2.9.4
[PATCH v3 04/11] x86/mm: Give each mm TLB flush generation a unique ID
This adds two new variables to mmu_context_t: ctx_id and tlb_gen. ctx_id uniquely identifies the mm_struct and will never be reused. For a given mm_struct (and hence ctx_id), tlb_gen is a monotonic count of the number of times that a TLB flush has been requested. The pair (ctx_id, tlb_gen) can be used as an identifier for TLB flush actions and will be used in subsequent patches to reliably determine whether all needed TLB flushes have occurred on a given CPU. This patch is split out for ease of review. By itself, it has no real effect other than creating and updating the new variables. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/mmu.h | 25 +++-- arch/x86/include/asm/mmu_context.h | 5 + arch/x86/include/asm/tlbflush.h| 18 ++ arch/x86/mm/tlb.c | 6 -- 4 files changed, 50 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h index 79b647a7ebd0..bb8c597c2248 100644 --- a/arch/x86/include/asm/mmu.h +++ b/arch/x86/include/asm/mmu.h @@ -3,12 +3,28 @@ #include #include +#include /* - * The x86 doesn't have a mmu context, but - * we put the segment information here. + * x86 has arch-specific MMU state beyond what lives in mm_struct. */ typedef struct { + /* +* ctx_id uniquely identifies this mm_struct. A ctx_id will never +* be reused, and zero is not a valid ctx_id. +*/ + u64 ctx_id; + + /* +* Any code that needs to do any sort of TLB flushing for this +* mm will first make its changes to the page tables, then +* increment tlb_gen, then flush. This lets the low-level +* flushing code keep track of what needs flushing. +* +* This is not used on Xen PV. +*/ + atomic64_t tlb_gen; + #ifdef CONFIG_MODIFY_LDT_SYSCALL struct ldt_struct *ldt; #endif @@ -37,6 +53,11 @@ typedef struct { #endif } mm_context_t; +#define INIT_MM_CONTEXT(mm)\ + .context = {\ + .ctx_id = 1,\ + } + void leave_mm(int cpu); #endif /* _ASM_X86_MMU_H */ diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h index ecfcb6643c9b..e5295d485899 100644 --- a/arch/x86/include/asm/mmu_context.h +++ b/arch/x86/include/asm/mmu_context.h @@ -129,9 +129,14 @@ static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk) this_cpu_write(cpu_tlbstate.state, TLBSTATE_LAZY); } +extern atomic64_t last_mm_ctx_id; + static inline int init_new_context(struct task_struct *tsk, struct mm_struct *mm) { + mm->context.ctx_id = atomic64_inc_return(_mm_ctx_id); + atomic64_set(>context.tlb_gen, 0); + #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS if (cpu_feature_enabled(X86_FEATURE_OSPKE)) { /* pkey 0 is the default and always allocated */ diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h index 50ea3482e1d1..1eb946c0507e 100644 --- a/arch/x86/include/asm/tlbflush.h +++ b/arch/x86/include/asm/tlbflush.h @@ -57,6 +57,23 @@ static inline void invpcid_flush_all_nonglobals(void) __invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL); } +static inline u64 bump_mm_tlb_gen(struct mm_struct *mm) +{ + u64 new_tlb_gen; + + /* +* Bump the generation count. This also serves as a full barrier +* that synchronizes with switch_mm: callers are required to order +* their read of mm_cpumask after their writes to the paging +* structures. +*/ + smp_mb__before_atomic(); + new_tlb_gen = atomic64_inc_return(>context.tlb_gen); + smp_mb__after_atomic(); + + return new_tlb_gen; +} + #ifdef CONFIG_PARAVIRT #include #else @@ -262,6 +279,7 @@ void native_flush_tlb_others(const struct cpumask *cpumask, static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch, struct mm_struct *mm) { + bump_mm_tlb_gen(mm); cpumask_or(>cpumask, >cpumask, mm_cpumask(mm)); } diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c index fd593833a854..6d9d37323a43 100644 --- a/arch/x86/mm/tlb.c +++ b/arch/x86/mm/tlb.c @@ -28,6 +28,8 @@ * Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi */ +atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1); + void leave_mm(int cpu) { struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm); @@ -286,8 +288,8 @@ void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start, cpu = get_cpu(); - /* Synchronize with switch_mm. */ - smp_mb(); + /* This is also a barrier that synchronizes with switch_mm(). */ + bump_mm_tlb_gen(mm); /* Should we flush just
[PATCH v3 08/11] x86/mm: Disable PCID on 32-bit kernels
32-bit kernels on new hardware will see PCID in CPUID, but PCID can only be used in 64-bit mode. Rather than making all PCID code conditional, just disable the feature on 32-bit builds. Signed-off-by: Andy Lutomirski --- arch/x86/include/asm/disabled-features.h | 4 +++- arch/x86/kernel/cpu/bugs.c | 8 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/disabled-features.h b/arch/x86/include/asm/disabled-features.h index 5dff775af7cd..c10c9128f54e 100644 --- a/arch/x86/include/asm/disabled-features.h +++ b/arch/x86/include/asm/disabled-features.h @@ -21,11 +21,13 @@ # define DISABLE_K6_MTRR (1<<(X86_FEATURE_K6_MTRR & 31)) # define DISABLE_CYRIX_ARR (1<<(X86_FEATURE_CYRIX_ARR & 31)) # define DISABLE_CENTAUR_MCR (1<<(X86_FEATURE_CENTAUR_MCR & 31)) +# define DISABLE_PCID 0 #else # define DISABLE_VME 0 # define DISABLE_K6_MTRR 0 # define DISABLE_CYRIX_ARR 0 # define DISABLE_CENTAUR_MCR 0 +# define DISABLE_PCID (1<<(X86_FEATURE_PCID & 31)) #endif /* CONFIG_X86_64 */ #ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS @@ -49,7 +51,7 @@ #define DISABLED_MASK1 0 #define DISABLED_MASK2 0 #define DISABLED_MASK3 (DISABLE_CYRIX_ARR|DISABLE_CENTAUR_MCR|DISABLE_K6_MTRR) -#define DISABLED_MASK4 0 +#define DISABLED_MASK4 (DISABLE_PCID) #define DISABLED_MASK5 0 #define DISABLED_MASK6 0 #define DISABLED_MASK7 0 diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c index 0af86d9242da..db684880d74a 100644 --- a/arch/x86/kernel/cpu/bugs.c +++ b/arch/x86/kernel/cpu/bugs.c @@ -21,6 +21,14 @@ void __init check_bugs(void) { +#ifdef CONFIG_X86_32 + /* +* Regardless of whether PCID is enumerated, the SDM says +* that it can't be enabled in 32-bit mode. +*/ + setup_clear_cpu_cap(X86_FEATURE_PCID); +#endif + identify_boot_cpu(); if (!IS_ENABLED(CONFIG_SMP)) { -- 2.9.4
[PATCH 4/4] kselftests: timers: Add test for frequency step
From: Miroslav LichvarThis test checks the response of the system clock to frequency steps made with adjtimex(). The frequency error and stability of the CLOCK_MONOTONIC clock relative to the CLOCK_MONOTONIC_RAW clock is measured in two intervals following the step. The test fails if values from the second interval exceed specified limits. Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Miroslav Lichvar Cc: Richard Cochran Cc: Prarit Bhargava Cc: Stephen Boyd Cc: Shuah Khan Signed-off-by: Miroslav Lichvar Signed-off-by: John Stultz --- tools/testing/selftests/timers/Makefile| 5 +- tools/testing/selftests/timers/freq-step.c | 268 + 2 files changed, 271 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/timers/freq-step.c diff --git a/tools/testing/selftests/timers/Makefile b/tools/testing/selftests/timers/Makefile index 5fa1d7e9..5801bbe 100644 --- a/tools/testing/selftests/timers/Makefile +++ b/tools/testing/selftests/timers/Makefile @@ -1,6 +1,6 @@ BUILD_FLAGS = -DKTEST CFLAGS += -O3 -Wl,-no-as-needed -Wall $(BUILD_FLAGS) -LDFLAGS += -lrt -lpthread +LDFLAGS += -lrt -lpthread -lm # these are all "safe" tests that don't modify # system time or require escalated privileges @@ -8,7 +8,7 @@ TEST_GEN_PROGS = posix_timers nanosleep nsleep-lat set-timer-lat mqueue-lat \ inconsistency-check raw_skew threadtest rtctest TEST_GEN_PROGS_EXTENDED = alarmtimer-suspend valid-adjtimex adjtick change_skew \ - skew_consistency clocksource-switch leap-a-day \ + skew_consistency clocksource-switch freq-step leap-a-day \ leapcrash set-tai set-2038 set-tz @@ -24,6 +24,7 @@ run_destructive_tests: run_tests ./change_skew ./skew_consistency ./clocksource-switch + ./freq-step ./leap-a-day -s -i 10 ./leapcrash ./set-tz diff --git a/tools/testing/selftests/timers/freq-step.c b/tools/testing/selftests/timers/freq-step.c new file mode 100644 index 000..e8c6183 --- /dev/null +++ b/tools/testing/selftests/timers/freq-step.c @@ -0,0 +1,268 @@ +/* + * This test checks the response of the system clock to frequency + * steps made with adjtimex(). The frequency error and stability of + * the CLOCK_MONOTONIC clock relative to the CLOCK_MONOTONIC_RAW clock + * is measured in two intervals following the step. The test fails if + * values from the second interval exceed specified limits. + * + * Copyright (C) Miroslav Lichvar 2017 + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ + +#include +#include +#include +#include +#include + +#include "../kselftest.h" + +#define SAMPLES 100 +#define SAMPLE_READINGS 10 +#define MEAN_SAMPLE_INTERVAL 0.1 +#define STEP_INTERVAL 1.0 +#define MAX_PRECISION 100e-9 +#define MAX_FREQ_ERROR 10e-6 +#define MAX_STDDEV 1000e-9 + +struct sample { + double offset; + double time; +}; + +static time_t mono_raw_base; +static time_t mono_base; +static long user_hz; +static double precision; +static double mono_freq_offset; + +static double diff_timespec(struct timespec *ts1, struct timespec *ts2) +{ + return ts1->tv_sec - ts2->tv_sec + (ts1->tv_nsec - ts2->tv_nsec) / 1e9; +} + +static double get_sample(struct sample *sample) +{ + double delay, mindelay = 0.0; + struct timespec ts1, ts2, ts3; + int i; + + for (i = 0; i < SAMPLE_READINGS; i++) { + clock_gettime(CLOCK_MONOTONIC_RAW, ); + clock_gettime(CLOCK_MONOTONIC, ); + clock_gettime(CLOCK_MONOTONIC_RAW, ); + + ts1.tv_sec -= mono_raw_base; + ts2.tv_sec -= mono_base; + ts3.tv_sec -= mono_raw_base; + + delay = diff_timespec(, ); + if (delay <= 1e-9) { + i--; + continue; + } + + if (!i || delay < mindelay) { + sample->offset = diff_timespec(, ); + sample->offset -= delay / 2.0; + sample->time = ts1.tv_sec + ts1.tv_nsec / 1e9; + mindelay = delay; + } + } + + return mindelay; +} + +static void reset_ntp_error(void) +{ + struct timex txc; + + txc.modes = ADJ_SETOFFSET; +
[PATCH 1/4] time: Clean up CLOCK_MONOTONIC_RAW time handling
Now that we fixed the sub-ns handling for CLOCK_MONOTONIC_RAW, remove the duplicitive tk->raw_time.tv_nsec, which can be stored in tk->tkr_raw.xtime_nsec (similarly to how its handled for monotonic time). Cc: Thomas GleixnerCc: Ingo Molnar Cc: Miroslav Lichvar Cc: Richard Cochran Cc: Prarit Bhargava Cc: Stephen Boyd Cc: Kevin Brodsky Cc: Will Deacon Cc: Daniel Mentz Tested-by: Daniel Mentz Signed-off-by: John Stultz --- arch/arm64/kernel/vdso.c| 6 ++--- include/linux/timekeeper_internal.h | 4 ++-- kernel/time/timekeeping.c | 45 - 3 files changed, 29 insertions(+), 26 deletions(-) diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c index d0cb007..7492d90 100644 --- a/arch/arm64/kernel/vdso.c +++ b/arch/arm64/kernel/vdso.c @@ -220,10 +220,8 @@ void update_vsyscall(struct timekeeper *tk) if (!use_syscall) { /* tkr_mono.cycle_last == tkr_raw.cycle_last */ vdso_data->cs_cycle_last= tk->tkr_mono.cycle_last; - vdso_data->raw_time_sec = tk->raw_time.tv_sec; - vdso_data->raw_time_nsec= (tk->raw_time.tv_nsec << - tk->tkr_raw.shift) + - tk->tkr_raw.xtime_nsec; + vdso_data->raw_time_sec = tk->raw_sec; + vdso_data->raw_time_nsec= tk->tkr_raw.xtime_nsec; vdso_data->xtime_clock_sec = tk->xtime_sec; vdso_data->xtime_clock_nsec = tk->tkr_mono.xtime_nsec; vdso_data->cs_mono_mult = tk->tkr_mono.mult; diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h index f7043cc..0a0a53d 100644 --- a/include/linux/timekeeper_internal.h +++ b/include/linux/timekeeper_internal.h @@ -51,7 +51,7 @@ struct tk_read_base { * @clock_was_set_seq: The sequence number of clock was set events * @cs_was_changed_seq:The sequence number of clocksource change events * @next_leap_ktime: CLOCK_MONOTONIC time value of a pending leap-second - * @raw_time: Monotonic raw base time in timespec64 format + * @raw_sec: CLOCK_MONOTONIC_RAW time in seconds * @cycle_interval:Number of clock cycles in one NTP interval * @xtime_interval:Number of clock shifted nano seconds in one NTP * interval. @@ -93,7 +93,7 @@ struct timekeeper { unsigned intclock_was_set_seq; u8 cs_was_changed_seq; ktime_t next_leap_ktime; - struct timespec64 raw_time; + u64 raw_sec; /* The following members are for timekeeping internal use */ u64 cycle_interval; diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index b602c48..0454bfa 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -72,6 +72,10 @@ static inline void tk_normalize_xtime(struct timekeeper *tk) tk->tkr_mono.xtime_nsec -= (u64)NSEC_PER_SEC << tk->tkr_mono.shift; tk->xtime_sec++; } + while (tk->tkr_raw.xtime_nsec >= ((u64)NSEC_PER_SEC << tk->tkr_raw.shift)) { + tk->tkr_raw.xtime_nsec -= (u64)NSEC_PER_SEC << tk->tkr_raw.shift; + tk->raw_sec++; + } } static inline struct timespec64 tk_xtime(struct timekeeper *tk) @@ -285,12 +289,14 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock) /* if changing clocks, convert xtime_nsec shift units */ if (old_clock) { int shift_change = clock->shift - old_clock->shift; - if (shift_change < 0) + if (shift_change < 0) { tk->tkr_mono.xtime_nsec >>= -shift_change; - else + tk->tkr_raw.xtime_nsec >>= -shift_change; + } else { tk->tkr_mono.xtime_nsec <<= shift_change; + tk->tkr_raw.xtime_nsec <<= shift_change; + } } - tk->tkr_raw.xtime_nsec = 0; tk->tkr_mono.shift = clock->shift; tk->tkr_raw.shift = clock->shift; @@ -619,9 +625,6 @@ static inline void tk_update_ktime_data(struct timekeeper *tk) nsec = (u32) tk->wall_to_monotonic.tv_nsec; tk->tkr_mono.base = ns_to_ktime(seconds * NSEC_PER_SEC + nsec); - /* Update the monotonic raw base */ - tk->tkr_raw.base = timespec64_to_ktime(tk->raw_time); - /* * The sum of the nanoseconds portions of xtime and * wall_to_monotonic can be greater/equal one
[PATCH 4/4] kselftests: timers: Add test for frequency step
From: Miroslav Lichvar This test checks the response of the system clock to frequency steps made with adjtimex(). The frequency error and stability of the CLOCK_MONOTONIC clock relative to the CLOCK_MONOTONIC_RAW clock is measured in two intervals following the step. The test fails if values from the second interval exceed specified limits. Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Miroslav Lichvar Cc: Richard Cochran Cc: Prarit Bhargava Cc: Stephen Boyd Cc: Shuah Khan Signed-off-by: Miroslav Lichvar Signed-off-by: John Stultz --- tools/testing/selftests/timers/Makefile| 5 +- tools/testing/selftests/timers/freq-step.c | 268 + 2 files changed, 271 insertions(+), 2 deletions(-) create mode 100644 tools/testing/selftests/timers/freq-step.c diff --git a/tools/testing/selftests/timers/Makefile b/tools/testing/selftests/timers/Makefile index 5fa1d7e9..5801bbe 100644 --- a/tools/testing/selftests/timers/Makefile +++ b/tools/testing/selftests/timers/Makefile @@ -1,6 +1,6 @@ BUILD_FLAGS = -DKTEST CFLAGS += -O3 -Wl,-no-as-needed -Wall $(BUILD_FLAGS) -LDFLAGS += -lrt -lpthread +LDFLAGS += -lrt -lpthread -lm # these are all "safe" tests that don't modify # system time or require escalated privileges @@ -8,7 +8,7 @@ TEST_GEN_PROGS = posix_timers nanosleep nsleep-lat set-timer-lat mqueue-lat \ inconsistency-check raw_skew threadtest rtctest TEST_GEN_PROGS_EXTENDED = alarmtimer-suspend valid-adjtimex adjtick change_skew \ - skew_consistency clocksource-switch leap-a-day \ + skew_consistency clocksource-switch freq-step leap-a-day \ leapcrash set-tai set-2038 set-tz @@ -24,6 +24,7 @@ run_destructive_tests: run_tests ./change_skew ./skew_consistency ./clocksource-switch + ./freq-step ./leap-a-day -s -i 10 ./leapcrash ./set-tz diff --git a/tools/testing/selftests/timers/freq-step.c b/tools/testing/selftests/timers/freq-step.c new file mode 100644 index 000..e8c6183 --- /dev/null +++ b/tools/testing/selftests/timers/freq-step.c @@ -0,0 +1,268 @@ +/* + * This test checks the response of the system clock to frequency + * steps made with adjtimex(). The frequency error and stability of + * the CLOCK_MONOTONIC clock relative to the CLOCK_MONOTONIC_RAW clock + * is measured in two intervals following the step. The test fails if + * values from the second interval exceed specified limits. + * + * Copyright (C) Miroslav Lichvar 2017 + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of version 2 of the GNU General Public License as + * published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ + +#include +#include +#include +#include +#include + +#include "../kselftest.h" + +#define SAMPLES 100 +#define SAMPLE_READINGS 10 +#define MEAN_SAMPLE_INTERVAL 0.1 +#define STEP_INTERVAL 1.0 +#define MAX_PRECISION 100e-9 +#define MAX_FREQ_ERROR 10e-6 +#define MAX_STDDEV 1000e-9 + +struct sample { + double offset; + double time; +}; + +static time_t mono_raw_base; +static time_t mono_base; +static long user_hz; +static double precision; +static double mono_freq_offset; + +static double diff_timespec(struct timespec *ts1, struct timespec *ts2) +{ + return ts1->tv_sec - ts2->tv_sec + (ts1->tv_nsec - ts2->tv_nsec) / 1e9; +} + +static double get_sample(struct sample *sample) +{ + double delay, mindelay = 0.0; + struct timespec ts1, ts2, ts3; + int i; + + for (i = 0; i < SAMPLE_READINGS; i++) { + clock_gettime(CLOCK_MONOTONIC_RAW, ); + clock_gettime(CLOCK_MONOTONIC, ); + clock_gettime(CLOCK_MONOTONIC_RAW, ); + + ts1.tv_sec -= mono_raw_base; + ts2.tv_sec -= mono_base; + ts3.tv_sec -= mono_raw_base; + + delay = diff_timespec(, ); + if (delay <= 1e-9) { + i--; + continue; + } + + if (!i || delay < mindelay) { + sample->offset = diff_timespec(, ); + sample->offset -= delay / 2.0; + sample->time = ts1.tv_sec + ts1.tv_nsec / 1e9; + mindelay = delay; + } + } + + return mindelay; +} + +static void reset_ntp_error(void) +{ + struct timex txc; + + txc.modes = ADJ_SETOFFSET; + txc.time.tv_sec = 0; + txc.time.tv_usec = 0; + + if (adjtimex() < 0) { + perror("[FAIL] adjtimex"); + ksft_exit_fail(); + } +} + +static void set_frequency(double freq) +{ + struct timex txc;
[PATCH 1/4] time: Clean up CLOCK_MONOTONIC_RAW time handling
Now that we fixed the sub-ns handling for CLOCK_MONOTONIC_RAW, remove the duplicitive tk->raw_time.tv_nsec, which can be stored in tk->tkr_raw.xtime_nsec (similarly to how its handled for monotonic time). Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Miroslav Lichvar Cc: Richard Cochran Cc: Prarit Bhargava Cc: Stephen Boyd Cc: Kevin Brodsky Cc: Will Deacon Cc: Daniel Mentz Tested-by: Daniel Mentz Signed-off-by: John Stultz --- arch/arm64/kernel/vdso.c| 6 ++--- include/linux/timekeeper_internal.h | 4 ++-- kernel/time/timekeeping.c | 45 - 3 files changed, 29 insertions(+), 26 deletions(-) diff --git a/arch/arm64/kernel/vdso.c b/arch/arm64/kernel/vdso.c index d0cb007..7492d90 100644 --- a/arch/arm64/kernel/vdso.c +++ b/arch/arm64/kernel/vdso.c @@ -220,10 +220,8 @@ void update_vsyscall(struct timekeeper *tk) if (!use_syscall) { /* tkr_mono.cycle_last == tkr_raw.cycle_last */ vdso_data->cs_cycle_last= tk->tkr_mono.cycle_last; - vdso_data->raw_time_sec = tk->raw_time.tv_sec; - vdso_data->raw_time_nsec= (tk->raw_time.tv_nsec << - tk->tkr_raw.shift) + - tk->tkr_raw.xtime_nsec; + vdso_data->raw_time_sec = tk->raw_sec; + vdso_data->raw_time_nsec= tk->tkr_raw.xtime_nsec; vdso_data->xtime_clock_sec = tk->xtime_sec; vdso_data->xtime_clock_nsec = tk->tkr_mono.xtime_nsec; vdso_data->cs_mono_mult = tk->tkr_mono.mult; diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h index f7043cc..0a0a53d 100644 --- a/include/linux/timekeeper_internal.h +++ b/include/linux/timekeeper_internal.h @@ -51,7 +51,7 @@ struct tk_read_base { * @clock_was_set_seq: The sequence number of clock was set events * @cs_was_changed_seq:The sequence number of clocksource change events * @next_leap_ktime: CLOCK_MONOTONIC time value of a pending leap-second - * @raw_time: Monotonic raw base time in timespec64 format + * @raw_sec: CLOCK_MONOTONIC_RAW time in seconds * @cycle_interval:Number of clock cycles in one NTP interval * @xtime_interval:Number of clock shifted nano seconds in one NTP * interval. @@ -93,7 +93,7 @@ struct timekeeper { unsigned intclock_was_set_seq; u8 cs_was_changed_seq; ktime_t next_leap_ktime; - struct timespec64 raw_time; + u64 raw_sec; /* The following members are for timekeeping internal use */ u64 cycle_interval; diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index b602c48..0454bfa 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -72,6 +72,10 @@ static inline void tk_normalize_xtime(struct timekeeper *tk) tk->tkr_mono.xtime_nsec -= (u64)NSEC_PER_SEC << tk->tkr_mono.shift; tk->xtime_sec++; } + while (tk->tkr_raw.xtime_nsec >= ((u64)NSEC_PER_SEC << tk->tkr_raw.shift)) { + tk->tkr_raw.xtime_nsec -= (u64)NSEC_PER_SEC << tk->tkr_raw.shift; + tk->raw_sec++; + } } static inline struct timespec64 tk_xtime(struct timekeeper *tk) @@ -285,12 +289,14 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock) /* if changing clocks, convert xtime_nsec shift units */ if (old_clock) { int shift_change = clock->shift - old_clock->shift; - if (shift_change < 0) + if (shift_change < 0) { tk->tkr_mono.xtime_nsec >>= -shift_change; - else + tk->tkr_raw.xtime_nsec >>= -shift_change; + } else { tk->tkr_mono.xtime_nsec <<= shift_change; + tk->tkr_raw.xtime_nsec <<= shift_change; + } } - tk->tkr_raw.xtime_nsec = 0; tk->tkr_mono.shift = clock->shift; tk->tkr_raw.shift = clock->shift; @@ -619,9 +625,6 @@ static inline void tk_update_ktime_data(struct timekeeper *tk) nsec = (u32) tk->wall_to_monotonic.tv_nsec; tk->tkr_mono.base = ns_to_ktime(seconds * NSEC_PER_SEC + nsec); - /* Update the monotonic raw base */ - tk->tkr_raw.base = timespec64_to_ktime(tk->raw_time); - /* * The sum of the nanoseconds portions of xtime and * wall_to_monotonic can be greater/equal one second. Take @@ -631,6 +634,11 @@ static inline void tk_update_ktime_data(struct timekeeper *tk) if (nsec >= NSEC_PER_SEC) seconds++; tk->ktime_sec = seconds; + + /* Update the monotonic raw base */ +
linux-next: build warning after merge of the percpu tree
Hi all, After merging the percpu tree, today's linux-next build (arm multi_v7_defconfig) produced this warning: In file included from include/linux/kernel.h:13:0, from include/linux/bitmap.h:9, from mm/percpu.c:58: mm/percpu.c: In function 'pcpu_alloc': include/linux/printk.h:303:2: warning: 'err' may be used uninitialized in this function [-Wmaybe-uninitialized] printk(KERN_WARNING pr_fmt(fmt), ##__VA_ARGS__) ^ mm/percpu.c:864:14: note: 'err' was declared here const char *err; ^ I am not sure which commit introduced this, although the if (is_atomic) goto fail; in pcpu_alloc() looks suspicious. -- Cheers, Stephen Rothwell
linux-next: build warning after merge of the percpu tree
Hi all, After merging the percpu tree, today's linux-next build (arm multi_v7_defconfig) produced this warning: In file included from include/linux/kernel.h:13:0, from include/linux/bitmap.h:9, from mm/percpu.c:58: mm/percpu.c: In function 'pcpu_alloc': include/linux/printk.h:303:2: warning: 'err' may be used uninitialized in this function [-Wmaybe-uninitialized] printk(KERN_WARNING pr_fmt(fmt), ##__VA_ARGS__) ^ mm/percpu.c:864:14: note: 'err' was declared here const char *err; ^ I am not sure which commit introduced this, although the if (is_atomic) goto fail; in pcpu_alloc() looks suspicious. -- Cheers, Stephen Rothwell
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Tue, Jun 20, 2017 at 6:40 PM, Dave Chinnerwrote: > Your mangling terminology here. We don't "break COW" - we *use* > copy-on-write to break *extent sharing*. We can break extent sharing > in page_mkwrite - that's exactly what we do for normal pagecache > based mmap writes, and it's done in page_mkwrite. Right, my bad. > > It hasn't been enabled it for DAX yet because it simply hasn't been > robustly tested yet. > >> A per-inode >> count of the number of live DAX mappings or of the number of struct >> file instances that have requested DAX would work here. > > For what purpose does this serve? The reflink invalidates all the > existing mappings, so the next write access causes a fault and then > page_mkwrite is called and the shared extent will get COWed The same purpose as XFS's FS_XFLAG_DAX (assuming I'm understanding it right), except that IMO an API that doesn't involve making a change to an inode that sticks around would be nice. The inode flag has the unfortunate property that, if two different programs each try to set the flag, mmap, write, and clear the flag, they'll stomp on each other and risk data corruption. I admit I'm now thoroughly confused as to exactly what XFS does here -- does FS_XFLAG_DAX persist across unmount/mount? > >> - Trying to use DAX on a file that is already reflinked. The order >> of operations doesn't matter hugely, except that breaking COW for the >> entire range in question all at once would be faster and result in >> better allocation. > > We have COW extent size hints for that. i.e. if you want to COW a > huge page at a time, set the COW extent size hint to the huge page > size... Nifty. > Apparently it is. There are people telling us that mtime > updates in page faults introduce too much unpredictable latency and > that screws over their low latency real time applications. I was one of those, and I even wrote patches. I should try to dust them off. > > Those same people are telling use that dirty tracking in page faults > for msync/fsync on DAX is too heavyweight and calling msync is too > onerous and has unpredictable latencies because it might result in > having to sync tens of thousands of unrelated dirty objects. Hence > they want to use userspace data sync primitives to avoid this > overhead and so filesystems need to make it possible to provide this > userspace idata sync capability. If I were using DAX in production, I'd have exactly this issue. Let me quote myself: On Tue, Jun 20, 2017 at 9:14 AM, Andy Lutomirski wrote: > 3. (Not strictly related to DAX.) A way to tell the kernel "I have > this file mmapped for write. Please go out of your way to avoid > future page faults." I've wanted this for ordinary files on ext4. > The kernel could, but presently does not, use hardware dirty tracking > instead of software dirty tracking to decide when to write the page > back. The kernel could also, in principle, write back dirty pages > without ever write protecting them. For DAX, this might change > behavior to prevent any operation that would relocate blocks or to > have the kernel go out of its way to only do such operations when > absolutely necessary and to immediately update and unwriteprotect the > relevant pages. I agree that this is a real issue, but it's not limited to DAX. I've wanted a mode where I tell the kernel "I'm a high-performance application mmapping this file and I'm going to write to it a lot. Do your best to avoid any page faults, even if it adversely affects the performance of the system." This mode could do lots of things. It could cause the system to leave the page writable even after writeback and, if possible, to use hardware dirty tracking. It could cause the system to make a copy of the page and write back from the copy if there is anything in play that could need stable pages during writeback. And, for DAX, it could tell the system to keep the page pinned and disallow moving it and reflinking it. (Of course, the above requires that we either deal with mtime like my patches do or that this heavyweight mechanism disable mtime updates. I prefer the former.) Here's the overall point I'm trying to make: unprivileged programs that want to write to DAX files with userspace commit mechanisms (CLFLUSHOPT;SFENCE, etc) should be able to do so reliably, without privilege, and with reasonably clean APIs. Ideally they could do this to any file they have write access to. Programs that want to write to mmapped files, DAX or otherwise, without latency spikes due to .page_mkwrite should be able to opt in to a heavier weight mechanism. But these two issues are someone independent, and I think they should be solved separately.
Re: [RFC PATCH 2/2] mm, fs: daxfile, an interface for byte-addressable updates to pmem
On Tue, Jun 20, 2017 at 6:40 PM, Dave Chinner wrote: > Your mangling terminology here. We don't "break COW" - we *use* > copy-on-write to break *extent sharing*. We can break extent sharing > in page_mkwrite - that's exactly what we do for normal pagecache > based mmap writes, and it's done in page_mkwrite. Right, my bad. > > It hasn't been enabled it for DAX yet because it simply hasn't been > robustly tested yet. > >> A per-inode >> count of the number of live DAX mappings or of the number of struct >> file instances that have requested DAX would work here. > > For what purpose does this serve? The reflink invalidates all the > existing mappings, so the next write access causes a fault and then > page_mkwrite is called and the shared extent will get COWed The same purpose as XFS's FS_XFLAG_DAX (assuming I'm understanding it right), except that IMO an API that doesn't involve making a change to an inode that sticks around would be nice. The inode flag has the unfortunate property that, if two different programs each try to set the flag, mmap, write, and clear the flag, they'll stomp on each other and risk data corruption. I admit I'm now thoroughly confused as to exactly what XFS does here -- does FS_XFLAG_DAX persist across unmount/mount? > >> - Trying to use DAX on a file that is already reflinked. The order >> of operations doesn't matter hugely, except that breaking COW for the >> entire range in question all at once would be faster and result in >> better allocation. > > We have COW extent size hints for that. i.e. if you want to COW a > huge page at a time, set the COW extent size hint to the huge page > size... Nifty. > Apparently it is. There are people telling us that mtime > updates in page faults introduce too much unpredictable latency and > that screws over their low latency real time applications. I was one of those, and I even wrote patches. I should try to dust them off. > > Those same people are telling use that dirty tracking in page faults > for msync/fsync on DAX is too heavyweight and calling msync is too > onerous and has unpredictable latencies because it might result in > having to sync tens of thousands of unrelated dirty objects. Hence > they want to use userspace data sync primitives to avoid this > overhead and so filesystems need to make it possible to provide this > userspace idata sync capability. If I were using DAX in production, I'd have exactly this issue. Let me quote myself: On Tue, Jun 20, 2017 at 9:14 AM, Andy Lutomirski wrote: > 3. (Not strictly related to DAX.) A way to tell the kernel "I have > this file mmapped for write. Please go out of your way to avoid > future page faults." I've wanted this for ordinary files on ext4. > The kernel could, but presently does not, use hardware dirty tracking > instead of software dirty tracking to decide when to write the page > back. The kernel could also, in principle, write back dirty pages > without ever write protecting them. For DAX, this might change > behavior to prevent any operation that would relocate blocks or to > have the kernel go out of its way to only do such operations when > absolutely necessary and to immediately update and unwriteprotect the > relevant pages. I agree that this is a real issue, but it's not limited to DAX. I've wanted a mode where I tell the kernel "I'm a high-performance application mmapping this file and I'm going to write to it a lot. Do your best to avoid any page faults, even if it adversely affects the performance of the system." This mode could do lots of things. It could cause the system to leave the page writable even after writeback and, if possible, to use hardware dirty tracking. It could cause the system to make a copy of the page and write back from the copy if there is anything in play that could need stable pages during writeback. And, for DAX, it could tell the system to keep the page pinned and disallow moving it and reflinking it. (Of course, the above requires that we either deal with mtime like my patches do or that this heavyweight mechanism disable mtime updates. I prefer the former.) Here's the overall point I'm trying to make: unprivileged programs that want to write to DAX files with userspace commit mechanisms (CLFLUSHOPT;SFENCE, etc) should be able to do so reliably, without privilege, and with reasonably clean APIs. Ideally they could do this to any file they have write access to. Programs that want to write to mmapped files, DAX or otherwise, without latency spikes due to .page_mkwrite should be able to opt in to a heavier weight mechanism. But these two issues are someone independent, and I think they should be solved separately.
Re: [PATCH 32/51] rtc: pcap: stop using rtc deprecated functions
Hi Benjamin, [auto build test ERROR on abelloni/rtc-next] [also build test ERROR on v4.12-rc6 next-20170620] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Benjamin-Gaignard/rtc-stop-using-rtc-deprecated-functions/20170621-044455 base: https://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git rtc-next config: arm-ezx_defconfig (attached as .config) compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705 reproduce: wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=arm All errors (new ones prefixed by >>): drivers/built-in.o: In function `pcap_rtc_set_mmss64': >> hid-generic.c:(.text+0xac5b8): undefined reference to `__aeabi_ldivmod' hid-generic.c:(.text+0xac5dc): undefined reference to `__aeabi_ldivmod' drivers/built-in.o: In function `pcap_rtc_set_alarm': >> hid-generic.c:(.text+0xac61c): undefined reference to `__aeabi_uldivmod' hid-generic.c:(.text+0xac640): undefined reference to `__aeabi_uldivmod' --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH 32/51] rtc: pcap: stop using rtc deprecated functions
Hi Benjamin, [auto build test ERROR on abelloni/rtc-next] [also build test ERROR on v4.12-rc6 next-20170620] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Benjamin-Gaignard/rtc-stop-using-rtc-deprecated-functions/20170621-044455 base: https://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git rtc-next config: arm-ezx_defconfig (attached as .config) compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705 reproduce: wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=arm All errors (new ones prefixed by >>): drivers/built-in.o: In function `pcap_rtc_set_mmss64': >> hid-generic.c:(.text+0xac5b8): undefined reference to `__aeabi_ldivmod' hid-generic.c:(.text+0xac5dc): undefined reference to `__aeabi_ldivmod' drivers/built-in.o: In function `pcap_rtc_set_alarm': >> hid-generic.c:(.text+0xac61c): undefined reference to `__aeabi_uldivmod' hid-generic.c:(.text+0xac640): undefined reference to `__aeabi_uldivmod' --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [Ksummit-discuss] Next steps and plans for the 2017 Maintainer and Kernel Summits
On Wed, Jun 14, 2017 at 09:29:20AM -0400, Theodore Ts'o wrote: > We will use the rest of names that Linus had mentioned as being part > of his secondary list to preseed the list of people to be considered. > People who suggest topics that should be discussed on the Maintainer's > summit will be added to the list. Please prefix [MAINTAINERS SUMMIT] > to the subject line, and remember that we only have a half-day, and > they should be appropriate for the Maintainer's summit --- that is, > they should be focused on the development process. If you are interested in attending the Maintainer's Summit, I'd appreciate it if you could send in any topic proposals by Monday, June 26th. Many thanks!! - Ted
Re: [Ksummit-discuss] Next steps and plans for the 2017 Maintainer and Kernel Summits
On Wed, Jun 14, 2017 at 09:29:20AM -0400, Theodore Ts'o wrote: > We will use the rest of names that Linus had mentioned as being part > of his secondary list to preseed the list of people to be considered. > People who suggest topics that should be discussed on the Maintainer's > summit will be added to the list. Please prefix [MAINTAINERS SUMMIT] > to the subject line, and remember that we only have a half-day, and > they should be appropriate for the Maintainer's summit --- that is, > they should be focused on the development process. If you are interested in attending the Maintainer's Summit, I'd appreciate it if you could send in any topic proposals by Monday, June 26th. Many thanks!! - Ted
Re: [PATCH V1 09/15] spmi: pmic-arb: check apid enabled before calling the handler
On 2017-06-17 02:41, Stephen Boyd wrote: On 06/14, kgu...@codeaurora.org wrote: On 2017-06-01 02:09, Stephen Boyd wrote: >On 05/30, Kiran Gunda wrote: >>From: Abhijeet Dharmapurikar>> >>The driver currently invokes the apid handler (periph_handler()) > >You mean periph_interrupt()? > Yes. >>once it sees that the summary status bit for that apid is set. >> >>However the hardware is designed to set that bit even if the apid >>interrupts are disabled. The driver should check whether the apid >>is indeed enabled before calling the apid handler. > >Really? Wow that is awful. Or is this because ACC_ENABLE bit is >always set now and never cleared? > Yes. It is awful. It is not because of the ACC_ENABLE bit is set. >> >>Signed-off-by: Abhijeet Dharmapurikar >>Signed-off-by: Kiran Gunda >>--- >> drivers/spmi/spmi-pmic-arb.c | 10 +++--- >> 1 file changed, 7 insertions(+), 3 deletions(-) >> >>diff --git a/drivers/spmi/spmi-pmic-arb.c >>b/drivers/spmi/spmi-pmic-arb.c >>index ad34491..f8638fa 100644 >>--- a/drivers/spmi/spmi-pmic-arb.c >>+++ b/drivers/spmi/spmi-pmic-arb.c >>@@ -536,8 +536,8 @@ static void pmic_arb_chained_irq(struct >>irq_desc *desc) >>void __iomem *intr = pa->intr; >>int first = pa->min_apid >> 5; >>int last = pa->max_apid >> 5; >>- u32 status; >>- int i, id; >>+ u32 status, enable; >>+ int i, id, apid; >> >>chained_irq_enter(chip, desc); >> >>@@ -547,7 +547,11 @@ static void pmic_arb_chained_irq(struct >>irq_desc *desc) >>while (status) { >>id = ffs(status) - 1; >>status &= ~BIT(id); >>- periph_interrupt(pa, id + i * 32); >>+ apid = id + i * 32; >>+ enable = readl_relaxed(intr + >>+ pa->ver_ops->acc_enable(apid)); > >Do we need to read the hardware to figure this out? After earlier >patches in this series we would never clear the >SPMI_PIC_ACC_ENABLE_BIT after one of the irqs in a peripheral is >unmasked for the first time (which looks to be fixing a bug in >the existing driver BTW). So in practice, this should almost >always be true. > yes. We have removed clearing the SPMI_PIC_ACC_ENABLE_BIT from the irq_mask, because if we disable this BIT it disables all the peripheral IRQs, which we don't want. Right, we could reference count it though and only clear and set the bit when we mask and unmask the last irq in the peripheral. Actually we are disabling the interrupt at peripheral level, so i believe disabling the PIC_ACC_EN_BIT is not mandatory. Once the peripheral fires the interrupt the summary status bit for that peripheral is set even though the SPMI_PIC_ACC_ENABLE_BIT is not enabled. That's why we have to read this register to not service the interrupt that is not requested/enabled yet. This SPMI_PIC_ACC_ENABLE_BIT is enabled during the irq_unmask which is called from request_irq. Ok. So this is again about handling the case where an interrupt is pending out of the bootloader? Yes. >In the one case that it isn't true, we'll be handling some other >irq for another peripheral and then hardware will tell us there's >an interrupt for a peripheral that doesn't have any interrupts >unmasked. We would call periph_interrupt() and then that >shouldn't see any interrupts in the status register for that >APID. So we do some more work, but nothing happens still. Did I >miss something? What is this fixing? Yes. As you said this fixes the issue of calling the periph_interrupt for some other irq that is not yet requested and enabled yet. Hmm. I seemed to miss the fact that periph_interrupt() will see an unmasked interrupt and think it's valid. I thought that only SPMI_PIC_ACC_ENABLE_BIT was broken, but you're saying that the status register for a particular peripheral will always latch interrupts even if we haven't enabled them? Yes. Correct. Both SPMI_PIC_OWNERm_ACC_STATUSn and SPMI_PIC_IRQ_STATUSn are getting set when the peripheral fires the interrupt, even though we haven't enable that particular peripheral bit in SPMI_PIC_ACC_ENABLEn register.
Re: [PATCH V1 09/15] spmi: pmic-arb: check apid enabled before calling the handler
On 2017-06-17 02:41, Stephen Boyd wrote: On 06/14, kgu...@codeaurora.org wrote: On 2017-06-01 02:09, Stephen Boyd wrote: >On 05/30, Kiran Gunda wrote: >>From: Abhijeet Dharmapurikar >> >>The driver currently invokes the apid handler (periph_handler()) > >You mean periph_interrupt()? > Yes. >>once it sees that the summary status bit for that apid is set. >> >>However the hardware is designed to set that bit even if the apid >>interrupts are disabled. The driver should check whether the apid >>is indeed enabled before calling the apid handler. > >Really? Wow that is awful. Or is this because ACC_ENABLE bit is >always set now and never cleared? > Yes. It is awful. It is not because of the ACC_ENABLE bit is set. >> >>Signed-off-by: Abhijeet Dharmapurikar >>Signed-off-by: Kiran Gunda >>--- >> drivers/spmi/spmi-pmic-arb.c | 10 +++--- >> 1 file changed, 7 insertions(+), 3 deletions(-) >> >>diff --git a/drivers/spmi/spmi-pmic-arb.c >>b/drivers/spmi/spmi-pmic-arb.c >>index ad34491..f8638fa 100644 >>--- a/drivers/spmi/spmi-pmic-arb.c >>+++ b/drivers/spmi/spmi-pmic-arb.c >>@@ -536,8 +536,8 @@ static void pmic_arb_chained_irq(struct >>irq_desc *desc) >>void __iomem *intr = pa->intr; >>int first = pa->min_apid >> 5; >>int last = pa->max_apid >> 5; >>- u32 status; >>- int i, id; >>+ u32 status, enable; >>+ int i, id, apid; >> >>chained_irq_enter(chip, desc); >> >>@@ -547,7 +547,11 @@ static void pmic_arb_chained_irq(struct >>irq_desc *desc) >>while (status) { >>id = ffs(status) - 1; >>status &= ~BIT(id); >>- periph_interrupt(pa, id + i * 32); >>+ apid = id + i * 32; >>+ enable = readl_relaxed(intr + >>+ pa->ver_ops->acc_enable(apid)); > >Do we need to read the hardware to figure this out? After earlier >patches in this series we would never clear the >SPMI_PIC_ACC_ENABLE_BIT after one of the irqs in a peripheral is >unmasked for the first time (which looks to be fixing a bug in >the existing driver BTW). So in practice, this should almost >always be true. > yes. We have removed clearing the SPMI_PIC_ACC_ENABLE_BIT from the irq_mask, because if we disable this BIT it disables all the peripheral IRQs, which we don't want. Right, we could reference count it though and only clear and set the bit when we mask and unmask the last irq in the peripheral. Actually we are disabling the interrupt at peripheral level, so i believe disabling the PIC_ACC_EN_BIT is not mandatory. Once the peripheral fires the interrupt the summary status bit for that peripheral is set even though the SPMI_PIC_ACC_ENABLE_BIT is not enabled. That's why we have to read this register to not service the interrupt that is not requested/enabled yet. This SPMI_PIC_ACC_ENABLE_BIT is enabled during the irq_unmask which is called from request_irq. Ok. So this is again about handling the case where an interrupt is pending out of the bootloader? Yes. >In the one case that it isn't true, we'll be handling some other >irq for another peripheral and then hardware will tell us there's >an interrupt for a peripheral that doesn't have any interrupts >unmasked. We would call periph_interrupt() and then that >shouldn't see any interrupts in the status register for that >APID. So we do some more work, but nothing happens still. Did I >miss something? What is this fixing? Yes. As you said this fixes the issue of calling the periph_interrupt for some other irq that is not yet requested and enabled yet. Hmm. I seemed to miss the fact that periph_interrupt() will see an unmasked interrupt and think it's valid. I thought that only SPMI_PIC_ACC_ENABLE_BIT was broken, but you're saying that the status register for a particular peripheral will always latch interrupts even if we haven't enabled them? Yes. Correct. Both SPMI_PIC_OWNERm_ACC_STATUSn and SPMI_PIC_IRQ_STATUSn are getting set when the peripheral fires the interrupt, even though we haven't enable that particular peripheral bit in SPMI_PIC_ACC_ENABLEn register.
[GVT-g] [Intel-gfx] [ANNOUNCE] 2017-Q2 release of XenGT (Intel GVT-g for Xen)
Hi all, We are pleased to announce an update of Intel GVT-g for Xen. Intel GVT-g is a full GPU virtualization solution with mediated pass-through, starting from 4th generation Intel Core(TM) processors with Intel processor graphics. A virtual GPU instance is maintained for each VM, with part of performance critical resources directly assigned. The capability of running native graphics driver inside a VM, without hypervisor intervention in performance critical paths, achieves a good balance among performance, feature, and sharing capability. GVT-g for Xen hypervisor is XenGT. Repositories -Xen : https://github.com/01org/igvtg-xen (tag: 2017-q2-xengt-stable-4.9) -Kernel: https://github.com/01org/gvt-linux/ (tag: 2017-q2-gvt-stable-4.11) -Qemu: https://github.com/01org/igvtg-qemu (tag: 2017-q2-stable-2.9.0) This update consists of: -GVT-g upgraded to new architecture which has been upstreamed from kernel 4.10. -QEMU version upgraded to 2.9 from 2.3. -Xen version upgraded to 4.9 from 4.6. -Supported server platforms: Intel(r) Xeon(r) E3_v4 and E3_v5 with Intel Graphics processor. -Supported client platforms: Intel(r) Core(tm) 5th generation (code name: Broadwell) and 6th generation (code name: Skylake). -Validated Guest OS: Windows7 32bit, Window7 64bit, Windows8.1 64bit, Windows10 64bit and Linux. -GVT-g only supports remote display not local display by this release. -Remote protocol: only guest-side remoting protocol is supported, host-side remoting connection like SPICE is working in progress. For example, user can use X11VNC for Guest Linux VM or TightVNC for Guest Windows VM. Limitation or known issues: -GVT-g can support maximum 7 Guest VMs due to host graphics resource limitation. When user runs 7 VMs simultaneously, host OS can only run in text mode. -In order to support Guest Windows7 32bit VM, user is recommended to add vgt_low_gm_sz=128 in HVM file because Guest Windows7 32bit VM needs more graphics resource than other Guest VM. -In order to support Guest VM high resolution and screen resolution adjustable in Guest Windows8.1 64bit VM and Guest Windows10 64bit VM, user is recommended to configure vgt_low_gm_sz=64 / 128 / 256 / 512 in HVM file to get larger VM aperture size. -In corner case, Guest Windows 8.1 64bit VM may be killed automatically by Xen when Guest VM runs into TDR. This issues happens only on Broadwell platform. -In corner case, Guest Windows VMs may be killed automatically by Xen when user shutdowns the last launched Guest Windows VM. This issues happens only on Broadwell platform. -In corner case, Guest Linux VM virtual display screen may freeze and not be able to recover when Guest VM runs into a TDR, but the Guest VM is still running and alive which can be accessed through SSH. Setup guide: https://github.com/01org/gvt-linux/wiki/GVTg_Setup_Guide This is the first GVT-g community release based on new Upstream architecture design, refer to the following document for new architecture introduction: https://01.org/igvt-g/documentation/intel-gvt-g-new-architecture-introduction Please subscribe to join the mailing list if you want to learn more about GVT-g project: https://lists.01.org/mailman/listinfo/igvt-g Please subscribe to join the mailing list if you want to contribute/review latest GVT-g upstream patches: https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev Official GVT-g portal: https://01.org/igvt-g More information about background, architecture and others about Intel GVT-g, can be found at: https://01.org/igvt-g https://www.usenix.org/conference/atc14/technical-sessions/presentation/tian http://events.linuxfoundation.org/sites/events/files/slides/XenGT-Xen%20Summit-v7_0.pdf http://events.linuxfoundation.org/sites/events/files/slides/XenGT-Xen%20Summit-REWRITE%203RD%20v4.pdf https://01.org/xen/blogs/srclarkx/2013/graphics-virtualization-xengt Note: The XenGT project should be considered a work in progress. As such it is not a complete product nor should it be considered one. Extra care should be taken when testing and configuring a system to use the XenGT project. Thanks Terrence Tel: +86-21-6116 5390 MP: +86-1356 4367 024 Mail: terrence...@intel.com ___ iGVT-g mailing list igv...@lists.01.org https://lists.01.org/mailman/listinfo/igvt-g
[GVT-g] [Intel-gfx] [ANNOUNCE] 2017-Q2 release of XenGT (Intel GVT-g for Xen)
Hi all, We are pleased to announce an update of Intel GVT-g for Xen. Intel GVT-g is a full GPU virtualization solution with mediated pass-through, starting from 4th generation Intel Core(TM) processors with Intel processor graphics. A virtual GPU instance is maintained for each VM, with part of performance critical resources directly assigned. The capability of running native graphics driver inside a VM, without hypervisor intervention in performance critical paths, achieves a good balance among performance, feature, and sharing capability. GVT-g for Xen hypervisor is XenGT. Repositories -Xen : https://github.com/01org/igvtg-xen (tag: 2017-q2-xengt-stable-4.9) -Kernel: https://github.com/01org/gvt-linux/ (tag: 2017-q2-gvt-stable-4.11) -Qemu: https://github.com/01org/igvtg-qemu (tag: 2017-q2-stable-2.9.0) This update consists of: -GVT-g upgraded to new architecture which has been upstreamed from kernel 4.10. -QEMU version upgraded to 2.9 from 2.3. -Xen version upgraded to 4.9 from 4.6. -Supported server platforms: Intel(r) Xeon(r) E3_v4 and E3_v5 with Intel Graphics processor. -Supported client platforms: Intel(r) Core(tm) 5th generation (code name: Broadwell) and 6th generation (code name: Skylake). -Validated Guest OS: Windows7 32bit, Window7 64bit, Windows8.1 64bit, Windows10 64bit and Linux. -GVT-g only supports remote display not local display by this release. -Remote protocol: only guest-side remoting protocol is supported, host-side remoting connection like SPICE is working in progress. For example, user can use X11VNC for Guest Linux VM or TightVNC for Guest Windows VM. Limitation or known issues: -GVT-g can support maximum 7 Guest VMs due to host graphics resource limitation. When user runs 7 VMs simultaneously, host OS can only run in text mode. -In order to support Guest Windows7 32bit VM, user is recommended to add vgt_low_gm_sz=128 in HVM file because Guest Windows7 32bit VM needs more graphics resource than other Guest VM. -In order to support Guest VM high resolution and screen resolution adjustable in Guest Windows8.1 64bit VM and Guest Windows10 64bit VM, user is recommended to configure vgt_low_gm_sz=64 / 128 / 256 / 512 in HVM file to get larger VM aperture size. -In corner case, Guest Windows 8.1 64bit VM may be killed automatically by Xen when Guest VM runs into TDR. This issues happens only on Broadwell platform. -In corner case, Guest Windows VMs may be killed automatically by Xen when user shutdowns the last launched Guest Windows VM. This issues happens only on Broadwell platform. -In corner case, Guest Linux VM virtual display screen may freeze and not be able to recover when Guest VM runs into a TDR, but the Guest VM is still running and alive which can be accessed through SSH. Setup guide: https://github.com/01org/gvt-linux/wiki/GVTg_Setup_Guide This is the first GVT-g community release based on new Upstream architecture design, refer to the following document for new architecture introduction: https://01.org/igvt-g/documentation/intel-gvt-g-new-architecture-introduction Please subscribe to join the mailing list if you want to learn more about GVT-g project: https://lists.01.org/mailman/listinfo/igvt-g Please subscribe to join the mailing list if you want to contribute/review latest GVT-g upstream patches: https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev Official GVT-g portal: https://01.org/igvt-g More information about background, architecture and others about Intel GVT-g, can be found at: https://01.org/igvt-g https://www.usenix.org/conference/atc14/technical-sessions/presentation/tian http://events.linuxfoundation.org/sites/events/files/slides/XenGT-Xen%20Summit-v7_0.pdf http://events.linuxfoundation.org/sites/events/files/slides/XenGT-Xen%20Summit-REWRITE%203RD%20v4.pdf https://01.org/xen/blogs/srclarkx/2013/graphics-virtualization-xengt Note: The XenGT project should be considered a work in progress. As such it is not a complete product nor should it be considered one. Extra care should be taken when testing and configuring a system to use the XenGT project. Thanks Terrence Tel: +86-21-6116 5390 MP: +86-1356 4367 024 Mail: terrence...@intel.com ___ iGVT-g mailing list igv...@lists.01.org https://lists.01.org/mailman/listinfo/igvt-g
[GVT-g] [Intel-gfx] [ANNOUNCE] 2017-Q2 release of KVMGT (Intel GVT-g for KVM)
Hi all, We are pleased to announce an update of Intel GVT-g for KVM. Intel GVT-g for KVM (a.k.a. KVMGT) is a full GPU virtualization solution with mediated pass-through, starting from 5th generation Intel Core(TM) processors with Intel processor graphics. A virtual GPU instance is maintained for each VM, with part of performance critical resources directly assigned. The capability of running native graphics driver inside a VM, without hypervisor intervention in performance critical paths, achieves a good balance among performance, feature, and sharing capability. Repositories: -Kernel: https://github.com/01org/gvt-linux/ (tag: 2017-q2-gvt-stable-4.11) -Qemu: https://github.com/01org/igvtg-qemu (tag: 2017-q2-stable-2.9.0) This update consists of: -GVT-g upgraded to new architecture which has been upstreamed from kernel 4.10. -QEMU version upgraded to 2.9 from 2.3 (There is no any specific code changes in QEMU for KVMGT, so people could directly use upstream QEMU as an alternative). -Supported server platforms: Intel(r) Xeon(r) E3_v4 and E3_v5 with Intel Graphics processor. -Supported client platforms: Intel(r) Core(tm) 5th generation (code name: Broadwell) and 6th generation (code name: Skylake). -Validated Guest OS: Windows7 32bit, Window7 64bit, Windows8.1 64bit, Windows10 64bit and Linux. -GVT-g only supports remote display not local display by this release. -Remote protocol: only guest-side remoting protocol is supported, host-side remoting connection like SPICE is working in progress. For example, user can use X11VNC for Guest Linux VM or TightVNC for Guest Windows VM. Limitation or known issues: -GVT-g can support maximum 7 Guest VMs due to host graphics resource limitation. When user runs 7 VMs simultaneously, host OS can only run in text mode. -In order to support Guest Windows7 32bit VM, user can only uses vGPU type1, type2, type4 not type8 because Guest Windows7 32bit VM needs more graphics resource than other Guest VM. -Some 3rd party applications/tools like GPU_Z, Passmark 9.0 may read/write GPU MSR directly, it will trigger VM BSOD since those MSRs are unhandled registers in KVMGT. The workaround is to set MSR read /write ignore flag to 1 in host grub file by adding "kvm.ignore_msrs=1". -In corner case, Guest Linux VM virtual display screen may freeze and not be able to recover when Guest VM runs into a TDR, but the Guest VM is still running and alive which can be accessed through SSH. Setup guide: https://github.com/01org/gvt-linux/wiki/GVTg_Setup_Guide This is the first GVT-g community release based on new Upstream architecture design, refer to the following document for new architecture introduction: https://01.org/igvt-g/documentation/intel-gvt-g-new-architecture-introduction Please subscribe to join the mailing list if you want to learn more about GVT-g project: https://lists.01.org/mailman/listinfo/igvt-g Please subscribe to join the mailing list if you want to contribute/review latest GVT-g upstream patches: https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev Official GVT-g portal: https://01.org/igvt-g More information about background, architecture and others about Intel GVT-g, can be found at: http://www.linux-kvm.org/images/f/f3/01x08b-KVMGT-a.pdf https://www.usenix.org/conference/atc14/technical-sessions/presentation/tian Note: The KVMGT project should be considered a work in progress. As such it is not a complete product nor should it be considered one. Extra care should be taken when testing and configuring a system to use the KVMGT project. Thanks Terrence Tel: +86-21-6116 5390 MP: +86-1356 4367 024 Mail: terrence...@intel.com ___ iGVT-g mailing list igv...@lists.01.org https://lists.01.org/mailman/listinfo/igvt-g
[GVT-g] [Intel-gfx] [ANNOUNCE] 2017-Q2 release of KVMGT (Intel GVT-g for KVM)
Hi all, We are pleased to announce an update of Intel GVT-g for KVM. Intel GVT-g for KVM (a.k.a. KVMGT) is a full GPU virtualization solution with mediated pass-through, starting from 5th generation Intel Core(TM) processors with Intel processor graphics. A virtual GPU instance is maintained for each VM, with part of performance critical resources directly assigned. The capability of running native graphics driver inside a VM, without hypervisor intervention in performance critical paths, achieves a good balance among performance, feature, and sharing capability. Repositories: -Kernel: https://github.com/01org/gvt-linux/ (tag: 2017-q2-gvt-stable-4.11) -Qemu: https://github.com/01org/igvtg-qemu (tag: 2017-q2-stable-2.9.0) This update consists of: -GVT-g upgraded to new architecture which has been upstreamed from kernel 4.10. -QEMU version upgraded to 2.9 from 2.3 (There is no any specific code changes in QEMU for KVMGT, so people could directly use upstream QEMU as an alternative). -Supported server platforms: Intel(r) Xeon(r) E3_v4 and E3_v5 with Intel Graphics processor. -Supported client platforms: Intel(r) Core(tm) 5th generation (code name: Broadwell) and 6th generation (code name: Skylake). -Validated Guest OS: Windows7 32bit, Window7 64bit, Windows8.1 64bit, Windows10 64bit and Linux. -GVT-g only supports remote display not local display by this release. -Remote protocol: only guest-side remoting protocol is supported, host-side remoting connection like SPICE is working in progress. For example, user can use X11VNC for Guest Linux VM or TightVNC for Guest Windows VM. Limitation or known issues: -GVT-g can support maximum 7 Guest VMs due to host graphics resource limitation. When user runs 7 VMs simultaneously, host OS can only run in text mode. -In order to support Guest Windows7 32bit VM, user can only uses vGPU type1, type2, type4 not type8 because Guest Windows7 32bit VM needs more graphics resource than other Guest VM. -Some 3rd party applications/tools like GPU_Z, Passmark 9.0 may read/write GPU MSR directly, it will trigger VM BSOD since those MSRs are unhandled registers in KVMGT. The workaround is to set MSR read /write ignore flag to 1 in host grub file by adding "kvm.ignore_msrs=1". -In corner case, Guest Linux VM virtual display screen may freeze and not be able to recover when Guest VM runs into a TDR, but the Guest VM is still running and alive which can be accessed through SSH. Setup guide: https://github.com/01org/gvt-linux/wiki/GVTg_Setup_Guide This is the first GVT-g community release based on new Upstream architecture design, refer to the following document for new architecture introduction: https://01.org/igvt-g/documentation/intel-gvt-g-new-architecture-introduction Please subscribe to join the mailing list if you want to learn more about GVT-g project: https://lists.01.org/mailman/listinfo/igvt-g Please subscribe to join the mailing list if you want to contribute/review latest GVT-g upstream patches: https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev Official GVT-g portal: https://01.org/igvt-g More information about background, architecture and others about Intel GVT-g, can be found at: http://www.linux-kvm.org/images/f/f3/01x08b-KVMGT-a.pdf https://www.usenix.org/conference/atc14/technical-sessions/presentation/tian Note: The KVMGT project should be considered a work in progress. As such it is not a complete product nor should it be considered one. Extra care should be taken when testing and configuring a system to use the KVMGT project. Thanks Terrence Tel: +86-21-6116 5390 MP: +86-1356 4367 024 Mail: terrence...@intel.com ___ iGVT-g mailing list igv...@lists.01.org https://lists.01.org/mailman/listinfo/igvt-g
Re: [PATCH] PM / OPP: Add dev_pm_opp_{set|put}_clkname()
On 20-06-17, 14:08, Stephen Boyd wrote: > On 06/20, Viresh Kumar wrote: > > + */ > > +struct opp_table *dev_pm_opp_set_clkname(struct device *dev, const char > > *name) > > +{ > > + struct opp_table *opp_table; > > + int ret; > > + > > + opp_table = dev_pm_opp_get_opp_table(dev); > > + if (!opp_table) > > + return ERR_PTR(-ENOMEM); > > + > > + /* This should be called before OPPs are initialized */ > > + if (WARN_ON(!list_empty(_table->opp_list))) { > > + ret = -EBUSY; > > + goto err; > > + } > > + > > + /* Already have clkname set */ > > + if (opp_table->clk_name) { > > + ret = -EBUSY; > > + goto err; > > + } > > + > > + opp_table->clk_name = kstrdup(name, GFP_KERNEL); > > + if (!opp_table->clk_name) { > > Is there a reason to duplicate clk_name instead of using the clk > structure returned from clk_get()? Is it because we may already > have opp_table->clk set from default init? Why can't we always > clk_put() the clk structure if it's !IS_ERR() and then allow > dev_pm_opp_set_clkname() to be called many times in succession? > Long story short, I don't see the benefit to allocating the name > again here just to use it as a mechanism to know if the APIs have > been called symmetrically. Yeah, it was kind of required in what I was trying to do earlier, but not anymore. -- viresh
Re: [PATCH] PM / OPP: Add dev_pm_opp_{set|put}_clkname()
On 20-06-17, 14:08, Stephen Boyd wrote: > On 06/20, Viresh Kumar wrote: > > + */ > > +struct opp_table *dev_pm_opp_set_clkname(struct device *dev, const char > > *name) > > +{ > > + struct opp_table *opp_table; > > + int ret; > > + > > + opp_table = dev_pm_opp_get_opp_table(dev); > > + if (!opp_table) > > + return ERR_PTR(-ENOMEM); > > + > > + /* This should be called before OPPs are initialized */ > > + if (WARN_ON(!list_empty(_table->opp_list))) { > > + ret = -EBUSY; > > + goto err; > > + } > > + > > + /* Already have clkname set */ > > + if (opp_table->clk_name) { > > + ret = -EBUSY; > > + goto err; > > + } > > + > > + opp_table->clk_name = kstrdup(name, GFP_KERNEL); > > + if (!opp_table->clk_name) { > > Is there a reason to duplicate clk_name instead of using the clk > structure returned from clk_get()? Is it because we may already > have opp_table->clk set from default init? Why can't we always > clk_put() the clk structure if it's !IS_ERR() and then allow > dev_pm_opp_set_clkname() to be called many times in succession? > Long story short, I don't see the benefit to allocating the name > again here just to use it as a mechanism to know if the APIs have > been called symmetrically. Yeah, it was kind of required in what I was trying to do earlier, but not anymore. -- viresh
[PATCH V2] PM / OPP: Add dev_pm_opp_{set|put}_clkname()
In order to support OPP switching, OPP layer needs to get pointer to the clock for the device. Simple cases work fine without using the routines added by this patch (i.e. by passing connection-id as NULL), but for a device with multiple clocks available, the OPP core needs to know the exact name of the clk to use. Add a new set of APIs to get that done. Tested-by: Rajendra NayakSigned-off-by: Viresh Kumar --- V1->V2: - No need to save clk_name, as we wouldn't use it again. drivers/base/power/opp/core.c | 67 +++ include/linux/pm_opp.h| 9 ++ 2 files changed, 76 insertions(+) diff --git a/drivers/base/power/opp/core.c b/drivers/base/power/opp/core.c index 5ee7aadf0abf..a8cc14fd8ae4 100644 --- a/drivers/base/power/opp/core.c +++ b/drivers/base/power/opp/core.c @@ -1363,6 +1363,73 @@ void dev_pm_opp_put_regulators(struct opp_table *opp_table) EXPORT_SYMBOL_GPL(dev_pm_opp_put_regulators); /** + * dev_pm_opp_set_clkname() - Set clk name for the device + * @dev: Device for which clk name is being set. + * @name: Clk name. + * + * In order to support OPP switching, OPP layer needs to get pointer to the + * clock for the device. Simple cases work fine without using this routine (i.e. + * by passing connection-id as NULL), but for a device with multiple clocks + * available, the OPP core needs to know the exact name of the clk to use. + * + * This must be called before any OPPs are initialized for the device. + */ +struct opp_table *dev_pm_opp_set_clkname(struct device *dev, const char *name) +{ + struct opp_table *opp_table; + int ret; + + opp_table = dev_pm_opp_get_opp_table(dev); + if (!opp_table) + return ERR_PTR(-ENOMEM); + + /* This should be called before OPPs are initialized */ + if (WARN_ON(!list_empty(_table->opp_list))) { + ret = -EBUSY; + goto err; + } + + /* Already have default clk set, free it */ + if (!IS_ERR(opp_table->clk)) + clk_put(opp_table->clk); + + /* Find clk for the device */ + opp_table->clk = clk_get(dev, name); + if (IS_ERR(opp_table->clk)) { + ret = PTR_ERR(opp_table->clk); + if (ret != -EPROBE_DEFER) { + dev_err(dev, "%s: Couldn't find clock: %d\n", __func__, + ret); + } + goto err; + } + + return opp_table; + +err: + dev_pm_opp_put_opp_table(opp_table); + + return ERR_PTR(ret); +} +EXPORT_SYMBOL_GPL(dev_pm_opp_set_clkname); + +/** + * dev_pm_opp_put_clkname() - Releases resources blocked for clk. + * @opp_table: OPP table returned from dev_pm_opp_set_clkname(). + */ +void dev_pm_opp_put_clkname(struct opp_table *opp_table) +{ + /* Make sure there are no concurrent readers while updating opp_table */ + WARN_ON(!list_empty(_table->opp_list)); + + clk_put(opp_table->clk); + opp_table->clk = ERR_PTR(-EINVAL); + + dev_pm_opp_put_opp_table(opp_table); +} +EXPORT_SYMBOL_GPL(dev_pm_opp_put_clkname); + +/** * dev_pm_opp_register_set_opp_helper() - Register custom set OPP helper * @dev: Device for which the helper is getting registered. * @set_opp: Custom set OPP helper. diff --git a/include/linux/pm_opp.h b/include/linux/pm_opp.h index a6685b3dde26..51ec727b4824 100644 --- a/include/linux/pm_opp.h +++ b/include/linux/pm_opp.h @@ -121,6 +121,8 @@ struct opp_table *dev_pm_opp_set_prop_name(struct device *dev, const char *name) void dev_pm_opp_put_prop_name(struct opp_table *opp_table); struct opp_table *dev_pm_opp_set_regulators(struct device *dev, const char * const names[], unsigned int count); void dev_pm_opp_put_regulators(struct opp_table *opp_table); +struct opp_table *dev_pm_opp_set_clkname(struct device *dev, const char * name); +void dev_pm_opp_put_clkname(struct opp_table *opp_table); struct opp_table *dev_pm_opp_register_set_opp_helper(struct device *dev, int (*set_opp)(struct dev_pm_set_opp_data *data)); void dev_pm_opp_register_put_opp_helper(struct opp_table *opp_table); int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq); @@ -257,6 +259,13 @@ static inline struct opp_table *dev_pm_opp_set_regulators(struct device *dev, co static inline void dev_pm_opp_put_regulators(struct opp_table *opp_table) {} +static inline struct opp_table *dev_pm_opp_set_clkname(struct device *dev, const char * name) +{ + return ERR_PTR(-ENOTSUPP); +} + +static inline void dev_pm_opp_put_clkname(struct opp_table *opp_table) {} + static inline int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq) { return -ENOTSUPP; -- 2.13.0.71.gd7076ec9c9cb
[PATCH V2] PM / OPP: Add dev_pm_opp_{set|put}_clkname()
In order to support OPP switching, OPP layer needs to get pointer to the clock for the device. Simple cases work fine without using the routines added by this patch (i.e. by passing connection-id as NULL), but for a device with multiple clocks available, the OPP core needs to know the exact name of the clk to use. Add a new set of APIs to get that done. Tested-by: Rajendra Nayak Signed-off-by: Viresh Kumar --- V1->V2: - No need to save clk_name, as we wouldn't use it again. drivers/base/power/opp/core.c | 67 +++ include/linux/pm_opp.h| 9 ++ 2 files changed, 76 insertions(+) diff --git a/drivers/base/power/opp/core.c b/drivers/base/power/opp/core.c index 5ee7aadf0abf..a8cc14fd8ae4 100644 --- a/drivers/base/power/opp/core.c +++ b/drivers/base/power/opp/core.c @@ -1363,6 +1363,73 @@ void dev_pm_opp_put_regulators(struct opp_table *opp_table) EXPORT_SYMBOL_GPL(dev_pm_opp_put_regulators); /** + * dev_pm_opp_set_clkname() - Set clk name for the device + * @dev: Device for which clk name is being set. + * @name: Clk name. + * + * In order to support OPP switching, OPP layer needs to get pointer to the + * clock for the device. Simple cases work fine without using this routine (i.e. + * by passing connection-id as NULL), but for a device with multiple clocks + * available, the OPP core needs to know the exact name of the clk to use. + * + * This must be called before any OPPs are initialized for the device. + */ +struct opp_table *dev_pm_opp_set_clkname(struct device *dev, const char *name) +{ + struct opp_table *opp_table; + int ret; + + opp_table = dev_pm_opp_get_opp_table(dev); + if (!opp_table) + return ERR_PTR(-ENOMEM); + + /* This should be called before OPPs are initialized */ + if (WARN_ON(!list_empty(_table->opp_list))) { + ret = -EBUSY; + goto err; + } + + /* Already have default clk set, free it */ + if (!IS_ERR(opp_table->clk)) + clk_put(opp_table->clk); + + /* Find clk for the device */ + opp_table->clk = clk_get(dev, name); + if (IS_ERR(opp_table->clk)) { + ret = PTR_ERR(opp_table->clk); + if (ret != -EPROBE_DEFER) { + dev_err(dev, "%s: Couldn't find clock: %d\n", __func__, + ret); + } + goto err; + } + + return opp_table; + +err: + dev_pm_opp_put_opp_table(opp_table); + + return ERR_PTR(ret); +} +EXPORT_SYMBOL_GPL(dev_pm_opp_set_clkname); + +/** + * dev_pm_opp_put_clkname() - Releases resources blocked for clk. + * @opp_table: OPP table returned from dev_pm_opp_set_clkname(). + */ +void dev_pm_opp_put_clkname(struct opp_table *opp_table) +{ + /* Make sure there are no concurrent readers while updating opp_table */ + WARN_ON(!list_empty(_table->opp_list)); + + clk_put(opp_table->clk); + opp_table->clk = ERR_PTR(-EINVAL); + + dev_pm_opp_put_opp_table(opp_table); +} +EXPORT_SYMBOL_GPL(dev_pm_opp_put_clkname); + +/** * dev_pm_opp_register_set_opp_helper() - Register custom set OPP helper * @dev: Device for which the helper is getting registered. * @set_opp: Custom set OPP helper. diff --git a/include/linux/pm_opp.h b/include/linux/pm_opp.h index a6685b3dde26..51ec727b4824 100644 --- a/include/linux/pm_opp.h +++ b/include/linux/pm_opp.h @@ -121,6 +121,8 @@ struct opp_table *dev_pm_opp_set_prop_name(struct device *dev, const char *name) void dev_pm_opp_put_prop_name(struct opp_table *opp_table); struct opp_table *dev_pm_opp_set_regulators(struct device *dev, const char * const names[], unsigned int count); void dev_pm_opp_put_regulators(struct opp_table *opp_table); +struct opp_table *dev_pm_opp_set_clkname(struct device *dev, const char * name); +void dev_pm_opp_put_clkname(struct opp_table *opp_table); struct opp_table *dev_pm_opp_register_set_opp_helper(struct device *dev, int (*set_opp)(struct dev_pm_set_opp_data *data)); void dev_pm_opp_register_put_opp_helper(struct opp_table *opp_table); int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq); @@ -257,6 +259,13 @@ static inline struct opp_table *dev_pm_opp_set_regulators(struct device *dev, co static inline void dev_pm_opp_put_regulators(struct opp_table *opp_table) {} +static inline struct opp_table *dev_pm_opp_set_clkname(struct device *dev, const char * name) +{ + return ERR_PTR(-ENOTSUPP); +} + +static inline void dev_pm_opp_put_clkname(struct opp_table *opp_table) {} + static inline int dev_pm_opp_set_rate(struct device *dev, unsigned long target_freq) { return -ENOTSUPP; -- 2.13.0.71.gd7076ec9c9cb
Re: [PATCH v6 1/4] of: remove *phandle properties from expanded device tree
Hi Frank, frowand.l...@gmail.com writes: > From: Frank Rowand> > Remove "phandle", "linux,phandle", and "ibm,phandle" properties from > the internal device tree. The phandle will still be in the struct > device_node phandle field and will still be displayed as if it is > a property in /proc/device_tree. > > This is to resolve the issue found by Stephen Boyd [1] when he changed > the type of struct property.value from void * to const void *. As > a result of the type change, the overlay code had compile errors > where the resolver updates phandle values. > > [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html > > - Add sysfs infrastructure to report np->phandle, as if it was a property. > - Do not create "phandle" "ibm,phandle", and "linux,phandle" properties > in the expanded device tree. > - Remove phandle properties in of_attach_node(), for nodes dynamically > attached to the live tree. Add the phandle sysfs entry for these nodes. > - When creating an overlay changeset, duplicate the node phandle in > __of_node_dup(). > - Remove no longer needed checks to exclude "phandle" and "linux,phandle" > properties in several locations. > - A side effect of these changes is that the obsolete "linux,phandle" and > "ibm,phandle" properties will no longer appear in /proc/device-tree (they > will appear as "phandle"). Sorry but I don't think that can work for us. Our DLPAR (ie. CPU/memory/device hotplug) stuff on PowerVM uses "ibm,phandle", and it's not the same thing as "phandle" / "linux,phandle". I don't know the code well myself, but the spec (PAPR) says: Note: If the “ibm,phandle” property exists, there are two “phandle” namespaces which must be kept separate. One is that actually used by the OF client interface, the other is properties in the device tree making reference to device tree nodes. These requirements are written to maintain backward compatibility with older FW versions predating these requirements; if the “ibm,phandle” property is not present, the OS may assume that any device tree properties which refer to this node will have a phandle value matching that returned by client interface services. I have systems here that still use "ibm,phandle". I also see at least some of the userspace code that looks for "ibm,phandle", and nothing else. The note above actually implies that the current Linux code is wrong, when it uses "ibm,phandle" as the value of np->phandle. So sorry that's a big mess, but we can't just rip out those properties. I think the minimal change would be to treat "ibm,phandle" like a normal property, I think that would allow our tools to keep working? The other thing that worries me is that by renaming (effectively) "linux,phandle" to "phandle", we lose the ability to accurately regenerate the device tree from /proc/device-tree. In theory it shouldn't matter, but I worry that in practice something will break. What if we just kept a single bit flag somewhere indicating if the name of the phandle property we found was "phandle" or "linux,phandle", and create the sysfs phandle using that name? cheers
Re: [PATCH v6 1/4] of: remove *phandle properties from expanded device tree
Hi Frank, frowand.l...@gmail.com writes: > From: Frank Rowand > > Remove "phandle", "linux,phandle", and "ibm,phandle" properties from > the internal device tree. The phandle will still be in the struct > device_node phandle field and will still be displayed as if it is > a property in /proc/device_tree. > > This is to resolve the issue found by Stephen Boyd [1] when he changed > the type of struct property.value from void * to const void *. As > a result of the type change, the overlay code had compile errors > where the resolver updates phandle values. > > [1] http://lkml.iu.edu/hypermail/linux/kernel/1702.1/04160.html > > - Add sysfs infrastructure to report np->phandle, as if it was a property. > - Do not create "phandle" "ibm,phandle", and "linux,phandle" properties > in the expanded device tree. > - Remove phandle properties in of_attach_node(), for nodes dynamically > attached to the live tree. Add the phandle sysfs entry for these nodes. > - When creating an overlay changeset, duplicate the node phandle in > __of_node_dup(). > - Remove no longer needed checks to exclude "phandle" and "linux,phandle" > properties in several locations. > - A side effect of these changes is that the obsolete "linux,phandle" and > "ibm,phandle" properties will no longer appear in /proc/device-tree (they > will appear as "phandle"). Sorry but I don't think that can work for us. Our DLPAR (ie. CPU/memory/device hotplug) stuff on PowerVM uses "ibm,phandle", and it's not the same thing as "phandle" / "linux,phandle". I don't know the code well myself, but the spec (PAPR) says: Note: If the “ibm,phandle” property exists, there are two “phandle” namespaces which must be kept separate. One is that actually used by the OF client interface, the other is properties in the device tree making reference to device tree nodes. These requirements are written to maintain backward compatibility with older FW versions predating these requirements; if the “ibm,phandle” property is not present, the OS may assume that any device tree properties which refer to this node will have a phandle value matching that returned by client interface services. I have systems here that still use "ibm,phandle". I also see at least some of the userspace code that looks for "ibm,phandle", and nothing else. The note above actually implies that the current Linux code is wrong, when it uses "ibm,phandle" as the value of np->phandle. So sorry that's a big mess, but we can't just rip out those properties. I think the minimal change would be to treat "ibm,phandle" like a normal property, I think that would allow our tools to keep working? The other thing that worries me is that by renaming (effectively) "linux,phandle" to "phandle", we lose the ability to accurately regenerate the device tree from /proc/device-tree. In theory it shouldn't matter, but I worry that in practice something will break. What if we just kept a single bit flag somewhere indicating if the name of the phandle property we found was "phandle" or "linux,phandle", and create the sysfs phandle using that name? cheers
Re: [PATCH v3 28/31] quota: add get_inode_usage callback to transfer multi-inode charges
Tahsin, when you think we've closed on the reviews, could you send out a complete set of all of the patches on a new mail thread, using git send-email so I can make sure I'm grabbing the final version of all of the patches in this patch series? It's great that you are using separate versions for each patch, since it makes it easy to track changes in each patch, but when it comes time for me to get a complete set of patches to apply, it does it make it harder to figure out that I need v5 of this patch, and v3 of that patch Many thanks! - Ted
Re: [PATCH v3 28/31] quota: add get_inode_usage callback to transfer multi-inode charges
Tahsin, when you think we've closed on the reviews, could you send out a complete set of all of the patches on a new mail thread, using git send-email so I can make sure I'm grabbing the final version of all of the patches in this patch series? It's great that you are using separate versions for each patch, since it makes it easy to track changes in each patch, but when it comes time for me to get a complete set of patches to apply, it does it make it harder to figure out that I need v5 of this patch, and v3 of that patch Many thanks! - Ted
[PATCH V2 1/5] arch_topology: Get rid of "cap_parsing_done"
We can reuse "cap_parsing_failed" instead of keeping an additional variable here. Signed-off-by: Viresh Kumar--- drivers/base/arch_topology.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index d1c33a85059e..8239a6232808 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -161,7 +161,6 @@ int __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) #ifdef CONFIG_CPU_FREQ static cpumask_var_t cpus_to_visit; -static bool cap_parsing_done; static void parsing_done_workfn(struct work_struct *work); static DECLARE_WORK(parsing_done_work, parsing_done_workfn); @@ -173,7 +172,7 @@ init_cpu_capacity_callback(struct notifier_block *nb, struct cpufreq_policy *policy = data; int cpu; - if (cap_parsing_failed || cap_parsing_done) + if (cap_parsing_failed) return 0; switch (val) { @@ -193,7 +192,7 @@ init_cpu_capacity_callback(struct notifier_block *nb, topology_normalize_cpu_scale(); kfree(raw_capacity); pr_debug("cpu_capacity: parsing done\n"); - cap_parsing_done = true; + cap_parsing_failed = true; schedule_work(_done_work); } } -- 2.13.0.71.gd7076ec9c9cb
[PATCH V2 0/5] arch_topology: Minor cleanups
Hi Greg, You weren't included in the first [1] version of this series, as it was targeting arch/arm*/ directories then. Here are some cleanups for the arch_topology core. Tested on ARM64 Hikey board by setting following in cpu nodes in DT: capacity-dmips-mhz = <1000>; V1->V2: - based over linux-next/master (to get Juri's recent changes) - More cleanups included. V1 only had the first patch. - Rename of cap_parsing_failed isn't required anymore (as it is localized to a single routine now). -- viresh [1] https://marc.info/?l=linux-arm-kernel=149795553024005 Viresh Kumar (5): arch_topology: Get rid of "cap_parsing_done" arch_topology: Don't break lines unnecessarily arch_topology: Covert switch block to if block arch_topology: Return 0 or -ve errors from topology_parse_cpu_capacity() arch_topology: Localize cap_parsing_failed to topology_parse_cpu_capacity() arch/arm/kernel/topology.c | 2 +- drivers/base/arch_topology.c | 78 ++-- 2 files changed, 40 insertions(+), 40 deletions(-) -- 2.13.0.71.gd7076ec9c9cb
[PATCH V2 1/5] arch_topology: Get rid of "cap_parsing_done"
We can reuse "cap_parsing_failed" instead of keeping an additional variable here. Signed-off-by: Viresh Kumar --- drivers/base/arch_topology.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index d1c33a85059e..8239a6232808 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -161,7 +161,6 @@ int __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) #ifdef CONFIG_CPU_FREQ static cpumask_var_t cpus_to_visit; -static bool cap_parsing_done; static void parsing_done_workfn(struct work_struct *work); static DECLARE_WORK(parsing_done_work, parsing_done_workfn); @@ -173,7 +172,7 @@ init_cpu_capacity_callback(struct notifier_block *nb, struct cpufreq_policy *policy = data; int cpu; - if (cap_parsing_failed || cap_parsing_done) + if (cap_parsing_failed) return 0; switch (val) { @@ -193,7 +192,7 @@ init_cpu_capacity_callback(struct notifier_block *nb, topology_normalize_cpu_scale(); kfree(raw_capacity); pr_debug("cpu_capacity: parsing done\n"); - cap_parsing_done = true; + cap_parsing_failed = true; schedule_work(_done_work); } } -- 2.13.0.71.gd7076ec9c9cb
[PATCH V2 0/5] arch_topology: Minor cleanups
Hi Greg, You weren't included in the first [1] version of this series, as it was targeting arch/arm*/ directories then. Here are some cleanups for the arch_topology core. Tested on ARM64 Hikey board by setting following in cpu nodes in DT: capacity-dmips-mhz = <1000>; V1->V2: - based over linux-next/master (to get Juri's recent changes) - More cleanups included. V1 only had the first patch. - Rename of cap_parsing_failed isn't required anymore (as it is localized to a single routine now). -- viresh [1] https://marc.info/?l=linux-arm-kernel=149795553024005 Viresh Kumar (5): arch_topology: Get rid of "cap_parsing_done" arch_topology: Don't break lines unnecessarily arch_topology: Covert switch block to if block arch_topology: Return 0 or -ve errors from topology_parse_cpu_capacity() arch_topology: Localize cap_parsing_failed to topology_parse_cpu_capacity() arch/arm/kernel/topology.c | 2 +- drivers/base/arch_topology.c | 78 ++-- 2 files changed, 40 insertions(+), 40 deletions(-) -- 2.13.0.71.gd7076ec9c9cb
[PATCH V2 4/5] arch_topology: Return 0 or -ve errors from topology_parse_cpu_capacity()
Use the standard way of returning errors instead of returning 0(failure) OR 1(success) and making it hard to read. Signed-off-by: Viresh Kumar--- arch/arm/kernel/topology.c | 2 +- drivers/base/arch_topology.c | 8 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index bf949a763dbe..a7ef4c35855e 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -111,7 +111,7 @@ static void __init parse_dt_topology(void) continue; } - if (topology_parse_cpu_capacity(cn, cpu)) { + if (!topology_parse_cpu_capacity(cn, cpu)) { of_node_put(cn); continue; } diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index 07784ba666a7..ff8713b83090 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -121,11 +121,11 @@ void topology_normalize_cpu_scale(void) int __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) { - int ret = 1; + int ret; u32 cpu_capacity; if (cap_parsing_failed) - return !ret; + return -EINVAL; ret = of_property_read_u32(cpu_node, "capacity-dmips-mhz", _capacity); @@ -137,7 +137,7 @@ int __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) if (!raw_capacity) { pr_err("cpu_capacity: failed to allocate memory for raw capacities\n"); cap_parsing_failed = true; - return 0; + return -ENOMEM; } } capacity_scale = max(cpu_capacity, capacity_scale); @@ -154,7 +154,7 @@ int __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) kfree(raw_capacity); } - return !ret; + return ret; } #ifdef CONFIG_CPU_FREQ -- 2.13.0.71.gd7076ec9c9cb
[PATCH V2 4/5] arch_topology: Return 0 or -ve errors from topology_parse_cpu_capacity()
Use the standard way of returning errors instead of returning 0(failure) OR 1(success) and making it hard to read. Signed-off-by: Viresh Kumar --- arch/arm/kernel/topology.c | 2 +- drivers/base/arch_topology.c | 8 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c index bf949a763dbe..a7ef4c35855e 100644 --- a/arch/arm/kernel/topology.c +++ b/arch/arm/kernel/topology.c @@ -111,7 +111,7 @@ static void __init parse_dt_topology(void) continue; } - if (topology_parse_cpu_capacity(cn, cpu)) { + if (!topology_parse_cpu_capacity(cn, cpu)) { of_node_put(cn); continue; } diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index 07784ba666a7..ff8713b83090 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -121,11 +121,11 @@ void topology_normalize_cpu_scale(void) int __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) { - int ret = 1; + int ret; u32 cpu_capacity; if (cap_parsing_failed) - return !ret; + return -EINVAL; ret = of_property_read_u32(cpu_node, "capacity-dmips-mhz", _capacity); @@ -137,7 +137,7 @@ int __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) if (!raw_capacity) { pr_err("cpu_capacity: failed to allocate memory for raw capacities\n"); cap_parsing_failed = true; - return 0; + return -ENOMEM; } } capacity_scale = max(cpu_capacity, capacity_scale); @@ -154,7 +154,7 @@ int __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) kfree(raw_capacity); } - return !ret; + return ret; } #ifdef CONFIG_CPU_FREQ -- 2.13.0.71.gd7076ec9c9cb
[PATCH V2 3/5] arch_topology: Covert switch block to if block
We only need to take care of one case here (CPUFREQ_NOTIFY) and there is no need to add an extra level of indentation to the case specific code by using a switch block. Use an if block instead. Also add some blank lines to make the code look better. Signed-off-by: Viresh Kumar--- drivers/base/arch_topology.c | 41 ++--- 1 file changed, 22 insertions(+), 19 deletions(-) diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index 2f1d9921ee54..07784ba666a7 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -173,26 +173,29 @@ init_cpu_capacity_callback(struct notifier_block *nb, if (cap_parsing_failed) return 0; - switch (val) { - case CPUFREQ_NOTIFY: - pr_debug("cpu_capacity: init cpu capacity for CPUs [%*pbl] (to_visit=%*pbl)\n", - cpumask_pr_args(policy->related_cpus), - cpumask_pr_args(cpus_to_visit)); - cpumask_andnot(cpus_to_visit, cpus_to_visit, - policy->related_cpus); - for_each_cpu(cpu, policy->related_cpus) { - raw_capacity[cpu] = topology_get_cpu_scale(NULL, cpu) * - policy->cpuinfo.max_freq / 1000UL; - capacity_scale = max(raw_capacity[cpu], capacity_scale); - } - if (cpumask_empty(cpus_to_visit)) { - topology_normalize_cpu_scale(); - kfree(raw_capacity); - pr_debug("cpu_capacity: parsing done\n"); - cap_parsing_failed = true; - schedule_work(_done_work); - } + if (val != CPUFREQ_NOTIFY) + return 0; + + pr_debug("cpu_capacity: init cpu capacity for CPUs [%*pbl] (to_visit=%*pbl)\n", +cpumask_pr_args(policy->related_cpus), +cpumask_pr_args(cpus_to_visit)); + + cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus); + + for_each_cpu(cpu, policy->related_cpus) { + raw_capacity[cpu] = topology_get_cpu_scale(NULL, cpu) * + policy->cpuinfo.max_freq / 1000UL; + capacity_scale = max(raw_capacity[cpu], capacity_scale); } + + if (cpumask_empty(cpus_to_visit)) { + topology_normalize_cpu_scale(); + kfree(raw_capacity); + pr_debug("cpu_capacity: parsing done\n"); + cap_parsing_failed = true; + schedule_work(_done_work); + } + return 0; } -- 2.13.0.71.gd7076ec9c9cb
[PATCH V2 5/5] arch_topology: Localize cap_parsing_failed to topology_parse_cpu_capacity()
cap_parsing_failed is only required in topology_parse_cpu_capacity() to know if we have already tried to allocate raw_capacity and failed, or if at least one of the cpu_node didn't had the required "capacity-dmips-mhz" property. All other users can use raw_capacity instead of cap_parsing_failed. Make sure we set raw_capacity to NULL after we free it. Signed-off-by: Viresh Kumar--- drivers/base/arch_topology.c | 25 + 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index ff8713b83090..d8923f89724f 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -95,14 +95,21 @@ subsys_initcall(register_cpu_capacity_sysctl); static u32 capacity_scale; static u32 *raw_capacity; -static bool cap_parsing_failed; + +static int __init free_raw_capacity(void) +{ + kfree(raw_capacity); + raw_capacity = NULL; + + return 0; +} void topology_normalize_cpu_scale(void) { u64 capacity; int cpu; - if (!raw_capacity || cap_parsing_failed) + if (!raw_capacity) return; pr_debug("cpu_capacity: capacity_scale=%u\n", capacity_scale); @@ -121,6 +128,7 @@ void topology_normalize_cpu_scale(void) int __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) { + static bool cap_parsing_failed; int ret; u32 cpu_capacity; @@ -151,7 +159,7 @@ int __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) pr_err("cpu_capacity: partial information: fallback to 1024 for all CPUs\n"); } cap_parsing_failed = true; - kfree(raw_capacity); + free_raw_capacity(); } return ret; @@ -170,7 +178,7 @@ init_cpu_capacity_callback(struct notifier_block *nb, struct cpufreq_policy *policy = data; int cpu; - if (cap_parsing_failed) + if (!raw_capacity) return 0; if (val != CPUFREQ_NOTIFY) @@ -190,9 +198,8 @@ init_cpu_capacity_callback(struct notifier_block *nb, if (cpumask_empty(cpus_to_visit)) { topology_normalize_cpu_scale(); - kfree(raw_capacity); + free_raw_capacity(); pr_debug("cpu_capacity: parsing done\n"); - cap_parsing_failed = true; schedule_work(_done_work); } @@ -232,11 +239,5 @@ static void parsing_done_workfn(struct work_struct *work) } #else -static int __init free_raw_capacity(void) -{ - kfree(raw_capacity); - - return 0; -} core_initcall(free_raw_capacity); #endif -- 2.13.0.71.gd7076ec9c9cb
[PATCH V2 2/5] arch_topology: Don't break lines unnecessarily
There is no need of line break at few places, avoid them. Signed-off-by: Viresh Kumar--- drivers/base/arch_topology.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index 8239a6232808..2f1d9921ee54 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -41,8 +41,7 @@ static ssize_t cpu_capacity_show(struct device *dev, { struct cpu *cpu = container_of(dev, struct cpu, dev); - return sprintf(buf, "%lu\n", - topology_get_cpu_scale(NULL, cpu->dev.id)); + return sprintf(buf, "%lu\n", topology_get_cpu_scale(NULL, cpu->dev.id)); } static ssize_t cpu_capacity_store(struct device *dev, @@ -128,8 +127,7 @@ int __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) if (cap_parsing_failed) return !ret; - ret = of_property_read_u32(cpu_node, - "capacity-dmips-mhz", + ret = of_property_read_u32(cpu_node, "capacity-dmips-mhz", _capacity); if (!ret) { if (!raw_capacity) { @@ -180,8 +178,7 @@ init_cpu_capacity_callback(struct notifier_block *nb, pr_debug("cpu_capacity: init cpu capacity for CPUs [%*pbl] (to_visit=%*pbl)\n", cpumask_pr_args(policy->related_cpus), cpumask_pr_args(cpus_to_visit)); - cpumask_andnot(cpus_to_visit, - cpus_to_visit, + cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus); for_each_cpu(cpu, policy->related_cpus) { raw_capacity[cpu] = topology_get_cpu_scale(NULL, cpu) * -- 2.13.0.71.gd7076ec9c9cb
[PATCH V2 3/5] arch_topology: Covert switch block to if block
We only need to take care of one case here (CPUFREQ_NOTIFY) and there is no need to add an extra level of indentation to the case specific code by using a switch block. Use an if block instead. Also add some blank lines to make the code look better. Signed-off-by: Viresh Kumar --- drivers/base/arch_topology.c | 41 ++--- 1 file changed, 22 insertions(+), 19 deletions(-) diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index 2f1d9921ee54..07784ba666a7 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -173,26 +173,29 @@ init_cpu_capacity_callback(struct notifier_block *nb, if (cap_parsing_failed) return 0; - switch (val) { - case CPUFREQ_NOTIFY: - pr_debug("cpu_capacity: init cpu capacity for CPUs [%*pbl] (to_visit=%*pbl)\n", - cpumask_pr_args(policy->related_cpus), - cpumask_pr_args(cpus_to_visit)); - cpumask_andnot(cpus_to_visit, cpus_to_visit, - policy->related_cpus); - for_each_cpu(cpu, policy->related_cpus) { - raw_capacity[cpu] = topology_get_cpu_scale(NULL, cpu) * - policy->cpuinfo.max_freq / 1000UL; - capacity_scale = max(raw_capacity[cpu], capacity_scale); - } - if (cpumask_empty(cpus_to_visit)) { - topology_normalize_cpu_scale(); - kfree(raw_capacity); - pr_debug("cpu_capacity: parsing done\n"); - cap_parsing_failed = true; - schedule_work(_done_work); - } + if (val != CPUFREQ_NOTIFY) + return 0; + + pr_debug("cpu_capacity: init cpu capacity for CPUs [%*pbl] (to_visit=%*pbl)\n", +cpumask_pr_args(policy->related_cpus), +cpumask_pr_args(cpus_to_visit)); + + cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus); + + for_each_cpu(cpu, policy->related_cpus) { + raw_capacity[cpu] = topology_get_cpu_scale(NULL, cpu) * + policy->cpuinfo.max_freq / 1000UL; + capacity_scale = max(raw_capacity[cpu], capacity_scale); } + + if (cpumask_empty(cpus_to_visit)) { + topology_normalize_cpu_scale(); + kfree(raw_capacity); + pr_debug("cpu_capacity: parsing done\n"); + cap_parsing_failed = true; + schedule_work(_done_work); + } + return 0; } -- 2.13.0.71.gd7076ec9c9cb
[PATCH V2 5/5] arch_topology: Localize cap_parsing_failed to topology_parse_cpu_capacity()
cap_parsing_failed is only required in topology_parse_cpu_capacity() to know if we have already tried to allocate raw_capacity and failed, or if at least one of the cpu_node didn't had the required "capacity-dmips-mhz" property. All other users can use raw_capacity instead of cap_parsing_failed. Make sure we set raw_capacity to NULL after we free it. Signed-off-by: Viresh Kumar --- drivers/base/arch_topology.c | 25 + 1 file changed, 13 insertions(+), 12 deletions(-) diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index ff8713b83090..d8923f89724f 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -95,14 +95,21 @@ subsys_initcall(register_cpu_capacity_sysctl); static u32 capacity_scale; static u32 *raw_capacity; -static bool cap_parsing_failed; + +static int __init free_raw_capacity(void) +{ + kfree(raw_capacity); + raw_capacity = NULL; + + return 0; +} void topology_normalize_cpu_scale(void) { u64 capacity; int cpu; - if (!raw_capacity || cap_parsing_failed) + if (!raw_capacity) return; pr_debug("cpu_capacity: capacity_scale=%u\n", capacity_scale); @@ -121,6 +128,7 @@ void topology_normalize_cpu_scale(void) int __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) { + static bool cap_parsing_failed; int ret; u32 cpu_capacity; @@ -151,7 +159,7 @@ int __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) pr_err("cpu_capacity: partial information: fallback to 1024 for all CPUs\n"); } cap_parsing_failed = true; - kfree(raw_capacity); + free_raw_capacity(); } return ret; @@ -170,7 +178,7 @@ init_cpu_capacity_callback(struct notifier_block *nb, struct cpufreq_policy *policy = data; int cpu; - if (cap_parsing_failed) + if (!raw_capacity) return 0; if (val != CPUFREQ_NOTIFY) @@ -190,9 +198,8 @@ init_cpu_capacity_callback(struct notifier_block *nb, if (cpumask_empty(cpus_to_visit)) { topology_normalize_cpu_scale(); - kfree(raw_capacity); + free_raw_capacity(); pr_debug("cpu_capacity: parsing done\n"); - cap_parsing_failed = true; schedule_work(_done_work); } @@ -232,11 +239,5 @@ static void parsing_done_workfn(struct work_struct *work) } #else -static int __init free_raw_capacity(void) -{ - kfree(raw_capacity); - - return 0; -} core_initcall(free_raw_capacity); #endif -- 2.13.0.71.gd7076ec9c9cb
[PATCH V2 2/5] arch_topology: Don't break lines unnecessarily
There is no need of line break at few places, avoid them. Signed-off-by: Viresh Kumar --- drivers/base/arch_topology.c | 9 +++-- 1 file changed, 3 insertions(+), 6 deletions(-) diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c index 8239a6232808..2f1d9921ee54 100644 --- a/drivers/base/arch_topology.c +++ b/drivers/base/arch_topology.c @@ -41,8 +41,7 @@ static ssize_t cpu_capacity_show(struct device *dev, { struct cpu *cpu = container_of(dev, struct cpu, dev); - return sprintf(buf, "%lu\n", - topology_get_cpu_scale(NULL, cpu->dev.id)); + return sprintf(buf, "%lu\n", topology_get_cpu_scale(NULL, cpu->dev.id)); } static ssize_t cpu_capacity_store(struct device *dev, @@ -128,8 +127,7 @@ int __init topology_parse_cpu_capacity(struct device_node *cpu_node, int cpu) if (cap_parsing_failed) return !ret; - ret = of_property_read_u32(cpu_node, - "capacity-dmips-mhz", + ret = of_property_read_u32(cpu_node, "capacity-dmips-mhz", _capacity); if (!ret) { if (!raw_capacity) { @@ -180,8 +178,7 @@ init_cpu_capacity_callback(struct notifier_block *nb, pr_debug("cpu_capacity: init cpu capacity for CPUs [%*pbl] (to_visit=%*pbl)\n", cpumask_pr_args(policy->related_cpus), cpumask_pr_args(cpus_to_visit)); - cpumask_andnot(cpus_to_visit, - cpus_to_visit, + cpumask_andnot(cpus_to_visit, cpus_to_visit, policy->related_cpus); for_each_cpu(cpu, policy->related_cpus) { raw_capacity[cpu] = topology_get_cpu_scale(NULL, cpu) * -- 2.13.0.71.gd7076ec9c9cb
Re: [PATCH 32/51] rtc: pcap: stop using rtc deprecated functions
Hi Benjamin, [auto build test ERROR on abelloni/rtc-next] [also build test ERROR on v4.12-rc6 next-20170620] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Benjamin-Gaignard/rtc-stop-using-rtc-deprecated-functions/20170621-044455 base: https://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git rtc-next config: i386-randconfig-b0-06210824 (attached as .config) compiler: gcc-5 (Debian 5.4.1-2) 5.4.1 20160904 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): drivers/built-in.o: In function `pcap_rtc_set_mmss64': >> rtc-pcap.c:(.text+0x18a7e0): undefined reference to `__moddi3' >> rtc-pcap.c:(.text+0x18a800): undefined reference to `__divdi3' drivers/built-in.o: In function `pcap_rtc_set_alarm': >> rtc-pcap.c:(.text+0x18a83e): undefined reference to `__umoddi3' >> rtc-pcap.c:(.text+0x18a85e): undefined reference to `__udivdi3' --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH 32/51] rtc: pcap: stop using rtc deprecated functions
Hi Benjamin, [auto build test ERROR on abelloni/rtc-next] [also build test ERROR on v4.12-rc6 next-20170620] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Benjamin-Gaignard/rtc-stop-using-rtc-deprecated-functions/20170621-044455 base: https://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git rtc-next config: i386-randconfig-b0-06210824 (attached as .config) compiler: gcc-5 (Debian 5.4.1-2) 5.4.1 20160904 reproduce: # save the attached .config to linux build tree make ARCH=i386 All errors (new ones prefixed by >>): drivers/built-in.o: In function `pcap_rtc_set_mmss64': >> rtc-pcap.c:(.text+0x18a7e0): undefined reference to `__moddi3' >> rtc-pcap.c:(.text+0x18a800): undefined reference to `__divdi3' drivers/built-in.o: In function `pcap_rtc_set_alarm': >> rtc-pcap.c:(.text+0x18a83e): undefined reference to `__umoddi3' >> rtc-pcap.c:(.text+0x18a85e): undefined reference to `__udivdi3' --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH 13/51] rtc: cpcap: stop using rtc deprecated functions
Hi Benjamin, [auto build test ERROR on abelloni/rtc-next] [also build test ERROR on v4.12-rc6 next-20170620] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Benjamin-Gaignard/rtc-stop-using-rtc-deprecated-functions/20170621-044455 base: https://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git rtc-next config: arm-omap2plus_defconfig (attached as .config) compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705 reproduce: wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=arm All errors (new ones prefixed by >>): >> ERROR: "__aeabi_uldivmod" [drivers/rtc/rtc-cpcap.ko] undefined! --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
Re: [PATCH 13/51] rtc: cpcap: stop using rtc deprecated functions
Hi Benjamin, [auto build test ERROR on abelloni/rtc-next] [also build test ERROR on v4.12-rc6 next-20170620] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Benjamin-Gaignard/rtc-stop-using-rtc-deprecated-functions/20170621-044455 base: https://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux.git rtc-next config: arm-omap2plus_defconfig (attached as .config) compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705 reproduce: wget https://raw.githubusercontent.com/01org/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=arm All errors (new ones prefixed by >>): >> ERROR: "__aeabi_uldivmod" [drivers/rtc/rtc-cpcap.ko] undefined! --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: application/gzip
[PATCH 1/3] ASoC: fsl: mpc5200_dma: remove unused psc_dma
linux/sound/soc/fsl/mpc5200_dma.c:305:18: warning: unused variable \ psc_dma’ [-Wunused-variable] Signed-off-by: Kuninori Morimoto--- sound/soc/fsl/mpc5200_dma.c | 1 - 1 file changed, 1 deletion(-) diff --git a/sound/soc/fsl/mpc5200_dma.c b/sound/soc/fsl/mpc5200_dma.c index 93885d9..cd8ccaba 100644 --- a/sound/soc/fsl/mpc5200_dma.c +++ b/sound/soc/fsl/mpc5200_dma.c @@ -302,7 +302,6 @@ static int psc_dma_new(struct snd_soc_pcm_runtime *rtd) struct snd_card *card = rtd->card->snd_card; struct snd_soc_dai *dai = rtd->cpu_dai; struct snd_pcm *pcm = rtd->pcm; - struct psc_dma *psc_dma = snd_soc_dai_get_drvdata(rtd->cpu_dai); size_t size = psc_dma_hardware.buffer_bytes_max; int rc; -- 1.9.1
[PATCH 1/3] ASoC: fsl: mpc5200_dma: remove unused psc_dma
linux/sound/soc/fsl/mpc5200_dma.c:305:18: warning: unused variable \ psc_dma’ [-Wunused-variable] Signed-off-by: Kuninori Morimoto --- sound/soc/fsl/mpc5200_dma.c | 1 - 1 file changed, 1 deletion(-) diff --git a/sound/soc/fsl/mpc5200_dma.c b/sound/soc/fsl/mpc5200_dma.c index 93885d9..cd8ccaba 100644 --- a/sound/soc/fsl/mpc5200_dma.c +++ b/sound/soc/fsl/mpc5200_dma.c @@ -302,7 +302,6 @@ static int psc_dma_new(struct snd_soc_pcm_runtime *rtd) struct snd_card *card = rtd->card->snd_card; struct snd_soc_dai *dai = rtd->cpu_dai; struct snd_pcm *pcm = rtd->pcm; - struct psc_dma *psc_dma = snd_soc_dai_get_drvdata(rtd->cpu_dai); size_t size = psc_dma_hardware.buffer_bytes_max; int rc; -- 1.9.1
Re: [PATCH] acer-wmi: Using zero as the first WMI instance number
On Tue, Jun 20, 2017 at 02:45:56PM -0700, Darren Hart wrote: > On Tue, Jun 20, 2017 at 10:46:12PM +0200, Pali Rohár wrote: > > On Tuesday 20 June 2017 19:22:46 Andy Shevchenko wrote: > > > On Tue, Jun 20, 2017 at 7:48 PM, Pali Rohár> > > wrote: > > > > On Tuesday 20 June 2017 17:06:23 Lee, Chun-Yi wrote: > > > >> Pali Rohár found that there have some wmi query/evaluation > > > >> code that they used 'one' as the first WMI instance number. > > > >> But the number is indexed from zero that it's must less than > > > >> the instance_count in _WDG. > > > >> > > > >> This patch changes those instance number from one to zero. > > > >> > > > >> Cc: Darren Hart > > > >> Cc: Andy Shevchenko > > > >> Cc: Pali Rohár > > > >> Signed-off-by: "Lee, Chun-Yi" > > > > > > > > Looks good, Reviewed-by: Pali Rohár > > > > > > Unfortunately patchwork ignores this tag. > > > So, in the future please: > > > - put a tag on a separate line > > > - do _not_ prepend it by any characters including white spaces > > > (except new line :-) ) I see! I will also follow the rules in the future. > > > > I have not known that those lines are parsed automatically. > > Will do it in future! > > :-) Thanks! > > For reference, Patchwork tally's such things in the patch list view: > https://patchwork.kernel.org/project/platform-driver-x86/list/ > > And, selfishly, the more normalized these are, the less likely Andy and I are > to > make mistakes manipulating them into the patches. > Joey Lee
Re: [PATCH] acer-wmi: Using zero as the first WMI instance number
On Tue, Jun 20, 2017 at 02:45:56PM -0700, Darren Hart wrote: > On Tue, Jun 20, 2017 at 10:46:12PM +0200, Pali Rohár wrote: > > On Tuesday 20 June 2017 19:22:46 Andy Shevchenko wrote: > > > On Tue, Jun 20, 2017 at 7:48 PM, Pali Rohár > > > wrote: > > > > On Tuesday 20 June 2017 17:06:23 Lee, Chun-Yi wrote: > > > >> Pali Rohár found that there have some wmi query/evaluation > > > >> code that they used 'one' as the first WMI instance number. > > > >> But the number is indexed from zero that it's must less than > > > >> the instance_count in _WDG. > > > >> > > > >> This patch changes those instance number from one to zero. > > > >> > > > >> Cc: Darren Hart > > > >> Cc: Andy Shevchenko > > > >> Cc: Pali Rohár > > > >> Signed-off-by: "Lee, Chun-Yi" > > > > > > > > Looks good, Reviewed-by: Pali Rohár > > > > > > Unfortunately patchwork ignores this tag. > > > So, in the future please: > > > - put a tag on a separate line > > > - do _not_ prepend it by any characters including white spaces > > > (except new line :-) ) I see! I will also follow the rules in the future. > > > > I have not known that those lines are parsed automatically. > > Will do it in future! > > :-) Thanks! > > For reference, Patchwork tally's such things in the patch list view: > https://patchwork.kernel.org/project/platform-driver-x86/list/ > > And, selfishly, the more normalized these are, the less likely Andy and I are > to > make mistakes manipulating them into the patches. > Joey Lee