[PATCH] pci: pcie-xilinx: fix a missing-check bug for __get_free_pages
In case __get_free_pages fail, the fix returns to avoid NULL pointer dereference. Signed-off-by: Kangjie Lu --- drivers/pci/controller/pcie-xilinx.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/pci/controller/pcie-xilinx.c b/drivers/pci/controller/pcie-xilinx.c index 9bd1a35cd5d8..b7083e995c45 100644 --- a/drivers/pci/controller/pcie-xilinx.c +++ b/drivers/pci/controller/pcie-xilinx.c @@ -341,6 +341,9 @@ static void xilinx_pcie_enable_msi(struct xilinx_pcie_port *port) phys_addr_t msg_addr; port->msi_pages = __get_free_pages(GFP_KERNEL, 0); + if (unlikely(!port->msi_pages)) + return; + msg_addr = virt_to_phys((void *)port->msi_pages); pcie_write(port, 0x0, XILINX_PCIE_REG_MSIBASE1); pcie_write(port, msg_addr, XILINX_PCIE_REG_MSIBASE2); -- 2.17.1
[PATCH] pci: pci-tegra: fix a potential NULL pointer dereference
In case __get_free_pages fails and returns NULL, the fix returns -ENOMEM and releases resources to avoid NULL pointer dereference. Signed-off-by: Kangjie Lu --- drivers/pci/controller/pci-tegra.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/pci/controller/pci-tegra.c b/drivers/pci/controller/pci-tegra.c index f4f53d092e00..0bdc6ee904f3 100644 --- a/drivers/pci/controller/pci-tegra.c +++ b/drivers/pci/controller/pci-tegra.c @@ -1550,6 +1550,12 @@ static int tegra_pcie_msi_setup(struct tegra_pcie *pcie) /* setup AFI/FPCI range */ msi->pages = __get_free_pages(GFP_KERNEL, 0); + if (!msi->pages) { + dev_err(dev, "failed to get free pages\n"); + err = -ENOMEM; + goto err; + } + msi->phys = virt_to_phys((void *)msi->pages); host->msi = >chip; -- 2.17.1
[PATCH] hyperv: a potential NULL pointer dereference
In case alloc_page, the fix returns -ENOMEM to avoid the potential NULL pointer dereference. Signed-off-by: Kangjie Lu --- arch/x86/hyperv/hv_init.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c index 7abb09e2eeb8..dfdb4ce1ae9c 100644 --- a/arch/x86/hyperv/hv_init.c +++ b/arch/x86/hyperv/hv_init.c @@ -102,9 +102,13 @@ static int hv_cpu_init(unsigned int cpu) u64 msr_vp_index; struct hv_vp_assist_page **hvp = _vp_assist_page[smp_processor_id()]; void **input_arg; + struct page *pg; input_arg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg); - *input_arg = page_address(alloc_page(GFP_KERNEL)); + pg = alloc_page(GFP_KERNEL); + if (unlikely(!pg)) + return -ENOMEM; + *input_arg = page_address(pg); hv_get_vp_index(msr_vp_index); -- 2.17.1
Re: [PATCH V3 07/10] mmc: tegra: add Tegra186 WAR for CQE
On 3/14/2019 3:15 AM, Sowjanya Komatineni wrote: Tegra186 CQHCI host has a known bug where CQHCI controller selects DATA_PRESENT_SELECT bit to 1 for DCMDs with R1B response type and since DCMD does not trigger any data transfer, DCMD task complete happens leaving the DATA FSM of host controller in wait state for the data. This effects the data transfer tasks issued after the DCMDs with R1b response type resulting in timeout. SW WAR is to set CMD_TIMING to 1 in DCMD task descriptor. This bug and SW WAR is applicable only for Tegra186 and not for Tegra194. This patch implements this WAR thru NVQUIRK_CQHCI_DCMD_R1B_CMD_TIMING for Tegra186 and also implements update_dcmd_desc of cqhci_host_ops interface to set CMD_TIMING bit depending on the NVQUIRK. Tested-by: Jon Hunter Signed-off-by: Sowjanya Komatineni Thanks, Reviewed-by: Ritesh Harjani --- drivers/mmc/host/sdhci-tegra.c | 17 - 1 file changed, 16 insertions(+), 1 deletion(-) diff --git a/drivers/mmc/host/sdhci-tegra.c b/drivers/mmc/host/sdhci-tegra.c index f1aa0591112a..2f08b6e480df 100644 --- a/drivers/mmc/host/sdhci-tegra.c +++ b/drivers/mmc/host/sdhci-tegra.c @@ -106,6 +106,7 @@ #define NVQUIRK_HAS_PADCALIB BIT(6) #define NVQUIRK_NEEDS_PAD_CONTROL BIT(7) #define NVQUIRK_DIS_CARD_CLK_CONFIG_TAP BIT(8) +#define NVQUIRK_CQHCI_DCMD_R1B_CMD_TIMING BIT(9) /* SDMMC CQE Base Address for Tegra Host Ver 4.1 and Higher */ #define SDHCI_TEGRA_CQE_BASE_ADDR 0xF000 @@ -1123,6 +1124,18 @@ static void tegra_sdhci_voltage_switch(struct sdhci_host *host) tegra_host->pad_calib_required = true; } +static void sdhci_tegra_update_dcmd_desc(struct mmc_host *mmc, +struct mmc_request *mrq, u64 *data) +{ + struct sdhci_pltfm_host *pltfm_host = sdhci_priv(mmc_priv(mmc)); + struct sdhci_tegra *tegra_host = sdhci_pltfm_priv(pltfm_host); + const struct sdhci_tegra_soc_data *soc_data = tegra_host->soc_data; + + if (soc_data->nvquirks & NVQUIRK_CQHCI_DCMD_R1B_CMD_TIMING && + mrq->cmd->flags & MMC_RSP_R1B) + *data |= CQHCI_CMD_TIMING(1); +} + static void sdhci_tegra_cqe_enable(struct mmc_host *mmc) { struct cqhci_host *cq_host = mmc->cqe_private; @@ -1164,6 +1177,7 @@ static const struct cqhci_host_ops sdhci_tegra_cqhci_ops = { .enable = sdhci_tegra_cqe_enable, .disable = sdhci_cqe_disable, .dumpregs = sdhci_tegra_dumpregs, + .update_dcmd_desc = sdhci_tegra_update_dcmd_desc, }; static const struct sdhci_ops tegra_sdhci_ops = { @@ -1345,7 +1359,8 @@ static const struct sdhci_tegra_soc_data soc_data_tegra186 = { NVQUIRK_HAS_PADCALIB | NVQUIRK_DIS_CARD_CLK_CONFIG_TAP | NVQUIRK_ENABLE_SDR50 | - NVQUIRK_ENABLE_SDR104, + NVQUIRK_ENABLE_SDR104 | + NVQUIRK_CQHCI_DCMD_R1B_CMD_TIMING, .min_tap_delay = 84, .max_tap_delay = 136, };
[PATCH] infiniband: i40iw: fix potential NULL pointer dereferences
alloc_ordered_workqueue may fail and return NULL. The fix captures the failure and handles it properly to avoid potential NULL pointer dereferences. Signed-off-by: Kangjie Lu --- V2: add return value to capture the error code drivers/infiniband/hw/i40iw/i40iw.h | 2 +- drivers/infiniband/hw/i40iw/i40iw_cm.c | 19 --- drivers/infiniband/hw/i40iw/i40iw_main.c | 5 - 3 files changed, 21 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/hw/i40iw/i40iw.h b/drivers/infiniband/hw/i40iw/i40iw.h index 2f2b4426ded7..8feec35f95a7 100644 --- a/drivers/infiniband/hw/i40iw/i40iw.h +++ b/drivers/infiniband/hw/i40iw/i40iw.h @@ -552,7 +552,7 @@ enum i40iw_status_code i40iw_obj_aligned_mem(struct i40iw_device *iwdev, void i40iw_request_reset(struct i40iw_device *iwdev); void i40iw_destroy_rdma_device(struct i40iw_ib_device *iwibdev); -void i40iw_setup_cm_core(struct i40iw_device *iwdev); +int i40iw_setup_cm_core(struct i40iw_device *iwdev); void i40iw_cleanup_cm_core(struct i40iw_cm_core *cm_core); void i40iw_process_ceq(struct i40iw_device *, struct i40iw_ceq *iwceq); void i40iw_process_aeq(struct i40iw_device *); diff --git a/drivers/infiniband/hw/i40iw/i40iw_cm.c b/drivers/infiniband/hw/i40iw/i40iw_cm.c index 206cfb0016f8..dda24f44239b 100644 --- a/drivers/infiniband/hw/i40iw/i40iw_cm.c +++ b/drivers/infiniband/hw/i40iw/i40iw_cm.c @@ -3237,7 +3237,7 @@ void i40iw_receive_ilq(struct i40iw_sc_vsi *vsi, struct i40iw_puda_buf *rbuf) * core * @iwdev: iwarp device structure */ -void i40iw_setup_cm_core(struct i40iw_device *iwdev) +int i40iw_setup_cm_core(struct i40iw_device *iwdev) { struct i40iw_cm_core *cm_core = >cm_core; @@ -3256,9 +3256,20 @@ void i40iw_setup_cm_core(struct i40iw_device *iwdev) cm_core->event_wq = alloc_ordered_workqueue("iwewq", WQ_MEM_RECLAIM); + if (!cm_core->event_wq) + goto error; cm_core->disconn_wq = alloc_ordered_workqueue("iwdwq", WQ_MEM_RECLAIM); + if (!cm_core->disconn_wq) + goto error; + + return 0; +error: + i40iw_cleanup_cm_core(>cm_core); + i40iw_pr_err("fail to setup CM core"); + + return return -ENOMEM; } /** @@ -3278,8 +3289,10 @@ void i40iw_cleanup_cm_core(struct i40iw_cm_core *cm_core) del_timer_sync(_core->tcp_timer); spin_unlock_irqrestore(_core->ht_lock, flags); - destroy_workqueue(cm_core->event_wq); - destroy_workqueue(cm_core->disconn_wq); + if (cm_core->event_wq) + destroy_workqueue(cm_core->event_wq); + if (cm_core->disconn_wq) + destroy_workqueue(cm_core->disconn_wq); } /** diff --git a/drivers/infiniband/hw/i40iw/i40iw_main.c b/drivers/infiniband/hw/i40iw/i40iw_main.c index 68095f00d08f..10932baee279 100644 --- a/drivers/infiniband/hw/i40iw/i40iw_main.c +++ b/drivers/infiniband/hw/i40iw/i40iw_main.c @@ -1641,7 +1641,10 @@ static int i40iw_open(struct i40e_info *ldev, struct i40e_client *client) iwdev = >device; iwdev->hdl = hdl; dev = >sc_dev; - i40iw_setup_cm_core(iwdev); + if (i40iw_setup_cm_core(iwdev)) { + kfree(iwdev->hdl); + return -ENOMEM; + } dev->back_dev = (void *)iwdev; iwdev->ldev = >ldev; -- 2.17.1
Re: [PATCH V3 06/10] mmc: cqhci: allow hosts to update dcmd cmd desc
On 3/14/2019 3:15 AM, Sowjanya Komatineni wrote: This patch adds update_dcmd_desc interface to cqhci_host_ops to allow hosts to update any of the DCMD task descriptor attributes and parameters. Tested-by: Jon Hunter Signed-off-by: Sowjanya Komatineni Thanks, Reviewed-by: Ritesh Harjani --- drivers/mmc/host/cqhci.c | 2 ++ drivers/mmc/host/cqhci.h | 2 ++ 2 files changed, 4 insertions(+) diff --git a/drivers/mmc/host/cqhci.c b/drivers/mmc/host/cqhci.c index a8af682a9182..d59cb0a51964 100644 --- a/drivers/mmc/host/cqhci.c +++ b/drivers/mmc/host/cqhci.c @@ -537,6 +537,8 @@ static void cqhci_prep_dcmd_desc(struct mmc_host *mmc, CQHCI_ACT(0x5) | CQHCI_CMD_INDEX(mrq->cmd->opcode) | CQHCI_CMD_TIMING(timing) | CQHCI_RESP_TYPE(resp_type)); + if (cq_host->ops->update_dcmd_desc) + cq_host->ops->update_dcmd_desc(mmc, mrq, ); *task_desc |= data; desc = (u8 *)task_desc; pr_debug("%s: cqhci: dcmd: cmd: %d timing: %d resp: %d\n", diff --git a/drivers/mmc/host/cqhci.h b/drivers/mmc/host/cqhci.h index 9e68286a07b4..8c8ec6f01c45 100644 --- a/drivers/mmc/host/cqhci.h +++ b/drivers/mmc/host/cqhci.h @@ -210,6 +210,8 @@ struct cqhci_host_ops { u32 (*read_l)(struct cqhci_host *host, int reg); void (*enable)(struct mmc_host *mmc); void (*disable)(struct mmc_host *mmc, bool recovery); + void (*update_dcmd_desc)(struct mmc_host *mmc, struct mmc_request *mrq, +u64 *data); }; static inline void cqhci_writel(struct cqhci_host *host, u32 val, int reg)
[PATCH v3 7/7] mm: Remove stale comment from page struct
We now use the slab_list list_head instead of the lru list_head. This comment has become stale. Remove stale comment from page struct slab_list list_head. Reviewed-by: Roman Gushchin Acked-by: Christoph Lameter Signed-off-by: Tobin C. Harding --- include/linux/mm_types.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 7eade9132f02..63a34e3d7c29 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -103,7 +103,7 @@ struct page { }; struct {/* slab, slob and slub */ union { - struct list_head slab_list; /* uses lru */ + struct list_head slab_list; struct {/* Partial pages */ struct page *next; #ifdef CONFIG_64BIT -- 2.21.0
[PATCH v3 4/7] slub: Add comments to endif pre-processor macros
SLUB allocator makes heavy use of ifdef/endif pre-processor macros. The pairing of these statements is at times hard to follow e.g. if the pair are further than a screen apart or if there are nested pairs. We can reduce cognitive load by adding a comment to the endif statement of form #ifdef CONFIG_FOO ... #endif /* CONFIG_FOO */ Add comments to endif pre-processor macros if ifdef/endif pair is not immediately apparent. Reviewed-by: Roman Gushchin Acked-by: Christoph Lameter Signed-off-by: Tobin C. Harding --- mm/slub.c | 20 ++-- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 1b08fbcb7e61..b282e22885cd 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1951,7 +1951,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t flags, } } } while (read_mems_allowed_retry(cpuset_mems_cookie)); -#endif +#endif /* CONFIG_NUMA */ return NULL; } @@ -2249,7 +2249,7 @@ static void unfreeze_partials(struct kmem_cache *s, discard_slab(s, page); stat(s, FREE_SLAB); } -#endif +#endif /* CONFIG_SLUB_CPU_PARTIAL */ } /* @@ -2308,7 +2308,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct page *page, int drain) local_irq_restore(flags); } preempt_enable(); -#endif +#endif /* CONFIG_SLUB_CPU_PARTIAL */ } static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c) @@ -2813,7 +2813,7 @@ void *kmem_cache_alloc_node_trace(struct kmem_cache *s, } EXPORT_SYMBOL(kmem_cache_alloc_node_trace); #endif -#endif +#endif /* CONFIG_NUMA */ /* * Slow path handling. This may still be called frequently since objects @@ -3845,7 +3845,7 @@ void *__kmalloc_node(size_t size, gfp_t flags, int node) return ret; } EXPORT_SYMBOL(__kmalloc_node); -#endif +#endif /* CONFIG_NUMA */ #ifdef CONFIG_HARDENED_USERCOPY /* @@ -4063,7 +4063,7 @@ void __kmemcg_cache_deactivate(struct kmem_cache *s) */ slab_deactivate_memcg_cache_rcu_sched(s, kmemcg_cache_deact_after_rcu); } -#endif +#endif /* CONFIG_MEMCG */ static int slab_mem_going_offline_callback(void *arg) { @@ -4696,7 +4696,7 @@ static int list_locations(struct kmem_cache *s, char *buf, len += sprintf(buf, "No data\n"); return len; } -#endif +#endif /* CONFIG_SLUB_DEBUG */ #ifdef SLUB_RESILIENCY_TEST static void __init resiliency_test(void) @@ -4756,7 +4756,7 @@ static void __init resiliency_test(void) #ifdef CONFIG_SYSFS static void resiliency_test(void) {}; #endif -#endif +#endif /* SLUB_RESILIENCY_TEST */ #ifdef CONFIG_SYSFS enum slab_stat_type { @@ -5413,7 +5413,7 @@ STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc); STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free); STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node); STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain); -#endif +#endif /* CONFIG_SLUB_STATS */ static struct attribute *slab_attrs[] = { _size_attr.attr, @@ -5614,7 +5614,7 @@ static void memcg_propagate_slab_attrs(struct kmem_cache *s) if (buffer) free_page((unsigned long)buffer); -#endif +#endif /* CONFIG_MEMCG */ } static void kmem_cache_release(struct kobject *k) -- 2.21.0
[PATCH v3 0/7] mm: Use slab_list list_head instead of lru
Currently the slab allocators (ab)use the struct page 'lru' list_head. We have a list head for slab allocators to use, 'slab_list'. During v2 it was noted by Christoph that the SLOB allocator was reaching into a list_head, this version adds 2 patches to the front of the set to fix that. Clean up all three allocators by using the 'slab_list' list_head instead of overloading the 'lru' list_head. Patch 1 - Adds a function to rotate a list to a specified entry. Patch 2 - Removes the code that reaches into list_head and instead uses the list_head API including the newly defined function. Patches 3-7 are unchanged from v3 Patch 3 (v2: patch 4) - Changes the SLOB allocator to use slab_list instead of lru. Patch 4 (v2: patch 1) - Makes no code changes, adds comments to #endif statements. Patch 5 (v2: patch 2) - Use slab_list instead of lru for SLUB allocator. Patch 6 (v2: patch 3) - Use slab_list instead of lru for SLAB allocator. Patch 7 (v2: patch 5) - Removes the now stale comment in the page struct definition. During v2 development patches were checked to see if the object file before and after was identical. Clearly this will no longer be possible for mm/slob.o, however this work is still of use to validate the change from lru -> slab_list. Patch 1 was tested with a module (creates and populates a list then calls list_rotate_to_front() and verifies new order): https://github.com/tcharding/ktest/tree/master/list_head Patch 2 was tested with another module that does some basic slab allocation and freeing to a newly created slab cache: https://github.com/tcharding/ktest/tree/master/slab Tested on a kernel with this in the config: CONFIG_SLOB=y CONFIG_SLAB_MERGE_DEFAULT=y Changes since v2: - Add list_rotate_to_front(). - Fix slob to use list_head API. - Re-order patches to put the list.h changes up front. - Add acks from Christoph. Changes since v1: - Verify object files are the same before and after the patch set is applied (suggested by Matthew). - Add extra explanation to the commit logs explaining why these changes are safe to make (suggested by Roman). - Remove stale comment (thanks Willy). thanks, Tobin. Tobin C. Harding (7): list: Add function list_rotate_to_front() slob: Respect list_head abstraction layer slob: Use slab_list instead of lru slub: Add comments to endif pre-processor macros slub: Use slab_list instead of lru slab: Use slab_list instead of lru mm: Remove stale comment from page struct include/linux/list.h | 18 include/linux/mm_types.h | 2 +- mm/slab.c| 49 mm/slob.c| 32 + mm/slub.c| 60 5 files changed, 94 insertions(+), 67 deletions(-) -- 2.21.0
[PATCH v3 1/7] list: Add function list_rotate_to_front()
Currently if we wish to rotate a list until a specific item is at the front of the list we can call list_move_tail(head, list). Note that the arguments are the reverse way to the usual use of list_move_tail(list, head). This is a hack, it depends on the developer knowing how the list_head operates internally which violates the layer of abstraction offered by the list_head. Also, it is not intuitive so the next developer to come along must study list.h in order to fully understand what is meant by the call, while this is 'good for' the developer it makes reading the code harder. We should have an function appropriately named that does this if there are users for it intree. By grep'ing the tree for list_move_tail() and list_tail() and attempting to guess the argument order from the names it seems there is only one place currently in the tree that does this - the slob allocatator. Add function list_rotate_to_front() to rotate a list until the specified item is at the front of the list. Signed-off-by: Tobin C. Harding --- include/linux/list.h | 18 ++ 1 file changed, 18 insertions(+) diff --git a/include/linux/list.h b/include/linux/list.h index 79626b5ab36c..8ead813e7f1c 100644 --- a/include/linux/list.h +++ b/include/linux/list.h @@ -270,6 +270,24 @@ static inline void list_rotate_left(struct list_head *head) } } +/** + * list_rotate_to_front() - Rotate list to specific item. + * @list: The desired new front of the list. + * @head: The head of the list. + * + * Rotates list so that @list becomes the new front of the list. + */ +static inline void list_rotate_to_front(struct list_head *list, + struct list_head *head) +{ + /* +* Deletes the list head from the list denoted by @head and +* places it as the tail of @list, this effectively rotates the +* list so that @list is at the front. +*/ + list_move_tail(head, list); +} + /** * list_is_singular - tests whether a list has just one entry. * @head: the list to test. -- 2.21.0
[PATCH v3 5/7] slub: Use slab_list instead of lru
Currently we use the page->lru list for maintaining lists of slabs. We have a list in the page structure (slab_list) that can be used for this purpose. Doing so makes the code cleaner since we are not overloading the lru list. Use the slab_list instead of the lru list for maintaining lists of slabs. Reviewed-by: Roman Gushchin Acked-by: Christoph Lameter Signed-off-by: Tobin C. Harding --- mm/slub.c | 40 1 file changed, 20 insertions(+), 20 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index b282e22885cd..d692b5e0163d 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1023,7 +1023,7 @@ static void add_full(struct kmem_cache *s, return; lockdep_assert_held(>list_lock); - list_add(>lru, >full); + list_add(>slab_list, >full); } static void remove_full(struct kmem_cache *s, struct kmem_cache_node *n, struct page *page) @@ -1032,7 +1032,7 @@ static void remove_full(struct kmem_cache *s, struct kmem_cache_node *n, struct return; lockdep_assert_held(>list_lock); - list_del(>lru); + list_del(>slab_list); } /* Tracking of the number of slabs for debugging purposes */ @@ -1773,9 +1773,9 @@ __add_partial(struct kmem_cache_node *n, struct page *page, int tail) { n->nr_partial++; if (tail == DEACTIVATE_TO_TAIL) - list_add_tail(>lru, >partial); + list_add_tail(>slab_list, >partial); else - list_add(>lru, >partial); + list_add(>slab_list, >partial); } static inline void add_partial(struct kmem_cache_node *n, @@ -1789,7 +1789,7 @@ static inline void remove_partial(struct kmem_cache_node *n, struct page *page) { lockdep_assert_held(>list_lock); - list_del(>lru); + list_del(>slab_list); n->nr_partial--; } @@ -1863,7 +1863,7 @@ static void *get_partial_node(struct kmem_cache *s, struct kmem_cache_node *n, return NULL; spin_lock(>list_lock); - list_for_each_entry_safe(page, page2, >partial, lru) { + list_for_each_entry_safe(page, page2, >partial, slab_list) { void *t; if (!pfmemalloc_match(page, flags)) @@ -2407,7 +2407,7 @@ static unsigned long count_partial(struct kmem_cache_node *n, struct page *page; spin_lock_irqsave(>list_lock, flags); - list_for_each_entry(page, >partial, lru) + list_for_each_entry(page, >partial, slab_list) x += get_count(page); spin_unlock_irqrestore(>list_lock, flags); return x; @@ -3702,10 +3702,10 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) BUG_ON(irqs_disabled()); spin_lock_irq(>list_lock); - list_for_each_entry_safe(page, h, >partial, lru) { + list_for_each_entry_safe(page, h, >partial, slab_list) { if (!page->inuse) { remove_partial(n, page); - list_add(>lru, ); + list_add(>slab_list, ); } else { list_slab_objects(s, page, "Objects remaining in %s on __kmem_cache_shutdown()"); @@ -3713,7 +3713,7 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) } spin_unlock_irq(>list_lock); - list_for_each_entry_safe(page, h, , lru) + list_for_each_entry_safe(page, h, , slab_list) discard_slab(s, page); } @@ -3993,7 +3993,7 @@ int __kmem_cache_shrink(struct kmem_cache *s) * Note that concurrent frees may occur while we hold the * list_lock. page->inuse here is the upper limit. */ - list_for_each_entry_safe(page, t, >partial, lru) { + list_for_each_entry_safe(page, t, >partial, slab_list) { int free = page->objects - page->inuse; /* Do not reread page->inuse */ @@ -4003,10 +4003,10 @@ int __kmem_cache_shrink(struct kmem_cache *s) BUG_ON(free <= 0); if (free == page->objects) { - list_move(>lru, ); + list_move(>slab_list, ); n->nr_partial--; } else if (free <= SHRINK_PROMOTE_MAX) - list_move(>lru, promote + free - 1); + list_move(>slab_list, promote + free - 1); } /* @@ -4019,7 +4019,7 @@ int __kmem_cache_shrink(struct kmem_cache *s) spin_unlock_irqrestore(>list_lock, flags); /* Release empty slabs */ - list_for_each_entry_safe(page, t, , lru) + list_for_each_entry_safe(page, t, , slab_list) discard_slab(s, page); if
[PATCH v3 6/7] slab: Use slab_list instead of lru
Currently we use the page->lru list for maintaining lists of slabs. We have a list in the page structure (slab_list) that can be used for this purpose. Doing so makes the code cleaner since we are not overloading the lru list. Use the slab_list instead of the lru list for maintaining lists of slabs. Reviewed-by: Roman Gushchin Signed-off-by: Tobin C. Harding --- mm/slab.c | 49 + 1 file changed, 25 insertions(+), 24 deletions(-) diff --git a/mm/slab.c b/mm/slab.c index 28652e4218e0..09cc64ef9613 100644 --- a/mm/slab.c +++ b/mm/slab.c @@ -1710,8 +1710,8 @@ static void slabs_destroy(struct kmem_cache *cachep, struct list_head *list) { struct page *page, *n; - list_for_each_entry_safe(page, n, list, lru) { - list_del(>lru); + list_for_each_entry_safe(page, n, list, slab_list) { + list_del(>slab_list); slab_destroy(cachep, page); } } @@ -2265,8 +2265,8 @@ static int drain_freelist(struct kmem_cache *cache, goto out; } - page = list_entry(p, struct page, lru); - list_del(>lru); + page = list_entry(p, struct page, slab_list); + list_del(>slab_list); n->free_slabs--; n->total_slabs--; /* @@ -2726,13 +2726,13 @@ static void cache_grow_end(struct kmem_cache *cachep, struct page *page) if (!page) return; - INIT_LIST_HEAD(>lru); + INIT_LIST_HEAD(>slab_list); n = get_node(cachep, page_to_nid(page)); spin_lock(>list_lock); n->total_slabs++; if (!page->active) { - list_add_tail(>lru, &(n->slabs_free)); + list_add_tail(>slab_list, >slabs_free); n->free_slabs++; } else fixup_slab_list(cachep, n, page, ); @@ -2841,9 +2841,9 @@ static inline void fixup_slab_list(struct kmem_cache *cachep, void **list) { /* move slabp to correct slabp list: */ - list_del(>lru); + list_del(>slab_list); if (page->active == cachep->num) { - list_add(>lru, >slabs_full); + list_add(>slab_list, >slabs_full); if (OBJFREELIST_SLAB(cachep)) { #if DEBUG /* Poisoning will be done without holding the lock */ @@ -2857,7 +2857,7 @@ static inline void fixup_slab_list(struct kmem_cache *cachep, page->freelist = NULL; } } else - list_add(>lru, >slabs_partial); + list_add(>slab_list, >slabs_partial); } /* Try to find non-pfmemalloc slab if needed */ @@ -2880,20 +2880,20 @@ static noinline struct page *get_valid_first_slab(struct kmem_cache_node *n, } /* Move pfmemalloc slab to the end of list to speed up next search */ - list_del(>lru); + list_del(>slab_list); if (!page->active) { - list_add_tail(>lru, >slabs_free); + list_add_tail(>slab_list, >slabs_free); n->free_slabs++; } else - list_add_tail(>lru, >slabs_partial); + list_add_tail(>slab_list, >slabs_partial); - list_for_each_entry(page, >slabs_partial, lru) { + list_for_each_entry(page, >slabs_partial, slab_list) { if (!PageSlabPfmemalloc(page)) return page; } n->free_touched = 1; - list_for_each_entry(page, >slabs_free, lru) { + list_for_each_entry(page, >slabs_free, slab_list) { if (!PageSlabPfmemalloc(page)) { n->free_slabs--; return page; @@ -2908,11 +2908,12 @@ static struct page *get_first_slab(struct kmem_cache_node *n, bool pfmemalloc) struct page *page; assert_spin_locked(>list_lock); - page = list_first_entry_or_null(>slabs_partial, struct page, lru); + page = list_first_entry_or_null(>slabs_partial, struct page, + slab_list); if (!page) { n->free_touched = 1; page = list_first_entry_or_null(>slabs_free, struct page, - lru); + slab_list); if (page) n->free_slabs--; } @@ -3413,29 +3414,29 @@ static void free_block(struct kmem_cache *cachep, void **objpp, objp = objpp[i]; page = virt_to_head_page(objp); - list_del(>lru); + list_del(>slab_list); check_spinlock_acquired_node(cachep, node); slab_put_obj(cachep, page, objp); STATS_DEC_ACTIVE(cachep); /* fixup slab chains */ if (page->active == 0) { -
[PATCH v3 3/7] slob: Use slab_list instead of lru
Currently we use the page->lru list for maintaining lists of slabs. We have a list_head in the page structure (slab_list) that can be used for this purpose. Doing so makes the code cleaner since we are not overloading the lru list. The slab_list is part of a union within the page struct (included here stripped down): union { struct {/* Page cache and anonymous pages */ struct list_head lru; ... }; struct { dma_addr_t dma_addr; }; struct {/* slab, slob and slub */ union { struct list_head slab_list; struct {/* Partial pages */ struct page *next; int pages; /* Nr of pages left */ int pobjects; /* Approximate count */ }; }; ... Here we see that slab_list and lru are the same bits. We can verify that this change is safe to do by examining the object file produced from slob.c before and after this patch is applied. Steps taken to verify: 1. checkout current tip of Linus' tree commit a667cb7a94d4 ("Merge branch 'akpm' (patches from Andrew)") 2. configure and build (select SLOB allocator) CONFIG_SLOB=y CONFIG_SLAB_MERGE_DEFAULT=y 3. dissasemble object file `objdump -dr mm/slub.o > before.s 4. apply patch 5. build 6. dissasemble object file `objdump -dr mm/slub.o > after.s 7. diff before.s after.s Use slab_list list_head instead of the lru list_head for maintaining lists of slabs. Reviewed-by: Roman Gushchin Signed-off-by: Tobin C. Harding --- mm/slob.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/mm/slob.c b/mm/slob.c index 39ad9217ffea..94486c32e0ff 100644 --- a/mm/slob.c +++ b/mm/slob.c @@ -112,13 +112,13 @@ static inline int slob_page_free(struct page *sp) static void set_slob_page_free(struct page *sp, struct list_head *list) { - list_add(>lru, list); + list_add(>slab_list, list); __SetPageSlobFree(sp); } static inline void clear_slob_page_free(struct page *sp) { - list_del(>lru); + list_del(>slab_list); __ClearPageSlobFree(sp); } @@ -282,7 +282,7 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align, int node) spin_lock_irqsave(_lock, flags); /* Iterate through each partially free page, try to find room */ - list_for_each_entry(sp, slob_list, lru) { + list_for_each_entry(sp, slob_list, slab_list) { #ifdef CONFIG_NUMA /* * If there's a node specification, search for a partial @@ -331,7 +331,7 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align, int node) spin_lock_irqsave(_lock, flags); sp->units = SLOB_UNITS(PAGE_SIZE); sp->freelist = b; - INIT_LIST_HEAD(>lru); + INIT_LIST_HEAD(>slab_list); set_slob(b, SLOB_UNITS(PAGE_SIZE), b + SLOB_UNITS(PAGE_SIZE)); set_slob_page_free(sp, slob_list); b = slob_page_alloc(sp, size, align); -- 2.21.0
[PATCH v3 2/7] slob: Respect list_head abstraction layer
Currently we reach inside the list_head. This is a violation of the layer of abstraction provided by the list_head. It makes the code fragile. More importantly it makes the code wicked hard to understand. The code logic is based on the page in which an allocation was made, we want to modify the slob_list we are working on to have this page at the front. We already have a function to check if an entry is at the front of the list. Recently a function was added to list.h to do the list rotation. We can use these two functions to reduce line count, reduce code fragility, and reduce cognitive load required to read the code. Use list_head functions to interact with lists thereby maintaining the abstraction provided by the list_head structure. Signed-off-by: Tobin C. Harding --- I verified the comment pointing to Knuth, the page number may be out of date but with this comment I was able to find the text that discusses this, left the comment as is (after fixing style). mm/slob.c | 24 1 file changed, 16 insertions(+), 8 deletions(-) diff --git a/mm/slob.c b/mm/slob.c index 307c2c9feb44..39ad9217ffea 100644 --- a/mm/slob.c +++ b/mm/slob.c @@ -268,8 +268,7 @@ static void *slob_page_alloc(struct page *sp, size_t size, int align) */ static void *slob_alloc(size_t size, gfp_t gfp, int align, int node) { - struct page *sp; - struct list_head *prev; + struct page *sp, *prev, *next; struct list_head *slob_list; slob_t *b = NULL; unsigned long flags; @@ -296,18 +295,27 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align, int node) if (sp->units < SLOB_UNITS(size)) continue; + /* +* Cache previous entry because slob_page_alloc() may +* remove sp from slob_list. +*/ + prev = list_prev_entry(sp, lru); + /* Attempt to alloc */ - prev = sp->lru.prev; b = slob_page_alloc(sp, size, align); if (!b) continue; - /* Improve fragment distribution and reduce our average + next = list_next_entry(prev, lru); /* This may or may not be sp */ + + /* +* Improve fragment distribution and reduce our average * search time by starting our next search here. (see -* Knuth vol 1, sec 2.5, pg 449) */ - if (prev != slob_list->prev && - slob_list->next != prev->next) - list_move_tail(slob_list, prev->next); +* Knuth vol 1, sec 2.5, pg 449) +*/ + if (!list_is_first(>lru, slob_list)) + list_rotate_to_front(>lru, slob_list); + break; } spin_unlock_irqrestore(_lock, flags); -- 2.21.0
Re: [RFC][PATCH 00/16] sched: Core scheduling
On Thu, Mar 14, 2019 at 8:35 AM Tim Chen wrote: > >> > >> One more NULL pointer dereference: > >> > >> Mar 12 02:24:46 aubrey-ivb kernel: [ 201.916741] core sched enabled > >> [ 201.950203] BUG: unable to handle kernel NULL pointer dereference > >> at 0008 > >> [ 201.950254] [ cut here ] > >> [ 201.959045] #PF error: [normal kernel read fault] > >> [ 201.964272] !se->on_rq > >> [ 201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849 > >> set_next_buddy+0x52/0x70 > > > Shouldn't the for_each_sched_entity(se) skip the code block for !se case > have avoided null pointer access of se? > > Since > #define for_each_sched_entity(se) \ > for (; se; se = se->parent) > > Scratching my head a bit here on how your changes would have made > a difference. This NULL pointer dereference is not replicable, which makes me thought the change works... > > In your original log, I wonder if the !se->on_rq warning on CPU 22 is mixed > with the actual OOPs? > Saw also in your original log rb_insert_color. Wonder if that > was actually the source of the Oops? No chance to figure this out, I only saw this once, lockup occurs more frequently. Thanks, -Aubrey
[PATCH v3] hid: logitech: check the return value of create_singlethread_workqueue
create_singlethread_workqueue may fail and return NULL. The fix checks if it is NULL to avoid NULL pointer dereference. Also, the fix moves the call of create_singlethread_workqueue earlier to avoid resource-release issues. -- V3: do not introduce memory leaks. Signed-off-by: Kangjie Lu --- drivers/hid/hid-logitech-hidpp.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/hid/hid-logitech-hidpp.c b/drivers/hid/hid-logitech-hidpp.c index 15ed6177a7a3..0a243247b231 100644 --- a/drivers/hid/hid-logitech-hidpp.c +++ b/drivers/hid/hid-logitech-hidpp.c @@ -2111,6 +2111,13 @@ static int hidpp_ff_init(struct hidpp_device *hidpp, u8 feature_index) kfree(data); return -ENOMEM; } + data->wq = create_singlethread_workqueue("hidpp-ff-sendqueue"); + if (!data->wq) { + kfree(data->effect_ids); + kfree(data); + return -ENOMEM; + } + data->hidpp = hidpp; data->feature_index = feature_index; data->version = version; @@ -2155,7 +2162,6 @@ static int hidpp_ff_init(struct hidpp_device *hidpp, u8 feature_index) /* ignore boost value at response.fap.params[2] */ /* init the hardware command queue */ - data->wq = create_singlethread_workqueue("hidpp-ff-sendqueue"); atomic_set(>workqueue_size, 0); /* initialize with zero autocenter to get wheel in usable state */ -- 2.17.1
Re: [PATCH v4] lib/string.c: implement a basic bcmp
On Thu, Mar 14, 2019 at 6:13 AM Nick Desaulniers wrote: > > A recent optimization in Clang (r355672) lowers comparisons of the > return value of memcmp against zero to comparisons of the return value > of bcmp against zero. This helps some platforms that implement bcmp > more efficiently than memcmp. glibc simply aliases bcmp to memcmp, but > an optimized implementation is in the works. > > This results in linkage failures for all targets with Clang due to the > undefined symbol. For now, just implement bcmp as a tailcail to memcmp > to unbreak the build. This routine can be further optimized in the > future. > > Other ideas discussed: > * A weak alias was discussed, but breaks for architectures that define > their own implementations of memcmp since aliases to declarations are > not permitted (only definitions). Arch-specific memcmp implementations > typically declare memcmp in C headers, but implement them in assembly. > * -ffreestanding also is used sporadically throughout the kernel. > * -fno-builtin-bcmp doesn't work when doing LTO. > > Link: https://bugs.llvm.org/show_bug.cgi?id=41035 > Link: https://code.woboq.org/userspace/glibc/string/memcmp.c.html#bcmp > Link: > https://github.com/llvm/llvm-project/commit/8e16d73346f8091461319a7dfc4ddd18eedcff13 > Link: https://github.com/ClangBuiltLinux/linux/issues/416 > Cc: sta...@vger.kernel.org > Reported-by: Nathan Chancellor > Reported-by: Adhemerval Zanella > Suggested-by: Arnd Bergmann > Suggested-by: James Y Knight > Suggested-by: Masahiro Yamada > Suggested-by: Nathan Chancellor > Suggested-by: Rasmus Villemoes > Signed-off-by: Nick Desaulniers > Acked-by: Steven Rostedt (VMware) Reviewed-by: Masahiro Yamada > --- > Changes V3 -> V4: > * Include the entirety of Rasmus' sugguestion, as per Steven. > * Change formal parameter identifiers to match the comment. > Changes V2 -> V3: > * Adjust comment as per Steven to Rasmus' sugguestion. > * Pick up Steven's Ack. > Changes V1 -> V2: > * Add declaration to include/linux/string.h. > * Reword comment above bcmp. > > include/linux/string.h | 3 +++ > lib/string.c | 20 > 2 files changed, 23 insertions(+) > > diff --git a/include/linux/string.h b/include/linux/string.h > index 7927b875f80c..6ab0a6fa512e 100644 > --- a/include/linux/string.h > +++ b/include/linux/string.h > @@ -150,6 +150,9 @@ extern void * memscan(void *,int,__kernel_size_t); > #ifndef __HAVE_ARCH_MEMCMP > extern int memcmp(const void *,const void *,__kernel_size_t); > #endif > +#ifndef __HAVE_ARCH_BCMP > +extern int bcmp(const void *,const void *,__kernel_size_t); > +#endif > #ifndef __HAVE_ARCH_MEMCHR > extern void * memchr(const void *,int,__kernel_size_t); > #endif > diff --git a/lib/string.c b/lib/string.c > index 38e4ca08e757..3ab861c1a857 100644 > --- a/lib/string.c > +++ b/lib/string.c > @@ -866,6 +866,26 @@ __visible int memcmp(const void *cs, const void *ct, > size_t count) > EXPORT_SYMBOL(memcmp); > #endif > > +#ifndef __HAVE_ARCH_BCMP > +/** > + * bcmp - returns 0 if and only if the buffers have identical contents. > + * @a: pointer to first buffer. > + * @b: pointer to second buffer. > + * @len: size of buffers. > + * > + * The sign or magnitude of a non-zero return value has no particular > + * meaning, and architectures may implement their own more efficient bcmp(). > So > + * while this particular implementation is a simple (tail) call to memcmp, do > + * not rely on anything but whether the return value is zero or non-zero. > + */ > +#undef bcmp > +int bcmp(const void *a, const void *b, size_t len) > +{ > + return memcmp(a, b, len); > +} > +EXPORT_SYMBOL(bcmp); > +#endif > + > #ifndef __HAVE_ARCH_MEMSCAN > /** > * memscan - Find a character in an area of memory. > -- > 2.21.0.360.g471c308f928-goog > -- Best Regards Masahiro Yamada
[RFC PATCH v2 6/7] tracing: Use tracing error_log with probe events
Use tracing error_log with probe events for logging error more precisely. This also makes all parse error returns -EINVAL (except for -ENOMEM), because user can see better error message in error_log file now. Signed-off-by: Masami Hiramatsu --- Changes in v2: - Update error message according to Steve's comment - Add uprobe reference counter errors - Adjust some error positions --- kernel/trace/trace_kprobe.c | 77 +++- kernel/trace/trace_probe.c | 274 +++ kernel/trace/trace_probe.h | 77 kernel/trace/trace_uprobe.c | 44 +-- 4 files changed, 348 insertions(+), 124 deletions(-) diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 56324c231688..b5945e1c3359 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -441,13 +441,8 @@ static int __register_trace_kprobe(struct trace_kprobe *tk) else ret = register_kprobe(>rp.kp); - if (ret == 0) { + if (ret == 0) tk->tp.flags |= TP_FLAG_REGISTERED; - } else if (ret == -EILSEQ) { - pr_warn("Probing address(0x%p) is not an instruction boundary.\n", - tk->rp.kp.addr); - ret = -EINVAL; - } return ret; } @@ -591,7 +586,7 @@ static int trace_kprobe_create(int argc, const char *argv[]) * Type of args: * FETCHARG:TYPE : use TYPE instead of unsigned long. */ - struct trace_kprobe *tk; + struct trace_kprobe *tk = NULL; int i, len, ret = 0; bool is_return = false; char *symbol = NULL, *tmp = NULL; @@ -615,44 +610,50 @@ static int trace_kprobe_create(int argc, const char *argv[]) if (argc < 2) return -ECANCELED; + trace_probe_log_init("trace_kprobe", argc, argv); + event = strchr([0][1], ':'); if (event) event++; if (isdigit(argv[0][1])) { if (!is_return) { - pr_info("Maxactive is not for kprobe"); - return -EINVAL; + trace_probe_log_err(1, MAXACT_NO_KPROBE); + goto parse_error; } if (event) len = event - [0][1] - 1; else len = strlen([0][1]); - if (len > MAX_EVENT_NAME_LEN - 1) - return -E2BIG; + if (len > MAX_EVENT_NAME_LEN - 1) { + trace_probe_log_err(1, BAD_MAXACT); + goto parse_error; + } memcpy(buf, [0][1], len); buf[len] = '\0'; ret = kstrtouint(buf, 0, ); if (ret || !maxactive) { - pr_info("Invalid maxactive number\n"); - return ret; + trace_probe_log_err(1, BAD_MAXACT); + goto parse_error; } /* kretprobes instances are iterated over via a list. The * maximum should stay reasonable. */ if (maxactive > KRETPROBE_MAXACTIVE_MAX) { - pr_info("Maxactive is too big (%d > %d).\n", - maxactive, KRETPROBE_MAXACTIVE_MAX); - return -E2BIG; + trace_probe_log_err(1, MAXACT_TOO_BIG); + goto parse_error; } } /* try to parse an address. if that fails, try to read the * input as a symbol. */ if (kstrtoul(argv[1], 0, (unsigned long *))) { + trace_probe_log_set_index(1); /* Check whether uprobe event specified */ - if (strchr(argv[1], '/') && strchr(argv[1], ':')) - return -ECANCELED; + if (strchr(argv[1], '/') && strchr(argv[1], ':')) { + ret = -ECANCELED; + goto error; + } /* a symbol specified */ symbol = kstrdup(argv[1], GFP_KERNEL); if (!symbol) @@ -660,23 +661,23 @@ static int trace_kprobe_create(int argc, const char *argv[]) /* TODO: support .init module functions */ ret = traceprobe_split_symbol_offset(symbol, ); if (ret || offset < 0 || offset > UINT_MAX) { - pr_info("Failed to parse either an address or a symbol.\n"); - goto out; + trace_probe_log_err(0, BAD_PROBE_ADDR); + goto parse_error; } if (kprobe_on_func_entry(NULL, symbol, offset)) flags |= TPARG_FL_FENTRY; if (offset && is_return && !(flags & TPARG_FL_FENTRY)) { - pr_info("Given offset is not valid for return
[RFC PATCH v2 7/7] selftests/ftrace: Add error_log testcase for probe errors
Add error_log testcase for error logs on probe events. This tests most of error cases and checks the error position is correct. Signed-off-by: Masami Hiramatsu --- Changes in v2: - Specify error position in command string by "^" - Clear error_log right before writing command - Add uprobe syntax error testcase --- .../ftrace/test.d/kprobe/kprobe_syntax_errors.tc | 93 .../ftrace/test.d/kprobe/uprobe_syntax_errors.tc | 31 +++ 2 files changed, 124 insertions(+) create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc new file mode 100644 index ..281665b1348c --- /dev/null +++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc @@ -0,0 +1,93 @@ +#!/bin/sh +# SPDX-License-Identifier: GPL-2.0 +# description: Kprobe event parser error log check + +[ -f kprobe_events ] || exit_unsupported # this is configurable + +[ -f error_log ] || exit_unsupported + +check_error() { # command-with-error-pos-by-^ +pos=$(echo -n "${1%^*}" | wc -c) # error position +command=$(echo "$1" | tr -d ^) +echo "Test command: $command" +echo > error_log +(! echo "$command" > kprobe_events ) >& /dev/null +grep "trace_kprobe: error:" -A 3 error_log +N=$(tail -n 1 error_log | wc -c) +# " Command: " and "^\n" => 13 +test $(expr 13 + $pos) -eq $N +} + +if grep -q 'r\[maxactive\]' README; then +check_error 'p^100 vfs_read' # MAXACT_NO_KPROBE +check_error 'r^1a111 vfs_read' # BAD_MAXACT +check_error 'r^10 vfs_read'# MAXACT_TOO_BIG +fi + +check_error 'p ^non_exist_func'# BAD_PROBE_ADDR (enoent) +check_error 'p ^hoge-fuga' # BAD_PROBE_ADDR (bad syntax) +check_error 'p ^hoge+1000-1000'# BAD_PROBE_ADDR (bad syntax) +check_error 'r ^vfs_read+10' # BAD_RETPROBE +check_error 'p:^/bar vfs_read' # NO_GROUP_NAME +check_error 'p:^12345678901234567890123456789012345678901234567890123456789012345/bar vfs_read'# GROUP_TOO_LONG + +check_error 'p:^foo.1/bar vfs_read'# BAD_GROUP_NAME +check_error 'p:foo/^ vfs_read' # NO_EVENT_NAME +check_error 'p:foo/^12345678901234567890123456789012345678901234567890123456789012345 vfs_read'# EVENT_TOO_LONG +check_error 'p:foo/^bar.1 vfs_read'# BAD_EVENT_NAME + +check_error 'p vfs_read ^$retval' # RETVAL_ON_PROBE +check_error 'p vfs_read ^$stack1' # BAD_STACK_NUM + +if grep -q '$arg' README; then +check_error 'p vfs_read ^$arg1'# BAD_ARG_NUM +fi + +check_error 'p vfs_read ^$none_var'# BAD_VAR + +check_error 'p vfs_read ^%none_reg'# BAD_REG_NAME +check_error 'p vfs_read ^@12345678abcde' # BAD_MEM_ADDR +check_error 'p vfs_read ^@+10' # FILE_ON_KPROBE + +check_error 'p vfs_read ^+0@0)'# DEREF_NEED_BRACE +check_error 'p vfs_read ^+0ab1(@0)'# BAD_DEREF_OFFS +check_error 'p vfs_read +0(+0(@0^)'# DEREF_OPEN_BRACE + +if grep -A1 "fetcharg:" README | grep -q '\$comm' ; then +check_error 'p vfs_read +0(^$comm)'# COMM_CANT_DEREF +fi + +check_error 'p vfs_read ^&1' # BAD_FETCH_ARG + + +# We've introduced this limitation with array support +if grep -q ' \\\[\\\]' README; then +check_error 'p vfs_read +0(^+0(+0(+0(+0(+0(+0(+0(+0(+0(+0(+0(+0(+0(@0))' # TOO_MANY_OPS? +check_error 'p vfs_read +0(@11):u8[10^'# ARRAY_NO_CLOSE +check_error 'p vfs_read +0(@11):u8[10]^a' # BAD_ARRAY_SUFFIX +check_error 'p vfs_read +0(@11):u8[^10a]' # BAD_ARRAY_NUM +check_error 'p vfs_read +0(@11):u8[^256]' # ARRAY_TOO_BIG +fi + +check_error 'p vfs_read @11:^unknown_type' # BAD_TYPE +check_error 'p vfs_read $stack0:^string' # BAD_STRING +check_error 'p vfs_read @11:^b10@a/16' # BAD_BITFIELD + +check_error 'p vfs_read ^arg123456789012345678901234567890=@11'# ARG_NAME_TOO_LOG +check_error 'p vfs_read ^=@11' # NO_ARG_NAME +check_error 'p vfs_read ^var.1=@11'# BAD_ARG_NAME +check_error 'p vfs_read var1=@11 ^var1=@12'# USED_ARG_NAME +check_error 'p vfs_read ^+1234567(+1234567(+1234567(+1234567(+1234567(+1234567(@1234))'# ARG_TOO_LONG +check_error 'p vfs_read arg1=^'# NO_ARG_BODY + +# instruction boundary check is valid on x86 (at this moment) +case $(uname -m) in + x86_64|i[3456]86) +echo 'p vfs_read' > kprobe_events +if grep -q FTRACE ../kprobes/list ; then + check_error 'p ^vfs_read+3' # BAD_INSN_BNDRY (only if function-tracer is enabled) +fi +;; +esac + +exit 0 diff --git a/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc b/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc new file mode 100644 index
[RFC PATCH v2 5/7] tracing/probe: Verify alloc_trace_*probe() result
Since alloc_trace_*probe() returns -EINVAL only if !event && !group, it should not happen in trace_*probe_create(). If we catch that case there is a bug. So use WARN_ON_ONCE() instead of pr_info(). Signed-off-by: Masami Hiramatsu --- kernel/trace/trace_kprobe.c |4 ++-- kernel/trace/trace_uprobe.c |3 ++- 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index 5222fd82e7e4..56324c231688 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -693,9 +693,9 @@ static int trace_kprobe_create(int argc, const char *argv[]) tk = alloc_trace_kprobe(group, event, addr, symbol, offset, maxactive, argc, is_return); if (IS_ERR(tk)) { - pr_info("Failed to allocate trace_probe.(%d)\n", - (int)PTR_ERR(tk)); ret = PTR_ERR(tk); + /* This must return -ENOMEM otherwise there is a bug */ + WARN_ON_ONCE(ret != -ENOMEM); goto out; } diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index 52f033489377..b54137ec7810 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -514,8 +514,9 @@ static int trace_uprobe_create(int argc, const char **argv) tu = alloc_trace_uprobe(group, event, argc, is_return); if (IS_ERR(tu)) { - pr_info("Failed to allocate trace_uprobe.(%d)\n", (int)PTR_ERR(tu)); ret = PTR_ERR(tu); + /* This must return -ENOMEM otherwise there is a bug */ + WARN_ON_ONCE(ret != -ENOMEM); goto fail_address_parse; } tu->offset = offset;
[RFC PATCH v2 2/7] tracing/probe: Check event name length correctly
Ensure given name of event is not too long when parsing it, and fix to update event name offset correctly when the group name is given. For example, this makes probe event to check the "p:foo/" error case correctly. Signed-off-by: Masami Hiramatsu --- kernel/trace/trace_probe.c |8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c index 89da34b326e3..4cd50913cb5d 100644 --- a/kernel/trace/trace_probe.c +++ b/kernel/trace/trace_probe.c @@ -159,6 +159,7 @@ int traceprobe_parse_event_name(const char **pevent, const char **pgroup, char *buf) { const char *slash, *event = *pevent; + int len; slash = strchr(event, '/'); if (slash) { @@ -173,10 +174,15 @@ int traceprobe_parse_event_name(const char **pevent, const char **pgroup, strlcpy(buf, event, slash - event + 1); *pgroup = buf; *pevent = slash + 1; + event = *pevent; } - if (strlen(event) == 0) { + len = strlen(event); + if (len == 0) { pr_info("Event name is not specified\n"); return -EINVAL; + } else if (len > MAX_EVENT_NAME_LEN) { + pr_info("Event name is too long\n"); + return -E2BIG; } return 0; }
[RFC PATCH v2 3/7] tracing/probe: Check the size of argument name and body
Check the size of argument name and expression is not 0 and smaller than maximum length. Signed-off-by: Masami Hiramatsu --- kernel/trace/trace_probe.c |2 ++ kernel/trace/trace_probe.h |1 + 2 files changed, 3 insertions(+) diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c index 4cd50913cb5d..feae03056f0b 100644 --- a/kernel/trace/trace_probe.c +++ b/kernel/trace/trace_probe.c @@ -554,6 +554,8 @@ int traceprobe_parse_probe_arg(struct trace_probe *tp, int i, char *arg, body = strchr(arg, '='); if (body) { + if (body - arg > MAX_ARG_NAME_LEN || body == arg) + return -EINVAL; parg->name = kmemdup_nul(arg, body - arg, GFP_KERNEL); body++; } else { diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h index 8a63f8bc01bc..2177c206de15 100644 --- a/kernel/trace/trace_probe.h +++ b/kernel/trace/trace_probe.h @@ -32,6 +32,7 @@ #define MAX_TRACE_ARGS 128 #define MAX_ARGSTR_LEN 63 #define MAX_ARRAY_LEN 64 +#define MAX_ARG_NAME_LEN 32 #define MAX_STRING_SIZEPATH_MAX /* Reserved field names */
[RFC PATCH v2 1/7] tracing/probe: Check maxactive error cases
Check maxactive on kprobe error case, because maxactive is only for kretprobe, not for kprobe. Also, maxactive should not be 0, it should be at least 1. Signed-off-by: Masami Hiramatsu --- kernel/trace/trace_kprobe.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index d5fb09ebba8b..d47e12596f12 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -624,7 +624,11 @@ static int trace_kprobe_create(int argc, const char *argv[]) if (event) event++; - if (is_return && isdigit(argv[0][1])) { + if (isdigit(argv[0][1])) { + if (!is_return) { + pr_info("Maxactive is not for kprobe"); + return -EINVAL; + } if (event) len = event - [0][1] - 1; else @@ -634,8 +638,8 @@ static int trace_kprobe_create(int argc, const char *argv[]) memcpy(buf, [0][1], len); buf[len] = '\0'; ret = kstrtouint(buf, 0, ); - if (ret) { - pr_info("Failed to parse maxactive.\n"); + if (ret || !maxactive) { + pr_info("Invalid maxactive number\n"); return ret; } /* kretprobes instances are iterated over via a list. The
[RFC PATCH v2 4/7] tracing/probe: Check event/group naming rule at parsing
Check event and group naming rule at parsing it instead of allocating probes. Signed-off-by: Masami Hiramatsu --- Changes in v2: - Update error message according to Steve's comment. --- kernel/trace/trace_kprobe.c |7 +-- kernel/trace/trace_probe.c |8 kernel/trace/trace_uprobe.c |5 + 3 files changed, 10 insertions(+), 10 deletions(-) diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index d47e12596f12..5222fd82e7e4 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -221,7 +221,7 @@ static struct trace_kprobe *alloc_trace_kprobe(const char *group, tk->rp.maxactive = maxactive; - if (!event || !is_good_name(event)) { + if (!event || !group) { ret = -EINVAL; goto error; } @@ -231,11 +231,6 @@ static struct trace_kprobe *alloc_trace_kprobe(const char *group, if (!tk->tp.call.name) goto error; - if (!group || !is_good_name(group)) { - ret = -EINVAL; - goto error; - } - tk->tp.class.system = kstrdup(group, GFP_KERNEL); if (!tk->tp.class.system) goto error; diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c index feae03056f0b..d2b73628f1e1 100644 --- a/kernel/trace/trace_probe.c +++ b/kernel/trace/trace_probe.c @@ -172,6 +172,10 @@ int traceprobe_parse_event_name(const char **pevent, const char **pgroup, return -E2BIG; } strlcpy(buf, event, slash - event + 1); + if (!is_good_name(buf)) { + pr_info("Group name must follow the same rules as C identifiers\n"); + return -EINVAL; + } *pgroup = buf; *pevent = slash + 1; event = *pevent; @@ -184,6 +188,10 @@ int traceprobe_parse_event_name(const char **pevent, const char **pgroup, pr_info("Event name is too long\n"); return -E2BIG; } + if (!is_good_name(event)) { + pr_info("Event name must follow the same rules as C identifiers\n"); + return -EINVAL; + } return 0; } diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index e335576b9411..52f033489377 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -266,10 +266,7 @@ alloc_trace_uprobe(const char *group, const char *event, int nargs, bool is_ret) { struct trace_uprobe *tu; - if (!event || !is_good_name(event)) - return ERR_PTR(-EINVAL); - - if (!group || !is_good_name(group)) + if (!event || !group) return ERR_PTR(-EINVAL); tu = kzalloc(SIZEOF_TRACE_UPROBE(nargs), GFP_KERNEL);
[RFC PATCH v2 0/7] tracing: Use common error_log with probe events
Hi, Here is the 2nd version of using common error_log with probe events. Previous version is here. http://lkml.kernel.org/r/155248005229.10815.334731901778152247.stgit@devnote2 In this version, I've updated some error messages according to Steve's comment, adjust some error position, and update testcase to simplify a bit. - [4/7]: Update error message according to Steve's comment (Thanks!) - [6/7]: Update error message, adjust error positions, and add uprobe errors - [7/7]: Specify error position in command string by "^". Clear error_log right before writing command. Add uprobe syntax error checker Thank you, --- Masami Hiramatsu (7): tracing/probe: Check maxactive error cases tracing/probe: Check event name length correctly tracing/probe: Check the size of argument name and body tracing/probe: Check event/group naming rule at parsing tracing/probe: Verify alloc_trace_*probe() result tracing: Use tracing error_log with probe events selftests/ftrace: Add error_log testcase for probe errors kernel/trace/trace_kprobe.c| 90 -- kernel/trace/trace_probe.c | 282 +++- kernel/trace/trace_probe.h | 78 +- kernel/trace/trace_uprobe.c| 52 ++-- .../ftrace/test.d/kprobe/kprobe_syntax_errors.tc | 93 +++ .../ftrace/test.d/kprobe/uprobe_syntax_errors.tc | 31 ++ 6 files changed, 494 insertions(+), 132 deletions(-) create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc create mode 100644 tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc -- Masami Hiramatsu
Compiling error if CONFIG_CPU_SUP_INTEL is disabled
If CONFIG_CPU_SUP_INTEL is disabled with either the 5.0.2 or 4.20.16 kernel, it errors out right away: In file included from arch/x86/events/amd/core.c:8: arch/x86/events/amd/../perf_event.h:1035:45: warning: ‘struct cpu_hw_event’ declared inside parameter list will not be visible outside of this definition or declaration static inline int intel_cpuc_prepare(struct cpu_hw_event *cpuc, int cpu) ^~~~ arch/x86/events/amd/../perf_event.h:1040:45: warning: ‘struct cpu_hw_event’ declared inside parameter list will not be visible outside of this definition or declaration static inline void intel_cpuc_finish(struct cpu_hw_event *cpuc) ^~~~ CC arch/x86/events/amd/uncore.o CC arch/x86/events/amd/ibs.o AR kernel/bpf/built-in.a CC kernel/cgroup/cgroup.o In file included from arch/x86/events/amd/ibs.c:19: arch/x86/events/amd/../perf_event.h:1035:45: warning: ‘struct cpu_hw_event’ declared inside parameter list will not be visible outside of this definition or declaration static inline int intel_cpuc_prepare(struct cpu_hw_event *cpuc, int cpu) ^~~~ arch/x86/events/amd/../perf_event.h:1040:45: warning: ‘struct cpu_hw_event’ declared inside parameter list will not be visible outside of this definition or declaration static inline void intel_cpuc_finish(struct cpu_hw_event *cpuc) ^~~~ AR arch/x86/events/amd/built-in.a CC arch/x86/events/core.o In file included from arch/x86/events/core.c:44: arch/x86/events/perf_event.h:1035:45: warning: ‘struct cpu_hw_event’ declared inside parameter list will not be visible outside of this definition or declaration static inline int intel_cpuc_prepare(struct cpu_hw_event *cpuc, int cpu) ^~~~ arch/x86/events/perf_event.h:1040:45: warning: ‘struct cpu_hw_event’ declared inside parameter list will not be visible outside of this definition or declaration static inline void intel_cpuc_finish(struct cpu_hw_event *cpuc) ^~~~ arch/x86/events/core.c: In function ‘free_fake_cpuc’: arch/x86/events/core.c:1998:20: error: passing argument 1 of ‘intel_cpuc_finish’ from incompatible pointer type [-Werror=incompatible-pointer-types] intel_cpuc_finish(cpuc); ^~~~ In file included from arch/x86/events/core.c:44: arch/x86/events/perf_event.h:1040:59: note: expected ‘struct cpu_hw_event *’ but argument is of type ‘struct cpu_hw_events *’ static inline void intel_cpuc_finish(struct cpu_hw_event *cpuc) ~^~~~ arch/x86/events/core.c: In function ‘allocate_fake_cpuc’: arch/x86/events/core.c:2012:25: error: passing argument 1 of ‘intel_cpuc_prepare’ from incompatible pointer type [-Werror=incompatible-pointer-types] if (intel_cpuc_prepare(cpuc, cpu)) ^~~~ In file included from arch/x86/events/core.c:44: arch/x86/events/perf_event.h:1035:59: note: expected ‘struct cpu_hw_event *’ but argument is of type ‘struct cpu_hw_events *’ static inline int intel_cpuc_prepare(struct cpu_hw_event *cpuc, int cpu) ~^~~~ cc1: some warnings being treated as errors make[2]: *** [scripts/Makefile.build:291: arch/x86/events/core.o] Error 1 make[1]: *** [scripts/Makefile.build:516: arch/x86/events] Error 2 make: *** [Makefile:1058: arch/x86] Error 2
Re: [PATCH 0/3] userfaultfd: allow to forbid unprivileged users
On 3/13/19 4:55 PM, Andrea Arcangeli wrote: > On Wed, Mar 13, 2019 at 01:01:40PM -0700, Mike Kravetz wrote: >> On 3/13/19 11:52 AM, Andrea Arcangeli wrote: >>> Unless somebody suggests a consistent way to make hugetlbfs "just >>> work" (like we could achieve clean with CRIU and KVM), I think Oracle >>> will need a one liner change in the Oracle setup to echo into that >>> file in addition of running the hugetlbfs mount. >> >> I think you are suggesting the DB setup process enable uffd for all users. >> Correct? > > Yes. In addition of the hugetlbfs setup, various apps requires to also > increase fs.inotify.max_user_watches or file-max and other tweaks, > this would be one of those tweaks. Yes, I agree. It is just that unprivileged_userfaultfd disabled would likely to be the default set by distros. Or, perhaps 'kvm'? Then, if you want to run the DB, the admin (or the DB) will need to set it to enabled. And, this results in it being enabled for everyone. I think I understand the scope of any security exposure this would cause from a technical perspective. However, I can imagine people being concerned about enabling everywhere if this is not the default setting. If it is OK to disable everywhere, why not just use disable for the kvm use case as well? :) >> This may be too simple, and I don't really like group access, but how about >> just defining a uffd group? If you are in the group you can make uffd >> system calls. > > Everything is possible, I'm just afraid it gets too complex. > > So you suggest to echo a gid into the file? That is what I was thinking. But, I was mostly thinking of that because Peter's earlier comment made me go and check hugetlbfs code. There is a sysctl_hugetlb_shm_group variable that does this, even though it is mostly unused in the hugetlbfs code. I know the kvm dev open scheme works for kvm. Was just trying to think of something more general that would work for hugetlbfs/DB or other use cases. -- Mike Kravetz
Re: [PATCH v2 4/5] slob: Use slab_list instead of lru
On Wed, Mar 13, 2019 at 07:05:02PM +, Christopher Lameter wrote: > On Wed, 13 Mar 2019, Tobin C. Harding wrote: > > > @@ -297,7 +297,7 @@ static void *slob_alloc(size_t size, gfp_t gfp, int > > align, int node) > > continue; > > > > /* Attempt to alloc */ > > - prev = sp->lru.prev; > > + prev = sp->slab_list.prev; > > b = slob_page_alloc(sp, size, align); > > if (!b) > > continue; > > Hmmm... Is there a way to use a macro or so to avoid referencing the field > within the slab_list? Thanks for the review. Next version includes a fix for this. Tobin.
[PATCH v2 1/1] initrd: move initrd_start calculate within linear mapping range check
in the previous case, initrd_start and initrd_end can be successfully returned either (base < memblock_start_of_DRAM()) or (base + size > memblock_start_of_DRAM() + linear_region_size). That means even linear mapping range check fail for initrd_start and initrd_end, it still can get virtual address. Here we put initrd_start/initrd_end to be calculated only when linear mapping check pass. Fixes: c756c592e442 ("arm64: Utilize phys_initrd_start/phys_initrd_size") Reviewed-by: Steven Price Signed-off-by: pierre Kuo --- Changes in v2: - add Fixes tag arch/arm64/mm/init.c | 8 +++- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c index 7205a9085b4d..1adf418de685 100644 --- a/arch/arm64/mm/init.c +++ b/arch/arm64/mm/init.c @@ -425,6 +425,9 @@ void __init arm64_memblock_init(void) memblock_remove(base, size); /* clear MEMBLOCK_ flags */ memblock_add(base, size); memblock_reserve(base, size); + /* the generic initrd code expects virtual addresses */ + initrd_start = __phys_to_virt(phys_initrd_start); + initrd_end = initrd_start + phys_initrd_size; } } @@ -450,11 +453,6 @@ void __init arm64_memblock_init(void) * pagetables with memblock. */ memblock_reserve(__pa_symbol(_text), _end - _text); - if (IS_ENABLED(CONFIG_BLK_DEV_INITRD) && phys_initrd_size) { - /* the generic initrd code expects virtual addresses */ - initrd_start = __phys_to_virt(phys_initrd_start); - initrd_end = initrd_start + phys_initrd_size; - } early_init_fdt_scan_reserved_mem(); -- 2.17.1
Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions
On 3/13/19 7:49 AM, Ira Weiny wrote: On Tue, Mar 12, 2019 at 05:38:55PM -0700, John Hubbard wrote: On 3/12/19 8:30 AM, Ira Weiny wrote: On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubb...@gmail.com wrote: From: John Hubbard Introduces put_user_page(), which simply calls put_page(). This provides a way to update all get_user_pages*() callers, so that they call put_user_page(), instead of put_page(). So I've been running with these patches for a while but today while ramping up my testing I hit the following: [ 1355.557819] [ cut here ] [ 1355.563436] get_user_pages pin count overflowed Hi Ira, Thanks for reporting this. That overflow, at face value, means that we've used more than the 22 bits worth of gup pin counts, so about 4 million pins of the same page... This is my bug in the patches I'm playing with. Somehow I'm causing more puts than gets... I'm not sure how but this is for sure my problem. Backing off to your patch set the numbers are good. Now that's a welcome bit of good news! Sorry for the noise. With the testing I've done today I feel comfortable adding Tested-by: Ira Weiny For the main GUP and InfiniBand patches. Ira OK, I'll add your tested-by tag to patches 1, 2, 4, 5 (the numbering refers to the "RFC v2: mm: gup/dma tracking" posting [1]) in my repo [2], and they'll show up in the next posting. (Patch 3 is already upstream, and patch 6 is documentation that needs to be rewritten entirely.) [1] https://lore.kernel.org/r/20190204052135.25784-1-jhubb...@nvidia.com [2] https://github.com/johnhubbard/linux/tree/gup_dma_core thanks, -- John Hubbard NVIDIA
Re:Re:Re: Re:[PATCH 4.9 81/96] MIPS:Loongson: Introduce and use loongson_llsc_mb()
Hi, Greg, Patch for 4.9 (and below 4.9) is here: https://patchwork.linux-mips.org/patch/21375/ Huacai -- Original -- From: "陈华才"; Date: Thu, Mar 14, 2019 06:55 AM To: "gregkh"; Cc: "linux-kernel"; "stable"; "huangpei"; "Paul Burton"; "Ralf Baechle"; "ambrosehua"; "Steven J . Hill"; "linux-mips"; "Fuxin Zhang"; "wuzhangjin"; "Li Xuefeng"; "Xu Chenghua"; "Sasha Levin"; Subject: Re:Re: Re:[PATCH 4.9 81/96] MIPS:Loongson: Introduce and use loongson_llsc_mb() Hi, Greg, Just 4.9(and below) need to modify spinlock.h, 4.14, 4.19 does not need to do this because they have converted to qspinlock. And, sorry for my poor reply because I can only use mobile phone now. ---原始邮件--- 发件人:"Greg Kroah-Hartman" 发送时间:2019年3月14日(星期四) 凌晨4:58 收件人:"陈华才"; 主题:Re:[PATCH 4.9 81/96] MIPS:Loongson: Introduce and use loongson_llsc_mb() On Wed, Mar 13, 2019 at 09:17:15PM +0800, 陈华才 wrote: > Hi, GREG, > > 4.9 need to modify spinlock.h, please wait my patch. > > > > ---原始邮件--- > 发件人:"Greg Kroah-Hartman" > 发送时间:2019年3月13日(星期三) 凌晨1:10 > 收件人:"linux-kernel"; > 主题:[PATCH 4.9 81/96] MIPS: Loongson: Introduce and use loongson_llsc_mb() > 4.9-stable review patch. If anyone has any objections, please let me know. > > -- > > [ Upstream commit e02e07e3127d8aec1f4bcdfb2fc52a2d99b4859e ] > > On the Loongson-2G/2H/3A/3B there is a hardware flaw that ll/sc and > lld/scd is very weak ordering. We should add sync instructions "before > each ll/lld" and "at the branch-target between ll/sc" to workaround. > Otherwise, this flaw will cause deadlock occasionally (e.g. when doing > heavy load test with LTP). > > Below is the explaination of CPU designer: > > "For Loongson 3 family, when a memory access instruction (load, store, > or prefetch)'ecuting occurs between the execution of LL and SC, the > success or failure of SC is not predictable. Although programmer would > not insert memory access instructions between LL and SC, the memory > instructions before LL in program-order, may dynamically executed > between the execution of LL/SC, so a memory fence (SYNC) is needed > before LL/LLD to avoid this situation. > > Since Loongson-3A R2 (3A2000), we have improved our hardware design to > handle this case. But we later deduce a rarely circumstance that some > speculatively executed memory instructions due to branch misprediction > between LL/SC still fall into the above case, so a memory fence (SYNC) > at branch-target (if its target is not between LL/SC) is needed for > Loongson 3A1000, 3B1500, 3A2000 and 3A3000. > > Our processor is continually evolving and we aim to to remove all these > workaround-SYNCs around LL/SC for new-come processor." > > Here is an example: > > Both cpu1 and cpu2 simutaneously run atomic_add by 1 on same atomic var, > this bug cause both '' by two cpus (in atomic_add) succeed at same > time(''urn 1), and the variable is only *added by 1*, sometimes, > which is wrong and unacceptable(it should be added by 2). > > Why disable fix-loongson3-llsc in compiler? > Because compiler fix will cause problems in kernel'ex_table section. > > This patch fix all the cases in kernel, but: > > +. the fix at the end of futex_atomic_cmpxchg_inatomic is for branch-target > of ''ere other cases which smp_mb__before_llsc() and smp_llsc_mb() fix > the ll and branch-target coincidently such as atomic_sub_if_positive/ > cmpxchg/xchg, just like this one. > > +. Loongson 3 does support CONFIG_EDAC_ATOMIC_SCRUB, so no need to touch > edac.h > > +. local_ops and cmpxchg_local should not be affected by this bug since > only the owner can write. > > +. mips_atomic_set for syscall.c is deprecated and rarely used, just let > it go > > Signed-off-by: Huacai Chen > Signed-off-by: Huang Pei > [paul.bur...@mips.com: > - Simplify the addition of -mno-fix-loongson3-llsc to cflags, and add > a comment describing why it'ere. > - Make loongson_llsc_mb() a no-op when > CONFIG_CPU_LOONGSON3_WORKAROUNDS=n, rather than a compiler memory > barrier. > - Add a comment describing the bug & how loongson_llsc_mb() helps > in asm/barrier.h.] > Signed-off-by: Paul Burton > Cc: Ralf Baechle > Cc: ambrose...@gmail.com > Cc: Steven J . Hill > Cc: linux-m...@linux-mips.org > Cc: Fuxin Zhang > Cc: Zhangjin Wu > Cc: Li Xuefeng > Cc: Xu Chenghua > Signed-off-by: Sasha Levin > --- > arch/mips/Kconfig | 15 ++ > arch/mips/include/asm/atomic.h | 6 ++ > arch/mips/include/asm/barrier.h | 36 + > arch/mips/include/asm/bitops.h | 5 + > arch/mips/include/asm/futex.h | 3 +++ > arch/mips/include/asm/pgtable.h | 2 ++ > arch/mips/loongson64/Platform | 23 + > arch/mips/mm/tlbex.c| 10 + > 8 files changed, 100 insertions(+) Ok, I will go drop this from all stable queues now, thanks! greg k-h
Re: [PATCH v4] lib/string.c: implement a basic bcmp
On Wed, Mar 13, 2019 at 02:13:31PM -0700, Nick Desaulniers wrote: > A recent optimization in Clang (r355672) lowers comparisons of the > return value of memcmp against zero to comparisons of the return value > of bcmp against zero. This helps some platforms that implement bcmp > more efficiently than memcmp. glibc simply aliases bcmp to memcmp, but > an optimized implementation is in the works. > > This results in linkage failures for all targets with Clang due to the > undefined symbol. For now, just implement bcmp as a tailcail to memcmp > to unbreak the build. This routine can be further optimized in the > future. > > Other ideas discussed: > * A weak alias was discussed, but breaks for architectures that define > their own implementations of memcmp since aliases to declarations are > not permitted (only definitions). Arch-specific memcmp implementations > typically declare memcmp in C headers, but implement them in assembly. > * -ffreestanding also is used sporadically throughout the kernel. > * -fno-builtin-bcmp doesn't work when doing LTO. > > Link: https://bugs.llvm.org/show_bug.cgi?id=41035 > Link: https://code.woboq.org/userspace/glibc/string/memcmp.c.html#bcmp > Link: > https://github.com/llvm/llvm-project/commit/8e16d73346f8091461319a7dfc4ddd18eedcff13 > Link: https://github.com/ClangBuiltLinux/linux/issues/416 > Cc: sta...@vger.kernel.org > Reported-by: Nathan Chancellor > Reported-by: Adhemerval Zanella > Suggested-by: Arnd Bergmann > Suggested-by: James Y Knight > Suggested-by: Masahiro Yamada > Suggested-by: Nathan Chancellor > Suggested-by: Rasmus Villemoes > Signed-off-by: Nick Desaulniers > Acked-by: Steven Rostedt (VMware) Reviewed-by: Nathan Chancellor Tested-by: Nathan Chancellor Thanks for fixing this better than I did (or tried to)! > --- > Changes V3 -> V4: > * Include the entirety of Rasmus' sugguestion, as per Steven. > * Change formal parameter identifiers to match the comment. > Changes V2 -> V3: > * Adjust comment as per Steven to Rasmus' sugguestion. > * Pick up Steven's Ack. > Changes V1 -> V2: > * Add declaration to include/linux/string.h. > * Reword comment above bcmp. > > include/linux/string.h | 3 +++ > lib/string.c | 20 > 2 files changed, 23 insertions(+) > > diff --git a/include/linux/string.h b/include/linux/string.h > index 7927b875f80c..6ab0a6fa512e 100644 > --- a/include/linux/string.h > +++ b/include/linux/string.h > @@ -150,6 +150,9 @@ extern void * memscan(void *,int,__kernel_size_t); > #ifndef __HAVE_ARCH_MEMCMP > extern int memcmp(const void *,const void *,__kernel_size_t); > #endif > +#ifndef __HAVE_ARCH_BCMP > +extern int bcmp(const void *,const void *,__kernel_size_t); > +#endif > #ifndef __HAVE_ARCH_MEMCHR > extern void * memchr(const void *,int,__kernel_size_t); > #endif > diff --git a/lib/string.c b/lib/string.c > index 38e4ca08e757..3ab861c1a857 100644 > --- a/lib/string.c > +++ b/lib/string.c > @@ -866,6 +866,26 @@ __visible int memcmp(const void *cs, const void *ct, > size_t count) > EXPORT_SYMBOL(memcmp); > #endif > > +#ifndef __HAVE_ARCH_BCMP > +/** > + * bcmp - returns 0 if and only if the buffers have identical contents. > + * @a: pointer to first buffer. > + * @b: pointer to second buffer. > + * @len: size of buffers. > + * > + * The sign or magnitude of a non-zero return value has no particular > + * meaning, and architectures may implement their own more efficient bcmp(). > So > + * while this particular implementation is a simple (tail) call to memcmp, do > + * not rely on anything but whether the return value is zero or non-zero. > + */ > +#undef bcmp > +int bcmp(const void *a, const void *b, size_t len) > +{ > + return memcmp(a, b, len); > +} > +EXPORT_SYMBOL(bcmp); > +#endif > + > #ifndef __HAVE_ARCH_MEMSCAN > /** > * memscan - Find a character in an area of memory. > -- > 2.21.0.360.g471c308f928-goog >
Re: lib/test_overflow.c causes WARNING and tainted kernel
Hi! On Wed, Mar 13, 2019 at 2:29 PM Randy Dunlap wrote: > > This is v5.0-11053-gebc551f2b8f9, MAR-12 around 4:00pm PT. > > In the first test_kmalloc() in test_overflow_allocation(): > > [54375.073895] test_overflow: ok: (s64)(0 << 63) == 0 > [54375.074228] WARNING: CPU: 2 PID: 5462 at ../mm/page_alloc.c:4584 > __alloc_pages_nodemask+0x33f/0x540 > [...] > [54375.079236] ---[ end trace 754acb68d8d1a1cb ]--- > [54375.079313] test_overflow: kmalloc detected saturation Yup! This is expected and operating as intended: it is exercising the allocator's detection of insane allocation sizes. :) If we want to make it less noisy, perhaps we could add a global flag the allocators could check before doing their WARNs? -Kees -- Kees Cook
Re: [PATCH v2 8/9] rtc: mt6397: fix alarm register overwrite
Hi, On Mon, 2019-03-11 at 13:50 -0700, Sean Wang wrote: > Hi, > > On Sun, Mar 10, 2019 at 8:49 PM Hsin-Hsiung Wang > wrote: > > > > From: Ran Bi > > > > Alarm registers high byte was reserved for other functions. > > This add mask in alarm registers operation functions. > > This also fix error condition in interrupt handler. > > > > Fixes: fc2979118f3f ("rtc: mediatek: Add MT6397 RTC driver") > > > > add a Cc: sta...@vger.kernel.org here to apply to all stable kernels > for the critical fixup patch. > > > Signed-off-by: Ran Bi > > --- > > drivers/rtc/rtc-mt6397.c | 47 > > +-- > > 1 file changed, 33 insertions(+), 14 deletions(-) > > > > diff --git a/drivers/rtc/rtc-mt6397.c b/drivers/rtc/rtc-mt6397.c > > index e9a25ec..f85f1fc 100644 > > --- a/drivers/rtc/rtc-mt6397.c > > +++ b/drivers/rtc/rtc-mt6397.c > > @@ -55,6 +55,14 @@ > > > > #define RTC_AL_SEC 0x0018 > > > > +#define RTC_AL_SEC_MASK0x003f > > +#define RTC_AL_MIN_MASK0x003f > > +#define RTC_AL_HOU_MASK0x001f > > +#define RTC_AL_DOM_MASK0x001f > > +#define RTC_AL_DOW_MASK0x0007 > > +#define RTC_AL_MTH_MASK0x000f > > +#define RTC_AL_YEA_MASK0x007f > > + > > #define RTC_PDN2 0x002e > > #define RTC_PDN2_PWRON_ALARM BIT(4) > > > > @@ -111,7 +119,7 @@ static irqreturn_t mtk_rtc_irq_handler_thread(int irq, > > void *data) > > irqen = irqsta & ~RTC_IRQ_EN_AL; > > mutex_lock(>lock); > > if (regmap_write(rtc->regmap, rtc->addr_base + RTC_IRQ_EN, > > -irqen) < 0) > > +irqen) == 0) > > mtk_rtc_write_trigger(rtc); > > mutex_unlock(>lock); > > > > @@ -233,12 +241,12 @@ static int mtk_rtc_read_alarm(struct device *dev, > > struct rtc_wkalrm *alm) > > alm->pending = !!(pdn2 & RTC_PDN2_PWRON_ALARM); > > mutex_unlock(>lock); > > > > - tm->tm_sec = data[RTC_OFFSET_SEC]; > > - tm->tm_min = data[RTC_OFFSET_MIN]; > > - tm->tm_hour = data[RTC_OFFSET_HOUR]; > > - tm->tm_mday = data[RTC_OFFSET_DOM]; > > - tm->tm_mon = data[RTC_OFFSET_MTH]; > > - tm->tm_year = data[RTC_OFFSET_YEAR]; > > + tm->tm_sec = data[RTC_OFFSET_SEC] & RTC_AL_SEC_MASK; > > + tm->tm_min = data[RTC_OFFSET_MIN] & RTC_AL_MIN_MASK; > > + tm->tm_hour = data[RTC_OFFSET_HOUR] & RTC_AL_HOU_MASK; > > + tm->tm_mday = data[RTC_OFFSET_DOM] & RTC_AL_DOM_MASK; > > + tm->tm_mon = data[RTC_OFFSET_MTH] & RTC_AL_MTH_MASK; > > + tm->tm_year = data[RTC_OFFSET_YEAR] & RTC_AL_YEA_MASK; > > > > tm->tm_year += RTC_MIN_YEAR_OFFSET; > > tm->tm_mon--; > > @@ -259,14 +267,25 @@ static int mtk_rtc_set_alarm(struct device *dev, > > struct rtc_wkalrm *alm) > > tm->tm_year -= RTC_MIN_YEAR_OFFSET; > > tm->tm_mon++; > > > > - data[RTC_OFFSET_SEC] = tm->tm_sec; > > - data[RTC_OFFSET_MIN] = tm->tm_min; > > - data[RTC_OFFSET_HOUR] = tm->tm_hour; > > - data[RTC_OFFSET_DOM] = tm->tm_mday; > > - data[RTC_OFFSET_MTH] = tm->tm_mon; > > - data[RTC_OFFSET_YEAR] = tm->tm_year; > > - > > mutex_lock(>lock); > > + ret = regmap_bulk_read(rtc->regmap, rtc->addr_base + RTC_AL_SEC, > > + data, RTC_OFFSET_COUNT); > > + if (ret < 0) > > + goto exit; > > + > > Why did we need an additional regmap_bulk_read here? I would suppose > data[RTC_OFFSET_*] & ~(RTC_AL_*_MASK) is always equal to 0. In bootloader, RTC init part need to operate these RTC_AL_* registers high byte for RTC operating mode settings and data storage, while RTC_AL_* low byte was used as alarm register here. So we need an additional regmap_bulk_read here to keep the RTC settings correct. > > It might be another fixup since change is not being mentioned in the > git message. > > > + data[RTC_OFFSET_SEC] = ((data[RTC_OFFSET_SEC] & ~(RTC_AL_SEC_MASK)) > > | > > + (tm->tm_sec & RTC_AL_SEC_MASK)); > > + data[RTC_OFFSET_MIN] = ((data[RTC_OFFSET_MIN] & ~(RTC_AL_MIN_MASK)) > > | > > + (tm->tm_min & RTC_AL_MIN_MASK)); > > + data[RTC_OFFSET_HOUR] = ((data[RTC_OFFSET_HOUR] & > > ~(RTC_AL_HOU_MASK)) | > > + (tm->tm_hour & RTC_AL_HOU_MASK)); > > + data[RTC_OFFSET_DOM] = ((data[RTC_OFFSET_DOM] & ~(RTC_AL_DOM_MASK)) > > | > > + (tm->tm_mday & RTC_AL_DOM_MASK)); > > + data[RTC_OFFSET_MTH] = ((data[RTC_OFFSET_MTH] & ~(RTC_AL_MTH_MASK)) > > | > > + (tm->tm_mon & RTC_AL_MTH_MASK)); > > + data[RTC_OFFSET_YEAR] = ((data[RTC_OFFSET_YEAR] & > > ~(RTC_AL_YEA_MASK)) | > > + (tm->tm_year & RTC_AL_YEA_MASK)); > > + > > if
Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.
On 19 Feb 2019, at 20:38, Anshuman Khandual wrote: On 02/19/2019 06:26 PM, Matthew Wilcox wrote: On Tue, Feb 19, 2019 at 01:12:07PM +0530, Anshuman Khandual wrote: But the location of this temp page matters as well because you would like to saturate the inter node interface. It needs to be either of the nodes where the source or destination page belongs. Any other node would generate two internode copy process which is not what you intend here I guess. That makes no sense. It should be allocated on the local node of the CPU performing the copy. If the CPU is in node A, the destination is in node B and the source is in node C, then you're doing 4k worth of reads from node C, 4k worth of reads from node B, 4k worth of writes to node C followed by 4k worth of writes to node B. Eventually the 4k of dirty cachelines on node A will be written back from cache to the local memory (... or not, if that page gets reused for some other purpose first). If you allocate the page on node B or node C, that's an extra 4k of writes to be sent across the inter-node link. Thats right there will be an extra remote write. My assumption was that the CPU performing the copy belongs to either node B or node C. I have some interesting throughput results for exchange per u64 and exchange per 4KB page. What I discovered is that using a 4KB page as the temporary storage for exchanging 2MB THPs does not improve the throughput. On contrary, when we are exchanging more than 2^4=16 THPs, exchanging per 4KB page has lower throughput than exchanging per u64. Please see results below. The experiments are done on a two socket machine with two Intel Xeon E5-2640 v3 CPUs. All exchanges are done via the QPI link across two sockets. Results === Throughput (GB/s) of exchanging 2 order-N 2MB pages between two NUMA nodes | 2mb_page_order | 0| 1| 2| 3| 4| 5| 6| 7 | 8| 9 | u64| 5.31 | 5.58 | 5.89 | 5.69 | 8.97 | 9.51 | 9.21 | 9.50 | 9.57 | 9.62 | per_page | 5.85 | 6.48 | 6.20 | 5.26 | 7.22 | 7.25 | 7.28 | 7.30 | 7.32 | 7.31 Normalized throughput (to per_page) 2mb_page_order | 0| 1| 2| 3| 4| 5| 6| 7 | 8| 9 u64| 0.90 | 0.86 | 0.94 | 1.08 | 1.24 | 1.31 |1.26 | 1.30 | 1.30 | 1.31 Exchange page code === For exchanging per u64, I use the following function: static void exchange_page(char *to, char *from) { u64 tmp; int i; for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) { tmp = *((u64 *)(from + i)); *((u64 *)(from + i)) = *((u64 *)(to + i)); *((u64 *)(to + i)) = tmp; } } For exchange per 4KB, I use the following function: static void exchange_page2(char *to, char *from) { int cpu = smp_processor_id(); VM_BUG_ON(!in_atomic()); if (!page_tmp[cpu]) { int nid = cpu_to_node(cpu); struct page *page_tmp_page = alloc_pages_node(nid, GFP_KERNEL, 0); if (!page_tmp_page) { exchange_page(to, from); return; } page_tmp[cpu] = kmap(page_tmp_page); } copy_page(page_tmp[cpu], to); copy_page(to, from); copy_page(from, page_tmp[cpu]); } where page_tmp is pre-allocated local to each CPU and alloc_pages_node() above is for hot-added CPUs, which is not used in the tests. The kernel is available at: https://gitlab.com/ziy/linux-contig-mem-rfc To do a comparison, you can clone this repo: https://gitlab.com/ziy/thp-migration-bench, then make, ./run_test.sh, and ./get_results.sh using the kernel from above. Let me know if I missed anything or did something wrong. Thanks. -- Best Regards, Yan Zi
Re: [LSF/MM TOPIC] Using XArray to manage the VMA
On Wed, Mar 13, 2019 at 02:06:03PM -0700, Davidlohr Bueso wrote: > On Wed, 13 Mar 2019, Laurent Dufour wrote: > > If this is not too late and if there is still place available, I would > > like to attend the MM track and propose a topic about using the XArray > > to replace the VMA's RB tree and list. > > > > Using the XArray in place of the VMA's tree and list seems to be a first > > step to the long way of removing/replacing the mmap_sem. > > So threaded (not as in threads of execution) rbtrees are another > alternative to deal with the two data structure approach we currently > have. Having O(1) rb_prev/next() calls allows us to basically get rid of > the vma list at the cost of an extra check for each node we visit on > the way down when inserting. It's probably worth listing the advantages of the Maple Tree over the rbtree. - Shallower tree. A 1000-entry rbtree is 10 levels deep. A 1000-entry Maple Tree is 5 levels deep (I did a more detailed analysis in an earlier email thread with Laurent and I can present it if needed). - O(1) prev/next - Lookups under the RCU lock There're some second-order effects too; by using externally allocated nodes, we avoid disturbing other VMAs when inserting/deleting, and we avoid bouncing cachelines around (eg the VMA which happens to end up at the head of the tree is accessed by every lookup in the tree because it's on the way to every other node).
Re: [PATCH] spi: mediatek: Attempt to address style issues in spi-mt7621.c
On Thu, Mar 14 2019, Chuanhong Guo wrote: > Hi! > On Thu, Mar 14, 2019 at 6:14 AM NeilBrown wrote: >> >> [...] >> My only small concern is that this driver was backported to openwrt >> (4.14 based) and then reverted >> >> https://github.com/openwrt/openwrt/commit/749a29f76ca780d8df70a5163d43bbdc6f13ba3f >> >> "This breaks some mt7621 devices." >> >> Possibly it was backported badly, or possibly there is a problem. > Last time I backported the version with broken SPI modes so it got > broken SPI CS1 support. There is one mt7621 router using SPI CS1 in > OpenWrt and the backported driver broke it. > It has been fixed by my two "drop broken spi modes" patches. Ahh, thanks for the clarification. Good to know all known problems are fixed. Thanks, NeilBrown >> >> John: do you have any more details of the problem other than what is in >> the commit message? >> >> Thanks, >> NeilBrown > > Regards, > Chuanhong Guo signature.asc Description: PGP signature
must hold the driver_input_lock in hid_debug_rdesc_show
we see the below kernel panic logs when run suspend resume test with usb mouse and usb keyboard connected. the scenario is the userspace call the hid_debug_rdesc_show to dump the input device while the device is removed. the patch hold the driver_input_lock to avoid the race. [ 5381.757295] selinux: SELinux: Could not stat /sys/devices/pci:00/:00:15.0/usb1/1-2/1-2:1.0/0003:03F0:0325.0320/input/input960/input960::scrolllock: No such file or directory. [ 5382.636498] BUG: unable to handle kernel paging request at 000783316040 [ 5382.651950] CPU: 1 PID: 1512 Comm: getevent Tainted: G U O 4.19.20-quilt-2e5dc0ac-00029-gc455a447dd55 #1 [ 5382.663797] RIP: 0010:hid_dump_device+0x9b/0x160 [ 5382.758853] Call Trace: [ 5382.761581] hid_debug_rdesc_show+0x72/0x1d0 [ 5382.766343] seq_read+0xe0/0x410 [ 5382.769941] full_proxy_read+0x5f/0x90 [ 5382.774121] __vfs_read+0x3a/0x170 [ 5382.788392] vfs_read+0xa0/0x150 [ 5382.791984] ksys_read+0x58/0xc0 [ 5382.801404] __x64_sys_read+0x1a/0x20 [ 5382.805483] do_syscall_64+0x55/0x110 [ 5382.809559] entry_SYSCALL_64_after_hwframe+0x49/0xbe Signed-off-by: he, bo Signed-off-by: "Zhang, Jun" --- drivers/hid/hid-debug.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/hid/hid-debug.c b/drivers/hid/hid-debug.c index ebc9ffde41e9..a353a011fbdf 100644 --- a/drivers/hid/hid-debug.c +++ b/drivers/hid/hid-debug.c @@ -1060,10 +1060,15 @@ static int hid_debug_rdesc_show(struct seq_file *f, void *p) seq_printf(f, "\n\n"); /* dump parsed data and input mappings */ + if (down_interruptible(>driver_input_lock)) + return 0; + hid_dump_device(hdev, f); seq_printf(f, "\n"); hid_dump_input_mapping(hdev, f); + up(>driver_input_lock); + return 0; } -- 2.20.1
Re: [PATCH] spi: mediatek: Attempt to address style issues in spi-mt7621.c
Hi! On Thu, Mar 14, 2019 at 6:14 AM NeilBrown wrote: > > [...] > My only small concern is that this driver was backported to openwrt > (4.14 based) and then reverted > > https://github.com/openwrt/openwrt/commit/749a29f76ca780d8df70a5163d43bbdc6f13ba3f > > "This breaks some mt7621 devices." > > Possibly it was backported badly, or possibly there is a problem. Last time I backported the version with broken SPI modes so it got broken SPI CS1 support. There is one mt7621 router using SPI CS1 in OpenWrt and the backported driver broke it. It has been fixed by my two "drop broken spi modes" patches. > > John: do you have any more details of the problem other than what is in > the commit message? > > Thanks, > NeilBrown Regards, Chuanhong Guo
RE: [PATCH RESEND V2 1/4] dt-bindings: fsl: scu: add watchdog binding
Hi, Rob Best Regards! Anson Huang > -Original Message- > From: Rob Herring [mailto:r...@kernel.org] > Sent: 2019年3月12日 5:26 > To: Aisheng Dong > Cc: Anson Huang ; Guenter Roeck us.net>; mark.rutl...@arm.com; ulf.hans...@linaro.org; he...@sntech.de; > catalin.mari...@arm.com; will.dea...@arm.com; > bjorn.anders...@linaro.org; feste...@gmail.com; > ja...@amarulasolutions.com; Andy Gross ; dl- > linux-imx ; devicet...@vger.kernel.org; linux- > watch...@vger.kernel.org; a...@arndb.de; marc.w.gonza...@free.fr; > s.ha...@pengutronix.de; enric.balle...@collabora.com; > horms+rene...@verge.net.au; w...@linux-watchdog.org; Daniel Baluta > ; linux-arm-ker...@lists.infradead.org; linux- > ker...@vger.kernel.org; ker...@pengutronix.de; o...@lixom.net; > shawn...@kernel.org; Jens Wiklander > Subject: Re: [PATCH RESEND V2 1/4] dt-bindings: fsl: scu: add watchdog > binding > > +Jens W > > On Thu, Mar 7, 2019 at 6:22 AM Aisheng Dong > wrote: > > > > Hi Rob, > > > > > > > I think Rob suggested that the SCU parent driver should > > > > > instantiate the watchdog without explicit watchdog node. That > > > > > would be possible, but it currently uses > > > > > devm_of_platform_populate() to do the instantiation, and > > > > > changing that would be a mess. Besides, it does sem to me that > > > > > your suggested node would describe the hardware, so I am not > > > > > sure I understand the > > > reasoning. > > > > > > It would just be a call to create a platform device instead. How is that a > mess? > > > > > > It's describing firmware. We have DT for describing h/w we've failed > > > to make discoverable. We should not repeat that and just describe > firmware in DT. > > > Make the firmware discoverable! Though there are cases like firmware > > > provided clocks where we still need something in DT, but this is not > > > one of them. > > > > > > > The watchdog node here in question actually is not using SCU firmware call. > > Due to security requirement by SCU, watchdog can only be accessed in > > security mode, for IMX case, via ARM Trust Firmware. That means the > > watchdog used in Linux actually is using ARM SMC call and does not > > depend SCU driver. So It would be strange for SCU driver to instantiate it. > > > > For this situation, do you think we can move watchdog out of scu node? > > Maybe rename the compatible string like "fsl,imx8qxp-sip-watchdog" > > because it's actually a watchdog serviced by ATF firmware. > > Yes, but that creates more questions. What exactly does ATF talk to for the > watchdog? The SCU firmware? Yes, ATF talks to SCU firmware directly, Linux kernel watchdog driver call SMC instructions to send command to ATF, and ATF will call SCU firmware API to finish the operation requested by Linux kernel watchdog driver. > > Maybe ATF should define and provide a standard watchdog interface? It is > still a question of making the firmware discoverable rather than trying to > describe the firmware in DT. The SMC call by Linux kernel watchdog already follow the SIP(silicon provider) standard, each SoC can define its own protocol for SIP. ATF does NOT have a standard common watchdog interface now, since it is more like a platform specific feature, most of platforms can control watchdog directly from Linux kernel I think. So, do you have suggestion for this case? Either find a place in DT to put watchdog node, or make it a build-in device inside SCU driver? Or you have other better ideas? Thanks, Anson. > > Rob
RE,
Hello, i have a deal for you, can we work together ?
Re: [PATCH 5/5] lib/list_sort: Optimize number of calls to comparison function
On Thu, 14 Mar 2019 at 00:28:16 +0100, Rasmus Villemoes wrote: > On 05/03/2019 06.58, George Spelvin wrote: >> This patch avoids badly unbalanced merges on unlucky input sizes. >> It slightly increases the code size, but saves an average of 0.2*n >> calls to cmp(). >> > [snip] >> >> (I confess to being more than a little bit proud of how clean this >> code turned out. It took a lot of thinking, but the resultant inner >> loop is very simple and efficient.) > > This is beautiful. So no comments on the patch itself. One thing that > might be nice would be to see the reduction in number of cmp callbacks > explicitly; it should be trivial to use the priv element for that in the > list_sort_test module. But to really see it one would of course have to > extend that test to do a lot more different sizes of lists. And you'd have to compile and run two different kernels, because you can't run the algorithms side-by-side. Not a good way to do it. I used a user-space test harness for testing. The output is ugly, but here are lots of numbers for various list sizes. The first group is single sort invocations. The list is sorted by (random) key, then again by address (the inverse), and the number of comparisons printed. The numbers in parens are the linear coefficient K in comparisons = n*log2(n) - K*n, i.e. it's log2(n) - (comparisons / n). Higher is better. The three lines for each size are the original list_sort, a top-down version I wrote as a "perfect" value for reference, and the posted code. (1034 and 1040 are particularly bad cases for the old code.) 1: 0 (0.00) 0 (0.00) 1: 0 (0.00) 0 (0.00) 1: 0 (0.00) 0 (0.00) 2: 1 (0.50) 1 (0.50) 2: 1 (0.50) 1 (0.50) 2: 1 (0.50) 1 (0.50) 3: 3 (0.584963) 3 (0.584963) 3: 2 (0.918296) 2 (0.918296) 3: 3 (0.584963) 3 (0.584963) 4: 5 (0.75) 5 (0.75) 4: 5 (0.75) 5 (0.75) 4: 5 (0.75) 5 (0.75) 5: 8 (0.721928) 8 (0.721928) 5: 8 (0.721928) 8 (0.721928) 5: 8 (0.721928) 8 (0.721928) 6: 10 (0.918296) 10 (0.918296) 6: 10 (0.918296) 10 (0.918296) 6: 10 (0.918296) 10 (0.918296) 7: 11 (1.235926) 12 (1.093069) 7: 13 (0.950212) 12 (1.093069) 7: 11 (1.235926) 12 (1.093069) 8: 17 (0.875000) 17 (0.875000) 8: 17 (0.875000) 17 (0.875000) 8: 17 (0.875000) 17 (0.875000) 9: 18 (1.169925) 22 (0.725481) 9: 21 (0.836592) 20 (0.947703) 9: 20 (0.947703) 20 (0.947703) 10: 20 (1.321928) 25 (0.821928) 10: 25 (0.821928) 24 (0.921928) 10: 22 (1.121928) 24 (0.921928) 13: 35 (1.008132) 35 (1.008132) 13: 33 (1.161978) 33 (1.161978) 13: 36 (0.931209) 35 (1.008132) 21: 71 (1.011365) 74 (0.868508) 21: 68 (1.154222) 71 (1.011365) 21: 69 (1.106603) 72 (0.963746) 31: 117 (1.180003) 111 (1.373551) 31: 114 (1.276777) 114 (1.276777) 31: 117 (1.180003) 111 (1.373551) 32: 123 (1.156250) 121 (1.218750) 32: 123 (1.156250) 121 (1.218750) 32: 123 (1.156250) 121 (1.218750) 33: 158 (0.256515) 149 (0.529243) 33: 130 (1.105000) 131 (1.074697) 33: 131 (1.074697) 131 (1.074697) 34: 142 (0.910992) 135 (1.116875) 34: 133 (1.175698) 135 (1.116875) 34: 131 (1.234522) 134 (1.146286) 55: 244 (1.344996) 249 (1.254087) 55: 246 (1.308632) 241 (1.399542) 55: 249 (1.254087) 250 (1.235905) 89: 484 (1.037531) 490 (0.970115) 89: 464 (1.262250) 477 (1.116183) 89: 479 (1.093711) 482 (1.060003) 127: 729 (1.248527) 727 (1.264275) 127: 734 (1.209157) 724 (1.287897) 127: 729 (1.248527) 727 (1.264275) 128: 744 (1.187500) 733 (1.273438) 128: 744 (1.187500) 733 (1.273438) 128: 744 (1.187500) 733 (1.273438) 129: 752 (1.181770) 853 (0.398824) 129: 746 (1.228282) 740 (1.274793) 129: 747 (1.220530) 741 (1.267041) 144: 926 (0.739369) 928 (0.725481) 144: 851 (1.260203) 866 (1.156036) 144: 865 (1.162981) 872 (1.114369) 233: 1556 (1.186075) 1541 (1.250452) 233: 1545 (1.233285) 1527 (1.310538) 233: 1550 (1.211826) 1534 (1.280495) 377: 2787 (1.165848) 2790 (1.157890) 377: 2752 (1.258686) 2771 (1.208288) 377: 2778 (1.189720) 2782 (1.179110) 610: 5115 (0.867420) 5115 (0.867420) 610: 4891 (1.234633) 4883 (1.247747) 610: 4909 (1.205124) 4930 (1.170698) 642: 5385 (0.938579) 5428 (0.871601) 642: 5166 (1.279701) 5185 (1.250105) 642: 5205 (1.218953) 5201 (1.225183) 987: 8574 (1.259976) 8620 (1.213370) 987: 8564 (1.270108) 8599 (1.234647) 987: 8565 (1.269095) 8614 (1.219449) 1022: 8937 (1.252561) 8913 (1.276044) 1022: 8916 (1.273109) 8928 (1.261367) 1022: 8937 (1.252561) 8913 (1.276044) 1023: 8959 (1.241015) 8909 (1.289891) 1023: 8927 (1.272295) 8918 (1.281093) 1023: 8959 (1.241015) 8909 (1.289891) 1024: 8970 (1.240234) 8916 (1.292969) 1024: 8966 (1.244141) 8912 (1.296875) 1024: 8966 (1.244141) 8912 (1.296875) 1025: 9548 (0.686286) 9724 (0.514579) 1025: 8971 (1.249213) 8943 (1.276530) 1025: 8970 (1.250189) 8944 (1.27) 1026: 9771 (0.479423) 9745 (0.504764) 1026: 8936 (1.293263) 8978 (1.252328) 1026: 8942 (1.287415) 8993 (1.237708) 1028: 9643 (0.625274) 9985 (0.292590) 1028: 8980 (1.270216) 9006 (1.244924) 1028: 8980 (1.270216) 9005 (1.245897) 1032: 9894 (0.424018) 10004
Re: [LKP] [ext4] fde872682e: fsmark.files_per_sec -38.0% regression
Theodore Ts'o writes: > On Wed, Mar 13, 2019 at 03:26:39PM +0800, huang ying wrote: >> > >> > >> > commit: fde872682e175743e0c3ef939c89e3c6008a1529 ("ext4: force inode >> > writes when nfsd calls commit_metadata()") >> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master >> >> It appears that this is a performance regression caused by a >> functionality fixing. So we should ignore this? > > Yes, this is a correctness issue that we discovered while tracking > down user data loss issue after a crash of the NFS server, so this is > a change we have to keep. When the NFS folks added the > commit_metadata() hook, they didn't realize that the fallback path in > nfsd/vfs.c using sync_inode_metadata() doesn't work on all file > systems --- and in particular doesn't work for ext3 and ext4 because > of how we do journalling. > > It only applies to NFS serving, not local ext4 use cases, so most ext4 > users won't be impacted on it; only those who export those file > systems using NFS. > > I do have some plans on how to claw back the performance hit. The > good news is that it won't require an on-disk format change; the bad > news is that it's a non-trivial change to how journalling works, and > it's not something we can backport to the stable kernel series. It's > something we're going to have to leave to a distribution who is > willing to do a lot of careful regression testing, once the change is > available, maybe in 3 months or so. > > - Ted > > P.S. I *believe* all other file systems should be OK, and I didn't > want to impose a performance tax on all other file systems (such as > btrfs), so I fixed it in an ext4-specific way. The more > general/conservative change would be to fall back to using fsync in > nfs/vfs.c:commit_metadata() unless the file system specifically set a > superblock flag indicating that using sync_inode_metadata is safe. > OTOH we lived with this flaw in ext3/ext4 for *years* without anyone > noticing or complaining, so Got it! Thanks for your detailed explanation! Best Regards, Huang, Ying
Re: [PATCH v3] lib/siphash.c: annotate implicit fall throughs
On Wed, 2019-03-13 at 22:12 +0100, Mathieu Malaterre wrote: > There is a plan to build the kernel with -Wimplicit-fallthrough and > these places in the code produced warnings (W=1). Fix them up. > > This commit remove the following warnings: > > lib/siphash.c:71:12: warning: this statement may fall through > [-Wimplicit-fallthrough=] > lib/siphash.c:72:12: warning: this statement may fall through > [-Wimplicit-fallthrough=] > lib/siphash.c:73:12: warning: this statement may fall through > [-Wimplicit-fallthrough=] > lib/siphash.c:75:12: warning: this statement may fall through > [-Wimplicit-fallthrough=] > lib/siphash.c:108:12: warning: this statement may fall through > [-Wimplicit-fallthrough=] > lib/siphash.c:109:12: warning: this statement may fall through > [-Wimplicit-fallthrough=] > lib/siphash.c:110:12: warning: this statement may fall through > [-Wimplicit-fallthrough=] > lib/siphash.c:112:12: warning: this statement may fall through > [-Wimplicit-fallthrough=] > lib/siphash.c:434:12: warning: this statement may fall through > [-Wimplicit-fallthrough=] > lib/siphash.c:462:12: warning: this statement may fall through > [-Wimplicit-fallthrough=] > > Move the break statement onto the next line to match the fall-through > comment pattern. Also move the trailing statement onto the next line to > pass checkpatch verification. [] > diff --git a/lib/siphash.c b/lib/siphash.c []. > @@ -68,13 +68,26 @@ u64 __siphash_aligned(const void *data, size_t len, const > siphash_key_t *key) > bytemask_from_count(left))); > #else > switch (left) { > - case 7: b |= ((u64)end[6]) << 48; > - case 6: b |= ((u64)end[5]) << 40; > - case 5: b |= ((u64)end[4]) << 32; It might also be worth not casting to u64 then shift as that can be moderately expensive on 32 bit systems and instead use ((char *))[]. > - case 4: b |= le32_to_cpup(data); break; > - case 3: b |= ((u64)end[2]) << 16; Perhaps an unnecessary cast before shift > - case 2: b |= le16_to_cpup(data); break; > - case 1: b |= end[0]; [] > @@ -101,13 +114,26 @@ u64 __siphash_unaligned(const void *data, size_t len, > const siphash_key_t *key) > bytemask_from_count(left))); > #else > switch (left) { > - case 7: b |= ((u64)end[6]) << 48; > - case 6: b |= ((u64)end[5]) << 40; > - case 5: b |= ((u64)end[4]) << 32; etc... > - case 4: b |= get_unaligned_le32(end); break; > - case 3: b |= ((u64)end[2]) << 16; > - case 2: b |= get_unaligned_le16(end); break; > - case 1: b |= end[0];
Re: [PATCH v1 2/2] interconnect: qcom: Add tagging and wake/sleep support for sdm845
On 3/8/2019 10:35 AM, Evan Green wrote: On Fri, Feb 8, 2019 at 9:22 AM Georgi Djakov wrote: From: David Dai Add support for wake and sleep commands by using a tag to indicate whether or not the aggregate and set requests are active only or dual context for a particular path. Signed-off-by: David Dai Signed-off-by: Georgi Djakov --- drivers/interconnect/qcom/sdm845.c | 101 + 1 file changed, 75 insertions(+), 26 deletions(-) diff --git a/drivers/interconnect/qcom/sdm845.c b/drivers/interconnect/qcom/sdm845.c index fb526004c82e..13499f681160 100644 --- a/drivers/interconnect/qcom/sdm845.c +++ b/drivers/interconnect/qcom/sdm845.c @@ -65,6 +65,12 @@ struct bcm_db { #define SDM845_MAX_BCMS30 #define SDM845_MAX_BCM_PER_NODE2 #define SDM845_MAX_VCD 10 +#define SDM845_MAX_CTX 2 +#define SDM845_EE_STATE2 +#define EE_STATE_WAKE 0 +#define EE_STATE_SLEEP 1 +#define AO_CTX 0 +#define DUAL_CTX 1 I get really lost with these two sets of numbers here. I think this needs some explanation in the comments. From staring at this what I've understood so far is: 1) Clients pass in a tag that buckets their requests into either AO or DUAL within the node. (Although, the requests all seem to get aggregated together no matter what the tag is, more on that later). 2) During icc_set(), we go through the many-to-many mapping of BCMs to nodes. For each BCM, we aggregate all the nodes it has, bucketed again by AO or DUAL. 3) The actual votes sent to RPMh are in terms of WAKE and SLEEP. 4) The mapping from AO/DUAL to WAKE/SLEEP for a particular BCM is: WAKE = AO+DUAL, SLEEP=DUAL. 5) qcom_icc_set() sends EE_STATE_WAKE stuff to RPMH_ACTIVE_ONLY_STATE. 6) If there's any difference between SLEEP and WAKE, then the EE_STATE_WAKE commands are gathered together and sent to RPMH_WAKE_ONLY_STATE, and all the EE_STATE_SLEEP commands are sent to RPMH_SLEEP_STATE. So ultimately the muxing ends up like RPMH_ACTIVE_ONLY_STATE <-- EE_STATE_WAKE <-- AO+DUAL RPMH_SLEEP_STATE <-- EE_STATE_SLEEP <-- DUAL RPMH_WAKE_ONLY_STATE <-- EE_STATE_WAKE <-- AO+DUAL This muxing explanation is correct, there's also an inherent limitation in this muxing where RPMH_ACTIVE cannot differ from RPMH_WAKE. Why do we need this complicated muxing to happen? Is it because we're trying to avoid requiring drivers to specify three different paths in the simple case, when all they care about is "use this bandwidth all the time"? That's a part of it, I don't have strong justification for this legacy way of supporting wake/sleep, so I'm open to new suggestions. What I think would make more sense would be to use the "tag" as a bitfield instead. So you'd have #define QCOM_ICC_TAG_ACTIVE_ONLY 0x0001 #define QCOM_ICC_TAG_SLEEP 0x0002 #define QCOM_ICC_TAG_WAKE 0x0004 #define QCOM_ICC_TAG_ALL_TIMES (QCOM_ICC_TAG_ACTIVE_ONLY | QCOM_ICC_TAG_SLEEP | QCOM_ICC_TAG_WAKE) Drivers that don't care about sleep/wake sets would just not set a tag, or set a tag of QCOM_ICC_TAG_ALL_TIMES. Then in qcom_icc_aggregate(), you aggregate the same values up to three times, one for each bit they have set. Finally in qcom_icc_set, you can pass the votes directly down in their buckets, without doing a weird mapping from AO/DUAL to SLEEP/WAKE/ACTIVE_ONLY. Works for me, I'd still want to optimize out the WAKE/SLEEP commands that are equal to each other though. The sticky part about this is that now rather than one set of sum/peak values, there are now three independent ones, corresponding to each of the tag bits. It seems like the interconnect core wouldn't want to mandate whether providers use the tag as a bitfield or a value. So the provider would need to keep not only the official set of aggregates that are returned back to the core (probably ACTIVE_ONLY, or maybe a max of the three), but also the two shadow copies internally for SLEEP and WAKE. I would ideally like to return the most accurate current state of the system which may be ACTIVE_ONLY or WAKE. We might also run into some confusion when printing out the summaries for each node. In a case where the votes for WAKE != ACTIVE(assuming one or more multiple consumers are intentionally doing this with multiple paths), the state of system will change depending on whether or not we've completed a sleep cycle(If it's prior to sleep, the ACTIVE_ONLY vote will be accurate, and post waking up the WAKE request will be correct) but this is more of a book keeping concern than anything else, as long as consumers understand that this is the expected behavior. On a side note, I don't like the ACTIVE_ONLY name designation for this which was brought over from the rpmh driver, it really means let's set the current state immediately and it has no association with the execution environment state unlike the other WAKE/SLEEP types. If you organized the tag that way, the only extra thing
(.text+0x79c): warning: Using 'getpwuid' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
tree: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master head: fa3d493f7a573b4e4e2538486e912093a0161c1b commit: 0053102a869f1b909904b1b85ac282e2744deaab um: Include sys/uio.h to have writev() date: 3 months ago config: um-allmodconfig (attached as .config) compiler: gcc-7 (Debian 7.3.0-1) 7.3.0 reproduce: git checkout 0053102a869f1b909904b1b85ac282e2744deaab # save the attached .config to linux build tree make ARCH=um All warnings (new ones prefixed by >>): arch/um/drivers/vde.o: In function `vde_open_real': (.text+0x951): warning: Using 'getgrnam' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking >> (.text+0x79c): warning: Using 'getpwuid' in statically linked applications >> requires at runtime the shared libraries from the glibc version used for >> linking arch/um/drivers/vector_user.o: In function `user_init_socket_fds': vector_user.c:(.text+0x369): warning: Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking arch/um/drivers/pcap.o: In function `pcap_nametoaddr': (.text+0xdee5): warning: Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking arch/um/drivers/pcap.o: In function `pcap_nametonetaddr': (.text+0xdf85): warning: Using 'getnetbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking arch/um/drivers/pcap.o: In function `pcap_nametoproto': (.text+0xe1a5): warning: Using 'getprotobyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking arch/um/drivers/pcap.o: In function `pcap_nametoport': (.text+0xdfd7): warning: Using 'getservbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking arch/um/drivers/vde.o: In function `vde_open_real': (.text+0x951): warning: Using 'getgrnam' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking >> (.text+0x79c): warning: Using 'getpwuid' in statically linked applications >> requires at runtime the shared libraries from the glibc version used for >> linking arch/um/drivers/vector_user.o: In function `user_init_socket_fds': vector_user.c:(.text+0x369): warning: Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking arch/um/drivers/pcap.o: In function `pcap_nametoaddr': (.text+0xdee5): warning: Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking arch/um/drivers/pcap.o: In function `pcap_nametonetaddr': (.text+0xdf85): warning: Using 'getnetbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking arch/um/drivers/pcap.o: In function `pcap_nametoproto': (.text+0xe1a5): warning: Using 'getprotobyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking arch/um/drivers/pcap.o: In function `pcap_nametoport': (.text+0xdfd7): warning: Using 'getservbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking arch/um/drivers/vde.o: In function `vde_open_real': (.text+0x951): warning: Using 'getgrnam' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking >> (.text+0x79c): warning: Using 'getpwuid' in statically linked applications >> requires at runtime the shared libraries from the glibc version used for >> linking arch/um/drivers/vector_user.o: In function `user_init_socket_fds': vector_user.c:(.text+0x369): warning: Using 'getaddrinfo' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking arch/um/drivers/pcap.o: In function `pcap_nametoaddr': (.text+0xdee5): warning: Using 'gethostbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking arch/um/drivers/pcap.o: In function `pcap_nametonetaddr': (.text+0xdf85): warning: Using 'getnetbyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking arch/um/drivers/pcap.o: In function `pcap_nametoproto': (.text+0xe1a5): warning: Using 'getprotobyname' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking arch/um/drivers/pcap.o: In function `pcap_nametoport': (.text+0xdfd7): warning: Using
Re: [RFC][PATCH 00/16] sched: Core scheduling
>> >> One more NULL pointer dereference: >> >> Mar 12 02:24:46 aubrey-ivb kernel: [ 201.916741] core sched enabled >> [ 201.950203] BUG: unable to handle kernel NULL pointer dereference >> at 0008 >> [ 201.950254] [ cut here ] >> [ 201.959045] #PF error: [normal kernel read fault] >> [ 201.964272] !se->on_rq >> [ 201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849 >> set_next_buddy+0x52/0x70 > > A quick workaround below: > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > index 1d0dac4fd94f..ef6acfe2cf7d 100644 > --- a/kernel/sched/fair.c > +++ b/kernel/sched/fair.c > @@ -6834,7 +6834,7 @@ static void set_last_buddy(struct sched_entity *se) > return; > > for_each_sched_entity(se) { > - if (SCHED_WARN_ON(!se->on_rq)) > + if (SCHED_WARN_ON(!(se && se->on_rq)) > return; > cfs_rq_of(se)->last = se; > } > @@ -6846,7 +6846,7 @@ static void set_next_buddy(struct sched_entity *se) > return; > > for_each_sched_entity(se) { > - if (SCHED_WARN_ON(!se->on_rq)) > + if (SCHED_WARN_ON(!(se && se->on_rq)) Shouldn't the for_each_sched_entity(se) skip the code block for !se case have avoided null pointer access of se? Since #define for_each_sched_entity(se) \ for (; se; se = se->parent) Scratching my head a bit here on how your changes would have made a difference. In your original log, I wonder if the !se->on_rq warning on CPU 22 is mixed with the actual OOPs? Saw also in your original log rb_insert_color. Wonder if that was actually the source of the Oops? [ 202.078674] RIP: 0010:set_next_buddy+0x52/0x70 [ 202.090135] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.99.99.x058.082120120902 08/21/2012 [ 202.090144] RIP: 0010:rb_insert_color+0x17/0x190 [ 202.101623] Code: 48 85 ff 74 10 8b 47 40 85 c0 75 e2 80 3d 9e e5 6a 01 00 74 02 f3 c3 48 c7 c7 5c 05 2c 82 c6 05 8c e5 6a 01 01 e8 2e bb fb ff <0f> 0b c3 83 bf 04 03 0e [ 202.113216] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 48 8b 17 48 85 d2 0f 84 4d 01 00 00 48 8b 02 a8 01 0f 85 6d 01 00 00 <48> 8b 48 08 49 89 c0 44 [ 202.118263] RSP: 0018:c9000a5cbbb0 EFLAGS: 00010086 [ 202.129858] RSP: 0018:c9000a463cc0 EFLAGS: 00010046 [ 202.135102] RAX: RBX: 88980047e800 RCX: [ 202.135105] RDX: 888be28caa40 RSI: 0001 RDI: 8110c3fa [ 202.156251] RAX: RBX: 888bfeb8 RCX: 888bfeb80 Thanks. Tim
Re: [PATCH v7 02/15] sched/core: uclamp: Enforce last task UCLAMP_MAX
On Wed, Mar 13, 2019 at 9:16 AM Patrick Bellasi wrote: > > On 13-Mar 15:12, Peter Zijlstra wrote: > > On Fri, Feb 08, 2019 at 10:05:41AM +, Patrick Bellasi wrote: > > > +static inline void uclamp_idle_reset(struct rq *rq, unsigned int > > > clamp_id, > > > +unsigned int clamp_value) > > > +{ > > > + /* Reset max-clamp retention only on idle exit */ > > > + if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE)) > > > + return; > > > + > > > + WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value); > > > + > > > + /* > > > +* This function is called for both UCLAMP_MIN (before) and UCLAMP_MAX > > > +* (after). The idle flag is reset only the second time, when we know > > > +* that UCLAMP_MIN has been already updated. > > > > Why do we care? That is, what is this comment trying to tell us. > > Right, the code is clear enough, I'll remove this comment. It would be probably even clearer if rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE is done from inside uclamp_rq_inc after uclamp_rq_inc_id for both clamps is called. > > > > > +*/ > > > + if (clamp_id == UCLAMP_MAX) > > > + rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE; > > > +} > > -- > #include > > Patrick Bellasi
Re: [PATCH 2/5] lib/sort: Use more efficient bottom-up heapsort variant
On 14/03/2019 01.03, George Spelvin wrote: > On Wed, 13 Mar 2019 at 23:29:40 +0100, Rasmus Villemoes wrote: > >> Nice! > > Thank you. May I translate that into Acked-by? > Sort-of. I prefer first seeing the full rerolled series for context etc., even if (the important parts of) this patch would be unchanged.
Re: [PATCH 2/5] lib/sort: Use more efficient bottom-up heapsort variant
On Wed, 13 Mar 2019 at 23:29:40 +0100, Rasmus Villemoes wrote: > On 21/02/2019 09.21, George Spelvin wrote: >> +/** >> + * parent - given the offset of the child, find the offset of the parent. >> + * @i: the offset of the heap element whose parent is sought. Non-zero. >> + * @lsbit: a precomputed 1-bit mask, equal to "size & -size" >> + * @size: size of each element >> + * >> + * In terms of array indexes, the parent of element j = i/size is simply >> + * (j-1)/2. But when working in byte offsets, we can't use implicit >> + * truncation of integer divides. >> + * >> + * Fortunately, we only need one bit of the quotient, not the full divide. >> + * size has a least significant bit. That bit will be clear if i is >> + * an even multiple of size, and set if it's an odd multiple. >> + * >> + * Logically, we're doing "if (i & lsbit) i -= size;", but since the >> + * branch is unpredictable, it's done with a bit of clever branch-free >> + * code instead. >> + */ >> +__attribute_const__ __always_inline >> +static size_t parent(size_t i, unsigned int lsbit, size_t size) >> +{ >> +i -= size; >> +i -= size & -(i & lsbit); >> +return i / 2; >> +} >> + > > Really nice :) I had to work through this by hand, but it's solid. Thank you! Yes, the way the mask doesn't include the low-order bits that don't matter anyway is a bit subtle. When the code is subtle, use lots of comments. The entire reason for making this a separate helper function is to leave room for the large comment. >> +unsigned const lsbit = size & -size;/* Used to find parent */ > > Nit: qualifier before type, "const unsigned". And this sets ZF, so a > paranoid check for zero size (cf. the other mail) by doing "if (!lsbit) > return;" is practically free. Though it's probably a bit obscure doing > it that way... Actually, this is a personal style thing which I can ignore for the sake of the kernel, but I believe that it's better to put the qualifier *after* the type. This is due to C's pointer declaration syntax. The standard example of the issue is: typedef char *pointer; const char *a; char const *b; char * const c; const pointer d; pointer const e; Now, which variables are the same types? The answer is that a & b are the same (mutable pointer to const char), and c, d & e are the same (const pointer to mutable char). I you make a habit of putting the qualifier *after* the type, then a simple "textual substitution" mental model for the typedef works, and it's clear that c and e are the same. It's also clear that b cannot be represented by the typedef because the const is between "char" and "*", and you obviously can't do that with the typedef. But if you put the qualifier first, it's annoying to rememeber why a and d are not the same type. So I've deliberately cultivated the style of putting the qualifier after the type. But if the kernel prefers it before... >> +if (!n) >> +return; > > I'd make that n <= 1. Shouldn't be much more costly. (Actually, it's "num <= 1"; n is the pre-multiplied form so n <= 1 can only happen when sorting one one-byte value.) I actually thought about this and decided not to bother. I did it this way during development to stress the general-case code. But should I change it? === NEVER MIND === I had written a long reply justifying leaving it alone to save one instruction when the light dawned: I can do *both* tests in one step with size_t n = num * size, a = (num/2) * size; unsigned const lsbit = size & -size;/* Used to find parent */ if (!a) /* num < 2 || size == 0 */ return; So now everyone's happy. > Nice! Thank you. May I translate that into Acked-by?
Re: [PATCH v1] Bluetooth: hci_qca: Enable the ldisc for ROME for x86 platforms.
On Wed, Mar 13, 2019 at 05:43:14PM +0800, rjl...@codeaurora.org wrote: > 在 2019-03-12 23:52,Matthias Kaehlcke 写道: > > Hi Rocky, > > > > On Tue, Mar 12, 2019 at 05:01:59PM +0800, rjl...@codeaurora.org wrote: > > > 在 2019-03-09 02:52,Matthias Kaehlcke 写道: > > > > On Fri, Mar 08, 2019 at 10:43:14AM +0530, Balakrishna Godavarthi wrote: > > > > > Hi Matthias, > > > > > > > > > > On 2019-03-08 02:12, Matthias Kaehlcke wrote: > > > > > > Hi Balakrishna, > > > > > > > > > > > > On Thu, Mar 07, 2019 at 03:47:22PM +0530, Balakrishna Godavarthi > > > > > > wrote: > > > > > > > When using btattach to setup Rome over ldisc we observed a crash > > > > > > > in qca_setup as it will try to access the serdev which is not > > > > > > > available in the ldisc proto. This patch will fix the crash by > > > > > > > support both the ldisc and serdev way in the qca hci_uart driver. > > > > > > > > > > > > > > Signed-off-by: Balakrishna Godavarthi > > > > > > > > > > > > Oh, I wasn't aware of the instantiation through ldisc and was > > > > > > actually > > > > > > considering to *remove* some of the seemingly unnecessary serdev > > > > > > checks. > > > > > > > > > > > > > --- > > > > > > > drivers/bluetooth/hci_qca.c | 47 > > > > > > > ++--- > > > > > > > 1 file changed, 28 insertions(+), 19 deletions(-) > > > > > > > > > > > > > > diff --git a/drivers/bluetooth/hci_qca.c > > > > > > > b/drivers/bluetooth/hci_qca.c > > > > > > > index 237aea34b69f..0a5c98d46864 100644 > > > > > > > --- a/drivers/bluetooth/hci_qca.c > > > > > > > +++ b/drivers/bluetooth/hci_qca.c > > > > > > > @@ -963,7 +963,7 @@ static int qca_set_baudrate(struct hci_dev > > > > > > > *hdev, uint8_t baudrate) > > > > > > > { > > > > > > > struct hci_uart *hu = hci_get_drvdata(hdev); > > > > > > > struct qca_data *qca = hu->priv; > > > > > > > - struct qca_serdev *qcadev; > > > > > > > + struct qca_serdev *qcadev = NULL; > > > > > > > > > > > > In many cases the only field that is accessed is > > > > > > qcadev->btsoc_type. I > > > > > > think something like 'qca_get_soc_type(struct hci_dev *hdev / struct > > > > > > hci_uart *hu)' would make things more readable. > > > > > > > > > > > [Bala]: sure will update this in other patch once this change is > > > > > landed as > > > > > this has to > > > > > go in priority as we have crash coming. > > > > > > > > That's not how things should work, especially for fairly trivial > > > > changes. It requires reviewers to first spent time to review the patch > > > > that adds clutter and later spend more time to review the one that > > > > removes it. It's also easier to get a clean patch merged in the first > > > > place, rather than a noisy one. > > > > > > > > Anyway, here is my take at it: > > > > https://lore.kernel.org/patchwork/patch/1049014/ > > > > > > > > Please help with testing for ROME, unless you disagree with the > > > > approach. > > > > > > > > Thanks > > > > > > > > Matthias > > > > > > Hi Matthias, > > > > > > I will test your patch and update to you, and you are correct that > > > AR3002 is > > > not part of Rome family, you should use QCA_ROME as the default > > > return of > > > qca_soc_type. > > > > Thanks for the confirmation! > > > > > Could you also loop me in > > > https://lore.kernel.org/patchwork/patch/1049014/? > > > > This patch has been superseded by a newer version: > > > > https://lore.kernel.org/patchwork/patch/1049696/ > > > > It already landed in bluetooth-next. > > > > Testing with Rome and ldisc would still be appreciated, since I don't > > have such a configuration. > > > > Thanks > > > > Matthias > > Hi Matthias, > > I verified your change and found there is another deference to serdev in the > qca_set_baudrate() function while running the ldisc proto, it will cause a > crash and need to add a check as below. Could you help to add this change? > > < serdev_device_wait_until_sent(hu->serdev, > < msecs_to_jiffies(CMD_TRANS_TIMEOUT_MS)); > --- > > if (hu->serdev) > > serdev_device_wait_until_sent(hu->serdev, > > msecs_to_jiffies(CMD_TRANS_TIMEOUT_MS)); argh, I shouldn't have missed this, thanks for testing! Here is a fix: https://lore.kernel.org/patchwork/patch/1050594/
Re: [PATCH 0/3] userfaultfd: allow to forbid unprivileged users
On Wed, Mar 13, 2019 at 01:01:40PM -0700, Mike Kravetz wrote: > On 3/13/19 11:52 AM, Andrea Arcangeli wrote: > > > > hugetlbfs is more complicated to detect, because even if you inherit > > it from fork(), the services that mounts the fs may be in a different > > container than the one that Oracle that uses userfaultfd later on down > > the road from a different context. And I don't think it would be ok to > > allow running userfaultfd just because you can open a file in an > > hugetlbfs file system. With /dev/kvm it's a bit different, that's > > chmod o-r by default.. no luser should be able to open it. > > > > Unless somebody suggests a consistent way to make hugetlbfs "just > > work" (like we could achieve clean with CRIU and KVM), I think Oracle > > will need a one liner change in the Oracle setup to echo into that > > file in addition of running the hugetlbfs mount. > > I think you are suggesting the DB setup process enable uffd for all users. > Correct? Yes. In addition of the hugetlbfs setup, various apps requires to also increase fs.inotify.max_user_watches or file-max and other tweaks, this would be one of those tweaks. > This may be too simple, and I don't really like group access, but how about > just defining a uffd group? If you are in the group you can make uffd > system calls. Everything is possible, I'm just afraid it gets too complex. So you suggest to echo a gid into the file?
[PATCH] Bluetooth: hci_qca: Fix crash with non-serdev devices
qca_set_baudrate() calls serdev_device_wait_until_sent() assuming that the HCI is always associated with a serdev device. This isn't true for ROME controllers instantiated through ldisc, where the call causes a crash due to a NULL pointer dereferentiation. Only call the function when we have a serdev device. The timeout for ROME devices at the end of qca_set_baudrate() is long enough to be reasonably sure that the command was sent. Fixes: fa9ad876b8e0 ("Bluetooth: hci_qca: Add support for Qualcomm Bluetooth chip wcn3990") Reported-by: Balakrishna Godavarthi Reported-by: Rocky Liao Signed-off-by: Matthias Kaehlcke --- drivers/bluetooth/hci_qca.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c index 4ea995d610d2..714a6a16f9d5 100644 --- a/drivers/bluetooth/hci_qca.c +++ b/drivers/bluetooth/hci_qca.c @@ -1004,7 +1004,8 @@ static int qca_set_baudrate(struct hci_dev *hdev, uint8_t baudrate) while (!skb_queue_empty(>txq)) usleep_range(100, 200); - serdev_device_wait_until_sent(hu->serdev, + if (hu->serdev) + serdev_device_wait_until_sent(hu->serdev, msecs_to_jiffies(CMD_TRANS_TIMEOUT_MS)); /* Give the controller time to process the request */ -- 2.21.0.360.g471c308f928-goog
Re: [PATCH 0/3] userfaultfd: allow to forbid unprivileged users
Hi Paolo, On Wed, Mar 13, 2019 at 03:12:28PM -0400, Paolo Bonzini wrote: > > > On Wed, Mar 13, 2019 at 09:22:31AM +0100, Paolo Bonzini wrote: > > Unless somebody suggests a consistent way to make hugetlbfs "just > > work" (like we could achieve clean with CRIU and KVM), I think Oracle > > will need a one liner change in the Oracle setup to echo into that > > file in addition of running the hugetlbfs mount. > > Hi Andrea, can you explain more in detail the risks of enabling > userfaultfd for unprivileged users? There's no more risk than in allowing mremap (direct memory overwrite after underflow) or O_DIRECT (dirtycow) for unprivileged users. If there was risk in allowing userfaultfd for unprivileged users, CAP_SYS_PTRACE would be mandatory and there wouldn't be an option to allow userfaultfd to all unprivileged users. This is just an hardening policy in kernel (for those that don't run everything under seccomp) that may even be removed later. Unlike mremap and O_DIRECT, because we've only an handful of (important) applications using userfaultfd so far, we can do like the bpf syscall: SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, size) { union bpf_attr attr = {}; int err; if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN)) return -EPERM; Except we picked CAP_SYS_PTRACE because CRIU already has to run with such capability for other reasons. So this is intended as the "sysctl_unprivileged_bpf_disabled" trick applied to the bpf syscall, also applied to the userfaultfd syscall, nothing more nothing less. Then I thought we can add a tristate so an open of /dev/kvm would also allow the syscall to make things more user friendly because unprivileged containers ideally should have writable mounts done with nodev and no matter the privilege they shouldn't ever get an hold on the KVM driver (and those who do, like kubevirt, will then just work). There has been one complaint because userfaultfd can also be used to facilitate exploiting use after free bugs in other kernel code: https://cyseclabs.com/blog/linux-kernel-heap-spray This isn't a bug in userfaultfd, the only bug there is some kernel code doing an use after free in a reproducible place, userfaultfd only allows to stop copy-user where the exploits like copy-user to be stopped. This isn't particularly concerning, because you can achieve the same objective with FUSE. In fact even if you set CONFIG_USERFAULTFD=n in the kernel config and CONFIG_FUSE_FS=n, a security bug like that can still be exploited eventually. It's just less reproducible if you can't stop copy-user. Restricting userfaultfd to privileged processes, won't make such kernel bug become harmless, it'll just require more attempts to exploit as far as I can tell. To put things in prospective, the exploits for the most serious security bugs like mremap missing underflow check, dirtycow or the missing stack_guard_gap would not get any facilitation by userfaultfd. I also thought we were randomizing all slab heaps by now to avoid issues like above, I don't know if the randomized slab freelist oreder CONFIG_SLAB_FREELIST_RANDOM and also the pointer with CONFIG_SLAB_FREELIST_HARDENED precisely to avoid the exploits like above. It's not like you can disable those two options even if you set CONFIG_USERFAULTFD=n. I wonder if in that blog post the exploit was set on a kernel with those two options enabled. In any case not allowing non privileged processes to run the userfaultfd syscall will provide some hardening feature also against such kind of concern. Thanks, Andrea
Re: INFO: rcu detected stall in sys_sendfile64 (2)
On Wed, Mar 13, 2019 at 07:43:38AM +0100, 'Dmitry Vyukov' via syzkaller-bugs wrote: > > Also, humans can sometimes find more simpler C reproducers from syzbot > > provided > > reproducers. It would be nice if syzbot can accept and use a user defined C > > reproducer for testing. > > It would be more useful to accept patches that make syzkaller create > better reproducers from these people. Manual work is not scalable. We > would need 10 reproducers per day for a dozen of OSes (incl some > private kernels/branches). Anybody is free to run syzkaller manually > and do full manual (perfect) reporting. But for us it become clear > very early that it won't work. Then see above, while that human is > sleeping/on weekend/vacation, syzbot will already bisect own > reproducer. Adding manual reproducer later won't help in any way. > syzkaller already does lots of smart work for reproducers. Let's not > give up on the last mile and switch back to all manual work. > Well, it's very tough and not many people are familiar with the syzkaller codebase, let alone have time to contribute. But having simplified a lot of the syzkaller reproducers manually, the main things I do are: - Replace bare system calls with proper C library calls. For example: #include syscall(__NR_socket, 0xa, 6, 0); becomes: #include socket(AF_INET, SOCK_DCCP, 0); - Do the same for structs. Use the appropriate C header rather than filling in each struct manually. For example: *(uint16_t*)0x2000 = 0xa; *(uint16_t*)0x2002 = htobe16(0x4e20); *(uint32_t*)0x2004 = 0; *(uint8_t*)0x2008 = 0; *(uint8_t*)0x2009 = 0; *(uint8_t*)0x200a = 0; *(uint8_t*)0x200b = 0; *(uint8_t*)0x200c = 0; *(uint8_t*)0x200d = 0; *(uint8_t*)0x200e = 0; *(uint8_t*)0x200f = 0; *(uint8_t*)0x2010 = 0; *(uint8_t*)0x2011 = 0; *(uint8_t*)0x2012 = 0; *(uint8_t*)0x2013 = 0; *(uint8_t*)0x2014 = 0; *(uint8_t*)0x2015 = 0; *(uint8_t*)0x2016 = 0; *(uint8_t*)0x2017 = 0; *(uint32_t*)0x2018 = 0; becomes: struct sockaddr_in6 addr = { .sin6_family = AF_INET6, .sin6_port = htobe16(0x4e20) }; - Put arguments on the stack rather than in a mmap'd region, if possible. - Simplify any calls to the helper functions that syzkaller emits, e.g. syz_open_dev(), syz_kvm_setup_vcpu(), or the networking setup stuff. Usually the reproducer needs a small subset of the functionality to work. - For multithreaded reproducers, try to incrementally simplify the threading strategy. For example, reduce the number of threads by combining operations. Also try running the operations in loops. Also, using fork() can often result in a simpler reproducer than pthreads. - Instead of using the 'r[]' array to hold all integer return values, give them appropriate names. - Remove duplicate #includes. - Considering the actual kernel code and the bug, if possible find a different way to trigger the same bug that's simpler or more reliable. If the problem is obvious it may be possible to jump right to this step from the beginning. Some gotchas: - fault-nth injections are fragile, since the number of memory allocations in a particular system call varies by kernel config and kernel version. Incrementing n starting from 1 is more reliable. - Some of the perf_event_open() reproducers are fragile because they hardcode a trace event ID, which can change in every kernel version. Reading the trace event ID from /sys/kernel/debug/tracing/events/ is more reliable. - Reproducers using the KVM API sometimes only work on certain processors (e.g. Intel but not AMD) or even depend on the host kernel. - Reproducers that access the local filesystem sometimes assume that it's ext4.
[PATCH v2] MAINTAINERS: Add KVM selftests to existing KVM entry
It's safe to assume Paolo and Radim are maintaining the KVM selftests given that the vast majority of commits have their SOBs. Play nice with get_maintainers and make it official. Signed-off-by: Sean Christopherson --- v2: add all KVM subdirectories MAINTAINERS | 2 ++ 1 file changed, 2 insertions(+) diff --git a/MAINTAINERS b/MAINTAINERS index 8347269aa386..5e5b5424fd93 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -8295,6 +8295,8 @@ F:include/linux/kvm* F: include/kvm/iodev.h F: virt/kvm/* F: tools/kvm/ +F: tools/testing/selftests/kvm/ +F: tools/testing/selftests/kvm/*/ KERNEL VIRTUAL MACHINE FOR AMD-V (KVM/amd) M: Joerg Roedel -- 2.21.0
Re: [PATCH 5/5] lib/list_sort: Optimize number of calls to comparison function
On 05/03/2019 06.58, George Spelvin wrote: > CONFIG_RETPOLINE has severely degraded indirect function call > performance, so it's worth putting some effort into reducing > the number of times cmp() is called. > > This patch avoids badly unbalanced merges on unlucky input sizes. > It slightly increases the code size, but saves an average of 0.2*n > calls to cmp(). > [snip] > > (I confess to being more than a little bit proud of how clean this > code turned out. It took a lot of thinking, but the resultant inner > loop is very simple and efficient.) > > Refs: > Bottom-up Mergesort: A Detailed Analysis > Wolfgang Panny, Helmut Prodinger > Algorithmica 14(4):340--354, October 1995 > https://doi.org/10.1007/BF01294131 > https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.5260 > > The cost distribution of queue-mergesort, optimal mergesorts, and > power-of-two rules > Wei-Mei Chen, Hsien-Kuei Hwang, Gen-Huey Chen > Journal of Algorithms 30(2); Pages 423--448, February 1999 > https://doi.org/10.1006/jagm.1998.0986 > https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.5380 > > Queue-Mergesort > Mordecai J. Golin, Robert Sedgewick > Information Processing Letters, 48(5):253--259, 10 December 1993 > https://doi.org/10.1016/0020-0190(93)90088-q > https://sci-hub.tw/10.1016/0020-0190(93)90088-Q This is beautiful. So no comments on the patch itself. One thing that might be nice would be to see the reduction in number of cmp callbacks explicitly; it should be trivial to use the priv element for that in the list_sort_test module. But to really see it one would of course have to extend that test to do a lot more different sizes of lists. And looking at the actual cmp functions used made me think we might do something similar to what you did for the swap functions for sort(), though it will require updating call sites. Suppose we have some struct foo { ... struct list_head list; ... some_integer_type key; ... }; and a trivial cmp function that just compares the ->key members. What if we do something like #define LIST_SORT_SIMPLE_CMP(type, list, key) ({ \ /* encode the size and signedness of type.key along with offsetof(type, key) - offsetof(type, list) into the low, say, 24 bits - a signed 20 bit number should be sufficient for the offsetof diff, and we can bail at compile time if too big */ }) then do int do_cmp(void *priv, list_head *a, list_head *b, cmp_func cmp) { if ((long)cmp & high_bits) /* it's a kernel pointer */ return cmp(priv, a, b); int diff = extract_diff(cmp); int type = extract_type(cmp); void *keya = (void*)a + diff; void *keyb = (void*)b + diff; if (type == s32) return *(s32*)keya > *(s32*)keyb; if (type == u32) return *(u32*)keya > *(u32*)keyb; ... in practice, we'd probably only have to support 4- and 8-byte signed and unsigned versions (and we can check at compile time whether key has a supported type). Similarly one could do a SORT_SIMPLE_CMP() for when sorting an array of structs according to a single numeric member. That sort is not stable, so the comparison functions would have to do a full -1,0,1 return, of course, but we'd still avoid indirect calls. Rasmus
Re: [PATCHv8 07/10] acpi/hmat: Register processor domain to its memory
On Mon, Mar 11, 2019 at 9:55 PM Keith Busch wrote: > > If the HMAT Subsystem Address Range provides a valid processor proximity > domain for a memory domain, or a processor domain matches the performance > access of the valid processor proximity domain, register the memory > target with that initiator so this relationship will be visible under > the node's sysfs directory. > > Since HMAT requires valid address ranges have an equivalent SRAT entry, > verify each memory target satisfies this requirement. > > Reviewed-by: Jonathan Cameron > Signed-off-by: Keith Busch Acked-by: Rafael J. Wysocki > --- > drivers/acpi/hmat/Kconfig | 3 +- > drivers/acpi/hmat/hmat.c | 392 > +- > 2 files changed, 393 insertions(+), 2 deletions(-) > > diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig > index 2f7111b7af62..13cddd612a52 100644 > --- a/drivers/acpi/hmat/Kconfig > +++ b/drivers/acpi/hmat/Kconfig > @@ -4,4 +4,5 @@ config ACPI_HMAT > depends on ACPI_NUMA > help > If set, this option has the kernel parse and report the > -platform's ACPI HMAT (Heterogeneous Memory Attributes Table). > +platform's ACPI HMAT (Heterogeneous Memory Attributes Table), > +and register memory initiators with their targets. > diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c > index 4758beb3b2c1..01a6eddac6f7 100644 > --- a/drivers/acpi/hmat/hmat.c > +++ b/drivers/acpi/hmat/hmat.c > @@ -13,11 +13,105 @@ > #include > #include > #include > +#include > #include > #include > > static __initdata u8 hmat_revision; > > +static __initdata LIST_HEAD(targets); > +static __initdata LIST_HEAD(initiators); > +static __initdata LIST_HEAD(localities); > + > +/* > + * The defined enum order is used to prioritize attributes to break ties when > + * selecting the best performing node. > + */ > +enum locality_types { > + WRITE_LATENCY, > + READ_LATENCY, > + WRITE_BANDWIDTH, > + READ_BANDWIDTH, > +}; > + > +static struct memory_locality *localities_types[4]; > + > +struct memory_target { > + struct list_head node; > + unsigned int memory_pxm; > + unsigned int processor_pxm; > + struct node_hmem_attrs hmem_attrs; > +}; > + > +struct memory_initiator { > + struct list_head node; > + unsigned int processor_pxm; > +}; > + > +struct memory_locality { > + struct list_head node; > + struct acpi_hmat_locality *hmat_loc; > +}; > + > +static __init struct memory_initiator *find_mem_initiator(unsigned int > cpu_pxm) > +{ > + struct memory_initiator *initiator; > + > + list_for_each_entry(initiator, , node) > + if (initiator->processor_pxm == cpu_pxm) > + return initiator; > + return NULL; > +} > + > +static __init struct memory_target *find_mem_target(unsigned int mem_pxm) > +{ > + struct memory_target *target; > + > + list_for_each_entry(target, , node) > + if (target->memory_pxm == mem_pxm) > + return target; > + return NULL; > +} > + > +static __init void alloc_memory_initiator(unsigned int cpu_pxm) > +{ > + struct memory_initiator *initiator; > + > + if (pxm_to_node(cpu_pxm) == NUMA_NO_NODE) > + return; > + > + initiator = find_mem_initiator(cpu_pxm); > + if (initiator) > + return; > + > + initiator = kzalloc(sizeof(*initiator), GFP_KERNEL); > + if (!initiator) > + return; > + > + initiator->processor_pxm = cpu_pxm; > + list_add_tail(>node, ); > +} > + > +static __init void alloc_memory_target(unsigned int mem_pxm) > +{ > + struct memory_target *target; > + > + if (pxm_to_node(mem_pxm) == NUMA_NO_NODE) > + return; > + > + target = find_mem_target(mem_pxm); > + if (target) > + return; > + > + target = kzalloc(sizeof(*target), GFP_KERNEL); > + if (!target) > + return; > + > + target->memory_pxm = mem_pxm; > + target->processor_pxm = PXM_INVAL; > + list_add_tail(>node, ); > +} > + > static __init const char *hmat_data_type(u8 type) > { > switch (type) { > @@ -89,14 +183,83 @@ static __init u32 hmat_normalize(u16 entry, u64 base, u8 > type) > return value; > } > > +static __init void hmat_update_target_access(struct memory_target *target, > +u8 type, u32 value) > +{ > + switch (type) { > + case ACPI_HMAT_ACCESS_LATENCY: > + target->hmem_attrs.read_latency = value; > + target->hmem_attrs.write_latency = value; > + break; > + case ACPI_HMAT_READ_LATENCY: > + target->hmem_attrs.read_latency = value; > + break; > + case ACPI_HMAT_WRITE_LATENCY: > + target->hmem_attrs.write_latency = value; > +
Re: Regression causes a hang on boot with a Comtrol PCI card
Hi Jesse, On Wed, Mar 13, 2019 at 11:50:07AM -0500, Jesse Hathaway wrote: > Two regressions cause Linux to hang on boot when a Comtrol PCI card > is present. > > If I revert the following two commits, I can boot again and the card > operates without issue: > > 1302fcf0d03e (refs/bisect/bad) PCI: Configure *all* devices, not just > hot-added ones > 1c3c5eab1715 sched/core: Enable might_sleep() and smp_processor_id() > checks early I'm very sorry about the regression, but thank you very much for narrowing it down and reporting it! How did you narrow it down to *two* commits, and do you have to revert both of them to avoid the hang? Usually a bisection identifies a single commit, and the two you mention aren't related. > ; lspci -vs 82:00.0 > 82:00.0 Multiport serial controller: Comtrol Corporation Device 0061 > Subsystem: Comtrol Corporation Device 0061 > Flags: 66MHz, medium devsel, IRQ 35, NUMA node 1 > Memory at c8004000 (32-bit, non-prefetchable) [size=4K] > Memory at c800 (32-bit, non-prefetchable) [size=16K] > Capabilities: [40] Hot-plug capable > Capabilities: [48] Power Management version 2 > Kernel driver in use: rp2 > Kernel modules: rp2 > > Is it possible that the problem is that the card claims to support > Hot-plug, but does not? > > I would love to help fix this issue, please let me know what other > information would be helpful to provide. Can you collect a complete dmesg log (with a working kernel) and output of "sudo lspci -vvxxx"? You can open a bug report at https://bugzilla.kernel.org, attach the logs there, and respond here with the URL. Where does the hang happen? Is it when we configure the Comtrol card? Bjorn
Re: [RFC][Patch v9 2/6] KVM: Enables the kernel to isolate guest free pages
On 13.03.19 23:54, Alexander Duyck wrote: > On Wed, Mar 13, 2019 at 9:39 AM David Hildenbrand wrote: >> >> On 13.03.19 17:37, Alexander Duyck wrote: >>> On Wed, Mar 13, 2019 at 5:18 AM David Hildenbrand wrote: On 13.03.19 12:54, Nitesh Narayan Lal wrote: > > On 3/12/19 5:13 PM, Alexander Duyck wrote: >> On Tue, Mar 12, 2019 at 12:46 PM Nitesh Narayan Lal >> wrote: >>> On 3/8/19 4:39 PM, Alexander Duyck wrote: On Fri, Mar 8, 2019 at 11:39 AM Nitesh Narayan Lal wrote: > On 3/8/19 2:25 PM, Alexander Duyck wrote: >> On Fri, Mar 8, 2019 at 11:10 AM Nitesh Narayan Lal >> wrote: >>> On 3/8/19 1:06 PM, Alexander Duyck wrote: On Thu, Mar 7, 2019 at 6:32 PM Michael S. Tsirkin wrote: > On Thu, Mar 07, 2019 at 02:35:53PM -0800, Alexander Duyck wrote: >> The only other thing I still want to try and see if I can do is >> to add >> a jiffies value to the page private data in the case of the buddy >> pages. > Actually there's one extra thing I think we should do, and that > is make > sure we do not leave less than X% off the free memory at a time. > This way chances of triggering an OOM are lower. If nothing else we could probably look at doing a watermark of some sort so we have to have X amount of memory free but not hinted before we will start providing the hints. It would just be a matter of tracking how much memory we have hinted on versus the amount of memory that has been pulled from that pool. >>> This is to avoid false OOM in the guest? >> Partially, though it would still be possible. Basically it would just >> be a way of determining when we have hinted "enough". Basically it >> doesn't do us much good to be hinting on free memory if the guest is >> already constrained and just going to reallocate the memory shortly >> after we hinted on it. The idea is with a watermark we can avoid >> hinting until we start having pages that are actually going to stay >> free for a while. >> It is another reason why we probably want a bit in the buddy pages somewhere to indicate if a page has been hinted or not as we can then use that to determine if we have to account for it in the statistics. >>> The one benefit which I can see of having an explicit bit is that it >>> will help us to have a single hook away from the hot path within >>> buddy >>> merging code (just like your arch_merge_page) and still avoid >>> duplicate >>> hints while releasing pages. >>> >>> I still have to check PG_idle and PG_young which you mentioned but I >>> don't think we can reuse any existing bits. >> Those are bits that are already there for 64b. I think those exist in >> the page extension for 32b systems. If I am not mistaken they are >> only >> used in VMA mapped memory. What I was getting at is that those are >> the >> bits we could think about reusing. >> >>> If we really want to have something like a watermark, then can't we >>> use >>> zone->free_pages before isolating to see how many free pages are >>> there >>> and put a threshold on it? (__isolate_free_page() does a similar >>> thing >>> but it does that on per request basis). >> Right. That is only part of it though since that tells you how many >> free pages are there. But how many of those free pages are hinted? >> That is the part we would need to track separately and then then >> compare to free_pages to determine if we need to start hinting on >> more >> memory or not. > Only pages which are isolated will be hinted, and once a page is > isolated it will not be counted in the zone free pages. > Feel free to correct me if I am wrong. You are correct up to here. When we isolate the page it isn't counted against the free pages. However after we complete the hint we end up taking it out of isolation and returning it to the "free" state, so it will be counted against the free pages. > If I am understanding it correctly you only want to hint the idle > pages, > is that right? Getting back to the ideas from our earlier discussion, we had 3 stages for things. Free but not hinted, isolated due to hinting, and free and hinted. So what we would need to do is identify the size of the first pool that is free and not hinted by knowing the total number of free
Re: [PATCHv8 06/10] node: Add memory-side caching attributes
On Mon, Mar 11, 2019 at 9:55 PM Keith Busch wrote: > > System memory may have caches to help improve access speed to frequently > requested address ranges. While the system provided cache is transparent > to the software accessing these memory ranges, applications can optimize > their own access based on cache attributes. > > Provide a new API for the kernel to register these memory-side caches > under the memory node that provides it. > > The new sysfs representation is modeled from the existing cpu cacheinfo > attributes, as seen from /sys/devices/system/cpu//cache/. Unlike CPU > cacheinfo though, the node cache level is reported from the view of the > memory. A higher level number is nearer to the CPU, while lower levels > are closer to the last level memory. > > The exported attributes are the cache size, the line size, associativity > indexing, and write back policy, and add the attributes for the system > memory caches to sysfs stable documentation. > > Signed-off-by: Keith Busch Reviewed-by: Rafael J. Wysocki > --- > Documentation/ABI/stable/sysfs-devices-node | 34 +++ > drivers/base/node.c | 151 > > include/linux/node.h| 39 +++ > 3 files changed, 224 insertions(+) > > diff --git a/Documentation/ABI/stable/sysfs-devices-node > b/Documentation/ABI/stable/sysfs-devices-node > index 735a40a3f9b2..f7ce68fbd4b9 100644 > --- a/Documentation/ABI/stable/sysfs-devices-node > +++ b/Documentation/ABI/stable/sysfs-devices-node > @@ -142,3 +142,37 @@ Contact: Keith Busch > Description: > This node's write latency in nanoseconds when access > from nodes found in this class's linked initiators. > + > +What: /sys/devices/system/node/nodeX/memory_side_cache/indexY/ > +Date: December 2018 > +Contact: Keith Busch > +Description: > + The directory containing attributes for the memory-side cache > + level 'Y'. > + > +What: > /sys/devices/system/node/nodeX/memory_side_cache/indexY/indexing > +Date: December 2018 > +Contact: Keith Busch > +Description: > + The caches associativity indexing: 0 for direct mapped, > + non-zero if indexed. > + > +What: > /sys/devices/system/node/nodeX/memory_side_cache/indexY/line_size > +Date: December 2018 > +Contact: Keith Busch > +Description: > + The number of bytes accessed from the next cache level on a > + cache miss. > + > +What: /sys/devices/system/node/nodeX/memory_side_cache/indexY/size > +Date: December 2018 > +Contact: Keith Busch > +Description: > + The size of this memory side cache in bytes. > + > +What: > /sys/devices/system/node/nodeX/memory_side_cache/indexY/write_policy > +Date: December 2018 > +Contact: Keith Busch > +Description: > + The cache write policy: 0 for write-back, 1 for write-through, > + other or unknown. > diff --git a/drivers/base/node.c b/drivers/base/node.c > index 2de546a040a5..8598fcbd2a17 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -205,6 +205,155 @@ void node_set_perf_attrs(unsigned int nid, struct > node_hmem_attrs *hmem_attrs, > } > } > } > + > +/** > + * struct node_cache_info - Internal tracking for memory node caches > + * @dev: Device represeting the cache level > + * @node: List element for tracking in the node > + * @cache_attrs:Attributes for this cache level > + */ > +struct node_cache_info { > + struct device dev; > + struct list_head node; > + struct node_cache_attrs cache_attrs; > +}; > +#define to_cache_info(device) container_of(device, struct node_cache_info, > dev) > + > +#define CACHE_ATTR(name, fmt) \ > +static ssize_t name##_show(struct device *dev, \ > + struct device_attribute *attr, \ > + char *buf) \ > +{ \ > + return sprintf(buf, fmt "\n", to_cache_info(dev)->cache_attrs.name);\ > +} \ > +DEVICE_ATTR_RO(name); > + > +CACHE_ATTR(size, "%llu") > +CACHE_ATTR(line_size, "%u") > +CACHE_ATTR(indexing, "%u") > +CACHE_ATTR(write_policy, "%u") > + > +static struct attribute *cache_attrs[] = { > + _attr_indexing.attr, > + _attr_size.attr, > + _attr_line_size.attr, > + _attr_write_policy.attr, > + NULL, > +}; > +ATTRIBUTE_GROUPS(cache); > + > +static void node_cache_release(struct device *dev) > +{ > + kfree(dev); > +} > + > +static void node_cacheinfo_release(struct device *dev) > +{ > + struct node_cache_info *info =
Re: [PATCH 1/5] lib/sort: Make swap functions more generic
Thank you for your thoughtful comments! On Wed, 13 Mar 2019 at 23:23:44 +0100, Rasmus Villemoes wrote: > On 21/02/2019 07.30, George Spelvin wrote: > + * @align: required aignment (typically 4 or 8) > > typo aLignment Thanks; fixed! >> + * Returns true if elements can be copied using word loads and stores. >> + * The size must be a multiple of the alignment, and the base address must >> + * be if we do not have CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS. > > I wonder if we shouldn't unconditionally require the base to be aligned. > It of course depends on what exactly 'efficient' means, but if the base > is not aligned, we know that every single load and store will be > unaligned, and we're doing lots of them. IIRC even on x86 it's usually > two cycles instead of one, and even more when we cross a cache line > boundary (which we do rather often - e.g. in your 40 byte example, > almost every swap will hit at least one). One could also have some data > that is naturally 4-byte aligned with an element size that happens to be > a multiple of 8 bytes; it would be a bit sad to use 8-byte accesses when > 4-byte would all have been aligned. Well, a 2-cycle unaligned access is still lots faster than 4 byte accesses. I think that's a decent interpretation of "EFFICIENT_UNALIGNED_ACCESS": faster than doing it a byte at a time. So the change is a net win; we're just wondering if it could be optimized more. The 4/8 issue is similar, but not as extreme. Is one unaligned 64-bit access faster than two aligned 32-bit accesses? As you note, it's usually only twice the cost, so it's no slower, and there's less loop overhead. So I don't think it's a bad thing here, either. Re your comments on cache lines, ISTR that x86 can do an unaligned load with minimal penalty as long as it doesn't cross a cache line: https://software.intel.com/en-us/articles/reducing-the-impact-of-misaligned-memory-accesses/ So you win on most of the accesses, hopefully enough to pay for the one unligned access. Apparently in processors more recent than the P4 example Intel used above, it's even less: https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/ On 32-bit machines, it's actually a 4-byte swap, unrolled twice; there are no 64-bit memory accesses. So the concern is only about 8-byte alignment on 64-bit machines. The great majority of call sites sort structures with pointer or long members, so are aligned and the question is moot. I don't think it's worth overthinking the issue on behalf of the performance of some rare corner cases. I have considered doing away with the two word-copy loops and just having one machine-word-sized loop plus a byte fallback. >> +unsigned int lsbits = (unsigned int)size; > > Drop that cast. Actually, in response to an earlier comment, I changed it (and the type of lsbits) to (unsigned char), to emphasize that I *really* only care about a handful of low bits. I know gcc doesn't warn about implicit narrowing casts, but I prefer to make them explicit for documentation reasons. "Yes, I really mean to throw away the high bits." Compilers on machines without native byte operations are very good at using larger registers and optimizing away mask operations. (Remember that this is inlined, and "align" is a constant 4 or 8.) > >> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS >> +(void)base; >> +#else >> +lsbits |= (unsigned int)(size_t)base; > > The kernel usually casts pointers to long or unsigned long. If some > oddball arch has size_t something other than unsigned long, this would > give a "warning: cast from pointer to integer of different size". So > just cast to unsigned long, and drop the cast to unsigned int. I changed this to uintptr_t and called it good. >> static void u32_swap(void *a, void *b, int size) >> { >> -u32 t = *(u32 *)a; >> -*(u32 *)a = *(u32 *)b; >> -*(u32 *)b = t; >> +size_t n = size; >> + > > Since the callback has int in its prototype (we should fix that globally > at some point...), might as well use a four-byte local variable, to > avoid rex prefix on x86-64. So just make that "unsigned" or "u32". Agreed about the eventual global fix. If you care, the only places a non-NULL swap function is passed in are: - arch/arc/kernel/unwind.c - arch/powerpc/kernel/module_32.c - arch/powerpc/kernel/module_64.c ! arch/x86/kernel/unwind_orc.c (2x) - fs/ocfs2/dir.c - fs/ocfs2/refcounttree.c (3x) - fs/ocfs2/xattr.c (3x) - fs/ubifs/find.c ! kernel/jump_label.c ! lib/extable.c The ones marked with "-" are simple memory swaps that could (should!) be replaced with NULL. The ones marked with "!" actually do something non-trivial. Actually, I deliberately used a pointer-sized index, since the loop uses (a + n) addressing. Hardware indexed addressing on 64-bit machines generally (I vaguely recall Aarch64 is an exception) requires a 64-bit index; using a smaller index requires some masking or sign-extending. I thought it was safer to bet that the
Re: [PATCHv8 05/10] node: Add heterogenous memory access attributes
On Mon, Mar 11, 2019 at 9:55 PM Keith Busch wrote: > > Heterogeneous memory systems provide memory nodes with different latency > and bandwidth performance attributes. Provide a new kernel interface > for subsystems to register the attributes under the memory target > node's initiator access class. If the system provides this information, > applications may query these attributes when deciding which node to > request memory. > > The following example shows the new sysfs hierarchy for a node exporting > performance attributes: > > # tree -P "read*|write*"/sys/devices/system/node/nodeY/accessZ/initiators/ > /sys/devices/system/node/nodeY/accessZ/initiators/ > |-- read_bandwidth > |-- read_latency > |-- write_bandwidth > `-- write_latency > > The bandwidth is exported as MB/s and latency is reported in > nanoseconds. The values are taken from the platform as reported by the > manufacturer. > > Memory accesses from an initiator node that is not one of the memory's > access "Z" initiator nodes linked in the same directory may observe > different performance than reported here. When a subsystem makes use > of this interface, initiators of a different access number may not have > the same performance relative to initiators in other access numbers, or > omitted from the any access class' initiators. > > Descriptions for memory access initiator performance access attributes > are added to sysfs stable documentation. > > Acked-by: Jonathan Cameron > Tested-by: Jonathan Cameron > Signed-off-by: Keith Busch Reviewed-by: Rafael J. Wysocki > --- > Documentation/ABI/stable/sysfs-devices-node | 28 ++ > drivers/base/Kconfig| 8 > drivers/base/node.c | 59 > + > include/linux/node.h| 26 + > 4 files changed, 121 insertions(+) > > diff --git a/Documentation/ABI/stable/sysfs-devices-node > b/Documentation/ABI/stable/sysfs-devices-node > index 433bcc04e542..735a40a3f9b2 100644 > --- a/Documentation/ABI/stable/sysfs-devices-node > +++ b/Documentation/ABI/stable/sysfs-devices-node > @@ -114,3 +114,31 @@ Contact: Keith Busch > Description: > The directory containing symlinks to memory targets that > this initiator node has class "Y" access. > + > +What: > /sys/devices/system/node/nodeX/accessY/initiators/read_bandwidth > +Date: December 2018 > +Contact: Keith Busch > +Description: > + This node's read bandwidth in MB/s when accessed from > + nodes found in this access class's linked initiators. > + > +What: /sys/devices/system/node/nodeX/accessY/initiators/read_latency > +Date: December 2018 > +Contact: Keith Busch > +Description: > + This node's read latency in nanoseconds when accessed > + from nodes found in this access class's linked initiators. > + > +What: > /sys/devices/system/node/nodeX/accessY/initiators/write_bandwidth > +Date: December 2018 > +Contact: Keith Busch > +Description: > + This node's write bandwidth in MB/s when accessed from > + found in this access class's linked initiators. > + > +What: > /sys/devices/system/node/nodeX/accessY/initiators/write_latency > +Date: December 2018 > +Contact: Keith Busch > +Description: > + This node's write latency in nanoseconds when access > + from nodes found in this class's linked initiators. > diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig > index 059700ea3521..a7438a58c250 100644 > --- a/drivers/base/Kconfig > +++ b/drivers/base/Kconfig > @@ -149,6 +149,14 @@ config DEBUG_TEST_DRIVER_REMOVE > unusable. You should say N here unless you are explicitly looking to > test this functionality. > > +config HMEM_REPORTING > + bool > + default n > + depends on NUMA > + help > + Enable reporting for heterogenous memory access attributes under > + their non-uniform memory nodes. > + > source "drivers/base/test/Kconfig" > > config SYS_HYPERVISOR > diff --git a/drivers/base/node.c b/drivers/base/node.c > index 6f4097680580..2de546a040a5 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -71,6 +71,9 @@ struct node_access_nodes { > struct device dev; > struct list_headlist_node; > unsignedaccess; > +#ifdef CONFIG_HMEM_REPORTING > + struct node_hmem_attrs hmem_attrs; > +#endif > }; > #define to_access_nodes(dev) container_of(dev, struct node_access_nodes, dev) > > @@ -148,6 +151,62 @@ static struct node_access_nodes > *node_init_node_access(struct node *node, > return NULL; > } > > +#ifdef CONFIG_HMEM_REPORTING > +#define ACCESS_ATTR(name) \ > +static ssize_t name##_show(struct
Re: [PATCHv8 04/10] node: Link memory nodes to their compute nodes
On Mon, Mar 11, 2019 at 9:55 PM Keith Busch wrote: > > Systems may be constructed with various specialized nodes. Some nodes > may provide memory, some provide compute devices that access and use > that memory, and others may provide both. Nodes that provide memory are > referred to as memory targets, and nodes that can initiate memory access > are referred to as memory initiators. > > Memory targets will often have varying access characteristics from > different initiators, and platforms may have ways to express those > relationships. In preparation for these systems, provide interfaces for > the kernel to export the memory relationship among different nodes memory > targets and their initiators with symlinks to each other. > > If a system provides access locality for each initiator-target pair, nodes > may be grouped into ranked access classes relative to other nodes. The > new interface allows a subsystem to register relationships of varying > classes if available and desired to be exported. > > A memory initiator may have multiple memory targets in the same access > class. The target memory's initiators in a given class indicate the > nodes access characteristics share the same performance relative to other > linked initiator nodes. Each target within an initiator's access class, > though, do not necessarily perform the same as each other. > > A memory target node may have multiple memory initiators. All linked > initiators in a target's class have the same access characteristics to > that target. > > The following example show the nodes' new sysfs hierarchy for a memory > target node 'Y' with access class 0 from initiator node 'X': > > # symlinks -v /sys/devices/system/node/nodeX/access0/ > relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> > ../../nodeY > > # symlinks -v /sys/devices/system/node/nodeY/access0/ > relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> > ../../nodeX > > The new attributes are added to the sysfs stable documentation. > > Reviewed-by: Jonathan Cameron > Signed-off-by: Keith Busch Reviewed-by: Rafael J. Wysocki > --- > Documentation/ABI/stable/sysfs-devices-node | 25 - > drivers/base/node.c | 142 > +++- > include/linux/node.h| 6 ++ > 3 files changed, 171 insertions(+), 2 deletions(-) > > diff --git a/Documentation/ABI/stable/sysfs-devices-node > b/Documentation/ABI/stable/sysfs-devices-node > index 3e90e1f3bf0a..433bcc04e542 100644 > --- a/Documentation/ABI/stable/sysfs-devices-node > +++ b/Documentation/ABI/stable/sysfs-devices-node > @@ -90,4 +90,27 @@ Date:December 2009 > Contact: Lee Schermerhorn > Description: > The node's huge page size control/query attributes. > - See Documentation/admin-guide/mm/hugetlbpage.rst > \ No newline at end of file > + See Documentation/admin-guide/mm/hugetlbpage.rst > + > +What: /sys/devices/system/node/nodeX/accessY/ > +Date: December 2018 > +Contact: Keith Busch > +Description: > + The node's relationship to other nodes for access class "Y". > + > +What: /sys/devices/system/node/nodeX/accessY/initiators/ > +Date: December 2018 > +Contact: Keith Busch > +Description: > + The directory containing symlinks to memory initiator > + nodes that have class "Y" access to this target node's > + memory. CPUs and other memory initiators in nodes not in > + the list accessing this node's memory may have different > + performance. > + > +What: /sys/devices/system/node/nodeX/accessY/targets/ > +Date: December 2018 > +Contact: Keith Busch > +Description: > + The directory containing symlinks to memory targets that > + this initiator node has class "Y" access. > diff --git a/drivers/base/node.c b/drivers/base/node.c > index 86d6cd92ce3d..6f4097680580 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -17,6 +17,7 @@ > #include > #include > #include > +#include > #include > #include > > @@ -59,6 +60,94 @@ static inline ssize_t node_read_cpulist(struct device *dev, > static DEVICE_ATTR(cpumap, S_IRUGO, node_read_cpumask, NULL); > static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL); > > +/** > + * struct node_access_nodes - Access class device to hold user visible > + * relationships to other nodes. > + * @dev: Device for this memory access class > + * @list_node: List element in the node's access list > + * @access:The access class rank > + */ > +struct node_access_nodes { > + struct device dev; > + struct list_headlist_node; > + unsignedaccess; > +}; > +#define to_access_nodes(dev) container_of(dev, struct node_access_nodes, dev) > + > +static
Re: [PATCH] percpu/module resevation: change resevation size iff X86_VSMP is set
On Wed, Mar 13, 2019 at 02:29:12PM -0700, Tejun Heo wrote: > Hello, > > On Wed, Mar 13, 2019 at 02:22:55PM -0700, Paul E. McKenney wrote: > > Should I define DEFINE_SRCU() and DEFINE_STATIC_SRCU() only if > > !defined(MODULE)? > > Yeah, that sounds like a great idea with comments explaining why it's > like that. Like this? * Build-time srcu_struct definition is not allowed in modules because * otherwise it is necessary to increase the size of the reserved region * each time a DEFINE_SRCU() or DEFINE_STATIC_SRCU() are added to a * kernel module. Kernel modules should instead declare an srcu_struct * and then invoke init_srcu_struct() from their module_init function and * cleanup_srcu_struct() from their module_exit function. Note that modules * using call_srcu() will also need to invoke srcu_barrier() from their * module_exit function. Also, it looks like Barret beat me to this suggestion. ;-) In addition, rcutorture and rcuperf needed to be updated because they used to use DEFINE_STATIC_STRUCT() whether built in or built as a loadable module. How does the (very lightly tested) patch below look to you all? Thanx, Paul commit 34f67df09cc0c6bf082a7cfca435373caeeb8d82 Author: Paul E. McKenney Date: Wed Mar 13 16:06:22 2019 -0700 srcu: Forbid DEFINE{,_STATIC}_SRCU() from modules Adding DEFINE_SRCU() or DEFINE_STATIC_SRCU() to a loadable module requires that the size of the reserved region be increased, which is not something we want to be doing all that often. Instead, loadable modules should define an srcu_struct and invoke init_srcu_struct() from their module_init function and cleanup_srcu_struct() from their module_exit function. Note that modules using call_srcu() will also need to invoke srcu_barrier() from their module_exit function. This commit enforces this advice by refusing to define DEFINE_SRCU() and DEFINE_STATIC_SRCU() within loadable modules. Suggested-by: Barret Rhoden Signed-off-by: Paul E. McKenney diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h index 7f7c8c050f63..ac5ea1c72e97 100644 --- a/include/linux/srcutree.h +++ b/include/linux/srcutree.h @@ -105,6 +105,15 @@ struct srcu_struct { * Define and initialize a srcu struct at build time. * Do -not- call init_srcu_struct() nor cleanup_srcu_struct() on it. * + * Build-time srcu_struct definition is not allowed in modules because + * otherwise it is necessary to increase the size of the reserved region + * each time a DEFINE_SRCU() or DEFINE_STATIC_SRCU() are added to a + * kernel module. Kernel modules should instead declare an srcu_struct + * and then invoke init_srcu_struct() from their module_init function and + * cleanup_srcu_struct() from their module_exit function. Note that modules + * using call_srcu() will also need to invoke srcu_barrier() from their + * module_exit function. + * * Note that although DEFINE_STATIC_SRCU() hides the name from other * files, the per-CPU variable rules nevertheless require that the * chosen name be globally unique. These rules also prohibit use of @@ -120,11 +129,13 @@ struct srcu_struct { * * See include/linux/percpu-defs.h for the rules on per-CPU variables. */ -#define __DEFINE_SRCU(name, is_static) \ - static DEFINE_PER_CPU(struct srcu_data, name##_srcu_data);\ +#ifndef MODULE +# define __DEFINE_SRCU(name, is_static) \ + static DEFINE_PER_CPU(struct srcu_data, name##_srcu_data); \ is_static struct srcu_struct name = __SRCU_STRUCT_INIT(name, name##_srcu_data) -#define DEFINE_SRCU(name) __DEFINE_SRCU(name, /* not static */) -#define DEFINE_STATIC_SRCU(name) __DEFINE_SRCU(name, static) +# define DEFINE_SRCU(name)__DEFINE_SRCU(name, /* not static */) +# define DEFINE_STATIC_SRCU(name) __DEFINE_SRCU(name, static) +#endif void synchronize_srcu_expedited(struct srcu_struct *ssp); void srcu_barrier(struct srcu_struct *ssp); diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c index c29761152874..b44208b3bf95 100644 --- a/kernel/rcu/rcuperf.c +++ b/kernel/rcu/rcuperf.c @@ -139,6 +139,7 @@ struct rcu_perf_ops { void (*sync)(void); void (*exp_sync)(void); const char *name; + const char *altname; }; static struct rcu_perf_ops *cur_ops; @@ -186,8 +187,16 @@ static struct rcu_perf_ops rcu_ops = { * Definitions for srcu perf testing. */ +static struct srcu_struct *srcu_ctlp; + +#ifndef MODULE DEFINE_STATIC_SRCU(srcu_ctl_perf); -static struct srcu_struct *srcu_ctlp = _ctl_perf; + +static void srcu_sync_perf_init(void) +{ + srcu_ctlp = _ctl_perf +} +#endif static int srcu_perf_read_lock(void) __acquires(srcu_ctlp) { @@ -224,9 +233,10 @@ static void
Re: [PATCH] EDAC, {skx|i10nm}_edac: Fix randconfig build error
On Wed, Mar 06, 2019 at 09:15:13PM +0100, Arnd Bergmann wrote: > On Wed, Mar 6, 2019 at 6:58 PM Luck, Tony wrote: > > From: Qiuxu Zhuo > > > > This seems cleaner than adding all the EXPORTs to skx_common.c > > I also tried a build with the 0x8A152468-config.gz that Arnd > > supplied. > > It's still a bit fragile since you do something that kbuild doesn't > expect with having a file in both a module and built-in code > in some configurations. I'm fairly sure this version works today, > but it would break the next time that file gets changed to include > a reference to THIS_MODULE, or anything else that is different > between built-in and modular code. > > Another alternative would be to mark all symbols in this file > 'static' and then include skx_common.c from the other two files. > Of course that is also ugly and it increases the overall code size, > so I don't see a way to win this. Boris, So where should we go. Proposed solutions are piling up: 1) Make skx_common a module [downside: have to EXPORT everything in it] 2) Move module-ish bits out of skx_common [downside: perhaps fragile] 3) #include skx_common.c into {skx,i10nm}_edac.c [downside: no patch yet, bigger code size] -Tony
Re: [PATCH 2/2] x86/mmap: handle worst-case heap randomization in mmap_base
On Tue, Mar 12, 2019 at 10:33 AM Ali Saidi wrote: > > Increase mmap_base by the worst-case brk randomization so that > the stack and heap remain apart. > > In Linux 4.13 a change was committed that special cased the kernel ELF > loader when the loader is invoked directly (eab09532d400; binfmt_elf: use > ELF_ET_DYN_BASE only for PIE). Generally, the loader isn’t invoked > directly and this issue is limited to cases where it is, (e.g to set a > non-inheritable LD_LIBRARY_PATH, testing new versions of the loader). In > those rare cases, the loader doesn't take into account the amount of brk > randomization that will be applied by arch_randomize_brk(). This can > lead to the stack and heap being arbitrarily close to each other. In the case of using the loader directly, brk (so helpfully identified as "[heap]") is allocated with the _loader_ not the binary. For example, with ASLR entirely disabled, you can see this more clearly: $ /bin/cat /proc/self/maps 4000-c000 r-xp fd:02 34603015 /bin/cat 5575b000-5575c000 r--p 7000 fd:02 34603015 /bin/cat 5575c000-5575d000 rw-p 8000 fd:02 34603015 /bin/cat 5575d000-5577e000 rw-p 00:00 0 [heap] ... 77ff7000-77ffa000 r--p 00:00 0 [vvar] 77ffa000-77ffc000 r-xp 00:00 0 [vdso] 77ffc000-77ffd000 r--p 00027000 fd:02 49287483 /lib/x86_64-linux-gnu/ld-2.27.so 77ffd000-77ffe000 rw-p 00028000 fd:02 49287483 /lib/x86_64-linux-gnu/ld-2.27.so 77ffe000-77fff000 rw-p 00:00 0 7ffde000-7000 rw-p 00:00 0 [stack] $ /lib/x86_64-linux-gnu/ld-2.27.so /bin/cat /proc/self/maps ... 77bcc000-77bd4000 r-xp fd:02 34603015 /bin/cat 77bd4000-77dd3000 ---p 8000 fd:02 34603015 /bin/cat 77dd3000-77dd4000 r--p 7000 fd:02 34603015 /bin/cat 77dd4000-77dd5000 rw-p 8000 fd:02 34603015 /bin/cat 77dd5000-77dfc000 r-xp fd:02 49287483 /lib/x86_64-linux-gnu/ld-2.27.so 77fb2000-77fd6000 rw-p 00:00 0 77ff7000-77ffa000 r--p 00:00 0 [vvar] 77ffa000-77ffc000 r-xp 00:00 0 [vdso] 77ffc000-77ffd000 r--p 00027000 fd:02 49287483 /lib/x86_64-linux-gnu/ld-2.27.so 77ffd000-77ffe000 rw-p 00028000 fd:02 49287483 /lib/x86_64-linux-gnu/ld-2.27.so 77ffe000-7802 rw-p 00:00 0 [heap] 7ffde000-7000 rw-p 00:00 0 [stack] So I think changing this globally isn't the right solution (normally brk is between text and mmap). Adjusting the mmap base padding means we lose even more memory space. Perhaps it would be better if brk allocation would be placed before the mmap region (i.e. use ELF_ET_DYN_BASE). This seems to work for me: diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c index 7d09d125f148..cdaa33f4a3ef 100644 --- a/fs/binfmt_elf.c +++ b/fs/binfmt_elf.c @@ -1131,6 +1131,15 @@ static int load_elf_binary(struct linux_binprm *bprm) current->mm->end_data = end_data; current->mm->start_stack = bprm->p; + /* +* When executing a loader directly (ET_DYN without Interp), move +* the brk area out of the mmap region (since it grows up, and may +* collide early with the stack growing down), and into the unused +* ELF_ET_DYN_BASE region. +*/ + if (!elf_interpreter) + current->mm->brk = current->mm->start_brk = ELF_ET_DYN_BASE; + if ((current->flags & PF_RANDOMIZE) && (randomize_va_space > 1)) { current->mm->brk = current->mm->start_brk = arch_randomize_brk(current->mm); $ /lib/x86_64-linux-gnu/ld-2.27.so /bin/cat /proc/self/maps 56de3000-56e04000 rw-p 00:00 0 [heap] 7f8467da9000-7f8467f9 r-xp fd:01 399295 /lib/x86_64-linux-gnu/libc-2.27.so ... 7f846819a000-7f84681a2000 r-xp fd:01 263229 /bin/cat ... 7f84685cb000-7f84685cc000 rw-p 00028000 fd:01 399286 /lib/x86_64-linux-gnu/ld-2.27.so 7f84685cc000-7f84685cd000 rw-p 00:00 0 7ffce68f8000-7ffce6919000 rw-p 00:00 0 [stack] 7ffce69f-7ffce69f3000 r--p 00:00 0 [vvar] 7ffce69f3000-7ffce69f4000 r-xp 00:00 0 [vdso] Does anyone see problems with this? (Note that ET_EXEC base is 0x40, so no collision there...) -- Kees Cook
Re: overlayfs vs. fscrypt
On Wed, Mar 13, 2019 at 03:29:43PM -0700, James Bottomley wrote: > On Wed, 2019-03-13 at 15:13 -0700, Eric Biggers wrote: > > On Wed, Mar 13, 2019 at 02:04:29PM -0700, James Bottomley wrote: > > > On Wed, 2019-03-13 at 13:25 -0700, Eric Biggers wrote: > > > > On Wed, Mar 13, 2019 at 01:06:06PM -0700, James Bottomley wrote: > > > > > On Wed, 2019-03-13 at 12:57 -0700, Eric Biggers wrote: > > > > > > [...] > > > > > > fscrypt would allow the data to be stored encrypted on the > > > > > > local disk, so it's protected against offline compromise of > > > > > > the disk. > > > > > > > > > > Container images are essentially tars of the overlays. They > > > > > only become actual filesystems when instantiated at > > > > > runtime. The current encrypted container image is an overlay > > > > > or set of overlays which is tarred then encrypted. So to > > > > > instantiate it is decrypted then untarred. > > > > > > > > > > The thing I was wondering about was whether instead of a tar > > > > > encrypt we could instead produce an encrypted image from a > > > > > fscrypt filesystem. > > > > > > > > > > > > > Why do you care whether the container image is encrypted on the > > > > local disk, when you're extracting it in plaintext onto the local > > > > disk anyway each time it runs? Even after the runtime files are > > > > "deleted", they may still be recoverable from the disk. Are you > > > > using shred and BLKSECDISCARD, and a non-COW filesystem? > > > > > > > > Now, if you wanted to avoid writing the plaintext to disk > > > > entirely (and thereby use encryption to actually achieve a useful > > > > security property that can't be achieved through file > > > > permissions), fscrypt is a good solution for that. > > > > > > OK let's start with a cloud and container 101: A container is an > > > exactly transportable IaaS environment containing an > > > application. The format for the exact transport is the "container > > > image" I've been describing (layered tar file set deployed with > > > overlays). These images are usually stored in cloud based > > > registries which may or may not have useful access controls. I > > > take it the reason for image encryption to protect confidentiality > > > within the registry is obvious. > > > > > > Because of the exact transport, the deployment may be on my laptop, > > > on my test system or in some type of public or private cloud. In > > > all cases bar the laptop, I won't actually own the physical system > > > which ends up deploying the container. So in exchange for security > > > guarantees from the physical system owner, I agree to turn over my > > > decryption key and possibly a cash payment. One of these > > > guarantees is usually that they shred the key after use and that > > > they deploy a useful key escrow system like vault or keyprotect to > > > guard it even while the decryption is being done. > > > > > > > Another is that all traces of the container be shredded after the > > > execution is finished. > > > > Well, sounds like that's not the case currently even with an > > encrypted container image, because the actual runtime files are not > > encrypted on disk. > > Shredding means destroying all trace including in the on-disk image. > However, one problem with the current implementation is there's a > window between container run and container stop where the unencrypted > files are in memory and on local disk. Access or cockup in that window > can leak confidential data. Well, another problem is that it almost certainly doesn't actually work, because some plaintext will still be recoverable from disk or the raw flash. Actually erasing data on modern filesystems and storage devices is extremely difficult. To start, you'd have to run BLKSECDISCARD on every single block ever written to disk to prepare the container, or written by the container during its execution. That's one reason to use storage encryption: it reduces the secure deletion problem to just erasing the encryption key. > > > Encrypting the runtime files using fscrypt with an ephemeral key > > would be useful here. IOW, randomly generate an encryption key when > > the container starts, never store it anywhere, and wipe it when the > > container stops. > > > > Note that this is separate from the container *image* encryption. > > Actually, that was my original thought: it needn't be. If fscrypt can > usefully add runtime security, then we could have the encrypted layer > be simply an fscrypt image ... I presume without the key we can create > a tar image of an fscrypt that is encrypted and would still be visible > on untar if we did have the key? So the encrypted layer would be a tar > of the fscrypt filesystem without the key. > fscrypt doesn't support backup and restore without the key, so no you can't do that. But there wouldn't be much benefit for doing it that way anyway, since you'd still have to untar the container image each time the container is started, then
Re: [RFC][Patch v9 2/6] KVM: Enables the kernel to isolate guest free pages
On Wed, Mar 13, 2019 at 9:39 AM David Hildenbrand wrote: > > On 13.03.19 17:37, Alexander Duyck wrote: > > On Wed, Mar 13, 2019 at 5:18 AM David Hildenbrand wrote: > >> > >> On 13.03.19 12:54, Nitesh Narayan Lal wrote: > >>> > >>> On 3/12/19 5:13 PM, Alexander Duyck wrote: > On Tue, Mar 12, 2019 at 12:46 PM Nitesh Narayan Lal > wrote: > > On 3/8/19 4:39 PM, Alexander Duyck wrote: > >> On Fri, Mar 8, 2019 at 11:39 AM Nitesh Narayan Lal > >> wrote: > >>> On 3/8/19 2:25 PM, Alexander Duyck wrote: > On Fri, Mar 8, 2019 at 11:10 AM Nitesh Narayan Lal > wrote: > > On 3/8/19 1:06 PM, Alexander Duyck wrote: > >> On Thu, Mar 7, 2019 at 6:32 PM Michael S. Tsirkin > >> wrote: > >>> On Thu, Mar 07, 2019 at 02:35:53PM -0800, Alexander Duyck wrote: > The only other thing I still want to try and see if I can do is > to add > a jiffies value to the page private data in the case of the buddy > pages. > >>> Actually there's one extra thing I think we should do, and that > >>> is make > >>> sure we do not leave less than X% off the free memory at a time. > >>> This way chances of triggering an OOM are lower. > >> If nothing else we could probably look at doing a watermark of some > >> sort so we have to have X amount of memory free but not hinted > >> before > >> we will start providing the hints. It would just be a matter of > >> tracking how much memory we have hinted on versus the amount of > >> memory > >> that has been pulled from that pool. > > This is to avoid false OOM in the guest? > Partially, though it would still be possible. Basically it would just > be a way of determining when we have hinted "enough". Basically it > doesn't do us much good to be hinting on free memory if the guest is > already constrained and just going to reallocate the memory shortly > after we hinted on it. The idea is with a watermark we can avoid > hinting until we start having pages that are actually going to stay > free for a while. > > >> It is another reason why we > >> probably want a bit in the buddy pages somewhere to indicate if a > >> page > >> has been hinted or not as we can then use that to determine if we > >> have > >> to account for it in the statistics. > > The one benefit which I can see of having an explicit bit is that it > > will help us to have a single hook away from the hot path within > > buddy > > merging code (just like your arch_merge_page) and still avoid > > duplicate > > hints while releasing pages. > > > > I still have to check PG_idle and PG_young which you mentioned but I > > don't think we can reuse any existing bits. > Those are bits that are already there for 64b. I think those exist in > the page extension for 32b systems. If I am not mistaken they are > only > used in VMA mapped memory. What I was getting at is that those are > the > bits we could think about reusing. > > > If we really want to have something like a watermark, then can't we > > use > > zone->free_pages before isolating to see how many free pages are > > there > > and put a threshold on it? (__isolate_free_page() does a similar > > thing > > but it does that on per request basis). > Right. That is only part of it though since that tells you how many > free pages are there. But how many of those free pages are hinted? > That is the part we would need to track separately and then then > compare to free_pages to determine if we need to start hinting on > more > memory or not. > >>> Only pages which are isolated will be hinted, and once a page is > >>> isolated it will not be counted in the zone free pages. > >>> Feel free to correct me if I am wrong. > >> You are correct up to here. When we isolate the page it isn't counted > >> against the free pages. However after we complete the hint we end up > >> taking it out of isolation and returning it to the "free" state, so it > >> will be counted against the free pages. > >> > >>> If I am understanding it correctly you only want to hint the idle > >>> pages, > >>> is that right? > >> Getting back to the ideas from our earlier discussion, we had 3 stages > >> for things. Free but not hinted, isolated due to hinting, and free and > >> hinted. So what we would need to do is identify the size of the first > >> pool that is free and not hinted by knowing the total number of free > >> pages, and then subtract the size of the pages
Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions
On Tue, Mar 12, 2019 at 05:38:55PM -0700, John Hubbard wrote: > On 3/12/19 8:30 AM, Ira Weiny wrote: > > On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubb...@gmail.com wrote: > > > From: John Hubbard > > > > > > Introduces put_user_page(), which simply calls put_page(). > > > This provides a way to update all get_user_pages*() callers, > > > so that they call put_user_page(), instead of put_page(). > > > > So I've been running with these patches for a while but today while ramping > > up > > my testing I hit the following: > > > > [ 1355.557819] [ cut here ] > > [ 1355.563436] get_user_pages pin count overflowed > > Hi Ira, > > Thanks for reporting this. That overflow, at face value, means that we've > used more than the 22 bits worth of gup pin counts, so about 4 million pins > of the same page... This is my bug in the patches I'm playing with. Somehow I'm causing more puts than gets... I'm not sure how but this is for sure my problem. Backing off to your patch set the numbers are good. Sorry for the noise. With the testing I've done today I feel comfortable adding Tested-by: Ira Weiny For the main GUP and InfiniBand patches. Ira > > > [ 1355.563446] WARNING: CPU: 1 PID: 1740 at mm/gup.c:73 > > get_gup_pin_page+0xa5/0xb0 > > [ 1355.577391] Modules linked in: ib_isert iscsi_target_mod ib_srpt > > target_core_mod ib_srp scsi_transpo > > rt_srp ext4 mbcache jbd2 mlx4_ib opa_vnic rpcrdma sunrpc rdma_ucm ib_iser > > rdma_cm ib_umad iw_cm libiscs > > i ib_ipoib scsi_transport_iscsi ib_cm sb_edac x86_pkg_temp_thermal > > intel_powerclamp coretemp kvm irqbyp > > ass snd_hda_codec_realtek ib_uverbs snd_hda_codec_generic crct10dif_pclmul > > ledtrig_audio snd_hda_intel > > crc32_pclmul snd_hda_codec snd_hda_core ghash_clmulni_intel snd_hwdep > > snd_pcm aesni_intel crypto_simd s > > nd_timer ib_core cryptd snd glue_helper dax_pmem soundcore nd_pmem ipmi_si > > device_dax nd_btt ioatdma nd > > _e820 ipmi_devintf ipmi_msghandler iTCO_wdt i2c_i801 iTCO_vendor_support > > libnvdimm pcspkr lpc_ich mei_m > > e mei mfd_core wmi pcc_cpufreq acpi_cpufreq sch_fq_codel xfs libcrc32c > > mlx4_en sr_mod cdrom sd_mod mgag > > 200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core > > ttm crc32c_intel igb isci ah > > ci dca libsas firewire_ohci drm i2c_algo_bit libahci scsi_transport_sas > > [ 1355.577429] firewire_core crc_itu_t i2c_core libata dm_mod [last > > unloaded: rdmavt] > > [ 1355.686703] CPU: 1 PID: 1740 Comm: reg-mr Not tainted 5.0.0+ #10 > > [ 1355.693851] Hardware name: Intel Corporation W2600CR/W2600CR, BIOS > > SE5C600.86B.02.04.0003.1023201411 > > 38 10/23/2014 > > [ 1355.705750] RIP: 0010:get_gup_pin_page+0xa5/0xb0 > > [ 1355.711348] Code: e8 40 02 ff ff 80 3d ba a2 fb 00 00 b8 b5 ff ff ff 75 > > bb 48 c7 c7 48 0a e9 81 89 4 > > 4 24 04 c6 05 a1 a2 fb 00 01 e8 35 63 e8 ff <0f> 0b 8b 44 24 04 eb 9c 0f 1f > > 00 66 66 66 66 90 41 57 49 > > bf 00 00 > > [ 1355.733244] RSP: 0018:c90005a23b30 EFLAGS: 00010286 > > [ 1355.739536] RAX: RBX: ea001422 RCX: > > > > [ 1355.748005] RDX: 0003 RSI: 827d94a3 RDI: > > 0246 > > [ 1355.756453] RBP: ea001422 R08: 0002 R09: > > 00022400 > > [ 1355.764907] R10: 0009ccf0ad0c4203 R11: 0001 R12: > > 00010207 > > [ 1355.773369] R13: 8884130b7040 R14: fff00fff R15: > > 000fffe0 > > [ 1355.781836] FS: 7f2680d0d740() GS:88842e84() > > knlGS: > > [ 1355.791384] CS: 0010 DS: ES: CR0: 80050033 > > [ 1355.798319] CR2: 00589000 CR3: 00040b05e004 CR4: > > 000606e0 > > [ 1355.806809] Call Trace: > > [ 1355.810078] follow_page_pte+0x4f3/0x5c0 > > [ 1355.814987] __get_user_pages+0x1eb/0x730 > > [ 1355.820020] get_user_pages+0x3e/0x50 > > [ 1355.824657] ib_umem_get+0x283/0x500 [ib_uverbs] > > [ 1355.830340] ? _cond_resched+0x15/0x30 > > [ 1355.835065] mlx4_ib_reg_user_mr+0x75/0x1e0 [mlx4_ib] > > [ 1355.841235] ib_uverbs_reg_mr+0x10c/0x220 [ib_uverbs] > > [ 1355.847400] ib_uverbs_write+0x2f9/0x4d0 [ib_uverbs] > > [ 1355.853473] __vfs_write+0x36/0x1b0 > > [ 1355.857904] ? selinux_file_permission+0xf0/0x130 > > [ 1355.863702] ? security_file_permission+0x2e/0xe0 > > [ 1355.869503] vfs_write+0xa5/0x1a0 > > [ 1355.873751] ksys_write+0x4f/0xb0 > > [ 1355.878009] do_syscall_64+0x5b/0x180 > > [ 1355.882656] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [ 1355.62] RIP: 0033:0x7f2680ec3ed8 > > [ 1355.893420] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 > > f3 0f 1e fa 48 8d 05 45 78 0 > > d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f > > 1f 80 00 00 00 00 41 54 49 > > 89 d4 55 > > [ 1355.915573] RSP: 002b:7ffe65d50bc8 EFLAGS: 0246 ORIG_RAX: > > 0001 > > [ 1355.924621] RAX: ffda RBX: 7ffe65d50c74 RCX: > >
Re: [RFC PATCH 4/7] tracing/probe: Check event/group naming rule at parsing
On Wed, 13 Mar 2019 11:23:46 -0400 Steven Rostedt wrote: > On Thu, 14 Mar 2019 00:04:02 +0900 > Masami Hiramatsu wrote: > > > > > strlcpy(buf, event, slash - event + 1); > > > > + if (!is_good_name(buf)) { > > > > + pr_info("Group name must follow the rule of C > > > > identifier\n"); > > > > > > What do you mean by "C identifier"? > > > > I meant "the naming rules of C language identifiers". It means > > that the name has to start with alphabet or "_" and only contain > > alphanumeric characters and "_". Does it clear? > > I couldn't think of a good word. maybe "naming convention" > > does not fit... > > Ah, OK that now makes sense. > > > > > Would you have any idea? > > I think I was more confused by the way it was stated. What about saying: > > "Group names must follow the same rules as C identifiers" OK, I will update it. > Saying "the rule of C identifiers" made me think more of "the rule of > law", and thought that this had something to do with rules of C > zealots, and those that identify with "C" ;-) Oh, I got it. :-) Thank you! > > -- Steve -- Masami Hiramatsu
Re: [RFC PATCH 1/7] tracing/probe: Check maxactive error cases
On Wed, 13 Mar 2019 10:51:56 -0400 Steven Rostedt wrote: > On Wed, 13 Mar 2019 23:37:39 +0900 > Masami Hiramatsu wrote: > > > > > So now 'p1:..." will error out and not just be ignored? > > > > Yes, I think it is better to tell user "your command has a > > meaningless option, maybe you made a mistake" than ignore that. > > > > OK, just making sure. Hope nobody complains about it ;-) > > Are these OK to add for the next merge window, or do you want them in > now? I could probably get them ready for -rc1. Yes, I think [1/7] to [3/7] would be better to go to 5.1. Thank you, -- Masami Hiramatsu
Re: overlayfs vs. fscrypt
Am Mittwoch, 13. März 2019, 23:26:11 CET schrieb Eric Biggers: > On Wed, Mar 13, 2019 at 09:33:10PM +0100, Richard Weinberger wrote: > > Am Mittwoch, 13. März 2019, 15:26:54 CET schrieb Amir Goldstein: > > > IMO, the best thing for UBIFS to do would be to modify fscrypt to support > > > opting out of the revalidate behavior, IWO, sanitize your hack to an API. > > > > Given the WTF/s rate this thread has, this might me a good option. > > Actually people already asked me how to disable this feature because > > they saw no use of it. > > Being able to delete encrypted files looks good on the feature list but in > > reality it has very few users but causes confusion, IMHO. > > > > I propose a new fscrypt_operations flag, FS_CFLG_NO_CRYPT_FNAMES. > > If this flag is set, a) fscrypt_setup_filename() will return -EPERM if > > no key is found. > > And b) __fscrypt_prepare_lookup() will not attach fscrypt_d_ops to the > > dentry. > > > > Eric, what do you think? > > > > Thanks, > > //richard > > > > What specifically is wrong with supporting the ciphertext "view" of encrypted > directories, and why do you want to opt UBIFS out of it specifically but not > ext4 and f2fs? (The fscrypt_operations are per-filesystem type, not > per-filesystem instance, so I assume that's what you had in mind.) Note that > we > can't unconditionally remove it because people need it to delete files without > the key. We could add a mount option to disable it, but why exactly? You are right, fscrypt_operations is the wrong structure. My plan was having it per filesystem instance. So a mount-option seems like a good option. Of course for all filesystems that support fscrypt, not just UBIFS. Over the last year I've converted many emebdded systems to fscrypt and it happened more than once that users, and more importantly, applications got confused that you can mount and walk the filesystem even if you don't have the key loaded yet. For them it felt more natural that you cannot even readdir if you don't have the key. In my opinion having such a mount option is useful to match these expectations. And it is also useful because you can enable only the features you actually need. On embedded systems that I have in mind you never delete files without having the key and since fscrypt is used for the whole filesystem you can just recreate it if you really lost the key. Thanks, //richard
Re: [PATCH 2/5] lib/sort: Use more efficient bottom-up heapsort variant
On 21/02/2019 09.21, George Spelvin wrote: > > +/** > + * parent - given the offset of the child, find the offset of the parent. > + * @i: the offset of the heap element whose parent is sought. Non-zero. > + * @lsbit: a precomputed 1-bit mask, equal to "size & -size" > + * @size: size of each element > + * > + * In terms of array indexes, the parent of element j = i/size is simply > + * (j-1)/2. But when working in byte offsets, we can't use implicit > + * truncation of integer divides. > + * > + * Fortunately, we only need one bit of the quotient, not the full divide. > + * size has a least significant bit. That bit will be clear if i is > + * an even multiple of size, and set if it's an odd multiple. > + * > + * Logically, we're doing "if (i & lsbit) i -= size;", but since the > + * branch is unpredictable, it's done with a bit of clever branch-free > + * code instead. > + */ > +__attribute_const__ __always_inline > +static size_t parent(size_t i, unsigned int lsbit, size_t size) > +{ > + i -= size; > + i -= size & -(i & lsbit); > + return i / 2; > +} > + Really nice :) I had to work through this by hand, but it's solid. > /** > * sort - sort an array of elements > * @base: pointer to data to sort > @@ -125,21 +151,26 @@ static void generic_swap(void *a, void *b, int size) > * @cmp_func: pointer to comparison function > * @swap_func: pointer to swap function or NULL > * > - * This function does a heapsort on the given array. You may provide a > - * swap_func function optimized to your element type. > + * This function does a heapsort on the given array. You may provide a > + * swap_func function if you need to do something more than a memory copy > + * (e.g. fix up pointers or auxiliary data), but the built-in swap isn't > + * usually a bottleneck. > * > * Sorting time is O(n log n) both on average and worst-case. While > * qsort is about 20% faster on average, it suffers from exploitable > * O(n*n) worst-case behavior and extra memory requirements that make > * it less suitable for kernel use. > */ > - > void sort(void *base, size_t num, size_t size, > int (*cmp_func)(const void *, const void *), > void (*swap_func)(void *, void *, int size)) > { > /* pre-scale counters for performance */ > - int i = (num/2 - 1) * size, n = num * size, c, r; > + size_t n = num * size, a = (num/2) * size; > + unsigned const lsbit = size & -size;/* Used to find parent */ > + Nit: qualifier before type, "const unsigned". And this sets ZF, so a paranoid check for zero size (cf. the other mail) by doing "if (!lsbit) return;" is practically free. Though it's probably a bit obscure doing it that way... > + if (!n) > + return; I'd make that n <= 1. Shouldn't be much more costly. > - } > - } > + /* > + * Loop invariants: > + * 1. elements [a,n) satisfy the heap property (compare greater than > + *all of their children), > + * 2. elements [n,num*size) are sorted, and > + * 3. a <= b <= c <= d <= n (whenever they are valid). > + */ > + for (;;) { > + size_t b, c, d; > > + if (a) /* Building heap: sift down --a */ > + a -= size; > + else if (n -= size) /* Sorting: Extract root to --n */ > + swap_func(base, base + n, size); > + else/* Sort complete */ > + break; > + > + /* > + * Sift element at "a" down into heap. This is the > + * "bottom-up" variant, which significantly reduces > + * calls to cmp_func(): we find the sift-down path all > + * the way to the leaves (one compare per level), then > + * backtrack to find where to insert the target element. > + * > + * Because elements tend to sift down close to the leaves, > + * this uses fewer compares than doing two per level > + * on the way down. (A bit more than half as many on > + * average, 3/4 worst-case.) > + */ > + for (b = a; c = 2*b + size, (d = c + size) < n;) > + b = cmp_func(base + c, base + d) >= 0 ? c : d; > + if (d == n) /* Special case last leaf with no sibling */ > + b = c; > + > + /* Now backtrack from "b" to the correct location for "a" */ > + while (b != a && cmp_func(base + a, base + b) >= 0) > + b = parent(b, lsbit, size); > + c = b; /* Where "a" belongs */ > + while (b != a) {/* Shift it into place */ > + b = parent(b, lsbit, size); > + swap_func(base + b, base + c, size); > } > } > } > Nice! Rasmus
Re: overlayfs vs. fscrypt
On Wed, 2019-03-13 at 15:13 -0700, Eric Biggers wrote: > On Wed, Mar 13, 2019 at 02:04:29PM -0700, James Bottomley wrote: > > On Wed, 2019-03-13 at 13:25 -0700, Eric Biggers wrote: > > > On Wed, Mar 13, 2019 at 01:06:06PM -0700, James Bottomley wrote: > > > > On Wed, 2019-03-13 at 12:57 -0700, Eric Biggers wrote: > > > > [...] > > > > > fscrypt would allow the data to be stored encrypted on the > > > > > local disk, so it's protected against offline compromise of > > > > > the disk. > > > > > > > > Container images are essentially tars of the overlays. They > > > > only become actual filesystems when instantiated at > > > > runtime. The current encrypted container image is an overlay > > > > or set of overlays which is tarred then encrypted. So to > > > > instantiate it is decrypted then untarred. > > > > > > > > The thing I was wondering about was whether instead of a tar > > > > encrypt we could instead produce an encrypted image from a > > > > fscrypt filesystem. > > > > > > > > > > Why do you care whether the container image is encrypted on the > > > local disk, when you're extracting it in plaintext onto the local > > > disk anyway each time it runs? Even after the runtime files are > > > "deleted", they may still be recoverable from the disk. Are you > > > using shred and BLKSECDISCARD, and a non-COW filesystem? > > > > > > Now, if you wanted to avoid writing the plaintext to disk > > > entirely (and thereby use encryption to actually achieve a useful > > > security property that can't be achieved through file > > > permissions), fscrypt is a good solution for that. > > > > OK let's start with a cloud and container 101: A container is an > > exactly transportable IaaS environment containing an > > application. The format for the exact transport is the "container > > image" I've been describing (layered tar file set deployed with > > overlays). These images are usually stored in cloud based > > registries which may or may not have useful access controls. I > > take it the reason for image encryption to protect confidentiality > > within the registry is obvious. > > > > Because of the exact transport, the deployment may be on my laptop, > > on my test system or in some type of public or private cloud. In > > all cases bar the laptop, I won't actually own the physical system > > which ends up deploying the container. So in exchange for security > > guarantees from the physical system owner, I agree to turn over my > > decryption key and possibly a cash payment. One of these > > guarantees is usually that they shred the key after use and that > > they deploy a useful key escrow system like vault or keyprotect to > > guard it even while the decryption is being done. > > > > Another is that all traces of the container be shredded after the > > execution is finished. > > Well, sounds like that's not the case currently even with an > encrypted container image, because the actual runtime files are not > encrypted on disk. Shredding means destroying all trace including in the on-disk image. However, one problem with the current implementation is there's a window between container run and container stop where the unencrypted files are in memory and on local disk. Access or cockup in that window can leak confidential data. > Encrypting the runtime files using fscrypt with an ephemeral key > would be useful here. IOW, randomly generate an encryption key when > the container starts, never store it anywhere, and wipe it when the > container stops. > > Note that this is separate from the container *image* encryption. Actually, that was my original thought: it needn't be. If fscrypt can usefully add runtime security, then we could have the encrypted layer be simply an fscrypt image ... I presume without the key we can create a tar image of an fscrypt that is encrypted and would still be visible on untar if we did have the key? So the encrypted layer would be a tar of the fscrypt filesystem without the key. > > considering is could I be protected against either cloud provider > > cockups that might leak the image (the misconfigured backup > > scenario I suggested) or malicious actions of other tenants. > > If the container image is encrypted with a key not on the system, > then its confidentiality is protected from anything that may happen > on that system. > > But if the container image encryption key *is* on the system, your > container image may be leaked either accidentally or maliciously. Well, yes, but that's like saying if you don't want to pick up a virus from your network unplug it. We have to look at ways of deploying the filesystem and the key such that it's hard to exfiltrate ... which seems to be similar to your android fscrypt use case. James
Re: [PATCH v2] PCI: qcom: Use default config space read function
On 13/03/2019 22:52, Srinivas Kandagatla wrote: > On 13/03/2019 20:39, Marc Gonzalez wrote: > >> Could you pastebin the output of lspci -vvv -n in the working case? > > This was already in my original reply: > working without patch : https://paste.ubuntu.com/p/TJm4hgjGW4/ I needed to see the numeric values, hence the -n flag ;-) lspci -vvv -n Regards.
Re: overlayfs vs. fscrypt
On Wed, Mar 13, 2019 at 09:33:10PM +0100, Richard Weinberger wrote: > Am Mittwoch, 13. März 2019, 15:26:54 CET schrieb Amir Goldstein: > > IMO, the best thing for UBIFS to do would be to modify fscrypt to support > > opting out of the revalidate behavior, IWO, sanitize your hack to an API. > > Given the WTF/s rate this thread has, this might me a good option. > Actually people already asked me how to disable this feature because > they saw no use of it. > Being able to delete encrypted files looks good on the feature list but in > reality it has very few users but causes confusion, IMHO. > > I propose a new fscrypt_operations flag, FS_CFLG_NO_CRYPT_FNAMES. > If this flag is set, a) fscrypt_setup_filename() will return -EPERM if > no key is found. > And b) __fscrypt_prepare_lookup() will not attach fscrypt_d_ops to the dentry. > > Eric, what do you think? > > Thanks, > //richard > What specifically is wrong with supporting the ciphertext "view" of encrypted directories, and why do you want to opt UBIFS out of it specifically but not ext4 and f2fs? (The fscrypt_operations are per-filesystem type, not per-filesystem instance, so I assume that's what you had in mind.) Note that we can't unconditionally remove it because people need it to delete files without the key. We could add a mount option to disable it, but why exactly? By the way, I suggest that people read Documentation/filesystems/fscrypt.rst for more information about what fscrypt is supposed to do, as there seems to be a lot of misconceptions. - Eric
[PATCH v5 05/12] mtd: rawnand: ingenic: Use SPDX license notifiers
Use SPDX license notifiers instead of GPLv2 license text in the headers. Signed-off-by: Paul Cercueil Reviewed-by: Boris Brezillon --- Notes: v2: No changes v3: No changes v4: No changes v5: No changes drivers/mtd/nand/raw/ingenic/jz4780_bch.c | 5 + drivers/mtd/nand/raw/ingenic/jz4780_bch.h | 5 + drivers/mtd/nand/raw/ingenic/jz4780_nand.c | 5 + 3 files changed, 3 insertions(+), 12 deletions(-) diff --git a/drivers/mtd/nand/raw/ingenic/jz4780_bch.c b/drivers/mtd/nand/raw/ingenic/jz4780_bch.c index c5f74ed85862..fdf483a49f7b 100644 --- a/drivers/mtd/nand/raw/ingenic/jz4780_bch.c +++ b/drivers/mtd/nand/raw/ingenic/jz4780_bch.c @@ -1,12 +1,9 @@ +// SPDX-License-Identifier: GPL-2.0 /* * JZ4780 BCH controller * * Copyright (c) 2015 Imagination Technologies * Author: Alex Smith - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of the GNU General Public License version 2 as published - * by the Free Software Foundation. */ #include diff --git a/drivers/mtd/nand/raw/ingenic/jz4780_bch.h b/drivers/mtd/nand/raw/ingenic/jz4780_bch.h index bf4718088a3a..451e0c770160 100644 --- a/drivers/mtd/nand/raw/ingenic/jz4780_bch.h +++ b/drivers/mtd/nand/raw/ingenic/jz4780_bch.h @@ -1,12 +1,9 @@ +/* SPDX-License-Identifier: GPL-2.0 */ /* * JZ4780 BCH controller * * Copyright (c) 2015 Imagination Technologies * Author: Alex Smith - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of the GNU General Public License version 2 as published - * by the Free Software Foundation. */ #ifndef __DRIVERS_MTD_NAND_JZ4780_BCH_H__ diff --git a/drivers/mtd/nand/raw/ingenic/jz4780_nand.c b/drivers/mtd/nand/raw/ingenic/jz4780_nand.c index 22e58975f0d5..7f55358b860f 100644 --- a/drivers/mtd/nand/raw/ingenic/jz4780_nand.c +++ b/drivers/mtd/nand/raw/ingenic/jz4780_nand.c @@ -1,12 +1,9 @@ +// SPDX-License-Identifier: GPL-2.0 /* * JZ4780 NAND driver * * Copyright (c) 2015 Imagination Technologies * Author: Alex Smith - * - * This program is free software; you can redistribute it and/or modify it - * under the terms of the GNU General Public License version 2 as published - * by the Free Software Foundation. */ #include -- 2.11.0
[PATCH v5 07/12] mtd: rawnand: ingenic: Rename jz4780_bch_init to jz4780_bch_reset
The jz4780_bch_init name was confusing, as it suggested that its content should be executed once at init time, whereas what the function really does is reset the hardware for a new ECC operation. Signed-off-by: Paul Cercueil --- Notes: v5: New patch drivers/mtd/nand/raw/ingenic/jz4780_bch.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/mtd/nand/raw/ingenic/jz4780_bch.c b/drivers/mtd/nand/raw/ingenic/jz4780_bch.c index fdf483a49f7b..7635753bd7ea 100644 --- a/drivers/mtd/nand/raw/ingenic/jz4780_bch.c +++ b/drivers/mtd/nand/raw/ingenic/jz4780_bch.c @@ -69,8 +69,8 @@ struct jz4780_bch { struct mutex lock; }; -static void jz4780_bch_init(struct jz4780_bch *bch, - struct jz4780_bch_params *params, bool encode) +static void jz4780_bch_reset(struct jz4780_bch *bch, +struct jz4780_bch_params *params, bool encode) { u32 reg; @@ -183,7 +183,8 @@ int jz4780_bch_calculate(struct jz4780_bch *bch, struct jz4780_bch_params *param int ret = 0; mutex_lock(>lock); - jz4780_bch_init(bch, params, true); + + jz4780_bch_reset(bch, params, true); jz4780_bch_write_data(bch, buf, params->size); if (jz4780_bch_wait_complete(bch, BCH_BHINT_ENCF, NULL)) { @@ -220,7 +221,7 @@ int jz4780_bch_correct(struct jz4780_bch *bch, struct jz4780_bch_params *params, mutex_lock(>lock); - jz4780_bch_init(bch, params, false); + jz4780_bch_reset(bch, params, false); jz4780_bch_write_data(bch, buf, params->size); jz4780_bch_write_data(bch, ecc_code, params->bytes); -- 2.11.0
[PATCH v5 03/12] dt-bindings: mtd: ingenic: Use standard ecc-engine property
The 'ingenic,bch-controller' property is now deprecated and the 'ecc-engine' property should be used instead. Signed-off-by: Paul Cercueil --- Notes: v5: New patch Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt b/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt index 5a45cc54f46d..c02259353327 100644 --- a/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt +++ b/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt @@ -14,7 +14,7 @@ Required NAND controller device properties: an offset of 0 and a size of 0x100 (i.e. the whole NEMC bank). Optional NAND controller device properties: -- ingenic,bch-controller: To make use of the hardware ECC controller, this +- ecc-engine: To make use of the hardware ECC controller, this property must contain a phandle for the ECC controller node. The required properties for this node are described below. If this is not specified, software ECC will be used instead. @@ -48,7 +48,7 @@ nemc: nemc@1341 { #address-cells = <1>; #size-cells = <0>; - ingenic,bch-controller = <>; + ecc-engine = <>; nand@1 { reg = <1>; -- 2.11.0
[PATCH v5 12/12] mtd: rawnand: ingenic: Add ooblayout for the Qi Ben Nanonote
The Ben Nanonote from Qi Hardware expects a specific OOB layout on its NAND. Signed-off-by: Paul Cercueil --- Notes: v2: New patch v3: Use the qi,lb60 layout unconditionally if we detect that we're running on that board. v4: No change v5: No change drivers/mtd/nand/raw/ingenic/ingenic_nand.c | 41 - 1 file changed, 40 insertions(+), 1 deletion(-) diff --git a/drivers/mtd/nand/raw/ingenic/ingenic_nand.c b/drivers/mtd/nand/raw/ingenic/ingenic_nand.c index a2450e10932a..d32d68ef0dc7 100644 --- a/drivers/mtd/nand/raw/ingenic/ingenic_nand.c +++ b/drivers/mtd/nand/raw/ingenic/ingenic_nand.c @@ -72,6 +72,41 @@ static inline struct ingenic_nfc *to_ingenic_nfc(struct nand_controller *ctrl) return container_of(ctrl, struct ingenic_nfc, controller); } +static int qi_lb60_ooblayout_ecc(struct mtd_info *mtd, int section, +struct mtd_oob_region *oobregion) +{ + struct nand_chip *chip = mtd_to_nand(mtd); + struct nand_ecc_ctrl *ecc = >ecc; + + if (section || !ecc->total) + return -ERANGE; + + oobregion->length = ecc->total; + oobregion->offset = 12; + + return 0; +} + +static int qi_lb60_ooblayout_free(struct mtd_info *mtd, int section, + struct mtd_oob_region *oobregion) +{ + struct nand_chip *chip = mtd_to_nand(mtd); + struct nand_ecc_ctrl *ecc = >ecc; + + if (section) + return -ERANGE; + + oobregion->length = mtd->oobsize - ecc->total - 12; + oobregion->offset = 12 + ecc->total; + + return 0; +} + +const struct mtd_ooblayout_ops qi_lb60_ooblayout_ops = { + .ecc = qi_lb60_ooblayout_ecc, + .free = qi_lb60_ooblayout_free, +}; + static int jz4725b_ooblayout_ecc(struct mtd_info *mtd, int section, struct mtd_oob_region *oobregion) { @@ -247,7 +282,11 @@ static int ingenic_nand_attach_chip(struct nand_chip *chip) return -EINVAL; } - mtd_set_ooblayout(mtd, nfc->soc_info->oob_layout); + /* For legacy reasons we use a different layout on the qi,lb60 board. */ + if (of_machine_is_compatible("qi,lb60")) + mtd_set_ooblayout(mtd, _lb60_ooblayout_ops); + else + mtd_set_ooblayout(mtd, nfc->soc_info->oob_layout); return 0; } -- 2.11.0
[PATCH v5 11/12] mtd: rawnand: ingenic: Add support for the JZ4725B
The boot ROM of the JZ4725B SoC expects a specific OOB layout on the NAND, so we use it unconditionally in the ingenic-nand driver. Also add the jz4725b-bch driver to support the JZ4725B-specific BCH hardware. Signed-off-by: Paul Cercueil --- Notes: v2: Instead of forcing the OOB layout, leave it to the board code or devicetree to decide if the jz4725b-specific layout should be used or not. v3: - Revert the change in v2, as the previous behaviour was correct. - Also add support for the hardware BCH of the JZ4725B in this patch. v4: - Add MODULE_* macros - Add tweaks suggested by upstream feedback v5: - Rename jz4740_ecc_init to jz4740_ecc_reset drivers/mtd/nand/raw/ingenic/Kconfig| 10 + drivers/mtd/nand/raw/ingenic/Makefile | 1 + drivers/mtd/nand/raw/ingenic/ingenic_nand.c | 48 - drivers/mtd/nand/raw/ingenic/jz4725b_bch.c | 293 4 files changed, 351 insertions(+), 1 deletion(-) create mode 100644 drivers/mtd/nand/raw/ingenic/jz4725b_bch.c diff --git a/drivers/mtd/nand/raw/ingenic/Kconfig b/drivers/mtd/nand/raw/ingenic/Kconfig index cc663cc15119..497b46b144f2 100644 --- a/drivers/mtd/nand/raw/ingenic/Kconfig +++ b/drivers/mtd/nand/raw/ingenic/Kconfig @@ -27,6 +27,16 @@ config MTD_NAND_JZ4740_ECC This driver can also be built as a module. If so, the module will be called jz4740-ecc. +config MTD_NAND_JZ4725B_BCH + tristate "Hardware BCH support for JZ4725B SoC" + select MTD_NAND_INGENIC_ECC + help + Enable this driver to support the BCH error-correction hardware + present on the JZ4725B SoC from Ingenic. + + This driver can also be built as a module. If so, the module + will be called jz4725b-bch. + config MTD_NAND_JZ4780_BCH tristate "Hardware BCH support for JZ4780 SoC" select MTD_NAND_INGENIC_ECC diff --git a/drivers/mtd/nand/raw/ingenic/Makefile b/drivers/mtd/nand/raw/ingenic/Makefile index 563b7effcf59..ab2c5f47e5b7 100644 --- a/drivers/mtd/nand/raw/ingenic/Makefile +++ b/drivers/mtd/nand/raw/ingenic/Makefile @@ -3,4 +3,5 @@ obj-$(CONFIG_MTD_NAND_JZ4780) += ingenic_nand.o obj-$(CONFIG_MTD_NAND_INGENIC_ECC) += ingenic_ecc.o obj-$(CONFIG_MTD_NAND_JZ4740_ECC) += jz4740_ecc.o +obj-$(CONFIG_MTD_NAND_JZ4725B_BCH) += jz4725b_bch.o obj-$(CONFIG_MTD_NAND_JZ4780_BCH) += jz4780_bch.o diff --git a/drivers/mtd/nand/raw/ingenic/ingenic_nand.c b/drivers/mtd/nand/raw/ingenic/ingenic_nand.c index 1bd1f18d4532..a2450e10932a 100644 --- a/drivers/mtd/nand/raw/ingenic/ingenic_nand.c +++ b/drivers/mtd/nand/raw/ingenic/ingenic_nand.c @@ -34,6 +34,7 @@ struct jz_soc_info { unsigned long data_offset; unsigned long addr_offset; unsigned long cmd_offset; + const struct mtd_ooblayout_ops *oob_layout; }; struct ingenic_nand_cs { @@ -71,6 +72,41 @@ static inline struct ingenic_nfc *to_ingenic_nfc(struct nand_controller *ctrl) return container_of(ctrl, struct ingenic_nfc, controller); } +static int jz4725b_ooblayout_ecc(struct mtd_info *mtd, int section, +struct mtd_oob_region *oobregion) +{ + struct nand_chip *chip = mtd_to_nand(mtd); + struct nand_ecc_ctrl *ecc = >ecc; + + if (section || !ecc->total) + return -ERANGE; + + oobregion->length = ecc->total; + oobregion->offset = 3; + + return 0; +} + +static int jz4725b_ooblayout_free(struct mtd_info *mtd, int section, + struct mtd_oob_region *oobregion) +{ + struct nand_chip *chip = mtd_to_nand(mtd); + struct nand_ecc_ctrl *ecc = >ecc; + + if (section) + return -ERANGE; + + oobregion->length = mtd->oobsize - ecc->total - 3; + oobregion->offset = 3 + ecc->total; + + return 0; +} + +const struct mtd_ooblayout_ops jz4725b_ooblayout_ops = { + .ecc = jz4725b_ooblayout_ecc, + .free = jz4725b_ooblayout_free, +}; + static void ingenic_nand_select_chip(struct nand_chip *chip, int chipnr) { struct ingenic_nand *nand = to_ingenic_nand(nand_to_mtd(chip)); @@ -211,7 +247,7 @@ static int ingenic_nand_attach_chip(struct nand_chip *chip) return -EINVAL; } - mtd_set_ooblayout(mtd, _ooblayout_lp_ops); + mtd_set_ooblayout(mtd, nfc->soc_info->oob_layout); return 0; } @@ -407,16 +443,26 @@ static const struct jz_soc_info jz4740_soc_info = { .data_offset = 0x, .cmd_offset = 0x8000, .addr_offset = 0x0001, + .oob_layout = _ooblayout_lp_ops, +}; + +static const struct jz_soc_info jz4725b_soc_info = { + .data_offset = 0x, + .cmd_offset = 0x8000, + .addr_offset = 0x0001, + .oob_layout = _ooblayout_ops, }; static const struct jz_soc_info jz4780_soc_info = { .data_offset = 0x, .cmd_offset =
[PATCH v5 10/12] mtd: rawnand: ingenic: Add support for the JZ4740
Add support for probing the ingenic-nand driver on the JZ4740 SoC from Ingenic, and the jz4740-ecc driver to support the JZ4740-specific ECC hardware. Signed-off-by: Paul Cercueil --- Notes: v2: New patch v3: Also add support for the hardware ECC of the JZ4740 in this patch v4: - Fix formatting issues - Add MODULE_* macros v5: - Rename jz4740_ecc_init to jz4740_ecc_reset - Fix comments blocks not following the kernel style drivers/mtd/nand/raw/ingenic/Kconfig| 10 ++ drivers/mtd/nand/raw/ingenic/Makefile | 1 + drivers/mtd/nand/raw/ingenic/ingenic_nand.c | 48 +-- drivers/mtd/nand/raw/ingenic/jz4740_ecc.c | 197 4 files changed, 245 insertions(+), 11 deletions(-) create mode 100644 drivers/mtd/nand/raw/ingenic/jz4740_ecc.c diff --git a/drivers/mtd/nand/raw/ingenic/Kconfig b/drivers/mtd/nand/raw/ingenic/Kconfig index 4bf7d7af3b0a..cc663cc15119 100644 --- a/drivers/mtd/nand/raw/ingenic/Kconfig +++ b/drivers/mtd/nand/raw/ingenic/Kconfig @@ -17,6 +17,16 @@ if MTD_NAND_JZ4780 config MTD_NAND_INGENIC_ECC tristate +config MTD_NAND_JZ4740_ECC + tristate "Hardware BCH support for JZ4740 SoC" + select MTD_NAND_INGENIC_ECC + help + Enable this driver to support the Reed-Solomon error-correction + hardware present on the JZ4740 SoC from Ingenic. + + This driver can also be built as a module. If so, the module + will be called jz4740-ecc. + config MTD_NAND_JZ4780_BCH tristate "Hardware BCH support for JZ4780 SoC" select MTD_NAND_INGENIC_ECC diff --git a/drivers/mtd/nand/raw/ingenic/Makefile b/drivers/mtd/nand/raw/ingenic/Makefile index f3c3c0f230b0..563b7effcf59 100644 --- a/drivers/mtd/nand/raw/ingenic/Makefile +++ b/drivers/mtd/nand/raw/ingenic/Makefile @@ -2,4 +2,5 @@ obj-$(CONFIG_MTD_NAND_JZ4740) += jz4740_nand.o obj-$(CONFIG_MTD_NAND_JZ4780) += ingenic_nand.o obj-$(CONFIG_MTD_NAND_INGENIC_ECC) += ingenic_ecc.o +obj-$(CONFIG_MTD_NAND_JZ4740_ECC) += jz4740_ecc.o obj-$(CONFIG_MTD_NAND_JZ4780_BCH) += jz4780_bch.o diff --git a/drivers/mtd/nand/raw/ingenic/ingenic_nand.c b/drivers/mtd/nand/raw/ingenic/ingenic_nand.c index a5f2f21c5198..1bd1f18d4532 100644 --- a/drivers/mtd/nand/raw/ingenic/ingenic_nand.c +++ b/drivers/mtd/nand/raw/ingenic/ingenic_nand.c @@ -13,6 +13,7 @@ #include #include #include +#include #include #include #include @@ -26,13 +27,15 @@ #define DRV_NAME "ingenic-nand" -#define OFFSET_DATA0x -#define OFFSET_CMD 0x0040 -#define OFFSET_ADDR0x0080 - /* Command delay when there is no R/B pin. */ #define RB_DELAY_US100 +struct jz_soc_info { + unsigned long data_offset; + unsigned long addr_offset; + unsigned long cmd_offset; +}; + struct ingenic_nand_cs { unsigned int bank; void __iomem *base; @@ -41,6 +44,7 @@ struct ingenic_nand_cs { struct ingenic_nfc { struct device *dev; struct ingenic_ecc *ecc; + const struct jz_soc_info *soc_info; struct nand_controller controller; unsigned int num_banks; struct list_head chips; @@ -100,9 +104,9 @@ static void ingenic_nand_cmd_ctrl(struct nand_chip *chip, int cmd, return; if (ctrl & NAND_ALE) - writeb(cmd, cs->base + OFFSET_ADDR); + writeb(cmd, cs->base + nfc->soc_info->addr_offset); else if (ctrl & NAND_CLE) - writeb(cmd, cs->base + OFFSET_CMD); + writeb(cmd, cs->base + nfc->soc_info->cmd_offset); } static int ingenic_nand_dev_ready(struct nand_chip *chip) @@ -160,8 +164,13 @@ static int ingenic_nand_attach_chip(struct nand_chip *chip) struct ingenic_nfc *nfc = to_ingenic_nfc(chip->controller); int eccbytes; - chip->ecc.bytes = fls((1 + 8) * chip->ecc.size) * - (chip->ecc.strength / 8); + if (chip->ecc.strength == 4) { + /* JZ4740 uses 9 bytes of ECC to correct maximum 4 errors */ + chip->ecc.bytes = 9; + } else { + chip->ecc.bytes = fls((1 + 8) * chip->ecc.size) * + (chip->ecc.strength / 8); + } switch (chip->ecc.mode) { case NAND_ECC_HW: @@ -270,8 +279,8 @@ static int ingenic_nand_init_chip(struct platform_device *pdev, return -ENOMEM; mtd->dev.parent = dev; - chip->legacy.IO_ADDR_R = cs->base + OFFSET_DATA; - chip->legacy.IO_ADDR_W = cs->base + OFFSET_DATA; + chip->legacy.IO_ADDR_R = cs->base + nfc->soc_info->data_offset; + chip->legacy.IO_ADDR_W = cs->base + nfc->soc_info->data_offset; chip->legacy.chip_delay = RB_DELAY_US; chip->options = NAND_NO_SUBPAGE_WRITE; chip->legacy.select_chip = ingenic_nand_select_chip; @@ -353,6 +362,10 @@ static int ingenic_nand_probe(struct platform_device *pdev) if
[PATCH v5 06/12] mtd: rawnand: ingenic: Rename jz4780_nand driver to ingenic_nand
The jz4780_nand driver will be modified to handle all the Ingenic JZ47xx SoCs that the upstream Linux kernel supports (JZ4740, JZ4725B, JZ4770, JZ4780), so it makes sense to rename it. Signed-off-by: Paul Cercueil --- Notes: v3: New patch v4: No changes v5: No changes drivers/mtd/nand/raw/ingenic/Makefile | 2 +- .../raw/ingenic/{jz4780_nand.c => ingenic_nand.c} | 146 ++--- 2 files changed, 74 insertions(+), 74 deletions(-) rename drivers/mtd/nand/raw/ingenic/{jz4780_nand.c => ingenic_nand.c} (63%) diff --git a/drivers/mtd/nand/raw/ingenic/Makefile b/drivers/mtd/nand/raw/ingenic/Makefile index 44c2ca053d24..af2d5de21f60 100644 --- a/drivers/mtd/nand/raw/ingenic/Makefile +++ b/drivers/mtd/nand/raw/ingenic/Makefile @@ -1,2 +1,2 @@ obj-$(CONFIG_MTD_NAND_JZ4740) += jz4740_nand.o -obj-$(CONFIG_MTD_NAND_JZ4780) += jz4780_nand.o jz4780_bch.o +obj-$(CONFIG_MTD_NAND_JZ4780) += ingenic_nand.o jz4780_bch.o diff --git a/drivers/mtd/nand/raw/ingenic/jz4780_nand.c b/drivers/mtd/nand/raw/ingenic/ingenic_nand.c similarity index 63% rename from drivers/mtd/nand/raw/ingenic/jz4780_nand.c rename to drivers/mtd/nand/raw/ingenic/ingenic_nand.c index 7f55358b860f..8c73f7c5be9a 100644 --- a/drivers/mtd/nand/raw/ingenic/jz4780_nand.c +++ b/drivers/mtd/nand/raw/ingenic/ingenic_nand.c @@ -1,6 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 /* - * JZ4780 NAND driver + * Ingenic JZ47xx NAND driver * * Copyright (c) 2015 Imagination Technologies * Author: Alex Smith @@ -24,7 +24,7 @@ #include "jz4780_bch.h" -#define DRV_NAME "jz4780-nand" +#define DRV_NAME "ingenic-nand" #define OFFSET_DATA0x #define OFFSET_CMD 0x0040 @@ -33,22 +33,22 @@ /* Command delay when there is no R/B pin. */ #define RB_DELAY_US100 -struct jz4780_nand_cs { +struct ingenic_nand_cs { unsigned int bank; void __iomem *base; }; -struct jz4780_nand_controller { +struct ingenic_nfc { struct device *dev; struct jz4780_bch *bch; struct nand_controller controller; unsigned int num_banks; struct list_head chips; int selected; - struct jz4780_nand_cs cs[]; + struct ingenic_nand_cs cs[]; }; -struct jz4780_nand_chip { +struct ingenic_nand { struct nand_chip chip; struct list_head chip_list; @@ -57,22 +57,21 @@ struct jz4780_nand_chip { unsigned int reading: 1; }; -static inline struct jz4780_nand_chip *to_jz4780_nand_chip(struct mtd_info *mtd) +static inline struct ingenic_nand *to_ingenic_nand(struct mtd_info *mtd) { - return container_of(mtd_to_nand(mtd), struct jz4780_nand_chip, chip); + return container_of(mtd_to_nand(mtd), struct ingenic_nand, chip); } -static inline struct jz4780_nand_controller -*to_jz4780_nand_controller(struct nand_controller *ctrl) +static inline struct ingenic_nfc *to_ingenic_nfc(struct nand_controller *ctrl) { - return container_of(ctrl, struct jz4780_nand_controller, controller); + return container_of(ctrl, struct ingenic_nfc, controller); } -static void jz4780_nand_select_chip(struct nand_chip *chip, int chipnr) +static void ingenic_nand_select_chip(struct nand_chip *chip, int chipnr) { - struct jz4780_nand_chip *nand = to_jz4780_nand_chip(nand_to_mtd(chip)); - struct jz4780_nand_controller *nfc = to_jz4780_nand_controller(nand->chip.controller); - struct jz4780_nand_cs *cs; + struct ingenic_nand *nand = to_ingenic_nand(nand_to_mtd(chip)); + struct ingenic_nfc *nfc = to_ingenic_nfc(nand->chip.controller); + struct ingenic_nand_cs *cs; /* Ensure the currently selected chip is deasserted. */ if (chipnr == -1 && nfc->selected >= 0) { @@ -83,12 +82,12 @@ static void jz4780_nand_select_chip(struct nand_chip *chip, int chipnr) nfc->selected = chipnr; } -static void jz4780_nand_cmd_ctrl(struct nand_chip *chip, int cmd, -unsigned int ctrl) +static void ingenic_nand_cmd_ctrl(struct nand_chip *chip, int cmd, + unsigned int ctrl) { - struct jz4780_nand_chip *nand = to_jz4780_nand_chip(nand_to_mtd(chip)); - struct jz4780_nand_controller *nfc = to_jz4780_nand_controller(nand->chip.controller); - struct jz4780_nand_cs *cs; + struct ingenic_nand *nand = to_ingenic_nand(nand_to_mtd(chip)); + struct ingenic_nfc *nfc = to_ingenic_nfc(nand->chip.controller); + struct ingenic_nand_cs *cs; if (WARN_ON(nfc->selected < 0)) return; @@ -106,25 +105,25 @@ static void jz4780_nand_cmd_ctrl(struct nand_chip *chip, int cmd, writeb(cmd, cs->base + OFFSET_CMD); } -static int jz4780_nand_dev_ready(struct nand_chip *chip) +static int ingenic_nand_dev_ready(struct nand_chip *chip) { - struct jz4780_nand_chip *nand = to_jz4780_nand_chip(nand_to_mtd(chip)); + struct ingenic_nand *nand =
[PATCH v5 02/12] dt-bindings: mtd: ingenic: Change 'BCH' to 'ECC' in documentation
The JZ4740 ECC hardware is not BCH but Reed-Solomon, so it makes more sense to use the more generic ECC term. Signed-off-by: Paul Cercueil Reviewed-by: Rob Herring --- Notes: v3: New patch v4: No change v5: No change .../devicetree/bindings/mtd/ingenic,jz4780-nand.txt| 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt b/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt index a5b940f18bf6..5a45cc54f46d 100644 --- a/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt +++ b/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt @@ -1,4 +1,4 @@ -* Ingenic JZ4780 NAND/BCH +* Ingenic JZ4780 NAND/ECC This file documents the device tree bindings for NAND flash devices on the JZ4780. NAND devices are connected to the NEMC controller (described in @@ -14,10 +14,10 @@ Required NAND controller device properties: an offset of 0 and a size of 0x100 (i.e. the whole NEMC bank). Optional NAND controller device properties: -- ingenic,bch-controller: To make use of the hardware BCH controller, this - property must contain a phandle for the BCH controller node. The required +- ingenic,bch-controller: To make use of the hardware ECC controller, this + property must contain a phandle for the ECC controller node. The required properties for this node are described below. If this is not specified, - software BCH will be used instead. + software ECC will be used instead. Optional children nodes: - Individual NAND chips are children of the NAND controller node. @@ -70,17 +70,17 @@ nemc: nemc@1341 { }; }; -The BCH controller is a separate SoC component used for error correction on +The ECC controller is a separate SoC component used for error correction on NAND devices. The following is a description of the device properties for a -BCH controller. +ECC controller. -Required BCH properties: +Required ECC properties: - compatible: Should be one of: * ingenic,jz4740-ecc * ingenic,jz4725b-bch * ingenic,jz4780-bch -- reg: Should specify the BCH controller registers location and length. -- clocks: Clock for the BCH controller. +- reg: Should specify the ECC controller registers location and length. +- clocks: Clock for the ECC controller. Example: -- 2.11.0
[PATCH v5 08/12] mtd: rawnand: ingenic: Separate top-level and SoC specific code
The ingenic-nand driver uses an API provided by the jz4780-bch driver. This makes it difficult to support other SoCs in the jz4780-bch driver. To work around this, we separate the API functions from the SoC-specific code, so that these API functions are SoC-agnostic. Signed-off-by: Paul Cercueil --- Notes: v2: Add an optional .probe() callback. It is used for instance to set the clock rate in the JZ4780 backend. v3: The common code is now inside the ingenic-ecc module. Each SoC-specific ECC code is now in its own module, which leaves to the user the choice of which (if any) ECC code should be supported. v4: No change v5: Remove .probe callback. Instead, call ingenic_ecc_probe from jz4780_bch_probe and set the clock rate afterwards. drivers/mtd/nand/raw/ingenic/Kconfig| 17 +++ drivers/mtd/nand/raw/ingenic/Makefile | 5 +- drivers/mtd/nand/raw/ingenic/ingenic_ecc.c | 153 + drivers/mtd/nand/raw/ingenic/ingenic_ecc.h | 83 ++ drivers/mtd/nand/raw/ingenic/ingenic_nand.c | 40 +++ drivers/mtd/nand/raw/ingenic/jz4780_bch.c | 172 +--- drivers/mtd/nand/raw/ingenic/jz4780_bch.h | 40 --- 7 files changed, 307 insertions(+), 203 deletions(-) create mode 100644 drivers/mtd/nand/raw/ingenic/ingenic_ecc.c create mode 100644 drivers/mtd/nand/raw/ingenic/ingenic_ecc.h delete mode 100644 drivers/mtd/nand/raw/ingenic/jz4780_bch.h diff --git a/drivers/mtd/nand/raw/ingenic/Kconfig b/drivers/mtd/nand/raw/ingenic/Kconfig index 67806c87b2c4..4bf7d7af3b0a 100644 --- a/drivers/mtd/nand/raw/ingenic/Kconfig +++ b/drivers/mtd/nand/raw/ingenic/Kconfig @@ -11,3 +11,20 @@ config MTD_NAND_JZ4780 help Enables support for NAND Flash connected to the NEMC on JZ4780 SoC based boards, using the BCH controller for hardware error correction. + +if MTD_NAND_JZ4780 + +config MTD_NAND_INGENIC_ECC + tristate + +config MTD_NAND_JZ4780_BCH + tristate "Hardware BCH support for JZ4780 SoC" + select MTD_NAND_INGENIC_ECC + help + Enable this driver to support the BCH error-correction hardware + present on the JZ4780 SoC from Ingenic. + + This driver can also be built as a module. If so, the module + will be called jz4780-bch. + +endif # MTD_NAND_JZ4780 diff --git a/drivers/mtd/nand/raw/ingenic/Makefile b/drivers/mtd/nand/raw/ingenic/Makefile index af2d5de21f60..f3c3c0f230b0 100644 --- a/drivers/mtd/nand/raw/ingenic/Makefile +++ b/drivers/mtd/nand/raw/ingenic/Makefile @@ -1,2 +1,5 @@ obj-$(CONFIG_MTD_NAND_JZ4740) += jz4740_nand.o -obj-$(CONFIG_MTD_NAND_JZ4780) += ingenic_nand.o jz4780_bch.o +obj-$(CONFIG_MTD_NAND_JZ4780) += ingenic_nand.o + +obj-$(CONFIG_MTD_NAND_INGENIC_ECC) += ingenic_ecc.o +obj-$(CONFIG_MTD_NAND_JZ4780_BCH) += jz4780_bch.o diff --git a/drivers/mtd/nand/raw/ingenic/ingenic_ecc.c b/drivers/mtd/nand/raw/ingenic/ingenic_ecc.c new file mode 100644 index ..d7f3a8c3abea --- /dev/null +++ b/drivers/mtd/nand/raw/ingenic/ingenic_ecc.c @@ -0,0 +1,153 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * JZ47xx ECC common code + * + * Copyright (c) 2015 Imagination Technologies + * Author: Alex Smith + */ + +#include +#include +#include +#include + +#include "ingenic_ecc.h" + +/** + * ingenic_ecc_calculate() - calculate ECC for a data buffer + * @ecc: ECC device. + * @params: ECC parameters. + * @buf: input buffer with raw data. + * @ecc_code: output buffer with ECC. + * + * Return: 0 on success, -ETIMEDOUT if timed out while waiting for ECC + * controller. + */ +int ingenic_ecc_calculate(struct ingenic_ecc *ecc, + struct ingenic_ecc_params *params, + const u8 *buf, u8 *ecc_code) +{ + return ecc->ops->calculate(ecc, params, buf, ecc_code); +} +EXPORT_SYMBOL(ingenic_ecc_calculate); + +/** + * ingenic_ecc_correct() - detect and correct bit errors + * @ecc: ECC device. + * @params: ECC parameters. + * @buf: raw data read from the chip. + * @ecc_code: ECC read from the chip. + * + * Given the raw data and the ECC read from the NAND device, detects and + * corrects errors in the data. + * + * Return: the number of bit errors corrected, -EBADMSG if there are too many + * errors to correct or -ETIMEDOUT if we timed out waiting for the controller. + */ +int ingenic_ecc_correct(struct ingenic_ecc *ecc, + struct ingenic_ecc_params *params, + u8 *buf, u8 *ecc_code) +{ + return ecc->ops->correct(ecc, params, buf, ecc_code); +} +EXPORT_SYMBOL(ingenic_ecc_correct); + +/** + * ingenic_ecc_get() - get the ECC controller device + * @np: ECC device tree node. + * + * Gets the ECC controller device from the specified device tree node. The + * device must be released with ingenic_ecc_release() when it is no longer being + * used. + * + * Return: a pointer to ingenic_ecc, errors are encoded into the
[PATCH v5 09/12] mtd: rawnand: ingenic: Make use of ecc-engine property
Use the 'ecc-engine' standard property instead of the custom 'ingenic,bch-controller' custom property, which is now deprecated. Signed-off-by: Paul Cercueil --- Notes: v5: New patch drivers/mtd/nand/raw/ingenic/ingenic_ecc.c | 13 ++--- 1 file changed, 10 insertions(+), 3 deletions(-) diff --git a/drivers/mtd/nand/raw/ingenic/ingenic_ecc.c b/drivers/mtd/nand/raw/ingenic/ingenic_ecc.c index d7f3a8c3abea..30436ca6628a 100644 --- a/drivers/mtd/nand/raw/ingenic/ingenic_ecc.c +++ b/drivers/mtd/nand/raw/ingenic/ingenic_ecc.c @@ -82,9 +82,9 @@ static struct ingenic_ecc *ingenic_ecc_get(struct device_node *np) /** * of_ingenic_ecc_get() - get the ECC controller from a DT node - * @of_node: the node that contains a bch-controller property. + * @of_node: the node that contains a ecc-engine property. * - * Get the bch-controller property from the given device tree + * Get the ecc-engine property from the given device tree * node and pass it to ingenic_ecc_get to do the work. * * Return: a pointer to ingenic_ecc, errors are encoded into the pointer. @@ -95,7 +95,14 @@ struct ingenic_ecc *of_ingenic_ecc_get(struct device_node *of_node) struct ingenic_ecc *ecc = NULL; struct device_node *np; - np = of_parse_phandle(of_node, "ingenic,bch-controller", 0); + np = of_parse_phandle(of_node, "ecc-engine", 0); + + /* +* If the ecc-engine property is not found, check for the deprecated +* ingenic,bch-controller property +*/ + if (!np) + np = of_parse_phandle(of_node, "ingenic,bch-controller", 0); if (np) { ecc = ingenic_ecc_get(np); -- 2.11.0
[PATCH v5 04/12] mtd: rawnand: Move drivers for Ingenic SoCs to subfolder
Before adding support for more SoCs and seeing the number of files for these drivers grow, we move them to their own subfolder to keep it tidy. Signed-off-by: Paul Cercueil --- Notes: v2: New patch v3: No change v4: No change v5: No change drivers/mtd/nand/raw/Kconfig | 14 +- drivers/mtd/nand/raw/Makefile| 3 +-- drivers/mtd/nand/raw/ingenic/Kconfig | 13 + drivers/mtd/nand/raw/ingenic/Makefile| 2 ++ drivers/mtd/nand/raw/{ => ingenic}/jz4740_nand.c | 0 drivers/mtd/nand/raw/{ => ingenic}/jz4780_bch.c | 0 drivers/mtd/nand/raw/{ => ingenic}/jz4780_bch.h | 0 drivers/mtd/nand/raw/{ => ingenic}/jz4780_nand.c | 0 8 files changed, 17 insertions(+), 15 deletions(-) create mode 100644 drivers/mtd/nand/raw/ingenic/Kconfig create mode 100644 drivers/mtd/nand/raw/ingenic/Makefile rename drivers/mtd/nand/raw/{ => ingenic}/jz4740_nand.c (100%) rename drivers/mtd/nand/raw/{ => ingenic}/jz4780_bch.c (100%) rename drivers/mtd/nand/raw/{ => ingenic}/jz4780_bch.h (100%) rename drivers/mtd/nand/raw/{ => ingenic}/jz4780_nand.c (100%) diff --git a/drivers/mtd/nand/raw/Kconfig b/drivers/mtd/nand/raw/Kconfig index e604625e2dfa..705863c6709a 100644 --- a/drivers/mtd/nand/raw/Kconfig +++ b/drivers/mtd/nand/raw/Kconfig @@ -470,19 +470,7 @@ config MTD_NAND_NUC900 This enables the driver for the NAND Flash on evaluation board based on w90p910 / NUC9xx. -config MTD_NAND_JZ4740 - tristate "Support for JZ4740 SoC NAND controller" - depends on MACH_JZ4740 || COMPILE_TEST - depends on HAS_IOMEM - help - Enables support for NAND Flash on JZ4740 SoC based boards. - -config MTD_NAND_JZ4780 - tristate "Support for NAND on JZ4780 SoC" - depends on JZ4780_NEMC - help - Enables support for NAND Flash connected to the NEMC on JZ4780 SoC - based boards, using the BCH controller for hardware error correction. +source "drivers/mtd/nand/raw/ingenic/Kconfig" config MTD_NAND_FSMC tristate "Support for NAND on ST Micros FSMC" diff --git a/drivers/mtd/nand/raw/Makefile b/drivers/mtd/nand/raw/Makefile index 5a5a72f0793e..5552c4a9f2cf 100644 --- a/drivers/mtd/nand/raw/Makefile +++ b/drivers/mtd/nand/raw/Makefile @@ -45,8 +45,7 @@ obj-$(CONFIG_MTD_NAND_NUC900) += nuc900_nand.o obj-$(CONFIG_MTD_NAND_MPC5121_NFC) += mpc5121_nfc.o obj-$(CONFIG_MTD_NAND_VF610_NFC) += vf610_nfc.o obj-$(CONFIG_MTD_NAND_RICOH) += r852.o -obj-$(CONFIG_MTD_NAND_JZ4740) += jz4740_nand.o -obj-$(CONFIG_MTD_NAND_JZ4780) += jz4780_nand.o jz4780_bch.o +obj-y += ingenic/ obj-$(CONFIG_MTD_NAND_GPMI_NAND) += gpmi-nand/ obj-$(CONFIG_MTD_NAND_XWAY)+= xway_nand.o obj-$(CONFIG_MTD_NAND_BCM47XXNFLASH) += bcm47xxnflash/ diff --git a/drivers/mtd/nand/raw/ingenic/Kconfig b/drivers/mtd/nand/raw/ingenic/Kconfig new file mode 100644 index ..67806c87b2c4 --- /dev/null +++ b/drivers/mtd/nand/raw/ingenic/Kconfig @@ -0,0 +1,13 @@ +config MTD_NAND_JZ4740 + tristate "Support for JZ4740 SoC NAND controller" + depends on MACH_JZ4740 || COMPILE_TEST + depends on HAS_IOMEM + help + Enables support for NAND Flash on JZ4740 SoC based boards. + +config MTD_NAND_JZ4780 + tristate "Support for NAND on JZ4780 SoC" + depends on JZ4780_NEMC + help + Enables support for NAND Flash connected to the NEMC on JZ4780 SoC + based boards, using the BCH controller for hardware error correction. diff --git a/drivers/mtd/nand/raw/ingenic/Makefile b/drivers/mtd/nand/raw/ingenic/Makefile new file mode 100644 index ..44c2ca053d24 --- /dev/null +++ b/drivers/mtd/nand/raw/ingenic/Makefile @@ -0,0 +1,2 @@ +obj-$(CONFIG_MTD_NAND_JZ4740) += jz4740_nand.o +obj-$(CONFIG_MTD_NAND_JZ4780) += jz4780_nand.o jz4780_bch.o diff --git a/drivers/mtd/nand/raw/jz4740_nand.c b/drivers/mtd/nand/raw/ingenic/jz4740_nand.c similarity index 100% rename from drivers/mtd/nand/raw/jz4740_nand.c rename to drivers/mtd/nand/raw/ingenic/jz4740_nand.c diff --git a/drivers/mtd/nand/raw/jz4780_bch.c b/drivers/mtd/nand/raw/ingenic/jz4780_bch.c similarity index 100% rename from drivers/mtd/nand/raw/jz4780_bch.c rename to drivers/mtd/nand/raw/ingenic/jz4780_bch.c diff --git a/drivers/mtd/nand/raw/jz4780_bch.h b/drivers/mtd/nand/raw/ingenic/jz4780_bch.h similarity index 100% rename from drivers/mtd/nand/raw/jz4780_bch.h rename to drivers/mtd/nand/raw/ingenic/jz4780_bch.h diff --git a/drivers/mtd/nand/raw/jz4780_nand.c b/drivers/mtd/nand/raw/ingenic/jz4780_nand.c similarity index 100% rename from drivers/mtd/nand/raw/jz4780_nand.c rename to drivers/mtd/nand/raw/ingenic/jz4780_nand.c -- 2.11.0
[PATCH v5 01/12] dt-bindings: mtd: ingenic: Add compatible strings for JZ4740 and JZ4725B
Add compatible strings to probe the jz4780-nand and jz4780-bch drivers from devicetree on the JZ4725B and JZ4740 SoCs from Ingenic. Signed-off-by: Paul Cercueil Reviewed-by: Rob Herring --- Notes: v2: - Change 'ingenic,jz4725b-nand' compatible string to 'ingenic,jz4740-nand' to reflect driver change - Add 'ingenic,jz4740-bch' compatible string - Document 'ingenic,oob-layout' property v3: - Removed 'ingenic,oob-layout' property - Update compatible strings to what the driver supports v4: No change v5: No change Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt b/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt index 29ea5853ca91..a5b940f18bf6 100644 --- a/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt +++ b/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt @@ -6,7 +6,10 @@ memory-controllers/ingenic,jz4780-nemc.txt), and thus NAND device nodes must be children of the NEMC node. Required NAND controller device properties: -- compatible: Should be set to "ingenic,jz4780-nand". +- compatible: Should be one of: + * ingenic,jz4740-nand + * ingenic,jz4725b-nand + * ingenic,jz4780-nand - reg: For each bank with a NAND chip attached, should specify a bank number, an offset of 0 and a size of 0x100 (i.e. the whole NEMC bank). @@ -72,7 +75,10 @@ NAND devices. The following is a description of the device properties for a BCH controller. Required BCH properties: -- compatible: Should be set to "ingenic,jz4780-bch". +- compatible: Should be one of: + * ingenic,jz4740-ecc + * ingenic,jz4725b-bch + * ingenic,jz4780-bch - reg: Should specify the BCH controller registers location and length. - clocks: Clock for the BCH controller. -- 2.11.0
[PATCH 1/2] platform/x86: intel_pmc_core: Convert to a platform_driver
Convert the intel_pmc_core driver to a platform driver. There is no functional change. Some code that tries to determine what kind of CPU this is, has been moved code is moved from pmc_core_probe() to pmc_core_init(). Signed-off-by: Rajat Jain --- This is rebased off git://git.infradead.org/linux-platform-drivers-x86.git/for-next drivers/platform/x86/intel_pmc_core.c | 93 --- 1 file changed, 68 insertions(+), 25 deletions(-) diff --git a/drivers/platform/x86/intel_pmc_core.c b/drivers/platform/x86/intel_pmc_core.c index f2c621b55f49..55578d07610c 100644 --- a/drivers/platform/x86/intel_pmc_core.c +++ b/drivers/platform/x86/intel_pmc_core.c @@ -19,6 +19,7 @@ #include #include #include +#include #include #include @@ -854,12 +855,59 @@ static const struct dmi_system_id pmc_core_dmi_table[] = { {} }; -static int __init pmc_core_probe(void) +static int pmc_core_probe(struct platform_device *pdev) { - struct pmc_dev *pmcdev = + struct pmc_dev *pmcdev = platform_get_drvdata(pdev); + int err; + + pmcdev->regbase = ioremap(pmcdev->base_addr, + pmcdev->map->regmap_length); + if (!pmcdev->regbase) + return -ENOMEM; + + mutex_init(>lock); + pmcdev->pmc_xram_read_bit = pmc_core_check_read_lock_bit(); + + err = pmc_core_dbgfs_register(pmcdev); + if (err < 0) { + dev_warn(>dev, "debugfs register failed.\n"); + iounmap(pmcdev->regbase); + return err; + } + + dmi_check_system(pmc_core_dmi_table); + dev_info(>dev, " initialized\n"); + return 0; +} + +static int pmc_core_remove(struct platform_device *pdev) +{ + struct pmc_dev *pmcdev = platform_get_drvdata(pdev); + + pmc_core_dbgfs_unregister(pmcdev); + mutex_destroy(>lock); + iounmap(pmcdev->regbase); + return 0; +} + +static struct platform_driver pmc_core_driver = { + .driver = { + .name = "pmc_core", + }, + .probe = pmc_core_probe, + .remove = pmc_core_remove, +}; + +static struct platform_device pmc_core_device = { + .name = "pmc_core", +}; + +static int __init pmc_core_init(void) +{ + int ret; const struct x86_cpu_id *cpu_id; + struct pmc_dev *pmcdev = u64 slp_s0_addr; - int err; cpu_id = x86_match_cpu(intel_pmc_core_ids); if (!cpu_id) @@ -880,36 +928,31 @@ static int __init pmc_core_probe(void) else pmcdev->base_addr = slp_s0_addr - pmcdev->map->slp_s0_offset; - pmcdev->regbase = ioremap(pmcdev->base_addr, - pmcdev->map->regmap_length); - if (!pmcdev->regbase) - return -ENOMEM; + platform_set_drvdata(_core_device, pmcdev); - mutex_init(>lock); - pmcdev->pmc_xram_read_bit = pmc_core_check_read_lock_bit(); + ret = platform_device_register(_core_device); + if (ret) + return ret; - err = pmc_core_dbgfs_register(pmcdev); - if (err < 0) { - pr_warn(" debugfs register failed.\n"); - iounmap(pmcdev->regbase); - return err; - } + ret = platform_driver_register(_core_driver); + if (ret) + goto out_remove_dev; - dmi_check_system(pmc_core_dmi_table); - pr_info(" initialized\n"); return 0; + +out_remove_dev: + platform_device_unregister(_core_device); + return ret; } -module_init(pmc_core_probe) -static void __exit pmc_core_remove(void) +static void __init pmc_core_exit(void) { - struct pmc_dev *pmcdev = - - pmc_core_dbgfs_unregister(pmcdev); - mutex_destroy(>lock); - iounmap(pmcdev->regbase); + platform_driver_unregister(_core_driver); + platform_device_unregister(_core_device); } -module_exit(pmc_core_remove) + +module_init(pmc_core_init); +module_exit(pmc_core_exit); MODULE_LICENSE("GPL v2"); MODULE_DESCRIPTION("Intel PMC Core Driver"); -- 2.21.0.360.g471c308f928-goog
[PATCH 2/2] platform/x86: intel_pmc_core: Allow to dump debug registers on S0ix failure
Add a module parameter which when enabled, will check on resume, if the last S0ix attempt was successful. If not, the driver would provide helpful debug information (which gets latched during the failed suspend attempt) to debug the S0ix failure. This information is very useful to debug S0ix failures. Specially since the latched debug information will be lost (over-written) if the system attempts to go into runtime (or imminent) S0ix again after that failed suspend attempt. Signed-off-by: Rajat Jain --- drivers/platform/x86/intel_pmc_core.c | 86 +++ drivers/platform/x86/intel_pmc_core.h | 7 +++ 2 files changed, 93 insertions(+) diff --git a/drivers/platform/x86/intel_pmc_core.c b/drivers/platform/x86/intel_pmc_core.c index 55578d07610c..b1f4405a27ce 100644 --- a/drivers/platform/x86/intel_pmc_core.c +++ b/drivers/platform/x86/intel_pmc_core.c @@ -20,6 +20,7 @@ #include #include #include +#include #include #include @@ -890,9 +891,94 @@ static int pmc_core_remove(struct platform_device *pdev) return 0; } +#ifdef CONFIG_PM_SLEEP + +static bool warn_on_s0ix_failures; +module_param(warn_on_s0ix_failures, bool, 0644); +MODULE_PARM_DESC(warn_on_s0ix_failures, "Check and warn for S0ix failures"); + +static int pmc_core_suspend(struct device *dev) +{ + struct pmc_dev *pmcdev = dev_get_drvdata(dev); + + /* Save PC10 and S0ix residency for checking later */ + if (warn_on_s0ix_failures && + !rdmsrl_safe(MSR_PKG_C10_RESIDENCY, >pc10_counter) && + !pmc_core_dev_state_get(pmcdev, >s0ix_counter)) + pmcdev->check_counters = true; + else + pmcdev->check_counters = false; + + return 0; +} + +static inline bool pc10_failed(struct pmc_dev *pmcdev) +{ + u64 pc10_counter; + + if (!rdmsrl_safe(MSR_PKG_C10_RESIDENCY, _counter) && + pc10_counter == pmcdev->pc10_counter) + return true; + else + return false; +} + +static inline bool s0ix_failed(struct pmc_dev *pmcdev) +{ + u64 s0ix_counter; + + if (!pmc_core_dev_state_get(pmcdev, _counter) && + s0ix_counter == pmcdev->s0ix_counter) + return true; + else + return false; +} + +static int pmc_core_resume(struct device *dev) +{ + struct pmc_dev *pmcdev = dev_get_drvdata(dev); + + if (!pmcdev->check_counters) + return 0; + + if (pc10_failed(pmcdev)) { + dev_info(dev, "PC10 entry had failed (PC10 cnt=0x%llx)\n", +pmcdev->pc10_counter); + } else if (s0ix_failed(pmcdev)) { + + const struct pmc_bit_map **maps = pmcdev->map->slps0_dbg_maps; + const struct pmc_bit_map *map; + int offset = pmcdev->map->slps0_dbg_offset; + u32 data; + + dev_warn(dev, "S0ix entry had failed (S0ix cnt=%llu)\n", +pmcdev->s0ix_counter); + while (*maps) { + map = *maps; + data = pmc_core_reg_read(pmcdev, offset); + offset += 4; + while (map->name) { + dev_warn(dev, "SLP_S0_DBG: %-32s\tState: %s\n", +map->name, +data & map->bit_mask ? "Yes" : "No"); + ++map; + } + ++maps; + } + } + return 0; +} + +#endif + +const struct dev_pm_ops pmc_core_pm_ops = { + SET_LATE_SYSTEM_SLEEP_PM_OPS(pmc_core_suspend, pmc_core_resume) +}; + static struct platform_driver pmc_core_driver = { .driver = { .name = "pmc_core", + .pm = _core_pm_ops }, .probe = pmc_core_probe, .remove = pmc_core_remove, diff --git a/drivers/platform/x86/intel_pmc_core.h b/drivers/platform/x86/intel_pmc_core.h index 88d9c0653a5f..fdee5772e532 100644 --- a/drivers/platform/x86/intel_pmc_core.h +++ b/drivers/platform/x86/intel_pmc_core.h @@ -241,6 +241,9 @@ struct pmc_reg_map { * @pmc_xram_read_bit: flag to indicate whether PMC XRAM shadow registers * used to read MPHY PG and PLL status are available * @mutex_lock:mutex to complete one transcation + * @check_counters:On resume, check if counters are getting incremented + * @pc10_counter: PC10 residency counter + * @s0ix_counter: S0ix residency (step adjusted) * * pmc_dev contains info about power management controller device. */ @@ -253,6 +256,10 @@ struct pmc_dev { #endif /* CONFIG_DEBUG_FS */ int pmc_xram_read_bit; struct mutex lock; /* generic mutex lock for PMC Core */ + + bool check_counters; /* Check for counter increments on resume */ + u64 pc10_counter; + u64 s0ix_counter; }; #endif /* PMC_CORE_H */ --
Re: [PATCH] [CIFS] Remove unneeded statements pointed out by Coverity
reviewed by me On Thu, Mar 14, 2019 at 7:53 AM Steve French wrote: > > cifs: remove unused value pointed out by Coverity > > Detected by CoverityScan CID#1438719 ("Unused Value") > > buf is reset again before being used so these two lines of code > are useless. > > Signed-off-by: Steve French > --- > fs/cifs/connect.c | 4 > 1 file changed, 4 deletions(-) > > diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c > index b95db2b593cb..a8e9738db691 100644 > --- a/fs/cifs/connect.c > +++ b/fs/cifs/connect.c > @@ -1191,10 +1191,6 @@ cifs_demultiplex_thread(void *p) > continue; > } > > -if (server->large_buf) > -buf = server->bigbuf; > - > - > server->lstrp = jiffies; > > for (i = 0; i < num_mids; i++) { > > -- > Thanks, > > Steve
Re: [PATCH] spi: mediatek: Attempt to address style issues in spi-mt7621.c
On Wed, Mar 13 2019, Stefan Roese wrote: > On 13.03.19 17:46, Matthias Brugger wrote: >> >> >> On 13/03/2019 17:34, Chuanhong Guo wrote: >>> Hi! >>> On Wed, Mar 13, 2019 at 8:28 PM Matthias Brugger >>> wrote: On 13/03/2019 13:24, Armando Miraglia wrote: [...] Apart from fixing styling issues it would be usefull to see if we can add support for mt7621 to drivers/spi/spi-mt65xx.c >>> It's impossible. They are completely different IPs. >> >> Thanks for the info. Do you know the status of the rest of the drivers in >> staging? > > Just to inform you. We are using this SPI driver from staging > in one of our customer projects and I will try to move this > driver out of staging into drivers/spi very shortly. This is good news! I think the driver is ready to move out of staging and would be very happy to see you do it. My only small concern is that this driver was backported to openwrt (4.14 based) and then reverted https://github.com/openwrt/openwrt/commit/749a29f76ca780d8df70a5163d43bbdc6f13ba3f "This breaks some mt7621 devices." Possibly it was backported badly, or possibly there is a problem. John: do you have any more details of the problem other than what is in the commit message? Thanks, NeilBrown signature.asc Description: PGP signature
Re: overlayfs vs. fscrypt
On Wed, Mar 13, 2019 at 02:04:29PM -0700, James Bottomley wrote: > On Wed, 2019-03-13 at 13:25 -0700, Eric Biggers wrote: > > On Wed, Mar 13, 2019 at 01:06:06PM -0700, James Bottomley wrote: > > > On Wed, 2019-03-13 at 12:57 -0700, Eric Biggers wrote: > [...] > > > > fscrypt would allow the data to be stored encrypted on the local > > > > disk, so it's protected against offline compromise of the disk. > > > > > > Container images are essentially tars of the overlays. They only > > > become actual filesystems when instantiated at runtime. The > > > current encrypted container image is an overlay or set of overlays > > > which is tarred then encrypted. So to instantiate it is decrypted > > > then untarred. > > > > > > The thing I was wondering about was whether instead of a tar > > > encrypt we could instead produce an encrypted image from a fscrypt > > > filesystem. > > > > > > > Why do you care whether the container image is encrypted on the local > > disk, when you're extracting it in plaintext onto the local disk > > anyway each time it runs? Even after the runtime files are "deleted", > > they may still be recoverable from the disk. Are you using shred and > > BLKSECDISCARD, and a non-COW filesystem? > > > > Now, if you wanted to avoid writing the plaintext to disk entirely > > (and thereby use encryption to actually achieve a useful security > > property that can't be achieved through file permissions), fscrypt is > > a good solution for that. > > OK let's start with a cloud and container 101: A container is an > exactly transportable IaaS environment containing an application. The > format for the exact transport is the "container image" I've been > describing (layered tar file set deployed with overlays). These images > are usually stored in cloud based registries which may or may not have > useful access controls. I take it the reason for image encryption to > protect confidentiality within the registry is obvious. > > Because of the exact transport, the deployment may be on my laptop, on > my test system or in some type of public or private cloud. In all > cases bar the laptop, I won't actually own the physical system which > ends up deploying the container. So in exchange for security > guarantees from the physical system owner, I agree to turn over my > decryption key and possibly a cash payment. One of these guarantees is > usually that they shred the key after use and that they deploy a useful > key escrow system like vault or keyprotect to guard it even while the > decryption is being done. > Another is that all traces of the container be shredded after the execution is > finished. Well, sounds like that's not the case currently even with an encrypted container image, because the actual runtime files are not encrypted on disk. Encrypting the runtime files using fscrypt with an ephemeral key would be useful here. IOW, randomly generate an encryption key when the container starts, never store it anywhere, and wipe it when the container stops. Note that this is separate from the container *image* encryption. > considering is could I be protected against either cloud provider > cockups that might leak the image (the misconfigured backup scenario I > suggested) or malicious actions of other tenants. If the container image is encrypted with a key not on the system, then its confidentiality is protected from anything that may happen on that system. But if the container image encryption key *is* on the system, your container image may be leaked either accidentally or maliciously. - Eric
Re: [PATCH 4.9 00/96] 4.9.163-stable review
On Wed, Mar 13, 2019 at 01:49:49PM -0700, Greg Kroah-Hartman wrote: > On Wed, Mar 13, 2019 at 01:34:41PM -0700, Guenter Roeck wrote: > > On Tue, Mar 12, 2019 at 10:09:18AM -0700, Greg Kroah-Hartman wrote: > > > This is the start of the stable review cycle for the 4.9.163 release. > > > There are 96 patches in this series, all will be posted as a response > > > to this one. If anyone has any issues with these being applied, please > > > let me know. > > > > > > Responses should be made by Thu Mar 14 17:10:06 UTC 2019. > > > Anything received after that time might be too late. > > > > > > > Build results: > > total: 172 pass: 170 fail: 2 > > Failed builds: > > i386:tools/perf > > x86_64:tools/perf > > Qemu test results: > > total: 316 pass: 316 fail: 0 > > > > Culprit is commit 386ca5754fad ("perf trace: Support multiple "vfs_getname" > > probes"), which introduces a call to strstarts() but not the necessary > > include file or declaration. Unfortunately, fixing that is quite complex > > (commit 8e99b6d4533c ("tools include: Adopt strstarts() from the kernel") > > doesn't apply). > > Thanks for tracking this down, I'm lost in a maze of gcc8 issues with > perf on 4.9 and 4.4 at the moment trying to fix that. > > I'll go drop this patch from the 4.9 and 4.4 queues. > Agreed; the same patch also introduces a call to evlist__for_each_entry_safe(), which does not exist in v4.4.y. Guenter
Re: [PATCH 1/5] lib/sort: Make swap functions more generic
On Wed, Mar 13, 2019 at 10:23 PM Rasmus Villemoes wrote: > On 21/02/2019 07.30, George Spelvin wrote: > > +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS > > + (void)base; > > +#else > > + lsbits |= (unsigned int)(size_t)base; > > The kernel usually casts pointers to long or unsigned long. If some > oddball arch has size_t something other than unsigned long, this would > give a "warning: cast from pointer to integer of different size". So > just cast to unsigned long, and drop the cast to unsigned int. The proper integer equivalent of a pointer is uintptr_t, so please use that instead of size_t. Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
Re: Kernel bug with MPX?
On 3/7/19 11:41 PM, Michal Hocko wrote: > And this seems real leak because I just hit some bugons later > > [112423.206497] BUG: Bad rss-counter state mm:7aa9c8a7 idx:1 val:25593 > [113601.595093] page:ea00041a07c0 count:2 mapcount:1 > mapping:88818d70e9a1 index:0x7f821adf6 > [113601.595102] anon FWIW, I was able to reproduce this. No idea what the problem is, yet, but I'm looking at it.
Re: [PATCH] spi: fix SPI_BPW_RANGE_MASK() regression
On Wed, Mar 13, 2019 at 10:01 PM Arnd Bergmann wrote: > Geert points out that I confused the min/max arguments that are > reversed between SPI_BPW_RANGE_MASK() and GENMASK(). This time > I have verified the result of the macro after fixing the arguments. > > Cc: Geert Uytterhoeven > Fixes: eefffb42f665 ("spi: work around clang bug in SPI_BPW_RANGE_MASK()") > Signed-off-by: Arnd Bergmann Reviewed-by: Geert Uytterhoeven Gr{oetje,eeting}s, Geert -- Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org In personal conversations with technical people, I call myself a hacker. But when I'm talking to journalists I just say "programmer" or something like that. -- Linus Torvalds
[PATCH] [CIFS] Remove unneeded statements pointed out by Coverity
cifs: remove unused value pointed out by Coverity Detected by CoverityScan CID#1438719 ("Unused Value") buf is reset again before being used so these two lines of code are useless. Signed-off-by: Steve French --- fs/cifs/connect.c | 4 1 file changed, 4 deletions(-) diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c index b95db2b593cb..a8e9738db691 100644 --- a/fs/cifs/connect.c +++ b/fs/cifs/connect.c @@ -1191,10 +1191,6 @@ cifs_demultiplex_thread(void *p) continue; } -if (server->large_buf) -buf = server->bigbuf; - - server->lstrp = jiffies; for (i = 0; i < num_mids; i++) { -- Thanks, Steve
Re: [PATCH v2] PCI: qcom: Use default config space read function
On 13/03/2019 20:39, Marc Gonzalez wrote: working without patch :https://paste.ubuntu.com/p/TJm4hgjGW4/ I have not debugged it any further other than just testing the patch. Let me know if you need any more dumps for more debug. Could you pastebin the output of lspci -vvv -n in the working case? This was already in my original reply: working without patch : https://paste.ubuntu.com/p/TJm4hgjGW4/ --srini