date:20190313

[PATCH] pci: pcie-xilinx: fix a missing-check bug for __get_free_pages

2019-03-13 Thread Kangjie Lu

In case __get_free_pages fail, the fix returns to avoid NULL
pointer dereference.

Signed-off-by: Kangjie Lu 
---
 drivers/pci/controller/pcie-xilinx.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/pci/controller/pcie-xilinx.c 
b/drivers/pci/controller/pcie-xilinx.c
index 9bd1a35cd5d8..b7083e995c45 100644
--- a/drivers/pci/controller/pcie-xilinx.c
+++ b/drivers/pci/controller/pcie-xilinx.c
@@ -341,6 +341,9 @@ static void xilinx_pcie_enable_msi(struct xilinx_pcie_port 
*port)
phys_addr_t msg_addr;
 
port->msi_pages = __get_free_pages(GFP_KERNEL, 0);
+   if (unlikely(!port->msi_pages))
+   return;
+
msg_addr = virt_to_phys((void *)port->msi_pages);
pcie_write(port, 0x0, XILINX_PCIE_REG_MSIBASE1);
pcie_write(port, msg_addr, XILINX_PCIE_REG_MSIBASE2);
-- 
2.17.1

[PATCH] pci: pci-tegra: fix a potential NULL pointer dereference

2019-03-13 Thread Kangjie Lu

In case __get_free_pages fails and returns NULL, the fix returns
-ENOMEM and releases resources to avoid NULL pointer dereference.

Signed-off-by: Kangjie Lu 
---
 drivers/pci/controller/pci-tegra.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/pci/controller/pci-tegra.c 
b/drivers/pci/controller/pci-tegra.c
index f4f53d092e00..0bdc6ee904f3 100644
--- a/drivers/pci/controller/pci-tegra.c
+++ b/drivers/pci/controller/pci-tegra.c
@@ -1550,6 +1550,12 @@ static int tegra_pcie_msi_setup(struct tegra_pcie *pcie)
 
/* setup AFI/FPCI range */
msi->pages = __get_free_pages(GFP_KERNEL, 0);
+   if (!msi->pages) {
+   dev_err(dev, "failed to get free pages\n");
+   err = -ENOMEM;
+   goto err;
+   }
+
msi->phys = virt_to_phys((void *)msi->pages);
host->msi = >chip;
 
-- 
2.17.1

[PATCH] hyperv: a potential NULL pointer dereference

2019-03-13 Thread Kangjie Lu

In case alloc_page, the fix returns -ENOMEM to avoid the potential
NULL pointer dereference.

Signed-off-by: Kangjie Lu 
---
 arch/x86/hyperv/hv_init.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index 7abb09e2eeb8..dfdb4ce1ae9c 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -102,9 +102,13 @@ static int hv_cpu_init(unsigned int cpu)
u64 msr_vp_index;
struct hv_vp_assist_page **hvp = _vp_assist_page[smp_processor_id()];
void **input_arg;
+   struct page *pg;
 
input_arg = (void **)this_cpu_ptr(hyperv_pcpu_input_arg);
-   *input_arg = page_address(alloc_page(GFP_KERNEL));
+   pg = alloc_page(GFP_KERNEL);
+   if (unlikely(!pg))
+   return -ENOMEM;
+   *input_arg = page_address(pg);
 
hv_get_vp_index(msr_vp_index);
 
-- 
2.17.1

Re: [PATCH V3 07/10] mmc: tegra: add Tegra186 WAR for CQE

2019-03-13 Thread Ritesh Harjani




On 3/14/2019 3:15 AM, Sowjanya Komatineni wrote:

Tegra186 CQHCI host has a known bug where CQHCI controller selects
DATA_PRESENT_SELECT bit to 1 for DCMDs with R1B response type and
since DCMD does not trigger any data transfer, DCMD task complete
happens leaving the DATA FSM of host controller in wait state for
the data.

This effects the data transfer tasks issued after the DCMDs with
R1b response type resulting in timeout.

SW WAR is to set CMD_TIMING to 1 in DCMD task descriptor. This bug
and SW WAR is applicable only for Tegra186 and not for Tegra194.

This patch implements this WAR thru NVQUIRK_CQHCI_DCMD_R1B_CMD_TIMING
for Tegra186 and also implements update_dcmd_desc of cqhci_host_ops
interface to set CMD_TIMING bit depending on the NVQUIRK.

Tested-by: Jon Hunter 
Signed-off-by: Sowjanya Komatineni 


Thanks,

Reviewed-by: Ritesh Harjani 

---
  drivers/mmc/host/sdhci-tegra.c | 17 -
  1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/mmc/host/sdhci-tegra.c b/drivers/mmc/host/sdhci-tegra.c
index f1aa0591112a..2f08b6e480df 100644
--- a/drivers/mmc/host/sdhci-tegra.c
+++ b/drivers/mmc/host/sdhci-tegra.c
@@ -106,6 +106,7 @@
  #define NVQUIRK_HAS_PADCALIB  BIT(6)
  #define NVQUIRK_NEEDS_PAD_CONTROL BIT(7)
  #define NVQUIRK_DIS_CARD_CLK_CONFIG_TAP   BIT(8)
+#define NVQUIRK_CQHCI_DCMD_R1B_CMD_TIMING  BIT(9)
  
  /* SDMMC CQE Base Address for Tegra Host Ver 4.1 and Higher */

  #define SDHCI_TEGRA_CQE_BASE_ADDR 0xF000
@@ -1123,6 +1124,18 @@ static void tegra_sdhci_voltage_switch(struct sdhci_host 
*host)
tegra_host->pad_calib_required = true;
  }
  
+static void sdhci_tegra_update_dcmd_desc(struct mmc_host *mmc,

+struct mmc_request *mrq, u64 *data)
+{
+   struct sdhci_pltfm_host *pltfm_host = sdhci_priv(mmc_priv(mmc));
+   struct sdhci_tegra *tegra_host = sdhci_pltfm_priv(pltfm_host);
+   const struct sdhci_tegra_soc_data *soc_data = tegra_host->soc_data;
+
+   if (soc_data->nvquirks & NVQUIRK_CQHCI_DCMD_R1B_CMD_TIMING &&
+   mrq->cmd->flags & MMC_RSP_R1B)
+   *data |= CQHCI_CMD_TIMING(1);
+}
+
  static void sdhci_tegra_cqe_enable(struct mmc_host *mmc)
  {
struct cqhci_host *cq_host = mmc->cqe_private;
@@ -1164,6 +1177,7 @@ static const struct cqhci_host_ops sdhci_tegra_cqhci_ops 
= {
.enable = sdhci_tegra_cqe_enable,
.disable = sdhci_cqe_disable,
.dumpregs = sdhci_tegra_dumpregs,
+   .update_dcmd_desc = sdhci_tegra_update_dcmd_desc,
  };
  
  static const struct sdhci_ops tegra_sdhci_ops = {

@@ -1345,7 +1359,8 @@ static const struct sdhci_tegra_soc_data 
soc_data_tegra186 = {
NVQUIRK_HAS_PADCALIB |
NVQUIRK_DIS_CARD_CLK_CONFIG_TAP |
NVQUIRK_ENABLE_SDR50 |
-   NVQUIRK_ENABLE_SDR104,
+   NVQUIRK_ENABLE_SDR104 |
+   NVQUIRK_CQHCI_DCMD_R1B_CMD_TIMING,
.min_tap_delay = 84,
.max_tap_delay = 136,
  };

[PATCH] infiniband: i40iw: fix potential NULL pointer dereferences

2019-03-13 Thread Kangjie Lu

alloc_ordered_workqueue may fail and return NULL.
The fix captures the failure and handles it properly to avoid
potential NULL pointer dereferences.

Signed-off-by: Kangjie Lu 
---
V2: add return value to capture the error code

 drivers/infiniband/hw/i40iw/i40iw.h  |  2 +-
 drivers/infiniband/hw/i40iw/i40iw_cm.c   | 19 ---
 drivers/infiniband/hw/i40iw/i40iw_main.c |  5 -
 3 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/i40iw/i40iw.h 
b/drivers/infiniband/hw/i40iw/i40iw.h
index 2f2b4426ded7..8feec35f95a7 100644
--- a/drivers/infiniband/hw/i40iw/i40iw.h
+++ b/drivers/infiniband/hw/i40iw/i40iw.h
@@ -552,7 +552,7 @@ enum i40iw_status_code i40iw_obj_aligned_mem(struct 
i40iw_device *iwdev,
 
 void i40iw_request_reset(struct i40iw_device *iwdev);
 void i40iw_destroy_rdma_device(struct i40iw_ib_device *iwibdev);
-void i40iw_setup_cm_core(struct i40iw_device *iwdev);
+int i40iw_setup_cm_core(struct i40iw_device *iwdev);
 void i40iw_cleanup_cm_core(struct i40iw_cm_core *cm_core);
 void i40iw_process_ceq(struct i40iw_device *, struct i40iw_ceq *iwceq);
 void i40iw_process_aeq(struct i40iw_device *);
diff --git a/drivers/infiniband/hw/i40iw/i40iw_cm.c 
b/drivers/infiniband/hw/i40iw/i40iw_cm.c
index 206cfb0016f8..dda24f44239b 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_cm.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_cm.c
@@ -3237,7 +3237,7 @@ void i40iw_receive_ilq(struct i40iw_sc_vsi *vsi, struct 
i40iw_puda_buf *rbuf)
  * core
  * @iwdev: iwarp device structure
  */
-void i40iw_setup_cm_core(struct i40iw_device *iwdev)
+int i40iw_setup_cm_core(struct i40iw_device *iwdev)
 {
struct i40iw_cm_core *cm_core = >cm_core;
 
@@ -3256,9 +3256,20 @@ void i40iw_setup_cm_core(struct i40iw_device *iwdev)
 
cm_core->event_wq = alloc_ordered_workqueue("iwewq",
WQ_MEM_RECLAIM);
+   if (!cm_core->event_wq)
+   goto error;
 
cm_core->disconn_wq = alloc_ordered_workqueue("iwdwq",
  WQ_MEM_RECLAIM);
+   if (!cm_core->disconn_wq)
+   goto error;
+
+   return 0;
+error:
+   i40iw_cleanup_cm_core(>cm_core);
+   i40iw_pr_err("fail to setup CM core");
+
+   return return -ENOMEM;
 }
 
 /**
@@ -3278,8 +3289,10 @@ void i40iw_cleanup_cm_core(struct i40iw_cm_core *cm_core)
del_timer_sync(_core->tcp_timer);
spin_unlock_irqrestore(_core->ht_lock, flags);
 
-   destroy_workqueue(cm_core->event_wq);
-   destroy_workqueue(cm_core->disconn_wq);
+   if (cm_core->event_wq)
+   destroy_workqueue(cm_core->event_wq);
+   if (cm_core->disconn_wq)
+   destroy_workqueue(cm_core->disconn_wq);
 }
 
 /**
diff --git a/drivers/infiniband/hw/i40iw/i40iw_main.c 
b/drivers/infiniband/hw/i40iw/i40iw_main.c
index 68095f00d08f..10932baee279 100644
--- a/drivers/infiniband/hw/i40iw/i40iw_main.c
+++ b/drivers/infiniband/hw/i40iw/i40iw_main.c
@@ -1641,7 +1641,10 @@ static int i40iw_open(struct i40e_info *ldev, struct 
i40e_client *client)
iwdev = >device;
iwdev->hdl = hdl;
dev = >sc_dev;
-   i40iw_setup_cm_core(iwdev);
+   if (i40iw_setup_cm_core(iwdev)) {
+   kfree(iwdev->hdl);
+   return -ENOMEM;
+   }
 
dev->back_dev = (void *)iwdev;
iwdev->ldev = >ldev;
-- 
2.17.1

Re: [PATCH V3 06/10] mmc: cqhci: allow hosts to update dcmd cmd desc

2019-03-13 Thread Ritesh Harjani




On 3/14/2019 3:15 AM, Sowjanya Komatineni wrote:

This patch adds update_dcmd_desc interface to cqhci_host_ops to
allow hosts to update any of the DCMD task descriptor attributes
and parameters.

Tested-by: Jon Hunter 
Signed-off-by: Sowjanya Komatineni 


Thanks,

Reviewed-by: Ritesh Harjani 


---
  drivers/mmc/host/cqhci.c | 2 ++
  drivers/mmc/host/cqhci.h | 2 ++
  2 files changed, 4 insertions(+)

diff --git a/drivers/mmc/host/cqhci.c b/drivers/mmc/host/cqhci.c
index a8af682a9182..d59cb0a51964 100644
--- a/drivers/mmc/host/cqhci.c
+++ b/drivers/mmc/host/cqhci.c
@@ -537,6 +537,8 @@ static void cqhci_prep_dcmd_desc(struct mmc_host *mmc,
 CQHCI_ACT(0x5) |
 CQHCI_CMD_INDEX(mrq->cmd->opcode) |
 CQHCI_CMD_TIMING(timing) | CQHCI_RESP_TYPE(resp_type));
+   if (cq_host->ops->update_dcmd_desc)
+   cq_host->ops->update_dcmd_desc(mmc, mrq, );
*task_desc |= data;
desc = (u8 *)task_desc;
pr_debug("%s: cqhci: dcmd: cmd: %d timing: %d resp: %d\n",
diff --git a/drivers/mmc/host/cqhci.h b/drivers/mmc/host/cqhci.h
index 9e68286a07b4..8c8ec6f01c45 100644
--- a/drivers/mmc/host/cqhci.h
+++ b/drivers/mmc/host/cqhci.h
@@ -210,6 +210,8 @@ struct cqhci_host_ops {
u32 (*read_l)(struct cqhci_host *host, int reg);
void (*enable)(struct mmc_host *mmc);
void (*disable)(struct mmc_host *mmc, bool recovery);
+   void (*update_dcmd_desc)(struct mmc_host *mmc, struct mmc_request *mrq,
+u64 *data);
  };
  
  static inline void cqhci_writel(struct cqhci_host *host, u32 val, int reg)

[PATCH v3 7/7] mm: Remove stale comment from page struct

2019-03-13 Thread Tobin C. Harding

We now use the slab_list list_head instead of the lru list_head.  This
comment has become stale.

Remove stale comment from page struct slab_list list_head.

Reviewed-by: Roman Gushchin 
Acked-by: Christoph Lameter 
Signed-off-by: Tobin C. Harding 
---
 include/linux/mm_types.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 7eade9132f02..63a34e3d7c29 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -103,7 +103,7 @@ struct page {
};
struct {/* slab, slob and slub */
union {
-   struct list_head slab_list; /* uses lru */
+   struct list_head slab_list;
struct {/* Partial pages */
struct page *next;
 #ifdef CONFIG_64BIT
-- 
2.21.0

[PATCH v3 4/7] slub: Add comments to endif pre-processor macros

2019-03-13 Thread Tobin C. Harding

SLUB allocator makes heavy use of ifdef/endif pre-processor macros.
The pairing of these statements is at times hard to follow e.g. if the
pair are further than a screen apart or if there are nested pairs.  We
can reduce cognitive load by adding a comment to the endif statement of
form

   #ifdef CONFIG_FOO
   ...
   #endif /* CONFIG_FOO */

Add comments to endif pre-processor macros if ifdef/endif pair is not
immediately apparent.

Reviewed-by: Roman Gushchin 
Acked-by: Christoph Lameter 
Signed-off-by: Tobin C. Harding 
---
 mm/slub.c | 20 ++--
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index 1b08fbcb7e61..b282e22885cd 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1951,7 +1951,7 @@ static void *get_any_partial(struct kmem_cache *s, gfp_t 
flags,
}
}
} while (read_mems_allowed_retry(cpuset_mems_cookie));
-#endif
+#endif /* CONFIG_NUMA */
return NULL;
 }
 
@@ -2249,7 +2249,7 @@ static void unfreeze_partials(struct kmem_cache *s,
discard_slab(s, page);
stat(s, FREE_SLAB);
}
-#endif
+#endif /* CONFIG_SLUB_CPU_PARTIAL */
 }
 
 /*
@@ -2308,7 +2308,7 @@ static void put_cpu_partial(struct kmem_cache *s, struct 
page *page, int drain)
local_irq_restore(flags);
}
preempt_enable();
-#endif
+#endif /* CONFIG_SLUB_CPU_PARTIAL */
 }
 
 static inline void flush_slab(struct kmem_cache *s, struct kmem_cache_cpu *c)
@@ -2813,7 +2813,7 @@ void *kmem_cache_alloc_node_trace(struct kmem_cache *s,
 }
 EXPORT_SYMBOL(kmem_cache_alloc_node_trace);
 #endif
-#endif
+#endif /* CONFIG_NUMA */
 
 /*
  * Slow path handling. This may still be called frequently since objects
@@ -3845,7 +3845,7 @@ void *__kmalloc_node(size_t size, gfp_t flags, int node)
return ret;
 }
 EXPORT_SYMBOL(__kmalloc_node);
-#endif
+#endif /* CONFIG_NUMA */
 
 #ifdef CONFIG_HARDENED_USERCOPY
 /*
@@ -4063,7 +4063,7 @@ void __kmemcg_cache_deactivate(struct kmem_cache *s)
 */
slab_deactivate_memcg_cache_rcu_sched(s, kmemcg_cache_deact_after_rcu);
 }
-#endif
+#endif /* CONFIG_MEMCG */
 
 static int slab_mem_going_offline_callback(void *arg)
 {
@@ -4696,7 +4696,7 @@ static int list_locations(struct kmem_cache *s, char *buf,
len += sprintf(buf, "No data\n");
return len;
 }
-#endif
+#endif /* CONFIG_SLUB_DEBUG */
 
 #ifdef SLUB_RESILIENCY_TEST
 static void __init resiliency_test(void)
@@ -4756,7 +4756,7 @@ static void __init resiliency_test(void)
 #ifdef CONFIG_SYSFS
 static void resiliency_test(void) {};
 #endif
-#endif
+#endif /* SLUB_RESILIENCY_TEST */
 
 #ifdef CONFIG_SYSFS
 enum slab_stat_type {
@@ -5413,7 +5413,7 @@ STAT_ATTR(CPU_PARTIAL_ALLOC, cpu_partial_alloc);
 STAT_ATTR(CPU_PARTIAL_FREE, cpu_partial_free);
 STAT_ATTR(CPU_PARTIAL_NODE, cpu_partial_node);
 STAT_ATTR(CPU_PARTIAL_DRAIN, cpu_partial_drain);
-#endif
+#endif /* CONFIG_SLUB_STATS */
 
 static struct attribute *slab_attrs[] = {
_size_attr.attr,
@@ -5614,7 +5614,7 @@ static void memcg_propagate_slab_attrs(struct kmem_cache 
*s)
 
if (buffer)
free_page((unsigned long)buffer);
-#endif
+#endif /* CONFIG_MEMCG */
 }
 
 static void kmem_cache_release(struct kobject *k)
-- 
2.21.0

[PATCH v3 0/7] mm: Use slab_list list_head instead of lru

2019-03-13 Thread Tobin C. Harding

Currently the slab allocators (ab)use the struct page 'lru' list_head.
We have a list head for slab allocators to use, 'slab_list'.

During v2 it was noted by Christoph that the SLOB allocator was reaching
into a list_head, this version adds 2 patches to the front of the set to
fix that.

Clean up all three allocators by using the 'slab_list' list_head instead
of overloading the 'lru' list_head.

Patch 1 - Adds a function to rotate a list to a specified entry.

Patch 2 - Removes the code that reaches into list_head and instead uses
  the list_head API including the newly defined function.

Patches 3-7 are unchanged from v3

Patch 3 (v2: patch 4) - Changes the SLOB allocator to use slab_list
instead of lru.

Patch 4 (v2: patch 1) - Makes no code changes, adds comments to #endif
statements.

Patch 5 (v2: patch 2) - Use slab_list instead of lru for SLUB allocator.

Patch 6 (v2: patch 3) - Use slab_list instead of lru for SLAB allocator.

Patch 7 (v2: patch 5) - Removes the now stale comment in the page struct
definition.

During v2 development patches were checked to see if the object file
before and after was identical.  Clearly this will no longer be possible
for mm/slob.o, however this work is still of use to validate the
change from lru -> slab_list.

Patch 1 was tested with a module (creates and populates a list then
calls list_rotate_to_front() and verifies new order):

  https://github.com/tcharding/ktest/tree/master/list_head

Patch 2 was tested with another module that does some basic slab
allocation and freeing to a newly created slab cache:

https://github.com/tcharding/ktest/tree/master/slab

Tested on a kernel with this in the config:

CONFIG_SLOB=y
CONFIG_SLAB_MERGE_DEFAULT=y


Changes since v2:

 - Add list_rotate_to_front().
 - Fix slob to use list_head API.
 - Re-order patches to put the list.h changes up front.
 - Add acks from Christoph.

Changes since v1:

 - Verify object files are the same before and after the patch set is
   applied (suggested by Matthew).
 - Add extra explanation to the commit logs explaining why these changes
   are safe to make (suggested by Roman).
 - Remove stale comment (thanks Willy).

thanks,
Tobin.


Tobin C. Harding (7):
  list: Add function list_rotate_to_front()
  slob: Respect list_head abstraction layer
  slob: Use slab_list instead of lru
  slub: Add comments to endif pre-processor macros
  slub: Use slab_list instead of lru
  slab: Use slab_list instead of lru
  mm: Remove stale comment from page struct

 include/linux/list.h | 18 
 include/linux/mm_types.h |  2 +-
 mm/slab.c| 49 
 mm/slob.c| 32 +
 mm/slub.c| 60 
 5 files changed, 94 insertions(+), 67 deletions(-)

-- 
2.21.0

[PATCH v3 1/7] list: Add function list_rotate_to_front()

2019-03-13 Thread Tobin C. Harding

Currently if we wish to rotate a list until a specific item is at the
front of the list we can call list_move_tail(head, list).  Note that the
arguments are the reverse way to the usual use of list_move_tail(list,
head).  This is a hack, it depends on the developer knowing how the
list_head operates internally which violates the layer of abstraction
offered by the list_head.  Also, it is not intuitive so the next
developer to come along must study list.h in order to fully understand
what is meant by the call, while this is 'good for' the developer it
makes reading the code harder.  We should have an function appropriately
named that does this if there are users for it intree.

By grep'ing the tree for list_move_tail() and list_tail() and attempting
to guess the argument order from the names it seems there is only one
place currently in the tree that does this - the slob allocatator.

Add function list_rotate_to_front() to rotate a list until the specified
item is at the front of the list.

Signed-off-by: Tobin C. Harding 
---
 include/linux/list.h | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/include/linux/list.h b/include/linux/list.h
index 79626b5ab36c..8ead813e7f1c 100644
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -270,6 +270,24 @@ static inline void list_rotate_left(struct list_head *head)
}
 }
 
+/**
+ * list_rotate_to_front() - Rotate list to specific item.
+ * @list: The desired new front of the list.
+ * @head: The head of the list.
+ *
+ * Rotates list so that @list becomes the new front of the list.
+ */
+static inline void list_rotate_to_front(struct list_head *list,
+   struct list_head *head)
+{
+   /*
+* Deletes the list head from the list denoted by @head and
+* places it as the tail of @list, this effectively rotates the
+* list so that @list is at the front.
+*/
+   list_move_tail(head, list);
+}
+
 /**
  * list_is_singular - tests whether a list has just one entry.
  * @head: the list to test.
-- 
2.21.0

[PATCH v3 5/7] slub: Use slab_list instead of lru

2019-03-13 Thread Tobin C. Harding

Currently we use the page->lru list for maintaining lists of slabs.  We
have a list in the page structure (slab_list) that can be used for this
purpose.  Doing so makes the code cleaner since we are not overloading
the lru list.

Use the slab_list instead of the lru list for maintaining lists of
slabs.

Reviewed-by: Roman Gushchin 
Acked-by: Christoph Lameter 
Signed-off-by: Tobin C. Harding 
---
 mm/slub.c | 40 
 1 file changed, 20 insertions(+), 20 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index b282e22885cd..d692b5e0163d 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -1023,7 +1023,7 @@ static void add_full(struct kmem_cache *s,
return;
 
lockdep_assert_held(>list_lock);
-   list_add(>lru, >full);
+   list_add(>slab_list, >full);
 }
 
 static void remove_full(struct kmem_cache *s, struct kmem_cache_node *n, 
struct page *page)
@@ -1032,7 +1032,7 @@ static void remove_full(struct kmem_cache *s, struct 
kmem_cache_node *n, struct
return;
 
lockdep_assert_held(>list_lock);
-   list_del(>lru);
+   list_del(>slab_list);
 }
 
 /* Tracking of the number of slabs for debugging purposes */
@@ -1773,9 +1773,9 @@ __add_partial(struct kmem_cache_node *n, struct page 
*page, int tail)
 {
n->nr_partial++;
if (tail == DEACTIVATE_TO_TAIL)
-   list_add_tail(>lru, >partial);
+   list_add_tail(>slab_list, >partial);
else
-   list_add(>lru, >partial);
+   list_add(>slab_list, >partial);
 }
 
 static inline void add_partial(struct kmem_cache_node *n,
@@ -1789,7 +1789,7 @@ static inline void remove_partial(struct kmem_cache_node 
*n,
struct page *page)
 {
lockdep_assert_held(>list_lock);
-   list_del(>lru);
+   list_del(>slab_list);
n->nr_partial--;
 }
 
@@ -1863,7 +1863,7 @@ static void *get_partial_node(struct kmem_cache *s, 
struct kmem_cache_node *n,
return NULL;
 
spin_lock(>list_lock);
-   list_for_each_entry_safe(page, page2, >partial, lru) {
+   list_for_each_entry_safe(page, page2, >partial, slab_list) {
void *t;
 
if (!pfmemalloc_match(page, flags))
@@ -2407,7 +2407,7 @@ static unsigned long count_partial(struct kmem_cache_node 
*n,
struct page *page;
 
spin_lock_irqsave(>list_lock, flags);
-   list_for_each_entry(page, >partial, lru)
+   list_for_each_entry(page, >partial, slab_list)
x += get_count(page);
spin_unlock_irqrestore(>list_lock, flags);
return x;
@@ -3702,10 +3702,10 @@ static void free_partial(struct kmem_cache *s, struct 
kmem_cache_node *n)
 
BUG_ON(irqs_disabled());
spin_lock_irq(>list_lock);
-   list_for_each_entry_safe(page, h, >partial, lru) {
+   list_for_each_entry_safe(page, h, >partial, slab_list) {
if (!page->inuse) {
remove_partial(n, page);
-   list_add(>lru, );
+   list_add(>slab_list, );
} else {
list_slab_objects(s, page,
"Objects remaining in %s on __kmem_cache_shutdown()");
@@ -3713,7 +3713,7 @@ static void free_partial(struct kmem_cache *s, struct 
kmem_cache_node *n)
}
spin_unlock_irq(>list_lock);
 
-   list_for_each_entry_safe(page, h, , lru)
+   list_for_each_entry_safe(page, h, , slab_list)
discard_slab(s, page);
 }
 
@@ -3993,7 +3993,7 @@ int __kmem_cache_shrink(struct kmem_cache *s)
 * Note that concurrent frees may occur while we hold the
 * list_lock. page->inuse here is the upper limit.
 */
-   list_for_each_entry_safe(page, t, >partial, lru) {
+   list_for_each_entry_safe(page, t, >partial, slab_list) {
int free = page->objects - page->inuse;
 
/* Do not reread page->inuse */
@@ -4003,10 +4003,10 @@ int __kmem_cache_shrink(struct kmem_cache *s)
BUG_ON(free <= 0);
 
if (free == page->objects) {
-   list_move(>lru, );
+   list_move(>slab_list, );
n->nr_partial--;
} else if (free <= SHRINK_PROMOTE_MAX)
-   list_move(>lru, promote + free - 1);
+   list_move(>slab_list, promote + free - 1);
}
 
/*
@@ -4019,7 +4019,7 @@ int __kmem_cache_shrink(struct kmem_cache *s)
spin_unlock_irqrestore(>list_lock, flags);
 
/* Release empty slabs */
-   list_for_each_entry_safe(page, t, , lru)
+   list_for_each_entry_safe(page, t, , slab_list)
discard_slab(s, page);
 
if

[PATCH v3 6/7] slab: Use slab_list instead of lru

2019-03-13 Thread Tobin C. Harding

Currently we use the page->lru list for maintaining lists of slabs.  We
have a list in the page structure (slab_list) that can be used for this
purpose.  Doing so makes the code cleaner since we are not overloading
the lru list.

Use the slab_list instead of the lru list for maintaining lists of
slabs.

Reviewed-by: Roman Gushchin 
Signed-off-by: Tobin C. Harding 
---
 mm/slab.c | 49 +
 1 file changed, 25 insertions(+), 24 deletions(-)

diff --git a/mm/slab.c b/mm/slab.c
index 28652e4218e0..09cc64ef9613 100644
--- a/mm/slab.c
+++ b/mm/slab.c
@@ -1710,8 +1710,8 @@ static void slabs_destroy(struct kmem_cache *cachep, 
struct list_head *list)
 {
struct page *page, *n;
 
-   list_for_each_entry_safe(page, n, list, lru) {
-   list_del(>lru);
+   list_for_each_entry_safe(page, n, list, slab_list) {
+   list_del(>slab_list);
slab_destroy(cachep, page);
}
 }
@@ -2265,8 +2265,8 @@ static int drain_freelist(struct kmem_cache *cache,
goto out;
}
 
-   page = list_entry(p, struct page, lru);
-   list_del(>lru);
+   page = list_entry(p, struct page, slab_list);
+   list_del(>slab_list);
n->free_slabs--;
n->total_slabs--;
/*
@@ -2726,13 +2726,13 @@ static void cache_grow_end(struct kmem_cache *cachep, 
struct page *page)
if (!page)
return;
 
-   INIT_LIST_HEAD(>lru);
+   INIT_LIST_HEAD(>slab_list);
n = get_node(cachep, page_to_nid(page));
 
spin_lock(>list_lock);
n->total_slabs++;
if (!page->active) {
-   list_add_tail(>lru, &(n->slabs_free));
+   list_add_tail(>slab_list, >slabs_free);
n->free_slabs++;
} else
fixup_slab_list(cachep, n, page, );
@@ -2841,9 +2841,9 @@ static inline void fixup_slab_list(struct kmem_cache 
*cachep,
void **list)
 {
/* move slabp to correct slabp list: */
-   list_del(>lru);
+   list_del(>slab_list);
if (page->active == cachep->num) {
-   list_add(>lru, >slabs_full);
+   list_add(>slab_list, >slabs_full);
if (OBJFREELIST_SLAB(cachep)) {
 #if DEBUG
/* Poisoning will be done without holding the lock */
@@ -2857,7 +2857,7 @@ static inline void fixup_slab_list(struct kmem_cache 
*cachep,
page->freelist = NULL;
}
} else
-   list_add(>lru, >slabs_partial);
+   list_add(>slab_list, >slabs_partial);
 }
 
 /* Try to find non-pfmemalloc slab if needed */
@@ -2880,20 +2880,20 @@ static noinline struct page 
*get_valid_first_slab(struct kmem_cache_node *n,
}
 
/* Move pfmemalloc slab to the end of list to speed up next search */
-   list_del(>lru);
+   list_del(>slab_list);
if (!page->active) {
-   list_add_tail(>lru, >slabs_free);
+   list_add_tail(>slab_list, >slabs_free);
n->free_slabs++;
} else
-   list_add_tail(>lru, >slabs_partial);
+   list_add_tail(>slab_list, >slabs_partial);
 
-   list_for_each_entry(page, >slabs_partial, lru) {
+   list_for_each_entry(page, >slabs_partial, slab_list) {
if (!PageSlabPfmemalloc(page))
return page;
}
 
n->free_touched = 1;
-   list_for_each_entry(page, >slabs_free, lru) {
+   list_for_each_entry(page, >slabs_free, slab_list) {
if (!PageSlabPfmemalloc(page)) {
n->free_slabs--;
return page;
@@ -2908,11 +2908,12 @@ static struct page *get_first_slab(struct 
kmem_cache_node *n, bool pfmemalloc)
struct page *page;
 
assert_spin_locked(>list_lock);
-   page = list_first_entry_or_null(>slabs_partial, struct page, lru);
+   page = list_first_entry_or_null(>slabs_partial, struct page,
+   slab_list);
if (!page) {
n->free_touched = 1;
page = list_first_entry_or_null(>slabs_free, struct page,
-   lru);
+   slab_list);
if (page)
n->free_slabs--;
}
@@ -3413,29 +3414,29 @@ static void free_block(struct kmem_cache *cachep, void 
**objpp,
objp = objpp[i];
 
page = virt_to_head_page(objp);
-   list_del(>lru);
+   list_del(>slab_list);
check_spinlock_acquired_node(cachep, node);
slab_put_obj(cachep, page, objp);
STATS_DEC_ACTIVE(cachep);
 
/* fixup slab chains */
if (page->active == 0) {
-

[PATCH v3 3/7] slob: Use slab_list instead of lru

2019-03-13 Thread Tobin C. Harding

Currently we use the page->lru list for maintaining lists of slabs.  We
have a list_head in the page structure (slab_list) that can be used for
this purpose.  Doing so makes the code cleaner since we are not
overloading the lru list.

The slab_list is part of a union within the page struct (included here
stripped down):

union {
struct {/* Page cache and anonymous pages */
struct list_head lru;
...
};
struct {
dma_addr_t dma_addr;
};
struct {/* slab, slob and slub */
union {
struct list_head slab_list;
struct {/* Partial pages */
struct page *next;
int pages;  /* Nr of pages left */
int pobjects;   /* Approximate count */
};
};
...

Here we see that slab_list and lru are the same bits.  We can verify
that this change is safe to do by examining the object file produced from
slob.c before and after this patch is applied.

Steps taken to verify:

 1. checkout current tip of Linus' tree

commit a667cb7a94d4 ("Merge branch 'akpm' (patches from Andrew)")

 2. configure and build (select SLOB allocator)

CONFIG_SLOB=y
CONFIG_SLAB_MERGE_DEFAULT=y

 3. dissasemble object file `objdump -dr mm/slub.o > before.s
 4. apply patch
 5. build
 6. dissasemble object file `objdump -dr mm/slub.o > after.s
 7. diff before.s after.s

Use slab_list list_head instead of the lru list_head for maintaining
lists of slabs.

Reviewed-by: Roman Gushchin 
Signed-off-by: Tobin C. Harding 
---
 mm/slob.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/slob.c b/mm/slob.c
index 39ad9217ffea..94486c32e0ff 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -112,13 +112,13 @@ static inline int slob_page_free(struct page *sp)
 
 static void set_slob_page_free(struct page *sp, struct list_head *list)
 {
-   list_add(>lru, list);
+   list_add(>slab_list, list);
__SetPageSlobFree(sp);
 }
 
 static inline void clear_slob_page_free(struct page *sp)
 {
-   list_del(>lru);
+   list_del(>slab_list);
__ClearPageSlobFree(sp);
 }
 
@@ -282,7 +282,7 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align, 
int node)
 
spin_lock_irqsave(_lock, flags);
/* Iterate through each partially free page, try to find room */
-   list_for_each_entry(sp, slob_list, lru) {
+   list_for_each_entry(sp, slob_list, slab_list) {
 #ifdef CONFIG_NUMA
/*
 * If there's a node specification, search for a partial
@@ -331,7 +331,7 @@ static void *slob_alloc(size_t size, gfp_t gfp, int align, 
int node)
spin_lock_irqsave(_lock, flags);
sp->units = SLOB_UNITS(PAGE_SIZE);
sp->freelist = b;
-   INIT_LIST_HEAD(>lru);
+   INIT_LIST_HEAD(>slab_list);
set_slob(b, SLOB_UNITS(PAGE_SIZE), b + SLOB_UNITS(PAGE_SIZE));
set_slob_page_free(sp, slob_list);
b = slob_page_alloc(sp, size, align);
-- 
2.21.0

[PATCH v3 2/7] slob: Respect list_head abstraction layer

2019-03-13 Thread Tobin C. Harding

Currently we reach inside the list_head.  This is a violation of the
layer of abstraction provided by the list_head.  It makes the code
fragile.  More importantly it makes the code wicked hard to understand.

The code logic is based on the page in which an allocation was made, we
want to modify the slob_list we are working on to have this page at the
front.  We already have a function to check if an entry is at the front
of the list.  Recently a function was added to list.h to do the list
rotation. We can use these two functions to reduce line count, reduce
code fragility, and reduce cognitive load required to read the code.

Use list_head functions to interact with lists thereby maintaining the
abstraction provided by the list_head structure.

Signed-off-by: Tobin C. Harding 
---

I verified the comment pointing to Knuth, the page number may be out of
date but with this comment I was able to find the text that discusses
this, left the comment as is (after fixing style).

 mm/slob.c | 24 
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/mm/slob.c b/mm/slob.c
index 307c2c9feb44..39ad9217ffea 100644
--- a/mm/slob.c
+++ b/mm/slob.c
@@ -268,8 +268,7 @@ static void *slob_page_alloc(struct page *sp, size_t size, 
int align)
  */
 static void *slob_alloc(size_t size, gfp_t gfp, int align, int node)
 {
-   struct page *sp;
-   struct list_head *prev;
+   struct page *sp, *prev, *next;
struct list_head *slob_list;
slob_t *b = NULL;
unsigned long flags;
@@ -296,18 +295,27 @@ static void *slob_alloc(size_t size, gfp_t gfp, int 
align, int node)
if (sp->units < SLOB_UNITS(size))
continue;
 
+   /*
+* Cache previous entry because slob_page_alloc() may
+* remove sp from slob_list.
+*/
+   prev = list_prev_entry(sp, lru);
+
/* Attempt to alloc */
-   prev = sp->lru.prev;
b = slob_page_alloc(sp, size, align);
if (!b)
continue;
 
-   /* Improve fragment distribution and reduce our average
+   next = list_next_entry(prev, lru); /* This may or may not be sp 
*/
+
+   /*
+* Improve fragment distribution and reduce our average
 * search time by starting our next search here. (see
-* Knuth vol 1, sec 2.5, pg 449) */
-   if (prev != slob_list->prev &&
-   slob_list->next != prev->next)
-   list_move_tail(slob_list, prev->next);
+* Knuth vol 1, sec 2.5, pg 449)
+*/
+   if (!list_is_first(>lru, slob_list))
+   list_rotate_to_front(>lru, slob_list);
+
break;
}
spin_unlock_irqrestore(_lock, flags);
-- 
2.21.0

Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-13 Thread Aubrey Li

On Thu, Mar 14, 2019 at 8:35 AM Tim Chen  wrote:
> >>
> >> One more NULL pointer dereference:
> >>
> >> Mar 12 02:24:46 aubrey-ivb kernel: [  201.916741] core sched enabled
> >> [  201.950203] BUG: unable to handle kernel NULL pointer dereference
> >> at 0008
> >> [  201.950254] [ cut here ]
> >> [  201.959045] #PF error: [normal kernel read fault]
> >> [  201.964272] !se->on_rq
> >> [  201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849
> >> set_next_buddy+0x52/0x70
> >
> Shouldn't the for_each_sched_entity(se) skip the code block for !se case
> have avoided null pointer access of se?
>
> Since
> #define for_each_sched_entity(se) \
> for (; se; se = se->parent)
>
> Scratching my head a bit here on how your changes would have made
> a difference.

This NULL pointer dereference is not replicable, which makes me thought the
change works...

>
> In your original log, I wonder if the !se->on_rq warning on CPU 22 is mixed 
> with the actual OOPs?
> Saw also in your original log rb_insert_color.  Wonder if that
> was actually the source of the Oops?

No chance to figure this out, I only saw this once, lockup occurs more
frequently.

Thanks,
-Aubrey

[PATCH v3] hid: logitech: check the return value of create_singlethread_workqueue

2019-03-13 Thread Kangjie Lu

create_singlethread_workqueue may fail and return NULL. The fix
checks if it is NULL to avoid NULL pointer dereference.
Also, the fix moves the call of create_singlethread_workqueue
earlier to avoid resource-release issues.

--
V3: do not introduce memory leaks.

Signed-off-by: Kangjie Lu 
---
 drivers/hid/hid-logitech-hidpp.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/hid/hid-logitech-hidpp.c b/drivers/hid/hid-logitech-hidpp.c
index 15ed6177a7a3..0a243247b231 100644
--- a/drivers/hid/hid-logitech-hidpp.c
+++ b/drivers/hid/hid-logitech-hidpp.c
@@ -2111,6 +2111,13 @@ static int hidpp_ff_init(struct hidpp_device *hidpp, u8 
feature_index)
kfree(data);
return -ENOMEM;
}
+   data->wq = create_singlethread_workqueue("hidpp-ff-sendqueue");
+   if (!data->wq) {
+   kfree(data->effect_ids);
+   kfree(data);
+   return -ENOMEM;
+   }
+
data->hidpp = hidpp;
data->feature_index = feature_index;
data->version = version;
@@ -2155,7 +2162,6 @@ static int hidpp_ff_init(struct hidpp_device *hidpp, u8 
feature_index)
/* ignore boost value at response.fap.params[2] */
 
/* init the hardware command queue */
-   data->wq = create_singlethread_workqueue("hidpp-ff-sendqueue");
atomic_set(>workqueue_size, 0);
 
/* initialize with zero autocenter to get wheel in usable state */
-- 
2.17.1

Re: [PATCH v4] lib/string.c: implement a basic bcmp

2019-03-13 Thread Masahiro Yamada

On Thu, Mar 14, 2019 at 6:13 AM Nick Desaulniers
 wrote:
>
> A recent optimization in Clang (r355672) lowers comparisons of the
> return value of memcmp against zero to comparisons of the return value
> of bcmp against zero.  This helps some platforms that implement bcmp
> more efficiently than memcmp. glibc simply aliases bcmp to memcmp, but
> an optimized implementation is in the works.
>
> This results in linkage failures for all targets with Clang due to the
> undefined symbol.  For now, just implement bcmp as a tailcail to memcmp
> to unbreak the build.  This routine can be further optimized in the
> future.
>
> Other ideas discussed:
> * A weak alias was discussed, but breaks for architectures that define
> their own implementations of memcmp since aliases to declarations are
> not permitted (only definitions).  Arch-specific memcmp implementations
> typically declare memcmp in C headers, but implement them in assembly.
> * -ffreestanding also is used sporadically throughout the kernel.
> * -fno-builtin-bcmp doesn't work when doing LTO.
>
> Link: https://bugs.llvm.org/show_bug.cgi?id=41035
> Link: https://code.woboq.org/userspace/glibc/string/memcmp.c.html#bcmp
> Link: 
> https://github.com/llvm/llvm-project/commit/8e16d73346f8091461319a7dfc4ddd18eedcff13
> Link: https://github.com/ClangBuiltLinux/linux/issues/416
> Cc: sta...@vger.kernel.org
> Reported-by: Nathan Chancellor 
> Reported-by: Adhemerval Zanella 
> Suggested-by: Arnd Bergmann 
> Suggested-by: James Y Knight 
> Suggested-by: Masahiro Yamada 
> Suggested-by: Nathan Chancellor 
> Suggested-by: Rasmus Villemoes 
> Signed-off-by: Nick Desaulniers 
> Acked-by: Steven Rostedt (VMware) 


Reviewed-by: Masahiro Yamada 




> ---
> Changes V3 -> V4:
> * Include the entirety of Rasmus' sugguestion, as per Steven.
> * Change formal parameter identifiers to match the comment.
> Changes V2 -> V3:
> * Adjust comment as per Steven to Rasmus' sugguestion.
> * Pick up Steven's Ack.
> Changes V1 -> V2:
> * Add declaration to include/linux/string.h.
> * Reword comment above bcmp.
>
>  include/linux/string.h |  3 +++
>  lib/string.c   | 20 
>  2 files changed, 23 insertions(+)
>
> diff --git a/include/linux/string.h b/include/linux/string.h
> index 7927b875f80c..6ab0a6fa512e 100644
> --- a/include/linux/string.h
> +++ b/include/linux/string.h
> @@ -150,6 +150,9 @@ extern void * memscan(void *,int,__kernel_size_t);
>  #ifndef __HAVE_ARCH_MEMCMP
>  extern int memcmp(const void *,const void *,__kernel_size_t);
>  #endif
> +#ifndef __HAVE_ARCH_BCMP
> +extern int bcmp(const void *,const void *,__kernel_size_t);
> +#endif
>  #ifndef __HAVE_ARCH_MEMCHR
>  extern void * memchr(const void *,int,__kernel_size_t);
>  #endif
> diff --git a/lib/string.c b/lib/string.c
> index 38e4ca08e757..3ab861c1a857 100644
> --- a/lib/string.c
> +++ b/lib/string.c
> @@ -866,6 +866,26 @@ __visible int memcmp(const void *cs, const void *ct, 
> size_t count)
>  EXPORT_SYMBOL(memcmp);
>  #endif
>
> +#ifndef __HAVE_ARCH_BCMP
> +/**
> + * bcmp - returns 0 if and only if the buffers have identical contents.
> + * @a: pointer to first buffer.
> + * @b: pointer to second buffer.
> + * @len: size of buffers.
> + *
> + * The sign or magnitude of a non-zero return value has no particular
> + * meaning, and architectures may implement their own more efficient bcmp(). 
> So
> + * while this particular implementation is a simple (tail) call to memcmp, do
> + * not rely on anything but whether the return value is zero or non-zero.
> + */
> +#undef bcmp
> +int bcmp(const void *a, const void *b, size_t len)
> +{
> +   return memcmp(a, b, len);
> +}
> +EXPORT_SYMBOL(bcmp);
> +#endif
> +
>  #ifndef __HAVE_ARCH_MEMSCAN
>  /**
>   * memscan - Find a character in an area of memory.
> --
> 2.21.0.360.g471c308f928-goog
>


-- 
Best Regards
Masahiro Yamada

[RFC PATCH v2 6/7] tracing: Use tracing error_log with probe events

2019-03-13 Thread Masami Hiramatsu

Use tracing error_log with probe events for logging error more
precisely. This also makes all parse error returns -EINVAL
(except for -ENOMEM), because user can see better error message
in error_log file now.

Signed-off-by: Masami Hiramatsu 
---
 Changes in v2:
  - Update error message according to Steve's comment
  - Add uprobe reference counter errors
  - Adjust some error positions
---
 kernel/trace/trace_kprobe.c |   77 +++-
 kernel/trace/trace_probe.c  |  274 +++
 kernel/trace/trace_probe.h  |   77 
 kernel/trace/trace_uprobe.c |   44 +--
 4 files changed, 348 insertions(+), 124 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 56324c231688..b5945e1c3359 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -441,13 +441,8 @@ static int __register_trace_kprobe(struct trace_kprobe *tk)
else
ret = register_kprobe(>rp.kp);
 
-   if (ret == 0) {
+   if (ret == 0)
tk->tp.flags |= TP_FLAG_REGISTERED;
-   } else if (ret == -EILSEQ) {
-   pr_warn("Probing address(0x%p) is not an instruction 
boundary.\n",
-   tk->rp.kp.addr);
-   ret = -EINVAL;
-   }
return ret;
 }
 
@@ -591,7 +586,7 @@ static int trace_kprobe_create(int argc, const char *argv[])
 * Type of args:
 *  FETCHARG:TYPE : use TYPE instead of unsigned long.
 */
-   struct trace_kprobe *tk;
+   struct trace_kprobe *tk = NULL;
int i, len, ret = 0;
bool is_return = false;
char *symbol = NULL, *tmp = NULL;
@@ -615,44 +610,50 @@ static int trace_kprobe_create(int argc, const char 
*argv[])
if (argc < 2)
return -ECANCELED;
 
+   trace_probe_log_init("trace_kprobe", argc, argv);
+
event = strchr([0][1], ':');
if (event)
event++;
 
if (isdigit(argv[0][1])) {
if (!is_return) {
-   pr_info("Maxactive is not for kprobe");
-   return -EINVAL;
+   trace_probe_log_err(1, MAXACT_NO_KPROBE);
+   goto parse_error;
}
if (event)
len = event - [0][1] - 1;
else
len = strlen([0][1]);
-   if (len > MAX_EVENT_NAME_LEN - 1)
-   return -E2BIG;
+   if (len > MAX_EVENT_NAME_LEN - 1) {
+   trace_probe_log_err(1, BAD_MAXACT);
+   goto parse_error;
+   }
memcpy(buf, [0][1], len);
buf[len] = '\0';
ret = kstrtouint(buf, 0, );
if (ret || !maxactive) {
-   pr_info("Invalid maxactive number\n");
-   return ret;
+   trace_probe_log_err(1, BAD_MAXACT);
+   goto parse_error;
}
/* kretprobes instances are iterated over via a list. The
 * maximum should stay reasonable.
 */
if (maxactive > KRETPROBE_MAXACTIVE_MAX) {
-   pr_info("Maxactive is too big (%d > %d).\n",
-   maxactive, KRETPROBE_MAXACTIVE_MAX);
-   return -E2BIG;
+   trace_probe_log_err(1, MAXACT_TOO_BIG);
+   goto parse_error;
}
}
 
/* try to parse an address. if that fails, try to read the
 * input as a symbol. */
if (kstrtoul(argv[1], 0, (unsigned long *))) {
+   trace_probe_log_set_index(1);
/* Check whether uprobe event specified */
-   if (strchr(argv[1], '/') && strchr(argv[1], ':'))
-   return -ECANCELED;
+   if (strchr(argv[1], '/') && strchr(argv[1], ':')) {
+   ret = -ECANCELED;
+   goto error;
+   }
/* a symbol specified */
symbol = kstrdup(argv[1], GFP_KERNEL);
if (!symbol)
@@ -660,23 +661,23 @@ static int trace_kprobe_create(int argc, const char 
*argv[])
/* TODO: support .init module functions */
ret = traceprobe_split_symbol_offset(symbol, );
if (ret || offset < 0 || offset > UINT_MAX) {
-   pr_info("Failed to parse either an address or a 
symbol.\n");
-   goto out;
+   trace_probe_log_err(0, BAD_PROBE_ADDR);
+   goto parse_error;
}
if (kprobe_on_func_entry(NULL, symbol, offset))
flags |= TPARG_FL_FENTRY;
if (offset && is_return && !(flags & TPARG_FL_FENTRY)) {
-   pr_info("Given offset is not valid for return

[RFC PATCH v2 7/7] selftests/ftrace: Add error_log testcase for probe errors

2019-03-13 Thread Masami Hiramatsu

Add error_log testcase for error logs on probe events.
This tests most of error cases and checks the error position
is correct.

Signed-off-by: Masami Hiramatsu 
---
 Changes in v2:
  - Specify error position in command string by "^"
  - Clear error_log right before writing command
  - Add uprobe syntax error testcase
---
 .../ftrace/test.d/kprobe/kprobe_syntax_errors.tc   |   93 
 .../ftrace/test.d/kprobe/uprobe_syntax_errors.tc   |   31 +++
 2 files changed, 124 insertions(+)
 create mode 100644 
tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
 create mode 100644 
tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc

diff --git 
a/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc 
b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
new file mode 100644
index ..281665b1348c
--- /dev/null
+++ b/tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
@@ -0,0 +1,93 @@
+#!/bin/sh
+# SPDX-License-Identifier: GPL-2.0
+# description: Kprobe event parser error log check
+
+[ -f kprobe_events ] || exit_unsupported # this is configurable
+
+[ -f error_log ] || exit_unsupported
+
+check_error() { # command-with-error-pos-by-^
+pos=$(echo -n "${1%^*}" | wc -c) # error position
+command=$(echo "$1" | tr -d ^)
+echo "Test command: $command"
+echo > error_log
+(! echo "$command" > kprobe_events ) >& /dev/null
+grep "trace_kprobe: error:" -A 3 error_log
+N=$(tail -n 1 error_log | wc -c)
+# "  Command: " and "^\n" => 13
+test $(expr 13 + $pos) -eq $N
+}
+
+if grep -q 'r\[maxactive\]' README; then
+check_error 'p^100 vfs_read'   # MAXACT_NO_KPROBE
+check_error 'r^1a111 vfs_read' # BAD_MAXACT
+check_error 'r^10 vfs_read'# MAXACT_TOO_BIG
+fi
+
+check_error 'p ^non_exist_func'# BAD_PROBE_ADDR (enoent)
+check_error 'p ^hoge-fuga' # BAD_PROBE_ADDR (bad syntax)
+check_error 'p ^hoge+1000-1000'# BAD_PROBE_ADDR (bad syntax)
+check_error 'r ^vfs_read+10'   # BAD_RETPROBE
+check_error 'p:^/bar vfs_read' # NO_GROUP_NAME
+check_error 
'p:^12345678901234567890123456789012345678901234567890123456789012345/bar 
vfs_read'# GROUP_TOO_LONG
+
+check_error 'p:^foo.1/bar vfs_read'# BAD_GROUP_NAME
+check_error 'p:foo/^ vfs_read' # NO_EVENT_NAME
+check_error 
'p:foo/^12345678901234567890123456789012345678901234567890123456789012345 
vfs_read'# EVENT_TOO_LONG
+check_error 'p:foo/^bar.1 vfs_read'# BAD_EVENT_NAME
+
+check_error 'p vfs_read ^$retval'  # RETVAL_ON_PROBE
+check_error 'p vfs_read ^$stack1'  # BAD_STACK_NUM
+
+if grep -q '$arg' README; then
+check_error 'p vfs_read ^$arg1'# BAD_ARG_NUM
+fi
+
+check_error 'p vfs_read ^$none_var'# BAD_VAR
+
+check_error 'p vfs_read ^%none_reg'# BAD_REG_NAME
+check_error 'p vfs_read ^@12345678abcde'   # BAD_MEM_ADDR
+check_error 'p vfs_read ^@+10' # FILE_ON_KPROBE
+
+check_error 'p vfs_read ^+0@0)'# DEREF_NEED_BRACE
+check_error 'p vfs_read ^+0ab1(@0)'# BAD_DEREF_OFFS
+check_error 'p vfs_read +0(+0(@0^)'# DEREF_OPEN_BRACE
+
+if grep -A1 "fetcharg:" README | grep -q '\$comm' ; then
+check_error 'p vfs_read +0(^$comm)'# COMM_CANT_DEREF
+fi
+
+check_error 'p vfs_read ^&1'   # BAD_FETCH_ARG
+
+
+# We've introduced this limitation with array support
+if grep -q ' \\\[\\\]' README; then
+check_error 'p vfs_read 
+0(^+0(+0(+0(+0(+0(+0(+0(+0(+0(+0(+0(+0(+0(@0))'   # TOO_MANY_OPS?
+check_error 'p vfs_read +0(@11):u8[10^'# ARRAY_NO_CLOSE
+check_error 'p vfs_read +0(@11):u8[10]^a'  # BAD_ARRAY_SUFFIX
+check_error 'p vfs_read +0(@11):u8[^10a]'  # BAD_ARRAY_NUM
+check_error 'p vfs_read +0(@11):u8[^256]'  # ARRAY_TOO_BIG
+fi
+
+check_error 'p vfs_read @11:^unknown_type' # BAD_TYPE
+check_error 'p vfs_read $stack0:^string'   # BAD_STRING
+check_error 'p vfs_read @11:^b10@a/16' # BAD_BITFIELD
+
+check_error 'p vfs_read ^arg123456789012345678901234567890=@11'# 
ARG_NAME_TOO_LOG
+check_error 'p vfs_read ^=@11' # NO_ARG_NAME
+check_error 'p vfs_read ^var.1=@11'# BAD_ARG_NAME
+check_error 'p vfs_read var1=@11 ^var1=@12'# USED_ARG_NAME
+check_error 'p vfs_read 
^+1234567(+1234567(+1234567(+1234567(+1234567(+1234567(@1234))'# 
ARG_TOO_LONG
+check_error 'p vfs_read arg1=^'# NO_ARG_BODY
+
+# instruction boundary check is valid on x86 (at this moment)
+case $(uname -m) in
+  x86_64|i[3456]86)
+echo 'p vfs_read' > kprobe_events
+if grep -q FTRACE ../kprobes/list ; then
+   check_error 'p ^vfs_read+3' # BAD_INSN_BNDRY (only if 
function-tracer is enabled)
+fi
+;;
+esac
+
+exit 0
diff --git 
a/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc 
b/tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc
new file mode 100644
index

[RFC PATCH v2 5/7] tracing/probe: Verify alloc_trace_*probe() result

2019-03-13 Thread Masami Hiramatsu

Since alloc_trace_*probe() returns -EINVAL only if !event && !group,
it should not happen in trace_*probe_create(). If we catch that case
there is a bug. So use WARN_ON_ONCE() instead of pr_info().

Signed-off-by: Masami Hiramatsu 
---
 kernel/trace/trace_kprobe.c |4 ++--
 kernel/trace/trace_uprobe.c |3 ++-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 5222fd82e7e4..56324c231688 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -693,9 +693,9 @@ static int trace_kprobe_create(int argc, const char *argv[])
tk = alloc_trace_kprobe(group, event, addr, symbol, offset, maxactive,
   argc, is_return);
if (IS_ERR(tk)) {
-   pr_info("Failed to allocate trace_probe.(%d)\n",
-   (int)PTR_ERR(tk));
ret = PTR_ERR(tk);
+   /* This must return -ENOMEM otherwise there is a bug */
+   WARN_ON_ONCE(ret != -ENOMEM);
goto out;
}
 
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 52f033489377..b54137ec7810 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -514,8 +514,9 @@ static int trace_uprobe_create(int argc, const char **argv)
 
tu = alloc_trace_uprobe(group, event, argc, is_return);
if (IS_ERR(tu)) {
-   pr_info("Failed to allocate trace_uprobe.(%d)\n", 
(int)PTR_ERR(tu));
ret = PTR_ERR(tu);
+   /* This must return -ENOMEM otherwise there is a bug */
+   WARN_ON_ONCE(ret != -ENOMEM);
goto fail_address_parse;
}
tu->offset = offset;

[RFC PATCH v2 2/7] tracing/probe: Check event name length correctly

2019-03-13 Thread Masami Hiramatsu

Ensure given name of event is not too long when parsing it,
and fix to update event name offset correctly when the group
name is given. For example, this makes probe event to check
the "p:foo/" error case correctly.

Signed-off-by: Masami Hiramatsu 
---
 kernel/trace/trace_probe.c |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 89da34b326e3..4cd50913cb5d 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -159,6 +159,7 @@ int traceprobe_parse_event_name(const char **pevent, const 
char **pgroup,
char *buf)
 {
const char *slash, *event = *pevent;
+   int len;
 
slash = strchr(event, '/');
if (slash) {
@@ -173,10 +174,15 @@ int traceprobe_parse_event_name(const char **pevent, 
const char **pgroup,
strlcpy(buf, event, slash - event + 1);
*pgroup = buf;
*pevent = slash + 1;
+   event = *pevent;
}
-   if (strlen(event) == 0) {
+   len = strlen(event);
+   if (len == 0) {
pr_info("Event name is not specified\n");
return -EINVAL;
+   } else if (len > MAX_EVENT_NAME_LEN) {
+   pr_info("Event name is too long\n");
+   return -E2BIG;
}
return 0;
 }

[RFC PATCH v2 3/7] tracing/probe: Check the size of argument name and body

2019-03-13 Thread Masami Hiramatsu

Check the size of argument name and expression is not 0
and smaller than maximum length.

Signed-off-by: Masami Hiramatsu 
---
 kernel/trace/trace_probe.c |2 ++
 kernel/trace/trace_probe.h |1 +
 2 files changed, 3 insertions(+)

diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index 4cd50913cb5d..feae03056f0b 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -554,6 +554,8 @@ int traceprobe_parse_probe_arg(struct trace_probe *tp, int 
i, char *arg,
 
body = strchr(arg, '=');
if (body) {
+   if (body - arg > MAX_ARG_NAME_LEN || body == arg)
+   return -EINVAL;
parg->name = kmemdup_nul(arg, body - arg, GFP_KERNEL);
body++;
} else {
diff --git a/kernel/trace/trace_probe.h b/kernel/trace/trace_probe.h
index 8a63f8bc01bc..2177c206de15 100644
--- a/kernel/trace/trace_probe.h
+++ b/kernel/trace/trace_probe.h
@@ -32,6 +32,7 @@
 #define MAX_TRACE_ARGS 128
 #define MAX_ARGSTR_LEN 63
 #define MAX_ARRAY_LEN  64
+#define MAX_ARG_NAME_LEN   32
 #define MAX_STRING_SIZEPATH_MAX
 
 /* Reserved field names */

[RFC PATCH v2 1/7] tracing/probe: Check maxactive error cases

2019-03-13 Thread Masami Hiramatsu

Check maxactive on kprobe error case, because maxactive
is only for kretprobe, not for kprobe. Also, maxactive
should not be 0, it should be at least 1.

Signed-off-by: Masami Hiramatsu 
---
 kernel/trace/trace_kprobe.c |   10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index d5fb09ebba8b..d47e12596f12 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -624,7 +624,11 @@ static int trace_kprobe_create(int argc, const char 
*argv[])
if (event)
event++;
 
-   if (is_return && isdigit(argv[0][1])) {
+   if (isdigit(argv[0][1])) {
+   if (!is_return) {
+   pr_info("Maxactive is not for kprobe");
+   return -EINVAL;
+   }
if (event)
len = event - [0][1] - 1;
else
@@ -634,8 +638,8 @@ static int trace_kprobe_create(int argc, const char *argv[])
memcpy(buf, [0][1], len);
buf[len] = '\0';
ret = kstrtouint(buf, 0, );
-   if (ret) {
-   pr_info("Failed to parse maxactive.\n");
+   if (ret || !maxactive) {
+   pr_info("Invalid maxactive number\n");
return ret;
}
/* kretprobes instances are iterated over via a list. The

[RFC PATCH v2 4/7] tracing/probe: Check event/group naming rule at parsing

2019-03-13 Thread Masami Hiramatsu

Check event and group naming rule at parsing it instead
of allocating probes.

Signed-off-by: Masami Hiramatsu 
---
 Changes in v2:
  - Update error message according to Steve's comment.
---
 kernel/trace/trace_kprobe.c |7 +--
 kernel/trace/trace_probe.c  |8 
 kernel/trace/trace_uprobe.c |5 +
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index d47e12596f12..5222fd82e7e4 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -221,7 +221,7 @@ static struct trace_kprobe *alloc_trace_kprobe(const char 
*group,
 
tk->rp.maxactive = maxactive;
 
-   if (!event || !is_good_name(event)) {
+   if (!event || !group) {
ret = -EINVAL;
goto error;
}
@@ -231,11 +231,6 @@ static struct trace_kprobe *alloc_trace_kprobe(const char 
*group,
if (!tk->tp.call.name)
goto error;
 
-   if (!group || !is_good_name(group)) {
-   ret = -EINVAL;
-   goto error;
-   }
-
tk->tp.class.system = kstrdup(group, GFP_KERNEL);
if (!tk->tp.class.system)
goto error;
diff --git a/kernel/trace/trace_probe.c b/kernel/trace/trace_probe.c
index feae03056f0b..d2b73628f1e1 100644
--- a/kernel/trace/trace_probe.c
+++ b/kernel/trace/trace_probe.c
@@ -172,6 +172,10 @@ int traceprobe_parse_event_name(const char **pevent, const 
char **pgroup,
return -E2BIG;
}
strlcpy(buf, event, slash - event + 1);
+   if (!is_good_name(buf)) {
+   pr_info("Group name must follow the same rules as C 
identifiers\n");
+   return -EINVAL;
+   }
*pgroup = buf;
*pevent = slash + 1;
event = *pevent;
@@ -184,6 +188,10 @@ int traceprobe_parse_event_name(const char **pevent, const 
char **pgroup,
pr_info("Event name is too long\n");
return -E2BIG;
}
+   if (!is_good_name(event)) {
+   pr_info("Event name must follow the same rules as C 
identifiers\n");
+   return -EINVAL;
+   }
return 0;
 }
 
diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index e335576b9411..52f033489377 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -266,10 +266,7 @@ alloc_trace_uprobe(const char *group, const char *event, 
int nargs, bool is_ret)
 {
struct trace_uprobe *tu;
 
-   if (!event || !is_good_name(event))
-   return ERR_PTR(-EINVAL);
-
-   if (!group || !is_good_name(group))
+   if (!event || !group)
return ERR_PTR(-EINVAL);
 
tu = kzalloc(SIZEOF_TRACE_UPROBE(nargs), GFP_KERNEL);

[RFC PATCH v2 0/7] tracing: Use common error_log with probe events

2019-03-13 Thread Masami Hiramatsu

Hi,

Here is the 2nd version of using common error_log  with probe events.
Previous version is here.

http://lkml.kernel.org/r/155248005229.10815.334731901778152247.stgit@devnote2

In this version, I've updated some error messages according to
Steve's comment, adjust some error position, and update testcase
to simplify a bit.

 - [4/7]: Update error message according to Steve's comment (Thanks!)
 - [6/7]: Update error message, adjust error positions, and add uprobe errors
 - [7/7]: Specify error position in command string by "^".
  Clear error_log right before writing command.
  Add uprobe syntax error checker

Thank you,

---

Masami Hiramatsu (7):
  tracing/probe: Check maxactive error cases
  tracing/probe: Check event name length correctly
  tracing/probe: Check the size of argument name and body
  tracing/probe: Check event/group naming rule at parsing
  tracing/probe: Verify alloc_trace_*probe() result
  tracing: Use tracing error_log with probe events
  selftests/ftrace: Add error_log testcase for probe errors


 kernel/trace/trace_kprobe.c|   90 --
 kernel/trace/trace_probe.c |  282 +++-
 kernel/trace/trace_probe.h |   78 +-
 kernel/trace/trace_uprobe.c|   52 ++--
 .../ftrace/test.d/kprobe/kprobe_syntax_errors.tc   |   93 +++
 .../ftrace/test.d/kprobe/uprobe_syntax_errors.tc   |   31 ++
 6 files changed, 494 insertions(+), 132 deletions(-)
 create mode 100644 
tools/testing/selftests/ftrace/test.d/kprobe/kprobe_syntax_errors.tc
 create mode 100644 
tools/testing/selftests/ftrace/test.d/kprobe/uprobe_syntax_errors.tc

--
Masami Hiramatsu

Compiling error if CONFIG_CPU_SUP_INTEL is disabled

2019-03-13 Thread Alec Ari

If CONFIG_CPU_SUP_INTEL is disabled with either the 5.0.2 or 4.20.16
kernel, it errors out right away:

In file included from arch/x86/events/amd/core.c:8:
arch/x86/events/amd/../perf_event.h:1035:45: warning: ‘struct
cpu_hw_event’ declared inside parameter list will not be visible
outside of this definition or declaration
 static inline int intel_cpuc_prepare(struct cpu_hw_event *cpuc, int cpu)
 ^~~~
arch/x86/events/amd/../perf_event.h:1040:45: warning: ‘struct
cpu_hw_event’ declared inside parameter list will not be visible
outside of this definition or declaration
 static inline void intel_cpuc_finish(struct cpu_hw_event *cpuc)
 ^~~~
  CC  arch/x86/events/amd/uncore.o
  CC  arch/x86/events/amd/ibs.o
  AR  kernel/bpf/built-in.a
  CC  kernel/cgroup/cgroup.o
In file included from arch/x86/events/amd/ibs.c:19:
arch/x86/events/amd/../perf_event.h:1035:45: warning: ‘struct
cpu_hw_event’ declared inside parameter list will not be visible
outside of this definition or declaration
 static inline int intel_cpuc_prepare(struct cpu_hw_event *cpuc, int cpu)
 ^~~~
arch/x86/events/amd/../perf_event.h:1040:45: warning: ‘struct
cpu_hw_event’ declared inside parameter list will not be visible
outside of this definition or declaration
 static inline void intel_cpuc_finish(struct cpu_hw_event *cpuc)
 ^~~~
  AR  arch/x86/events/amd/built-in.a
  CC  arch/x86/events/core.o
In file included from arch/x86/events/core.c:44:
arch/x86/events/perf_event.h:1035:45: warning: ‘struct cpu_hw_event’
declared inside parameter list will not be visible outside of this
definition or declaration
 static inline int intel_cpuc_prepare(struct cpu_hw_event *cpuc, int cpu)
 ^~~~
arch/x86/events/perf_event.h:1040:45: warning: ‘struct cpu_hw_event’
declared inside parameter list will not be visible outside of this
definition or declaration
 static inline void intel_cpuc_finish(struct cpu_hw_event *cpuc)
 ^~~~
arch/x86/events/core.c: In function ‘free_fake_cpuc’:
arch/x86/events/core.c:1998:20: error: passing argument 1 of
‘intel_cpuc_finish’ from incompatible pointer type
[-Werror=incompatible-pointer-types]
  intel_cpuc_finish(cpuc);
^~~~
In file included from arch/x86/events/core.c:44:
arch/x86/events/perf_event.h:1040:59: note: expected ‘struct
cpu_hw_event *’ but argument is of type ‘struct cpu_hw_events *’
 static inline void intel_cpuc_finish(struct cpu_hw_event *cpuc)
  ~^~~~
arch/x86/events/core.c: In function ‘allocate_fake_cpuc’:
arch/x86/events/core.c:2012:25: error: passing argument 1 of
‘intel_cpuc_prepare’ from incompatible pointer type
[-Werror=incompatible-pointer-types]
  if (intel_cpuc_prepare(cpuc, cpu))
 ^~~~
In file included from arch/x86/events/core.c:44:
arch/x86/events/perf_event.h:1035:59: note: expected ‘struct
cpu_hw_event *’ but argument is of type ‘struct cpu_hw_events *’
 static inline int intel_cpuc_prepare(struct cpu_hw_event *cpuc, int cpu)
  ~^~~~
cc1: some warnings being treated as errors
make[2]: *** [scripts/Makefile.build:291: arch/x86/events/core.o] Error 1
make[1]: *** [scripts/Makefile.build:516: arch/x86/events] Error 2
make: *** [Makefile:1058: arch/x86] Error 2

Re: [PATCH 0/3] userfaultfd: allow to forbid unprivileged users

2019-03-13 Thread Mike Kravetz

On 3/13/19 4:55 PM, Andrea Arcangeli wrote:
> On Wed, Mar 13, 2019 at 01:01:40PM -0700, Mike Kravetz wrote:
>> On 3/13/19 11:52 AM, Andrea Arcangeli wrote:
>>> Unless somebody suggests a consistent way to make hugetlbfs "just
>>> work" (like we could achieve clean with CRIU and KVM), I think Oracle
>>> will need a one liner change in the Oracle setup to echo into that
>>> file in addition of running the hugetlbfs mount.
>>
>> I think you are suggesting the DB setup process enable uffd for all users.
>> Correct?
> 
> Yes. In addition of the hugetlbfs setup, various apps requires to also
> increase fs.inotify.max_user_watches or file-max and other tweaks,
> this would be one of those tweaks.

Yes, I agree.
It is just that unprivileged_userfaultfd disabled would likely to be the
default set by distros.  Or, perhaps 'kvm'?  Then, if you want to run the
DB, the admin (or the DB) will need to set it to enabled.  And, this results
in it being enabled for everyone.  I think I understand the scope of any
security exposure this would cause from a technical perspective.  However,
I can imagine people being concerned about enabling everywhere if this is
not the default setting.

If it is OK to disable everywhere, why not just use disable for the kvm
use case as well? :)

>> This may be too simple, and I don't really like group access, but how about
>> just defining a uffd group?  If you are in the group you can make uffd
>> system calls.
> 
> Everything is possible, I'm just afraid it gets too complex.
> 
> So you suggest to echo a gid into the file?

That is what I was thinking.  But, I was mostly thinking of that because
Peter's earlier comment made me go and check hugetlbfs code.  There is a
sysctl_hugetlb_shm_group variable that does this, even though it is mostly
unused in the hugetlbfs code.

I know the kvm dev open scheme works for kvm.  Was just trying to think
of something more general that would work for hugetlbfs/DB or other use
cases.
-- 
Mike Kravetz

Re: [PATCH v2 4/5] slob: Use slab_list instead of lru

2019-03-13 Thread Tobin C. Harding

On Wed, Mar 13, 2019 at 07:05:02PM +, Christopher Lameter wrote:
> On Wed, 13 Mar 2019, Tobin C. Harding wrote:
> 
> > @@ -297,7 +297,7 @@ static void *slob_alloc(size_t size, gfp_t gfp, int 
> > align, int node)
> > continue;
> >
> > /* Attempt to alloc */
> > -   prev = sp->lru.prev;
> > +   prev = sp->slab_list.prev;
> > b = slob_page_alloc(sp, size, align);
> > if (!b)
> > continue;
> 
> Hmmm... Is there a way to use a macro or so to avoid referencing the field
> within the slab_list?

Thanks for the review.  Next version includes a fix for this.

Tobin.

[PATCH v2 1/1] initrd: move initrd_start calculate within linear mapping range check

2019-03-13 Thread pierre Kuo

in the previous case, initrd_start and initrd_end can be successfully
returned either (base < memblock_start_of_DRAM()) or (base + size >
memblock_start_of_DRAM() + linear_region_size).

That means even linear mapping range check fail for initrd_start and
initrd_end, it still can get virtual address. Here we put
initrd_start/initrd_end to be calculated only when linear mapping check
pass.

Fixes: c756c592e442 ("arm64: Utilize phys_initrd_start/phys_initrd_size")
Reviewed-by: Steven Price 
Signed-off-by: pierre Kuo 
---
Changes in v2:
- add Fixes tag

 arch/arm64/mm/init.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 7205a9085b4d..1adf418de685 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -425,6 +425,9 @@ void __init arm64_memblock_init(void)
memblock_remove(base, size); /* clear MEMBLOCK_ flags */
memblock_add(base, size);
memblock_reserve(base, size);
+   /* the generic initrd code expects virtual addresses */
+   initrd_start = __phys_to_virt(phys_initrd_start);
+   initrd_end = initrd_start + phys_initrd_size;
}
}
 
@@ -450,11 +453,6 @@ void __init arm64_memblock_init(void)
 * pagetables with memblock.
 */
memblock_reserve(__pa_symbol(_text), _end - _text);
-   if (IS_ENABLED(CONFIG_BLK_DEV_INITRD) && phys_initrd_size) {
-   /* the generic initrd code expects virtual addresses */
-   initrd_start = __phys_to_virt(phys_initrd_start);
-   initrd_end = initrd_start + phys_initrd_size;
-   }
 
early_init_fdt_scan_reserved_mem();
 
-- 
2.17.1

Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions

2019-03-13 Thread John Hubbard


On 3/13/19 7:49 AM, Ira Weiny wrote:

On Tue, Mar 12, 2019 at 05:38:55PM -0700, John Hubbard wrote:

On 3/12/19 8:30 AM, Ira Weiny wrote:

On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubb...@gmail.com wrote:

From: John Hubbard 

Introduces put_user_page(), which simply calls put_page().
This provides a way to update all get_user_pages*() callers,
so that they call put_user_page(), instead of put_page().


So I've been running with these patches for a while but today while ramping up
my testing I hit the following:

[ 1355.557819] [ cut here ]
[ 1355.563436] get_user_pages pin count overflowed


Hi Ira,

Thanks for reporting this. That overflow, at face value, means that we've
used more than the 22 bits worth of gup pin counts, so about 4 million pins
of the same page...


This is my bug in the patches I'm playing with.  Somehow I'm causing more puts
than gets...  I'm not sure how but this is for sure my problem.

Backing off to your patch set the numbers are good.


Now that's a welcome bit of good news!



Sorry for the noise.

With the testing I've done today I feel comfortable adding

Tested-by: Ira Weiny 

For the main GUP and InfiniBand patches.

Ira



OK, I'll add your tested-by tag to patches 1, 2, 4, 5 (the numbering refers
to the "RFC v2: mm: gup/dma tracking" posting [1]) in my repo [2], and they'll 
show up in the next posting. (Patch 3 is already upstream, and patch 6 is

documentation that needs to be rewritten entirely.)

[1] https://lore.kernel.org/r/20190204052135.25784-1-jhubb...@nvidia.com

[2] https://github.com/johnhubbard/linux/tree/gup_dma_core

thanks,
--
John Hubbard
NVIDIA

Re:Re：Re: Re：[PATCH 4.9 81/96] MIPS:Loongson: Introduce and use loongson_llsc_mb()

2019-03-13 Thread 陈华才

Hi, Greg,

Patch for 4.9 (and below 4.9) is here:
https://patchwork.linux-mips.org/patch/21375/

Huacai
 
-- Original --
From:  "陈华才";
Date:  Thu, Mar 14, 2019 06:55 AM
To:  "gregkh";
Cc:  "linux-kernel"; 
"stable"; "huangpei"; "Paul 
Burton"; "Ralf Baechle"; 
"ambrosehua"; "Steven J . Hill"; 
"linux-mips"; "Fuxin Zhang"; 
"wuzhangjin"; "Li Xuefeng"; "Xu 
Chenghua"; "Sasha Levin";
Subject:  Re：Re: Re：[PATCH 4.9 81/96] MIPS:Loongson: Introduce and use 
loongson_llsc_mb()
 
Hi, Greg,

Just 4.9(and below) need to modify spinlock.h,   4.14, 4.19 does not need to do 
this because they have converted to qspinlock. And, sorry for my poor reply 
because I can only use mobile phone now.
---原始邮件---
发件人:"Greg Kroah-Hartman"
发送时间:2019年3月14日(星期四) 凌晨4:58
收件人:"陈华才";
主题:Re：[PATCH 4.9 81/96] MIPS:Loongson: Introduce and use loongson_llsc_mb()
On Wed, Mar 13, 2019 at 09:17:15PM +0800, 陈华才 wrote:
> Hi, GREG,
> 
> 4.9 need to modify spinlock.h, please wait my patch.
> 
> 
> 
> ---原始邮件---
> 发件人:"Greg Kroah-Hartman"
> 发送时间:2019年3月13日(星期三) 凌晨1:10
> 收件人:"linux-kernel";
> 主题:[PATCH 4.9 81/96] MIPS: Loongson: Introduce and use loongson_llsc_mb()
> 4.9-stable review patch.  If anyone has any objections, please let me know.
> 
> --
> 
> [ Upstream commit e02e07e3127d8aec1f4bcdfb2fc52a2d99b4859e ]
> 
> On the Loongson-2G/2H/3A/3B there is a hardware flaw that ll/sc and
> lld/scd is very weak ordering. We should add sync instructions "before
> each ll/lld" and "at the branch-target between ll/sc" to workaround.
> Otherwise, this flaw will cause deadlock occasionally (e.g. when doing
> heavy load test with LTP).
> 
> Below is the explaination of CPU designer:
> 
> "For Loongson 3 family, when a memory access instruction (load, store,
> or prefetch)'ecuting occurs between the execution of LL and SC, the
> success or failure of SC is not predictable. Although programmer would
> not insert memory access instructions between LL and SC, the memory
> instructions before LL in program-order, may dynamically executed
> between the execution of LL/SC, so a memory fence (SYNC) is needed
> before LL/LLD to avoid this situation.
> 
> Since Loongson-3A R2 (3A2000), we have improved our hardware design to
> handle this case. But we later deduce a rarely circumstance that some
> speculatively executed memory instructions due to branch misprediction
> between LL/SC still fall into the above case, so a memory fence (SYNC)
> at branch-target (if its target is not between LL/SC) is needed for
> Loongson 3A1000, 3B1500, 3A2000 and 3A3000.
> 
> Our processor is continually evolving and we aim to to remove all these
> workaround-SYNCs around LL/SC for new-come processor."
> 
> Here is an example:
> 
> Both cpu1 and cpu2 simutaneously run atomic_add by 1 on same atomic var,
> this bug cause both '' by two cpus (in atomic_add) succeed at same
> time(''urn 1), and the variable is only *added by 1*, sometimes,
> which is wrong and unacceptable(it should be added by 2).
> 
> Why disable fix-loongson3-llsc in compiler?
> Because compiler fix will cause problems in kernel'ex_table section.
> 
> This patch fix all the cases in kernel, but:
> 
> +. the fix at the end of futex_atomic_cmpxchg_inatomic is for branch-target
> of ''ere other cases which smp_mb__before_llsc() and smp_llsc_mb() fix
> the ll and branch-target coincidently such as atomic_sub_if_positive/
> cmpxchg/xchg, just like this one.
> 
> +. Loongson 3 does support CONFIG_EDAC_ATOMIC_SCRUB, so no need to touch
> edac.h
> 
> +. local_ops and cmpxchg_local should not be affected by this bug since
> only the owner can write.
> 
> +. mips_atomic_set for syscall.c is deprecated and rarely used, just let
> it go
> 
> Signed-off-by: Huacai Chen 
> Signed-off-by: Huang Pei 
> [paul.bur...@mips.com:
>  - Simplify the addition of -mno-fix-loongson3-llsc to cflags, and add
>  a comment describing why it'ere.
>  - Make loongson_llsc_mb() a no-op when
>  CONFIG_CPU_LOONGSON3_WORKAROUNDS=n, rather than a compiler memory
>  barrier.
>  - Add a comment describing the bug & how loongson_llsc_mb() helps
>  in asm/barrier.h.]
> Signed-off-by: Paul Burton 
> Cc: Ralf Baechle 
> Cc: ambrose...@gmail.com
> Cc: Steven J . Hill 
> Cc: linux-m...@linux-mips.org
> Cc: Fuxin Zhang 
> Cc: Zhangjin Wu 
> Cc: Li Xuefeng 
> Cc: Xu Chenghua 
> Signed-off-by: Sasha Levin 
> ---
> arch/mips/Kconfig   | 15 ++
> arch/mips/include/asm/atomic.h  |  6 ++
> arch/mips/include/asm/barrier.h | 36 +
> arch/mips/include/asm/bitops.h  |  5 +
> arch/mips/include/asm/futex.h   |  3 +++
> arch/mips/include/asm/pgtable.h |  2 ++
> arch/mips/loongson64/Platform   | 23 +
> arch/mips/mm/tlbex.c| 10 +
> 8 files changed, 100 insertions(+)

Ok, I will go drop this from all stable queues now, thanks!

greg k-h

Re: [PATCH v4] lib/string.c: implement a basic bcmp

2019-03-13 Thread Nathan Chancellor

On Wed, Mar 13, 2019 at 02:13:31PM -0700, Nick Desaulniers wrote:
> A recent optimization in Clang (r355672) lowers comparisons of the
> return value of memcmp against zero to comparisons of the return value
> of bcmp against zero.  This helps some platforms that implement bcmp
> more efficiently than memcmp. glibc simply aliases bcmp to memcmp, but
> an optimized implementation is in the works.
> 
> This results in linkage failures for all targets with Clang due to the
> undefined symbol.  For now, just implement bcmp as a tailcail to memcmp
> to unbreak the build.  This routine can be further optimized in the
> future.
> 
> Other ideas discussed:
> * A weak alias was discussed, but breaks for architectures that define
> their own implementations of memcmp since aliases to declarations are
> not permitted (only definitions).  Arch-specific memcmp implementations
> typically declare memcmp in C headers, but implement them in assembly.
> * -ffreestanding also is used sporadically throughout the kernel.
> * -fno-builtin-bcmp doesn't work when doing LTO.
> 
> Link: https://bugs.llvm.org/show_bug.cgi?id=41035
> Link: https://code.woboq.org/userspace/glibc/string/memcmp.c.html#bcmp
> Link: 
> https://github.com/llvm/llvm-project/commit/8e16d73346f8091461319a7dfc4ddd18eedcff13
> Link: https://github.com/ClangBuiltLinux/linux/issues/416
> Cc: sta...@vger.kernel.org
> Reported-by: Nathan Chancellor 
> Reported-by: Adhemerval Zanella 
> Suggested-by: Arnd Bergmann 
> Suggested-by: James Y Knight 
> Suggested-by: Masahiro Yamada 
> Suggested-by: Nathan Chancellor 
> Suggested-by: Rasmus Villemoes 
> Signed-off-by: Nick Desaulniers 
> Acked-by: Steven Rostedt (VMware) 

Reviewed-by: Nathan Chancellor 
Tested-by: Nathan Chancellor 

Thanks for fixing this better than I did (or tried to)!

> ---
> Changes V3 -> V4:
> * Include the entirety of Rasmus' sugguestion, as per Steven.
> * Change formal parameter identifiers to match the comment.
> Changes V2 -> V3:
> * Adjust comment as per Steven to Rasmus' sugguestion.
> * Pick up Steven's Ack.
> Changes V1 -> V2:
> * Add declaration to include/linux/string.h.
> * Reword comment above bcmp.
> 
>  include/linux/string.h |  3 +++
>  lib/string.c   | 20 
>  2 files changed, 23 insertions(+)
> 
> diff --git a/include/linux/string.h b/include/linux/string.h
> index 7927b875f80c..6ab0a6fa512e 100644
> --- a/include/linux/string.h
> +++ b/include/linux/string.h
> @@ -150,6 +150,9 @@ extern void * memscan(void *,int,__kernel_size_t);
>  #ifndef __HAVE_ARCH_MEMCMP
>  extern int memcmp(const void *,const void *,__kernel_size_t);
>  #endif
> +#ifndef __HAVE_ARCH_BCMP
> +extern int bcmp(const void *,const void *,__kernel_size_t);
> +#endif
>  #ifndef __HAVE_ARCH_MEMCHR
>  extern void * memchr(const void *,int,__kernel_size_t);
>  #endif
> diff --git a/lib/string.c b/lib/string.c
> index 38e4ca08e757..3ab861c1a857 100644
> --- a/lib/string.c
> +++ b/lib/string.c
> @@ -866,6 +866,26 @@ __visible int memcmp(const void *cs, const void *ct, 
> size_t count)
>  EXPORT_SYMBOL(memcmp);
>  #endif
>  
> +#ifndef __HAVE_ARCH_BCMP
> +/**
> + * bcmp - returns 0 if and only if the buffers have identical contents.
> + * @a: pointer to first buffer.
> + * @b: pointer to second buffer.
> + * @len: size of buffers.
> + *
> + * The sign or magnitude of a non-zero return value has no particular
> + * meaning, and architectures may implement their own more efficient bcmp(). 
> So
> + * while this particular implementation is a simple (tail) call to memcmp, do
> + * not rely on anything but whether the return value is zero or non-zero.
> + */
> +#undef bcmp
> +int bcmp(const void *a, const void *b, size_t len)
> +{
> + return memcmp(a, b, len);
> +}
> +EXPORT_SYMBOL(bcmp);
> +#endif
> +
>  #ifndef __HAVE_ARCH_MEMSCAN
>  /**
>   * memscan - Find a character in an area of memory.
> -- 
> 2.21.0.360.g471c308f928-goog
>

Re: lib/test_overflow.c causes WARNING and tainted kernel

2019-03-13 Thread Kees Cook

Hi!

On Wed, Mar 13, 2019 at 2:29 PM Randy Dunlap  wrote:
>
> This is v5.0-11053-gebc551f2b8f9, MAR-12 around 4:00pm PT.
>
> In the first test_kmalloc() in test_overflow_allocation():
>
> [54375.073895] test_overflow: ok: (s64)(0 << 63) == 0
> [54375.074228] WARNING: CPU: 2 PID: 5462 at ../mm/page_alloc.c:4584 
> __alloc_pages_nodemask+0x33f/0x540
> [...]
> [54375.079236] ---[ end trace 754acb68d8d1a1cb ]---
> [54375.079313] test_overflow: kmalloc detected saturation

Yup! This is expected and operating as intended: it is exercising the
allocator's detection of insane allocation sizes. :)

If we want to make it less noisy, perhaps we could add a global flag
the allocators could check before doing their WARNs?

-Kees

-- 
Kees Cook

Re: [PATCH v2 8/9] rtc: mt6397: fix alarm register overwrite

2019-03-13 Thread mtk14576

Hi,

On Mon, 2019-03-11 at 13:50 -0700, Sean Wang wrote:
> Hi,
> 
> On Sun, Mar 10, 2019 at 8:49 PM Hsin-Hsiung Wang
>  wrote:
> >
> > From: Ran Bi 
> >
> > Alarm registers high byte was reserved for other functions.
> > This add mask in alarm registers operation functions.
> > This also fix error condition in interrupt handler.
> >
> > Fixes: fc2979118f3f ("rtc: mediatek: Add MT6397 RTC driver")
> >
> 
> add a  Cc: sta...@vger.kernel.org  here to apply to all stable kernels
> for the critical fixup patch.
> 
> > Signed-off-by: Ran Bi 
> > ---
> >  drivers/rtc/rtc-mt6397.c | 47 
> > +--
> >  1 file changed, 33 insertions(+), 14 deletions(-)
> >
> > diff --git a/drivers/rtc/rtc-mt6397.c b/drivers/rtc/rtc-mt6397.c
> > index e9a25ec..f85f1fc 100644
> > --- a/drivers/rtc/rtc-mt6397.c
> > +++ b/drivers/rtc/rtc-mt6397.c
> > @@ -55,6 +55,14 @@
> >
> >  #define RTC_AL_SEC 0x0018
> >
> > +#define RTC_AL_SEC_MASK0x003f
> > +#define RTC_AL_MIN_MASK0x003f
> > +#define RTC_AL_HOU_MASK0x001f
> > +#define RTC_AL_DOM_MASK0x001f
> > +#define RTC_AL_DOW_MASK0x0007
> > +#define RTC_AL_MTH_MASK0x000f
> > +#define RTC_AL_YEA_MASK0x007f
> > +
> >  #define RTC_PDN2   0x002e
> >  #define RTC_PDN2_PWRON_ALARM   BIT(4)
> >
> > @@ -111,7 +119,7 @@ static irqreturn_t mtk_rtc_irq_handler_thread(int irq, 
> > void *data)
> > irqen = irqsta & ~RTC_IRQ_EN_AL;
> > mutex_lock(>lock);
> > if (regmap_write(rtc->regmap, rtc->addr_base + RTC_IRQ_EN,
> > -irqen) < 0)
> > +irqen) == 0)
> > mtk_rtc_write_trigger(rtc);
> > mutex_unlock(>lock);
> >
> > @@ -233,12 +241,12 @@ static int mtk_rtc_read_alarm(struct device *dev, 
> > struct rtc_wkalrm *alm)
> > alm->pending = !!(pdn2 & RTC_PDN2_PWRON_ALARM);
> > mutex_unlock(>lock);
> >
> > -   tm->tm_sec = data[RTC_OFFSET_SEC];
> > -   tm->tm_min = data[RTC_OFFSET_MIN];
> > -   tm->tm_hour = data[RTC_OFFSET_HOUR];
> > -   tm->tm_mday = data[RTC_OFFSET_DOM];
> > -   tm->tm_mon = data[RTC_OFFSET_MTH];
> > -   tm->tm_year = data[RTC_OFFSET_YEAR];
> > +   tm->tm_sec = data[RTC_OFFSET_SEC] & RTC_AL_SEC_MASK;
> > +   tm->tm_min = data[RTC_OFFSET_MIN] & RTC_AL_MIN_MASK;
> > +   tm->tm_hour = data[RTC_OFFSET_HOUR] & RTC_AL_HOU_MASK;
> > +   tm->tm_mday = data[RTC_OFFSET_DOM] & RTC_AL_DOM_MASK;
> > +   tm->tm_mon = data[RTC_OFFSET_MTH] & RTC_AL_MTH_MASK;
> > +   tm->tm_year = data[RTC_OFFSET_YEAR] & RTC_AL_YEA_MASK;
> >
> > tm->tm_year += RTC_MIN_YEAR_OFFSET;
> > tm->tm_mon--;
> > @@ -259,14 +267,25 @@ static int mtk_rtc_set_alarm(struct device *dev, 
> > struct rtc_wkalrm *alm)
> > tm->tm_year -= RTC_MIN_YEAR_OFFSET;
> > tm->tm_mon++;
> >
> > -   data[RTC_OFFSET_SEC] = tm->tm_sec;
> > -   data[RTC_OFFSET_MIN] = tm->tm_min;
> > -   data[RTC_OFFSET_HOUR] = tm->tm_hour;
> > -   data[RTC_OFFSET_DOM] = tm->tm_mday;
> > -   data[RTC_OFFSET_MTH] = tm->tm_mon;
> > -   data[RTC_OFFSET_YEAR] = tm->tm_year;
> > -
> > mutex_lock(>lock);
> > +   ret = regmap_bulk_read(rtc->regmap, rtc->addr_base + RTC_AL_SEC,
> > +  data, RTC_OFFSET_COUNT);
> > +   if (ret < 0)
> > +   goto exit;
> > +
> 
> Why did we need an additional regmap_bulk_read here? I would suppose
> data[RTC_OFFSET_*] & ~(RTC_AL_*_MASK) is always equal to 0.

In bootloader, RTC init part need to operate these RTC_AL_* registers
high byte for RTC operating mode settings and data storage, while
RTC_AL_* low byte was used as alarm register here. So we need an
additional regmap_bulk_read here to keep the RTC settings correct.

> 
> It might be another fixup since change is not being mentioned in the
> git message.
> 
> > +   data[RTC_OFFSET_SEC] = ((data[RTC_OFFSET_SEC] & ~(RTC_AL_SEC_MASK)) 
> > |
> > +   (tm->tm_sec & RTC_AL_SEC_MASK));
> > +   data[RTC_OFFSET_MIN] = ((data[RTC_OFFSET_MIN] & ~(RTC_AL_MIN_MASK)) 
> > |
> > +   (tm->tm_min & RTC_AL_MIN_MASK));
> > +   data[RTC_OFFSET_HOUR] = ((data[RTC_OFFSET_HOUR] & 
> > ~(RTC_AL_HOU_MASK)) |
> > +   (tm->tm_hour & RTC_AL_HOU_MASK));
> > +   data[RTC_OFFSET_DOM] = ((data[RTC_OFFSET_DOM] & ~(RTC_AL_DOM_MASK)) 
> > |
> > +   (tm->tm_mday & RTC_AL_DOM_MASK));
> > +   data[RTC_OFFSET_MTH] = ((data[RTC_OFFSET_MTH] & ~(RTC_AL_MTH_MASK)) 
> > |
> > +   (tm->tm_mon & RTC_AL_MTH_MASK));
> > +   data[RTC_OFFSET_YEAR] = ((data[RTC_OFFSET_YEAR] & 
> > ~(RTC_AL_YEA_MASK)) |
> > +   (tm->tm_year & RTC_AL_YEA_MASK));
> > +
> > if

Re: [RFC PATCH 01/31] mm: migrate: Add exchange_pages to exchange two lists of pages.

2019-03-13 Thread Zi Yan


On 19 Feb 2019, at 20:38, Anshuman Khandual wrote:


On 02/19/2019 06:26 PM, Matthew Wilcox wrote:

On Tue, Feb 19, 2019 at 01:12:07PM +0530, Anshuman Khandual wrote:
But the location of this temp page matters as well because you would 
like to
saturate the inter node interface. It needs to be either of the 
nodes where
the source or destination page belongs. Any other node would 
generate two

internode copy process which is not what you intend here I guess.
That makes no sense.  It should be allocated on the local node of the 
CPU
performing the copy.  If the CPU is in node A, the destination is in 
node B
and the source is in node C, then you're doing 4k worth of reads from 
node C,
4k worth of reads from node B, 4k worth of writes to node C followed 
by
4k worth of writes to node B.  Eventually the 4k of dirty cachelines 
on
node A will be written back from cache to the local memory (... or 
not,

if that page gets reused for some other purpose first).

If you allocate the page on node B or node C, that's an extra 4k of 
writes

to be sent across the inter-node link.


Thats right there will be an extra remote write. My assumption was 
that the CPU

performing the copy belongs to either node B or node C.



I have some interesting throughput results for exchange per u64 and 
exchange per 4KB page.
What I discovered is that using a 4KB page as the temporary storage for 
exchanging
2MB THPs does not improve the throughput. On contrary, when we are 
exchanging more than 2^4=16 THPs,
exchanging per 4KB page has lower throughput than exchanging per u64. 
Please see results below.


The experiments are done on a two socket machine with two Intel Xeon 
E5-2640 v3 CPUs.

All exchanges are done via the QPI link across two sockets.


Results
===

Throughput (GB/s) of exchanging 2 order-N 2MB pages between two NUMA 
nodes


| 2mb_page_order | 0| 1| 2| 3| 4| 5| 6| 7
| 8| 9
| u64| 5.31 | 5.58 | 5.89 | 5.69 | 8.97 | 9.51 | 9.21 | 9.50 
| 9.57 | 9.62
| per_page   | 5.85 | 6.48 | 6.20 | 5.26 | 7.22 | 7.25 | 7.28 | 7.30 
| 7.32 | 7.31


Normalized throughput (to per_page)

 2mb_page_order | 0| 1| 2| 3| 4| 5| 6| 7
| 8| 9
 u64| 0.90 | 0.86 | 0.94 | 1.08 | 1.24 | 1.31 |1.26  | 1.30 
| 1.30 | 1.31




Exchange page code
===

For exchanging per u64, I use the following function:

static void exchange_page(char *to, char *from)
{
u64 tmp;
int i;

for (i = 0; i < PAGE_SIZE; i += sizeof(tmp)) {
tmp = *((u64 *)(from + i));
*((u64 *)(from + i)) = *((u64 *)(to + i));
*((u64 *)(to + i)) = tmp;
}
}


For exchange per 4KB, I use the following function:

static void exchange_page2(char *to, char *from)
{
int cpu = smp_processor_id();

VM_BUG_ON(!in_atomic());

if (!page_tmp[cpu]) {
int nid = cpu_to_node(cpu);
struct page *page_tmp_page = alloc_pages_node(nid, GFP_KERNEL, 
0);
if (!page_tmp_page) {
exchange_page(to, from);
return;
}
page_tmp[cpu] = kmap(page_tmp_page);
}

copy_page(page_tmp[cpu], to);
copy_page(to, from);
copy_page(from, page_tmp[cpu]);
}

where page_tmp is pre-allocated local to each CPU and alloc_pages_node() 
above

is for hot-added CPUs, which is not used in the tests.


The kernel is available at: https://gitlab.com/ziy/linux-contig-mem-rfc
To do a comparison, you can clone this repo: 
https://gitlab.com/ziy/thp-migration-bench,
then make, ./run_test.sh, and ./get_results.sh using the kernel from 
above.


Let me know if I missed anything or did something wrong. Thanks.


--
Best Regards,
Yan Zi

Re: [LSF/MM TOPIC] Using XArray to manage the VMA

2019-03-13 Thread Matthew Wilcox

On Wed, Mar 13, 2019 at 02:06:03PM -0700, Davidlohr Bueso wrote:
> On Wed, 13 Mar 2019, Laurent Dufour wrote:
> > If this is not too late and if there is still place available, I would
> > like to attend the MM track and propose a topic about using the XArray
> > to replace the VMA's RB tree and list.
> > 
> > Using the XArray in place of the VMA's tree and list seems to be a first
> > step to the long way of removing/replacing the mmap_sem.
> 
> So threaded (not as in threads of execution) rbtrees are another
> alternative to deal with the two data structure approach we currently
> have. Having O(1) rb_prev/next() calls allows us to basically get rid of
> the vma list at the cost of an extra check for each node we visit on
> the way down when inserting.

It's probably worth listing the advantages of the Maple Tree over the
rbtree.

 - Shallower tree.  A 1000-entry rbtree is 10 levels deep.  A 1000-entry
   Maple Tree is 5 levels deep (I did a more detailed analysis in an
   earlier email thread with Laurent and I can present it if needed).
 - O(1) prev/next
 - Lookups under the RCU lock

There're some second-order effects too; by using externally allocated
nodes, we avoid disturbing other VMAs when inserting/deleting, and we
avoid bouncing cachelines around (eg the VMA which happens to end up
at the head of the tree is accessed by every lookup in the tree because
it's on the way to every other node).

Re: [PATCH] spi: mediatek: Attempt to address style issues in spi-mt7621.c

2019-03-13 Thread NeilBrown

On Thu, Mar 14 2019, Chuanhong Guo wrote:

> Hi!
> On Thu, Mar 14, 2019 at 6:14 AM NeilBrown  wrote:
>>
>> [...]
>> My only small concern is that this driver was backported to openwrt
>> (4.14 based) and then reverted
>>
>> https://github.com/openwrt/openwrt/commit/749a29f76ca780d8df70a5163d43bbdc6f13ba3f
>>
>>  "This breaks some mt7621 devices."
>>
>> Possibly it was backported badly, or possibly there is a problem.
> Last time I backported the version with broken SPI modes so it got
> broken SPI CS1 support. There is one mt7621 router using SPI CS1 in
> OpenWrt and the backported driver broke it.
> It has been fixed by my two "drop broken spi modes" patches.

Ahh, thanks for the clarification.  Good to know all known problems are
fixed.

Thanks,
NeilBrown


>>
>> John: do you have any more details of the problem other than what is in
>> the commit message?
>>
>> Thanks,
>> NeilBrown
>
> Regards,
> Chuanhong Guo


signature.asc
Description: PGP signature

must hold the driver_input_lock in hid_debug_rdesc_show

2019-03-13 Thread He, Bo

we see the below kernel panic logs when run suspend resume test with
usb mouse and usb keyboard connected.

the scenario is the userspace call the hid_debug_rdesc_show to dump
the input device while the device is removed. the patch
hold the driver_input_lock to avoid the race.

[ 5381.757295] selinux: SELinux:  Could not stat
/sys/devices/pci:00/:00:15.0/usb1/1-2/1-2:1.0/0003:03F0:0325.0320/input/input960/input960::scrolllock:
No such file or directory.
[ 5382.636498] BUG: unable to handle kernel paging request at 000783316040
[ 5382.651950] CPU: 1 PID: 1512 Comm: getevent Tainted: G U O 
4.19.20-quilt-2e5dc0ac-00029-gc455a447dd55 #1
[ 5382.663797] RIP: 0010:hid_dump_device+0x9b/0x160
[ 5382.758853] Call Trace:
[ 5382.761581]  hid_debug_rdesc_show+0x72/0x1d0
[ 5382.766343]  seq_read+0xe0/0x410
[ 5382.769941]  full_proxy_read+0x5f/0x90
[ 5382.774121]  __vfs_read+0x3a/0x170
[ 5382.788392]  vfs_read+0xa0/0x150
[ 5382.791984]  ksys_read+0x58/0xc0
[ 5382.801404]  __x64_sys_read+0x1a/0x20
[ 5382.805483]  do_syscall_64+0x55/0x110
[ 5382.809559]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: he, bo 
Signed-off-by: "Zhang, Jun" 
---
 drivers/hid/hid-debug.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/drivers/hid/hid-debug.c b/drivers/hid/hid-debug.c
index ebc9ffde41e9..a353a011fbdf 100644
--- a/drivers/hid/hid-debug.c
+++ b/drivers/hid/hid-debug.c
@@ -1060,10 +1060,15 @@ static int hid_debug_rdesc_show(struct seq_file *f, 
void *p)
seq_printf(f, "\n\n");
 
/* dump parsed data and input mappings */
+   if (down_interruptible(>driver_input_lock))
+   return 0;
+
hid_dump_device(hdev, f);
seq_printf(f, "\n");
hid_dump_input_mapping(hdev, f);
 
+   up(>driver_input_lock);
+
return 0;
 }
 
-- 
2.20.1

Re: [PATCH] spi: mediatek: Attempt to address style issues in spi-mt7621.c

2019-03-13 Thread Chuanhong Guo

Hi!
On Thu, Mar 14, 2019 at 6:14 AM NeilBrown  wrote:
>
> [...]
> My only small concern is that this driver was backported to openwrt
> (4.14 based) and then reverted
>
> https://github.com/openwrt/openwrt/commit/749a29f76ca780d8df70a5163d43bbdc6f13ba3f
>
>  "This breaks some mt7621 devices."
>
> Possibly it was backported badly, or possibly there is a problem.
Last time I backported the version with broken SPI modes so it got
broken SPI CS1 support. There is one mt7621 router using SPI CS1 in
OpenWrt and the backported driver broke it.
It has been fixed by my two "drop broken spi modes" patches.
>
> John: do you have any more details of the problem other than what is in
> the commit message?
>
> Thanks,
> NeilBrown

Regards,
Chuanhong Guo

RE: [PATCH RESEND V2 1/4] dt-bindings: fsl: scu: add watchdog binding

2019-03-13 Thread Anson Huang

Hi, Rob

Best Regards!
Anson Huang

> -Original Message-
> From: Rob Herring [mailto:r...@kernel.org]
> Sent: 2019年3月12日 5:26
> To: Aisheng Dong 
> Cc: Anson Huang ; Guenter Roeck  us.net>; mark.rutl...@arm.com; ulf.hans...@linaro.org; he...@sntech.de;
> catalin.mari...@arm.com; will.dea...@arm.com;
> bjorn.anders...@linaro.org; feste...@gmail.com;
> ja...@amarulasolutions.com; Andy Gross ; dl-
> linux-imx ; devicet...@vger.kernel.org; linux-
> watch...@vger.kernel.org; a...@arndb.de; marc.w.gonza...@free.fr;
> s.ha...@pengutronix.de; enric.balle...@collabora.com;
> horms+rene...@verge.net.au; w...@linux-watchdog.org; Daniel Baluta
> ; linux-arm-ker...@lists.infradead.org; linux-
> ker...@vger.kernel.org; ker...@pengutronix.de; o...@lixom.net;
> shawn...@kernel.org; Jens Wiklander 
> Subject: Re: [PATCH RESEND V2 1/4] dt-bindings: fsl: scu: add watchdog
> binding
> 
> +Jens W
> 
> On Thu, Mar 7, 2019 at 6:22 AM Aisheng Dong 
> wrote:
> >
> > Hi Rob,
> >
> > > > > I think Rob suggested that the SCU parent driver should
> > > > > instantiate the watchdog without explicit watchdog node. That
> > > > > would be possible, but it currently uses
> > > > > devm_of_platform_populate() to do the instantiation, and
> > > > > changing that would be a mess. Besides, it does sem to me that
> > > > > your suggested node would describe the hardware, so I am not
> > > > > sure I understand the
> > > reasoning.
> > >
> > > It would just be a call to create a platform device instead. How is that a
> mess?
> > >
> > > It's describing firmware. We have DT for describing h/w we've failed
> > > to make discoverable. We should not repeat that and just describe
> firmware in DT.
> > > Make the firmware discoverable! Though there are cases like firmware
> > > provided clocks where we still need something in DT, but this is not
> > > one of them.
> > >
> >
> > The watchdog node here in question actually is not using SCU firmware call.
> > Due to security requirement by SCU, watchdog can only be accessed in
> > security mode, for IMX case, via ARM Trust Firmware. That means the
> > watchdog used in Linux actually is using ARM SMC call and does not
> > depend SCU driver. So It would be strange for SCU driver to instantiate it.
> >
> > For this situation, do you think we can move watchdog out of scu node?
> > Maybe rename the compatible string like "fsl,imx8qxp-sip-watchdog"
> > because it's actually a watchdog serviced by ATF firmware.
> 
> Yes, but that creates more questions. What exactly does ATF talk to for the
> watchdog? The SCU firmware?

Yes, ATF talks to SCU firmware directly, Linux kernel watchdog driver call SMC 
instructions
to send command to ATF, and ATF will call SCU firmware API to finish the 
operation requested
by Linux kernel watchdog driver.

> 
> Maybe ATF should define and provide a standard watchdog interface? It is
> still a question of making the firmware discoverable rather than trying to
> describe the firmware in DT.

The SMC call by Linux kernel watchdog already follow the SIP(silicon provider) 
standard, each
SoC can define its own protocol for SIP. ATF does NOT have a standard common 
watchdog interface
now, since it is more like a platform specific feature, most of platforms can 
control watchdog directly
from Linux kernel I think. 

So, do you have suggestion for this case? Either find a place in DT to put 
watchdog node, or make it
a build-in device inside SCU driver? Or you have other better ideas?

Thanks,
Anson.

> 
> Rob

RE,

2019-03-13 Thread LUIS EDUARDO CEPEDA CABRERA

Hello,

 i have a deal for you, can we work together ?

Re: [PATCH 5/5] lib/list_sort: Optimize number of calls to comparison function

2019-03-13 Thread George Spelvin

On Thu, 14 Mar 2019 at 00:28:16 +0100, Rasmus Villemoes wrote:
> On 05/03/2019 06.58, George Spelvin wrote:
>> This patch avoids badly unbalanced merges on unlucky input sizes.
>> It slightly increases the code size, but saves an average of 0.2*n
>> calls to cmp().
>> 
> [snip]
>> 
>> (I confess to being more than a little bit proud of how clean this
>> code turned out.  It took a lot of thinking, but the resultant inner
>> loop is very simple and efficient.)
>
> This is beautiful. So no comments on the patch itself. One thing that
> might be nice would be to see the reduction in number of cmp callbacks
> explicitly; it should be trivial to use the priv element for that in the
> list_sort_test module. But to really see it one would of course have to
> extend that test to do a lot more different sizes of lists.

And you'd have to compile and run two different kernels, because you
can't run the algorithms side-by-side.  Not a good way to do it.

I used a user-space test harness for testing.  The output is ugly,
but here are lots of numbers for various list sizes.

The first group is single sort invocations.  The list is sorted by
(random) key, then again by address (the inverse), and the number
of comparisons printed.  The numbers in parens are the linear
coefficient K in
comparisons = n*log2(n) - K*n,
i.e. it's
log2(n) - (comparisons / n).
Higher is better.

The three lines for each size are the original list_sort, a top-down
version I wrote as a "perfect" value for reference, and the posted
code.  (1034 and 1040 are particularly bad cases for the old code.)

1: 0 (0.00) 0 (0.00)
1: 0 (0.00) 0 (0.00)
1: 0 (0.00) 0 (0.00)
2: 1 (0.50) 1 (0.50)
2: 1 (0.50) 1 (0.50)
2: 1 (0.50) 1 (0.50)
3: 3 (0.584963) 3 (0.584963)
3: 2 (0.918296) 2 (0.918296)
3: 3 (0.584963) 3 (0.584963)
4: 5 (0.75) 5 (0.75)
4: 5 (0.75) 5 (0.75)
4: 5 (0.75) 5 (0.75)
5: 8 (0.721928) 8 (0.721928)
5: 8 (0.721928) 8 (0.721928)
5: 8 (0.721928) 8 (0.721928)
6: 10 (0.918296) 10 (0.918296)
6: 10 (0.918296) 10 (0.918296)
6: 10 (0.918296) 10 (0.918296)
7: 11 (1.235926) 12 (1.093069)
7: 13 (0.950212) 12 (1.093069)
7: 11 (1.235926) 12 (1.093069)
8: 17 (0.875000) 17 (0.875000)
8: 17 (0.875000) 17 (0.875000)
8: 17 (0.875000) 17 (0.875000)
9: 18 (1.169925) 22 (0.725481)
9: 21 (0.836592) 20 (0.947703)
9: 20 (0.947703) 20 (0.947703)
10: 20 (1.321928) 25 (0.821928)
10: 25 (0.821928) 24 (0.921928)
10: 22 (1.121928) 24 (0.921928)
13: 35 (1.008132) 35 (1.008132)
13: 33 (1.161978) 33 (1.161978)
13: 36 (0.931209) 35 (1.008132)
21: 71 (1.011365) 74 (0.868508)
21: 68 (1.154222) 71 (1.011365)
21: 69 (1.106603) 72 (0.963746)
31: 117 (1.180003) 111 (1.373551)
31: 114 (1.276777) 114 (1.276777)
31: 117 (1.180003) 111 (1.373551)
32: 123 (1.156250) 121 (1.218750)
32: 123 (1.156250) 121 (1.218750)
32: 123 (1.156250) 121 (1.218750)
33: 158 (0.256515) 149 (0.529243)
33: 130 (1.105000) 131 (1.074697)
33: 131 (1.074697) 131 (1.074697)
34: 142 (0.910992) 135 (1.116875)
34: 133 (1.175698) 135 (1.116875)
34: 131 (1.234522) 134 (1.146286)
55: 244 (1.344996) 249 (1.254087)
55: 246 (1.308632) 241 (1.399542)
55: 249 (1.254087) 250 (1.235905)
89: 484 (1.037531) 490 (0.970115)
89: 464 (1.262250) 477 (1.116183)
89: 479 (1.093711) 482 (1.060003)
127: 729 (1.248527) 727 (1.264275)
127: 734 (1.209157) 724 (1.287897)
127: 729 (1.248527) 727 (1.264275)
128: 744 (1.187500) 733 (1.273438)
128: 744 (1.187500) 733 (1.273438)
128: 744 (1.187500) 733 (1.273438)
129: 752 (1.181770) 853 (0.398824)
129: 746 (1.228282) 740 (1.274793)
129: 747 (1.220530) 741 (1.267041)
144: 926 (0.739369) 928 (0.725481)
144: 851 (1.260203) 866 (1.156036)
144: 865 (1.162981) 872 (1.114369)
233: 1556 (1.186075) 1541 (1.250452)
233: 1545 (1.233285) 1527 (1.310538)
233: 1550 (1.211826) 1534 (1.280495)
377: 2787 (1.165848) 2790 (1.157890)
377: 2752 (1.258686) 2771 (1.208288)
377: 2778 (1.189720) 2782 (1.179110)
610: 5115 (0.867420) 5115 (0.867420)
610: 4891 (1.234633) 4883 (1.247747)
610: 4909 (1.205124) 4930 (1.170698)
642: 5385 (0.938579) 5428 (0.871601)
642: 5166 (1.279701) 5185 (1.250105)
642: 5205 (1.218953) 5201 (1.225183)
987: 8574 (1.259976) 8620 (1.213370)
987: 8564 (1.270108) 8599 (1.234647)
987: 8565 (1.269095) 8614 (1.219449)
1022: 8937 (1.252561) 8913 (1.276044)
1022: 8916 (1.273109) 8928 (1.261367)
1022: 8937 (1.252561) 8913 (1.276044)
1023: 8959 (1.241015) 8909 (1.289891)
1023: 8927 (1.272295) 8918 (1.281093)
1023: 8959 (1.241015) 8909 (1.289891)
1024: 8970 (1.240234) 8916 (1.292969)
1024: 8966 (1.244141) 8912 (1.296875)
1024: 8966 (1.244141) 8912 (1.296875)
1025: 9548 (0.686286) 9724 (0.514579)
1025: 8971 (1.249213) 8943 (1.276530)
1025: 8970 (1.250189) 8944 (1.27)
1026: 9771 (0.479423) 9745 (0.504764)
1026: 8936 (1.293263) 8978 (1.252328)
1026: 8942 (1.287415) 8993 (1.237708)
1028: 9643 (0.625274) 9985 (0.292590)
1028: 8980 (1.270216) 9006 (1.244924)
1028: 8980 (1.270216) 9005 (1.245897)
1032: 9894 (0.424018) 10004

Re: [LKP] [ext4] fde872682e: fsmark.files_per_sec -38.0% regression

2019-03-13 Thread Huang, Ying

Theodore Ts'o  writes:

 > On Wed, Mar 13, 2019 at 03:26:39PM +0800, huang ying wrote:
>> >
>> >
>> > commit: fde872682e175743e0c3ef939c89e3c6008a1529 ("ext4: force inode 
>> > writes when nfsd calls commit_metadata()")
>> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>> 
>> It appears that this is a performance regression caused by a
>> functionality fixing.  So we should ignore this?
>
> Yes, this is a correctness issue that we discovered while tracking
> down user data loss issue after a crash of the NFS server, so this is
> a change we have to keep.  When the NFS folks added the
> commit_metadata() hook, they didn't realize that the fallback path in
> nfsd/vfs.c using sync_inode_metadata() doesn't work on all file
> systems --- and in particular doesn't work for ext3 and ext4 because
> of how we do journalling.
>
> It only applies to NFS serving, not local ext4 use cases, so most ext4
> users won't be impacted on it; only those who export those file
> systems using NFS.
>
> I do have some plans on how to claw back the performance hit.  The
> good news is that it won't require an on-disk format change; the bad
> news is that it's a non-trivial change to how journalling works, and
> it's not something we can backport to the stable kernel series.  It's
> something we're going to have to leave to a distribution who is
> willing to do a lot of careful regression testing, once the change is
> available, maybe in 3 months or so.
>
> - Ted
>
> P.S.  I *believe* all other file systems should be OK, and I didn't
> want to impose a performance tax on all other file systems (such as
> btrfs), so I fixed it in an ext4-specific way.  The more
> general/conservative change would be to fall back to using fsync in
> nfs/vfs.c:commit_metadata() unless the file system specifically set a
> superblock flag indicating that using sync_inode_metadata is safe.
> OTOH we lived with this flaw in ext3/ext4 for *years* without anyone
> noticing or complaining, so

Got it!  Thanks for your detailed explanation!

Best Regards,
Huang, Ying

Re: [PATCH v3] lib/siphash.c: annotate implicit fall throughs

2019-03-13 Thread Joe Perches

On Wed, 2019-03-13 at 22:12 +0100, Mathieu Malaterre wrote:
> There is a plan to build the kernel with -Wimplicit-fallthrough and
> these places in the code produced warnings (W=1). Fix them up.
> 
> This commit remove the following warnings:
> 
>   lib/siphash.c:71:12: warning: this statement may fall through 
> [-Wimplicit-fallthrough=]
>   lib/siphash.c:72:12: warning: this statement may fall through 
> [-Wimplicit-fallthrough=]
>   lib/siphash.c:73:12: warning: this statement may fall through 
> [-Wimplicit-fallthrough=]
>   lib/siphash.c:75:12: warning: this statement may fall through 
> [-Wimplicit-fallthrough=]
>   lib/siphash.c:108:12: warning: this statement may fall through 
> [-Wimplicit-fallthrough=]
>   lib/siphash.c:109:12: warning: this statement may fall through 
> [-Wimplicit-fallthrough=]
>   lib/siphash.c:110:12: warning: this statement may fall through 
> [-Wimplicit-fallthrough=]
>   lib/siphash.c:112:12: warning: this statement may fall through 
> [-Wimplicit-fallthrough=]
>   lib/siphash.c:434:12: warning: this statement may fall through 
> [-Wimplicit-fallthrough=]
>   lib/siphash.c:462:12: warning: this statement may fall through 
> [-Wimplicit-fallthrough=]
> 
> Move the break statement onto the next line to match the fall-through
> comment pattern. Also move the trailing statement onto the next line to
> pass checkpatch verification.
[]
> diff --git a/lib/siphash.c b/lib/siphash.c
[].
> @@ -68,13 +68,26 @@ u64 __siphash_aligned(const void *data, size_t len, const 
> siphash_key_t *key)
> bytemask_from_count(left)));
>  #else
>   switch (left) {
> - case 7: b |= ((u64)end[6]) << 48;
> - case 6: b |= ((u64)end[5]) << 40;
> - case 5: b |= ((u64)end[4]) << 32;

It might also be worth not casting to u64 then shift
as that can be
moderately expensive on 32 bit systems
and instead use ((char
*))[].

> - case 4: b |= le32_to_cpup(data); break;
> - case 3: b |= ((u64)end[2]) << 16;

Perhaps an unnecessary cast before shift

> - case 2: b |= le16_to_cpup(data); break;
> - case 1: b |= end[0];

[]

> @@ -101,13 +114,26 @@ u64 __siphash_unaligned(const void *data, size_t len, 
> const siphash_key_t *key)
> bytemask_from_count(left)));
>  #else
>   switch (left) {
> - case 7: b |= ((u64)end[6]) << 48;
> - case 6: b |= ((u64)end[5]) << 40;
> - case 5: b |= ((u64)end[4]) << 32;

etc...

> - case 4: b |= get_unaligned_le32(end); break;
> - case 3: b |= ((u64)end[2]) << 16;
> - case 2: b |= get_unaligned_le16(end); break;
> - case 1: b |= end[0];

Re: [PATCH v1 2/2] interconnect: qcom: Add tagging and wake/sleep support for sdm845

2019-03-13 Thread David Dai




On 3/8/2019 10:35 AM, Evan Green wrote:

On Fri, Feb 8, 2019 at 9:22 AM Georgi Djakov  wrote:

From: David Dai 

Add support for wake and sleep commands by using a tag to indicate
whether or not the aggregate and set requests are active only or
dual context for a particular path.

Signed-off-by: David Dai 
Signed-off-by: Georgi Djakov 
---
  drivers/interconnect/qcom/sdm845.c | 101 +
  1 file changed, 75 insertions(+), 26 deletions(-)

diff --git a/drivers/interconnect/qcom/sdm845.c 
b/drivers/interconnect/qcom/sdm845.c
index fb526004c82e..13499f681160 100644
--- a/drivers/interconnect/qcom/sdm845.c
+++ b/drivers/interconnect/qcom/sdm845.c
@@ -65,6 +65,12 @@ struct bcm_db {
  #define SDM845_MAX_BCMS30
  #define SDM845_MAX_BCM_PER_NODE2
  #define SDM845_MAX_VCD 10
+#define SDM845_MAX_CTX 2
+#define SDM845_EE_STATE2
+#define EE_STATE_WAKE  0
+#define EE_STATE_SLEEP 1
+#define AO_CTX 0
+#define DUAL_CTX   1


I get really lost with these two sets of numbers here. I think this
needs some explanation in the comments. From staring at this what I've
understood so far is:
1) Clients pass in a tag that buckets their requests into either AO or
DUAL within the node. (Although, the requests all seem to get
aggregated together no matter what the tag is, more on that later).
2) During icc_set(), we go through the many-to-many mapping of BCMs to
nodes. For each BCM, we aggregate all the nodes it has, bucketed again
by AO or DUAL.
3) The actual votes sent to RPMh are in terms of WAKE and SLEEP.
4) The mapping from AO/DUAL to WAKE/SLEEP for a particular BCM is:
WAKE = AO+DUAL, SLEEP=DUAL.
5) qcom_icc_set() sends EE_STATE_WAKE stuff to RPMH_ACTIVE_ONLY_STATE.
6) If there's any difference between SLEEP and WAKE, then the
EE_STATE_WAKE commands are gathered together and sent to
RPMH_WAKE_ONLY_STATE, and all the EE_STATE_SLEEP commands are sent to
RPMH_SLEEP_STATE.

So ultimately the muxing ends up like
RPMH_ACTIVE_ONLY_STATE <-- EE_STATE_WAKE <-- AO+DUAL
RPMH_SLEEP_STATE <-- EE_STATE_SLEEP <-- DUAL
RPMH_WAKE_ONLY_STATE <-- EE_STATE_WAKE <-- AO+DUAL
This muxing explanation is correct, there's also an inherent limitation 
in this muxing where RPMH_ACTIVE cannot differ from RPMH_WAKE.


Why do we need this complicated muxing to happen? Is it because we're
trying to avoid requiring drivers to specify three different paths in
the simple case, when all they care about is "use this bandwidth all
the time"?
That's a part of it, I don't have strong justification for this legacy 
way of supporting wake/sleep, so I'm open to new suggestions.

What I think would make more sense would be to use the "tag" as a
bitfield instead. So you'd have

#define QCOM_ICC_TAG_ACTIVE_ONLY 0x0001
#define QCOM_ICC_TAG_SLEEP 0x0002
#define QCOM_ICC_TAG_WAKE 0x0004
#define QCOM_ICC_TAG_ALL_TIMES (QCOM_ICC_TAG_ACTIVE_ONLY |
QCOM_ICC_TAG_SLEEP | QCOM_ICC_TAG_WAKE)

Drivers that don't care about sleep/wake sets would just not set a
tag, or set a tag of QCOM_ICC_TAG_ALL_TIMES. Then in
qcom_icc_aggregate(), you aggregate the same values up to three times,
one for each bit they have set. Finally in qcom_icc_set, you can pass
the votes directly down in their buckets, without doing a weird
mapping from AO/DUAL to SLEEP/WAKE/ACTIVE_ONLY.
Works for me, I'd still want to optimize out the WAKE/SLEEP commands 
that are equal to each other though.

The sticky part about this is that now rather than one set of sum/peak
values, there are now three independent ones, corresponding to each of
the tag bits. It seems like the interconnect core wouldn't want to
mandate whether providers use the tag as a bitfield or a value. So the
provider would need to keep not only the official set of aggregates
that are returned back to the core (probably ACTIVE_ONLY, or maybe a
max of the three), but also the two shadow copies internally for SLEEP
and WAKE.
I would ideally like to return the most accurate current state of the 
system which may be ACTIVE_ONLY or WAKE. We might also run into some 
confusion when printing out the summaries for each node. In a case where 
the votes for WAKE != ACTIVE(assuming one or more multiple consumers are 
intentionally doing this with multiple paths), the state of system will 
change depending on whether or not we've completed a sleep cycle(If it's 
prior to sleep, the ACTIVE_ONLY vote will be accurate, and post waking 
up the WAKE request will be correct) but this is more of a book keeping 
concern than anything else, as long as consumers understand that this is 
the expected behavior. On a side note, I don't like the ACTIVE_ONLY name 
designation for this which was brought over from the rpmh driver, it 
really means let's set the current state immediately and it has no 
association with the execution environment state unlike the other 
WAKE/SLEEP types.


If you organized the tag that way, the only extra thing

(.text+0x79c): warning: Using 'getpwuid' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking

2019-03-13 Thread kbuild test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
master
head:   fa3d493f7a573b4e4e2538486e912093a0161c1b
commit: 0053102a869f1b909904b1b85ac282e2744deaab um: Include sys/uio.h to have 
writev()
date:   3 months ago
config: um-allmodconfig (attached as .config)
compiler: gcc-7 (Debian 7.3.0-1) 7.3.0
reproduce:
git checkout 0053102a869f1b909904b1b85ac282e2744deaab
# save the attached .config to linux build tree
make ARCH=um 

All warnings (new ones prefixed by >>):

   arch/um/drivers/vde.o: In function `vde_open_real':
   (.text+0x951): warning: Using 'getgrnam' in statically linked applications 
requires at runtime the shared libraries from the glibc version used for linking
>> (.text+0x79c): warning: Using 'getpwuid' in statically linked applications 
>> requires at runtime the shared libraries from the glibc version used for 
>> linking
   arch/um/drivers/vector_user.o: In function `user_init_socket_fds':
   vector_user.c:(.text+0x369): warning: Using 'getaddrinfo' in statically 
linked applications requires at runtime the shared libraries from the glibc 
version used for linking
   arch/um/drivers/pcap.o: In function `pcap_nametoaddr':
   (.text+0xdee5): warning: Using 'gethostbyname' in statically linked 
applications requires at runtime the shared libraries from the glibc version 
used for linking
   arch/um/drivers/pcap.o: In function `pcap_nametonetaddr':
   (.text+0xdf85): warning: Using 'getnetbyname' in statically linked 
applications requires at runtime the shared libraries from the glibc version 
used for linking
   arch/um/drivers/pcap.o: In function `pcap_nametoproto':
   (.text+0xe1a5): warning: Using 'getprotobyname' in statically linked 
applications requires at runtime the shared libraries from the glibc version 
used for linking
   arch/um/drivers/pcap.o: In function `pcap_nametoport':
   (.text+0xdfd7): warning: Using 'getservbyname' in statically linked 
applications requires at runtime the shared libraries from the glibc version 
used for linking
   arch/um/drivers/vde.o: In function `vde_open_real':
   (.text+0x951): warning: Using 'getgrnam' in statically linked applications 
requires at runtime the shared libraries from the glibc version used for linking
>> (.text+0x79c): warning: Using 'getpwuid' in statically linked applications 
>> requires at runtime the shared libraries from the glibc version used for 
>> linking
   arch/um/drivers/vector_user.o: In function `user_init_socket_fds':
   vector_user.c:(.text+0x369): warning: Using 'getaddrinfo' in statically 
linked applications requires at runtime the shared libraries from the glibc 
version used for linking
   arch/um/drivers/pcap.o: In function `pcap_nametoaddr':
   (.text+0xdee5): warning: Using 'gethostbyname' in statically linked 
applications requires at runtime the shared libraries from the glibc version 
used for linking
   arch/um/drivers/pcap.o: In function `pcap_nametonetaddr':
   (.text+0xdf85): warning: Using 'getnetbyname' in statically linked 
applications requires at runtime the shared libraries from the glibc version 
used for linking
   arch/um/drivers/pcap.o: In function `pcap_nametoproto':
   (.text+0xe1a5): warning: Using 'getprotobyname' in statically linked 
applications requires at runtime the shared libraries from the glibc version 
used for linking
   arch/um/drivers/pcap.o: In function `pcap_nametoport':
   (.text+0xdfd7): warning: Using 'getservbyname' in statically linked 
applications requires at runtime the shared libraries from the glibc version 
used for linking
   arch/um/drivers/vde.o: In function `vde_open_real':
   (.text+0x951): warning: Using 'getgrnam' in statically linked applications 
requires at runtime the shared libraries from the glibc version used for linking
>> (.text+0x79c): warning: Using 'getpwuid' in statically linked applications 
>> requires at runtime the shared libraries from the glibc version used for 
>> linking
   arch/um/drivers/vector_user.o: In function `user_init_socket_fds':
   vector_user.c:(.text+0x369): warning: Using 'getaddrinfo' in statically 
linked applications requires at runtime the shared libraries from the glibc 
version used for linking
   arch/um/drivers/pcap.o: In function `pcap_nametoaddr':
   (.text+0xdee5): warning: Using 'gethostbyname' in statically linked 
applications requires at runtime the shared libraries from the glibc version 
used for linking
   arch/um/drivers/pcap.o: In function `pcap_nametonetaddr':
   (.text+0xdf85): warning: Using 'getnetbyname' in statically linked 
applications requires at runtime the shared libraries from the glibc version 
used for linking
   arch/um/drivers/pcap.o: In function `pcap_nametoproto':
   (.text+0xe1a5): warning: Using 'getprotobyname' in statically linked 
applications requires at runtime the shared libraries from the glibc version 
used for linking
   arch/um/drivers/pcap.o: In function `pcap_nametoport':
   (.text+0xdfd7): warning: Using

Re: [RFC][PATCH 00/16] sched: Core scheduling

2019-03-13 Thread Tim Chen



>>
>> One more NULL pointer dereference:
>>
>> Mar 12 02:24:46 aubrey-ivb kernel: [  201.916741] core sched enabled
>> [  201.950203] BUG: unable to handle kernel NULL pointer dereference
>> at 0008
>> [  201.950254] [ cut here ]
>> [  201.959045] #PF error: [normal kernel read fault]
>> [  201.964272] !se->on_rq
>> [  201.964287] WARNING: CPU: 22 PID: 2965 at kernel/sched/fair.c:6849
>> set_next_buddy+0x52/0x70
> 
> A quick workaround below:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1d0dac4fd94f..ef6acfe2cf7d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6834,7 +6834,7 @@ static void set_last_buddy(struct sched_entity *se)
> return;
> 
> for_each_sched_entity(se) {
> -   if (SCHED_WARN_ON(!se->on_rq))
> +   if (SCHED_WARN_ON(!(se && se->on_rq))
> return;
> cfs_rq_of(se)->last = se;
> }
> @@ -6846,7 +6846,7 @@ static void set_next_buddy(struct sched_entity *se)
> return;
> 
> for_each_sched_entity(se) {
> -   if (SCHED_WARN_ON(!se->on_rq))
> +   if (SCHED_WARN_ON(!(se && se->on_rq))


Shouldn't the for_each_sched_entity(se) skip the code block for !se case
have avoided null pointer access of se?

Since
#define for_each_sched_entity(se) \
for (; se; se = se->parent)

Scratching my head a bit here on how your changes would have made
a difference.

In your original log, I wonder if the !se->on_rq warning on CPU 22 is mixed 
with the actual OOPs?
Saw also in your original log rb_insert_color.  Wonder if that
was actually the source of the Oops?


[  202.078674] RIP: 0010:set_next_buddy+0x52/0x70
[  202.090135] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS
SE5C600.86B.99.99.x058.082120120902 08/21/2012
[  202.090144] RIP: 0010:rb_insert_color+0x17/0x190
[  202.101623] Code: 48 85 ff 74 10 8b 47 40 85 c0 75 e2 80 3d 9e e5
6a 01 00 74 02 f3 c3 48 c7 c7 5c 05 2c 82 c6 05 8c e5 6a 01 01 e8 2e
bb fb ff <0f> 0b c3 83 bf 04 03 0e
[  202.113216] Code: f3 c3 31 c0 c3 0f 1f 40 00 66 2e 0f 1f 84 00 00
00 00 00 48 8b 17 48 85 d2 0f 84 4d 01 00 00 48 8b 02 a8 01 0f 85 6d
01 00 00 <48> 8b 48 08 49 89 c0 44
[  202.118263] RSP: 0018:c9000a5cbbb0 EFLAGS: 00010086
[  202.129858] RSP: 0018:c9000a463cc0 EFLAGS: 00010046
[  202.135102] RAX:  RBX: 88980047e800 RCX: 
[  202.135105] RDX: 888be28caa40 RSI: 0001 RDI: 8110c3fa
[  202.156251] RAX:  RBX: 888bfeb8 RCX: 888bfeb80

Thanks.

Tim

Re: [PATCH v7 02/15] sched/core: uclamp: Enforce last task UCLAMP_MAX

2019-03-13 Thread Suren Baghdasaryan

On Wed, Mar 13, 2019 at 9:16 AM Patrick Bellasi  wrote:
>
> On 13-Mar 15:12, Peter Zijlstra wrote:
> > On Fri, Feb 08, 2019 at 10:05:41AM +, Patrick Bellasi wrote:
> > > +static inline void uclamp_idle_reset(struct rq *rq, unsigned int 
> > > clamp_id,
> > > +unsigned int clamp_value)
> > > +{
> > > +   /* Reset max-clamp retention only on idle exit */
> > > +   if (!(rq->uclamp_flags & UCLAMP_FLAG_IDLE))
> > > +   return;
> > > +
> > > +   WRITE_ONCE(rq->uclamp[clamp_id].value, clamp_value);
> > > +
> > > +   /*
> > > +* This function is called for both UCLAMP_MIN (before) and UCLAMP_MAX
> > > +* (after). The idle flag is reset only the second time, when we know
> > > +* that UCLAMP_MIN has been already updated.
> >
> > Why do we care? That is, what is this comment trying to tell us.
>
> Right, the code is clear enough, I'll remove this comment.

It would be probably even clearer if rq->uclamp_flags &=
~UCLAMP_FLAG_IDLE is done from inside uclamp_rq_inc after
uclamp_rq_inc_id for both clamps is called.

> >
> > > +*/
> > > +   if (clamp_id == UCLAMP_MAX)
> > > +   rq->uclamp_flags &= ~UCLAMP_FLAG_IDLE;
> > > +}
>
> --
> #include 
>
> Patrick Bellasi

Re: [PATCH 2/5] lib/sort: Use more efficient bottom-up heapsort variant

2019-03-13 Thread Rasmus Villemoes

On 14/03/2019 01.03, George Spelvin wrote:
> On Wed, 13 Mar 2019 at 23:29:40 +0100, Rasmus Villemoes wrote:
> 
>> Nice!
> 
> Thank you.  May I translate that into Acked-by?
> 

Sort-of. I prefer first seeing the full rerolled series for context
etc., even if (the important parts of) this patch would be unchanged.

Re: [PATCH 2/5] lib/sort: Use more efficient bottom-up heapsort variant

2019-03-13 Thread George Spelvin

On Wed, 13 Mar 2019 at 23:29:40 +0100, Rasmus Villemoes wrote:
> On 21/02/2019 09.21, George Spelvin wrote:
>> +/**
>> + * parent - given the offset of the child, find the offset of the parent.
>> + * @i: the offset of the heap element whose parent is sought.  Non-zero.
>> + * @lsbit: a precomputed 1-bit mask, equal to "size & -size"
>> + * @size: size of each element
>> + *
>> + * In terms of array indexes, the parent of element j = i/size is simply
>> + * (j-1)/2.  But when working in byte offsets, we can't use implicit
>> + * truncation of integer divides.
>> + *
>> + * Fortunately, we only need one bit of the quotient, not the full divide.
>> + * size has a least significant bit.  That bit will be clear if i is
>> + * an even multiple of size, and set if it's an odd multiple.
>> + *
>> + * Logically, we're doing "if (i & lsbit) i -= size;", but since the
>> + * branch is unpredictable, it's done with a bit of clever branch-free
>> + * code instead.
>> + */
>> +__attribute_const__ __always_inline
>> +static size_t parent(size_t i, unsigned int lsbit, size_t size)
>> +{
>> +i -= size;
>> +i -= size & -(i & lsbit);
>> +return i / 2;
>> +}
>> +
>
> Really nice :) I had to work through this by hand, but it's solid.

Thank you!  Yes, the way the mask doesn't include the low-order bits
that don't matter anyway is a bit subtle.

When the code is subtle, use lots of comments.  The entire reason
for making this a separate helper function is to leave room for
the large comment.

>> +unsigned const lsbit = size & -size;/* Used to find parent */
>
> Nit: qualifier before type, "const unsigned". And this sets ZF, so a
> paranoid check for zero size (cf. the other mail) by doing "if (!lsbit)
> return;" is practically free. Though it's probably a bit obscure doing
> it that way...

Actually, this is a personal style thing which I can ignore for the sake
of the kernel, but I believe that it's better to put the qualifier
*after* the type.  This is due to C's pointer declaration syntax.

The standard example of the issue is:

typedef char *pointer;
const char *a;
char const *b;
char * const c;
const pointer d;
pointer const e;

Now, which variables are the same types?

The answer is that a & b are the same (mutable pointer to const
char), and c, d & e are the same (const pointer to mutable char).

I you make a habit of putting the qualifier *after* the type, then
a simple "textual substitution" mental model for the typedef works,
and it's clear that c and e are the same.

It's also clear that b cannot be represented by the typedef because
the const is between "char" and "*", and you obviously can't do that
with the typedef.

But if you put the qualifier first, it's annoying to rememeber why
a and d are not the same type.

So I've deliberately cultivated the style of putting the qualifier
after the type.

But if the kernel prefers it before...

>> +if (!n)
>> +return;
>
> I'd make that n <= 1. Shouldn't be much more costly.

(Actually, it's "num <= 1"; n is the pre-multiplied form so
n <= 1 can only happen when sorting one one-byte value.)

I actually thought about this and decided not to bother.  I did it
this way during development to stress the general-case code.  But
should I change it?

=== NEVER MIND ===

I had written a long reply justifying leaving it alone to save one
instruction when the light dawned: I can do *both* tests in one
step with
size_t n = num * size, a = (num/2) * size;
unsigned const lsbit = size & -size;/* Used to find parent */

if (!a) /* num < 2 || size == 0 */
return;

So now everyone's happy.

> Nice!

Thank you.  May I translate that into Acked-by?

Re: [PATCH v1] Bluetooth: hci_qca: Enable the ldisc for ROME for x86 platforms.

2019-03-13 Thread Matthias Kaehlcke

On Wed, Mar 13, 2019 at 05:43:14PM +0800, rjl...@codeaurora.org wrote:
> 在 2019-03-12 23:52，Matthias Kaehlcke 写道：
> > Hi Rocky,
> > 
> > On Tue, Mar 12, 2019 at 05:01:59PM +0800, rjl...@codeaurora.org wrote:
> > > 在 2019-03-09 02:52，Matthias Kaehlcke 写道：
> > > > On Fri, Mar 08, 2019 at 10:43:14AM +0530, Balakrishna Godavarthi wrote:
> > > > > Hi Matthias,
> > > > >
> > > > > On 2019-03-08 02:12, Matthias Kaehlcke wrote:
> > > > > > Hi Balakrishna,
> > > > > >
> > > > > > On Thu, Mar 07, 2019 at 03:47:22PM +0530, Balakrishna Godavarthi 
> > > > > > wrote:
> > > > > > > When using btattach to setup Rome over ldisc we observed a crash
> > > > > > > in qca_setup as it will try to access the serdev which is not
> > > > > > > available in the ldisc proto. This patch will fix the crash by
> > > > > > > support both the ldisc and serdev way in the qca hci_uart driver.
> > > > > > >
> > > > > > > Signed-off-by: Balakrishna Godavarthi 
> > > > > >
> > > > > > Oh, I wasn't aware of the instantiation through ldisc and was 
> > > > > > actually
> > > > > > considering to *remove* some of the seemingly unnecessary serdev
> > > > > > checks.
> > > > > >
> > > > > > > ---
> > > > > > >  drivers/bluetooth/hci_qca.c | 47
> > > > > > > ++---
> > > > > > >  1 file changed, 28 insertions(+), 19 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/bluetooth/hci_qca.c 
> > > > > > > b/drivers/bluetooth/hci_qca.c
> > > > > > > index 237aea34b69f..0a5c98d46864 100644
> > > > > > > --- a/drivers/bluetooth/hci_qca.c
> > > > > > > +++ b/drivers/bluetooth/hci_qca.c
> > > > > > > @@ -963,7 +963,7 @@ static int qca_set_baudrate(struct hci_dev
> > > > > > > *hdev, uint8_t baudrate)
> > > > > > >  {
> > > > > > >   struct hci_uart *hu = hci_get_drvdata(hdev);
> > > > > > >   struct qca_data *qca = hu->priv;
> > > > > > > - struct qca_serdev *qcadev;
> > > > > > > + struct qca_serdev *qcadev = NULL;
> > > > > >
> > > > > > In many cases the only field that is accessed is 
> > > > > > qcadev->btsoc_type. I
> > > > > > think something like 'qca_get_soc_type(struct hci_dev *hdev / struct
> > > > > > hci_uart *hu)' would make things more readable.
> > > > > >
> > > > > [Bala]: sure will update this in other patch once this change is
> > > > > landed as
> > > > > this has to
> > > > > go in priority as we have crash coming.
> > > >
> > > > That's not how things should work, especially for fairly trivial
> > > > changes. It requires reviewers to first spent time to review the patch
> > > > that adds clutter and later spend more time to review the one that
> > > > removes it. It's also easier to get a clean patch merged in the first
> > > > place, rather than a noisy one.
> > > >
> > > > Anyway, here is my take at it:
> > > > https://lore.kernel.org/patchwork/patch/1049014/
> > > >
> > > > Please help with testing for ROME, unless you disagree with the
> > > > approach.
> > > >
> > > > Thanks
> > > >
> > > > Matthias
> > > 
> > > Hi Matthias,
> > > 
> > > I will test your patch and update to you, and you are correct that
> > > AR3002 is
> > > not part of Rome family, you should use QCA_ROME as the default
> > > return of
> > > qca_soc_type.
> > 
> > Thanks for the confirmation!
> > 
> > > Could you also loop me in
> > > https://lore.kernel.org/patchwork/patch/1049014/?
> > 
> > This patch has been superseded by a newer version:
> > 
> > https://lore.kernel.org/patchwork/patch/1049696/
> > 
> > It already landed in bluetooth-next.
> > 
> > Testing with Rome and ldisc would still be appreciated, since I don't
> > have such a configuration.
> > 
> > Thanks
> > 
> > Matthias
> 
> Hi Matthias,
> 
> I verified your change and found there is another deference to serdev in the
> qca_set_baudrate() function while running the ldisc proto, it will cause a
> crash and need to add a check as below. Could you help to add this change?
> 
> <   serdev_device_wait_until_sent(hu->serdev,
> <   msecs_to_jiffies(CMD_TRANS_TIMEOUT_MS));
> ---
> >   if (hu->serdev)
> >   serdev_device_wait_until_sent(hu->serdev,
> >   msecs_to_jiffies(CMD_TRANS_TIMEOUT_MS));

argh, I shouldn't have missed this, thanks for testing!

Here is a fix:

https://lore.kernel.org/patchwork/patch/1050594/

Re: [PATCH 0/3] userfaultfd: allow to forbid unprivileged users

2019-03-13 Thread Andrea Arcangeli

On Wed, Mar 13, 2019 at 01:01:40PM -0700, Mike Kravetz wrote:
> On 3/13/19 11:52 AM, Andrea Arcangeli wrote:
> > 
> > hugetlbfs is more complicated to detect, because even if you inherit
> > it from fork(), the services that mounts the fs may be in a different
> > container than the one that Oracle that uses userfaultfd later on down
> > the road from a different context. And I don't think it would be ok to
> > allow running userfaultfd just because you can open a file in an
> > hugetlbfs file system. With /dev/kvm it's a bit different, that's
> > chmod o-r by default.. no luser should be able to open it.
> > 
> > Unless somebody suggests a consistent way to make hugetlbfs "just
> > work" (like we could achieve clean with CRIU and KVM), I think Oracle
> > will need a one liner change in the Oracle setup to echo into that
> > file in addition of running the hugetlbfs mount.
> 
> I think you are suggesting the DB setup process enable uffd for all users.
> Correct?

Yes. In addition of the hugetlbfs setup, various apps requires to also
increase fs.inotify.max_user_watches or file-max and other tweaks,
this would be one of those tweaks.

> This may be too simple, and I don't really like group access, but how about
> just defining a uffd group?  If you are in the group you can make uffd
> system calls.

Everything is possible, I'm just afraid it gets too complex.

So you suggest to echo a gid into the file?

[PATCH] Bluetooth: hci_qca: Fix crash with non-serdev devices

2019-03-13 Thread Matthias Kaehlcke

qca_set_baudrate() calls serdev_device_wait_until_sent() assuming that
the HCI is always associated with a serdev device. This isn't true for
ROME controllers instantiated through ldisc, where the call causes a
crash due to a NULL pointer dereferentiation. Only call the function
when we have a serdev device. The timeout for ROME devices at the end
of qca_set_baudrate() is long enough to be reasonably sure that the
command was sent.

Fixes: fa9ad876b8e0 ("Bluetooth: hci_qca: Add support for Qualcomm
Bluetooth chip wcn3990")
Reported-by: Balakrishna Godavarthi 
Reported-by: Rocky Liao 
Signed-off-by: Matthias Kaehlcke 
---
 drivers/bluetooth/hci_qca.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/bluetooth/hci_qca.c b/drivers/bluetooth/hci_qca.c
index 4ea995d610d2..714a6a16f9d5 100644
--- a/drivers/bluetooth/hci_qca.c
+++ b/drivers/bluetooth/hci_qca.c
@@ -1004,7 +1004,8 @@ static int qca_set_baudrate(struct hci_dev
*hdev, uint8_t baudrate)
while (!skb_queue_empty(>txq))
usleep_range(100, 200);

-   serdev_device_wait_until_sent(hu->serdev,
+   if (hu->serdev)
+   serdev_device_wait_until_sent(hu->serdev,
  msecs_to_jiffies(CMD_TRANS_TIMEOUT_MS));

/* Give the controller time to process the request */
-- 
2.21.0.360.g471c308f928-goog

Re: [PATCH 0/3] userfaultfd: allow to forbid unprivileged users

2019-03-13 Thread Andrea Arcangeli

Hi Paolo,

On Wed, Mar 13, 2019 at 03:12:28PM -0400, Paolo Bonzini wrote:
> 
> > On Wed, Mar 13, 2019 at 09:22:31AM +0100, Paolo Bonzini wrote:
> > Unless somebody suggests a consistent way to make hugetlbfs "just
> > work" (like we could achieve clean with CRIU and KVM), I think Oracle
> > will need a one liner change in the Oracle setup to echo into that
> > file in addition of running the hugetlbfs mount.
> 
> Hi Andrea, can you explain more in detail the risks of enabling
> userfaultfd for unprivileged users?

There's no more risk than in allowing mremap (direct memory overwrite
after underflow) or O_DIRECT (dirtycow) for unprivileged users.

If there was risk in allowing userfaultfd for unprivileged users,
CAP_SYS_PTRACE would be mandatory and there wouldn't be an option to
allow userfaultfd to all unprivileged users.

This is just an hardening policy in kernel (for those that don't run
everything under seccomp) that may even be removed later.

Unlike mremap and O_DIRECT, because we've only an handful of
(important) applications using userfaultfd so far, we can do like the
bpf syscall:

SYSCALL_DEFINE3(bpf, int, cmd, union bpf_attr __user *, uattr, unsigned int, 
size)
{
union bpf_attr attr = {};
int err;

if (sysctl_unprivileged_bpf_disabled && !capable(CAP_SYS_ADMIN))
return -EPERM;

Except we picked CAP_SYS_PTRACE because CRIU already has to run with
such capability for other reasons.

So this is intended as the "sysctl_unprivileged_bpf_disabled" trick
applied to the bpf syscall, also applied to the userfaultfd syscall,
nothing more nothing less.

Then I thought we can add a tristate so an open of /dev/kvm would also
allow the syscall to make things more user friendly because
unprivileged containers ideally should have writable mounts done with
nodev and no matter the privilege they shouldn't ever get an hold on
the KVM driver (and those who do, like kubevirt, will then just work).

There has been one complaint because userfaultfd can also be used to
facilitate exploiting use after free bugs in other kernel code:

https://cyseclabs.com/blog/linux-kernel-heap-spray

This isn't a bug in userfaultfd, the only bug there is some kernel
code doing an use after free in a reproducible place, userfaultfd only
allows to stop copy-user where the exploits like copy-user to be
stopped.

This isn't particularly concerning, because you can achieve the same
objective with FUSE. In fact even if you set CONFIG_USERFAULTFD=n in
the kernel config and CONFIG_FUSE_FS=n, a security bug like that can
still be exploited eventually. It's just less reproducible if you
can't stop copy-user.

Restricting userfaultfd to privileged processes, won't make such
kernel bug become harmless, it'll just require more attempts to
exploit as far as I can tell. To put things in prospective, the
exploits for the most serious security bugs like mremap missing
underflow check, dirtycow or the missing stack_guard_gap would not
get any facilitation by userfaultfd.

I also thought we were randomizing all slab heaps by now to avoid
issues like above, I don't know if the randomized slab freelist oreder
CONFIG_SLAB_FREELIST_RANDOM and also the pointer with
CONFIG_SLAB_FREELIST_HARDENED precisely to avoid the exploits like
above. It's not like you can disable those two options even if you set
CONFIG_USERFAULTFD=n. I wonder if in that blog post the exploit was
set on a kernel with those two options enabled.

In any case not allowing non privileged processes to run the
userfaultfd syscall will provide some hardening feature also against
such kind of concern.

Thanks,
Andrea

Re: INFO: rcu detected stall in sys_sendfile64 (2)

2019-03-13 Thread Eric Biggers

On Wed, Mar 13, 2019 at 07:43:38AM +0100, 'Dmitry Vyukov' via syzkaller-bugs 
wrote:
> > Also, humans can sometimes find more simpler C reproducers from syzbot 
> > provided
> > reproducers. It would be nice if syzbot can accept and use a user defined C
> > reproducer for testing.
> 
> It would be more useful to accept patches that make syzkaller create
> better reproducers from these people. Manual work is not scalable. We
> would need 10 reproducers per day for a dozen of OSes (incl some
> private kernels/branches). Anybody is free to run syzkaller manually
> and do full manual (perfect) reporting. But for us it become clear
> very early that it won't work. Then see above, while that human is
> sleeping/on weekend/vacation, syzbot will already bisect own
> reproducer. Adding manual reproducer later won't help in any way.
> syzkaller already does lots of smart work for reproducers. Let's not
> give up on the last mile and switch back to all manual work.
> 

Well, it's very tough and not many people are familiar with the syzkaller
codebase, let alone have time to contribute.  But having simplified a lot of
the syzkaller reproducers manually, the main things I do are:

- Replace bare system calls with proper C library calls.  For example:

#include 

syscall(__NR_socket, 0xa, 6, 0);

becomes:

#include 

socket(AF_INET, SOCK_DCCP, 0); 

- Do the same for structs.  Use the appropriate C header rather than filling in
  each struct manually.  For example:

*(uint16_t*)0x2000 = 0xa;
*(uint16_t*)0x2002 = htobe16(0x4e20);
*(uint32_t*)0x2004 = 0;
*(uint8_t*)0x2008 = 0;
*(uint8_t*)0x2009 = 0;
*(uint8_t*)0x200a = 0;
*(uint8_t*)0x200b = 0;
*(uint8_t*)0x200c = 0;
*(uint8_t*)0x200d = 0;
*(uint8_t*)0x200e = 0;
*(uint8_t*)0x200f = 0;
*(uint8_t*)0x2010 = 0;
*(uint8_t*)0x2011 = 0;
*(uint8_t*)0x2012 = 0;
*(uint8_t*)0x2013 = 0;
*(uint8_t*)0x2014 = 0;
*(uint8_t*)0x2015 = 0;
*(uint8_t*)0x2016 = 0;
*(uint8_t*)0x2017 = 0;
*(uint32_t*)0x2018 = 0;

becomes:

struct sockaddr_in6 addr = { .sin6_family = AF_INET6, .sin6_port = 
htobe16(0x4e20) };

- Put arguments on the stack rather than in a mmap'd region, if possible.

- Simplify any calls to the helper functions that syzkaller emits, e.g.
  syz_open_dev(), syz_kvm_setup_vcpu(), or the networking setup stuff.  Usually
  the reproducer needs a small subset of the functionality to work.

- For multithreaded reproducers, try to incrementally simplify the threading
  strategy.  For example, reduce the number of threads by combining operations.
  Also try running the operations in loops.  Also, using fork() can often result
  in a simpler reproducer than pthreads.

- Instead of using the 'r[]' array to hold all integer return values, give them
  appropriate names.

- Remove duplicate #includes.

- Considering the actual kernel code and the bug, if possible find a different
  way to trigger the same bug that's simpler or more reliable.  If the problem
  is obvious it may be possible to jump right to this step from the beginning.

Some gotchas:

- fault-nth injections are fragile, since the number of memory allocations in a
  particular system call varies by kernel config and kernel version.
  Incrementing n starting from 1 is more reliable.

- Some of the perf_event_open() reproducers are fragile because they hardcode a
  trace event ID, which can change in every kernel version.  Reading the trace
  event ID from /sys/kernel/debug/tracing/events/ is more reliable.

- Reproducers using the KVM API sometimes only work on certain processors (e.g.
  Intel but not AMD) or even depend on the host kernel.

- Reproducers that access the local filesystem sometimes assume that it's ext4.

[PATCH v2] MAINTAINERS: Add KVM selftests to existing KVM entry

2019-03-13 Thread Sean Christopherson

It's safe to assume Paolo and Radim are maintaining the KVM selftests
given that the vast majority of commits have their SOBs.  Play nice
with get_maintainers and make it official.

Signed-off-by: Sean Christopherson 
---
v2: add all KVM subdirectories

 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 8347269aa386..5e5b5424fd93 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8295,6 +8295,8 @@ F:include/linux/kvm*
 F: include/kvm/iodev.h
 F: virt/kvm/*
 F: tools/kvm/
+F: tools/testing/selftests/kvm/
+F: tools/testing/selftests/kvm/*/
 
 KERNEL VIRTUAL MACHINE FOR AMD-V (KVM/amd)
 M: Joerg Roedel 
-- 
2.21.0

Re: [PATCH 5/5] lib/list_sort: Optimize number of calls to comparison function

2019-03-13 Thread Rasmus Villemoes

On 05/03/2019 06.58, George Spelvin wrote:
> CONFIG_RETPOLINE has severely degraded indirect function call
> performance, so it's worth putting some effort into reducing
> the number of times cmp() is called.
> 
> This patch avoids badly unbalanced merges on unlucky input sizes.
> It slightly increases the code size, but saves an average of 0.2*n
> calls to cmp().
> 
[snip]
> 
> (I confess to being more than a little bit proud of how clean this
> code turned out.  It took a lot of thinking, but the resultant inner
> loop is very simple and efficient.)
> 
> Refs:
>   Bottom-up Mergesort: A Detailed Analysis
>   Wolfgang Panny, Helmut Prodinger
>   Algorithmica 14(4):340--354, October 1995
>   https://doi.org/10.1007/BF01294131
>   https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.6.5260
> 
>   The cost distribution of queue-mergesort, optimal mergesorts, and
>   power-of-two rules
>   Wei-Mei Chen, Hsien-Kuei Hwang, Gen-Huey Chen
>   Journal of Algorithms 30(2); Pages 423--448, February 1999
>   https://doi.org/10.1006/jagm.1998.0986
>   https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.4.5380
> 
>   Queue-Mergesort
>   Mordecai J. Golin, Robert Sedgewick
>   Information Processing Letters, 48(5):253--259, 10 December 1993
>   https://doi.org/10.1016/0020-0190(93)90088-q
>   https://sci-hub.tw/10.1016/0020-0190(93)90088-Q

This is beautiful. So no comments on the patch itself. One thing that
might be nice would be to see the reduction in number of cmp callbacks
explicitly; it should be trivial to use the priv element for that in the
list_sort_test module. But to really see it one would of course have to
extend that test to do a lot more different sizes of lists.

And looking at the actual cmp functions used made me think we might do
something similar to what you did for the swap functions for sort(),
though it will require updating call sites. Suppose we have some

struct foo {
   ...
   struct list_head list;
   ...
   some_integer_type key;
   ...
};

and a trivial cmp function that just compares the ->key members. What if
we do something like

#define LIST_SORT_SIMPLE_CMP(type, list, key) ({ \
  /* encode the size and signedness of type.key along with
   offsetof(type, key) - offsetof(type, list) into the low, say,
   24 bits - a signed 20 bit number should be sufficient for the
   offsetof diff, and we can bail at compile time if too big */
})

then do

int do_cmp(void *priv, list_head *a, list_head *b, cmp_func cmp)
{
   if ((long)cmp & high_bits) /* it's a kernel pointer */
  return cmp(priv, a, b);
   int diff = extract_diff(cmp);
   int type = extract_type(cmp);
   void *keya = (void*)a + diff;
   void *keyb = (void*)b + diff;
   if (type == s32)
  return *(s32*)keya > *(s32*)keyb;
   if (type == u32)
   return *(u32*)keya > *(u32*)keyb;
   ...

in practice, we'd probably only have to support 4- and 8-byte signed and
unsigned versions (and we can check at compile time whether key has a
supported type).

Similarly one could do a SORT_SIMPLE_CMP() for when sorting an array of
structs according to a single numeric member. That sort is not stable,
so the comparison functions would have to do a full -1,0,1 return, of
course, but we'd still avoid indirect calls.

Rasmus

Re: [PATCHv8 07/10] acpi/hmat: Register processor domain to its memory

2019-03-13 Thread Rafael J. Wysocki

On Mon, Mar 11, 2019 at 9:55 PM Keith Busch  wrote:
>
> If the HMAT Subsystem Address Range provides a valid processor proximity
> domain for a memory domain, or a processor domain matches the performance
> access of the valid processor proximity domain, register the memory
> target with that initiator so this relationship will be visible under
> the node's sysfs directory.
>
> Since HMAT requires valid address ranges have an equivalent SRAT entry,
> verify each memory target satisfies this requirement.
>
> Reviewed-by: Jonathan Cameron 
> Signed-off-by: Keith Busch 

Acked-by: Rafael J. Wysocki 

> ---
>  drivers/acpi/hmat/Kconfig |   3 +-
>  drivers/acpi/hmat/hmat.c  | 392 
> +-
>  2 files changed, 393 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/acpi/hmat/Kconfig b/drivers/acpi/hmat/Kconfig
> index 2f7111b7af62..13cddd612a52 100644
> --- a/drivers/acpi/hmat/Kconfig
> +++ b/drivers/acpi/hmat/Kconfig
> @@ -4,4 +4,5 @@ config ACPI_HMAT
> depends on ACPI_NUMA
> help
>  If set, this option has the kernel parse and report the
> -platform's ACPI HMAT (Heterogeneous Memory Attributes Table).
> +platform's ACPI HMAT (Heterogeneous Memory Attributes Table),
> +and register memory initiators with their targets.
> diff --git a/drivers/acpi/hmat/hmat.c b/drivers/acpi/hmat/hmat.c
> index 4758beb3b2c1..01a6eddac6f7 100644
> --- a/drivers/acpi/hmat/hmat.c
> +++ b/drivers/acpi/hmat/hmat.c
> @@ -13,11 +13,105 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>
>  static __initdata u8 hmat_revision;
>
> +static __initdata LIST_HEAD(targets);
> +static __initdata LIST_HEAD(initiators);
> +static __initdata LIST_HEAD(localities);
> +
> +/*
> + * The defined enum order is used to prioritize attributes to break ties when
> + * selecting the best performing node.
> + */
> +enum locality_types {
> +   WRITE_LATENCY,
> +   READ_LATENCY,
> +   WRITE_BANDWIDTH,
> +   READ_BANDWIDTH,
> +};
> +
> +static struct memory_locality *localities_types[4];
> +
> +struct memory_target {
> +   struct list_head node;
> +   unsigned int memory_pxm;
> +   unsigned int processor_pxm;
> +   struct node_hmem_attrs hmem_attrs;
> +};
> +
> +struct memory_initiator {
> +   struct list_head node;
> +   unsigned int processor_pxm;
> +};
> +
> +struct memory_locality {
> +   struct list_head node;
> +   struct acpi_hmat_locality *hmat_loc;
> +};
> +
> +static __init struct memory_initiator *find_mem_initiator(unsigned int 
> cpu_pxm)
> +{
> +   struct memory_initiator *initiator;
> +
> +   list_for_each_entry(initiator, , node)
> +   if (initiator->processor_pxm == cpu_pxm)
> +   return initiator;
> +   return NULL;
> +}
> +
> +static __init struct memory_target *find_mem_target(unsigned int mem_pxm)
> +{
> +   struct memory_target *target;
> +
> +   list_for_each_entry(target, , node)
> +   if (target->memory_pxm == mem_pxm)
> +   return target;
> +   return NULL;
> +}
> +
> +static __init void alloc_memory_initiator(unsigned int cpu_pxm)
> +{
> +   struct memory_initiator *initiator;
> +
> +   if (pxm_to_node(cpu_pxm) == NUMA_NO_NODE)
> +   return;
> +
> +   initiator = find_mem_initiator(cpu_pxm);
> +   if (initiator)
> +   return;
> +
> +   initiator = kzalloc(sizeof(*initiator), GFP_KERNEL);
> +   if (!initiator)
> +   return;
> +
> +   initiator->processor_pxm = cpu_pxm;
> +   list_add_tail(>node, );
> +}
> +
> +static __init void alloc_memory_target(unsigned int mem_pxm)
> +{
> +   struct memory_target *target;
> +
> +   if (pxm_to_node(mem_pxm) == NUMA_NO_NODE)
> +   return;
> +
> +   target = find_mem_target(mem_pxm);
> +   if (target)
> +   return;
> +
> +   target = kzalloc(sizeof(*target), GFP_KERNEL);
> +   if (!target)
> +   return;
> +
> +   target->memory_pxm = mem_pxm;
> +   target->processor_pxm = PXM_INVAL;
> +   list_add_tail(>node, );
> +}
> +
>  static __init const char *hmat_data_type(u8 type)
>  {
> switch (type) {
> @@ -89,14 +183,83 @@ static __init u32 hmat_normalize(u16 entry, u64 base, u8 
> type)
> return value;
>  }
>
> +static __init void hmat_update_target_access(struct memory_target *target,
> +u8 type, u32 value)
> +{
> +   switch (type) {
> +   case ACPI_HMAT_ACCESS_LATENCY:
> +   target->hmem_attrs.read_latency = value;
> +   target->hmem_attrs.write_latency = value;
> +   break;
> +   case ACPI_HMAT_READ_LATENCY:
> +   target->hmem_attrs.read_latency = value;
> +   break;
> +   case ACPI_HMAT_WRITE_LATENCY:
> +   target->hmem_attrs.write_latency = value;
> +

Re: Regression causes a hang on boot with a Comtrol PCI card

2019-03-13 Thread Bjorn Helgaas

Hi Jesse,

On Wed, Mar 13, 2019 at 11:50:07AM -0500, Jesse Hathaway wrote:
> Two regressions cause Linux to hang on boot when a Comtrol PCI card
> is present.
> 
> If I revert the following two commits, I can boot again and the card
> operates without issue:
> 
> 1302fcf0d03e (refs/bisect/bad) PCI: Configure *all* devices, not just
> hot-added ones
> 1c3c5eab1715 sched/core: Enable might_sleep() and smp_processor_id()
> checks early

I'm very sorry about the regression, but thank you very much for
narrowing it down and reporting it!

How did you narrow it down to *two* commits, and do you have to revert
both of them to avoid the hang?  Usually a bisection identifies a
single commit, and the two you mention aren't related.

> ; lspci -vs 82:00.0
> 82:00.0 Multiport serial controller: Comtrol Corporation Device 0061
> Subsystem: Comtrol Corporation Device 0061
> Flags: 66MHz, medium devsel, IRQ 35, NUMA node 1
> Memory at c8004000 (32-bit, non-prefetchable) [size=4K]
> Memory at c800 (32-bit, non-prefetchable) [size=16K]
> Capabilities: [40] Hot-plug capable
> Capabilities: [48] Power Management version 2
> Kernel driver in use: rp2
> Kernel modules: rp2
> 
> Is it possible that the problem is that the card claims to support
> Hot-plug, but does not?
> 
> I would love to help fix this issue, please let me know what other
> information would be helpful to provide.

Can you collect a complete dmesg log (with a working kernel) and
output of "sudo lspci -vvxxx"?  You can open a bug report at
https://bugzilla.kernel.org, attach the logs there, and respond here
with the URL.

Where does the hang happen?  Is it when we configure the Comtrol card?

Bjorn

Re: [RFC][Patch v9 2/6] KVM: Enables the kernel to isolate guest free pages

2019-03-13 Thread David Hildenbrand

On 13.03.19 23:54, Alexander Duyck wrote:
> On Wed, Mar 13, 2019 at 9:39 AM David Hildenbrand  wrote:
>>
>> On 13.03.19 17:37, Alexander Duyck wrote:
>>> On Wed, Mar 13, 2019 at 5:18 AM David Hildenbrand  wrote:

 On 13.03.19 12:54, Nitesh Narayan Lal wrote:
>
> On 3/12/19 5:13 PM, Alexander Duyck wrote:
>> On Tue, Mar 12, 2019 at 12:46 PM Nitesh Narayan Lal  
>> wrote:
>>> On 3/8/19 4:39 PM, Alexander Duyck wrote:
 On Fri, Mar 8, 2019 at 11:39 AM Nitesh Narayan Lal  
 wrote:
> On 3/8/19 2:25 PM, Alexander Duyck wrote:
>> On Fri, Mar 8, 2019 at 11:10 AM Nitesh Narayan Lal 
>>  wrote:
>>> On 3/8/19 1:06 PM, Alexander Duyck wrote:
 On Thu, Mar 7, 2019 at 6:32 PM Michael S. Tsirkin 
  wrote:
> On Thu, Mar 07, 2019 at 02:35:53PM -0800, Alexander Duyck wrote:
>> The only other thing I still want to try and see if I can do is 
>> to add
>> a jiffies value to the page private data in the case of the buddy
>> pages.
> Actually there's one extra thing I think we should do, and that 
> is make
> sure we do not leave less than X% off the free memory at a time.
> This way chances of triggering an OOM are lower.
 If nothing else we could probably look at doing a watermark of some
 sort so we have to have X amount of memory free but not hinted 
 before
 we will start providing the hints. It would just be a matter of
 tracking how much memory we have hinted on versus the amount of 
 memory
 that has been pulled from that pool.
>>> This is to avoid false OOM in the guest?
>> Partially, though it would still be possible. Basically it would just
>> be a way of determining when we have hinted "enough". Basically it
>> doesn't do us much good to be hinting on free memory if the guest is
>> already constrained and just going to reallocate the memory shortly
>> after we hinted on it. The idea is with a watermark we can avoid
>> hinting until we start having pages that are actually going to stay
>> free for a while.
>>
  It is another reason why we
 probably want a bit in the buddy pages somewhere to indicate if a 
 page
 has been hinted or not as we can then use that to determine if we 
 have
 to account for it in the statistics.
>>> The one benefit which I can see of having an explicit bit is that it
>>> will help us to have a single hook away from the hot path within 
>>> buddy
>>> merging code (just like your arch_merge_page) and still avoid 
>>> duplicate
>>> hints while releasing pages.
>>>
>>> I still have to check PG_idle and PG_young which you mentioned but I
>>> don't think we can reuse any existing bits.
>> Those are bits that are already there for 64b. I think those exist in
>> the page extension for 32b systems. If I am not mistaken they are 
>> only
>> used in VMA mapped memory. What I was getting at is that those are 
>> the
>> bits we could think about reusing.
>>
>>> If we really want to have something like a watermark, then can't we 
>>> use
>>> zone->free_pages before isolating to see how many free pages are 
>>> there
>>> and put a threshold on it? (__isolate_free_page() does a similar 
>>> thing
>>> but it does that on per request basis).
>> Right. That is only part of it though since that tells you how many
>> free pages are there. But how many of those free pages are hinted?
>> That is the part we would need to track separately and then then
>> compare to free_pages to determine if we need to start hinting on 
>> more
>> memory or not.
> Only pages which are isolated will be hinted, and once a page is
> isolated it will not be counted in the zone free pages.
> Feel free to correct me if I am wrong.
 You are correct up to here. When we isolate the page it isn't counted
 against the free pages. However after we complete the hint we end up
 taking it out of isolation and returning it to the "free" state, so it
 will be counted against the free pages.

> If I am understanding it correctly you only want to hint the idle 
> pages,
> is that right?
 Getting back to the ideas from our earlier discussion, we had 3 stages
 for things. Free but not hinted, isolated due to hinting, and free and
 hinted. So what we would need to do is identify the size of the first
 pool that is free and not hinted by knowing the total number of free

Re: [PATCHv8 06/10] node: Add memory-side caching attributes

2019-03-13 Thread Rafael J. Wysocki

On Mon, Mar 11, 2019 at 9:55 PM Keith Busch  wrote:
>
> System memory may have caches to help improve access speed to frequently
> requested address ranges. While the system provided cache is transparent
> to the software accessing these memory ranges, applications can optimize
> their own access based on cache attributes.
>
> Provide a new API for the kernel to register these memory-side caches
> under the memory node that provides it.
>
> The new sysfs representation is modeled from the existing cpu cacheinfo
> attributes, as seen from /sys/devices/system/cpu//cache/.  Unlike CPU
> cacheinfo though, the node cache level is reported from the view of the
> memory. A higher level number is nearer to the CPU, while lower levels
> are closer to the last level memory.
>
> The exported attributes are the cache size, the line size, associativity
> indexing, and write back policy, and add the attributes for the system
> memory caches to sysfs stable documentation.
>
> Signed-off-by: Keith Busch 

Reviewed-by: Rafael J. Wysocki 

> ---
>  Documentation/ABI/stable/sysfs-devices-node |  34 +++
>  drivers/base/node.c | 151 
> 
>  include/linux/node.h|  39 +++
>  3 files changed, 224 insertions(+)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node 
> b/Documentation/ABI/stable/sysfs-devices-node
> index 735a40a3f9b2..f7ce68fbd4b9 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -142,3 +142,37 @@ Contact:   Keith Busch 
>  Description:
> This node's write latency in nanoseconds when access
> from nodes found in this class's linked initiators.
> +
> +What:  /sys/devices/system/node/nodeX/memory_side_cache/indexY/
> +Date:  December 2018
> +Contact:   Keith Busch 
> +Description:
> +   The directory containing attributes for the memory-side cache
> +   level 'Y'.
> +
> +What:  
> /sys/devices/system/node/nodeX/memory_side_cache/indexY/indexing
> +Date:  December 2018
> +Contact:   Keith Busch 
> +Description:
> +   The caches associativity indexing: 0 for direct mapped,
> +   non-zero if indexed.
> +
> +What:  
> /sys/devices/system/node/nodeX/memory_side_cache/indexY/line_size
> +Date:  December 2018
> +Contact:   Keith Busch 
> +Description:
> +   The number of bytes accessed from the next cache level on a
> +   cache miss.
> +
> +What:  /sys/devices/system/node/nodeX/memory_side_cache/indexY/size
> +Date:  December 2018
> +Contact:   Keith Busch 
> +Description:
> +   The size of this memory side cache in bytes.
> +
> +What:  
> /sys/devices/system/node/nodeX/memory_side_cache/indexY/write_policy
> +Date:  December 2018
> +Contact:   Keith Busch 
> +Description:
> +   The cache write policy: 0 for write-back, 1 for write-through,
> +   other or unknown.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 2de546a040a5..8598fcbd2a17 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -205,6 +205,155 @@ void node_set_perf_attrs(unsigned int nid, struct 
> node_hmem_attrs *hmem_attrs,
> }
> }
>  }
> +
> +/**
> + * struct node_cache_info - Internal tracking for memory node caches
> + * @dev:   Device represeting the cache level
> + * @node:  List element for tracking in the node
> + * @cache_attrs:Attributes for this cache level
> + */
> +struct node_cache_info {
> +   struct device dev;
> +   struct list_head node;
> +   struct node_cache_attrs cache_attrs;
> +};
> +#define to_cache_info(device) container_of(device, struct node_cache_info, 
> dev)
> +
> +#define CACHE_ATTR(name, fmt)  \
> +static ssize_t name##_show(struct device *dev, \
> +  struct device_attribute *attr,   \
> +  char *buf)   \
> +{  \
> +   return sprintf(buf, fmt "\n", to_cache_info(dev)->cache_attrs.name);\
> +}  \
> +DEVICE_ATTR_RO(name);
> +
> +CACHE_ATTR(size, "%llu")
> +CACHE_ATTR(line_size, "%u")
> +CACHE_ATTR(indexing, "%u")
> +CACHE_ATTR(write_policy, "%u")
> +
> +static struct attribute *cache_attrs[] = {
> +   _attr_indexing.attr,
> +   _attr_size.attr,
> +   _attr_line_size.attr,
> +   _attr_write_policy.attr,
> +   NULL,
> +};
> +ATTRIBUTE_GROUPS(cache);
> +
> +static void node_cache_release(struct device *dev)
> +{
> +   kfree(dev);
> +}
> +
> +static void node_cacheinfo_release(struct device *dev)
> +{
> +   struct node_cache_info *info =

Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-13 Thread George Spelvin

Thank you for your thoughtful comments!

On Wed, 13 Mar 2019 at 23:23:44 +0100, Rasmus Villemoes wrote:
> On 21/02/2019 07.30, George Spelvin wrote:
> + * @align: required aignment (typically 4 or 8)
>
> typo aLignment

Thanks; fixed!

>> + * Returns true if elements can be copied using word loads and stores.
>> + * The size must be a multiple of the alignment, and the base address must
>> + * be if we do not have CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.
>
> I wonder if we shouldn't unconditionally require the base to be aligned.
> It of course depends on what exactly 'efficient' means, but if the base
> is not aligned, we know that every single load and store will be
> unaligned, and we're doing lots of them. IIRC even on x86 it's usually
> two cycles instead of one, and even more when we cross a cache line
> boundary (which we do rather often - e.g. in your 40 byte example,
> almost every swap will hit at least one). One could also have some data
> that is naturally 4-byte aligned with an element size that happens to be
> a multiple of 8 bytes; it would be a bit sad to use 8-byte accesses when
> 4-byte would all have been aligned.

Well, a 2-cycle unaligned access is still lots faster than 4 byte accesses.
I think that's a decent interpretation of "EFFICIENT_UNALIGNED_ACCESS":
faster than doing it a byte at a time.  So the change is a net win;
we're just wondering if it could be optimized more.

The 4/8 issue is similar, but not as extreme.  Is one unaligned 64-bit
access faster than two aligned 32-bit accesses?  As you note, it's usually
only twice the cost, so it's no slower, and there's less loop overhead.
So I don't think it's a bad thing here, either.

Re your comments on cache lines, ISTR that x86 can do an unaligned
load with minimal penalty as long as it doesn't cross a cache line:
https://software.intel.com/en-us/articles/reducing-the-impact-of-misaligned-memory-accesses/
So you win on most of the accesses, hopefully enough to pay for the
one unligned access.  Apparently in processors more recent than
the P4 example Intel used above, it's even less:
https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

On 32-bit machines, it's actually a 4-byte swap, unrolled twice;
there are no 64-bit memory accesses.  So the concern is only about
8-byte alignment on 64-bit machines.

The great majority of call sites sort structures with pointer or
long members, so are aligned and the question is moot.  I don't
think it's worth overthinking the issue on behalf of the performance
of some rare corner cases.

I have considered doing away with the two word-copy loops and just
having one machine-word-sized loop plus a byte fallback.

>> +unsigned int lsbits = (unsigned int)size;
>
> Drop that cast.

Actually, in response to an earlier comment, I changed it (and the
type of lsbits) to (unsigned char), to emphasize that I *really* only
care about a handful of low bits.

I know gcc doesn't warn about implicit narrowing casts, but
I prefer to make them explicit for documentation reasons.
"Yes, I really mean to throw away the high bits."

Compilers on machines without native byte operations are very good
at using larger registers and optimizing away mask operations.
(Remember that this is inlined, and "align" is a constant 4 or 8.)
>
>> +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
>> +(void)base;
>> +#else
>> +lsbits |= (unsigned int)(size_t)base;
>
> The kernel usually casts pointers to long or unsigned long. If some
> oddball arch has size_t something other than unsigned long, this would
> give a "warning: cast from pointer to integer of different size". So
> just cast to unsigned long, and drop the cast to unsigned int.

I changed this to uintptr_t and called it good.

>>  static void u32_swap(void *a, void *b, int size)
>>  {
>> -u32 t = *(u32 *)a;
>> -*(u32 *)a = *(u32 *)b;
>> -*(u32 *)b = t;
>> +size_t n = size;
>> +
>
> Since the callback has int in its prototype (we should fix that globally
> at some point...), might as well use a four-byte local variable, to
> avoid rex prefix on x86-64. So just make that "unsigned" or "u32".

Agreed about the eventual global fix.  If you care, the only places
a non-NULL swap function is passed in are:
- arch/arc/kernel/unwind.c
- arch/powerpc/kernel/module_32.c
- arch/powerpc/kernel/module_64.c
! arch/x86/kernel/unwind_orc.c (2x)
- fs/ocfs2/dir.c
- fs/ocfs2/refcounttree.c (3x)
- fs/ocfs2/xattr.c (3x)
- fs/ubifs/find.c
! kernel/jump_label.c
! lib/extable.c

The ones marked with "-" are simple memory swaps that could (should!)
be replaced with NULL.  The ones marked with "!" actually do
something non-trivial.

Actually, I deliberately used a pointer-sized index, since the loop
uses (a + n) addressing.  Hardware indexed addressing on 64-bit
machines generally (I vaguely recall Aarch64 is an exception)
requires a 64-bit index; using a smaller index requires some masking
or sign-extending.

I thought it was safer to bet that the

Re: [PATCHv8 05/10] node: Add heterogenous memory access attributes

2019-03-13 Thread Rafael J. Wysocki

On Mon, Mar 11, 2019 at 9:55 PM Keith Busch  wrote:
>
> Heterogeneous memory systems provide memory nodes with different latency
> and bandwidth performance attributes. Provide a new kernel interface
> for subsystems to register the attributes under the memory target
> node's initiator access class. If the system provides this information,
> applications may query these attributes when deciding which node to
> request memory.
>
> The following example shows the new sysfs hierarchy for a node exporting
> performance attributes:
>
>   # tree -P "read*|write*"/sys/devices/system/node/nodeY/accessZ/initiators/
>   /sys/devices/system/node/nodeY/accessZ/initiators/
>   |-- read_bandwidth
>   |-- read_latency
>   |-- write_bandwidth
>   `-- write_latency
>
> The bandwidth is exported as MB/s and latency is reported in
> nanoseconds. The values are taken from the platform as reported by the
> manufacturer.
>
> Memory accesses from an initiator node that is not one of the memory's
> access "Z" initiator nodes linked in the same directory may observe
> different performance than reported here. When a subsystem makes use
> of this interface, initiators of a different access number may not have
> the same performance relative to initiators in other access numbers, or
> omitted from the any access class' initiators.
>
> Descriptions for memory access initiator performance access attributes
> are added to sysfs stable documentation.
>
> Acked-by: Jonathan Cameron 
> Tested-by: Jonathan Cameron 
> Signed-off-by: Keith Busch 

Reviewed-by: Rafael J. Wysocki 

> ---
>  Documentation/ABI/stable/sysfs-devices-node | 28 ++
>  drivers/base/Kconfig|  8 
>  drivers/base/node.c | 59 
> +
>  include/linux/node.h| 26 +
>  4 files changed, 121 insertions(+)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node 
> b/Documentation/ABI/stable/sysfs-devices-node
> index 433bcc04e542..735a40a3f9b2 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -114,3 +114,31 @@ Contact:   Keith Busch 
>  Description:
> The directory containing symlinks to memory targets that
> this initiator node has class "Y" access.
> +
> +What:  
> /sys/devices/system/node/nodeX/accessY/initiators/read_bandwidth
> +Date:  December 2018
> +Contact:   Keith Busch 
> +Description:
> +   This node's read bandwidth in MB/s when accessed from
> +   nodes found in this access class's linked initiators.
> +
> +What:  /sys/devices/system/node/nodeX/accessY/initiators/read_latency
> +Date:  December 2018
> +Contact:   Keith Busch 
> +Description:
> +   This node's read latency in nanoseconds when accessed
> +   from nodes found in this access class's linked initiators.
> +
> +What:  
> /sys/devices/system/node/nodeX/accessY/initiators/write_bandwidth
> +Date:  December 2018
> +Contact:   Keith Busch 
> +Description:
> +   This node's write bandwidth in MB/s when accessed from
> +   found in this access class's linked initiators.
> +
> +What:  
> /sys/devices/system/node/nodeX/accessY/initiators/write_latency
> +Date:  December 2018
> +Contact:   Keith Busch 
> +Description:
> +   This node's write latency in nanoseconds when access
> +   from nodes found in this class's linked initiators.
> diff --git a/drivers/base/Kconfig b/drivers/base/Kconfig
> index 059700ea3521..a7438a58c250 100644
> --- a/drivers/base/Kconfig
> +++ b/drivers/base/Kconfig
> @@ -149,6 +149,14 @@ config DEBUG_TEST_DRIVER_REMOVE
>   unusable. You should say N here unless you are explicitly looking to
>   test this functionality.
>
> +config HMEM_REPORTING
> +   bool
> +   default n
> +   depends on NUMA
> +   help
> + Enable reporting for heterogenous memory access attributes under
> + their non-uniform memory nodes.
> +
>  source "drivers/base/test/Kconfig"
>
>  config SYS_HYPERVISOR
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 6f4097680580..2de546a040a5 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -71,6 +71,9 @@ struct node_access_nodes {
> struct device   dev;
> struct list_headlist_node;
> unsignedaccess;
> +#ifdef CONFIG_HMEM_REPORTING
> +   struct node_hmem_attrs  hmem_attrs;
> +#endif
>  };
>  #define to_access_nodes(dev) container_of(dev, struct node_access_nodes, dev)
>
> @@ -148,6 +151,62 @@ static struct node_access_nodes 
> *node_init_node_access(struct node *node,
> return NULL;
>  }
>
> +#ifdef CONFIG_HMEM_REPORTING
> +#define ACCESS_ATTR(name) \
> +static ssize_t name##_show(struct

Re: [PATCHv8 04/10] node: Link memory nodes to their compute nodes

2019-03-13 Thread Rafael J. Wysocki

On Mon, Mar 11, 2019 at 9:55 PM Keith Busch  wrote:
>
> Systems may be constructed with various specialized nodes. Some nodes
> may provide memory, some provide compute devices that access and use
> that memory, and others may provide both. Nodes that provide memory are
> referred to as memory targets, and nodes that can initiate memory access
> are referred to as memory initiators.
>
> Memory targets will often have varying access characteristics from
> different initiators, and platforms may have ways to express those
> relationships. In preparation for these systems, provide interfaces for
> the kernel to export the memory relationship among different nodes memory
> targets and their initiators with symlinks to each other.
>
> If a system provides access locality for each initiator-target pair, nodes
> may be grouped into ranked access classes relative to other nodes. The
> new interface allows a subsystem to register relationships of varying
> classes if available and desired to be exported.
>
> A memory initiator may have multiple memory targets in the same access
> class. The target memory's initiators in a given class indicate the
> nodes access characteristics share the same performance relative to other
> linked initiator nodes. Each target within an initiator's access class,
> though, do not necessarily perform the same as each other.
>
> A memory target node may have multiple memory initiators. All linked
> initiators in a target's class have the same access characteristics to
> that target.
>
> The following example show the nodes' new sysfs hierarchy for a memory
> target node 'Y' with access class 0 from initiator node 'X':
>
>   # symlinks -v /sys/devices/system/node/nodeX/access0/
>   relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> 
> ../../nodeY
>
>   # symlinks -v /sys/devices/system/node/nodeY/access0/
>   relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> 
> ../../nodeX
>
> The new attributes are added to the sysfs stable documentation.
>
> Reviewed-by: Jonathan Cameron 
> Signed-off-by: Keith Busch 

Reviewed-by: Rafael J. Wysocki 

> ---
>  Documentation/ABI/stable/sysfs-devices-node |  25 -
>  drivers/base/node.c | 142 
> +++-
>  include/linux/node.h|   6 ++
>  3 files changed, 171 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/ABI/stable/sysfs-devices-node 
> b/Documentation/ABI/stable/sysfs-devices-node
> index 3e90e1f3bf0a..433bcc04e542 100644
> --- a/Documentation/ABI/stable/sysfs-devices-node
> +++ b/Documentation/ABI/stable/sysfs-devices-node
> @@ -90,4 +90,27 @@ Date:December 2009
>  Contact:   Lee Schermerhorn 
>  Description:
> The node's huge page size control/query attributes.
> -   See Documentation/admin-guide/mm/hugetlbpage.rst
> \ No newline at end of file
> +   See Documentation/admin-guide/mm/hugetlbpage.rst
> +
> +What:  /sys/devices/system/node/nodeX/accessY/
> +Date:  December 2018
> +Contact:   Keith Busch 
> +Description:
> +   The node's relationship to other nodes for access class "Y".
> +
> +What:  /sys/devices/system/node/nodeX/accessY/initiators/
> +Date:  December 2018
> +Contact:   Keith Busch 
> +Description:
> +   The directory containing symlinks to memory initiator
> +   nodes that have class "Y" access to this target node's
> +   memory. CPUs and other memory initiators in nodes not in
> +   the list accessing this node's memory may have different
> +   performance.
> +
> +What:  /sys/devices/system/node/nodeX/accessY/targets/
> +Date:  December 2018
> +Contact:   Keith Busch 
> +Description:
> +   The directory containing symlinks to memory targets that
> +   this initiator node has class "Y" access.
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 86d6cd92ce3d..6f4097680580 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>
> @@ -59,6 +60,94 @@ static inline ssize_t node_read_cpulist(struct device *dev,
>  static DEVICE_ATTR(cpumap,  S_IRUGO, node_read_cpumask, NULL);
>  static DEVICE_ATTR(cpulist, S_IRUGO, node_read_cpulist, NULL);
>
> +/**
> + * struct node_access_nodes - Access class device to hold user visible
> + *   relationships to other nodes.
> + * @dev:   Device for this memory access class
> + * @list_node: List element in the node's access list
> + * @access:The access class rank
> + */
> +struct node_access_nodes {
> +   struct device   dev;
> +   struct list_headlist_node;
> +   unsignedaccess;
> +};
> +#define to_access_nodes(dev) container_of(dev, struct node_access_nodes, dev)
> +
> +static

Re: [PATCH] percpu/module resevation: change resevation size iff X86_VSMP is set

2019-03-13 Thread Paul E. McKenney

On Wed, Mar 13, 2019 at 02:29:12PM -0700, Tejun Heo wrote:
> Hello,
> 
> On Wed, Mar 13, 2019 at 02:22:55PM -0700, Paul E. McKenney wrote:
> > Should I define DEFINE_SRCU() and DEFINE_STATIC_SRCU() only if
> > !defined(MODULE)?
> 
> Yeah, that sounds like a great idea with comments explaining why it's
> like that.

Like this?

 * Build-time srcu_struct definition is not allowed in modules because
 * otherwise it is necessary to increase the size of the reserved region
 * each time a DEFINE_SRCU() or DEFINE_STATIC_SRCU() are added to a
 * kernel module.  Kernel modules should instead declare an srcu_struct
 * and then invoke init_srcu_struct() from their module_init function and
 * cleanup_srcu_struct() from their module_exit function.  Note that modules
 * using call_srcu() will also need to invoke srcu_barrier() from their
 * module_exit function.

Also, it looks like Barret beat me to this suggestion.  ;-)

In addition, rcutorture and rcuperf needed to be updated because
they used to use DEFINE_STATIC_STRUCT() whether built in or built
as a loadable module.

How does the (very lightly tested) patch below look to you all?

Thanx, Paul

commit 34f67df09cc0c6bf082a7cfca435373caeeb8d82
Author: Paul E. McKenney 
Date:   Wed Mar 13 16:06:22 2019 -0700

srcu: Forbid DEFINE{,_STATIC}_SRCU() from modules

Adding DEFINE_SRCU() or DEFINE_STATIC_SRCU() to a loadable module
requires that the size of the reserved region be increased, which is
not something we want to be doing all that often.  Instead, loadable
modules should define an srcu_struct and invoke init_srcu_struct()
from their module_init function and cleanup_srcu_struct() from their
module_exit function.  Note that modules using call_srcu() will also
need to invoke srcu_barrier() from their module_exit function.

This commit enforces this advice by refusing to define DEFINE_SRCU()
and DEFINE_STATIC_SRCU() within loadable modules.

Suggested-by: Barret Rhoden 
Signed-off-by: Paul E. McKenney 

diff --git a/include/linux/srcutree.h b/include/linux/srcutree.h
index 7f7c8c050f63..ac5ea1c72e97 100644
--- a/include/linux/srcutree.h
+++ b/include/linux/srcutree.h
@@ -105,6 +105,15 @@ struct srcu_struct {
  * Define and initialize a srcu struct at build time.
  * Do -not- call init_srcu_struct() nor cleanup_srcu_struct() on it.
  *
+ * Build-time srcu_struct definition is not allowed in modules because
+ * otherwise it is necessary to increase the size of the reserved region
+ * each time a DEFINE_SRCU() or DEFINE_STATIC_SRCU() are added to a
+ * kernel module.  Kernel modules should instead declare an srcu_struct
+ * and then invoke init_srcu_struct() from their module_init function and
+ * cleanup_srcu_struct() from their module_exit function.  Note that modules
+ * using call_srcu() will also need to invoke srcu_barrier() from their
+ * module_exit function.
+ *
  * Note that although DEFINE_STATIC_SRCU() hides the name from other
  * files, the per-CPU variable rules nevertheless require that the
  * chosen name be globally unique.  These rules also prohibit use of
@@ -120,11 +129,13 @@ struct srcu_struct {
  *
  * See include/linux/percpu-defs.h for the rules on per-CPU variables.
  */
-#define __DEFINE_SRCU(name, is_static) \
-   static DEFINE_PER_CPU(struct srcu_data, name##_srcu_data);\
+#ifndef MODULE
+#  define __DEFINE_SRCU(name, is_static)   \
+   static DEFINE_PER_CPU(struct srcu_data, name##_srcu_data);  \
is_static struct srcu_struct name = __SRCU_STRUCT_INIT(name, 
name##_srcu_data)
-#define DEFINE_SRCU(name)  __DEFINE_SRCU(name, /* not static */)
-#define DEFINE_STATIC_SRCU(name)   __DEFINE_SRCU(name, static)
+#  define DEFINE_SRCU(name)__DEFINE_SRCU(name, /* not static */)
+#  define DEFINE_STATIC_SRCU(name) __DEFINE_SRCU(name, static)
+#endif

 void synchronize_srcu_expedited(struct srcu_struct *ssp);
 void srcu_barrier(struct srcu_struct *ssp);
diff --git a/kernel/rcu/rcuperf.c b/kernel/rcu/rcuperf.c
index c29761152874..b44208b3bf95 100644
--- a/kernel/rcu/rcuperf.c
+++ b/kernel/rcu/rcuperf.c
@@ -139,6 +139,7 @@ struct rcu_perf_ops {
void (*sync)(void);
void (*exp_sync)(void);
const char *name;
+   const char *altname;
 };

 static struct rcu_perf_ops *cur_ops;
@@ -186,8 +187,16 @@ static struct rcu_perf_ops rcu_ops = {
  * Definitions for srcu perf testing.
  */

+static struct srcu_struct *srcu_ctlp;
+
+#ifndef MODULE
 DEFINE_STATIC_SRCU(srcu_ctl_perf);
-static struct srcu_struct *srcu_ctlp = _ctl_perf;
+
+static void srcu_sync_perf_init(void)
+{
+   srcu_ctlp = _ctl_perf
+}
+#endif

 static int srcu_perf_read_lock(void) __acquires(srcu_ctlp)
 {
@@ -224,9 +233,10 @@ static void

Re: [PATCH] EDAC, {skx|i10nm}_edac: Fix randconfig build error

2019-03-13 Thread Luck, Tony

On Wed, Mar 06, 2019 at 09:15:13PM +0100, Arnd Bergmann wrote:
> On Wed, Mar 6, 2019 at 6:58 PM Luck, Tony  wrote:
> > From: Qiuxu Zhuo 
> >
> > This seems cleaner than adding all the EXPORTs to skx_common.c
> > I also tried a build with the 0x8A152468-config.gz that Arnd
> > supplied.
> 
> It's still a bit fragile since you do something that kbuild doesn't
> expect with having a file in both a module and built-in code
> in some configurations. I'm fairly sure this version works today,
> but it would break the next time that file gets changed to include
> a reference to THIS_MODULE, or anything else that is different
> between built-in and modular code.
> 
> Another alternative would be to mark all symbols in this file
> 'static' and then include skx_common.c from the other two files.
> Of course that is also ugly and it increases the overall code size,
> so I don't see a way to win this.

Boris,

So where should we go. Proposed solutions are piling up:

1) Make skx_common a module
[downside: have to EXPORT everything in it]
2) Move module-ish bits out of skx_common
[downside: perhaps fragile]
3) #include skx_common.c into {skx,i10nm}_edac.c
[downside: no patch yet, bigger code size]

-Tony

Re: [PATCH 2/2] x86/mmap: handle worst-case heap randomization in mmap_base

2019-03-13 Thread Kees Cook

On Tue, Mar 12, 2019 at 10:33 AM Ali Saidi  wrote:
>
> Increase mmap_base by the worst-case brk randomization so that
> the stack and heap remain apart.
>
> In Linux 4.13 a change was committed that special cased the kernel ELF
> loader when the loader is invoked directly (eab09532d400; binfmt_elf: use
> ELF_ET_DYN_BASE only for PIE). Generally, the loader isn’t invoked
> directly and this issue is limited to cases where it is, (e.g to set a
> non-inheritable LD_LIBRARY_PATH, testing new versions of the loader). In
> those rare cases, the loader doesn't take into account the amount of brk
> randomization that will be applied by arch_randomize_brk(). This can
> lead to the stack and heap being arbitrarily close to each other.

In the case of using the loader directly, brk (so helpfully identified
as "[heap]") is allocated with the _loader_ not the binary. For
example, with ASLR entirely disabled, you can see this more clearly:

$ /bin/cat /proc/self/maps
4000-c000 r-xp  fd:02 34603015
  /bin/cat
5575b000-5575c000 r--p 7000 fd:02 34603015
  /bin/cat
5575c000-5575d000 rw-p 8000 fd:02 34603015
  /bin/cat
5575d000-5577e000 rw-p  00:00 0  [heap]
...
77ff7000-77ffa000 r--p  00:00 0  [vvar]
77ffa000-77ffc000 r-xp  00:00 0  [vdso]
77ffc000-77ffd000 r--p 00027000 fd:02 49287483
  /lib/x86_64-linux-gnu/ld-2.27.so
77ffd000-77ffe000 rw-p 00028000 fd:02 49287483
  /lib/x86_64-linux-gnu/ld-2.27.so
77ffe000-77fff000 rw-p  00:00 0
7ffde000-7000 rw-p  00:00 0  [stack]

$ /lib/x86_64-linux-gnu/ld-2.27.so /bin/cat /proc/self/maps
...
77bcc000-77bd4000 r-xp  fd:02 34603015
  /bin/cat
77bd4000-77dd3000 ---p 8000 fd:02 34603015
  /bin/cat
77dd3000-77dd4000 r--p 7000 fd:02 34603015
  /bin/cat
77dd4000-77dd5000 rw-p 8000 fd:02 34603015
  /bin/cat
77dd5000-77dfc000 r-xp  fd:02 49287483
  /lib/x86_64-linux-gnu/ld-2.27.so
77fb2000-77fd6000 rw-p  00:00 0
77ff7000-77ffa000 r--p  00:00 0  [vvar]
77ffa000-77ffc000 r-xp  00:00 0  [vdso]
77ffc000-77ffd000 r--p 00027000 fd:02 49287483
  /lib/x86_64-linux-gnu/ld-2.27.so
77ffd000-77ffe000 rw-p 00028000 fd:02 49287483
  /lib/x86_64-linux-gnu/ld-2.27.so
77ffe000-7802 rw-p  00:00 0  [heap]
7ffde000-7000 rw-p  00:00 0  [stack]

So I think changing this globally isn't the right solution (normally
brk is between text and mmap). Adjusting the mmap base padding means
we lose even more memory space. Perhaps it would be better if brk
allocation would be placed before the mmap region (i.e. use
ELF_ET_DYN_BASE). This seems to work for me:

diff --git a/fs/binfmt_elf.c b/fs/binfmt_elf.c
index 7d09d125f148..cdaa33f4a3ef 100644
--- a/fs/binfmt_elf.c
+++ b/fs/binfmt_elf.c
@@ -1131,6 +1131,15 @@ static int load_elf_binary(struct linux_binprm *bprm)
current->mm->end_data = end_data;
current->mm->start_stack = bprm->p;

+   /*
+* When executing a loader directly (ET_DYN without Interp), move
+* the brk area out of the mmap region (since it grows up, and may
+* collide early with the stack growing down), and into the unused
+* ELF_ET_DYN_BASE region.
+*/
+   if (!elf_interpreter)
+   current->mm->brk = current->mm->start_brk = ELF_ET_DYN_BASE;
+
if ((current->flags & PF_RANDOMIZE) && (randomize_va_space > 1)) {
current->mm->brk = current->mm->start_brk =
arch_randomize_brk(current->mm);

$ /lib/x86_64-linux-gnu/ld-2.27.so /bin/cat /proc/self/maps
56de3000-56e04000 rw-p  00:00 0  [heap]
7f8467da9000-7f8467f9 r-xp  fd:01 399295
  /lib/x86_64-linux-gnu/libc-2.27.so
...
7f846819a000-7f84681a2000 r-xp  fd:01 263229
  /bin/cat
...
7f84685cb000-7f84685cc000 rw-p 00028000 fd:01 399286
  /lib/x86_64-linux-gnu/ld-2.27.so
7f84685cc000-7f84685cd000 rw-p  00:00 0
7ffce68f8000-7ffce6919000 rw-p  00:00 0  [stack]
7ffce69f-7ffce69f3000 r--p  00:00 0  [vvar]
7ffce69f3000-7ffce69f4000 r-xp  00:00 0  [vdso]

Does anyone see problems with this? (Note that ET_EXEC base is
0x40, so no collision there...)

-- 
Kees Cook

Re: overlayfs vs. fscrypt

2019-03-13 Thread Eric Biggers

On Wed, Mar 13, 2019 at 03:29:43PM -0700, James Bottomley wrote:
> On Wed, 2019-03-13 at 15:13 -0700, Eric Biggers wrote:
> > On Wed, Mar 13, 2019 at 02:04:29PM -0700, James Bottomley wrote:
> > > On Wed, 2019-03-13 at 13:25 -0700, Eric Biggers wrote:
> > > > On Wed, Mar 13, 2019 at 01:06:06PM -0700, James Bottomley wrote:
> > > > > On Wed, 2019-03-13 at 12:57 -0700, Eric Biggers wrote:
> > > 
> > > [...]
> > > > > > fscrypt would allow the data to be stored encrypted on the
> > > > > > local disk, so it's protected against offline compromise of
> > > > > > the disk.
> > > > > 
> > > > > Container images are essentially tars of the overlays.  They
> > > > > only become actual filesystems when instantiated at
> > > > > runtime.  The current encrypted container image is an overlay
> > > > > or set of overlays which is tarred then encrypted.  So to
> > > > > instantiate it is decrypted then untarred.
> > > > > 
> > > > > The thing I was wondering about was whether instead of a tar
> > > > > encrypt we could instead produce an encrypted image from a
> > > > > fscrypt filesystem.
> > > > > 
> > > > 
> > > > Why do you care whether the container image is encrypted on the
> > > > local disk, when you're extracting it in plaintext onto the local
> > > > disk anyway each time it runs? Even after the runtime files are
> > > > "deleted", they may still be recoverable from the disk.  Are you
> > > > using shred and BLKSECDISCARD, and a non-COW filesystem?
> > > > 
> > > > Now, if you wanted to avoid writing the plaintext to disk
> > > > entirely (and thereby use encryption to actually achieve a useful
> > > > security property that can't be achieved through file
> > > > permissions), fscrypt is a good solution for that.
> > > 
> > > OK let's start with a cloud and container 101: A container is an
> > > exactly transportable IaaS environment containing an
> > > application.  The format for the exact transport is the "container
> > > image" I've been describing (layered tar file set deployed with
> > > overlays).  These images are usually stored in cloud based
> > > registries which may or may not have useful access controls.  I
> > > take it the reason for image encryption to protect confidentiality
> > > within the registry is obvious.
> > > 
> > > Because of the exact transport, the deployment may be on my laptop,
> > > on my test system or in some type of public or private cloud.  In
> > > all cases bar the laptop, I won't actually own the physical system
> > > which ends up deploying the container.  So in exchange for security
> > > guarantees from the physical system owner, I agree to turn over my
> > > decryption key and possibly a cash payment.  One of these
> > > guarantees is usually that they shred the key after use and that
> > > they deploy a useful key escrow system like vault or keyprotect to
> > > guard it even while the decryption is being done.
> > 
> > 
> > > Another is that all traces of the container be shredded after the
> > > execution is finished.
> > 
> > Well, sounds like that's not the case currently even with an
> > encrypted container image, because the actual runtime files are not
> > encrypted on disk.
> 
> Shredding means destroying all trace including in the on-disk image. 
> However, one problem with the current implementation is there's a
> window between container run and container stop where the unencrypted
> files are in memory and on local disk.  Access or cockup in that window
> can leak confidential data.

Well, another problem is that it almost certainly doesn't actually work, because
some plaintext will still be recoverable from disk or the raw flash.  Actually
erasing data on modern filesystems and storage devices is extremely difficult.
To start, you'd have to run BLKSECDISCARD on every single block ever written to
disk to prepare the container, or written by the container during its execution.

That's one reason to use storage encryption: it reduces the secure deletion
problem to just erasing the encryption key.

> 
> >   Encrypting the runtime files using fscrypt with an ephemeral key
> > would be useful here.  IOW, randomly generate an encryption key when
> > the container starts, never store it anywhere, and wipe it when the
> > container stops.
> > 
> > Note that this is separate from the container *image* encryption.
> 
> Actually, that was my original thought: it needn't be.  If fscrypt can
> usefully add runtime security, then we could have the encrypted layer
> be simply an fscrypt image ... I presume without the key we can create
> a tar image of an fscrypt that is encrypted and would still be visible
> on untar if we did have the key?  So the encrypted layer would be a tar
> of the fscrypt filesystem without the key.
> 

fscrypt doesn't support backup and restore without the key, so no you can't do
that.  But there wouldn't be much benefit for doing it that way anyway, since
you'd still have to untar the container image each time the container is
started, then

Re: [RFC][Patch v9 2/6] KVM: Enables the kernel to isolate guest free pages

2019-03-13 Thread Alexander Duyck

On Wed, Mar 13, 2019 at 9:39 AM David Hildenbrand  wrote:
>
> On 13.03.19 17:37, Alexander Duyck wrote:
> > On Wed, Mar 13, 2019 at 5:18 AM David Hildenbrand  wrote:
> >>
> >> On 13.03.19 12:54, Nitesh Narayan Lal wrote:
> >>>
> >>> On 3/12/19 5:13 PM, Alexander Duyck wrote:
>  On Tue, Mar 12, 2019 at 12:46 PM Nitesh Narayan Lal  
>  wrote:
> > On 3/8/19 4:39 PM, Alexander Duyck wrote:
> >> On Fri, Mar 8, 2019 at 11:39 AM Nitesh Narayan Lal  
> >> wrote:
> >>> On 3/8/19 2:25 PM, Alexander Duyck wrote:
>  On Fri, Mar 8, 2019 at 11:10 AM Nitesh Narayan Lal 
>   wrote:
> > On 3/8/19 1:06 PM, Alexander Duyck wrote:
> >> On Thu, Mar 7, 2019 at 6:32 PM Michael S. Tsirkin 
> >>  wrote:
> >>> On Thu, Mar 07, 2019 at 02:35:53PM -0800, Alexander Duyck wrote:
>  The only other thing I still want to try and see if I can do is 
>  to add
>  a jiffies value to the page private data in the case of the buddy
>  pages.
> >>> Actually there's one extra thing I think we should do, and that 
> >>> is make
> >>> sure we do not leave less than X% off the free memory at a time.
> >>> This way chances of triggering an OOM are lower.
> >> If nothing else we could probably look at doing a watermark of some
> >> sort so we have to have X amount of memory free but not hinted 
> >> before
> >> we will start providing the hints. It would just be a matter of
> >> tracking how much memory we have hinted on versus the amount of 
> >> memory
> >> that has been pulled from that pool.
> > This is to avoid false OOM in the guest?
>  Partially, though it would still be possible. Basically it would just
>  be a way of determining when we have hinted "enough". Basically it
>  doesn't do us much good to be hinting on free memory if the guest is
>  already constrained and just going to reallocate the memory shortly
>  after we hinted on it. The idea is with a watermark we can avoid
>  hinting until we start having pages that are actually going to stay
>  free for a while.
> 
> >>  It is another reason why we
> >> probably want a bit in the buddy pages somewhere to indicate if a 
> >> page
> >> has been hinted or not as we can then use that to determine if we 
> >> have
> >> to account for it in the statistics.
> > The one benefit which I can see of having an explicit bit is that it
> > will help us to have a single hook away from the hot path within 
> > buddy
> > merging code (just like your arch_merge_page) and still avoid 
> > duplicate
> > hints while releasing pages.
> >
> > I still have to check PG_idle and PG_young which you mentioned but I
> > don't think we can reuse any existing bits.
>  Those are bits that are already there for 64b. I think those exist in
>  the page extension for 32b systems. If I am not mistaken they are 
>  only
>  used in VMA mapped memory. What I was getting at is that those are 
>  the
>  bits we could think about reusing.
> 
> > If we really want to have something like a watermark, then can't we 
> > use
> > zone->free_pages before isolating to see how many free pages are 
> > there
> > and put a threshold on it? (__isolate_free_page() does a similar 
> > thing
> > but it does that on per request basis).
>  Right. That is only part of it though since that tells you how many
>  free pages are there. But how many of those free pages are hinted?
>  That is the part we would need to track separately and then then
>  compare to free_pages to determine if we need to start hinting on 
>  more
>  memory or not.
> >>> Only pages which are isolated will be hinted, and once a page is
> >>> isolated it will not be counted in the zone free pages.
> >>> Feel free to correct me if I am wrong.
> >> You are correct up to here. When we isolate the page it isn't counted
> >> against the free pages. However after we complete the hint we end up
> >> taking it out of isolation and returning it to the "free" state, so it
> >> will be counted against the free pages.
> >>
> >>> If I am understanding it correctly you only want to hint the idle 
> >>> pages,
> >>> is that right?
> >> Getting back to the ideas from our earlier discussion, we had 3 stages
> >> for things. Free but not hinted, isolated due to hinting, and free and
> >> hinted. So what we would need to do is identify the size of the first
> >> pool that is free and not hinted by knowing the total number of free
> >> pages, and then subtract the size of the pages

Re: [PATCH v3 1/1] mm: introduce put_user_page*(), placeholder versions

2019-03-13 Thread Ira Weiny

On Tue, Mar 12, 2019 at 05:38:55PM -0700, John Hubbard wrote:
> On 3/12/19 8:30 AM, Ira Weiny wrote:
> > On Wed, Mar 06, 2019 at 03:54:55PM -0800, john.hubb...@gmail.com wrote:
> > > From: John Hubbard 
> > > 
> > > Introduces put_user_page(), which simply calls put_page().
> > > This provides a way to update all get_user_pages*() callers,
> > > so that they call put_user_page(), instead of put_page().
> > 
> > So I've been running with these patches for a while but today while ramping 
> > up
> > my testing I hit the following:
> > 
> > [ 1355.557819] [ cut here ]
> > [ 1355.563436] get_user_pages pin count overflowed
> 
> Hi Ira,
> 
> Thanks for reporting this. That overflow, at face value, means that we've
> used more than the 22 bits worth of gup pin counts, so about 4 million pins
> of the same page...

This is my bug in the patches I'm playing with.  Somehow I'm causing more puts
than gets...  I'm not sure how but this is for sure my problem.

Backing off to your patch set the numbers are good.

Sorry for the noise.

With the testing I've done today I feel comfortable adding

Tested-by: Ira Weiny 

For the main GUP and InfiniBand patches.

Ira

> 
> > [ 1355.563446] WARNING: CPU: 1 PID: 1740 at mm/gup.c:73 
> > get_gup_pin_page+0xa5/0xb0
> > [ 1355.577391] Modules linked in: ib_isert iscsi_target_mod ib_srpt 
> > target_core_mod ib_srp scsi_transpo
> > rt_srp ext4 mbcache jbd2 mlx4_ib opa_vnic rpcrdma sunrpc rdma_ucm ib_iser 
> > rdma_cm ib_umad iw_cm libiscs
> > i ib_ipoib scsi_transport_iscsi ib_cm sb_edac x86_pkg_temp_thermal 
> > intel_powerclamp coretemp kvm irqbyp
> > ass snd_hda_codec_realtek ib_uverbs snd_hda_codec_generic crct10dif_pclmul 
> > ledtrig_audio snd_hda_intel
> > crc32_pclmul snd_hda_codec snd_hda_core ghash_clmulni_intel snd_hwdep 
> > snd_pcm aesni_intel crypto_simd s
> > nd_timer ib_core cryptd snd glue_helper dax_pmem soundcore nd_pmem ipmi_si 
> > device_dax nd_btt ioatdma nd
> > _e820 ipmi_devintf ipmi_msghandler iTCO_wdt i2c_i801 iTCO_vendor_support 
> > libnvdimm pcspkr lpc_ich mei_m
> > e mei mfd_core wmi pcc_cpufreq acpi_cpufreq sch_fq_codel xfs libcrc32c 
> > mlx4_en sr_mod cdrom sd_mod mgag
> > 200 drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops mlx4_core 
> > ttm crc32c_intel igb isci ah
> > ci dca libsas firewire_ohci drm i2c_algo_bit libahci scsi_transport_sas
> > [ 1355.577429]  firewire_core crc_itu_t i2c_core libata dm_mod [last 
> > unloaded: rdmavt]
> > [ 1355.686703] CPU: 1 PID: 1740 Comm: reg-mr Not tainted 5.0.0+ #10
> > [ 1355.693851] Hardware name: Intel Corporation W2600CR/W2600CR, BIOS 
> > SE5C600.86B.02.04.0003.1023201411
> > 38 10/23/2014
> > [ 1355.705750] RIP: 0010:get_gup_pin_page+0xa5/0xb0
> > [ 1355.711348] Code: e8 40 02 ff ff 80 3d ba a2 fb 00 00 b8 b5 ff ff ff 75 
> > bb 48 c7 c7 48 0a e9 81 89 4
> > 4 24 04 c6 05 a1 a2 fb 00 01 e8 35 63 e8 ff <0f> 0b 8b 44 24 04 eb 9c 0f 1f 
> > 00 66 66 66 66 90 41 57 49
> > bf 00 00
> > [ 1355.733244] RSP: 0018:c90005a23b30 EFLAGS: 00010286
> > [ 1355.739536] RAX:  RBX: ea001422 RCX: 
> > 
> > [ 1355.748005] RDX: 0003 RSI: 827d94a3 RDI: 
> > 0246
> > [ 1355.756453] RBP: ea001422 R08: 0002 R09: 
> > 00022400
> > [ 1355.764907] R10: 0009ccf0ad0c4203 R11: 0001 R12: 
> > 00010207
> > [ 1355.773369] R13: 8884130b7040 R14: fff00fff R15: 
> > 000fffe0
> > [ 1355.781836] FS:  7f2680d0d740() GS:88842e84() 
> > knlGS:
> > [ 1355.791384] CS:  0010 DS:  ES:  CR0: 80050033
> > [ 1355.798319] CR2: 00589000 CR3: 00040b05e004 CR4: 
> > 000606e0
> > [ 1355.806809] Call Trace:
> > [ 1355.810078]  follow_page_pte+0x4f3/0x5c0
> > [ 1355.814987]  __get_user_pages+0x1eb/0x730
> > [ 1355.820020]  get_user_pages+0x3e/0x50
> > [ 1355.824657]  ib_umem_get+0x283/0x500 [ib_uverbs]
> > [ 1355.830340]  ? _cond_resched+0x15/0x30
> > [ 1355.835065]  mlx4_ib_reg_user_mr+0x75/0x1e0 [mlx4_ib]
> > [ 1355.841235]  ib_uverbs_reg_mr+0x10c/0x220 [ib_uverbs]
> > [ 1355.847400]  ib_uverbs_write+0x2f9/0x4d0 [ib_uverbs]
> > [ 1355.853473]  __vfs_write+0x36/0x1b0
> > [ 1355.857904]  ? selinux_file_permission+0xf0/0x130
> > [ 1355.863702]  ? security_file_permission+0x2e/0xe0
> > [ 1355.869503]  vfs_write+0xa5/0x1a0
> > [ 1355.873751]  ksys_write+0x4f/0xb0
> > [ 1355.878009]  do_syscall_64+0x5b/0x180
> > [ 1355.882656]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > [ 1355.62] RIP: 0033:0x7f2680ec3ed8
> > [ 1355.893420] Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 
> > f3 0f 1e fa 48 8d 05 45 78 0
> > d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 
> > 1f 80 00 00 00 00 41 54 49
> > 89 d4 55
> > [ 1355.915573] RSP: 002b:7ffe65d50bc8 EFLAGS: 0246 ORIG_RAX: 
> > 0001
> > [ 1355.924621] RAX: ffda RBX: 7ffe65d50c74 RCX: 
> >

Re: [RFC PATCH 4/7] tracing/probe: Check event/group naming rule at parsing

2019-03-13 Thread Masami Hiramatsu

On Wed, 13 Mar 2019 11:23:46 -0400
Steven Rostedt  wrote:

> On Thu, 14 Mar 2019 00:04:02 +0900
> Masami Hiramatsu  wrote:
> 
> > > > strlcpy(buf, event, slash - event + 1);
> > > > +   if (!is_good_name(buf)) {
> > > > +   pr_info("Group name must follow the rule of C 
> > > > identifier\n");  
> > > 
> > > What do you mean by "C identifier"?  
> > 
> > I meant "the naming rules of C language identifiers". It means
> > that the name has to start with alphabet or "_" and only contain
> > alphanumeric characters and "_". Does it clear?
> > I couldn't think of a good word. maybe "naming convention"
> > does not fit...
> 
> Ah, OK that now makes sense.
> 
> > 
> > Would you have any idea?
> 
> I think I was more confused by the way it was stated. What about saying:
> 
>   "Group names must follow the same rules as C identifiers"

OK, I will update it.

> Saying "the rule of C identifiers" made me think more of "the rule of
> law", and thought that this had something to do with rules of C
> zealots, and those that identify with "C" ;-)

Oh, I got it. :-)

Thank you!

> 
> -- Steve


-- 
Masami Hiramatsu

Re: [RFC PATCH 1/7] tracing/probe: Check maxactive error cases

2019-03-13 Thread Masami Hiramatsu

On Wed, 13 Mar 2019 10:51:56 -0400
Steven Rostedt  wrote:

> On Wed, 13 Mar 2019 23:37:39 +0900
> Masami Hiramatsu  wrote:
> 
> 
> > > So now 'p1:..." will error out and not just be ignored?  
> > 
> > Yes, I think it is better to tell user "your command has a
> > meaningless option, maybe you made a mistake" than ignore that.
> > 
> 
> OK, just making sure. Hope nobody complains about it ;-)
> 
> Are these OK to add for the next merge window, or do you want them in
> now? I could probably get them ready for -rc1.

Yes, I think [1/7] to [3/7] would be better to go to 5.1.

Thank you,

-- 
Masami Hiramatsu

Re: overlayfs vs. fscrypt

2019-03-13 Thread Richard Weinberger

Am Mittwoch, 13. März 2019, 23:26:11 CET schrieb Eric Biggers:
> On Wed, Mar 13, 2019 at 09:33:10PM +0100, Richard Weinberger wrote:
> > Am Mittwoch, 13. März 2019, 15:26:54 CET schrieb Amir Goldstein:
> > > IMO, the best thing for UBIFS to do would be to modify fscrypt to support
> > > opting out of the revalidate behavior, IWO, sanitize your hack to an API.
> > 
> > Given the WTF/s rate this thread has, this might me a good option.
> > Actually people already asked me how to disable this feature because
> > they saw no use of it.
> > Being able to delete encrypted files looks good on the feature list but in
> > reality it has very few users but causes confusion, IMHO.
> > 
> > I propose a new fscrypt_operations flag, FS_CFLG_NO_CRYPT_FNAMES.
> > If this flag is set, a) fscrypt_setup_filename() will return -EPERM if
> > no key is found.
> > And b) __fscrypt_prepare_lookup() will not attach fscrypt_d_ops to the 
> > dentry.
> > 
> > Eric, what do you think?
> > 
> > Thanks,
> > //richard
> > 
> 
> What specifically is wrong with supporting the ciphertext "view" of encrypted
> directories, and why do you want to opt UBIFS out of it specifically but not
> ext4 and f2fs?  (The fscrypt_operations are per-filesystem type, not
> per-filesystem instance, so I assume that's what you had in mind.)  Note that 
> we
> can't unconditionally remove it because people need it to delete files without
> the key.  We could add a mount option to disable it, but why exactly?

You are right, fscrypt_operations is the wrong structure.
My plan was having it per filesystem instance. So a mount-option seems like
a good option. Of course for all filesystems that support fscrypt, not just 
UBIFS.

Over the last year I've converted many emebdded systems to fscrypt and it 
happened
more than once that users, and more importantly, applications got confused that
you can mount and walk the filesystem even if you don't have the key loaded yet.
For them it felt more natural that you cannot even readdir if you don't have 
the key.

In my opinion having such a mount option is useful to match these expectations.
And it is also useful because you can enable only the features you actually 
need.
On embedded systems that I have in mind you never delete files without having 
the key
and since fscrypt is used for the whole filesystem you can just recreate it if 
you
really lost the key.

Thanks,
//richard

Re: [PATCH 2/5] lib/sort: Use more efficient bottom-up heapsort variant

2019-03-13 Thread Rasmus Villemoes

On 21/02/2019 09.21, George Spelvin wrote:
>  
> +/**
> + * parent - given the offset of the child, find the offset of the parent.
> + * @i: the offset of the heap element whose parent is sought.  Non-zero.
> + * @lsbit: a precomputed 1-bit mask, equal to "size & -size"
> + * @size: size of each element
> + *
> + * In terms of array indexes, the parent of element j = i/size is simply
> + * (j-1)/2.  But when working in byte offsets, we can't use implicit
> + * truncation of integer divides.
> + *
> + * Fortunately, we only need one bit of the quotient, not the full divide.
> + * size has a least significant bit.  That bit will be clear if i is
> + * an even multiple of size, and set if it's an odd multiple.
> + *
> + * Logically, we're doing "if (i & lsbit) i -= size;", but since the
> + * branch is unpredictable, it's done with a bit of clever branch-free
> + * code instead.
> + */
> +__attribute_const__ __always_inline
> +static size_t parent(size_t i, unsigned int lsbit, size_t size)
> +{
> + i -= size;
> + i -= size & -(i & lsbit);
> + return i / 2;
> +}
> +

Really nice :) I had to work through this by hand, but it's solid.

>  /**
>   * sort - sort an array of elements
>   * @base: pointer to data to sort
> @@ -125,21 +151,26 @@ static void generic_swap(void *a, void *b, int size)
>   * @cmp_func: pointer to comparison function
>   * @swap_func: pointer to swap function or NULL
>   *
> - * This function does a heapsort on the given array. You may provide a
> - * swap_func function optimized to your element type.
> + * This function does a heapsort on the given array.  You may provide a
> + * swap_func function if you need to do something more than a memory copy
> + * (e.g. fix up pointers or auxiliary data), but the built-in swap isn't
> + * usually a bottleneck.
>   *
>   * Sorting time is O(n log n) both on average and worst-case. While
>   * qsort is about 20% faster on average, it suffers from exploitable
>   * O(n*n) worst-case behavior and extra memory requirements that make
>   * it less suitable for kernel use.
>   */
> -
>  void sort(void *base, size_t num, size_t size,
> int (*cmp_func)(const void *, const void *),
> void (*swap_func)(void *, void *, int size))
>  {
>   /* pre-scale counters for performance */
> - int i = (num/2 - 1) * size, n = num * size, c, r;
> + size_t n = num * size, a = (num/2) * size;
> + unsigned const lsbit = size & -size;/* Used to find parent */
> +

Nit: qualifier before type, "const unsigned". And this sets ZF, so a
paranoid check for zero size (cf. the other mail) by doing "if (!lsbit)
return;" is practically free. Though it's probably a bit obscure doing
it that way...

> + if (!n)
> + return;

I'd make that n <= 1. Shouldn't be much more costly.

> - }
> - }
> + /*
> +  * Loop invariants:
> +  * 1. elements [a,n) satisfy the heap property (compare greater than
> +  *all of their children),
> +  * 2. elements [n,num*size) are sorted, and
> +  * 3. a <= b <= c <= d <= n (whenever they are valid).
> +  */
> + for (;;) {
> + size_t b, c, d;
>  

> + if (a)  /* Building heap: sift down --a */
> + a -= size;
> + else if (n -= size) /* Sorting: Extract root to --n */
> + swap_func(base, base + n, size);
> + else/* Sort complete */
> + break;
> +
> + /*
> +  * Sift element at "a" down into heap.  This is the
> +  * "bottom-up" variant, which significantly reduces
> +  * calls to cmp_func(): we find the sift-down path all
> +  * the way to the leaves (one compare per level), then
> +  * backtrack to find where to insert the target element.
> +  *
> +  * Because elements tend to sift down close to the leaves,
> +  * this uses fewer compares than doing two per level
> +  * on the way down.  (A bit more than half as many on
> +  * average, 3/4 worst-case.)
> +  */
> + for (b = a; c = 2*b + size, (d = c + size) < n;)
> + b = cmp_func(base + c, base + d) >= 0 ? c : d;
> + if (d == n) /* Special case last leaf with no sibling */
> + b = c;
> +
> + /* Now backtrack from "b" to the correct location for "a" */
> + while (b != a && cmp_func(base + a, base + b) >= 0)
> + b = parent(b, lsbit, size);
> + c = b;  /* Where "a" belongs */
> + while (b != a) {/* Shift it into place */
> + b = parent(b, lsbit, size);
> + swap_func(base + b, base + c, size);
>   }
>   }
>  }
> 

Nice!

Rasmus

Re: overlayfs vs. fscrypt

2019-03-13 Thread James Bottomley

On Wed, 2019-03-13 at 15:13 -0700, Eric Biggers wrote:
> On Wed, Mar 13, 2019 at 02:04:29PM -0700, James Bottomley wrote:
> > On Wed, 2019-03-13 at 13:25 -0700, Eric Biggers wrote:
> > > On Wed, Mar 13, 2019 at 01:06:06PM -0700, James Bottomley wrote:
> > > > On Wed, 2019-03-13 at 12:57 -0700, Eric Biggers wrote:
> > 
> > [...]
> > > > > fscrypt would allow the data to be stored encrypted on the
> > > > > local disk, so it's protected against offline compromise of
> > > > > the disk.
> > > > 
> > > > Container images are essentially tars of the overlays.  They
> > > > only become actual filesystems when instantiated at
> > > > runtime.  The current encrypted container image is an overlay
> > > > or set of overlays which is tarred then encrypted.  So to
> > > > instantiate it is decrypted then untarred.
> > > > 
> > > > The thing I was wondering about was whether instead of a tar
> > > > encrypt we could instead produce an encrypted image from a
> > > > fscrypt filesystem.
> > > > 
> > > 
> > > Why do you care whether the container image is encrypted on the
> > > local disk, when you're extracting it in plaintext onto the local
> > > disk anyway each time it runs? Even after the runtime files are
> > > "deleted", they may still be recoverable from the disk.  Are you
> > > using shred and BLKSECDISCARD, and a non-COW filesystem?
> > > 
> > > Now, if you wanted to avoid writing the plaintext to disk
> > > entirely (and thereby use encryption to actually achieve a useful
> > > security property that can't be achieved through file
> > > permissions), fscrypt is a good solution for that.
> > 
> > OK let's start with a cloud and container 101: A container is an
> > exactly transportable IaaS environment containing an
> > application.  The format for the exact transport is the "container
> > image" I've been describing (layered tar file set deployed with
> > overlays).  These images are usually stored in cloud based
> > registries which may or may not have useful access controls.  I
> > take it the reason for image encryption to protect confidentiality
> > within the registry is obvious.
> > 
> > Because of the exact transport, the deployment may be on my laptop,
> > on my test system or in some type of public or private cloud.  In
> > all cases bar the laptop, I won't actually own the physical system
> > which ends up deploying the container.  So in exchange for security
> > guarantees from the physical system owner, I agree to turn over my
> > decryption key and possibly a cash payment.  One of these
> > guarantees is usually that they shred the key after use and that
> > they deploy a useful key escrow system like vault or keyprotect to
> > guard it even while the decryption is being done.
> 
> 
> > Another is that all traces of the container be shredded after the
> > execution is finished.
> 
> Well, sounds like that's not the case currently even with an
> encrypted container image, because the actual runtime files are not
> encrypted on disk.

Shredding means destroying all trace including in the on-disk image. 
However, one problem with the current implementation is there's a
window between container run and container stop where the unencrypted
files are in memory and on local disk.  Access or cockup in that window
can leak confidential data.

>   Encrypting the runtime files using fscrypt with an ephemeral key
> would be useful here.  IOW, randomly generate an encryption key when
> the container starts, never store it anywhere, and wipe it when the
> container stops.
> 
> Note that this is separate from the container *image* encryption.

Actually, that was my original thought: it needn't be.  If fscrypt can
usefully add runtime security, then we could have the encrypted layer
be simply an fscrypt image ... I presume without the key we can create
a tar image of an fscrypt that is encrypted and would still be visible
on untar if we did have the key?  So the encrypted layer would be a tar
of the fscrypt filesystem without the key.

> > considering is could I be protected against either cloud provider
> > cockups that might leak the image (the misconfigured backup
> > scenario I suggested) or malicious actions of other tenants.
> 
> If the container image is encrypted with a key not on the system,
> then its confidentiality is protected from anything that may happen
> on that system.
> 
> But if the container image encryption key *is* on the system, your
> container image may be leaked either accidentally or maliciously.

Well, yes, but that's like saying if you don't want to pick up a virus
from your network unplug it.  We have to look at ways of deploying the
filesystem and the key such that it's hard to exfiltrate ... which
seems to be similar to your android fscrypt use case.

James

Re: [PATCH v2] PCI: qcom: Use default config space read function

2019-03-13 Thread Marc Gonzalez

On 13/03/2019 22:52, Srinivas Kandagatla wrote:

> On 13/03/2019 20:39, Marc Gonzalez wrote:
>
>> Could you pastebin the output of lspci -vvv -n in the working case?
>
> This was already in my original reply:
> working without patch : https://paste.ubuntu.com/p/TJm4hgjGW4/

I needed to see the numeric values, hence the -n flag ;-)

lspci -vvv -n

Regards.

Re: overlayfs vs. fscrypt

2019-03-13 Thread Eric Biggers

On Wed, Mar 13, 2019 at 09:33:10PM +0100, Richard Weinberger wrote:
> Am Mittwoch, 13. März 2019, 15:26:54 CET schrieb Amir Goldstein:
> > IMO, the best thing for UBIFS to do would be to modify fscrypt to support
> > opting out of the revalidate behavior, IWO, sanitize your hack to an API.
> 
> Given the WTF/s rate this thread has, this might me a good option.
> Actually people already asked me how to disable this feature because
> they saw no use of it.
> Being able to delete encrypted files looks good on the feature list but in
> reality it has very few users but causes confusion, IMHO.
> 
> I propose a new fscrypt_operations flag, FS_CFLG_NO_CRYPT_FNAMES.
> If this flag is set, a) fscrypt_setup_filename() will return -EPERM if
> no key is found.
> And b) __fscrypt_prepare_lookup() will not attach fscrypt_d_ops to the dentry.
> 
> Eric, what do you think?
> 
> Thanks,
> //richard
> 

What specifically is wrong with supporting the ciphertext "view" of encrypted
directories, and why do you want to opt UBIFS out of it specifically but not
ext4 and f2fs?  (The fscrypt_operations are per-filesystem type, not
per-filesystem instance, so I assume that's what you had in mind.)  Note that we
can't unconditionally remove it because people need it to delete files without
the key.  We could add a mount option to disable it, but why exactly?

By the way, I suggest that people read Documentation/filesystems/fscrypt.rst for
more information about what fscrypt is supposed to do, as there seems to be a
lot of misconceptions.

- Eric

[PATCH v5 05/12] mtd: rawnand: ingenic: Use SPDX license notifiers

2019-03-13 Thread Paul Cercueil

Use SPDX license notifiers instead of GPLv2 license text in the headers.

Signed-off-by: Paul Cercueil 
Reviewed-by: Boris Brezillon 
---

Notes:
v2: No changes

v3: No changes

v4: No changes

v5: No changes

 drivers/mtd/nand/raw/ingenic/jz4780_bch.c  | 5 +
 drivers/mtd/nand/raw/ingenic/jz4780_bch.h  | 5 +
 drivers/mtd/nand/raw/ingenic/jz4780_nand.c | 5 +
 3 files changed, 3 insertions(+), 12 deletions(-)

diff --git a/drivers/mtd/nand/raw/ingenic/jz4780_bch.c 
b/drivers/mtd/nand/raw/ingenic/jz4780_bch.c
index c5f74ed85862..fdf483a49f7b 100644
--- a/drivers/mtd/nand/raw/ingenic/jz4780_bch.c
+++ b/drivers/mtd/nand/raw/ingenic/jz4780_bch.c
@@ -1,12 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * JZ4780 BCH controller
  *
  * Copyright (c) 2015 Imagination Technologies
  * Author: Alex Smith 
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License version 2 as published
- * by the Free Software Foundation.
  */
 
 #include 
diff --git a/drivers/mtd/nand/raw/ingenic/jz4780_bch.h 
b/drivers/mtd/nand/raw/ingenic/jz4780_bch.h
index bf4718088a3a..451e0c770160 100644
--- a/drivers/mtd/nand/raw/ingenic/jz4780_bch.h
+++ b/drivers/mtd/nand/raw/ingenic/jz4780_bch.h
@@ -1,12 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
 /*
  * JZ4780 BCH controller
  *
  * Copyright (c) 2015 Imagination Technologies
  * Author: Alex Smith 
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License version 2 as published
- * by the Free Software Foundation.
  */
 
 #ifndef __DRIVERS_MTD_NAND_JZ4780_BCH_H__
diff --git a/drivers/mtd/nand/raw/ingenic/jz4780_nand.c 
b/drivers/mtd/nand/raw/ingenic/jz4780_nand.c
index 22e58975f0d5..7f55358b860f 100644
--- a/drivers/mtd/nand/raw/ingenic/jz4780_nand.c
+++ b/drivers/mtd/nand/raw/ingenic/jz4780_nand.c
@@ -1,12 +1,9 @@
+// SPDX-License-Identifier: GPL-2.0
 /*
  * JZ4780 NAND driver
  *
  * Copyright (c) 2015 Imagination Technologies
  * Author: Alex Smith 
- *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License version 2 as published
- * by the Free Software Foundation.
  */
 
 #include 
-- 
2.11.0

[PATCH v5 07/12] mtd: rawnand: ingenic: Rename jz4780_bch_init to jz4780_bch_reset

2019-03-13 Thread Paul Cercueil

The jz4780_bch_init name was confusing, as it suggested that its content
should be executed once at init time, whereas what the function really
does is reset the hardware for a new ECC operation.

Signed-off-by: Paul Cercueil 
---

Notes:
v5: New patch

 drivers/mtd/nand/raw/ingenic/jz4780_bch.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/mtd/nand/raw/ingenic/jz4780_bch.c 
b/drivers/mtd/nand/raw/ingenic/jz4780_bch.c
index fdf483a49f7b..7635753bd7ea 100644
--- a/drivers/mtd/nand/raw/ingenic/jz4780_bch.c
+++ b/drivers/mtd/nand/raw/ingenic/jz4780_bch.c
@@ -69,8 +69,8 @@ struct jz4780_bch {
struct mutex lock;
 };
 
-static void jz4780_bch_init(struct jz4780_bch *bch,
-   struct jz4780_bch_params *params, bool encode)
+static void jz4780_bch_reset(struct jz4780_bch *bch,
+struct jz4780_bch_params *params, bool encode)
 {
u32 reg;
 
@@ -183,7 +183,8 @@ int jz4780_bch_calculate(struct jz4780_bch *bch, struct 
jz4780_bch_params *param
int ret = 0;
 
mutex_lock(>lock);
-   jz4780_bch_init(bch, params, true);
+
+   jz4780_bch_reset(bch, params, true);
jz4780_bch_write_data(bch, buf, params->size);
 
if (jz4780_bch_wait_complete(bch, BCH_BHINT_ENCF, NULL)) {
@@ -220,7 +221,7 @@ int jz4780_bch_correct(struct jz4780_bch *bch, struct 
jz4780_bch_params *params,
 
mutex_lock(>lock);
 
-   jz4780_bch_init(bch, params, false);
+   jz4780_bch_reset(bch, params, false);
jz4780_bch_write_data(bch, buf, params->size);
jz4780_bch_write_data(bch, ecc_code, params->bytes);
 
-- 
2.11.0

[PATCH v5 03/12] dt-bindings: mtd: ingenic: Use standard ecc-engine property

2019-03-13 Thread Paul Cercueil

The 'ingenic,bch-controller' property is now deprecated and the
'ecc-engine' property should be used instead.

Signed-off-by: Paul Cercueil 
---

Notes:
v5: New patch

 Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt 
b/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt
index 5a45cc54f46d..c02259353327 100644
--- a/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt
+++ b/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt
@@ -14,7 +14,7 @@ Required NAND controller device properties:
   an offset of 0 and a size of 0x100 (i.e. the whole NEMC bank).
 
 Optional NAND controller device properties:
-- ingenic,bch-controller: To make use of the hardware ECC controller, this
+- ecc-engine: To make use of the hardware ECC controller, this
   property must contain a phandle for the ECC controller node. The required
   properties for this node are described below. If this is not specified,
   software ECC will be used instead.
@@ -48,7 +48,7 @@ nemc: nemc@1341 {
#address-cells = <1>;
#size-cells = <0>;
 
-   ingenic,bch-controller = <>;
+   ecc-engine = <>;
 
nand@1 {
reg = <1>;
-- 
2.11.0

[PATCH v5 12/12] mtd: rawnand: ingenic: Add ooblayout for the Qi Ben Nanonote

2019-03-13 Thread Paul Cercueil

The Ben Nanonote from Qi Hardware expects a specific OOB layout on its
NAND.

Signed-off-by: Paul Cercueil 
---

Notes:
v2: New patch

v3: Use the qi,lb60 layout unconditionally if we detect that we're
running on that board.

v4: No change

v5: No change

 drivers/mtd/nand/raw/ingenic/ingenic_nand.c | 41 -
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/drivers/mtd/nand/raw/ingenic/ingenic_nand.c 
b/drivers/mtd/nand/raw/ingenic/ingenic_nand.c
index a2450e10932a..d32d68ef0dc7 100644
--- a/drivers/mtd/nand/raw/ingenic/ingenic_nand.c
+++ b/drivers/mtd/nand/raw/ingenic/ingenic_nand.c
@@ -72,6 +72,41 @@ static inline struct ingenic_nfc *to_ingenic_nfc(struct 
nand_controller *ctrl)
return container_of(ctrl, struct ingenic_nfc, controller);
 }
 
+static int qi_lb60_ooblayout_ecc(struct mtd_info *mtd, int section,
+struct mtd_oob_region *oobregion)
+{
+   struct nand_chip *chip = mtd_to_nand(mtd);
+   struct nand_ecc_ctrl *ecc = >ecc;
+
+   if (section || !ecc->total)
+   return -ERANGE;
+
+   oobregion->length = ecc->total;
+   oobregion->offset = 12;
+
+   return 0;
+}
+
+static int qi_lb60_ooblayout_free(struct mtd_info *mtd, int section,
+ struct mtd_oob_region *oobregion)
+{
+   struct nand_chip *chip = mtd_to_nand(mtd);
+   struct nand_ecc_ctrl *ecc = >ecc;
+
+   if (section)
+   return -ERANGE;
+
+   oobregion->length = mtd->oobsize - ecc->total - 12;
+   oobregion->offset = 12 + ecc->total;
+
+   return 0;
+}
+
+const struct mtd_ooblayout_ops qi_lb60_ooblayout_ops = {
+   .ecc = qi_lb60_ooblayout_ecc,
+   .free = qi_lb60_ooblayout_free,
+};
+
 static int jz4725b_ooblayout_ecc(struct mtd_info *mtd, int section,
 struct mtd_oob_region *oobregion)
 {
@@ -247,7 +282,11 @@ static int ingenic_nand_attach_chip(struct nand_chip *chip)
return -EINVAL;
}
 
-   mtd_set_ooblayout(mtd, nfc->soc_info->oob_layout);
+   /* For legacy reasons we use a different layout on the qi,lb60 board. */
+   if (of_machine_is_compatible("qi,lb60"))
+   mtd_set_ooblayout(mtd, _lb60_ooblayout_ops);
+   else
+   mtd_set_ooblayout(mtd, nfc->soc_info->oob_layout);
 
return 0;
 }
-- 
2.11.0

[PATCH v5 11/12] mtd: rawnand: ingenic: Add support for the JZ4725B

2019-03-13 Thread Paul Cercueil

The boot ROM of the JZ4725B SoC expects a specific OOB layout on the
NAND, so we use it unconditionally in the ingenic-nand driver.

Also add the jz4725b-bch driver to support the JZ4725B-specific BCH
hardware.

Signed-off-by: Paul Cercueil 
---

Notes:
v2: Instead of forcing the OOB layout, leave it to the board code or
devicetree to decide if the jz4725b-specific layout should be used
or not.

v3: - Revert the change in v2, as the previous behaviour was correct.
- Also add support for the hardware BCH of the JZ4725B in this
  patch.

v4: - Add MODULE_* macros
- Add tweaks suggested by upstream feedback

v5: - Rename jz4740_ecc_init to jz4740_ecc_reset

 drivers/mtd/nand/raw/ingenic/Kconfig|  10 +
 drivers/mtd/nand/raw/ingenic/Makefile   |   1 +
 drivers/mtd/nand/raw/ingenic/ingenic_nand.c |  48 -
 drivers/mtd/nand/raw/ingenic/jz4725b_bch.c  | 293 
 4 files changed, 351 insertions(+), 1 deletion(-)
 create mode 100644 drivers/mtd/nand/raw/ingenic/jz4725b_bch.c

diff --git a/drivers/mtd/nand/raw/ingenic/Kconfig 
b/drivers/mtd/nand/raw/ingenic/Kconfig
index cc663cc15119..497b46b144f2 100644
--- a/drivers/mtd/nand/raw/ingenic/Kconfig
+++ b/drivers/mtd/nand/raw/ingenic/Kconfig
@@ -27,6 +27,16 @@ config MTD_NAND_JZ4740_ECC
  This driver can also be built as a module. If so, the module
  will be called jz4740-ecc.
 
+config MTD_NAND_JZ4725B_BCH
+   tristate "Hardware BCH support for JZ4725B SoC"
+   select MTD_NAND_INGENIC_ECC
+   help
+ Enable this driver to support the BCH error-correction hardware
+ present on the JZ4725B SoC from Ingenic.
+
+ This driver can also be built as a module. If so, the module
+ will be called jz4725b-bch.
+
 config MTD_NAND_JZ4780_BCH
tristate "Hardware BCH support for JZ4780 SoC"
select MTD_NAND_INGENIC_ECC
diff --git a/drivers/mtd/nand/raw/ingenic/Makefile 
b/drivers/mtd/nand/raw/ingenic/Makefile
index 563b7effcf59..ab2c5f47e5b7 100644
--- a/drivers/mtd/nand/raw/ingenic/Makefile
+++ b/drivers/mtd/nand/raw/ingenic/Makefile
@@ -3,4 +3,5 @@ obj-$(CONFIG_MTD_NAND_JZ4780) += ingenic_nand.o
 
 obj-$(CONFIG_MTD_NAND_INGENIC_ECC) += ingenic_ecc.o
 obj-$(CONFIG_MTD_NAND_JZ4740_ECC) += jz4740_ecc.o
+obj-$(CONFIG_MTD_NAND_JZ4725B_BCH) += jz4725b_bch.o
 obj-$(CONFIG_MTD_NAND_JZ4780_BCH) += jz4780_bch.o
diff --git a/drivers/mtd/nand/raw/ingenic/ingenic_nand.c 
b/drivers/mtd/nand/raw/ingenic/ingenic_nand.c
index 1bd1f18d4532..a2450e10932a 100644
--- a/drivers/mtd/nand/raw/ingenic/ingenic_nand.c
+++ b/drivers/mtd/nand/raw/ingenic/ingenic_nand.c
@@ -34,6 +34,7 @@ struct jz_soc_info {
unsigned long data_offset;
unsigned long addr_offset;
unsigned long cmd_offset;
+   const struct mtd_ooblayout_ops *oob_layout;
 };
 
 struct ingenic_nand_cs {
@@ -71,6 +72,41 @@ static inline struct ingenic_nfc *to_ingenic_nfc(struct 
nand_controller *ctrl)
return container_of(ctrl, struct ingenic_nfc, controller);
 }
 
+static int jz4725b_ooblayout_ecc(struct mtd_info *mtd, int section,
+struct mtd_oob_region *oobregion)
+{
+   struct nand_chip *chip = mtd_to_nand(mtd);
+   struct nand_ecc_ctrl *ecc = >ecc;
+
+   if (section || !ecc->total)
+   return -ERANGE;
+
+   oobregion->length = ecc->total;
+   oobregion->offset = 3;
+
+   return 0;
+}
+
+static int jz4725b_ooblayout_free(struct mtd_info *mtd, int section,
+ struct mtd_oob_region *oobregion)
+{
+   struct nand_chip *chip = mtd_to_nand(mtd);
+   struct nand_ecc_ctrl *ecc = >ecc;
+
+   if (section)
+   return -ERANGE;
+
+   oobregion->length = mtd->oobsize - ecc->total - 3;
+   oobregion->offset = 3 + ecc->total;
+
+   return 0;
+}
+
+const struct mtd_ooblayout_ops jz4725b_ooblayout_ops = {
+   .ecc = jz4725b_ooblayout_ecc,
+   .free = jz4725b_ooblayout_free,
+};
+
 static void ingenic_nand_select_chip(struct nand_chip *chip, int chipnr)
 {
struct ingenic_nand *nand = to_ingenic_nand(nand_to_mtd(chip));
@@ -211,7 +247,7 @@ static int ingenic_nand_attach_chip(struct nand_chip *chip)
return -EINVAL;
}
 
-   mtd_set_ooblayout(mtd, _ooblayout_lp_ops);
+   mtd_set_ooblayout(mtd, nfc->soc_info->oob_layout);
 
return 0;
 }
@@ -407,16 +443,26 @@ static const struct jz_soc_info jz4740_soc_info = {
.data_offset = 0x,
.cmd_offset = 0x8000,
.addr_offset = 0x0001,
+   .oob_layout = _ooblayout_lp_ops,
+};
+
+static const struct jz_soc_info jz4725b_soc_info = {
+   .data_offset = 0x,
+   .cmd_offset = 0x8000,
+   .addr_offset = 0x0001,
+   .oob_layout = _ooblayout_ops,
 };
 
 static const struct jz_soc_info jz4780_soc_info = {
.data_offset = 0x,
.cmd_offset =

[PATCH v5 10/12] mtd: rawnand: ingenic: Add support for the JZ4740

2019-03-13 Thread Paul Cercueil

Add support for probing the ingenic-nand driver on the JZ4740 SoC from
Ingenic, and the jz4740-ecc driver to support the JZ4740-specific
ECC hardware.

Signed-off-by: Paul Cercueil 
---

Notes:
v2: New patch

v3: Also add support for the hardware ECC of the JZ4740 in this patch

v4: - Fix formatting issues
- Add MODULE_* macros

v5: - Rename jz4740_ecc_init to jz4740_ecc_reset
- Fix comments blocks not following the kernel style

 drivers/mtd/nand/raw/ingenic/Kconfig|  10 ++
 drivers/mtd/nand/raw/ingenic/Makefile   |   1 +
 drivers/mtd/nand/raw/ingenic/ingenic_nand.c |  48 +--
 drivers/mtd/nand/raw/ingenic/jz4740_ecc.c   | 197 
 4 files changed, 245 insertions(+), 11 deletions(-)
 create mode 100644 drivers/mtd/nand/raw/ingenic/jz4740_ecc.c

diff --git a/drivers/mtd/nand/raw/ingenic/Kconfig 
b/drivers/mtd/nand/raw/ingenic/Kconfig
index 4bf7d7af3b0a..cc663cc15119 100644
--- a/drivers/mtd/nand/raw/ingenic/Kconfig
+++ b/drivers/mtd/nand/raw/ingenic/Kconfig
@@ -17,6 +17,16 @@ if MTD_NAND_JZ4780
 config MTD_NAND_INGENIC_ECC
tristate
 
+config MTD_NAND_JZ4740_ECC
+   tristate "Hardware BCH support for JZ4740 SoC"
+   select MTD_NAND_INGENIC_ECC
+   help
+ Enable this driver to support the Reed-Solomon error-correction
+ hardware present on the JZ4740 SoC from Ingenic.
+
+ This driver can also be built as a module. If so, the module
+ will be called jz4740-ecc.
+
 config MTD_NAND_JZ4780_BCH
tristate "Hardware BCH support for JZ4780 SoC"
select MTD_NAND_INGENIC_ECC
diff --git a/drivers/mtd/nand/raw/ingenic/Makefile 
b/drivers/mtd/nand/raw/ingenic/Makefile
index f3c3c0f230b0..563b7effcf59 100644
--- a/drivers/mtd/nand/raw/ingenic/Makefile
+++ b/drivers/mtd/nand/raw/ingenic/Makefile
@@ -2,4 +2,5 @@ obj-$(CONFIG_MTD_NAND_JZ4740) += jz4740_nand.o
 obj-$(CONFIG_MTD_NAND_JZ4780) += ingenic_nand.o
 
 obj-$(CONFIG_MTD_NAND_INGENIC_ECC) += ingenic_ecc.o
+obj-$(CONFIG_MTD_NAND_JZ4740_ECC) += jz4740_ecc.o
 obj-$(CONFIG_MTD_NAND_JZ4780_BCH) += jz4780_bch.o
diff --git a/drivers/mtd/nand/raw/ingenic/ingenic_nand.c 
b/drivers/mtd/nand/raw/ingenic/ingenic_nand.c
index a5f2f21c5198..1bd1f18d4532 100644
--- a/drivers/mtd/nand/raw/ingenic/ingenic_nand.c
+++ b/drivers/mtd/nand/raw/ingenic/ingenic_nand.c
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -26,13 +27,15 @@
 
 #define DRV_NAME   "ingenic-nand"
 
-#define OFFSET_DATA0x
-#define OFFSET_CMD 0x0040
-#define OFFSET_ADDR0x0080
-
 /* Command delay when there is no R/B pin. */
 #define RB_DELAY_US100
 
+struct jz_soc_info {
+   unsigned long data_offset;
+   unsigned long addr_offset;
+   unsigned long cmd_offset;
+};
+
 struct ingenic_nand_cs {
unsigned int bank;
void __iomem *base;
@@ -41,6 +44,7 @@ struct ingenic_nand_cs {
 struct ingenic_nfc {
struct device *dev;
struct ingenic_ecc *ecc;
+   const struct jz_soc_info *soc_info;
struct nand_controller controller;
unsigned int num_banks;
struct list_head chips;
@@ -100,9 +104,9 @@ static void ingenic_nand_cmd_ctrl(struct nand_chip *chip, 
int cmd,
return;
 
if (ctrl & NAND_ALE)
-   writeb(cmd, cs->base + OFFSET_ADDR);
+   writeb(cmd, cs->base + nfc->soc_info->addr_offset);
else if (ctrl & NAND_CLE)
-   writeb(cmd, cs->base + OFFSET_CMD);
+   writeb(cmd, cs->base + nfc->soc_info->cmd_offset);
 }
 
 static int ingenic_nand_dev_ready(struct nand_chip *chip)
@@ -160,8 +164,13 @@ static int ingenic_nand_attach_chip(struct nand_chip *chip)
struct ingenic_nfc *nfc = to_ingenic_nfc(chip->controller);
int eccbytes;
 
-   chip->ecc.bytes = fls((1 + 8) * chip->ecc.size) *
-   (chip->ecc.strength / 8);
+   if (chip->ecc.strength == 4) {
+   /* JZ4740 uses 9 bytes of ECC to correct maximum 4 errors */
+   chip->ecc.bytes = 9;
+   } else {
+   chip->ecc.bytes = fls((1 + 8) * chip->ecc.size) *
+ (chip->ecc.strength / 8);
+   }
 
switch (chip->ecc.mode) {
case NAND_ECC_HW:
@@ -270,8 +279,8 @@ static int ingenic_nand_init_chip(struct platform_device 
*pdev,
return -ENOMEM;
mtd->dev.parent = dev;
 
-   chip->legacy.IO_ADDR_R = cs->base + OFFSET_DATA;
-   chip->legacy.IO_ADDR_W = cs->base + OFFSET_DATA;
+   chip->legacy.IO_ADDR_R = cs->base + nfc->soc_info->data_offset;
+   chip->legacy.IO_ADDR_W = cs->base + nfc->soc_info->data_offset;
chip->legacy.chip_delay = RB_DELAY_US;
chip->options = NAND_NO_SUBPAGE_WRITE;
chip->legacy.select_chip = ingenic_nand_select_chip;
@@ -353,6 +362,10 @@ static int ingenic_nand_probe(struct platform_device *pdev)
if

[PATCH v5 06/12] mtd: rawnand: ingenic: Rename jz4780_nand driver to ingenic_nand

2019-03-13 Thread Paul Cercueil

The jz4780_nand driver will be modified to handle all the Ingenic
JZ47xx SoCs that the upstream Linux kernel supports (JZ4740, JZ4725B,
JZ4770, JZ4780), so it makes sense to rename it.

Signed-off-by: Paul Cercueil 
---

Notes:
v3: New patch

v4: No changes

v5: No changes

 drivers/mtd/nand/raw/ingenic/Makefile  |   2 +-
 .../raw/ingenic/{jz4780_nand.c => ingenic_nand.c}  | 146 ++---
 2 files changed, 74 insertions(+), 74 deletions(-)
 rename drivers/mtd/nand/raw/ingenic/{jz4780_nand.c => ingenic_nand.c} (63%)

diff --git a/drivers/mtd/nand/raw/ingenic/Makefile 
b/drivers/mtd/nand/raw/ingenic/Makefile
index 44c2ca053d24..af2d5de21f60 100644
--- a/drivers/mtd/nand/raw/ingenic/Makefile
+++ b/drivers/mtd/nand/raw/ingenic/Makefile
@@ -1,2 +1,2 @@
 obj-$(CONFIG_MTD_NAND_JZ4740) += jz4740_nand.o
-obj-$(CONFIG_MTD_NAND_JZ4780) += jz4780_nand.o jz4780_bch.o
+obj-$(CONFIG_MTD_NAND_JZ4780) += ingenic_nand.o jz4780_bch.o
diff --git a/drivers/mtd/nand/raw/ingenic/jz4780_nand.c 
b/drivers/mtd/nand/raw/ingenic/ingenic_nand.c
similarity index 63%
rename from drivers/mtd/nand/raw/ingenic/jz4780_nand.c
rename to drivers/mtd/nand/raw/ingenic/ingenic_nand.c
index 7f55358b860f..8c73f7c5be9a 100644
--- a/drivers/mtd/nand/raw/ingenic/jz4780_nand.c
+++ b/drivers/mtd/nand/raw/ingenic/ingenic_nand.c
@@ -1,6 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 /*
- * JZ4780 NAND driver
+ * Ingenic JZ47xx NAND driver
  *
  * Copyright (c) 2015 Imagination Technologies
  * Author: Alex Smith 
@@ -24,7 +24,7 @@
 
 #include "jz4780_bch.h"
 
-#define DRV_NAME   "jz4780-nand"
+#define DRV_NAME   "ingenic-nand"
 
 #define OFFSET_DATA0x
 #define OFFSET_CMD 0x0040
@@ -33,22 +33,22 @@
 /* Command delay when there is no R/B pin. */
 #define RB_DELAY_US100
 
-struct jz4780_nand_cs {
+struct ingenic_nand_cs {
unsigned int bank;
void __iomem *base;
 };
 
-struct jz4780_nand_controller {
+struct ingenic_nfc {
struct device *dev;
struct jz4780_bch *bch;
struct nand_controller controller;
unsigned int num_banks;
struct list_head chips;
int selected;
-   struct jz4780_nand_cs cs[];
+   struct ingenic_nand_cs cs[];
 };
 
-struct jz4780_nand_chip {
+struct ingenic_nand {
struct nand_chip chip;
struct list_head chip_list;
 
@@ -57,22 +57,21 @@ struct jz4780_nand_chip {
unsigned int reading: 1;
 };
 
-static inline struct jz4780_nand_chip *to_jz4780_nand_chip(struct mtd_info 
*mtd)
+static inline struct ingenic_nand *to_ingenic_nand(struct mtd_info *mtd)
 {
-   return container_of(mtd_to_nand(mtd), struct jz4780_nand_chip, chip);
+   return container_of(mtd_to_nand(mtd), struct ingenic_nand, chip);
 }
 
-static inline struct jz4780_nand_controller
-*to_jz4780_nand_controller(struct nand_controller *ctrl)
+static inline struct ingenic_nfc *to_ingenic_nfc(struct nand_controller *ctrl)
 {
-   return container_of(ctrl, struct jz4780_nand_controller, controller);
+   return container_of(ctrl, struct ingenic_nfc, controller);
 }
 
-static void jz4780_nand_select_chip(struct nand_chip *chip, int chipnr)
+static void ingenic_nand_select_chip(struct nand_chip *chip, int chipnr)
 {
-   struct jz4780_nand_chip *nand = to_jz4780_nand_chip(nand_to_mtd(chip));
-   struct jz4780_nand_controller *nfc = 
to_jz4780_nand_controller(nand->chip.controller);
-   struct jz4780_nand_cs *cs;
+   struct ingenic_nand *nand = to_ingenic_nand(nand_to_mtd(chip));
+   struct ingenic_nfc *nfc = to_ingenic_nfc(nand->chip.controller);
+   struct ingenic_nand_cs *cs;
 
/* Ensure the currently selected chip is deasserted. */
if (chipnr == -1 && nfc->selected >= 0) {
@@ -83,12 +82,12 @@ static void jz4780_nand_select_chip(struct nand_chip *chip, 
int chipnr)
nfc->selected = chipnr;
 }
 
-static void jz4780_nand_cmd_ctrl(struct nand_chip *chip, int cmd,
-unsigned int ctrl)
+static void ingenic_nand_cmd_ctrl(struct nand_chip *chip, int cmd,
+ unsigned int ctrl)
 {
-   struct jz4780_nand_chip *nand = to_jz4780_nand_chip(nand_to_mtd(chip));
-   struct jz4780_nand_controller *nfc = 
to_jz4780_nand_controller(nand->chip.controller);
-   struct jz4780_nand_cs *cs;
+   struct ingenic_nand *nand = to_ingenic_nand(nand_to_mtd(chip));
+   struct ingenic_nfc *nfc = to_ingenic_nfc(nand->chip.controller);
+   struct ingenic_nand_cs *cs;
 
if (WARN_ON(nfc->selected < 0))
return;
@@ -106,25 +105,25 @@ static void jz4780_nand_cmd_ctrl(struct nand_chip *chip, 
int cmd,
writeb(cmd, cs->base + OFFSET_CMD);
 }
 
-static int jz4780_nand_dev_ready(struct nand_chip *chip)
+static int ingenic_nand_dev_ready(struct nand_chip *chip)
 {
-   struct jz4780_nand_chip *nand = to_jz4780_nand_chip(nand_to_mtd(chip));
+   struct ingenic_nand *nand =

[PATCH v5 02/12] dt-bindings: mtd: ingenic: Change 'BCH' to 'ECC' in documentation

2019-03-13 Thread Paul Cercueil

The JZ4740 ECC hardware is not BCH but Reed-Solomon, so it makes more
sense to use the more generic ECC term.

Signed-off-by: Paul Cercueil 
Reviewed-by: Rob Herring 
---

Notes:
v3: New patch

v4: No change

v5: No change

 .../devicetree/bindings/mtd/ingenic,jz4780-nand.txt| 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt 
b/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt
index a5b940f18bf6..5a45cc54f46d 100644
--- a/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt
+++ b/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt
@@ -1,4 +1,4 @@
-* Ingenic JZ4780 NAND/BCH
+* Ingenic JZ4780 NAND/ECC
 
 This file documents the device tree bindings for NAND flash devices on the
 JZ4780. NAND devices are connected to the NEMC controller (described in
@@ -14,10 +14,10 @@ Required NAND controller device properties:
   an offset of 0 and a size of 0x100 (i.e. the whole NEMC bank).
 
 Optional NAND controller device properties:
-- ingenic,bch-controller: To make use of the hardware BCH controller, this
-  property must contain a phandle for the BCH controller node. The required
+- ingenic,bch-controller: To make use of the hardware ECC controller, this
+  property must contain a phandle for the ECC controller node. The required
   properties for this node are described below. If this is not specified,
-  software BCH will be used instead.
+  software ECC will be used instead.
 
 Optional children nodes:
 - Individual NAND chips are children of the NAND controller node.
@@ -70,17 +70,17 @@ nemc: nemc@1341 {
};
 };
 
-The BCH controller is a separate SoC component used for error correction on
+The ECC controller is a separate SoC component used for error correction on
 NAND devices. The following is a description of the device properties for a
-BCH controller.
+ECC controller.
 
-Required BCH properties:
+Required ECC properties:
 - compatible: Should be one of:
   * ingenic,jz4740-ecc
   * ingenic,jz4725b-bch
   * ingenic,jz4780-bch
-- reg: Should specify the BCH controller registers location and length.
-- clocks: Clock for the BCH controller.
+- reg: Should specify the ECC controller registers location and length.
+- clocks: Clock for the ECC controller.
 
 Example:
 
-- 
2.11.0

[PATCH v5 08/12] mtd: rawnand: ingenic: Separate top-level and SoC specific code

2019-03-13 Thread Paul Cercueil

The ingenic-nand driver uses an API provided by the jz4780-bch driver.
This makes it difficult to support other SoCs in the jz4780-bch driver.
To work around this, we separate the API functions from the SoC-specific
code, so that these API functions are SoC-agnostic.

Signed-off-by: Paul Cercueil 
---

Notes:
v2: Add an optional .probe() callback. It is used for instance to set
the clock rate in the JZ4780 backend.

v3: The common code is now inside the ingenic-ecc module. Each
SoC-specific ECC code is now in its own module, which leaves to the
user the choice of which (if any) ECC code should be supported.

v4: No change

v5: Remove .probe callback. Instead, call ingenic_ecc_probe from
jz4780_bch_probe and set the clock rate afterwards.

 drivers/mtd/nand/raw/ingenic/Kconfig|  17 +++
 drivers/mtd/nand/raw/ingenic/Makefile   |   5 +-
 drivers/mtd/nand/raw/ingenic/ingenic_ecc.c  | 153 +
 drivers/mtd/nand/raw/ingenic/ingenic_ecc.h  |  83 ++
 drivers/mtd/nand/raw/ingenic/ingenic_nand.c |  40 +++
 drivers/mtd/nand/raw/ingenic/jz4780_bch.c   | 172 +---
 drivers/mtd/nand/raw/ingenic/jz4780_bch.h   |  40 ---
 7 files changed, 307 insertions(+), 203 deletions(-)
 create mode 100644 drivers/mtd/nand/raw/ingenic/ingenic_ecc.c
 create mode 100644 drivers/mtd/nand/raw/ingenic/ingenic_ecc.h
 delete mode 100644 drivers/mtd/nand/raw/ingenic/jz4780_bch.h

diff --git a/drivers/mtd/nand/raw/ingenic/Kconfig 
b/drivers/mtd/nand/raw/ingenic/Kconfig
index 67806c87b2c4..4bf7d7af3b0a 100644
--- a/drivers/mtd/nand/raw/ingenic/Kconfig
+++ b/drivers/mtd/nand/raw/ingenic/Kconfig
@@ -11,3 +11,20 @@ config MTD_NAND_JZ4780
help
  Enables support for NAND Flash connected to the NEMC on JZ4780 SoC
  based boards, using the BCH controller for hardware error correction.
+
+if MTD_NAND_JZ4780
+
+config MTD_NAND_INGENIC_ECC
+   tristate
+
+config MTD_NAND_JZ4780_BCH
+   tristate "Hardware BCH support for JZ4780 SoC"
+   select MTD_NAND_INGENIC_ECC
+   help
+ Enable this driver to support the BCH error-correction hardware
+ present on the JZ4780 SoC from Ingenic.
+
+ This driver can also be built as a module. If so, the module
+ will be called jz4780-bch.
+
+endif # MTD_NAND_JZ4780
diff --git a/drivers/mtd/nand/raw/ingenic/Makefile 
b/drivers/mtd/nand/raw/ingenic/Makefile
index af2d5de21f60..f3c3c0f230b0 100644
--- a/drivers/mtd/nand/raw/ingenic/Makefile
+++ b/drivers/mtd/nand/raw/ingenic/Makefile
@@ -1,2 +1,5 @@
 obj-$(CONFIG_MTD_NAND_JZ4740) += jz4740_nand.o
-obj-$(CONFIG_MTD_NAND_JZ4780) += ingenic_nand.o jz4780_bch.o
+obj-$(CONFIG_MTD_NAND_JZ4780) += ingenic_nand.o
+
+obj-$(CONFIG_MTD_NAND_INGENIC_ECC) += ingenic_ecc.o
+obj-$(CONFIG_MTD_NAND_JZ4780_BCH) += jz4780_bch.o
diff --git a/drivers/mtd/nand/raw/ingenic/ingenic_ecc.c 
b/drivers/mtd/nand/raw/ingenic/ingenic_ecc.c
new file mode 100644
index ..d7f3a8c3abea
--- /dev/null
+++ b/drivers/mtd/nand/raw/ingenic/ingenic_ecc.c
@@ -0,0 +1,153 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * JZ47xx ECC common code
+ *
+ * Copyright (c) 2015 Imagination Technologies
+ * Author: Alex Smith 
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#include "ingenic_ecc.h"
+
+/**
+ * ingenic_ecc_calculate() - calculate ECC for a data buffer
+ * @ecc: ECC device.
+ * @params: ECC parameters.
+ * @buf: input buffer with raw data.
+ * @ecc_code: output buffer with ECC.
+ *
+ * Return: 0 on success, -ETIMEDOUT if timed out while waiting for ECC
+ * controller.
+ */
+int ingenic_ecc_calculate(struct ingenic_ecc *ecc,
+ struct ingenic_ecc_params *params,
+ const u8 *buf, u8 *ecc_code)
+{
+   return ecc->ops->calculate(ecc, params, buf, ecc_code);
+}
+EXPORT_SYMBOL(ingenic_ecc_calculate);
+
+/**
+ * ingenic_ecc_correct() - detect and correct bit errors
+ * @ecc: ECC device.
+ * @params: ECC parameters.
+ * @buf: raw data read from the chip.
+ * @ecc_code: ECC read from the chip.
+ *
+ * Given the raw data and the ECC read from the NAND device, detects and
+ * corrects errors in the data.
+ *
+ * Return: the number of bit errors corrected, -EBADMSG if there are too many
+ * errors to correct or -ETIMEDOUT if we timed out waiting for the controller.
+ */
+int ingenic_ecc_correct(struct ingenic_ecc *ecc,
+   struct ingenic_ecc_params *params,
+   u8 *buf, u8 *ecc_code)
+{
+   return ecc->ops->correct(ecc, params, buf, ecc_code);
+}
+EXPORT_SYMBOL(ingenic_ecc_correct);
+
+/**
+ * ingenic_ecc_get() - get the ECC controller device
+ * @np: ECC device tree node.
+ *
+ * Gets the ECC controller device from the specified device tree node. The
+ * device must be released with ingenic_ecc_release() when it is no longer 
being
+ * used.
+ *
+ * Return: a pointer to ingenic_ecc, errors are encoded into the

[PATCH v5 09/12] mtd: rawnand: ingenic: Make use of ecc-engine property

2019-03-13 Thread Paul Cercueil

Use the 'ecc-engine' standard property instead of the custom
'ingenic,bch-controller' custom property, which is now deprecated.

Signed-off-by: Paul Cercueil 
---

Notes:
v5: New patch

 drivers/mtd/nand/raw/ingenic/ingenic_ecc.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/mtd/nand/raw/ingenic/ingenic_ecc.c 
b/drivers/mtd/nand/raw/ingenic/ingenic_ecc.c
index d7f3a8c3abea..30436ca6628a 100644
--- a/drivers/mtd/nand/raw/ingenic/ingenic_ecc.c
+++ b/drivers/mtd/nand/raw/ingenic/ingenic_ecc.c
@@ -82,9 +82,9 @@ static struct ingenic_ecc *ingenic_ecc_get(struct device_node 
*np)
 
 /**
  * of_ingenic_ecc_get() - get the ECC controller from a DT node
- * @of_node: the node that contains a bch-controller property.
+ * @of_node: the node that contains a ecc-engine property.
  *
- * Get the bch-controller property from the given device tree
+ * Get the ecc-engine property from the given device tree
  * node and pass it to ingenic_ecc_get to do the work.
  *
  * Return: a pointer to ingenic_ecc, errors are encoded into the pointer.
@@ -95,7 +95,14 @@ struct ingenic_ecc *of_ingenic_ecc_get(struct device_node 
*of_node)
struct ingenic_ecc *ecc = NULL;
struct device_node *np;
 
-   np = of_parse_phandle(of_node, "ingenic,bch-controller", 0);
+   np = of_parse_phandle(of_node, "ecc-engine", 0);
+
+   /*
+* If the ecc-engine property is not found, check for the deprecated
+* ingenic,bch-controller property
+*/
+   if (!np)
+   np = of_parse_phandle(of_node, "ingenic,bch-controller", 0);
 
if (np) {
ecc = ingenic_ecc_get(np);
-- 
2.11.0

[PATCH v5 04/12] mtd: rawnand: Move drivers for Ingenic SoCs to subfolder

2019-03-13 Thread Paul Cercueil

Before adding support for more SoCs and seeing the number of files for
these drivers grow, we move them to their own subfolder to keep it tidy.

Signed-off-by: Paul Cercueil 
---

Notes:
v2: New patch

v3: No change

v4: No change

v5: No change

 drivers/mtd/nand/raw/Kconfig | 14 +-
 drivers/mtd/nand/raw/Makefile|  3 +--
 drivers/mtd/nand/raw/ingenic/Kconfig | 13 +
 drivers/mtd/nand/raw/ingenic/Makefile|  2 ++
 drivers/mtd/nand/raw/{ => ingenic}/jz4740_nand.c |  0
 drivers/mtd/nand/raw/{ => ingenic}/jz4780_bch.c  |  0
 drivers/mtd/nand/raw/{ => ingenic}/jz4780_bch.h  |  0
 drivers/mtd/nand/raw/{ => ingenic}/jz4780_nand.c |  0
 8 files changed, 17 insertions(+), 15 deletions(-)
 create mode 100644 drivers/mtd/nand/raw/ingenic/Kconfig
 create mode 100644 drivers/mtd/nand/raw/ingenic/Makefile
 rename drivers/mtd/nand/raw/{ => ingenic}/jz4740_nand.c (100%)
 rename drivers/mtd/nand/raw/{ => ingenic}/jz4780_bch.c (100%)
 rename drivers/mtd/nand/raw/{ => ingenic}/jz4780_bch.h (100%)
 rename drivers/mtd/nand/raw/{ => ingenic}/jz4780_nand.c (100%)

diff --git a/drivers/mtd/nand/raw/Kconfig b/drivers/mtd/nand/raw/Kconfig
index e604625e2dfa..705863c6709a 100644
--- a/drivers/mtd/nand/raw/Kconfig
+++ b/drivers/mtd/nand/raw/Kconfig
@@ -470,19 +470,7 @@ config MTD_NAND_NUC900
  This enables the driver for the NAND Flash on evaluation board based
  on w90p910 / NUC9xx.
 
-config MTD_NAND_JZ4740
-   tristate "Support for JZ4740 SoC NAND controller"
-   depends on MACH_JZ4740 || COMPILE_TEST
-   depends on HAS_IOMEM
-   help
- Enables support for NAND Flash on JZ4740 SoC based boards.
-
-config MTD_NAND_JZ4780
-   tristate "Support for NAND on JZ4780 SoC"
-   depends on JZ4780_NEMC
-   help
- Enables support for NAND Flash connected to the NEMC on JZ4780 SoC
- based boards, using the BCH controller for hardware error correction.
+source "drivers/mtd/nand/raw/ingenic/Kconfig"
 
 config MTD_NAND_FSMC
tristate "Support for NAND on ST Micros FSMC"
diff --git a/drivers/mtd/nand/raw/Makefile b/drivers/mtd/nand/raw/Makefile
index 5a5a72f0793e..5552c4a9f2cf 100644
--- a/drivers/mtd/nand/raw/Makefile
+++ b/drivers/mtd/nand/raw/Makefile
@@ -45,8 +45,7 @@ obj-$(CONFIG_MTD_NAND_NUC900) += nuc900_nand.o
 obj-$(CONFIG_MTD_NAND_MPC5121_NFC) += mpc5121_nfc.o
 obj-$(CONFIG_MTD_NAND_VF610_NFC)   += vf610_nfc.o
 obj-$(CONFIG_MTD_NAND_RICOH)   += r852.o
-obj-$(CONFIG_MTD_NAND_JZ4740)  += jz4740_nand.o
-obj-$(CONFIG_MTD_NAND_JZ4780)  += jz4780_nand.o jz4780_bch.o
+obj-y  += ingenic/
 obj-$(CONFIG_MTD_NAND_GPMI_NAND)   += gpmi-nand/
 obj-$(CONFIG_MTD_NAND_XWAY)+= xway_nand.o
 obj-$(CONFIG_MTD_NAND_BCM47XXNFLASH)   += bcm47xxnflash/
diff --git a/drivers/mtd/nand/raw/ingenic/Kconfig 
b/drivers/mtd/nand/raw/ingenic/Kconfig
new file mode 100644
index ..67806c87b2c4
--- /dev/null
+++ b/drivers/mtd/nand/raw/ingenic/Kconfig
@@ -0,0 +1,13 @@
+config MTD_NAND_JZ4740
+   tristate "Support for JZ4740 SoC NAND controller"
+   depends on MACH_JZ4740 || COMPILE_TEST
+   depends on HAS_IOMEM
+   help
+ Enables support for NAND Flash on JZ4740 SoC based boards.
+
+config MTD_NAND_JZ4780
+   tristate "Support for NAND on JZ4780 SoC"
+   depends on JZ4780_NEMC
+   help
+ Enables support for NAND Flash connected to the NEMC on JZ4780 SoC
+ based boards, using the BCH controller for hardware error correction.
diff --git a/drivers/mtd/nand/raw/ingenic/Makefile 
b/drivers/mtd/nand/raw/ingenic/Makefile
new file mode 100644
index ..44c2ca053d24
--- /dev/null
+++ b/drivers/mtd/nand/raw/ingenic/Makefile
@@ -0,0 +1,2 @@
+obj-$(CONFIG_MTD_NAND_JZ4740) += jz4740_nand.o
+obj-$(CONFIG_MTD_NAND_JZ4780) += jz4780_nand.o jz4780_bch.o
diff --git a/drivers/mtd/nand/raw/jz4740_nand.c 
b/drivers/mtd/nand/raw/ingenic/jz4740_nand.c
similarity index 100%
rename from drivers/mtd/nand/raw/jz4740_nand.c
rename to drivers/mtd/nand/raw/ingenic/jz4740_nand.c
diff --git a/drivers/mtd/nand/raw/jz4780_bch.c 
b/drivers/mtd/nand/raw/ingenic/jz4780_bch.c
similarity index 100%
rename from drivers/mtd/nand/raw/jz4780_bch.c
rename to drivers/mtd/nand/raw/ingenic/jz4780_bch.c
diff --git a/drivers/mtd/nand/raw/jz4780_bch.h 
b/drivers/mtd/nand/raw/ingenic/jz4780_bch.h
similarity index 100%
rename from drivers/mtd/nand/raw/jz4780_bch.h
rename to drivers/mtd/nand/raw/ingenic/jz4780_bch.h
diff --git a/drivers/mtd/nand/raw/jz4780_nand.c 
b/drivers/mtd/nand/raw/ingenic/jz4780_nand.c
similarity index 100%
rename from drivers/mtd/nand/raw/jz4780_nand.c
rename to drivers/mtd/nand/raw/ingenic/jz4780_nand.c
-- 
2.11.0

[PATCH v5 01/12] dt-bindings: mtd: ingenic: Add compatible strings for JZ4740 and JZ4725B

2019-03-13 Thread Paul Cercueil

Add compatible strings to probe the jz4780-nand and jz4780-bch drivers
from devicetree on the JZ4725B and JZ4740 SoCs from Ingenic.

Signed-off-by: Paul Cercueil 
Reviewed-by: Rob Herring 
---

Notes:
v2: - Change 'ingenic,jz4725b-nand' compatible string to
  'ingenic,jz4740-nand' to reflect driver change
- Add 'ingenic,jz4740-bch' compatible string
- Document 'ingenic,oob-layout' property

v3: - Removed 'ingenic,oob-layout' property
- Update compatible strings to what the driver supports

v4: No change

v5: No change

 Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt 
b/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt
index 29ea5853ca91..a5b940f18bf6 100644
--- a/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt
+++ b/Documentation/devicetree/bindings/mtd/ingenic,jz4780-nand.txt
@@ -6,7 +6,10 @@ memory-controllers/ingenic,jz4780-nemc.txt), and thus NAND 
device nodes must
 be children of the NEMC node.
 
 Required NAND controller device properties:
-- compatible: Should be set to "ingenic,jz4780-nand".
+- compatible: Should be one of:
+  * ingenic,jz4740-nand
+  * ingenic,jz4725b-nand
+  * ingenic,jz4780-nand
 - reg: For each bank with a NAND chip attached, should specify a bank number,
   an offset of 0 and a size of 0x100 (i.e. the whole NEMC bank).
 
@@ -72,7 +75,10 @@ NAND devices. The following is a description of the device 
properties for a
 BCH controller.
 
 Required BCH properties:
-- compatible: Should be set to "ingenic,jz4780-bch".
+- compatible: Should be one of:
+  * ingenic,jz4740-ecc
+  * ingenic,jz4725b-bch
+  * ingenic,jz4780-bch
 - reg: Should specify the BCH controller registers location and length.
 - clocks: Clock for the BCH controller.
 
-- 
2.11.0

[PATCH 1/2] platform/x86: intel_pmc_core: Convert to a platform_driver

2019-03-13 Thread Rajat Jain

Convert the intel_pmc_core driver to a platform driver. There is no
functional change. Some code that tries to determine what kind of
CPU this is, has been moved code is moved from pmc_core_probe() to
pmc_core_init().

Signed-off-by: Rajat Jain 
---
This is rebased off
git://git.infradead.org/linux-platform-drivers-x86.git/for-next

 drivers/platform/x86/intel_pmc_core.c | 93 ---
 1 file changed, 68 insertions(+), 25 deletions(-)

diff --git a/drivers/platform/x86/intel_pmc_core.c 
b/drivers/platform/x86/intel_pmc_core.c
index f2c621b55f49..55578d07610c 100644
--- a/drivers/platform/x86/intel_pmc_core.c
+++ b/drivers/platform/x86/intel_pmc_core.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -854,12 +855,59 @@ static const struct dmi_system_id pmc_core_dmi_table[]  = 
{
{}
 };
 
-static int __init pmc_core_probe(void)
+static int pmc_core_probe(struct platform_device *pdev)
 {
-   struct pmc_dev *pmcdev = 
+   struct pmc_dev *pmcdev = platform_get_drvdata(pdev);
+   int err;
+
+   pmcdev->regbase = ioremap(pmcdev->base_addr,
+ pmcdev->map->regmap_length);
+   if (!pmcdev->regbase)
+   return -ENOMEM;
+
+   mutex_init(>lock);
+   pmcdev->pmc_xram_read_bit = pmc_core_check_read_lock_bit();
+
+   err = pmc_core_dbgfs_register(pmcdev);
+   if (err < 0) {
+   dev_warn(>dev, "debugfs register failed.\n");
+   iounmap(pmcdev->regbase);
+   return err;
+   }
+
+   dmi_check_system(pmc_core_dmi_table);
+   dev_info(>dev, " initialized\n");
+   return 0;
+}
+
+static int pmc_core_remove(struct platform_device *pdev)
+{
+   struct pmc_dev *pmcdev = platform_get_drvdata(pdev);
+
+   pmc_core_dbgfs_unregister(pmcdev);
+   mutex_destroy(>lock);
+   iounmap(pmcdev->regbase);
+   return 0;
+}
+
+static struct platform_driver pmc_core_driver = {
+   .driver = {
+   .name = "pmc_core",
+   },
+   .probe = pmc_core_probe,
+   .remove = pmc_core_remove,
+};
+
+static struct platform_device pmc_core_device = {
+   .name   = "pmc_core",
+};
+
+static int __init pmc_core_init(void)
+{
+   int ret;
const struct x86_cpu_id *cpu_id;
+   struct pmc_dev *pmcdev = 
u64 slp_s0_addr;
-   int err;
 
cpu_id = x86_match_cpu(intel_pmc_core_ids);
if (!cpu_id)
@@ -880,36 +928,31 @@ static int __init pmc_core_probe(void)
else
pmcdev->base_addr = slp_s0_addr - pmcdev->map->slp_s0_offset;
 
-   pmcdev->regbase = ioremap(pmcdev->base_addr,
- pmcdev->map->regmap_length);
-   if (!pmcdev->regbase)
-   return -ENOMEM;
+   platform_set_drvdata(_core_device, pmcdev);
 
-   mutex_init(>lock);
-   pmcdev->pmc_xram_read_bit = pmc_core_check_read_lock_bit();
+   ret = platform_device_register(_core_device);
+   if (ret)
+   return ret;
 
-   err = pmc_core_dbgfs_register(pmcdev);
-   if (err < 0) {
-   pr_warn(" debugfs register failed.\n");
-   iounmap(pmcdev->regbase);
-   return err;
-   }
+   ret = platform_driver_register(_core_driver);
+   if (ret)
+   goto out_remove_dev;
 
-   dmi_check_system(pmc_core_dmi_table);
-   pr_info(" initialized\n");
return 0;
+
+out_remove_dev:
+   platform_device_unregister(_core_device);
+   return ret;
 }
-module_init(pmc_core_probe)
 
-static void __exit pmc_core_remove(void)
+static void __init pmc_core_exit(void)
 {
-   struct pmc_dev *pmcdev = 
-
-   pmc_core_dbgfs_unregister(pmcdev);
-   mutex_destroy(>lock);
-   iounmap(pmcdev->regbase);
+   platform_driver_unregister(_core_driver);
+   platform_device_unregister(_core_device);
 }
-module_exit(pmc_core_remove)
+
+module_init(pmc_core_init);
+module_exit(pmc_core_exit);
 
 MODULE_LICENSE("GPL v2");
 MODULE_DESCRIPTION("Intel PMC Core Driver");
-- 
2.21.0.360.g471c308f928-goog

[PATCH 2/2] platform/x86: intel_pmc_core: Allow to dump debug registers on S0ix failure

2019-03-13 Thread Rajat Jain

Add a module parameter which when enabled, will check on resume, if the
last S0ix attempt was successful. If not, the driver would provide
helpful debug information (which gets latched during the failed suspend
attempt) to debug the S0ix failure.

This information is very useful to debug S0ix failures. Specially since
the latched debug information will be lost (over-written) if the system
attempts to go into runtime (or imminent) S0ix again after that failed
suspend attempt.

Signed-off-by: Rajat Jain 
---
 drivers/platform/x86/intel_pmc_core.c | 86 +++
 drivers/platform/x86/intel_pmc_core.h |  7 +++
 2 files changed, 93 insertions(+)

diff --git a/drivers/platform/x86/intel_pmc_core.c 
b/drivers/platform/x86/intel_pmc_core.c
index 55578d07610c..b1f4405a27ce 100644
--- a/drivers/platform/x86/intel_pmc_core.c
+++ b/drivers/platform/x86/intel_pmc_core.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -890,9 +891,94 @@ static int pmc_core_remove(struct platform_device *pdev)
return 0;
 }
 
+#ifdef CONFIG_PM_SLEEP
+
+static bool warn_on_s0ix_failures;
+module_param(warn_on_s0ix_failures, bool, 0644);
+MODULE_PARM_DESC(warn_on_s0ix_failures, "Check and warn for S0ix failures");
+
+static int pmc_core_suspend(struct device *dev)
+{
+   struct pmc_dev *pmcdev = dev_get_drvdata(dev);
+
+   /* Save PC10 and S0ix residency for checking later */
+   if (warn_on_s0ix_failures &&
+   !rdmsrl_safe(MSR_PKG_C10_RESIDENCY, >pc10_counter) &&
+   !pmc_core_dev_state_get(pmcdev, >s0ix_counter))
+   pmcdev->check_counters = true;
+   else
+   pmcdev->check_counters = false;
+
+   return 0;
+}
+
+static inline bool pc10_failed(struct pmc_dev *pmcdev)
+{
+   u64 pc10_counter;
+
+   if (!rdmsrl_safe(MSR_PKG_C10_RESIDENCY, _counter) &&
+   pc10_counter == pmcdev->pc10_counter)
+   return true;
+   else
+   return false;
+}
+
+static inline bool s0ix_failed(struct pmc_dev *pmcdev)
+{
+   u64 s0ix_counter;
+
+   if (!pmc_core_dev_state_get(pmcdev, _counter) &&
+   s0ix_counter == pmcdev->s0ix_counter)
+   return true;
+   else
+   return false;
+}
+
+static int pmc_core_resume(struct device *dev)
+{
+   struct pmc_dev *pmcdev = dev_get_drvdata(dev);
+
+   if (!pmcdev->check_counters)
+   return 0;
+
+   if (pc10_failed(pmcdev)) {
+   dev_info(dev, "PC10 entry had failed (PC10 cnt=0x%llx)\n",
+pmcdev->pc10_counter);
+   } else if (s0ix_failed(pmcdev)) {
+
+   const struct pmc_bit_map **maps = pmcdev->map->slps0_dbg_maps;
+   const struct pmc_bit_map *map;
+   int offset = pmcdev->map->slps0_dbg_offset;
+   u32 data;
+
+   dev_warn(dev, "S0ix entry had failed (S0ix cnt=%llu)\n",
+pmcdev->s0ix_counter);
+   while (*maps) {
+   map = *maps;
+   data = pmc_core_reg_read(pmcdev, offset);
+   offset += 4;
+   while (map->name) {
+   dev_warn(dev, "SLP_S0_DBG: %-32s\tState: %s\n",
+map->name,
+data & map->bit_mask ? "Yes" : "No");
+   ++map;
+   }
+   ++maps;
+   }
+   }
+   return 0;
+}
+
+#endif
+
+const struct dev_pm_ops pmc_core_pm_ops = {
+   SET_LATE_SYSTEM_SLEEP_PM_OPS(pmc_core_suspend, pmc_core_resume)
+};
+
 static struct platform_driver pmc_core_driver = {
.driver = {
.name = "pmc_core",
+   .pm = _core_pm_ops
},
.probe = pmc_core_probe,
.remove = pmc_core_remove,
diff --git a/drivers/platform/x86/intel_pmc_core.h 
b/drivers/platform/x86/intel_pmc_core.h
index 88d9c0653a5f..fdee5772e532 100644
--- a/drivers/platform/x86/intel_pmc_core.h
+++ b/drivers/platform/x86/intel_pmc_core.h
@@ -241,6 +241,9 @@ struct pmc_reg_map {
  * @pmc_xram_read_bit: flag to indicate whether PMC XRAM shadow registers
  * used to read MPHY PG and PLL status are available
  * @mutex_lock:mutex to complete one transcation
+ * @check_counters:On resume, check if counters are getting incremented
+ * @pc10_counter:  PC10 residency counter
+ * @s0ix_counter:  S0ix residency (step adjusted)
  *
  * pmc_dev contains info about power management controller device.
  */
@@ -253,6 +256,10 @@ struct pmc_dev {
 #endif /* CONFIG_DEBUG_FS */
int pmc_xram_read_bit;
struct mutex lock; /* generic mutex lock for PMC Core */
+
+   bool check_counters; /* Check for counter increments on resume */
+   u64 pc10_counter;
+   u64 s0ix_counter;
 };
 
 #endif /* PMC_CORE_H */
--

Re: [PATCH] [CIFS] Remove unneeded statements pointed out by Coverity

2019-03-13 Thread ronnie sahlberg

reviewed by me

On Thu, Mar 14, 2019 at 7:53 AM Steve French  wrote:
>
> cifs: remove unused value pointed out by Coverity
>
> Detected by CoverityScan CID#1438719 ("Unused Value")
>
> buf is reset again before being used so these two lines of code
> are useless.
>
> Signed-off-by: Steve French 
> ---
>  fs/cifs/connect.c | 4 
>  1 file changed, 4 deletions(-)
>
> diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
> index b95db2b593cb..a8e9738db691 100644
> --- a/fs/cifs/connect.c
> +++ b/fs/cifs/connect.c
> @@ -1191,10 +1191,6 @@ cifs_demultiplex_thread(void *p)
>  continue;
>  }
>
> -if (server->large_buf)
> -buf = server->bigbuf;
> -
> -
>  server->lstrp = jiffies;
>
>  for (i = 0; i < num_mids; i++) {
>
> --
> Thanks,
>
> Steve

Re: [PATCH] spi: mediatek: Attempt to address style issues in spi-mt7621.c

2019-03-13 Thread NeilBrown

On Wed, Mar 13 2019, Stefan Roese wrote:

> On 13.03.19 17:46, Matthias Brugger wrote:
>> 
>> 
>> On 13/03/2019 17:34, Chuanhong Guo wrote:
>>> Hi!
>>> On Wed, Mar 13, 2019 at 8:28 PM Matthias Brugger  
>>> wrote:



 On 13/03/2019 13:24, Armando Miraglia wrote:
 [...]
 Apart from fixing styling issues it would be usefull to see if we can add
 support for mt7621 to drivers/spi/spi-mt65xx.c
>>> It's impossible. They are completely different IPs.
>> 
>> Thanks for the info. Do you know the status of the rest of the drivers in 
>> staging?
>
> Just to inform you. We are using this SPI driver from staging
> in one of our customer projects and I will try to move this
> driver out of staging into drivers/spi very shortly.

This is good news! I think the driver is ready to move out of staging
and would be very happy to see you do it.

My only small concern is that this driver was backported to openwrt
(4.14 based) and then reverted

https://github.com/openwrt/openwrt/commit/749a29f76ca780d8df70a5163d43bbdc6f13ba3f

 "This breaks some mt7621 devices."

Possibly it was backported badly, or possibly there is a problem.

John: do you have any more details of the problem other than what is in
the commit message?

Thanks,
NeilBrown


signature.asc
Description: PGP signature

Re: overlayfs vs. fscrypt

2019-03-13 Thread Eric Biggers

On Wed, Mar 13, 2019 at 02:04:29PM -0700, James Bottomley wrote:
> On Wed, 2019-03-13 at 13:25 -0700, Eric Biggers wrote:
> > On Wed, Mar 13, 2019 at 01:06:06PM -0700, James Bottomley wrote:
> > > On Wed, 2019-03-13 at 12:57 -0700, Eric Biggers wrote:
> [...]
> > > > fscrypt would allow the data to be stored encrypted on the local
> > > > disk, so it's protected against offline compromise of the disk.
> > > 
> > > Container images are essentially tars of the overlays.  They only
> > > become actual filesystems when instantiated at runtime.  The
> > > current encrypted container image is an overlay or set of overlays
> > > which is tarred then encrypted.  So to instantiate it is decrypted
> > > then untarred.
> > > 
> > > The thing I was wondering about was whether instead of a tar
> > > encrypt we could instead produce an encrypted image from a fscrypt
> > > filesystem.
> > > 
> > 
> > Why do you care whether the container image is encrypted on the local
> > disk, when you're extracting it in plaintext onto the local disk
> > anyway each time it runs? Even after the runtime files are "deleted",
> > they may still be recoverable from the disk.  Are you using shred and
> > BLKSECDISCARD, and a non-COW filesystem?
> > 
> > Now, if you wanted to avoid writing the plaintext to disk entirely
> > (and thereby use encryption to actually achieve a useful security
> > property that can't be achieved through file permissions), fscrypt is
> > a good solution for that.
> 
> OK let's start with a cloud and container 101: A container is an
> exactly transportable IaaS environment containing an application.  The
> format for the exact transport is the "container image" I've been
> describing (layered tar file set deployed with overlays).  These images
> are usually stored in cloud based registries which may or may not have
> useful access controls.  I take it the reason for image encryption to
> protect confidentiality within the registry is obvious.
> 
> Because of the exact transport, the deployment may be on my laptop, on
> my test system or in some type of public or private cloud.  In all
> cases bar the laptop, I won't actually own the physical system which
> ends up deploying the container.  So in exchange for security
> guarantees from the physical system owner, I agree to turn over my
> decryption key and possibly a cash payment.  One of these guarantees is
> usually that they shred the key after use and that they deploy a useful
> key escrow system like vault or keyprotect to guard it even while the
> decryption is being done.


> Another is that all traces of the container be shredded after the execution is
> finished.

Well, sounds like that's not the case currently even with an encrypted container
image, because the actual runtime files are not encrypted on disk.  Encrypting
the runtime files using fscrypt with an ephemeral key would be useful here.
IOW, randomly generate an encryption key when the container starts, never store
it anywhere, and wipe it when the container stops.

Note that this is separate from the container *image* encryption.

> considering is could I be protected against either cloud provider
> cockups that might leak the image (the misconfigured backup scenario I
> suggested) or malicious actions of other tenants.

If the container image is encrypted with a key not on the system, then its
confidentiality is protected from anything that may happen on that system.

But if the container image encryption key *is* on the system, your container
image may be leaked either accidentally or maliciously.

- Eric

Re: [PATCH 4.9 00/96] 4.9.163-stable review

2019-03-13 Thread Guenter Roeck

On Wed, Mar 13, 2019 at 01:49:49PM -0700, Greg Kroah-Hartman wrote:
> On Wed, Mar 13, 2019 at 01:34:41PM -0700, Guenter Roeck wrote:
> > On Tue, Mar 12, 2019 at 10:09:18AM -0700, Greg Kroah-Hartman wrote:
> > > This is the start of the stable review cycle for the 4.9.163 release.
> > > There are 96 patches in this series, all will be posted as a response
> > > to this one.  If anyone has any issues with these being applied, please
> > > let me know.
> > > 
> > > Responses should be made by Thu Mar 14 17:10:06 UTC 2019.
> > > Anything received after that time might be too late.
> > > 
> > 
> > Build results:
> > total: 172 pass: 170 fail: 2
> > Failed builds: 
> > i386:tools/perf 
> > x86_64:tools/perf 
> > Qemu test results:
> > total: 316 pass: 316 fail: 0
> > 
> > Culprit is commit 386ca5754fad ("perf trace: Support multiple "vfs_getname"
> > probes"), which introduces a call to strstarts() but not the necessary
> > include file or declaration. Unfortunately, fixing that is quite complex
> > (commit 8e99b6d4533c ("tools include: Adopt strstarts() from the kernel")
> > doesn't apply).
> 
> Thanks for tracking this down, I'm lost in a maze of gcc8 issues with
> perf on 4.9 and 4.4 at the moment trying to fix that.
> 
> I'll go drop this patch from the 4.9 and 4.4 queues.
> 
Agreed; the same patch also introduces a call to evlist__for_each_entry_safe(),
which does not exist in v4.4.y.

Guenter

Re: [PATCH 1/5] lib/sort: Make swap functions more generic

2019-03-13 Thread Geert Uytterhoeven

On Wed, Mar 13, 2019 at 10:23 PM Rasmus Villemoes
 wrote:
> On 21/02/2019 07.30, George Spelvin wrote:
> > +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > + (void)base;
> > +#else
> > + lsbits |= (unsigned int)(size_t)base;
>
> The kernel usually casts pointers to long or unsigned long. If some
> oddball arch has size_t something other than unsigned long, this would
> give a "warning: cast from pointer to integer of different size". So
> just cast to unsigned long, and drop the cast to unsigned int.

The proper integer equivalent of a pointer is uintptr_t, so please use
that instead of size_t.

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

Re: Kernel bug with MPX?

2019-03-13 Thread Dave Hansen

On 3/7/19 11:41 PM, Michal Hocko wrote:
> And this seems real leak because I just hit some bugons later
> 
> [112423.206497] BUG: Bad rss-counter state mm:7aa9c8a7 idx:1 val:25593
> [113601.595093] page:ea00041a07c0 count:2 mapcount:1 
> mapping:88818d70e9a1 index:0x7f821adf6
> [113601.595102] anon 

FWIW, I was able to reproduce this.  No idea what the problem is, yet,
but I'm looking at it.

Re: [PATCH] spi: fix SPI_BPW_RANGE_MASK() regression

2019-03-13 Thread Geert Uytterhoeven

On Wed, Mar 13, 2019 at 10:01 PM Arnd Bergmann  wrote:
> Geert points out that I confused the min/max arguments that are
> reversed between SPI_BPW_RANGE_MASK() and GENMASK(). This time
> I have verified the result of the macro after fixing the arguments.
>
> Cc: Geert Uytterhoeven 
> Fixes: eefffb42f665 ("spi: work around clang bug in SPI_BPW_RANGE_MASK()")
> Signed-off-by: Arnd Bergmann 

Reviewed-by: Geert Uytterhoeven 

Gr{oetje,eeting}s,

Geert

-- 
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- ge...@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds

[PATCH] [CIFS] Remove unneeded statements pointed out by Coverity

2019-03-13 Thread Steve French

cifs: remove unused value pointed out by Coverity

Detected by CoverityScan CID#1438719 ("Unused Value")

buf is reset again before being used so these two lines of code
are useless.

Signed-off-by: Steve French 
---
 fs/cifs/connect.c | 4 
 1 file changed, 4 deletions(-)

diff --git a/fs/cifs/connect.c b/fs/cifs/connect.c
index b95db2b593cb..a8e9738db691 100644
--- a/fs/cifs/connect.c
+++ b/fs/cifs/connect.c
@@ -1191,10 +1191,6 @@ cifs_demultiplex_thread(void *p)
 continue;
 }

-if (server->large_buf)
-buf = server->bigbuf;
-
-
 server->lstrp = jiffies;

 for (i = 0; i < num_mids; i++) {

-- 
Thanks,

Steve

Re: [PATCH v2] PCI: qcom: Use default config space read function

2019-03-13 Thread Srinivas Kandagatla





On 13/03/2019 20:39, Marc Gonzalez wrote:

working without patch :https://paste.ubuntu.com/p/TJm4hgjGW4/

I have not debugged it any further other than just testing the patch.
Let me know if you need any more dumps for more debug.

Could you pastebin the output of lspci -vvv -n in the working case?

This was already in my original reply:
working without patch : https://paste.ubuntu.com/p/TJm4hgjGW4/

--srini

1 2 3 4 5 6 7 8 9 >

1 - 100 of 818 matches

Mail list logo