Re: [PATCH v2] netprio_cgroup: Fix unlimited memory leak of v2 cgroups

2020-05-08 Thread Jakub Kicinski
On Sat, 9 May 2020 11:32:10 +0800 Zefan Li wrote:
> If systemd is configured to use hybrid mode which enables the use of
> both cgroup v1 and v2, systemd will create new cgroup on both the default
> root (v2) and netprio_cgroup hierarchy (v1) for a new session and attach
> task to the two cgroups. If the task does some network thing then the v2
> cgroup can never be freed after the session exited.
> 
> One of our machines ran into OOM due to this memory leak.
> 
> In the scenario described above when sk_alloc() is called cgroup_sk_alloc()
> thought it's in v2 mode, so it stores the cgroup pointer in sk->sk_cgrp_data
> and increments the cgroup refcnt, but then sock_update_netprioidx() thought
> it's in v1 mode, so it stores netprioidx value in sk->sk_cgrp_data, so the
> cgroup refcnt will never be freed.
> 
> Currently we do the mode switch when someone writes to the ifpriomap cgroup
> control file. The easiest fix is to also do the switch when a task is attached
> to a new cgroup.
> 
> Fixes: bd1060a1d671("sock, cgroup: add sock->sk_cgroup")

 ^ space missing here

> Reported-by: Yang Yingliang 
> Tested-by: Yang Yingliang 
> Signed-off-by: Zefan Li 
> ---
> 
> forgot to rebase to the latest kernel.
> 
> ---
>  net/core/netprio_cgroup.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
> index 8881dd9..9bd4cab 100644
> --- a/net/core/netprio_cgroup.c
> +++ b/net/core/netprio_cgroup.c
> @@ -236,6 +236,8 @@ static void net_prio_attach(struct cgroup_taskset *tset)
>   struct task_struct *p;
>   struct cgroup_subsys_state *css;
>  
> + cgroup_sk_alloc_disable();
> +
>   cgroup_taskset_for_each(p, css, tset) {
>   void *v = (void *)(unsigned long)css->id;
>  



RE: [RFC] Issue in final aggregate value, in case of multiple events present in metric expression

2020-05-08 Thread Joakim Zhang

Hi Arnaldo,

Kajol reflects this issue for almost two months, got no feedbacks, do you have 
any comments? That could be appreciated if you can look into it. Thanks a lot!

Please refer to below link:
https://www.spinics.net/lists/linux-perf-users/msg11011.html

Best Regards,
Joakim Zhang

> -Original Message-
> From: kajoljain 
> Sent: 2020年3月24日 16:01
> To: Joakim Zhang ; a...@kernel.org; Jiri Olsa
> ; Andi Kleen 
> Cc: linux-kernel@vger.kernel.org; linux-perf-us...@vger.kernel.org; Kan Liang
> ; Madhavan Srinivasan
> ; Anju T Sudhakar ;
> Ravi Bangoria 
> Subject: [RFC] Issue in final aggregate value, in case of multiple events 
> present
> in metric expression
> 
> Hello All,
>   I want to discuss one issue raised by Joakim Zhang where he mentioned
> that, we are not getting correct result in-case of multiple events present in
> metric expression.
> 
> This is one example pointed by him :
> 
> below is the JSON file and result.
> [
> {
>  "PublicDescription": "Calculate DDR0 bus actual utilization
> which vary from DDR0 controller clock frequency",
>  "BriefDescription": "imx8qm: ddr0 bus actual utilization",
>  "MetricName": "imx8qm-ddr0-bus-util",
>  "MetricExpr": "( imx8_ddr0\\/read\\-cycles\\/ +
> imx8_ddr0\\/write\\-cycles\\/ )",
>  "MetricGroup": "i.MX8QM_DDR0_BUS_UTIL"
> }
> ]
> ./perf stat -I 1000 -M imx8qm-ddr0-bus-util
> #   time counts unit events
>  1.000104250  16720  imx8_ddr0/read-cycles/
> #  22921.0 imx8qm-ddr0-bus-util
>  1.000104250   6201  imx8_ddr0/write-cycles/
>  2.000525625   8316  imx8_ddr0/read-cycles/
> #  12785.5 imx8qm-ddr0-bus-util
>  2.000525625   2738  imx8_ddr0/write-cycles/
>  3.000819125   1056  imx8_ddr0/read-cycles/
> #   4136.7 imx8qm-ddr0-bus-util
>  3.000819125303  imx8_ddr0/write-cycles/
>  4.001103750   6260  imx8_ddr0/read-cycles/
> #   9149.8 imx8qm-ddr0-bus-util
>  4.001103750   2317  imx8_ddr0/write-cycles/
>  5.001392750   2084  imx8_ddr0/read-cycles/
> #   4516.0 imx8qm-ddr0-bus-util
>  5.001392750601  imx8_ddr0/write-cycles/
> 
> Based on given metric expression, the sum coming correct for first iteration
> while for rest, we won't see same addition result. But in-case we have single
> event in metric expression, we are getting correct result as expected.
> 
> 
> So, I try to look into this issue and understand the flow. From my 
> understanding,
> whenever we do calculation of metric expression we don't use exact count we
> are getting.
> Basically we use mean value of each metric event in the calculation of metric
> expression.
> 
> So, I take same example:
> 
> Metric Event: imx8qm-ddr0-bus-util
> MetricExpr": "( imx8_ddr0\\/read\\-cycles\\/ + imx8_ddr0\\/write\\-cycles\\/ 
> )"
> 
> command#: ./perf stat -I 1000 -M imx8qm-ddr0-bus-util
> 
> #   time counts unit events
>  1.000104250  16720  imx8_ddr0/read-cycles/
> #  22921.0 imx8qm-ddr0-bus-util
>  1.000104250   6201  imx8_ddr0/write-cycles/
>  2.000525625   8316  imx8_ddr0/read-cycles/
> #  12785.5 imx8qm-ddr0-bus-util
>  2.000525625   2738  imx8_ddr0/write-cycles/
>  3.000819125   1056  imx8_ddr0/read-cycles/
> #   4136.7 imx8qm-ddr0-bus-util
>  3.000819125303  imx8_ddr0/write-cycles/
>  4.001103750   6260  imx8_ddr0/read-cycles/
> #   9149.8 imx8qm-ddr0-bus-util
>  4.001103750   2317  imx8_ddr0/write-cycles/
>  5.001392750   2084  imx8_ddr0/read-cycles/
> #   4516.0 imx8qm-ddr0-bus-util
>  5.001392750601  imx8_ddr0/write-cycles/
> 
> So, there is one function called 'update_stats' in file util/stat.c where we 
> do this
> calculation and updating stats->mean value. And this mean value is what we
> are actually using in our metric expression calculation.
> 
> We call this function in each iteration where we update stats->mean and
> stats->n for each event.
> But one weird issue is, for very first event, stat->n is always 1 that is why 
> we
> are getting mean same as count.
> 
> So this the reason why for single event we get exact aggregate of metric
> expression.
> So doesn't matter how many events you have in your metric expression, every
> time you take exact count for first one and normalized value for rest which is
> weird.
> 
> According to update_stats function:  We are updating mean as:
> 
> stats->mean += delta / stats->n where,  delta = val - stats->mean.
> 
> If we take write-cycles here. Initially mean = 0 and n = 1.
> 
> 1st iteration: n=1, write cycle : 6201 and mean = 6201  (Final agg value: 
> 16720
> + 6201 = 22921) 2nd iteration: n=2, write 

Re: [PATCH net-next] net: ipa: Remove ipa_endpoint_stop{,_rx_dma} again

2020-05-08 Thread Jakub Kicinski
On Fri,  8 May 2020 12:41:33 -0700 Nathan Chancellor wrote:
> When building arm64 allyesconfig:
> 
> drivers/net/ipa/ipa_endpoint.c: In function 'ipa_endpoint_stop_rx_dma':
> drivers/net/ipa/ipa_endpoint.c:1274:13: error: 'IPA_ENDPOINT_STOP_RX_SIZE' 
> undeclared (first use in this function)
> drivers/net/ipa/ipa_endpoint.c:1274:13: note: each undeclared identifier is 
> reported only once for each function it appears in
> drivers/net/ipa/ipa_endpoint.c:1289:2: error: implicit declaration of 
> function 'ipa_cmd_dma_task_32b_addr_add' 
> [-Werror=implicit-function-declaration]
> drivers/net/ipa/ipa_endpoint.c:1291:45: error: 'ENDPOINT_STOP_DMA_TIMEOUT' 
> undeclared (first use in this function)
> drivers/net/ipa/ipa_endpoint.c: In function 'ipa_endpoint_stop':
> drivers/net/ipa/ipa_endpoint.c:1309:16: error: 'IPA_ENDPOINT_STOP_RX_RETRIES' 
> undeclared (first use in this function)
> 
> These functions were removed in a series, merged in as
> commit 33395f4a5c1b ("Merge branch 'net-ipa-kill-endpoint-stop-workaround'").
> 
> Remove them again so that the build works properly.
> 
> Fixes: 3793faad7b5b ("Merge 
> git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net")
> Signed-off-by: Nathan Chancellor 

Applied, thank you!

I think I already said this, but would be great if IPA built on x86
with COMPILE_TEST..


Re: [PATCH v7 8/8] bus: mhi: core: Ensure non-zero session or sequence ID values are used

2020-05-08 Thread Manivannan Sadhasivam
On Fri, May 08, 2020 at 07:26:48PM -0700, Bhaumik Bhatt wrote:
> While writing any sequence or session identifiers, it is possible that
> the host could write a zero value, whereas only non-zero values should
> be supported writes to those registers. Ensure that the host does not
> write a non-zero value for them and also log them in debug messages. A
> macro is introduced to simplify this check and the existing checks are
> also converted to use this macro.
> 
> Signed-off-by: Bhaumik Bhatt 
> Reviewed-by: Jeffrey Hugo 

Reviewed-by: Manivannan Sadhasivam 

Thanks,
Mani

> ---
>  drivers/bus/mhi/core/boot.c | 15 +++
>  drivers/bus/mhi/core/internal.h |  1 +
>  2 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/bus/mhi/core/boot.c b/drivers/bus/mhi/core/boot.c
> index 80e4d76..0b38014 100644
> --- a/drivers/bus/mhi/core/boot.c
> +++ b/drivers/bus/mhi/core/boot.c
> @@ -43,10 +43,7 @@ void mhi_rddm_prepare(struct mhi_controller *mhi_cntrl,
> lower_32_bits(mhi_buf->dma_addr));
>  
>   mhi_write_reg(mhi_cntrl, base, BHIE_RXVECSIZE_OFFS, mhi_buf->len);
> - sequence_id = prandom_u32() & BHIE_RXVECSTATUS_SEQNUM_BMSK;
> -
> - if (unlikely(!sequence_id))
> - sequence_id = 1;
> + sequence_id = MHI_RANDOM_U32_NONZERO(BHIE_RXVECSTATUS_SEQNUM_BMSK);
>  
>   mhi_write_reg_field(mhi_cntrl, base, BHIE_RXVECDB_OFFS,
>   BHIE_RXVECDB_SEQNUM_BMSK, BHIE_RXVECDB_SEQNUM_SHFT,
> @@ -189,7 +186,9 @@ static int mhi_fw_load_amss(struct mhi_controller 
> *mhi_cntrl,
>   return -EIO;
>   }
>  
> - dev_dbg(dev, "Starting AMSS download via BHIe\n");
> + sequence_id = MHI_RANDOM_U32_NONZERO(BHIE_TXVECSTATUS_SEQNUM_BMSK);
> + dev_dbg(dev, "Starting AMSS download via BHIe. Sequence ID:%u\n",
> + sequence_id);
>   mhi_write_reg(mhi_cntrl, base, BHIE_TXVECADDR_HIGH_OFFS,
> upper_32_bits(mhi_buf->dma_addr));
>  
> @@ -198,7 +197,6 @@ static int mhi_fw_load_amss(struct mhi_controller 
> *mhi_cntrl,
>  
>   mhi_write_reg(mhi_cntrl, base, BHIE_TXVECSIZE_OFFS, mhi_buf->len);
>  
> - sequence_id = prandom_u32() & BHIE_TXVECSTATUS_SEQNUM_BMSK;
>   mhi_write_reg_field(mhi_cntrl, base, BHIE_TXVECDB_OFFS,
>   BHIE_TXVECDB_SEQNUM_BMSK, BHIE_TXVECDB_SEQNUM_SHFT,
>   sequence_id);
> @@ -246,14 +244,15 @@ static int mhi_fw_load_sbl(struct mhi_controller 
> *mhi_cntrl,
>   goto invalid_pm_state;
>   }
>  
> - dev_dbg(dev, "Starting SBL download via BHI\n");
> + session_id = MHI_RANDOM_U32_NONZERO(BHI_TXDB_SEQNUM_BMSK);
> + dev_dbg(dev, "Starting SBL download via BHI. Session ID:%u\n",
> + session_id);
>   mhi_write_reg(mhi_cntrl, base, BHI_STATUS, 0);
>   mhi_write_reg(mhi_cntrl, base, BHI_IMGADDR_HIGH,
> upper_32_bits(dma_addr));
>   mhi_write_reg(mhi_cntrl, base, BHI_IMGADDR_LOW,
> lower_32_bits(dma_addr));
>   mhi_write_reg(mhi_cntrl, base, BHI_IMGSIZE, size);
> - session_id = prandom_u32() & BHI_TXDB_SEQNUM_BMSK;
>   mhi_write_reg(mhi_cntrl, base, BHI_IMGTXDB, session_id);
>   read_unlock_bh(pm_lock);
>  
> diff --git a/drivers/bus/mhi/core/internal.h b/drivers/bus/mhi/core/internal.h
> index 0965ca3..80b32c2 100644
> --- a/drivers/bus/mhi/core/internal.h
> +++ b/drivers/bus/mhi/core/internal.h
> @@ -452,6 +452,7 @@ enum mhi_pm_state {
>  #define PRIMARY_CMD_RING 0
>  #define MHI_DEV_WAKE_DB  127
>  #define MHI_MAX_MTU  0x
> +#define MHI_RANDOM_U32_NONZERO(bmsk) (prandom_u32_max(bmsk) + 1)
>  
>  enum mhi_er_type {
>   MHI_ER_TYPE_INVALID = 0x0,
> -- 
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project


Re: [PATCH v7 6/8] bus: mhi: core: Return appropriate error codes for AMSS load failure

2020-05-08 Thread Manivannan Sadhasivam
On Fri, May 08, 2020 at 07:26:46PM -0700, Bhaumik Bhatt wrote:
> When loading AMSS firmware using BHIe protocol, return -ETIMEDOUT if no
> response is received within the timeout or return -EIO in case of a
> protocol returned failure or an MHI error state.
> 
> Signed-off-by: Bhaumik Bhatt 
> Reviewed-by: Jeffrey Hugo 

Reviewed-by: Manivannan Sadhasivam 

Thanks,
Mani

> ---
>  drivers/bus/mhi/core/boot.c | 23 ---
>  1 file changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/bus/mhi/core/boot.c b/drivers/bus/mhi/core/boot.c
> index 17c636b..cf6dc5a 100644
> --- a/drivers/bus/mhi/core/boot.c
> +++ b/drivers/bus/mhi/core/boot.c
> @@ -176,6 +176,7 @@ static int mhi_fw_load_amss(struct mhi_controller 
> *mhi_cntrl,
>   void __iomem *base = mhi_cntrl->bhie;
>   rwlock_t *pm_lock = _cntrl->pm_lock;
>   u32 tx_status, sequence_id;
> + int ret;
>  
>   read_lock_bh(pm_lock);
>   if (!MHI_REG_ACCESS_VALID(mhi_cntrl->pm_state)) {
> @@ -198,19 +199,19 @@ static int mhi_fw_load_amss(struct mhi_controller 
> *mhi_cntrl,
>   read_unlock_bh(pm_lock);
>  
>   /* Wait for the image download to complete */
> - wait_event_timeout(mhi_cntrl->state_event,
> -MHI_PM_IN_ERROR_STATE(mhi_cntrl->pm_state) ||
> -mhi_read_reg_field(mhi_cntrl, base,
> -   BHIE_TXVECSTATUS_OFFS,
> -   BHIE_TXVECSTATUS_STATUS_BMSK,
> -   BHIE_TXVECSTATUS_STATUS_SHFT,
> -   _status) || tx_status,
> -msecs_to_jiffies(mhi_cntrl->timeout_ms));
> -
> - if (MHI_PM_IN_ERROR_STATE(mhi_cntrl->pm_state))
> + ret = wait_event_timeout(mhi_cntrl->state_event,
> +  MHI_PM_IN_ERROR_STATE(mhi_cntrl->pm_state) ||
> +  mhi_read_reg_field(mhi_cntrl, base,
> +BHIE_TXVECSTATUS_OFFS,
> +BHIE_TXVECSTATUS_STATUS_BMSK,
> +BHIE_TXVECSTATUS_STATUS_SHFT,
> +_status) || tx_status,
> +  msecs_to_jiffies(mhi_cntrl->timeout_ms));
> + if (MHI_PM_IN_ERROR_STATE(mhi_cntrl->pm_state) ||
> + tx_status != BHIE_TXVECSTATUS_STATUS_XFER_COMPL)
>   return -EIO;
>  
> - return (tx_status == BHIE_TXVECSTATUS_STATUS_XFER_COMPL) ? 0 : -EIO;
> + return (!ret) ? -ETIMEDOUT : 0;
>  }
>  
>  static int mhi_fw_load_sbl(struct mhi_controller *mhi_cntrl,
> -- 
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project


Re: [PATCH v7 7/8] bus: mhi: core: Improve debug logs for loading firmware

2020-05-08 Thread Manivannan Sadhasivam
On Fri, May 08, 2020 at 07:26:47PM -0700, Bhaumik Bhatt wrote:
> Add log messages to track boot flow errors and timeouts in SBL or AMSS
> firmware loading to aid in debug.
> 
> Signed-off-by: Bhaumik Bhatt 
> Reviewed-by: Jeffrey Hugo 

Reviewed-by: Manivannan Sadhasivam 

Thanks,
Mani

> ---
>  drivers/bus/mhi/core/boot.c | 21 ++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/bus/mhi/core/boot.c b/drivers/bus/mhi/core/boot.c
> index cf6dc5a..80e4d76 100644
> --- a/drivers/bus/mhi/core/boot.c
> +++ b/drivers/bus/mhi/core/boot.c
> @@ -121,7 +121,8 @@ static int __mhi_download_rddm_in_panic(struct 
> mhi_controller *mhi_cntrl)
>   ee = mhi_get_exec_env(mhi_cntrl);
>   }
>  
> - dev_dbg(dev, "Waiting for image download completion, current EE: %s\n",
> + dev_dbg(dev,
> + "Waiting for RDDM image download via BHIe, current EE:%s\n",
>   TO_MHI_EXEC_STR(ee));
>  
>   while (retry--) {
> @@ -152,11 +153,14 @@ static int __mhi_download_rddm_in_panic(struct 
> mhi_controller *mhi_cntrl)
>  int mhi_download_rddm_img(struct mhi_controller *mhi_cntrl, bool in_panic)
>  {
>   void __iomem *base = mhi_cntrl->bhie;
> + struct device *dev = _cntrl->mhi_dev->dev;
>   u32 rx_status;
>  
>   if (in_panic)
>   return __mhi_download_rddm_in_panic(mhi_cntrl);
>  
> + dev_dbg(dev, "Waiting for RDDM image download via BHIe\n");
> +
>   /* Wait for the image download to complete */
>   wait_event_timeout(mhi_cntrl->state_event,
>  mhi_read_reg_field(mhi_cntrl, base,
> @@ -174,6 +178,7 @@ static int mhi_fw_load_amss(struct mhi_controller 
> *mhi_cntrl,
>   const struct mhi_buf *mhi_buf)
>  {
>   void __iomem *base = mhi_cntrl->bhie;
> + struct device *dev = _cntrl->mhi_dev->dev;
>   rwlock_t *pm_lock = _cntrl->pm_lock;
>   u32 tx_status, sequence_id;
>   int ret;
> @@ -184,6 +189,7 @@ static int mhi_fw_load_amss(struct mhi_controller 
> *mhi_cntrl,
>   return -EIO;
>   }
>  
> + dev_dbg(dev, "Starting AMSS download via BHIe\n");
>   mhi_write_reg(mhi_cntrl, base, BHIE_TXVECADDR_HIGH_OFFS,
> upper_32_bits(mhi_buf->dma_addr));
>  
> @@ -435,7 +441,12 @@ void mhi_fw_load_handler(struct mhi_controller 
> *mhi_cntrl)
>   release_firmware(firmware);
>  
>   /* Error or in EDL mode, we're done */
> - if (ret || mhi_cntrl->ee == MHI_EE_EDL)
> + if (ret) {
> + dev_err(dev, "MHI did not load SBL, ret:%d\n", ret);
> + return;
> + }
> +
> + if (mhi_cntrl->ee == MHI_EE_EDL)
>   return;
>  
>   write_lock_irq(_cntrl->pm_lock);
> @@ -463,8 +474,10 @@ void mhi_fw_load_handler(struct mhi_controller 
> *mhi_cntrl)
>   if (!mhi_cntrl->fbc_download)
>   return;
>  
> - if (ret)
> + if (ret) {
> + dev_err(dev, "MHI did not enter READY state\n");
>   goto error_read;
> + }
>  
>   /* Wait for the SBL event */
>   ret = wait_event_timeout(mhi_cntrl->state_event,
> @@ -482,6 +495,8 @@ void mhi_fw_load_handler(struct mhi_controller *mhi_cntrl)
>   ret = mhi_fw_load_amss(mhi_cntrl,
>  /* Vector table is the last entry */
>  _info->mhi_buf[image_info->entries - 1]);
> + if (ret)
> + dev_err(dev, "MHI did not load AMSS, ret:%d\n", ret);
>  
>   release_firmware(firmware);
>  
> -- 
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project


Re: [PATCH v7 4/8] bus: mhi: core: Read transfer length from an event properly

2020-05-08 Thread Manivannan Sadhasivam
On Fri, May 08, 2020 at 07:26:44PM -0700, Bhaumik Bhatt wrote:
> From: Hemant Kumar 
> 
> When MHI Driver receives an EOT event, it reads xfer_len from the
> event in the last TRE. The value is under control of the MHI device
> and never validated by Host MHI driver. The value should never be
> larger than the real size of the buffer but a malicious device can
> set the value 0x as maximum. This causes driver to memory
> overflow (both read or write). Fix this issue by reading minimum of
> transfer length from event and the buffer length provided.
> 
> Signed-off-by: Hemant Kumar 
> Signed-off-by: Bhaumik Bhatt 
> Reviewed-by: Jeffrey Hugo 

Reviewed-by: Manivannan Sadhasivam 

Thanks,
Mani

> ---
>  drivers/bus/mhi/core/main.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c
> index 30798ec..6a80666 100644
> --- a/drivers/bus/mhi/core/main.c
> +++ b/drivers/bus/mhi/core/main.c
> @@ -514,7 +514,10 @@ static int parse_xfer_event(struct mhi_controller 
> *mhi_cntrl,
>   mhi_cntrl->unmap_single(mhi_cntrl, buf_info);
>  
>   result.buf_addr = buf_info->cb_buf;
> - result.bytes_xferd = xfer_len;
> +
> + /* truncate to buf len if xfer_len is larger */
> + result.bytes_xferd =
> + min_t(u16, xfer_len, buf_info->len);
>   mhi_del_ring_element(mhi_cntrl, buf_ring);
>   mhi_del_ring_element(mhi_cntrl, tre_ring);
>   local_rp = tre_ring->rp;
> @@ -598,7 +601,9 @@ static int parse_rsc_event(struct mhi_controller 
> *mhi_cntrl,
>  
>   result.transaction_status = (ev_code == MHI_EV_CC_OVERFLOW) ?
>   -EOVERFLOW : 0;
> - result.bytes_xferd = xfer_len;
> +
> + /* truncate to buf len if xfer_len is larger */
> + result.bytes_xferd = min_t(u16, xfer_len, buf_info->len);
>   result.buf_addr = buf_info->cb_buf;
>   result.dir = mhi_chan->dir;
>  
> -- 
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project


Re: [PATCH][V2] net: tg3: tidy up loop, remove need to compute off with a multiply

2020-05-08 Thread Jakub Kicinski
On Sat,  9 May 2020 00:14:47 +0100 Colin King wrote:
> From: Colin Ian King 
> 
> Currently the value for 'off' is computed using a multiplication and
> a couple of statements later off is being incremented by len and
> this value is never read.  Clean up the code by removing the
> multiplication and just increment off by len on each iteration.
> 
> Addresses-Coverity: ("Unused value")
> Signed-off-by: Colin Ian King 

Applied, thank you!


Re: [PATCH v7 5/8] bus: mhi: core: Handle firmware load using state worker

2020-05-08 Thread Manivannan Sadhasivam
On Fri, May 08, 2020 at 07:26:45PM -0700, Bhaumik Bhatt wrote:
> Upon power up, driver queues firmware worker thread if the execution
> environment is PBL. Firmware worker is blocked with a timeout until
> state worker gets a chance to run and unblock firmware worker. An
> endpoint power up failure can be seen if state worker gets a chance to
> run after firmware worker has timed out. Remove this dependency and
> handle firmware load directly using state worker thread.
> 
> Signed-off-by: Bhaumik Bhatt 
> Reviewed-by: Jeffrey Hugo 

Reviewed-by: Manivannan Sadhasivam 

Thanks,
Mani

> ---
>  drivers/bus/mhi/core/boot.c | 18 +++---
>  drivers/bus/mhi/core/init.c |  1 -
>  drivers/bus/mhi/core/internal.h |  1 +
>  drivers/bus/mhi/core/pm.c   |  6 +-
>  include/linux/mhi.h |  2 --
>  5 files changed, 5 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/bus/mhi/core/boot.c b/drivers/bus/mhi/core/boot.c
> index ebad5eb..17c636b 100644
> --- a/drivers/bus/mhi/core/boot.c
> +++ b/drivers/bus/mhi/core/boot.c
> @@ -377,30 +377,18 @@ static void mhi_firmware_copy(struct mhi_controller 
> *mhi_cntrl,
>   }
>  }
>  
> -void mhi_fw_load_worker(struct work_struct *work)
> +void mhi_fw_load_handler(struct mhi_controller *mhi_cntrl)
>  {
> - struct mhi_controller *mhi_cntrl;
>   const struct firmware *firmware = NULL;
>   struct image_info *image_info;
> - struct device *dev;
> + struct device *dev = _cntrl->mhi_dev->dev;
>   const char *fw_name;
>   void *buf;
>   dma_addr_t dma_addr;
>   size_t size;
>   int ret;
>  
> - mhi_cntrl = container_of(work, struct mhi_controller, fw_worker);
> - dev = _cntrl->mhi_dev->dev;
> -
> - dev_dbg(dev, "Waiting for device to enter PBL from: %s\n",
> - TO_MHI_EXEC_STR(mhi_cntrl->ee));
> -
> - ret = wait_event_timeout(mhi_cntrl->state_event,
> -  MHI_IN_PBL(mhi_cntrl->ee) ||
> -  MHI_PM_IN_ERROR_STATE(mhi_cntrl->pm_state),
> -  msecs_to_jiffies(mhi_cntrl->timeout_ms));
> -
> - if (!ret || MHI_PM_IN_ERROR_STATE(mhi_cntrl->pm_state)) {
> + if (MHI_PM_IN_ERROR_STATE(mhi_cntrl->pm_state)) {
>   dev_err(dev, "Device MHI is not in valid state\n");
>   return;
>   }
> diff --git a/drivers/bus/mhi/core/init.c b/drivers/bus/mhi/core/init.c
> index 1a93d24..6882206 100644
> --- a/drivers/bus/mhi/core/init.c
> +++ b/drivers/bus/mhi/core/init.c
> @@ -835,7 +835,6 @@ int mhi_register_controller(struct mhi_controller 
> *mhi_cntrl,
>   spin_lock_init(_cntrl->wlock);
>   INIT_WORK(_cntrl->st_worker, mhi_pm_st_worker);
>   INIT_WORK(_cntrl->syserr_worker, mhi_pm_sys_err_worker);
> - INIT_WORK(_cntrl->fw_worker, mhi_fw_load_worker);
>   init_waitqueue_head(_cntrl->state_event);
>  
>   mhi_cmd = mhi_cntrl->mhi_cmd;
> diff --git a/drivers/bus/mhi/core/internal.h b/drivers/bus/mhi/core/internal.h
> index 40c47f9..0965ca3 100644
> --- a/drivers/bus/mhi/core/internal.h
> +++ b/drivers/bus/mhi/core/internal.h
> @@ -627,6 +627,7 @@ void mhi_ring_chan_db(struct mhi_controller *mhi_cntrl,
>  void mhi_deinit_free_irq(struct mhi_controller *mhi_cntrl);
>  void mhi_rddm_prepare(struct mhi_controller *mhi_cntrl,
> struct image_info *img_info);
> +void mhi_fw_load_handler(struct mhi_controller *mhi_cntrl);
>  int mhi_prepare_channel(struct mhi_controller *mhi_cntrl,
>   struct mhi_chan *mhi_chan);
>  int mhi_init_chan_ctxt(struct mhi_controller *mhi_cntrl,
> diff --git a/drivers/bus/mhi/core/pm.c b/drivers/bus/mhi/core/pm.c
> index e7c8318..3cc238a 100644
> --- a/drivers/bus/mhi/core/pm.c
> +++ b/drivers/bus/mhi/core/pm.c
> @@ -528,7 +528,6 @@ static void mhi_pm_disable_transition(struct 
> mhi_controller *mhi_cntrl,
>   dev_dbg(dev, "Waiting for all pending threads to complete\n");
>   wake_up_all(_cntrl->state_event);
>   flush_work(_cntrl->st_worker);
> - flush_work(_cntrl->fw_worker);
>  
>   dev_dbg(dev, "Reset all active channels and remove MHI devices\n");
>   device_for_each_child(mhi_cntrl->cntrl_dev, NULL, mhi_destroy_device);
> @@ -643,7 +642,7 @@ void mhi_pm_st_worker(struct work_struct *work)
>   mhi_cntrl->ee = mhi_get_exec_env(mhi_cntrl);
>   write_unlock_irq(_cntrl->pm_lock);
>   if (MHI_IN_PBL(mhi_cntrl->ee))
> - wake_up_all(_cntrl->state_event);
> + mhi_fw_load_handler(mhi_cntrl);
>   break;
>   case DEV_ST_TRANSITION_SBL:
>   write_lock_irq(_cntrl->pm_lock);
> @@ -976,9 +975,6 @@ int mhi_async_power_up(struct mhi_controller *mhi_cntrl)
>   next_state = MHI_IN_PBL(current_ee) ?
>   DEV_ST_TRANSITION_PBL : DEV_ST_TRANSITION_READY;
>  
> - if (next_state == DEV_ST_TRANSITION_PBL)
> - 

Re: [PATCH] cnic: remove redundant assignment to variable ret

2020-05-08 Thread Jakub Kicinski
On Fri,  8 May 2020 23:40:26 +0100 Colin King wrote:
> From: Colin Ian King 
> 
> The variable ret is being assigned with a value that is never read,
> the assignment is redundant and can be removed.
> 
> Addresses-Coverity: ("Unused value")
> Signed-off-by: Colin Ian King 

Applied, thank you!


Re: [PATCH] net: lio_core: remove redundant assignment to variable tx_done

2020-05-08 Thread Jakub Kicinski
On Fri,  8 May 2020 23:58:10 +0100 Colin King wrote:
> From: Colin Ian King 
> 
> The variable tx_done is being assigned with a value that is never read
> as the function returns a few statements later.  The assignment is
> redundant and can be removed.
> 
> Addresses-Coverity: ("Unused value")
> Signed-off-by: Colin Ian King 

Applied, thank you!


Re: [PATCH] net/atheros: remove redundant assignment to variable size

2020-05-08 Thread Jakub Kicinski
On Fri,  8 May 2020 23:33:21 +0100 Colin King wrote:
> From: Colin Ian King 
> 
> The variable size is being assigned with a value that is never read,
> the assignment is redundant and cab be removed.
> 
> Addresses-Coverity: ("Unused value")
> Signed-off-by: Colin Ian King 

Applied, thank you!


Re: [PATCH v7 3/8] bus: mhi: core: Add range check for channel id received in event ring

2020-05-08 Thread Manivannan Sadhasivam
On Fri, May 08, 2020 at 07:26:43PM -0700, Bhaumik Bhatt wrote:
> From: Hemant Kumar 
> 
> MHI data completion handler function reads channel id from event
> ring element. Value is under the control of MHI devices and can be
> any value between 0 and 255. In order to prevent out of bound access
> add a bound check against the max channel supported by controller
> and skip processing of that event ring element.
> 
> Signed-off-by: Hemant Kumar 
> Signed-off-by: Bhaumik Bhatt 
> Reviewed-by: Jeffrey Hugo 

Reviewed-by: Manivannan Sadhasivam 

Thanks,
Mani

> ---
>  drivers/bus/mhi/core/main.c | 40 +---
>  1 file changed, 29 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c
> index 605640c..30798ec 100644
> --- a/drivers/bus/mhi/core/main.c
> +++ b/drivers/bus/mhi/core/main.c
> @@ -775,9 +775,18 @@ int mhi_process_ctrl_ev_ring(struct mhi_controller 
> *mhi_cntrl,
>   }
>   case MHI_PKT_TYPE_TX_EVENT:
>   chan = MHI_TRE_GET_EV_CHID(local_rp);
> - mhi_chan = _cntrl->mhi_chan[chan];
> - parse_xfer_event(mhi_cntrl, local_rp, mhi_chan);
> - event_quota--;
> +
> + WARN_ON(chan >= mhi_cntrl->max_chan);
> +
> + /*
> +  * Only process the event ring elements whose channel
> +  * ID is within the maximum supported range.
> +  */
> + if (chan < mhi_cntrl->max_chan) {
> + mhi_chan = _cntrl->mhi_chan[chan];
> + parse_xfer_event(mhi_cntrl, local_rp, mhi_chan);
> + event_quota--;
> + }
>   break;
>   default:
>   dev_err(dev, "Unhandled event type: %d\n", type);
> @@ -820,14 +829,23 @@ int mhi_process_data_event_ring(struct mhi_controller 
> *mhi_cntrl,
>   enum mhi_pkt_type type = MHI_TRE_GET_EV_TYPE(local_rp);
>  
>   chan = MHI_TRE_GET_EV_CHID(local_rp);
> - mhi_chan = _cntrl->mhi_chan[chan];
> -
> - if (likely(type == MHI_PKT_TYPE_TX_EVENT)) {
> - parse_xfer_event(mhi_cntrl, local_rp, mhi_chan);
> - event_quota--;
> - } else if (type == MHI_PKT_TYPE_RSC_TX_EVENT) {
> - parse_rsc_event(mhi_cntrl, local_rp, mhi_chan);
> - event_quota--;
> +
> + WARN_ON(chan >= mhi_cntrl->max_chan);
> +
> + /*
> +  * Only process the event ring elements whose channel
> +  * ID is within the maximum supported range.
> +  */
> + if (chan < mhi_cntrl->max_chan) {
> + mhi_chan = _cntrl->mhi_chan[chan];
> +
> + if (likely(type == MHI_PKT_TYPE_TX_EVENT)) {
> + parse_xfer_event(mhi_cntrl, local_rp, mhi_chan);
> + event_quota--;
> + } else if (type == MHI_PKT_TYPE_RSC_TX_EVENT) {
> + parse_rsc_event(mhi_cntrl, local_rp, mhi_chan);
> + event_quota--;
> + }
>   }
>  
>   mhi_recycle_ev_ring_element(mhi_cntrl, ev_ring);
> -- 
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project


Re: [PATCH v7 2/8] bus: mhi: core: Cache intmod from mhi event to mhi channel

2020-05-08 Thread Manivannan Sadhasivam
On Fri, May 08, 2020 at 07:26:42PM -0700, Bhaumik Bhatt wrote:
> From: Hemant Kumar 
> 
> Driver is using zero initialized intmod value from mhi channel when
> configuring TRE for bei field. This prevents interrupt moderation to
> take effect in case it is supported by an event ring. Fix this by
> copying intmod value from associated event ring to mhi channel upon
> registering mhi controller.
> 
> Signed-off-by: Hemant Kumar 
> Signed-off-by: Bhaumik Bhatt 
> Reviewed-by: Jeffrey Hugo 

Reviewed-by: Manivannan Sadhasivam 

Thanks,
Mani

> ---
>  drivers/bus/mhi/core/init.c | 4 
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/bus/mhi/core/init.c b/drivers/bus/mhi/core/init.c
> index eb2ab05..1a93d24 100644
> --- a/drivers/bus/mhi/core/init.c
> +++ b/drivers/bus/mhi/core/init.c
> @@ -863,6 +863,10 @@ int mhi_register_controller(struct mhi_controller 
> *mhi_cntrl,
>   mutex_init(_chan->mutex);
>   init_completion(_chan->completion);
>   rwlock_init(_chan->lock);
> +
> + /* used in setting bei field of TRE */
> + mhi_event = _cntrl->mhi_event[mhi_chan->er_index];
> + mhi_chan->intmod = mhi_event->intmod;
>   }
>  
>   if (mhi_cntrl->bounce_buf) {
> -- 
> The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
> a Linux Foundation Collaborative Project


Re: [PATCH v7 1/8] bus: mhi: core: Refactor mhi queue APIs

2020-05-08 Thread Manivannan Sadhasivam
On Fri, May 08, 2020 at 07:26:41PM -0700, Bhaumik Bhatt wrote:
> From: Hemant Kumar 
> 
> Move all the common code to generate TRE from mhi_queue_buf,
> mhi_queue_dma and mhi_queue_skb to mhi_gen_tre. This helps
> to centralize the TRE generation code which makes any future
> bug fixing easier to manage in these APIs.
> 
> Suggested-by: Jeffrey Hugo 
> Signed-off-by: Hemant Kumar 
> Signed-off-by: Bhaumik Bhatt 
> Reviewed-by: Jeffrey Hugo 

Reviewed-by: Manivannan Sadhasivam 

Thanks,
Mani

> ---
>  drivers/bus/mhi/core/internal.h |   3 +-
>  drivers/bus/mhi/core/main.c | 107 
> +---
>  2 files changed, 47 insertions(+), 63 deletions(-)
> 
> diff --git a/drivers/bus/mhi/core/internal.h b/drivers/bus/mhi/core/internal.h
> index 095d95b..40c47f9 100644
> --- a/drivers/bus/mhi/core/internal.h
> +++ b/drivers/bus/mhi/core/internal.h
> @@ -670,8 +670,7 @@ int mhi_process_ctrl_ev_ring(struct mhi_controller 
> *mhi_cntrl,
>  irqreturn_t mhi_intvec_handler(int irq_number, void *dev);
>  
>  int mhi_gen_tre(struct mhi_controller *mhi_cntrl, struct mhi_chan *mhi_chan,
> - void *buf, void *cb, size_t buf_len, enum mhi_flags flags);
> -
> + struct mhi_buf_info *info, enum mhi_flags flags);
>  int mhi_map_single_no_bb(struct mhi_controller *mhi_cntrl,
>struct mhi_buf_info *buf_info);
>  int mhi_map_single_use_bb(struct mhi_controller *mhi_cntrl,
> diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c
> index c26eed0..605640c 100644
> --- a/drivers/bus/mhi/core/main.c
> +++ b/drivers/bus/mhi/core/main.c
> @@ -919,9 +919,7 @@ int mhi_queue_skb(struct mhi_device *mhi_dev, enum 
> dma_data_direction dir,
>   struct mhi_chan *mhi_chan = (dir == DMA_TO_DEVICE) ? mhi_dev->ul_chan :
>mhi_dev->dl_chan;
>   struct mhi_ring *tre_ring = _chan->tre_ring;
> - struct mhi_ring *buf_ring = _chan->buf_ring;
> - struct mhi_buf_info *buf_info;
> - struct mhi_tre *mhi_tre;
> + struct mhi_buf_info buf_info = { };
>   int ret;
>  
>   /* If MHI host pre-allocates buffers then client drivers cannot queue */
> @@ -946,27 +944,15 @@ int mhi_queue_skb(struct mhi_device *mhi_dev, enum 
> dma_data_direction dir,
>   /* Toggle wake to exit out of M2 */
>   mhi_cntrl->wake_toggle(mhi_cntrl);
>  
> - /* Generate the TRE */
> - buf_info = buf_ring->wp;
> -
> - buf_info->v_addr = skb->data;
> - buf_info->cb_buf = skb;
> - buf_info->wp = tre_ring->wp;
> - buf_info->dir = mhi_chan->dir;
> - buf_info->len = len;
> - ret = mhi_cntrl->map_single(mhi_cntrl, buf_info);
> - if (ret)
> - goto map_error;
> -
> - mhi_tre = tre_ring->wp;
> -
> - mhi_tre->ptr = MHI_TRE_DATA_PTR(buf_info->p_addr);
> - mhi_tre->dword[0] = MHI_TRE_DATA_DWORD0(buf_info->len);
> - mhi_tre->dword[1] = MHI_TRE_DATA_DWORD1(1, 1, 0, 0);
> + buf_info.v_addr = skb->data;
> + buf_info.cb_buf = skb;
> + buf_info.len = len;
>  
> - /* increment WP */
> - mhi_add_ring_element(mhi_cntrl, tre_ring);
> - mhi_add_ring_element(mhi_cntrl, buf_ring);
> + ret = mhi_gen_tre(mhi_cntrl, mhi_chan, _info, mflags);
> + if (unlikely(ret)) {
> + read_unlock_bh(_cntrl->pm_lock);
> + return ret;
> + }
>  
>   if (mhi_chan->dir == DMA_TO_DEVICE)
>   atomic_inc(_cntrl->pending_pkts);
> @@ -980,11 +966,6 @@ int mhi_queue_skb(struct mhi_device *mhi_dev, enum 
> dma_data_direction dir,
>   read_unlock_bh(_cntrl->pm_lock);
>  
>   return 0;
> -
> -map_error:
> - read_unlock_bh(_cntrl->pm_lock);
> -
> - return ret;
>  }
>  EXPORT_SYMBOL_GPL(mhi_queue_skb);
>  
> @@ -996,9 +977,8 @@ int mhi_queue_dma(struct mhi_device *mhi_dev, enum 
> dma_data_direction dir,
>mhi_dev->dl_chan;
>   struct device *dev = _cntrl->mhi_dev->dev;
>   struct mhi_ring *tre_ring = _chan->tre_ring;
> - struct mhi_ring *buf_ring = _chan->buf_ring;
> - struct mhi_buf_info *buf_info;
> - struct mhi_tre *mhi_tre;
> + struct mhi_buf_info buf_info = { };
> + int ret;
>  
>   /* If MHI host pre-allocates buffers then client drivers cannot queue */
>   if (mhi_chan->pre_alloc)
> @@ -1025,25 +1005,16 @@ int mhi_queue_dma(struct mhi_device *mhi_dev, enum 
> dma_data_direction dir,
>   /* Toggle wake to exit out of M2 */
>   mhi_cntrl->wake_toggle(mhi_cntrl);
>  
> - /* Generate the TRE */
> - buf_info = buf_ring->wp;
> - WARN_ON(buf_info->used);
> - buf_info->p_addr = mhi_buf->dma_addr;
> - buf_info->pre_mapped = true;
> - buf_info->cb_buf = mhi_buf;
> - buf_info->wp = tre_ring->wp;
> - buf_info->dir = mhi_chan->dir;
> - buf_info->len = len;
> -
> - mhi_tre = tre_ring->wp;
> -
> - mhi_tre->ptr = MHI_TRE_DATA_PTR(buf_info->p_addr);
> - mhi_tre->dword[0] 

Re: [PATCH v4 1/4] device-dax: Don't leak kernel memory to user space after unloading kmem

2020-05-08 Thread David Hildenbrand



> Am 09.05.2020 um 01:53 schrieb Andrew Morton :
> 
> On Fri,  8 May 2020 10:42:14 +0200 David Hildenbrand  
> wrote:
> 
>> Assume we have kmem configured and loaded:
>>  [root@localhost ~]# cat /proc/iomem
>>  ...
>>  14000-33fff : Persistent Memory$
>>14000-1481f : namespace0.0
>>15000-33fff : dax0.0
>>  15000-33fff : System RAM
>> 
>> Assume we try to unload kmem. This force-unloading will work, even if
>> memory cannot get removed from the system.
>>  [root@localhost ~]# rmmod kmem
>>  [   86.380228] removing memory fails, because memory 
>> [0x00015000-0x000157ff] is onlined
>>  ...
>>  [   86.431225] kmem dax0.0: DAX region [mem 0x15000-0x33fff] cannot 
>> be hotremoved until the next reboot
>> 
>> Now, we can reconfigure the namespace:
>>  [root@localhost ~]# ndctl create-namespace --force --reconfig=namespace0.0 
>> --mode=devdax
>>  [  131.409351] nd_pmem namespace0.0: could not reserve region [mem 
>> 0x14000-0x33fff]dax
>>  [  131.410147] nd_pmem: probe of namespace0.0 failed with error 
>> -16namespace0.0 --mode=devdax
>>  ...
>> 
>> This fails as expected due to the busy memory resource, and the memory
>> cannot be used. However, the dax0.0 device is removed, and along its name.
>> 
>> The name of the memory resource now points at freed memory (name of the
>> device).
>>  [root@localhost ~]# cat /proc/iomem
>>  ...
>>  14000-33fff : Persistent Memory
>>14000-1481f : namespace0.0
>>15000-33fff : �_�^7_��/_��wR��WQ���^��� ...
>>15000-33fff : System RAM
>> 
>> We have to make sure to duplicate the string. While at it, remove the
>> superfluous setting of the name and fixup a stale comment.
>> 
>> Fixes: 9f960da72b25 ("device-dax: "Hotremove" persistent memory that is used 
>> like normal RAM")
>> Cc: sta...@vger.kernel.org # v5.3
> 
> hm.
> 
> Is this really -stable material?  These are all privileged operations,
> I expect?

Yes, my thought was rather that an admin could bring the system into such a 
state (by mistake?). Let‘s see if somebody has a suggestion.

I guess if we were really unlucky, we could access invalid memory and trigger a 
BUG (e.g., page at the end of memory and does not contain a 0 byte).

> 
> Assuming "yes", I've queued this separately, staged for 5.7-rcX.  I'll
> redo patches 2-4 as a three-patch series for 5.8-rc1.

Make sense, let‘s wait for review feedback, thanks!


[PATCH 4/4] fs: btrfs: fix a data race in btrfs_block_rsv_release()

2020-05-08 Thread Jia-Ju Bai
The functions btrfs_block_rsv_release() and
btrfs_update_delayed_refs_rsv() are concurrently executed at runtime in
the following call contexts:

Thread 1:
  btrfs_file_write_iter()
btrfs_buffered_write()
  btrfs_delalloc_release_extents()
btrfs_inode_rsv_release()
  __btrfs_block_rsv_release()

Thread 2:
  finish_ordered_fn()
btrfs_finish_ordered_io()
  insert_reserved_file_extent()
__btrfs_drop_extents()
  btrfs_free_extent()
btrfs_add_delayed_data_ref()
  btrfs_update_delayed_refs_rsv()

In __btrfs_block_rsv_release():
  else if (... && !delayed_rsv->full)

In btrfs_update_delayed_refs_rsv():
  spin_lock(_rsv->lock);
  delayed_rsv->size += num_bytes;
  delayed_rsv->full = 0;
  spin_unlock(_rsv->lock);

Thus a data race for delayed_rsv->full can occur.
This race was found and actually reproduced by our conccurency fuzzer.

To fix this race, the spinlock delayed_rsv->lock is used to
protect the access to delayed_rsv->full in btrfs_block_rsv_release().

Signed-off-by: Jia-Ju Bai 
---
 fs/btrfs/block-rsv.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/block-rsv.c b/fs/btrfs/block-rsv.c
index 27efec8f7c5b..89c53a7137b4 100644
--- a/fs/btrfs/block-rsv.c
+++ b/fs/btrfs/block-rsv.c
@@ -277,6 +277,11 @@ u64 btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
struct btrfs_block_rsv *global_rsv = _info->global_block_rsv;
struct btrfs_block_rsv *delayed_rsv = _info->delayed_refs_rsv;
struct btrfs_block_rsv *target = NULL;
+   unsigned short full = 0;
+
+   spin_lock(_rsv->lock);
+   full = delayed_rsv->full;
+   spin_unlock(_rsv->lock);
 
/*
 * If we are the delayed_rsv then push to the global rsv, otherwise dump
@@ -284,7 +289,7 @@ u64 btrfs_block_rsv_release(struct btrfs_fs_info *fs_info,
 */
if (block_rsv == delayed_rsv)
target = global_rsv;
-   else if (block_rsv != global_rsv && !delayed_rsv->full)
+   else if (block_rsv != global_rsv && !full)
target = delayed_rsv;
 
if (target && block_rsv->space_info != target->space_info)
-- 
2.17.1



Re: [PATCH 6/6] exec: Set the point of no return sooner

2020-05-08 Thread Kees Cook
On Fri, May 08, 2020 at 01:48:13PM -0500, Eric W. Biederman wrote:
> 
> Make the code more robust by marking the point of no return sooner.
> This ensures that future code changes don't need to worry about how
> they return errors if they are past this point.
> 
> This results in no actual change in behavior as __do_execve_file does
> not force SIGSEGV when there is a pending fatal signal pending past
> the point of no return.  Further the only error returns from de_thread
> and exec_mmap that can occur result in fatal signals being pending.
> 
> Signed-off-by: "Eric W. Biederman" 

Yes, thank you. I'm a fan; this makes the comment above the function a
bit easier to understand, since the very first thing is to set the
point_of_no_return. :)

Reviewed-by: Kees Cook 

-- 
Kees Cook


Re: [PATCH -next] net: dsa: sja1105: remove set but not used variable 'prev_time'

2020-05-08 Thread Jakub Kicinski
On Fri, 8 May 2020 20:00:55 +0800 Samuel Zou wrote:
> Fixes gcc '-Wunused-but-set-variable' warning:
> 
> drivers/net/dsa/sja1105/sja1105_vl.c:468:6: warning: variable ‘prev_time’ set 
> but not used [-Wunused-but-set-variable]
>   u32 prev_time = 0;
>   ^
> 
> Reported-by: Hulk Robot 
> Signed-off-by: Samuel Zou 

Applied, thank you!


Re: [PATCH 5/6] exec: Move handling of the point of no return to the top level

2020-05-08 Thread Kees Cook
On Fri, May 08, 2020 at 01:47:10PM -0500, Eric W. Biederman wrote:
> 
> Move the handing of the point of no return from search_binary_handler
> into __do_execve_file so that it is easier to find, and to keep
> things robust in the face of change.
> 
> Make it clear that an existing fatal signal will take precedence over
> a forced SIGSEGV by not forcing SIGSEGV if a fatal signal is already
> pending.  This does not change the behavior but it saves a reader
> of the code the tedium of reading and understanding force_sig
> and the signal delivery code.
> 
> Update the comment in begin_new_exec about where SIGSEGV is forced.
> 
> Keep point_of_no_return from being a mystery by documenting
> what the code is doing where it forces SIGSEGV if the
> code is past the point of no return.
> 
> Signed-off-by: "Eric W. Biederman" 

I had to read the code around these changes a bit carefully, but yeah,
this looks like a safe cleanup. It is a behavioral change, though (in
that in unmasks non-SEGV fatal signals), so I do wonder if something
somewhere might notice this, but I'd agree that it's the more robust
behavior.

Reviewed-by: Kees Cook 

-- 
Kees Cook


[PATCH 3/4] fs: btrfs: fix data races in start_transaction()

2020-05-08 Thread Jia-Ju Bai
The functions start_transaction() and btrfs_update_delayed_refs_rsv()
are concurrently executed at runtime in the following call contexts:

Thread 1:
  btrfs_sync_file()
btrfs_start_transaction()
  start_transaction()

Thread 2:
  finish_ordered_fn()
btrfs_finish_ordered_io()
  insert_reserved_file_extent()
__btrfs_drop_extents()
  btrfs_free_extent()
btrfs_add_delayed_data_ref()
  btrfs_update_delayed_refs_rsv()

In start_transaction():
  if (delayed_refs_rsv->full == 0)
  ...
  else if (... && !delayed_refs_rsv->full)

In btrfs_update_delayed_refs_rsv():
  spin_lock(_rsv->lock);
  delayed_rsv->size += num_bytes;
  delayed_rsv->full = 0;
  spin_unlock(_rsv->lock);

The values delayed_refs_rsv->full and delayed_rsv->full access the same
memory, and these data races can occur.
These data races were found and actually reproduced by our conccurency
fuzzer.

To fix these races, the spinlock delayed_refs_rsv->lock is used to
protect the access to delayed_refs_rsv->full in start_transaction().

Signed-off-by: Jia-Ju Bai 
---
 fs/btrfs/transaction.c | 9 +++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 8cede6eb9843..ca38d7cf665d 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -524,6 +524,7 @@ start_transaction(struct btrfs_root *root, unsigned int 
num_items,
u64 qgroup_reserved = 0;
bool reloc_reserved = false;
int ret;
+   unsigned short full = 0;
 
/* Send isn't supposed to start transactions. */
ASSERT(current->journal_info != BTRFS_SEND_TRANS_STUB);
@@ -541,6 +542,10 @@ start_transaction(struct btrfs_root *root, unsigned int 
num_items,
goto got_it;
}
 
+   spin_lock(_refs_rsv->lock);
+   full = delayed_refs_rsv->full;
+   spin_unlock(_refs_rsv->lock);
+
/*
 * Do the reservation before we join the transaction so we can do all
 * the appropriate flushing if need be.
@@ -563,7 +568,7 @@ start_transaction(struct btrfs_root *root, unsigned int 
num_items,
 * refill that amount for whatever is missing in the reserve.
 */
num_bytes = btrfs_calc_insert_metadata_size(fs_info, num_items);
-   if (delayed_refs_rsv->full == 0) {
+   if (full == 0) {
delayed_refs_bytes = num_bytes;
num_bytes <<= 1;
}
@@ -585,7 +590,7 @@ start_transaction(struct btrfs_root *root, unsigned int 
num_items,
num_bytes -= delayed_refs_bytes;
}
} else if (num_items == 0 && flush == BTRFS_RESERVE_FLUSH_ALL &&
-  !delayed_refs_rsv->full) {
+  !full) {
/*
 * Some people call with btrfs_start_transaction(root, 0)
 * because they can be throttled, but have some other mechanism
-- 
2.17.1



[PATCH 2/4] fs: btrfs: fix data races in extent_write_cache_pages()

2020-05-08 Thread Jia-Ju Bai
The function extent_write_cache_pages is concurrently executed with
itself at runtime in the following call contexts:

Thread 1:
  btrfs_sync_file()
start_ordered_ops()
  btrfs_fdatawrite_range()
btrfs_writepages() [via function pointer]
  extent_writepages()
extent_write_cache_pages()

Thread 2:
  btrfs_writepages() 
extent_writepages()
  extent_write_cache_pages()

In extent_write_cache_pages():
  index = mapping->writeback_index;
  ...
  mapping->writeback_index = done_index;

The accesses to mapping->writeback_index are not synchronized, and thus
data races for this value can occur.
These data races were found and actually reproduced by our concurrency 
fuzzer.

To fix these races, the spinlock mapping->private_lock is used to
protect the accesses to mapping->writeback_index.

Signed-off-by: Jia-Ju Bai 
---
 fs/btrfs/extent_io.c | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 39e45b8a5031..8c33a60bde1d 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -4160,7 +4160,9 @@ static int extent_write_cache_pages(struct address_space 
*mapping,
 
pagevec_init();
if (wbc->range_cyclic) {
+   spin_lock(>private_lock);
index = mapping->writeback_index; /* Start from prev offset */
+   spin_unlock(>private_lock);
end = -1;
/*
 * Start from the beginning does not need to cycle over the
@@ -4271,8 +4273,11 @@ static int extent_write_cache_pages(struct address_space 
*mapping,
goto retry;
}
 
-   if (wbc->range_cyclic || (wbc->nr_to_write > 0 && range_whole))
+   if (wbc->range_cyclic || (wbc->nr_to_write > 0 && range_whole)) {
+   spin_lock(>private_lock);
mapping->writeback_index = done_index;
+   spin_unlock(>private_lock);
+   }
 
btrfs_add_delayed_iput(inode);
return ret;
-- 
2.17.1



[PATCH 1/4] fs: btrfs: fix a data race in btrfs_block_group_done()

2020-05-08 Thread Jia-Ju Bai
The functions btrfs_block_group_done() and caching_thread() are 
concurrently executed at runtime in the following call contexts:

Thread 1:
  btrfs_sync_file()
start_ordered_ops()
  btrfs_fdatawrite_range()
btrfs_writepages() [via function pointer]
  extent_writepages()
extent_write_cache_pages()
  __extent_writepage()
writepage_delalloc()
  btrfs_run_delalloc_range()
cow_file_range()
  btrfs_reserve_extent()
find_free_extent()
  btrfs_block_group_done()

Thread 2:
  caching_thread()

In btrfs_block_group_done():
  smp_mb();
  return cache->cached == BTRFS_CACHE_FINISHED ||
  cache->cached == BTRFS_CACHE_ERROR;

In caching_thread():
  spin_lock(_group->lock);
  block_group->caching_ctl = NULL;
  block_group->cached = ret ? BTRFS_CACHE_ERROR : BTRFS_CACHE_FINISHED;
  spin_unlock(_group->lock);

The values cache->cached and block_group->cached access the same memory, 
and thus a data race can occur.
This data race was found and actually reproduced by our concurrency 
fuzzer.

To fix this race, the spinlock cache->lock is used to protect the 
access to cache->cached in btrfs_block_group_done().

Signed-off-by: Jia-Ju Bai 
---
 fs/btrfs/block-group.h | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h
index 107bb557ca8d..fb5f12acea40 100644
--- a/fs/btrfs/block-group.h
+++ b/fs/btrfs/block-group.h
@@ -278,9 +278,13 @@ static inline u64 btrfs_system_alloc_profile(struct 
btrfs_fs_info *fs_info)
 
 static inline int btrfs_block_group_done(struct btrfs_block_group *cache)
 {
+   int flag;
smp_mb();
-   return cache->cached == BTRFS_CACHE_FINISHED ||
-   cache->cached == BTRFS_CACHE_ERROR;
+   spin_lock(>lock);
+   flag = (cache->cached == BTRFS_CACHE_FINISHED ||
+   cache->cached == BTRFS_CACHE_ERROR);
+   spin_unlock(>lock);
+   return flag;
 }
 
 #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS
-- 
2.17.1



Re: [PATCH 4/6] exec: Run sync_mm_rss before taking exec_update_mutex

2020-05-08 Thread Kees Cook
On Fri, May 08, 2020 at 01:45:56PM -0500, Eric W. Biederman wrote:
> Like exec_mm_release sync_mm_rss is about flushing out the state of
> the old_mm, which does not need to happen under exec_update_mutex.
> 
> Make this explicit by moving sync_mm_rss outside of exec_update_mutex.
> 
> Signed-off-by: "Eric W. Biederman" 

Reviewed-by: Kees Cook 

Additional thoughts below...

> ---
>  fs/exec.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 11a5c073aa35..15682a1dfee9 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1051,13 +1051,14 @@ static int exec_mmap(struct mm_struct *mm)
>   tsk = current;
>   old_mm = current->mm;
>   exec_mm_release(tsk, old_mm);
> + if (old_mm)
> + sync_mm_rss(old_mm);
>  
>   ret = mutex_lock_killable(>signal->exec_update_mutex);
>   if (ret)
>   return ret;
>  
>   if (old_mm) {
> - sync_mm_rss(old_mm);
>   /*
>* Make sure that if there is a core dump in progress
>* for the old mm, we get out and die instead of going

$ git grep exec_mm_release
fs/exec.c:  exec_mm_release(tsk, old_mm);
include/linux/sched/mm.h:extern void exec_mm_release(struct task_struct *, 
struct mm_struct *);
kernel/fork.c:void exec_mm_release(struct task_struct *tsk, struct mm_struct 
*mm)

kernel/fork.c:

void exit_mm_release(struct task_struct *tsk, struct mm_struct *mm)
{
futex_exit_release(tsk);
mm_release(tsk, mm);
}

void exec_mm_release(struct task_struct *tsk, struct mm_struct *mm)
{
futex_exec_release(tsk);
mm_release(tsk, mm);
}

$ git grep exit_mm_release
include/linux/sched/mm.h:extern void exit_mm_release(struct task_struct *, 
struct mm_struct *);
kernel/exit.c:  exit_mm_release(current, mm);
kernel/fork.c:void exit_mm_release(struct task_struct *tsk, struct mm_struct 
*mm)

kernel/exit.c:

exit_mm_release(current, mm);
if (!mm)
return;
sync_mm_rss(mm);

It looks to me like both exec_mm_release() and exit_mm_release() could
easily have the sync_mm_rss(...) folded into their function bodies and
removed from the callers. *shrug*

-- 
Kees Cook


Re: [PATCH v4 12/12] mtd: Support kmsg dumper based on pstore/blk

2020-05-08 Thread WeiXiong Liao
hi Kees Cook,

On 2020/5/8 PM 2:40, Kees Cook wrote:
> From: WeiXiong Liao 
> 
> This introduces mtdpstore, which is similar to mtdoops but more
> powerful. It uses pstore/blk, and aims to store panic and oops logs to
> a flash partition, where pstore can later read back and present as files
> in the mounted pstore filesystem.
> 
> To make mtdpstore work, the "blkdev" of pstore/blk should be set
> as MTD device name or MTD device number. For more details, see
> Documentation/admin-guide/pstore-blk.rst
> 
> This solves a number of issues:
> - Work duplication: both of pstore and mtdoops do the same job storing
>   panic/oops log. They have very similar logic, registering to kmsg
>   dumper and storing logs to several chunks one by one.
> - Layer violations: drivers should provides methods instead of polices.
>   MTD should provide read/write/erase operations, and allow a higher
>   level drivers to provide the chunk management, kmsg dump
>   configuration, etc.
> - Missing features: pstore provides many additional features, including
>   presenting the logs as files, logging dump time and count, and
>   supporting other frontends like pmsg, console, etc.
> 
> Signed-off-by: WeiXiong Liao 
> Link: 
> https://lore.kernel.org/r/1585126506-18635-12-git-send-email-liaoweixi...@allwinnertech.com
> Signed-off-by: Kees Cook 
> ---
>  Documentation/admin-guide/pstore-blk.rst |   9 +-
>  drivers/mtd/Kconfig  |  10 +
>  drivers/mtd/Makefile |   1 +
>  drivers/mtd/mtdpstore.c  | 564 +++
>  fs/pstore/platform.c |  22 +-
>  5 files changed, 583 insertions(+), 23 deletions(-)
>  create mode 100644 drivers/mtd/mtdpstore.c
> 
> diff --git a/Documentation/admin-guide/pstore-blk.rst 
> b/Documentation/admin-guide/pstore-blk.rst
> index 2f3602397715..bf0b5a227e24 100644
> --- a/Documentation/admin-guide/pstore-blk.rst
> +++ b/Documentation/admin-guide/pstore-blk.rst
> @@ -43,9 +43,9 @@ blkdev
>  ~~
>  
>  The block device to use. Most of the time, it is a partition of block device.
> -It's required for pstore/blk.
> +It's required for pstore/blk. It is also used for MTD device.
>  
> -It accepts the following variants:
> +It accepts the following variants for block device:
>  
>  1.  device number in hexadecimal represents itself; no
> leading 0x, for example b302.
> @@ -64,6 +64,11 @@ It accepts the following variants:
> partition with a known unique id.
>  #. : major and minor number of the device separated by a colon.
>  
> +It accepts the following variants for MTD device:
> +
> +1.  MTD device name. "pstore" is recommended.
> +#.  MTD device number.
> +
>  kmsg_size
>  ~
>  
> diff --git a/drivers/mtd/Kconfig b/drivers/mtd/Kconfig
> index 42d401ea60ee..6ddab796216d 100644
> --- a/drivers/mtd/Kconfig
> +++ b/drivers/mtd/Kconfig
> @@ -170,6 +170,16 @@ config MTD_OOPS
> buffer in a flash partition where it can be read back at some
> later point.
>  
> +config MTD_PSTORE
> + tristate "Log panic/oops to an MTD buffer based on pstore"
> + depends on PSTORE_BLK
> + help
> +   This enables panic and oops messages to be logged to a circular
> +   buffer in a flash partition where it can be read back as files after
> +   mounting pstore filesystem.
> +
> +   If unsure, say N.
> +
>  config MTD_SWAP
>   tristate "Swap on MTD device support"
>   depends on MTD && SWAP
> diff --git a/drivers/mtd/Makefile b/drivers/mtd/Makefile
> index 56cc60ccc477..593d0593a038 100644
> --- a/drivers/mtd/Makefile
> +++ b/drivers/mtd/Makefile
> @@ -20,6 +20,7 @@ obj-$(CONFIG_RFD_FTL)   += rfd_ftl.o
>  obj-$(CONFIG_SSFDC)  += ssfdc.o
>  obj-$(CONFIG_SM_FTL) += sm_ftl.o
>  obj-$(CONFIG_MTD_OOPS)   += mtdoops.o
> +obj-$(CONFIG_MTD_PSTORE) += mtdpstore.o
>  obj-$(CONFIG_MTD_SWAP)   += mtdswap.o
>  
>  nftl-objs:= nftlcore.o nftlmount.o
> diff --git a/drivers/mtd/mtdpstore.c b/drivers/mtd/mtdpstore.c
> new file mode 100644
> index ..50c8fc746f39
> --- /dev/null
> +++ b/drivers/mtd/mtdpstore.c
> @@ -0,0 +1,564 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#define dev_fmt(fmt) "mtdoops-pstore: " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static struct mtdpstore_context {
> + int index;
> + struct pstore_blk_info info;
> + struct psblk_device dev;
> + struct mtd_info *mtd;
> + unsigned long *rmmap;   /* removed bit map */
> + unsigned long *usedmap; /* used bit map */
> + /*
> +  * used for panic write
> +  * As there are no block_isbad for panic case, we should keep this
> +  * status before panic to ensure panic_write not failed.
> +  */
> + unsigned long *badmap;  /* bad block bit map */
> +} oops_cxt;
> +
> +static int mtdpstore_block_isbad(struct mtdpstore_context *cxt, loff_t off)
> +{
> + int 

Re: [PATCH 3/6] exec: Stop open coding mutex_lock_killable of cred_guard_mutex

2020-05-08 Thread Kees Cook
On Fri, May 08, 2020 at 01:45:25PM -0500, Eric W. Biederman wrote:
> 
> Oleg modified the code that did
> "mutex_lock_interruptible(>cred_guard_mutex)" to return
> -ERESTARTNOINTR instead of -EINTR, so that userspace will never see a
> failure to grab the mutex.
> 
> Slightly earlier Liam R. Howlett defined mutex_lock_killable for
> exactly the same situation but it does it a little more cleanly.
> 
> Switch the code to mutex_lock_killable so that it is clearer what the
> code is doing.
> 
> Ref: ad776537cc6b ("Add mutex_lock_killable")
> Ref: 793285fcafce ("cred_guard_mutex: do not return -EINTR to user-space")
> Signed-off-by: "Eric W. Biederman" 

Reviewed-by: Kees Cook 

-- 
Kees Cook


Re: [PATCH 2/6] exec: Fix spelling of search_binary_handler in a comment

2020-05-08 Thread Kees Cook
On Fri, May 08, 2020 at 01:44:46PM -0500, Eric W. Biederman wrote:
> 
> Signed-off-by: "Eric W. Biederman" 

Reviewed-by: Kees Cook 

-- 
Kees Cook


Re: [PATCH 1/6] exec: Move the comment from above de_thread to above unshare_sighand

2020-05-08 Thread Kees Cook
On Fri, May 08, 2020 at 01:44:19PM -0500, Eric W. Biederman wrote:
> 
> The comment describes work that now happens in unshare_sighand so
> move the comment where it makes sense.
> 
> Signed-off-by: "Eric W. Biederman" 

Reviewed-by: Kees Cook 

-- 
Kees Cook


Re: linux-next 20200506 - build failure with net/bpfilter/bpfilter_umh

2020-05-08 Thread Valdis Klētnieks
On Sat, 09 May 2020 12:45:22 +0900, Masahiro Yamada said:

> >   LD [U]  net/bpfilter/bpfilter_umh
> > /usr/bin/ld: cannot find -lc
> > collect2: error: ld returned 1 exit status
> > make[2]: *** [scripts/Makefile.userprogs:36: net/bpfilter/bpfilter_umh] 
> > Error 1
> > make[1]: *** [scripts/Makefile.build:494: net/bpfilter] Error 2
> > make: *** [Makefile:1726: net] Error 2
> >
> > The culprit is this commit:
>
> Thanks. I will try to fix it,
> but my commit is innocent because
> it is just textual cleanups.
> No functional change is intended.

Hmm... 'git show' made it look like a totally new line...

Proposed patch following in separate email...




pgpASvU1HdMMA.pgp
Description: PGP signature


Re: [PATCH v4 06/12] pstore/blk: Add console frontend support

2020-05-08 Thread WeiXiong Liao
hi Kees Cook,

On 2020/5/8 PM 2:39, Kees Cook wrote:
> From: WeiXiong Liao 
> 
> Support backend for console. To enable console backend, just make
> console_size be greater than 0 and a multiple of 4096.
> 
> Signed-off-by: WeiXiong Liao 
> Link: 
> https://lore.kernel.org/r/1585126506-18635-6-git-send-email-liaoweixi...@allwinnertech.com
> Signed-off-by: Kees Cook 
> ---
>  fs/pstore/Kconfig   | 12 +++
>  fs/pstore/blk.c | 12 ++-
>  fs/pstore/zone.c| 67 +++--
>  include/linux/pstore_zone.h |  4 ++-
>  4 files changed, 90 insertions(+), 5 deletions(-)
> 
> diff --git a/fs/pstore/Kconfig b/fs/pstore/Kconfig
> index f18cd126d83f..f1484f751c5e 100644
> --- a/fs/pstore/Kconfig
> +++ b/fs/pstore/Kconfig
> @@ -236,3 +236,15 @@ config PSTORE_BLK_PMSG_SIZE
>  
> NOTE that, both Kconfig and module parameters can configure
> pstore/blk, but module parameters have priority over Kconfig.
> +
> +config PSTORE_BLK_CONSOLE_SIZE
> + int "Size in Kbytes of console to store"
> + depends on PSTORE_BLK
> + depends on PSTORE_CONSOLE
> + default 64
> + help
> +   This just sets size of console (console_size) for pstore/blk. The
> +   size is in KB and must be a multiple of 4.
> +
> +   NOTE that, both Kconfig and module parameters can configure
> +   pstore/blk, but module parameters have priority over Kconfig.
> diff --git a/fs/pstore/blk.c b/fs/pstore/blk.c
> index 401e5ba66a5f..813025ea7edd 100644
> --- a/fs/pstore/blk.c
> +++ b/fs/pstore/blk.c
> @@ -32,6 +32,14 @@ static long pmsg_size = -1;
>  module_param(pmsg_size, long, 0400);
>  MODULE_PARM_DESC(pmsg_size, "pmsg size in kbytes");
>  
> +#if IS_ENABLED(CONFIG_PSTORE_CONSOLE)
> +static long console_size = CONFIG_PSTORE_BLK_CONSOLE_SIZE;
> +#else
> +static long console_size = -1;
> +#endif
> +module_param(console_size, long, 0400);
> +MODULE_PARM_DESC(console_size, "console size in kbytes");
> +
>  /*
>   * blkdev - The block device to use.
>   *
> @@ -83,7 +91,8 @@ static struct bdev_info {
>   *   whole disk).
>   *   On success, the number of bytes should be returned, others
>   *   means error.
> - * @write:   The same as @read.
> + * @write:   The same as @read, but the following error number:
> + *   -EBUSY means try to write again later.
>   * @panic_write:The write operation only used for panic case. It's optional
>   *   if you do not care panic log. The parameters and return value
>   *   are the same as @read.
> @@ -133,6 +142,7 @@ static int psblk_register_do(struct psblk_device *dev)
>  
>   verify_size(kmsg_size, 4096, dev->flags & PSTORE_FLAGS_DMESG);
>   verify_size(pmsg_size, 4096, dev->flags & PSTORE_FLAGS_PMSG);
> + verify_size(console_size, 4096, dev->flags & PSTORE_FLAGS_CONSOLE);
>  #undef verify_size
>  
>   pstore_zone_info->total_size = dev->total_size;
> diff --git a/fs/pstore/zone.c b/fs/pstore/zone.c
> index f472b06a6c14..0b952eea39fe 100644
> --- a/fs/pstore/zone.c
> +++ b/fs/pstore/zone.c
> @@ -87,10 +87,12 @@ struct pstore_zone {
>   *
>   * @opszs: oops/panic storage zones
>   * @ppsz: pmsg storage zone
> + * @cpsz: console storage zone
>   * @oops_max_cnt: max count of @opszs
>   * @oops_read_cnt: counter to read oops zone
>   * @oops_write_cnt: counter to write
>   * @pmsg_read_cnt: counter to read pmsg zone
> + * @console_read_cnt: counter to read console zone
>   * @oops_counter: counter to oops
>   * @panic_counter: counter to panic
>   * @recovered: whether finish recovering data from storage
> @@ -102,10 +104,12 @@ struct pstore_zone {
>  struct psz_context {
>   struct pstore_zone **opszs;
>   struct pstore_zone *ppsz;
> + struct pstore_zone *cpsz;
>   unsigned int oops_max_cnt;
>   unsigned int oops_read_cnt;
>   unsigned int oops_write_cnt;
>   unsigned int pmsg_read_cnt;
> + unsigned int console_read_cnt;
>   /*
>* the counter should be recovered when recover.
>* It records the oops/panic times after burning rather than booting.
> @@ -125,6 +129,9 @@ struct psz_context {
>  };
>  static struct psz_context psz_cxt;
>  
> +static void psz_flush_all_dirty_zones(struct work_struct *);
> +static DECLARE_WORK(psz_cleaner, psz_flush_all_dirty_zones);

I think it's better to use delayed work.

static DECLARE_DELAYED_WORK(psz_cleaner, psz_flush_all_dirty_zones);

> +
>  /**
>   * enum psz_flush_mode - flush mode for psz_zone_write()
>   *
> @@ -235,6 +242,9 @@ static int psz_zone_write(struct pstore_zone *zone,
>   return 0;
>  dirty:
>   atomic_set(>dirty, true);
> + /* flush dirty zones nicely */
> + if (wcnt == -EBUSY && !is_on_panic())
> + schedule_work(_cleaner);

Change to:

schedule_delayed_work(_cleaner, msecs_to_jiffies(500));

delay for 500ms to merge more log of console and reduce calling times.

>   return -EBUSY;
>  }
>  
> @@ -291,6 

Re: [PATCH] cnic: remove redundant assignment to variable ret

2020-05-08 Thread Michael Chan
On Fri, May 8, 2020 at 3:40 PM Colin King  wrote:
>
> From: Colin Ian King 
>
> The variable ret is being assigned with a value that is never read,
> the assignment is redundant and can be removed.
>
> Addresses-Coverity: ("Unused value")
> Signed-off-by: Colin Ian King 

Reviewed-by: Michael Chan 


net/mptcp/protocol.c:538:6: warning: 'dfrag_collapsed' may be used uninitialized in this function

2020-05-08 Thread kbuild test robot
tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
master
head:   d5eeab8d7e269e8cfc53b915bccd7bd30485bcbf
commit: 3b1d6210a9577369103330b0d802b0bf74b65e7f mptcp: implement and use 
MPTCP-level retransmission
date:   6 weeks ago
config: x86_64-randconfig-a003-20200509 (attached as .config)
compiler: gcc-6 (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
reproduce:
git checkout 3b1d6210a9577369103330b0d802b0bf74b65e7f
# save the attached .config to linux build tree
make ARCH=x86_64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kbuild test robot 

Note: it may well be a FALSE warning. FWIW you are at least aware of it now.
http://gcc.gnu.org/wiki/Better_Uninitialized_Warnings

All warnings (new ones prefixed by >>):

   net/mptcp/protocol.c: In function 'mptcp_sendmsg_frag':
>> net/mptcp/protocol.c:538:6: warning: 'dfrag_collapsed' may be used 
>> uninitialized in this function [-Wmaybe-uninitialized]
  if (!dfrag_collapsed) {
 ^

vim +/dfrag_collapsed +538 net/mptcp/protocol.c

18b683bff89d46ac Paolo Abeni   2020-03-27  422  
6d0060f600adfdda Mat Martineau 2020-01-21  423  static int 
mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk,
3f8e0aae17963634 Paolo Abeni   2020-03-27  424
struct msghdr *msg, struct mptcp_data_frag *dfrag,
3f8e0aae17963634 Paolo Abeni   2020-03-27  425
long *timeo, int *pmss_now,
57040755a3e43a1e Paolo Abeni   2020-01-21  426
int *ps_goal)
6d0060f600adfdda Mat Martineau 2020-01-21  427  {
18b683bff89d46ac Paolo Abeni   2020-03-27  428  int mss_now, 
avail_size, size_goal, offset, ret, frag_truesize = 0;
18b683bff89d46ac Paolo Abeni   2020-03-27  429  bool dfrag_collapsed, 
can_collapse = false;
6d0060f600adfdda Mat Martineau 2020-01-21  430  struct mptcp_sock *msk 
= mptcp_sk(sk);
6d0060f600adfdda Mat Martineau 2020-01-21  431  struct mptcp_ext *mpext 
= NULL;
3f8e0aae17963634 Paolo Abeni   2020-03-27  432  bool retransmission = 
!!dfrag;
57040755a3e43a1e Paolo Abeni   2020-01-21  433  struct sk_buff *skb, 
*tail;
6d0060f600adfdda Mat Martineau 2020-01-21  434  struct page_frag *pfrag;
3f8e0aae17963634 Paolo Abeni   2020-03-27  435  struct page *page;
3f8e0aae17963634 Paolo Abeni   2020-03-27  436  u64 *write_seq;
6d0060f600adfdda Mat Martineau 2020-01-21  437  size_t psize;
6d0060f600adfdda Mat Martineau 2020-01-21  438  
6d0060f600adfdda Mat Martineau 2020-01-21  439  /* use the mptcp page 
cache so that we can easily move the data
6d0060f600adfdda Mat Martineau 2020-01-21  440   * from one substream 
to another, but do per subflow memory accounting
3f8e0aae17963634 Paolo Abeni   2020-03-27  441   * Note: pfrag is used 
only !retransmission, but the compiler if
3f8e0aae17963634 Paolo Abeni   2020-03-27  442   * fooled into a 
warning if we don't init here
6d0060f600adfdda Mat Martineau 2020-01-21  443   */
6d0060f600adfdda Mat Martineau 2020-01-21  444  pfrag = 
sk_page_frag(sk);
3f8e0aae17963634 Paolo Abeni   2020-03-27  445  while ((!retransmission 
&& !mptcp_page_frag_refill(ssk, pfrag)) ||
6d0060f600adfdda Mat Martineau 2020-01-21  446 
!mptcp_ext_cache_refill(msk)) {
6d0060f600adfdda Mat Martineau 2020-01-21  447  ret = 
sk_stream_wait_memory(ssk, timeo);
6d0060f600adfdda Mat Martineau 2020-01-21  448  if (ret)
6d0060f600adfdda Mat Martineau 2020-01-21  449  return 
ret;
18b683bff89d46ac Paolo Abeni   2020-03-27  450  
18b683bff89d46ac Paolo Abeni   2020-03-27  451  /* if 
sk_stream_wait_memory() sleeps snd_una can change
18b683bff89d46ac Paolo Abeni   2020-03-27  452   * 
significantly, refresh the rtx queue
18b683bff89d46ac Paolo Abeni   2020-03-27  453   */
18b683bff89d46ac Paolo Abeni   2020-03-27  454  
mptcp_clean_una(sk);
18b683bff89d46ac Paolo Abeni   2020-03-27  455  
8ab183deb26a3b79 Paolo Abeni   2020-01-21  456  if 
(unlikely(__mptcp_needs_tcp_fallback(msk)))
8ab183deb26a3b79 Paolo Abeni   2020-01-21  457  return 
0;
6d0060f600adfdda Mat Martineau 2020-01-21  458  }
3f8e0aae17963634 Paolo Abeni   2020-03-27  459  if (!retransmission) {
3f8e0aae17963634 Paolo Abeni   2020-03-27  460  write_seq = 
>write_seq;
3f8e0aae17963634 Paolo Abeni   2020-03-27  461  page = 
pfrag->page;
3f8e0aae17963634 Paolo Abeni   2020-03-27  462  } else {
3f8e0aae17963634 Paolo Abeni   2020-03-27  463  write_seq = 
>data_seq;
3f8e0aae17963634 Paolo Abeni   2020-03-27  464  page = 
dfrag->page;
3f8e0aae17963634 Paolo Abeni   2020-03-27  465  }
6d0060f600adfdda Mat Martineau 2020-01-21  466  
6d0060f600adfdda 

Re: [PATCH] net: tg3: tidy up loop, remove need to compute off with a multiply

2020-05-08 Thread Michael Chan
On Fri, May 8, 2020 at 7:31 PM Joe Perches  wrote:
>
> On Fri, 2020-05-08 at 18:48 -0700, Jakub Kicinski wrote:
> > On Sat, 9 May 2020 00:31:03 +0100 Colin Ian King wrote:
> > > > My preference would be for
> > > >
> > > > {
> > > >   int i;
> > > >   u32 off = 0;
> > > >
> > > >   for (i = 0; i < TG3_SD_NUM_RECS; i++) {
> > > >   tg3_ape_scratchpad_read(tp, (u32 *)ocir, off, TC3_OCIR_LEN);
> > > >
> > > >   if (ocir->signature != TG3_OCIR_SIG_MAGIC ||
> > > >   !(ocir->version_flags & TG3_OCIR_FLAG_ACTIVE))
> > > >   memset(ocir, 0, TG3_OCIR_LEN);
> > > >
> > > >   off += TG3_OCIR_LEN;
> > > >   ocir++;
> > > >   }
> > > >
> > > OK, I'll send a V3 tomorrow.
> >
> > I already reviewed and applied v2, just waiting for builds to finish,
> > let's leave it.
>
>
> I think clarity should be preferred.

Either way is fine with me.  I'm fine with v2 since it's already applied.


[PATCH] bpfilter: document build requirements for bpfilter_umh

2020-05-08 Thread Valdis Klētnieks
It's not intuitively obvious that bpfilter_umh is a statically linked binary.
Mention the toolchain requirement in the Kconfig help, so people
have an easier time figuring out what's needed.

Signed-off-by: Valdis Kletnieks 

diff --git a/net/bpfilter/Kconfig b/net/bpfilter/Kconfig
index fed9290e3b41..0ec6c7958c20 100644
--- a/net/bpfilter/Kconfig
+++ b/net/bpfilter/Kconfig
@@ -13,4 +13,8 @@ config BPFILTER_UMH
default m
help
  This builds bpfilter kernel module with embedded user mode helper
+
+ Note: your toolchain must support building static binaries, since
+ rootfs isn't mounted at the time when __init functions are called
+ and do_execv won't be able to find the elf interpreter.
 endif



[PATCH 12/15] ath10k: use new module_firmware_crashed()

2020-05-08 Thread Luis Chamberlain
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: linux-wirel...@vger.kernel.org
Cc: ath...@lists.infradead.org
Cc: Kalle Valo 
Signed-off-by: Luis Chamberlain 
---
 drivers/net/wireless/ath/ath10k/pci.c  | 2 ++
 drivers/net/wireless/ath/ath10k/sdio.c | 2 ++
 drivers/net/wireless/ath/ath10k/snoc.c | 1 +
 3 files changed, 5 insertions(+)

diff --git a/drivers/net/wireless/ath/ath10k/pci.c 
b/drivers/net/wireless/ath/ath10k/pci.c
index 1d941d53fdc9..6bd0f3b518b9 100644
--- a/drivers/net/wireless/ath/ath10k/pci.c
+++ b/drivers/net/wireless/ath/ath10k/pci.c
@@ -1767,6 +1767,7 @@ static void ath10k_pci_fw_dump_work(struct work_struct 
*work)
scnprintf(guid, sizeof(guid), "n/a");
 
ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
+   module_firmware_crashed();
ath10k_print_driver_info(ar);
ath10k_pci_dump_registers(ar, crash_data);
ath10k_ce_dump_registers(ar, crash_data);
@@ -2837,6 +2838,7 @@ static int ath10k_pci_hif_power_up(struct ath10k *ar,
if (ret) {
if (ath10k_pci_has_fw_crashed(ar)) {
ath10k_warn(ar, "firmware crashed during chip reset\n");
+   module_firmware_crashed();
ath10k_pci_fw_crashed_clear(ar);
ath10k_pci_fw_crashed_dump(ar);
}
diff --git a/drivers/net/wireless/ath/ath10k/sdio.c 
b/drivers/net/wireless/ath/ath10k/sdio.c
index e2aff2254a40..d34ad289380f 100644
--- a/drivers/net/wireless/ath/ath10k/sdio.c
+++ b/drivers/net/wireless/ath/ath10k/sdio.c
@@ -794,6 +794,7 @@ static int ath10k_sdio_mbox_proc_dbg_intr(struct ath10k *ar)
 
/* TODO: Add firmware crash handling */
ath10k_warn(ar, "firmware crashed\n");
+   module_firmware_crashed();
 
/* read counter to clear the interrupt, the debug error interrupt is
 * counter 0.
@@ -915,6 +916,7 @@ static int ath10k_sdio_mbox_proc_cpu_intr(struct ath10k *ar)
if (cpu_int_status & MBOX_CPU_STATUS_ENABLE_ASSERT_MASK) {
ath10k_err(ar, "firmware crashed!\n");
queue_work(ar->workqueue, >restart_work);
+   module_firmware_crashed();
}
return ret;
 }
diff --git a/drivers/net/wireless/ath/ath10k/snoc.c 
b/drivers/net/wireless/ath/ath10k/snoc.c
index 354d49b1cd45..7cfc123c345c 100644
--- a/drivers/net/wireless/ath/ath10k/snoc.c
+++ b/drivers/net/wireless/ath/ath10k/snoc.c
@@ -1451,6 +1451,7 @@ void ath10k_snoc_fw_crashed_dump(struct ath10k *ar)
scnprintf(guid, sizeof(guid), "n/a");
 
ath10k_err(ar, "firmware crashed! (guid %s)\n", guid);
+   module_firmware_crashed();
ath10k_print_driver_info(ar);
ath10k_msa_dump_memory(ar, crash_data);
mutex_unlock(>dump_mutex);
-- 
2.25.1



Re: [PATCH v5] x86: bitops: fix build regression

2020-05-08 Thread Sedat Dilek
On Sat, May 9, 2020 at 1:47 AM Jesse Brandeburg
 wrote:
>
> On Fri, 8 May 2020 13:28:35 -0700
> Nathan Chancellor  wrote:
>
> > On Fri, May 08, 2020 at 11:32:29AM -0700, Nick Desaulniers wrote:
> > > Use the `%b` "x86 Operand Modifier" to instead force register allocation
> > > to select a lower-8-bit GPR operand.
>
> This looks OK to me, I appreciate the work done to find the right
> fix and clean up the code while not breaking sparse! I had a look at
> the assembly from gcc 9.3.1 and it looks good. Thanks!
>
> Reviewed-by: Jesse Brandeburg 
>

Tested v5 on Debian/testing AMD64 with a selfmade llvm-toolchain
(LLVM/Clang/LLD) v10.0.1+git92d5c1be9ee9

Please add:

 Tested-by: Sedat Dilek 

For details see


Thanks to all involved people.

- Sedat -


Re: [PATCH v4 05/12] pstore/blk: Add support for pmsg frontend

2020-05-08 Thread WeiXiong Liao
hi Kees Cook,

On 2020/5/8 PM 2:39, Kees Cook wrote:
> From: WeiXiong Liao 
> 
> Add pmsg support to pstore/blk (through pstore/zone). To enable, pmsg_size
> must be greater than 0 and a multiple of 4096.
> 
> Signed-off-by: WeiXiong Liao 
> Link: 
> https://lore.kernel.org/r/1585126506-18635-5-git-send-email-liaoweixi...@allwinnertech.com
> Signed-off-by: Kees Cook 
> ---
>  fs/pstore/Kconfig   |  12 ++
>  fs/pstore/blk.c |   9 ++
>  fs/pstore/zone.c| 268 ++--
>  include/linux/pstore_zone.h |   2 +
>  4 files changed, 281 insertions(+), 10 deletions(-)
> 
> diff --git a/fs/pstore/Kconfig b/fs/pstore/Kconfig
> index 92ba73bd0b62..f18cd126d83f 100644
> --- a/fs/pstore/Kconfig
> +++ b/fs/pstore/Kconfig
> @@ -224,3 +224,15 @@ config PSTORE_BLK_MAX_REASON
>  
> NOTE that, both Kconfig and module parameters can configure
> pstore/blk, but module parameters have priority over Kconfig.
> +
> +config PSTORE_BLK_PMSG_SIZE
> + int "Size in Kbytes of pmsg to store"
> + depends on PSTORE_BLK
> + depends on PSTORE_PMSG
> + default 64
> + help
> +   This just sets size of pmsg (pmsg_size) for pstore/blk. The size is
> +   in KB and must be a multiple of 4.
> +
> +   NOTE that, both Kconfig and module parameters can configure
> +   pstore/blk, but module parameters have priority over Kconfig.
> diff --git a/fs/pstore/blk.c b/fs/pstore/blk.c
> index d1c3074aa128..401e5ba66a5f 100644
> --- a/fs/pstore/blk.c
> +++ b/fs/pstore/blk.c
> @@ -24,6 +24,14 @@ module_param(max_reason, int, 0400);
>  MODULE_PARM_DESC(max_reason,
>"maximum reason for kmsg dump (default 2: Oops and Panic)");
>  
> +#if IS_ENABLED(CONFIG_PSTORE_PMSG)
> +static long pmsg_size = CONFIG_PSTORE_BLK_PMSG_SIZE;
> +#else
> +static long pmsg_size = -1;
> +#endif
> +module_param(pmsg_size, long, 0400);
> +MODULE_PARM_DESC(pmsg_size, "pmsg size in kbytes");
> +
>  /*
>   * blkdev - The block device to use.
>   *
> @@ -124,6 +132,7 @@ static int psblk_register_do(struct psblk_device *dev)
>   }
>  
>   verify_size(kmsg_size, 4096, dev->flags & PSTORE_FLAGS_DMESG);
> + verify_size(pmsg_size, 4096, dev->flags & PSTORE_FLAGS_PMSG);
>  #undef verify_size
>  
>   pstore_zone_info->total_size = dev->total_size;
> diff --git a/fs/pstore/zone.c b/fs/pstore/zone.c
> index 6c25c443c8e2..f472b06a6c14 100644
> --- a/fs/pstore/zone.c
> +++ b/fs/pstore/zone.c
> @@ -23,12 +23,14 @@
>   *
>   * @sig: signature to indicate header (PSZ_SIG xor PSZONE-type value)
>   * @datalen: length of data in @data
> + * @start: offset into @data where the beginning of the stored bytes begin
>   * @data: zone data.
>   */
>  struct psz_buffer {
>  #define PSZ_SIG (0x43474244) /* DBGC */
>   uint32_t sig;
>   atomic_t datalen;
> + atomic_t start;
>   uint8_t data[];
>  };
>  
> @@ -84,9 +86,11 @@ struct pstore_zone {
>   * struct psz_context - all about running state of pstore/zone
>   *
>   * @opszs: oops/panic storage zones
> + * @ppsz: pmsg storage zone
>   * @oops_max_cnt: max count of @opszs
>   * @oops_read_cnt: counter to read oops zone
>   * @oops_write_cnt: counter to write
> + * @pmsg_read_cnt: counter to read pmsg zone
>   * @oops_counter: counter to oops
>   * @panic_counter: counter to panic
>   * @recovered: whether finish recovering data from storage
> @@ -97,9 +101,11 @@ struct pstore_zone {
>   */
>  struct psz_context {
>   struct pstore_zone **opszs;
> + struct pstore_zone *ppsz;
>   unsigned int oops_max_cnt;
>   unsigned int oops_read_cnt;
>   unsigned int oops_write_cnt;
> + unsigned int pmsg_read_cnt;
>   /*
>* the counter should be recovered when recover.
>* It records the oops/panic times after burning rather than booting.
> @@ -139,6 +145,11 @@ static inline int buffer_datalen(struct pstore_zone 
> *zone)
>   return atomic_read(>buffer->datalen);
>  }
>  
> +static inline int buffer_start(struct pstore_zone *zone)
> +{
> + return atomic_read(>buffer->start);
> +}
> +
>  static inline bool is_on_panic(void)
>  {
>   struct psz_context *cxt = _cxt;
> @@ -146,10 +157,10 @@ static inline bool is_on_panic(void)
>   return atomic_read(>on_panic);
>  }
>  
> -static ssize_t psz_zone_read(struct pstore_zone *zone, char *buf,
> +static ssize_t psz_zone_read_buffer(struct pstore_zone *zone, char *buf,
>   size_t len, unsigned long off)
>  {
> - if (!buf || !zone->buffer)
> + if (!buf || !zone || !zone->buffer)
>   return -EINVAL;
>   if (off > zone->buffer_size)
>   return -EINVAL;
> @@ -158,6 +169,18 @@ static ssize_t psz_zone_read(struct pstore_zone *zone, 
> char *buf,
>   return len;
>  }
>  
> +static int psz_zone_read_oldbuf(struct pstore_zone *zone, char *buf,
> + size_t len, unsigned long off)
> +{
> + if (!buf || !zone || !zone->oldbuf)
> + return -EINVAL;
> + if (off 

[PATCH 01/15] taint: add module firmware crash taint support

2020-05-08 Thread Luis Chamberlain
Device driver firmware can crash, and sometimes, this can leave your
system in a state which makes the device or subsystem completely
useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
of scraping some magical words from the kernel log, which is driver
specific, is much easier. So instead provide a helper which lets drivers
annotate this.

Once this happens, scrapers can easily look for modules taint flags
for a firmware crash. This will taint both the kernel and respective
calling module.

The new helper module_firmware_crashed() uses LOCKDEP_STILL_OK as this
fact should in no way shape or form affect lockdep. This taint is device
driver specific.

Signed-off-by: Luis Chamberlain 
---
 include/linux/kernel.h|  3 ++-
 include/linux/module.h| 13 +
 include/trace/events/module.h |  3 ++-
 kernel/module.c   |  5 +++--
 kernel/panic.c|  1 +
 5 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 04a5885cec1b..19e1541c82c7 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -601,7 +601,8 @@ extern enum system_states {
 #define TAINT_LIVEPATCH15
 #define TAINT_AUX  16
 #define TAINT_RANDSTRUCT   17
-#define TAINT_FLAGS_COUNT  18
+#define TAINT_FIRMWARE_CRASH   18
+#define TAINT_FLAGS_COUNT  19
 
 struct taint_flag {
char c_true;/* character printed when tainted */
diff --git a/include/linux/module.h b/include/linux/module.h
index 2c2e988bcf10..221200078180 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -697,6 +697,14 @@ static inline bool is_livepatch_module(struct module *mod)
 bool is_module_sig_enforced(void);
 void set_module_sig_enforced(void);
 
+void add_taint_module(struct module *mod, unsigned flag,
+ enum lockdep_ok lockdep_ok);
+
+static inline void module_firmware_crashed(void)
+{
+   add_taint_module(THIS_MODULE, TAINT_FIRMWARE_CRASH, LOCKDEP_STILL_OK);
+}
+
 #else /* !CONFIG_MODULES... */
 
 static inline struct module *__module_address(unsigned long addr)
@@ -844,6 +852,11 @@ void *dereference_module_function_descriptor(struct module 
*mod, void *ptr)
return ptr;
 }
 
+static inline void module_firmware_crashed(void)
+{
+   add_taint(TAINT_FIRMWARE_CRASH, LOCKDEP_STILL_OK);
+}
+
 #endif /* CONFIG_MODULES */
 
 #ifdef CONFIG_SYSFS
diff --git a/include/trace/events/module.h b/include/trace/events/module.h
index 097485c73c01..b749ea25affd 100644
--- a/include/trace/events/module.h
+++ b/include/trace/events/module.h
@@ -26,7 +26,8 @@ struct module;
{ (1UL << TAINT_OOT_MODULE),"O" },  \
{ (1UL << TAINT_FORCED_MODULE), "F" },  \
{ (1UL << TAINT_CRAP),  "C" },  \
-   { (1UL << TAINT_UNSIGNED_MODULE),   "E" })
+   { (1UL << TAINT_UNSIGNED_MODULE),   "E" },  \
+   { (1UL << TAINT_FIRMWARE_CRASH),"Q" })
 
 TRACE_EVENT(module_load,
 
diff --git a/kernel/module.c b/kernel/module.c
index 80faaf2116dd..f98e8c25c6b4 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -325,12 +325,13 @@ static inline int strong_try_module_get(struct module 
*mod)
return -ENOENT;
 }
 
-static inline void add_taint_module(struct module *mod, unsigned flag,
-   enum lockdep_ok lockdep_ok)
+void add_taint_module(struct module *mod, unsigned flag,
+ enum lockdep_ok lockdep_ok)
 {
add_taint(flag, lockdep_ok);
set_bit(flag, >taints);
 }
+EXPORT_SYMBOL_GPL(add_taint_module);
 
 /*
  * A thread that wants to hold a reference to a module only while it
diff --git a/kernel/panic.c b/kernel/panic.c
index ec6d7d788ce7..504fb926947e 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -384,6 +384,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = {
[ TAINT_LIVEPATCH ] = { 'K', ' ', true },
[ TAINT_AUX ]   = { 'X', ' ', true },
[ TAINT_RANDSTRUCT ]= { 'T', ' ', true },
+   [ TAINT_FIRMWARE_CRASH ]= { 'Q', ' ', true },
 };
 
 /**
-- 
2.25.1



[PATCH 11/15] wimax/i2400m: use new module_firmware_crashed()

2020-05-08 Thread Luis Chamberlain
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: linux-wi...@intel.com
Cc: Inaky Perez-Gonzalez 
Signed-off-by: Luis Chamberlain 
---
 drivers/net/wimax/i2400m/rx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/wimax/i2400m/rx.c b/drivers/net/wimax/i2400m/rx.c
index c9fb619a9e01..307a62ca183f 100644
--- a/drivers/net/wimax/i2400m/rx.c
+++ b/drivers/net/wimax/i2400m/rx.c
@@ -712,6 +712,7 @@ void __i2400m_roq_queue(struct i2400m *i2400m, struct 
i2400m_roq *roq,
dev_err(dev, "SW BUG? failed to insert packet\n");
dev_err(dev, "ERX: roq %p [ws %u] skb %p nsn %d sn %u\n",
roq, roq->ws, skb, nsn, roq_data->sn);
+   module_firmware_crashed();
skb_queue_walk(>queue, skb_itr) {
roq_data_itr = (struct i2400m_roq_data *) _itr->cb;
nsn_itr = __i2400m_roq_nsn(roq, roq_data_itr->sn);
-- 
2.25.1



[PATCH 07/15] cxgb4: use new module_firmware_crashed()

2020-05-08 Thread Luis Chamberlain
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: Vishal Kulkarni 
Signed-off-by: Luis Chamberlain 
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index a70018f067aa..c67fc86c0e42 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -3646,6 +3646,7 @@ void t4_fatal_err(struct adapter *adap)
 * could be exposed to the adapter.  RDMA MWs for example...
 */
t4_shutdown_adapter(adap);
+   module_firmware_crashed();
for_each_port(adap, port) {
struct net_device *dev = adap->port[port];
 
-- 
2.25.1



[PATCH 10/15] soc: qcom: ipa: use new module_firmware_crashed()

2020-05-08 Thread Luis Chamberlain
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: Alex Elder 
Signed-off-by: Luis Chamberlain 
---
 drivers/net/ipa/ipa_modem.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ipa/ipa_modem.c b/drivers/net/ipa/ipa_modem.c
index ed10818dd99f..1790b87446ed 100644
--- a/drivers/net/ipa/ipa_modem.c
+++ b/drivers/net/ipa/ipa_modem.c
@@ -285,6 +285,7 @@ static void ipa_modem_crashed(struct ipa *ipa)
struct device *dev = >pdev->dev;
int ret;
 
+   module_firmware_crashed();
ipa_endpoint_modem_pause_all(ipa, true);
 
ipa_endpoint_modem_hol_block_clear_all(ipa);
-- 
2.25.1



[PATCH 08/15] ehea: use new module_firmware_crashed()

2020-05-08 Thread Luis Chamberlain
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: Douglas Miller 
Signed-off-by: Luis Chamberlain 
---
 drivers/net/ethernet/ibm/ehea/ehea_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/ibm/ehea/ehea_main.c 
b/drivers/net/ethernet/ibm/ehea/ehea_main.c
index 0273fb7a9d01..6ae35067003f 100644
--- a/drivers/net/ethernet/ibm/ehea/ehea_main.c
+++ b/drivers/net/ethernet/ibm/ehea/ehea_main.c
@@ -3285,6 +3285,8 @@ static void ehea_crash_handler(void)
 {
int i;
 
+   module_firmware_crashed();
+
if (ehea_fw_handles.arr)
for (i = 0; i < ehea_fw_handles.num_entries; i++)
ehea_h_free_resource(ehea_fw_handles.arr[i].adh,
-- 
2.25.1



[PATCH 15/15] mwl8k: use new module_firmware_crashed()

2020-05-08 Thread Luis Chamberlain
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: linux-wirel...@vger.kernel.org
Cc: Lennert Buytenhek 
Cc: Kalle Valo 
Cc: "Gustavo A. R. Silva" 
Cc: Johannes Berg 
Cc: Ganapathi Bhat 
Signed-off-by: Luis Chamberlain 
---
 drivers/net/wireless/marvell/mwl8k.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/wireless/marvell/mwl8k.c 
b/drivers/net/wireless/marvell/mwl8k.c
index 97f23f93f6e7..d609ef1bb879 100644
--- a/drivers/net/wireless/marvell/mwl8k.c
+++ b/drivers/net/wireless/marvell/mwl8k.c
@@ -1551,6 +1551,7 @@ static int mwl8k_tx_wait_empty(struct ieee80211_hw *hw)
 * the firmware has crashed
 */
if (priv->hw_restart_in_progress) {
+   module_firmware_crashed();
if (priv->hw_restart_owner == current)
return 0;
else
-- 
2.25.1



[PATCH 04/15] bnxt: use new module_firmware_crashed()

2020-05-08 Thread Luis Chamberlain
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: Michael Chan 
Signed-off-by: Luis Chamberlain 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
index dd0c3f227009..5ba1bd0734e9 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c
@@ -3503,6 +3503,7 @@ static int bnxt_get_dump_data(struct net_device *dev, 
struct ethtool_dump *dump,
 
dump->flag = bp->dump_flag;
if (dump->flag == BNXT_DUMP_CRASH) {
+   module_firmware_crashed();
 #ifdef CONFIG_TEE_BNXT_FW
return tee_bnxt_copy_coredump(buf, 0, dump->len);
 #endif
-- 
2.25.1



[PATCH 06/15] liquidio: use new module_firmware_crashed()

2020-05-08 Thread Luis Chamberlain
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: Derek Chickles 
Cc: Satanand Burla 
Cc: Felix Manlunas 
Signed-off-by: Luis Chamberlain 
---
 drivers/net/ethernet/cavium/liquidio/lio_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c 
b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 66d31c018c7e..f18085262982 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -801,6 +801,7 @@ static int liquidio_watchdog(void *param)
continue;
 
WRITE_ONCE(oct->cores_crashed, true);
+   module_firmware_crashed();
other_oct = get_other_octeon_device(oct);
if (other_oct)
WRITE_ONCE(other_oct->cores_crashed, true);
-- 
2.25.1



[PATCH 03/15] bnx2x: use new module_firmware_crashed()

2020-05-08 Thread Luis Chamberlain
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: Ariel Elior 
Cc: Sudarsana Kalluru 
CC: gr-everest-linux...@marvell.com
Signed-off-by: Luis Chamberlain 
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
index db5107e7937c..c38b8c9c8af0 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c
@@ -909,6 +909,7 @@ void bnx2x_panic_dump(struct bnx2x *bp, bool disable_int)
bp->eth_stats.unrecoverable_error++;
DP(BNX2X_MSG_STATS, "stats_state - DISABLED\n");
 
+   module_firmware_crashed();
BNX2X_ERR("begin crash dump -\n");
 
/* Indices */
-- 
2.25.1



[PATCH 13/15] ath6kl: use new module_firmware_crashed()

2020-05-08 Thread Luis Chamberlain
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: linux-wirel...@vger.kernel.org
Cc: ath...@lists.infradead.org
Cc: Kalle Valo 
Signed-off-by: Luis Chamberlain 
---
 drivers/net/wireless/ath/ath6kl/hif.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/wireless/ath/ath6kl/hif.c 
b/drivers/net/wireless/ath/ath6kl/hif.c
index d1942537ea10..cfd838607544 100644
--- a/drivers/net/wireless/ath/ath6kl/hif.c
+++ b/drivers/net/wireless/ath/ath6kl/hif.c
@@ -120,6 +120,7 @@ static int ath6kl_hif_proc_dbg_intr(struct ath6kl_device 
*dev)
int ret;
 
ath6kl_warn("firmware crashed\n");
+   module_firmware_crashed();
 
/*
 * read counter to clear the interrupt, the debug error interrupt is
-- 
2.25.1



[PATCH 05/15] bna: use new module_firmware_crashed()

2020-05-08 Thread Luis Chamberlain
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: Rasesh Mody 
Cc: Sudarsana Kalluru 
Cc: gr-linux-nic-...@marvell.com
Signed-off-by: Luis Chamberlain 
---
 drivers/net/ethernet/brocade/bna/bfa_ioc.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/brocade/bna/bfa_ioc.c 
b/drivers/net/ethernet/brocade/bna/bfa_ioc.c
index e17bfc87da90..b3f44a912574 100644
--- a/drivers/net/ethernet/brocade/bna/bfa_ioc.c
+++ b/drivers/net/ethernet/brocade/bna/bfa_ioc.c
@@ -927,6 +927,7 @@ bfa_iocpf_sm_disabled(struct bfa_iocpf *iocpf, enum 
iocpf_event event)
 static void
 bfa_iocpf_sm_initfail_sync_entry(struct bfa_iocpf *iocpf)
 {
+   module_firmware_crashed();
bfa_nw_ioc_debug_save_ftrc(iocpf->ioc);
bfa_ioc_hw_sem_get(iocpf->ioc);
 }
-- 
2.25.1



[PATCH 09/15] qed: use new module_firmware_crashed()

2020-05-08 Thread Luis Chamberlain
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: Ariel Elior 
Cc: gr-everest-linux...@marvell.com
Signed-off-by: Luis Chamberlain 
---
 drivers/net/ethernet/qlogic/qed/qed_debug.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/qlogic/qed/qed_debug.c 
b/drivers/net/ethernet/qlogic/qed/qed_debug.c
index f4eebaabb6d0..9cc6287b889b 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_debug.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_debug.c
@@ -7854,6 +7854,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void *buffer)
 REGDUMP_HEADER_SIZE,
 _size);
if (!rc) {
+   module_firmware_crashed();
*(u32 *)((u8 *)buffer + offset) =
qed_calc_regdump_header(cdev, PROTECTION_OVERRIDE,
cur_engine,
@@ -7869,6 +7870,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void *buffer)
rc = qed_dbg_fw_asserts(cdev, (u8 *)buffer + offset +
REGDUMP_HEADER_SIZE, _size);
if (!rc) {
+   module_firmware_crashed();
*(u32 *)((u8 *)buffer + offset) =
qed_calc_regdump_header(cdev, FW_ASSERTS,
cur_engine, feature_size,
@@ -7906,6 +7908,7 @@ int qed_dbg_all_data(struct qed_dev *cdev, void *buffer)
rc = qed_dbg_grc(cdev, (u8 *)buffer + offset +
 REGDUMP_HEADER_SIZE, _size);
if (!rc) {
+   module_firmware_crashed();
*(u32 *)((u8 *)buffer + offset) =
qed_calc_regdump_header(cdev, GRC_DUMP,
cur_engine,
-- 
2.25.1



[PATCH 00/15] net: taint when the device driver firmware crashes

2020-05-08 Thread Luis Chamberlain
Device driver firmware can crash, and sometimes, this can leave your
system in a state which makes the device or subsystem completely
useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
of scraping some magical words from the kernel log, which is driver
specific, is much easier. So instead this series provides a helper which
lets drivers annotate this and shows how to use this on networking
drivers.

My methodology for finding when firmware crashes is to git grep for
"crash" and then doing some study of the code to see if this indeed
a place where the firmware crashes. In some places this is quite
obvious.

I'm starting off with networking first, if this gets merged later on I
can focus on the other drivers, but I already have some work done on
other subsytems.

Review, flames, etc are greatly appreciated.

This work, only on networking drivers, can be found on my git tree as well:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20200509-taint-firmware-net

Luis Chamberlain (15):
  taint: add module firmware crash taint support
  ethernet/839: use new module_firmware_crashed()
  bnx2x: use new module_firmware_crashed()
  bnxt: use new module_firmware_crashed()
  bna: use new module_firmware_crashed()
  liquidio: use new module_firmware_crashed()
  cxgb4: use new module_firmware_crashed()
  ehea: use new module_firmware_crashed()
  qed: use new module_firmware_crashed()
  soc: qcom: ipa: use new module_firmware_crashed()
  wimax/i2400m: use new module_firmware_crashed()
  ath10k: use new module_firmware_crashed()
  ath6kl: use new module_firmware_crashed()
  brcm80211: use new module_firmware_crashed()
  mwl8k: use new module_firmware_crashed()

 drivers/net/ethernet/8390/axnet_cs.c|  4 +++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c|  1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c   |  1 +
 drivers/net/ethernet/brocade/bna/bfa_ioc.c  |  1 +
 drivers/net/ethernet/cavium/liquidio/lio_main.c |  1 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  1 +
 drivers/net/ethernet/ibm/ehea/ehea_main.c   |  2 ++
 drivers/net/ethernet/qlogic/qed/qed_debug.c |  3 +++
 drivers/net/ipa/ipa_modem.c |  1 +
 drivers/net/wimax/i2400m/rx.c   |  1 +
 drivers/net/wireless/ath/ath10k/pci.c   |  2 ++
 drivers/net/wireless/ath/ath10k/sdio.c  |  2 ++
 drivers/net/wireless/ath/ath10k/snoc.c  |  1 +
 drivers/net/wireless/ath/ath6kl/hif.c   |  1 +
 .../net/wireless/broadcom/brcm80211/brcmfmac/core.c |  1 +
 drivers/net/wireless/marvell/mwl8k.c|  1 +
 include/linux/kernel.h  |  3 ++-
 include/linux/module.h  | 13 +
 include/trace/events/module.h   |  3 ++-
 kernel/module.c |  5 +++--
 kernel/panic.c  |  1 +
 21 files changed, 44 insertions(+), 5 deletions(-)

-- 
2.25.1



[PATCH 14/15] brcm80211: use new module_firmware_crashed()

2020-05-08 Thread Luis Chamberlain
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: linux-wirel...@vger.kernel.org
Cc: brcm80211-dev-list@broadcom.com
Cc: brcm80211-dev-l...@cypress.com
Cc: Arend van Spriel 
Cc: Franky Lin 
Cc: Hante Meuleman 
Cc: Chi-Hsien Lin 
Cc: Wright Feng 
Cc: Kalle Valo 
Cc: "Rafał Miłecki" 
Cc: Pieter-Paul Giesberts 
Signed-off-by: Luis Chamberlain 
---
 drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
index c88655acc78c..d623f83568b3 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/core.c
@@ -1393,6 +1393,7 @@ void brcmf_fw_crashed(struct device *dev)
struct brcmf_pub *drvr = bus_if->drvr;
 
bphy_err(drvr, "Firmware has halted or crashed\n");
+   module_firmware_crashed();
 
brcmf_dev_coredump(dev);
 
-- 
2.25.1



[PATCH 02/15] ethernet/839: use new module_firmware_crashed()

2020-05-08 Thread Luis Chamberlain
This makes use of the new module_firmware_crashed() to help
annotate when firmware for device drivers crash. When firmware
crashes devices can sometimes become unresponsive, and recovery
sometimes requires a driver unload / reload and in the worst cases
a reboot.

Using a taint flag allows us to annotate when this happens clearly.

Cc: "Michael S. Tsirkin" 
Cc: Shannon Nelson 
Cc: Jakub Kicinski 
Cc: Heiner Kallweit 
Signed-off-by: Luis Chamberlain 
---
 drivers/net/ethernet/8390/axnet_cs.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/8390/axnet_cs.c 
b/drivers/net/ethernet/8390/axnet_cs.c
index aeae7966a082..8ad0200db8e9 100644
--- a/drivers/net/ethernet/8390/axnet_cs.c
+++ b/drivers/net/ethernet/8390/axnet_cs.c
@@ -1358,9 +1358,11 @@ static void ei_receive(struct net_device *dev)
 */
if ((netif_msg_rx_err(ei_local)) &&
this_frame != ei_local->current_page &&
-   (this_frame != 0x0 || rxing_page != 0xFF))
+   (this_frame != 0x0 || rxing_page != 0xFF)) {
+   module_firmware_crashed();
netdev_err(dev, "mismatched read page pointers %2x vs 
%2x\n",
   this_frame, ei_local->current_page);
+   }

if (this_frame == rxing_page)   /* Read all the frames? */
break;  /* Done for now */
-- 
2.25.1



Re: [PATCH -next v2] locking/osq_lock: annotate a data race in osq_lock

2020-05-08 Thread Paul E. McKenney
On Fri, May 08, 2020 at 04:59:05PM -0400, Qian Cai wrote:
> 
> 
> > On Feb 11, 2020, at 8:54 AM, Qian Cai  wrote:
> > 
> > prev->next could be accessed concurrently as noticed by KCSAN,
> > 
> > write (marked) to 0x9d3370dbbe40 of 8 bytes by task 3294 on cpu 107:
> >  osq_lock+0x25f/0x350
> >  osq_wait_next at kernel/locking/osq_lock.c:79
> >  (inlined by) osq_lock at kernel/locking/osq_lock.c:185
> >  rwsem_optimistic_spin
> >  
> > 
> > read to 0x9d3370dbbe40 of 8 bytes by task 3398 on cpu 100:
> >  osq_lock+0x196/0x350
> >  osq_lock at kernel/locking/osq_lock.c:157
> >  rwsem_optimistic_spin
> >  
> > 
> > Since the write only stores NULL to prev->next and the read tests if
> > prev->next equals to this_cpu_ptr(_node). Even if the value is
> > shattered, the code is still working correctly. Thus, mark it as an
> > intentional data race using the data_race() macro.
> > 
> > Signed-off-by: Qian Cai 
> 
> Hmm, this patch has been dropped from linux-next from some reasons.
> 
> Paul, can you pick this up along with KCSAN fixes?
> 
> https://lore.kernel.org/lkml/1581429255-12542-1-git-send-email-...@lca.pw/

I have queued it on -rcu, but it is too late for v5.8 via the -rcu tree,
so this is v5.9 at the earliest.  Plus I would need an ack from one of
the locking folks.

Thanx, Paul

> > ---
> > 
> > v2: insert some code comments.
> > 
> > kernel/locking/osq_lock.c | 6 +-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
> > 
> > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > index 1f7734949ac8..f733bcd99e8a 100644
> > --- a/kernel/locking/osq_lock.c
> > +++ b/kernel/locking/osq_lock.c
> > @@ -154,7 +154,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> >  */
> > 
> > for (;;) {
> > -   if (prev->next == node &&
> > +   /*
> > +* cpu_relax() below implies a compiler barrier which would
> > +* prevent this comparison being optimized away.
> > +*/
> > +   if (data_race(prev->next == node) &&
> > cmpxchg(>next, node, NULL) == node)
> > break;
> > 
> > -- 
> > 1.8.3.1
> > 
> 


[PATCH net v2] hinic: fix a bug of ndo_stop

2020-05-08 Thread Luo bin
if some function in ndo_stop interface returns failure because of
hardware fault, must go on excuting rest steps rather than return
failure directly, otherwise will cause memory leak.And bump the
timeout for SET_FUNC_STATE to ensure that cmd won't return failure
when hw is busy. Otherwise hw may stomp host memory if we free
memory regardless of the return value of SET_FUNC_STATE.

Signed-off-by: Luo bin 
---
 .../net/ethernet/huawei/hinic/hinic_hw_mgmt.c | 28 ++-
 .../net/ethernet/huawei/hinic/hinic_main.c| 16 ++-
 2 files changed, 23 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/huawei/hinic/hinic_hw_mgmt.c 
b/drivers/net/ethernet/huawei/hinic/hinic_hw_mgmt.c
index eef855f11a01..e66659bab012 100644
--- a/drivers/net/ethernet/huawei/hinic/hinic_hw_mgmt.c
+++ b/drivers/net/ethernet/huawei/hinic/hinic_hw_mgmt.c
@@ -45,6 +45,10 @@
 
 #define MGMT_MSG_TIMEOUT5000
 
+#define SET_FUNC_PORT_MBOX_TIMEOUT 3
+
+#define SET_FUNC_PORT_MGMT_TIMEOUT 25000
+
 #define mgmt_to_pfhwdev(pf_mgmt)\
container_of(pf_mgmt, struct hinic_pfhwdev, pf_to_mgmt)
 
@@ -238,12 +242,13 @@ static int msg_to_mgmt_sync(struct hinic_pf_to_mgmt 
*pf_to_mgmt,
u8 *buf_in, u16 in_size,
u8 *buf_out, u16 *out_size,
enum mgmt_direction_type direction,
-   u16 resp_msg_id)
+   u16 resp_msg_id, u32 timeout)
 {
struct hinic_hwif *hwif = pf_to_mgmt->hwif;
struct pci_dev *pdev = hwif->pdev;
struct hinic_recv_msg *recv_msg;
struct completion *recv_done;
+   unsigned long timeo;
u16 msg_id;
int err;
 
@@ -267,8 +272,9 @@ static int msg_to_mgmt_sync(struct hinic_pf_to_mgmt 
*pf_to_mgmt,
goto unlock_sync_msg;
}
 
-   if (!wait_for_completion_timeout(recv_done,
-msecs_to_jiffies(MGMT_MSG_TIMEOUT))) {
+   timeo = msecs_to_jiffies(timeout ? timeout : MGMT_MSG_TIMEOUT);
+
+   if (!wait_for_completion_timeout(recv_done, timeo)) {
dev_err(>dev, "MGMT timeout, MSG id = %d\n", msg_id);
err = -ETIMEDOUT;
goto unlock_sync_msg;
@@ -342,6 +348,7 @@ int hinic_msg_to_mgmt(struct hinic_pf_to_mgmt *pf_to_mgmt,
 {
struct hinic_hwif *hwif = pf_to_mgmt->hwif;
struct pci_dev *pdev = hwif->pdev;
+   u32 timeout = 0;
 
if (sync != HINIC_MGMT_MSG_SYNC) {
dev_err(>dev, "Invalid MGMT msg type\n");
@@ -353,13 +360,20 @@ int hinic_msg_to_mgmt(struct hinic_pf_to_mgmt *pf_to_mgmt,
return -EINVAL;
}
 
-   if (HINIC_IS_VF(hwif))
+   if (HINIC_IS_VF(hwif)) {
+   if (cmd == HINIC_PORT_CMD_SET_FUNC_STATE)
+   timeout = SET_FUNC_PORT_MBOX_TIMEOUT;
+
return hinic_mbox_to_pf(pf_to_mgmt->hwdev, mod, cmd, buf_in,
-   in_size, buf_out, out_size, 0);
-   else
+   in_size, buf_out, out_size, timeout);
+   } else {
+   if (cmd == HINIC_PORT_CMD_SET_FUNC_STATE)
+   timeout = SET_FUNC_PORT_MGMT_TIMEOUT;
+
return msg_to_mgmt_sync(pf_to_mgmt, mod, cmd, buf_in, in_size,
buf_out, out_size, MGMT_DIRECT_SEND,
-   MSG_NOT_RESP);
+   MSG_NOT_RESP, timeout);
+   }
 }
 
 /**
diff --git a/drivers/net/ethernet/huawei/hinic/hinic_main.c 
b/drivers/net/ethernet/huawei/hinic/hinic_main.c
index 3d6569d7bac8..22ee868dd11e 100644
--- a/drivers/net/ethernet/huawei/hinic/hinic_main.c
+++ b/drivers/net/ethernet/huawei/hinic/hinic_main.c
@@ -487,7 +487,6 @@ static int hinic_close(struct net_device *netdev)
 {
struct hinic_dev *nic_dev = netdev_priv(netdev);
unsigned int flags;
-   int err;
 
down(_dev->mgmt_lock);
 
@@ -504,20 +503,9 @@ static int hinic_close(struct net_device *netdev)
if (!HINIC_IS_VF(nic_dev->hwdev->hwif))
hinic_notify_all_vfs_link_changed(nic_dev->hwdev, 0);
 
-   err = hinic_port_set_func_state(nic_dev, HINIC_FUNC_PORT_DISABLE);
-   if (err) {
-   netif_err(nic_dev, drv, netdev,
- "Failed to set func port state\n");
-   nic_dev->flags |= (flags & HINIC_INTF_UP);
-   return err;
-   }
+   hinic_port_set_state(nic_dev, HINIC_PORT_DISABLE);
 
-   err = hinic_port_set_state(nic_dev, HINIC_PORT_DISABLE);
-   if (err) {
-   netif_err(nic_dev, drv, netdev, "Failed to set port state\n");
-   nic_dev->flags |= (flags & HINIC_INTF_UP);
-   return err;
-   }
+   hinic_port_set_func_state(nic_dev, HINIC_FUNC_PORT_DISABLE);
 
if (nic_dev->flags & HINIC_RSS_ENABLE) {

[PATCH net-next v2] hinic: add three net_device_ops of vf

2020-05-08 Thread Luo bin
adds ndo_set_vf_rate/ndo_set_vf_spoofchk/ndo_set_vf_link_state
to configure netdev of virtual function

Signed-off-by: Luo bin 
---
 .../net/ethernet/huawei/hinic/hinic_hw_cmdq.c |  29 ++
 .../net/ethernet/huawei/hinic/hinic_hw_dev.c  |  35 ++-
 .../net/ethernet/huawei/hinic/hinic_hw_dev.h  |  21 ++
 .../net/ethernet/huawei/hinic/hinic_hw_if.c   |  32 +-
 .../net/ethernet/huawei/hinic/hinic_hw_if.h   |   6 +-
 .../net/ethernet/huawei/hinic/hinic_hw_mbox.c |   4 +-
 .../net/ethernet/huawei/hinic/hinic_main.c|  17 +-
 .../net/ethernet/huawei/hinic/hinic_port.c|   8 +-
 .../net/ethernet/huawei/hinic/hinic_port.h|  43 +++
 .../net/ethernet/huawei/hinic/hinic_sriov.c   | 275 ++
 .../net/ethernet/huawei/hinic/hinic_sriov.h   |   7 +
 11 files changed, 453 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.c 
b/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.c
index 33c5333657c1..cb5b6e5f787f 100644
--- a/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.c
+++ b/drivers/net/ethernet/huawei/hinic/hinic_hw_cmdq.c
@@ -849,6 +849,25 @@ static int init_cmdqs_ctxt(struct hinic_hwdev *hwdev,
return err;
 }
 
+static int hinic_set_cmdq_depth(struct hinic_hwdev *hwdev, u16 cmdq_depth)
+{
+   struct hinic_cmd_hw_ioctxt hw_ioctxt = { 0 };
+   struct hinic_pfhwdev *pfhwdev;
+
+   pfhwdev = container_of(hwdev, struct hinic_pfhwdev, hwdev);
+
+   hw_ioctxt.func_idx = HINIC_HWIF_FUNC_IDX(hwdev->hwif);
+   hw_ioctxt.ppf_idx = HINIC_HWIF_PPF_IDX(hwdev->hwif);
+
+   hw_ioctxt.set_cmdq_depth = HW_IOCTXT_SET_CMDQ_DEPTH_ENABLE;
+   hw_ioctxt.cmdq_depth = (u8)ilog2(cmdq_depth);
+
+   return hinic_msg_to_mgmt(>pf_to_mgmt, HINIC_MOD_COMM,
+HINIC_COMM_CMD_HWCTXT_SET,
+_ioctxt, sizeof(hw_ioctxt), NULL,
+NULL, HINIC_MGMT_MSG_SYNC);
+}
+
 /**
  * hinic_init_cmdqs - init all cmdqs
  * @cmdqs: cmdqs to init
@@ -899,8 +918,18 @@ int hinic_init_cmdqs(struct hinic_cmdqs *cmdqs, struct 
hinic_hwif *hwif,
 
hinic_ceq_register_cb(_to_io->ceqs, HINIC_CEQ_CMDQ, cmdqs,
  cmdq_ceq_handler);
+
+   err = hinic_set_cmdq_depth(hwdev, CMDQ_DEPTH);
+   if (err) {
+   dev_err(>pdev->dev, "Failed to set cmdq depth\n");
+   goto err_set_cmdq_depth;
+   }
+
return 0;
 
+err_set_cmdq_depth:
+   hinic_ceq_unregister_cb(_to_io->ceqs, HINIC_CEQ_CMDQ);
+
 err_cmdq_ctxt:
hinic_wqs_cmdq_free(>cmdq_pages, cmdqs->saved_wqs,
HINIC_MAX_CMDQ_TYPES);
diff --git a/drivers/net/ethernet/huawei/hinic/hinic_hw_dev.c 
b/drivers/net/ethernet/huawei/hinic/hinic_hw_dev.c
index e5cab58e4ddd..1ce8b8d572cf 100644
--- a/drivers/net/ethernet/huawei/hinic/hinic_hw_dev.c
+++ b/drivers/net/ethernet/huawei/hinic/hinic_hw_dev.c
@@ -44,10 +44,6 @@ enum io_status {
IO_RUNNING = 1,
 };
 
-enum hw_ioctxt_set_cmdq_depth {
-   HW_IOCTXT_SET_CMDQ_DEPTH_DEFAULT,
-};
-
 /**
  * get_capability - convert device capabilities to NIC capabilities
  * @hwdev: the HW device to set and convert device capabilities for
@@ -667,6 +663,32 @@ static void free_pfhwdev(struct hinic_pfhwdev *pfhwdev)
hinic_pf_to_mgmt_free(>pf_to_mgmt);
 }
 
+static int hinic_l2nic_reset(struct hinic_hwdev *hwdev)
+{
+   struct hinic_cmd_l2nic_reset l2nic_reset = {0};
+   u16 out_size = sizeof(l2nic_reset);
+   struct hinic_pfhwdev *pfhwdev;
+   int err;
+
+   pfhwdev = container_of(hwdev, struct hinic_pfhwdev, hwdev);
+
+   l2nic_reset.func_id = HINIC_HWIF_FUNC_IDX(hwdev->hwif);
+   /* 0 represents standard l2nic reset flow */
+   l2nic_reset.reset_flag = 0;
+
+   err = hinic_msg_to_mgmt(>pf_to_mgmt, HINIC_MOD_COMM,
+   HINIC_COMM_CMD_L2NIC_RESET, _reset,
+   sizeof(l2nic_reset), _reset,
+   _size, HINIC_MGMT_MSG_SYNC);
+   if (err || !out_size || l2nic_reset.status) {
+   dev_err(>hwif->pdev->dev, "Failed to reset L2NIC 
resources, err: %d, status: 0x%x, out_size: 0x%x\n",
+   err, l2nic_reset.status, out_size);
+   return -EIO;
+   }
+
+   return 0;
+}
+
 /**
  * hinic_init_hwdev - Initialize the NIC HW
  * @pdev: the NIC pci device
@@ -729,6 +751,10 @@ struct hinic_hwdev *hinic_init_hwdev(struct pci_dev *pdev)
goto err_init_pfhwdev;
}
 
+   err = hinic_l2nic_reset(hwdev);
+   if (err)
+   goto err_l2nic_reset;
+
err = get_dev_cap(hwdev);
if (err) {
dev_err(>dev, "Failed to get device capabilities\n");
@@ -759,6 +785,7 @@ struct hinic_hwdev *hinic_init_hwdev(struct pci_dev *pdev)
 err_init_fw_ctxt:
hinic_vf_func_free(hwdev);
 err_vf_func_init:
+err_l2nic_reset:
 err_dev_cap:
free_pfhwdev(pfhwdev);
 
diff --git 

RE: [PATCH] fpga: dfl: Replace zero-length array with flexible-array

2020-05-08 Thread Wu, Hao
I think Moritz had already applied it to his tree per last submission.
https://lkml.org/lkml/2020/3/21/373

Thanks
Hao

> -Original Message-
> From: matthew.gerl...@linux.intel.com 
> Sent: Saturday, May 9, 2020 2:21 AM
> To: Gustavo A. R. Silva 
> Cc: Wu, Hao ; linux-f...@vger.kernel.org; linux-
> ker...@vger.kernel.org
> Subject: Re: [PATCH] fpga: dfl: Replace zero-length array with flexible-array
> 
> 
> 
> This looks like a a good change to me.
> 
> Tested-by: Matthew Gerlach 
> 
> On Thu, 7 May 2020, Gustavo A. R. Silva wrote:
> 
> > The current codebase makes use of the zero-length array language
> > extension to the C90 standard, but the preferred mechanism to declare
> > variable-length types such as these ones is a flexible array member[1][2],
> > introduced in C99:
> >
> > struct foo {
> >int stuff;
> >struct boo array[];
> > };
> >
> > By making use of the mechanism above, we will get a compiler warning
> > in case the flexible array does not occur last in the structure, which
> > will help us prevent some kind of undefined behavior bugs from being
> > inadvertently introduced[3] to the codebase from now on.
> >
> > Also, notice that, dynamic memory allocations won't be affected by
> > this change:
> >
> > "Flexible array members have incomplete type, and so the sizeof operator
> > may not be applied. As a quirk of the original implementation of
> > zero-length arrays, sizeof evaluates to zero."[1]
> >
> > sizeof(flexible-array-member) triggers a warning because flexible array
> > members have incomplete type[1]. There are some instances of code in
> > which the sizeof operator is being incorrectly/erroneously applied to
> > zero-length arrays and the result is zero. Such instances may be hiding
> > some bugs. So, this work (flexible-array member conversions) will also
> > help to get completely rid of those sorts of issues.
> >
> > This issue was found with the help of Coccinelle.
> >
> > [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
> > [2] https://github.com/KSPP/linux/issues/21
> > [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")
> >
> > Signed-off-by: Gustavo A. R. Silva 
> > ---
> > drivers/fpga/dfl.h |2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/fpga/dfl.h b/drivers/fpga/dfl.h
> > index 9f0e656de720..1cd51ea52ce1 100644
> > --- a/drivers/fpga/dfl.h
> > +++ b/drivers/fpga/dfl.h
> > @@ -235,7 +235,7 @@ struct dfl_feature_platform_data {
> > unsigned long dev_status;
> > void *private;
> > int num;
> > -   struct dfl_feature features[0];
> > +   struct dfl_feature features[];
> > };
> >
> > static inline
> >
> >


Re: [PATCH v2] net: Protect INET_ADDR_COOKIE on 32-bit architectures

2020-05-08 Thread Jakub Kicinski
On Fri,  8 May 2020 14:04:57 +0200 Stephen Kitt wrote:
> Commit c7228317441f ("net: Use a more standard macro for
> INET_ADDR_COOKIE") added a __deprecated marker to the cookie name on
> 32-bit architectures, with the intent that the compiler would flag
> uses of the name. However since commit 771c035372a0 ("deprecate the
> '__deprecated' attribute warnings entirely and for good"),
> __deprecated doesn't do anything and should be avoided.
> 
> This patch changes INET_ADDR_COOKIE to declare a dummy struct so that
> any subsequent use of the cookie's name will in all likelihood break
> the build. It also removes the __deprecated marker.
> 
> Signed-off-by: Stephen Kitt 
> ---
> Changes since v1:
>   - use a dummy struct rather than a typedef
> 
>  include/net/inet_hashtables.h | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
> index ad64ba6a057f..889d9b00c905 100644
> --- a/include/net/inet_hashtables.h
> +++ b/include/net/inet_hashtables.h
> @@ -301,8 +301,9 @@ static inline struct sock *inet_lookup_listener(struct 
> net *net,
> ((__sk)->sk_bound_dev_if == (__sdif)))&&  \
>net_eq(sock_net(__sk), (__net)))
>  #else /* 32-bit arch */
> +/* Break the build if anything tries to use the cookie's name. */

I think the macro is supposed to cause a warning when the variable
itself is accessed. And I don't think that happens with your patch
applied.

diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 2bbaaf0c7176..6c4a3904ed8b 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -360,6 +360,8 @@ struct sock *__inet_lookup_established(struct net *net,
unsigned int slot = hash & hashinfo->ehash_mask;
struct inet_ehash_bucket *head = >ehash[slot];
 
+   kfree();
 begin:
sk_nulls_for_each_rcu(sk, node, >chain) {
if (sk->sk_hash != hash)

$ make ARCH=i386
make[1]: Entering directory `/netdev/net-next/build_allmodconfig_warn_32bit'
  GEN Makefile
  CALL../scripts/atomic/check-atomics.sh
  CALL../scripts/checksyscalls.sh
  CHK include/generated/compile.h
  CC  net/ipv4/inet_hashtables.o
  CHK kernel/kheaders_data.tar.xz
  AR  net/ipv4/built-in.a
  AR  net/built-in.a
  GEN .version
  CHK include/generated/compile.h
  UPD include/generated/compile.h
  CC  init/version.o
  AR  init/built-in.a
  LD  vmlinux.o
  MODPOST vmlinux.o

Builds fine.

>  #define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
> - const int __name __deprecated __attribute__((unused))
> + struct {} __name __attribute__((unused))
>  
>  #define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, 
> __sdif) \
>   (((__sk)->sk_portpair == (__ports)) &&  \



Re: [PATCH v2] kernel: add panic_on_taint

2020-05-08 Thread Luis Chamberlain
On Fri, May 08, 2020 at 08:47:19AM -0400, Rafael Aquini wrote:
> On Thu, May 07, 2020 at 10:25:58PM +, Luis Chamberlain wrote:
> > On Thu, May 07, 2020 at 06:06:06PM -0400, Rafael Aquini wrote:
> > > On Thu, May 07, 2020 at 08:33:40PM +, Luis Chamberlain wrote:
> > > > I *think* that a cmdline route to enable this would likely remove the
> > > > need for the kernel config for this. But even with Vlastimil's work
> > > > merged, I think we'd want yet-another value to enable / disable this
> > > > feature. Do we need yet-another-taint flag to tell us that this feature
> > > > was enabled?
> > > >
> > > 
> > > I guess it makes sense to get rid of the sysctl interface for
> > > proc_on_taint, and only keep it as a cmdline option. 
> > 
> > That would be easier to support and k3eps this simple.
> > 
> > > But the real issue seems to be, regardless we go with a cmdline-only 
> > > option
> > > or not, the ability of proc_taint() to set any arbitrary taint flag 
> > > other than just marking the kernel with TAINT_USER. 
> > 
> > I think we would have no other option but to add a new TAINT flag so
> > that we know that the taint flag was modified by a user. Perhaps just
> > re-using TAINT_USER when proc_taint() would suffice.
> >
> 
> We might not need an extra taint flag if, perhaps, we could make these
> two features mutually exclusive. The idea here is that bitmasks added 
> via panic_on_taint get filtered out in proc_taint(), so a malicious 
> user couldn't exploit the latter interface to easily panic the system,
> when the first one is also in use. 

I get it, however I I can still see the person who enables enabling
panic-on-tain wanting to know if proc_taint() was used. So even if
it was not on their mask, if it was modified that seems like important
information for a bug report analysis.

  Luis


Re: [PATCH v4 03/12] pstore/blk: Introduce backend for block devices

2020-05-08 Thread WeiXiong Liao
hi Kees Cook,

On 2020/5/8 PM 2:39, Kees Cook wrote:
> From: WeiXiong Liao 
> 
> pstore/blk is similar to pstore/ram, but uses a block device as the
> storage rather than persistent ram.
> 
> The pstore/blk backend solves two common use-cases that used to preclude
> using pstore/ram:
> - not all devices have a battery that could be used to persist
>   regular RAM across power failures.
> - most embedded intelligent equipment have no persistent ram, which
>   increases costs, instead preferring cheaper solutions, like block
>   devices.
> 
> pstore/blk provides separate configurations for the end user and for the
> block drivers. User configuration determines how pstore/blk operates, such
> as record sizes, max kmsg dump reasons, etc. These can be set by Kconfig
> and/or module parameters, but module parameter have priority over Kconfig.
> Driver configuration covers all the details about the target block device,
> such as total size of the device and how to perform read/write operations.
> These are provided by block drivers, calling pstore_register_blkdev(),
> including an optional panic_write callback used to bypass regular IO
> APIs in an effort to avoid potentially destabilized kernel code during
> a panic.
> 
> Signed-off-by: WeiXiong Liao 
> Link: 
> https://lore.kernel.org/r/1585126506-18635-3-git-send-email-liaoweixi...@allwinnertech.com
> Co-developed-by: Kees Cook 
> Signed-off-by: Kees Cook 
> ---
>  fs/pstore/Kconfig  |  64 ++
>  fs/pstore/Makefile |   3 +
>  fs/pstore/blk.c| 426 +
>  include/linux/pstore_blk.h |  27 +++
>  4 files changed, 520 insertions(+)
>  create mode 100644 fs/pstore/blk.c
>  create mode 100644 include/linux/pstore_blk.h
> 
> diff --git a/fs/pstore/Kconfig b/fs/pstore/Kconfig
> index 98d2457bdd9f..92ba73bd0b62 100644
> --- a/fs/pstore/Kconfig
> +++ b/fs/pstore/Kconfig
> @@ -160,3 +160,67 @@ config PSTORE_ZONE
>   help
> The common layer for pstore/blk (and pstore/ram in the future)
> to manage storage in zones.
> +
> +config PSTORE_BLK
> + tristate "Log panic/oops to a block device"
> + depends on PSTORE
> + depends on BLOCK
> + select PSTORE_ZONE
> + default n
> + help
> +   This enables panic and oops message to be logged to a block dev
> +   where it can be read back at some later point.
> +
> +   If unsure, say N.
> +
> +config PSTORE_BLK_BLKDEV
> + string "block device identifier"
> + depends on PSTORE_BLK
> + default ""
> + help
> +   Which block device should be used for pstore/blk.
> +
> +   It accept the following variants:
> +   1)  device number in hexadecimal represents
> +  itself no leading 0x, for example b302.
> +   2) /dev/ represents the device number of disk
> +   3) /dev/ represents the device number
> +  of partition - device number of disk plus the partition number
> +   4) /dev/p - same as the above, this form is
> +  used when disk name of partitioned disk ends with a digit.
> +   5) PARTUUID=00112233-4455-6677-8899-AABBCCDDEEFF representing the
> +  unique id of a partition if the partition table provides it.
> +  The UUID may be either an EFI/GPT UUID, or refer to an MSDOS
> +  partition using the format -PP, where  is a zero-
> +  filled hex representation of the 32-bit "NT disk signature", and PP
> +  is a zero-filled hex representation of the 1-based partition 
> number.
> +   6) PARTUUID=/PARTNROFF= to select a partition in relation
> +  to a partition with a known unique id.
> +   7) : major and minor number of the device separated by
> +  a colon.
> +
> +   NOTE that, both Kconfig and module parameters can configure
> +   pstore/blk, but module parameters have priority over Kconfig.
> +
> +config PSTORE_BLK_KMSG_SIZE
> + int "Size in Kbytes of kmsg dump log to store"
> + depends on PSTORE_BLK
> + default 64
> + help
> +   This just sets size of kmsg dump (oops, panic, etc) log for
> +   pstore/blk. The size is in KB and must be a multiple of 4.
> +
> +   NOTE that, both Kconfig and module parameters can configure
> +   pstore/blk, but module parameters have priority over Kconfig.
> +
> +config PSTORE_BLK_MAX_REASON
> + int "Maximum kmsg dump reason to store"
> + depends on PSTORE_BLK
> + default 2
> + help
> +   The maximum reason for kmsg dumps to store. The default is
> +   2 (KMSG_DUMP_OOPS), see include/linux/kmsg_dump.h's
> +   enum kmsg_dump_reason for more details.
> +
> +   NOTE that, both Kconfig and module parameters can configure
> +   pstore/blk, but module parameters have priority over Kconfig.
> diff --git a/fs/pstore/Makefile b/fs/pstore/Makefile
> index 58a967cbe4af..c270467aeece 100644
> --- a/fs/pstore/Makefile
> +++ b/fs/pstore/Makefile
> @@ -15,3 +15,6 @@ 

Re: linux-next: new contact(s) for the ceph tree?

2020-05-08 Thread Stephen Rothwell
Hi Sage,

On Sat, 9 May 2020 01:03:14 + (UTC) Sage Weil  wrote:
>
> Jeff Layton 

Done.
> On Sat, 9 May 2020, Stephen Rothwell wrote:
> > 
> > I noticed commit
> > 
> >   3a5ccecd9af7 ("MAINTAINERS: remove myself as ceph co-maintainer")
> > 
> > appear recently.  So who should I now list as the contact(s) for the
> > ceph tree?

-- 
Cheers,
Stephen Rothwell


pgpFBm3Lw0tD_.pgp
Description: OpenPGP digital signature


Re: linux-next 20200506 - build failure with net/bpfilter/bpfilter_umh

2020-05-08 Thread Masahiro Yamada
On Fri, May 8, 2020 at 2:22 PM Valdis Klētnieks  wrote:
>
> My kernel build came to a screeching halt with:
>
>   CHECK   net/bpfilter/bpfilter_kern.c
>   CC [M]  net/bpfilter/bpfilter_kern.o
>   CC [U]  net/bpfilter/main.o
>   LD [U]  net/bpfilter/bpfilter_umh
> /usr/bin/ld: cannot find -lc
> collect2: error: ld returned 1 exit status
> make[2]: *** [scripts/Makefile.userprogs:36: net/bpfilter/bpfilter_umh] Error 
> 1
> make[1]: *** [scripts/Makefile.build:494: net/bpfilter] Error 2
> make: *** [Makefile:1726: net] Error 2
>
> The culprit is this commit:

Thanks. I will try to fix it,
but my commit is innocent because
it is just textual cleanups.
No functional change is intended.


This '-static' option has existed since
the day 1 of bpfilter umh support.
See commit d2ba09c17a0647f899d6c20a11bab9e6d3382f07



I built the mainline kernel in
Fedora running on docker.

I was able to reproduce the same issue.


[masahiro@ed7f2ae1915f linux]$ git log --oneline -1
0e698dfa2822 (HEAD, tag: v5.7-rc4) Linux 5.7-rc4
[masahiro@ed7f2ae1915f linux]$ make  defconfig
  HOSTCC  scripts/basic/fixdep
  HOSTCC  scripts/kconfig/conf.o
  HOSTCC  scripts/kconfig/confdata.o
  HOSTCC  scripts/kconfig/expr.o
  LEX scripts/kconfig/lexer.lex.c
  YACCscripts/kconfig/parser.tab.[ch]
  HOSTCC  scripts/kconfig/lexer.lex.o
  HOSTCC  scripts/kconfig/parser.tab.o
  HOSTCC  scripts/kconfig/preprocess.o
  HOSTCC  scripts/kconfig/symbol.o
  HOSTCC  scripts/kconfig/util.o
  HOSTLD  scripts/kconfig/conf
*** Default configuration is based on 'x86_64_defconfig'
#
# configuration written to .config
#
[masahiro@ed7f2ae1915f linux]$ scripts/config -e BPFILTER
[masahiro@ed7f2ae1915f linux]$ scripts/config -e BPFILTER_UMH
[masahiro@ed7f2ae1915f linux]$ make -j24 net/bpfilter/
scripts/kconfig/conf  --syncconfig Kconfig
  SYSTBL  arch/x86/include/generated/asm/syscalls_32.h
  SYSHDR  arch/x86/include/generated/asm/unistd_32_ia32.h
  SYSHDR  arch/x86/include/generated/asm/unistd_64_x32.h
  SYSTBL  arch/x86/include/generated/asm/syscalls_64.h
  SYSHDR  arch/x86/include/generated/uapi/asm/unistd_32.h
  SYSHDR  arch/x86/include/generated/uapi/asm/unistd_64.h
  SYSHDR  arch/x86/include/generated/uapi/asm/unistd_x32.h
  WRAParch/x86/include/generated/uapi/asm/bpf_perf_event.h
  WRAParch/x86/include/generated/uapi/asm/errno.h
  WRAParch/x86/include/generated/uapi/asm/ioctl.h
  WRAParch/x86/include/generated/uapi/asm/ioctls.h
  WRAParch/x86/include/generated/uapi/asm/fcntl.h
  WRAParch/x86/include/generated/uapi/asm/param.h
  WRAParch/x86/include/generated/uapi/asm/ipcbuf.h
  WRAParch/x86/include/generated/uapi/asm/resource.h
  WRAParch/x86/include/generated/uapi/asm/poll.h
  WRAParch/x86/include/generated/uapi/asm/socket.h
  WRAParch/x86/include/generated/uapi/asm/sockios.h
  WRAParch/x86/include/generated/uapi/asm/termbits.h
  WRAParch/x86/include/generated/uapi/asm/termios.h
  WRAParch/x86/include/generated/uapi/asm/types.h
  HOSTCC  arch/x86/tools/relocs_32.o
  HOSTCC  arch/x86/tools/relocs_64.o
  HOSTCC  arch/x86/tools/relocs_common.o
  WRAParch/x86/include/generated/asm/export.h
  WRAParch/x86/include/generated/asm/early_ioremap.h
  WRAParch/x86/include/generated/asm/dma-contiguous.h
  WRAParch/x86/include/generated/asm/mcs_spinlock.h
  WRAParch/x86/include/generated/asm/mm-arch-hooks.h
  WRAParch/x86/include/generated/asm/mmiowb.h
  HOSTCC  scripts/kallsyms
  HOSTCC  scripts/sorttable
  HOSTCC  scripts/asn1_compiler
  UPD include/config/kernel.release
  DESCEND  objtool
  HOSTCC  scripts/selinux/mdp/mdp
  HOSTCC  scripts/selinux/genheaders/genheaders
  UPD include/generated/utsrelease.h
scripts/kallsyms.c: In function ‘read_symbol’:
scripts/kallsyms.c:222:2: warning: ‘strcpy’ writing between 1 and 128
bytes into a region of size 0 [-Wstringop-overflow=]
  222 |  strcpy(sym_name(sym), name);
  |  ^~~
  HOSTCC   /home/masahiro/ref/linux/tools/objtool/fixdep.o
  HOSTLD  arch/x86/tools/relocs
  HOSTLD   /home/masahiro/ref/linux/tools/objtool/fixdep-in.o
  LINK /home/masahiro/ref/linux/tools/objtool/fixdep
  CC   /home/masahiro/ref/linux/tools/objtool/builtin-check.o
  CC   /home/masahiro/ref/linux/tools/objtool/builtin-orc.o
  CC   /home/masahiro/ref/linux/tools/objtool/check.o
  CC   /home/masahiro/ref/linux/tools/objtool/orc_gen.o
  CC   /home/masahiro/ref/linux/tools/objtool/orc_dump.o
  CC   /home/masahiro/ref/linux/tools/objtool/elf.o
  GEN  /home/masahiro/ref/linux/tools/objtool/arch/x86/lib/inat-tables.c
  CC   /home/masahiro/ref/linux/tools/objtool/special.o
  CC   /home/masahiro/ref/linux/tools/objtool/objtool.o
  CC   /home/masahiro/ref/linux/tools/objtool/libstring.o
  CC   /home/masahiro/ref/linux/tools/objtool/libctype.o
  CC   /home/masahiro/ref/linux/tools/objtool/str_error_r.o
  CC   /home/masahiro/ref/linux/tools/objtool/librbtree.o
  CC   

Re: [PATCH v2 1/4] Documentation: LKMM: Move MP+onceassign+derefonce to new litmus-tests/rcu/

2020-05-08 Thread Akira Yokosawa
Hi Joel,

Sorry for the late response but I've noticed some glitches.
 
On Sun, 22 Mar 2020 21:57:32 -0400, Joel Fernandes (Google) wrote:
> Move MP+onceassign+derefonce to the new Documentation/litmus-tests/rcu/
> directory.

MP+onceassign+derefonce.litmus is called out in
tools/memory-model/Documentation/recipes.txt as a representative example
of RCU related litmus test.

So I think it should be kept under tools/memory-model/litmus-tests.

Further RCU-related litmus tests can be added under Documentation/litmus-tests/.

IIUC, this change is not picked up by tip tree yet. So we have time to respin
the series targeting v5.9.

> 
> More RCU-related litmus tests would be added here.
> 
> Signed-off-by: Joel Fernandes (Google) 
> 
> ---
> Cc: vpil...@digitalocean.com
> 
>  Documentation/litmus-tests/README| 9 +

Please note that later patches to add atomic litmus tests under
Documentation/litmus-tests/ by Boqun put README as
Documentation/litums-tests/atomic/README.

This patch's location of RCU's README as Documentation/litmus-tests/README
looks asymmetric to me.

I'm OK with either merging atomic's README with the top-level one or
moving RCU's README to under Documentation/litmus-tests/rcu.

Joel, Boqum, can you sort out the location of README?

Thanks, Akira

>  .../litmus-tests/rcu}/MP+onceassign+derefonce.litmus | 0
>  tools/memory-model/litmus-tests/README   | 3 ---
>  3 files changed, 9 insertions(+), 3 deletions(-)
>  create mode 100644 Documentation/litmus-tests/README
>  rename {tools/memory-model/litmus-tests => 
> Documentation/litmus-tests/rcu}/MP+onceassign+derefonce.litmus (100%)
> 
> diff --git a/Documentation/litmus-tests/README 
> b/Documentation/litmus-tests/README
> new file mode 100644
> index 0..84208bc197f2e
> --- /dev/null
> +++ b/Documentation/litmus-tests/README
> @@ -0,0 +1,9 @@
> +
> +LITMUS TESTS
> +
> +
> +RCU (/rcu directory)
> +
> +MP+onceassign+derefonce.litmus
> +Demonstrates that rcu_assign_pointer() and rcu_dereference() to
> +ensure that an RCU reader will not see pre-initialization garbage.
> diff --git a/tools/memory-model/litmus-tests/MP+onceassign+derefonce.litmus 
> b/Documentation/litmus-tests/rcu/MP+onceassign+derefonce.litmus
> similarity index 100%
> rename from tools/memory-model/litmus-tests/MP+onceassign+derefonce.litmus
> rename to Documentation/litmus-tests/rcu/MP+onceassign+derefonce.litmus
> diff --git a/tools/memory-model/litmus-tests/README 
> b/tools/memory-model/litmus-tests/README
> index 681f9067fa9ed..79e1b1ed4929a 100644
> --- a/tools/memory-model/litmus-tests/README
> +++ b/tools/memory-model/litmus-tests/README
> @@ -63,9 +63,6 @@ LB+poonceonces.litmus
>   As above, but with store-release replaced with WRITE_ONCE()
>   and load-acquire replaced with READ_ONCE().
>  
> -MP+onceassign+derefonce.litmus
> - As below, but with rcu_assign_pointer() and an rcu_dereference().
> -
>  MP+polockmbonce+poacquiresilsil.litmus
>   Protect the access with a lock and an smp_mb__after_spinlock()
>   in one process, and use an acquire load followed by a pair of
> 


Re: [RFC v1 0/6] block: add error handling for *add_disk*()

2020-05-08 Thread Luis Chamberlain
On Tue, May 05, 2020 at 06:18:22PM -0700, Bart Van Assche wrote:
> On 2020-04-29 00:48, Luis Chamberlain wrote:
> > While working on some blktrace races I noticed that we don't do
> > error handling on *add_disk*() and friends. This is my initial
> > work on that.
> > 
> > This is based on linux-next tag next-20200428, you can also get this
> > on my branch 20200428-block-fixes [0].
> > 
> > Let me know what you think.
> Hi Luis,
> 
> Thank you for having done this work.

My pleasure, I just made one minor change to this series, but that's
all so far. Note that break-blktrace run_0004.sh still yields:

debugfs: Directory 'loop0' with parent 'block' already present!

And so I suspect something else is up, this is even after. That's using
my latest:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20200508-block-fixes

Some more eyebealls on that would be useful.

> Since triggering error paths can be
> challenging, how about adding fault injection capabilities that make it
> possible to trigger all modified error paths and how about adding
> blktests that trigger these paths? That is the strategy that I followed
> myself recently to fix an error path in blk_mq_realloc_hw_ctxs().

Sure thing, but I get the impression that adding this may make it odd
to or harder to review. Shouldn't this be done after we have *some*
error handling? Right now we shouldn't regress as we never fail, and
that seemss worse.

Let me know, either way, I'll start work on it.

  Luis


Re: [RFC PATCH 00/13] Core scheduling v5

2020-05-08 Thread Ning, Hongyu


- Test environment:
Intel Xeon Server platform
CPU(s):  192
On-line CPU(s) list: 0-191
Thread(s) per core:  2
Core(s) per socket:  48
Socket(s):   2
NUMA node(s):4

- Kernel under test: 
Core scheduling v5 base
https://github.com/digitalocean/linux-coresched/tree/coresched/v5-v5.5.y

- Test set based on sysbench 1.1.0-bd4b418:
A: sysbench cpu in cgroup cpu 1 + sysbench mysql in cgroup mysql 1 (192 
workload tasks for each cgroup)
B: sysbench cpu in cgroup cpu 1 + sysbench cpu in cgroup cpu 2 + sysbench mysql 
in cgroup mysql 1 + sysbench mysql in cgroup mysql 2 (192 workload tasks for 
each cgroup)

- Test results briefing:
1 Good results:
1.1 For test set A, coresched could achieve same or better performance compared 
to smt_off, for both cpu workload and sysbench workload
1.2 For test set B, cpu workload, coresched could achieve better performance 
compared to smt_off

2 Bad results:
2.1 For test set B, mysql workload, coresched performance is lower than 
smt_off, potential fairness issue between cpu workloads and mysql workloads
2.2 For test set B, cpu workload, potential fairness issue between 2 cgroups 
cpu workloads

- Test results:
Note: test results in following tables are Tput normalized to default baseline

-- Test set A Tput normalized results:
+++---+-+---+---+-+---+-+
||    | default   | coresched   | smt_off   | ***   | 
default | coresched | smt_off |
+++===+=+===+===+=+===+=+
| cgroups|    | cg cpu 1  | cg cpu 1| cg cpu 1  | ***   | 
cg mysql 1  | cg mysql 1| cg mysql 1  |
+++---+-+---+---+-+---+-+
| sysbench workload  |    | cpu   | cpu | cpu   | ***   | 
mysql   | mysql | mysql   |
+++---+-+---+---+-+---+-+
| 192 tasks / cgroup |    | 1 | 0.95| 0.54  | ***   | 1 
  | 0.92  | 0.97|
+++---+-+---+---+-+---+-+

-- Test set B Tput normalized results:
+++---+-+---+---+-+---+-+--+-+---+-+-+-+---+-+
||    | default   | coresched   | smt_off   | ***   | 
default | coresched | smt_off | **   | default | coresched 
| smt_off | *   | default | coresched | smt_off |
+++===+=+===+===+=+===+=+==+=+===+=+=+=+===+=+
| cgroups|    | cg cpu 1  | cg cpu 1| cg cpu 1  | ***   | 
cg cpu 2| cg cpu 2  | cg cpu 2| **   | cg mysql 1  | cg mysql 1
| cg mysql 1  | *   | cg mysql 2  | cg mysql 2| cg mysql 2  |
+++---+-+---+---+-+---+-+--+-+---+-+-+-+---+-+
| sysbench workload  |    | cpu   | cpu | cpu   | ***   | 
cpu | cpu   | cpu | **   | mysql   | mysql 
| mysql   | *   | mysql   | mysql | mysql   |
+++---+-+---+---+-+---+-+--+-+---+-+-+-+---+-+
| 192 tasks / cgroup |    | 1 | 0.9 | 0.47  | ***   | 1 
  | 1.32  | 0.66| **   | 1   | 0.42  | 
0.89| *   | 1   | 0.42  | 0.89|
+++---+-+---+---+-+---+-+--+-+---+-+-+-+---+-+


> On Date: Wed,  4 Mar 2020 16:59:50 +, vpillai  
> wrote:
> To: Nishanth Aravamudan , Julien Desfossez 
> , Peter Zijlstra , Tim 
> Chen , mi...@kernel.org, t...@linutronix.de, 
> p...@google.com, torva...@linux-foundation.org
> CC: vpillai , linux-kernel@vger.kernel.org, 
> fweis...@gmail.com, keesc...@chromium.org, kerr...@google.com, Phil Auld 
> , Aaron Lu , Aubrey Li 
> , aubrey...@linux.intel.com, Valentin Schneider 
> , Mel Gorman , Pawan 
> Gupta , Paolo Bonzini 
> , Joel Fernandes , 
> j...@joelfernandes.org
> 
> 

[PATCH v2] netprio_cgroup: Fix unlimited memory leak of v2 cgroups

2020-05-08 Thread Zefan Li
If systemd is configured to use hybrid mode which enables the use of
both cgroup v1 and v2, systemd will create new cgroup on both the default
root (v2) and netprio_cgroup hierarchy (v1) for a new session and attach
task to the two cgroups. If the task does some network thing then the v2
cgroup can never be freed after the session exited.

One of our machines ran into OOM due to this memory leak.

In the scenario described above when sk_alloc() is called cgroup_sk_alloc()
thought it's in v2 mode, so it stores the cgroup pointer in sk->sk_cgrp_data
and increments the cgroup refcnt, but then sock_update_netprioidx() thought
it's in v1 mode, so it stores netprioidx value in sk->sk_cgrp_data, so the
cgroup refcnt will never be freed.

Currently we do the mode switch when someone writes to the ifpriomap cgroup
control file. The easiest fix is to also do the switch when a task is attached
to a new cgroup.

Fixes: bd1060a1d671("sock, cgroup: add sock->sk_cgroup")
Reported-by: Yang Yingliang 
Tested-by: Yang Yingliang 
Signed-off-by: Zefan Li 
---

forgot to rebase to the latest kernel.

---
 net/core/netprio_cgroup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index 8881dd9..9bd4cab 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -236,6 +236,8 @@ static void net_prio_attach(struct cgroup_taskset *tset)
struct task_struct *p;
struct cgroup_subsys_state *css;
 
+   cgroup_sk_alloc_disable();
+
cgroup_taskset_for_each(p, css, tset) {
void *v = (void *)(unsigned long)css->id;
 
-- 
2.7.4


Re: [PATCH] usb: roles: Switch on role-switch uevent reporting

2020-05-08 Thread Wen Yang




在 2020/5/9 上午12:29, Bryan O'Donoghue 写道:

Right now we don't report to user-space a role switch when doing a
usb_role_switch_set_role() despite having registered the uevent callbacks.

This patch switches on the notifications allowing user-space to see
role-switch change notifications and subsequently determine the current
controller data-role.

example:
PFX=/devices/platform/soc/78d9000.usb/ci_hdrc.0

root@somebox# udevadm monitor -p

KERNEL[49.894994] change $PFX/usb_role/ci_hdrc.0-role-switch (usb_role)
ACTION=change
DEVPATH=$PFX/usb_role/ci_hdrc.0-role-switch
SUBSYSTEM=usb_role
DEVTYPE=usb_role_switch
USB_ROLE_SWITCH=ci_hdrc.0-role-switch
SEQNUM=2432

Cc: Heikki Krogerus 
Cc: Greg Kroah-Hartman 
Cc: Chunfeng Yun 
Cc: Suzuki K Poulose 
Cc: Alexandre Belloni 
Cc: Wen Yang 
Cc: chenqiwu 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Bryan O'Donoghue 
---
  drivers/usb/roles/class.c | 4 +++-
  1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/usb/roles/class.c b/drivers/usb/roles/class.c
index 5b17709821df..27d92af29635 100644
--- a/drivers/usb/roles/class.c
+++ b/drivers/usb/roles/class.c
@@ -49,8 +49,10 @@ int usb_role_switch_set_role(struct usb_role_switch *sw, 
enum usb_role role)
mutex_lock(>lock);
  
  	ret = sw->set(sw, role);

-   if (!ret)
+   if (!ret) {
sw->role = role;
+   kobject_uevent(>dev.kobj, KOBJ_CHANGE);
+   }
  
  	mutex_unlock(>lock);
  



Hi, we may also need to deal with the return value of kobject_uevent(). 
Should we move it under the line mutex_unlock(>lock)?


Regards,
Wen



[PATCH 2/2] ASoC: max98390: Added Amplifier Driver

2020-05-08 Thread Steve Lee
Signed-off-by: Steve Lee 
---
 sound/soc/codecs/Kconfig|5 +
 sound/soc/codecs/Makefile   |2 +
 sound/soc/codecs/max98390.c | 1039 +++
 sound/soc/codecs/max98390.h |  661 ++
 4 files changed, 1707 insertions(+)
 create mode 100644 sound/soc/codecs/max98390.c
 create mode 100644 sound/soc/codecs/max98390.h

diff --git a/sound/soc/codecs/Kconfig b/sound/soc/codecs/Kconfig
index e60e0b6a689c..bb0c73422163 100644
--- a/sound/soc/codecs/Kconfig
+++ b/sound/soc/codecs/Kconfig
@@ -116,6 +116,7 @@ config SND_SOC_ALL_CODECS
imply SND_SOC_MAX98926
imply SND_SOC_MAX98927
imply SND_SOC_MAX98373
+   imply SND_SOC_MAX98390 if I2C
imply SND_SOC_MAX9850
imply SND_SOC_MAX9860
imply SND_SOC_MAX9759
@@ -867,6 +868,10 @@ config SND_SOC_MAX98373
tristate "Maxim Integrated MAX98373 Speaker Amplifier"
depends on I2C
 
+config SND_SOC_MAX98390
+   tristate "Maxim Integrated MAX98390 Speaker Amplifier"
+   depends on I2C
+
 config SND_SOC_MAX9850
tristate
depends on I2C
diff --git a/sound/soc/codecs/Makefile b/sound/soc/codecs/Makefile
index 03533157cda6..e7c17f701529 100644
--- a/sound/soc/codecs/Makefile
+++ b/sound/soc/codecs/Makefile
@@ -115,6 +115,7 @@ snd-soc-max98925-objs := max98925.o
 snd-soc-max98926-objs := max98926.o
 snd-soc-max98927-objs := max98927.o
 snd-soc-max98373-objs := max98373.o
+snd-soc-max98390-objs := max98390.o
 snd-soc-max9850-objs := max9850.o
 snd-soc-max9860-objs := max9860.o
 snd-soc-mc13783-objs := mc13783.o
@@ -415,6 +416,7 @@ obj-$(CONFIG_SND_SOC_MAX98925)  += snd-soc-max98925.o
 obj-$(CONFIG_SND_SOC_MAX98926) += snd-soc-max98926.o
 obj-$(CONFIG_SND_SOC_MAX98927) += snd-soc-max98927.o
 obj-$(CONFIG_SND_SOC_MAX98373) += snd-soc-max98373.o
+obj-$(CONFIG_SND_SOC_MAX98390) += snd-soc-max98390.o
 obj-$(CONFIG_SND_SOC_MAX9850)  += snd-soc-max9850.o
 obj-$(CONFIG_SND_SOC_MAX9860)  += snd-soc-max9860.o
 obj-$(CONFIG_SND_SOC_MC13783)  += snd-soc-mc13783.o
diff --git a/sound/soc/codecs/max98390.c b/sound/soc/codecs/max98390.c
new file mode 100644
index ..625219b7517c
--- /dev/null
+++ b/sound/soc/codecs/max98390.c
@@ -0,0 +1,1039 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2020, Maxim Integrated.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "max98390.h"
+
+static struct reg_default max98390_reg_defaults[] = {
+   {MAX98390_INT_EN1, 0xf0},
+   {MAX98390_INT_EN2, 0x00},
+   {MAX98390_INT_EN3, 0x00},
+   {MAX98390_INT_FLAG_CLR1, 0x00},
+   {MAX98390_INT_FLAG_CLR2, 0x00},
+   {MAX98390_INT_FLAG_CLR3, 0x00},
+   {MAX98390_IRQ_CTRL, 0x01},
+   {MAX98390_CLK_MON, 0x6d},
+   {MAX98390_DAT_MON, 0x03},
+   {MAX98390_WDOG_CTRL, 0x00},
+   {MAX98390_WDOG_RST, 0x00},
+   {MAX98390_MEAS_ADC_THERM_WARN_THRESH, 0x75},
+   {MAX98390_MEAS_ADC_THERM_SHDN_THRESH, 0x8c},
+   {MAX98390_MEAS_ADC_THERM_HYSTERESIS, 0x08},
+   {MAX98390_PIN_CFG, 0x55},
+   {MAX98390_PCM_RX_EN_A, 0x00},
+   {MAX98390_PCM_RX_EN_B, 0x00},
+   {MAX98390_PCM_TX_EN_A, 0x00},
+   {MAX98390_PCM_TX_EN_B, 0x00},
+   {MAX98390_PCM_TX_HIZ_CTRL_A, 0xff},
+   {MAX98390_PCM_TX_HIZ_CTRL_B, 0xff},
+   {MAX98390_PCM_CH_SRC_1, 0x00},
+   {MAX98390_PCM_CH_SRC_2, 0x00},
+   {MAX98390_PCM_CH_SRC_3, 0x00},
+   {MAX98390_PCM_MODE_CFG, 0xc0},
+   {MAX98390_PCM_MASTER_MODE, 0x1c},
+   {MAX98390_PCM_CLK_SETUP, 0x44},
+   {MAX98390_PCM_SR_SETUP, 0x08},
+   {MAX98390_ICC_RX_EN_A, 0x00},
+   {MAX98390_ICC_RX_EN_B, 0x00},
+   {MAX98390_ICC_TX_EN_A, 0x00},
+   {MAX98390_ICC_TX_EN_B, 0x00},
+   {MAX98390_ICC_HIZ_MANUAL_MODE, 0x00},
+   {MAX98390_ICC_TX_HIZ_EN_A, 0x00},
+   {MAX98390_ICC_TX_HIZ_EN_B, 0x00},
+   {MAX98390_ICC_LNK_EN, 0x00},
+   {MAX98390_R2039_AMP_DSP_CFG, 0x0f},
+   {MAX98390_R203A_AMP_EN, 0x81},
+   {MAX98390_TONE_GEN_DC_CFG, 0x00},
+   {MAX98390_SPK_SRC_SEL, 0x00},
+   {MAX98390_SSM_CFG, 0x85},
+   {MAX98390_MEAS_EN, 0x03},
+   {MAX98390_MEAS_DSP_CFG, 0x0f},
+   {MAX98390_BOOST_CTRL0, 0x1c},
+   {MAX98390_BOOST_CTRL3, 0x01},
+   {MAX98390_BOOST_CTRL1, 0x40},
+   {MAX98390_MEAS_ADC_CFG, 0x07},
+   {MAX98390_MEAS_ADC_BASE_MSB, 0x00},
+   {MAX98390_MEAS_ADC_BASE_LSB, 0x23},
+   {MAX98390_ADC_CH0_DIVIDE, 0x00},
+   {MAX98390_ADC_CH1_DIVIDE, 0x00},
+   {MAX98390_ADC_CH2_DIVIDE, 0x00},
+   {MAX98390_ADC_CH0_FILT_CFG, 0x00},
+   {MAX98390_ADC_CH1_FILT_CFG, 0x00},
+   {MAX98390_ADC_CH2_FILT_CFG, 0x00},
+   {MAX98390_PWR_GATE_CTL, 0x2c},
+   {MAX98390_BROWNOUT_EN, 0x00},
+   {MAX98390_BROWNOUT_INFINITE_HOLD, 0x00},
+   {MAX98390_BROWNOUT_INFINITE_HOLD_CLR, 0x00},
+   {MAX98390_BROWNOUT_LVL_HOLD, 0x00},
+   

[PATCH] netprio_cgroup: Fix unlimited memory leak of v2 cgroup

2020-05-08 Thread Zefan Li
If systemd is configured to use hybrid mode which enables the use of
both cgroup v1 and v2, systemd will create new cgroup on both the default
root (v2) and netprio_cgroup hierarchy (v1) for a new session and attach
task to the two cgroups. If the task does some network thing then the v2
cgroup can never be freed after the session exited.

One of our machines ran into OOM due to this memory leak.

In the scenario described above when sk_alloc() is called cgroup_sk_alloc()
thought it's in v2 mode, so it stores the cgroup pointer in sk->sk_cgrp_data
and increments the cgroup refcnt, but then sock_update_netprioidx() thought
it's in v1 mode, so it stores netprioidx value in sk->sk_cgrp_data, so the
cgroup refcnt will never be freed.

Currently we do the mode switch when someone writes to the ifpriomap cgroup
control file. The easiest fix is to also do the switch when a task is attached
to a new cgroup.

Fixes: bd1060a1d671("sock, cgroup: add sock->sk_cgroup")
Reported-by: Yang Yingliang 
Tested-by: Yang Yingliang 
Signed-off-by: Zefan Li 
---
 net/core/netprio_cgroup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index b905747..2397866 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -240,6 +240,8 @@ static void net_prio_attach(struct cgroup_taskset *tset)
struct task_struct *p;
struct cgroup_subsys_state *css;
 
+   cgroup_sk_alloc_disable();
+
cgroup_taskset_for_each(p, css, tset) {
void *v = (void *)(unsigned long)css->cgroup->id;
 
-- 
2.7.4



[PATCH 1/2] dt-bindings: Added device tree binding for max98390

2020-05-08 Thread Steve Lee
Signed-off-by: Steve Lee 
---
 .../devicetree/bindings/sound/max98390.txt| 26 +++
 1 file changed, 26 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/sound/max98390.txt

diff --git a/Documentation/devicetree/bindings/sound/max98390.txt 
b/Documentation/devicetree/bindings/sound/max98390.txt
new file mode 100644
index ..147dfd88cd3b
--- /dev/null
+++ b/Documentation/devicetree/bindings/sound/max98390.txt
@@ -0,0 +1,26 @@
+Maxim Integrated MAX98390 Speaker Amplifier
+
+This device supports I2C.
+
+Required properties:
+
+ - compatible : "maxim,max98390"
+
+ - reg : the I2C address of the device.
+
+Optional properties:
+
+ - maxim,temperature_calib
+   u32. The calculated temperature data was measured while doing the 
calibration. Data calibration : Temp / 100 * 2^12
+
+ - maxim,r0_calib
+   u32. This is r0 calibration data which was measured in factory mode.
+
+Example:
+
+codec: max98390@38 {
+   compatible = "maxim,max98390";
+   reg = <0x38>;
+   maxim,temperature_calib = <1024>;
+   maxim,r0_calib = <0x224050>;
+};
-- 
2.17.1



Re: [PATCH 0/6] nouveau/hmm: add support for mapping large pages

2020-05-08 Thread Matthew Wilcox
On Fri, May 08, 2020 at 01:17:55PM -0700, Ralph Campbell wrote:
> On 5/8/20 12:59 PM, Matthew Wilcox wrote:
> > On Fri, May 08, 2020 at 12:20:03PM -0700, Ralph Campbell wrote:
> > > hmm_range_fault() returns an array of page frame numbers and flags for
> > > how the pages are mapped in the requested process' page tables. The PFN
> > > can be used to get the struct page with hmm_pfn_to_page() and the page 
> > > size
> > > order can be determined with compound_order(page) but if the page is 
> > > larger
> > > than order 0 (PAGE_SIZE), there is no indication that the page is mapped
> > > using a larger page size. To be fully general, hmm_range_fault() would 
> > > need
> > > to return the mapping size to handle cases like a 1GB compound page being
> > > mapped with 2MB PMD entries. However, the most common case is the mapping
> > > size the same as the underlying compound page size.
> > > This series adds a new output flag to indicate this so that callers know 
> > > it
> > > is safe to use a large device page table mapping if one is available.
> > > Nouveau and the HMM tests are updated to use the new flag.
> > 
> > This explanation doesn't make any sense.  It doesn't matter how somebody
> > else has it mapped; if it's a PMD-sized page, you can map it with a
> > 2MB mapping.
> 
> Sure, the I/O will work OK, but is it safe?
> Copy on write isn't an issue? splitting a PMD in one process due to
> mprotect of a shared page will cause other process' page tables to be split
> the same way?

Are you saying that if you call this function on an address range of a
process which has done COW of a single page in the middle of a THP,
you want to return with this flag clear, but if the THP is still intact,
you want to set this flag?

> Recall that these are system memory pages that could be THPs, shmem, 
> hugetlbfs,
> mmap shared file pages, etc.


Re: [PATCH] drm: vmwgfx: include linux/highmem.h

2020-05-08 Thread Stephen Rothwell
Hi Arnd,

On Sat,  9 May 2020 00:01:31 +0200 Arnd Bergmann  wrote:
>
> In order to call kmap_atomic() etc, we need to include linux/highmem.h:
> 
> drivers/gpu/drm/vmwgfx/vmwgfx_blit.c: In function 'vmw_bo_cpu_blit_line':
> drivers/gpu/drm/vmwgfx/vmwgfx_blit.c:377:4: error: implicit declaration of 
> function 'kunmap_atomic'; did you mean 'in_atomic'? 
> [-Werror=implicit-function-declaration]
>   377 |kunmap_atomic(d->src_addr);
>   |^
>   |in_atomic
> drivers/gpu/drm/vmwgfx/vmwgfx_blit.c:391:5: error: implicit declaration of 
> function 'kmap_atomic_prot' [-Werror=implicit-function-declaration]
>   391 | kmap_atomic_prot(d->dst_pages[dst_page],
>   | ^~~~
> drivers/gpu/drm/vmwgfx/vmwgfx_blit.c:390:16: warning: assignment to 'u8 *' 
> {aka 'unsigned char *'} from 'int' makes pointer from integer without a cast 
> [-Wint-conversion]
>   390 |d->dst_addr =
>   |^
> drivers/gpu/drm/vmwgfx/vmwgfx_blit.c:403:16: warning: assignment to 'u8 *' 
> {aka 'unsigned char *'} from 'int' makes pointer from integer without a cast 
> [-Wint-conversion]
>   403 |d->src_addr =
>   |^
> 
> Fixes: 46385a895322 ("drm: remove drm specific kmap_atomic code")
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/gpu/drm/vmwgfx/vmwgfx_blit.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c 
> b/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
> index 94d456a1d1a9..1629427d5734 100644
> --- a/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
> +++ b/drivers/gpu/drm/vmwgfx/vmwgfx_blit.c
> @@ -27,6 +27,7 @@
>   **/
>  
>  #include "vmwgfx_drv.h"
> +#include 
>  
>  /*
>   * Template that implements find_first_diff() for a generic
> -- 
> 2.26.0
> 

Added to linux-next for Monday (in case Andrew doesn't get around to it).

-- 
Cheers,
Stephen Rothwell


pgp2KVJx1ei4A.pgp
Description: OpenPGP digital signature


[GIT PULL] io_uring fixes for 5.7-rc5

2020-05-08 Thread Jens Axboe
Hi Linus,

A few fixes that should go into this series:

- Fix finish_wait() balancing in file cancelation (Xiaoguang)

- Ensure early cleanup of resources in ring map failure (Xiaoguang)

- Ensure IORING_OP_SLICE does the right file mode checks (Pavel)

- Remove file opening from openat/openat2/statx, it's not needed and
  messes with O_PATH

Please pull!


  git://git.kernel.dk/linux-block.git tags/io_uring-5.7-2020-05-08



Jens Axboe (1):
  io_uring: don't use 'fd' for openat/openat2/statx

Pavel Begunkov (1):
  splice: move f_mode checks to do_{splice,tee}()

Xiaoguang Wang (2):
  io_uring: fix mismatched finish_wait() calls in io_uring_cancel_files()
  io_uring: handle -EFAULT properly in io_uring_setup()

 fs/io_uring.c | 65 ---
 fs/splice.c   | 45 +
 2 files changed, 40 insertions(+), 70 deletions(-)

-- 
Jens Axboe



[PATCH v4 1/5] block: revert back to synchronous request_queue removal

2020-05-08 Thread Luis Chamberlain
Commit dc9edc44de6c ("block: Fix a blk_exit_rl() regression") merged on
v4.12 moved the work behind blk_release_queue() into a workqueue after a
splat floated around which indicated some work on blk_release_queue()
could sleep in blk_exit_rl(). This splat would be possible when a driver
called blk_put_queue() or blk_cleanup_queue() (which calls blk_put_queue()
as its final call) from an atomic context.

blk_put_queue() decrements the refcount for the request_queue kobject,
and upon reaching 0 blk_release_queue() is called. Although blk_exit_rl()
is now removed through commit db6d9952356 ("block: remove request_list code")
on v5.0, we reserve the right to be able to sleep within blk_release_queue()
context.

The last reference for the request_queue must not be called from atomic
context. *When* the last reference to the request_queue reaches 0 varies,
and so let's take the opportunity to document when that is expected to
happen and also document the context of the related calls as best as possible
so we can avoid future issues, and with the hopes that the synchronous
request_queue removal sticks.

We revert back to synchronous request_queue removal because asynchronous
removal creates a regression with expected userspace interaction with
several drivers. An example is when removing the loopback driver, one
uses ioctls from userspace to do so, but upon return and if successful,
one expects the device to be removed. Likewise if one races to add another
device the new one may not be added as it is still being removed. This was
expected behavior before and it now fails as the device is still present
and busy still. Moving to asynchronous request_queue removal could have
broken many scripts which relied on the removal to have been completed if
there was no error. Document this expectation as well so that this
doesn't regress userspace again.

Using asynchronous request_queue removal however has helped us find
other bugs. In the future we can test what could break with this
arrangement by enabling CONFIG_DEBUG_KOBJECT_RELEASE.

Cc: Bart Van Assche 
Cc: Omar Sandoval 
Cc: Hannes Reinecke 
Cc: Nicolai Stange 
Cc: Greg Kroah-Hartman 
Cc: Michal Hocko 
Cc: yu kuai 
Suggested-by: Nicolai Stange 
Fixes: dc9edc44de6c ("block: Fix a blk_exit_rl() regression")
Reviewed-by: Christoph Hellwig 
Signed-off-by: Luis Chamberlain 
---
 block/blk-core.c   | 23 +
 block/blk-sysfs.c  | 43 +
 block/genhd.c  | 73 +-
 include/linux/blkdev.h |  2 --
 4 files changed, 117 insertions(+), 24 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index dccdae09b7b6..da120fd257fa 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -306,6 +306,16 @@ void blk_clear_pm_only(struct request_queue *q)
 }
 EXPORT_SYMBOL_GPL(blk_clear_pm_only);
 
+/**
+ * blk_put_queue - decrement the request_queue refcount
+ * @q: the request_queue structure to decrement the refcount for
+ *
+ * Decrements the refcount of the request_queue kobject. When this reaches 0
+ * we'll have blk_release_queue() called.
+ *
+ * Context: Any context, but the last reference must not be dropped from
+ *  atomic context.
+ */
 void blk_put_queue(struct request_queue *q)
 {
kobject_put(>kobj);
@@ -337,9 +347,14 @@ EXPORT_SYMBOL_GPL(blk_set_queue_dying);
  *
  * Mark @q DYING, drain all pending requests, mark @q DEAD, destroy and
  * put it.  All future requests will be failed immediately with -ENODEV.
+ *
+ * Context: can sleep
  */
 void blk_cleanup_queue(struct request_queue *q)
 {
+   /* cannot be called from atomic context */
+   might_sleep();
+
WARN_ON_ONCE(blk_queue_registered(q));
 
/* mark @q DYING, no new request or merges will be allowed afterwards */
@@ -584,6 +599,14 @@ struct request_queue *blk_alloc_queue(make_request_fn 
make_request, int node_id)
 }
 EXPORT_SYMBOL(blk_alloc_queue);
 
+/**
+ * blk_get_queue - increment the request_queue refcount
+ * @q: the request_queue structure to increment the refcount for
+ *
+ * Increment the refcount of the request_queue kobject.
+ *
+ * Context: Any context.
+ */
 bool blk_get_queue(struct request_queue *q)
 {
if (likely(!blk_queue_dying(q))) {
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index fca9b158f4a0..5d0fc165a036 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -860,22 +860,32 @@ static void blk_exit_queue(struct request_queue *q)
bdi_put(q->backing_dev_info);
 }
 
-
 /**
- * __blk_release_queue - release a request queue
- * @work: pointer to the release_work member of the request queue to be 
released
+ * blk_release_queue - releases all allocated resources of the request_queue
+ * @kobj: pointer to a kobject, whose container is a request_queue
+ *
+ * This function releases all allocated resources of the request queue.
+ *
+ * The struct request_queue refcount is incremented with blk_get_queue() and
+ * decremented with blk_put_queue(). 

[PATCH v4 5/5] loop: be paranoid on exit and prevent new additions / removals

2020-05-08 Thread Luis Chamberlain
Be pedantic on removal as well and hold the mutex.
This should prevent uses of addition while we exit.

I cannot trigger an issue with this though, this is
just a fix done through code inspection.

Reviewed-by: Ming Lei 
Signed-off-by: Luis Chamberlain 
---
 drivers/block/loop.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 14372df0f354..54fbcbd930de 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -2333,6 +2333,8 @@ static void __exit loop_exit(void)
 
range = max_loop ? max_loop << part_shift : 1UL << MINORBITS;
 
+   mutex_lock(_ctl_mutex);
+
idr_for_each(_index_idr, _exit_cb, NULL);
idr_destroy(_index_idr);
 
@@ -2340,6 +2342,8 @@ static void __exit loop_exit(void)
unregister_blkdev(LOOP_MAJOR, "loop");
 
misc_deregister(_misc);
+
+   mutex_unlock(_ctl_mutex);
 }
 
 module_init(loop_init);
-- 
2.25.1



[PATCH v4 4/5] blktrace: break out of blktrace setup on concurrent calls

2020-05-08 Thread Luis Chamberlain
We use one blktrace per request_queue, that means one per the entire
disk.  So we cannot run one blktrace on say /dev/vda and then /dev/vda1,
or just two calls on /dev/vda.

We check for concurrent setup only at the very end of the blktrace setup though.

If we try to run two concurrent blktraces on the same block device the
second one will fail, and the first one seems to go on. However when
one tries to kill the first one one will see things like this:

The kernel will show these:

```
debugfs: File 'dropped' in directory 'nvme1n1' already present!
debugfs: File 'msg' in directory 'nvme1n1' already present!
debugfs: File 'trace0' in directory 'nvme1n1' already present!
``

And userspace just sees this error message for the second call:

```
blktrace /dev/nvme1n1
BLKTRACESETUP(2) /dev/nvme1n1 failed: 5/Input/output error
```

The first userspace process #1 will also claim that the files
were taken underneath their nose as well. The files are taken
away form the first process given that when the second blktrace
fails, it will follow up with a BLKTRACESTOP and BLKTRACETEARDOWN.
This means that even if go-happy process #1 is waiting for blktrace
data, we *have* been asked to take teardown the blktrace.

This can easily be reproduced with break-blktrace [0] run_0005.sh test.

Just break out early if we know we're already going to fail, this will
prevent trying to create the files all over again, which we know still
exist.

[0] https://github.com/mcgrof/break-blktrace
Signed-off-by: Luis Chamberlain 
---
 kernel/trace/blktrace.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/kernel/trace/blktrace.c b/kernel/trace/blktrace.c
index 6c10a1427de2..bd5ec2184d46 100644
--- a/kernel/trace/blktrace.c
+++ b/kernel/trace/blktrace.c
@@ -3,6 +3,9 @@
  * Copyright (C) 2006 Jens Axboe 
  *
  */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
 #include 
 #include 
 #include 
@@ -493,6 +496,12 @@ static int do_blk_trace_setup(struct request_queue *q, 
char *name, dev_t dev,
 */
strreplace(buts->name, '/', '_');
 
+   if (q->blk_trace) {
+   pr_warn("Concurrent blktraces are not allowed on %s\n",
+   buts->name);
+   return -EBUSY;
+   }
+
bt = kzalloc(sizeof(*bt), GFP_KERNEL);
if (!bt)
return -ENOMEM;
-- 
2.25.1



[PATCH v4 2/5] block: move main block debugfs initialization to its own file

2020-05-08 Thread Luis Chamberlain
make_request-based drivers and and request-based drivers share some
debugfs code. By moving this into its own file it makes it easier
to expand and audit this shared code.

This patch contains no functional changes.

Cc: Bart Van Assche 
Cc: Omar Sandoval 
Cc: Hannes Reinecke 
Cc: Nicolai Stange 
Cc: Greg Kroah-Hartman 
Cc: Michal Hocko 
Cc: yu kuai 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Greg Kroah-Hartman 
Reviewed-by: Bart Van Assche 
Signed-off-by: Luis Chamberlain 
---
 block/Makefile  |  1 +
 block/blk-core.c|  9 +
 block/blk-debugfs.c | 15 +++
 block/blk.h |  7 +++
 4 files changed, 24 insertions(+), 8 deletions(-)
 create mode 100644 block/blk-debugfs.c

diff --git a/block/Makefile b/block/Makefile
index 206b96e9387f..1d3ab20505d8 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -10,6 +10,7 @@ obj-$(CONFIG_BLOCK) := bio.o elevator.o blk-core.o 
blk-sysfs.o \
blk-mq-sysfs.o blk-mq-cpumap.o blk-mq-sched.o ioctl.o \
genhd.o ioprio.o badblocks.o partitions/ blk-rq-qos.o
 
+obj-$(CONFIG_DEBUG_FS) += blk-debugfs.o
 obj-$(CONFIG_BOUNCE)   += bounce.o
 obj-$(CONFIG_BLK_SCSI_REQUEST) += scsi_ioctl.o
 obj-$(CONFIG_BLK_DEV_BSG)  += bsg.o
diff --git a/block/blk-core.c b/block/blk-core.c
index da120fd257fa..0a34b299275e 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -49,10 +49,6 @@
 #include "blk-pm.h"
 #include "blk-rq-qos.h"
 
-#ifdef CONFIG_DEBUG_FS
-struct dentry *blk_debugfs_root;
-#endif
-
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap);
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap);
 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_complete);
@@ -1813,10 +1809,7 @@ int __init blk_dev_init(void)
 
blk_requestq_cachep = kmem_cache_create("request_queue",
sizeof(struct request_queue), 0, SLAB_PANIC, NULL);
-
-#ifdef CONFIG_DEBUG_FS
-   blk_debugfs_root = debugfs_create_dir("block", NULL);
-#endif
+   blk_debugfs_register();
 
return 0;
 }
diff --git a/block/blk-debugfs.c b/block/blk-debugfs.c
new file mode 100644
index ..19091e1effc0
--- /dev/null
+++ b/block/blk-debugfs.c
@@ -0,0 +1,15 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/*
+ * Shared request-based / make_request-based functionality
+ */
+#include 
+#include 
+#include 
+
+struct dentry *blk_debugfs_root;
+
+void blk_debugfs_register(void)
+{
+   blk_debugfs_root = debugfs_create_dir("block", NULL);
+}
diff --git a/block/blk.h b/block/blk.h
index 73bd3b1c6938..ec16e8a6049e 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -456,5 +456,12 @@ struct request_queue *__blk_alloc_queue(int node_id);
 int __bio_add_pc_page(struct request_queue *q, struct bio *bio,
struct page *page, unsigned int len, unsigned int offset,
bool *same_page);
+#ifdef CONFIG_DEBUG_FS
+void blk_debugfs_register(void);
+#else
+static inline void blk_debugfs_register(void)
+{
+}
+#endif /* CONFIG_DEBUG_FS */
 
 #endif /* BLK_INTERNAL_H */
-- 
2.25.1



[PATCH v4 3/5] blktrace: fix debugfs use after free

2020-05-08 Thread Luis Chamberlain
On commit 6ac93117ab00 ("blktrace: use existing disk debugfs directory")
merged on v4.12 Omar fixed the original blktrace code for request-based
drivers (multiqueue). This however left in place a possible crash, if you
happen to abuse blktrace while racing to remove / add a device.

We used to use asynchronous removal of the request_queue, and with that
the issue was easier to reproduce. Now that we have reverted to
synchronous removal of the request_queue, the issue is still possible to
reproduce, its however just a bit more difficult.

We essentially run two instances of break-blktrace which add/remove
a loop device, and setup a blktrace and just never tear the blktrace
down. We do this twice in parallel. This is easily reproduced with the
break-blktrace run_0004.sh script.

We can end up with two types of panics each reflecting where we
race, one a failed blktrace setup:

[  252.426751] debugfs: Directory 'loop0' with parent 'block' already present!
[  252.432265] BUG: kernel NULL pointer dereference, address: 00a0
[  252.436592] #PF: supervisor write access in kernel mode
[  252.439822] #PF: error_code(0x0002) - not-present page
[  252.442967] PGD 0 P4D 0
[  252.444656] Oops: 0002 [#1] SMP NOPTI
[  252.446972] CPU: 10 PID: 1153 Comm: break-blktrace Tainted: GE   
  5.7.0-rc2-next-20200420+ #164
[  252.452673] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.13.0-1 04/01/2014
[  252.456343] RIP: 0010:down_write+0x15/0x40
[  252.458146] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc
   cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00
   00 00  48 0f b1 55 00 75 0f 48 8b 04 25 c0 8b 01 00 48 89
   45 08 5d
[  252.463638] RSP: 0018:a626415abcc8 EFLAGS: 00010246
[  252.464950] RAX:  RBX: 958c25f0f5c0 RCX: ff81
[  252.466727] RDX: 0001 RSI: ff81 RDI: 00a0
[  252.468482] RBP: 00a0 R08:  R09: 0001
[  252.470014] R10:  R11: 958d1f9227ff R12: 
[  252.471473] R13: 958c25ea5380 R14: 8cce15f1 R15: 00a0
[  252.473346] FS:  7f2e69dee540() GS:958c2fc8() 
knlGS:
[  252.475225] CS:  0010 DS:  ES:  CR0: 80050033
[  252.476267] CR2: 00a0 CR3: 000427d10004 CR4: 00360ee0
[  252.477526] DR0:  DR1:  DR2: 
[  252.478776] DR3:  DR6: fffe0ff0 DR7: 0400
[  252.479866] Call Trace:
[  252.480322]  simple_recursive_removal+0x4e/0x2e0
[  252.481078]  ? debugfs_remove+0x60/0x60
[  252.481725]  ? relay_destroy_buf+0x77/0xb0
[  252.482662]  debugfs_remove+0x40/0x60
[  252.483518]  blk_remove_buf_file_callback+0x5/0x10
[  252.484328]  relay_close_buf+0x2e/0x60
[  252.484930]  relay_open+0x1ce/0x2c0
[  252.485520]  do_blk_trace_setup+0x14f/0x2b0
[  252.486187]  __blk_trace_setup+0x54/0xb0
[  252.486803]  blk_trace_ioctl+0x90/0x140
[  252.487423]  ? do_sys_openat2+0x1ab/0x2d0
[  252.488053]  blkdev_ioctl+0x4d/0x260
[  252.488636]  block_ioctl+0x39/0x40
[  252.489139]  ksys_ioctl+0x87/0xc0
[  252.489675]  __x64_sys_ioctl+0x16/0x20
[  252.490380]  do_syscall_64+0x52/0x180
[  252.491032]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

And the other on the device removal:

[  128.528940] debugfs: Directory 'loop0' with parent 'block' already present!
[  128.615325] BUG: kernel NULL pointer dereference, address: 00a0
[  128.619537] #PF: supervisor write access in kernel mode
[  128.622700] #PF: error_code(0x0002) - not-present page
[  128.625842] PGD 0 P4D 0
[  128.627585] Oops: 0002 [#1] SMP NOPTI
[  128.629871] CPU: 12 PID: 544 Comm: break-blktrace Tainted: GE
 5.7.0-rc2-next-20200420+ #164
[  128.635595] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
1.13.0-1 04/01/2014
[  128.640471] RIP: 0010:down_write+0x15/0x40
[  128.643041] Code: eb ca e8 ae 22 8d ff cc cc cc cc cc cc cc cc cc cc cc cc
   cc cc 0f 1f 44 00 00 55 48 89 fd e8 52 db ff ff 31 c0 ba 01 00
   00 00  48 0f b1 55 00 75 0f 65 48 8b 04 25 c0 8b 01 00 48 89
   45 08 5d
[  128.650180] RSP: 0018:a9c3c05ebd78 EFLAGS: 00010246
[  128.651820] RAX:  RBX: 8ae9a6370240 RCX: ff81
[  128.653942] RDX: 0001 RSI: ff81 RDI: 00a0
[  128.655720] RBP: 00a0 R08: 0002 R09: 8ae9afd2d3d0
[  128.657400] R10: 0056 R11:  R12: 
[  128.659099] R13:  R14: 0003 R15: 00a0
[  128.660500] FS:  7febfd995540() GS:8ae9afd0() 
knlGS:
[  128.662204] CS:  0010 DS:  ES:  CR0: 80050033
[  128.663426] CR2: 00a0 CR3: 000420042003 CR4: 00360ee0
[  128.664776] DR0: 

[PATCH v4 0/5] block: fix blktrace debugfs use after free

2020-05-08 Thread Luis Chamberlain
Phew, well, since we did't hear back about removing scsi-generic
blktrace functionality I put work into addressing to keep it. That
took a lot of code inspection and also testing. Since libvirt
is limited to what devices you can test I resorted to testing
all supported device types with iscsi tcp and tgt.

I decided to simplfiy the partition work further by just using
a symbolic link. In the end that makes the blktrace code even
cleaner than anything we had before.

scsi-generic stuff still required quite a bit of work to figure
out what to do. Since scsi devices probe asynchronously and scsi-generic
is nothing but a class_interface whose sg_add_device() runs *prior*
to the scsi device probe, we currently address the symlink on the
sg ioctl. I however think this reveals a shortcoming of the class
interface, now that we have async probe and its used widely. I
think we need a probe_complete() call or something like that.
If that seems reasonable I can work on that, that would allow us to
move the debugfs_dir symlink / settings from sg's ioctl to a new
proper call. I'd prefer to address that later though, as an evolution.

Its my first time touching scsi stuff, so I'd highly appreciate a good
review of what I propose for scsi-generic. It gets a bit more
complicated with some drivers using the bsg queue. FWIW, if bsg is
enabled we *reshare* the request_queue from the scsi_device, unless
you're a scsi transport, in which case it creates its own.

You can find this on my git tree:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20200508-block-fixes

Luis Chamberlain (5):
  block: revert back to synchronous request_queue removal
  block: move main block debugfs initialization to its own file
  blktrace: fix debugfs use after free
  blktrace: break out of blktrace setup on concurrent calls
  loop: be paranoid on exit and prevent new additions / removals

 block/Makefile   |   1 +
 block/blk-core.c |  32 --
 block/blk-debugfs.c  | 202 +++
 block/blk-mq-debugfs.c   |   5 -
 block/blk-sysfs.c|  46 
 block/blk.h  |  23 
 block/bsg.c  |   2 +
 block/genhd.c|  73 -
 block/partitions/core.c  |   9 ++
 drivers/block/loop.c |   4 +
 drivers/scsi/ch.c|   1 +
 drivers/scsi/sg.c|  75 +
 drivers/scsi/st.c|   2 +
 include/linux/blkdev.h   |   6 +-
 include/linux/blktrace_api.h |   1 -
 include/linux/genhd.h|  69 
 kernel/trace/blktrace.c  |  33 --
 17 files changed, 539 insertions(+), 45 deletions(-)
 create mode 100644 block/blk-debugfs.c

-- 
2.25.1



Re: [PATCH net-next v1] hinic: add three net_device_ops of vf

2020-05-08 Thread Jakub Kicinski
On Sat, 9 May 2020 10:56:55 +0800 luobin (L) wrote:
> >> -  if (!HINIC_IS_VF(nic_dev->hwdev->hwif))
> >> -  /* Wait up to 3 sec between port enable to link state */
> >> -  msleep(3000);  
> > Why is this no longer needed?
> > ---When phsical port links up, hw will notify this event to hinic 
> > driver.Driver code can  
> 
> handle this event now, so need to wait for link up in hinic_open().

Makes sense.


Re: [PATCH v4 02/12] pstore/zone: Introduce common layer to manage storage zones

2020-05-08 Thread WeiXiong Liao
hi Kees Cook,

On 2020/5/8 PM 2:39, Kees Cook wrote:
> From: WeiXiong Liao 
> 
> Implement a common set of APIs needed to support pstore storage zones,
> based on how ramoops is designed. This will be used by pstore/blk with
> the intention of migrating pstore/ram in the future.
> 
> Signed-off-by: WeiXiong Liao 
> Link: 
> https://lore.kernel.org/r/1585126506-18635-2-git-send-email-liaoweixi...@allwinnertech.com
> Co-developed-by: Kees Cook 
> Signed-off-by: Kees Cook 
> ---
>  fs/pstore/Kconfig   |   7 +
>  fs/pstore/Makefile  |   3 +
>  fs/pstore/zone.c| 973 
>  include/linux/pstore_zone.h |  44 ++
>  4 files changed, 1027 insertions(+)
>  create mode 100644 fs/pstore/zone.c
>  create mode 100644 include/linux/pstore_zone.h
> 
> diff --git a/fs/pstore/Kconfig b/fs/pstore/Kconfig
> index 8f0369aad22a..98d2457bdd9f 100644
> --- a/fs/pstore/Kconfig
> +++ b/fs/pstore/Kconfig
> @@ -153,3 +153,10 @@ config PSTORE_RAM
> "ramoops.ko".
>  
> For more information, see Documentation/admin-guide/ramoops.rst.
> +
> +config PSTORE_ZONE
> + tristate
> + depends on PSTORE
> + help
> +   The common layer for pstore/blk (and pstore/ram in the future)
> +   to manage storage in zones.
> diff --git a/fs/pstore/Makefile b/fs/pstore/Makefile
> index 967b5891f325..58a967cbe4af 100644
> --- a/fs/pstore/Makefile
> +++ b/fs/pstore/Makefile
> @@ -12,3 +12,6 @@ pstore-$(CONFIG_PSTORE_PMSG)+= pmsg.o
>  
>  ramoops-objs += ram.o ram_core.o
>  obj-$(CONFIG_PSTORE_RAM) += ramoops.o
> +
> +pstore_zone-objs += zone.o
> +obj-$(CONFIG_PSTORE_ZONE)+= pstore_zone.o
> diff --git a/fs/pstore/zone.c b/fs/pstore/zone.c
> new file mode 100644
> index ..6c25c443c8e2
> --- /dev/null
> +++ b/fs/pstore/zone.c
> @@ -0,0 +1,973 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#define MODNAME "pstore-zone"
> +#define pr_fmt(fmt) MODNAME ": " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +/**
> + * struct psz_head - header of zone to flush to storage
> + *
> + * @sig: signature to indicate header (PSZ_SIG xor PSZONE-type value)
> + * @datalen: length of data in @data
> + * @data: zone data.
> + */
> +struct psz_buffer {
> +#define PSZ_SIG (0x43474244) /* DBGC */
> + uint32_t sig;
> + atomic_t datalen;
> + uint8_t data[];
> +};
> +
> +/**
> + * struct psz_oops_header - sub header of oops zones to flush to storage
> + *
> + * @magic: magic num for oops header
> + * @time: oops/panic trigger time
> + * @compressed: whether conpressed
> + * @counter: oops/panic counter
> + * @reason: identify oops or panic
> + * @data: pointer to log data
> + *
> + * It's a sub-header of oops zone, trailing after _buffer.
> + */
> +struct psz_oops_header {
> +#define OOPS_HEADER_MAGIC 0x4dfc3ae5 /* Just a ramdom number */
> + uint32_t magic;
> + struct timespec64 time;
> + bool compressed;
> + uint32_t counter;
> + enum kmsg_dump_reason reason;
> + uint8_t data[];
> +};
> +
> +/**
> + * struct pstore_zone - zone information
> + *
> + * @off: zone offset of storage
> + * @type: front-end type for this zone
> + * @name: front-end name for this zone
> + * @buffer: pointer to data buffer managed by this zone
> + * @oldbuf: pointer to old data buffer.
> + * @buffer_size: bytes in @buffer->data
> + * @should_recover: whether this zone should recover from storage
> + * @dirty: whether the data in @buffer dirty
> + *
> + * zone structure in memory.
> + */
> +struct pstore_zone {
> + loff_t off;
> + const char *name;
> + enum pstore_type_id type;
> +
> + struct psz_buffer *buffer;
> + struct psz_buffer *oldbuf;
> + size_t buffer_size;
> + bool should_recover;
> + atomic_t dirty;
> +};
> +
> +/**
> + * struct psz_context - all about running state of pstore/zone
> + *
> + * @opszs: oops/panic storage zones
> + * @oops_max_cnt: max count of @opszs
> + * @oops_read_cnt: counter to read oops zone
> + * @oops_write_cnt: counter to write
> + * @oops_counter: counter to oops
> + * @panic_counter: counter to panic
> + * @recovered: whether finish recovering data from storage
> + * @on_panic: whether occur panic
> + * @pstore_zone_info_lock: lock to @pstore_zone_info
> + * @pstore_zone_info: information from back-end
> + * @pstore: structure for pstore
> + */
> +struct psz_context {
> + struct pstore_zone **opszs;
> + unsigned int oops_max_cnt;
> + unsigned int oops_read_cnt;
> + unsigned int oops_write_cnt;
> + /*
> +  * the counter should be recovered when recover.
> +  * It records the oops/panic times after burning rather than booting.
> +  */
> + unsigned int oops_counter;
> + unsigned int panic_counter;
> + atomic_t recovered;
> + atomic_t on_panic;
> +
> + /*
> +  * pstore_zone_info_lock 

[PATCH v9 3/8] x86/split_lock: Introduce flag X86_FEATURE_SLD_FATAL and drop sld_state

2020-05-08 Thread Xiaoyao Li
Introduce a synthetic feature flag X86_FEATURE_SLD_FATAL, which means
kernel is in sld_fatal mode if set.

Now sld_state is not needed any more that the state of SLD can be
inferred from X86_FEATURE_SPLIT_LOCK_DETECT and X86_FEATURE_SLD_FATAL.

Suggested-by: Sean Christopherson 
Signed-off-by: Xiaoyao Li 
---
 arch/x86/include/asm/cpufeatures.h |  1 +
 arch/x86/kernel/cpu/intel.c| 16 ++--
 2 files changed, 7 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h 
b/arch/x86/include/asm/cpufeatures.h
index db189945e9b0..260adfc6c61a 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -286,6 +286,7 @@
 #define X86_FEATURE_FENCE_SWAPGS_USER  (11*32+ 4) /* "" LFENCE in user entry 
SWAPGS path */
 #define X86_FEATURE_FENCE_SWAPGS_KERNEL(11*32+ 5) /* "" LFENCE in 
kernel entry SWAPGS path */
 #define X86_FEATURE_SPLIT_LOCK_DETECT  (11*32+ 6) /* #AC for split lock */
+#define X86_FEATURE_SLD_FATAL  (11*32+ 7) /* split lock detection in 
fatal mode */
 
 /* Intel-defined CPU features, CPUID level 0x0007:1 (EAX), word 12 */
 #define X86_FEATURE_AVX512_BF16(12*32+ 5) /* AVX512 BFLOAT16 
instructions */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 4602dac14dcb..93b8ccf2fa11 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -40,12 +40,6 @@ enum split_lock_detect_state {
sld_fatal,
 };
 
-/*
- * Default to sld_off because most systems do not support split lock detection
- * split_lock_setup() will switch this to sld_warn on systems that support
- * split lock detect, unless there is a command line override.
- */
-static enum split_lock_detect_state sld_state __ro_after_init = sld_off;
 static u64 msr_test_ctrl_cache __ro_after_init;
 
 /*
@@ -1043,8 +1037,9 @@ static void __init split_lock_setup(void)
return;
}
 
-   sld_state = state;
setup_force_cpu_cap(X86_FEATURE_SPLIT_LOCK_DETECT);
+   if (state == sld_fatal)
+   setup_force_cpu_cap(X86_FEATURE_SLD_FATAL);
 }
 
 /*
@@ -1064,7 +1059,7 @@ static void sld_update_msr(bool on)
 
 static void split_lock_init(void)
 {
-   split_lock_verify_msr(sld_state != sld_off);
+   split_lock_verify_msr(boot_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT));
 }
 
 static void split_lock_warn(unsigned long ip)
@@ -1083,7 +1078,7 @@ static void split_lock_warn(unsigned long ip)
 
 bool handle_guest_split_lock(unsigned long ip)
 {
-   if (sld_state == sld_warn) {
+   if (!boot_cpu_has(X86_FEATURE_SLD_FATAL)) {
split_lock_warn(ip);
return true;
}
@@ -1100,7 +1095,8 @@ EXPORT_SYMBOL_GPL(handle_guest_split_lock);
 
 bool handle_user_split_lock(struct pt_regs *regs, long error_code)
 {
-   if ((regs->flags & X86_EFLAGS_AC) || sld_state == sld_fatal)
+   if ((regs->flags & X86_EFLAGS_AC) ||
+   boot_cpu_has(X86_FEATURE_SLD_FATAL))
return false;
split_lock_warn(regs->ip);
return true;
-- 
2.18.2



[PATCH v9 7/8] KVM: VMX: virtualize split lock detection

2020-05-08 Thread Xiaoyao Li
TEST_CTRL MSR is per-core scope, i.e., the sibling threads in the same
physical core share the same MSR. This requires additional constraint
when exposing it to guest.

1) When host SLD state is sld_off (no X86_FEATURE_SPLIT_LOCK_DETECT),
   feature split lock detection is unsupported/disabled.
   Cannot expose it to guest.

2) When host SLD state is sld_warn (has X86_FEATURE_SPLIT_LOCK_DETECT
   but no X86_FEATURE_SLD_FATAL), feature split lock detection can be
   exposed to guest only when nosmt due to the per-core scope.

   In this case, guest's setting can be propagated into real hardware
   MSR. Further, to avoid the potiential MSR_TEST_CTRL.SLD toggling
   overhead during every vm-enter and vm-exit, it loads and keeps
   guest's SLD setting when in vcpu run loop and guest_state_loaded,
   i.e., betweer vmx_prepare_switch_to_guest() and
   vmx_prepare_switch_to_host().

3) when host SLD state is sld_fatal (has X86_FEATURE_SLD_FATAL),
   feature split lock detection can be exposed to guest regardless of
   SMT but KVM_HINTS_SLD_FATAL needs to be set.

   In this case, guest can still set and clear MSR_TEST_CTRL.SLD bit,
   but the bit value never be propagated to real MSR. KVM always keeps
   SLD bit turned on for guest vcpu. The reason why not force guest
   MSR_CTRL.SLD bit to 1 is that guest needs to set this bit to 1 itself
   to tell KVM it's SLD-aware.

Signed-off-by: Xiaoyao Li 
---
 arch/x86/kvm/cpuid.c   |  6 
 arch/x86/kvm/vmx/vmx.c | 68 --
 arch/x86/kvm/vmx/vmx.h |  3 ++
 arch/x86/kvm/x86.c |  6 +++-
 arch/x86/kvm/x86.h |  7 +
 5 files changed, 80 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 901cd1fdecd9..7d9f2daddaf3 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -717,9 +717,15 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array 
*array, u32 function)
if (sched_info_on())
entry->eax |= (1 << KVM_FEATURE_STEAL_TIME);
 
+   if (kvm_split_lock_detect_supported())
+   entry->eax |= (1 << KVM_FEATURE_SPLIT_LOCK_DETECT);
+
entry->ebx = 0;
entry->ecx = 0;
entry->edx = 0;
+
+   if (boot_cpu_has(X86_FEATURE_SLD_FATAL))
+   entry->edx |= (1 << KVM_HINTS_SLD_FATAL);
break;
case 0x8000:
entry->eax = min(entry->eax, 0x801f);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index dbec38ad5035..1cc386c5801d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1120,6 +1120,29 @@ void vmx_set_host_fs_gs(struct vmcs_host_state *host, 
u16 fs_sel, u16 gs_sel,
}
 }
 
+static inline u64 vmx_msr_test_ctrl_valid_bits(struct vcpu_vmx *vmx)
+{
+   u64 valid_bits = 0;
+
+   if (vmx->guest_has_sld)
+   valid_bits |= MSR_TEST_CTRL_SPLIT_LOCK_DETECT;
+
+   return valid_bits;
+}
+
+static inline bool guest_sld_on(struct vcpu_vmx *vmx)
+{
+   return vmx->msr_test_ctrl & MSR_TEST_CTRL_SPLIT_LOCK_DETECT;
+}
+
+static inline void vmx_update_guest_sld(struct vcpu_vmx *vmx)
+{
+   preempt_disable();
+   if (vmx->guest_state_loaded)
+   split_lock_set_guest(guest_sld_on(vmx));
+   preempt_enable();
+}
+
 void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 {
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -1188,6 +1211,10 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 #endif
 
vmx_set_host_fs_gs(host_state, fs_sel, gs_sel, fs_base, gs_base);
+
+   if (static_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT) && vmx->guest_has_sld)
+   vmx->host_sld_on = split_lock_set_guest(guest_sld_on(vmx));
+
vmx->guest_state_loaded = true;
 }
 
@@ -1226,6 +1253,10 @@ static void vmx_prepare_switch_to_host(struct vcpu_vmx 
*vmx)
wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
 #endif
load_fixmap_gdt(raw_smp_processor_id());
+
+   if (static_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT) && vmx->guest_has_sld)
+   split_lock_restore_host(vmx->host_sld_on);
+
vmx->guest_state_loaded = false;
vmx->guest_msrs_ready = false;
 }
@@ -1790,7 +1821,7 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
 
switch (msr_info->index) {
case MSR_TEST_CTRL:
-   msr_info->data = 0;
+   msr_info->data = vmx->msr_test_ctrl;
break;
 #ifdef CONFIG_X86_64
case MSR_FS_BASE:
@@ -1946,9 +1977,12 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
 
switch (msr_index) {
case MSR_TEST_CTRL:
-   if (data)
+   if (data & ~vmx_msr_test_ctrl_valid_bits(vmx))
return 1;
 
+   vmx->msr_test_ctrl = data;
+   vmx_update_guest_sld(vmx);
+
break;
case 

[PATCH v9 8/8] x86/split_lock: Enable split lock detection initialization when running as an guest on KVM

2020-05-08 Thread Xiaoyao Li
When running as guest, enumerating feature split lock detection through
CPU model is not easy since CPU model is configurable by host VMM.

If running upon KVM, it can be enumerated through
KVM_FEATURE_SPLIT_LOCK_DETECT, and if KVM_HINTS_SLD_FATAL is set, it
needs to be set to sld_fatal mode.

Signed-off-by: Xiaoyao Li 
---
 arch/x86/include/asm/cpu.h  |  2 ++
 arch/x86/kernel/cpu/intel.c | 12 ++--
 arch/x86/kernel/kvm.c   |  3 +++
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h
index a57f00f1d5b5..5d5b488b4b45 100644
--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -42,12 +42,14 @@ unsigned int x86_model(unsigned int sig);
 unsigned int x86_stepping(unsigned int sig);
 #ifdef CONFIG_CPU_SUP_INTEL
 extern void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c);
+extern void __init split_lock_setup(bool fatal);
 extern void switch_to_sld(unsigned long tifn);
 extern bool handle_user_split_lock(struct pt_regs *regs, long error_code);
 extern bool handle_guest_split_lock(unsigned long ip);
 extern bool split_lock_virt_switch(bool on);
 #else
 static inline void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) {}
+static inline void __init split_lock_setup(bool fatal) {}
 static inline void switch_to_sld(unsigned long tifn) {}
 static inline bool handle_user_split_lock(struct pt_regs *regs, long 
error_code)
 {
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 1e2a74e8c592..02e24134b9b5 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -996,12 +996,18 @@ static bool split_lock_verify_msr(bool on)
return ctrl == tmp;
 }
 
-static void __init split_lock_setup(void)
+void __init split_lock_setup(bool fatal)
 {
enum split_lock_detect_state state = sld_warn;
char arg[20];
int i, ret;
 
+   if (fatal) {
+   state = sld_fatal;
+   pr_info("forced on, sending SIGBUS on user-space 
split_locks\n");
+   goto set_cap;
+   }
+
if (!split_lock_verify_msr(false)) {
pr_info("MSR access failed: Disabled\n");
return;
@@ -1037,6 +1043,7 @@ static void __init split_lock_setup(void)
return;
}
 
+set_cap:
setup_force_cpu_cap(X86_FEATURE_SPLIT_LOCK_DETECT);
if (state == sld_fatal)
setup_force_cpu_cap(X86_FEATURE_SLD_FATAL);
@@ -1161,6 +1168,7 @@ void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c)
const struct x86_cpu_id *m;
u64 ia32_core_caps;
 
+   /* Note, paravirt support can enable SLD, e.g., see kvm_guest_init(). */
if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
return;
 
@@ -1182,5 +1190,5 @@ void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c)
return;
}
 
-   split_lock_setup();
+   split_lock_setup(false);
 }
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 6efe0410fb72..489ea89e2e8e 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -670,6 +670,9 @@ static void __init kvm_guest_init(void)
 * overcommitted.
 */
hardlockup_detector_disable();
+
+   if (kvm_para_has_feature(KVM_FEATURE_SPLIT_LOCK_DETECT))
+   split_lock_setup(kvm_para_has_hint(KVM_HINTS_SLD_FATAL));
 }
 
 static noinline uint32_t __kvm_cpuid_base(void)
-- 
2.18.2



[PATCH v9 4/8] x86/split_lock: Introduce split_lock_virt_switch() and two wrappers

2020-05-08 Thread Xiaoyao Li
Introduce split_lock_virt_switch(), which is used for toggling split
lock detection setting as well as updating TIF_SLD_DISABLED flag to make
them consistent. Note, it can only be used in sld warn mode, i.e.,
X86_FEATURE_SPLIT_LOCK_DETECT && !X86_FEATURE_SLD_FATAL.

The FATAL check is handled by wrappers, split_lock_set_guest() and
split_lock_restore_host(), that will be used by KVM when virtualizing
split lock detection for guest in the future.

Signed-off-by: Xiaoyao Li 
---
 arch/x86/include/asm/cpu.h  | 33 +
 arch/x86/kernel/cpu/intel.c | 20 
 2 files changed, 53 insertions(+)

diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h
index dd17c2da1af5..a57f00f1d5b5 100644
--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -45,6 +45,7 @@ extern void __init cpu_set_core_cap_bits(struct cpuinfo_x86 
*c);
 extern void switch_to_sld(unsigned long tifn);
 extern bool handle_user_split_lock(struct pt_regs *regs, long error_code);
 extern bool handle_guest_split_lock(unsigned long ip);
+extern bool split_lock_virt_switch(bool on);
 #else
 static inline void __init cpu_set_core_cap_bits(struct cpuinfo_x86 *c) {}
 static inline void switch_to_sld(unsigned long tifn) {}
@@ -57,5 +58,37 @@ static inline bool handle_guest_split_lock(unsigned long ip)
 {
return false;
 }
+
+static inline bool split_lock_virt_switch(bool on) { return false; }
 #endif
+
+/**
+ * split_lock_set_guest - Set SLD state for a guest
+ * @guest_sld_on:  If SLD is on in the guest
+ *
+ * returns:%true if SLD was enabled in the task
+ *
+ * Must be called when X86_FEATURE_SPLIT_LOCK_DETECT is available.
+ */
+static inline bool split_lock_set_guest(bool guest_sld_on)
+{
+   if (static_cpu_has(X86_FEATURE_SLD_FATAL))
+   return true;
+
+   return split_lock_virt_switch(guest_sld_on);
+}
+
+/**
+ * split_lock_restore_host - Restore host SLD state
+ * @host_sld_on:   If SLD is on in the host
+ *
+ * Must be called when X86_FEATURE_SPLIT_LOCK_DETECT is available.
+ */
+static inline void split_lock_restore_host(bool host_sld_on)
+{
+   if (static_cpu_has(X86_FEATURE_SLD_FATAL))
+   return;
+
+   split_lock_virt_switch(host_sld_on);
+}
 #endif /* _ASM_X86_CPU_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 93b8ccf2fa11..1e2a74e8c592 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -1062,6 +1062,26 @@ static void split_lock_init(void)
split_lock_verify_msr(boot_cpu_has(X86_FEATURE_SPLIT_LOCK_DETECT));
 }
 
+/*
+ * It should never be called directly but should use split_lock_set_guest()
+ * and split_lock_restore_host() instead.
+ *
+ * The caller needs to be in preemption disabled context to ensure
+ * MSR state and TIF_SLD_DISABLED state consistent.
+ */
+bool split_lock_virt_switch(bool on)
+{
+   bool was_on = !test_thread_flag(TIF_SLD_DISABLED);
+
+   if (on != was_on) {
+   sld_update_msr(on);
+   update_thread_flag(TIF_SLD_DISABLED, !on);
+   }
+
+   return was_on;
+}
+EXPORT_SYMBOL_GPL(split_lock_virt_switch);
+
 static void split_lock_warn(unsigned long ip)
 {
pr_warn_ratelimited("#AC: %s/%d took a split_lock trap at address: 
0x%lx\n",
-- 
2.18.2



[PATCH v9 0/8] KVM: Add virtualization support of split lock detection

2020-05-08 Thread Xiaoyao Li
This series aims to add the virtualization of split lock detection in
KVM.

Due to the fact that split lock detection is tightly coupled with CPU
model and CPU model is configurable by host VMM, we elect to use
paravirt method to expose and enumerate it for guest.

Changes in v9
 - rebase to v5.7-rc4
 - Add one patch to rename TIF_SLD to TIF_SLD_DISABLED;
 - Add one patch to remove bogus case in handle_guest_split_lock;
 - Introduce flag X86_FEATURE_SPLIT_LOCK_DETECT_FATAL and thus drop
   sld_state;
 - Use X86_FEATURE_SPLIT_LOCK_DETECT and X86_FEATURE_SPLIT_LOCK_DETECT_FATAL
   to determine the SLD state of host;
 - Introduce split_lock_virt_switch() and two wrappers for KVM instead
   of sld_update_to(); 
 - Use paravirt to expose and enumerate split lock detection for guest;
 - Split lock detection can be exposed to guest when host is sld_fatal,
   even though host is SMT available. 

Changes in v8:
https://lkml.kernel.org/r/20200414063129.133630-1-xiaoyao...@intel.com
 - rebase to v5.7-rc1.
 - basic enabling of split lock detection already merged.
 - When host is sld_warn and nosmt, load guest's sld bit when in KVM
   context, i.e., between vmx_prepare_switch_to_guest() and before
   vmx_prepare_switch_to_host(), KVM uses guest sld setting.  

Changes in v7:
https://lkml.kernel.org/r/20200325030924.132881-1-xiaoyao...@intel.com
 - only pick patch 1 and patch 2, and hold all the left.
 - Update SLD bit on each processor based on sld_state.

Changes in v6:
https://lkml.kernel.org/r/20200324151859.31068-1-xiaoyao...@intel.com
 - Drop the sld_not_exist flag and use X86_FEATURE_SPLIT_LOCK_DETECT to
   check whether need to init split lock detection. [tglx]
 - Use tglx's method to verify the existence of split lock detectoin.
 - small optimization of sld_update_msr() that the default value of
   msr_test_ctrl_cache has split_lock_detect bit cleared.
 - Drop the patch3 in v5 that introducing kvm_only option. [tglx]
 - Rebase patch4-8 to kvm/queue.
 - use the new kvm-cpu-cap to expose X86_FEATURE_CORE_CAPABILITIES in
   Patch 6.

Changes in v5:
https://lkml.kernel.org/r/20200315050517.127446-1-xiaoyao...@intel.com
 - Use X86_FEATURE_SPLIT_LOCK_DETECT flag in kvm to ensure split lock
   detection is really supported.
 - Add and export sld related helper functions in their related usecase 
   kvm patches.

Xiaoyao Li (8):
  x86/split_lock: Rename TIF_SLD to TIF_SLD_DISABLED
  x86/split_lock: Remove bogus case in handle_guest_split_lock()
  x86/split_lock: Introduce flag X86_FEATURE_SLD_FATAL and drop
sld_state
  x86/split_lock: Introduce split_lock_virt_switch() and two wrappers
  x86/kvm: Introduce paravirt split lock detection enumeration
  KVM: VMX: Enable MSR TEST_CTRL for guest
  KVM: VMX: virtualize split lock detection
  x86/split_lock: Enable split lock detection initialization when
running as an guest on KVM

 Documentation/virt/kvm/cpuid.rst | 29 +++
 arch/x86/include/asm/cpu.h   | 35 ++
 arch/x86/include/asm/cpufeatures.h   |  1 +
 arch/x86/include/asm/thread_info.h   |  6 +--
 arch/x86/include/uapi/asm/kvm_para.h |  8 ++--
 arch/x86/kernel/cpu/intel.c  | 59 ---
 arch/x86/kernel/kvm.c|  3 ++
 arch/x86/kernel/process.c|  2 +-
 arch/x86/kvm/cpuid.c |  6 +++
 arch/x86/kvm/vmx/vmx.c   | 72 +---
 arch/x86/kvm/vmx/vmx.h   |  3 ++
 arch/x86/kvm/x86.c   |  6 ++-
 arch/x86/kvm/x86.h   |  7 +++
 13 files changed, 196 insertions(+), 41 deletions(-)

-- 
2.18.2



[PATCH v9 2/8] x86/split_lock: Remove bogus case in handle_guest_split_lock()

2020-05-08 Thread Xiaoyao Li
The bogus case can never happen, i.e., when sld_state == sld_off, guest
won't trigger split lock #AC and of course no handle_guest_split_lock()
will be called.

Beside, drop bogus case also makes future patch easier to remove
sld_state if we reach the alignment that it must be sld_warn or
sld_fatal when handle_guest_split_lock() is called.

Signed-off-by: Xiaoyao Li 
---
  The alternative would be to remove the "SLD enabled" check from KVM so
  that a truly unexpected/bogus #AC would generate a warn.  It's not clear
  whether or not calling handle_guest_split_lock() iff SLD is enabled was
  intended in the long term.
---
 arch/x86/kernel/cpu/intel.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 0e6aee6ef1e8..4602dac14dcb 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -1088,9 +1088,8 @@ bool handle_guest_split_lock(unsigned long ip)
return true;
}
 
-   pr_warn_once("#AC: %s/%d %s split_lock trap at address: 0x%lx\n",
-current->comm, current->pid,
-sld_state == sld_fatal ? "fatal" : "bogus", ip);
+   pr_warn_once("#AC: %s/%d fatal split_lock trap at address: 0x%lx\n",
+current->comm, current->pid, ip);
 
current->thread.error_code = 0;
current->thread.trap_nr = X86_TRAP_AC;
-- 
2.18.2



[PATCH v9 6/8] KVM: VMX: Enable MSR TEST_CTRL for guest

2020-05-08 Thread Xiaoyao Li
Unconditionally allow the guest to read and zero-write MSR TEST_CTRL.

This matches the fact that most Intel CPUs support MSR TEST_CTRL, and
it also alleviates the effort to handle wrmsr/rdmsr when split lock
detection is exposed to the guest in a future patch.

Signed-off-by: Xiaoyao Li 
---
 arch/x86/kvm/vmx/vmx.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c2c6335a998c..dbec38ad5035 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1789,6 +1789,9 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
u32 index;
 
switch (msr_info->index) {
+   case MSR_TEST_CTRL:
+   msr_info->data = 0;
+   break;
 #ifdef CONFIG_X86_64
case MSR_FS_BASE:
msr_info->data = vmcs_readl(GUEST_FS_BASE);
@@ -1942,6 +1945,11 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct 
msr_data *msr_info)
u32 index;
 
switch (msr_index) {
+   case MSR_TEST_CTRL:
+   if (data)
+   return 1;
+
+   break;
case MSR_EFER:
ret = kvm_set_msr_common(vcpu, msr_info);
break;
-- 
2.18.2



[PATCH v9 1/8] x86/split_lock: Rename TIF_SLD to TIF_SLD_DISABLED

2020-05-08 Thread Xiaoyao Li
TIF_SLD can only be set if a user space thread hits split lock and
sld_state == sld_warn. This flag is set to indicate SLD (split lock
detection) is turned off for the thread, so rename it to
TIF_SLD_DISABLED, which is pretty self explaining.

Suggested-by: Sean Christopherson 
Suggested-by: Thomas Gleixner 
Signed-off-by: Xiaoyao Li 
---
 arch/x86/include/asm/thread_info.h | 6 +++---
 arch/x86/kernel/cpu/intel.c| 6 +++---
 arch/x86/kernel/process.c  | 2 +-
 3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/thread_info.h 
b/arch/x86/include/asm/thread_info.h
index 8de8ceccb8bc..451a930de1c0 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -92,7 +92,7 @@ struct thread_info {
 #define TIF_NOCPUID15  /* CPUID is not accessible in userland 
*/
 #define TIF_NOTSC  16  /* TSC is not accessible in userland */
 #define TIF_IA32   17  /* IA32 compatibility process */
-#define TIF_SLD18  /* Restore split lock detection 
on context switch */
+#define TIF_SLD_DISABLED   18  /* split lock detection is turned off */
 #define TIF_MEMDIE 20  /* is terminating due to OOM killer */
 #define TIF_POLLING_NRFLAG 21  /* idle is polling for TIF_NEED_RESCHED 
*/
 #define TIF_IO_BITMAP  22  /* uses I/O bitmap */
@@ -122,7 +122,7 @@ struct thread_info {
 #define _TIF_NOCPUID   (1 << TIF_NOCPUID)
 #define _TIF_NOTSC (1 << TIF_NOTSC)
 #define _TIF_IA32  (1 << TIF_IA32)
-#define _TIF_SLD   (1 << TIF_SLD)
+#define _TIF_SLD_DISABLED  (1 << TIF_SLD_DISABLED)
 #define _TIF_POLLING_NRFLAG(1 << TIF_POLLING_NRFLAG)
 #define _TIF_IO_BITMAP (1 << TIF_IO_BITMAP)
 #define _TIF_FORCED_TF (1 << TIF_FORCED_TF)
@@ -141,7 +141,7 @@ struct thread_info {
 /* flags to check in __switch_to() */
 #define _TIF_WORK_CTXSW_BASE   \
(_TIF_NOCPUID | _TIF_NOTSC | _TIF_BLOCKSTEP |   \
-_TIF_SSBD | _TIF_SPEC_FORCE_UPDATE | _TIF_SLD)
+_TIF_SSBD | _TIF_SPEC_FORCE_UPDATE | _TIF_SLD_DISABLED)
 
 /*
  * Avoid calls to __switch_to_xtra() on UP as STIBP is not evaluated.
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index a19a680542ce..0e6aee6ef1e8 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -1074,11 +1074,11 @@ static void split_lock_warn(unsigned long ip)
 
/*
 * Disable the split lock detection for this task so it can make
-* progress and set TIF_SLD so the detection is re-enabled via
+* progress and set TIF_SLD_DISABLED so the detection is re-enabled via
 * switch_to_sld() when the task is scheduled out.
 */
sld_update_msr(false);
-   set_tsk_thread_flag(current, TIF_SLD);
+   set_tsk_thread_flag(current, TIF_SLD_DISABLED);
 }
 
 bool handle_guest_split_lock(unsigned long ip)
@@ -1116,7 +1116,7 @@ bool handle_user_split_lock(struct pt_regs *regs, long 
error_code)
  */
 void switch_to_sld(unsigned long tifn)
 {
-   sld_update_msr(!(tifn & _TIF_SLD));
+   sld_update_msr(!(tifn & _TIF_SLD_DISABLED));
 }
 
 /*
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 9da70b279dad..e7693a283489 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -650,7 +650,7 @@ void __switch_to_xtra(struct task_struct *prev_p, struct 
task_struct *next_p)
__speculation_ctrl_update(~tifn, tifn);
}
 
-   if ((tifp ^ tifn) & _TIF_SLD)
+   if ((tifp ^ tifn) & _TIF_SLD_DISABLED)
switch_to_sld(tifn);
 }
 
-- 
2.18.2



[PATCH v9 5/8] x86/kvm: Introduce paravirt split lock detection enumeration

2020-05-08 Thread Xiaoyao Li
Introduce KVM_FEATURE_SPLIT_LOCK_DETECT, for which linux guest running
on KVM can enumerate the avaliablility of feature split lock detection.

Introduce KVM_HINTS_SLD_FATAL, which tells whether host is sld_fatal mode,
i.e., whether split lock detection is forced on for guest vcpu.

Signed-off-by: Xiaoyao Li 
---
 Documentation/virt/kvm/cpuid.rst | 29 
 arch/x86/include/uapi/asm/kvm_para.h |  8 +---
 2 files changed, 26 insertions(+), 11 deletions(-)

diff --git a/Documentation/virt/kvm/cpuid.rst b/Documentation/virt/kvm/cpuid.rst
index 01b081f6e7ea..a7e85ac090a8 100644
--- a/Documentation/virt/kvm/cpuid.rst
+++ b/Documentation/virt/kvm/cpuid.rst
@@ -86,6 +86,12 @@ KVM_FEATURE_PV_SCHED_YIELD13  guest checks 
this feature bit
   before using paravirtualized
   sched yield.
 
+KVM_FEATURE_SPLIT_LOCK_DETECT 14  guest checks this feature bit for
+  available of split lock 
detection.
+
+  KVM doesn't support enumerating
+ split lock detection via CPU model
+
 KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24  host will warn if no guest-side
   per-cpu warps are expeced in
   kvmclock
@@ -97,11 +103,18 @@ KVM_FEATURE_CLOCSOURCE_STABLE_BIT 24  host will 
warn if no guest-side
 
 Where ``flag`` here is defined as below:
 
-==  =
-flag   valuemeaning
-==  =
-KVM_HINTS_REALTIME 0guest checks this feature bit to
-determine that vCPUs are never
-preempted for an unlimited time
-allowing optimizations
-==  =
+  =
+flag valuemeaning
+  =
+KVM_HINTS_REALTIME   0guest checks this feature bit to
+  determine that vCPUs are never
+  preempted for an unlimited time
+  allowing optimizations
+
+KVM_HINTS_SLD_FATAL  1set if split lock detection is
+  forced on in the host, in which
+ case KVM will kill the guest if it
+ generates a split lock #AC with
+ SLD disabled from guest's
+ perspective
+  =
diff --git a/arch/x86/include/uapi/asm/kvm_para.h 
b/arch/x86/include/uapi/asm/kvm_para.h
index 2a8e0b6b9805..a8fe0221403a 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -31,14 +31,16 @@
 #define KVM_FEATURE_PV_SEND_IPI11
 #define KVM_FEATURE_POLL_CONTROL   12
 #define KVM_FEATURE_PV_SCHED_YIELD 13
-
-#define KVM_HINTS_REALTIME  0
-
+#define KVM_FEATURE_SPLIT_LOCK_DETECT  14
 /* The last 8 bits are used to indicate how to interpret the flags field
  * in pvclock structure. If no bits are set, all flags are ignored.
  */
 #define KVM_FEATURE_CLOCKSOURCE_STABLE_BIT 24
 
+/* KVM feature hints in CPUID.0x4001.EDX */
+#define KVM_HINTS_REALTIME 0
+#define KVM_HINTS_SLD_FATAL1
+
 #define MSR_KVM_WALL_CLOCK  0x11
 #define MSR_KVM_SYSTEM_TIME 0x12
 
-- 
2.18.2



Re: [PATCH V4] f2fs: Avoid double lock for cp_rwsem during checkpoint

2020-05-08 Thread Chao Yu
On 2020/5/9 0:10, Jaegeuk Kim wrote:
> Hi Sayali,
> 
> In order to address the perf regression, how about this?
> 
>>From 48418af635884803ffb35972df7958a2e6649322 Mon Sep 17 00:00:00 2001
> From: Jaegeuk Kim 
> Date: Fri, 8 May 2020 09:08:37 -0700
> Subject: [PATCH] f2fs: avoid double lock for cp_rwsem during checkpoint
> 
> There could be a scenario where f2fs_sync_node_pages gets
> called during checkpoint, which in turn tries to flush
> inline data and calls iput(). This results in deadlock as
> iput() tries to hold cp_rwsem, which is already held at the
> beginning by checkpoint->block_operations().
> 
> Call stack :
> 
> Thread A  Thread B
> f2fs_write_checkpoint()
> - block_operations(sbi)
>  - f2fs_lock_all(sbi);
>   - down_write(>cp_rwsem);
> 
> - open()
>  - igrab()
> - write() write inline data
> - unlink()
> - f2fs_sync_node_pages()
>  - if (is_inline_node(page))
>   - flush_inline_data()
>- ilookup()
>  page = f2fs_pagecache_get_page()
>  if (!page)
>   goto iput_out;
>  iput_out:
>   -close()
>   -iput()
>iput(inode);
>- f2fs_evict_inode()
> - f2fs_truncate_blocks()
>  - f2fs_lock_op()
>- down_read(>cp_rwsem);
> 
> Fixes: 2049d4fcb057 ("f2fs: avoid multiple node page writes due to 
> inline_data")
> Signed-off-by: Sayali Lokhande 
> Signed-off-by: Jaegeuk Kim 
> ---
>  fs/f2fs/node.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/f2fs/node.c b/fs/f2fs/node.c
> index 1db8cabf727ef..626d7daca09de 100644
> --- a/fs/f2fs/node.c
> +++ b/fs/f2fs/node.c
> @@ -1870,8 +1870,8 @@ int f2fs_sync_node_pages(struct f2fs_sb_info *sbi,
>   goto continue_unlock;
>   }
>  
> - /* flush inline_data */
> - if (is_inline_node(page)) {
> + /* flush inline_data, if it's not sync path. */
> + if (do_balance && is_inline_node(page)) {

IIRC, this flow was designed to avoid running out of free space issue
during checkpoint:

2049d4fcb057 ("f2fs: avoid multiple node page writes due to inline_data")

The sceanrio is:
1. create fully node blocks
2. flush node blocks
3. write inline_data for all the node blocks again
4. flush node blocks redundantly

I guess this may cause failing one case of fstest.


Since block_operations->f2fs_sync_inode_meta has synced inode cache to
inode page, so in block_operations->f2fs_sync_node_pages, could we
check nlink before flush_inline_data():

if (is_inline_node(page)) {
if (IS_INODE(page) && raw_inode_page->i_links) {
flush_inline_data()
}
}


>   clear_inline_node(page);
>   unlock_page(page);
>   flush_inline_data(sbi, ino_of_node(page));
> 


[PATCH -next] mm/hmm/test: fix missing unlock on error in dmirror_migrate_finalize_and_map()

2020-05-08 Thread Wei Yongjun
Add the missing unlock before return from function
dmirror_migrate_finalize_and_map() in the error
handling case.

Fixes: 5d5e54be8a1e ("mm/hmm/test: add selftest driver for HMM")
Reported-by: Hulk Robot 
Signed-off-by: Wei Yongjun 
---
 lib/test_hmm.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 00bca6116f93..30462193c4ff 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -647,8 +647,10 @@ static int dmirror_migrate_finalize_and_map(struct 
migrate_vma *args,
if (*dst & MIGRATE_PFN_WRITE)
entry = xa_tag_pointer(entry, DPT_XA_TAG_WRITE);
entry = xa_store(>pt, pfn, entry, GFP_ATOMIC);
-   if (xa_is_err(entry))
+   if (xa_is_err(entry)) {
+   mutex_unlock(>mutex);
return xa_err(entry);
+   }
}
 
mutex_unlock(>mutex);





[PATCH -next] mm/hmm/test: fix error return code in hmm_dmirror_init()

2020-05-08 Thread Wei Yongjun
Fix to return negative error code -ENOMEM from the alloc_page() error
handling case instead of 0, as done elsewhere in this function.

Fixes: 5d5e54be8a1e ("mm/hmm/test: add selftest driver for HMM")
Reported-by: Hulk Robot 
Signed-off-by: Wei Yongjun 
---
 lib/test_hmm.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 00bca6116f93..b4d9434e49e7 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -1119,8 +1119,10 @@ static int __init hmm_dmirror_init(void)
 * make the code here simpler (i.e., we need a struct page for it).
 */
dmirror_zero_page = alloc_page(GFP_HIGHUSER | __GFP_ZERO);
-   if (!dmirror_zero_page)
+   if (!dmirror_zero_page) {
+   ret = -ENOMEM;
goto err_chrdev;
+   }
 
pr_info("HMM test module loaded. This is only for testing HMM.\n");
return 0;





Re: [PATCH 13/14] sched: cpufreq: Use IS_ENABLED() for schedutil

2020-05-08 Thread Pavan Kondeti
On Fri, May 08, 2020 at 02:21:29PM +0100, Quentin Perret wrote:
> On Friday 08 May 2020 at 11:00:53 (+0530), Pavan Kondeti wrote:
> > > -#if defined(CONFIG_ENERGY_MODEL) && 
> > > defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
> > > +#if defined(CONFIG_ENERGY_MODEL) && 
> > > IS_ENABLED(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
> > >   /* Build perf. domains: */
> > >   for (i = 0; i < ndoms_new; i++) {
> > >   for (j = 0; j < n && !sched_energy_update; j++) {
> > 
> > Now that scheduler does not have any references to schedutil_gov and cpufreq
> > has want_eas flag, do we need this CONFIG_CPU_FREQ_GOV_SCHEDUTIL checks 
> > here?
> 
> Right, they're not absolutely required, but given that sugov is the only
> one to have 'want_eas' set I guess there is no need to compile that in
> without it, no?
> 
Right.

Since you removed all compile time dependencies on schedutil, I thought the
#ifdef check around schedutil can be removed too. 

Thanks,
Pavan

-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.


[PATCH -next] power: reset: ltc2952: remove set but used variable

2020-05-08 Thread Hongbo Yao
Fixes gcc '-Wunused-but-set-variable' warning:
drivers/power/reset/ltc2952-poweroff.c:97:16: warning: variable
‘overruns’ set but not used [-Wunused-but-set-variable]
  unsigned long overruns;

Reported-by: Hulk Robot 
Signed-off-by: Hongbo Yao 
---
 drivers/power/reset/ltc2952-poweroff.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/power/reset/ltc2952-poweroff.c 
b/drivers/power/reset/ltc2952-poweroff.c
index e4a0cc45b3d1..318927938b05 100644
--- a/drivers/power/reset/ltc2952-poweroff.c
+++ b/drivers/power/reset/ltc2952-poweroff.c
@@ -94,7 +94,6 @@ static enum hrtimer_restart ltc2952_poweroff_timer_wde(struct 
hrtimer *timer)
 {
ktime_t now;
int state;
-   unsigned long overruns;
struct ltc2952_poweroff *data = to_ltc2952(timer, timer_wde);
 
if (data->kernel_panic)
@@ -104,7 +103,7 @@ static enum hrtimer_restart 
ltc2952_poweroff_timer_wde(struct hrtimer *timer)
gpiod_set_value(data->gpio_watchdog, !state);
 
now = hrtimer_cb_get_time(timer);
-   overruns = hrtimer_forward(timer, now, data->wde_interval);
+   hrtimer_forward(timer, now, data->wde_interval);
 
return HRTIMER_RESTART;
 }
-- 
2.20.1



Re: [RFC PATCH v3 12/14] perf metricgroup: order event groups by size

2020-05-08 Thread Andi Kleen
> > I'm not sure if size is that great an heuristic. The dedup algorithm should
> > work in any case even if you don't order by size, right?
> 
> Consider two metrics:
>  - metric 1 with events {A,B}
>  - metric 2 with events {A,B,C,D}
> If the list isn't sorted then as the matching takes the first group
> with all the events, metric 1 will match {A,B} and metric 2 {A,B,C,D}.
> If the order is sorted to {A,B,C,D},{A,B} then metric 1 matches within
> the {A,B,C,D} group as does metric 2. The events in metric 1 aren't
> used and are removed.

Ok. It's better for the longer metric if they stay together.

> 
> The dedup algorithm is very naive :-)

I guess what matters is that it gives reasonable results on the current
metrics. I assume it does?

How much deduping is happening if you run all metrics?

For toplev on my long term todo list was to compare it against
a hopefully better schedule generated by or-tools, but I never
got around to coding that up.

-Andi


Re: [RFC 2/4] sched/core: Set nr_lat_sensitive counter at various scheduler entry/exit points

2020-05-08 Thread Pavan Kondeti
On Fri, May 08, 2020 at 04:45:16PM +0530, Parth Shah wrote:
> Hi Pavan,
> 
> Thanks for going through this patch-set.
> 
> On 5/8/20 2:03 PM, Pavan Kondeti wrote:
> > Hi Parth,
> > 
> > On Thu, May 07, 2020 at 07:07:21PM +0530, Parth Shah wrote:
> >> Monitor tasks at:
> >> 1. wake_up_new_task() - forked tasks
> >>
> >> 2. set_task_cpu() - task migrations, Load balancer
> >>
> >> 3. __sched_setscheduler() - set/unset latency_nice value
> >> Increment the nr_lat_sensitive count on the CPU with task marked with
> >> latency_nice == -20.
> >> Similarly, decrement the nr_lat_sensitive counter upon re-marking the task
> >> with >-20 latency_nice task.
> >>
> >> 4. finish_task_switch() - dying task
> >>
> > 
> > 
> >> Signed-off-by: Parth Shah 
> >> ---
> >>  kernel/sched/core.c  | 30 --
> >>  kernel/sched/sched.h |  5 +
> >>  2 files changed, 33 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> >> index 2d8b76f41d61..ad396c36eba6 100644
> >> --- a/kernel/sched/core.c
> >> +++ b/kernel/sched/core.c
> >> @@ -1744,6 +1744,11 @@ void set_task_cpu(struct task_struct *p, unsigned 
> >> int new_cpu)
> >>trace_sched_migrate_task(p, new_cpu);
> >>  
> >>if (task_cpu(p) != new_cpu) {
> >> +  if (task_is_lat_sensitive(p)) {
> >> +  per_cpu(nr_lat_sensitive, task_cpu(p))--;
> >> +  per_cpu(nr_lat_sensitive, new_cpu)++;
> >> +  }
> >> +
> > 
> > Since we can come here without rq locks, there is a possibility
> > of a race and incorrect updates can happen. Since the counters
> > are used to prevent C-states, we don't want that to happen.
> 
> I did tried using task_lock(p) wherever we do change refcount and when
> latency_nice value is set. There I was using nr_lat_sensitive with atomic_t
> type.
> 
> After lots of thinking to optimize it and thinking that we anyways hold rq
> lock, I thought of not using any lock here and see if scheduler community
> has well known solution for this :-)
> 
> But in brief, using atomic_t nr_lat_sensitive and task_lock(p) when changin
> refcount should solve problem, right?
> 
> If you or anyone have solution for this kind of pattern, then that surely
> will be helpful.
> 
I am not sure if task_lock() can help here, because we are operating the
counter on per CPU basis here. May be cmpxchg based loop works here to make
sure that increment/decrement operation happens atomically here.

> > 
> >>if (p->sched_class->migrate_task_rq)
> >>p->sched_class->migrate_task_rq(p, new_cpu);
> >>p->se.nr_migrations++;

[...]

> >> @@ -4732,8 +4749,17 @@ static void __setscheduler_params(struct 
> >> task_struct *p,
> >>p->normal_prio = normal_prio(p);
> >>set_load_weight(p, true);
> >>  
> >> -  if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE)
> >> +  if (attr->sched_flags & SCHED_FLAG_LATENCY_NICE) {
> >> +  if (p->state != TASK_DEAD &&
> >> +  attr->sched_latency_nice != p->latency_nice) {
> >> +  if (attr->sched_latency_nice == MIN_LATENCY_NICE)
> >> +  per_cpu(nr_lat_sensitive, task_cpu(p))++;
> >> +  else if (task_is_lat_sensitive(p))
> >> +  per_cpu(nr_lat_sensitive, task_cpu(p))--;
> >> +  }
> >> +
> >>p->latency_nice = attr->sched_latency_nice;
> >> +  }
> >>  }
> > 
> > There is a potential race here due to which we can mess up the refcount.
> > 
> > - A latency sensitive task is marked TASK_DEAD
> > 
> > - sched_setattr() called on the task to clear the latency nice. Since
> > we check the task state here, we skip the decrement.
> > - The task is finally context switched out and we skip the decrement again
> > since it is not a latency senstivie task.
> 
> if task is already marked TASK_DEAD then we should have already decremented
> its refcount in finish_task_switch().
> am I missing something?

There is a window (context switch and dropping rq lock) between
marking a task DEAD (in do_task_dead()) and dropping the ref counter
(in finish_task_switch()) during which we can run into here and skip
the checking because task is marked as DEAD.

Thanks,
Pavan
-- 
Qualcomm India Private Limited, on behalf of Qualcomm Innovation Center, Inc.
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux 
Foundation Collaborative Project.


Re: [PATCH] phy: samsung: s5pv210-usb2: Add delay after reset

2020-05-08 Thread Jonathan Bakker
Hi Kishon,

Is there anything else you (or someone else) needs from me?  I'd really 
appreciate getting USB working again on s5pv210 as without this patch it is 
entirely broken.

Thanks,
Jonathan

On 2020-04-28 2:50 p.m., Jonathan Bakker wrote:
> Hi Kishon,
> 
> On 2020-04-27 5:40 a.m., Kishon Vijay Abraham I wrote:
>> Hi Jonathan,
>>
>> On 4/25/2020 11:06 PM, Jonathan Bakker wrote:
>>> The USB phy takes some time to reset, so make sure we give it to it. The
>>> delay length was taken from the 4x12 phy driver.
>>>
>>> This manifested in issues with the DWC2 driver since commit fe369e1826b3
>>> ("usb: dwc2: Make dwc2_readl/writel functions endianness-agnostic.")
>>> where the endianness check would read the DWC ID as 0 due to the phy still
>>> resetting, resulting in the wrong endian mode being chosen.
>>>
>>> Signed-off-by: Jonathan Bakker 
>>> ---
>>>  drivers/phy/samsung/phy-s5pv210-usb2.c | 4 
>>>  1 file changed, 4 insertions(+)
>>>
>>> diff --git a/drivers/phy/samsung/phy-s5pv210-usb2.c 
>>> b/drivers/phy/samsung/phy-s5pv210-usb2.c
>>> index 56a5083fe6f9..32be62e49804 100644
>>> --- a/drivers/phy/samsung/phy-s5pv210-usb2.c
>>> +++ b/drivers/phy/samsung/phy-s5pv210-usb2.c
>>> @@ -139,6 +139,10 @@ static void s5pv210_phy_pwr(struct 
>>> samsung_usb2_phy_instance *inst, bool on)
>>> udelay(10);
>>> rst &= ~rstbits;
>>> writel(rst, drv->reg_phy + S5PV210_UPHYRST);
>>> +   /* The following delay is necessary for the reset sequence to be
>>> +* completed
>>> +*/
>>> +   udelay(80);
>>
>> Please fix the following checkpatch check error.
>> CHECK: usleep_range is preferred over udelay; see
>> Documentation/timers/timers-howto.rst
>> #151: FILE: drivers/phy/samsung/phy-s5pv210-usb2.c:145:
>> +   udelay(80);
>>
>> total: 0 errors, 0 warnings, 1 checks, 10 lines checked
> 
> 
> Unfortunately, this is an atomic code path (and hence why the other Samsung 
> phy driver use udelay
> in the same place). Changing to usleep_range brings up the following BUG:
> 
> BUG: scheduling while atomic: swapper/1/0x0002
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper Not tainted 5.7.0-rc2-1-gf9f8ac7cc48c-dirty 
> #443
> Hardware name: Samsung S5PC110/S5PV210-based board
> [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
> [] (show_stack) from [] (__schedule_bug+0x70/0x84)
> [] (__schedule_bug) from [] (__schedule+0x38c/0x464)
> [] (__schedule) from [] (schedule+0xa0/0x138)
> [] (schedule) from [] 
> (schedule_hrtimeout_range_clock+0xd4/0x158)
> [] (schedule_hrtimeout_range_clock) from [] 
> (schedule_hrtimeout_range+0x18/0x20)
> [] (schedule_hrtimeout_range) from [] 
> (usleep_range+0x68/0x8c)
> [] (usleep_range) from [] (s5pv210_power_on+0xbc/0xe4)
> [] (s5pv210_power_on) from [] 
> (samsung_usb2_phy_power_on+0xec/0x16c)
> [] (samsung_usb2_phy_power_on) from [] 
> (phy_power_on+0x8c/0xdc)
> [] (phy_power_on) from [] 
> (__dwc2_lowlevel_hw_enable+0xb8/0xcc)
> [] (__dwc2_lowlevel_hw_enable) from [] 
> (dwc2_driver_probe+0x1e4/0x580)
> [] (dwc2_driver_probe) from [] 
> (platform_drv_probe+0x48/0x98)
> [] (platform_drv_probe) from [] (really_probe+0x1e0/0x344)
> [] (really_probe) from [] (driver_probe_device+0x60/0x168)
> [] (driver_probe_device) from [] 
> (device_driver_attach+0x58/0x60)
> [] (device_driver_attach) from [] 
> (__driver_attach+0x58/0xcc)
> [] (__driver_attach) from [] (bus_for_each_dev+0x74/0xb4)
> [] (bus_for_each_dev) from [] (bus_add_driver+0x1b4/0x1d4)
> [] (bus_add_driver) from [] (driver_register+0x74/0x108)
> [] (driver_register) from [] (do_one_initcall+0x7c/0x1cc)
> [] (do_one_initcall) from [] 
> (kernel_init_freeable+0x15c/0x1d4)
> [] (kernel_init_freeable) from [] (kernel_init+0x8/0x110)
> [] (kernel_init) from [] (ret_from_fork+0x14/0x2c)
> Exception stack(0xe745ffb0 to 0xe745fff8)
> ffa0:    
> ffc0:        
> ffe0:     0013 
> dwc2 ec00.hsotg: dwc2_check_params: Invalid parameter besl=1
> dwc2 ec00.hsotg: dwc2_check_params: Invalid parameter 
> g_np_tx_fifo_size=1024
> dwc2 ec00.hsotg: EPs: 16, dedicated fifos, 7808 entries in SPRAM
> [ cut here ]
> kernel BUG at mm/vmalloc.c:2101!
> Internal error: Oops - BUG: 0 [#1] PREEMPT ARM
> Modules linked in:
> CPU: 0 PID: 1 Comm: swapper Tainted: GW 
> 5.7.0-rc2-1-gf9f8ac7cc48c-dirty #443
> Hardware name: Samsung S5PC110/S5PV210-based board
> PC is at __get_vm_area_node+0x174/0x178
> LR is at 0xe745e000
> pc : []lr : []psr: 2013
> sp : e745fc60  ip : 001fff00  fp : 0038
> r10: e6cb9880  r9 : 0001  r8 : c0161760
> r7 : e6cb98c0  r6 : 0cc0  r5 : 0247  r4 : e7fd2800
> r3 : e880  r2 : 0010  r1 :   r0 : 1000
> Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> 

Re: [PATCH] net: tg3: tidy up loop, remove need to compute off with a multiply

2020-05-08 Thread Joe Perches
On Fri, 2020-05-08 at 18:48 -0700, Jakub Kicinski wrote:
> On Sat, 9 May 2020 00:31:03 +0100 Colin Ian King wrote:
> > > My preference would be for
> > > 
> > > {
> > >   int i;
> > >   u32 off = 0;
> > > 
> > >   for (i = 0; i < TG3_SD_NUM_RECS; i++) {
> > >   tg3_ape_scratchpad_read(tp, (u32 *)ocir, off, TC3_OCIR_LEN);
> > > 
> > >   if (ocir->signature != TG3_OCIR_SIG_MAGIC ||
> > >   !(ocir->version_flags & TG3_OCIR_FLAG_ACTIVE))
> > >   memset(ocir, 0, TG3_OCIR_LEN);
> > > 
> > >   off += TG3_OCIR_LEN;
> > >   ocir++;
> > >   }
> > >   
> > OK, I'll send a V3 tomorrow.
> 
> I already reviewed and applied v2, just waiting for builds to finish,
> let's leave it.


I think clarity should be preferred.
Are you a maintainer of this file?

$ ./scripts/get_maintainer.pl -f drivers/net/ethernet/broadcom/tg3.c
Siva Reddy Kallam  (supporter:BROADCOM TG3 GIGABIT 
ETHERNET DRIVER)
Prashant Sreedharan  (supporter:BROADCOM TG3 GIGABIT 
ETHERNET DRIVER)
Michael Chan  (supporter:BROADCOM TG3 GIGABIT ETHERNET 
DRIVER)
"David S. Miller"  (odd fixer:NETWORKING DRIVERS)
net...@vger.kernel.org (open list:BROADCOM TG3 GIGABIT ETHERNET DRIVER)
linux-kernel@vger.kernel.org (open list)




Re: cgroup pointed by sock is leaked on mode switch

2020-05-08 Thread Yang Yingliang



On 2020/5/6 15:51, Zefan Li wrote:

On 2020/5/6 10:16, Zefan Li wrote:

On 2020/5/6 9:50, Yang Yingliang wrotee:

+cc lize...@huawei.com

On 2020/5/6 0:06, Tejun Heo wrote:

Hello, Yang.

On Sat, May 02, 2020 at 06:27:21PM +0800, Yang Yingliang wrote:

I find the number nr_dying_descendants is increasing:
linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep
'^nr_dying_descendants [^0]'  {} +
/sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 80
/sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 
1
/sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants 


1
/sys/fs/cgroup/unified/lxc/cgroup.stat:nr_dying_descendants 79
/sys/fs/cgroup/unified/lxc/5f1fdb8c54fa40c3e599613dab6e4815058b76ebada8a27bc1fe80c0d4801764/cgroup.stat:nr_dying_descendants 


78
/sys/fs/cgroup/unified/lxc/5f1fdb8c54fa40c3e599613dab6e4815058b76ebada8a27bc1fe80c0d4801764/system.slice/cgroup.stat:nr_dying_descendants 


78

Those numbers are nowhere close to causing oom issues. There are some
aspects of page and other cache draining which is being improved 
but unless
you're seeing numbers multiple orders of magnitude higher, this 
isn't the

source of your problem.

The situation is as same as the commit bd1060a1d671 ("sock, 
cgroup: add

sock->sk_cgroup") describes.
"On mode switch, cgroup references which are already being pointed 
to by

socks may be leaked."
I'm doubtful that you're hitting that issue. Mode switching means 
memcg
being switched between cgroup1 and cgroup2 hierarchies, which is 
unlikely to

be what's happening when you're launching docker containers.

The first step would be identifying where memory is going and 
finding out
whether memcg is actually being switched between cgroup1 and 2 - 
look at the

hierarchy number in /proc/cgroups, if that's switching between 0 and
someting not zero, it is switching.



I think there's a bug here which can lead to unlimited memory leak.
This should reproduce the bug:

    # mount -t cgroup -o netprio xxx /cgroup/netprio
    # mkdir /cgroup/netprio/xxx
    # echo PID > /cgroup/netprio/xxx/tasks
    /* this PID process starts to do some network thing and then 
exits */

    # rmdir /cgroup/netprio/xxx
    /* now this cgroup will never be freed */



Correction (still not tested):

    # mount -t cgroup2 none /cgroup/v2
    # mkdir /cgroup/v2/xxx
    # echo PID > /cgroup/v2/xxx/cgroup.procs
    /* this PID process starts to do some network thing */

    # mount -t cgroup -o netprio xxx /cgroup/netprio
    # mkdir /cgroup/netprio/xxx
    # echo PID > /cgroup/netprio/xxx/tasks
    ...
    /* the PID process exits */

    rmdir /cgroup/netprio/xxx
    rmdir /cgroup/v2/xxx
    /* now looks like this v2 cgroup will never be freed */


Look at the code:

static inline void sock_update_netprioidx(struct sock_cgroup_data *skcd)
{
 ...
 sock_cgroup_set_prioidx(skcd, task_netprioidx(current));
}

static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data 
*skcd,

 u16 prioidx)
{
 ...
 if (sock_cgroup_prioidx(_buf) == prioidx)
 return ;
 ...
 skcd_buf.prioidx = prioidx;
 WRITE_ONCE(skcd->val, skcd_buf.val);
}

task_netprioidx() will be the cgrp id of xxx which is not 1, but
sock_cgroup_prioidx(_buf) is 1 because it thought it's in v2 mode.
Now we have a memory leak.

I think the eastest fix is to do the mode switch here:

diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index b905747..2397866 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -240,6 +240,8 @@ static void net_prio_attach(struct cgroup_taskset 
*tset)

 struct task_struct *p;
 struct cgroup_subsys_state *css;

+   cgroup_sk_alloc_disable();
+
 cgroup_taskset_for_each(p, css, tset) {
 void *v = (void *)(unsigned long)css->cgroup->id;



I've do some test with this change, here is the test result:


Without this change, nr_dying_descendants is increased and the cgroup is 
leaked:


linux-dVpNUK:~ # mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup2 on /sys/fs/cgroup/unified type cgroup2 
(rw,nosuid,nodev,noexec,relatime,nsdelegate)
cgroup on /sys/fs/cgroup/systemd type cgroup 
(rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/blkio type cgroup 
(rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup 
(rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup 
(rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/freezer type cgroup 
(rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/devices type cgroup 
(rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/memory type cgroup 
(rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/cpuset type cgroup 
(rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on 

[PATCH v7 3/8] bus: mhi: core: Add range check for channel id received in event ring

2020-05-08 Thread Bhaumik Bhatt
From: Hemant Kumar 

MHI data completion handler function reads channel id from event
ring element. Value is under the control of MHI devices and can be
any value between 0 and 255. In order to prevent out of bound access
add a bound check against the max channel supported by controller
and skip processing of that event ring element.

Signed-off-by: Hemant Kumar 
Signed-off-by: Bhaumik Bhatt 
Reviewed-by: Jeffrey Hugo 
---
 drivers/bus/mhi/core/main.c | 40 +---
 1 file changed, 29 insertions(+), 11 deletions(-)

diff --git a/drivers/bus/mhi/core/main.c b/drivers/bus/mhi/core/main.c
index 605640c..30798ec 100644
--- a/drivers/bus/mhi/core/main.c
+++ b/drivers/bus/mhi/core/main.c
@@ -775,9 +775,18 @@ int mhi_process_ctrl_ev_ring(struct mhi_controller 
*mhi_cntrl,
}
case MHI_PKT_TYPE_TX_EVENT:
chan = MHI_TRE_GET_EV_CHID(local_rp);
-   mhi_chan = _cntrl->mhi_chan[chan];
-   parse_xfer_event(mhi_cntrl, local_rp, mhi_chan);
-   event_quota--;
+
+   WARN_ON(chan >= mhi_cntrl->max_chan);
+
+   /*
+* Only process the event ring elements whose channel
+* ID is within the maximum supported range.
+*/
+   if (chan < mhi_cntrl->max_chan) {
+   mhi_chan = _cntrl->mhi_chan[chan];
+   parse_xfer_event(mhi_cntrl, local_rp, mhi_chan);
+   event_quota--;
+   }
break;
default:
dev_err(dev, "Unhandled event type: %d\n", type);
@@ -820,14 +829,23 @@ int mhi_process_data_event_ring(struct mhi_controller 
*mhi_cntrl,
enum mhi_pkt_type type = MHI_TRE_GET_EV_TYPE(local_rp);
 
chan = MHI_TRE_GET_EV_CHID(local_rp);
-   mhi_chan = _cntrl->mhi_chan[chan];
-
-   if (likely(type == MHI_PKT_TYPE_TX_EVENT)) {
-   parse_xfer_event(mhi_cntrl, local_rp, mhi_chan);
-   event_quota--;
-   } else if (type == MHI_PKT_TYPE_RSC_TX_EVENT) {
-   parse_rsc_event(mhi_cntrl, local_rp, mhi_chan);
-   event_quota--;
+
+   WARN_ON(chan >= mhi_cntrl->max_chan);
+
+   /*
+* Only process the event ring elements whose channel
+* ID is within the maximum supported range.
+*/
+   if (chan < mhi_cntrl->max_chan) {
+   mhi_chan = _cntrl->mhi_chan[chan];
+
+   if (likely(type == MHI_PKT_TYPE_TX_EVENT)) {
+   parse_xfer_event(mhi_cntrl, local_rp, mhi_chan);
+   event_quota--;
+   } else if (type == MHI_PKT_TYPE_RSC_TX_EVENT) {
+   parse_rsc_event(mhi_cntrl, local_rp, mhi_chan);
+   event_quota--;
+   }
}
 
mhi_recycle_ev_ring_element(mhi_cntrl, ev_ring);
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project


  1   2   3   4   5   6   7   8   9   10   >