Re: [PATCH 4/4] mtd: rawnand: meson: only initialize the RB completion once

2019-04-18 Thread Liang Yang

Hi Martin,

On 2019/4/19 3:44, Martin Blumenstingl wrote:

Hi Liang,

On Mon, Apr 15, 2019 at 8:04 AM Liang Yang  wrote:



On 2019/4/12 6:00, Martin Blumenstingl wrote:

Documentation/scheduler/completion.txt states:
Calling init_completion() on the same completion object twice is
most likely a bug as it re-initializes the queue to an empty queue and
enqueued tasks could get "lost" - use reinit_completion() in that case,
but be aware of other races.

Initialize nfc->completion in meson_nfc_probe using init_completion and
change the call in meson_nfc_queue_rb to reinit_completion so the logic
matches what the documentation suggests.

Signed-off-by: Martin Blumenstingl 
---
   drivers/mtd/nand/raw/meson_nand.c | 3 ++-
   1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/mtd/nand/raw/meson_nand.c 
b/drivers/mtd/nand/raw/meson_nand.c
index 57cc4bd3f665..ea57ddcec41e 100644
--- a/drivers/mtd/nand/raw/meson_nand.c
+++ b/drivers/mtd/nand/raw/meson_nand.c
@@ -400,7 +400,7 @@ static int meson_nfc_queue_rb(struct meson_nfc *nfc, int 
timeout_ms)
   cfg |= NFC_RB_IRQ_EN;
   writel(cfg, nfc->reg_base + NFC_REG_CFG);

- init_completion(>completion);
+ reinit_completion(>completion);

Tested-by:Liang Yang 
Acked-by: Liang Yang 

thank you for reviewing and testing my patches!

[...]

Tested-by:Liang Yang 
Acked-by: Liang Yang 

please consider the following note for future code-reviews:
most maintainers take the patch from patchwork and apply it to their git tree.
however, patchwork is not smart enough to detect when the same
Tested-by/Acked-by is sent multiple times.
this results in the same Tested-by/Acked-by being listed multiple
times in the final commit: [0]

what I do instead is to reply with one set of Tested-by/Acked-by
(below the author's Signed-off-by) which is then valid for the whole
patch.
There's no problem to have Tested-by and Acked-by at the same time,
the issue only shows up if you send Acked-by (or any other tag) for
the same patch multiple times.


Thanks. Well, I known about it now.



Have a great day!
Regards,
Martin


[0] 
https://git.kernel.org/pub/scm/linux/kernel/git/mtd/linux.git/commit/?h=nand/next=39e01956e2f70ff9f0e97db1a69c9847aa1d5d8b

.



Re: [PATCH v7 00/10] Add support for XO 1.75 to OLPC battery driver

2019-04-18 Thread Sebastian Reichel
Hi,

On Thu, Apr 18, 2019 at 04:46:45PM +0200, Lubomir Rintel wrote:
> This patch set modifies the OLPC battery driver so that it could eventually
> be used on an Arm-based OLPC XO 1.75 machine.
> 
> Compared to the previous version, it addresses review comments for the
> x86 platform parts from Thomas Gleixner. Details in individual patches'
> change logs.
> 
> I've failed to version the previous iteration of the set that should've
> been "v6" for the change logs to make sense. Sorry for that.
> 
> Thank you,
> Lubo

I queued up the complete patchset via an immutable branch based on
v5.1-rc1. X86 maintainers can pull this signed immutable branch as
required. Thanks for the patches,

-- Sebastian

The following changes since commit 9e98c678c2d6ae3a17cb2de55d17f69dddaa231b:

  Linux 5.1-rc1 (2019-03-17 14:22:26 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/sre/linux-power-supply.git 
tags/psy-olpc-1.75-battery-signed

for you to fetch changes up to 31e220877981d0005d0c4a76f68027a73131212b:

  power: supply: olpc_battery: Have the framework register sysfs files for us 
(2019-04-18 21:55:04 +0200)


Immutable branch between x86 and power-supply for OLPC

This immutable branch contains the changes required for OLPC
1.75 battery, which touches x86 and power-supply and is based
on v5.1-rc1.

Signed-off-by: Sebastian Reichel 


Lubomir Rintel (10):
  dt-bindings: olpc_battery: Add XO-1.5 battery
  x86/platform/olpc: Don't split string literals when fixing up the DT
  x86/platform/olpc: Trivial code move in DT fixup
  x86/platform/olpc: Use a correct version when making up a battery node
  power: supply: olpc_battery: Use DT to get battery version
  power: supply: olpc_battery: Move priv data to a struct
  power: supply: olpc_battery: Use devm_power_supply_register()
  power: supply: olpc_battery: Avoid using platform_info
  power: supply: olpc_battery: Add OLPC XO 1.75 support
  power: supply: olpc_battery: Have the framework register sysfs files for 
us

 .../bindings/power/supply/olpc_battery.txt |   2 +-
 arch/x86/platform/olpc/olpc_dt.c   | 101 
 drivers/power/supply/olpc_battery.c| 169 +
 3 files changed, 177 insertions(+), 95 deletions(-)


signature.asc
Description: PGP signature


Re: [PATCH v3] power: supply: add Ingenic JZ47xx battery driver.

2019-04-18 Thread Sebastian Reichel
Hi,

On Thu, Apr 18, 2019 at 08:24:04PM +0200, Artur Rojek wrote:
> Add a driver for battery present on Ingenic JZ47xx SoCs.
> 
> Signed-off-by: Artur Rojek 
> Reviewed-by: Jonathan Cameron 
> ---

Thanks, queued.

-- Sebastian

> 
> Changes:
> 
> v2: - rework the return logic in ingenic_battery_get_property,
> - make index offsets point forward in ingenic_battery_set_scale,
> - fix spacing around scale_raw[best_idx + 1]
> 
> v3: Add the missing MODULE_LICENSE & Co.
> 
>  drivers/power/supply/Kconfig   |  11 ++
>  drivers/power/supply/Makefile  |   1 +
>  drivers/power/supply/ingenic-battery.c | 184 +
>  3 files changed, 196 insertions(+)
>  create mode 100644 drivers/power/supply/ingenic-battery.c
> 
> diff --git a/drivers/power/supply/Kconfig b/drivers/power/supply/Kconfig
> index b6a4265bf0ec..81b0c20b24dd 100644
> --- a/drivers/power/supply/Kconfig
> +++ b/drivers/power/supply/Kconfig
> @@ -169,6 +169,17 @@ config BATTERY_COLLIE
> Say Y to enable support for the battery on the Sharp Zaurus
> SL-5500 (collie) models.
>  
> +config BATTERY_INGENIC
> + tristate "Ingenic JZ47xx SoCs battery driver"
> + depends on MIPS || COMPILE_TEST
> + depends on INGENIC_ADC
> + help
> +   Choose this option if you want to monitor battery status on
> +   Ingenic JZ47xx SoC based devices.
> +
> +   This driver can also be built as a module. If so, the module will be
> +   called ingenic-battery.
> +
>  config BATTERY_IPAQ_MICRO
>   tristate "iPAQ Atmel Micro ASIC battery driver"
>   depends on MFD_IPAQ_MICRO
> diff --git a/drivers/power/supply/Makefile b/drivers/power/supply/Makefile
> index 1cdce3ded839..976dc1698134 100644
> --- a/drivers/power/supply/Makefile
> +++ b/drivers/power/supply/Makefile
> @@ -34,6 +34,7 @@ obj-$(CONFIG_BATTERY_PMU)   += pmu_battery.o
>  obj-$(CONFIG_BATTERY_OLPC)   += olpc_battery.o
>  obj-$(CONFIG_BATTERY_TOSA)   += tosa_battery.o
>  obj-$(CONFIG_BATTERY_COLLIE) += collie_battery.o
> +obj-$(CONFIG_BATTERY_INGENIC)+= ingenic-battery.o
>  obj-$(CONFIG_BATTERY_IPAQ_MICRO) += ipaq_micro_battery.o
>  obj-$(CONFIG_BATTERY_WM97XX) += wm97xx_battery.o
>  obj-$(CONFIG_BATTERY_SBS)+= sbs-battery.o
> diff --git a/drivers/power/supply/ingenic-battery.c 
> b/drivers/power/supply/ingenic-battery.c
> new file mode 100644
> index ..35816d4b3012
> --- /dev/null
> +++ b/drivers/power/supply/ingenic-battery.c
> @@ -0,0 +1,184 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Battery driver for the Ingenic JZ47xx SoCs
> + * Copyright (c) 2019 Artur Rojek 
> + *
> + * based on drivers/power/supply/jz4740-battery.c
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +struct ingenic_battery {
> + struct device *dev;
> + struct iio_channel *channel;
> + struct power_supply_desc desc;
> + struct power_supply *battery;
> + struct power_supply_battery_info info;
> +};
> +
> +static int ingenic_battery_get_property(struct power_supply *psy,
> + enum power_supply_property psp,
> + union power_supply_propval *val)
> +{
> + struct ingenic_battery *bat = power_supply_get_drvdata(psy);
> + struct power_supply_battery_info *info = >info;
> + int ret;
> +
> + switch (psp) {
> + case POWER_SUPPLY_PROP_HEALTH:
> + ret = iio_read_channel_processed(bat->channel, >intval);
> + val->intval *= 1000;
> + if (val->intval < info->voltage_min_design_uv)
> + val->intval = POWER_SUPPLY_HEALTH_DEAD;
> + else if (val->intval > info->voltage_max_design_uv)
> + val->intval = POWER_SUPPLY_HEALTH_OVERVOLTAGE;
> + else
> + val->intval = POWER_SUPPLY_HEALTH_GOOD;
> + return ret;
> + case POWER_SUPPLY_PROP_VOLTAGE_NOW:
> + ret = iio_read_channel_processed(bat->channel, >intval);
> + val->intval *= 1000;
> + return ret;
> + case POWER_SUPPLY_PROP_VOLTAGE_MIN_DESIGN:
> + val->intval = info->voltage_min_design_uv;
> + return 0;
> + case POWER_SUPPLY_PROP_VOLTAGE_MAX_DESIGN:
> + val->intval = info->voltage_max_design_uv;
> + return 0;
> + default:
> + return -EINVAL;
> + };
> +}
> +
> +/* Set the most appropriate IIO channel voltage reference scale
> + * based on the battery's max voltage.
> + */
> +static int ingenic_battery_set_scale(struct ingenic_battery *bat)
> +{
> + const int *scale_raw;
> + int scale_len, scale_type, best_idx = -1, best_mV, max_raw, i, ret;
> + u64 max_mV;
> +
> + ret = iio_read_max_channel_raw(bat->channel, _raw);
> + if (ret) {
> + dev_err(bat->dev, "Unable to read max raw channel value\n");
> + return ret;
> + }
> +
> + ret = 

Re: [PATCH v2 1/4] iio: inkern: API for reading available iio channel attribute values

2019-04-18 Thread Sebastian Reichel
Hi,

On Sun, Apr 14, 2019 at 11:34:54AM +0100, Jonathan Cameron wrote:
> On Sun, 24 Mar 2019 15:27:25 +
> Jonathan Cameron  wrote:
> 
> > On Sat, 23 Mar 2019 18:28:06 +0100
> > Artur Rojek  wrote:
> > 
> > > Extend the inkern API with a function for reading available
> > > attribute values of iio channels.
> > > 
> > > Signed-off-by: Artur Rojek   
> > If this goes through a route other than IIO (otherwise
> > I'll just add a signed-off-by...)
> > 
> > Acked-by: Jonathan Cameron 
> 
> Applied to the ib-jz47xx-prereq branch of
> git://git.kernel.org/pub/scm/linux/kernel/git/jic23/iio.git
> 
> I'll merge that into the togreg branch as well shortly.

Thanks, I pulled that branch into power-supply next for the
other two patches.

-- Sebastian


signature.asc
Description: PGP signature


Re: [PATCH v2 3/4] dt-bindings: power: supply: Add docs for Ingenic JZ47xx SoCs battery.

2019-04-18 Thread Sebastian Reichel
Hi,

On Sun, Mar 24, 2019 at 03:30:18PM +, Jonathan Cameron wrote:
> On Sat, 23 Mar 2019 18:28:08 +0100
> Artur Rojek  wrote:
> 
> > Add documentation for the ingenic-battery driver, used on JZ47xx SoCs.
> > 
> > Signed-off-by: Artur Rojek 
> Looks fine to me.
> 
> Acked-by: Jonathan Cameron 

Thanks, queued.

-- Sebastian

> 
> > ---
> > 
> > Changes:
> > 
> > v2: no change
> > 
> >  .../bindings/power/supply/ingenic,battery.txt | 31 +++
> >  1 file changed, 31 insertions(+)
> >  create mode 100644 
> > Documentation/devicetree/bindings/power/supply/ingenic,battery.txt
> > 
> > diff --git 
> > a/Documentation/devicetree/bindings/power/supply/ingenic,battery.txt 
> > b/Documentation/devicetree/bindings/power/supply/ingenic,battery.txt
> > new file mode 100644
> > index ..66430bf73815
> > --- /dev/null
> > +++ b/Documentation/devicetree/bindings/power/supply/ingenic,battery.txt
> > @@ -0,0 +1,31 @@
> > +* Ingenic JZ47xx battery bindings
> > +
> > +Required properties:
> > +
> > +- compatible: Must be "ingenic,jz4740-battery".
> > +- io-channels: phandle and IIO specifier pair to the IIO device.
> > +  Format described in iio-bindings.txt.
> > +- monitored-battery: phandle to a "simple-battery" compatible node.
> > +
> > +The "monitored-battery" property must be a phandle to a node using the 
> > format
> > +described in battery.txt, with the following properties being required:
> > +
> > +- voltage-min-design-microvolt: Drained battery voltage.
> > +- voltage-max-design-microvolt: Fully charged battery voltage.
> > +
> > +Example:
> > +
> > +#include 
> > +
> > +simple_battery: battery {
> > +   compatible = "simple-battery";
> > +   voltage-min-design-microvolt = <360>;
> > +   voltage-max-design-microvolt = <420>;
> > +};
> > +
> > +ingenic_battery {
> > +   compatible = "ingenic,jz4740-battery";
> > +   io-channels = < INGENIC_ADC_BATTERY>;
> > +   io-channel-names = "battery";
> > +   monitored-battery = <_battery>;
> > +};
> 


signature.asc
Description: PGP signature


[PATCH] PCI: keystone: Fix build error while only CONFIG_PCI_KEYSTONE is set

2019-04-18 Thread Yue Haibing
From: YueHaibing 

During randconfig builds, I occasionally run into an invalid configuration

drivers/pci/controller/dwc/pci-keystone.o: In function `ks_pcie_link_up':
pci-keystone.c:(.text+0x90): undefined reference to `__dw_pcie_read_dbi'
pci-keystone.c:(.text+0x90): relocation truncated to fit: R_AARCH64_CALL26 
against undefined symbol `__dw_pcie_read_dbi'
drivers/pci/controller/dwc/pci-keystone.o: In function `ks_pcie_v3_65_scan_bus':
pci-keystone.c:(.text+0x4f0): undefined reference to `__dw_pcie_write_dbi'
pci-keystone.c:(.text+0x4f0): relocation truncated to fit: R_AARCH64_CALL26 
against undefined symbol `__dw_pcie_write_dbi'

while CONFIG_PCI_KEYSTONE is selected but CONFIG_PCIE_DW
is not set, the building failed like this. This patch
selects PCIE_DW to fix it.

Reported-by: Hulk Robot 
Fixes: 5709114f0a97 ("PCI: keystone: Add support for PCIe EP in AM654x 
Platforms")
Signed-off-by: YueHaibing 
---
 drivers/pci/controller/dwc/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/pci/controller/dwc/Kconfig 
b/drivers/pci/controller/dwc/Kconfig
index b450ad2..641fa0f 100644
--- a/drivers/pci/controller/dwc/Kconfig
+++ b/drivers/pci/controller/dwc/Kconfig
@@ -105,6 +105,7 @@ config PCIE_SPEAR13XX
 config PCI_KEYSTONE
bool "TI Keystone PCIe controller"
depends on ARCH_KEYSTONE || ARCH_K3 || ((ARM || ARM64) && COMPILE_TEST)
+   select PCIE_DW
help
  Say Y here if you want to enable PCI controller support on Keystone
  SoCs. The PCI controller on Keystone is based on DesignWare hardware
-- 
2.7.4




[PATCH net-next 05/12] net: hns3: refine tx timeout count handle

2019-04-18 Thread Huazhong Tan
From: Jian Shen 

In current codes, tx_timeout_cnt is used before increased,
then we can see the tx_timeout_count is still 0 from the
print when tx timeout happens, e.g.
"hns3 :7d:00.3 eth3: tx_timeout count: 0, queue id: 0, SW_NTU:
 0xa6, SW_NTC: 0xa4, HW_HEAD: 0xa4, HW_TAIL: 0xa6, INT: 0x1"

The tx_timeout_cnt should be updated before used.

Signed-off-by: Jian Shen 
Signed-off-by: Peng Li 
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_enet.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
index 9233438..193994c 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_enet.c
@@ -1655,6 +1655,8 @@ static bool hns3_get_tx_timeo_queue_info(struct 
net_device *ndev)
return false;
}
 
+   priv->tx_timeout_count++;
+
tx_ring = priv->ring_data[timeout_queue].ring;
 
hw_head = readl_relaxed(tx_ring->tqp->io_base +
@@ -1682,8 +1684,6 @@ static void hns3_nic_net_timeout(struct net_device *ndev)
if (!hns3_get_tx_timeo_queue_info(ndev))
return;
 
-   priv->tx_timeout_count++;
-
/* request the reset, and let the hclge to determine
 * which reset level should be done
 */
-- 
2.7.4



[PATCH net-next 09/12] net: hns3: add support for dump ncl config by debugfs

2019-04-18 Thread Huazhong Tan
From: Weihang Li 

This patch allow users to dump content of NCL_CONFIG by using debugfs
command.
Command format:
echo dump ncl_config   > cmd
It will print as follows:
hns3 :7d:00.0: offset |data
hns3 :7d:00.0: 0x | 0x0020
hns3 :7d:00.0: 0x0004 | 0x0400
hns3 :7d:00.0: 0x0008 | 0x08020401

Signed-off-by: Weihang Li 
Signed-off-by: Peng Li 
Signed-off-by: Huazhong Tan 
---
 drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c |  1 +
 .../net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h |  3 +
 .../ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c | 66 ++
 3 files changed, 70 insertions(+)

diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c
index c46da81..5ccb701 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3_debugfs.c
@@ -248,6 +248,7 @@ static void hns3_dbg_help(struct hnae3_handle *h)
dev_info(>pdev->dev, "dump qos buf cfg\n");
dev_info(>pdev->dev, "dump mng tbl\n");
dev_info(>pdev->dev, "dump reset info\n");
+   dev_info(>pdev->dev, "dump ncl_config  (in hex)\n");
 
memset(printf_buf, 0, HNS3_DBG_BUF_LEN);
strncat(printf_buf, "dump reg [[bios common] [ssu ]",
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
index 3714733..124d78e 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_cmd.h
@@ -237,6 +237,9 @@ enum hclge_opcode_type {
/* Led command */
HCLGE_OPC_LED_STATUS_CFG= 0xB000,
 
+   /* NCL config command */
+   HCLGE_OPC_QUERY_NCL_CONFIG  = 0x7011,
+
/* SFP command */
HCLGE_OPC_SFP_GET_SPEED = 0x7104,
 
diff --git a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c 
b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c
index d9782c7..6a774cc 100644
--- a/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c
+++ b/drivers/net/ethernet/hisilicon/hns3/hns3pf/hclge_debugfs.c
@@ -921,6 +921,69 @@ static void hclge_dbg_dump_rst_info(struct hclge_dev *hdev)
 hdev->rst_stats.reset_cnt);
 }
 
+/* hclge_dbg_dump_ncl_config: print specified range of NCL_CONFIG file
+ * @hdev: pointer to struct hclge_dev
+ * @cmd_buf: string that contains offset and length
+ */
+static void hclge_dbg_dump_ncl_config(struct hclge_dev *hdev, char *cmd_buf)
+{
+#define HCLGE_MAX_NCL_CONFIG_OFFSET4096
+#define HCLGE_MAX_NCL_CONFIG_LENGTH(20 + 24 * 4)
+#define HCLGE_CMD_DATA_NUM 6
+
+   struct hclge_desc desc[5];
+   u32 byte_offset;
+   int bd_num = 5;
+   int offset;
+   int length;
+   int data0;
+   int ret;
+   int i;
+   int j;
+
+   ret = sscanf(cmd_buf, "%x %x", , );
+   if (ret != 2 || offset >= HCLGE_MAX_NCL_CONFIG_OFFSET ||
+   length > HCLGE_MAX_NCL_CONFIG_OFFSET - offset) {
+   dev_err(>pdev->dev, "Invalid offset or length.\n");
+   return;
+   }
+   if (offset < 0 || length <= 0) {
+   dev_err(>pdev->dev, "Non-positive offset or length.\n");
+   return;
+   }
+
+   dev_info(>pdev->dev, "offset |data\n");
+
+   while (length > 0) {
+   data0 = offset;
+   if (length >= HCLGE_MAX_NCL_CONFIG_LENGTH)
+   data0 |= HCLGE_MAX_NCL_CONFIG_LENGTH << 16;
+   else
+   data0 |= length << 16;
+   ret = hclge_dbg_cmd_send(hdev, desc, data0, bd_num,
+HCLGE_OPC_QUERY_NCL_CONFIG);
+   if (ret)
+   return;
+
+   byte_offset = offset;
+   for (i = 0; i < bd_num; i++) {
+   for (j = 0; j < HCLGE_CMD_DATA_NUM; j++) {
+   if (i == 0 && j == 0)
+   continue;
+
+   dev_info(>pdev->dev, "0x%04x | 0x%08x\n",
+byte_offset,
+le32_to_cpu(desc[i].data[j]));
+   byte_offset += sizeof(u32);
+   length -= sizeof(u32);
+   if (length <= 0)
+   return;
+   }
+   }
+   offset += HCLGE_MAX_NCL_CONFIG_LENGTH;
+   }
+}
+
 int hclge_dbg_run_cmd(struct hnae3_handle *handle, char *cmd_buf)
 {
struct hclge_vport *vport = hclge_get_vport(handle);
@@ -946,6 +1009,9 @@ int hclge_dbg_run_cmd(struct hnae3_handle *handle, char 
*cmd_buf)
hclge_dbg_dump_reg_cmd(hdev, cmd_buf);
} else if (strncmp(cmd_buf, "dump reset info", 15) == 0) {

Re: [PATCH v20 16/28] x86/sgx: Add provisioning

2019-04-18 Thread Huang, Kai
On Wed, 2019-04-17 at 13:39 +0300, Jarkko Sakkinen wrote:
> In order to provide a mechanism for devilering provisoning rights:
> 
> 1. Add a new device file /dev/sgx/provision that works as a token for
>allowing an enclave to have the provisioning privileges.
> 2. Add a new ioctl called SGX_IOC_ENCLAVE_SET_ATTRIBUTE that accepts the
>following data structure:
> 
>struct sgx_enclave_set_attribute {
>__u64 addr;
>__u64 attribute_fd;
>};
> 
> A daemon could sit on top of /dev/sgx/provision and send a file
> descriptor of this file to a process that needs to be able to provision
> enclaves.
> 
> The way this API is used is straight-forward. Lets assume that dev_fd is
> a handle to /dev/sgx/enclave and prov_fd is a handle to
> /dev/sgx/provision.  You would allow SGX_IOC_ENCLAVE_CREATE to
> initialize an enclave with the PROVISIONKEY attribute by
> 
> params.addr = ;
> params.token_fd = prov_fd;
> 
Should be params.attribute_fd = prov_fd;

Thanks,
-Kai

Re: [PATCH v2] moduleparam: Save information about built-in modules in separate file

2019-04-18 Thread Masahiro Yamada
On Fri, Apr 19, 2019 at 12:36 AM Jessica Yu  wrote:
>
> +++ Masahiro Yamada [19/04/19 00:26 +0900]:
> >On Thu, Apr 18, 2019 at 10:52 PM Jessica Yu  wrote:
> >>
> >> +++ Masahiro Yamada [18/04/19 20:10 +0900]:
> >> >On Sat, Apr 6, 2019 at 9:15 PM Alexey Gladkov  
> >> >wrote:
> >> >>
> >> >> Problem:
> >> >>
> >> >> When a kernel module is compiled as a separate module, some important
> >> >> information about the kernel module is available via .modinfo section of
> >> >> the module.  In contrast, when the kernel module is compiled into the
> >> >> kernel, that information is not available.
> >> >>
> >> >> Information about built-in modules is necessary in the following cases:
> >> >>
> >> >> 1. When it is necessary to find out what additional parameters can be
> >> >> passed to the kernel at boot time.
> >> >>
> >> >> 2. When you need to know which module names and their aliases are in
> >> >> the kernel. This is very useful for creating an initrd image.
> >> >>
> >> >> Proposal:
> >> >>
> >> >> The proposed patch does not remove .modinfo section with module
> >> >> information from the vmlinux at the build time and saves it into a
> >> >> separate file after kernel linking. So, the kernel does not increase in
> >> >> size and no additional information remains in it. Information is stored
> >> >> in the same format as in the separate modules (null-terminated string
> >> >> array). Because the .modinfo section is already exported with a separate
> >> >> modules, we are not creating a new API.
> >> >>
> >> >> It can be easily read in the userspace:
> >> >>
> >> >> $ tr '\0' '\n' < kernel.builtin
> >> >> ext4.softdep=pre: crc32c
> >> >> ext4.license=GPL
> >> >> ext4.description=Fourth Extended Filesystem
> >> >> ext4.author=Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, 
> >> >> Theodore Ts'o and others
> >> >> ext4.alias=fs-ext4
> >> >> ext4.alias=ext3
> >> >> ext4.alias=fs-ext3
> >> >> ext4.alias=ext2
> >> >> ext4.alias=fs-ext2
> >> >> md_mod.alias=block-major-9-*
> >> >> md_mod.alias=md
> >> >> md_mod.description=MD RAID framework
> >> >> md_mod.license=GPL
> >> >> md_mod.parmtype=create_on_open:bool
> >> >> md_mod.parmtype=start_dirty_degraded:int
> >> >> ...
> >> >>
> >> >> v2:
> >> >>  * Extract modinfo from vmlinux.o as suggested by Masahiro Yamada;
> >> >>  * Rename output file to kernel.builtin;
> >> >
> >> >Sorry, I do not get why you renamed
> >> >"kernel.builtin.modinfo" to "kernel.builtin".
> >> >
> >> >If you drop "modinfo", we do not understand
> >> >what kind information is contained in it.
> >> >
> >> >I think "kernel" and "builtin" have
> >> >a quite similar meaning here.
> >> >
> >> >How about "builtin.modinfo" for example?
> >> >
> >> >
> >> >It is shorter, and it is clear enough
> >> >that it contains module_info.
> >>
> >> I agree that the name kernel.builtin is unclear in what kind of
> >> information it contains. Apologies for not having clarified this in
> >> the previous review.
> >>
> >> Since kbuild already produces "modules.order" and "modules.builtin"
> >> files, why not just name it "modules.builtin.modinfo" to keep the
> >> names consistent with what is already there?
> >
> >
> >Is it consistent?
> >
> >If we had "modules.order" and "modules.builtin.order" there,
> >I would agree with "modules.builtin.modinfo",
> >and also "modules.alias" vs "modules.builtin.alias".
> >
> >
> >We already have "modules.builtin", and probably impossible
> >to rename it, so we cannot keep consistency in any way.
> >
> >
> >"modules.builtin" is a weird name since
> >it actually contains "order", but its extension
> >does not express what kind of information is in it.
> >Hence, I doubt "modules.builtin" is a good precedent.
> >
> >IMHO, "modules" and "builtin" are opposite
> >to each other. "modules.builtin" sounds iffy to me.
>
> I've always interpreted "modules.builtin" to mean "this is a list of
> modules that have been built-in into the kernel", no? So I thought the
> name made sense.

OK, I see.

> But you are the maintainer, so I do not have a strong
> opinion on this either way :-)

My idea was to use
'modules.'  vs  'builtin.'
instead of
'modules.'  vs  'modules.builtin.'

I am slightly in favor of the former
since it is shorter and
(I hope) still clear enough.

If this naming is not nice for external projects such as kmod,
please speak up.


(BTW, I am thinking of renaming 'modules.builtin' into 'builtin.order'
for kbuild internal. We cannot change that for the installation area, though.)

>
> Thanks,
>
> Jessica



-- 
Best Regards
Masahiro Yamada


Re: net: ax25: %x specifier misuse in kernel?

2019-04-18 Thread Eric Dumazet



On 04/18/2019 07:01 PM, PlusOneSecond wrote:
> In ax25_info_show of af_ax25.c:1891, linux-5.1.
> The pointer ax25 is cast to long type to print out.
> 
> Why it prints the a pointer 'ax25' use %8.8lx rather than %p?
> If it really want to print the value of ax25, it should use %px.

I guess nobody cared enough abut this security issue.

Please send a patch, thanks.

> 
> Also, I  scan the kernel code and notice that most of the pointers cast to 
> long or unsigned long type to print out are marked with __iomem.
> Is it a misuse of %x?
> 
> 
> Thanks.
> 


Re: kernel BUG at kernel/cred.c:434!

2019-04-18 Thread Yang Yingliang




On 2019/4/19 10:04, Paul Moore wrote:

On Wed, Apr 17, 2019 at 10:50 PM Yang Yingliang
 wrote:

On 2019/4/18 8:24, Casey Schaufler wrote:

On 4/17/2019 4:39 PM, Paul Moore wrote:

Since it looks like all three LSMs which implement the setprocattr
hook are vulnerable I'm open to the idea that proc_pid_attr_write() is
a better choice for the cred != read_cred check, but I would want to
make sure John and Casey are okay with that.

John?

Casey?

I'm fine with the change going into proc_pid_attr_write().

The cred != real_cred checking is not enough.

Consider this situation, when doing override, cred, real_cred and
new_cred are all same:

after override_creds()cred == real_cred == new1_cred

I'm sorry, you've lost me.  After override_creds() returns
current->cred and current->real_cred are not going to be the same,
yes?
It's possible the new  cred is equal to current->real_cred and 
current->cred,

so after overrides_creds(), they have the same value.



after prepare_creds() new2_cred
after commit_creds() becasue the check is false, so cred ==
real_cred == new2_cred
after revert_creds()cred == new1_cred, real_cred == new2_cred

It will cause cred != real_cred finally.





[PATCH] kvm: x86: refine kvm_get_arch_capabilities()

2019-04-18 Thread Xiaoyao Li
1. Using X86_FEATURE_ARCH_CAPABILITIES to enumerate the existence of
MSR_IA32_ARCH_CAPABILITIES to avoid using rdmsrl_safe().

2. Since kvm_get_arch_capabilities() is only used in this file, making
it static.

Signed-off-by: Xiaoyao Li 
---
 arch/x86/include/asm/kvm_host.h | 1 -
 arch/x86/kvm/x86.c  | 8 
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index a9d03af34030..d4ae67870764 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1526,7 +1526,6 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long 
ipi_bitmap_low,
unsigned long ipi_bitmap_high, u32 min,
unsigned long icr, int op_64_bit);
 
-u64 kvm_get_arch_capabilities(void);
 void kvm_define_shared_msr(unsigned index, u32 msr);
 int kvm_set_shared_msr(unsigned index, u64 val, u64 mask);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a0d1fc80ac5a..ba8e269a8cd2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1205,11 +1205,12 @@ static u32 msr_based_features[] = {
 
 static unsigned int num_msr_based_features;
 
-u64 kvm_get_arch_capabilities(void)
+static u64 kvm_get_arch_capabilities(void)
 {
-   u64 data;
+   u64 data = 0;
 
-   rdmsrl_safe(MSR_IA32_ARCH_CAPABILITIES, );
+   if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
+   rdmsrl(MSR_IA32_ARCH_CAPABILITIES, data);
 
/*
 * If we're doing cache flushes (either "always" or "cond")
@@ -1225,7 +1226,6 @@ u64 kvm_get_arch_capabilities(void)
 
return data;
 }
-EXPORT_SYMBOL_GPL(kvm_get_arch_capabilities);
 
 static int kvm_get_msr_feature(struct kvm_msr_entry *msr)
 {
-- 
2.19.1



Re: [RFC PATCH v3] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER

2019-04-18 Thread Kairui Song
On Fri, Apr 19, 2019 at 8:58 AM Josh Poimboeuf  wrote:
>
> I still don't like using regs->bp because it results in different code
> paths for FP and ORC.  In the FP case, the regs are treated like real
> regs even though they're fake.
>
> Something like the below would be much simpler.  Would this work?  I don't
> know if any other code relies on the fake regs->bp or regs->sp.

Works perfectly. My only concern is that FP path used to work very
well, not sure it's a good idea to change it, and this may bring some
extra overhead for FP path.

>
> (BTW, tomorrow is a US holiday so I may not be very responsive until
> Monday.)
>
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index de1a924a4914..f315425d8468 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2382,6 +2382,15 @@ void arch_perf_update_userpage(struct perf_event 
> *event,
> cyc2ns_read_end();
>  }
>
> +/*
> + * Determine whether the regs were taken from an irq/exception handler rather
> + * than from perf_arch_fetch_caller_regs().
> + */
> +static bool perf_hw_regs(struct pt_regs *regs)
> +{
> +   return regs->flags & X86_EFLAGS_FIXED;
> +}
> +
>  void
>  perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
> *regs)
>  {
> @@ -2393,11 +2402,15 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
> *entry, struct pt_regs *re
> return;
> }
>
> -   if (perf_callchain_store(entry, regs->ip))
> -   return;
> +   if (perf_hw_regs(regs)) {
> +   if (perf_callchain_store(entry, regs->ip))
> +   return;
> +   unwind_start(, current, regs, NULL);
> +   } else {
> +   unwind_start(, current, NULL, (void *)regs->sp);
> +   }
>
> -   for (unwind_start(, current, regs, NULL); !unwind_done();
> -unwind_next_frame()) {
> +   for (; !unwind_done(); unwind_next_frame()) {
> addr = unwind_get_return_address();
> if (!addr || perf_callchain_store(entry, addr))
> return;
> diff --git a/arch/x86/include/asm/perf_event.h 
> b/arch/x86/include/asm/perf_event.h
> index 04768a3a5454..1392d5e6e8d6 100644
> --- a/arch/x86/include/asm/perf_event.h
> +++ b/arch/x86/include/asm/perf_event.h
> @@ -308,14 +308,9 @@ extern unsigned long perf_misc_flags(struct pt_regs 
> *regs);
>   */
>  #define perf_arch_fetch_caller_regs(regs, __ip){   \
> (regs)->ip = (__ip);\
> -   (regs)->bp = caller_frame_pointer();\
> +   (regs)->sp = (unsigned long)__builtin_frame_address(0); \
> (regs)->cs = __KERNEL_CS;   \
> regs->flags = 0;\
> -   asm volatile(   \
> -   _ASM_MOV "%%"_ASM_SP ", %0\n"   \
> -   : "=m" ((regs)->sp) \
> -   :: "memory" \
> -   );  \
>  }
>
>  struct perf_guest_switch_msr {
> diff --git a/arch/x86/include/asm/stacktrace.h 
> b/arch/x86/include/asm/stacktrace.h
> index d6d758a187b6..a8d0cdf48616 100644
> --- a/arch/x86/include/asm/stacktrace.h
> +++ b/arch/x86/include/asm/stacktrace.h
> @@ -100,19 +100,6 @@ struct stack_frame_ia32 {
>  u32 return_address;
>  };
>
> -static inline unsigned long caller_frame_pointer(void)
> -{
> -   struct stack_frame *frame;
> -
> -   frame = __builtin_frame_address(0);
> -
> -#ifdef CONFIG_FRAME_POINTER
> -   frame = frame->next_frame;
> -#endif
> -
> -   return (unsigned long)frame;
> -}
> -
>  void show_opcodes(struct pt_regs *regs, const char *loglvl);
>  void show_ip(struct pt_regs *regs, const char *loglvl);
>  #endif /* _ASM_X86_STACKTRACE_H */
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index f3864e1c5569..0f560069aeec 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1062,7 +1062,7 @@ static inline void perf_arch_fetch_caller_regs(struct 
> pt_regs *regs, unsigned lo
>   * the nth caller. We only need a few of the regs:
>   * - ip for PERF_SAMPLE_IP
>   * - cs for user_mode() tests
> - * - bp for callchains
> + * - sp for callchains
>   * - eflags, for future purposes, just in case
>   */
>  static inline void perf_fetch_caller_regs(struct pt_regs *regs)

-- 
Best Regards,
Kairui Song


Re: [PATCH v2] panic: add an option to replay all the printk message in buffer

2019-04-18 Thread Feng Tang
On Thu, Apr 18, 2019 at 01:01:52PM +0200, Petr Mladek wrote:
> On Thu 2019-04-18 17:00:44, Feng Tang wrote:
> > Hi Petr,
> > 
> > On Thu, Apr 18, 2019 at 09:45:52AM +0200, Petr Mladek wrote:
> > > On Thu 2019-04-18 09:00:14, Sergey Senozhatsky wrote:
> > > > I think that PANIC_PRINT_ALL_PRINTK_MSG is a debugging option; a quite
> > > > specific one. So people who ask the kernel to PANIC_PRINT_ALL_PRINTK_MSG
> > > > they know what they are doing, and we probably will not cofuse anyone.
> > > > After all, we don't print any headers when we ftrace_dump() or imitate
> > > > sysrq via sysrq_timer_list_show(), or for any other 
> > > > panic_print_sys_info()
> > > > printouts. So it's OK to just do the simple thing for
> > > > PANIC_PRINT_ALL_PRINTK_MSG.
> > > 
> > > The following functions are currently called from panic_print_sys_info():
> > > 
> > > + show_state():
> > >   printk(KERN_INFO
> > >   "  taskPC stack   pid father\n");
> > > + show_mem():
> > >   printk("Mem-Info:\n");
> > > 
> > > + sysrq_timer_list_show()
> > >   no global header; but each section can be easily distinguished
> > >   because there are several static strings that explains the
> > >   content
> > > 
> > > + debug_show_all_locks()
> > >   pr_warn("\nShowing all locks held in the system:\n");
> > > 
> > > + ftrace_dump():
> > >   printk(KERN_TRACE "Dumping ftrace buffer:\n");
> > > 
> > > 
> > > The person that enabled the debugging option might know what it did
> > > when it process the log the very same day. It might be less clear
> > > for others reading the log.
> > > 
> > > Also it still might be convenient to find the beginning easily.
> > > Or it might help to orientate when several test runs
> > > (over night test) are squashed in a single file. I see
> > > such logs pretty often.
> > 
> > Thanks for checking all the info. For the note, how about adding
> > something like this inside panic_print_sys_info()?
> > 
> > if (panic_print & PANIC_PRINT_ALL_PRINTK_MSG) {
> > printk("\nprintk: will replay all the messages in buffer:\n");
> > console_flush_on_panic(CONSOLE_FLUSH_ALL);
> > }
> 
> This will store the message at the end of the buffer. It will repeat
> the message once again when replaying the log. I personally think
> that it is worse than printing nothing.

Got it :) 

> 
> 
> > If you are all fine with it, I'll go post a v3, thanks,
> 
> I suggest to wait few days. We might get opinion from someone
> else...
> 
> Then sent whatever approach makes most sense to you. In fact, you
> are the 3rd person in this triangle. I believe that we will
> respect it. Last many mails are just a bike shedding.
> 
> Just one thing. Please, use "replay" in the function or parameter
> name. The effect of console_flush_on_panic(CONSOLE_FLUSH_ALL)
> is really hard to guess from the name.

Ok. thanks

- Feng


Re: kernel BUG at kernel/cred.c:434!

2019-04-18 Thread Paul Moore
On Wed, Apr 17, 2019 at 10:50 PM Yang Yingliang
 wrote:
> On 2019/4/18 8:24, Casey Schaufler wrote:
> > On 4/17/2019 4:39 PM, Paul Moore wrote:
> >>
> >> Since it looks like all three LSMs which implement the setprocattr
> >> hook are vulnerable I'm open to the idea that proc_pid_attr_write() is
> >> a better choice for the cred != read_cred check, but I would want to
> >> make sure John and Casey are okay with that.
> >>
> >> John?
> >>
> >> Casey?
> >
> > I'm fine with the change going into proc_pid_attr_write().
>
> The cred != real_cred checking is not enough.
>
> Consider this situation, when doing override, cred, real_cred and
> new_cred are all same:
>
> after override_creds()cred == real_cred == new1_cred

I'm sorry, you've lost me.  After override_creds() returns
current->cred and current->real_cred are not going to be the same,
yes?

> after prepare_creds() new2_cred
> after commit_creds() becasue the check is false, so cred ==
> real_cred == new2_cred
> after revert_creds()cred == new1_cred, real_cred == new2_cred
>
> It will cause cred != real_cred finally.

-- 
paul moore
www.paul-moore.com


net: ax25: %x specifier misuse in kernel?

2019-04-18 Thread PlusOneSecond
In ax25_info_show of af_ax25.c:1891, linux-5.1.
The pointer ax25 is cast to long type to print out.

Why it prints the a pointer 'ax25' use %8.8lx rather than %p?
If it really want to print the value of ax25, it should use %px.

Also, I  scan the kernel code and notice that most of the pointers cast to long 
or unsigned long type to print out are marked with __iomem.
Is it a misuse of %x?


Thanks.

Re: [PATCH v7 00/10] vsprintf: Prevent silent crashes and consolidate error handling

2019-04-18 Thread Sergey Senozhatsky
On (04/17/19 13:53), Petr Mladek wrote:
> Crash in vsprintf() might be silent when it happens under logbuf_lock
> in vprintk_emit(). This patch set prevents most of the crashes by probing
> the address. The check is done only by %s and some %p* specifiers that need
> to dereference the address.
>
> Only the first byte of the address is checked to keep it simple. It should
> be enough to catch most problems.
>
> The check is explicitly done in each function that does the dereference.
> It helps to avoid the questionable strchr() of affected specifiers. This
> change motivated me to do some preparation patches that consolidated
> the error handling and cleaned the code a bit.

The patch set looks OK to me.

I got confused by 'pC?' error string, but once you start looking
at it as a regex (? - zero or one occurrences) things look OK.
Regex in dmesg/serial output might be something very new to people,
stack traces, after all, is a rather common error reporting mechanism.
So the previous "WARN_ON() + exact unrecognized fmt[N] char" was not
totally awful or wrong (well, it was, before we introduced printk_safe()),
but I don't have strong objections against that new regex thing.

FWIW,
Reviewed-by: Sergey Senozhatsky 

-ss


[PATCH] tools/lib/traceevent: Remove hardcoded install paths from pkg-config file

2019-04-18 Thread Steven Rostedt


From: Tzvetomir Stoyanov 

Install directories of header and library files are hardcoded in pkg-config
templete file. They must be configurable, the Makefile should set them on the
compilation / install stage.

Link: http://lkml.kernel.org/r/20190329144546.5819-1-tstoya...@vmware.com

Signed-off-by: Tzvetomir Stoyanov 
Signed-off-by: Steven Rostedt (VMware) 
---
 tools/lib/traceevent/Makefile  | 13 +
 tools/lib/traceevent/libtraceevent.pc.template |  4 ++--
 2 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/tools/lib/traceevent/Makefile b/tools/lib/traceevent/Makefile
index 941761d9923d..34cf33a4f001 100644
--- a/tools/lib/traceevent/Makefile
+++ b/tools/lib/traceevent/Makefile
@@ -50,6 +50,9 @@ man_dir = $(prefix)/share/man
 man_dir_SQ = '$(subst ','\'',$(man_dir))'
 pkgconfig_dir ?= $(word 1,$(shell $(PKG_CONFIG)\
--variable pc_path pkg-config | tr ":" " "))
+includedir_relative = traceevent
+includedir = $(prefix)/include/$(includedir_relative)
+includedir_SQ = '$(subst ','\'',$(includedir))'
 
 export man_dir man_dir_SQ INSTALL
 export DESTDIR DESTDIR_SQ
@@ -279,6 +282,8 @@ define do_install_pkgconfig_file
cp -f ${PKG_CONFIG_FILE}.template ${PKG_CONFIG_FILE};   
\
sed -i "s|INSTALL_PREFIX|${1}|g" ${PKG_CONFIG_FILE};
\
sed -i "s|LIB_VERSION|${EVENT_PARSE_VERSION}|g" 
${PKG_CONFIG_FILE}; \
+   sed -i "s|LIB_DIR|${libdir}|g" ${PKG_CONFIG_FILE}; \
+   sed -i "s|HEADER_DIR|$(includedir)|g" ${PKG_CONFIG_FILE}; \
$(call do_install,$(PKG_CONFIG_FILE),$(pkgconfig_dir),644); 
\
else
\
(echo Failed to locate pkg-config directory) 1>&2;  
\
@@ -300,10 +305,10 @@ install_pkgconfig:
 
 install_headers:
$(call QUIET_INSTALL, headers) \
-   $(call 
do_install,event-parse.h,$(prefix)/include/traceevent,644); \
-   $(call 
do_install,event-utils.h,$(prefix)/include/traceevent,644); \
-   $(call 
do_install,trace-seq.h,$(prefix)/include/traceevent,644); \
-   $(call do_install,kbuffer.h,$(prefix)/include/traceevent,644)
+   $(call 
do_install,event-parse.h,$(DESTDIR)$(includedir_SQ),644); \
+   $(call 
do_install,event-utils.h,$(DESTDIR)$(includedir_SQ),644); \
+   $(call do_install,trace-seq.h,$(DESTDIR)$(includedir_SQ),644); \
+   $(call do_install,kbuffer.h,$(DESTDIR)$(includedir_SQ),644)
 
 install: install_lib
 
diff --git a/tools/lib/traceevent/libtraceevent.pc.template 
b/tools/lib/traceevent/libtraceevent.pc.template
index 42e4d6cb6b9e..86384fcd57f1 100644
--- a/tools/lib/traceevent/libtraceevent.pc.template
+++ b/tools/lib/traceevent/libtraceevent.pc.template
@@ -1,6 +1,6 @@
 prefix=INSTALL_PREFIX
-libdir=${prefix}/lib64
-includedir=${prefix}/include/traceevent
+libdir=LIB_DIR
+includedir=HEADER_DIR
 
 Name: libtraceevent
 URL: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
-- 
2.20.1



Re: linux-next: build warning after merge of the block tree

2019-04-18 Thread Chao Yu
On 2019/4/18 22:01, Jaegeuk Kim wrote:
> On 04/18, Chao Yu wrote:
>> On 2019/4/17 22:03, Jaegeuk Kim wrote:
>>> On 04/17, Chao Yu wrote:
 Hi Jaegeuk,

 On 2019/4/17 10:31, Stephen Rothwell wrote:
> Hi all,
>
> After merging the block tree, today's linux-next build (x86_64
> allmodconfig) produced this warning:
>
> fs/f2fs/node.c: In function 'f2fs_remove_inode_page':
> fs/f2fs/node.c:1193:47: warning: format '%zu' expects argument of type 
> 'size_t', but argument 5 has type 'blkcnt_t' {aka 'long long unsigned 
> int'} [-Wformat=]
> "Inconsistent i_blocks, ino:%lu, iblocks:%zu",
>  ~~^
>  %llu
> inode->i_ino, inode->i_blocks);
>   ~~~   

 Could you please help to fix that as below in your dev branch directly?

 "Inconsistent i_blocks, ino:%lu, iblocks:%llu",
>>>
>>> We can just use "%lu"?
>>
>> We'd better follow sample in Documentation/core-api/printk-formats.rst:
>>
>> If  is dependent on a config option for its size (e.g., sector_t,
>> blkcnt_t) or is architecture-dependent for its size (e.g., tcflag_t), use a
>> format specifier of its largest possible type and explicitly cast to it.
> 
> Great, done. :)

Thanks, :)

> 
> Thanks,
> 
>>
>> Example::
>>
>> printk("test: sector number/total blocks: %llu/%llu\n",
>> (unsigned long long)sector, (unsigned long long)blockcount);
>>
>> Thanks,
>>
>>>
 inode->i_ino, (unsigned long long)inode->i_blocks)


 It looks that we need to fix below commits as well:

 f2fs: fix to avoid panic in dec_valid_block_count()
 f2fs: fix to avoid panic in dec_valid_node_count()

 Thanks,

>
> Introduced by commit
>
>   90ae238d9dac ("f2fs: fix to avoid panic in f2fs_remove_inode_page()")
>
> from the f2fs tree interacting with commit
>
>   72deb455b5ec ("block: remove CONFIG_LBDAF")
>
>>> .
>>>
> .
> 


Re: [PATCH v3] proc/sysctl: add shared variables for range check

2019-04-18 Thread Matthew Wilcox
On Fri, Apr 19, 2019 at 09:17:17AM +0900, Matteo Croce wrote:
> > extern const int sysctl_zero;
> > /* comment goes here */
> > #define SYSCTL_ZERO ((void *)_zero)
> > 
> > and then use SYSCTL_ZERO everywhere.  That centralizes the ugliness
> > and
> > makes it easier to switch over if/when extra1&2 are constified.
> > 
> > But it's all a bit sad and lame :( 
> 
> No, we didn't decide yet. I need to check for all extra1,2 assignment. Not an 
> impossible task, anyway.
> 
> I agree that the casts are ugly. Your suggested macro moves the ugliness in a 
> single point, which is good. Or maybe we can do a single macro like:
> 
> #define SYSCTL_VAL(x) ((void *)_##x)
> 
> to avoid defining one for every value. And when we decide that everything can 
> be const, we just update the macro.

If we're going to do that, we can save two EXPORTs and do:

const int sysctl_vals[] = { 0, 1, -1 };
EXPORT_SYMBOL(sysctl_vals);

#define SYSCTL_ZERO ((void *)_vals[0])


Re: [PATCH v5 0/2] tty: serial: add DT bindings and serial driver for the SiFive FU540 UART

2019-04-18 Thread Atish Patra

On 4/18/19 4:22 PM, Kevin Hilman wrote:

Hi Paul,

Paul Walmsley  writes:


This series adds a serial driver, with console support, for the
UART IP block present on the SiFive FU540 SoC.  The programming
model is straightforward, but unique.

Boot-tested on a SiFive FU540 HiFive-U board, using BBL and the
open-source FSBL (with appropriate patches to the DT data).

This fifth version fixes a bug in the set_termios handler,
found by Andreas Schwab .

The patches in this series can also be found, with the PRCI patches,
DT patches, and DT prerequisite patch, at:

https://github.com/sifive/riscv-linux/tree/dev/paulw/serial-v5.1-rc4


I tried this branch, and it doesn't boot on my unleashed board.

Here's the boot log when I pass the DT built from your branch via
u-boot: https://termbin.com/rfp3.



Unfortunately, that won't work. The current DT modifications by OpenSBI.

1. Change hart status to "masked" from "okay".
2. M-mode interrupt masking in PLIC node.
3. Add a chosen node for serial access in U-Boot.

You can ignore 3 for your use case. However, if you pass a dtb built 
from source code, that will have hart0 enabled and M-mode interrupts 
enabled in DT.


Not sure if we should do these DT modifications in U-Boot as well.

I also noticed that your kernel is booting only 1 hart.
Just FYI: RISC-V SMP for U-Boot patches are merged in master. So you 
should be able to boot all cpus. You can ingore FU540_ENABLED_HART_MASK 
in OpenSBI build as well.


Regards,
Atish

I also tried the same thing, but using the DT that's hard-coded into
SBI/u-boot.  That doesn't boot fully either[1], but one thing I noted is
that with the DT from the kernel tree, the printk timestamps aren't
moving.  Maybe I'm still missing some kconfig options to enable the
right clock and/or IRQ controllers? I'm using this fragment[2] on top of
the default defconfig (arch/riscv/configs/defconfig).

Could you share the defconfig you're using when testing your branch?

Also for reference, I'm able to successfully build/boot the
5.1-rc1-unleashed branch from Atish's tree[3] using that kconfig
fragment[2] (and the hard-coded DT from u-boot/SBI).  Full log here[4].

Thanks,

Kevin

[1] https://termbin.com/wuc9
[2]
CONFIG_CLK_SIFIVE=y
CONFIG_CLK_SIFIVE_FU540_PRCI=y

CONFIG_SERIAL_SIFIVE=y
CONFIG_SERIAL_SIFIVE_CONSOLE=y

CONFIG_SIFIVE_PLIC=y
CONFIG_SPI=y
CONFIG_SPI_SIFIVE=y
CONFIG_GPIOLIB=y
CONFIG_GPIO_SIFIVE=y
CONFIG_PWM_SIFIVE=y

CONFIG_CLK_U54_PRCI=y
CONFIG_CLK_GEMGXL_MGMT=y

[3] https://github.com/atishp04/linux/tree/5.1-rc1-unleashed
[4] https://termbin.com/12bg

___
linux-riscv mailing list
linux-ri...@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv





[PATCH 1/2] regulator: fan53555: Clean up unneeded fields from struct fan53555_device_info

2019-04-18 Thread Axel Lin
The *regmap and *rdev can be replaced by local variables.
The slew_rate is no longer used since commit dd7e71fbeefe
("regulator: fan53555: use set_ramp_delay to set the ramp up slew rate").

Signed-off-by: Axel Lin 
---
 drivers/regulator/fan53555.c | 36 +---
 1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/drivers/regulator/fan53555.c b/drivers/regulator/fan53555.c
index 771a06d1900d..e2caf4173ab5 100644
--- a/drivers/regulator/fan53555.c
+++ b/drivers/regulator/fan53555.c
@@ -91,10 +91,8 @@ enum {
 
 struct fan53555_device_info {
enum fan53555_vendor vendor;
-   struct regmap *regmap;
struct device *dev;
struct regulator_desc desc;
-   struct regulator_dev *rdev;
struct regulator_init_data *regulator;
/* IC Type and Rev */
int chip_id;
@@ -106,8 +104,6 @@ struct fan53555_device_info {
unsigned int vsel_min;
unsigned int vsel_step;
unsigned int vsel_count;
-   /* Voltage slew rate limiting */
-   unsigned int slew_rate;
/* Mode */
unsigned int mode_reg;
unsigned int mode_mask;
@@ -125,7 +121,7 @@ static int fan53555_set_suspend_voltage(struct 
regulator_dev *rdev, int uV)
ret = regulator_map_voltage_linear(rdev, uV, uV);
if (ret < 0)
return ret;
-   ret = regmap_update_bits(di->regmap, di->sleep_reg,
+   ret = regmap_update_bits(rdev->regmap, di->sleep_reg,
 di->desc.vsel_mask, ret);
if (ret < 0)
return ret;
@@ -140,7 +136,7 @@ static int fan53555_set_suspend_enable(struct regulator_dev 
*rdev)
 {
struct fan53555_device_info *di = rdev_get_drvdata(rdev);
 
-   return regmap_update_bits(di->regmap, di->sleep_reg,
+   return regmap_update_bits(rdev->regmap, di->sleep_reg,
  VSEL_BUCK_EN, VSEL_BUCK_EN);
 }
 
@@ -148,7 +144,7 @@ static int fan53555_set_suspend_disable(struct 
regulator_dev *rdev)
 {
struct fan53555_device_info *di = rdev_get_drvdata(rdev);
 
-   return regmap_update_bits(di->regmap, di->sleep_reg,
+   return regmap_update_bits(rdev->regmap, di->sleep_reg,
  VSEL_BUCK_EN, 0);
 }
 
@@ -158,11 +154,11 @@ static int fan53555_set_mode(struct regulator_dev *rdev, 
unsigned int mode)
 
switch (mode) {
case REGULATOR_MODE_FAST:
-   regmap_update_bits(di->regmap, di->mode_reg,
+   regmap_update_bits(rdev->regmap, di->mode_reg,
   di->mode_mask, di->mode_mask);
break;
case REGULATOR_MODE_NORMAL:
-   regmap_update_bits(di->regmap, di->vol_reg, di->mode_mask, 0);
+   regmap_update_bits(rdev->regmap, di->vol_reg, di->mode_mask, 0);
break;
default:
return -EINVAL;
@@ -176,7 +172,7 @@ static unsigned int fan53555_get_mode(struct regulator_dev 
*rdev)
unsigned int val;
int ret = 0;
 
-   ret = regmap_read(di->regmap, di->mode_reg, );
+   ret = regmap_read(rdev->regmap, di->mode_reg, );
if (ret < 0)
return ret;
if (val & di->mode_mask)
@@ -213,7 +209,7 @@ static int fan53555_set_ramp(struct regulator_dev *rdev, 
int ramp)
return -EINVAL;
}
 
-   return regmap_update_bits(di->regmap, FAN53555_CONTROL,
+   return regmap_update_bits(rdev->regmap, FAN53555_CONTROL,
  CTL_SLEW_MASK, regval << CTL_SLEW_SHIFT);
 }
 
@@ -396,6 +392,7 @@ static int fan53555_regulator_register(struct 
fan53555_device_info *di,
struct regulator_config *config)
 {
struct regulator_desc *rdesc = >desc;
+   struct regulator_dev *rdev;
 
rdesc->name = "fan53555-reg";
rdesc->supply_name = "vin";
@@ -410,8 +407,8 @@ static int fan53555_regulator_register(struct 
fan53555_device_info *di,
rdesc->vsel_mask = di->vsel_count - 1;
rdesc->owner = THIS_MODULE;
 
-   di->rdev = devm_regulator_register(di->dev, >desc, config);
-   return PTR_ERR_OR_ZERO(di->rdev);
+   rdev = devm_regulator_register(di->dev, >desc, config);
+   return PTR_ERR_OR_ZERO(rdev);
 }
 
 static const struct regmap_config fan53555_regmap_config = {
@@ -466,6 +463,7 @@ static int fan53555_regulator_probe(struct i2c_client 
*client,
struct fan53555_device_info *di;
struct fan53555_platform_data *pdata;
struct regulator_config config = { };
+   struct regmap *regmap;
unsigned int val;
int ret;
 
@@ -502,22 +500,22 @@ static int fan53555_regulator_probe(struct i2c_client 
*client,
di->vendor = id->driver_data;
}
 
-   di->regmap = devm_regmap_init_i2c(client, _regmap_config);
-   if (IS_ERR(di->regmap)) {
+   regmap = devm_regmap_init_i2c(client, _regmap_config);
+   if (IS_ERR(regmap)) {

Re: [RFC PATCH v3] perf/x86: make perf callchain work without CONFIG_FRAME_POINTER

2019-04-18 Thread Josh Poimboeuf
On Thu, Apr 18, 2019 at 12:07:30PM -0400, Kairui Song wrote:
> Currently perf callchain doesn't work well when sampling from trace
> point, with ORC unwinder enabled and CONFIG_FRAME_POINTER disabled.
> We'll get useless in kernel callchain like this:
> 
> perf  6429 [000]22.498450: kmem:mm_page_alloc: page=0x176a17 
> pfn=1534487 order=0 migratetype=0 gfp_flags=GFP_KERNEL
> be23e32e __alloc_pages_nodemask+0x22e 
> (/lib/modules/5.1.0-rc3+/build/vmlinux)
>   7efdf7f7d3e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
>   5651468729c1 [unknown] (/usr/bin/perf)
>   5651467ee82a main+0x69a (/usr/bin/perf)
>   7efdf7eaf413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
> 5541f689495641d7 [unknown] ([unknown])
> 
> The root cause is within a trace point perf will try to dump the
> required caller's registers, but without CONFIG_FRAME_POINTER we
> can't get caller's BP as the frame pointer, so current frame pointer
> is returned instead. We get a invalid register combination which
> confuse the unwinder and end the stacktrace early.
> 
> So in such case just don't try dump BP when doing partial register
> dump. And just let the unwinder start directly when the register is
> incapable of presenting a unwinding start point. Use SP as the skip
> mark, skip all the frames until we meet the frame we want.
> 
> This make the callchain get the full kernel space stacktrace again:
> 
> perf  6503 [000]  1567.570191: kmem:mm_page_alloc: page=0x16c904 
> pfn=1493252 order=0 migratetype=0 gfp_flags=GFP_KERNEL
> b523e2ae __alloc_pages_nodemask+0x22e 
> (/lib/modules/5.1.0-rc3+/build/vmlinux)
> b52383bd __get_free_pages+0xd 
> (/lib/modules/5.1.0-rc3+/build/vmlinux)
> b52fd28a __pollwait+0x8a (/lib/modules/5.1.0-rc3+/build/vmlinux)
> b521426f perf_poll+0x2f (/lib/modules/5.1.0-rc3+/build/vmlinux)
> b52fe3e2 do_sys_poll+0x252 (/lib/modules/5.1.0-rc3+/build/vmlinux)
> b52ff027 __x64_sys_poll+0x37 
> (/lib/modules/5.1.0-rc3+/build/vmlinux)
> b500418b do_syscall_64+0x5b 
> (/lib/modules/5.1.0-rc3+/build/vmlinux)
> b5a0008c entry_SYSCALL_64_after_hwframe+0x44 
> (/lib/modules/5.1.0-rc3+/build/vmlinux)
>   7f71e92d03e8 __poll+0x18 (/usr/lib64/libc-2.28.so)
>   55a22960d9c1 [unknown] (/usr/bin/perf)
>   55a22958982a main+0x69a (/usr/bin/perf)
>   7f71e9202413 __libc_start_main+0xf3 (/usr/lib64/libc-2.28.so)
> 5541f689495641d7 [unknown] ([unknown])
> 
> Signed-off-by: Kairui Song 
> ---
> 
> Update from V2:
>   - Instead of looking at if BP is 0, use X86_EFLAGS_FIXED flag bit as
> the indicator of where the pt_regs is valid for unwinding. As
> suggested by Peter Zijlstra
>   - Update some comments accordingly.
> 
> Update from V1:
>   Get rid of a lot of unneccessary code and just don't dump a inaccurate
>   BP, and use SP as the marker for target frame.
> 
>  arch/x86/events/core.c| 24 +---
>  arch/x86/include/asm/perf_event.h |  5 +
>  arch/x86/include/asm/stacktrace.h |  9 +++--
>  include/linux/perf_event.h|  6 +++---
>  4 files changed, 36 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index e2b1447192a8..e181e195fe5d 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -2355,6 +2355,18 @@ void arch_perf_update_userpage(struct perf_event 
> *event,
>   cyc2ns_read_end();
>  }
>  
> +static inline int
> +valid_unwinding_registers(struct pt_regs *regs)
> +{
> + /*
> +  * regs might be a fake one, it won't dump the flags reg,
> +  * and without frame pointer, it won't have a valid BP.
> +  */
> + if (IS_ENABLED(CONFIG_FRAME_POINTER))
> + return 1;
> + return (regs->flags & PERF_EFLAGS_SNAP);
> +}
> +
>  void
>  perf_callchain_kernel(struct perf_callchain_entry_ctx *entry, struct pt_regs 
> *regs)
>  {
> @@ -2366,11 +2378,17 @@ perf_callchain_kernel(struct perf_callchain_entry_ctx 
> *entry, struct pt_regs *re
>   return;
>   }
>  
> - if (perf_callchain_store(entry, regs->ip))
> + if (valid_unwinding_registers(regs)) {
> + if (perf_callchain_store(entry, regs->ip))
> + return;
> + unwind_start(, current, regs, NULL);
> + } else if (regs->sp) {
> + unwind_start(, current, NULL, (unsigned long *)regs->sp);
> + } else {
>   return;
> + }
>  
> - for (unwind_start(, current, regs, NULL); !unwind_done();
> -  unwind_next_frame()) {
> + for (; !unwind_done(); unwind_next_frame()) {
>   addr = unwind_get_return_address();
>   if (!addr || perf_callchain_store(entry, addr))
>   return;
> diff --git a/arch/x86/include/asm/perf_event.h 
> b/arch/x86/include/asm/perf_event.h
> index 8bdf74902293..77c8519512ff 100644
> --- 

[PATCH 2/2] regulator: fan53555: Switch to SPDX identifier

2019-04-18 Thread Axel Lin
Signed-off-by: Axel Lin 
---
 drivers/regulator/fan53555.c | 24 ++--
 1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/drivers/regulator/fan53555.c b/drivers/regulator/fan53555.c
index e2caf4173ab5..dbe477da4e55 100644
--- a/drivers/regulator/fan53555.c
+++ b/drivers/regulator/fan53555.c
@@ -1,17 +1,13 @@
-/*
- * FAN53555 Fairchild Digitally Programmable TinyBuck Regulator Driver.
- *
- * Supported Part Numbers:
- * FAN53555UC00X/01X/03X/04X/05X
- *
- * Copyright (c) 2012 Marvell Technology Ltd.
- * Yunfan Zhang 
- *
- * This package is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License version 2 as
- * published by the Free Software Foundation.
- *
- */
+// SPDX-License-Identifier: GPL-2.0
+//
+// FAN53555 Fairchild Digitally Programmable TinyBuck Regulator Driver.
+//
+// Supported Part Numbers:
+// FAN53555UC00X/01X/03X/04X/05X
+//
+// Copyright (c) 2012 Marvell Technology Ltd.
+// Yunfan Zhang 
+
 #include 
 #include 
 #include 
-- 
2.17.1



Re: Adding plain accesses and detecting data races in the LKMM

2019-04-18 Thread Andrea Parri
> Are you saying that on x86, atomic_inc() acts as a full memory barrier 
> but not as a compiler barrier, and vice versa for 
> smp_mb__after_atomic()?  Or that neither atomic_inc() nor 
> smp_mb__after_atomic() implements a full memory barrier?

I'd say the former; AFAICT, these boil down to:

  
https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/atomic.h#L95
  
https://elixir.bootlin.com/linux/v5.1-rc5/source/arch/x86/include/asm/barrier.h#L84

  Andrea


Re: EDAC: Fix memory leak in creating CSROW object

2019-04-18 Thread Borislav Petkov
On Fri, Apr 19, 2019 at 08:35:36AM +0800, PanBian wrote:
> Yes, I see that. Because the loop start with (--i), there is no put
> operation for the device that fails to create. So, I think we cannot
> rule out the possibility of memory leak.

Ok, so this is not something you trigger - you're basically staring at
the code?

Well, there's something else questionable in that code which I asked
Greg about today but we didn't finish that conversation, let me CC him.

So AFAIU, devices for which device_add() has returned success,
should be removed with their counterpart device_del().
edac_create_csrow_objects(), however, does put_device() on those in the
"unwinding" loop.

And for the case where device_add() fails, you should do put_device() to
it. I.e., what you're saying.

So I think we need to figure what needs to be done when before fixing
this properly.

Greg?

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-18 Thread Borislav Petkov
On Thu, Apr 18, 2019 at 05:07:45PM -0700, Luck, Tony wrote:
> On Fri, Apr 19, 2019 at 01:29:10AM +0200, Borislav Petkov wrote:
> > Which reminds me, Tony, I think all those debugging files "pfn"
> > and "array" and the one you add now, should all be under a
> > CONFIG_RAS_CEC_DEBUG which is default off and used only for development.
> > Mind adding that too pls?
> 
> Patch below, on top of previous patch.  Note that I didn't move "enable"
> into the RAS_CEC_DEBUG code. I think it has some value even on
> production systems.

And that value is? Usecase?

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


[PATCH v2] tools/power: turbostat: make output buffer extensible (Re: [PATCH v1] tools/power: turbostat: fix buffer overrun)

2019-04-18 Thread Naoya Horiguchi
I updated the patch with a trival fix.
Could you review it?

- Naoya


From: Naoya Horiguchi 
Date: Fri, 19 Apr 2019 09:21:59 +0900
Subject: [PATCH v2] tools/power: turbostat: make output buffer extensible

"turbostat --Dump" could be terminated by general protection fault on
some latest hardwares which (for example) support 9 levels of C-states
and show 18 "tADDED" lines. That bloats the total output and finally
causes buffer overrun.  So this patch sugguests to extend the output
buffer when reaching the end.

This patch also removes duplicated "pc10:" line to reduce buffer usage.

Signed-off-by: Naoya Horiguchi 
---
ChangeLog v1->v2:
- moved va_end() before if block.
---
 tools/power/x86/turbostat/turbostat.c | 397 ++
 1 file changed, 210 insertions(+), 187 deletions(-)

diff --git a/tools/power/x86/turbostat/turbostat.c 
b/tools/power/x86/turbostat/turbostat.c
index c7727be9719f..767ada558d75 100644
--- a/tools/power/x86/turbostat/turbostat.c
+++ b/tools/power/x86/turbostat/turbostat.c
@@ -84,6 +84,7 @@ double tsc_tweak = 1.0;
 unsigned int show_pkg_only;
 unsigned int show_core_only;
 char *output_buffer, *outp;
+ssize_t outbuf_size;
 unsigned int do_rapl;
 unsigned int do_dts;
 unsigned int do_ptm;
@@ -625,6 +626,28 @@ unsigned long long bic_lookup(char *name_list, enum 
show_hide_mode mode)
return retval;
 }
 
+static void *append_to_output_buffer(const char *fmt, ...)
+{
+   va_list args;
+
+   va_start(args, fmt);
+   outp += vsprintf(outp, fmt, args);
+   va_end(args);
+
+   /* Approaching the buffer end, so extend it. */
+   if (outp - output_buffer >= (outbuf_size - 256)) {
+   int output_size = outp - output_buffer;
+
+   outbuf_size += 1024;
+   output_buffer = realloc(output_buffer, outbuf_size);
+   if (output_buffer == NULL)
+   err(-1, "realloc output buffer");
+   if (debug)
+   printf("Output buffer was extended.\n");
+   outp = output_buffer + output_size;
+   }
+   return outp;
+}
 
 void print_header(char *delim)
 {
@@ -632,173 +655,173 @@ void print_header(char *delim)
int printed = 0;
 
if (DO_BIC(BIC_USEC))
-   outp += sprintf(outp, "%susec", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%susec", (printed++ ? delim : 
""));
if (DO_BIC(BIC_TOD))
-   outp += sprintf(outp, "%sTime_Of_Day_Seconds", (printed++ ? 
delim : ""));
+   outp = append_to_output_buffer("%sTime_Of_Day_Seconds", 
(printed++ ? delim : ""));
if (DO_BIC(BIC_Package))
-   outp += sprintf(outp, "%sPackage", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sPackage", (printed++ ? delim 
: ""));
if (DO_BIC(BIC_Die))
-   outp += sprintf(outp, "%sDie", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sDie", (printed++ ? delim : 
""));
if (DO_BIC(BIC_Node))
-   outp += sprintf(outp, "%sNode", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sNode", (printed++ ? delim : 
""));
if (DO_BIC(BIC_Core))
-   outp += sprintf(outp, "%sCore", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sCore", (printed++ ? delim : 
""));
if (DO_BIC(BIC_CPU))
-   outp += sprintf(outp, "%sCPU", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sCPU", (printed++ ? delim : 
""));
if (DO_BIC(BIC_APIC))
-   outp += sprintf(outp, "%sAPIC", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sAPIC", (printed++ ? delim : 
""));
if (DO_BIC(BIC_X2APIC))
-   outp += sprintf(outp, "%sX2APIC", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sX2APIC", (printed++ ? delim : 
""));
if (DO_BIC(BIC_Avg_MHz))
-   outp += sprintf(outp, "%sAvg_MHz", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sAvg_MHz", (printed++ ? delim 
: ""));
if (DO_BIC(BIC_Busy))
-   outp += sprintf(outp, "%sBusy%%", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sBusy%%", (printed++ ? delim : 
""));
if (DO_BIC(BIC_Bzy_MHz))
-   outp += sprintf(outp, "%sBzy_MHz", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sBzy_MHz", (printed++ ? delim 
: ""));
if (DO_BIC(BIC_TSC_MHz))
-   outp += sprintf(outp, "%sTSC_MHz", (printed++ ? delim : ""));
+   outp = append_to_output_buffer("%sTSC_MHz", (printed++ ? delim 
: ""));
 
if (DO_BIC(BIC_IRQ)) {
if (sums_need_wide_columns)
-   outp += sprintf(outp, "%s IRQ", (printed++ ? delim 
: ""));
+   outp = append_to_output_buffer("%s  

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-18 Thread Borislav Petkov
On Thu, Apr 18, 2019 at 04:58:22PM -0700, Cong Wang wrote:
> No, it is all about whether we should break users' expectation.

What user expectation?

> This doesn't sounds like a valid reason for us to break users'
> expectation.

I think it is *you* who has some sort of "expectation" but that
"expectation" is wrong.

> Prior to CONFIG_RAS, mcelog just works fine for users (at least Intel
> users). Suddenly after enabling CONFIG_RAS in kernel, mcelog will
> no longer receive any correctable memory errors _silently_.

That is, of course, wrong too.

> What's more, we don't even have rasdaemon running in our system, so

Are you saying "we" to mean "we the users" or some company "we"?

And that is wrong too, there's at least one rasdaemon:

  http://git.infradead.org/users/mchehab/rasdaemon.git

> there is no consumer of RAS CEC,

RAS CEC doesn't need a consumer. You're misunderstanding the whole
concept of the error collector.

> these errors just simply disappear from users' expected place.

They "disappear" because you have CONFIG_RAS_CEC enabled. But they
don't really disappear - they're collected by the thing to filter out
only the pages which keep generating errors constantly and those get
soft-offlined.

The sporadic ones simply get ignored because they don't happen again
and are only result of alpha particles or overheating conditions or
whatever.

Now here's the CEC help text:

config RAS_CEC
bool "Correctable Errors Collector"
depends on X86_MCE && MEMORY_FAILURE && DEBUG_FS
---help---
  This is a small cache which collects correctable memory errors per 4K
  page PFN and counts their repeated occurrence. Once the counter for a
  PFN overflows, we try to soft-offline that page as we take it to mean
  that it has reached a relatively high error count and would probably
  be best if we don't use it anymore.

  Bear in mind that this is absolutely useless if your platform doesn't
  have ECC DIMMs and doesn't have DRAM ECC checking enabled in the BIOS.

you can tell me what in that text is not clear so that I can make it
more clear and obvious what that thing is.

> I know CONFIG_RAS is new feature supposed to replace MCELOG,

No, it isn't. CONFIG_RAS is supposed to collect all the RAS-related
functionality in the kernel and it looks like you have some
misconceptions about it.

> but they can co-exist in kernel config, which means mcelog should
> continue to work as before until it gets fully replaced.

For that you need to enable X86_MCELOG_LEGACY.

And let me repeat it again - if you want to collect errors in userspace,
do not enable RAS_CEC at all.

> Even the following PoC change could make this situation better,
> because with this change when we enable CONFIG_RAS,mcelog
> will break _loudly_ rather than just silently, users will notice mcelog
> is no longer supported and will look for its alternative choice.

You have somehow put in your head that CONFIG_RAS is the counterpart of
CONFIG_X86_MCELOG_LEGACY. Which is *simply* *not* *true*.

And the moment you realize that, then you'll be a step further in the
right direction.

So enable X86_MCELOG_LEGACY and you can collect all the errors you wish.
And there's a rasdaemon which you can use too, as I pointed above, if
you don't want mcelog.

CEC is something *completely* different and its purpose is to run in the
kernel and prevent users and admins from upsetting unnecessarily with
every sporadic correctable error and just because an alpha particle flew
through their DIMMs, they all start running in headless chicken mode,
trying to RMA perfectly good hardware.

Now, if any of that above still doesn't make it clear, please state what
you're trying to achieve and I'll try to help.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH v3] proc/sysctl: add shared variables for range check

2019-04-18 Thread Matteo Croce
On April 19, 2019 7:40:45 AM GMT+09:00, Andrew Morton 
 wrote:
> On Wed, 17 Apr 2019 15:15:31 +0200 Matteo Croce 
> wrote:
> 
> > In the sysctl code the proc_dointvec_minmax() function is often used
> to
> > validate the user supplied value between an allowed range. This
> function
> > uses the extra1 and extra2 members from struct ctl_table as minimum
> and
> > maximum allowed value.
> > 
> > On sysctl handler declaration, in every source file there are some
> readonly
> > variables containing just an integer which address is assigned to
> the
> > extra1 and extra2 members, so the sysctl range is enforced.
> > 
> > The special values 0, 1 and INT_MAX are very often used as range
> boundary,
> > leading duplication of variables like zero=0, one=1, int_max=INT_MAX
> in
> > different source files:
> > 
> > $ git grep -E '\.extra[12].*&(zero|one|int_max)\b' |wc -l
> > 245
> > 
> > This patch adds three const variables for the most commonly used
> values,
> > and use them instead of creating a local one for every object file.
> > 
> > ...
> >
> > --- a/arch/s390/appldata/appldata_base.c
> > +++ b/arch/s390/appldata/appldata_base.c
> > @@ -220,15 +220,13 @@ appldata_timer_handler(struct ctl_table *ctl,
> int write,
> >void __user *buffer, size_t *lenp, loff_t *ppos)
> >  {
> > int timer_active = appldata_timer_active;
> > -   int zero = 0;
> > -   int one = 1;
> > int rc;
> > struct ctl_table ctl_entry = {
> > .procname   = ctl->procname,
> > .data   = _active,
> > .maxlen = sizeof(int),
> > -   .extra1 = ,
> > -   .extra2 = ,
> > +   .extra1 = (void *)_zero,
> > +   .extra2 = (void *)_one,
> > };
> 
> Still not liking the casts :(
> 
> Did we decide whether making extra1&2 const void*'s was feasible?
> 
> I'm wondering if it would be better to do
> 
> extern const int sysctl_zero;
> /* comment goes here */
> #define SYSCTL_ZERO ((void *)_zero)
> 
> and then use SYSCTL_ZERO everywhere.  That centralizes the ugliness
> and
> makes it easier to switch over if/when extra1&2 are constified.
> 
> But it's all a bit sad and lame :( 

No, we didn't decide yet. I need to check for all extra1,2 assignment. Not an 
impossible task, anyway.

I agree that the casts are ugly. Your suggested macro moves the ugliness in a 
single point, which is good. Or maybe we can do a single macro like:

#define SYSCTL_VAL(x) ((void *)_##x)

to avoid defining one for every value. And when we decide that everything can 
be const, we just update the macro.

Regards,
-- 
Matteo Croce
per aspera ad upstream


Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-18 Thread Luck, Tony
On Fri, Apr 19, 2019 at 01:29:10AM +0200, Borislav Petkov wrote:
> Which reminds me, Tony, I think all those debugging files "pfn"
> and "array" and the one you add now, should all be under a
> CONFIG_RAS_CEC_DEBUG which is default off and used only for development.
> Mind adding that too pls?

Patch below, on top of previous patch.  Note that I didn't move "enable"
into the RAS_CEC_DEBUG code. I think it has some value even on
production systems.  It is still in debugfs (which many production
systems don't mount) so I don't see that people are going to be randomly
using it to disable the CEC.

-Tony


>From ac9d8c9bf7b38e18dcffdd41f8fcf0f07c632cd3 Mon Sep 17 00:00:00 2001
From: Tony Luck 
Date: Thu, 18 Apr 2019 16:46:55 -0700
Subject: [PATCH] RAS/CEC: Move CEC debug features under a CONFIG_RAS_CEC_DEBUG
 option

The pfn and array files in /sys/kernel/debug/ras/cec are intended
for debugging the CEC code itself. They are not needed on production
systems, so the default setting for this CONFIG option is "n".

Signed-off-by: Tony Luck 
---
 arch/x86/ras/Kconfig | 10 ++
 drivers/ras/cec.c| 13 -
 2 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/arch/x86/ras/Kconfig b/arch/x86/ras/Kconfig
index a9c3db125222..7fde8d55e394 100644
--- a/arch/x86/ras/Kconfig
+++ b/arch/x86/ras/Kconfig
@@ -11,3 +11,13 @@ config RAS_CEC
 
  Bear in mind that this is absolutely useless if your platform doesn't
  have ECC DIMMs and doesn't have DRAM ECC checking enabled in the BIOS.
+
+config RAS_CEC_DEBUG
+   bool "Debugging"
+   default n
+   depends on RAS_CEC
+   ---help---
+ Add extra files to /sys/kernel/debug/ras/cec to test the correctable
+ memory error feature. "pfn" is a writable file that allows user to
+ simulate an error in a particular page frame. "array" is a read-only
+ file that dumps out the current state of all pages logged so far.
diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index a2ceedcd8516..ff880a5c289a 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -118,7 +118,9 @@ static struct ce_array {
 } ce_arr;
 
 static DEFINE_MUTEX(ce_mutex);
+#ifdef CONFIG_RAS_CEC_DEBUG
 static u64 dfs_pfn;
+#endif
 
 /* Amount of errors after which we offline */
 static unsigned int count_threshold = COUNT_MASK;
@@ -364,6 +366,7 @@ static int u64_get(void *data, u64 *val)
return 0;
 }
 
+#ifdef CONFIG_RAS_CEC_DEBUG
 static int pfn_set(void *data, u64 val)
 {
*(u64 *)data = val;
@@ -372,6 +375,7 @@ static int pfn_set(void *data, u64 val)
 }
 
 DEFINE_DEBUGFS_ATTRIBUTE(pfn_ops, u64_get, pfn_set, "0x%llx\n");
+#endif
 
 static int decay_interval_set(void *data, u64 val)
 {
@@ -411,6 +415,7 @@ static int enable_set(void *data, u64 val)
 
 DEFINE_DEBUGFS_ATTRIBUTE(enable_ops, u64_get, enable_set, "%lld\n");
 
+#ifdef CONFIG_RAS_CEC_DEBUG
 static int array_dump(struct seq_file *m, void *v)
 {
struct ce_array *ca = _arr;
@@ -459,10 +464,14 @@ static const struct file_operations array_ops = {
.llseek  = seq_lseek,
.release = single_release,
 };
+#endif
 
 static int __init create_debugfs_nodes(void)
 {
-   struct dentry *d, *pfn, *decay, *count, *array, *enable;
+   struct dentry *d, *decay, *count, *enable;
+#ifdef CONFIG_RAS_CEC_DEBUG
+   struct dentry *pfn, *array;
+#endif
 
d = debugfs_create_dir("cec", ras_debugfs_dir);
if (!d) {
@@ -470,6 +479,7 @@ static int __init create_debugfs_nodes(void)
return -1;
}
 
+#ifdef CONFIG_RAS_CEC_DEBUG
pfn = debugfs_create_file("pfn", S_IRUSR | S_IWUSR, d, _pfn, 
_ops);
if (!pfn) {
pr_warn("Error creating pfn debugfs node!\n");
@@ -481,6 +491,7 @@ static int __init create_debugfs_nodes(void)
pr_warn("Error creating array debugfs node!\n");
goto err;
}
+#endif
 
decay = debugfs_create_file("decay_interval", S_IRUSR | S_IWUSR, d,
_interval, _interval_ops);
-- 
2.19.1



mmotm 2019-04-18-16-58 uploaded

2019-04-18 Thread akpm
The mm-of-the-moment snapshot 2019-04-18-16-58 has been uploaded to

   http://www.ozlabs.org/~akpm/mmotm/

mmotm-readme.txt says

README for mm-of-the-moment:

http://www.ozlabs.org/~akpm/mmotm/

This is a snapshot of my -mm patch queue.  Uploaded at random hopefully
more than once a week.

You will need quilt to apply these patches to the latest Linus release (5.x
or 5.x-rcY).  The series file is in broken-out.tar.gz and is duplicated in
http://ozlabs.org/~akpm/mmotm/series

The file broken-out.tar.gz contains two datestamp files: .DATE and
.DATE--mm-dd-hh-mm-ss.  Both contain the string -mm-dd-hh-mm-ss,
followed by the base kernel version against which this patch series is to
be applied.

This tree is partially included in linux-next.  To see which patches are
included in linux-next, consult the `series' file.  Only the patches
within the #NEXT_PATCHES_START/#NEXT_PATCHES_END markers are included in
linux-next.


A full copy of the full kernel tree with the linux-next and mmotm patches
already applied is available through git within an hour of the mmotm
release.  Individual mmotm releases are tagged.  The master branch always
points to the latest release, so it's constantly rebasing.

http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/



The directory http://www.ozlabs.org/~akpm/mmots/ (mm-of-the-second)
contains daily snapshots of the -mm tree.  It is updated more frequently
than mmotm, and is untested.

A git copy of this tree is available at

http://git.cmpxchg.org/cgit.cgi/linux-mmots.git/

and use of this tree is similar to
http://git.cmpxchg.org/cgit.cgi/linux-mmotm.git/, described above.


This mmotm tree contains the following patches against 5.1-rc5:
(patches marked "*" will be included in linux-next)

  origin.patch
* slab-store-tagged-freelist-for-off-slab-slabmgmt.patch
* mm-swapoff-shmem_find_swap_entries-filter-out-other-types.patch
* mm-swapoff-remove-too-limiting-swap_unuse_max_tries.patch
* mm-swapoff-take-notice-of-completion-sooner.patch
* mm-swapoff-shmem_unuse-stop-eviction-without-igrab.patch
* mm-swapoff-shmem_unuse-stop-eviction-without-igrab-fix.patch
* 
mm-memory_hotplug-do-not-unlock-when-fails-to-take-the-device_hotplug_lock.patch
* 
mm-vmstat-fix-proc-vmstat-format-for-config_debug_tlbflush=y-config_smp=n.patch
* proc-fix-map_files-test-on-f29.patch
* proc-fixup-proc-pid-vm-test.patch
* mm-hotplug-treat-cma-pages-as-unmovable.patch
* mm-hotplug-treat-cma-pages-as-unmovable-v4.patch
* mm-fix-inactive-list-balancing-between-numa-nodes-and-cgroups.patch
* mm-fix-inactive-list-balancing-between-numa-nodes-and-cgroups-fix.patch
* kcov-improve-config_arch_has_kcov-help-text.patch
* watchdog-hard-lockup-message-should-end-with-a-newline.patch
* init-initialize-jump-labels-before-command-line-option-parsing.patch
* kmemleak-fix-unused-function-warning.patch
* 
coredump-fix-race-condition-between-mmget_not_zero-get_task_mm-and-core-dumping.patch
* checkpatch-dont-interpret-stack-dumps-as-commit-ids.patch
* mm-add-sys-kernel-slab-cache-cache_dma32.patch
* userfaultfd-use-rcu-to-free-the-task-struct-when-fork-fails.patch
* mm-memory_hotplug-drop-memory-device-reference-after-find_memory_block.patch
* zram-pass-down-the-bvec-we-need-to-read-into-in-the-work-struct.patch
* lib-kconfigdebug-fix-build-error-without-config_block.patch
* lib-test_vmalloc-do-not-create-cpumask_t-variable-on-stack.patch
* prctl-fix-false-positive-in-validate_prctl_map.patch
* scripts-spellingtxt-add-more-typos-to-spellingtxt-and-sort.patch
* arch-sh-boards-mach-dreamcast-irqc-remove-duplicate-header.patch
* debugobjects-move-printk-out-of-db-lock-critical-sections.patch
* ocfs2-use-common-file-type-conversion.patch
* ocfs2-fix-ocfs2-read-inode-data-panic-in-ocfs2_iget.patch
* ocfs2-clear-zero-in-unaligned-direct-io.patch
* ocfs2-clear-zero-in-unaligned-direct-io-checkpatch-fixes.patch
* ocfs2-wait-for-recovering-done-after-direct-unlock-request.patch
* ocfs2-checkpoint-appending-truncate-log-transaction-before-flushing.patch
* ramfs-support-o_tmpfile.patch
  mm.patch
* list-add-function-list_rotate_to_front.patch
* slob-respect-list_head-abstraction-layer.patch
* slob-use-slab_list-instead-of-lru.patch
* slub-add-comments-to-endif-pre-processor-macros.patch
* slub-use-slab_list-instead-of-lru.patch
* slab-use-slab_list-instead-of-lru.patch
* mm-remove-stale-comment-from-page-struct.patch
* slub-remove-useless-kmem_cache_debug-before-remove_full.patch
* mm-slab-remove-unneed-check-in-cpuup_canceled.patch
* slub-update-the-comment-about-slab-frozen.patch
* slab-fix-an-infinite-loop-in-leaks_show.patch
* slab-fix-an-infinite-loop-in-leaks_show-fix.patch
* mm-vmscan-drop-zone-id-from-kswapd-tracepoints.patch
* mm-cma_debugc-fix-the-break-condition-in-cma_maxchunk_get.patch
* userfaultfd-sysctl-add-vmunprivileged_userfaultfd.patch
* userfaultfd-sysctl-add-vmunprivileged_userfaultfd-fix.patch
* page-cache-store-only-head-pages-in-i_pages.patch
* page-cache-store-only-head-pages-in-i_pages-fix.patch
* 

Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-18 Thread Cong Wang
On Thu, Apr 18, 2019 at 4:29 PM Borislav Petkov  wrote:
>
> On Thu, Apr 18, 2019 at 03:51:07PM -0700, Cong Wang wrote:
> > On Thu, Apr 18, 2019 at 3:02 PM Tony Luck  wrote:
> > >
> > > Useful when running error injection tests that want to
> > > see all of the MCi_(STATUS|ADDR|MISC) data via /dev/mcelog.
> > >
> > > Signed-off-by: Tony Luck 
> >
> > We saw the same problem, CONFIG_RAS hijacks all the
> > correctable memory errors, which leaves mcelog "broken"
> > silently. I know it is arguable, but until we can switch from
> > mcelog to rasdaemon, mcelog should still work as before.
>
> It is "arguable" because this is not how the CEC is supposed to be used.

No, it is all about whether we should break users' expectation.

>
> If you want to collect errors with mcelog, you don't use the CEC at all.
> And there's ras=cec_disable for that or you simply don't enable it in
> your .config.
>
> As Tony says in the commit message, the enable should be used only for
> injection tests. Which is where that thing should only be used for -
> debugging the CEC itself.

This doesn't sounds like a valid reason for us to break users'
expectation.

Prior to CONFIG_RAS, mcelog just works fine for users (at least Intel
users). Suddenly after enabling CONFIG_RAS in kernel, mcelog will
no longer receive any correctable memory errors _silently_. What's
more, we don't even have rasdaemon running in our system, so there is
no consumer of RAS CEC, these errors just simply disappear from
users' expected place.

I know CONFIG_RAS is new feature supposed to replace MCELOG,
but they can co-exist in kernel config, which means mcelog should
continue to work as before until it gets fully replaced.

Even the following PoC change could make this situation better,
because with this change when we enable CONFIG_RAS,mcelog
will break _loudly_ rather than just silently, users will notice mcelog
is no longer supported and will look for its alternative choice.

diff --git a/drivers/ras/Kconfig b/drivers/ras/Kconfig
index b834ff555188..f2e2b75fffbe 100644
--- a/drivers/ras/Kconfig
+++ b/drivers/ras/Kconfig
@@ -1,5 +1,6 @@
 menuconfig RAS
bool "Reliability, Availability and Serviceability (RAS) features"
+   depends on !X86_MCELOG_LEGACY
help
  Reliability, availability and serviceability (RAS) is a computer
  hardware engineering term. Computers designed with higher levels


Just my 2 cents.

Thanks.


Re: [PATCH v5 00/18] locking/rwsem: Rwsem rearchitecture part 2

2019-04-18 Thread Waiman Long
On 04/18/2019 07:46 PM, Waiman Long wrote:
>  v5:
>   - Drop v4 patch 1 as it is merged into tip's locking/core branch.
>   - Integrate the 2 followup patches into the series. The first
> follow-up patch is broken into 2 pieces. The first piece comes in
> before the "Enable readers spinning on writer" and the 2nd piece
> is merged into the "Enable time-based spinning on reader-owned
> rwsem" patch. The 2nd followup patch is added after that.
>   - Add a new patch to make all wake_up_q() calls after dropping
> wait_lock as suggested by PeterZ.
>   - Incorporate numerouos suggestions by PeterZ and Davidlohr.

This patchset is still being reviewed by Peter . The purpose of this
series is mainly to sync up the version that Peter has and the ones that
I am working on incorporating his feedback. Further changes may still be
needed.

I run an overall performance test on this new patchset and present the
data in this cover letter. However, I haven't run performance tests for
individual patches. So the performance data listed in some of the
patches may be stale.

Cheers,
Longman





[PATCH v5 18/18] locking/rwsem: Remove redundant computation of writer lock word

2019-04-18 Thread Waiman Long
On 64-bit architectures, each rwsem writer will have its unique lock
word for acquiring the lock. Right now, the writer code recomputes the
lock word every time it tries to acquire the lock. This is a waste of
time. The lock word is now cached and reused when it is needed.

When CONFIG_RWSEM_OWNER_COUNT isn't defined, the extra constant argument
to rwsem_try_write_lock() and rwsem_try_write_lock_unqueued() should
be optimized out by the compiler.

Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.c | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 2dc975f7eb8e..19d8fbd50d17 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -578,6 +578,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
  * bit is set or the lock is acquired with handoff bit cleared.
  */
 static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem,
+   const long wlock,
enum writer_wait_state wstate)
 {
long new;
@@ -594,8 +595,7 @@ static inline bool rwsem_try_write_lock(long count, struct 
rw_semaphore *sem,
return false;
new = count | RWSEM_FLAG_HANDOFF;
} else {
-   new = (count | RWSEM_WRITER_LOCKED) &
-   ~RWSEM_FLAG_HANDOFF;
+   new = (count | wlock) & ~RWSEM_FLAG_HANDOFF;
 
if (list_is_singular(>wait_list))
new &= ~RWSEM_FLAG_WAITERS;
@@ -641,13 +641,14 @@ static inline bool rwsem_try_read_lock_unqueued(struct 
rw_semaphore *sem)
 /*
  * Try to acquire write lock before the writer has been put on wait queue.
  */
-static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
+static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem,
+const long wlock)
 {
long count = atomic_long_read(>count);
 
while (!(count & (RWSEM_LOCK_MASK|RWSEM_FLAG_HANDOFF))) {
if (atomic_long_try_cmpxchg_acquire(>count, ,
-   count | RWSEM_WRITER_LOCKED)) {
+   count | wlock)) {
rwsem_set_owner(sem);
lockevent_inc(rwsem_opt_wlock);
return true;
@@ -813,7 +814,7 @@ static inline u64 rwsem_rspin_threshold(struct rw_semaphore 
*sem)
return sched_clock() + delta;
 }
 
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem, const long wlock)
 {
bool taken = false;
int prev_owner_state = OWNER_NULL;
@@ -841,7 +842,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem, 
bool wlock)
/*
 * Try to acquire the lock
 */
-   taken = wlock ? rwsem_try_write_lock_unqueued(sem)
+   taken = wlock ? rwsem_try_write_lock_unqueued(sem, wlock)
  : rwsem_try_read_lock_unqueued(sem);
 
if (taken)
@@ -957,7 +958,8 @@ static inline bool rwsem_can_spin_on_owner(struct 
rw_semaphore *sem, bool wr)
return false;
 }
 
-static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
+static inline bool rwsem_optimistic_spin(struct rw_semaphore *sem,
+const long wlock)
 {
return false;
 }
@@ -1122,10 +1124,11 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int 
state)
struct rwsem_waiter waiter;
struct rw_semaphore *ret = sem;
DEFINE_WAKE_Q(wake_q);
+   const long wlock = RWSEM_WRITER_LOCKED;
 
/* do optimistic spinning and steal lock if possible */
if (rwsem_can_spin_on_owner(sem, true) &&
-   rwsem_optimistic_spin(sem, true))
+   rwsem_optimistic_spin(sem, wlock))
return sem;
 
/*
@@ -1193,7 +1196,7 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int 
state)
/* wait until we successfully acquire the lock */
set_current_state(state);
while (true) {
-   if (rwsem_try_write_lock(count, sem, wstate))
+   if (rwsem_try_write_lock(count, sem, wlock, wstate))
break;
 
raw_spin_unlock_irq(>wait_lock);
-- 
2.18.1



[PATCH v5 05/18] locking/rwsem: Code cleanup after files merging

2019-04-18 Thread Waiman Long
After merging all the relevant rwsem code into one single file, there
are a number of optimizations and cleanups that can be done:

 1) Remove all the EXPORT_SYMBOL() calls for functions that are not
accessed elsewhere.
 2) Remove all the __visible tags as none of the functions will be
called from assembly code anymore.
 3) Make all the internal functions static.
 4) Remove some unneeded blank lines.
 5) Remove the intermediate rwsem_down_{read|write}_failed*() functions
and rename __rwsem_down_{read|write}_failed_common() to
rwsem_down_{read|write}_slowpath().
 6) Use atomic_long_try_cmpxchg_acquire() as much as possible.
 7) Remove the rwsem_rtrylock and rwsem_wtrylock lock events as they
are not that useful.

That enables the compiler to do better optimization and reduce code
size. The text+data size of rwsem.o on an x86-64 machine with gcc8 was
reduced from 10237 bytes to 5030 bytes with this change.

Suggested-by: Peter Zijlstra 
Signed-off-by: Waiman Long 
---
 kernel/locking/lock_events_list.h |   2 -
 kernel/locking/rwsem.c| 119 +-
 2 files changed, 34 insertions(+), 87 deletions(-)

diff --git a/kernel/locking/lock_events_list.h 
b/kernel/locking/lock_events_list.h
index ad7668cfc9da..11187a1d40b8 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -61,7 +61,5 @@ LOCK_EVENT(rwsem_opt_fail)/* # of failed opt-spinnings
*/
 LOCK_EVENT(rwsem_rlock)/* # of read locks acquired 
*/
 LOCK_EVENT(rwsem_rlock_fast)   /* # of fast read locks acquired*/
 LOCK_EVENT(rwsem_rlock_fail)   /* # of failed read lock acquisitions   */
-LOCK_EVENT(rwsem_rtrylock) /* # of read trylock calls  */
 LOCK_EVENT(rwsem_wlock)/* # of write locks acquired
*/
 LOCK_EVENT(rwsem_wlock_fail)   /* # of failed write lock acquisitions  */
-LOCK_EVENT(rwsem_wtrylock) /* # of write trylock calls */
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 06e060c360d8..467279ffbd4c 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -205,7 +205,6 @@ void __init_rwsem(struct rw_semaphore *sem, const char 
*name,
osq_lock_init(>osq);
 #endif
 }
-
 EXPORT_SYMBOL(__init_rwsem);
 
 enum rwsem_waiter_type {
@@ -304,7 +303,7 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
list_del(>list);
/*
 * Ensure calling get_task_struct() before setting the reader
-* waiter to nil such that rwsem_down_read_failed() cannot
+* waiter to nil such that rwsem_down_read_slowpath() cannot
 * race with do_exit() by always holding a reference count
 * to the task to wakeup.
 */
@@ -500,8 +499,8 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 /*
  * Wait for the read lock to be granted
  */
-static inline struct rw_semaphore __sched *
-__rwsem_down_read_failed_common(struct rw_semaphore *sem, int state)
+static struct rw_semaphore __sched *
+rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
 {
long count, adjustment = -RWSEM_READER_BIAS;
struct rwsem_waiter waiter;
@@ -573,25 +572,11 @@ __rwsem_down_read_failed_common(struct rw_semaphore *sem, 
int state)
return ERR_PTR(-EINTR);
 }
 
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed(struct rw_semaphore *sem)
-{
-   return __rwsem_down_read_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_read_failed_killable(struct rw_semaphore *sem)
-{
-   return __rwsem_down_read_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_read_failed_killable);
-
 /*
  * Wait until we successfully acquire the write lock
  */
-static inline struct rw_semaphore *
-__rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
+static struct rw_semaphore *
+rwsem_down_write_slowpath(struct rw_semaphore *sem, int state)
 {
long count;
bool waiting = true; /* any queued threads before us */
@@ -692,26 +677,11 @@ __rwsem_down_write_failed_common(struct rw_semaphore 
*sem, int state)
return ERR_PTR(-EINTR);
 }
 
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed(struct rw_semaphore *sem)
-{
-   return __rwsem_down_write_failed_common(sem, TASK_UNINTERRUPTIBLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed);
-
-__visible struct rw_semaphore * __sched
-rwsem_down_write_failed_killable(struct rw_semaphore *sem)
-{
-   return __rwsem_down_write_failed_common(sem, TASK_KILLABLE);
-}
-EXPORT_SYMBOL(rwsem_down_write_failed_killable);
-
 /*
  * handle waking up a waiter on the semaphore
  * - up_read/up_write has decremented the active part of count if we come here
  */
-__visible
-struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
+static struct 

[PATCH v5 17/18] locking/rwsem: Merge owner into count on x86-64

2019-04-18 Thread Waiman Long
With separate count and owner, there are timing windows where the two
values are inconsistent. That can cause problem when trying to figure
out the exact state of the rwsem. For instance, a RT task will stop
optimistic spinning if the lock is acquired by a writer but the owner
field isn't set yet. That can be solved by combining the count and
owner together in a single atomic value.

On 32-bit architectures, there aren't enough bits to hold both.
64-bit architectures, however, can have enough bits to do that. For
x86-64, the physical address can use up to 52 bits. That is 4PB of
memory. That leaves 12 bits available for other use. The task structure
pointer is aligned to the L1 cache size. That means another 6 bits
(64 bytes cacheline) will be available. Reserving 2 bits for status
flags, we will have 16 bits for the reader count and the read fail bit.
That can supports up to (32k-1) readers. Without 5-level page table,
we can supports up to (2M-1) readers.

The owner value will still be duplicated in the owner field as that
will ease debugging when looking at core dump.

This change is currently enabled for x86-64 only. Other 64-bit
architectures may be enabled in the future if the need arises.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
writer-only locking threads and then equal numbers of readers and writers
(mixed) before patch and after this and subsequent related patches were
as follows:

  Before Patch  After Patch
   # of Threads  wlockmixedwlockmixed
     ----
130,422   31,034   30,323   30,379
2 6,4276,6847,8049,436
4 6,7426,7387,5688,268
8 7,0927,2225,6797,041
   16 6,8827,1636,8487,652
   32 7,4587,3167,9752,189
   64 7,906  5208,269  534
  128 1,680  4258,047  448

In the single thread case, the complex write-locking operation does
introduce a little bit of overhead (about 0.3%). For the contended cases,
except for some anomalies in the data, there is no evidence that this
change will adversely impact performance.

When running the same microbenchmark with RT locking threads instead,
we got the following results:

  Before Patch  After Patch
   # of Threads  wlockmixedwlockmixed
     ----
2 4,0653,6424,7565,062
4 2,2541,9073,4602,496
8 2,386  9643,0121,964
   16 2,0951,5963,0831,862
   32 2,388  5303,717  359
   64 1,424  3224,060  401
  128 1,642  5104,488  628

It is obvious that RT tasks can benefit pretty significantly with this set
of patches.

Signed-off-by: Waiman Long 
---
 arch/x86/Kconfig   |   6 +++
 kernel/Kconfig.locks   |  12 +
 kernel/locking/rwsem.c | 102 +
 3 files changed, 111 insertions(+), 9 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 7d160f58a8f6..82a8c02f1b44 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -80,6 +80,7 @@ config X86
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
+   select ARCH_USE_RWSEM_OWNER_COUNT   if X86_64
select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
select ARCH_WANTS_DYNAMIC_TASK_STRUCT
select ARCH_WANTS_THP_SWAP  if X86_64
@@ -350,6 +351,11 @@ config PGTABLE_LEVELS
default 3 if X86_PAE
default 2
 
+config RWSEM_OWNER_COUNT_PA_BITS
+   int
+   default 52 if X86_5LEVEL
+   default 46 if X86_64
+
 config CC_HAS_SANE_STACKPROTECTOR
bool
default $(success,$(srctree)/scripts/gcc-x86_64-has-stack-protector.sh 
$(CC)) if 64BIT
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index e335953fa704..3370ea21407b 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -251,3 +251,15 @@ config ARCH_USE_QUEUED_RWLOCKS
 config QUEUED_RWLOCKS
def_bool y if ARCH_USE_QUEUED_RWLOCKS
depends on SMP
+
+#
+# An architecture that want to merge rwsem write-owner into count should
+# select ARCH_USE_RWSEM_OWNER_COUNT and define RWSEM_OWNER_COUNT_PA_BITS
+# as the correct number of physical address bits.
+#
+config ARCH_USE_RWSEM_OWNER_COUNT
+   bool
+
+config RWSEM_OWNER_COUNT
+   def_bool y if ARCH_USE_RWSEM_OWNER_COUNT
+   depends on SMP && 64BIT
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 7d2dc1314208..2dc975f7eb8e 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -108,7 +108,38 @@
 #endif
 
 /*
- * On 64-bit architectures, the bit definitions of the count 

[PATCH v5 14/18] locking/rwsem: Adaptive disabling of reader optimistic spinning

2019-04-18 Thread Waiman Long
Reader optimistic spinning is helpful when the reader critical section
is short and there aren't that many readers around. It makes readers
relatively more preferred than writers. When a writer times out spinning
on a reader-owned lock and set the nospinnable bits, there are two main
reasons for that.

 1) The reader critical section is long, perhaps the task sleeps after
acquiring the read lock.
 2) There are just too many readers contending the lock causing it to
take a while to service all of them.

In the former case, long reader critical section will impede the progress
of writers which is usually more important for system performance.
In the later case, reader optimistic spinning tends to make the reader
groups that contain readers that acquire the lock together smaller
leading to more of them. That may hurt performance in some cases. In
other words, the setting of nonspinnable bits indicates that reader
optimistic spinning may not be helpful for those workloads that cause it.

Therefore, any writers that had observed the setting of the writer
nonspinnable bit for a given rwsem after they fail to acquire the lock
via optimistic spinning will set the reader nonspinnable bit once they
acquire the write lock. This is to discourage reader optmistic spinning
on that particular rwsem and make writers more preferred. This adaptive
disabling of reader optimistic spinning will alleviate some of the
negative side effect of this feature.

On a 2-socket 40-core 80-thread Skylake system, the page_fault1 test of
the will-it-scale benchmark was run with various number of threads. The
number of operations done before and after the patch were:

  Threads   Before patch   After patch   % change
  ---      ---   
2054090755436456   +0.5%
4071740807903845  +10.2%
6067497077009784   +3.9%
8070713347353806   +4.0%

This doesn't recover all the lost performance, but is close to half. Given
the fact that reader optimistic spinning does benefit some workloads, this
is a good compromise.

Using the rwsem locking microbenchmark with very short critical section,
this patch also helps performance at high contention level as shown
by the locking rates (kops/s) below with equal numbers of readers and
writers before and after this patch:

   # of Threads  Pre-patchPost-patch
     ---
2  4,4724,839
4  4,6234,143
8  4,7644,126
   16  4,6783,873
   32  2,8473,263
   64  2,4783,121
   80  2,2223,104

Signed-off-by: Waiman Long 
---
 kernel/locking/lock_events_list.h |  9 +++---
 kernel/locking/rwsem.c| 52 +--
 2 files changed, 55 insertions(+), 6 deletions(-)

diff --git a/kernel/locking/lock_events_list.h 
b/kernel/locking/lock_events_list.h
index baa998401052..661326885800 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -56,10 +56,11 @@ LOCK_EVENT(rwsem_sleep_reader)  /* # of reader sleeps   
*/
 LOCK_EVENT(rwsem_sleep_writer) /* # of writer sleeps   */
 LOCK_EVENT(rwsem_wake_reader)  /* # of reader wakeups  */
 LOCK_EVENT(rwsem_wake_writer)  /* # of writer wakeups  */
-LOCK_EVENT(rwsem_opt_rlock)/* # of read locks opt-spin acquired*/
-LOCK_EVENT(rwsem_opt_wlock)/* # of write locks opt-spin acquired   */
-LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings*/
-LOCK_EVENT(rwsem_opt_nospin)   /* # of disabled reader opt-spinnings   */
+LOCK_EVENT(rwsem_opt_rlock)/* # of opt-acquired read locks */
+LOCK_EVENT(rwsem_opt_wlock)/* # of opt-acquired write locks*/
+LOCK_EVENT(rwsem_opt_fail) /* # of failed optspins */
+LOCK_EVENT(rwsem_opt_nospin)   /* # of disabled optspins   */
+LOCK_EVENT(rwsem_opt_norspin)  /* # of disabled reader-only optspins   */
 LOCK_EVENT(rwsem_rlock)/* # of read locks acquired 
*/
 LOCK_EVENT(rwsem_rlock_fast)   /* # of fast read locks acquired*/
 LOCK_EVENT(rwsem_rlock_fail)   /* # of failed read lock acquisitions   */
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index d08187e079be..80f6e75c92ad 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -59,6 +59,34 @@
  * seems to hang on a reader owned rwsem especially if only one reader
  * is involved. Ideally we would like to track all the readers that own
  * a rwsem, but the overhead is simply too big.
+ *
+ * Reader optimistic spinning is helpful when the reader critical section
+ * is short and there aren't that many readers around. It makes readers
+ * relatively more preferred than writers. When a writer times out spinning
+ * on a 

[PATCH v5 07/18] locking/rwsem: Implement lock handoff to prevent lock starvation

2019-04-18 Thread Waiman Long
Because of writer lock stealing, it is possible that a constant
stream of incoming writers will cause a waiting writer or reader to
wait indefinitely leading to lock starvation.

This patch implements a lock handoff mechanism to disable lock stealing
and force lock handoff to the first waiter or waiters (for readers)
in the queue after at least a 4ms waiting period unless it is a RT
writer task which doesn't need to wait. The waiting period is used to
avoid discouraging lock stealing too much to affect performance.

The setting and clearing of the handoff bit is serialized by the
wait_lock. So racing is not possible.

A rwsem microbenchmark was run for 5 seconds on a 2-socket 40-core
80-thread Skylake system with a v5.1 based kernel and 240 write_lock
threads with 5us sleep critical section.

Before the patch, the min/mean/max numbers of locking operations for
the locking threads were 1/7,792/173,696. After the patch, the figures
became 5,842/6,542/7,458.  It can be seen that the rwsem became much
more fair, though there was a drop of about 16% in the mean locking
operations done which was a tradeoff of having better fairness.

Making the waiter set the handoff bit right after the first wakeup can
impact performance especially with a mixed reader/writer workload. With
the same microbenchmark with short critical section and equal number of
reader and writer threads (40/40), the reader/writer locking operation
counts with the current patch were:

  40 readers, Iterations Min/Mean/Max = 1,793/1,794/1,796
  40 writers, Iterations Min/Mean/Max = 1,793/34,956/86,081

By making waiter set handoff bit immediately after wakeup:

  40 readers, Iterations Min/Mean/Max = 43/44/46
  40 writers, Iterations Min/Mean/Max = 43/1,263/3,191

Signed-off-by: Waiman Long 
---
 kernel/locking/lock_events_list.h |   2 +
 kernel/locking/rwsem.c| 221 ++
 2 files changed, 169 insertions(+), 54 deletions(-)

diff --git a/kernel/locking/lock_events_list.h 
b/kernel/locking/lock_events_list.h
index 11187a1d40b8..634b47fd8b5e 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -61,5 +61,7 @@ LOCK_EVENT(rwsem_opt_fail)/* # of failed opt-spinnings
*/
 LOCK_EVENT(rwsem_rlock)/* # of read locks acquired 
*/
 LOCK_EVENT(rwsem_rlock_fast)   /* # of fast read locks acquired*/
 LOCK_EVENT(rwsem_rlock_fail)   /* # of failed read lock acquisitions   */
+LOCK_EVENT(rwsem_rlock_handoff)/* # of read lock handoffs  
*/
 LOCK_EVENT(rwsem_wlock)/* # of write locks acquired
*/
 LOCK_EVENT(rwsem_wlock_fail)   /* # of failed write lock acquisitions  */
+LOCK_EVENT(rwsem_wlock_handoff)/* # of write lock handoffs 
*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index fb3493fef1fb..82cfc5a1c42d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -10,8 +10,9 @@
  * Optimistic spinning by Tim Chen 
  * and Davidlohr Bueso . Based on mutexes.
  *
- * Rwsem count bit fields re-definition and rwsem rearchitecture
- * by Waiman Long .
+ * Rwsem count bit fields re-definition and rwsem rearchitecture by
+ * Waiman Long  and
+ * Peter Zijlstra .
  */
 
 #include 
@@ -74,20 +75,33 @@
  *
  * Bit  0   - writer locked bit
  * Bit  1   - waiters present bit
- * Bits 2-7 - reserved
+ * Bit  2   - lock handoff bit
+ * Bits 3-7 - reserved
  * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
  *
  * atomic_long_fetch_add() is used to obtain reader lock, whereas
  * atomic_long_cmpxchg() will be used to obtain writer lock.
+ *
+ * There are three places where the lock handoff bit may be set or cleared.
+ * 1) __rwsem_mark_wake() for readers.
+ * 2) rwsem_try_write_lock() for writers.
+ * 3) Error path of __rwsem_down_write_failed_common().
+ *
+ * For all the above cases, wait_lock will be held. A writer must also
+ * be the first one in the wait_list to be eligible for setting the handoff
+ * bit. So concurrent setting/clearing of handoff bit is not possible.
  */
 #define RWSEM_WRITER_LOCKED(1UL << 0)
 #define RWSEM_FLAG_WAITERS (1UL << 1)
+#define RWSEM_FLAG_HANDOFF (1UL << 2)
+
 #define RWSEM_READER_SHIFT 8
 #define RWSEM_READER_BIAS  (1UL << RWSEM_READER_SHIFT)
 #define RWSEM_READER_MASK  (~(RWSEM_READER_BIAS - 1))
 #define RWSEM_WRITER_MASK  RWSEM_WRITER_LOCKED
 #define RWSEM_LOCK_MASK(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
-#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS)
+#define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
+RWSEM_FLAG_HANDOFF)
 
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -216,7 +230,10 @@ struct rwsem_waiter {
struct list_head list;
struct task_struct *task;
enum rwsem_waiter_type type;
+   unsigned long timeout;
 };
+#define 

[PATCH v5 04/18] locking/rwsem: Merge rwsem.h and rwsem-xadd.c into rwsem.c

2019-04-18 Thread Waiman Long
Now we only have one implementation of rwsem. Even though we still use
xadd to handle reader locking, we use cmpxchg for writer instead. So
the filename rwsem-xadd.c is not strictly correct. Also no one outside
of the rwsem code need to know the internal implementation other than
function prototypes for two internal functions that are called directly
from percpu-rwsem.c.

So the rwsem-xadd.c and rwsem.h files are now merged into rwsem.c in
the following order:

  
  
  
  

The rwsem.h file now contains only 2 function declarations for
__up_read() and __down_read().

This is a code relocation patch with no code change at all except
making __up_read() and __down_read() non-static functions so they
can be used by percpu-rwsem.c.

Suggested-by: Peter Zijlstra 
Signed-off-by: Waiman Long 
---
 kernel/locking/Makefile |   2 +-
 kernel/locking/rwsem-xadd.c | 608 -
 kernel/locking/rwsem.c  | 868 
 kernel/locking/rwsem.h  | 281 +---
 4 files changed, 875 insertions(+), 884 deletions(-)
 delete mode 100644 kernel/locking/rwsem-xadd.c

diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 6fe2f333aecb..45452facff3b 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -3,7 +3,7 @@
 # and is generally not a function of system call inputs.
 KCOV_INSTRUMENT:= n
 
-obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o rwsem-xadd.o
+obj-y += mutex.o semaphore.o rwsem.o percpu-rwsem.o
 
 ifdef CONFIG_FUNCTION_TRACER
 CFLAGS_REMOVE_lockdep.o = $(CC_FLAGS_FTRACE)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
deleted file mode 100644
index 54ca0a58e7f7..
--- a/kernel/locking/rwsem-xadd.c
+++ /dev/null
@@ -1,608 +0,0 @@
-// SPDX-License-Identifier: GPL-2.0
-/* rwsem.c: R/W semaphores: contention handling functions
- *
- * Written by David Howells (dhowe...@redhat.com).
- * Derived from arch/i386/kernel/semaphore.c
- *
- * Writer lock-stealing by Alex Shi 
- * and Michel Lespinasse 
- *
- * Optimistic spinning by Tim Chen 
- * and Davidlohr Bueso . Based on mutexes.
- *
- * Rwsem count bit fields re-definition by Waiman Long .
- */
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-#include 
-
-#include "rwsem.h"
-
-/*
- * Guide to the rw_semaphore's count field.
- *
- * When the RWSEM_WRITER_LOCKED bit in count is set, the lock is owned
- * by a writer.
- *
- * The lock is owned by readers when
- * (1) the RWSEM_WRITER_LOCKED isn't set in count,
- * (2) some of the reader bits are set in count, and
- * (3) the owner field has RWSEM_READ_OWNED bit set.
- *
- * Having some reader bits set is not enough to guarantee a readers owned
- * lock as the readers may be in the process of backing out from the count
- * and a writer has just released the lock. So another writer may steal
- * the lock immediately after that.
- */
-
-/*
- * Initialize an rwsem:
- */
-void __init_rwsem(struct rw_semaphore *sem, const char *name,
- struct lock_class_key *key)
-{
-#ifdef CONFIG_DEBUG_LOCK_ALLOC
-   /*
-* Make sure we are not reinitializing a held semaphore:
-*/
-   debug_check_no_locks_freed((void *)sem, sizeof(*sem));
-   lockdep_init_map(>dep_map, name, key, 0);
-#endif
-   atomic_long_set(>count, RWSEM_UNLOCKED_VALUE);
-   raw_spin_lock_init(>wait_lock);
-   INIT_LIST_HEAD(>wait_list);
-   sem->owner = NULL;
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-   osq_lock_init(>osq);
-#endif
-}
-
-EXPORT_SYMBOL(__init_rwsem);
-
-enum rwsem_waiter_type {
-   RWSEM_WAITING_FOR_WRITE,
-   RWSEM_WAITING_FOR_READ
-};
-
-struct rwsem_waiter {
-   struct list_head list;
-   struct task_struct *task;
-   enum rwsem_waiter_type type;
-};
-
-enum rwsem_wake_type {
-   RWSEM_WAKE_ANY, /* Wake whatever's at head of wait list */
-   RWSEM_WAKE_READERS, /* Wake readers only */
-   RWSEM_WAKE_READ_OWNED   /* Waker thread holds the read lock */
-};
-
-/*
- * handle the lock release when processes blocked on it that can now run
- * - if we come here from up_(), then the RWSEM_FLAG_WAITERS bit must
- *   have been set.
- * - there must be someone on the queue
- * - the wait_lock must be held by the caller
- * - tasks are marked for wakeup, the caller must later invoke wake_up_q()
- *   to actually wakeup the blocked task(s) and drop the reference count,
- *   preferably when the wait_lock is released
- * - woken process blocks are discarded from the list after having task zeroed
- * - writers are only marked woken if downgrading is false
- */
-static void __rwsem_mark_wake(struct rw_semaphore *sem,
- enum rwsem_wake_type wake_type,
- struct wake_q_head *wake_q)
-{
-   struct rwsem_waiter *waiter, *tmp;
-   long oldcount, woken = 0, adjustment = 0;
-
-   /*
-* Take a peek at the queue head waiter such 

[PATCH v5 16/18] locking/rwsem: Guard against making count negative

2019-04-18 Thread Waiman Long
The upper bits of the count field is used as reader count. When
sufficient number of active readers are present, the most significant
bit will be set and the count becomes negative. If the number of active
readers keep on piling up, we may eventually overflow the reader counts.
This is not likely to happen unless the number of bits reserved for
reader count is reduced because those bits are need for other purpose.

To prevent this count overflow from happening, the most significant bit
is now treated as a guard bit (RWSEM_FLAG_READFAIL). Read-lock attempts
will now fail for both the fast and optimistic spinning paths whenever
this bit is set. So all those extra readers will be put to sleep in
the wait queue. Wakeup will not happen until the reader count reaches 0.

Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.c | 73 +-
 1 file changed, 58 insertions(+), 15 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 327bf8295d5d..7d2dc1314208 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -108,13 +108,28 @@
 #endif
 
 /*
- * The definition of the atomic counter in the semaphore:
+ * On 64-bit architectures, the bit definitions of the count are:
  *
- * Bit  0   - writer locked bit
- * Bit  1   - waiters present bit
- * Bit  2   - lock handoff bit
- * Bits 3-7 - reserved
- * Bits 8-X - 24-bit (32-bit) or 56-bit reader count
+ * Bit  0- writer locked bit
+ * Bit  1- waiters present bit
+ * Bit  2- lock handoff bit
+ * Bits 3-7  - reserved
+ * Bits 8-62 - 55-bit reader count
+ * Bit  63   - read fail bit
+ *
+ * On 32-bit architectures, the bit definitions of the count are:
+ *
+ * Bit  0- writer locked bit
+ * Bit  1- waiters present bit
+ * Bit  2- lock handoff bit
+ * Bits 3-7  - reserved
+ * Bits 8-30 - 23-bit reader count
+ * Bit  31   - read fail bit
+ *
+ * It is not likely that the most significant bit (read fail bit) will ever
+ * be set. This guard bit is still checked anyway in the down_read() fastpath
+ * just in case we need to use up more of the reader bits for other purpose
+ * in the future.
  *
  * atomic_long_fetch_add() is used to obtain reader lock, whereas
  * atomic_long_cmpxchg() will be used to obtain writer lock.
@@ -131,6 +146,7 @@
 #define RWSEM_WRITER_LOCKED(1UL << 0)
 #define RWSEM_FLAG_WAITERS (1UL << 1)
 #define RWSEM_FLAG_HANDOFF (1UL << 2)
+#define RWSEM_FLAG_READFAIL(1UL << (BITS_PER_LONG - 1))
 
 #define RWSEM_READER_SHIFT 8
 #define RWSEM_READER_BIAS  (1UL << RWSEM_READER_SHIFT)
@@ -138,7 +154,7 @@
 #define RWSEM_WRITER_MASK  RWSEM_WRITER_LOCKED
 #define RWSEM_LOCK_MASK(RWSEM_WRITER_MASK|RWSEM_READER_MASK)
 #define RWSEM_READ_FAILED_MASK (RWSEM_WRITER_MASK|RWSEM_FLAG_WAITERS|\
-RWSEM_FLAG_HANDOFF)
+RWSEM_FLAG_HANDOFF|RWSEM_FLAG_READFAIL)
 
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
@@ -872,13 +888,36 @@ static inline void clear_wr_nonspinnable(struct 
rw_semaphore *sem) { }
  * Wait for the read lock to be granted
  */
 static struct rw_semaphore __sched *
-rwsem_down_read_slowpath(struct rw_semaphore *sem, int state)
+rwsem_down_read_slowpath(struct rw_semaphore *sem, int state, long count)
 {
-   long count, adjustment = -RWSEM_READER_BIAS;
+   long adjustment = -RWSEM_READER_BIAS;
bool wake = false;
struct rwsem_waiter waiter;
DEFINE_WAKE_Q(wake_q);
 
+   if (unlikely(count < 0)) {
+   /*
+* The sign bit has been set meaning that too many active
+* readers are present. We need to decrement reader count &
+* enter wait queue immediately to avoid overflowing the
+* reader count.
+*
+* As preemption is not disabled, there is a remote
+* possibility that preemption can happen in the narrow
+* timing window between incrementing and decrementing
+* the reader count and the task is put to sleep for a
+* considerable amount of time. If sufficient number
+* of such unfortunate sequence of events happen, we
+* may still overflow the reader count. It is extremely
+* unlikey, though. If this is a concern, we should consider
+* disable preemption during this timing window to make
+* sure that such unfortunate event will not happen.
+*/
+   atomic_long_add(-RWSEM_READER_BIAS, >count);
+   adjustment = 0;
+   goto queue;
+   }
+
if (!rwsem_can_spin_on_owner(sem, false))
goto queue;
 
@@ -1189,9 +1228,11 @@ static struct rw_semaphore *rwsem_downgrade_wake(struct 
rw_semaphore *sem)
  */
 inline void __down_read(struct rw_semaphore *sem)
 {
-   if 

[PATCH v5 15/18] locking/rwsem: Add more rwsem owner access helpers

2019-04-18 Thread Waiman Long
Before combining owner and count, we are adding two new helpers for
accessing the owner value in the rwsem.

 1) struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
 2) bool is_rwsem_reader_owned(struct rw_semaphore *sem)

Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.c | 41 ++---
 1 file changed, 30 insertions(+), 11 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 80f6e75c92ad..327bf8295d5d 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -157,6 +157,11 @@ static inline void rwsem_clear_owner(struct rw_semaphore 
*sem)
WRITE_ONCE(sem->owner, NULL);
 }
 
+static inline struct task_struct *rwsem_get_owner(struct rw_semaphore *sem)
+{
+   return READ_ONCE(sem->owner);
+}
+
 /*
  * The task_struct pointer of the last owning reader will be left in
  * the owner field.
@@ -198,6 +203,23 @@ static inline bool is_rwsem_spinnable(struct rw_semaphore 
*sem, bool wr)
return is_rwsem_owner_spinnable(READ_ONCE(sem->owner), wr);
 }
 
+/*
+ * Return true if the rwsem is owned by a reader.
+ */
+static inline bool is_rwsem_reader_owned(struct rw_semaphore *sem)
+{
+#ifdef CONFIG_DEBUG_RWSEMS
+   /*
+* Check the count to see if it is write-locked.
+*/
+   long count = atomic_long_read(>count);
+
+   if (count & RWSEM_WRITER_MASK)
+   return false;
+#endif
+   return (unsigned long)sem->owner & RWSEM_READER_OWNED;
+}
+
 #ifdef CONFIG_DEBUG_RWSEMS
 /*
  * With CONFIG_DEBUG_RWSEMS configured, it will make sure that if there
@@ -210,7 +232,7 @@ static inline void rwsem_clear_reader_owned(struct 
rw_semaphore *sem)
unsigned long owner = (unsigned long)READ_ONCE(sem->owner);
 
if ((owner & ~RWSEM_OWNER_FLAGS_MASK) == (unsigned long)current)
-   cmpxchg_relaxed((unsigned long *)>owner, val,
+   cmpxchg_relaxed((unsigned long *)>owner, owner,
owner & RWSEM_OWNER_FLAGS_MASK);
 }
 #else
@@ -565,7 +587,7 @@ static inline bool rwsem_can_spin_on_owner(struct 
rw_semaphore *sem, bool wr)
 
preempt_disable();
rcu_read_lock();
-   owner = READ_ONCE(sem->owner);
+   owner = rwsem_get_owner(sem);
if (owner) {
ret = is_rwsem_owner_spinnable(owner, wr) &&
  (((unsigned long)owner & RWSEM_READER_OWNED) ||
@@ -614,7 +636,7 @@ static inline enum owner_state rwsem_owner_state(unsigned 
long owner, bool wr)
 static noinline enum owner_state
 rwsem_spin_on_owner(struct rw_semaphore *sem, bool wr)
 {
-   struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
+   struct task_struct *tmp, *owner = rwsem_get_owner(sem);
enum owner_state state = rwsem_owner_state((unsigned long)owner, wr);
 
if (state != OWNER_WRITER)
@@ -627,7 +649,7 @@ rwsem_spin_on_owner(struct rw_semaphore *sem, bool wr)
break;
}
 
-   tmp = READ_ONCE(sem->owner);
+   tmp = rwsem_get_owner(sem);
if (tmp != owner) {
state = rwsem_owner_state((unsigned long)tmp, wr);
break;
@@ -1170,8 +1192,7 @@ inline void __down_read(struct rw_semaphore *sem)
if (unlikely(atomic_long_fetch_add_acquire(RWSEM_READER_BIAS,
>count) & RWSEM_READ_FAILED_MASK)) {
rwsem_down_read_slowpath(sem, TASK_UNINTERRUPTIBLE);
-   DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-   RWSEM_READER_OWNED), sem);
+   DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
} else {
rwsem_set_reader_owned(sem);
}
@@ -1183,8 +1204,7 @@ static inline int __down_read_killable(struct 
rw_semaphore *sem)
>count) & RWSEM_READ_FAILED_MASK)) {
if (IS_ERR(rwsem_down_read_slowpath(sem, TASK_KILLABLE)))
return -EINTR;
-   DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner &
-   RWSEM_READER_OWNED), sem);
+   DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
} else {
rwsem_set_reader_owned(sem);
}
@@ -1255,7 +1275,7 @@ inline void __up_read(struct rw_semaphore *sem)
 {
long tmp;
 
-   DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED), 
sem);
+   DEBUG_RWSEMS_WARN_ON(!is_rwsem_reader_owned(sem), sem);
rwsem_clear_reader_owned(sem);
tmp = atomic_long_add_return_release(-RWSEM_READER_BIAS, >count);
if (unlikely((tmp & (RWSEM_LOCK_MASK|RWSEM_FLAG_WAITERS)) ==
@@ -1469,8 +1489,7 @@ EXPORT_SYMBOL(down_write_killable_nested);
 
 void up_read_non_owner(struct rw_semaphore *sem)
 {
-   DEBUG_RWSEMS_WARN_ON(!((unsigned long)sem->owner & RWSEM_READER_OWNED),
-   sem);
+   

[PATCH v5 11/18] locking/rwsem: Clarify usage of owner's nonspinaable bit

2019-04-18 Thread Waiman Long
Bit 1 of sem->owner (RWSEM_ANONYMOUSLY_OWNED) is used to designate an
anonymous owner - readers or an anonymous writer. The setting of this
anonymous bit is used as an indicator that optimistic spinning cannot
be done on this rwsem.

With the upcoming reader optimistic spinning patches, a reader-owned
rwsem can be spinned on for a limit period of time. We still need
this bit to indicate a rwsem is nonspinnable, but not setting this
bit loses its meaning that the owner is known. So rename the bit
to RWSEM_NONSPINNABLE to clarify its meaning.

This patch also fixes a DEBUG_RWSEMS_WARN_ON() bug in __up_write().

Signed-off-by: Waiman Long 
---
 include/linux/rwsem.h  |  2 +-
 kernel/locking/rwsem.c | 43 +-
 2 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 148983e21d47..bb76e82398b2 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -50,7 +50,7 @@ struct rw_semaphore {
 };
 
 /*
- * Setting bit 1 of the owner field but not bit 0 will indicate
+ * Setting all bits of the owner field except bit 0 will indicate
  * that the rwsem is writer-owned with an unknown owner.
  */
 #define RWSEM_OWNER_UNKNOWN((struct task_struct *)-2L)
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 76c380b63b0c..80d6377e68f5 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -33,17 +33,18 @@
 /*
  * The least significant 2 bits of the owner value has the following
  * meanings when set.
- *  - RWSEM_READER_OWNED (bit 0): The rwsem is owned by readers
- *  - RWSEM_ANONYMOUSLY_OWNED (bit 1): The rwsem is anonymously owned,
- *i.e. the owner(s) cannot be readily determined. It can be reader
- *owned or the owning writer is indeterminate.
+ *  - Bit 0: RWSEM_READER_OWNED - The rwsem is owned by readers
+ *  - Bit 1: RWSEM_NONSPINNABLE - Waiters cannot spin on the rwsem
+ *The rwsem is anonymously owned, i.e. the owner(s) cannot be
+ *readily determined. It can be reader owned or the owning writer
+ *is indeterminate.
  *
  * When a writer acquires a rwsem, it puts its task_struct pointer
  * into the owner field. It is cleared after an unlock.
  *
  * When a reader acquires a rwsem, it will also puts its task_struct
  * pointer into the owner field with both the RWSEM_READER_OWNED and
- * RWSEM_ANONYMOUSLY_OWNED bits set. On unlock, the owner field will
+ * RWSEM_NONSPINNABLE bits set. On unlock, the owner field will
  * largely be left untouched. So for a free or reader-owned rwsem,
  * the owner value may contain information about the last reader that
  * acquires the rwsem. The anonymous bit is set because that particular
@@ -55,7 +56,8 @@
  * a rwsem, but the overhead is simply too big.
  */
 #define RWSEM_READER_OWNED (1UL << 0)
-#define RWSEM_ANONYMOUSLY_OWNED(1UL << 1)
+#define RWSEM_NONSPINNABLE (1UL << 1)
+#define RWSEM_OWNER_FLAGS_MASK (RWSEM_READER_OWNED | RWSEM_NONSPINNABLE)
 
 #ifdef CONFIG_DEBUG_RWSEMS
 # define DEBUG_RWSEMS_WARN_ON(c, sem)  do {\
@@ -132,7 +134,7 @@ static inline void __rwsem_set_reader_owned(struct 
rw_semaphore *sem,
struct task_struct *owner)
 {
unsigned long val = (unsigned long)owner | RWSEM_READER_OWNED
-| RWSEM_ANONYMOUSLY_OWNED;
+| RWSEM_NONSPINNABLE;
 
WRITE_ONCE(sem->owner, (struct task_struct *)val);
 }
@@ -144,20 +146,12 @@ static inline void rwsem_set_reader_owned(struct 
rw_semaphore *sem)
 
 /*
  * Return true if the a rwsem waiter can spin on the rwsem's owner
- * and steal the lock, i.e. the lock is not anonymously owned.
+ * and steal the lock.
  * N.B. !owner is considered spinnable.
  */
 static inline bool is_rwsem_owner_spinnable(struct task_struct *owner)
 {
-   return !((unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED);
-}
-
-/*
- * Return true if rwsem is owned by an anonymous writer or readers.
- */
-static inline bool rwsem_has_anonymous_owner(struct task_struct *owner)
-{
-   return (unsigned long)owner & RWSEM_ANONYMOUSLY_OWNED;
+   return !((unsigned long)owner & RWSEM_NONSPINNABLE);
 }
 
 #ifdef CONFIG_DEBUG_RWSEMS
@@ -170,10 +164,10 @@ static inline bool rwsem_has_anonymous_owner(struct 
task_struct *owner)
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
-  | RWSEM_ANONYMOUSLY_OWNED;
+  | RWSEM_NONSPINNABLE;
if (READ_ONCE(sem->owner) == (struct task_struct *)val)
cmpxchg_relaxed((unsigned long *)>owner, val,
-   RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
+   RWSEM_READER_OWNED | RWSEM_NONSPINNABLE);
 }
 #else
 static inline 

[PATCH v5 13/18] locking/rwsem: Enable time-based spinning on reader-owned rwsem

2019-04-18 Thread Waiman Long
When the rwsem is owned by reader, writers stop optimistic spinning
simply because there is no easy way to figure out if all the readers
are actively running or not. However, there are scenarios where
the readers are unlikely to sleep and optimistic spinning can help
performance.

This patch provides a simple mechanism for spinning on a reader-owned
rwsem by a writer. It is a time threshold based spinning where the
allowable spinning time can vary from 10us to 25us depending on the
condition of the rwsem.

When the time threshold is exceeded, the nonspinnable bits will be set
in the owner field to indicate that no more optimistic spinning will
be allowed on this rwsem until it becomes writer owned again. Not even
readers is allowed to acquire the reader-locked rwsem by optimistic
spinning for fairness.

We also want a writer to acquire the lock after the readers hold the
lock for a relatively long time. In order to give preference to writers
under such a circumstance, the single RWSEM_NONSPINNABLE bit is now split
into two - one for reader and one for writer. When optimistic spinning
is disabled, both bits will be set. When the reader count drop down
to 0, the writer nonspinnable bit will be cleared to allow writers to
spin on the lock, but not the readers. When a writer acquires the lock,
it will write its own task structure pointer into sem->owner and clear
the reader nonspinnable bit in the process.

The time taken for each iteration of the reader-owned rwsem spinning
loop varies. Below are sample minimum elapsed times for 16 iterations
of the loop.

  System Time for 16 Iterations
  -- --
  1-socket Skylake  ~800ns
  4-socket Broadwell~300ns
  2-socket ThunderX2 (arm64)~250ns

When the lock cacheline is contended, we can see up to almost 10X
increase in elapsed time.  So 25us will be at most 500, 1300 and 1600
iterations for each of the above systems.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:

   # of Threads  Pre-patchPost-patch
     ---
2  1,7596,684
4  1,6846,738
8  1,0747,222
   169007,163
   324587,316
   64208  520
  128168  425
  240143  474

This patch gives a big boost in performance for mixed reader/writer
workloads.

With 32 locking threads, the rwsem lock event data were:

rwsem_opt_fail=79850
rwsem_opt_nospin=5069
rwsem_opt_rlock=597484
rwsem_opt_wlock=957339
rwsem_sleep_reader=57782
rwsem_sleep_writer=55663

With 64 locking threads, the data looked like:

rwsem_opt_fail=346723
rwsem_opt_nospin=6293
rwsem_opt_rlock=1127119
rwsem_opt_wlock=1400628
rwsem_sleep_reader=308201
rwsem_sleep_writer=72281

So a lot more threads acquired the lock in the slowpath and more threads
went to sleep.

Signed-off-by: Waiman Long 
---
 kernel/locking/lock_events_list.h |   1 +
 kernel/locking/rwsem.c| 218 --
 2 files changed, 180 insertions(+), 39 deletions(-)

diff --git a/kernel/locking/lock_events_list.h 
b/kernel/locking/lock_events_list.h
index ca954e4e00e4..baa998401052 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -59,6 +59,7 @@ LOCK_EVENT(rwsem_wake_writer) /* # of writer wakeups  
*/
 LOCK_EVENT(rwsem_opt_rlock)/* # of read locks opt-spin acquired*/
 LOCK_EVENT(rwsem_opt_wlock)/* # of write locks opt-spin acquired   */
 LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings*/
+LOCK_EVENT(rwsem_opt_nospin)   /* # of disabled reader opt-spinnings   */
 LOCK_EVENT(rwsem_rlock)/* # of read locks acquired 
*/
 LOCK_EVENT(rwsem_rlock_fast)   /* # of fast read locks acquired*/
 LOCK_EVENT(rwsem_rlock_fail)   /* # of failed read lock acquisitions   */
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 8d7210d2f8d7..d08187e079be 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -23,6 +23,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -31,24 +32,28 @@
 #include "lock_events.h"
 
 /*
- * The least significant 2 bits of the owner value has the following
+ * The least significant 3 bits of the owner value has the following
  * meanings when set.
  *  - Bit 0: RWSEM_READER_OWNED - The rwsem is owned by readers
- *  - Bit 1: RWSEM_NONSPINNABLE - Waiters cannot spin on the rwsem
- *The rwsem is anonymously owned, i.e. the owner(s) cannot be
- *readily determined. It can be reader owned or the owning writer
- *is indeterminate.
+ *  - Bit 1: RWSEM_RD_NONSPINNABLE - Readers 

[PATCH v5 06/18] locking/rwsem: Make rwsem_spin_on_owner() return owner state

2019-04-18 Thread Waiman Long
This patch modifies rwsem_spin_on_owner() to return four possible
values to better reflect the state of lock holder which enables us to
make a better decision of what to do next.

Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.c | 65 ++
 1 file changed, 47 insertions(+), 18 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 467279ffbd4c..fb3493fef1fb 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -398,17 +398,54 @@ static inline bool rwsem_can_spin_on_owner(struct 
rw_semaphore *sem)
 }
 
 /*
- * Return true only if we can still spin on the owner field of the rwsem.
+ * The rwsem_spin_on_owner() function returns the folowing 4 values
+ * depending on the lock owner state.
+ *   OWNER_NULL  : owner is currently NULL
+ *   OWNER_WRITER: when owner changes and is a writer
+ *   OWNER_READER: when owner changes and the new owner may be a reader.
+ *   OWNER_NONSPINNABLE:
+ *when optimistic spinning has to stop because either the
+ *owner stops running, is unknown, or its timeslice has
+ *been used up.
  */
-static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
+enum owner_state {
+   OWNER_NULL  = 1 << 0,
+   OWNER_WRITER= 1 << 1,
+   OWNER_READER= 1 << 2,
+   OWNER_NONSPINNABLE  = 1 << 3,
+};
+#define OWNER_SPINNABLE(OWNER_NULL | OWNER_WRITER)
+
+static inline enum owner_state rwsem_owner_state(unsigned long owner)
 {
-   struct task_struct *owner = READ_ONCE(sem->owner);
+   if (!owner)
+   return OWNER_NULL;
 
-   if (!is_rwsem_owner_spinnable(owner))
-   return false;
+   if (owner & RWSEM_ANONYMOUSLY_OWNED)
+   return OWNER_NONSPINNABLE;
+
+   if (owner & RWSEM_READER_OWNED)
+   return OWNER_READER;
+
+   return OWNER_WRITER;
+}
+
+static noinline enum owner_state rwsem_spin_on_owner(struct rw_semaphore *sem)
+{
+   struct task_struct *tmp, *owner = READ_ONCE(sem->owner);
+   enum owner_state state = rwsem_owner_state((unsigned long)owner);
+
+   if (state != OWNER_WRITER)
+   return state;
 
rcu_read_lock();
-   while (owner && (READ_ONCE(sem->owner) == owner)) {
+   for (;;) {
+   tmp = READ_ONCE(sem->owner);
+   if (tmp != owner) {
+   state = rwsem_owner_state((unsigned long)tmp);
+   break;
+   }
+
/*
 * Ensure we emit the owner->on_cpu, dereference _after_
 * checking sem->owner still matches owner, if that fails,
@@ -417,24 +454,16 @@ static noinline bool rwsem_spin_on_owner(struct 
rw_semaphore *sem)
 */
barrier();
 
-   /*
-* abort spinning when need_resched or owner is not running or
-* owner's cpu is preempted.
-*/
if (need_resched() || !owner_on_cpu(owner)) {
-   rcu_read_unlock();
-   return false;
+   state = OWNER_NONSPINNABLE;
+   break;
}
 
cpu_relax();
}
rcu_read_unlock();
 
-   /*
-* If there is a new owner or the owner is not set, we continue
-* spinning.
-*/
-   return is_rwsem_owner_spinnable(READ_ONCE(sem->owner));
+   return state;
 }
 
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
@@ -457,7 +486,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 *  2) readers own the lock as we can't determine if they are
 * actively running or not.
 */
-   while (rwsem_spin_on_owner(sem)) {
+   while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) {
/*
 * Try to acquire the lock
 */
-- 
2.18.1



[PATCH v5 09/18] locking/rwsem: More optimal RT task handling of null owner

2019-04-18 Thread Waiman Long
An RT task can do optimistic spinning only if the lock holder is
actually running. If the state of the lock holder isn't known, there
is a possibility that high priority of the RT task may block forward
progress of the lock holder if it happens to reside on the same CPU.
This will lead to deadlock. So we have to make sure that an RT task
will not spin on a reader-owned rwsem.

When the owner is temporarily set to NULL, there are two cases
where we may want to continue spinning:

 1) The lock owner is in the process of releasing the lock, sem->owner
is cleared but the lock has not been released yet.

 2) The lock was free and owner cleared, but another task just comes
in and acquire the lock before we try to get it. The new owner may
be a spinnable writer.

So an RT task is now made to retry one more time to see if it can
acquire the lock or continue spinning on the new owning writer.

When testing on a 8-socket IvyBridge-EX system, the one additional retry
seems to improve locking performance of RT write locking threads under
heavy contentions. The table below shows the locking rates (in kops/s)
with various write locking threads before and after the patch.

Locking threads Pre-patch Post-patch
--- - ---
4 2,753  2,608
8 2,529  2,520
   16 1,727  1,918
   32 1,263  1,956
   64   889  1,343

Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.c | 51 --
 1 file changed, 44 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 087bbef2089e..97c4e92482be 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -545,6 +545,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct 
rw_semaphore *sem)
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 {
bool taken = false;
+   int prev_owner_state = OWNER_NULL;
 
preempt_disable();
 
@@ -562,7 +563,12 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 *  2) readers own the lock as we can't determine if they are
 * actively running or not.
 */
-   while (rwsem_spin_on_owner(sem) & OWNER_SPINNABLE) {
+   for (;;) {
+   enum owner_state owner_state = rwsem_spin_on_owner(sem);
+
+   if (!(owner_state & OWNER_SPINNABLE))
+   break;
+
/*
 * Try to acquire the lock
 */
@@ -572,13 +578,44 @@ static bool rwsem_optimistic_spin(struct rw_semaphore 
*sem)
}
 
/*
-* When there's no owner, we might have preempted between the
-* owner acquiring the lock and setting the owner field. If
-* we're an RT task that will live-lock because we won't let
-* the owner complete.
+* An RT task cannot do optimistic spinning if it cannot
+* be sure the lock holder is running or live-lock may
+* happen if the current task and the lock holder happen
+* to run in the same CPU. However, aborting optimistic
+* spinning while a NULL owner is detected may miss some
+* opportunity where spinning can continue without causing
+* problem.
+*
+* There are 2 possible cases where an RT task may be able
+* to continue spinning.
+*
+* 1) The lock owner is in the process of releasing the
+*lock, sem->owner is cleared but the lock has not
+*been released yet.
+* 2) The lock was free and owner cleared, but another
+*task just comes in and acquire the lock before
+*we try to get it. The new owner may be a spinnable
+*writer.
+*
+* To take advantage of two scenarios listed agove, the RT
+* task is made to retry one more time to see if it can
+* acquire the lock or continue spinning on the new owning
+* writer. Of course, if the time lag is long enough or the
+* new owner is not a writer or spinnable, the RT task will
+* quit spinning.
+*
+* If the owner is a writer, the need_resched() check is
+* done inside rwsem_spin_on_owner(). If the owner is not
+* a writer, need_resched() check needs to be done here.
 */
-   if (!sem->owner && (need_resched() || rt_task(current)))
-   break;
+   if (owner_state != OWNER_WRITER) {
+   if (need_resched())
+   break;
+ 

[PATCH v5 03/18] locking/rwsem: Implement a new locking scheme

2019-04-18 Thread Waiman Long
The current way of using various reader, writer and waiting biases
in the rwsem code are confusing and hard to understand. I have to
reread the rwsem count guide in the rwsem-xadd.c file from time to
time to remind myself how this whole thing works. It also makes the
rwsem code harder to be optimized.

To make rwsem more sane, a new locking scheme similar to the one in
qrwlock is now being used.  The atomic long count has the following
bit definitions:

  Bit  0   - writer locked bit
  Bit  1   - waiters present bit
  Bits 2-7 - reserved for future extension
  Bits 8-X - reader count (24/56 bits)

The cmpxchg instruction is now used to acquire the write lock. The read
lock is still acquired with xadd instruction, so there is no change here.
This scheme will allow up to 16M/64P active readers which should be
more than enough. We can always use some more reserved bits if necessary.

With that change, we can deterministically know if a rwsem has been
write-locked. Looking at the count alone, however, one cannot determine
for certain if a rwsem is owned by readers or not as the readers that
set the reader count bits may be in the process of backing out. So we
still need the reader-owned bit in the owner field to be sure.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) of the benchmark on a 8-socket 120-core
IvyBridge-EX system before and after the patch were as follows:

  Before Patch  After Patch
   # of Threads  wlockrlockwlockrlock
     ----
130,659   31,341   31,055   31,283
2 8,909   16,4579,884   17,659
4 9,028   15,8238,933   20,233
8 8,410   14,2127,230   17,140
   16 8,217   25,2407,479   24,607

The locking rates of the benchmark on a Power8 system were as follows:

  Before Patch  After Patch
   # of Threads  wlockrlockwlockrlock
     ----
112,963   13,647   13,275   13,601
2 7,570   11,5697,902   10,829
4 5,2325,5165,4665,435
8 5,2333,3865,4673,168

The locking rates of the benchmark on a 2-socket ARM64 system were
as follows:

  Before Patch  After Patch
   # of Threads  wlockrlockwlockrlock
     ----
121,495   21,046   21,524   21,074
2 5,293   10,5025,333   10,504
4 5,325   11,4635,358   11,631
8 5,391   11,7125,470   11,680

The performance are roughly the same before and after the patch. There
are run-to-run variations in performance. Runs with higher variances
usually have higher throughput.

Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem-xadd.c | 147 
 kernel/locking/rwsem.h  |  74 +-
 2 files changed, 85 insertions(+), 136 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 98de7f0cfedd..54ca0a58e7f7 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -9,6 +9,8 @@
  *
  * Optimistic spinning by Tim Chen 
  * and Davidlohr Bueso . Based on mutexes.
+ *
+ * Rwsem count bit fields re-definition by Waiman Long .
  */
 #include 
 #include 
@@ -22,52 +24,20 @@
 #include "rwsem.h"
 
 /*
- * Guide to the rw_semaphore's count field for common values.
- * (32-bit case illustrated, similar for 64-bit)
- *
- * 0x000X  (1) X readers active or attempting lock, no writer waiting
- * X = #active_readers + #readers attempting to lock
- * (X*ACTIVE_BIAS)
- *
- * 0x  rwsem is unlocked, and no one is waiting for the lock or
- * attempting to read lock or write lock.
- *
- * 0x000X  (1) X readers active or attempting lock, with waiters for lock
- * X = #active readers + # readers attempting lock
- * (X*ACTIVE_BIAS + WAITING_BIAS)
- * (2) 1 writer attempting lock, no waiters for lock
- * X-1 = #active readers + #readers attempting lock
- * ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- * (3) 1 writer active, no waiters for lock
- * X-1 = #active readers + #readers attempting lock
- * ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- *
- * 0x0001  (1) 1 reader active or attempting lock, waiters for lock
- * (WAITING_BIAS + ACTIVE_BIAS)
- * (2) 1 writer active or attempting lock, no waiters for lock
- * (ACTIVE_WRITE_BIAS)
+ * Guide to the rw_semaphore's count field.
  *
- * 0x  (1) There are writers or readers queued but none active
- * or in the process of attempting lock.
- * (WAITING_BIAS)
- * Note: writer 

[PATCH v5 08/18] locking/rwsem: Always release wait_lock before waking up tasks

2019-04-18 Thread Waiman Long
With the use of wake_q, we can do task wakeups without holding the
wait_lock. There is one exception in the rwsem code, though. It is
when the writer in the slowpath detects that there are waiters ahead
but the rwsem is not held by a writer. This can lead to a long wait_lock
hold time especially when a large number of readers are to be woken up.

Remediate this situation by releasing the wait_lock before waking up
tasks and re-acquiring it afterward.

Suggested-by: Peter Zijlstra 
Signed-off-by: Waiman Long 
---
 include/linux/sched/wake_q.h |  5 +
 kernel/locking/rwsem.c   | 27 ---
 2 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/wake_q.h b/include/linux/sched/wake_q.h
index ad826d2a4557..26a2013ac39c 100644
--- a/include/linux/sched/wake_q.h
+++ b/include/linux/sched/wake_q.h
@@ -51,6 +51,11 @@ static inline void wake_q_init(struct wake_q_head *head)
head->lastp = >first;
 }
 
+static inline bool wake_q_empty(struct wake_q_head *head)
+{
+   return head->first == WAKE_Q_TAIL;
+}
+
 extern void wake_q_add(struct wake_q_head *head, struct task_struct *task);
 extern void wake_q_add_safe(struct wake_q_head *head, struct task_struct 
*task);
 extern void wake_up_q(struct wake_q_head *head);
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 82cfc5a1c42d..087bbef2089e 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -731,17 +731,22 @@ rwsem_down_write_slowpath(struct rw_semaphore *sem, int 
state)
? RWSEM_WAKE_READERS
: RWSEM_WAKE_ANY, _q);
 
-   /*
-* The wakeup is normally called _after_ the wait_lock
-* is released, but given that we are proactively waking
-* readers we can deal with the wake_q overhead as it is
-* similar to releasing and taking the wait_lock again
-* for attempting rwsem_try_write_lock().
-*/
-   wake_up_q(_q);
-
-   /* We need wake_q again below, reinitialize */
-   wake_q_init(_q);
+   if (!wake_q_empty(_q)) {
+   /*
+* We want to minimize wait_lock hold time especially
+* when a large number of readers are to be woken up.
+*/
+   raw_spin_unlock_irq(>wait_lock);
+   wake_up_q(_q);
+   wake_q_init(_q);   /* Used again, reinit */
+   raw_spin_lock_irq(>wait_lock);
+   /*
+* This waiter may have become first in the wait
+* list after re-acquring the wait_lock. The
+* rwsem_first_waiter() test in the main while
+* loop below will correctly detect that.
+*/
+   }
} else {
count = atomic_long_add_return(RWSEM_FLAG_WAITERS, >count);
}
-- 
2.18.1



[PATCH v5 02/18] locking/rwsem: Remove rwsem_wake() wakeup optimization

2019-04-18 Thread Waiman Long
With the commit 59aabfc7e959 ("locking/rwsem: Reduce spinlock contention
in wakeup after up_read()/up_write()"), the rwsem_wake() forgoes doing
a wakeup if the wait_lock cannot be directly acquired and an optimistic
spinning locker is present.  This can help performance by avoiding
spinning on the wait_lock when it is contended.

With the later commit 133e89ef5ef3 ("locking/rwsem: Enable lockless
waiter wakeup(s)"), the performance advantage of the above optimization
diminishes as the average wait_lock hold time become much shorter.

With a later patch that supports rwsem lock handoff, we can no
longer relies on the fact that the presence of an optimistic spinning
locker will ensure that the lock will be acquired by a task soon and
rwsem_wake() will be called later on to wake up waiters. This can lead
to missed wakeup and application hang. So the commit 59aabfc7e959
("locking/rwsem: Reduce spinlock contention in wakeup after
up_read()/up_write()") will have to be reverted.

Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem-xadd.c | 72 -
 1 file changed, 72 deletions(-)

diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 7fd4f1de794a..98de7f0cfedd 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -395,25 +395,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore 
*sem)
lockevent_cond_inc(rwsem_opt_fail, !taken);
return taken;
 }
-
-/*
- * Return true if the rwsem has active spinner
- */
-static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
-{
-   return osq_is_locked(>osq);
-}
-
 #else
 static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
 {
return false;
 }
-
-static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
-{
-   return false;
-}
 #endif
 
 /*
@@ -635,65 +621,7 @@ struct rw_semaphore *rwsem_wake(struct rw_semaphore *sem)
unsigned long flags;
DEFINE_WAKE_Q(wake_q);
 
-   /*
-   * __rwsem_down_write_failed_common(sem)
-   *   rwsem_optimistic_spin(sem)
-   * osq_unlock(sem->osq)
-   *   ...
-   *   atomic_long_add_return(>count)
-   *
-   *  - VS -
-   *
-   *  __up_write()
-   *if (atomic_long_sub_return_release(>count) < 0)
-   *  rwsem_wake(sem)
-   *osq_is_locked(>osq)
-   *
-   * And __up_write() must observe !osq_is_locked() when it observes the
-   * atomic_long_add_return() in order to not miss a wakeup.
-   *
-   * This boils down to:
-   *
-   * [S.rel] X = 1[RmW] r0 = (Y += 0)
-   * MB RMB
-   * [RmW]   Y += 1   [L]   r1 = X
-   *
-   * exists (r0=1 /\ r1=0)
-   */
-   smp_rmb();
-
-   /*
-* If a spinner is present, it is not necessary to do the wakeup.
-* Try to do wakeup only if the trylock succeeds to minimize
-* spinlock contention which may introduce too much delay in the
-* unlock operation.
-*
-*spinning writer   up_write/up_read caller
-*---   ---
-* [S]   osq_unlock()   [L]   osq
-*   MB   RMB
-* [RmW] rwsem_try_write_lock() [RmW] spin_trylock(wait_lock)
-*
-* Here, it is important to make sure that there won't be a missed
-* wakeup while the rwsem is free and the only spinning writer goes
-* to sleep without taking the rwsem. Even when the spinning writer
-* is just going to break out of the waiting loop, it will still do
-* a trylock in rwsem_down_write_failed() before sleeping. IOW, if
-* rwsem_has_spinner() is true, it will guarantee at least one
-* trylock attempt on the rwsem later on.
-*/
-   if (rwsem_has_spinner(sem)) {
-   /*
-* The smp_rmb() here is to make sure that the spinner
-* state is consulted before reading the wait_lock.
-*/
-   smp_rmb();
-   if (!raw_spin_trylock_irqsave(>wait_lock, flags))
-   return sem;
-   goto locked;
-   }
raw_spin_lock_irqsave(>wait_lock, flags);
-locked:
 
if (!list_empty(>wait_list))
__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, _q);
-- 
2.18.1



[PATCH v5 12/18] locking/rwsem: Enable readers spinning on writer

2019-04-18 Thread Waiman Long
This patch enables readers to optimistically spin on a
rwsem when it is owned by a writer instead of going to sleep
directly.  The rwsem_can_spin_on_owner() function is extracted
out of rwsem_optimistic_spin() and is called directly by
__rwsem_down_read_failed_common() and __rwsem_down_write_failed_common().

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBrige-EX system with equal
numbers of readers and writers before and after the patch were as
follows:

   # of Threads  Pre-patchPost-patch
     ---
4  1,6741,684
8  1,0621,074
   16924  900
   32300  458
   64195  208
  128164  168
  240149  143

The performance change wasn't significant in this case, but this change
is required by a follow-on patch.

Signed-off-by: Waiman Long 
---
 kernel/locking/lock_events_list.h |  1 +
 kernel/locking/rwsem.c| 86 ++-
 2 files changed, 75 insertions(+), 12 deletions(-)

diff --git a/kernel/locking/lock_events_list.h 
b/kernel/locking/lock_events_list.h
index 634b47fd8b5e..ca954e4e00e4 100644
--- a/kernel/locking/lock_events_list.h
+++ b/kernel/locking/lock_events_list.h
@@ -56,6 +56,7 @@ LOCK_EVENT(rwsem_sleep_reader)/* # of reader sleeps   
*/
 LOCK_EVENT(rwsem_sleep_writer) /* # of writer sleeps   */
 LOCK_EVENT(rwsem_wake_reader)  /* # of reader wakeups  */
 LOCK_EVENT(rwsem_wake_writer)  /* # of writer wakeups  */
+LOCK_EVENT(rwsem_opt_rlock)/* # of read locks opt-spin acquired*/
 LOCK_EVENT(rwsem_opt_wlock)/* # of write locks opt-spin acquired   */
 LOCK_EVENT(rwsem_opt_fail) /* # of failed opt-spinnings*/
 LOCK_EVENT(rwsem_rlock)/* # of read locks acquired 
*/
diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 80d6377e68f5..8d7210d2f8d7 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -435,6 +435,30 @@ static inline bool rwsem_try_write_lock(long count, struct 
rw_semaphore *sem,
 }
 
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
+/*
+ * Try to acquire read lock before the reader is put on wait queue.
+ * Lock acquisition isn't allowed if the rwsem is locked or a writer handoff
+ * is ongoing.
+ */
+static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
+{
+   long count = atomic_long_read(>count);
+
+   if (count & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))
+   return false;
+
+   count = atomic_long_fetch_add_acquire(RWSEM_READER_BIAS, >count);
+   if (!(count & (RWSEM_WRITER_MASK | RWSEM_FLAG_HANDOFF))) {
+   rwsem_set_reader_owned(sem);
+   lockevent_inc(rwsem_opt_rlock);
+   return true;
+   }
+
+   /* Back out the change */
+   atomic_long_add(-RWSEM_READER_BIAS, >count);
+   return false;
+}
+
 /*
  * Try to acquire write lock before the writer has been put on wait queue.
  */
@@ -469,9 +493,12 @@ static inline bool rwsem_can_spin_on_owner(struct 
rw_semaphore *sem)
 
BUILD_BUG_ON(is_rwsem_owner_spinnable(RWSEM_OWNER_UNKNOWN));
 
-   if (need_resched())
+   if (need_resched()) {
+   lockevent_inc(rwsem_opt_fail);
return false;
+   }
 
+   preempt_disable();
rcu_read_lock();
owner = READ_ONCE(sem->owner);
if (owner) {
@@ -479,6 +506,9 @@ static inline bool rwsem_can_spin_on_owner(struct 
rw_semaphore *sem)
  owner_on_cpu(owner);
}
rcu_read_unlock();
+   preempt_enable();
+
+   lockevent_cond_inc(rwsem_opt_fail, !ret);
return ret;
 }
 
@@ -556,7 +586,7 @@ static noinline enum owner_state rwsem_spin_on_owner(struct 
rw_semaphore *sem)
return state;
 }
 
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem, bool wlock)
 {
bool taken = false;
int prev_owner_state = OWNER_NULL;
@@ -564,9 +594,6 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
preempt_disable();
 
/* sem->wait_lock should not be held when doing optimistic spinning */
-   if (!rwsem_can_spin_on_owner(sem))
-   goto done;
-
if (!osq_lock(>osq))
goto done;
 
@@ -586,10 +613,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore 
*sem)
/*
 * Try to acquire the lock
 */
-   if (rwsem_try_write_lock_unqueued(sem)) {
-   taken = true;
+   taken = wlock ? rwsem_try_write_lock_unqueued(sem)
+ : rwsem_try_read_lock_unqueued(sem);
+
+   if (taken)
break;
- 

[PATCH v5 10/18] locking/rwsem: Wake up almost all readers in wait queue

2019-04-18 Thread Waiman Long
When the front of the wait queue is a reader, other readers
immediately following the first reader will also be woken up at the
same time. However, if there is a writer in between. Those readers
behind the writer will not be woken up.

Because of optimistic spinning, the lock acquisition order is not FIFO
anyway. The lock handoff mechanism will ensure that lock starvation
will not happen.

Assuming that the lock hold times of the other readers still in the
queue will be about the same as the readers that are being woken up,
there is really not much additional cost other than the additional
latency due to the wakeup of additional tasks by the waker. Therefore
all the readers up to a maximum of 256 in the queue are woken up when
the first waiter is a reader to improve reader throughput. This is
somewhat similar in concept to a phase-fair R/W lock.

With a locking microbenchmark running on 5.1 based kernel, the total
locking rates (in kops/s) on a 8-socket IvyBridge-EX system with
equal numbers of readers and writers before and after this patch were
as follows:

   # of Threads  Pre-Patch   Post-patch
     -   --
4  1,6411,674
87311,062
   16564  924
   32 78  300
   64 38  195
  240 50  149

There is no performance gain at low contention level. At high contention
level, however, this patch gives a pretty decent performance boost.

Signed-off-by: Waiman Long 
---
 kernel/locking/rwsem.c | 26 +++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
index 97c4e92482be..76c380b63b0c 100644
--- a/kernel/locking/rwsem.c
+++ b/kernel/locking/rwsem.c
@@ -254,6 +254,14 @@ enum writer_wait_state {
  */
 #define RWSEM_WAIT_TIMEOUT DIV_ROUND_UP(HZ, 250)
 
+/*
+ * Magic number to batch-wakeup waiting readers, even when writers are
+ * also present in the queue. This both limits the amount of work the
+ * waking thread must do and also prevents any potential counter overflow,
+ * however unlikely.
+ */
+#define MAX_READERS_WAKEUP 0x100
+
 /*
  * handle the lock release when processes blocked on it that can now run
  * - if we come here from up_(), then the RWSEM_FLAG_WAITERS bit must
@@ -328,16 +336,22 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
}
 
/*
-* Grant an infinite number of read locks to the readers at the front
-* of the queue. We know that woken will be at least 1 as we accounted
+* Grant up to MAX_READERS_WAKEUP read locks to all the readers in the
+* queue. We know that the woken will be at least 1 as we accounted
 * for above. Note we increment the 'active part' of the count by the
 * number of readers before waking any processes up.
+*
+* This is an adaptation of the phase-fair R/W locks where at the
+* reader phase (first waiter is a reader), all readers are eligible
+* to acquire the lock at the same time irrespective of their order
+* in the queue. The writers acquire the lock according to their
+* order in the queue.
 */
list_for_each_entry_safe(waiter, tmp, >wait_list, list) {
struct task_struct *tsk;
 
if (waiter->type == RWSEM_WAITING_FOR_WRITE)
-   break;
+   continue;
 
woken++;
tsk = waiter->task;
@@ -356,6 +370,12 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
 * after setting the reader waiter to nil.
 */
wake_q_add_safe(wake_q, tsk);
+
+   /*
+* Limit # of readers that can be woken up per wakeup call.
+*/
+   if (woken >= MAX_READERS_WAKEUP)
+   break;
}
 
adjustment = woken * RWSEM_READER_BIAS - adjustment;
-- 
2.18.1



[PATCH v5 00/18] locking/rwsem: Rwsem rearchitecture part 2

2019-04-18 Thread Waiman Long
 v5:
  - Drop v4 patch 1 as it is merged into tip's locking/core branch.
  - Integrate the 2 followup patches into the series. The first
follow-up patch is broken into 2 pieces. The first piece comes in
before the "Enable readers spinning on writer" and the 2nd piece
is merged into the "Enable time-based spinning on reader-owned
rwsem" patch. The 2nd followup patch is added after that.
  - Add a new patch to make all wake_up_q() calls after dropping
wait_lock as suggested by PeterZ.
  - Incorporate numerouos suggestions by PeterZ and Davidlohr.

 v4:
  - Fix the missing initialization bug with !CONFIG_RWSEM_SPIN_ON_OWNER
in patch 2.
  - Move the "Remove rwsem_wake() wakeup optimization" patch before
the "Implement a new locking scheme" patch.
  - Add two new patches to merge the relevant content of rwsem.h and
rwsem-xadd.c into rwsem.c as suggested by PeterZ.
  - Refactor the lock handoff patch to make all setting and clearing
of the handoff bit serialized by wait_lock to ensure correctness.
  - Adapt the rest of the patches to the new code base.

 v3:
  - Add 2 more patches in front to fix build and testing issues found.
Patch 1 can actually be merged on top of the patch "locking/rwsem:
Enhance DEBUG_RWSEMS_WARN_ON() macro" in part 1.
  - Change the handoff patch (now patch 4) to set handoff bit immediately
after wakeup for RT writers. The timeout limit is also tightened to
4ms.
  - There is no code changes in other patches other than resolving conflicts
with patches 1, 2 and 4.

 v2:
  - Move the negative reader count checking patch (patch 12->10)
forward to before the merge owner to count patch as suggested by
Linus & expand the comment.
  - Change the reader-owned rwsem spinning from count based to time
based to have better control of the max time allowed.

This is part 2 of a 3-part (0/1/2) series to rearchitect the internal
operation of rwsem. Both part 0 and part 1 are merged into tip.

This patchset revamps the current rwsem-xadd implementation to make
it saner and easier to work with. It also implements the following 3
new features:

 1) Waiter lock handoff
 2) Reader optimistic spinning with adapative disabling
 3) Store write-lock owner in the atomic count (x86-64 only for now)

Waiter lock handoff is similar to the mechanism currently in the mutex
code. This ensures that lock starvation won't happen.

Reader optimistic spinning enables readers to acquire the lock more
quickly.  So workloads that use a mix of readers and writers should
see an increase in performance as long as the reader critical sections
are short. For those workloads that have long reader critical sections
reader optimistic spinning may hurt performance, so an adaptive disabling
mechanism is also implemented to disable it when reader-owned lock
spinning timeouts happen.

Finally, storing the write-lock owner into the count will allow
optimistic spinners to get to the lock holder's task structure more
quickly and eliminating the timing gap where the write lock is acquired
but the owner isn't known yet. This is important for RT tasks where
spinning on a lock with an unknown owner is not allowed.

Because of the fact that multiple readers can share the same lock,
there is a natural preference for readers when measuring in term of
locking throughput as more readers are likely to get into the locking
fast path than the writers. With waiter lock handoff, we are not going
to starve the writers.

On a 1-socket 22-core 44-thread Skylake system with 22 reader and writer
locking threads, the min/mean/max locking operations done in a 5-second
testing window before the patchset were:

  22 readers, Iterations Min/Mean/Max = 38,193/38,194/38,194
  22 writers, Iterations Min/Mean/Max = 104,663/162,133/261,370

After the patchset, they became:

  22 readers, Iterations Min/Mean/Max = 520,981/560,847/631,603
  22 writers, Iterations Min/Mean/Max = 230,567/241,565/276,940

So it was much fairer to readers.

Patch 1 makes owner a permanent member of the rw_semaphore structure and
set it irrespective of CONFIG_RWSEM_SPIN_ON_OWNER.

Patch 2 removes rwsem_wake() wakeup optimization as it doesn't work
with lock handoff.

Patch 3 implements a new rwsem locking scheme similar to what qrwlock
is current doing. Write lock is done by atomic_cmpxchg() while read
lock is still being done by atomic_add().

Patch 4 merges the content of rwsem.h and rwsem-xadd.c into rwsem.c just
like the mutex. The rwsem-xadd.c is removed and a bare-bone rwsem.h is
left for internal function declaration needed by percpu-rwsem.c.

Patch 5 optimizes the merged rwsem.c file to generate smaller object
file and performs other miscellaneous code cleanups.

Patch 6 makes rwsem_spin_on_owner() returns owner state.

Patch 7 implments lock handoff to prevent lock starvation. It is expected
that throughput will be lower on workloads with highly contended rwsems
for better fairness.

Patch 8 makes sure that all 

[PATCH v5 01/18] locking/rwsem: Make owner available even if !CONFIG_RWSEM_SPIN_ON_OWNER

2019-04-18 Thread Waiman Long
The owner field in the rw_semaphore structure is used primarily for
optimistic spinning. However, identifying the rwsem owner can also be
helpful in debugging as well as tracing locking related issues when
analyzing crash dump. The owner field may also store state information
that can be important to the operation of the rwsem.

So the owner field is now made a permanent member of the rw_semaphore
structure irrespective of CONFIG_RWSEM_SPIN_ON_OWNER.

Signed-off-by: Waiman Long 
---
 include/linux/rwsem.h   |  9 +
 kernel/locking/rwsem-xadd.c |  2 +-
 kernel/locking/rwsem.h  | 23 ---
 lib/Kconfig.debug   |  8 
 4 files changed, 10 insertions(+), 32 deletions(-)

diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 2ea18a3def04..148983e21d47 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -34,12 +34,12 @@
  */
 struct rw_semaphore {
atomic_long_t count;
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
/*
-* Write owner. Used as a speculative check to see
-* if the owner is running on the cpu.
+* Write owner or one of the read owners. Can be used as a
+* speculative check to see if the owner is running on the cpu.
 */
struct task_struct *owner;
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
struct optimistic_spin_queue osq; /* spinner MCS lock */
 #endif
raw_spinlock_t wait_lock;
@@ -73,13 +73,14 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
 #endif
 
 #ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED, .owner = NULL
+#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED
 #else
 #define __RWSEM_OPT_INIT(lockname)
 #endif
 
 #define __RWSEM_INITIALIZER(name)  \
{ __RWSEM_INIT_COUNT(name), \
+ .owner = NULL,\
  .wait_list = LIST_HEAD_INIT((name).wait_list),\
  .wait_lock = __RAW_SPIN_LOCK_UNLOCKED(name.wait_lock) \
  __RWSEM_OPT_INIT(name)\
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 6b3ee9948bf1..7fd4f1de794a 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -86,8 +86,8 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
atomic_long_set(>count, RWSEM_UNLOCKED_VALUE);
raw_spin_lock_init(>wait_lock);
INIT_LIST_HEAD(>wait_list);
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
sem->owner = NULL;
+#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
osq_lock_init(>osq);
 #endif
 }
diff --git a/kernel/locking/rwsem.h b/kernel/locking/rwsem.h
index 64877f5294e3..eb9c8534299b 100644
--- a/kernel/locking/rwsem.h
+++ b/kernel/locking/rwsem.h
@@ -61,7 +61,6 @@
 #define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
 #define RWSEM_ACTIVE_WRITE_BIAS(RWSEM_WAITING_BIAS + 
RWSEM_ACTIVE_BIAS)
 
-#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
 /*
  * All writes to owner are protected by WRITE_ONCE() to make sure that
  * store tearing can't happen as optimistic spinners may read and use
@@ -126,7 +125,6 @@ static inline bool rwsem_has_anonymous_owner(struct 
task_struct *owner)
  * real owner or one of the real owners. The only exception is when the
  * unlock is done by up_read_non_owner().
  */
-#define rwsem_clear_reader_owned rwsem_clear_reader_owned
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
unsigned long val = (unsigned long)current | RWSEM_READER_OWNED
@@ -135,28 +133,7 @@ static inline void rwsem_clear_reader_owned(struct 
rw_semaphore *sem)
cmpxchg_relaxed((unsigned long *)>owner, val,
RWSEM_READER_OWNED | RWSEM_ANONYMOUSLY_OWNED);
 }
-#endif
-
 #else
-static inline void rwsem_set_owner(struct rw_semaphore *sem)
-{
-}
-
-static inline void rwsem_clear_owner(struct rw_semaphore *sem)
-{
-}
-
-static inline void __rwsem_set_reader_owned(struct rw_semaphore *sem,
-  struct task_struct *owner)
-{
-}
-
-static inline void rwsem_set_reader_owned(struct rw_semaphore *sem)
-{
-}
-#endif
-
-#ifndef rwsem_clear_reader_owned
 static inline void rwsem_clear_reader_owned(struct rw_semaphore *sem)
 {
 }
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 0d9e81779e37..2047f3884540 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1067,7 +1067,7 @@ config PROVE_LOCKING
select DEBUG_SPINLOCK
select DEBUG_MUTEXES
select DEBUG_RT_MUTEXES if RT_MUTEXES
-   select DEBUG_RWSEMS if RWSEM_SPIN_ON_OWNER
+   select DEBUG_RWSEMS
select DEBUG_WW_MUTEX_SLOWPATH
select DEBUG_LOCK_ALLOC
select TRACE_IRQFLAGS
@@ -1171,10 +1171,10 @@ config DEBUG_WW_MUTEX_SLOWPATH
 
 config DEBUG_RWSEMS
bool "RW Semaphore debugging: basic checks"
-   depends on DEBUG_KERNEL && RWSEM_SPIN_ON_OWNER
+ 

[PATCH v5] power_supply: platform/chrome: wilco_ec: Add charging config driver

2019-04-18 Thread Nick Crews
Add control of the charging algorithm used on Wilco devices.
See Documentation/ABI/testing/sysfs-class-power-wilco for the
userspace interface and other info.

v5 changes:
-Remove OP_SYNC, it has no immediate use case.
-Merge properties.h into wilco-ec.h
-Remove enum get_set_sync_op from the public interface,
 since without OP_SYNC they are irrelevant.
-Fix Kconfigs and Makefiles so they actually work
 with the v4 changes
-Tweak some formatting, spacing, and comments
-Fix validation of charge_type so illegal values
 can't be set. Before negative error codes were
 accidentally getting casted to positive numbers
-Remove more unneeded parentheses.
v4 changes:
-Use put_unaligned_le32() to store PID in request.
-Move implementation from
 drivers/platform/chrome/wilco_ec/charge_config.c to
 drivers/power/supply/wilco_charger.c
-Move drivers/platform/chrome/wilco_ec/properties.h to
 include/linux/platform_data/wilco-ec-properties.h
-Remove parentheses in switch statement in psp_val_to_charge_mode()
-Check for any negatvie return code from psp_val_to_charge_mode()
 instead of just -EINVAL so its less brittle
-Tweak comments in wilco-ec-properties.h
v3 changes:
-Add this changelog
-Fix commit message tags
v2 changes:
-Update Documentation to say KernelVersion 5.2
-Update Documentation to explain Trickle mode better.
-rename things from using *PCC* to *CHARGE*
-Split up conversions between POWER_SUPPLY_PROP_CHARGE_TYPE values
and Wilco EC codes
-Use devm_ flavor of power_supply_register(), which simplifies things
-Add extra error checking on property messages received from the EC
-Fix bug in memcpy() calls in properties.c
-Refactor fill_property_id()
-Add valid input checks to charge_type
-Properly convert charge_type when get()ting

Signed-off-by: Nick Crews 
---
 .../ABI/testing/sysfs-class-power-wilco   |  30 +++
 drivers/platform/chrome/wilco_ec/Kconfig  |   9 +
 drivers/platform/chrome/wilco_ec/Makefile |   2 +-
 drivers/platform/chrome/wilco_ec/core.c   |  13 ++
 drivers/platform/chrome/wilco_ec/properties.c | 134 
 drivers/power/supply/Makefile |   1 +
 drivers/power/supply/wilco-charger.c  | 190 ++
 include/linux/platform_data/wilco-ec.h|  73 +++
 8 files changed, 451 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/ABI/testing/sysfs-class-power-wilco
 create mode 100644 drivers/platform/chrome/wilco_ec/properties.c
 create mode 100644 drivers/power/supply/wilco-charger.c

diff --git a/Documentation/ABI/testing/sysfs-class-power-wilco 
b/Documentation/ABI/testing/sysfs-class-power-wilco
new file mode 100644
index ..7f3b01310476
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-class-power-wilco
@@ -0,0 +1,30 @@
+What:  /sys/class/power_supply/wilco_charger/charge_type
+Date:  April 2019
+KernelVersion: 5.2
+Description:
+   What charging algorithm to use:
+
+   Standard: Fully charges battery at a standard rate.
+   Adaptive: Battery settings adaptively optimized based on
+   typical battery usage pattern.
+   Fast: Battery charges over a shorter period.
+   Trickle: Extends battery lifespan, intended for users who
+   primarily use their Chromebook while connected to AC.
+   Custom: A low and high threshold percentage is specified.
+   Charging begins when level drops below
+   charge_control_start_threshold, and ceases when
+   level is above charge_control_end_threshold.
+
+What:  
/sys/class/power_supply/wilco_charger/charge_control_start_threshold
+Date:  April 2019
+KernelVersion: 5.2
+Description:
+   Used when charge_type="Custom", as described above. Measured in
+   percentages. The valid range is [50, 95].
+
+What:  
/sys/class/power_supply/wilco_charger/charge_control_end_threshold
+Date:  April 2019
+KernelVersion: 5.2
+Description:
+   Used when charge_type="Custom", as described above. Measured in
+   percentages. The valid range is [55, 100].
diff --git a/drivers/platform/chrome/wilco_ec/Kconfig 
b/drivers/platform/chrome/wilco_ec/Kconfig
index e09e4cebe9b4..ec0db8dd14c2 100644
--- a/drivers/platform/chrome/wilco_ec/Kconfig
+++ b/drivers/platform/chrome/wilco_ec/Kconfig
@@ -18,3 +18,12 @@ config WILCO_EC_DEBUGFS
  manipulation and allow for testing arbitrary commands.  This
  interface is intended for debug only and will not be present
  on production devices.
+
+config CHARGER_WILCO
+   tristate "Enable charger control"
+   depends on WILCO_EC
+   help
+ If you say Y here, you get support to control the charging
+ routines performed by the Wilco Embedded Controller.
+ Further information can be found in
+ Documentation/ABI/testing/sysfs-class-power-wilco)

[PATCH V2] staging: vc04_services: handle kzalloc failure

2019-04-18 Thread Nicholas Mc Guire
The kzalloc here was being used without checking the return - if the
kzalloc fails return VCHIQ_ERROR. The call-site of
vchiq_platform_init_state() vchiq_init_state() was not responding
to an allocation failure so checks for != VCHIQ_SUCCESS
and pass VCHIQ_ERROR up to vchiq_platform_init() which then
will fail with -EINVAL.

Signed-off-by: Nicholas Mc Guire 
Reported-by: kbuild test robot 
---

Problem located with experimental coccinelle script

V2: The != VCHIQ_SUCCES went unnoticed not clear how I did that as
building drivers/staging/vc04_services/vchiq.o seemed to have succeeded
anyway kbuild test found it. This is one of the disadvantages if one
can not make path/file.o due to missing Makefiles and is required to
build entire directories for compile-testing, I suspect that I did not
clean the directory before compile-testing the patch.


Patch was compile-tested with: bcm2835_defconfig (implies BCM2835_VCHIQ=m)
(with sparse warning:
  CHECK   drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.c
drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.c:410:29: 
warning: incorrect type in argument 1 (different address spaces)
drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.c:410:29:
expected void const *x
drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.c:410:29:
got char [noderef]  *buf
drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.c:416:63: 
warning: incorrect type in argument 1 (different address spaces)
drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.c:416:63:
expected void const *addr
drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.c:416:63:
got char [noderef]  *
  CC [M]  drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.o
Q: should this buf have been remapped to user space e.g. remap_vmalloc_range ?

Patch is against 5.1-rc5 (localversion next is 20190417)

 drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.c | 3 +++
 drivers/staging/vc04_services/interface/vchiq_arm/vchiq_core.c | 2 ++
 2 files changed, 5 insertions(+)

diff --git a/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.c 
b/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.c
index 31eb7d5..a9a2291 100644
--- a/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.c
+++ b/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_2835_arm.c
@@ -179,6 +179,9 @@ vchiq_platform_init_state(struct vchiq_state *state)
struct vchiq_2835_state *platform_state;
 
state->platform_state = kzalloc(sizeof(*platform_state), GFP_KERNEL);
+   if (!state->platform_state)
+   return VCHIQ_ERROR;
+
platform_state = (struct vchiq_2835_state *)state->platform_state;
 
platform_state->inited = 1;
diff --git a/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_core.c 
b/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_core.c
index 6057a90..92e6221 100644
--- a/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_core.c
+++ b/drivers/staging/vc04_services/interface/vchiq_arm/vchiq_core.c
@@ -2209,6 +2209,8 @@ vchiq_init_state(struct vchiq_state *state, struct 
vchiq_slot_zero *slot_zero)
local->debug[DEBUG_ENTRIES] = DEBUG_MAX;
 
status = vchiq_platform_init_state(state);
+   if (status != VCHIQ_SUCCESS)
+   return VCHIQ_ERROR;
 
/*
bring up slot handler thread
-- 
2.1.4



Re: [REGRESSION 5.0.x] Windows XP broken on KVM

2019-04-18 Thread Sean Christopherson
On Thu, Apr 18, 2019 at 10:05:34AM +0200, Takashi Iwai wrote:
> On Thu, 18 Apr 2019 09:53:28 +0200,
> Greg Kroah-Hartman wrote:
> > 
> > On Thu, Apr 18, 2019 at 09:38:52AM +0200, Takashi Iwai wrote:
> > > Hi,
> > > 
> > > we've got a regression report on the recent 5.0.x kernel, starting
> > > from 5.0.6, where Windows XP can't boot on KVM any longer.
> > > 
> > > The culprit seems to be the patch
> > >KVM: x86: update %rip after emulating IO
> > > with the upstream commit 45def77ebf79e2e8942b89ed79294d97ce914fa0.
> > > Reverting this alone fixed the problem.
> > > 
> > > The report is found at openSUSE bugzilla:
> > >   https://bugzilla.suse.com/show_bug.cgi?id=1132694
> > > 
> > > Is there already a followup fix?  If not, we need to revert it from
> > > stable, at least.
> > 
> > Is this also a problem in 5.1-rc5?
> 
> Only 5.0.x has been tested, so far.
> I'll ask the reporter to try 5.1-rc if Paolo can't reproduce the
> problem in his side.

I'd be surprised if 5.1-rc* changes anything relative to 5.0.x, the
commit in question is fairly isolated and specific.

Can we get the Qemu version and command line that is being run?


Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-18 Thread Borislav Petkov
On Thu, Apr 18, 2019 at 03:51:07PM -0700, Cong Wang wrote:
> On Thu, Apr 18, 2019 at 3:02 PM Tony Luck  wrote:
> >
> > Useful when running error injection tests that want to
> > see all of the MCi_(STATUS|ADDR|MISC) data via /dev/mcelog.
> >
> > Signed-off-by: Tony Luck 
> 
> We saw the same problem, CONFIG_RAS hijacks all the
> correctable memory errors, which leaves mcelog "broken"
> silently. I know it is arguable, but until we can switch from
> mcelog to rasdaemon, mcelog should still work as before.

It is "arguable" because this is not how the CEC is supposed to be used.

If you want to collect errors with mcelog, you don't use the CEC at all.
And there's ras=cec_disable for that or you simply don't enable it in
your .config.

As Tony says in the commit message, the enable should be used only for
injection tests. Which is where that thing should only be used for -
debugging the CEC itself.

Which reminds me, Tony, I think all those debugging files "pfn"
and "array" and the one you add now, should all be under a
CONFIG_RAS_CEC_DEBUG which is default off and used only for development.
Mind adding that too pls?

Thx.

-- 
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.


Re: [PATCH] drivers: iio: proximity: This patch fix the following checkpatch warning.

2019-04-18 Thread Matt Ranostay
See comments inline

On Wed, Apr 17, 2019 at 11:15 AM Mohan Kumar  wrote:
>
> As per Documentation/timers/timers-howto.txt Msleep < 20ms can sleep for
> up to 20ms. so use usleep_range.
>
> Signed-off-by: Mohan Kumar 
> ---
>  drivers/iio/proximity/mb1232.c | 2 +-
>  drivers/iio/proximity/srf08.c  | 4 ++--
>  2 files changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/iio/proximity/mb1232.c b/drivers/iio/proximity/mb1232.c
> index 166b3e6..74f7eae 100644
> --- a/drivers/iio/proximity/mb1232.c
> +++ b/drivers/iio/proximity/mb1232.c
> @@ -81,7 +81,7 @@ static s16 mb1232_read_distance(struct mb1232_data *data)
> }
> } else {
> /* use simple sleep if announce irq is not connected */
> -   msleep(15);
> +   usleep_range(15000, 2);

This is actually less than ideal.. because usleep_range uses the
hrtimers which forces an interrupt.

Unless you actually need to read between 15ms and 20ms it is best just
to leave the msleep() as it is, and let it take a bit longer
if required.

- Matt

> }
>
> ret = i2c_master_recv(client, (char *), sizeof(buf));
> diff --git a/drivers/iio/proximity/srf08.c b/drivers/iio/proximity/srf08.c
> index f2bf783..605a582 100644
> --- a/drivers/iio/proximity/srf08.c
> +++ b/drivers/iio/proximity/srf08.c
> @@ -150,7 +150,7 @@ static int srf08_read_ranging(struct srf08_data *data)
>  * polling for not more than 20 ms should be enough
>  */
> waittime = 1 + data->range_mm / 172;
> -   msleep(waittime);
> +   usleep_range(waittime * 1000, (waittime * 1000) + 2000);
> for (i = 0; i < 4; i++) {
> ret = i2c_smbus_read_byte_data(data->client,
> SRF08_READ_SW_REVISION);
> @@ -158,7 +158,7 @@ static int srf08_read_ranging(struct srf08_data *data)
> /* check if a valid version number is read */
> if (ret < 255 && ret > 0)
> break;
> -   msleep(5);
> +   usleep_range(5000, 15000);
> }
>
> if (ret >= 255 || ret <= 0) {
> --
> 2.7.4
>


Re: [PATCH v2] clk: hi3660: Mark clk_gate_ufs_subsys as critical

2019-04-18 Thread Leo Yan
On Thu, Apr 18, 2019 at 02:52:44PM -0700, Stephen Boyd wrote:
> Quoting Leo Yan (2019-04-17 22:33:39)
> > Hi Michael, Stephen,
> > 
> > On Wed, Mar 20, 2019 at 06:05:08PM +0800, Leo Yan wrote:
> > > clk_gate_ufs_subsys is a system bus clock, turning off it will
> > > introduce lockup issue during system suspend flow.  Let's mark
> > > clk_gate_ufs_subsys as critical clock, thus keeps it on during
> > > system suspend and resume.
> > 
> > Could you pick up this patch?  Or should I resend it?
> > 
> 
> I can pick it up. Is it critical or can it wait for v5.2?

It's fine to wait for v5.2.

Thanks,
Leo Yan


Re: [PATCH v5 0/2] tty: serial: add DT bindings and serial driver for the SiFive FU540 UART

2019-04-18 Thread Kevin Hilman
Hi Paul,

Paul Walmsley  writes:

> This series adds a serial driver, with console support, for the
> UART IP block present on the SiFive FU540 SoC.  The programming
> model is straightforward, but unique.
>
> Boot-tested on a SiFive FU540 HiFive-U board, using BBL and the
> open-source FSBL (with appropriate patches to the DT data).
>
> This fifth version fixes a bug in the set_termios handler,
> found by Andreas Schwab .
>
> The patches in this series can also be found, with the PRCI patches,
> DT patches, and DT prerequisite patch, at:
>
> https://github.com/sifive/riscv-linux/tree/dev/paulw/serial-v5.1-rc4

I tried this branch, and it doesn't boot on my unleashed board.

Here's the boot log when I pass the DT built from your branch via
u-boot: https://termbin.com/rfp3.

I also tried the same thing, but using the DT that's hard-coded into
SBI/u-boot.  That doesn't boot fully either[1], but one thing I noted is
that with the DT from the kernel tree, the printk timestamps aren't
moving.  Maybe I'm still missing some kconfig options to enable the
right clock and/or IRQ controllers? I'm using this fragment[2] on top of
the default defconfig (arch/riscv/configs/defconfig).

Could you share the defconfig you're using when testing your branch?

Also for reference, I'm able to successfully build/boot the
5.1-rc1-unleashed branch from Atish's tree[3] using that kconfig
fragment[2] (and the hard-coded DT from u-boot/SBI).  Full log here[4].

Thanks,

Kevin

[1] https://termbin.com/wuc9
[2]
CONFIG_CLK_SIFIVE=y
CONFIG_CLK_SIFIVE_FU540_PRCI=y

CONFIG_SERIAL_SIFIVE=y
CONFIG_SERIAL_SIFIVE_CONSOLE=y

CONFIG_SIFIVE_PLIC=y
CONFIG_SPI=y
CONFIG_SPI_SIFIVE=y
CONFIG_GPIOLIB=y
CONFIG_GPIO_SIFIVE=y
CONFIG_PWM_SIFIVE=y

CONFIG_CLK_U54_PRCI=y
CONFIG_CLK_GEMGXL_MGMT=y

[3] https://github.com/atishp04/linux/tree/5.1-rc1-unleashed
[4] https://termbin.com/12bg


Re: [PATCH v4 1/2] mm: refactor __vunmap() to avoid duplicated call to find_vm_area()

2019-04-18 Thread Eric Dumazet



On 04/18/2019 03:24 PM, Andrew Morton wrote:

> afaict, vfree() will only do a mutex_trylock() in
> try_purge_vmap_area_lazy().  So does vfree actually sleep in any
> situation?  Whether or not local interrupts are enabled?

We would be in a big trouble if vfree() could potentially sleep...

Random example : __free_fdtable() called from rcu callback.



Re: [PATCH v2] fix compile errors due to unsync linux/in6.h and netinet/in.h

2019-04-18 Thread Alexei Starovoitov
On Thu, Apr 18, 2019 at 11:06 AM Wang YanQing  wrote:
>
> I meet below compile errors:
> "
> In file included from test_tcpnotify_kern.c:12:
> /usr/include/netinet/in.h:101:5: error: expected identifier
> IPPROTO_HOPOPTS = 0,   /* IPv6 Hop-by-Hop options.  */
> ^
> /usr/include/linux/in6.h:131:26: note: expanded from macro 'IPPROTO_HOPOPTS'
> ^
> In file included from test_tcpnotify_kern.c:12:
> /usr/include/netinet/in.h:103:5: error: expected identifier
> IPPROTO_ROUTING = 43,  /* IPv6 routing header.  */
> ^
> /usr/include/linux/in6.h:132:26: note: expanded from macro 'IPPROTO_ROUTING'
> ^
> In file included from test_tcpnotify_kern.c:12:
> /usr/include/netinet/in.h:105:5: error: expected identifier
> IPPROTO_FRAGMENT = 44, /* IPv6 fragmentation header.  */
> ^
> /usr/include/linux/in6.h:133:26: note: expanded from macro 'IPPROTO_FRAGMENT'
> "
> The same compile errors are reported for test_tcpbpf_kern.c too.
>
> My environment:
> lsb_release -a:
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:Ubuntu 16.04.6 LTS
> Release:16.04
> Codename:   xenial
>
> dpkg -l | grep libc-dev:
> ii  libc-dev-bin  2.23-0ubuntu11   amd64GNU C 
> Library: Development binaries
> ii  linux-libc-dev:amd64  4.4.0-145.171amd64Linux 
> Kernel Headers for development.
>
> The reason is linux/in6.h and netinet/in.h aren't synchronous about how to
> handle the same definitions, IPPROTO_HOPOPTS, etc.
>
> This patch fixes the compile errors by moving  to before the
> .
>
> Signed-off-by: Wang YanQing 

Added 'selftests/bpf' to commit log and pushed.


Re: [PATCH v2 2/4] interconnect: qcom: Add QCS404 interconnect provider driver

2019-04-18 Thread Bjorn Andersson
On Mon 15 Apr 03:43 PDT 2019, Georgi Djakov wrote:
> diff --git a/drivers/interconnect/qcom/qcs404.c 
> b/drivers/interconnect/qcom/qcs404.c
[..]
> +#define QCS404_MASTER_AMPSS_M0   0
> +#define QCS404_MASTER_GRAPHICS_3D1
> +#define QCS404_MASTER_MDP_PORT0  2
> +#define QCS404_SNOC_BIMC_1_MAS   3
> +#define QCS404_MASTER_TCU_0  4
> +#define QCS404_MASTER_SPDM   5
> +#define QCS404_MASTER_BLSP_1 6
> +#define QCS404_MASTER_BLSP_2 7
> +#define QCS404_MASTER_XM_USB_HS1 8
> +#define QCS404_MASTER_CRYPTO_CORE0   9
> +#define QCS404_MASTER_SDCC_1 10
> +#define QCS404_MASTER_SDCC_2 11
> +#define QCS404_SNOC_PNOC_MAS 12
> +#define QCS404_MASTER_QPIC   13
> +#define QCS404_MASTER_QDSS_BAM   14
> +#define QCS404_BIMC_SNOC_MAS 15
> +#define QCS404_PNOC_SNOC_MAS 16
> +#define QCS404_MASTER_QDSS_ETR   17
> +#define QCS404_MASTER_EMAC   18
> +#define QCS404_MASTER_PCIE   19
> +#define QCS404_MASTER_USB3   20
> +#define QCS404_PNOC_INT_021
> +#define QCS404_PNOC_INT_222
> +#define QCS404_PNOC_INT_323
> +#define QCS404_PNOC_SLV_024
> +#define QCS404_PNOC_SLV_125
> +#define QCS404_PNOC_SLV_226
> +#define QCS404_PNOC_SLV_327
> +#define QCS404_PNOC_SLV_428
> +#define QCS404_PNOC_SLV_629
> +#define QCS404_PNOC_SLV_730
> +#define QCS404_PNOC_SLV_831
> +#define QCS404_PNOC_SLV_932
> +#define QCS404_PNOC_SLV_10   33
> +#define QCS404_PNOC_SLV_11   34
> +#define QCS404_SNOC_QDSS_INT 35
> +#define QCS404_SNOC_INT_036
> +#define QCS404_SNOC_INT_137
> +#define QCS404_SNOC_INT_238
> +#define QCS404_SLAVE_EBI_CH0 39
> +#define QCS404_BIMC_SNOC_SLV 40
> +#define QCS404_SLAVE_SPDM_WRAPPER41
> +#define QCS404_SLAVE_PDM 42
> +#define QCS404_SLAVE_PRNG43
> +#define QCS404_SLAVE_TCSR44
> +#define QCS404_SLAVE_SNOC_CFG45
> +#define QCS404_SLAVE_MESSAGE_RAM 46
> +#define QCS404_SLAVE_DISPLAY_CFG 47
> +#define QCS404_SLAVE_GRAPHICS_3D_CFG 48
> +#define QCS404_SLAVE_BLSP_1  49
> +#define QCS404_SLAVE_TLMM_NORTH  50
> +#define QCS404_SLAVE_PCIE_1  51
> +#define QCS404_SLAVE_EMAC_CFG52
> +#define QCS404_SLAVE_BLSP_2  53
> +#define QCS404_SLAVE_TLMM_EAST   54
> +#define QCS404_SLAVE_TCU 55
> +#define QCS404_SLAVE_PMIC_ARB56
> +#define QCS404_SLAVE_SDCC_1  57
> +#define QCS404_SLAVE_SDCC_2  58
> +#define QCS404_SLAVE_TLMM_SOUTH  59
> +#define QCS404_SLAVE_USB_HS  60
> +#define QCS404_SLAVE_USB361
> +#define QCS404_SLAVE_CRYPTO_0_CFG62
> +#define QCS404_PNOC_SNOC_SLV 63
> +#define QCS404_SLAVE_APPSS   64
> +#define QCS404_SLAVE_WCSS65
> +#define QCS404_SNOC_BIMC_1_SLV   66
> +#define QCS404_SLAVE_OCIMEM  67
> +#define QCS404_SNOC_PNOC_SLV 68
> +#define QCS404_SLAVE_QDSS_STM69
> +#define QCS404_SLAVE_CATS_12870
> +#define QCS404_SLAVE_OCMEM_6471
> +#define QCS404_SLAVE_LPASS   72

An enum would probably be cleaner, given that the actual values are not
significant.

[..]
> +static int qnoc_probe(struct platform_device *pdev)
> +{
> + struct device *dev = >dev;
> + const struct qcom_icc_desc *desc;
> + struct icc_onecell_data *data;
> + struct icc_provider *provider;
> + struct qcom_icc_node **qnodes;
> + struct qcom_icc_provider *qp;
> + struct icc_node *node;
> + size_t num_nodes, i;
> + int ret;
> +
> + rpm = dev_get_drvdata(dev->parent);
> + if (!rpm) {
> + dev_err(dev, "unable to retrieve handle to RPM\n");
> + return -ENODEV;
> + }

In line with my feedback on the DT binding I would prefer if this driver
deals with the devices on the mmio/soc bus and you move the SMD/RPM part
to a separate driver - then perform the SMD/RPM operation by calling
into the other driver.

Given how much of this driver is platforms specific then I think it's a
pretty clean split for adding further (SMD/RPM based) platforms.



Re: [PATCH] reiserfs: Force type conversion in xattr_hash

2019-04-18 Thread Andrew Morton
On Wed, 17 Apr 2019 17:22:00 +0530 Bharath Vedartham  
wrote:

> This patch fixes the sparse warning:
> 
> fs/reiserfs//xattr.c:453:28: warning: incorrect type in return
> expression (different base types)
> fs/reiserfs//xattr.c:453:28:expected unsigned int
> fs/reiserfs//xattr.c:453:28:got restricted __wsum
> fs/reiserfs//xattr.c:453:28: warning: incorrect type in return
> expression (different base types)
> fs/reiserfs//xattr.c:453:28:expected unsigned int
> fs/reiserfs//xattr.c:453:28:got restricted __wsum
> 
> csum_partial returns restricted integer __wsum whereas xattr_hash
> expects a return type of __u32.
> 
> ...
>
> --- a/fs/reiserfs/xattr.c
> +++ b/fs/reiserfs/xattr.c
> @@ -450,7 +450,7 @@ static struct page *reiserfs_get_page(struct inode *dir, 
> size_t n)
>  
>  static inline __u32 xattr_hash(const char *msg, int len)
>  {
> - return csum_partial(msg, len, 0);
> + return (__force __u32)csum_partial(msg, len, 0);
>  }
>  
>  int reiserfs_commit_write(struct file *f, struct page *page,

hm.  Conversion from int to __u32 should be OK - why is sparse being so
picky here?

Why is the __force needed, btw?


Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-18 Thread Cong Wang
On Thu, Apr 18, 2019 at 3:02 PM Tony Luck  wrote:
>
> Useful when running error injection tests that want to
> see all of the MCi_(STATUS|ADDR|MISC) data via /dev/mcelog.
>
> Signed-off-by: Tony Luck 

We saw the same problem, CONFIG_RAS hijacks all the
correctable memory errors, which leaves mcelog "broken"
silently. I know it is arguable, but until we can switch from
mcelog to rasdaemon, mcelog should still work as before.

I like this idea of disabling it at runtime, so:

Acked-by: Cong Wang 

Thanks.


Re: [PATCH v2 4/4] dt-bindings: interconnect: qcs404: Introduce qcom,qos DT property

2019-04-18 Thread Bjorn Andersson
On Mon 15 Apr 03:43 PDT 2019, Georgi Djakov wrote:

> There are separate hardware blocks per each interconnect that allow QoS
> configuration to be applied to each port (node). There are different kinds of
> priorities that could be set on these ports. Each port supports also various
> QoS modes such as "fixed", "limiter", "bypass" and "regulator". Depending on
> the mode, there are a few additional knobs that could be configured.
> 
> Introduce the qcom,qos property, so that we describe this relation in DT and
> allow the interconnect provider drivers can make use of it.
> 

As the example shows we will end up with two nodes describing the same
hardware with pretty much identical set of properties.

I really do think it's better to represent and implement the NoC
controllers on the mmio/platform bus and have a small "proxy" as a child
of the RPM.


By doing this you avoid the duplication of the clock properties and you
don't need the qcom,qos reference to pair up each "bus performance
driver" to the "bus qos driver".

And you still maintain the idea that the entity you request bandwidth
votes with are the NoC controller (which also will be the QoS
controller).

Regards,
Bjorn

> Signed-off-by: Georgi Djakov 
> ---
> 
> v2:
> - New patch.
> 
>  .../bindings/interconnect/qcom,qcs404.txt | 31 +++
>  1 file changed, 31 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/interconnect/qcom,qcs404.txt 
> b/Documentation/devicetree/bindings/interconnect/qcom,qcs404.txt
> index 9befcd14a5b5..b971e0ee2963 100644
> --- a/Documentation/devicetree/bindings/interconnect/qcom,qcs404.txt
> +++ b/Documentation/devicetree/bindings/interconnect/qcom,qcs404.txt
> @@ -11,9 +11,37 @@ Required properties :
>  Optional properties :
>  clocks : list of phandles and specifiers to all interconnect bus clocks
>  clock-names : clock names should include both "bus_clk" and "bus_a_clk"
> +qcom,qos : phandle to the QoS device-tree node
>  
>  Example:
>  
> +soc {
> + ...
> + bimc_qos: interconnect@40 {
> + compatible = "qcom,qcs404-bimc-qos";
> + reg = <0x40 0x8>;
> + clock-names = "bus_clk", "bus_a_clk";
> + clocks = < RPM_SMD_BIMC_CLK>,
> + < RPM_SMD_BIMC_A_CLK>;
> + };
> +
> + pcnoc_qos: interconnect@50 {
> + compatible = "qcom,qcs404-pcnoc-qos";
> + reg = <0x50 0x15080>;
> + clock-names = "bus_clk", "bus_a_clk";
> + clocks = < RPM_SMD_PNOC_CLK>,
> + < RPM_SMD_PNOC_A_CLK>;
> + };
> +
> + snoc_qos: interconnect@58 {
> + compatible = "qcom,qcs404-snoc-qos";
> + reg = <0x58 0x14000>;
> + clock-names = "bus_clk", "bus_a_clk";
> + clocks = < RPM_SMD_SNOC_CLK>,
> + < RPM_SMD_SNOC_A_CLK>;
> + };
> +};
> +
>  rpm-glink {
>   ...
>   rpm_requests: glink-channel {
> @@ -24,6 +52,7 @@ rpm-glink {
>   clock-names = "bus_clk", "bus_a_clk";
>   clocks = < RPM_SMD_BIMC_CLK>,
>   < RPM_SMD_BIMC_A_CLK>;
> + qcom,qos = <_qos>;
>   };
>  
>   pnoc: interconnect@1 {
> @@ -32,6 +61,7 @@ rpm-glink {
>   clock-names = "bus_clk", "bus_a_clk";
>   clocks = < RPM_SMD_PNOC_CLK>,
>   < RPM_SMD_PNOC_A_CLK>;
> + qcom,qos = <_qos>;
>   };
>  
>   snoc: interconnect@2 {
> @@ -40,6 +70,7 @@ rpm-glink {
>   clock-names = "bus_clk", "bus_a_clk";
>   clocks = < RPM_SMD_SNOC_CLK>,
>   < RPM_SMD_SNOC_A_CLK>;
> + qcom,qos = <_qos>;
>   };
>   };
>  };


Re: [PATCH 1/2] ras: fix an off-by-one error in __find_elem()

2019-04-18 Thread Cong Wang
On Wed, Apr 17, 2019 at 2:15 PM Luck, Tony  wrote:
>
> On Tue, Apr 16, 2019 at 07:37:41PM -0700, Cong Wang wrote:
> > On Tue, Apr 16, 2019 at 7:31 PM Cong Wang  wrote:
> > > Yes it is, I have a stacktrace in production which clearly shows
> > > del_elem.isra.1+0x34/0x40, unlike the one I triggered via fake
> > > PFN's. I can show you if you want, it is on 4.14, so very unlikely
> > > it is interesting to anyone here.
> >
> > Correct myself:
> >
> > The one triggered via fake PFN's also shows
> > del_elem.constprop.1+0x39/0x40.
> >
> > Sorry for the mistake, I kinda read into another crash...
>
> Ok. Now I follow what you are saying. I'll try to rephrase
> here.

Yes, your detailed analysis is correct. Sorry if I caused any
confusion here.

>
>
> So I'm now liking this version of the patch by Cong:
>
...
> Reviewed-by: Tony Luck 


Thanks for reviewing my patch!


Re: [PATCH v3] proc/sysctl: add shared variables for range check

2019-04-18 Thread Andrew Morton
On Wed, 17 Apr 2019 15:15:31 +0200 Matteo Croce  wrote:

> In the sysctl code the proc_dointvec_minmax() function is often used to
> validate the user supplied value between an allowed range. This function
> uses the extra1 and extra2 members from struct ctl_table as minimum and
> maximum allowed value.
> 
> On sysctl handler declaration, in every source file there are some readonly
> variables containing just an integer which address is assigned to the
> extra1 and extra2 members, so the sysctl range is enforced.
> 
> The special values 0, 1 and INT_MAX are very often used as range boundary,
> leading duplication of variables like zero=0, one=1, int_max=INT_MAX in
> different source files:
> 
> $ git grep -E '\.extra[12].*&(zero|one|int_max)\b' |wc -l
> 245
> 
> This patch adds three const variables for the most commonly used values,
> and use them instead of creating a local one for every object file.
> 
> ...
>
> --- a/arch/s390/appldata/appldata_base.c
> +++ b/arch/s390/appldata/appldata_base.c
> @@ -220,15 +220,13 @@ appldata_timer_handler(struct ctl_table *ctl, int write,
>  void __user *buffer, size_t *lenp, loff_t *ppos)
>  {
>   int timer_active = appldata_timer_active;
> - int zero = 0;
> - int one = 1;
>   int rc;
>   struct ctl_table ctl_entry = {
>   .procname   = ctl->procname,
>   .data   = _active,
>   .maxlen = sizeof(int),
> - .extra1 = ,
> - .extra2 = ,
> + .extra1 = (void *)_zero,
> + .extra2 = (void *)_one,
>   };

Still not liking the casts :(

Did we decide whether making extra1&2 const void*'s was feasible?

I'm wondering if it would be better to do

extern const int sysctl_zero;
/* comment goes here */
#define SYSCTL_ZERO ((void *)_zero)

and then use SYSCTL_ZERO everywhere.  That centralizes the ugliness and
makes it easier to switch over if/when extra1&2 are constified.

But it's all a bit sad and lame :( 


Re: Detecting x86 LAPIC timer frequency from CPUID data

2019-04-18 Thread Thomas Gleixner
On Thu, 18 Apr 2019, Thomas Gleixner wrote:
> On Wed, 17 Apr 2019, Daniel Drake wrote:
> 
> > The CPUID.0x16 leaf provides "Bus (Reference) Frequency (in MHz)".
> > 
> > In the thread "No 8254 PIT & no HPET on new Intel N3350 platforms
> > causes kernel panic during early boot" we are exploring ways to have
> > the kernel avoid using the PIT/HPET IRQ0 timer in more cases, and
> > Thomas Gleixner suggested that we could use this CPUID data to set
> > lapic_timer_frequency, avoiding the need for calibrate_APIC_clock()
> > to measure the APIC clock against the IRQ0 timer.
> > 
> > I'm thinking of the the following code change, however I get
> > unexpected results on Intel i7-8565U (Whiskey Lake). When
> > booting without this change, and with apic=notscdeadline (so that
> > APIC clock gets calibrated and used), the bus speed is detected as
> > 23MHz:
> > 
> >  ... lapic delta = 149994
> >  ... PM-Timer delta = 357939
> >  ... PM-Timer result ok
> >  . delta 149994
> >  . mult: 6442193
> >  . calibration result: 23999

That's 24MHZ which is the nominal clock for these machines.

> >  . CPU clock speed is 1991.0916 MHz.
> >  . host bus clock speed is 23.0999 MHz.

I think that printout is wrong in two aspects. First it should be
23.Mhz, second it should be 100MHz. This stuff comes from old systems
which had completely different clock setups.

> > However the CPUID.0x16 ECX reports a 100MHz bus speed on this device,
> > so this code change would produce a significantly different calibration.

Yes.

> > Am I doing anything obviously wrong?
> 
> It's probably just my fault sending you down the wrong path. What's the
> content of CPUUD.0x15 on that box?

I bet that CPUID.0x15 ECX says 24Mhz or it just says 0 like on my
machine. But then interestingly enough on that box I see:

   Time Stamp Counter/Core Crystal Clock Information (0x15):
  TSC/clock ratio = 168/2
  nominal core crystal clock = 0 Hz

   Processor Frequency Information (0x16):
  Core Base Frequency (MHz) = 0x834 (2100)
  Core Maximum Frequency (MHz) = 0xed8 (3800)
  Bus (Reference) Frequency (MHz) = 0x64 (100)

Assuming that TSC and local APIC timer run from the same frequency on these
modern machines.

   2100MHz * 2 / 168 = 25MHz

and disabling the tsc deadline timer tells me:

  . calibration result: 24999

Close enough. Thinking about it - that makes a lot of sense. With TSC
deadline timer available it would be pretty silly if there is yet another
clock feeding into local APIC.

And the 0x15 -> 0x16 correlation is clear as well. The TSC runs at the
nominal CPU frequency, in this case 2100MHz. So the nominator/denominator
pair from 0x15 allows to deduce the nominal core crystal clock which seems
to be exactly the clock which is fed into the local APIC timer.

If someone from Intel could confirm that, then we could make that work for
any modern system.

Thanks,

tglx


Re: [PATCH v4 1/2] mm: refactor __vunmap() to avoid duplicated call to find_vm_area()

2019-04-18 Thread Andrew Morton
On Thu, 18 Apr 2019 04:18:34 -0700 Matthew Wilcox  wrote:

> On Wed, Apr 17, 2019 at 02:58:27PM -0700, Andrew Morton wrote:
> > On Wed, 17 Apr 2019 12:40:01 -0700 Roman Gushchin  wrote:
> > > +static struct vm_struct *__remove_vm_area(struct vmap_area *va)
> > > +{
> > > + struct vm_struct *vm = va->vm;
> > > +
> > > + might_sleep();
> > 
> > Where might __remove_vm_area() sleep?
> > 
> > >From a quick scan I'm only seeing vfree(), and that has the
> > might_sleep_if(!in_interrupt()).
> > 
> > So perhaps we can remove this...
> 
> See commit 5803ed292e63 ("mm: mark all calls into the vmalloc subsystem as 
> potentially sleeping")
> 
> It looks like the intent is to unconditionally check might_sleep() at
> the entry points to the vmalloc code, rather than only catch them in
> the occasional place where it happens to go wrong.

afaict, vfree() will only do a mutex_trylock() in
try_purge_vmap_area_lazy().  So does vfree actually sleep in any
situation?  Whether or not local interrupts are enabled?




Re: [PATCH 1/1] lib/test_vmalloc: do not create cpumask_t variable on stack

2019-04-18 Thread Andrew Morton
On Thu, 18 Apr 2019 21:39:25 +0200 "Uladzislau Rezki (Sony)"  
wrote:

> On my "Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz" system(12 CPUs)
> i get the warning from the compiler about frame size:
> 
> 
> warning: the frame size of 1096 bytes is larger than 1024 bytes
> [-Wframe-larger-than=]
> 
> 
> the size of cpumask_t depends on number of CPUs, therefore just
> make use of cpumask_of() in set_cpus_allowed_ptr() as a second
> argument.
> 
> ...
L
> --- a/lib/test_vmalloc.c
> +++ b/lib/test_vmalloc.c
> @@ -383,14 +383,14 @@ static void shuffle_array(int *arr, int n)
>  static int test_func(void *private)
>  {
>   struct test_driver *t = private;
> - cpumask_t newmask = CPU_MASK_NONE;
>   int random_array[ARRAY_SIZE(test_case_array)];
>   int index, i, j, ret;
>   ktime_t kt;
>   u64 delta;
>  
> - cpumask_set_cpu(t->cpu, );
> - set_cpus_allowed_ptr(current, );
> + ret = set_cpus_allowed_ptr(current, cpumask_of(t->cpu));
> + if (ret < 0)
> + pr_err("Failed to set affinity to %d CPU\n", t->cpu);
>  
>   for (i = 0; i < ARRAY_SIZE(test_case_array); i++)
>   random_array[i] = i;

lgtm.

While we're in there...


From: Andrew Morton 
Subject: lib/test_vmalloc.c:test_func(): eliminate local `ret'

Local 'ret' is unneeded and was poorly named: the variable `ret' generally
means the "the value which this function will return".

Cc: Roman Gushchin 
Cc: Uladzislau Rezki 
Cc: Michal Hocko 
Cc: Matthew Wilcox 
Cc: Thomas Garnier 
Cc: Oleksiy Avramchenko 
Cc: Steven Rostedt 
Cc: Joel Fernandes 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Tejun Heo 
Signed-off-by: Andrew Morton 
---

 lib/test_vmalloc.c |8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

--- a/lib/test_vmalloc.c~a
+++ a/lib/test_vmalloc.c
@@ -384,12 +384,11 @@ static int test_func(void *private)
 {
struct test_driver *t = private;
int random_array[ARRAY_SIZE(test_case_array)];
-   int index, i, j, ret;
+   int index, i, j;
ktime_t kt;
u64 delta;
 
-   ret = set_cpus_allowed_ptr(current, cpumask_of(t->cpu));
-   if (ret < 0)
+   if (set_cpus_allowed_ptr(current, cpumask_of(t->cpu)) < 0)
pr_err("Failed to set affinity to %d CPU\n", t->cpu);
 
for (i = 0; i < ARRAY_SIZE(test_case_array); i++)
@@ -415,8 +414,7 @@ static int test_func(void *private)
 
kt = ktime_get();
for (j = 0; j < test_repeat_count; j++) {
-   ret = test_case_array[index].test_func();
-   if (!ret)
+   if (!test_case_array[index].test_func())
per_cpu_test_data[t->cpu][index].test_passed++;
else
per_cpu_test_data[t->cpu][index].test_failed++;
_



[PATCH] RAS/CEC: Add debugfs switch to disable at run time

2019-04-18 Thread Tony Luck
Useful when running error injection tests that want to
see all of the MCi_(STATUS|ADDR|MISC) data via /dev/mcelog.

Signed-off-by: Tony Luck 
---
 drivers/ras/cec.c | 20 +++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index 2d9ec378a8bc..a2ceedcd8516 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -123,6 +123,9 @@ static u64 dfs_pfn;
 /* Amount of errors after which we offline */
 static unsigned int count_threshold = COUNT_MASK;
 
+/* debugfs switch to enable/disable CEC */
+static u64 cec_enabled = 1;
+
 /*
  * The timer "decays" element count each timer_interval which is 24hrs by
  * default.
@@ -400,6 +403,14 @@ static int count_threshold_set(void *data, u64 val)
 }
 DEFINE_DEBUGFS_ATTRIBUTE(count_threshold_ops, u64_get, count_threshold_set, 
"%lld\n");
 
+static int enable_set(void *data, u64 val)
+{
+   ce_arr.disabled = !val;
+   return 0;
+}
+
+DEFINE_DEBUGFS_ATTRIBUTE(enable_ops, u64_get, enable_set, "%lld\n");
+
 static int array_dump(struct seq_file *m, void *v)
 {
struct ce_array *ca = _arr;
@@ -451,7 +462,7 @@ static const struct file_operations array_ops = {
 
 static int __init create_debugfs_nodes(void)
 {
-   struct dentry *d, *pfn, *decay, *count, *array;
+   struct dentry *d, *pfn, *decay, *count, *array, *enable;
 
d = debugfs_create_dir("cec", ras_debugfs_dir);
if (!d) {
@@ -485,6 +496,13 @@ static int __init create_debugfs_nodes(void)
goto err;
}
 
+   enable = debugfs_create_file("enable", S_IRUSR | S_IWUSR, d,
+   _enabled, _ops);
+   if (!enable) {
+   pr_warn("Error creating enable debugfs node!\n");
+   goto err;
+   }
+
 
return 0;
 
-- 
2.19.1



Re: [RESEND v2] soc: imx: Add generic i.MX8 SoC driver

2019-04-18 Thread Leonard Crestez
On 3/28/2019 6:43 PM, Leonard Crestez wrote:
> On Fri, 2019-03-22 at 16:49 +, Abel Vesa wrote:
>> Add generic i.MX8 SoC driver along with the i.MX8MQ SoC specific code.
>> For now, only i.MX8MQ revision B1 is supported. For any other, i.MX8MQ
>> revision it will print 'unknown'.
> 
>> +#define REV_B1  0x21
>> +
>> +#define IMX8MQ_SW_INFO_B1   0x40
>> +#define IMX8MQ_SW_MAGIC_B1  0xff0055aa
>> +
>> +np = of_find_compatible_node(NULL, NULL, "fsl,imx8mq-ocotp");+
>> +ocotp_base = of_iomap(np, 0);
>> +
>> +magic = readl_relaxed(ocotp_base + IMX8MQ_SW_INFO_B1);
>> +if (magic == IMX8MQ_SW_MAGIC_B1)
>> +rev = REV_B1;
> 
> This is based on ATF code in vendor tree, but shouldn't we have some
> sort of explanation for this "magic"?
> 
> Looking at the OCOTP driver reg 0x40 is IMX_OCOTP_ADDR_DATA2 and it's
> used as part of fuse writes. According to the driver code 8mq is
> compatible with 7d and this write path is enabled for imx8mq-ocotp.

After further digging in NXP manuals and uboot sources it seems that 
imx8mq ocotp is like imx6 rather than imx7. Posted fix for nvmem driver:

https://patchwork.kernel.org/patch/10908081/

Reviewed-by: Leonard Crestez 

It might still be nice to find a way to identify imx8mq B0.


Re: [REGRESSION 5.0.x] Windows XP broken on KVM

2019-04-18 Thread Harald Arnesen
Takashi Iwai [18.04.2019 09:38]:

> Hi,
> 
> we've got a regression report on the recent 5.0.x kernel, starting
> from 5.0.6, where Windows XP can't boot on KVM any longer.

Not for all configurations. I just checked with 5.0.8, and Windows XP
boots just fine on KVM.
-- 
Hilsen Harald



Re: [PATCH] clk: ingenic/jz4725b: Fix parent of pixel clock

2019-04-18 Thread Stephen Boyd
Quoting Paul Cercueil (2019-04-17 04:24:20)
> The pixel clock is directly connected to the output of the PLL, and not
> to the /2 divider.
> 
> Cc: sta...@vger.kernel.org
> Fixes: 226dfa4726eb ("clk: Add Ingenic jz4725b CGU driver")
> Signed-off-by: Paul Cercueil 
> ---

Applied to clk-next



Re: [PATCH] clk: mvebu: fix spelling mistake "gatable" -> "gateable"

2019-04-18 Thread Stephen Boyd
Quoting Colin King (2019-04-16 04:56:16)
> From: Colin Ian King 
> 
> There are a few spelling mistakes in comments and a pr_err
> error message. Fix these.
> 
> Signed-off-by: Colin Ian King 
> ---

Applied to clk-next



Re: [PATCH v2] clk: hi3660: Mark clk_gate_ufs_subsys as critical

2019-04-18 Thread Stephen Boyd
Quoting Leo Yan (2019-04-17 22:33:39)
> Hi Michael, Stephen,
> 
> On Wed, Mar 20, 2019 at 06:05:08PM +0800, Leo Yan wrote:
> > clk_gate_ufs_subsys is a system bus clock, turning off it will
> > introduce lockup issue during system suspend flow.  Let's mark
> > clk_gate_ufs_subsys as critical clock, thus keeps it on during
> > system suspend and resume.
> 
> Could you pick up this patch?  Or should I resend it?
> 

I can pick it up. Is it critical or can it wait for v5.2?



Re: [PATCH bpf-next v2 1/3] bpf: sock ops: add netns ino and dev in bpf context

2019-04-18 Thread Song Liu
On Thu, Apr 18, 2019 at 2:41 PM Song Liu  wrote:
>
> On Thu, Apr 18, 2019 at 8:59 AM Alban Crequy  wrote:
> >
> > From: Alban Crequy 
> >
> > sockops programs can now access the network namespace inode and device
> > via (struct bpf_sock_ops)->netns_ino and ->netns_dev. This can be useful
> > to apply different policies on different network namespaces.
> >
> > In the unlikely case where network namespaces are not compiled in
> > (CONFIG_NET_NS=n), the verifier will not allow access to ->netns_*.
> >
> > The generated BPF bytecode for netns_ino is loading the correct inode
> > number at the time of execution.
> >
> > However, the generated BPF bytecode for netns_dev is loading an
> > immediate value determined at BPF-load-time by looking at the initial
> > network namespace. In practice, this works because all netns currently
> > use the same virtual device. If this was to change, this code would need
> > to be updated too.
> >
> > Signed-off-by: Alban Crequy 
>
> Acked-by: Song Liu 

Sorry! I didn't see Yonghong's comments. Please resolve those.

Sorry for the confusion.
Song

>
> >
> > ---
> >
> > Changes since v1:
> > - add netns_dev (review from Alexei)
> > ---
> >  include/uapi/linux/bpf.h |  2 ++
> >  net/core/filter.c| 70 
> >  2 files changed, 72 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index eaf2d3284248..f4f841dde42c 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -3213,6 +3213,8 @@ struct bpf_sock_ops {
> > __u32 sk_txhash;
> > __u64 bytes_received;
> > __u64 bytes_acked;
> > +   __u64 netns_dev;
> > +   __u64 netns_ino;
> >  };
> >
> >  /* Definitions for bpf_sock_ops_cb_flags */
> > diff --git a/net/core/filter.c b/net/core/filter.c
> > index 1833926a63fc..93e3429603d7 100644
> > --- a/net/core/filter.c
> > +++ b/net/core/filter.c
> > @@ -75,6 +75,8 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> >
> >  /**
> >   * sk_filter_trim_cap - run a packet through a socket filter
> > @@ -6774,6 +6776,15 @@ static bool sock_ops_is_valid_access(int off, int 
> > size,
> > }
> > } else {
> > switch (off) {
> > +   case offsetof(struct bpf_sock_ops, netns_dev):
> > +   case offsetof(struct bpf_sock_ops, netns_ino):
> > +#ifdef CONFIG_NET_NS
> > +   if (size != sizeof(__u64))
> > +   return false;
> > +#else
> > +   return false;
> > +#endif
> > +   break;
> > case bpf_ctx_range_till(struct bpf_sock_ops, bytes_received,
> > bytes_acked):
> > if (size != sizeof(__u64))
> > @@ -7660,6 +7671,11 @@ static u32 sock_addr_convert_ctx_access(enum 
> > bpf_access_type type,
> > return insn - insn_buf;
> >  }
> >
> > +static struct ns_common *sockops_netns_cb(void *private_data)
> > +{
> > +   return _net.ns;
> > +}
> > +
> >  static u32 sock_ops_convert_ctx_access(enum bpf_access_type type,
> >const struct bpf_insn *si,
> >struct bpf_insn *insn_buf,
> > @@ -7668,6 +7684,10 @@ static u32 sock_ops_convert_ctx_access(enum 
> > bpf_access_type type,
> >  {
> > struct bpf_insn *insn = insn_buf;
> > int off;
> > +   struct inode *ns_inode;
> > +   struct path ns_path;
> > +   __u64 netns_dev;
> > +   void *res;
> >
> >  /* Helper macro for adding read access to tcp_sock or sock fields. */
> >  #define SOCK_OPS_GET_FIELD(BPF_FIELD, OBJ_FIELD, OBJ)  
> >   \
> > @@ -7914,6 +7934,56 @@ static u32 sock_ops_convert_ctx_access(enum 
> > bpf_access_type type,
> > SOCK_OPS_GET_OR_SET_FIELD(sk_txhash, sk_txhash,
> >   struct sock, type);
> > break;
> > +
> > +   case offsetof(struct bpf_sock_ops, netns_dev):
> > +#ifdef CONFIG_NET_NS
> > +   /* We get the netns_dev at BPF-load-time and not at
> > +* BPF-exec-time. We assume that netns_dev is a constant.
> > +*/
> > +   res = ns_get_path_cb(_path, sockops_netns_cb, NULL);
> > +   if (IS_ERR(res)) {
> > +   netns_dev = 0;
> > +   } else {
> > +   ns_inode = ns_path.dentry->d_inode;
> > +   netns_dev = new_encode_dev(ns_inode->i_sb->s_dev);
> > +   }
> > +#else
> > +   netns_dev = 0;
> > +#endif
> > +   *insn++ = BPF_MOV64_IMM(si->dst_reg, netns_dev);
> > +   break;
> > +
> > +   case offsetof(struct bpf_sock_ops, netns_ino):
> > +#ifdef CONFIG_NET_NS
> > +   /* Loading: sk_ops->sk->__sk_common.skc_net.net->ns.inum
> > +* 

Re: [tip:perf/core] perf/x86/intel: Force resched when TFA sysctl is modified

2019-04-18 Thread Stephane Eranian
Vince

On Tue, Apr 16, 2019 at 11:06 PM Ingo Molnar  wrote:
>
>
> * Vince Weaver  wrote:
>
> > On Tue, 16 Apr 2019, tip-bot for Stephane Eranian wrote:
> >
> > > Commit-ID:  f447e4eb3ad1e60d173ca997fcb2ef2a66f12574
> > > Gitweb: 
> > > https://git.kernel.org/tip/f447e4eb3ad1e60d173ca997fcb2ef2a66f12574
> > > Author: Stephane Eranian 
> > > AuthorDate: Mon, 8 Apr 2019 10:32:52 -0700
> > > Committer:  Ingo Molnar 
> > > CommitDate: Tue, 16 Apr 2019 12:19:35 +0200
> > >
> > > perf/x86/intel: Force resched when TFA sysctl is modified
> >
> > What's TFA?  Tuna-fish-alarm?
>
> Heh, I wish! :-)
>
Sorry about the confusion. I was just trying to mimic the function
names that Peter used
in the code. Hard to fit the whole sysctl name in the title, anyway.

> > [...] Nowhere in the commit or in the code does it ever say what a TFA
> > is or why we'd want to resched when it is modified.
>
> Yeah, it's the TSX-Force-Abort acronym - Intel has numbed our general
> dislike to random acrynyms ...
>
> Peter and me usually fix such changelog context omissions, but this one
> slipped through. :-/
>
> The commit is too deep down perf/core already to rebase it just for the
> changelog, but if we are going to rebase it for some functional reason
> I'll take care of it next time around.
>
> TFA. (Thanks For your Assistance. :-)
>
> Ingo


Re: [REGRESSION 5.0.x] Windows XP broken on KVM

2019-04-18 Thread Harald Arnesen
Greg Kroah-Hartman [18.04.2019 09:53]:

> On Thu, Apr 18, 2019 at 09:38:52AM +0200, Takashi Iwai wrote:
>> Hi,
>> 
>> we've got a regression report on the recent 5.0.x kernel, starting
>> from 5.0.6, where Windows XP can't boot on KVM any longer.
>> 
>> The culprit seems to be the patch
>>KVM: x86: update %rip after emulating IO
>> with the upstream commit 45def77ebf79e2e8942b89ed79294d97ce914fa0.
>> Reverting this alone fixed the problem.
>> 
>> The report is found at openSUSE bugzilla:
>>   https://bugzilla.suse.com/show_bug.cgi?id=1132694
>> 
>> Is there already a followup fix?  If not, we need to revert it from
>> stable, at least.
> 
> Is this also a problem in 5.1-rc5?

My previously installed Windows XP boots and runs fine in 5.1-rc5.
-- 
Hilsen Harald


Re: [PATCH] cifs: fix page reference leak with readv/writev

2019-04-18 Thread Pavel Shilovsky
чт, 18 апр. 2019 г. в 14:12, Jerome Glisse :
>
> On Wed, Apr 10, 2019 at 09:47:01PM -0500, Steve French wrote:
> > How was this discovered? Does it address a reported user problem?
>
> I have spotted it while tracking down how page reference are taken
> for bio and how they are release. In the current code, once the page
> are GUPed they are never release if there is any failure, on failure
> cifs_aio_ctx_release() will be call but it will just free the bio_vec
> not release the page reference.
>
> Page reference are only drop if everything is successful.
>
> So this patch move the page reference droping to cifs_aio_ctx_release()
> which is call from all code path including failure AFAICT and thus page
> reference will be drop if failure does happen.
>
> Cheers,
> Jérôme
>
> >
> > On Wed, Apr 10, 2019 at 2:38 PM  wrote:
> > >
> > > From: Jérôme Glisse 
> > >
> > > CIFS can leak pages reference gotten through GUP (get_user_pages*()
> > > through iov_iter_get_pages()). This happen if cifs_send_async_read()
> > > or cifs_write_from_iter() calls fail from within __cifs_readv() and
> > > __cifs_writev() respectively. This patch move page unreference to
> > > cifs_aio_ctx_release() which will happens on all code paths this is
> > > all simpler to follow for correctness.
> > >
> > > Signed-off-by: Jérôme Glisse 
> > > Cc: Steve French 
> > > Cc: linux-c...@vger.kernel.org
> > > Cc: samba-techni...@lists.samba.org
> > > Cc: Alexander Viro 
> > > Cc: linux-fsde...@vger.kernel.org
> > > Cc: Linus Torvalds 
> > > Cc: Stable 
> > > ---
> > >  fs/cifs/file.c | 15 +--
> > >  fs/cifs/misc.c | 23 ++-
> > >  2 files changed, 23 insertions(+), 15 deletions(-)
> > >
> > > diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> > > index 89006e044973..a756a4d3f70f 100644
> > > --- a/fs/cifs/file.c
> > > +++ b/fs/cifs/file.c
> > > @@ -2858,7 +2858,6 @@ static void collect_uncached_write_data(struct 
> > > cifs_aio_ctx *ctx)
> > > struct cifs_tcon *tcon;
> > > struct cifs_sb_info *cifs_sb;
> > > struct dentry *dentry = ctx->cfile->dentry;
> > > -   unsigned int i;
> > > int rc;
> > >
> > > tcon = tlink_tcon(ctx->cfile->tlink);
> > > @@ -2922,10 +2921,6 @@ static void collect_uncached_write_data(struct 
> > > cifs_aio_ctx *ctx)
> > > kref_put(>refcount, 
> > > cifs_uncached_writedata_release);
> > > }
> > >
> > > -   if (!ctx->direct_io)
> > > -   for (i = 0; i < ctx->npages; i++)
> > > -   put_page(ctx->bv[i].bv_page);
> > > -
> > > cifs_stats_bytes_written(tcon, ctx->total_len);
> > > set_bit(CIFS_INO_INVALID_MAPPING, 
> > > _I(dentry->d_inode)->flags);
> > >
> > > @@ -3563,7 +3558,6 @@ collect_uncached_read_data(struct cifs_aio_ctx *ctx)
> > > struct iov_iter *to = >iter;
> > > struct cifs_sb_info *cifs_sb;
> > > struct cifs_tcon *tcon;
> > > -   unsigned int i;
> > > int rc;
> > >
> > > tcon = tlink_tcon(ctx->cfile->tlink);
> > > @@ -3647,15 +3641,8 @@ collect_uncached_read_data(struct cifs_aio_ctx 
> > > *ctx)
> > > kref_put(>refcount, 
> > > cifs_uncached_readdata_release);
> > > }
> > >
> > > -   if (!ctx->direct_io) {
> > > -   for (i = 0; i < ctx->npages; i++) {
> > > -   if (ctx->should_dirty)
> > > -   set_page_dirty(ctx->bv[i].bv_page);
> > > -   put_page(ctx->bv[i].bv_page);
> > > -   }
> > > -
> > > +   if (!ctx->direct_io)
> > > ctx->total_len = ctx->len - iov_iter_count(to);
> > > -   }
> > >
> > > /* mask nodata case */
> > > if (rc == -ENODATA)
> > > diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c
> > > index bee203055b30..9bc0d17a9d77 100644
> > > --- a/fs/cifs/misc.c
> > > +++ b/fs/cifs/misc.c
> > > @@ -768,6 +768,11 @@ cifs_aio_ctx_alloc(void)
> > >  {
> > > struct cifs_aio_ctx *ctx;
> > >
> > > +   /*
> > > +* Must use kzalloc to initialize ctx->bv to NULL and 
> > > ctx->direct_io
> > > +* to false so that we know when we have to unreference pages 
> > > within
> > > +* cifs_aio_ctx_release()
> > > +*/
> > > ctx = kzalloc(sizeof(struct cifs_aio_ctx), GFP_KERNEL);
> > > if (!ctx)
> > > return NULL;
> > > @@ -786,7 +791,23 @@ cifs_aio_ctx_release(struct kref *refcount)
> > > struct cifs_aio_ctx, refcount);
> > >
> > > cifsFileInfo_put(ctx->cfile);
> > > -   kvfree(ctx->bv);
> > > +
> > > +   /*
> > > +* ctx->bv is only set if setup_aio_ctx_iter() was call 
> > > successfuly
> > > +* which means that iov_iter_get_pages() was a success and thus 
> > > that
> > > +* we have taken reference on pages.
> > > +*/
> > > +   if (ctx->bv) {
> > > +   unsigned i;
> > > +
> > > +   for (i = 0; 

[PATCH] memcg: refill_stock for kmem uncharging too

2019-04-18 Thread Shakeel Butt
The commit 475d0487a2ad ("mm: memcontrol: use per-cpu stocks for socket
memory uncharging") added refill_stock() for skmem uncharging path to
optimize workloads having high network traffic. Do the same for the kmem
uncharging as well. However bypass the refill for offlined memcgs to not
cause zombie apocalypse.

Signed-off-by: Shakeel Butt 
---
 mm/memcontrol.c | 17 -
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 2535e54e7989..7b8de091f572 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -178,6 +178,7 @@ struct mem_cgroup_event {
 
 static void mem_cgroup_threshold(struct mem_cgroup *memcg);
 static void mem_cgroup_oom_notify(struct mem_cgroup *memcg);
+static void cancel_charge(struct mem_cgroup *memcg, unsigned int nr_pages);
 
 /* Stuffs for move charges at task migration. */
 /*
@@ -2097,10 +2098,7 @@ static void drain_stock(struct memcg_stock_pcp *stock)
struct mem_cgroup *old = stock->cached;
 
if (stock->nr_pages) {
-   page_counter_uncharge(>memory, stock->nr_pages);
-   if (do_memsw_account())
-   page_counter_uncharge(>memsw, stock->nr_pages);
-   css_put_many(>css, stock->nr_pages);
+   cancel_charge(old, stock->nr_pages);
stock->nr_pages = 0;
}
stock->cached = NULL;
@@ -2133,6 +2131,11 @@ static void refill_stock(struct mem_cgroup *memcg, 
unsigned int nr_pages)
struct memcg_stock_pcp *stock;
unsigned long flags;
 
+   if (unlikely(!mem_cgroup_online(memcg))) {
+   cancel_charge(memcg, nr_pages);
+   return;
+   }
+
local_irq_save(flags);
 
stock = this_cpu_ptr(_stock);
@@ -2768,17 +2771,13 @@ void __memcg_kmem_uncharge(struct page *page, int order)
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
page_counter_uncharge(>kmem, nr_pages);
 
-   page_counter_uncharge(>memory, nr_pages);
-   if (do_memsw_account())
-   page_counter_uncharge(>memsw, nr_pages);
-
page->mem_cgroup = NULL;
 
/* slab pages do not have PageKmemcg flag set */
if (PageKmemcg(page))
__ClearPageKmemcg(page);
 
-   css_put_many(>css, nr_pages);
+   refill_stock(memcg, nr_pages);
 }
 #endif /* CONFIG_MEMCG_KMEM */
 
-- 
2.21.0.392.gf8f6787159e-goog



Re: [PATCH] i2c: iproc: Change driver to use 'BIT' macro

2019-04-18 Thread Peter Rosin
On 2019-04-18 19:25, Ray Jui wrote:
> 
> 
> On 4/17/2019 11:21 PM, Peter Rosin wrote:
>> On 2019-04-18 01:48, Ray Jui wrote:
>>>
>>>
>>> On 4/14/2019 11:56 PM, Peter Rosin wrote:
 On 2019-04-13 00:59, Peter Rosin wrote:
> On 2019-04-03 23:05, Ray Jui wrote:
>> Change the iProc I2C driver to use the 'BIT' macro from all '1 << XXX'
>> bit operations to get rid of compiler warning and improve readability of
>> the code
>
> All? I see lots more '1 << XXX_SHIFT' matches. I might be behind though?

 I verified that, and yes indeed, I was behind. That said, see below...

>>>
>>> Right. Previous change that this change depends on is already queued in
>>> i2c/for-next.
>>>
> Anyway, if you are cleaning up, I'm just flagging that BIT(XXX_SHIFT) 
> looks
> a bit clunky to me. You might consider renaming all those single-bit
> XXX_SHIFT macros to simple be
>
> #define XXX BIT()
>
> instead of
>
> #define XXX_SHIFT 
>
> but that triggers more churn, so is obviously more error prone. You might
> not dare it?
>
>>>
>>> With the current code, I don't see how that is cleaner. With XXX_SHIFT
>>> specified, it makes it very clear to the user that the define a for a
>>> bit location within a register. You can argue and say it makes the
>>> define longer, but not less clear.
>>>
> Cheers,
> Peter
>
>> Signed-off-by: Ray Jui 
>> ---
>>  drivers/i2c/busses/i2c-bcm-iproc.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/i2c/busses/i2c-bcm-iproc.c 
>> b/drivers/i2c/busses/i2c-bcm-iproc.c
>> index 562942d0c05c..a845b8decac8 100644
>> --- a/drivers/i2c/busses/i2c-bcm-iproc.c
>> +++ b/drivers/i2c/busses/i2c-bcm-iproc.c
>> @@ -717,7 +717,7 @@ static int bcm_iproc_i2c_xfer_single_msg(struct 
>> bcm_iproc_i2c_dev *iproc_i2c,
>>  
>>  /* mark the last byte */
>>  if (i == msg->len - 1)
>> -val |= 1 << M_TX_WR_STATUS_SHIFT;
>> +val |= BIT(M_TX_WR_STATUS_SHIFT);
>>  
>>  iproc_i2c_wr_reg(iproc_i2c, M_TX_OFFSET, val);
>>  }
>> @@ -844,7 +844,7 @@ static int bcm_iproc_i2c_cfg_speed(struct 
>> bcm_iproc_i2c_dev *iproc_i2c)
>>  
>>  iproc_i2c->bus_speed = bus_speed;
>>  val = iproc_i2c_rd_reg(iproc_i2c, TIM_CFG_OFFSET);
>> -val &= ~(1 << TIM_CFG_MODE_400_SHIFT);
>> +val &= ~BIT(TIM_CFG_MODE_400_SHIFT);
>>  val |= (bus_speed == 40) << TIM_CFG_MODE_400_SHIFT;

 These two statements now no longer "match". One uses BIT and the other open
 codes the shift. I think that's bad. Losing the _SHIFT suffix and including
 BIT in the macro expansion, as suggested above, yields:

val &= ~TIM_CFG_MODE_400;
if (bus_speed == 40)
val |= TIM_CFG_MODE_400;

 which is perhaps one more line, but also more readable IMO.

>>>
>>> A single line with evaluation embedded is nice and clean to me. I guess
>>> this is subjective.
>>
>> The "problem" I had when I looked at the driver was not any one specific
>> instance. It was just that, for my taste, the code had too many shifts
>> etc inline with the real code. Replacing 1 << xyz_SHIFT with BIT(xyz_SHIFT)
>> is not a real improvement, they are just about equal to me, it's just that
>> there are still too many mechanical things happening that could easily be
>> tucked away with the suggested approach.
>>
> 
> Right, for your taste. Like I said, I feel this is very subjective. To
> me, and many other I2C driver owners (I just checked how many other I2C
> drivers also appear to prefer to use XXX_SHIFT and there are a lot of
> them), using XXX_SHIFT makes it more clear that the define is intended
> to be used for bit shift operation.

Which is a very strange thing to say about my suggestion. There is no need
for a _SHIFT suffix for the macro names if they are not going to used in
shifts! That's the whole friggin point.

Regarding other I2C drivers, I just had a brief look at about 10 or so
picked at random, and NONE of them use the XXX_SHIFT paradigm that this
driver is using. The ones I picked were:

i2c-acorn.c
i2c-ali563.c
i2c-altera.c
i2c-at91.c
i2c-cmp.c
i2c-davinci.c
i2c-digicolor.c
i2c-elektor.c
i2c-st.c

The only one I looked at not doing it the way I suggested is i2c-dln2.c
which does not appear to need any register field accesses at all.

So, perhaps you should read the suggestion again with more care? Or not.
Anyway, I'm not going to waste any more time here.

Cheers,
Peter

> 
>>> I'll leave the decision to Wolfram. If he also prefers the above change
>>> to be made, sure. Otherwise, I'll leave it as it is.
>>
>> But if you see no value in my suggestion and/or don't what to take the
>> cleanup 

Re: [PATCH v3 08/11] mips: Properly account for stack randomization and stack guard gap

2019-04-18 Thread Paul Burton
Hi Alexandre,

On Wed, Apr 17, 2019 at 01:22:44AM -0400, Alexandre Ghiti wrote:
> This commit takes care of stack randomization and stack guard gap when
> computing mmap base address and checks if the task asked for randomization.
> This fixes the problem uncovered and not fixed for mips here:
> https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1429066.html
> 
> Signed-off-by: Alexandre Ghiti 

For patches 8-10:

Acked-by: Paul Burton 

Thanks for improving this,

Paul

> ---
>  arch/mips/mm/mmap.c | 14 --
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/mips/mm/mmap.c b/arch/mips/mm/mmap.c
> index 2f616ebeb7e0..3ff82c6f7e24 100644
> --- a/arch/mips/mm/mmap.c
> +++ b/arch/mips/mm/mmap.c
> @@ -21,8 +21,9 @@ unsigned long shm_align_mask = PAGE_SIZE - 1;   /* Sane 
> caches */
>  EXPORT_SYMBOL(shm_align_mask);
>  
>  /* gap between mmap and stack */
> -#define MIN_GAP (128*1024*1024UL)
> -#define MAX_GAP ((TASK_SIZE)/6*5)
> +#define MIN_GAP  (128*1024*1024UL)
> +#define MAX_GAP  ((TASK_SIZE)/6*5)
> +#define STACK_RND_MASK   (0x7ff >> (PAGE_SHIFT - 12))
>  
>  static int mmap_is_legacy(struct rlimit *rlim_stack)
>  {
> @@ -38,6 +39,15 @@ static int mmap_is_legacy(struct rlimit *rlim_stack)
>  static unsigned long mmap_base(unsigned long rnd, struct rlimit *rlim_stack)
>  {
>   unsigned long gap = rlim_stack->rlim_cur;
> + unsigned long pad = stack_guard_gap;
> +
> + /* Account for stack randomization if necessary */
> + if (current->flags & PF_RANDOMIZE)
> + pad += (STACK_RND_MASK << PAGE_SHIFT);
> +
> + /* Values close to RLIM_INFINITY can overflow. */
> + if (gap + pad > gap)
> + gap += pad;
>  
>   if (gap < MIN_GAP)
>   gap = MIN_GAP;
> -- 
> 2.20.1
> 


Re: [PATCH 5/7] mmc: meson-gx: avoid clock glitch when switching to DDR modes

2019-04-18 Thread Jerome Brunet
On Thu, 2019-04-18 at 22:53 +0200, Martin Blumenstingl wrote:
> Hi Jerome,
> 
> On Thu, Apr 18, 2019 at 10:46 PM Jerome Brunet  wrote:
> > On Thu, 2019-04-18 at 22:16 +0200, Martin Blumenstingl wrote:
> > > Hi Jerome,
> > > 
> > > On Wed, Apr 17, 2019 at 10:44 PM Jerome Brunet  
> > > wrote:
> > > > Activating DDR in the Amlogic mmc controller, among other things, will
> > > > divide the output clock by 2. So by activating it with clock on, we are
> > > > creating a glitch on the output.
> > > > 
> > > > Instead, let's deal with DDR when the clock output is off, when setting
> > > > the clock.
> > > > 
> > > > Signed-off-by: Jerome Brunet 
> > > it seems that this patch breaks SD card on my Khadas VIM and Khadas VIM2.
> > 
> > The error I see in your logs is with eMMC and hs200, not SD card.
> sorry, I should have been more clear that there are two errors:
> eMMC, this is what I have been seeing for a while on my Khadas VIM2
> (it's probably not related to this patch):
>   mmc1: mmc_select_hs200 failed, error -84
>   mmc1: error -84 whilst initialising MMC card

Following patches were also supposed to help the vim2. 
... something I tested numerous time on this particular board.

> 
> however, then there's this other error:
>   print_req_error: I/O error, dev mmcblk0, sector 0 flags 0
>   Buffer I/O error on dev mmcblk0, logical block 0, async page read
> as result of this the partition table cannot be read and my kernel
> cannot find the rootfs.

I don't know what the problem is (probably some CRC error - you can check the 
log
in interrupt for this) but I'm pretty sure it is not related to this patch.

I see in the log SD is indeed in HS mode, so not in any DDR mode.
I also see that the SDIO fails as well. There something really weird, which 
can't
be explained by this patch alone AFAICT.

> 
> > Either way, There is something I don't really get. eMMC should not go 
> > through
> > any DDR mode to reach HS200 (which is an SDR mode), neither should SD to 
> > reach
> > HS.
> > 
> > All this does is flipping the DDR bit (when necessary) when clock if off for
> > the mmc device, avoiding a glitch on clk line.
> > 
> > This patch should not make any difference for SDR only setup, Maybe I missed
> > something, but I don't see how it could make anything different for SDR 
> > only.
> > 
> > I (repeatedly) tested both vim1 and vim2, without seeing this issue, so I 
> > can't
> > debug this. I'll need more detail to progress, something does not make 
> > sense here.
> please let me know from which part of the driver do you want debug logs
> 
> 
> Regards
> Martin




Re: [PATCH] cifs: fix page reference leak with readv/writev

2019-04-18 Thread Jerome Glisse
On Wed, Apr 10, 2019 at 09:47:01PM -0500, Steve French wrote:
> How was this discovered? Does it address a reported user problem?

I have spotted it while tracking down how page reference are taken
for bio and how they are release. In the current code, once the page
are GUPed they are never release if there is any failure, on failure
cifs_aio_ctx_release() will be call but it will just free the bio_vec
not release the page reference.

Page reference are only drop if everything is successful.

So this patch move the page reference droping to cifs_aio_ctx_release()
which is call from all code path including failure AFAICT and thus page
reference will be drop if failure does happen.

Cheers,
Jérôme

> 
> On Wed, Apr 10, 2019 at 2:38 PM  wrote:
> >
> > From: Jérôme Glisse 
> >
> > CIFS can leak pages reference gotten through GUP (get_user_pages*()
> > through iov_iter_get_pages()). This happen if cifs_send_async_read()
> > or cifs_write_from_iter() calls fail from within __cifs_readv() and
> > __cifs_writev() respectively. This patch move page unreference to
> > cifs_aio_ctx_release() which will happens on all code paths this is
> > all simpler to follow for correctness.
> >
> > Signed-off-by: Jérôme Glisse 
> > Cc: Steve French 
> > Cc: linux-c...@vger.kernel.org
> > Cc: samba-techni...@lists.samba.org
> > Cc: Alexander Viro 
> > Cc: linux-fsde...@vger.kernel.org
> > Cc: Linus Torvalds 
> > Cc: Stable 
> > ---
> >  fs/cifs/file.c | 15 +--
> >  fs/cifs/misc.c | 23 ++-
> >  2 files changed, 23 insertions(+), 15 deletions(-)
> >
> > diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> > index 89006e044973..a756a4d3f70f 100644
> > --- a/fs/cifs/file.c
> > +++ b/fs/cifs/file.c
> > @@ -2858,7 +2858,6 @@ static void collect_uncached_write_data(struct 
> > cifs_aio_ctx *ctx)
> > struct cifs_tcon *tcon;
> > struct cifs_sb_info *cifs_sb;
> > struct dentry *dentry = ctx->cfile->dentry;
> > -   unsigned int i;
> > int rc;
> >
> > tcon = tlink_tcon(ctx->cfile->tlink);
> > @@ -2922,10 +2921,6 @@ static void collect_uncached_write_data(struct 
> > cifs_aio_ctx *ctx)
> > kref_put(>refcount, cifs_uncached_writedata_release);
> > }
> >
> > -   if (!ctx->direct_io)
> > -   for (i = 0; i < ctx->npages; i++)
> > -   put_page(ctx->bv[i].bv_page);
> > -
> > cifs_stats_bytes_written(tcon, ctx->total_len);
> > set_bit(CIFS_INO_INVALID_MAPPING, _I(dentry->d_inode)->flags);
> >
> > @@ -3563,7 +3558,6 @@ collect_uncached_read_data(struct cifs_aio_ctx *ctx)
> > struct iov_iter *to = >iter;
> > struct cifs_sb_info *cifs_sb;
> > struct cifs_tcon *tcon;
> > -   unsigned int i;
> > int rc;
> >
> > tcon = tlink_tcon(ctx->cfile->tlink);
> > @@ -3647,15 +3641,8 @@ collect_uncached_read_data(struct cifs_aio_ctx *ctx)
> > kref_put(>refcount, cifs_uncached_readdata_release);
> > }
> >
> > -   if (!ctx->direct_io) {
> > -   for (i = 0; i < ctx->npages; i++) {
> > -   if (ctx->should_dirty)
> > -   set_page_dirty(ctx->bv[i].bv_page);
> > -   put_page(ctx->bv[i].bv_page);
> > -   }
> > -
> > +   if (!ctx->direct_io)
> > ctx->total_len = ctx->len - iov_iter_count(to);
> > -   }
> >
> > /* mask nodata case */
> > if (rc == -ENODATA)
> > diff --git a/fs/cifs/misc.c b/fs/cifs/misc.c
> > index bee203055b30..9bc0d17a9d77 100644
> > --- a/fs/cifs/misc.c
> > +++ b/fs/cifs/misc.c
> > @@ -768,6 +768,11 @@ cifs_aio_ctx_alloc(void)
> >  {
> > struct cifs_aio_ctx *ctx;
> >
> > +   /*
> > +* Must use kzalloc to initialize ctx->bv to NULL and ctx->direct_io
> > +* to false so that we know when we have to unreference pages within
> > +* cifs_aio_ctx_release()
> > +*/
> > ctx = kzalloc(sizeof(struct cifs_aio_ctx), GFP_KERNEL);
> > if (!ctx)
> > return NULL;
> > @@ -786,7 +791,23 @@ cifs_aio_ctx_release(struct kref *refcount)
> > struct cifs_aio_ctx, refcount);
> >
> > cifsFileInfo_put(ctx->cfile);
> > -   kvfree(ctx->bv);
> > +
> > +   /*
> > +* ctx->bv is only set if setup_aio_ctx_iter() was call successfuly
> > +* which means that iov_iter_get_pages() was a success and thus that
> > +* we have taken reference on pages.
> > +*/
> > +   if (ctx->bv) {
> > +   unsigned i;
> > +
> > +   for (i = 0; i < ctx->npages; i++) {
> > +   if (ctx->should_dirty)
> > +   set_page_dirty(ctx->bv[i].bv_page);
> > +   put_page(ctx->bv[i].bv_page);
> > +   }
> > +   kvfree(ctx->bv);
> > +   }
> > +
> > kfree(ctx);
> >  }
> >
> > --
> > 2.20.1
> >
> 

Re: [PATCH v3 00/28] userfaultfd: write protection support

2019-04-18 Thread Jerome Glisse
On Tue, Apr 09, 2019 at 02:08:39PM +0800, Peter Xu wrote:
> On Wed, Mar 20, 2019 at 10:06:14AM +0800, Peter Xu wrote:
> > This series implements initial write protection support for
> > userfaultfd.  Currently both shmem and hugetlbfs are not supported
> > yet, but only anonymous memory.  This is the 3nd version of it.
> > 
> > The latest code can also be found at:
> > 
> >   https://github.com/xzpeter/linux/tree/uffd-wp-merged
> > 
> > Note again that the first 5 patches in the series can be seen as
> > isolated work on page fault mechanism.  I would hope that they can be
> > considered to be reviewed/picked even earlier than the rest of the
> > series since it's even useful for existing userfaultfd MISSING case
> > [8].
> 
> Ping - any further comments for v3?  Is there any chance to have this
> series (or the first 5 patches) for 5.2?

Few issues left, sorry for taking so long to get to review, sometimes
it goes to the bottom of my stack.

I am guessing this should be merge through Andrew ? Unless Andrea have
a tree for userfaultfd (i am not following all that closely).

>From my point of view it almost all look good. I sent review before
this email. Maybe we need some review from x86 folks on the x86 arch
changes for the feature ?

Cheers,
Jérôme


Re: [v2 RFC PATCH 0/9] Another Approach to Use PMEM as NUMA Node

2019-04-18 Thread Zi Yan
On 18 Apr 2019, at 15:23, Yang Shi wrote:

> On 4/18/19 11:16 AM, Keith Busch wrote:
>> On Wed, Apr 17, 2019 at 10:13:44AM -0700, Dave Hansen wrote:
>>> On 4/17/19 2:23 AM, Michal Hocko wrote:
 yes. This could be achieved by GFP_NOWAIT opportunistic allocation for
 the migration target. That should prevent from loops or artificial nodes
 exhausting quite naturaly AFAICS. Maybe we will need some tricks to
 raise the watermark but I am not convinced something like that is really
 necessary.
>>> I don't think GFP_NOWAIT alone is good enough.
>>>
>>> Let's say we have a system full of clean page cache and only two nodes:
>>> 0 and 1.  GFP_NOWAIT will eventually kick off kswapd on both nodes.
>>> Each kswapd will be migrating pages to the *other* node since each is in
>>> the other's fallback path.
>>>
>>> I think what you're saying is that, eventually, the kswapds will see
>>> allocation failures and stop migrating, providing hysteresis.  This is
>>> probably true.
>>>
>>> But, I'm more concerned about that window where the kswapds are throwing
>>> pages at each other because they're effectively just wasting resources
>>> in this window.  I guess we should figure our how large this window is
>>> and how fast (or if) the dampening occurs in practice.
>> I'm still refining tests to help answer this and have some preliminary
>> data. My test rig has CPU + memory Node 0, memory-only Node 1, and a
>> fast swap device. The test has an application strict mbind more than
>> the total memory to node 0, and forever writes random cachelines from
>> per-cpu threads.
>
> Thanks for the test. A follow-up question, how about the size for each node? 
> Is node 1 bigger than node 0? Since PMEM typically has larger capacity, so 
> I'm wondering whether the capacity may make things different or not.
>
>> I'm testing two memory pressure policies:
>>
>>Node 0 can migrate to Node 1, no cycles
>>Node 0 and Node 1 migrate with each other (0 -> 1 -> 0 cycles)
>>
>> After the initial ramp up time, the second policy is ~7-10% slower than
>> no cycles. There doesn't appear to be a temporary window dealing with
>> bouncing pages: it's just a slower overall steady state. Looks like when
>> migration fails and falls back to swap, the newly freed pages occasionaly
>> get sniped by the other node, keeping the pressure up.


In addition to these two policies, I am curious about how MPOL_PREFERRED to 
Node 0
performs. I just wonder how bad static page allocation does.

--
Best Regards,
Yan Zi


signature.asc
Description: OpenPGP digital signature


Re: [PATCH v3 25/28] userfaultfd: wp: fixup swap entries in change_pte_range

2019-04-18 Thread Jerome Glisse
On Wed, Mar 20, 2019 at 10:06:39AM +0800, Peter Xu wrote:
> In change_pte_range() we do nothing for uffd if the PTE is a swap
> entry.  That can lead to data mismatch if the page that we are going
> to write protect is swapped out when sending the UFFDIO_WRITEPROTECT.
> This patch applies/removes the uffd-wp bit even for the swap entries.
> 
> Signed-off-by: Peter Xu 

This one seems to address some of the comments i made on patch 17
not all thought. Maybe squash them together ?

> ---
> 
> I kept this patch a standalone one majorly to make review easier.  The
> patch can be considered as standalone or to squash into the patch
> "userfaultfd: wp: support swap and page migration".
> ---
>  mm/mprotect.c | 24 +---
>  1 file changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 96c0f521099d..a23e03053787 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -183,11 +183,11 @@ static unsigned long change_pte_range(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   }
>   ptep_modify_prot_commit(mm, addr, pte, ptent);
>   pages++;
> - } else if (IS_ENABLED(CONFIG_MIGRATION)) {
> + } else if (is_swap_pte(oldpte)) {
>   swp_entry_t entry = pte_to_swp_entry(oldpte);
> + pte_t newpte;
>  
>   if (is_write_migration_entry(entry)) {
> - pte_t newpte;
>   /*
>* A protection check is difficult so
>* just be safe and disable write
> @@ -198,22 +198,24 @@ static unsigned long change_pte_range(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   newpte = pte_swp_mksoft_dirty(newpte);
>   if (pte_swp_uffd_wp(oldpte))
>   newpte = pte_swp_mkuffd_wp(newpte);
> - set_pte_at(mm, addr, pte, newpte);
> -
> - pages++;
> - }
> -
> - if (is_write_device_private_entry(entry)) {
> - pte_t newpte;
> -
> + } else if (is_write_device_private_entry(entry)) {
>   /*
>* We do not preserve soft-dirtiness. See
>* copy_one_pte() for explanation.
>*/
>   make_device_private_entry_read();
>   newpte = swp_entry_to_pte(entry);
> - set_pte_at(mm, addr, pte, newpte);
> + } else {
> + newpte = oldpte;
> + }
>  
> + if (uffd_wp)
> + newpte = pte_swp_mkuffd_wp(newpte);
> + else if (uffd_wp_resolve)
> + newpte = pte_swp_clear_uffd_wp(newpte);
> +
> + if (!pte_same(oldpte, newpte)) {
> + set_pte_at(mm, addr, pte, newpte);
>   pages++;
>   }
>   }
> -- 
> 2.17.1
> 


Re: [PATCH v3 17/28] userfaultfd: wp: support swap and page migration

2019-04-18 Thread Jerome Glisse
On Wed, Mar 20, 2019 at 10:06:31AM +0800, Peter Xu wrote:
> For either swap and page migration, we all use the bit 2 of the entry to
> identify whether this entry is uffd write-protected.  It plays a similar
> role as the existing soft dirty bit in swap entries but only for keeping
> the uffd-wp tracking for a specific PTE/PMD.
> 
> Something special here is that when we want to recover the uffd-wp bit
> from a swap/migration entry to the PTE bit we'll also need to take care
> of the _PAGE_RW bit and make sure it's cleared, otherwise even with the
> _PAGE_UFFD_WP bit we can't trap it at all.
> 
> Note that this patch removed two lines from "userfaultfd: wp: hook
> userfault handler to write protection fault" where we try to remove the
> VM_FAULT_WRITE from vmf->flags when uffd-wp is set for the VMA.  This
> patch will still keep the write flag there.
> 
> Reviewed-by: Mike Rapoport 
> Signed-off-by: Peter Xu 

Some missing thing see below.

[...]

> diff --git a/mm/memory.c b/mm/memory.c
> index 6405d56debee..c3d57fa890f2 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -736,6 +736,8 @@ copy_one_pte(struct mm_struct *dst_mm, struct mm_struct 
> *src_mm,
>   pte = swp_entry_to_pte(entry);
>   if (pte_swp_soft_dirty(*src_pte))
>   pte = pte_swp_mksoft_dirty(pte);
> + if (pte_swp_uffd_wp(*src_pte))
> + pte = pte_swp_mkuffd_wp(pte);
>   set_pte_at(src_mm, addr, src_pte, pte);
>   }
>   } else if (is_device_private_entry(entry)) {

You need to handle the is_device_private_entry() as the migration case
too.



> @@ -2825,6 +2827,10 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
>   flush_icache_page(vma, page);
>   if (pte_swp_soft_dirty(vmf->orig_pte))
>   pte = pte_mksoft_dirty(pte);
> + if (pte_swp_uffd_wp(vmf->orig_pte)) {
> + pte = pte_mkuffd_wp(pte);
> + pte = pte_wrprotect(pte);
> + }
>   set_pte_at(vma->vm_mm, vmf->address, vmf->pte, pte);
>   arch_do_swap_page(vma->vm_mm, vma, vmf->address, pte, vmf->orig_pte);
>   vmf->orig_pte = pte;
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 181f5d2718a9..72cde187d4a1 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -241,6 +241,8 @@ static bool remove_migration_pte(struct page *page, 
> struct vm_area_struct *vma,
>   entry = pte_to_swp_entry(*pvmw.pte);
>   if (is_write_migration_entry(entry))
>   pte = maybe_mkwrite(pte, vma);
> + else if (pte_swp_uffd_wp(*pvmw.pte))
> + pte = pte_mkuffd_wp(pte);
>  
>   if (unlikely(is_zone_device_page(new))) {
>   if (is_device_private_page(new)) {

You need to handle is_device_private_page() case ie mark its swap
as uffd_wp

> @@ -2301,6 +2303,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp,
>   swp_pte = swp_entry_to_pte(entry);
>   if (pte_soft_dirty(pte))
>   swp_pte = pte_swp_mksoft_dirty(swp_pte);
> + if (pte_uffd_wp(pte))
> + swp_pte = pte_swp_mkuffd_wp(swp_pte);
>   set_pte_at(mm, addr, ptep, swp_pte);
>
>   /*
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 855dddb07ff2..96c0f521099d 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -196,6 +196,8 @@ static unsigned long change_pte_range(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   newpte = swp_entry_to_pte(entry);
>   if (pte_swp_soft_dirty(oldpte))
>   newpte = pte_swp_mksoft_dirty(newpte);
> + if (pte_swp_uffd_wp(oldpte))
> + newpte = pte_swp_mkuffd_wp(newpte);
>   set_pte_at(mm, addr, pte, newpte);
>  
>   pages++;

Need to handle is_write_device_private_entry() case just below
that chunk.

Cheers,
Jérôme


Re: [PATCH v3 00/3] MIPS: SGI-IP27 rework part2

2019-04-18 Thread Bjorn Helgaas
Hi Thomas,

On Tue, Mar 19, 2019 at 04:47:49PM +0100, Thomas Bogendoerfer wrote:
> SGI IP27 (Origin/Onyx2) and SGI IP30 (Octane) have a similair
> architecture and share some hardware (ioc3/bridge). To share
> the software parts this patchset reworks SGI IP27 interrupt
> and pci bridge code. By using features Linux gained during the
> many years since SGI IP27 code was integrated this even results
> in code reduction and IMHO cleaner code.
> 
> Tests have been done on a two module O200 (4 CPUs) and an
> Origin 2000 (8 CPUs).

Thanks for doing all this work!  It seems like it basically converts
some of the SGI PCI code to the structure typical of current host
controller drivers and moves it to drivers/pci/controller, which all
seems great to me.

The patches were kind of in limbo as far as Patchwork.  Lorenzo
handles the native host controller drivers, so I just delegated them
to him, so now they should be on his radar.

Bjorn

> My next step in integrating SGI IP30 support is splitting ioc3eth
> into a MFD and subdevice drivers, which will be submitted soon.
> 
> Changes in v3:
> 
> - dropped patches accepted by Paul
> - moved IP27 specific __phys_to_dma/__dma_to_phys into its own file
> - moved pcibios_to_node into IP27 specific file
> - moved PCI bus address resources setup out of pci-xtalk code into
>   IP27 specific code
> - dropped bit from hub_irq_data and use hwirq from irq_data
> - introduced intr_addr for setting up bridge interrupts (IP30 preperation)
> 
> Changes in v2:
> 
> - replaced HUB_L/HUB_S by __raw_readq/__raw_writeq
> - removed union bridge_ate
> - replaced remaing fields in slice_data by per_cpu data
> - use generic_handle_irq instead of do_IRQ
> - use hierarchy irq domain for stacking bridge and hub interrupt
> - moved __dma_to_phys/__phy_to_dma to mach-ip27/dma-direct.h
> - use dev_to_node() for pcibus_to_node() implementation
> 
> Thomas Bogendoerfer (3):
>   MIPS: SGI-IP27: move IP27 specific code out of pci-ip27.c into new
> file
>   MIPS: SGI-IP27: use generic PCI driver
>   MIPS: SGI-IP27: abstract chipset irq from bridge
> 
>  arch/mips/Kconfig  |   3 +
>  arch/mips/include/asm/mach-ip27/topology.h |  11 +-
>  arch/mips/include/asm/pci/bridge.h |  14 +-
>  arch/mips/include/asm/sn/irq_alloc.h   |  11 +
>  arch/mips/include/asm/xtalk/xtalk.h|   9 -
>  arch/mips/pci/Makefile |   1 -
>  arch/mips/pci/ops-bridge.c | 302 --
>  arch/mips/pci/pci-ip27.c   | 214 --
>  arch/mips/sgi-ip27/Makefile|   4 +-
>  arch/mips/sgi-ip27/ip27-init.c |   2 +
>  arch/mips/sgi-ip27/ip27-irq.c  | 191 -
>  arch/mips/sgi-ip27/ip27-pci.c  |  30 ++
>  arch/mips/sgi-ip27/ip27-xtalk.c|  61 ++-
>  drivers/pci/controller/Kconfig |   3 +
>  drivers/pci/controller/Makefile|   1 +
>  drivers/pci/controller/pci-xtalk-bridge.c  | 610 
> +
>  include/linux/platform_data/xtalk-bridge.h |  22 ++
>  17 files changed, 822 insertions(+), 667 deletions(-)
>  create mode 100644 arch/mips/include/asm/sn/irq_alloc.h
>  delete mode 100644 arch/mips/pci/ops-bridge.c
>  delete mode 100644 arch/mips/pci/pci-ip27.c
>  create mode 100644 arch/mips/sgi-ip27/ip27-pci.c
>  create mode 100644 drivers/pci/controller/pci-xtalk-bridge.c
>  create mode 100644 include/linux/platform_data/xtalk-bridge.h
> 
> -- 
> 2.13.7
> 


Re: [PATCH 5/7] mmc: meson-gx: avoid clock glitch when switching to DDR modes

2019-04-18 Thread Martin Blumenstingl
Hi Jerome,

On Thu, Apr 18, 2019 at 10:46 PM Jerome Brunet  wrote:
>
> On Thu, 2019-04-18 at 22:16 +0200, Martin Blumenstingl wrote:
> > Hi Jerome,
> >
> > On Wed, Apr 17, 2019 at 10:44 PM Jerome Brunet  wrote:
> > > Activating DDR in the Amlogic mmc controller, among other things, will
> > > divide the output clock by 2. So by activating it with clock on, we are
> > > creating a glitch on the output.
> > >
> > > Instead, let's deal with DDR when the clock output is off, when setting
> > > the clock.
> > >
> > > Signed-off-by: Jerome Brunet 
> > it seems that this patch breaks SD card on my Khadas VIM and Khadas VIM2.
>
> The error I see in your logs is with eMMC and hs200, not SD card.
sorry, I should have been more clear that there are two errors:
eMMC, this is what I have been seeing for a while on my Khadas VIM2
(it's probably not related to this patch):
  mmc1: mmc_select_hs200 failed, error -84
  mmc1: error -84 whilst initialising MMC card

however, then there's this other error:
  print_req_error: I/O error, dev mmcblk0, sector 0 flags 0
  Buffer I/O error on dev mmcblk0, logical block 0, async page read
as result of this the partition table cannot be read and my kernel
cannot find the rootfs.

> Either way, There is something I don't really get. eMMC should not go through
> any DDR mode to reach HS200 (which is an SDR mode), neither should SD to reach
> HS.
>
> All this does is flipping the DDR bit (when necessary) when clock if off for
> the mmc device, avoiding a glitch on clk line.
>
> This patch should not make any difference for SDR only setup, Maybe I missed
> something, but I don't see how it could make anything different for SDR only.
>
> I (repeatedly) tested both vim1 and vim2, without seeing this issue, so I 
> can't
> debug this. I'll need more detail to progress, something does not make sense 
> here.
please let me know from which part of the driver do you want debug logs


Regards
Martin


Re: [PATCH v3 14/28] userfaultfd: wp: handle COW properly for uffd-wp

2019-04-18 Thread Jerome Glisse
On Wed, Mar 20, 2019 at 10:06:28AM +0800, Peter Xu wrote:
> This allows uffd-wp to support write-protected pages for COW.
> 
> For example, the uffd write-protected PTE could also be write-protected
> by other usages like COW or zero pages.  When that happens, we can't
> simply set the write bit in the PTE since otherwise it'll change the
> content of every single reference to the page.  Instead, we should do
> the COW first if necessary, then handle the uffd-wp fault.
> 
> To correctly copy the page, we'll also need to carry over the
> _PAGE_UFFD_WP bit if it was set in the original PTE.
> 
> For huge PMDs, we just simply split the huge PMDs where we want to
> resolve an uffd-wp page fault always.  That matches what we do with
> general huge PMD write protections.  In that way, we resolved the huge
> PMD copy-on-write issue into PTE copy-on-write.
> 
> Signed-off-by: Peter Xu 

This one has a bug see below.


> ---
>  mm/memory.c   |  5 +++-
>  mm/mprotect.c | 64 ---
>  2 files changed, 65 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index e7a4b9650225..b8a4c0bab461 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -2291,7 +2291,10 @@ vm_fault_t wp_page_copy(struct vm_fault *vmf)
>   }
>   flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
>   entry = mk_pte(new_page, vma->vm_page_prot);
> - entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> + if (pte_uffd_wp(vmf->orig_pte))
> + entry = pte_mkuffd_wp(entry);
> + else
> + entry = maybe_mkwrite(pte_mkdirty(entry), vma);
>   /*
>* Clear the pte entry and flush it first, before updating the
>* pte with the new entry. This will avoid a race condition
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 9d4433044c21..855dddb07ff2 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -73,18 +73,18 @@ static unsigned long change_pte_range(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   flush_tlb_batched_pending(vma->vm_mm);
>   arch_enter_lazy_mmu_mode();
>   do {
> +retry_pte:
>   oldpte = *pte;
>   if (pte_present(oldpte)) {
>   pte_t ptent;
>   bool preserve_write = prot_numa && pte_write(oldpte);
> + struct page *page;
>  
>   /*
>* Avoid trapping faults against the zero or KSM
>* pages. See similar comment in change_huge_pmd.
>*/
>   if (prot_numa) {
> - struct page *page;
> -
>   page = vm_normal_page(vma, addr, oldpte);
>   if (!page || PageKsm(page))
>   continue;
> @@ -114,6 +114,54 @@ static unsigned long change_pte_range(struct 
> vm_area_struct *vma, pmd_t *pmd,
>   continue;
>   }
>  
> + /*
> +  * Detect whether we'll need to COW before
> +  * resolving an uffd-wp fault.  Note that this
> +  * includes detection of the zero page (where
> +  * page==NULL)
> +  */
> + if (uffd_wp_resolve) {
> + /* If the fault is resolved already, skip */
> + if (!pte_uffd_wp(*pte))
> + continue;
> + page = vm_normal_page(vma, addr, oldpte);
> + if (!page || page_mapcount(page) > 1) {
> + struct vm_fault vmf = {
> + .vma = vma,
> + .address = addr & PAGE_MASK,
> + .page = page,
> + .orig_pte = oldpte,
> + .pmd = pmd,
> + /* pte and ptl not needed */
> + };
> + vm_fault_t ret;
> +
> + if (page)
> + get_page(page);
> + arch_leave_lazy_mmu_mode();
> + pte_unmap_unlock(pte, ptl);
> + ret = wp_page_copy();
> + /* PTE is changed, or OOM */
> + if (ret == 0)
> + /* It's done by others */
> + continue;

This is wrong if ret == 0 you still need to remap the pte before

Re: [PATCH 5/7] mmc: meson-gx: avoid clock glitch when switching to DDR modes

2019-04-18 Thread Jerome Brunet
On Thu, 2019-04-18 at 22:16 +0200, Martin Blumenstingl wrote:
> Hi Jerome,
> 
> On Wed, Apr 17, 2019 at 10:44 PM Jerome Brunet  wrote:
> > Activating DDR in the Amlogic mmc controller, among other things, will
> > divide the output clock by 2. So by activating it with clock on, we are
> > creating a glitch on the output.
> > 
> > Instead, let's deal with DDR when the clock output is off, when setting
> > the clock.
> > 
> > Signed-off-by: Jerome Brunet 
> it seems that this patch breaks SD card on my Khadas VIM and Khadas VIM2.

The error I see in your logs is with eMMC and hs200, not SD card.

Either way, There is something I don't really get. eMMC should not go through
any DDR mode to reach HS200 (which is an SDR mode), neither should SD to reach
HS.

All this does is flipping the DDR bit (when necessary) when clock if off for
the mmc device, avoiding a glitch on clk line. 

This patch should not make any difference for SDR only setup, Maybe I missed
something, but I don't see how it could make anything different for SDR only.

I (repeatedly) tested both vim1 and vim2, without seeing this issue, so I can't
debug this. I'll need more detail to progress, something does not make sense 
here.

> I used git bisect within this series to find that issue.
> applying your .dts patches on top doesn't fix it
> 
> two boot logs attached:
> * kvim-broken.txt has patches 1-5 (= including this patch) applied
> * kvim-working.txt has only patches 1-4 (= excluding this patch) applied
> 
> 
> Regards
> Martin




Re: [PATCH v2] fix compile errors due to unsync linux/in6.h and netinet/in.h

2019-04-18 Thread Yonghong Song


On 4/18/19 11:05 AM, Wang YanQing wrote:
> I meet below compile errors:
> "
> In file included from test_tcpnotify_kern.c:12:
> /usr/include/netinet/in.h:101:5: error: expected identifier
>  IPPROTO_HOPOPTS = 0,   /* IPv6 Hop-by-Hop options.  */
>  ^
> /usr/include/linux/in6.h:131:26: note: expanded from macro 'IPPROTO_HOPOPTS'
>  ^
> In file included from test_tcpnotify_kern.c:12:
> /usr/include/netinet/in.h:103:5: error: expected identifier
>  IPPROTO_ROUTING = 43,  /* IPv6 routing header.  */
>  ^
> /usr/include/linux/in6.h:132:26: note: expanded from macro 'IPPROTO_ROUTING'
>  ^
> In file included from test_tcpnotify_kern.c:12:
> /usr/include/netinet/in.h:105:5: error: expected identifier
>  IPPROTO_FRAGMENT = 44, /* IPv6 fragmentation header.  */
>  ^
> /usr/include/linux/in6.h:133:26: note: expanded from macro 'IPPROTO_FRAGMENT'
> "
> The same compile errors are reported for test_tcpbpf_kern.c too.
> 
> My environment:
> lsb_release -a:
> No LSB modules are available.
> Distributor ID: Ubuntu
> Description:Ubuntu 16.04.6 LTS
> Release:16.04
> Codename:   xenial
> 
> dpkg -l | grep libc-dev:
> ii  libc-dev-bin  2.23-0ubuntu11   amd64GNU C 
> Library: Development binaries
> ii  linux-libc-dev:amd64  4.4.0-145.171amd64Linux 
> Kernel Headers for development.
> 
> The reason is linux/in6.h and netinet/in.h aren't synchronous about how to
> handle the same definitions, IPPROTO_HOPOPTS, etc.
> 
> This patch fixes the compile errors by moving  to before the
> .
> 
> Signed-off-by: Wang YanQing 

Tested and the change makes sense. By changing the include order, now we 
have linux checks the existence of macro _NETINET_IN_H instead of
certain versions of glibc checks _UAPI_LINUX_IN6_H which does not exist
for all recent kernels (at least 4.4 and onwards).

Acked-by: Yonghong Song 


> ---
>   Changes
>   v1-v2:
>   1:Rewrite the changelog.
> 
>   tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c| 2 +-
>   tools/testing/selftests/bpf/progs/test_tcpnotify_kern.c | 2 +-
>   2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c 
> b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
> index 74f73b3..c7c3240 100644
> --- a/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
> +++ b/tools/testing/selftests/bpf/progs/test_tcpbpf_kern.c
> @@ -1,6 +1,7 @@
>   // SPDX-License-Identifier: GPL-2.0
>   #include 
>   #include 
> +#include 
>   #include 
>   #include 
>   #include 
> @@ -9,7 +10,6 @@
>   #include 
>   #include 
>   #include 
> -#include 
>   #include "bpf_helpers.h"
>   #include "bpf_endian.h"
>   #include "test_tcpbpf.h"
> diff --git a/tools/testing/selftests/bpf/progs/test_tcpnotify_kern.c 
> b/tools/testing/selftests/bpf/progs/test_tcpnotify_kern.c
> index edbca20..ec6db6e 100644
> --- a/tools/testing/selftests/bpf/progs/test_tcpnotify_kern.c
> +++ b/tools/testing/selftests/bpf/progs/test_tcpnotify_kern.c
> @@ -1,6 +1,7 @@
>   // SPDX-License-Identifier: GPL-2.0
>   #include 
>   #include 
> +#include 
>   #include 
>   #include 
>   #include 
> @@ -9,7 +10,6 @@
>   #include 
>   #include 
>   #include 
> -#include 
>   #include "bpf_helpers.h"
>   #include "bpf_endian.h"
>   #include "test_tcpnotify.h"
> 


Re: [PATCH v4 1/9] clkdev: Hold clocks_mutex while iterating clocks list

2019-04-18 Thread Stephen Boyd
Quoting Stephen Boyd (2019-04-12 11:31:42)
> We recently introduced a change to support devm clk lookups. That change
> introduced a code-path that used clk_find() without holding the
> 'clocks_mutex'. Unfortunately, clk_find() iterates over the 'clocks'
> list and so we need to prevent the list from being modified at the same
> time. Do this by holding the mutex and checking to make sure it's held
> while iterating the list.
> 
> Note, we don't really care if the lookup is freed after we find it with
> clk_find() because we're just doing a pointer comparison, but if we did
> care we would need to keep holding the mutex while we dereference the
> clk_lookup pointer.
> 
> Fixes: 3eee6c7d119c ("clkdev: add managed clkdev lookup registration")
> Cc: Miquel Raynal 
> Cc: Jerome Brunet 
> Cc: Russell King 
> Cc: Michael Turquette 
> Cc: Jeffrey Hugo 
> Cc: Chen-Yu Tsai 
> Cc: Matti Vaittinen 
> Signed-off-by: Stephen Boyd 
> ---

Applied to clk-fixes



Re: [PATCH] selftests/bpf: fix compile errors with older glibc

2019-04-18 Thread Yonghong Song


On 4/18/19 11:00 AM, Wang YanQing wrote:
> On Thu, Apr 18, 2019 at 05:09:53AM +, Yonghong Song wrote:
>>
>>
>> On 4/17/19 10:48 AM, Wang YanQing wrote:
>>> The older glibc (for example, 2.23) doesn't handle __UAPI_DEF_*
>>> in libc-compat.h properly, and it bring below compile errors:
>>
>> I have an even old glibc 2.17 and it still works. Not sure
>> why it failed here. Could you explain more?
> 
> We will meet these errors with the combination of some versions of kernel 
> headers and
> some versions of glibc headers.
> 
> After some research on the git history of glibc and kernel, I find the reason 
> behind
> the scene is a little complex:
> There are some same definitions between glibc's netinet/in.h and kernel's 
> linux/in6.h,
> IPPROTO_HOPOPTS, etc.
> These same definitions willn't bring trouble when we include both of them if 
> kernel and
> glibc coordinates with each other well, but the reality is the coordination 
> is poor and
> unsynchronous in history.
> 
> Kernel and glibc uses guard macros to detect whether need to export their 
> definitions,
> linux/in6.h includes libc-compat.h which will check the guard macro, 
> _NETINET_IN_H, and
> netinet/in.h includes bits/in.h which will check the guard macro, 
> _UAPI_LINUX_IN6_H
> (glibc-2.19~821 6c82a2f8d7c8e21e39237225c819f182ae438db3 "Coordinate IPv6 
> definitions for Linux and glibc"),
> 
> the problem is in the installation process of kernel headers, the "_UAPI" in 
> _UAPI_LINUX_IN6_H
> in linux/in6.h will be stripped due to commit 56c176c9cac9
> ("UAPI: strip the _UAPI prefix from header guards during header 
> installation").
> 
> The good news is the glibc fix this the trouble by checking the guard macro, 
> _LINUX_IN6_H, too.
> (glibc c9bd40daaee18cf1d9824e4a7ebaebe321e0a5a8 "Bug 20214: Fix linux/in6.h 
> and netinet/in.h sync.").

Thanks for explanation, my system contains this glibc fix (_LINUX_IN6_H 
is checked) so that is why I did not have the issue.

> 
> My environment still have this trouble:
> lsb_release -a:
> No LSB modules are available.
> Distributor ID:   Ubuntu
> Description:  Ubuntu 16.04.6 LTS
> Release:  16.04
> Codename: xenial
> 
> dpkg -l | grep libc-dev:
> ii  libc-dev-bin  2.23-0ubuntu11   amd64GNU C 
> Library: Development binaries
> ii  linux-libc-dev:amd64  4.4.0-145.171amd64Linux 
> Kernel Headers for development.
> 
> 
> I will send out the v2 which I will change some words in changelog.
> 
> Thanks for review.
> 


Re: [PATCH] clk: ingenic/jz4725b: Fix parent of pixel clock

2019-04-18 Thread Stephen Boyd
Quoting Paul Cercueil (2019-04-17 16:53:53)
> Hi Stephen,
> 
> Le jeu. 18 avril 2019 à 1:48, Stephen Boyd  a écrit 
> :
> > Quoting Paul Cercueil (2019-04-17 04:24:20)
> >>  The pixel clock is directly connected to the output of the PLL, and 
> >> not
> >>  to the /2 divider.
> >> 
> >>  Cc: sta...@vger.kernel.org
> >>  Fixes: 226dfa4726eb ("clk: Add Ingenic jz4725b CGU driver")
> >>  Signed-off-by: Paul Cercueil 
> >>  ---
> > 
> > Is this breaking something in 5.1-rc series? Or just found by
> > inspection? I'm trying to understand the priority of this patch.
> 
> I verified it with the hardware. It fixes a bug that has been present
> since the introduction of this driver.
> 
> However until now nothing uses this particular clock so it can go to 
> 5.2.
> 

Great. Thanks for the background!



Re: [PATCH 5/6] arm64: dts: meson: vim2: add missing clk-gate pinctrl

2019-04-18 Thread Martin Blumenstingl
On Thu, Apr 18, 2019 at 2:28 PM Jerome Brunet  wrote:
>
> For some reason the vim2 is missing the clk-gate pinctrl setting all
> the other board have. Just add this missing bit
>
> Signed-off-by: Jerome Brunet 
Reviewed-by: Martin Blumenstingl


  1   2   3   4   5   6   7   8   9   10   >