linux-next: Tree for Sep 6
Hi all, Please do not add any v4.15 related material to your linux-next included branches until after v4.14-rc1 has been released. Changes since 20170905: The devicetree tree gained conflicts against Linus' and the sound trees. The akpm-current tree lost its build failure. Non-merge commits (relative to Linus' tree): 7233 7400 files changed, 308697 insertions(+), 105511 deletions(-) I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (patches at http://www.kernel.org/pub/linux/kernel/next/ ). If you are tracking the linux-next tree using git, you should not use "git pull" to do so as that will try to merge the new linux-next release with the old one. You should use "git fetch" and checkout or reset to the new master. You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with a ppc64_defconfig for powerpc and an allmodconfig (with CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a native build of tools/perf. After the final fixups (if any), I do an x86_64 modules_install followed by builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc and sparc64 defconfig. And finally, a simple boot test of the powerpc pseries_le_defconfig kernel in qemu. Below is a summary of the state of the merge. I am currently merging 268 trees (counting Linus' and 41 trees of bug fix patches pending for the current merge release). Stats about the size of the tree over time can be seen at http://neuling.org/linux-next-size.html . Status of my local build tests will be at http://kisskb.ellerman.id.au/linux-next . If maintainers want to give advice about cross compilers/configs that work, we are always open to add more builds. Thanks to Randy Dunlap for doing many randconfig builds. And to Paul Gortmaker for triage and bug fixes. -- Cheers, Stephen Rothwell $ git checkout master $ git reset --hard stable Merging origin/master (e7d0c41ecc2e Merge tag 'devprop-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm) Merging fixes/master (b4b8cbf679c4 Cavium CNN55XX: fix broken default Kconfig entry) Merging kbuild-current/fixes (64236e315955 kbuild: update comments of Makefile.asm-generic) Merging arc-current/for-curr (1ee55a8f7f6b ARC: Re-enable MMU upon Machine Check exception) Merging arm-current/fixes (746a272e4414 ARM: 8692/1: mm: abort uaccess retries upon fatal signal) Merging m68k-current/for-linus (558d5ad276c9 m68k/mac: Avoid soft-lockup warning after mach_power_off) Merging metag-fixes/fixes (b884a190afce metag/usercopy: Add missing fixups) Merging powerpc-fixes/fixes (1a92a80ad386 powerpc/mm: Ensure cpumask update is ordered) Merging sparc/master (6470812e2226 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc) Merging fscrypt-current/for-stable (42d97eb0ade3 fscrypt: fix renaming and linking special files) Merging net/master (6d9c153a0b84 net: dsa: loop: Do not unregister invalid fixed PHY) Merging ipsec/master (f581a0dd744f wl1251: add a missing spin_lock_init()) Merging netfilter/master (6d9c153a0b84 net: dsa: loop: Do not unregister invalid fixed PHY) Merging ipvs/master (f7fb77fc1235 netfilter: nft_compat: check extension hook mask only if set) Merging wireless-drivers/master (10a54d8196d1 iwlwifi: pcie: move rx workqueue initialization to iwl_trans_pcie_alloc()) Merging mac80211/master (04306f4dcf90 mac80211: Complete ampdu work schedule during session tear down) Merging sound-current/for-linus (ee5f38a4459a Merge tag 'asoc-v4.14-cs43130' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus) Merging pci-current/for-linus (8e1101d25164 PCI/MSI: Don't warn when irq_create_affinity_masks() returns NULL) Merging driver-core.current/driver-core-linus (ef954844c7ac Linux 4.13-rc5) Merging tty.current/tty-linus (ef954844c7ac Linux 4.13-rc5) Merging usb.current/usb-linus (ef954844c7ac Linux 4.13-rc5) Merging usb-gadget-fixes/fixes (b7d44c36a6f6 usb: renesas_usbhs: gadget: fix unused-but-set-variable warning) Merging usb-serial-fixes/usb-linus (fd1b8668af59 USB: serial: option: add D-Link DWM-222 device ID) Merging usb-chipidea-fixes/ci-for-usb-stable (cbb22ebcfb99 usb: chipidea: core: check before accessing ci_role in ci_role_show) Merging phy/fixes (5771a8c08880 Linux v4.13-rc1) Merging staging.current/staging-linus (cc4a41fe5541 Linux 4.13-rc7) Merging char-misc.current/char-misc-linus (cc4a41fe5541 Linux 4.13-rc7) Merging input-current/for-linus (a6cbfa1e6d38 Merge branch 'next' into for-linus) Merging crypto-current/master (2d45a7e89833 crypto: af_alg - get_page upon reassignment to TX SGL) Merging ide/master (b671e170339
linux-next: Tree for Sep 6
Hi all, Please do not add any v4.15 related material to your linux-next included branches until after v4.14-rc1 has been released. Changes since 20170905: The devicetree tree gained conflicts against Linus' and the sound trees. The akpm-current tree lost its build failure. Non-merge commits (relative to Linus' tree): 7233 7400 files changed, 308697 insertions(+), 105511 deletions(-) I have created today's linux-next tree at git://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git (patches at http://www.kernel.org/pub/linux/kernel/next/ ). If you are tracking the linux-next tree using git, you should not use "git pull" to do so as that will try to merge the new linux-next release with the old one. You should use "git fetch" and checkout or reset to the new master. You can see which trees have been included by looking in the Next/Trees file in the source. There are also quilt-import.log and merge.log files in the Next directory. Between each merge, the tree was built with a ppc64_defconfig for powerpc and an allmodconfig (with CONFIG_BUILD_DOCSRC=n) for x86_64, a multi_v7_defconfig for arm and a native build of tools/perf. After the final fixups (if any), I do an x86_64 modules_install followed by builds for x86_64 allnoconfig, powerpc allnoconfig (32 and 64 bit), ppc44x_defconfig, allyesconfig and pseries_le_defconfig and i386, sparc and sparc64 defconfig. And finally, a simple boot test of the powerpc pseries_le_defconfig kernel in qemu. Below is a summary of the state of the merge. I am currently merging 268 trees (counting Linus' and 41 trees of bug fix patches pending for the current merge release). Stats about the size of the tree over time can be seen at http://neuling.org/linux-next-size.html . Status of my local build tests will be at http://kisskb.ellerman.id.au/linux-next . If maintainers want to give advice about cross compilers/configs that work, we are always open to add more builds. Thanks to Randy Dunlap for doing many randconfig builds. And to Paul Gortmaker for triage and bug fixes. -- Cheers, Stephen Rothwell $ git checkout master $ git reset --hard stable Merging origin/master (e7d0c41ecc2e Merge tag 'devprop-4.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm) Merging fixes/master (b4b8cbf679c4 Cavium CNN55XX: fix broken default Kconfig entry) Merging kbuild-current/fixes (64236e315955 kbuild: update comments of Makefile.asm-generic) Merging arc-current/for-curr (1ee55a8f7f6b ARC: Re-enable MMU upon Machine Check exception) Merging arm-current/fixes (746a272e4414 ARM: 8692/1: mm: abort uaccess retries upon fatal signal) Merging m68k-current/for-linus (558d5ad276c9 m68k/mac: Avoid soft-lockup warning after mach_power_off) Merging metag-fixes/fixes (b884a190afce metag/usercopy: Add missing fixups) Merging powerpc-fixes/fixes (1a92a80ad386 powerpc/mm: Ensure cpumask update is ordered) Merging sparc/master (6470812e2226 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc) Merging fscrypt-current/for-stable (42d97eb0ade3 fscrypt: fix renaming and linking special files) Merging net/master (6d9c153a0b84 net: dsa: loop: Do not unregister invalid fixed PHY) Merging ipsec/master (f581a0dd744f wl1251: add a missing spin_lock_init()) Merging netfilter/master (6d9c153a0b84 net: dsa: loop: Do not unregister invalid fixed PHY) Merging ipvs/master (f7fb77fc1235 netfilter: nft_compat: check extension hook mask only if set) Merging wireless-drivers/master (10a54d8196d1 iwlwifi: pcie: move rx workqueue initialization to iwl_trans_pcie_alloc()) Merging mac80211/master (04306f4dcf90 mac80211: Complete ampdu work schedule during session tear down) Merging sound-current/for-linus (ee5f38a4459a Merge tag 'asoc-v4.14-cs43130' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound into for-linus) Merging pci-current/for-linus (8e1101d25164 PCI/MSI: Don't warn when irq_create_affinity_masks() returns NULL) Merging driver-core.current/driver-core-linus (ef954844c7ac Linux 4.13-rc5) Merging tty.current/tty-linus (ef954844c7ac Linux 4.13-rc5) Merging usb.current/usb-linus (ef954844c7ac Linux 4.13-rc5) Merging usb-gadget-fixes/fixes (b7d44c36a6f6 usb: renesas_usbhs: gadget: fix unused-but-set-variable warning) Merging usb-serial-fixes/usb-linus (fd1b8668af59 USB: serial: option: add D-Link DWM-222 device ID) Merging usb-chipidea-fixes/ci-for-usb-stable (cbb22ebcfb99 usb: chipidea: core: check before accessing ci_role in ci_role_show) Merging phy/fixes (5771a8c08880 Linux v4.13-rc1) Merging staging.current/staging-linus (cc4a41fe5541 Linux 4.13-rc7) Merging char-misc.current/char-misc-linus (cc4a41fe5541 Linux 4.13-rc7) Merging input-current/for-linus (a6cbfa1e6d38 Merge branch 'next' into for-linus) Merging crypto-current/master (2d45a7e89833 crypto: af_alg - get_page upon reassignment to TX SGL) Merging ide/master (b671e170339
[PATCH] drivers/iommu/tegra: Optimize mutex_unlock function call
Code can be optimized more by making single exit point of function and calling mutex_unlock at the end Signed-off-by: Pushkar Jambhlekar--- drivers/iommu/tegra-smmu.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c index 3b6449e..da38833 100644 --- a/drivers/iommu/tegra-smmu.c +++ b/drivers/iommu/tegra-smmu.c @@ -232,20 +232,22 @@ static inline void smmu_flush(struct tegra_smmu *smmu) static int tegra_smmu_alloc_asid(struct tegra_smmu *smmu, unsigned int *idp) { unsigned long id; + int ret = 0; mutex_lock(>lock); id = find_first_zero_bit(smmu->asids, smmu->soc->num_asids); if (id >= smmu->soc->num_asids) { - mutex_unlock(>lock); - return -ENOSPC; + ret = -ENOSPC; + goto Exit; } set_bit(id, smmu->asids); *idp = id; +Exit: mutex_unlock(>lock); - return 0; + return ret; } static void tegra_smmu_free_asid(struct tegra_smmu *smmu, unsigned int id) -- 2.7.4
[PATCH] drivers/iommu/tegra: Optimize mutex_unlock function call
Code can be optimized more by making single exit point of function and calling mutex_unlock at the end Signed-off-by: Pushkar Jambhlekar --- drivers/iommu/tegra-smmu.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/tegra-smmu.c b/drivers/iommu/tegra-smmu.c index 3b6449e..da38833 100644 --- a/drivers/iommu/tegra-smmu.c +++ b/drivers/iommu/tegra-smmu.c @@ -232,20 +232,22 @@ static inline void smmu_flush(struct tegra_smmu *smmu) static int tegra_smmu_alloc_asid(struct tegra_smmu *smmu, unsigned int *idp) { unsigned long id; + int ret = 0; mutex_lock(>lock); id = find_first_zero_bit(smmu->asids, smmu->soc->num_asids); if (id >= smmu->soc->num_asids) { - mutex_unlock(>lock); - return -ENOSPC; + ret = -ENOSPC; + goto Exit; } set_bit(id, smmu->asids); *idp = id; +Exit: mutex_unlock(>lock); - return 0; + return ret; } static void tegra_smmu_free_asid(struct tegra_smmu *smmu, unsigned int id) -- 2.7.4
Re: [PATCH v2] md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list
On Wed, Sep 06, 2017 at 11:02:35AM +0800, Dennis Yang wrote: > In release_stripe_plug(), if a stripe_head has its STRIPE_ON_UNPLUG_LIST > set, it indicates that this stripe_head is already in the raid5_plug_cb > list and release_stripe() would be called instead to drop a reference > count. Otherwise, the STRIPE_ON_UNPLUG_LIST bit would be set for this > stripe_head and it will get queued into the raid5_plug_cb list. > > Since break_stripe_batch_list() did not preserve STRIPE_ON_UNPLUG_LIST, > A stripe could be re-added to plug list while it is still on that list > in the following situation. If stripe_head A is added to another > stripe_head B's batch list, in this case A will have its > batch_head != NULL and be added into the plug list. After that, > stripe_head B gets handled and called break_stripe_batch_list() to > reset all the batched stripe_head(including A which is still on > the plug list)'s state and reset their batch_head to NULL. > Before the plug list gets processed, if there is another write request > comes in and get stripe_head A, A will have its batch_head == NULL > (cleared by calling break_stripe_batch_list() on B) and be added to > plug list once again. applied, thanks! > Signed-off-by: Dennis Yang> --- > drivers/md/raid5.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > index e92dd2d..faf3cfd 100644 > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -4611,7 +4611,8 @@ static void break_stripe_batch_list(struct stripe_head > *head_sh, > > set_mask_bits(>state, ~(STRIPE_EXPAND_SYNC_FLAGS | > (1 << STRIPE_PREREAD_ACTIVE) | > - (1 << STRIPE_DEGRADED)), > + (1 << STRIPE_DEGRADED) | > + (1 << STRIPE_ON_UNPLUG_LIST)), > head_sh->state & (1 << STRIPE_INSYNC)); > > sh->check_state = head_sh->check_state; > -- > 1.9.1 >
Re: [PATCH v2] md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list
On Wed, Sep 06, 2017 at 11:02:35AM +0800, Dennis Yang wrote: > In release_stripe_plug(), if a stripe_head has its STRIPE_ON_UNPLUG_LIST > set, it indicates that this stripe_head is already in the raid5_plug_cb > list and release_stripe() would be called instead to drop a reference > count. Otherwise, the STRIPE_ON_UNPLUG_LIST bit would be set for this > stripe_head and it will get queued into the raid5_plug_cb list. > > Since break_stripe_batch_list() did not preserve STRIPE_ON_UNPLUG_LIST, > A stripe could be re-added to plug list while it is still on that list > in the following situation. If stripe_head A is added to another > stripe_head B's batch list, in this case A will have its > batch_head != NULL and be added into the plug list. After that, > stripe_head B gets handled and called break_stripe_batch_list() to > reset all the batched stripe_head(including A which is still on > the plug list)'s state and reset their batch_head to NULL. > Before the plug list gets processed, if there is another write request > comes in and get stripe_head A, A will have its batch_head == NULL > (cleared by calling break_stripe_batch_list() on B) and be added to > plug list once again. applied, thanks! > Signed-off-by: Dennis Yang > --- > drivers/md/raid5.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c > index e92dd2d..faf3cfd 100644 > --- a/drivers/md/raid5.c > +++ b/drivers/md/raid5.c > @@ -4611,7 +4611,8 @@ static void break_stripe_batch_list(struct stripe_head > *head_sh, > > set_mask_bits(>state, ~(STRIPE_EXPAND_SYNC_FLAGS | > (1 << STRIPE_PREREAD_ACTIVE) | > - (1 << STRIPE_DEGRADED)), > + (1 << STRIPE_DEGRADED) | > + (1 << STRIPE_ON_UNPLUG_LIST)), > head_sh->state & (1 << STRIPE_INSYNC)); > > sh->check_state = head_sh->check_state; > -- > 1.9.1 >
[PATCH 1/2] dt-bindings: watchdog: Add Spreadtrum watchdog documentation
This patch adds the documentation for Spreadtrum watchdog driver. Signed-off-by: Eric Long--- .../devicetree/bindings/watchdog/sprd-wdt.txt | 19 +++ 1 file changed, 19 insertions(+) create mode 100644 Documentation/devicetree/bindings/watchdog/sprd-wdt.txt diff --git a/Documentation/devicetree/bindings/watchdog/sprd-wdt.txt b/Documentation/devicetree/bindings/watchdog/sprd-wdt.txt new file mode 100644 index 000..aeaf3e0 --- /dev/null +++ b/Documentation/devicetree/bindings/watchdog/sprd-wdt.txt @@ -0,0 +1,19 @@ +Spreadtrum SoCs Watchdog timer + +Required properties: +- compatible : Should be "sprd,sp9860-wdt". +- reg : Specifies base physical address and size of the registers. +- interrupts : Exactly one interrupt specifier. +- timeout-sec : Contain the default watchdog timeout in seconds. +- clock-names : Contain the input clock names. +- clocks : Phandles to input clocks. + +Example: + watchdog: watchdog@4031 { + compatible = "sprd,sp9860-wdt"; + reg = <0 0x4031 0 0x1000>; + interrupts = ; + timeout-sec = <12>; + clock-names = "enable", "rtc_enable"; + clocks = <_aon_apb_gates1 8>, <_aon_apb_rtc_gates 9>; + }; -- 1.9.1
[PATCH 2/2] watchdog: Add Spreadtrum watchdog driver
This patch adds the watchdog driver for Spreadtrum SC9860 platform. Signed-off-by: Eric Long--- drivers/watchdog/Kconfig| 8 + drivers/watchdog/Makefile | 1 + drivers/watchdog/sprd_wdt.c | 366 3 files changed, 375 insertions(+) create mode 100644 drivers/watchdog/sprd_wdt.c diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig index c722cbf..ea07718 100644 --- a/drivers/watchdog/Kconfig +++ b/drivers/watchdog/Kconfig @@ -787,6 +787,14 @@ config UNIPHIER_WATCHDOG To compile this driver as a module, choose M here: the module will be called uniphier_wdt. +config SPRD_WATCHDOG + tristate "Spreadtrum watchdog support" + depends on ARCH_SPRD + select WATCHDOG_CORE + help + Say Y here to include support watchdog timer embedded + into the Spreadtrum system. + # AVR32 Architecture config AT32AP700X_WDT diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile index 56adf9f..187cca2 100644 --- a/drivers/watchdog/Makefile +++ b/drivers/watchdog/Makefile @@ -87,6 +87,7 @@ obj-$(CONFIG_ASPEED_WATCHDOG) += aspeed_wdt.o obj-$(CONFIG_ZX2967_WATCHDOG) += zx2967_wdt.o obj-$(CONFIG_STM32_WATCHDOG) += stm32_iwdg.o obj-$(CONFIG_UNIPHIER_WATCHDOG) += uniphier_wdt.o +obj-$(CONFIG_SPRD_WATCHDOG) += sprd_wdt.o # AVR32 Architecture obj-$(CONFIG_AT32AP700X_WDT) += at32ap700x_wdt.o diff --git a/drivers/watchdog/sprd_wdt.c b/drivers/watchdog/sprd_wdt.c new file mode 100644 index 000..6006bb4 --- /dev/null +++ b/drivers/watchdog/sprd_wdt.c @@ -0,0 +1,366 @@ +/* + * Spreadtrum watchdog driver + * Copyright (C) 2017 Spreadtrum - http://www.spreadtrum.com + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * version 2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define WDT_LOAD_LOW 0x0 +#define WDT_LOAD_HIGH 0x4 +#define WDT_CTRL 0x8 +#define WDT_INT_CLR0xc +#define WDT_INT_RAW0x10 +#define WDT_INT_MSK0x14 +#define WDT_CNT_LOW0x18 +#define WDT_CNT_HIGH 0x1c +#define WDT_LOCK 0x20 +#define WDT_IRQ_LOAD_LOW 0x2c +#define WDT_IRQ_LOAD_HIGH 0x30 + +/* WDT_CTRL */ +#define WDT_INT_EN_BIT BIT(0) +#define WDT_CNT_EN_BIT BIT(1) +#define WDT_NEW_VER_EN BIT(2) +#define WDT_RST_EN_BIT BIT(3) + +/* WDT_INT_CLR */ +#define WDT_INT_CLEAR_BIT BIT(0) +#define WDT_RST_CLEAR_BIT BIT(3) + +/* WDT_INT_RAW */ +#define WDT_INT_RAW_BITBIT(0) +#define WDT_RST_RAW_BITBIT(3) +#define WDT_LD_BUSY_BITBIT(4) + +#define WDT_CLK32768 +#define WDT_UNLOCK_KEY 0xe551 +#define WDT_IRQ_TMROUT_OFFSET 0x3 + +#define WDT_CNT_VALUE_SIZE 16 +#define WDT_CNT_VALUE_MASK GENMASK(15, 0) +#define WDT_LOAD_TIMEOUT_NUM 1 + +struct sprd_wdt { + void __iomem *base; + struct watchdog_device wdd; + struct clk *enable; + struct clk *rtc_enable; + u32 irq_tmr_out; + u32 rst_tmr_out; + unsigned int irq; +}; + +static inline struct sprd_wdt *to_sprd_wdt(struct watchdog_device *wdd) +{ + return container_of(wdd, struct sprd_wdt, wdd); +} + +static inline void sprd_wdt_lock(void __iomem *addr) +{ + writel_relaxed(0x0, addr + WDT_LOCK); +} + +static inline void sprd_wdt_unlock(void __iomem *addr) +{ + writel_relaxed(WDT_UNLOCK_KEY, addr + WDT_LOCK); +} + +static irqreturn_t sprd_wdt_isr(int irq, void *dev_id) +{ + struct sprd_wdt *wdt = (struct sprd_wdt *)dev_id; + + sprd_wdt_unlock(wdt->base); + writel_relaxed(WDT_INT_CLEAR_BIT, wdt->base + WDT_INT_CLR); + sprd_wdt_lock(wdt->base); + watchdog_notify_pretimeout(>wdd); + return IRQ_HANDLED; +} + +static u32 sprd_wdt_get_cnt_value(struct sprd_wdt *wdt) +{ + u32 val; + + val = readl_relaxed(wdt->base + WDT_CNT_HIGH) << WDT_CNT_VALUE_SIZE; + val |= readl_relaxed(wdt->base + WDT_CNT_LOW) & WDT_CNT_VALUE_MASK; + + return val; +} + +static int sprd_wdt_load_value(struct sprd_wdt *wdt, u32 rst_value, + u32 irq_value) +{ + u32 val, cnt = 0; + + sprd_wdt_unlock(wdt->base); + writel_relaxed((rst_value >> WDT_CNT_VALUE_SIZE) & WDT_CNT_VALUE_MASK, + wdt->base + WDT_LOAD_HIGH); + writel_relaxed((rst_value & WDT_CNT_VALUE_MASK), + wdt->base + WDT_LOAD_LOW); +
[PATCH 1/2] dt-bindings: watchdog: Add Spreadtrum watchdog documentation
This patch adds the documentation for Spreadtrum watchdog driver. Signed-off-by: Eric Long --- .../devicetree/bindings/watchdog/sprd-wdt.txt | 19 +++ 1 file changed, 19 insertions(+) create mode 100644 Documentation/devicetree/bindings/watchdog/sprd-wdt.txt diff --git a/Documentation/devicetree/bindings/watchdog/sprd-wdt.txt b/Documentation/devicetree/bindings/watchdog/sprd-wdt.txt new file mode 100644 index 000..aeaf3e0 --- /dev/null +++ b/Documentation/devicetree/bindings/watchdog/sprd-wdt.txt @@ -0,0 +1,19 @@ +Spreadtrum SoCs Watchdog timer + +Required properties: +- compatible : Should be "sprd,sp9860-wdt". +- reg : Specifies base physical address and size of the registers. +- interrupts : Exactly one interrupt specifier. +- timeout-sec : Contain the default watchdog timeout in seconds. +- clock-names : Contain the input clock names. +- clocks : Phandles to input clocks. + +Example: + watchdog: watchdog@4031 { + compatible = "sprd,sp9860-wdt"; + reg = <0 0x4031 0 0x1000>; + interrupts = ; + timeout-sec = <12>; + clock-names = "enable", "rtc_enable"; + clocks = <_aon_apb_gates1 8>, <_aon_apb_rtc_gates 9>; + }; -- 1.9.1
[PATCH 2/2] watchdog: Add Spreadtrum watchdog driver
This patch adds the watchdog driver for Spreadtrum SC9860 platform. Signed-off-by: Eric Long --- drivers/watchdog/Kconfig| 8 + drivers/watchdog/Makefile | 1 + drivers/watchdog/sprd_wdt.c | 366 3 files changed, 375 insertions(+) create mode 100644 drivers/watchdog/sprd_wdt.c diff --git a/drivers/watchdog/Kconfig b/drivers/watchdog/Kconfig index c722cbf..ea07718 100644 --- a/drivers/watchdog/Kconfig +++ b/drivers/watchdog/Kconfig @@ -787,6 +787,14 @@ config UNIPHIER_WATCHDOG To compile this driver as a module, choose M here: the module will be called uniphier_wdt. +config SPRD_WATCHDOG + tristate "Spreadtrum watchdog support" + depends on ARCH_SPRD + select WATCHDOG_CORE + help + Say Y here to include support watchdog timer embedded + into the Spreadtrum system. + # AVR32 Architecture config AT32AP700X_WDT diff --git a/drivers/watchdog/Makefile b/drivers/watchdog/Makefile index 56adf9f..187cca2 100644 --- a/drivers/watchdog/Makefile +++ b/drivers/watchdog/Makefile @@ -87,6 +87,7 @@ obj-$(CONFIG_ASPEED_WATCHDOG) += aspeed_wdt.o obj-$(CONFIG_ZX2967_WATCHDOG) += zx2967_wdt.o obj-$(CONFIG_STM32_WATCHDOG) += stm32_iwdg.o obj-$(CONFIG_UNIPHIER_WATCHDOG) += uniphier_wdt.o +obj-$(CONFIG_SPRD_WATCHDOG) += sprd_wdt.o # AVR32 Architecture obj-$(CONFIG_AT32AP700X_WDT) += at32ap700x_wdt.o diff --git a/drivers/watchdog/sprd_wdt.c b/drivers/watchdog/sprd_wdt.c new file mode 100644 index 000..6006bb4 --- /dev/null +++ b/drivers/watchdog/sprd_wdt.c @@ -0,0 +1,366 @@ +/* + * Spreadtrum watchdog driver + * Copyright (C) 2017 Spreadtrum - http://www.spreadtrum.com + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * version 2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#define WDT_LOAD_LOW 0x0 +#define WDT_LOAD_HIGH 0x4 +#define WDT_CTRL 0x8 +#define WDT_INT_CLR0xc +#define WDT_INT_RAW0x10 +#define WDT_INT_MSK0x14 +#define WDT_CNT_LOW0x18 +#define WDT_CNT_HIGH 0x1c +#define WDT_LOCK 0x20 +#define WDT_IRQ_LOAD_LOW 0x2c +#define WDT_IRQ_LOAD_HIGH 0x30 + +/* WDT_CTRL */ +#define WDT_INT_EN_BIT BIT(0) +#define WDT_CNT_EN_BIT BIT(1) +#define WDT_NEW_VER_EN BIT(2) +#define WDT_RST_EN_BIT BIT(3) + +/* WDT_INT_CLR */ +#define WDT_INT_CLEAR_BIT BIT(0) +#define WDT_RST_CLEAR_BIT BIT(3) + +/* WDT_INT_RAW */ +#define WDT_INT_RAW_BITBIT(0) +#define WDT_RST_RAW_BITBIT(3) +#define WDT_LD_BUSY_BITBIT(4) + +#define WDT_CLK32768 +#define WDT_UNLOCK_KEY 0xe551 +#define WDT_IRQ_TMROUT_OFFSET 0x3 + +#define WDT_CNT_VALUE_SIZE 16 +#define WDT_CNT_VALUE_MASK GENMASK(15, 0) +#define WDT_LOAD_TIMEOUT_NUM 1 + +struct sprd_wdt { + void __iomem *base; + struct watchdog_device wdd; + struct clk *enable; + struct clk *rtc_enable; + u32 irq_tmr_out; + u32 rst_tmr_out; + unsigned int irq; +}; + +static inline struct sprd_wdt *to_sprd_wdt(struct watchdog_device *wdd) +{ + return container_of(wdd, struct sprd_wdt, wdd); +} + +static inline void sprd_wdt_lock(void __iomem *addr) +{ + writel_relaxed(0x0, addr + WDT_LOCK); +} + +static inline void sprd_wdt_unlock(void __iomem *addr) +{ + writel_relaxed(WDT_UNLOCK_KEY, addr + WDT_LOCK); +} + +static irqreturn_t sprd_wdt_isr(int irq, void *dev_id) +{ + struct sprd_wdt *wdt = (struct sprd_wdt *)dev_id; + + sprd_wdt_unlock(wdt->base); + writel_relaxed(WDT_INT_CLEAR_BIT, wdt->base + WDT_INT_CLR); + sprd_wdt_lock(wdt->base); + watchdog_notify_pretimeout(>wdd); + return IRQ_HANDLED; +} + +static u32 sprd_wdt_get_cnt_value(struct sprd_wdt *wdt) +{ + u32 val; + + val = readl_relaxed(wdt->base + WDT_CNT_HIGH) << WDT_CNT_VALUE_SIZE; + val |= readl_relaxed(wdt->base + WDT_CNT_LOW) & WDT_CNT_VALUE_MASK; + + return val; +} + +static int sprd_wdt_load_value(struct sprd_wdt *wdt, u32 rst_value, + u32 irq_value) +{ + u32 val, cnt = 0; + + sprd_wdt_unlock(wdt->base); + writel_relaxed((rst_value >> WDT_CNT_VALUE_SIZE) & WDT_CNT_VALUE_MASK, + wdt->base + WDT_LOAD_HIGH); + writel_relaxed((rst_value & WDT_CNT_VALUE_MASK), + wdt->base + WDT_LOAD_LOW); + writel_relaxed((irq_value >>
[PATCH] drivers/staging/pi433: Fixing coding guidelines
Fix brace style of if-else case Signed-off-by: Pushkar Jambhlekar--- drivers/staging/pi433/pi433_if.c | 162 +-- 1 file changed, 54 insertions(+), 108 deletions(-) diff --git a/drivers/staging/pi433/pi433_if.c b/drivers/staging/pi433/pi433_if.c index 93c0168..c070cf3 100644 --- a/drivers/staging/pi433/pi433_if.c +++ b/drivers/staging/pi433/pi433_if.c @@ -133,19 +133,16 @@ static irqreturn_t DIO0_irq_handler(int irq, void *dev_id) { struct pi433_device *device = dev_id; - if (device->irq_state[DIO0] == DIO_PacketSent) - { + if (device->irq_state[DIO0] == DIO_PacketSent) { device->free_in_fifo = FIFO_SIZE; printk("DIO0 irq: Packet sent\n"); // TODO: printk() should include KERN_ facility level wake_up_interruptible(>fifo_wait_queue); } - else if (device->irq_state[DIO0] == DIO_Rssi_DIO0) - { + else if (device->irq_state[DIO0] == DIO_Rssi_DIO0) { printk("DIO0 irq: RSSI level over threshold\n"); wake_up_interruptible(>rx_wait_queue); } - else if (device->irq_state[DIO0] == DIO_PayloadReady) - { + else if (device->irq_state[DIO0] == DIO_PayloadReady) { printk("DIO0 irq: PayloadReady\n"); device->free_in_fifo = 0; wake_up_interruptible(>fifo_wait_queue); @@ -158,12 +155,10 @@ static irqreturn_t DIO1_irq_handler(int irq, void *dev_id) { struct pi433_device *device = dev_id; - if (device->irq_state[DIO1] == DIO_FifoNotEmpty_DIO1) - { + if (device->irq_state[DIO1] == DIO_FifoNotEmpty_DIO1) { device->free_in_fifo = FIFO_SIZE; } - else if (device->irq_state[DIO1] == DIO_FifoLevel) - { + else if (device->irq_state[DIO1] == DIO_FifoLevel) { if (device->rx_active) device->free_in_fifo = FIFO_THRESHOLD - 1; elsedevice->free_in_fifo = FIFO_SIZE - FIFO_THRESHOLD - 1; } @@ -197,12 +192,10 @@ rf69_set_rx_cfg(struct pi433_device *dev, struct pi433_rx_cfg *rx_cfg) /* packet config */ /* enable */ SET_CHECKED(rf69_set_sync_enable(dev->spi, rx_cfg->enable_sync)); - if (rx_cfg->enable_sync == optionOn) - { + if (rx_cfg->enable_sync == optionOn) { SET_CHECKED(rf69_set_fifo_fill_condition(dev->spi, afterSyncInterrupt)); } - else - { + else { SET_CHECKED(rf69_set_fifo_fill_condition(dev->spi, always)); } if (rx_cfg->enable_length_byte == optionOn) { @@ -219,29 +212,24 @@ rf69_set_rx_cfg(struct pi433_device *dev, struct pi433_rx_cfg *rx_cfg) /* lengths */ SET_CHECKED(rf69_set_sync_size(dev->spi, rx_cfg->sync_length)); - if (rx_cfg->enable_length_byte == optionOn) - { + if (rx_cfg->enable_length_byte == optionOn) { SET_CHECKED(rf69_set_payload_length(dev->spi, 0xff)); } - else if (rx_cfg->fixed_message_length != 0) - { + else if (rx_cfg->fixed_message_length != 0) { payload_length = rx_cfg->fixed_message_length; if (rx_cfg->enable_length_byte == optionOn) payload_length++; if (rx_cfg->enable_address_filtering != filteringOff) payload_length++; SET_CHECKED(rf69_set_payload_length(dev->spi, payload_length)); } - else - { + else { SET_CHECKED(rf69_set_payload_length(dev->spi, 0)); } /* values */ - if (rx_cfg->enable_sync == optionOn) - { + if (rx_cfg->enable_sync == optionOn) { SET_CHECKED(rf69_set_sync_values(dev->spi, rx_cfg->sync_pattern)); } - if (rx_cfg->enable_address_filtering != filteringOff) - { + if (rx_cfg->enable_address_filtering != filteringOff) { SET_CHECKED(rf69_set_node_address (dev->spi, rx_cfg->node_address)); SET_CHECKED(rf69_set_broadcast_address(dev->spi, rx_cfg->broadcast_address)); } @@ -263,12 +251,10 @@ rf69_set_tx_cfg(struct pi433_device *dev, struct pi433_tx_cfg *tx_cfg) SET_CHECKED(rf69_set_tx_start_condition(dev->spi, tx_cfg->tx_start_condition)); /* packet format enable */ - if (tx_cfg->enable_preamble == optionOn) - { + if (tx_cfg->enable_preamble == optionOn) { SET_CHECKED(rf69_set_preamble_length(dev->spi, tx_cfg->preamble_length)); } - else - { + else { SET_CHECKED(rf69_set_preamble_length(dev->spi, 0)); } SET_CHECKED(rf69_set_sync_enable (dev->spi, tx_cfg->enable_sync)); @@ -284,8 +270,7 @@ rf69_set_tx_cfg(struct pi433_device *dev, struct pi433_tx_cfg *tx_cfg) SET_CHECKED(rf69_set_crc_enable (dev->spi, tx_cfg->enable_crc)); /*
[PATCH] drivers/staging/pi433: Fixing coding guidelines
Fix brace style of if-else case Signed-off-by: Pushkar Jambhlekar --- drivers/staging/pi433/pi433_if.c | 162 +-- 1 file changed, 54 insertions(+), 108 deletions(-) diff --git a/drivers/staging/pi433/pi433_if.c b/drivers/staging/pi433/pi433_if.c index 93c0168..c070cf3 100644 --- a/drivers/staging/pi433/pi433_if.c +++ b/drivers/staging/pi433/pi433_if.c @@ -133,19 +133,16 @@ static irqreturn_t DIO0_irq_handler(int irq, void *dev_id) { struct pi433_device *device = dev_id; - if (device->irq_state[DIO0] == DIO_PacketSent) - { + if (device->irq_state[DIO0] == DIO_PacketSent) { device->free_in_fifo = FIFO_SIZE; printk("DIO0 irq: Packet sent\n"); // TODO: printk() should include KERN_ facility level wake_up_interruptible(>fifo_wait_queue); } - else if (device->irq_state[DIO0] == DIO_Rssi_DIO0) - { + else if (device->irq_state[DIO0] == DIO_Rssi_DIO0) { printk("DIO0 irq: RSSI level over threshold\n"); wake_up_interruptible(>rx_wait_queue); } - else if (device->irq_state[DIO0] == DIO_PayloadReady) - { + else if (device->irq_state[DIO0] == DIO_PayloadReady) { printk("DIO0 irq: PayloadReady\n"); device->free_in_fifo = 0; wake_up_interruptible(>fifo_wait_queue); @@ -158,12 +155,10 @@ static irqreturn_t DIO1_irq_handler(int irq, void *dev_id) { struct pi433_device *device = dev_id; - if (device->irq_state[DIO1] == DIO_FifoNotEmpty_DIO1) - { + if (device->irq_state[DIO1] == DIO_FifoNotEmpty_DIO1) { device->free_in_fifo = FIFO_SIZE; } - else if (device->irq_state[DIO1] == DIO_FifoLevel) - { + else if (device->irq_state[DIO1] == DIO_FifoLevel) { if (device->rx_active) device->free_in_fifo = FIFO_THRESHOLD - 1; elsedevice->free_in_fifo = FIFO_SIZE - FIFO_THRESHOLD - 1; } @@ -197,12 +192,10 @@ rf69_set_rx_cfg(struct pi433_device *dev, struct pi433_rx_cfg *rx_cfg) /* packet config */ /* enable */ SET_CHECKED(rf69_set_sync_enable(dev->spi, rx_cfg->enable_sync)); - if (rx_cfg->enable_sync == optionOn) - { + if (rx_cfg->enable_sync == optionOn) { SET_CHECKED(rf69_set_fifo_fill_condition(dev->spi, afterSyncInterrupt)); } - else - { + else { SET_CHECKED(rf69_set_fifo_fill_condition(dev->spi, always)); } if (rx_cfg->enable_length_byte == optionOn) { @@ -219,29 +212,24 @@ rf69_set_rx_cfg(struct pi433_device *dev, struct pi433_rx_cfg *rx_cfg) /* lengths */ SET_CHECKED(rf69_set_sync_size(dev->spi, rx_cfg->sync_length)); - if (rx_cfg->enable_length_byte == optionOn) - { + if (rx_cfg->enable_length_byte == optionOn) { SET_CHECKED(rf69_set_payload_length(dev->spi, 0xff)); } - else if (rx_cfg->fixed_message_length != 0) - { + else if (rx_cfg->fixed_message_length != 0) { payload_length = rx_cfg->fixed_message_length; if (rx_cfg->enable_length_byte == optionOn) payload_length++; if (rx_cfg->enable_address_filtering != filteringOff) payload_length++; SET_CHECKED(rf69_set_payload_length(dev->spi, payload_length)); } - else - { + else { SET_CHECKED(rf69_set_payload_length(dev->spi, 0)); } /* values */ - if (rx_cfg->enable_sync == optionOn) - { + if (rx_cfg->enable_sync == optionOn) { SET_CHECKED(rf69_set_sync_values(dev->spi, rx_cfg->sync_pattern)); } - if (rx_cfg->enable_address_filtering != filteringOff) - { + if (rx_cfg->enable_address_filtering != filteringOff) { SET_CHECKED(rf69_set_node_address (dev->spi, rx_cfg->node_address)); SET_CHECKED(rf69_set_broadcast_address(dev->spi, rx_cfg->broadcast_address)); } @@ -263,12 +251,10 @@ rf69_set_tx_cfg(struct pi433_device *dev, struct pi433_tx_cfg *tx_cfg) SET_CHECKED(rf69_set_tx_start_condition(dev->spi, tx_cfg->tx_start_condition)); /* packet format enable */ - if (tx_cfg->enable_preamble == optionOn) - { + if (tx_cfg->enable_preamble == optionOn) { SET_CHECKED(rf69_set_preamble_length(dev->spi, tx_cfg->preamble_length)); } - else - { + else { SET_CHECKED(rf69_set_preamble_length(dev->spi, 0)); } SET_CHECKED(rf69_set_sync_enable (dev->spi, tx_cfg->enable_sync)); @@ -284,8 +270,7 @@ rf69_set_tx_cfg(struct pi433_device *dev, struct pi433_tx_cfg *tx_cfg) SET_CHECKED(rf69_set_crc_enable (dev->spi, tx_cfg->enable_crc)); /* configure sync, if
Re: [PATCH v8 06/13] x86/apic: Mark the apic_intr_mode extern for sanity check cleanup
Hi Baoquan, At 09/06/2017 01:26 PM, Baoquan He wrote: [...] static void __init smp_cpu_index_default(void) @@ -1335,19 +1295,20 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus) Please also cleanup the passed in max_cpus since it's not used here any more. And up to the caller: Oops, I just checked x86 code, other arch also have this hook. Please ingore this comment. Yes, you are right. As I also found the comments for native_smp_prepare_cpus() is out of date. I will update the comment and explain it like: diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 4f63afc..9f8479c 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1260,8 +1260,10 @@ static void __init smp_get_logical_apicid(void) } /* - * Prepare for SMP bootup. The MP table or ACPI has been read - * earlier. Just do some sanity checking here and enable APIC mode. + * Prepare for SMP bootup. + * + * @max_cpus: configured maximum number of CPUs + * It don't be used, but other arch also have this hook, so keep it. */ void __init native_smp_prepare_cpus(unsigned int max_cpus) { Thanks, dou. static noinline void __init kernel_init_freeable(void) { ... smp_prepare_cpus(setup_max_cpus); ... } apic_intr_mode_init(); - switch (smp_sanity_check(max_cpus)) { - case SMP_NO_CONFIG: - disable_smp(); - return; - case SMP_NO_APIC: + smp_sanity_check(); + + switch (apic_intr_mode) { + case APIC_PIC: + case APIC_VIRTUAL_WIRE_NO_CONFIG: disable_smp(); return; - case SMP_FORCE_UP: + case APIC_SYMMETRIC_IO_NO_ROUTING: disable_smp(); /* Setup local timer */ x86_init.timers.setup_percpu_clockev(); return; - case SMP_OK: + case APIC_VIRTUAL_WIRE: + case APIC_SYMMETRIC_IO: break; } -- 2.5.5
Re: [PATCH v8 06/13] x86/apic: Mark the apic_intr_mode extern for sanity check cleanup
Hi Baoquan, At 09/06/2017 01:26 PM, Baoquan He wrote: [...] static void __init smp_cpu_index_default(void) @@ -1335,19 +1295,20 @@ void __init native_smp_prepare_cpus(unsigned int max_cpus) Please also cleanup the passed in max_cpus since it's not used here any more. And up to the caller: Oops, I just checked x86 code, other arch also have this hook. Please ingore this comment. Yes, you are right. As I also found the comments for native_smp_prepare_cpus() is out of date. I will update the comment and explain it like: diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c index 4f63afc..9f8479c 100644 --- a/arch/x86/kernel/smpboot.c +++ b/arch/x86/kernel/smpboot.c @@ -1260,8 +1260,10 @@ static void __init smp_get_logical_apicid(void) } /* - * Prepare for SMP bootup. The MP table or ACPI has been read - * earlier. Just do some sanity checking here and enable APIC mode. + * Prepare for SMP bootup. + * + * @max_cpus: configured maximum number of CPUs + * It don't be used, but other arch also have this hook, so keep it. */ void __init native_smp_prepare_cpus(unsigned int max_cpus) { Thanks, dou. static noinline void __init kernel_init_freeable(void) { ... smp_prepare_cpus(setup_max_cpus); ... } apic_intr_mode_init(); - switch (smp_sanity_check(max_cpus)) { - case SMP_NO_CONFIG: - disable_smp(); - return; - case SMP_NO_APIC: + smp_sanity_check(); + + switch (apic_intr_mode) { + case APIC_PIC: + case APIC_VIRTUAL_WIRE_NO_CONFIG: disable_smp(); return; - case SMP_FORCE_UP: + case APIC_SYMMETRIC_IO_NO_ROUTING: disable_smp(); /* Setup local timer */ x86_init.timers.setup_percpu_clockev(); return; - case SMP_OK: + case APIC_VIRTUAL_WIRE: + case APIC_SYMMETRIC_IO: break; } -- 2.5.5
Re: [PATCH v2 0/8] MT2712 IOMMU SUPPORT
On Tue, 2017-08-22 at 22:38 +0800, Joerg Roedel wrote: > On Mon, Aug 21, 2017 at 07:00:13PM +0800, Yong Wu wrote: > > Yong Wu (8): > > dt-bindings: mediatek: Add binding for mt2712 IOMMU and SMI > > iommu/mediatek: Move MTK_M4U_TO_LARB/PORT into mtk_iommu.c > > iommu/mediatek: Add mt2712 IOMMU support > > iommu/mediatek: Merge 2 M4U HWs into one iommu domain > > iommu/mediatek: Move pgtable allocation into domain_alloc > > iommu/mediatek: Disable iommu clock when system suspend > > iommu/mediatek: Enlarge the validate PA range for 4GB mode > > memory: mtk-smi: Degrade SMI init to module_init > > Applied patches 2-7. Patch 1 is something for Matthias. Hi Matthias, Could you help review the patch[1/8]? If it's ok for you too,Could you help apply that one via your tree. Thanks.
Re: [PATCH v2 0/8] MT2712 IOMMU SUPPORT
On Tue, 2017-08-22 at 22:38 +0800, Joerg Roedel wrote: > On Mon, Aug 21, 2017 at 07:00:13PM +0800, Yong Wu wrote: > > Yong Wu (8): > > dt-bindings: mediatek: Add binding for mt2712 IOMMU and SMI > > iommu/mediatek: Move MTK_M4U_TO_LARB/PORT into mtk_iommu.c > > iommu/mediatek: Add mt2712 IOMMU support > > iommu/mediatek: Merge 2 M4U HWs into one iommu domain > > iommu/mediatek: Move pgtable allocation into domain_alloc > > iommu/mediatek: Disable iommu clock when system suspend > > iommu/mediatek: Enlarge the validate PA range for 4GB mode > > memory: mtk-smi: Degrade SMI init to module_init > > Applied patches 2-7. Patch 1 is something for Matthias. Hi Matthias, Could you help review the patch[1/8]? If it's ok for you too,Could you help apply that one via your tree. Thanks.
Re: [PATCH] arm64: KVM: VHE: save and restore some PSTATE bits
CC Catalin On 2017/9/6 2:58, gengdongjiu wrote: > when exit from guest, some host PSTATE bits may be lost, such as > PSTATE.PAN or PSTATE.UAO. It is because host and hypervisor all run > in the EL2, host PSTATE value cannot be saved and restored via > SPSR_EL2. So if guest has changed the PSTATE, host continues with > a wrong value guest has set. > > Signed-off-by: Dongjiu Geng> Signed-off-by: Haibin Zhang > --- > arch/arm64/include/asm/kvm_host.h | 8 +++ > arch/arm64/include/asm/kvm_hyp.h | 2 ++ > arch/arm64/include/asm/sysreg.h | 23 +++ > arch/arm64/kvm/hyp/entry.S| 2 -- > arch/arm64/kvm/hyp/switch.c | 24 ++-- > arch/arm64/kvm/hyp/sysreg-sr.c| 48 > --- > 6 files changed, 100 insertions(+), 7 deletions(-) > > diff --git a/arch/arm64/include/asm/kvm_host.h > b/arch/arm64/include/asm/kvm_host.h > index e923b58..cba7d3e 100644 > --- a/arch/arm64/include/asm/kvm_host.h > +++ b/arch/arm64/include/asm/kvm_host.h > @@ -193,6 +193,12 @@ struct kvm_cpu_context { > }; > }; > > +struct kvm_cpu_host_pstate { > + u64 daif; > + u64 uao; > + u64 pan; > +}; > + > typedef struct kvm_cpu_context kvm_cpu_context_t; > > struct kvm_vcpu_arch { > @@ -227,6 +233,8 @@ struct kvm_vcpu_arch { > > /* Pointer to host CPU context */ > kvm_cpu_context_t *host_cpu_context; > + /* Host PSTATE value */ > + struct kvm_cpu_host_pstate host_pstate; > struct { > /* {Break,watch}point registers */ > struct kvm_guest_debug_arch regs; > diff --git a/arch/arm64/include/asm/kvm_hyp.h > b/arch/arm64/include/asm/kvm_hyp.h > index 4572a9b..a75587a 100644 > --- a/arch/arm64/include/asm/kvm_hyp.h > +++ b/arch/arm64/include/asm/kvm_hyp.h > @@ -134,6 +134,8 @@ > > void __sysreg_save_host_state(struct kvm_cpu_context *ctxt); > void __sysreg_restore_host_state(struct kvm_cpu_context *ctxt); > +void __sysreg_save_host_pstate(struct kvm_vcpu *vcpu); > +void __sysreg_restore_host_pstate(struct kvm_vcpu *vcpu); > void __sysreg_save_guest_state(struct kvm_cpu_context *ctxt); > void __sysreg_restore_guest_state(struct kvm_cpu_context *ctxt); > void __sysreg32_save_state(struct kvm_vcpu *vcpu); > diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h > index 248339e..efdcf40 100644 > --- a/arch/arm64/include/asm/sysreg.h > +++ b/arch/arm64/include/asm/sysreg.h > @@ -295,6 +295,29 @@ > #define SYS_ICH_LR14_EL2 __SYS__LR8_EL2(6) > #define SYS_ICH_LR15_EL2 __SYS__LR8_EL2(7) > > +#define REG_PSTATE_PAN sys_reg(3, 0, 4, 2, 3) > +#define REG_PSTATE_UAO sys_reg(3, 0, 4, 2, 4) > + > +#define GET_PSTATE_PAN \ > + ({ \ > + u64 reg;\ > + asm volatile(ALTERNATIVE("mov %0, #0", \ > + "mrs_s %0, " __stringify(REG_PSTATE_PAN),\ > + ARM64_HAS_PAN)\ > + : "=r" (reg));\ > + reg;\ > + }) > + > +#define GET_PSTATE_UAO \ > + ({ \ > + u64 reg;\ > + asm volatile(ALTERNATIVE("mov %0, #0",\ > + "mrs_s %0, " __stringify(REG_PSTATE_UAO),\ > + ARM64_HAS_UAO)\ > + : "=r" (reg));\ > + reg;\ > + }) > + > /* Common SCTLR_ELx flags. */ > #define SCTLR_ELx_EE(1 << 25) > #define SCTLR_ELx_I (1 << 12) > diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S > index 12ee62d..7662ef5 100644 > --- a/arch/arm64/kvm/hyp/entry.S > +++ b/arch/arm64/kvm/hyp/entry.S > @@ -96,8 +96,6 @@ ENTRY(__guest_exit) > > add x1, x1, #VCPU_CONTEXT > > - ALTERNATIVE(nop, SET_PSTATE_PAN(1), ARM64_HAS_PAN, CONFIG_ARM64_PAN) > - > // Store the guest regs x2 and x3 > stp x2, x3, [x1, #CPU_XREG_OFFSET(2)] > > diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c > index 945e79c..9b380a1 100644 > --- a/arch/arm64/kvm/hyp/switch.c > +++ b/arch/arm64/kvm/hyp/switch.c > @@ -278,6 +278,26 @@ static void __hyp_text __skip_instr(struct kvm_vcpu > *vcpu) > write_sysreg_el2(*vcpu_pc(vcpu), elr); > } > > +static void __hyp_text __save_host_state(struct kvm_vcpu *vcpu) > +{ > + struct kvm_cpu_context *host_ctxt; > + > + host_ctxt = kern_hyp_va(vcpu->arch.host_cpu_context); > + > + __sysreg_save_host_state(host_ctxt); > + __sysreg_save_host_pstate(vcpu); > +} > + > +static void __hyp_text __restore_host_state(struct kvm_vcpu *vcpu) > +{ > + struct
Re: [PATCH] arm64: KVM: VHE: save and restore some PSTATE bits
CC Catalin On 2017/9/6 2:58, gengdongjiu wrote: > when exit from guest, some host PSTATE bits may be lost, such as > PSTATE.PAN or PSTATE.UAO. It is because host and hypervisor all run > in the EL2, host PSTATE value cannot be saved and restored via > SPSR_EL2. So if guest has changed the PSTATE, host continues with > a wrong value guest has set. > > Signed-off-by: Dongjiu Geng > Signed-off-by: Haibin Zhang > --- > arch/arm64/include/asm/kvm_host.h | 8 +++ > arch/arm64/include/asm/kvm_hyp.h | 2 ++ > arch/arm64/include/asm/sysreg.h | 23 +++ > arch/arm64/kvm/hyp/entry.S| 2 -- > arch/arm64/kvm/hyp/switch.c | 24 ++-- > arch/arm64/kvm/hyp/sysreg-sr.c| 48 > --- > 6 files changed, 100 insertions(+), 7 deletions(-) > > diff --git a/arch/arm64/include/asm/kvm_host.h > b/arch/arm64/include/asm/kvm_host.h > index e923b58..cba7d3e 100644 > --- a/arch/arm64/include/asm/kvm_host.h > +++ b/arch/arm64/include/asm/kvm_host.h > @@ -193,6 +193,12 @@ struct kvm_cpu_context { > }; > }; > > +struct kvm_cpu_host_pstate { > + u64 daif; > + u64 uao; > + u64 pan; > +}; > + > typedef struct kvm_cpu_context kvm_cpu_context_t; > > struct kvm_vcpu_arch { > @@ -227,6 +233,8 @@ struct kvm_vcpu_arch { > > /* Pointer to host CPU context */ > kvm_cpu_context_t *host_cpu_context; > + /* Host PSTATE value */ > + struct kvm_cpu_host_pstate host_pstate; > struct { > /* {Break,watch}point registers */ > struct kvm_guest_debug_arch regs; > diff --git a/arch/arm64/include/asm/kvm_hyp.h > b/arch/arm64/include/asm/kvm_hyp.h > index 4572a9b..a75587a 100644 > --- a/arch/arm64/include/asm/kvm_hyp.h > +++ b/arch/arm64/include/asm/kvm_hyp.h > @@ -134,6 +134,8 @@ > > void __sysreg_save_host_state(struct kvm_cpu_context *ctxt); > void __sysreg_restore_host_state(struct kvm_cpu_context *ctxt); > +void __sysreg_save_host_pstate(struct kvm_vcpu *vcpu); > +void __sysreg_restore_host_pstate(struct kvm_vcpu *vcpu); > void __sysreg_save_guest_state(struct kvm_cpu_context *ctxt); > void __sysreg_restore_guest_state(struct kvm_cpu_context *ctxt); > void __sysreg32_save_state(struct kvm_vcpu *vcpu); > diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h > index 248339e..efdcf40 100644 > --- a/arch/arm64/include/asm/sysreg.h > +++ b/arch/arm64/include/asm/sysreg.h > @@ -295,6 +295,29 @@ > #define SYS_ICH_LR14_EL2 __SYS__LR8_EL2(6) > #define SYS_ICH_LR15_EL2 __SYS__LR8_EL2(7) > > +#define REG_PSTATE_PAN sys_reg(3, 0, 4, 2, 3) > +#define REG_PSTATE_UAO sys_reg(3, 0, 4, 2, 4) > + > +#define GET_PSTATE_PAN \ > + ({ \ > + u64 reg;\ > + asm volatile(ALTERNATIVE("mov %0, #0", \ > + "mrs_s %0, " __stringify(REG_PSTATE_PAN),\ > + ARM64_HAS_PAN)\ > + : "=r" (reg));\ > + reg;\ > + }) > + > +#define GET_PSTATE_UAO \ > + ({ \ > + u64 reg;\ > + asm volatile(ALTERNATIVE("mov %0, #0",\ > + "mrs_s %0, " __stringify(REG_PSTATE_UAO),\ > + ARM64_HAS_UAO)\ > + : "=r" (reg));\ > + reg;\ > + }) > + > /* Common SCTLR_ELx flags. */ > #define SCTLR_ELx_EE(1 << 25) > #define SCTLR_ELx_I (1 << 12) > diff --git a/arch/arm64/kvm/hyp/entry.S b/arch/arm64/kvm/hyp/entry.S > index 12ee62d..7662ef5 100644 > --- a/arch/arm64/kvm/hyp/entry.S > +++ b/arch/arm64/kvm/hyp/entry.S > @@ -96,8 +96,6 @@ ENTRY(__guest_exit) > > add x1, x1, #VCPU_CONTEXT > > - ALTERNATIVE(nop, SET_PSTATE_PAN(1), ARM64_HAS_PAN, CONFIG_ARM64_PAN) > - > // Store the guest regs x2 and x3 > stp x2, x3, [x1, #CPU_XREG_OFFSET(2)] > > diff --git a/arch/arm64/kvm/hyp/switch.c b/arch/arm64/kvm/hyp/switch.c > index 945e79c..9b380a1 100644 > --- a/arch/arm64/kvm/hyp/switch.c > +++ b/arch/arm64/kvm/hyp/switch.c > @@ -278,6 +278,26 @@ static void __hyp_text __skip_instr(struct kvm_vcpu > *vcpu) > write_sysreg_el2(*vcpu_pc(vcpu), elr); > } > > +static void __hyp_text __save_host_state(struct kvm_vcpu *vcpu) > +{ > + struct kvm_cpu_context *host_ctxt; > + > + host_ctxt = kern_hyp_va(vcpu->arch.host_cpu_context); > + > + __sysreg_save_host_state(host_ctxt); > + __sysreg_save_host_pstate(vcpu); > +} > + > +static void __hyp_text __restore_host_state(struct kvm_vcpu *vcpu) > +{ > + struct kvm_cpu_context *host_ctxt; > + > + host_ctxt =
Re: [RFC v2 6/6] platform/x86: intel_pmc_ipc: Use generic Intel IPC device calls
Hi Lee, On 9/5/2017 12:38 AM, Lee Jones wrote: On Sat, 02 Sep 2017, sathyanarayanan.kuppusw...@linux.intel.com wrote: From: Kuppuswamy SathyanarayananRemoved redundant IPC helper functions and refactored the driver to use generic IPC device driver APIs. This patch also cleans-up PMC IPC user drivers to use APIs provided by generic IPC driver. Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/include/asm/intel_pmc_ipc.h | 37 +-- drivers/mfd/intel_soc_pmic_bxtwc.c| 24 +- include/linux/mfd/intel_soc_pmic.h| 2 + I'm a bit concerned by the API. This is not a new change. Even before refactoring this driver, we have been using u32 bit variable to pass the DPTR and SPTR address. Any reason why you're not using pointers for addresses? I think the main reason is, this is the expected format defined by SCU/PMC spec. According to the spec document, SPTR and DPTR registers are used to program the 32 bit SRAM address from which the PMC/SCU firmware can read/write the data of an IPC command. if we are not using SPTR or DPTR, we need to leave the value at zero. pointers, you should be using NULL, instead of 0. drivers/platform/x86/intel_pmc_ipc.c | 364 +- drivers/platform/x86/intel_telemetry_pltdrv.c | 114 5 files changed, 215 insertions(+), 326 deletions(-) Changes since v1: * Removed custom APIs. * Cleaned up PMC IPC user drivers to use APIs provided by generic IPC driver.
Re: [RFC v2 6/6] platform/x86: intel_pmc_ipc: Use generic Intel IPC device calls
Hi Lee, On 9/5/2017 12:38 AM, Lee Jones wrote: On Sat, 02 Sep 2017, sathyanarayanan.kuppusw...@linux.intel.com wrote: From: Kuppuswamy Sathyanarayanan Removed redundant IPC helper functions and refactored the driver to use generic IPC device driver APIs. This patch also cleans-up PMC IPC user drivers to use APIs provided by generic IPC driver. Signed-off-by: Kuppuswamy Sathyanarayanan --- arch/x86/include/asm/intel_pmc_ipc.h | 37 +-- drivers/mfd/intel_soc_pmic_bxtwc.c| 24 +- include/linux/mfd/intel_soc_pmic.h| 2 + I'm a bit concerned by the API. This is not a new change. Even before refactoring this driver, we have been using u32 bit variable to pass the DPTR and SPTR address. Any reason why you're not using pointers for addresses? I think the main reason is, this is the expected format defined by SCU/PMC spec. According to the spec document, SPTR and DPTR registers are used to program the 32 bit SRAM address from which the PMC/SCU firmware can read/write the data of an IPC command. if we are not using SPTR or DPTR, we need to leave the value at zero. pointers, you should be using NULL, instead of 0. drivers/platform/x86/intel_pmc_ipc.c | 364 +- drivers/platform/x86/intel_telemetry_pltdrv.c | 114 5 files changed, 215 insertions(+), 326 deletions(-) Changes since v1: * Removed custom APIs. * Cleaned up PMC IPC user drivers to use APIs provided by generic IPC driver.
[PATCH 3/3] locking/rwsem/x86: Add stack frame dependency for __downgrade_write()
kernel/locking/rwsem.o: warning: objtool: downgrade_write()+0x22: call without frame pointer save/setup The warning means gcc 7.2.0 placed the __downgrade_write() inline asm (and its call instruction) before the frame pointer setup in downgrade_write(), which breaks frame pointer convention and can result in incorrect stack traces. Force a stack frame to be created before the call instruction by listing the stack pointer as an output operand in the inline asm statement. Signed-off-by: Miguel Bernal Marin--- arch/x86/include/asm/rwsem.h | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h index d26b6916b935..a749dc6a3103 100644 --- a/arch/x86/include/asm/rwsem.h +++ b/arch/x86/include/asm/rwsem.h @@ -205,8 +205,10 @@ static inline void __up_write(struct rw_semaphore *sem) */ static inline void __downgrade_write(struct rw_semaphore *sem) { + register void *__sp asm(_ASM_SP); + asm volatile("# beginning __downgrade_write\n\t" -LOCK_PREFIX _ASM_ADD "%2,(%1)\n\t" +LOCK_PREFIX _ASM_ADD "%2,(%2)\n\t" /* * transitions 0x0001 -> 0x0001 (i386) * 0x0001 -> 0x0001 (x86_64) @@ -215,7 +217,7 @@ static inline void __downgrade_write(struct rw_semaphore *sem) " call call_rwsem_downgrade_wake\n" "1:\n\t" "# ending __downgrade_write\n" -: "+m" (sem->count) +: "+m" (sem->count), "+r" (__sp) : "a" (sem), "er" (-RWSEM_WAITING_BIAS) : "memory", "cc"); } -- 2.14.1
[PATCH 0/3] locking/rwsem/x86: Add stack frame dependency for some inline asm
Some warning were showed by objtool using gcc 7.2.0 kernel/locking/rwsem.o: warning: objtool: up_read()+0x11: call without frame pointer save/setup kernel/locking/rwsem.o: warning: objtool: up_write()+0x17: call without frame pointer save/setup kernel/locking/rwsem.o: warning: objtool: downgrade_write()+0x22: call without frame pointer save/setup which means gcc placed an inline asm function and its call instruction before the frame pointer setup. This series forces a stack frame to be created before the call instruction by listing the stack pointer as an output operand in the inline asm statement. Miguel Bernal Marin (3): locking/rwsem/x86: Add stack frame dependency for __up_read() locking/rwsem/x86: Add stack frame dependency for __up_write() locking/rwsem/x86: Add stack frame dependency for __downgrade_write() arch/x86/include/asm/rwsem.h | 18 -- 1 file changed, 12 insertions(+), 6 deletions(-) -- 2.14.1
[PATCH 3/3] locking/rwsem/x86: Add stack frame dependency for __downgrade_write()
kernel/locking/rwsem.o: warning: objtool: downgrade_write()+0x22: call without frame pointer save/setup The warning means gcc 7.2.0 placed the __downgrade_write() inline asm (and its call instruction) before the frame pointer setup in downgrade_write(), which breaks frame pointer convention and can result in incorrect stack traces. Force a stack frame to be created before the call instruction by listing the stack pointer as an output operand in the inline asm statement. Signed-off-by: Miguel Bernal Marin --- arch/x86/include/asm/rwsem.h | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h index d26b6916b935..a749dc6a3103 100644 --- a/arch/x86/include/asm/rwsem.h +++ b/arch/x86/include/asm/rwsem.h @@ -205,8 +205,10 @@ static inline void __up_write(struct rw_semaphore *sem) */ static inline void __downgrade_write(struct rw_semaphore *sem) { + register void *__sp asm(_ASM_SP); + asm volatile("# beginning __downgrade_write\n\t" -LOCK_PREFIX _ASM_ADD "%2,(%1)\n\t" +LOCK_PREFIX _ASM_ADD "%2,(%2)\n\t" /* * transitions 0x0001 -> 0x0001 (i386) * 0x0001 -> 0x0001 (x86_64) @@ -215,7 +217,7 @@ static inline void __downgrade_write(struct rw_semaphore *sem) " call call_rwsem_downgrade_wake\n" "1:\n\t" "# ending __downgrade_write\n" -: "+m" (sem->count) +: "+m" (sem->count), "+r" (__sp) : "a" (sem), "er" (-RWSEM_WAITING_BIAS) : "memory", "cc"); } -- 2.14.1
[PATCH 0/3] locking/rwsem/x86: Add stack frame dependency for some inline asm
Some warning were showed by objtool using gcc 7.2.0 kernel/locking/rwsem.o: warning: objtool: up_read()+0x11: call without frame pointer save/setup kernel/locking/rwsem.o: warning: objtool: up_write()+0x17: call without frame pointer save/setup kernel/locking/rwsem.o: warning: objtool: downgrade_write()+0x22: call without frame pointer save/setup which means gcc placed an inline asm function and its call instruction before the frame pointer setup. This series forces a stack frame to be created before the call instruction by listing the stack pointer as an output operand in the inline asm statement. Miguel Bernal Marin (3): locking/rwsem/x86: Add stack frame dependency for __up_read() locking/rwsem/x86: Add stack frame dependency for __up_write() locking/rwsem/x86: Add stack frame dependency for __downgrade_write() arch/x86/include/asm/rwsem.h | 18 -- 1 file changed, 12 insertions(+), 6 deletions(-) -- 2.14.1
[PATCH 2/3] locking/rwsem/x86: Add stack frame dependency for __up_write()
kernel/locking/rwsem.o: warning: objtool: up_write()+0x17: call without frame pointer save/setup The warning means gcc 7.2.0 placed the __up_write() inline asm (and its call instruction) before the frame pointer setup in up_write(), which breaks frame pointer convention and can result in incorrect stack traces. Force a stack frame to be created before the call instruction by listing the stack pointer as an output operand in the inline asm statement. Signed-off-by: Miguel Bernal Marin--- arch/x86/include/asm/rwsem.h | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h index 762167afaec0..d26b6916b935 100644 --- a/arch/x86/include/asm/rwsem.h +++ b/arch/x86/include/asm/rwsem.h @@ -186,14 +186,16 @@ static inline void __up_read(struct rw_semaphore *sem) static inline void __up_write(struct rw_semaphore *sem) { long tmp; + register void *__sp asm(_ASM_SP); + asm volatile("# beginning __up_write\n\t" -LOCK_PREFIX " xadd %1,(%2)\n\t" +LOCK_PREFIX " xadd %1,(%3)\n\t" /* subtracts 0x0001, returns the old value */ " jns1f\n\t" " call call_rwsem_wake\n" /* expects old value in %edx */ "1:\n\t" "# ending __up_write\n" -: "+m" (sem->count), "=d" (tmp) +: "+m" (sem->count), "=d" (tmp), "+r" (__sp) : "a" (sem), "1" (-RWSEM_ACTIVE_WRITE_BIAS) : "memory", "cc"); } -- 2.14.1
[PATCH 1/3] locking/rwsem/x86: Add stack frame dependency for __up_read()
kernel/locking/rwsem.o: warning: objtool: up_read()+0x11: call without frame pointer save/setup The warning means gcc 7.2.0 placed the __up_read() inline asm (and its call instruction) before the frame pointer setup in up_read(), which breaks frame pointer convention and can result in incorrect stack traces. Force a stack frame to be created before the call instruction by listing the stack pointer as an output operand in the inline asm statement. Signed-off-by: Miguel Bernal Marin--- arch/x86/include/asm/rwsem.h | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h index a34e0d4b957d..762167afaec0 100644 --- a/arch/x86/include/asm/rwsem.h +++ b/arch/x86/include/asm/rwsem.h @@ -166,14 +166,16 @@ static inline bool __down_write_trylock(struct rw_semaphore *sem) static inline void __up_read(struct rw_semaphore *sem) { long tmp; + register void *__sp asm(_ASM_SP); + asm volatile("# beginning __up_read\n\t" -LOCK_PREFIX " xadd %1,(%2)\n\t" +LOCK_PREFIX " xadd %1,(%3)\n\t" /* subtracts 1, returns the old value */ " jns1f\n\t" " call call_rwsem_wake\n" /* expects old value in %edx */ "1:\n" "# ending __up_read\n" -: "+m" (sem->count), "=d" (tmp) +: "+m" (sem->count), "=d" (tmp), "+r" (__sp) : "a" (sem), "1" (-RWSEM_ACTIVE_READ_BIAS) : "memory", "cc"); } -- 2.14.1
Re: [PATCH v8 06/13] x86/apic: Mark the apic_intr_mode extern for sanity check cleanup
On 09/06/17 at 12:25pm, Baoquan He wrote: > On 08/28/17 at 11:20am, Dou Liyang wrote: > > - /* > > * Should not be necessary because the MP table should list the boot > > * CPU too, but we do it for the sake of robustness anyway. > > */ > > @@ -1254,29 +1237,6 @@ static int __init smp_sanity_check(unsigned max_cpus) > > physid_set(hard_smp_processor_id(), phys_cpu_present_map); > > } > > preempt_enable(); > > - > > - /* > > -* If we couldn't find a local APIC, then get out of here now! > > -*/ > > - if (APIC_INTEGRATED(boot_cpu_apic_version) && > > - !boot_cpu_has(X86_FEATURE_APIC)) { > > - if (!disable_apic) { > > - pr_err("BIOS bug, local APIC #%d not detected!...\n", > > - boot_cpu_physical_apicid); > > - pr_err("... forcing use of dummy APIC emulation (tell > > your hw vendor)\n"); > > - } > > - return SMP_NO_APIC; > > - } > > - > > - /* > > -* If SMP should be disabled, then really disable it! > > -*/ > > - if (!max_cpus) { > > - pr_info("SMP mode deactivated\n"); > > - return SMP_FORCE_UP; > > - } > > - > > - return SMP_OK; > > } > > > > static void __init smp_cpu_index_default(void) > > @@ -1335,19 +1295,20 @@ void __init native_smp_prepare_cpus(unsigned int > > max_cpus) > > Please also cleanup the passed in max_cpus since it's not used here any > more. And up to the caller: Oops, I just checked x86 code, other arch also have this hook. Please ingore this comment. > > static noinline void __init kernel_init_freeable(void) > { > ... > smp_prepare_cpus(setup_max_cpus); > ... > } > > > > > apic_intr_mode_init(); > > > > - switch (smp_sanity_check(max_cpus)) { > > - case SMP_NO_CONFIG: > > - disable_smp(); > > - return; > > - case SMP_NO_APIC: > > + smp_sanity_check(); > > + > > + switch (apic_intr_mode) { > > + case APIC_PIC: > > + case APIC_VIRTUAL_WIRE_NO_CONFIG: > > disable_smp(); > > return; > > - case SMP_FORCE_UP: > > + case APIC_SYMMETRIC_IO_NO_ROUTING: > > disable_smp(); > > /* Setup local timer */ > > x86_init.timers.setup_percpu_clockev(); > > return; > > - case SMP_OK: > > + case APIC_VIRTUAL_WIRE: > > + case APIC_SYMMETRIC_IO: > > break; > > } > > > > -- > > 2.5.5 > > > > > >
[PATCH 2/3] locking/rwsem/x86: Add stack frame dependency for __up_write()
kernel/locking/rwsem.o: warning: objtool: up_write()+0x17: call without frame pointer save/setup The warning means gcc 7.2.0 placed the __up_write() inline asm (and its call instruction) before the frame pointer setup in up_write(), which breaks frame pointer convention and can result in incorrect stack traces. Force a stack frame to be created before the call instruction by listing the stack pointer as an output operand in the inline asm statement. Signed-off-by: Miguel Bernal Marin --- arch/x86/include/asm/rwsem.h | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h index 762167afaec0..d26b6916b935 100644 --- a/arch/x86/include/asm/rwsem.h +++ b/arch/x86/include/asm/rwsem.h @@ -186,14 +186,16 @@ static inline void __up_read(struct rw_semaphore *sem) static inline void __up_write(struct rw_semaphore *sem) { long tmp; + register void *__sp asm(_ASM_SP); + asm volatile("# beginning __up_write\n\t" -LOCK_PREFIX " xadd %1,(%2)\n\t" +LOCK_PREFIX " xadd %1,(%3)\n\t" /* subtracts 0x0001, returns the old value */ " jns1f\n\t" " call call_rwsem_wake\n" /* expects old value in %edx */ "1:\n\t" "# ending __up_write\n" -: "+m" (sem->count), "=d" (tmp) +: "+m" (sem->count), "=d" (tmp), "+r" (__sp) : "a" (sem), "1" (-RWSEM_ACTIVE_WRITE_BIAS) : "memory", "cc"); } -- 2.14.1
[PATCH 1/3] locking/rwsem/x86: Add stack frame dependency for __up_read()
kernel/locking/rwsem.o: warning: objtool: up_read()+0x11: call without frame pointer save/setup The warning means gcc 7.2.0 placed the __up_read() inline asm (and its call instruction) before the frame pointer setup in up_read(), which breaks frame pointer convention and can result in incorrect stack traces. Force a stack frame to be created before the call instruction by listing the stack pointer as an output operand in the inline asm statement. Signed-off-by: Miguel Bernal Marin --- arch/x86/include/asm/rwsem.h | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h index a34e0d4b957d..762167afaec0 100644 --- a/arch/x86/include/asm/rwsem.h +++ b/arch/x86/include/asm/rwsem.h @@ -166,14 +166,16 @@ static inline bool __down_write_trylock(struct rw_semaphore *sem) static inline void __up_read(struct rw_semaphore *sem) { long tmp; + register void *__sp asm(_ASM_SP); + asm volatile("# beginning __up_read\n\t" -LOCK_PREFIX " xadd %1,(%2)\n\t" +LOCK_PREFIX " xadd %1,(%3)\n\t" /* subtracts 1, returns the old value */ " jns1f\n\t" " call call_rwsem_wake\n" /* expects old value in %edx */ "1:\n" "# ending __up_read\n" -: "+m" (sem->count), "=d" (tmp) +: "+m" (sem->count), "=d" (tmp), "+r" (__sp) : "a" (sem), "1" (-RWSEM_ACTIVE_READ_BIAS) : "memory", "cc"); } -- 2.14.1
Re: [PATCH v8 06/13] x86/apic: Mark the apic_intr_mode extern for sanity check cleanup
On 09/06/17 at 12:25pm, Baoquan He wrote: > On 08/28/17 at 11:20am, Dou Liyang wrote: > > - /* > > * Should not be necessary because the MP table should list the boot > > * CPU too, but we do it for the sake of robustness anyway. > > */ > > @@ -1254,29 +1237,6 @@ static int __init smp_sanity_check(unsigned max_cpus) > > physid_set(hard_smp_processor_id(), phys_cpu_present_map); > > } > > preempt_enable(); > > - > > - /* > > -* If we couldn't find a local APIC, then get out of here now! > > -*/ > > - if (APIC_INTEGRATED(boot_cpu_apic_version) && > > - !boot_cpu_has(X86_FEATURE_APIC)) { > > - if (!disable_apic) { > > - pr_err("BIOS bug, local APIC #%d not detected!...\n", > > - boot_cpu_physical_apicid); > > - pr_err("... forcing use of dummy APIC emulation (tell > > your hw vendor)\n"); > > - } > > - return SMP_NO_APIC; > > - } > > - > > - /* > > -* If SMP should be disabled, then really disable it! > > -*/ > > - if (!max_cpus) { > > - pr_info("SMP mode deactivated\n"); > > - return SMP_FORCE_UP; > > - } > > - > > - return SMP_OK; > > } > > > > static void __init smp_cpu_index_default(void) > > @@ -1335,19 +1295,20 @@ void __init native_smp_prepare_cpus(unsigned int > > max_cpus) > > Please also cleanup the passed in max_cpus since it's not used here any > more. And up to the caller: Oops, I just checked x86 code, other arch also have this hook. Please ingore this comment. > > static noinline void __init kernel_init_freeable(void) > { > ... > smp_prepare_cpus(setup_max_cpus); > ... > } > > > > > apic_intr_mode_init(); > > > > - switch (smp_sanity_check(max_cpus)) { > > - case SMP_NO_CONFIG: > > - disable_smp(); > > - return; > > - case SMP_NO_APIC: > > + smp_sanity_check(); > > + > > + switch (apic_intr_mode) { > > + case APIC_PIC: > > + case APIC_VIRTUAL_WIRE_NO_CONFIG: > > disable_smp(); > > return; > > - case SMP_FORCE_UP: > > + case APIC_SYMMETRIC_IO_NO_ROUTING: > > disable_smp(); > > /* Setup local timer */ > > x86_init.timers.setup_percpu_clockev(); > > return; > > - case SMP_OK: > > + case APIC_VIRTUAL_WIRE: > > + case APIC_SYMMETRIC_IO: > > break; > > } > > > > -- > > 2.5.5 > > > > > >
Re: [PATCH v3 2/3] arm: dts: add Nuvoton NPCM750 device tree
On Wed, Sep 6, 2017 at 10:00 AM, Brendan Higginswrote: > +++ b/Documentation/devicetree/bindings/arm/npcm/npcm.txt > @@ -0,0 +1,6 @@ > +NPCM Platforms Device Tree Bindings > +--- > +NPCM750 SoC > +Required root node properties: > + - compatible = "nuvoton,npcm750"; > + This is minimal. I assume there will be more content added as more support is added? Does it need it's own directory? > diff --git a/arch/arm/boot/dts/nuvoton-npcm750-evb.dts > b/arch/arm/boot/dts/nuvoton-npcm750-evb.dts > new file mode 100644 > index ..54df32cff21b > --- /dev/null > +++ b/arch/arm/boot/dts/nuvoton-npcm750-evb.dts > + > +/dts-v1/; > +#include "nuvoton-npcm750.dtsi" > + > +/ { > + model = "Nuvoton npcm750 Development Board (Device Tree)"; > + compatible = "nuvoton,npcm750"; > + > + chosen { > + stdout-path = > + bootargs = "earlyprintk=serial,serial3,115200"; > + }; > + > + memory { > + reg = <0 0x4000>; > + }; > + > + cpus { > + enable-method = "nuvoton,npcm7xx-smp"; > + }; > + > + clk: clock-controller@f0801000 { > + status = "okay"; > + }; > + > + apb { > + watchdog1: watchdog@f0009000 { > + status = "okay"; > + }; You've already got the label for the node, is there are reason you don't use a phandle to set the status? { status = "okay"; }; Same with the serial nodes below. > + > + serial0: serial0@f0001000 { > + status = "okay"; > + }; > + > + serial1: serial1@f0002000 { > + status = "okay"; > + }; > + > + serial2: serial2@f0003000 { > + status = "okay"; > + }; > + > + serial3: serial3@f0004000 { > + status = "okay"; > + }; > + };
Re: [PATCH v3 2/3] arm: dts: add Nuvoton NPCM750 device tree
On Wed, Sep 6, 2017 at 10:00 AM, Brendan Higgins wrote: > +++ b/Documentation/devicetree/bindings/arm/npcm/npcm.txt > @@ -0,0 +1,6 @@ > +NPCM Platforms Device Tree Bindings > +--- > +NPCM750 SoC > +Required root node properties: > + - compatible = "nuvoton,npcm750"; > + This is minimal. I assume there will be more content added as more support is added? Does it need it's own directory? > diff --git a/arch/arm/boot/dts/nuvoton-npcm750-evb.dts > b/arch/arm/boot/dts/nuvoton-npcm750-evb.dts > new file mode 100644 > index ..54df32cff21b > --- /dev/null > +++ b/arch/arm/boot/dts/nuvoton-npcm750-evb.dts > + > +/dts-v1/; > +#include "nuvoton-npcm750.dtsi" > + > +/ { > + model = "Nuvoton npcm750 Development Board (Device Tree)"; > + compatible = "nuvoton,npcm750"; > + > + chosen { > + stdout-path = > + bootargs = "earlyprintk=serial,serial3,115200"; > + }; > + > + memory { > + reg = <0 0x4000>; > + }; > + > + cpus { > + enable-method = "nuvoton,npcm7xx-smp"; > + }; > + > + clk: clock-controller@f0801000 { > + status = "okay"; > + }; > + > + apb { > + watchdog1: watchdog@f0009000 { > + status = "okay"; > + }; You've already got the label for the node, is there are reason you don't use a phandle to set the status? { status = "okay"; }; Same with the serial nodes below. > + > + serial0: serial0@f0001000 { > + status = "okay"; > + }; > + > + serial1: serial1@f0002000 { > + status = "okay"; > + }; > + > + serial2: serial2@f0003000 { > + status = "okay"; > + }; > + > + serial3: serial3@f0004000 { > + status = "okay"; > + }; > + };
Abysmal scheduler performance in Linus' tree?
I'm running e7d0c41ecc2e372a81741a30894f556afec24315 from Linus' tree today, and I'm seeing abysmal scheduler performance. Running make -j4 ends up with all the tasks on CPU 3 most of the time (on my 4-logical-thread laptop). taskset -c 0 whatever puts whatever on CPU 0, but plain while true; do true; done puts the infinite loop on CPU 3 right along with the make -j4 tasks. This is on Fedora 26, and I don't think I'm doing anything weird. systemd has enabled the cpu controller, but it doesn't seem to have configured anything or created any non-root cgroups. Just a heads up. I haven't tried to diagnose it at all.
Abysmal scheduler performance in Linus' tree?
I'm running e7d0c41ecc2e372a81741a30894f556afec24315 from Linus' tree today, and I'm seeing abysmal scheduler performance. Running make -j4 ends up with all the tasks on CPU 3 most of the time (on my 4-logical-thread laptop). taskset -c 0 whatever puts whatever on CPU 0, but plain while true; do true; done puts the infinite loop on CPU 3 right along with the make -j4 tasks. This is on Fedora 26, and I don't think I'm doing anything weird. systemd has enabled the cpu controller, but it doesn't seem to have configured anything or created any non-root cgroups. Just a heads up. I haven't tried to diagnose it at all.
[PATCH v2] arm: dts: artpec6: Remove unnecessary interrupt-parent property from sub-nodes
From: Surender Polsani <surend...@techveda.org> "interrupt-parent" property is declared in root node, so it is global to all nodes. This property is re-declared in few sub-nodes. To avoid duplication this property is removed from following sub-nodes: pmu, amba@0, amba@0/ethernet. Signed-off-by: Surender Polsani <surend...@techveda.org> Acked-by: Niklas Cassel <niklas.cas...@axis.com> --- Changes for v2: - Done few changes as suggested by Niklas Cassel. - Patch is rebased and built(ARCH=arm) on latest next-20170905. --- arch/arm/boot/dts/artpec6.dtsi | 3 --- 1 file changed, 3 deletions(-) diff --git a/arch/arm/boot/dts/artpec6.dtsi b/arch/arm/boot/dts/artpec6.dtsi index 767cbe8..2ed1177 100644 --- a/arch/arm/boot/dts/artpec6.dtsi +++ b/arch/arm/boot/dts/artpec6.dtsi @@ -151,7 +151,6 @@ interrupts = , ; interrupt-affinity = <>, <>; - interrupt-parent = <>; }; pcie: pcie@f805 { @@ -185,7 +184,6 @@ compatible = "simple-bus"; #address-cells = <0x1>; #size-cells = <0x1>; - interrupt-parent = <>; ranges; dma-ranges = <0x8000 0x 0x4000>; dma-coherent; @@ -195,7 +193,6 @@ clocks = <_phy_ref_clk>, < ARTPEC6_CLK_ETH_ACLK>; compatible = "snps,dwc-qos-ethernet-4.10"; - interrupt-parent = <>; interrupts = ; reg = <0xf801 0x4000>; -- 1.9.1
[PATCH v2] arm: dts: artpec6: Remove unnecessary interrupt-parent property from sub-nodes
From: Surender Polsani "interrupt-parent" property is declared in root node, so it is global to all nodes. This property is re-declared in few sub-nodes. To avoid duplication this property is removed from following sub-nodes: pmu, amba@0, amba@0/ethernet. Signed-off-by: Surender Polsani Acked-by: Niklas Cassel --- Changes for v2: - Done few changes as suggested by Niklas Cassel. - Patch is rebased and built(ARCH=arm) on latest next-20170905. --- arch/arm/boot/dts/artpec6.dtsi | 3 --- 1 file changed, 3 deletions(-) diff --git a/arch/arm/boot/dts/artpec6.dtsi b/arch/arm/boot/dts/artpec6.dtsi index 767cbe8..2ed1177 100644 --- a/arch/arm/boot/dts/artpec6.dtsi +++ b/arch/arm/boot/dts/artpec6.dtsi @@ -151,7 +151,6 @@ interrupts = , ; interrupt-affinity = <>, <>; - interrupt-parent = <>; }; pcie: pcie@f805 { @@ -185,7 +184,6 @@ compatible = "simple-bus"; #address-cells = <0x1>; #size-cells = <0x1>; - interrupt-parent = <>; ranges; dma-ranges = <0x8000 0x 0x4000>; dma-coherent; @@ -195,7 +193,6 @@ clocks = <_phy_ref_clk>, < ARTPEC6_CLK_ETH_ACLK>; compatible = "snps,dwc-qos-ethernet-4.10"; - interrupt-parent = <>; interrupts = ; reg = <0xf801 0x4000>; -- 1.9.1
UBSAN: Undefined error in log2.h
Hi, I am hitting this bug when running the syzkaller fuzzer on kernel 4.13-rc7 Syzkaller hit 'UBSAN: Undefined behaviour in ./include/linux/log2.h:LINE' bug. Guilty file: fs/pipe.c Maintainers: [] UBSAN: Undefined behaviour in ./include/linux/log2.h:57:13 shift exponent 64 is too large for 64-bit type 'long unsigned int' CPU: 3 PID: 2712 Comm: syz-executor0 Not tainted 4.13.0-rc7 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0xf7/0x1ae lib/dump_stack.c:52 ubsan_epilogue+0x12/0x8f lib/ubsan.c:164 __ubsan_handle_shift_out_of_bounds+0x2b2/0x32c lib/ubsan.c:421 __roundup_pow_of_two include/linux/log2.h:57 [inline] round_pipe_size fs/pipe.c:1027 [inline] pipe_set_size fs/pipe.c:1041 [inline] pipe_fcntl+0x6b6/0x7b0 fs/pipe.c:1158 do_fcntl+0x362/0x1250 fs/fcntl.c:416 SYSC_fcntl fs/fcntl.c:462 [inline] SyS_fcntl+0xd6/0x110 fs/fcntl.c:447 entry_SYSCALL_64_fastpath+0x18/0xad RIP: 0033:0x451e59 RSP: 002b:7ffe1469d358 EFLAGS: 0212 ORIG_RAX: 0048 RAX: ffda RBX: 00718000 RCX: 00451e59 RDX: RSI: 0407 RDI: 0003 RBP: 0046 R08: R09: R10: R11: 0212 R12: 20011ff8 R13: 00718000 R14: R15: I have the full reproducer program here: https://pastebin.com/mLZx0ySS The main code is below: long r[4]; void loop() { syscall(SYS_write, 1, "executing program\n", strlen("executing program\n")); memset(r, -1, sizeof(r)); r[0] = syscall(__NR_mmap, 0x2000ul, 0x12000ul, 0x3ul, 0x32ul, 0xul, 0x0ul); r[1] = syscall(__NR_pipe, 0x20011ff8ul); if (r[1] != -1) NONFAILING(r[2] = *(uint32_t*)0x20011ff8); r[3] = syscall(__NR_fcntl, r[2], 0x407ul, 0x0ul); }
UBSAN: Undefined error in log2.h
Hi, I am hitting this bug when running the syzkaller fuzzer on kernel 4.13-rc7 Syzkaller hit 'UBSAN: Undefined behaviour in ./include/linux/log2.h:LINE' bug. Guilty file: fs/pipe.c Maintainers: [] UBSAN: Undefined behaviour in ./include/linux/log2.h:57:13 shift exponent 64 is too large for 64-bit type 'long unsigned int' CPU: 3 PID: 2712 Comm: syz-executor0 Not tainted 4.13.0-rc7 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0xf7/0x1ae lib/dump_stack.c:52 ubsan_epilogue+0x12/0x8f lib/ubsan.c:164 __ubsan_handle_shift_out_of_bounds+0x2b2/0x32c lib/ubsan.c:421 __roundup_pow_of_two include/linux/log2.h:57 [inline] round_pipe_size fs/pipe.c:1027 [inline] pipe_set_size fs/pipe.c:1041 [inline] pipe_fcntl+0x6b6/0x7b0 fs/pipe.c:1158 do_fcntl+0x362/0x1250 fs/fcntl.c:416 SYSC_fcntl fs/fcntl.c:462 [inline] SyS_fcntl+0xd6/0x110 fs/fcntl.c:447 entry_SYSCALL_64_fastpath+0x18/0xad RIP: 0033:0x451e59 RSP: 002b:7ffe1469d358 EFLAGS: 0212 ORIG_RAX: 0048 RAX: ffda RBX: 00718000 RCX: 00451e59 RDX: RSI: 0407 RDI: 0003 RBP: 0046 R08: R09: R10: R11: 0212 R12: 20011ff8 R13: 00718000 R14: R15: I have the full reproducer program here: https://pastebin.com/mLZx0ySS The main code is below: long r[4]; void loop() { syscall(SYS_write, 1, "executing program\n", strlen("executing program\n")); memset(r, -1, sizeof(r)); r[0] = syscall(__NR_mmap, 0x2000ul, 0x12000ul, 0x3ul, 0x32ul, 0xul, 0x0ul); r[1] = syscall(__NR_pipe, 0x20011ff8ul); if (r[1] != -1) NONFAILING(r[2] = *(uint32_t*)0x20011ff8); r[3] = syscall(__NR_fcntl, r[2], 0x407ul, 0x0ul); }
Re: UBSAN: Undefined error in time.h signed integer overflow
On Tue, Sep 5, 2017 at 9:30 PM, Shankara Pailoorwrote: > Hi, > > I encountered this bug while fuzzing linux kernel 4.13-rc7 with syzkaller. > > > UBSAN: Undefined behaviour in ./include/linux/time.h:233:27 > signed integer overflow: > 8391720337152500783 * 10 cannot be represented in type 'long int' > CPU: 0 PID: 31798 Comm: syz-executor2 Not tainted 4.13.0-rc7 #2 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > Ubuntu-1.8.2-1ubuntu1 04/01/2014 > Call Trace: > __dump_stack lib/dump_stack.c:16 [inline] > dump_stack+0xf7/0x1ae lib/dump_stack.c:52 > ubsan_epilogue+0x12/0x8f lib/ubsan.c:164 > handle_overflow+0x21e/0x292 lib/ubsan.c:195 > __ubsan_handle_mul_overflow+0x2a/0x3e lib/ubsan.c:219 > timespec_to_ns include/linux/time.h:233 [inline] > posix_cpu_timer_set+0xb5c/0xf20 kernel/time/posix-cpu-timers.c:686 > do_timer_settime+0x1f4/0x390 kernel/time/posix-timers.c:890 > SYSC_timer_settime kernel/time/posix-timers.c:916 [inline] > SyS_timer_settime+0xea/0x170 kernel/time/posix-timers.c:902 > entry_SYSCALL_64_fastpath+0x18/0xad > RIP: 0033:0x451e59 > RSP: 002b:7fb62af4fc08 EFLAGS: 0216 ORIG_RAX: 00df > RAX: ffda RBX: 00718000 RCX: 00451e59 > RDX: 20006000 RSI: RDI: > RBP: 0046 R08: R09: > R10: 20003fe0 R11: 0216 R12: 004be920 > R13: R14: R15: > Looks similar to the issue Thomas fixed here: https://patchwork.kernel.org/patch/9799827/ Thomas: Should we change timespec_to_ns() to use the same transition internally rather then trying to track all the callers? thanks -john
Re: UBSAN: Undefined error in time.h signed integer overflow
On Tue, Sep 5, 2017 at 9:30 PM, Shankara Pailoor wrote: > Hi, > > I encountered this bug while fuzzing linux kernel 4.13-rc7 with syzkaller. > > > UBSAN: Undefined behaviour in ./include/linux/time.h:233:27 > signed integer overflow: > 8391720337152500783 * 10 cannot be represented in type 'long int' > CPU: 0 PID: 31798 Comm: syz-executor2 Not tainted 4.13.0-rc7 #2 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > Ubuntu-1.8.2-1ubuntu1 04/01/2014 > Call Trace: > __dump_stack lib/dump_stack.c:16 [inline] > dump_stack+0xf7/0x1ae lib/dump_stack.c:52 > ubsan_epilogue+0x12/0x8f lib/ubsan.c:164 > handle_overflow+0x21e/0x292 lib/ubsan.c:195 > __ubsan_handle_mul_overflow+0x2a/0x3e lib/ubsan.c:219 > timespec_to_ns include/linux/time.h:233 [inline] > posix_cpu_timer_set+0xb5c/0xf20 kernel/time/posix-cpu-timers.c:686 > do_timer_settime+0x1f4/0x390 kernel/time/posix-timers.c:890 > SYSC_timer_settime kernel/time/posix-timers.c:916 [inline] > SyS_timer_settime+0xea/0x170 kernel/time/posix-timers.c:902 > entry_SYSCALL_64_fastpath+0x18/0xad > RIP: 0033:0x451e59 > RSP: 002b:7fb62af4fc08 EFLAGS: 0216 ORIG_RAX: 00df > RAX: ffda RBX: 00718000 RCX: 00451e59 > RDX: 20006000 RSI: RDI: > RBP: 0046 R08: R09: > R10: 20003fe0 R11: 0216 R12: 004be920 > R13: R14: R15: > Looks similar to the issue Thomas fixed here: https://patchwork.kernel.org/patch/9799827/ Thomas: Should we change timespec_to_ns() to use the same transition internally rather then trying to track all the callers? thanks -john
[PATCH 1/2] mm/slub: wake up kswapd for initial high order allocation
From: Joonsoo Kimslub uses higher order allocation than it actually needs. In this case, we don't want to do direct reclaim to make such a high order page since it causes a big latency to the user. Instead, we would like to fallback lower order allocation that it actually needs. However, we also want to get this higher order page in the next time in order to get the best performance and it would be a role of the background thread like as kswapd and kcompactd. To wake up them, we should not clear __GFP_KSWAPD_RECLAIM. Unlike this intention, current code clears __GFP_KSWAPD_RECLAIM so fix it. Current unintended code is done by Mel's commit 444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a stall-free defrag option") for slub part. It removes a special case in __alloc_page_slowpath() where including __GFP_THISNODE and lacking ~__GFP_DIRECT_RECLAIM effectively means also lacking __GFP_KSWAPD_RECLAIM. However, slub doesn't use __GFP_THISNODE so it is not the case for this purpose. So, partially reverting this code in slub doesn't hurt Mel's intention. Note that this patch does some clean up, too. __GFP_NOFAIL is cleared twice so remove one. Signed-off-by: Joonsoo Kim --- mm/slub.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 163352c..45f4a4b 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1578,8 +1578,12 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) * so we fall-back to the minimum order allocation. */ alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL; - if ((alloc_gfp & __GFP_DIRECT_RECLAIM) && oo_order(oo) > oo_order(s->min)) - alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~(__GFP_RECLAIM|__GFP_NOFAIL); + if (oo_order(oo) > oo_order(s->min)) { + if (alloc_gfp & __GFP_DIRECT_RECLAIM) { + alloc_gfp |= __GFP_NOMEMALLOC; + alloc_gfp &= ~__GFP_DIRECT_RECLAIM; + } + } page = alloc_slab_page(s, alloc_gfp, node, oo); if (unlikely(!page)) { -- 2.7.4
[PATCH 2/2] mm/slub: don't use reserved memory for optimistic try
From: Joonsoo KimHigh-order atomic allocation is difficult to succeed since we cannot reclaim anything in this context. So, we reserves the pageblock for this kind of request. In slub, we try to allocate higher-order page more than it actually needs in order to get the best performance. If this optimistic try is used with GFP_ATOMIC, alloc_flags will be set as ALLOC_HARDER and the pageblock reserved for high-order atomic allocation would be used. Moreover, this request would reserve the MIGRATE_HIGHATOMIC pageblock ,if succeed, to prepare further request. It would not be good to use MIGRATE_HIGHATOMIC pageblock in terms of fragmentation management since it unconditionally set a migratetype to request's migratetype when unreserving the pageblock without considering the migratetype of used pages in the pageblock. This is not what we don't intend so fix it by unconditionally masking out __GFP_ATOMIC in order to not set ALLOC_HARDER. And, it is also undesirable to use reserved memory for optimistic try so mask out __GFP_HIGH. This patch also adds __GFP_NOMEMALLOC since we don't want to use the reserved memory for optimistic try even if the user has PF_MEMALLOC flag. Signed-off-by: Joonsoo Kim --- include/linux/gfp.h | 1 + mm/page_alloc.c | 8 mm/slub.c | 6 ++ 3 files changed, 11 insertions(+), 4 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index f780718..1f5658e 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -568,6 +568,7 @@ extern gfp_t gfp_allowed_mask; /* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask); +gfp_t gfp_drop_reserves(gfp_t gfp_mask); extern void pm_restrict_gfp_mask(void); extern void pm_restore_gfp_mask(void); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6dbc49e..0f34356 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3720,6 +3720,14 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) return !!__gfp_pfmemalloc_flags(gfp_mask); } +gfp_t gfp_drop_reserves(gfp_t gfp_mask) +{ + gfp_mask &= ~(__GFP_HIGH | __GFP_ATOMIC); + gfp_mask |= __GFP_NOMEMALLOC; + + return gfp_mask; +} + /* * Checks whether it makes sense to retry the reclaim to make a forward progress * for the given allocation request. diff --git a/mm/slub.c b/mm/slub.c index 45f4a4b..3d75d30 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1579,10 +1579,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) */ alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL; if (oo_order(oo) > oo_order(s->min)) { - if (alloc_gfp & __GFP_DIRECT_RECLAIM) { - alloc_gfp |= __GFP_NOMEMALLOC; - alloc_gfp &= ~__GFP_DIRECT_RECLAIM; - } + alloc_gfp = gfp_drop_reserves(alloc_gfp); + alloc_gfp &= ~__GFP_DIRECT_RECLAIM; } page = alloc_slab_page(s, alloc_gfp, node, oo); -- 2.7.4
[PATCH 1/2] mm/slub: wake up kswapd for initial high order allocation
From: Joonsoo Kim slub uses higher order allocation than it actually needs. In this case, we don't want to do direct reclaim to make such a high order page since it causes a big latency to the user. Instead, we would like to fallback lower order allocation that it actually needs. However, we also want to get this higher order page in the next time in order to get the best performance and it would be a role of the background thread like as kswapd and kcompactd. To wake up them, we should not clear __GFP_KSWAPD_RECLAIM. Unlike this intention, current code clears __GFP_KSWAPD_RECLAIM so fix it. Current unintended code is done by Mel's commit 444eb2a449ef ("mm: thp: set THP defrag by default to madvise and add a stall-free defrag option") for slub part. It removes a special case in __alloc_page_slowpath() where including __GFP_THISNODE and lacking ~__GFP_DIRECT_RECLAIM effectively means also lacking __GFP_KSWAPD_RECLAIM. However, slub doesn't use __GFP_THISNODE so it is not the case for this purpose. So, partially reverting this code in slub doesn't hurt Mel's intention. Note that this patch does some clean up, too. __GFP_NOFAIL is cleared twice so remove one. Signed-off-by: Joonsoo Kim --- mm/slub.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/mm/slub.c b/mm/slub.c index 163352c..45f4a4b 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1578,8 +1578,12 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) * so we fall-back to the minimum order allocation. */ alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL; - if ((alloc_gfp & __GFP_DIRECT_RECLAIM) && oo_order(oo) > oo_order(s->min)) - alloc_gfp = (alloc_gfp | __GFP_NOMEMALLOC) & ~(__GFP_RECLAIM|__GFP_NOFAIL); + if (oo_order(oo) > oo_order(s->min)) { + if (alloc_gfp & __GFP_DIRECT_RECLAIM) { + alloc_gfp |= __GFP_NOMEMALLOC; + alloc_gfp &= ~__GFP_DIRECT_RECLAIM; + } + } page = alloc_slab_page(s, alloc_gfp, node, oo); if (unlikely(!page)) { -- 2.7.4
[PATCH 2/2] mm/slub: don't use reserved memory for optimistic try
From: Joonsoo Kim High-order atomic allocation is difficult to succeed since we cannot reclaim anything in this context. So, we reserves the pageblock for this kind of request. In slub, we try to allocate higher-order page more than it actually needs in order to get the best performance. If this optimistic try is used with GFP_ATOMIC, alloc_flags will be set as ALLOC_HARDER and the pageblock reserved for high-order atomic allocation would be used. Moreover, this request would reserve the MIGRATE_HIGHATOMIC pageblock ,if succeed, to prepare further request. It would not be good to use MIGRATE_HIGHATOMIC pageblock in terms of fragmentation management since it unconditionally set a migratetype to request's migratetype when unreserving the pageblock without considering the migratetype of used pages in the pageblock. This is not what we don't intend so fix it by unconditionally masking out __GFP_ATOMIC in order to not set ALLOC_HARDER. And, it is also undesirable to use reserved memory for optimistic try so mask out __GFP_HIGH. This patch also adds __GFP_NOMEMALLOC since we don't want to use the reserved memory for optimistic try even if the user has PF_MEMALLOC flag. Signed-off-by: Joonsoo Kim --- include/linux/gfp.h | 1 + mm/page_alloc.c | 8 mm/slub.c | 6 ++ 3 files changed, 11 insertions(+), 4 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index f780718..1f5658e 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -568,6 +568,7 @@ extern gfp_t gfp_allowed_mask; /* Returns true if the gfp_mask allows use of ALLOC_NO_WATERMARK */ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask); +gfp_t gfp_drop_reserves(gfp_t gfp_mask); extern void pm_restrict_gfp_mask(void); extern void pm_restore_gfp_mask(void); diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6dbc49e..0f34356 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3720,6 +3720,14 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask) return !!__gfp_pfmemalloc_flags(gfp_mask); } +gfp_t gfp_drop_reserves(gfp_t gfp_mask) +{ + gfp_mask &= ~(__GFP_HIGH | __GFP_ATOMIC); + gfp_mask |= __GFP_NOMEMALLOC; + + return gfp_mask; +} + /* * Checks whether it makes sense to retry the reclaim to make a forward progress * for the given allocation request. diff --git a/mm/slub.c b/mm/slub.c index 45f4a4b..3d75d30 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -1579,10 +1579,8 @@ static struct page *allocate_slab(struct kmem_cache *s, gfp_t flags, int node) */ alloc_gfp = (flags | __GFP_NOWARN | __GFP_NORETRY) & ~__GFP_NOFAIL; if (oo_order(oo) > oo_order(s->min)) { - if (alloc_gfp & __GFP_DIRECT_RECLAIM) { - alloc_gfp |= __GFP_NOMEMALLOC; - alloc_gfp &= ~__GFP_DIRECT_RECLAIM; - } + alloc_gfp = gfp_drop_reserves(alloc_gfp); + alloc_gfp &= ~__GFP_DIRECT_RECLAIM; } page = alloc_slab_page(s, alloc_gfp, node, oo); -- 2.7.4
[PATCH] mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request
From: Joonsoo KimFreepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that important to reserve. When ZONE_MOVABLE is used, this problem would theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE allocation request which is mainly used for page cache and anon page allocation. So, fix it by setting 0 to sysctl_lowmem_reserve_ratio[ZONE_HIGHMEM]. And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size makes code complex. For example, if there is highmem system, following reserve ratio is activated for *NORMAL ZONE* which would be easyily misleading people. #ifdef CONFIG_HIGHMEM 32 #endif This patch also fix this situation by defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES and place "#ifdef" to right place. Reviewed-by: Aneesh Kumar K.V Acked-by: Vlastimil Babka Signed-off-by: Joonsoo Kim --- Documentation/sysctl/vm.txt | 5 ++--- include/linux/mmzone.h | 2 +- mm/page_alloc.c | 25 ++--- 3 files changed, 17 insertions(+), 15 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 9baf66a..e9059d3 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -336,8 +336,6 @@ The lowmem_reserve_ratio is an array. You can see them by reading this file. % cat /proc/sys/vm/lowmem_reserve_ratio 256 256 32 - -Note: # of this elements is one fewer than number of zones. Because the highest - zone's value is not necessary for following calculation. But, these values are not used directly. The kernel calculates # of protection pages for each zones from them. These are shown as array of protection pages @@ -388,7 +386,8 @@ As above expression, they are reciprocal number of ratio. pages of higher zones on the node. If you would like to protect more pages, smaller values are effective. -The minimum value is 1 (1/1 -> 100%). +The minimum value is 1 (1/1 -> 100%). The value less than 1 completely +disables protection of the pages. == diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 356a814..d549c4e 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -890,7 +890,7 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); -extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1]; +extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES]; int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0f34356..2a7f7e9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -203,17 +203,18 @@ static void __free_pages_ok(struct page *page, unsigned int order); * TBD: should special case ZONE_DMA32 machines here - in those we normally * don't need any ZONE_NORMAL reservation */ -int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { +int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES] = { #ifdef CONFIG_ZONE_DMA -256, + [ZONE_DMA] = 256, #endif #ifdef CONFIG_ZONE_DMA32 -256, + [ZONE_DMA32] = 256, #endif + [ZONE_NORMAL] = 32, #ifdef CONFIG_HIGHMEM -32, + [ZONE_HIGHMEM] = 0, #endif -32, + [ZONE_MOVABLE] = 0, }; EXPORT_SYMBOL(totalram_pages); @@ -6921,13 +6922,15 @@ static void setup_per_zone_lowmem_reserve(void) struct zone *lower_zone; idx--; - - if (sysctl_lowmem_reserve_ratio[idx] < 1) - sysctl_lowmem_reserve_ratio[idx] = 1; - lower_zone = pgdat->node_zones + idx; - lower_zone->lowmem_reserve[j] = managed_pages / - sysctl_lowmem_reserve_ratio[idx]; + + if (sysctl_lowmem_reserve_ratio[idx] < 1) { + sysctl_lowmem_reserve_ratio[idx] = 0; + lower_zone->lowmem_reserve[j] = 0; + } else { + lower_zone->lowmem_reserve[j] = + managed_pages / sysctl_lowmem_reserve_ratio[idx]; + } managed_pages += lower_zone->managed_pages; } } -- 2.7.4
[PATCH] mm/page_alloc: don't reserve ZONE_HIGHMEM for ZONE_MOVABLE request
From: Joonsoo Kim Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that important to reserve. When ZONE_MOVABLE is used, this problem would theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE allocation request which is mainly used for page cache and anon page allocation. So, fix it by setting 0 to sysctl_lowmem_reserve_ratio[ZONE_HIGHMEM]. And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size makes code complex. For example, if there is highmem system, following reserve ratio is activated for *NORMAL ZONE* which would be easyily misleading people. #ifdef CONFIG_HIGHMEM 32 #endif This patch also fix this situation by defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES and place "#ifdef" to right place. Reviewed-by: Aneesh Kumar K.V Acked-by: Vlastimil Babka Signed-off-by: Joonsoo Kim --- Documentation/sysctl/vm.txt | 5 ++--- include/linux/mmzone.h | 2 +- mm/page_alloc.c | 25 ++--- 3 files changed, 17 insertions(+), 15 deletions(-) diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 9baf66a..e9059d3 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -336,8 +336,6 @@ The lowmem_reserve_ratio is an array. You can see them by reading this file. % cat /proc/sys/vm/lowmem_reserve_ratio 256 256 32 - -Note: # of this elements is one fewer than number of zones. Because the highest - zone's value is not necessary for following calculation. But, these values are not used directly. The kernel calculates # of protection pages for each zones from them. These are shown as array of protection pages @@ -388,7 +386,8 @@ As above expression, they are reciprocal number of ratio. pages of higher zones on the node. If you would like to protect more pages, smaller values are effective. -The minimum value is 1 (1/1 -> 100%). +The minimum value is 1 (1/1 -> 100%). The value less than 1 completely +disables protection of the pages. == diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 356a814..d549c4e 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -890,7 +890,7 @@ int min_free_kbytes_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); -extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1]; +extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES]; int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, void __user *, size_t *, loff_t *); int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int, diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 0f34356..2a7f7e9 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -203,17 +203,18 @@ static void __free_pages_ok(struct page *page, unsigned int order); * TBD: should special case ZONE_DMA32 machines here - in those we normally * don't need any ZONE_NORMAL reservation */ -int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES-1] = { +int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES] = { #ifdef CONFIG_ZONE_DMA -256, + [ZONE_DMA] = 256, #endif #ifdef CONFIG_ZONE_DMA32 -256, + [ZONE_DMA32] = 256, #endif + [ZONE_NORMAL] = 32, #ifdef CONFIG_HIGHMEM -32, + [ZONE_HIGHMEM] = 0, #endif -32, + [ZONE_MOVABLE] = 0, }; EXPORT_SYMBOL(totalram_pages); @@ -6921,13 +6922,15 @@ static void setup_per_zone_lowmem_reserve(void) struct zone *lower_zone; idx--; - - if (sysctl_lowmem_reserve_ratio[idx] < 1) - sysctl_lowmem_reserve_ratio[idx] = 1; - lower_zone = pgdat->node_zones + idx; - lower_zone->lowmem_reserve[j] = managed_pages / - sysctl_lowmem_reserve_ratio[idx]; + + if (sysctl_lowmem_reserve_ratio[idx] < 1) { + sysctl_lowmem_reserve_ratio[idx] = 0; + lower_zone->lowmem_reserve[j] = 0; + } else { + lower_zone->lowmem_reserve[j] = + managed_pages / sysctl_lowmem_reserve_ratio[idx]; + } managed_pages += lower_zone->managed_pages; } } -- 2.7.4
UBSAN: Undefined error in time.h signed integer overflow
Hi, I encountered this bug while fuzzing linux kernel 4.13-rc7 with syzkaller. UBSAN: Undefined behaviour in ./include/linux/time.h:233:27 signed integer overflow: 8391720337152500783 * 10 cannot be represented in type 'long int' CPU: 0 PID: 31798 Comm: syz-executor2 Not tainted 4.13.0-rc7 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0xf7/0x1ae lib/dump_stack.c:52 ubsan_epilogue+0x12/0x8f lib/ubsan.c:164 handle_overflow+0x21e/0x292 lib/ubsan.c:195 __ubsan_handle_mul_overflow+0x2a/0x3e lib/ubsan.c:219 timespec_to_ns include/linux/time.h:233 [inline] posix_cpu_timer_set+0xb5c/0xf20 kernel/time/posix-cpu-timers.c:686 do_timer_settime+0x1f4/0x390 kernel/time/posix-timers.c:890 SYSC_timer_settime kernel/time/posix-timers.c:916 [inline] SyS_timer_settime+0xea/0x170 kernel/time/posix-timers.c:902 entry_SYSCALL_64_fastpath+0x18/0xad RIP: 0033:0x451e59 RSP: 002b:7fb62af4fc08 EFLAGS: 0216 ORIG_RAX: 00df RAX: ffda RBX: 00718000 RCX: 00451e59 RDX: 20006000 RSI: RDI: RBP: 0046 R08: R09: R10: 20003fe0 R11: 0216 R12: 004be920 R13: R14: R15: Here is the full reproducer program: https://pastebin.com/xucAtmbD Below is the core of the reproducer: long r[16]; void *thr(void *arg) { switch ((long)arg) { case 0: r[0] = syscall(__NR_mmap, 0x2000ul, 0xd000ul, 0x3ul, 0x32ul, 0xul, 0x0ul); break; case 1: NONFAILING(*(uint64_t*)0x20001fb0 = (uint64_t)0x0); NONFAILING(*(uint32_t*)0x20001fb8 = (uint32_t)0x1); NONFAILING(*(uint32_t*)0x20001fbc = (uint32_t)0x0); NONFAILING(*(uint64_t*)0x20001fc0 = (uint64_t)0x20007fcd); NONFAILING(*(uint64_t*)0x20001fc8 = (uint64_t)0x20005000); r[6] = syscall(__NR_timer_create, 0x3ul, 0x20001fb0ul, 0x2000ul); break; case 2: r[7] = syscall(__NR_clock_gettime, 0x0ul, 0x20004000ul); if (r[7] != -1) NONFAILING(r[8] = *(uint64_t*)0x20004008); break; case 3: NONFAILING(*(uint64_t*)0x20006000 = (uint64_t)0x0); NONFAILING(*(uint64_t*)0x20006008 = (uint64_t)0x0); NONFAILING(*(uint64_t*)0x20006010 = (uint64_t)0x0); NONFAILING(*(uint64_t*)0x20006018 = r[8]+1000); r[13] = syscall(__NR_timer_settime, 0x0ul, 0x0ul, 0x20006000ul, 0x20003fe0ul); break; case 4: NONFAILING(memcpy((void*)0x20006000, "\x2f\x64\x65\x76\x2f\x61\x75\x74\x6f\x66\x73\x00", 12)); r[15] = syscall(__NR_openat, 0xff9cul, 0x20006000ul, 0x8000ul, 0x0ul); break; } return 0; }
UBSAN: Undefined error in time.h signed integer overflow
Hi, I encountered this bug while fuzzing linux kernel 4.13-rc7 with syzkaller. UBSAN: Undefined behaviour in ./include/linux/time.h:233:27 signed integer overflow: 8391720337152500783 * 10 cannot be represented in type 'long int' CPU: 0 PID: 31798 Comm: syz-executor2 Not tainted 4.13.0-rc7 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:16 [inline] dump_stack+0xf7/0x1ae lib/dump_stack.c:52 ubsan_epilogue+0x12/0x8f lib/ubsan.c:164 handle_overflow+0x21e/0x292 lib/ubsan.c:195 __ubsan_handle_mul_overflow+0x2a/0x3e lib/ubsan.c:219 timespec_to_ns include/linux/time.h:233 [inline] posix_cpu_timer_set+0xb5c/0xf20 kernel/time/posix-cpu-timers.c:686 do_timer_settime+0x1f4/0x390 kernel/time/posix-timers.c:890 SYSC_timer_settime kernel/time/posix-timers.c:916 [inline] SyS_timer_settime+0xea/0x170 kernel/time/posix-timers.c:902 entry_SYSCALL_64_fastpath+0x18/0xad RIP: 0033:0x451e59 RSP: 002b:7fb62af4fc08 EFLAGS: 0216 ORIG_RAX: 00df RAX: ffda RBX: 00718000 RCX: 00451e59 RDX: 20006000 RSI: RDI: RBP: 0046 R08: R09: R10: 20003fe0 R11: 0216 R12: 004be920 R13: R14: R15: Here is the full reproducer program: https://pastebin.com/xucAtmbD Below is the core of the reproducer: long r[16]; void *thr(void *arg) { switch ((long)arg) { case 0: r[0] = syscall(__NR_mmap, 0x2000ul, 0xd000ul, 0x3ul, 0x32ul, 0xul, 0x0ul); break; case 1: NONFAILING(*(uint64_t*)0x20001fb0 = (uint64_t)0x0); NONFAILING(*(uint32_t*)0x20001fb8 = (uint32_t)0x1); NONFAILING(*(uint32_t*)0x20001fbc = (uint32_t)0x0); NONFAILING(*(uint64_t*)0x20001fc0 = (uint64_t)0x20007fcd); NONFAILING(*(uint64_t*)0x20001fc8 = (uint64_t)0x20005000); r[6] = syscall(__NR_timer_create, 0x3ul, 0x20001fb0ul, 0x2000ul); break; case 2: r[7] = syscall(__NR_clock_gettime, 0x0ul, 0x20004000ul); if (r[7] != -1) NONFAILING(r[8] = *(uint64_t*)0x20004008); break; case 3: NONFAILING(*(uint64_t*)0x20006000 = (uint64_t)0x0); NONFAILING(*(uint64_t*)0x20006008 = (uint64_t)0x0); NONFAILING(*(uint64_t*)0x20006010 = (uint64_t)0x0); NONFAILING(*(uint64_t*)0x20006018 = r[8]+1000); r[13] = syscall(__NR_timer_settime, 0x0ul, 0x0ul, 0x20006000ul, 0x20003fe0ul); break; case 4: NONFAILING(memcpy((void*)0x20006000, "\x2f\x64\x65\x76\x2f\x61\x75\x74\x6f\x66\x73\x00", 12)); r[15] = syscall(__NR_openat, 0xff9cul, 0x20006000ul, 0x8000ul, 0x0ul); break; } return 0; }
Re: [PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing the idlest CPU
On Wed, Sep 06, 2017 at 12:57:41AM +0200, Thomas Gleixner wrote: > On Sun, 3 Sep 2017, Thomas Gleixner wrote: > > > On Fri, 1 Sep 2017, Chen Yu wrote: > > > > > This is the major logic to spread the vectors on different CPUs. > > > The main idea is to choose the 'idlest' CPU which has assigned > > > the least number of vectors as the candidate/hint for the vector > > > allocation domain, in the hope that the vector allocation domain > > > could leverage this hint to generate corresponding cpumask. > > > > > > One of the requirements to do this vector spreading work comes from the > > > following hibernation problem found on a 16 cores server: > > > > > > CPU 31 disable failed: CPU has 62 vectors assigned and there > > > are only 0 available. > > Thinking more about this, this makes no sense whatsoever. > > The total number of interrupts on a system is the same whether they are > all on CPU 0 or evenly spread over all CPUs. > > As this machine is using physcial destination mode, the number of vectors > used is the same as the number of interrupts, except for the case where a > move of an interrupt is in progress and the interrupt which cleans up the > old vector has not yet arrived. Lets ignore that for now. > > The available vector space is 204 per CPU on such a system. > > 256 - SYSTEM[0-31, 32, 128, 239-255] - LEGACY[50] = 204 > > > > CPU 31 disable failed: CPU has 62 vectors assigned and there > > > are only 0 available. > > CPU31 is the last AP going offline (CPU0 is still online). > > It wants to move 62 vectors to CPU0, but it can't because CPU0 has 0 > available vectors. That means CPU0 has 204 vectors used. I doubt that, but > what I doubt even more is that this interrupt spreading helps in any way. > > Assumed that we have a total of 204 + 62 = 266 device interrupt vectors in > use and they are evenly spread over 32 CPUs, so each CPU has either 8 or > nine vectors. Fine. > > Now if you unplug all CPUs except CPU0 starting from CPU1 up to CPU31 then > at the point where CPU31 is about to be unplugged, CPU0 holds 133 vectors > and CPU31 holds 133 vectors as well - assumed that the spread is exactly > even. > > I have a hard time to figure out how the 133 vectors on CPU31 are now > magically fitting in the empty space on CPU0, which is 204 - 133 = 71. In > my limited understanding of math 133 is greater than 71, but your patch > might make that magically be wrong. > The problem is reproduced when the network cable is not plugged in, because this driver looks like this: step 1. Reserved enough irq vectors and corresponding IRQs. step 2. If the network is activated, invoke request_irq() to register the handler. step 3. Invoke set_affinity() to spread the IRQs onto different CPUs, thus to spread the vectors too. Here's my understanding for why spreading vectors might help for this special case: As step 2 will not get invoked, the IRQs of this driver has not been enabled, thus in migrate_one_irq() this IRQ will not be considered because there is a check of irqd_is_started(d), thus there should only be 8 vectors allocated by this driver on CPU0, and 8 vectors left on CPU31, and the 8 vectors on CPU31 will not be migrated to CPU0 neither, so there is room for other 'valid' vectors to be migrated to CPU0. > Can you please provide detailed information about how many device > interrupts are actually in use/allocated on that system? > > Please enable CONFIG_GENERIC_IRQ_DEBUGFS and provide the output of > > # cat /sys/kernel/debug/irq/domains/* > > and > > # ls /sys/kernel/debug/irq/irqs > Ok, here's the information after system bootup on top of 4.13: # cat /sys/kernel/debug/irq/domains/* name: VECTOR size: 0 mapped: 388 flags: 0x0041 name: IO-APIC-0 size: 24 mapped: 16 flags: 0x0041 parent: VECTOR name: VECTOR size: 0 mapped: 388 flags: 0x0041 name: IO-APIC-1 size: 8 mapped: 2 flags: 0x0041 parent: VECTOR name: VECTOR size: 0 mapped: 388 flags: 0x0041 name: IO-APIC-2 size: 8 mapped: 0 flags: 0x0041 parent: VECTOR name: VECTOR size: 0 mapped: 388 flags: 0x0041 name: IO-APIC-3 size: 8 mapped: 0 flags: 0x0041 parent: VECTOR name: VECTOR size: 0 mapped: 388 flags: 0x0041 name: IO-APIC-4 size: 8 mapped: 5 flags: 0x0041 parent: VECTOR name: VECTOR size: 0 mapped: 388 flags: 0x0041 name: PCI-HT size: 0 mapped: 0 flags: 0x0041 parent: VECTOR name: VECTOR size: 0 mapped: 388 flags: 0x0041 name: PCI-MSI-2 size: 0 mapped: 365 flags: 0x0051 parent: VECTOR name: VECTOR size: 0 mapped: 388 flags: 0x0041 name: VECTOR size: 0 mapped: 388 flags: 0x0041 # ls /sys/kernel/debug/irq/irqs ls /sys/kernel/debug/irq/irqs 0 10 11 13 142 184 217 259 292 31 33 337 339 340
Re: [PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing the idlest CPU
On Wed, Sep 06, 2017 at 12:57:41AM +0200, Thomas Gleixner wrote: > On Sun, 3 Sep 2017, Thomas Gleixner wrote: > > > On Fri, 1 Sep 2017, Chen Yu wrote: > > > > > This is the major logic to spread the vectors on different CPUs. > > > The main idea is to choose the 'idlest' CPU which has assigned > > > the least number of vectors as the candidate/hint for the vector > > > allocation domain, in the hope that the vector allocation domain > > > could leverage this hint to generate corresponding cpumask. > > > > > > One of the requirements to do this vector spreading work comes from the > > > following hibernation problem found on a 16 cores server: > > > > > > CPU 31 disable failed: CPU has 62 vectors assigned and there > > > are only 0 available. > > Thinking more about this, this makes no sense whatsoever. > > The total number of interrupts on a system is the same whether they are > all on CPU 0 or evenly spread over all CPUs. > > As this machine is using physcial destination mode, the number of vectors > used is the same as the number of interrupts, except for the case where a > move of an interrupt is in progress and the interrupt which cleans up the > old vector has not yet arrived. Lets ignore that for now. > > The available vector space is 204 per CPU on such a system. > > 256 - SYSTEM[0-31, 32, 128, 239-255] - LEGACY[50] = 204 > > > > CPU 31 disable failed: CPU has 62 vectors assigned and there > > > are only 0 available. > > CPU31 is the last AP going offline (CPU0 is still online). > > It wants to move 62 vectors to CPU0, but it can't because CPU0 has 0 > available vectors. That means CPU0 has 204 vectors used. I doubt that, but > what I doubt even more is that this interrupt spreading helps in any way. > > Assumed that we have a total of 204 + 62 = 266 device interrupt vectors in > use and they are evenly spread over 32 CPUs, so each CPU has either 8 or > nine vectors. Fine. > > Now if you unplug all CPUs except CPU0 starting from CPU1 up to CPU31 then > at the point where CPU31 is about to be unplugged, CPU0 holds 133 vectors > and CPU31 holds 133 vectors as well - assumed that the spread is exactly > even. > > I have a hard time to figure out how the 133 vectors on CPU31 are now > magically fitting in the empty space on CPU0, which is 204 - 133 = 71. In > my limited understanding of math 133 is greater than 71, but your patch > might make that magically be wrong. > The problem is reproduced when the network cable is not plugged in, because this driver looks like this: step 1. Reserved enough irq vectors and corresponding IRQs. step 2. If the network is activated, invoke request_irq() to register the handler. step 3. Invoke set_affinity() to spread the IRQs onto different CPUs, thus to spread the vectors too. Here's my understanding for why spreading vectors might help for this special case: As step 2 will not get invoked, the IRQs of this driver has not been enabled, thus in migrate_one_irq() this IRQ will not be considered because there is a check of irqd_is_started(d), thus there should only be 8 vectors allocated by this driver on CPU0, and 8 vectors left on CPU31, and the 8 vectors on CPU31 will not be migrated to CPU0 neither, so there is room for other 'valid' vectors to be migrated to CPU0. > Can you please provide detailed information about how many device > interrupts are actually in use/allocated on that system? > > Please enable CONFIG_GENERIC_IRQ_DEBUGFS and provide the output of > > # cat /sys/kernel/debug/irq/domains/* > > and > > # ls /sys/kernel/debug/irq/irqs > Ok, here's the information after system bootup on top of 4.13: # cat /sys/kernel/debug/irq/domains/* name: VECTOR size: 0 mapped: 388 flags: 0x0041 name: IO-APIC-0 size: 24 mapped: 16 flags: 0x0041 parent: VECTOR name: VECTOR size: 0 mapped: 388 flags: 0x0041 name: IO-APIC-1 size: 8 mapped: 2 flags: 0x0041 parent: VECTOR name: VECTOR size: 0 mapped: 388 flags: 0x0041 name: IO-APIC-2 size: 8 mapped: 0 flags: 0x0041 parent: VECTOR name: VECTOR size: 0 mapped: 388 flags: 0x0041 name: IO-APIC-3 size: 8 mapped: 0 flags: 0x0041 parent: VECTOR name: VECTOR size: 0 mapped: 388 flags: 0x0041 name: IO-APIC-4 size: 8 mapped: 5 flags: 0x0041 parent: VECTOR name: VECTOR size: 0 mapped: 388 flags: 0x0041 name: PCI-HT size: 0 mapped: 0 flags: 0x0041 parent: VECTOR name: VECTOR size: 0 mapped: 388 flags: 0x0041 name: PCI-MSI-2 size: 0 mapped: 365 flags: 0x0051 parent: VECTOR name: VECTOR size: 0 mapped: 388 flags: 0x0041 name: VECTOR size: 0 mapped: 388 flags: 0x0041 # ls /sys/kernel/debug/irq/irqs ls /sys/kernel/debug/irq/irqs 0 10 11 13 142 184 217 259 292 31 33 337 339 340
Re: [PATCH 23/25 v3] ALSA/dummy: Replace tasklet with softirq hrtimer
On Sep 6 2017 01:18, Sebastian Andrzej Siewior wrote: From: Thomas GleixnerThe tasklet is used to defer the execution of snd_pcm_period_elapsed() to the softirq context. Using the CLOCK_MONOTONIC_SOFT base invokes the timer callback in softirq context as well which renders the tasklet useless. Signed-off-by: Thomas Gleixner Signed-off-by: Anna-Maria Gleixner Cc: Jaroslav Kysela Cc: Takashi Iwai Cc: Takashi Sakamoto Cc: alsa-de...@alsa-project.org [o-takashi: avoid stall due to a call of hrtimer_cancel() on a callback of hrtimer] Signed-off-by: Takashi Sakamoto Signed-off-by: Sebastian Andrzej Siewior --- On 2017-09-06 01:05:43 [+0900], Takashi Sakamoto wrote: As Iwai-san mentioned, in this function, .trigger can be called in two cases; XRUN occurs and draining is done. Thus, let you change the comment as 'In cases of XRUN and draining, this calls .trigger to stop PCM substream.'. I'm sorry to trouble you. snd_pcm_period_elapsed() ->snd_pcm_update_hw_ptr0() ->snd_pcm_update_state() ->snd_pcm_drain_done() ... ->.trigger(TRIGGER_STOP) ->xrun() ->snd_pcm_stop() ... ->.trigger(TRIGGER_STOP) I think you asked me just to update the comment. Did I do it right? v2…v3: updated the comment as per Takashi Sakamoto's suggestion. v1…v2: merged Takashi Sakamoto fixup of the original patch into v2. sound/drivers/dummy.c | 25 +++-- 1 file changed, 11 insertions(+), 14 deletions(-) This Looks good to me. I did quick test and confirmed that it brings no stalls. Tested-by: Takashi Sakamoto diff --git a/sound/drivers/dummy.c b/sound/drivers/dummy.c index dd5ed037adf2..cdd851286f92 100644 --- a/sound/drivers/dummy.c +++ b/sound/drivers/dummy.c @@ -376,17 +376,9 @@ struct dummy_hrtimer_pcm { ktime_t period_time; atomic_t running; struct hrtimer timer; - struct tasklet_struct tasklet; struct snd_pcm_substream *substream; }; -static void dummy_hrtimer_pcm_elapsed(unsigned long priv) -{ - struct dummy_hrtimer_pcm *dpcm = (struct dummy_hrtimer_pcm *)priv; - if (atomic_read(>running)) - snd_pcm_period_elapsed(dpcm->substream); -} - static enum hrtimer_restart dummy_hrtimer_callback(struct hrtimer *timer) { struct dummy_hrtimer_pcm *dpcm; @@ -394,7 +386,14 @@ static enum hrtimer_restart dummy_hrtimer_callback(struct hrtimer *timer) dpcm = container_of(timer, struct dummy_hrtimer_pcm, timer); if (!atomic_read(>running)) return HRTIMER_NORESTART; - tasklet_schedule(>tasklet); + /* +* In cases of XRUN and draining, this calls .trigger to stop PCM +* substream. +*/ + snd_pcm_period_elapsed(dpcm->substream); + if (!atomic_read(>running)) + return HRTIMER_NORESTART; + hrtimer_forward_now(timer, dpcm->period_time); return HRTIMER_RESTART; } @@ -414,14 +413,14 @@ static int dummy_hrtimer_stop(struct snd_pcm_substream *substream) struct dummy_hrtimer_pcm *dpcm = substream->runtime->private_data; atomic_set(>running, 0); - hrtimer_cancel(>timer); + if (!hrtimer_callback_running(>timer)) + hrtimer_cancel(>timer); return 0; } static inline void dummy_hrtimer_sync(struct dummy_hrtimer_pcm *dpcm) { hrtimer_cancel(>timer); - tasklet_kill(>tasklet); } static snd_pcm_uframes_t @@ -466,12 +465,10 @@ static int dummy_hrtimer_create(struct snd_pcm_substream *substream) if (!dpcm) return -ENOMEM; substream->runtime->private_data = dpcm; - hrtimer_init(>timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + hrtimer_init(>timer, CLOCK_MONOTONIC_SOFT, HRTIMER_MODE_REL); dpcm->timer.function = dummy_hrtimer_callback; dpcm->substream = substream; atomic_set(>running, 0); - tasklet_init(>tasklet, dummy_hrtimer_pcm_elapsed, -(unsigned long)dpcm); return 0; } Thanks Takashi Sakamoto
Re: [PATCH 23/25 v3] ALSA/dummy: Replace tasklet with softirq hrtimer
On Sep 6 2017 01:18, Sebastian Andrzej Siewior wrote: From: Thomas Gleixner The tasklet is used to defer the execution of snd_pcm_period_elapsed() to the softirq context. Using the CLOCK_MONOTONIC_SOFT base invokes the timer callback in softirq context as well which renders the tasklet useless. Signed-off-by: Thomas Gleixner Signed-off-by: Anna-Maria Gleixner Cc: Jaroslav Kysela Cc: Takashi Iwai Cc: Takashi Sakamoto Cc: alsa-de...@alsa-project.org [o-takashi: avoid stall due to a call of hrtimer_cancel() on a callback of hrtimer] Signed-off-by: Takashi Sakamoto Signed-off-by: Sebastian Andrzej Siewior --- On 2017-09-06 01:05:43 [+0900], Takashi Sakamoto wrote: As Iwai-san mentioned, in this function, .trigger can be called in two cases; XRUN occurs and draining is done. Thus, let you change the comment as 'In cases of XRUN and draining, this calls .trigger to stop PCM substream.'. I'm sorry to trouble you. snd_pcm_period_elapsed() ->snd_pcm_update_hw_ptr0() ->snd_pcm_update_state() ->snd_pcm_drain_done() ... ->.trigger(TRIGGER_STOP) ->xrun() ->snd_pcm_stop() ... ->.trigger(TRIGGER_STOP) I think you asked me just to update the comment. Did I do it right? v2…v3: updated the comment as per Takashi Sakamoto's suggestion. v1…v2: merged Takashi Sakamoto fixup of the original patch into v2. sound/drivers/dummy.c | 25 +++-- 1 file changed, 11 insertions(+), 14 deletions(-) This Looks good to me. I did quick test and confirmed that it brings no stalls. Tested-by: Takashi Sakamoto diff --git a/sound/drivers/dummy.c b/sound/drivers/dummy.c index dd5ed037adf2..cdd851286f92 100644 --- a/sound/drivers/dummy.c +++ b/sound/drivers/dummy.c @@ -376,17 +376,9 @@ struct dummy_hrtimer_pcm { ktime_t period_time; atomic_t running; struct hrtimer timer; - struct tasklet_struct tasklet; struct snd_pcm_substream *substream; }; -static void dummy_hrtimer_pcm_elapsed(unsigned long priv) -{ - struct dummy_hrtimer_pcm *dpcm = (struct dummy_hrtimer_pcm *)priv; - if (atomic_read(>running)) - snd_pcm_period_elapsed(dpcm->substream); -} - static enum hrtimer_restart dummy_hrtimer_callback(struct hrtimer *timer) { struct dummy_hrtimer_pcm *dpcm; @@ -394,7 +386,14 @@ static enum hrtimer_restart dummy_hrtimer_callback(struct hrtimer *timer) dpcm = container_of(timer, struct dummy_hrtimer_pcm, timer); if (!atomic_read(>running)) return HRTIMER_NORESTART; - tasklet_schedule(>tasklet); + /* +* In cases of XRUN and draining, this calls .trigger to stop PCM +* substream. +*/ + snd_pcm_period_elapsed(dpcm->substream); + if (!atomic_read(>running)) + return HRTIMER_NORESTART; + hrtimer_forward_now(timer, dpcm->period_time); return HRTIMER_RESTART; } @@ -414,14 +413,14 @@ static int dummy_hrtimer_stop(struct snd_pcm_substream *substream) struct dummy_hrtimer_pcm *dpcm = substream->runtime->private_data; atomic_set(>running, 0); - hrtimer_cancel(>timer); + if (!hrtimer_callback_running(>timer)) + hrtimer_cancel(>timer); return 0; } static inline void dummy_hrtimer_sync(struct dummy_hrtimer_pcm *dpcm) { hrtimer_cancel(>timer); - tasklet_kill(>tasklet); } static snd_pcm_uframes_t @@ -466,12 +465,10 @@ static int dummy_hrtimer_create(struct snd_pcm_substream *substream) if (!dpcm) return -ENOMEM; substream->runtime->private_data = dpcm; - hrtimer_init(>timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL); + hrtimer_init(>timer, CLOCK_MONOTONIC_SOFT, HRTIMER_MODE_REL); dpcm->timer.function = dummy_hrtimer_callback; dpcm->substream = substream; atomic_set(>running, 0); - tasklet_init(>tasklet, dummy_hrtimer_pcm_elapsed, -(unsigned long)dpcm); return 0; } Thanks Takashi Sakamoto
Re: [PATCH v8 06/13] x86/apic: Mark the apic_intr_mode extern for sanity check cleanup
On 08/28/17 at 11:20am, Dou Liyang wrote: > Calling native_smp_prepare_cpus() to prepare for SMP bootup, does > some sanity checking, enables APIC mode and disables SMP feature. > > Now, APIC mode setup has been unified to apic_intr_mode_init(), > some sanity checks are redundant and need to be cleanup. > > Mark the apic_intr_mode extern to refine the switch and remove > the redundant sanity check. > > Signed-off-by: Dou Liyang> --- > arch/x86/include/asm/apic.h | 9 +++ > arch/x86/kernel/apic/apic.c | 12 -- > arch/x86/kernel/smpboot.c | 57 > +++-- > 3 files changed, 22 insertions(+), 56 deletions(-) > > diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h > index 4e550c7..01f3fc8 100644 > --- a/arch/x86/include/asm/apic.h > +++ b/arch/x86/include/asm/apic.h > @@ -53,6 +53,15 @@ extern int local_apic_timer_c2_ok; > extern int disable_apic; > extern unsigned int lapic_timer_frequency; > > +extern enum apic_intr_mode_id apic_intr_mode; > +enum apic_intr_mode_id { > + APIC_PIC, > + APIC_VIRTUAL_WIRE, > + APIC_VIRTUAL_WIRE_NO_CONFIG, > + APIC_SYMMETRIC_IO, > + APIC_SYMMETRIC_IO_NO_ROUTING > +}; > + > #ifdef CONFIG_SMP > extern void __inquire_remote_apic(int apicid); > #else /* CONFIG_SMP */ > diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c > index 9038c5f..97bd47b 100644 > --- a/arch/x86/kernel/apic/apic.c > +++ b/arch/x86/kernel/apic/apic.c > @@ -1235,13 +1235,7 @@ void __init sync_Arb_IDs(void) > APIC_INT_LEVELTRIG | APIC_DM_INIT); > } > > -enum apic_intr_mode { > - APIC_PIC, > - APIC_VIRTUAL_WIRE, > - APIC_VIRTUAL_WIRE_NO_CONFIG, > - APIC_SYMMETRIC_IO, > - APIC_SYMMETRIC_IO_NO_ROUTING, > -}; > +enum apic_intr_mode_id apic_intr_mode; > > static int __init apic_intr_mode_select(void) > { > @@ -1367,7 +1361,9 @@ void __init apic_intr_mode_init(void) > { > bool upmode = false; > > - switch (apic_intr_mode_select()) { > + apic_intr_mode = apic_intr_mode_select(); > + > + switch (apic_intr_mode) { > case APIC_PIC: > pr_info("APIC: Keep in PIC mode(8259)\n"); > return; > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index 8301b75..d2160f7 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -1187,17 +1187,10 @@ static __init void disable_smp(void) > cpumask_set_cpu(0, topology_core_cpumask(0)); > } > > -enum { > - SMP_OK, > - SMP_NO_CONFIG, > - SMP_NO_APIC, > - SMP_FORCE_UP, > -}; > - > /* > * Various sanity checks. > */ > -static int __init smp_sanity_check(unsigned max_cpus) > +static void __init smp_sanity_check(void) > { > preempt_disable(); > > @@ -1235,16 +1228,6 @@ static int __init smp_sanity_check(unsigned max_cpus) > } > > /* > - * If we couldn't find an SMP configuration at boot time, > - * get out of here now! > - */ > - if (!smp_found_config && !acpi_lapic) { > - preempt_enable(); > - pr_notice("SMP motherboard not detected\n"); > - return SMP_NO_CONFIG; > - } > - > - /* >* Should not be necessary because the MP table should list the boot >* CPU too, but we do it for the sake of robustness anyway. >*/ > @@ -1254,29 +1237,6 @@ static int __init smp_sanity_check(unsigned max_cpus) > physid_set(hard_smp_processor_id(), phys_cpu_present_map); > } > preempt_enable(); > - > - /* > - * If we couldn't find a local APIC, then get out of here now! > - */ > - if (APIC_INTEGRATED(boot_cpu_apic_version) && > - !boot_cpu_has(X86_FEATURE_APIC)) { > - if (!disable_apic) { > - pr_err("BIOS bug, local APIC #%d not detected!...\n", > - boot_cpu_physical_apicid); > - pr_err("... forcing use of dummy APIC emulation (tell > your hw vendor)\n"); > - } > - return SMP_NO_APIC; > - } > - > - /* > - * If SMP should be disabled, then really disable it! > - */ > - if (!max_cpus) { > - pr_info("SMP mode deactivated\n"); > - return SMP_FORCE_UP; > - } > - > - return SMP_OK; > } > > static void __init smp_cpu_index_default(void) > @@ -1335,19 +1295,20 @@ void __init native_smp_prepare_cpus(unsigned int > max_cpus) Please also cleanup the passed in max_cpus since it's not used here any more. And up to the caller: static noinline void __init kernel_init_freeable(void) { ... smp_prepare_cpus(setup_max_cpus); ... } > > apic_intr_mode_init(); > > - switch (smp_sanity_check(max_cpus)) { > - case SMP_NO_CONFIG: > - disable_smp(); > - return; > - case SMP_NO_APIC: > + smp_sanity_check(); > + > + switch
Re: [PATCH v8 06/13] x86/apic: Mark the apic_intr_mode extern for sanity check cleanup
On 08/28/17 at 11:20am, Dou Liyang wrote: > Calling native_smp_prepare_cpus() to prepare for SMP bootup, does > some sanity checking, enables APIC mode and disables SMP feature. > > Now, APIC mode setup has been unified to apic_intr_mode_init(), > some sanity checks are redundant and need to be cleanup. > > Mark the apic_intr_mode extern to refine the switch and remove > the redundant sanity check. > > Signed-off-by: Dou Liyang > --- > arch/x86/include/asm/apic.h | 9 +++ > arch/x86/kernel/apic/apic.c | 12 -- > arch/x86/kernel/smpboot.c | 57 > +++-- > 3 files changed, 22 insertions(+), 56 deletions(-) > > diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h > index 4e550c7..01f3fc8 100644 > --- a/arch/x86/include/asm/apic.h > +++ b/arch/x86/include/asm/apic.h > @@ -53,6 +53,15 @@ extern int local_apic_timer_c2_ok; > extern int disable_apic; > extern unsigned int lapic_timer_frequency; > > +extern enum apic_intr_mode_id apic_intr_mode; > +enum apic_intr_mode_id { > + APIC_PIC, > + APIC_VIRTUAL_WIRE, > + APIC_VIRTUAL_WIRE_NO_CONFIG, > + APIC_SYMMETRIC_IO, > + APIC_SYMMETRIC_IO_NO_ROUTING > +}; > + > #ifdef CONFIG_SMP > extern void __inquire_remote_apic(int apicid); > #else /* CONFIG_SMP */ > diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c > index 9038c5f..97bd47b 100644 > --- a/arch/x86/kernel/apic/apic.c > +++ b/arch/x86/kernel/apic/apic.c > @@ -1235,13 +1235,7 @@ void __init sync_Arb_IDs(void) > APIC_INT_LEVELTRIG | APIC_DM_INIT); > } > > -enum apic_intr_mode { > - APIC_PIC, > - APIC_VIRTUAL_WIRE, > - APIC_VIRTUAL_WIRE_NO_CONFIG, > - APIC_SYMMETRIC_IO, > - APIC_SYMMETRIC_IO_NO_ROUTING, > -}; > +enum apic_intr_mode_id apic_intr_mode; > > static int __init apic_intr_mode_select(void) > { > @@ -1367,7 +1361,9 @@ void __init apic_intr_mode_init(void) > { > bool upmode = false; > > - switch (apic_intr_mode_select()) { > + apic_intr_mode = apic_intr_mode_select(); > + > + switch (apic_intr_mode) { > case APIC_PIC: > pr_info("APIC: Keep in PIC mode(8259)\n"); > return; > diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c > index 8301b75..d2160f7 100644 > --- a/arch/x86/kernel/smpboot.c > +++ b/arch/x86/kernel/smpboot.c > @@ -1187,17 +1187,10 @@ static __init void disable_smp(void) > cpumask_set_cpu(0, topology_core_cpumask(0)); > } > > -enum { > - SMP_OK, > - SMP_NO_CONFIG, > - SMP_NO_APIC, > - SMP_FORCE_UP, > -}; > - > /* > * Various sanity checks. > */ > -static int __init smp_sanity_check(unsigned max_cpus) > +static void __init smp_sanity_check(void) > { > preempt_disable(); > > @@ -1235,16 +1228,6 @@ static int __init smp_sanity_check(unsigned max_cpus) > } > > /* > - * If we couldn't find an SMP configuration at boot time, > - * get out of here now! > - */ > - if (!smp_found_config && !acpi_lapic) { > - preempt_enable(); > - pr_notice("SMP motherboard not detected\n"); > - return SMP_NO_CONFIG; > - } > - > - /* >* Should not be necessary because the MP table should list the boot >* CPU too, but we do it for the sake of robustness anyway. >*/ > @@ -1254,29 +1237,6 @@ static int __init smp_sanity_check(unsigned max_cpus) > physid_set(hard_smp_processor_id(), phys_cpu_present_map); > } > preempt_enable(); > - > - /* > - * If we couldn't find a local APIC, then get out of here now! > - */ > - if (APIC_INTEGRATED(boot_cpu_apic_version) && > - !boot_cpu_has(X86_FEATURE_APIC)) { > - if (!disable_apic) { > - pr_err("BIOS bug, local APIC #%d not detected!...\n", > - boot_cpu_physical_apicid); > - pr_err("... forcing use of dummy APIC emulation (tell > your hw vendor)\n"); > - } > - return SMP_NO_APIC; > - } > - > - /* > - * If SMP should be disabled, then really disable it! > - */ > - if (!max_cpus) { > - pr_info("SMP mode deactivated\n"); > - return SMP_FORCE_UP; > - } > - > - return SMP_OK; > } > > static void __init smp_cpu_index_default(void) > @@ -1335,19 +1295,20 @@ void __init native_smp_prepare_cpus(unsigned int > max_cpus) Please also cleanup the passed in max_cpus since it's not used here any more. And up to the caller: static noinline void __init kernel_init_freeable(void) { ... smp_prepare_cpus(setup_max_cpus); ... } > > apic_intr_mode_init(); > > - switch (smp_sanity_check(max_cpus)) { > - case SMP_NO_CONFIG: > - disable_smp(); > - return; > - case SMP_NO_APIC: > + smp_sanity_check(); > + > + switch (apic_intr_mode) { > +
Re: [PATCH] gpio: thunderx: select IRQ_DOMAIN_HIERARCHY instead of depends on
Hi David, 2017-09-06 11:09 GMT+09:00 David Daney: > On 09/05/2017 06:40 PM, Masahiro Yamada wrote: >> >> IRQ_DOMAIN_HIERARCHY is not user-configurable, but supposed to be >> selected by drivers that need IRQ domain hierarchy support. >> >> GPIO_THUNDERX is the only user of "depends on IRQ_DOMAIN_HIERARCHY". >> This means, we can not enable GPIO_THUNDERX unless other drivers >> select IRQ_DOMAIN_HIERARCHY elsewhere. This is odd. Flip the logic. >> >> Signed-off-by: Masahiro Yamada > > > IRQ_DOMAIN_HIERARCHY is set as a result of ARCH_THUNDER (this SoC hardware), > so it actually works as-is. Right, ARCH_THUNDER does not select it directly, but does it indirectly. (this is not so clear...) ARCH_THUNDER -> ARM64 -> ARM_GIC -> IRQ_DOMAIN_HIERARCHY > That said, this looks like a reasonable > improvement, and will allow the COMPILE_TEST to enable it, so... > > Acked-by: David Daney BTW, I could not understand your intention of (64BIT && COMPILE_TEST) Why can COMPILE_TEST be enabled when 64BIT? -- Best Regards Masahiro Yamada
Re: [PATCH] gpio: thunderx: select IRQ_DOMAIN_HIERARCHY instead of depends on
Hi David, 2017-09-06 11:09 GMT+09:00 David Daney : > On 09/05/2017 06:40 PM, Masahiro Yamada wrote: >> >> IRQ_DOMAIN_HIERARCHY is not user-configurable, but supposed to be >> selected by drivers that need IRQ domain hierarchy support. >> >> GPIO_THUNDERX is the only user of "depends on IRQ_DOMAIN_HIERARCHY". >> This means, we can not enable GPIO_THUNDERX unless other drivers >> select IRQ_DOMAIN_HIERARCHY elsewhere. This is odd. Flip the logic. >> >> Signed-off-by: Masahiro Yamada > > > IRQ_DOMAIN_HIERARCHY is set as a result of ARCH_THUNDER (this SoC hardware), > so it actually works as-is. Right, ARCH_THUNDER does not select it directly, but does it indirectly. (this is not so clear...) ARCH_THUNDER -> ARM64 -> ARM_GIC -> IRQ_DOMAIN_HIERARCHY > That said, this looks like a reasonable > improvement, and will allow the COMPILE_TEST to enable it, so... > > Acked-by: David Daney BTW, I could not understand your intention of (64BIT && COMPILE_TEST) Why can COMPILE_TEST be enabled when 64BIT? -- Best Regards Masahiro Yamada
Re: [PATCH v8 01/13] x86/apic: Construct a selector for the interrupt delivery mode
Hi Baoquan, Thanks for your reply! My answer is in below. At 09/06/2017 08:55 AM, Baoquan He wrote: Hi Liyang, On 08/28/17 at 11:20am, Dou Liyang wrote: Now, there are many switches in kernel which are used to determine the final interrupt delivery mode, as shown below: 1) kconfig: CONFIG_X86_64; CONFIG_X86_LOCAL_APIC; CONFIG_x86_IO_APIC 2) kernel option: disable_apic; skip_ioapic_setup 3) CPU Capability: boot_cpu_has(X86_FEATURE_APIC) 4) MP table: smp_found_config 5) ACPI: acpi_lapic; acpi_ioapic; nr_ioapic These switches are disordered and scattered and there are also some dependencies with each other. These make the code difficult to maintain and read. Construct a selector to unify them into a single function, then, Use this selector to get an interrupt delivery mode directly. Signed-off-by: Dou Liyang--- arch/x86/kernel/apic/apic.c | 59 + 1 file changed, 59 insertions(+) diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 98b3dd8..01bde03 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -1235,6 +1235,65 @@ void __init sync_Arb_IDs(void) APIC_INT_LEVELTRIG | APIC_DM_INIT); } +enum apic_intr_mode { + APIC_PIC, + APIC_VIRTUAL_WIRE, + APIC_SYMMETRIC_IO, +}; + +static int __init apic_intr_mode_select(void) +{ + /* Check kernel option */ + if (disable_apic) { + pr_info("APIC disabled via kernel command line\n"); + return APIC_PIC; + } + + /* Check BIOS */ +#ifdef CONFIG_X86_64 + /* On 64-bit, the APIC must be integrated, Check local APIC only */ + if (!boot_cpu_has(X86_FEATURE_APIC)) { + disable_apic = 1; + pr_info("APIC disabled by BIOS\n"); + return APIC_PIC; + } +#else + /* +* On 32-bit, check whether there is a separate chip or integrated Change the comment to: On 32-bit, the APIC may be a separated chip(82489DX) or integrated chip. if BSP doesn't has APIC feature, we can sure there is no integrated chip, but can not be sure there is no independent chip. So check two situation when BSP doesn't has APIC feature. +* APIC +*/ + + /* Has a separate chip ? */ If there is also no SMP configuration, we can be sure there is no separated chip. Switch the interrupt delivery node to APIC_PIC directly. + if (!boot_cpu_has(X86_FEATURE_APIC) && !smp_found_config) { + disable_apic = 1; + + return APIC_PIC; + } + Here you said several times you are checking if APIC is integrated, but you always check boot_cpu_has(X86_FEATURE_APIC), and you also check smp_found_config in above case. Can you make the comment match the code? Yes. E.g if (!boot_cpu_has(X86_FEATURE_APIC)), cpu doesn't support lapic, just return, you can remove the CONFIG_X86_64 check to make it a common check. And we have lapic_is_integrated() to check if lapic is integrated. I am sorry my comment may confuse you. our target is checking if BIOS supports APIC, no matter what type(separated/integrated) it is. The new logic 1) as you said may like : if (!boot_cpu_has(X86_FEATURE_APIC)) return ... if (lapic_is_integrated()) return ... here we miss (!boot_cpu_has(X86_FEATURE_APIC) && smp_found_config) for a separated chip. Besides, we are saying lapic is integrated with ioapic in a single chip, right? I found MP-Spec mention it. If yes, could you add more words to Yes, 82489DX – it was a discrete chip that functioned both as local and I/O APIC make it more specific and precise? Then people can get the exact Indeed, I will. Please see the modification of comments information from the comment and code. Thanks Baoquan + /* Has a local APIC ? */ Sanity check if the BIOS pretends there is one local APIC. Thanks, dou. + if (!boot_cpu_has(X86_FEATURE_APIC) && + APIC_INTEGRATED(boot_cpu_apic_version)) { + disable_apic = 1; + pr_err(FW_BUG "Local APIC %d not detected, force emulation\n", + boot_cpu_physical_apicid); + + return APIC_PIC; + } +#endif + + /* Check MP table or ACPI MADT configuration */ + if (!smp_found_config) { + disable_ioapic_support(); + + if (!acpi_lapic) + pr_info("APIC: ACPI MADT or MP tables are not detected\n"); + + return APIC_VIRTUAL_WIRE; + } + + return APIC_SYMMETRIC_IO; +} + /* * An initial setup of the virtual wire mode. */ -- 2.5.5
Re: [PATCH v8 01/13] x86/apic: Construct a selector for the interrupt delivery mode
Hi Baoquan, Thanks for your reply! My answer is in below. At 09/06/2017 08:55 AM, Baoquan He wrote: Hi Liyang, On 08/28/17 at 11:20am, Dou Liyang wrote: Now, there are many switches in kernel which are used to determine the final interrupt delivery mode, as shown below: 1) kconfig: CONFIG_X86_64; CONFIG_X86_LOCAL_APIC; CONFIG_x86_IO_APIC 2) kernel option: disable_apic; skip_ioapic_setup 3) CPU Capability: boot_cpu_has(X86_FEATURE_APIC) 4) MP table: smp_found_config 5) ACPI: acpi_lapic; acpi_ioapic; nr_ioapic These switches are disordered and scattered and there are also some dependencies with each other. These make the code difficult to maintain and read. Construct a selector to unify them into a single function, then, Use this selector to get an interrupt delivery mode directly. Signed-off-by: Dou Liyang --- arch/x86/kernel/apic/apic.c | 59 + 1 file changed, 59 insertions(+) diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 98b3dd8..01bde03 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -1235,6 +1235,65 @@ void __init sync_Arb_IDs(void) APIC_INT_LEVELTRIG | APIC_DM_INIT); } +enum apic_intr_mode { + APIC_PIC, + APIC_VIRTUAL_WIRE, + APIC_SYMMETRIC_IO, +}; + +static int __init apic_intr_mode_select(void) +{ + /* Check kernel option */ + if (disable_apic) { + pr_info("APIC disabled via kernel command line\n"); + return APIC_PIC; + } + + /* Check BIOS */ +#ifdef CONFIG_X86_64 + /* On 64-bit, the APIC must be integrated, Check local APIC only */ + if (!boot_cpu_has(X86_FEATURE_APIC)) { + disable_apic = 1; + pr_info("APIC disabled by BIOS\n"); + return APIC_PIC; + } +#else + /* +* On 32-bit, check whether there is a separate chip or integrated Change the comment to: On 32-bit, the APIC may be a separated chip(82489DX) or integrated chip. if BSP doesn't has APIC feature, we can sure there is no integrated chip, but can not be sure there is no independent chip. So check two situation when BSP doesn't has APIC feature. +* APIC +*/ + + /* Has a separate chip ? */ If there is also no SMP configuration, we can be sure there is no separated chip. Switch the interrupt delivery node to APIC_PIC directly. + if (!boot_cpu_has(X86_FEATURE_APIC) && !smp_found_config) { + disable_apic = 1; + + return APIC_PIC; + } + Here you said several times you are checking if APIC is integrated, but you always check boot_cpu_has(X86_FEATURE_APIC), and you also check smp_found_config in above case. Can you make the comment match the code? Yes. E.g if (!boot_cpu_has(X86_FEATURE_APIC)), cpu doesn't support lapic, just return, you can remove the CONFIG_X86_64 check to make it a common check. And we have lapic_is_integrated() to check if lapic is integrated. I am sorry my comment may confuse you. our target is checking if BIOS supports APIC, no matter what type(separated/integrated) it is. The new logic 1) as you said may like : if (!boot_cpu_has(X86_FEATURE_APIC)) return ... if (lapic_is_integrated()) return ... here we miss (!boot_cpu_has(X86_FEATURE_APIC) && smp_found_config) for a separated chip. Besides, we are saying lapic is integrated with ioapic in a single chip, right? I found MP-Spec mention it. If yes, could you add more words to Yes, 82489DX – it was a discrete chip that functioned both as local and I/O APIC make it more specific and precise? Then people can get the exact Indeed, I will. Please see the modification of comments information from the comment and code. Thanks Baoquan + /* Has a local APIC ? */ Sanity check if the BIOS pretends there is one local APIC. Thanks, dou. + if (!boot_cpu_has(X86_FEATURE_APIC) && + APIC_INTEGRATED(boot_cpu_apic_version)) { + disable_apic = 1; + pr_err(FW_BUG "Local APIC %d not detected, force emulation\n", + boot_cpu_physical_apicid); + + return APIC_PIC; + } +#endif + + /* Check MP table or ACPI MADT configuration */ + if (!smp_found_config) { + disable_ioapic_support(); + + if (!acpi_lapic) + pr_info("APIC: ACPI MADT or MP tables are not detected\n"); + + return APIC_VIRTUAL_WIRE; + } + + return APIC_SYMMETRIC_IO; +} + /* * An initial setup of the virtual wire mode. */ -- 2.5.5
Re: [PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing the idlest CPU
Thanks for looking at this, (please bear my slow response as I need to check related code before replying.) On Sun, Sep 03, 2017 at 08:17:04PM +0200, Thomas Gleixner wrote: > On Fri, 1 Sep 2017, Chen Yu wrote: > > > This is the major logic to spread the vectors on different CPUs. > > The main idea is to choose the 'idlest' CPU which has assigned > > the least number of vectors as the candidate/hint for the vector > > allocation domain, in the hope that the vector allocation domain > > could leverage this hint to generate corresponding cpumask. > > > > One of the requirements to do this vector spreading work comes from the > > following hibernation problem found on a 16 cores server: > > > > CPU 31 disable failed: CPU has 62 vectors assigned and there > > are only 0 available. > > > > 2 issues were found after investigation: > > > > 1. The network driver has declared many vector resources via > >pci_enable_msix_range(), say, this driver might likely want > >to reserve 6 per logical CPU, then there would be 192 of them. > > 2. Besides, most of the vectors reserved by this driver are assigned > >on CPU0 due to the current code strategy, so there would be > >insufficient slots left on CPU0 to receive any migrated IRQs > >during CPU offine. > > > > In theory after the commit c5cb83bb337c ("genirq/cpuhotplug: Handle > > managed IRQs on CPU hotplug") the issue 1 has been solved with the > > managed interrupt mechanism for multi queue devices, which no longer > > migrate these interrupts. It simply shuts them down and restarts > > them when the CPU comes back online. > > > > However, according to the implementation of this network driver, > > the commit mentioned above does not fully fix the issue 1. > > Here's the framework of the network driver: > > > > step 1. Reserved enough irq vectors and corresponding IRQs. > > step 2. If the network is activated, invoke request_irq() to > > register the handler. > > step 3. Invoke set_affinity() to spread the IRQs onto different > > CPUs, thus to spread the vectors too. > > > > The problem is, if the network cable is not connected, step 2 > > and 3 will not get invoked, thus the IRQ vectors will not spread > > on different CPUs and will still be allocated on CPU0. As a result > > the CPUs will still get the offline failure. > > So the proper solution for this network driver is to switch to managed > interrupts instead of trying to work around it in some other place. It's > using the wrong mechanism - end of story. > > Why are you insisting on implementing a band aid for this particular driver > instead of fixing the underlying problem of that driver which requires to > have 32 queues and interrupts open even if only a single CPU is online? > I agree, the driver could be rewritten, but it might take some time, so meanwhile I'm looking at also other possible optimization. > > Previously there were some discussion in the thread [1] about the > > vector spread, and here are the suggestion from Thomas: > > > > Q1: > > Rewrite that allocation code from scratch, use per cpu bitmaps, > > so we can do fast search over masks and keep track of > > the vectors which are used/free per cpu. > > A1: > > per cpu bitmap was onced considered but this might not be > > as fast as the solution proposed here. That is, if we introduce the > > per cpu bitmap, we need to count the number of vectors assigned on > > each CPUs by cpumask_weight() and sort them in order to get the > > CPUs who have the least number of vectors assigned. By contrast, > > if we introduce the bitmaps for each vector number, we can get the > > 'idle' CPU at the cost of nearly O(1). > > The weight accounting with the cpumasks is an orthogonal issue to the per > cpu vector bitmaps. > > Once you selected a CPU the current code still loops in circles instead of > just searching a bitmap. Yes, I agree. > > And if the required affinity mask needs more than one CPU then this is > still not giving you the right answer. You assign perhaps a vector to the > least busy CPU, but the other CPUs which happen to be in the mask are going > to be randomly selected. Yes, for multi cpumask, it might depends on how vector domain behaves. > > I'm not against making the vector allocation better, but certainly not by > adding yet more duct tape to something which is well known as one of the > dumbest search algorithms on the planet with a worst case of > >nvectors * nr_online_cpus * nr_online_cpus_in_affinity_mask > > and your mechanism nests another loop of potentially NR_VECTORS into > that. Which is pointless as the actual assignable vector space is > smaller. While x86 has 256 vectors the assignable vector space is way > smaller and non continous: > > Vectors 0- 31 are reserved for CPU traps and exceptions > Vectors 239-255 are reserved by the kernel for IPIs, local timer etc. > Vector 32 can be reserved by the kernel for the
Re: [PATCH 4/4][RFC v2] x86/apic: Spread the vectors by choosing the idlest CPU
Thanks for looking at this, (please bear my slow response as I need to check related code before replying.) On Sun, Sep 03, 2017 at 08:17:04PM +0200, Thomas Gleixner wrote: > On Fri, 1 Sep 2017, Chen Yu wrote: > > > This is the major logic to spread the vectors on different CPUs. > > The main idea is to choose the 'idlest' CPU which has assigned > > the least number of vectors as the candidate/hint for the vector > > allocation domain, in the hope that the vector allocation domain > > could leverage this hint to generate corresponding cpumask. > > > > One of the requirements to do this vector spreading work comes from the > > following hibernation problem found on a 16 cores server: > > > > CPU 31 disable failed: CPU has 62 vectors assigned and there > > are only 0 available. > > > > 2 issues were found after investigation: > > > > 1. The network driver has declared many vector resources via > >pci_enable_msix_range(), say, this driver might likely want > >to reserve 6 per logical CPU, then there would be 192 of them. > > 2. Besides, most of the vectors reserved by this driver are assigned > >on CPU0 due to the current code strategy, so there would be > >insufficient slots left on CPU0 to receive any migrated IRQs > >during CPU offine. > > > > In theory after the commit c5cb83bb337c ("genirq/cpuhotplug: Handle > > managed IRQs on CPU hotplug") the issue 1 has been solved with the > > managed interrupt mechanism for multi queue devices, which no longer > > migrate these interrupts. It simply shuts them down and restarts > > them when the CPU comes back online. > > > > However, according to the implementation of this network driver, > > the commit mentioned above does not fully fix the issue 1. > > Here's the framework of the network driver: > > > > step 1. Reserved enough irq vectors and corresponding IRQs. > > step 2. If the network is activated, invoke request_irq() to > > register the handler. > > step 3. Invoke set_affinity() to spread the IRQs onto different > > CPUs, thus to spread the vectors too. > > > > The problem is, if the network cable is not connected, step 2 > > and 3 will not get invoked, thus the IRQ vectors will not spread > > on different CPUs and will still be allocated on CPU0. As a result > > the CPUs will still get the offline failure. > > So the proper solution for this network driver is to switch to managed > interrupts instead of trying to work around it in some other place. It's > using the wrong mechanism - end of story. > > Why are you insisting on implementing a band aid for this particular driver > instead of fixing the underlying problem of that driver which requires to > have 32 queues and interrupts open even if only a single CPU is online? > I agree, the driver could be rewritten, but it might take some time, so meanwhile I'm looking at also other possible optimization. > > Previously there were some discussion in the thread [1] about the > > vector spread, and here are the suggestion from Thomas: > > > > Q1: > > Rewrite that allocation code from scratch, use per cpu bitmaps, > > so we can do fast search over masks and keep track of > > the vectors which are used/free per cpu. > > A1: > > per cpu bitmap was onced considered but this might not be > > as fast as the solution proposed here. That is, if we introduce the > > per cpu bitmap, we need to count the number of vectors assigned on > > each CPUs by cpumask_weight() and sort them in order to get the > > CPUs who have the least number of vectors assigned. By contrast, > > if we introduce the bitmaps for each vector number, we can get the > > 'idle' CPU at the cost of nearly O(1). > > The weight accounting with the cpumasks is an orthogonal issue to the per > cpu vector bitmaps. > > Once you selected a CPU the current code still loops in circles instead of > just searching a bitmap. Yes, I agree. > > And if the required affinity mask needs more than one CPU then this is > still not giving you the right answer. You assign perhaps a vector to the > least busy CPU, but the other CPUs which happen to be in the mask are going > to be randomly selected. Yes, for multi cpumask, it might depends on how vector domain behaves. > > I'm not against making the vector allocation better, but certainly not by > adding yet more duct tape to something which is well known as one of the > dumbest search algorithms on the planet with a worst case of > >nvectors * nr_online_cpus * nr_online_cpus_in_affinity_mask > > and your mechanism nests another loop of potentially NR_VECTORS into > that. Which is pointless as the actual assignable vector space is > smaller. While x86 has 256 vectors the assignable vector space is way > smaller and non continous: > > Vectors 0- 31 are reserved for CPU traps and exceptions > Vectors 239-255 are reserved by the kernel for IPIs, local timer etc. > Vector 32 can be reserved by the kernel for the
[PATCH 1/1] workqueue: use type int instead of bool to index array
From: zijun_hutype bool is used to index three arrays in alloc_and_link_pwqs() it doesn't look like conventional. it is fixed by using type int to index the relevant arrays. Signed-off-by: zijun_hu --- kernel/workqueue.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index ab3c0dc8c7ed..0c7a16792949 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -3917,7 +3917,7 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu, static int alloc_and_link_pwqs(struct workqueue_struct *wq) { - bool highpri = wq->flags & WQ_HIGHPRI; + int pool_idx = (wq->flags & WQ_HIGHPRI) ? 1 : 0; int cpu, ret; if (!(wq->flags & WQ_UNBOUND)) { @@ -3931,7 +3931,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq) struct worker_pool *cpu_pools = per_cpu(cpu_worker_pools, cpu); - init_pwq(pwq, wq, _pools[highpri]); + init_pwq(pwq, wq, _pools[pool_idx]); mutex_lock(>mutex); link_pwq(pwq); @@ -3939,14 +3939,15 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq) } return 0; } else if (wq->flags & __WQ_ORDERED) { - ret = apply_workqueue_attrs(wq, ordered_wq_attrs[highpri]); + ret = apply_workqueue_attrs(wq, ordered_wq_attrs[pool_idx]); /* there should only be single pwq for ordering guarantee */ WARN(!ret && (wq->pwqs.next != >dfl_pwq->pwqs_node || wq->pwqs.prev != >dfl_pwq->pwqs_node), "ordering guarantee broken for workqueue %s\n", wq->name); return ret; } else { - return apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]); + return apply_workqueue_attrs(wq, + unbound_std_wq_attrs[pool_idx]); } } -- 1.9.1
[PATCH 1/1] workqueue: use type int instead of bool to index array
From: zijun_hu type bool is used to index three arrays in alloc_and_link_pwqs() it doesn't look like conventional. it is fixed by using type int to index the relevant arrays. Signed-off-by: zijun_hu --- kernel/workqueue.c | 9 + 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/kernel/workqueue.c b/kernel/workqueue.c index ab3c0dc8c7ed..0c7a16792949 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -3917,7 +3917,7 @@ static void wq_update_unbound_numa(struct workqueue_struct *wq, int cpu, static int alloc_and_link_pwqs(struct workqueue_struct *wq) { - bool highpri = wq->flags & WQ_HIGHPRI; + int pool_idx = (wq->flags & WQ_HIGHPRI) ? 1 : 0; int cpu, ret; if (!(wq->flags & WQ_UNBOUND)) { @@ -3931,7 +3931,7 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq) struct worker_pool *cpu_pools = per_cpu(cpu_worker_pools, cpu); - init_pwq(pwq, wq, _pools[highpri]); + init_pwq(pwq, wq, _pools[pool_idx]); mutex_lock(>mutex); link_pwq(pwq); @@ -3939,14 +3939,15 @@ static int alloc_and_link_pwqs(struct workqueue_struct *wq) } return 0; } else if (wq->flags & __WQ_ORDERED) { - ret = apply_workqueue_attrs(wq, ordered_wq_attrs[highpri]); + ret = apply_workqueue_attrs(wq, ordered_wq_attrs[pool_idx]); /* there should only be single pwq for ordering guarantee */ WARN(!ret && (wq->pwqs.next != >dfl_pwq->pwqs_node || wq->pwqs.prev != >dfl_pwq->pwqs_node), "ordering guarantee broken for workqueue %s\n", wq->name); return ret; } else { - return apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]); + return apply_workqueue_attrs(wq, + unbound_std_wq_attrs[pool_idx]); } } -- 1.9.1
Re: [PATCH 3/3] f2fs: support flexible inline xattr size
Hi Jaegeuk, Do we have time to test and stabilize this new feature before merge window? Thanks, On 2017/9/4 18:58, Chao Yu wrote: > Now, in product, more and more features based on file encryption were > introduced, their demand of xattr space is increasing, however, inline > xattr has fixed-size of 200 bytes, once inline xattr space is full, new > increased xattr data would occupy additional xattr block which may bring > us more space usage and performance regression during persisting. > > In order to resolve above issue, it's better to expand inline xattr size > flexibly according to user's requirement. > > So this patch introduces new filesystem feature 'flexible inline xattr', > and new mount option 'inline_xattr_size=%u', once mkfs enables the > feature, we can use the option to make f2fs supporting flexible inline > xattr size. > > To support this feature, we add extra attribute i_inline_xattr_size in > inode layout, indicating that how many space inline xattr borrows from > block address mapping space in inode layout, by this, we can easily > locate and store flexible-sized inline xattr data in inode. > > Inode disk layout: > +--+ > | .i_mode | > | ... | > | .i_ext | > +--+ > | .i_extra_isize | > | .i_inline_xattr_size |---+ > | ... | | > +--+ | > | .i_addr | | > | - block address or | | > | - inline data | | > +--+<---+ v > |inline xattr |+---inline xattr range > +--+<---+ > | .i_nid | > +--+ > | node_footer| > | (nid, ino, offset) | > +--+ > > Signed-off-by: Chao Yu> --- > fs/f2fs/f2fs.h | 43 ++- > fs/f2fs/inode.c | 12 > fs/f2fs/namei.c | 6 ++ > fs/f2fs/node.c | 4 ++-- > fs/f2fs/super.c | 32 +++- > fs/f2fs/sysfs.c | 7 +++ > fs/f2fs/xattr.c | 18 +- > include/linux/f2fs_fs.h | 5 +++-- > 8 files changed, 100 insertions(+), 27 deletions(-) > > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h > index 1f24ad4ca1bb..168ad51b7fb9 100644 > --- a/fs/f2fs/f2fs.h > +++ b/fs/f2fs/f2fs.h > @@ -93,6 +93,7 @@ extern char *fault_name[FAULT_MAX]; > #define F2FS_MOUNT_GRPQUOTA 0x0010 > #define F2FS_MOUNT_PRJQUOTA 0x0020 > #define F2FS_MOUNT_QUOTA 0x0040 > +#define F2FS_MOUNT_INLINE_XATTR_SIZE 0x0080 > > #define clear_opt(sbi, option) ((sbi)->mount_opt.opt &= > ~F2FS_MOUNT_##option) > #define set_opt(sbi, option) ((sbi)->mount_opt.opt |= F2FS_MOUNT_##option) > @@ -118,6 +119,7 @@ struct f2fs_mount_info { > #define F2FS_FEATURE_EXTRA_ATTR 0x0008 > #define F2FS_FEATURE_PRJQUOTA0x0010 > #define F2FS_FEATURE_INODE_CHKSUM0x0020 > +#define F2FS_FEATURE_FLEXIBLE_INLINE_XATTR 0x0040 > > #define F2FS_HAS_FEATURE(sb, mask) \ > ((F2FS_SB(sb)->raw_super->feature & cpu_to_le32(mask)) != 0) > @@ -379,11 +381,14 @@ struct f2fs_flush_device { > > /* for inline stuff */ > #define DEF_INLINE_RESERVED_SIZE 1 > +#define DEF_MIN_INLINE_SIZE 1 > static inline int get_extra_isize(struct inode *inode); > -#define MAX_INLINE_DATA(inode) (sizeof(__le32) * \ > - (CUR_ADDRS_PER_INODE(inode) - \ > - DEF_INLINE_RESERVED_SIZE - \ > - F2FS_INLINE_XATTR_ADDRS)) > +static inline int get_inline_xattr_addrs(struct inode *inode); > +#define F2FS_INLINE_XATTR_ADDRS(inode) get_inline_xattr_addrs(inode) > +#define MAX_INLINE_DATA(inode) (sizeof(__le32) * > \ > + (CUR_ADDRS_PER_INODE(inode) - \ > + F2FS_INLINE_XATTR_ADDRS(inode) -\ > + DEF_INLINE_RESERVED_SIZE)) > > /* for inline dir */ > #define NR_INLINE_DENTRY(inode) (MAX_INLINE_DATA(inode) * BITS_PER_BYTE > / \ > @@ -592,6 +597,7 @@ struct f2fs_inode_info { > > int i_extra_isize; /* size of extra space located in > i_addr */ > kprojid_t i_projid; /* id for project quota */ > + int i_inline_xattr_size;/* inline xattr size */ > }; > > static inline void get_extent_info(struct extent_info *ext, > @@ -1043,6 +1049,7 @@ struct f2fs_sb_info { > loff_t max_file_blocks; /* max block index of file */ > int active_logs;/* # of active logs */ > int dir_level; /* directory level */ > + int inline_xattr_size;
SME/32-bit regression
It appears there is a regression for 32-bit kernels due to SME changes. I bisected my particular problem (Xen PV guest) to 21729f81ce8ae76a6995681d40e16f7ce8075db4 but I also saw pmd_clear_bad() errors on baremetal. This seems to be caused by sme_me_mask being an unsigned long as opposed to phys_addr_t (the actual problem is that __PHYSICAL_MASK is truncated). When I declare it as u64 and drop unsigned long cast in __sme_set()/__sme_clr() the problem goes way. (This presumably won't work for non-PAE which I haven't tried). -boris
Re: [PATCH 3/3] f2fs: support flexible inline xattr size
Hi Jaegeuk, Do we have time to test and stabilize this new feature before merge window? Thanks, On 2017/9/4 18:58, Chao Yu wrote: > Now, in product, more and more features based on file encryption were > introduced, their demand of xattr space is increasing, however, inline > xattr has fixed-size of 200 bytes, once inline xattr space is full, new > increased xattr data would occupy additional xattr block which may bring > us more space usage and performance regression during persisting. > > In order to resolve above issue, it's better to expand inline xattr size > flexibly according to user's requirement. > > So this patch introduces new filesystem feature 'flexible inline xattr', > and new mount option 'inline_xattr_size=%u', once mkfs enables the > feature, we can use the option to make f2fs supporting flexible inline > xattr size. > > To support this feature, we add extra attribute i_inline_xattr_size in > inode layout, indicating that how many space inline xattr borrows from > block address mapping space in inode layout, by this, we can easily > locate and store flexible-sized inline xattr data in inode. > > Inode disk layout: > +--+ > | .i_mode | > | ... | > | .i_ext | > +--+ > | .i_extra_isize | > | .i_inline_xattr_size |---+ > | ... | | > +--+ | > | .i_addr | | > | - block address or | | > | - inline data | | > +--+<---+ v > |inline xattr |+---inline xattr range > +--+<---+ > | .i_nid | > +--+ > | node_footer| > | (nid, ino, offset) | > +--+ > > Signed-off-by: Chao Yu > --- > fs/f2fs/f2fs.h | 43 ++- > fs/f2fs/inode.c | 12 > fs/f2fs/namei.c | 6 ++ > fs/f2fs/node.c | 4 ++-- > fs/f2fs/super.c | 32 +++- > fs/f2fs/sysfs.c | 7 +++ > fs/f2fs/xattr.c | 18 +- > include/linux/f2fs_fs.h | 5 +++-- > 8 files changed, 100 insertions(+), 27 deletions(-) > > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h > index 1f24ad4ca1bb..168ad51b7fb9 100644 > --- a/fs/f2fs/f2fs.h > +++ b/fs/f2fs/f2fs.h > @@ -93,6 +93,7 @@ extern char *fault_name[FAULT_MAX]; > #define F2FS_MOUNT_GRPQUOTA 0x0010 > #define F2FS_MOUNT_PRJQUOTA 0x0020 > #define F2FS_MOUNT_QUOTA 0x0040 > +#define F2FS_MOUNT_INLINE_XATTR_SIZE 0x0080 > > #define clear_opt(sbi, option) ((sbi)->mount_opt.opt &= > ~F2FS_MOUNT_##option) > #define set_opt(sbi, option) ((sbi)->mount_opt.opt |= F2FS_MOUNT_##option) > @@ -118,6 +119,7 @@ struct f2fs_mount_info { > #define F2FS_FEATURE_EXTRA_ATTR 0x0008 > #define F2FS_FEATURE_PRJQUOTA0x0010 > #define F2FS_FEATURE_INODE_CHKSUM0x0020 > +#define F2FS_FEATURE_FLEXIBLE_INLINE_XATTR 0x0040 > > #define F2FS_HAS_FEATURE(sb, mask) \ > ((F2FS_SB(sb)->raw_super->feature & cpu_to_le32(mask)) != 0) > @@ -379,11 +381,14 @@ struct f2fs_flush_device { > > /* for inline stuff */ > #define DEF_INLINE_RESERVED_SIZE 1 > +#define DEF_MIN_INLINE_SIZE 1 > static inline int get_extra_isize(struct inode *inode); > -#define MAX_INLINE_DATA(inode) (sizeof(__le32) * \ > - (CUR_ADDRS_PER_INODE(inode) - \ > - DEF_INLINE_RESERVED_SIZE - \ > - F2FS_INLINE_XATTR_ADDRS)) > +static inline int get_inline_xattr_addrs(struct inode *inode); > +#define F2FS_INLINE_XATTR_ADDRS(inode) get_inline_xattr_addrs(inode) > +#define MAX_INLINE_DATA(inode) (sizeof(__le32) * > \ > + (CUR_ADDRS_PER_INODE(inode) - \ > + F2FS_INLINE_XATTR_ADDRS(inode) -\ > + DEF_INLINE_RESERVED_SIZE)) > > /* for inline dir */ > #define NR_INLINE_DENTRY(inode) (MAX_INLINE_DATA(inode) * BITS_PER_BYTE > / \ > @@ -592,6 +597,7 @@ struct f2fs_inode_info { > > int i_extra_isize; /* size of extra space located in > i_addr */ > kprojid_t i_projid; /* id for project quota */ > + int i_inline_xattr_size;/* inline xattr size */ > }; > > static inline void get_extent_info(struct extent_info *ext, > @@ -1043,6 +1049,7 @@ struct f2fs_sb_info { > loff_t max_file_blocks; /* max block index of file */ > int active_logs;/* # of active logs */ > int dir_level; /* directory level */ > + int inline_xattr_size; /* inline xattr
SME/32-bit regression
It appears there is a regression for 32-bit kernels due to SME changes. I bisected my particular problem (Xen PV guest) to 21729f81ce8ae76a6995681d40e16f7ce8075db4 but I also saw pmd_clear_bad() errors on baremetal. This seems to be caused by sme_me_mask being an unsigned long as opposed to phys_addr_t (the actual problem is that __PHYSICAL_MASK is truncated). When I declare it as u64 and drop unsigned long cast in __sme_set()/__sme_clr() the problem goes way. (This presumably won't work for non-PAE which I haven't tried). -boris
[PATCH 3/3] KVM: SVM: Add irqchip_split() checks before enabling AVIC
SVM AVIC hardware accelerates guest write to APIC_EOI register (for edge-trigger interrupt), which means it does not trap to KVM. So, only enable SVM AVIC only in split irqchip mode. (e.g. launching qemu w/ option '-machine kernel_irqchip=split'). Suggested-by: Paolo BonziniSigned-off-by: Suravee Suthikulpanit --- arch/x86/kvm/svm.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index d1b3eb4..7ce191b 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1176,7 +1176,6 @@ static void avic_init_vmcb(struct vcpu_svm *svm) vmcb->control.avic_physical_id = ppa & AVIC_HPA_MASK; vmcb->control.avic_physical_id |= AVIC_MAX_PHYSICAL_ID_COUNT; vmcb->control.int_ctl |= AVIC_ENABLE_MASK; - svm->vcpu.arch.apicv_active = true; } static void init_vmcb(struct vcpu_svm *svm) @@ -1292,7 +1291,7 @@ static void init_vmcb(struct vcpu_svm *svm) set_intercept(svm, INTERCEPT_PAUSE); } - if (avic) + if (kvm_vcpu_apicv_active(>vcpu)) avic_init_vmcb(svm); /* @@ -1594,6 +1593,12 @@ static int avic_init_vcpu(struct vcpu_svm *svm) if (!avic) return 0; + if (!irqchip_split(svm->vcpu.kvm)) { + pr_debug("%s: Disable AVIC due to non-split irqchip.\n", +__func__); + return 0; + } + ret = avic_init_backing_page(>vcpu); if (ret) return ret; @@ -4388,7 +4393,7 @@ static void svm_set_virtual_x2apic_mode(struct kvm_vcpu *vcpu, bool set) static bool svm_get_enable_apicv(struct kvm_vcpu *vcpu) { - return avic; + return avic && irqchip_split(vcpu->kvm); } static void svm_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr) @@ -4405,7 +4410,7 @@ static void svm_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu) struct vcpu_svm *svm = to_svm(vcpu); struct vmcb *vmcb = svm->vmcb; - if (!avic) + if (!avic || !irqchip_split(svm->vcpu.kvm)) return; vmcb->control.int_ctl &= ~AVIC_ENABLE_MASK; -- 1.8.3.1
[PATCH 3/3] KVM: SVM: Add irqchip_split() checks before enabling AVIC
SVM AVIC hardware accelerates guest write to APIC_EOI register (for edge-trigger interrupt), which means it does not trap to KVM. So, only enable SVM AVIC only in split irqchip mode. (e.g. launching qemu w/ option '-machine kernel_irqchip=split'). Suggested-by: Paolo Bonzini Signed-off-by: Suravee Suthikulpanit --- arch/x86/kvm/svm.c | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index d1b3eb4..7ce191b 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1176,7 +1176,6 @@ static void avic_init_vmcb(struct vcpu_svm *svm) vmcb->control.avic_physical_id = ppa & AVIC_HPA_MASK; vmcb->control.avic_physical_id |= AVIC_MAX_PHYSICAL_ID_COUNT; vmcb->control.int_ctl |= AVIC_ENABLE_MASK; - svm->vcpu.arch.apicv_active = true; } static void init_vmcb(struct vcpu_svm *svm) @@ -1292,7 +1291,7 @@ static void init_vmcb(struct vcpu_svm *svm) set_intercept(svm, INTERCEPT_PAUSE); } - if (avic) + if (kvm_vcpu_apicv_active(>vcpu)) avic_init_vmcb(svm); /* @@ -1594,6 +1593,12 @@ static int avic_init_vcpu(struct vcpu_svm *svm) if (!avic) return 0; + if (!irqchip_split(svm->vcpu.kvm)) { + pr_debug("%s: Disable AVIC due to non-split irqchip.\n", +__func__); + return 0; + } + ret = avic_init_backing_page(>vcpu); if (ret) return ret; @@ -4388,7 +4393,7 @@ static void svm_set_virtual_x2apic_mode(struct kvm_vcpu *vcpu, bool set) static bool svm_get_enable_apicv(struct kvm_vcpu *vcpu) { - return avic; + return avic && irqchip_split(vcpu->kvm); } static void svm_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr) @@ -4405,7 +4410,7 @@ static void svm_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu) struct vcpu_svm *svm = to_svm(vcpu); struct vmcb *vmcb = svm->vmcb; - if (!avic) + if (!avic || !irqchip_split(svm->vcpu.kvm)) return; vmcb->control.int_ctl &= ~AVIC_ENABLE_MASK; -- 1.8.3.1
kvm: use-after-free in irq_bypass_register_consumer
Got the following report while fuzzing 4.9.47. It seems that this bug has been reported by Dmitry Vyukov [https://lkml.org/lkml/2017/1/27/828] But I still can reproduce the bug on latest Ubuntu1604 (4.4.0-94-generic) PoC: https://gist.githubusercontent.com/dvyukov/3411518fa22baff19ae204cc46a6/raw/7826cbcd1cbc472dfa4972fe56371df3c94b70c7/gistfile1.txt === BUG: KASAN: use-after-free in __list_add include/linux/list.h:43 [inline] at addr 88007ab9dc80 BUG: KASAN: use-after-free in list_add include/linux/list.h:63 [inline] at addr 88007ab9dc80 BUG: KASAN: use-after-free in irq_bypass_register_consumer+0x3a3/0x420 virt/lib/irqbypass.c:217 at addr 88007ab9dc80 Write of size 8 by task syz-executor4/2970 CPU: 0 PID: 2970 Comm: syz-executor4 Tainted: GB 4.9.47 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 8800760df9c0 81ad97d9 88007f803080 88007ab9db80 88007ab9dd80 88007ab9da00 8800760df9e8 8153892c 8800760dfa78 88007f803080 8800776f7d20 8800760dfa68 Call Trace: [] __dump_stack lib/dump_stack.c:15 [inline] [] dump_stack+0x83/0xba lib/dump_stack.c:51 [] kasan_object_err+0x1c/0x70 mm/kasan/report.c:160 [] print_address_description mm/kasan/report.c:198 [inline] [] kasan_report_error+0x1f0/0x4e0 mm/kasan/report.c:287 [] kasan_report mm/kasan/report.c:309 [inline] [] __asan_report_store8_noabort+0x3e/0x40 mm/kasan/report.c:335 [] __list_add include/linux/list.h:43 [inline] [] list_add include/linux/list.h:63 [inline] [] irq_bypass_register_consumer+0x3a3/0x420 virt/lib/irqbypass.c:217 [] kvm_irqfd_assign arch/x86/kvm/../../../virt/kvm/eventfd.c:417 [inline] [] kvm_irqfd+0x1095/0x1840 arch/x86/kvm/../../../virt/kvm/eventfd.c:572 [] kvm_vm_ioctl+0x2bc/0x1580 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2999 [] vfs_ioctl fs/ioctl.c:43 [inline] [] do_vfs_ioctl+0x18c/0xf80 fs/ioctl.c:679 [] SYSC_ioctl fs/ioctl.c:694 [inline] [] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:685 [] entry_SYSCALL_64_fastpath+0x1a/0xa9 Object at 88007ab9db80, in cache kmalloc-512 size: 512 Allocated: PID = 2970 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57 save_stack+0x46/0xd0 mm/kasan/kasan.c:495 set_track mm/kasan/kasan.c:507 [inline] kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:598 kmem_cache_alloc_trace include/linux/slab.h:391 [inline] kmalloc include/linux/slab.h:490 [inline] kzalloc include/linux/slab.h:636 [inline] kvm_irqfd_assign arch/x86/kvm/../../../virt/kvm/eventfd.c:296 [inline] kvm_irqfd+0xbd/0x1840 arch/x86/kvm/../../../virt/kvm/eventfd.c:572 kvm_vm_ioctl+0x2bc/0x1580 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2999 vfs_ioctl fs/ioctl.c:43 [inline] do_vfs_ioctl+0x18c/0xf80 fs/ioctl.c:679 SYSC_ioctl fs/ioctl.c:694 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:685 entry_SYSCALL_64_fastpath+0x1a/0xa9 Freed: PID = 1098 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57 save_stack+0x46/0xd0 mm/kasan/kasan.c:495 set_track mm/kasan/kasan.c:507 [inline] kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:571 slab_free_hook mm/slub.c:1355 [inline] slab_free_freelist_hook mm/slub.c:1377 [inline] slab_free mm/slub.c:2958 [inline] kfree+0xa0/0x150 mm/slub.c:3878 irqfd_shutdown+0x137/0x1a0 arch/x86/kvm/../../../virt/kvm/eventfd.c:148 process_one_work+0x87c/0x1170 kernel/workqueue.c:2096 worker_thread+0xed/0x14e0 kernel/workqueue.c:2230 kthread+0x220/0x2a0 kernel/kthread.c:211 ret_from_fork+0x25/0x30 arch/x86/entry/entry_64.S:433 Memory state around the buggy address: 88007ab9db80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 88007ab9dc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >88007ab9dc80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ 88007ab9dd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 88007ab9dd80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc -- Regards, idaifish
kvm: use-after-free in irq_bypass_register_consumer
Got the following report while fuzzing 4.9.47. It seems that this bug has been reported by Dmitry Vyukov [https://lkml.org/lkml/2017/1/27/828] But I still can reproduce the bug on latest Ubuntu1604 (4.4.0-94-generic) PoC: https://gist.githubusercontent.com/dvyukov/3411518fa22baff19ae204cc46a6/raw/7826cbcd1cbc472dfa4972fe56371df3c94b70c7/gistfile1.txt === BUG: KASAN: use-after-free in __list_add include/linux/list.h:43 [inline] at addr 88007ab9dc80 BUG: KASAN: use-after-free in list_add include/linux/list.h:63 [inline] at addr 88007ab9dc80 BUG: KASAN: use-after-free in irq_bypass_register_consumer+0x3a3/0x420 virt/lib/irqbypass.c:217 at addr 88007ab9dc80 Write of size 8 by task syz-executor4/2970 CPU: 0 PID: 2970 Comm: syz-executor4 Tainted: GB 4.9.47 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 8800760df9c0 81ad97d9 88007f803080 88007ab9db80 88007ab9dd80 88007ab9da00 8800760df9e8 8153892c 8800760dfa78 88007f803080 8800776f7d20 8800760dfa68 Call Trace: [] __dump_stack lib/dump_stack.c:15 [inline] [] dump_stack+0x83/0xba lib/dump_stack.c:51 [] kasan_object_err+0x1c/0x70 mm/kasan/report.c:160 [] print_address_description mm/kasan/report.c:198 [inline] [] kasan_report_error+0x1f0/0x4e0 mm/kasan/report.c:287 [] kasan_report mm/kasan/report.c:309 [inline] [] __asan_report_store8_noabort+0x3e/0x40 mm/kasan/report.c:335 [] __list_add include/linux/list.h:43 [inline] [] list_add include/linux/list.h:63 [inline] [] irq_bypass_register_consumer+0x3a3/0x420 virt/lib/irqbypass.c:217 [] kvm_irqfd_assign arch/x86/kvm/../../../virt/kvm/eventfd.c:417 [inline] [] kvm_irqfd+0x1095/0x1840 arch/x86/kvm/../../../virt/kvm/eventfd.c:572 [] kvm_vm_ioctl+0x2bc/0x1580 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2999 [] vfs_ioctl fs/ioctl.c:43 [inline] [] do_vfs_ioctl+0x18c/0xf80 fs/ioctl.c:679 [] SYSC_ioctl fs/ioctl.c:694 [inline] [] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:685 [] entry_SYSCALL_64_fastpath+0x1a/0xa9 Object at 88007ab9db80, in cache kmalloc-512 size: 512 Allocated: PID = 2970 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57 save_stack+0x46/0xd0 mm/kasan/kasan.c:495 set_track mm/kasan/kasan.c:507 [inline] kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:598 kmem_cache_alloc_trace include/linux/slab.h:391 [inline] kmalloc include/linux/slab.h:490 [inline] kzalloc include/linux/slab.h:636 [inline] kvm_irqfd_assign arch/x86/kvm/../../../virt/kvm/eventfd.c:296 [inline] kvm_irqfd+0xbd/0x1840 arch/x86/kvm/../../../virt/kvm/eventfd.c:572 kvm_vm_ioctl+0x2bc/0x1580 arch/x86/kvm/../../../virt/kvm/kvm_main.c:2999 vfs_ioctl fs/ioctl.c:43 [inline] do_vfs_ioctl+0x18c/0xf80 fs/ioctl.c:679 SYSC_ioctl fs/ioctl.c:694 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:685 entry_SYSCALL_64_fastpath+0x1a/0xa9 Freed: PID = 1098 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:57 save_stack+0x46/0xd0 mm/kasan/kasan.c:495 set_track mm/kasan/kasan.c:507 [inline] kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:571 slab_free_hook mm/slub.c:1355 [inline] slab_free_freelist_hook mm/slub.c:1377 [inline] slab_free mm/slub.c:2958 [inline] kfree+0xa0/0x150 mm/slub.c:3878 irqfd_shutdown+0x137/0x1a0 arch/x86/kvm/../../../virt/kvm/eventfd.c:148 process_one_work+0x87c/0x1170 kernel/workqueue.c:2096 worker_thread+0xed/0x14e0 kernel/workqueue.c:2230 kthread+0x220/0x2a0 kernel/kthread.c:211 ret_from_fork+0x25/0x30 arch/x86/entry/entry_64.S:433 Memory state around the buggy address: 88007ab9db80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 88007ab9dc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >88007ab9dc80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ 88007ab9dd00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 88007ab9dd80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc -- Regards, idaifish
[PATCH 2/3] KVM: Add struct kvm_vcpu pointer parameter to get_enable_apicv()
Modify struct kvm_x86_ops.arch.apicv_active() to take struct kvm_vcpu pointer as parameter in preparation to subsequent changes. Signed-off-by: Suravee Suthikulpanit--- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/svm.c | 2 +- arch/x86/kvm/vmx.c | 2 +- arch/x86/kvm/x86.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index f4d120a..3cdde44 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -969,7 +969,7 @@ struct kvm_x86_ops { void (*enable_nmi_window)(struct kvm_vcpu *vcpu); void (*enable_irq_window)(struct kvm_vcpu *vcpu); void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr); - bool (*get_enable_apicv)(void); + bool (*get_enable_apicv)(struct kvm_vcpu *vcpu); void (*refresh_apicv_exec_ctrl)(struct kvm_vcpu *vcpu); void (*hwapic_irr_update)(struct kvm_vcpu *vcpu, int max_irr); void (*hwapic_isr_update)(struct kvm_vcpu *vcpu, int isr); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 316edbf..d1b3eb4 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -4386,7 +4386,7 @@ static void svm_set_virtual_x2apic_mode(struct kvm_vcpu *vcpu, bool set) return; } -static bool svm_get_enable_apicv(void) +static bool svm_get_enable_apicv(struct kvm_vcpu *vcpu) { return avic; } diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c6ef294..8a7ab16 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4983,7 +4983,7 @@ static void vmx_disable_intercept_msr_x2apic(u32 msr, int type, bool apicv_activ } } -static bool vmx_get_enable_apicv(void) +static bool vmx_get_enable_apicv(struct kvm_vcpu *vcpu) { return enable_apicv; } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 05a5e57..b8c40ca 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7946,7 +7946,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) BUG_ON(vcpu->kvm == NULL); kvm = vcpu->kvm; - vcpu->arch.apicv_active = kvm_x86_ops->get_enable_apicv(); + vcpu->arch.apicv_active = kvm_x86_ops->get_enable_apicv(vcpu); vcpu->arch.pv.pv_unhalted = false; vcpu->arch.emulate_ctxt.ops = _ops; if (!irqchip_in_kernel(kvm) || kvm_vcpu_is_reset_bsp(vcpu)) -- 1.8.3.1
[PATCH 2/3] KVM: Add struct kvm_vcpu pointer parameter to get_enable_apicv()
Modify struct kvm_x86_ops.arch.apicv_active() to take struct kvm_vcpu pointer as parameter in preparation to subsequent changes. Signed-off-by: Suravee Suthikulpanit --- arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/svm.c | 2 +- arch/x86/kvm/vmx.c | 2 +- arch/x86/kvm/x86.c | 2 +- 4 files changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index f4d120a..3cdde44 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -969,7 +969,7 @@ struct kvm_x86_ops { void (*enable_nmi_window)(struct kvm_vcpu *vcpu); void (*enable_irq_window)(struct kvm_vcpu *vcpu); void (*update_cr8_intercept)(struct kvm_vcpu *vcpu, int tpr, int irr); - bool (*get_enable_apicv)(void); + bool (*get_enable_apicv)(struct kvm_vcpu *vcpu); void (*refresh_apicv_exec_ctrl)(struct kvm_vcpu *vcpu); void (*hwapic_irr_update)(struct kvm_vcpu *vcpu, int max_irr); void (*hwapic_isr_update)(struct kvm_vcpu *vcpu, int isr); diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 316edbf..d1b3eb4 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -4386,7 +4386,7 @@ static void svm_set_virtual_x2apic_mode(struct kvm_vcpu *vcpu, bool set) return; } -static bool svm_get_enable_apicv(void) +static bool svm_get_enable_apicv(struct kvm_vcpu *vcpu) { return avic; } diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index c6ef294..8a7ab16 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4983,7 +4983,7 @@ static void vmx_disable_intercept_msr_x2apic(u32 msr, int type, bool apicv_activ } } -static bool vmx_get_enable_apicv(void) +static bool vmx_get_enable_apicv(struct kvm_vcpu *vcpu) { return enable_apicv; } diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 05a5e57..b8c40ca 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -7946,7 +7946,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu) BUG_ON(vcpu->kvm == NULL); kvm = vcpu->kvm; - vcpu->arch.apicv_active = kvm_x86_ops->get_enable_apicv(); + vcpu->arch.apicv_active = kvm_x86_ops->get_enable_apicv(vcpu); vcpu->arch.pv.pv_unhalted = false; vcpu->arch.emulate_ctxt.ops = _ops; if (!irqchip_in_kernel(kvm) || kvm_vcpu_is_reset_bsp(vcpu)) -- 1.8.3.1
[PATCH 1/3] KVM: SVM: Refactor AVIC vcpu initialization into avic_init_vcpu()
Preparing the base code for subsequent changes. This does not change existing logic. Signed-off-by: Suravee Suthikulpanit--- arch/x86/kvm/svm.c | 28 1 file changed, 20 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index af256b7..316edbf 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1587,6 +1587,23 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) avic_update_vapic_bar(svm, APIC_DEFAULT_PHYS_BASE); } +static int avic_init_vcpu(struct vcpu_svm *svm) +{ + int ret; + + if (!avic) + return 0; + + ret = avic_init_backing_page(>vcpu); + if (ret) + return ret; + + INIT_LIST_HEAD(>ir_list); + spin_lock_init(>ir_list_lock); + + return ret; +} + static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) { struct vcpu_svm *svm; @@ -1623,14 +1640,9 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) if (!hsave_page) goto free_page3; - if (avic) { - err = avic_init_backing_page(>vcpu); - if (err) - goto free_page4; - - INIT_LIST_HEAD(>ir_list); - spin_lock_init(>ir_list_lock); - } + err = avic_init_vcpu(svm); + if (err) + goto free_page4; /* We initialize this flag to true to make sure that the is_running * bit would be set the first time the vcpu is loaded. -- 1.8.3.1
[PATCH 1/3] KVM: SVM: Refactor AVIC vcpu initialization into avic_init_vcpu()
Preparing the base code for subsequent changes. This does not change existing logic. Signed-off-by: Suravee Suthikulpanit --- arch/x86/kvm/svm.c | 28 1 file changed, 20 insertions(+), 8 deletions(-) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index af256b7..316edbf 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1587,6 +1587,23 @@ static void svm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) avic_update_vapic_bar(svm, APIC_DEFAULT_PHYS_BASE); } +static int avic_init_vcpu(struct vcpu_svm *svm) +{ + int ret; + + if (!avic) + return 0; + + ret = avic_init_backing_page(>vcpu); + if (ret) + return ret; + + INIT_LIST_HEAD(>ir_list); + spin_lock_init(>ir_list_lock); + + return ret; +} + static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) { struct vcpu_svm *svm; @@ -1623,14 +1640,9 @@ static struct kvm_vcpu *svm_create_vcpu(struct kvm *kvm, unsigned int id) if (!hsave_page) goto free_page3; - if (avic) { - err = avic_init_backing_page(>vcpu); - if (err) - goto free_page4; - - INIT_LIST_HEAD(>ir_list); - spin_lock_init(>ir_list_lock); - } + err = avic_init_vcpu(svm); + if (err) + goto free_page4; /* We initialize this flag to true to make sure that the is_running * bit would be set the first time the vcpu is loaded. -- 1.8.3.1
[PATCH 0/3] KVM: SVM: Fix guest not booting w/ AVIC enabled
Certain QEMU options fails to boot VM guest w/ SVM AVIC enabled (e.g. modprobe kvm_amd avic=1). Investigation shows that this mainly due to AVIC hardware does not trap into hypervisor when guest OS writes to APIC_EOI register. The boot hang is caused by missing timer interrupt when using in-kernel PIT model (e.g. launch qemu w/ '-no-hpet' option) since it requires irq acknowledgmen before injecting another interrupt in case irq re-injection is enabled (normally default). Suravee Suthikulpanit (3): KVM: SVM: Refactor AVIC vcpu initialization into avic_init_vcpu() KVM: Add struct kvm_vcpu pointer parameter to get_enable_apicv() KVM: SVM: Add irqchip_split() checks before enabling AVIC arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/svm.c | 43 - arch/x86/kvm/vmx.c | 2 +- arch/x86/kvm/x86.c | 2 +- 4 files changed, 33 insertions(+), 16 deletions(-) -- 1.8.3.1
[PATCH 0/3] KVM: SVM: Fix guest not booting w/ AVIC enabled
Certain QEMU options fails to boot VM guest w/ SVM AVIC enabled (e.g. modprobe kvm_amd avic=1). Investigation shows that this mainly due to AVIC hardware does not trap into hypervisor when guest OS writes to APIC_EOI register. The boot hang is caused by missing timer interrupt when using in-kernel PIT model (e.g. launch qemu w/ '-no-hpet' option) since it requires irq acknowledgmen before injecting another interrupt in case irq re-injection is enabled (normally default). Suravee Suthikulpanit (3): KVM: SVM: Refactor AVIC vcpu initialization into avic_init_vcpu() KVM: Add struct kvm_vcpu pointer parameter to get_enable_apicv() KVM: SVM: Add irqchip_split() checks before enabling AVIC arch/x86/include/asm/kvm_host.h | 2 +- arch/x86/kvm/svm.c | 43 - arch/x86/kvm/vmx.c | 2 +- arch/x86/kvm/x86.c | 2 +- 4 files changed, 33 insertions(+), 16 deletions(-) -- 1.8.3.1
Re: [PATCH RFC v3 0/4] Add cross-compilation support to eBPF samples
Hi Alexei, On Tue, Sep 5, 2017 at 8:24 PM, Alexei Starovoitovwrote: > On Sun, Sep 03, 2017 at 11:23:21AM -0700, Joel Fernandes wrote: >> These patches fix issues seen when cross-compiling eBPF samples on arm64. >> Compared to [1], I dropped the controversial inline-asm patch pending further >> discussion on the right way to do it. However these patches are still a step >> in >> the right direction and I wanted them to get in before the more controversial >> bit. > > All patches in this set look good to me. > When you repost without changes after net-next reopens pls keep my: > Acked-by: Alexei Starovoitov Thanks a lot! will do. Regards, Joel
Re: [PATCH RFC v3 0/4] Add cross-compilation support to eBPF samples
Hi Alexei, On Tue, Sep 5, 2017 at 8:24 PM, Alexei Starovoitov wrote: > On Sun, Sep 03, 2017 at 11:23:21AM -0700, Joel Fernandes wrote: >> These patches fix issues seen when cross-compiling eBPF samples on arm64. >> Compared to [1], I dropped the controversial inline-asm patch pending further >> discussion on the right way to do it. However these patches are still a step >> in >> the right direction and I wanted them to get in before the more controversial >> bit. > > All patches in this set look good to me. > When you repost without changes after net-next reopens pls keep my: > Acked-by: Alexei Starovoitov Thanks a lot! will do. Regards, Joel
Re: [PATCH RFC v3 0/4] Add cross-compilation support to eBPF samples
On Sun, Sep 03, 2017 at 11:23:21AM -0700, Joel Fernandes wrote: > These patches fix issues seen when cross-compiling eBPF samples on arm64. > Compared to [1], I dropped the controversial inline-asm patch pending further > discussion on the right way to do it. However these patches are still a step > in > the right direction and I wanted them to get in before the more controversial > bit. All patches in this set look good to me. When you repost without changes after net-next reopens pls keep my: Acked-by: Alexei Starovoitov
Re: [PATCH RFC v3 0/4] Add cross-compilation support to eBPF samples
On Sun, Sep 03, 2017 at 11:23:21AM -0700, Joel Fernandes wrote: > These patches fix issues seen when cross-compiling eBPF samples on arm64. > Compared to [1], I dropped the controversial inline-asm patch pending further > discussion on the right way to do it. However these patches are still a step > in > the right direction and I wanted them to get in before the more controversial > bit. All patches in this set look good to me. When you repost without changes after net-next reopens pls keep my: Acked-by: Alexei Starovoitov
Re: [PATCH] f2fs: fix wrong bfree and bavail
On 2017/9/6 11:10, Jaegeuk Kim wrote: > On 09/06, Chao Yu wrote: >> On 2017/9/6 5:06, Jaegeuk Kim wrote: >>> This patch fixes bavail and bfree finally. >>> >>> longf_bfree;/* free blocks in fs */ >>> longf_bavail; /* free blocks avail to non-superuser */ >>> >>> So, bfree represents all the reserved blocks, while bavail does the space >>> only >>> visible to normal user. >>> >>> Fix: 3e6d0b4d9c1c ("f2fs: fix incorrect f_bfree calculation in ->statfs") >>> Cc:# 4.8+ >>> Signed-off-by: Jaegeuk Kim >>> --- >>> fs/f2fs/super.c | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >>> >>> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c >>> index 7e29db227910..17cca4cb93af 100644 >>> --- a/fs/f2fs/super.c >>> +++ b/fs/f2fs/super.c >>> @@ -955,8 +955,8 @@ static int f2fs_statfs(struct dentry *dentry, struct >>> kstatfs *buf) >>> buf->f_bsize = sbi->blocksize; >>> >>> buf->f_blocks = total_count - start_count; >>> - buf->f_bfree = user_block_count - valid_user_blocks(sbi) + ovp_count; >>> - buf->f_bavail = user_block_count - valid_user_blocks(sbi) - >>> + buf->f_bfree = user_block_count - valid_user_blocks(sbi); >>> + buf->f_bavail = user_block_count - valid_user_blocks(sbi) - ovp_count - >> >> set_cp(user_block_count, ((get_cp(free_segment_count) + 6 - >> get_cp(overprov_segment_count)) * c.blks_per_seg)); >> >> As we calculated in mkfs, we have removed ovp_count in user_block_count, so >> shouldn't we add ovp_count for f_bfree here? > > Hehe, I already dropped this. :P Alright. :) > It seems like something wrong with df in android. Maybe its version is too old? Thanks, > >> >> Thanks, >> >>> sbi->reserved_blocks; >>> >>> avail_node_count = sbi->total_node_count - F2FS_RESERVED_NODE_NUM; >>> > > . >
Re: [PATCH] f2fs: fix wrong bfree and bavail
On 2017/9/6 11:10, Jaegeuk Kim wrote: > On 09/06, Chao Yu wrote: >> On 2017/9/6 5:06, Jaegeuk Kim wrote: >>> This patch fixes bavail and bfree finally. >>> >>> longf_bfree;/* free blocks in fs */ >>> longf_bavail; /* free blocks avail to non-superuser */ >>> >>> So, bfree represents all the reserved blocks, while bavail does the space >>> only >>> visible to normal user. >>> >>> Fix: 3e6d0b4d9c1c ("f2fs: fix incorrect f_bfree calculation in ->statfs") >>> Cc: # 4.8+ >>> Signed-off-by: Jaegeuk Kim >>> --- >>> fs/f2fs/super.c | 4 ++-- >>> 1 file changed, 2 insertions(+), 2 deletions(-) >>> >>> diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c >>> index 7e29db227910..17cca4cb93af 100644 >>> --- a/fs/f2fs/super.c >>> +++ b/fs/f2fs/super.c >>> @@ -955,8 +955,8 @@ static int f2fs_statfs(struct dentry *dentry, struct >>> kstatfs *buf) >>> buf->f_bsize = sbi->blocksize; >>> >>> buf->f_blocks = total_count - start_count; >>> - buf->f_bfree = user_block_count - valid_user_blocks(sbi) + ovp_count; >>> - buf->f_bavail = user_block_count - valid_user_blocks(sbi) - >>> + buf->f_bfree = user_block_count - valid_user_blocks(sbi); >>> + buf->f_bavail = user_block_count - valid_user_blocks(sbi) - ovp_count - >> >> set_cp(user_block_count, ((get_cp(free_segment_count) + 6 - >> get_cp(overprov_segment_count)) * c.blks_per_seg)); >> >> As we calculated in mkfs, we have removed ovp_count in user_block_count, so >> shouldn't we add ovp_count for f_bfree here? > > Hehe, I already dropped this. :P Alright. :) > It seems like something wrong with df in android. Maybe its version is too old? Thanks, > >> >> Thanks, >> >>> sbi->reserved_blocks; >>> >>> avail_node_count = sbi->total_node_count - F2FS_RESERVED_NODE_NUM; >>> > > . >
Re: [f2fs-dev] [PATCH 2/2] f2fs: use generic terms used for encrypted block management
On 2017/9/6 8:15, Jaegeuk Kim wrote: > This patch renames functions regarding to buffer management via META_MAPPING > used for encrypted blocks especially. We can actually use them in generic way. Meta inode cache is more like bd_inode cache now. ;) Reviewed-by: Chao YuThanks, > > Signed-off-by: Jaegeuk Kim > --- > fs/f2fs/data.c| 6 +++--- > fs/f2fs/f2fs.h| 3 +-- > fs/f2fs/file.c| 2 +- > fs/f2fs/gc.c | 13 + > fs/f2fs/segment.c | 3 +-- > 5 files changed, 15 insertions(+), 12 deletions(-) > > diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c > index e6c683e7a10e..ee6801fdbdec 100644 > --- a/fs/f2fs/data.c > +++ b/fs/f2fs/data.c > @@ -1163,7 +1163,7 @@ static struct bio *f2fs_grab_bio(struct inode *inode, > block_t blkaddr, > return ERR_CAST(ctx); > > /* wait the page to be moved by cleaning */ > - f2fs_wait_on_encrypted_page_writeback(sbi, blkaddr); > + f2fs_wait_on_block_writeback(sbi, blkaddr); > } > > bio = bio_alloc(GFP_KERNEL, min_t(int, nr_pages, BIO_MAX_PAGES)); > @@ -1349,7 +1349,7 @@ static int encrypt_one_page(struct f2fs_io_info *fio) > return 0; > > /* wait for GCed encrypted page writeback */ > - f2fs_wait_on_encrypted_page_writeback(fio->sbi, fio->old_blkaddr); > + f2fs_wait_on_block_writeback(fio->sbi, fio->old_blkaddr); > > retry_encrypt: > fio->encrypted_page = fscrypt_encrypt_page(inode, fio->page, > @@ -1975,7 +1975,7 @@ static int f2fs_write_begin(struct file *file, struct > address_space *mapping, > > /* wait for GCed encrypted page writeback */ > if (f2fs_encrypted_file(inode)) > - f2fs_wait_on_encrypted_page_writeback(sbi, blkaddr); > + f2fs_wait_on_block_writeback(sbi, blkaddr); > > if (len == PAGE_SIZE || PageUptodate(page)) > return 0; > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h > index 417c28f01fd5..91fb749686f2 100644 > --- a/fs/f2fs/f2fs.h > +++ b/fs/f2fs/f2fs.h > @@ -2550,8 +2550,7 @@ void allocate_data_block(struct f2fs_sb_info *sbi, > struct page *page, > struct f2fs_io_info *fio, bool add_list); > void f2fs_wait_on_page_writeback(struct page *page, > enum page_type type, bool ordered); > -void f2fs_wait_on_encrypted_page_writeback(struct f2fs_sb_info *sbi, > - block_t blkaddr); > +void f2fs_wait_on_block_writeback(struct f2fs_sb_info *sbi, block_t blkaddr); > void write_data_summaries(struct f2fs_sb_info *sbi, block_t start_blk); > void write_node_summaries(struct f2fs_sb_info *sbi, block_t start_blk); > int lookup_journal_in_cursum(struct f2fs_journal *journal, int type, > diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c > index f8cedaba7ce4..224379a9848c 100644 > --- a/fs/f2fs/file.c > +++ b/fs/f2fs/file.c > @@ -107,7 +107,7 @@ static int f2fs_vm_page_mkwrite(struct vm_fault *vmf) > > /* wait for GCed encrypted page writeback */ > if (f2fs_encrypted_file(inode)) > - f2fs_wait_on_encrypted_page_writeback(sbi, dn.data_blkaddr); > + f2fs_wait_on_block_writeback(sbi, dn.data_blkaddr); > > out_sem: > up_read(_I(inode)->i_mmap_sem); > diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c > index 6a12f33d0cdd..bfe6a8ccc3a0 100644 > --- a/fs/f2fs/gc.c > +++ b/fs/f2fs/gc.c > @@ -599,8 +599,12 @@ static bool is_alive(struct f2fs_sb_info *sbi, struct > f2fs_summary *sum, > return true; > } > > -static void move_encrypted_block(struct inode *inode, block_t bidx, > - unsigned int segno, int > off) > +/* > + * Move data block via META_MAPPING while keeping locked data page. > + * This can be used to move blocks, aka LBAs, directly on disk. > + */ > +static void move_data_block(struct inode *inode, block_t bidx, > + unsigned int segno, int off) > { > struct f2fs_io_info fio = { > .sbi = F2FS_I_SB(inode), > @@ -873,9 +877,10 @@ static void gc_data_segment(struct f2fs_sb_info *sbi, > struct f2fs_summary *sum, > start_bidx = start_bidx_of_node(nofs, inode) > + ofs_in_node; > if (f2fs_encrypted_file(inode)) > - move_encrypted_block(inode, start_bidx, segno, > off); > + move_data_block(inode, start_bidx, segno, off); > else > - move_data_page(inode, start_bidx, gc_type, > segno, off); > + move_data_page(inode, start_bidx, gc_type, > + segno, off); > > if (locked) { > up_write(>dio_rwsem[WRITE]); > diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c > index
Re: [PATCH] md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list
> On Fri, Sep 01, 2017 at 05:26:48PM +0800, Dennis Yang wrote: > > >On Mon, Aug 28, 2017 at 08:01:59PM +0800, Dennis Yang wrote: > > >> break_stripe_batch_list() did not preserve STRIPE_ON_UNPLUG_LIST which is > > >> set when a stripe_head gets queued to the stripe_head list maintained by > > >> raid5_plug_cb and waiting for releasing after blk_unplug(). > > >> > > >> In release_stripe_plug(), if a stripe_head has its STRIPE_ON_UNPLUG_LIST > > >> set, it indicates that this stripe_head is already in the raid5_plug_cb > > >> list and release_stripe() would be called instead to drop a reference > > >> count. Otherwise, the STRIPE_ON_UNPLUG_LIST bit would be set for this > > >> stripe_head and it will get queued into the raid5_plug_cb list. > > >> > > >> Without preserving STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list(), > > >> a stripe_head could be re-added to the raid5_plug_cb list while it is > > >> currently on that list and waiting to be released. This would mess up > > >> the raid5_plug_cb and leads to soft/hard lockup in raid5_unplug() or > > >> kernel crash. > > > > > >Did you observe this? From my understanding, this is impossible. If a > > >stripe is > > >in batch list (that's why we have break_stripe_batch_list), add_stripe_bio > > >will > > >bail out (because sh->batch_head != NULL), so the stripe is never added > > >into > > >plug list, hence has the STRIPE_ON_UNPLUG_LIST bit set. Am I missing > > >anything? > > > > > >Thanks, > > >Shaohua > > > > It is hard to reproduce. Our storage exports 4 iscsi LUNs which are built > > on > > top of a dm-thin pool with 8 x SSD RAID6. Each two of these iscsi LUNs are > > connected to a windows server which runs 4 fio jobs with 16 threads and > > iodepth = 16. In this setup, we can get the following call trace after > > couples > > of hours. The stack dump basically shows that the raid5_unplug is holding > > the > > conf->device_lock for a very long time. After adding some debug traces, we > > can > > confirm that the raid5_plug_cb->list is corrupted which sometimes leads to > > infinite loop in raid5_unplug. > > > > I didn't notice that add_stripe_bio will bail out if the stripe_head is in > > someone's batch list, but I am wondering if the following scenario will > > still > > happen. If a stripe_head A is added to another stripe_head B's batch list, > > in > > this case A will have its batch_head != NULL and be added into the plug > > list. > > After that, stripe_head B gets handled and called break_stripe_batch_list() > > to > > reset all the batched stripe_head(including A which is still on the plug > > list)'s > > state and reset their batch_head to NULL. Before the plug list gets > > processed, > > if there is another write request comes in and get stripe_head A, A will > > have > > its batch_head == NULL (cleared by calling break_stripe_batch_list() on B) > > and > > be added to plug list once again. > > > > Since we have the environment which can reproduce this issue in 6 hours, if > > there is anything we can do to help resolve this issue, we are more than > > willing to help. > > yes, I think you are right. If the stripe is on one thread's plug list, the > thread doesn't do unplug till the stripe is handled (because it's part of a > batch), we will have the stripe in plug list but not have the bit set. Can you > resend the patch with description what happens? BTW, does the patch fix your > issue? > > Thanks, > Shaohua Sure, I will resend a patch to you with updated commit message. We have already run the same stress test on the same environment for 6 days without any error. Since the test cannot run more than 6 hours before this patch, I think it does fix my issue. Thanks, Dennis > > > > Thanks, > > Dennis > > > > [ 4481.493998] BUG: spinlock lockup suspected on CPU#10, md1_raid6/5497 > > [ 4481.501109] lock: 0x88102b242888, .magic: dead4ead, .owner: > > kworker/u24:2/22322, .owner_cpu: 0 > > [ 4481.511231] CPU: 10 PID: 5497 Comm: md1_raid6 Tainted: PW O > > 4.2.8 #7 > > [ 4481.519504] Hardware name: Default string Default string/Default string, > > BIOS QY38AR06 01/24/2017 > > [ 4481.529429] 88102b242888 881014967c78 81c933b6 > > 0007 > > [ 4481.537739] 8805c9720a00 881014967c98 810d2c20 > > 88102b242888 > > [ 4481.546048] 82d0f7c8 881014967cc8 810d2d9b > > 88102b242888 > > [ 4481.554357] Call Trace: > > [ 4481.557093] [] dump_stack+0x4c/0x65 > > [ 4481.562840] [] spin_dump+0x80/0xd0 > > [ 4481.568489] [] do_raw_spin_lock+0x9b/0x120 > > [ 4481.574916] [] _raw_spin_lock_irq+0x41/0x50 > > [ 4481.581443] [] ? raid5d+0x64/0x790 > > [ 4481.587091] [] raid5d+0x64/0x790 > > [ 4481.592546] [] ? finish_wait+0x46/0x80 > > [ 4481.598583] [] ? finish_wait+0x68/0x80 > > [ 4481.604621] [] md_thread+0x13e/0x150 > > [ 4481.610464] [] ? wait_woken+0xb0/0xb0 > > [ 4481.616405] [] ?
Re: [f2fs-dev] [PATCH 2/2] f2fs: use generic terms used for encrypted block management
On 2017/9/6 8:15, Jaegeuk Kim wrote: > This patch renames functions regarding to buffer management via META_MAPPING > used for encrypted blocks especially. We can actually use them in generic way. Meta inode cache is more like bd_inode cache now. ;) Reviewed-by: Chao Yu Thanks, > > Signed-off-by: Jaegeuk Kim > --- > fs/f2fs/data.c| 6 +++--- > fs/f2fs/f2fs.h| 3 +-- > fs/f2fs/file.c| 2 +- > fs/f2fs/gc.c | 13 + > fs/f2fs/segment.c | 3 +-- > 5 files changed, 15 insertions(+), 12 deletions(-) > > diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c > index e6c683e7a10e..ee6801fdbdec 100644 > --- a/fs/f2fs/data.c > +++ b/fs/f2fs/data.c > @@ -1163,7 +1163,7 @@ static struct bio *f2fs_grab_bio(struct inode *inode, > block_t blkaddr, > return ERR_CAST(ctx); > > /* wait the page to be moved by cleaning */ > - f2fs_wait_on_encrypted_page_writeback(sbi, blkaddr); > + f2fs_wait_on_block_writeback(sbi, blkaddr); > } > > bio = bio_alloc(GFP_KERNEL, min_t(int, nr_pages, BIO_MAX_PAGES)); > @@ -1349,7 +1349,7 @@ static int encrypt_one_page(struct f2fs_io_info *fio) > return 0; > > /* wait for GCed encrypted page writeback */ > - f2fs_wait_on_encrypted_page_writeback(fio->sbi, fio->old_blkaddr); > + f2fs_wait_on_block_writeback(fio->sbi, fio->old_blkaddr); > > retry_encrypt: > fio->encrypted_page = fscrypt_encrypt_page(inode, fio->page, > @@ -1975,7 +1975,7 @@ static int f2fs_write_begin(struct file *file, struct > address_space *mapping, > > /* wait for GCed encrypted page writeback */ > if (f2fs_encrypted_file(inode)) > - f2fs_wait_on_encrypted_page_writeback(sbi, blkaddr); > + f2fs_wait_on_block_writeback(sbi, blkaddr); > > if (len == PAGE_SIZE || PageUptodate(page)) > return 0; > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h > index 417c28f01fd5..91fb749686f2 100644 > --- a/fs/f2fs/f2fs.h > +++ b/fs/f2fs/f2fs.h > @@ -2550,8 +2550,7 @@ void allocate_data_block(struct f2fs_sb_info *sbi, > struct page *page, > struct f2fs_io_info *fio, bool add_list); > void f2fs_wait_on_page_writeback(struct page *page, > enum page_type type, bool ordered); > -void f2fs_wait_on_encrypted_page_writeback(struct f2fs_sb_info *sbi, > - block_t blkaddr); > +void f2fs_wait_on_block_writeback(struct f2fs_sb_info *sbi, block_t blkaddr); > void write_data_summaries(struct f2fs_sb_info *sbi, block_t start_blk); > void write_node_summaries(struct f2fs_sb_info *sbi, block_t start_blk); > int lookup_journal_in_cursum(struct f2fs_journal *journal, int type, > diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c > index f8cedaba7ce4..224379a9848c 100644 > --- a/fs/f2fs/file.c > +++ b/fs/f2fs/file.c > @@ -107,7 +107,7 @@ static int f2fs_vm_page_mkwrite(struct vm_fault *vmf) > > /* wait for GCed encrypted page writeback */ > if (f2fs_encrypted_file(inode)) > - f2fs_wait_on_encrypted_page_writeback(sbi, dn.data_blkaddr); > + f2fs_wait_on_block_writeback(sbi, dn.data_blkaddr); > > out_sem: > up_read(_I(inode)->i_mmap_sem); > diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c > index 6a12f33d0cdd..bfe6a8ccc3a0 100644 > --- a/fs/f2fs/gc.c > +++ b/fs/f2fs/gc.c > @@ -599,8 +599,12 @@ static bool is_alive(struct f2fs_sb_info *sbi, struct > f2fs_summary *sum, > return true; > } > > -static void move_encrypted_block(struct inode *inode, block_t bidx, > - unsigned int segno, int > off) > +/* > + * Move data block via META_MAPPING while keeping locked data page. > + * This can be used to move blocks, aka LBAs, directly on disk. > + */ > +static void move_data_block(struct inode *inode, block_t bidx, > + unsigned int segno, int off) > { > struct f2fs_io_info fio = { > .sbi = F2FS_I_SB(inode), > @@ -873,9 +877,10 @@ static void gc_data_segment(struct f2fs_sb_info *sbi, > struct f2fs_summary *sum, > start_bidx = start_bidx_of_node(nofs, inode) > + ofs_in_node; > if (f2fs_encrypted_file(inode)) > - move_encrypted_block(inode, start_bidx, segno, > off); > + move_data_block(inode, start_bidx, segno, off); > else > - move_data_page(inode, start_bidx, gc_type, > segno, off); > + move_data_page(inode, start_bidx, gc_type, > + segno, off); > > if (locked) { > up_write(>dio_rwsem[WRITE]); > diff --git a/fs/f2fs/segment.c b/fs/f2fs/segment.c > index 265c3bc44f2d..9e708e525ba8 100644 > ---
Re: [PATCH] md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list
> On Fri, Sep 01, 2017 at 05:26:48PM +0800, Dennis Yang wrote: > > >On Mon, Aug 28, 2017 at 08:01:59PM +0800, Dennis Yang wrote: > > >> break_stripe_batch_list() did not preserve STRIPE_ON_UNPLUG_LIST which is > > >> set when a stripe_head gets queued to the stripe_head list maintained by > > >> raid5_plug_cb and waiting for releasing after blk_unplug(). > > >> > > >> In release_stripe_plug(), if a stripe_head has its STRIPE_ON_UNPLUG_LIST > > >> set, it indicates that this stripe_head is already in the raid5_plug_cb > > >> list and release_stripe() would be called instead to drop a reference > > >> count. Otherwise, the STRIPE_ON_UNPLUG_LIST bit would be set for this > > >> stripe_head and it will get queued into the raid5_plug_cb list. > > >> > > >> Without preserving STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list(), > > >> a stripe_head could be re-added to the raid5_plug_cb list while it is > > >> currently on that list and waiting to be released. This would mess up > > >> the raid5_plug_cb and leads to soft/hard lockup in raid5_unplug() or > > >> kernel crash. > > > > > >Did you observe this? From my understanding, this is impossible. If a > > >stripe is > > >in batch list (that's why we have break_stripe_batch_list), add_stripe_bio > > >will > > >bail out (because sh->batch_head != NULL), so the stripe is never added > > >into > > >plug list, hence has the STRIPE_ON_UNPLUG_LIST bit set. Am I missing > > >anything? > > > > > >Thanks, > > >Shaohua > > > > It is hard to reproduce. Our storage exports 4 iscsi LUNs which are built > > on > > top of a dm-thin pool with 8 x SSD RAID6. Each two of these iscsi LUNs are > > connected to a windows server which runs 4 fio jobs with 16 threads and > > iodepth = 16. In this setup, we can get the following call trace after > > couples > > of hours. The stack dump basically shows that the raid5_unplug is holding > > the > > conf->device_lock for a very long time. After adding some debug traces, we > > can > > confirm that the raid5_plug_cb->list is corrupted which sometimes leads to > > infinite loop in raid5_unplug. > > > > I didn't notice that add_stripe_bio will bail out if the stripe_head is in > > someone's batch list, but I am wondering if the following scenario will > > still > > happen. If a stripe_head A is added to another stripe_head B's batch list, > > in > > this case A will have its batch_head != NULL and be added into the plug > > list. > > After that, stripe_head B gets handled and called break_stripe_batch_list() > > to > > reset all the batched stripe_head(including A which is still on the plug > > list)'s > > state and reset their batch_head to NULL. Before the plug list gets > > processed, > > if there is another write request comes in and get stripe_head A, A will > > have > > its batch_head == NULL (cleared by calling break_stripe_batch_list() on B) > > and > > be added to plug list once again. > > > > Since we have the environment which can reproduce this issue in 6 hours, if > > there is anything we can do to help resolve this issue, we are more than > > willing to help. > > yes, I think you are right. If the stripe is on one thread's plug list, the > thread doesn't do unplug till the stripe is handled (because it's part of a > batch), we will have the stripe in plug list but not have the bit set. Can you > resend the patch with description what happens? BTW, does the patch fix your > issue? > > Thanks, > Shaohua Sure, I will resend a patch to you with updated commit message. We have already run the same stress test on the same environment for 6 days without any error. Since the test cannot run more than 6 hours before this patch, I think it does fix my issue. Thanks, Dennis > > > > Thanks, > > Dennis > > > > [ 4481.493998] BUG: spinlock lockup suspected on CPU#10, md1_raid6/5497 > > [ 4481.501109] lock: 0x88102b242888, .magic: dead4ead, .owner: > > kworker/u24:2/22322, .owner_cpu: 0 > > [ 4481.511231] CPU: 10 PID: 5497 Comm: md1_raid6 Tainted: PW O > > 4.2.8 #7 > > [ 4481.519504] Hardware name: Default string Default string/Default string, > > BIOS QY38AR06 01/24/2017 > > [ 4481.529429] 88102b242888 881014967c78 81c933b6 > > 0007 > > [ 4481.537739] 8805c9720a00 881014967c98 810d2c20 > > 88102b242888 > > [ 4481.546048] 82d0f7c8 881014967cc8 810d2d9b > > 88102b242888 > > [ 4481.554357] Call Trace: > > [ 4481.557093] [] dump_stack+0x4c/0x65 > > [ 4481.562840] [] spin_dump+0x80/0xd0 > > [ 4481.568489] [] do_raw_spin_lock+0x9b/0x120 > > [ 4481.574916] [] _raw_spin_lock_irq+0x41/0x50 > > [ 4481.581443] [] ? raid5d+0x64/0x790 > > [ 4481.587091] [] raid5d+0x64/0x790 > > [ 4481.592546] [] ? finish_wait+0x46/0x80 > > [ 4481.598583] [] ? finish_wait+0x68/0x80 > > [ 4481.604621] [] md_thread+0x13e/0x150 > > [ 4481.610464] [] ? wait_woken+0xb0/0xb0 > > [ 4481.616405] [] ?
Re: [PATCH v3 2/2] ip6_tunnel: fix ip6 tunnel lookup in collect_md mode
On 9/4/17 1:36 AM, Haishuang Yan wrote: In collect_md mode, if the tun dev is down, it still can call __ip6_tnl_rcv to receive on packets, and the rx statistics increase improperly. Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels") Cc: Alexei StarovoitovSigned-off-by: Haishuang Yan --- Change since v3: * Increment rx_dropped if tunnel device is not up, suggested by Pravin B Shelar * Fix wrong recipient address --- net/ipv6/ip6_tunnel.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index 10a693a..e91d3b6 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -171,8 +171,11 @@ static struct net_device_stats *ip6_get_stats(struct net_device *dev) } t = rcu_dereference(ip6n->collect_md_tun); - if (t) - return t; + if (t) { + if (t->dev->flags & IFF_UP) + return t; + t->dev->stats.rx_dropped++; + } Why increment the stats only for this drop case? There are plenty of other conditions where packet will be dropped in ip6 tunnel. I think it's important to present consistent behavior to the users, so I'd increment drop stats either for all drop cases or for none. And today it's none. The ! IFF_UP case should probably be return NULL too.
Re: [PATCH v3 2/2] ip6_tunnel: fix ip6 tunnel lookup in collect_md mode
On 9/4/17 1:36 AM, Haishuang Yan wrote: In collect_md mode, if the tun dev is down, it still can call __ip6_tnl_rcv to receive on packets, and the rx statistics increase improperly. Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels") Cc: Alexei Starovoitov Signed-off-by: Haishuang Yan --- Change since v3: * Increment rx_dropped if tunnel device is not up, suggested by Pravin B Shelar * Fix wrong recipient address --- net/ipv6/ip6_tunnel.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index 10a693a..e91d3b6 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -171,8 +171,11 @@ static struct net_device_stats *ip6_get_stats(struct net_device *dev) } t = rcu_dereference(ip6n->collect_md_tun); - if (t) - return t; + if (t) { + if (t->dev->flags & IFF_UP) + return t; + t->dev->stats.rx_dropped++; + } Why increment the stats only for this drop case? There are plenty of other conditions where packet will be dropped in ip6 tunnel. I think it's important to present consistent behavior to the users, so I'd increment drop stats either for all drop cases or for none. And today it's none. The ! IFF_UP case should probably be return NULL too.
Re: [f2fs-dev] [PATCH 1/2] f2fs: introduce f2fs_encrypted_file for clean-up
On 2017/9/6 8:15, Jaegeuk Kim wrote: > This patch replaces (f2fs_encrypted_inode() && S_ISREG()) with > f2fs_encrypted_file(), which gives no functional change. > > Signed-off-by: Jaegeuk KimReviewed-by: Chao Yu Thanks, > --- > fs/f2fs/data.c | 10 +- > fs/f2fs/f2fs.h | 5 + > fs/f2fs/file.c | 2 +- > fs/f2fs/gc.c | 5 ++--- > fs/f2fs/inline.c | 2 +- > 5 files changed, 14 insertions(+), 10 deletions(-) > > diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c > index 67da4f6eeaa0..e6c683e7a10e 100644 > --- a/fs/f2fs/data.c > +++ b/fs/f2fs/data.c > @@ -581,7 +581,7 @@ struct page *get_read_data_page(struct inode *inode, > pgoff_t index, > .encrypted_page = NULL, > }; > > - if (f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode)) > + if (f2fs_encrypted_file(inode)) > return read_mapping_page(mapping, index, NULL); > > page = f2fs_grab_cache_page(mapping, index, for_write); > @@ -786,7 +786,7 @@ static int __allocate_data_block(struct dnode_of_data *dn) > > static inline bool __force_buffered_io(struct inode *inode, int rw) > { > - return ((f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode)) || > + return (f2fs_encrypted_file(inode) || > (rw == WRITE && test_opt(F2FS_I_SB(inode), LFS)) || > F2FS_I_SB(inode)->s_ndevs); > } > @@ -1157,7 +1157,7 @@ static struct bio *f2fs_grab_bio(struct inode *inode, > block_t blkaddr, > struct fscrypt_ctx *ctx = NULL; > struct bio *bio; > > - if (f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode)) { > + if (f2fs_encrypted_file(inode)) { > ctx = fscrypt_get_ctx(inode, GFP_NOFS); > if (IS_ERR(ctx)) > return ERR_CAST(ctx); > @@ -1345,7 +1345,7 @@ static int encrypt_one_page(struct f2fs_io_info *fio) > struct inode *inode = fio->page->mapping->host; > gfp_t gfp_flags = GFP_NOFS; > > - if (!f2fs_encrypted_inode(inode) || !S_ISREG(inode->i_mode)) > + if (!f2fs_encrypted_file(inode)) > return 0; > > /* wait for GCed encrypted page writeback */ > @@ -1974,7 +1974,7 @@ static int f2fs_write_begin(struct file *file, struct > address_space *mapping, > f2fs_wait_on_page_writeback(page, DATA, false); > > /* wait for GCed encrypted page writeback */ > - if (f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode)) > + if (f2fs_encrypted_file(inode)) > f2fs_wait_on_encrypted_page_writeback(sbi, blkaddr); > > if (len == PAGE_SIZE || PageUptodate(page)) > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h > index 4b993961d81d..417c28f01fd5 100644 > --- a/fs/f2fs/f2fs.h > +++ b/fs/f2fs/f2fs.h > @@ -2939,6 +2939,11 @@ static inline bool f2fs_encrypted_inode(struct inode > *inode) > return file_is_encrypt(inode); > } > > +static inline bool f2fs_encrypted_file(struct inode *inode) > +{ > + return f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode); > +} > + > static inline void f2fs_set_encrypted_inode(struct inode *inode) > { > #ifdef CONFIG_F2FS_FS_ENCRYPTION > diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c > index 63e394907808..f8cedaba7ce4 100644 > --- a/fs/f2fs/file.c > +++ b/fs/f2fs/file.c > @@ -106,7 +106,7 @@ static int f2fs_vm_page_mkwrite(struct vm_fault *vmf) > f2fs_wait_on_page_writeback(page, DATA, false); > > /* wait for GCed encrypted page writeback */ > - if (f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode)) > + if (f2fs_encrypted_file(inode)) > f2fs_wait_on_encrypted_page_writeback(sbi, dn.data_blkaddr); > > out_sem: > diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c > index b226760afba8..6a12f33d0cdd 100644 > --- a/fs/f2fs/gc.c > +++ b/fs/f2fs/gc.c > @@ -831,8 +831,7 @@ static void gc_data_segment(struct f2fs_sb_info *sbi, > struct f2fs_summary *sum, > continue; > > /* if encrypted inode, let's go phase 3 */ > - if (f2fs_encrypted_inode(inode) && > - S_ISREG(inode->i_mode)) { > + if (f2fs_encrypted_file(inode)) { > add_gc_inode(gc_list, inode); > continue; > } > @@ -873,7 +872,7 @@ static void gc_data_segment(struct f2fs_sb_info *sbi, > struct f2fs_summary *sum, > > start_bidx = start_bidx_of_node(nofs, inode) > + ofs_in_node; > - if (f2fs_encrypted_inode(inode) && > S_ISREG(inode->i_mode)) > + if (f2fs_encrypted_file(inode)) > move_encrypted_block(inode, start_bidx, segno, > off); > else > move_data_page(inode, start_bidx, gc_type, > segno, off); > diff --git
Re: [f2fs-dev] [PATCH 1/2] f2fs: introduce f2fs_encrypted_file for clean-up
On 2017/9/6 8:15, Jaegeuk Kim wrote: > This patch replaces (f2fs_encrypted_inode() && S_ISREG()) with > f2fs_encrypted_file(), which gives no functional change. > > Signed-off-by: Jaegeuk Kim Reviewed-by: Chao Yu Thanks, > --- > fs/f2fs/data.c | 10 +- > fs/f2fs/f2fs.h | 5 + > fs/f2fs/file.c | 2 +- > fs/f2fs/gc.c | 5 ++--- > fs/f2fs/inline.c | 2 +- > 5 files changed, 14 insertions(+), 10 deletions(-) > > diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c > index 67da4f6eeaa0..e6c683e7a10e 100644 > --- a/fs/f2fs/data.c > +++ b/fs/f2fs/data.c > @@ -581,7 +581,7 @@ struct page *get_read_data_page(struct inode *inode, > pgoff_t index, > .encrypted_page = NULL, > }; > > - if (f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode)) > + if (f2fs_encrypted_file(inode)) > return read_mapping_page(mapping, index, NULL); > > page = f2fs_grab_cache_page(mapping, index, for_write); > @@ -786,7 +786,7 @@ static int __allocate_data_block(struct dnode_of_data *dn) > > static inline bool __force_buffered_io(struct inode *inode, int rw) > { > - return ((f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode)) || > + return (f2fs_encrypted_file(inode) || > (rw == WRITE && test_opt(F2FS_I_SB(inode), LFS)) || > F2FS_I_SB(inode)->s_ndevs); > } > @@ -1157,7 +1157,7 @@ static struct bio *f2fs_grab_bio(struct inode *inode, > block_t blkaddr, > struct fscrypt_ctx *ctx = NULL; > struct bio *bio; > > - if (f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode)) { > + if (f2fs_encrypted_file(inode)) { > ctx = fscrypt_get_ctx(inode, GFP_NOFS); > if (IS_ERR(ctx)) > return ERR_CAST(ctx); > @@ -1345,7 +1345,7 @@ static int encrypt_one_page(struct f2fs_io_info *fio) > struct inode *inode = fio->page->mapping->host; > gfp_t gfp_flags = GFP_NOFS; > > - if (!f2fs_encrypted_inode(inode) || !S_ISREG(inode->i_mode)) > + if (!f2fs_encrypted_file(inode)) > return 0; > > /* wait for GCed encrypted page writeback */ > @@ -1974,7 +1974,7 @@ static int f2fs_write_begin(struct file *file, struct > address_space *mapping, > f2fs_wait_on_page_writeback(page, DATA, false); > > /* wait for GCed encrypted page writeback */ > - if (f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode)) > + if (f2fs_encrypted_file(inode)) > f2fs_wait_on_encrypted_page_writeback(sbi, blkaddr); > > if (len == PAGE_SIZE || PageUptodate(page)) > diff --git a/fs/f2fs/f2fs.h b/fs/f2fs/f2fs.h > index 4b993961d81d..417c28f01fd5 100644 > --- a/fs/f2fs/f2fs.h > +++ b/fs/f2fs/f2fs.h > @@ -2939,6 +2939,11 @@ static inline bool f2fs_encrypted_inode(struct inode > *inode) > return file_is_encrypt(inode); > } > > +static inline bool f2fs_encrypted_file(struct inode *inode) > +{ > + return f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode); > +} > + > static inline void f2fs_set_encrypted_inode(struct inode *inode) > { > #ifdef CONFIG_F2FS_FS_ENCRYPTION > diff --git a/fs/f2fs/file.c b/fs/f2fs/file.c > index 63e394907808..f8cedaba7ce4 100644 > --- a/fs/f2fs/file.c > +++ b/fs/f2fs/file.c > @@ -106,7 +106,7 @@ static int f2fs_vm_page_mkwrite(struct vm_fault *vmf) > f2fs_wait_on_page_writeback(page, DATA, false); > > /* wait for GCed encrypted page writeback */ > - if (f2fs_encrypted_inode(inode) && S_ISREG(inode->i_mode)) > + if (f2fs_encrypted_file(inode)) > f2fs_wait_on_encrypted_page_writeback(sbi, dn.data_blkaddr); > > out_sem: > diff --git a/fs/f2fs/gc.c b/fs/f2fs/gc.c > index b226760afba8..6a12f33d0cdd 100644 > --- a/fs/f2fs/gc.c > +++ b/fs/f2fs/gc.c > @@ -831,8 +831,7 @@ static void gc_data_segment(struct f2fs_sb_info *sbi, > struct f2fs_summary *sum, > continue; > > /* if encrypted inode, let's go phase 3 */ > - if (f2fs_encrypted_inode(inode) && > - S_ISREG(inode->i_mode)) { > + if (f2fs_encrypted_file(inode)) { > add_gc_inode(gc_list, inode); > continue; > } > @@ -873,7 +872,7 @@ static void gc_data_segment(struct f2fs_sb_info *sbi, > struct f2fs_summary *sum, > > start_bidx = start_bidx_of_node(nofs, inode) > + ofs_in_node; > - if (f2fs_encrypted_inode(inode) && > S_ISREG(inode->i_mode)) > + if (f2fs_encrypted_file(inode)) > move_encrypted_block(inode, start_bidx, segno, > off); > else > move_data_page(inode, start_bidx, gc_type, > segno, off); > diff --git a/fs/f2fs/inline.c b/fs/f2fs/inline.c > index
Re: [PATCH] f2fs: fix wrong bfree and bavail
On 09/06, Chao Yu wrote: > On 2017/9/6 5:06, Jaegeuk Kim wrote: > > This patch fixes bavail and bfree finally. > > > > longf_bfree;/* free blocks in fs */ > > longf_bavail; /* free blocks avail to non-superuser */ > > > > So, bfree represents all the reserved blocks, while bavail does the space > > only > > visible to normal user. > > > > Fix: 3e6d0b4d9c1c ("f2fs: fix incorrect f_bfree calculation in ->statfs") > > Cc:# 4.8+ > > Signed-off-by: Jaegeuk Kim > > --- > > fs/f2fs/super.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c > > index 7e29db227910..17cca4cb93af 100644 > > --- a/fs/f2fs/super.c > > +++ b/fs/f2fs/super.c > > @@ -955,8 +955,8 @@ static int f2fs_statfs(struct dentry *dentry, struct > > kstatfs *buf) > > buf->f_bsize = sbi->blocksize; > > > > buf->f_blocks = total_count - start_count; > > - buf->f_bfree = user_block_count - valid_user_blocks(sbi) + ovp_count; > > - buf->f_bavail = user_block_count - valid_user_blocks(sbi) - > > + buf->f_bfree = user_block_count - valid_user_blocks(sbi); > > + buf->f_bavail = user_block_count - valid_user_blocks(sbi) - ovp_count - > > set_cp(user_block_count, ((get_cp(free_segment_count) + 6 - > get_cp(overprov_segment_count)) * c.blks_per_seg)); > > As we calculated in mkfs, we have removed ovp_count in user_block_count, so > shouldn't we add ovp_count for f_bfree here? Hehe, I already dropped this. :P It seems like something wrong with df in android. > > Thanks, > > > sbi->reserved_blocks; > > > > avail_node_count = sbi->total_node_count - F2FS_RESERVED_NODE_NUM; > >
Re: [PATCH] f2fs: fix wrong bfree and bavail
On 09/06, Chao Yu wrote: > On 2017/9/6 5:06, Jaegeuk Kim wrote: > > This patch fixes bavail and bfree finally. > > > > longf_bfree;/* free blocks in fs */ > > longf_bavail; /* free blocks avail to non-superuser */ > > > > So, bfree represents all the reserved blocks, while bavail does the space > > only > > visible to normal user. > > > > Fix: 3e6d0b4d9c1c ("f2fs: fix incorrect f_bfree calculation in ->statfs") > > Cc: # 4.8+ > > Signed-off-by: Jaegeuk Kim > > --- > > fs/f2fs/super.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c > > index 7e29db227910..17cca4cb93af 100644 > > --- a/fs/f2fs/super.c > > +++ b/fs/f2fs/super.c > > @@ -955,8 +955,8 @@ static int f2fs_statfs(struct dentry *dentry, struct > > kstatfs *buf) > > buf->f_bsize = sbi->blocksize; > > > > buf->f_blocks = total_count - start_count; > > - buf->f_bfree = user_block_count - valid_user_blocks(sbi) + ovp_count; > > - buf->f_bavail = user_block_count - valid_user_blocks(sbi) - > > + buf->f_bfree = user_block_count - valid_user_blocks(sbi); > > + buf->f_bavail = user_block_count - valid_user_blocks(sbi) - ovp_count - > > set_cp(user_block_count, ((get_cp(free_segment_count) + 6 - > get_cp(overprov_segment_count)) * c.blks_per_seg)); > > As we calculated in mkfs, we have removed ovp_count in user_block_count, so > shouldn't we add ovp_count for f_bfree here? Hehe, I already dropped this. :P It seems like something wrong with df in android. > > Thanks, > > > sbi->reserved_blocks; > > > > avail_node_count = sbi->total_node_count - F2FS_RESERVED_NODE_NUM; > >
Re: [PATCH] f2fs: fix wrong bfree and bavail
On 2017/9/6 5:06, Jaegeuk Kim wrote: > This patch fixes bavail and bfree finally. > > longf_bfree;/* free blocks in fs */ > longf_bavail; /* free blocks avail to non-superuser */ > > So, bfree represents all the reserved blocks, while bavail does the space only > visible to normal user. > > Fix: 3e6d0b4d9c1c ("f2fs: fix incorrect f_bfree calculation in ->statfs") > Cc:# 4.8+ > Signed-off-by: Jaegeuk Kim > --- > fs/f2fs/super.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c > index 7e29db227910..17cca4cb93af 100644 > --- a/fs/f2fs/super.c > +++ b/fs/f2fs/super.c > @@ -955,8 +955,8 @@ static int f2fs_statfs(struct dentry *dentry, struct > kstatfs *buf) > buf->f_bsize = sbi->blocksize; > > buf->f_blocks = total_count - start_count; > - buf->f_bfree = user_block_count - valid_user_blocks(sbi) + ovp_count; > - buf->f_bavail = user_block_count - valid_user_blocks(sbi) - > + buf->f_bfree = user_block_count - valid_user_blocks(sbi); > + buf->f_bavail = user_block_count - valid_user_blocks(sbi) - ovp_count - set_cp(user_block_count, ((get_cp(free_segment_count) + 6 - get_cp(overprov_segment_count)) * c.blks_per_seg)); As we calculated in mkfs, we have removed ovp_count in user_block_count, so shouldn't we add ovp_count for f_bfree here? Thanks, > sbi->reserved_blocks; > > avail_node_count = sbi->total_node_count - F2FS_RESERVED_NODE_NUM; >
Re: [PATCH] f2fs: fix wrong bfree and bavail
On 2017/9/6 5:06, Jaegeuk Kim wrote: > This patch fixes bavail and bfree finally. > > longf_bfree;/* free blocks in fs */ > longf_bavail; /* free blocks avail to non-superuser */ > > So, bfree represents all the reserved blocks, while bavail does the space only > visible to normal user. > > Fix: 3e6d0b4d9c1c ("f2fs: fix incorrect f_bfree calculation in ->statfs") > Cc: # 4.8+ > Signed-off-by: Jaegeuk Kim > --- > fs/f2fs/super.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/fs/f2fs/super.c b/fs/f2fs/super.c > index 7e29db227910..17cca4cb93af 100644 > --- a/fs/f2fs/super.c > +++ b/fs/f2fs/super.c > @@ -955,8 +955,8 @@ static int f2fs_statfs(struct dentry *dentry, struct > kstatfs *buf) > buf->f_bsize = sbi->blocksize; > > buf->f_blocks = total_count - start_count; > - buf->f_bfree = user_block_count - valid_user_blocks(sbi) + ovp_count; > - buf->f_bavail = user_block_count - valid_user_blocks(sbi) - > + buf->f_bfree = user_block_count - valid_user_blocks(sbi); > + buf->f_bavail = user_block_count - valid_user_blocks(sbi) - ovp_count - set_cp(user_block_count, ((get_cp(free_segment_count) + 6 - get_cp(overprov_segment_count)) * c.blks_per_seg)); As we calculated in mkfs, we have removed ovp_count in user_block_count, so shouldn't we add ovp_count for f_bfree here? Thanks, > sbi->reserved_blocks; > > avail_node_count = sbi->total_node_count - F2FS_RESERVED_NODE_NUM; >
[PATCH v2] md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list
In release_stripe_plug(), if a stripe_head has its STRIPE_ON_UNPLUG_LIST set, it indicates that this stripe_head is already in the raid5_plug_cb list and release_stripe() would be called instead to drop a reference count. Otherwise, the STRIPE_ON_UNPLUG_LIST bit would be set for this stripe_head and it will get queued into the raid5_plug_cb list. Since break_stripe_batch_list() did not preserve STRIPE_ON_UNPLUG_LIST, A stripe could be re-added to plug list while it is still on that list in the following situation. If stripe_head A is added to another stripe_head B's batch list, in this case A will have its batch_head != NULL and be added into the plug list. After that, stripe_head B gets handled and called break_stripe_batch_list() to reset all the batched stripe_head(including A which is still on the plug list)'s state and reset their batch_head to NULL. Before the plug list gets processed, if there is another write request comes in and get stripe_head A, A will have its batch_head == NULL (cleared by calling break_stripe_batch_list() on B) and be added to plug list once again. Signed-off-by: Dennis Yang--- drivers/md/raid5.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index e92dd2d..faf3cfd 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -4611,7 +4611,8 @@ static void break_stripe_batch_list(struct stripe_head *head_sh, set_mask_bits(>state, ~(STRIPE_EXPAND_SYNC_FLAGS | (1 << STRIPE_PREREAD_ACTIVE) | - (1 << STRIPE_DEGRADED)), + (1 << STRIPE_DEGRADED) | + (1 << STRIPE_ON_UNPLUG_LIST)), head_sh->state & (1 << STRIPE_INSYNC)); sh->check_state = head_sh->check_state; -- 1.9.1
[PATCH v2] md/raid5: preserve STRIPE_ON_UNPLUG_LIST in break_stripe_batch_list
In release_stripe_plug(), if a stripe_head has its STRIPE_ON_UNPLUG_LIST set, it indicates that this stripe_head is already in the raid5_plug_cb list and release_stripe() would be called instead to drop a reference count. Otherwise, the STRIPE_ON_UNPLUG_LIST bit would be set for this stripe_head and it will get queued into the raid5_plug_cb list. Since break_stripe_batch_list() did not preserve STRIPE_ON_UNPLUG_LIST, A stripe could be re-added to plug list while it is still on that list in the following situation. If stripe_head A is added to another stripe_head B's batch list, in this case A will have its batch_head != NULL and be added into the plug list. After that, stripe_head B gets handled and called break_stripe_batch_list() to reset all the batched stripe_head(including A which is still on the plug list)'s state and reset their batch_head to NULL. Before the plug list gets processed, if there is another write request comes in and get stripe_head A, A will have its batch_head == NULL (cleared by calling break_stripe_batch_list() on B) and be added to plug list once again. Signed-off-by: Dennis Yang --- drivers/md/raid5.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/md/raid5.c b/drivers/md/raid5.c index e92dd2d..faf3cfd 100644 --- a/drivers/md/raid5.c +++ b/drivers/md/raid5.c @@ -4611,7 +4611,8 @@ static void break_stripe_batch_list(struct stripe_head *head_sh, set_mask_bits(>state, ~(STRIPE_EXPAND_SYNC_FLAGS | (1 << STRIPE_PREREAD_ACTIVE) | - (1 << STRIPE_DEGRADED)), + (1 << STRIPE_DEGRADED) | + (1 << STRIPE_ON_UNPLUG_LIST)), head_sh->state & (1 << STRIPE_INSYNC)); sh->check_state = head_sh->check_state; -- 1.9.1
linux-next: manual merge of the devicetree tree with Linus' tree
Hi Rob, Today's linux-next merge of the devicetree tree got a conflict in: Documentation/devicetree/bindings/net/brcm,bgmac-nsp.txt between commit: ea6c3077678f ("dt-bindings: net: Remove duplicate NSP Ethernet MAC binding document") from Linus' tree and commit: 4da722ca19f3 ("dt-bindings: Remove "status" from examples") from the devicetree tree. I fixed it up (the former removed the file, so I did that) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell
linux-next: manual merge of the devicetree tree with Linus' tree
Hi Rob, Today's linux-next merge of the devicetree tree got a conflict in: Documentation/devicetree/bindings/net/brcm,bgmac-nsp.txt between commit: ea6c3077678f ("dt-bindings: net: Remove duplicate NSP Ethernet MAC binding document") from Linus' tree and commit: 4da722ca19f3 ("dt-bindings: Remove "status" from examples") from the devicetree tree. I fixed it up (the former removed the file, so I did that) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell
linux-next: manual merge of the devicetree tree with the sound tree
Hi all, Today's linux-next merge of the devicetree tree got a conflict in: Documentation/devicetree/bindings/sound/rockchip,pdm.txt between commit: fce7358b54c6 ("ASoC: rockchip: separate pinctrl pins from each other") from the sound tree and commit: 4da722ca19f3 ("dt-bindings: Remove "status" from examples") from the devicetree tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc Documentation/devicetree/bindings/sound/rockchip,pdm.txt index 2ad66f649a28,beb32ca26e5d.. --- a/Documentation/devicetree/bindings/sound/rockchip,pdm.txt +++ b/Documentation/devicetree/bindings/sound/rockchip,pdm.txt @@@ -33,10 -34,5 +33,9 @@@ pdm: pdm@ff04 _sdi1 _sdi2 _sdi3>; - pinctrl-1 = <_sleep>; + pinctrl-1 = <_clk_sleep + _sdi0_sleep + _sdi1_sleep + _sdi2_sleep + _sdi3_sleep>; - status = "disabled"; };
linux-next: manual merge of the devicetree tree with the sound tree
Hi all, Today's linux-next merge of the devicetree tree got a conflict in: Documentation/devicetree/bindings/sound/rockchip,pdm.txt between commit: fce7358b54c6 ("ASoC: rockchip: separate pinctrl pins from each other") from the sound tree and commit: 4da722ca19f3 ("dt-bindings: Remove "status" from examples") from the devicetree tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc Documentation/devicetree/bindings/sound/rockchip,pdm.txt index 2ad66f649a28,beb32ca26e5d.. --- a/Documentation/devicetree/bindings/sound/rockchip,pdm.txt +++ b/Documentation/devicetree/bindings/sound/rockchip,pdm.txt @@@ -33,10 -34,5 +33,9 @@@ pdm: pdm@ff04 _sdi1 _sdi2 _sdi3>; - pinctrl-1 = <_sleep>; + pinctrl-1 = <_clk_sleep + _sdi0_sleep + _sdi1_sleep + _sdi2_sleep + _sdi3_sleep>; - status = "disabled"; };