date:20210416

Re: [PATCH v6] lib: add basic KUnit test for lib/math

2021-04-16 Thread kernel test robot

Hi Daniel,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on 7e25f40eab52c57ff6772d27d2aef3640a3237d7]

url:
https://github.com/0day-ci/linux/commits/Daniel-Latypov/lib-add-basic-KUnit-test-for-lib-math/20210417-020619
base:   7e25f40eab52c57ff6772d27d2aef3640a3237d7
config: powerpc-randconfig-c004-20210416 (attached as .config)
compiler: powerpc-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# 
https://github.com/0day-ci/linux/commit/0f1888ffeaa6baa1bc2a99eac8ba7d1df29c8450
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Daniel-Latypov/lib-add-basic-KUnit-test-for-lib-math/20210417-020619
git checkout 0f1888ffeaa6baa1bc2a99eac8ba7d1df29c8450
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross W=1 
ARCH=powerpc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All warnings (new ones prefixed by >>):

   lib/math/math_kunit.c: In function 'abs_test':
>> lib/math/math_kunit.c:41:1: warning: the frame size of 1088 bytes is larger 
>> than 1024 bytes [-Wframe-larger-than=]
  41 | }
 | ^


vim +41 lib/math/math_kunit.c

14  
15  static void abs_test(struct kunit *test)
16  {
17  KUNIT_EXPECT_EQ(test, abs((char)0), (char)0);
18  KUNIT_EXPECT_EQ(test, abs((char)42), (char)42);
19  KUNIT_EXPECT_EQ(test, abs((char)-42), (char)42);
20  
21  /* The expression in the macro is actually promoted to an int. 
*/
22  KUNIT_EXPECT_EQ(test, abs((short)0),  0);
23  KUNIT_EXPECT_EQ(test, abs((short)42),  42);
24  KUNIT_EXPECT_EQ(test, abs((short)-42),  42);
25  
26  KUNIT_EXPECT_EQ(test, abs(0),  0);
27  KUNIT_EXPECT_EQ(test, abs(42),  42);
28  KUNIT_EXPECT_EQ(test, abs(-42),  42);
29  
30  KUNIT_EXPECT_EQ(test, abs(0L), 0L);
31  KUNIT_EXPECT_EQ(test, abs(42L), 42L);
32  KUNIT_EXPECT_EQ(test, abs(-42L), 42L);
33  
34  KUNIT_EXPECT_EQ(test, abs(0LL), 0LL);
35  KUNIT_EXPECT_EQ(test, abs(42LL), 42LL);
36  KUNIT_EXPECT_EQ(test, abs(-42LL), 42LL);
37  
38  /* Unsigned types get casted to signed. */
39  KUNIT_EXPECT_EQ(test, abs(0ULL), 0LL);
40  KUNIT_EXPECT_EQ(test, abs(42ULL), 42LL);
  > 41  }
42  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip

Re: [PATCH V2 5/9] platform/x86: intel_pmc_core: Get LPM requirements for Tiger Lake

2021-04-16 Thread kernel test robot

Hi "David,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on 823b31517ad3196324322804ee365d5fcff704d6]

url:
https://github.com/0day-ci/linux/commits/David-E-Box/intel_pmc_core-Add-sub-state-requirements-and-mode/20210417-111530
base:   823b31517ad3196324322804ee365d5fcff704d6
config: i386-randconfig-a001-20210417 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-22) 9.3.0
reproduce (this is a W=1 build):
# 
https://github.com/0day-ci/linux/commit/703038f16e99686bf2538222cee482f823bfa60f
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
David-E-Box/intel_pmc_core-Add-sub-state-requirements-and-mode/20210417-111530
git checkout 703038f16e99686bf2538222cee482f823bfa60f
# save the attached .config to linux build tree
make W=1 W=1 ARCH=i386 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All warnings (new ones prefixed by >>):

   In file included from drivers/platform/x86/intel_pmc_core.c:14:
   drivers/platform/x86/intel_pmc_core.c: In function 
'pmc_core_get_tgl_lpm_reqs':
>> drivers/platform/x86/intel_pmc_core.c:621:5: warning: format '%ld' expects 
>> argument of type 'long int', but argument 5 has type 'size_t' {aka 'unsigned 
>> int'} [-Wformat=]
 621 | "_DSM returned unexpected buffer size,"
 | ^~~
 622 | " have %d, expect %ld\n", size, lpm_size);
 | 
 | |
 | size_t {aka unsigned int}
   include/linux/acpi.h:1073:42: note: in definition of macro 
'acpi_handle_debug'
1073 |   acpi_handle_printk(KERN_DEBUG, handle, fmt, ##__VA_ARGS__); \
 |  ^~~
   drivers/platform/x86/intel_pmc_core.c:622:25: note: format string is defined 
here
 622 | " have %d, expect %ld\n", size, lpm_size);
 |   ~~^
 | |
 | long int
 |   %d


vim +621 drivers/platform/x86/intel_pmc_core.c

   597  
   598  static void pmc_core_get_tgl_lpm_reqs(struct platform_device *pdev)
   599  {
   600  struct pmc_dev *pmcdev = platform_get_drvdata(pdev);
   601  const int num_maps = pmcdev->map->lpm_num_maps;
   602  size_t lpm_size = LPM_MAX_NUM_MODES * num_maps * 4;
   603  union acpi_object *out_obj;
   604  struct acpi_device *adev;
   605  guid_t s0ix_dsm_guid;
   606  u32 *lpm_req_regs, *addr;
   607  
   608  adev = ACPI_COMPANION(>dev);
   609  if (!adev)
   610  return;
   611  
   612  guid_parse(ACPI_S0IX_DSM_UUID, _dsm_guid);
   613  
   614  out_obj = acpi_evaluate_dsm(adev->handle, _dsm_guid, 0,
   615  ACPI_GET_LOW_MODE_REGISTERS, NULL);
   616  if (out_obj && out_obj->type == ACPI_TYPE_BUFFER) {
   617  int size = out_obj->buffer.length;
   618  
   619  if (size != lpm_size) {
   620  acpi_handle_debug(adev->handle,
 > 621  "_DSM returned unexpected buffer size,"
   622  " have %d, expect %ld\n", size, 
lpm_size);
   623  goto free_acpi_obj;
   624  }
   625  } else {
   626  acpi_handle_debug(adev->handle,
   627"_DSM function 0 evaluation 
failed\n");
   628  goto free_acpi_obj;
   629  }
   630  
   631  addr = (u32 *)out_obj->buffer.pointer;
   632  
   633  lpm_req_regs = devm_kzalloc(>dev, lpm_size * sizeof(u32),
   634   GFP_KERNEL);
   635  if (!lpm_req_regs)
   636  goto free_acpi_obj;
   637  
   638  memcpy(lpm_req_regs, addr, lpm_size);
   639  pmcdev->lpm_req_regs = lpm_req_regs;
   640  
   641  free_acpi_obj:
   642  ACPI_FREE(out_obj);
   643  }
   644  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip

Re: [PATCH v1 0/3] mm,hwpoison: fix sending SIGBUS for Action Required MCE

2021-04-16 Thread Aili Yao

On Tue, 13 Apr 2021 07:43:17 +0900
Naoya Horiguchi  wrote:

> Hi,
> 
> I wrote this patchset to materialize what I think is the current
> allowable solution mentioned by the previous discussion [1].
> I simply borrowed Tony's mutex patch and Aili's return code patch,
> then I queued another one to find error virtual address in the best
> effort manner.  I know that this is not a perfect solution, but
> should work for some typical case.
> 
> My simple testing showed this patchset seems to work as intended,
> but if you have the related testcases, could you please test and
> let me have some feedback?
> 
> Thanks,
> Naoya Horiguchi
> 
> [1]: 
> https://lore.kernel.org/linux-mm/20210331192540.2141052f@alex-virtual-machine/
> ---
> Summary:
> 
> Aili Yao (1):
>   mm,hwpoison: return -EHWPOISON when page already
> 
> Naoya Horiguchi (1):
>   mm,hwpoison: add kill_accessing_process() to find error virtual address
> 
> Tony Luck (1):
>   mm/memory-failure: Use a mutex to avoid memory_failure() races
> 
>  arch/x86/kernel/cpu/mce/core.c |  13 +++-
>  include/linux/swapops.h|   5 ++
>  mm/memory-failure.c| 166 
> -
>  3 files changed, 178 insertions(+), 6 deletions(-)

Hi Naoya,

Thanks for your patch and complete fix for this race issue.

I test your patches, mainly it worked as expected, but in some cases it failed, 
I checked  it
and find some doubt places, could you help confirm it?

1. there is a compile warning:
static int hwpoison_pte_range(pmd_t *pmdp, unsigned long addr,
  unsigned long end, struct mm_walk *walk)
{
struct hwp_walk *hwp = (struct hwp_walk *)walk->private;
int ret; here

It seems this ret may not be initialized, and some time ret may be error 
retruned?

and for this:
static int check_hwpoisoned_entry(pte_t pte, unsigned long addr, short shift,
unsigned long poisoned_pfn, struct to_kill *tk)
{
unsigned long pfn;

I think it better to be initialized too.

2. In the function hwpoison_pte_range():
if (pfn <= hwp->pfn && hwp->pfn < pfn + PMD_SIZE) this check seem we should use 
PMD_SIZE/PAGE_SIZE or some macro like this?

3. unsigned long hwpoison_vaddr = addr + (hwp->pfn << PAGE_SHIFT & ~PMD_MASK); 
this seems not exact accurate?

4. static int set_to_kill(struct to_kill *tk, unsigned long addr, short shift)
{
if (tk->addr) {--- I am not sure about this check and if it will 
lead failure.
return 1;
}
In my test, it seems sometimes it will hit this branch, I don't know it's multi 
entry issue or multi posion issue.
when i get to this fail, there is not enough log for this, but i can't 
reproduce it after that.

wolud you help confirm this and if any changes, please post again and I will do 
the test again.

Thansk
Aili Yao

Re: [PATCH v7 2/9] reboot: thermal: Export hardware protection shutdown

2021-04-16 Thread Daniel Lezcano

On 14/04/2021 07:52, Matti Vaittinen wrote:
> Thermal core contains a logic for safety shutdown. System is attempted to
> be powered off if temperature exceeds safety limits.
> 
> Currently this can be also utilized by regulator subsystem as a final
> protection measure if PMICs report dangerous over-voltage, over-current or
> over-temperature and if per regulator counter measures fail or do not
> exist.
> 
> Move this logic to kernel/reboot.c and export the functionality for other
> subsystems to use. Also replace the mutex with a spinlock to allow using
> the function from any context.
> 
> Also the EMIF bus code has implemented a safety shut-down. EMIF does not
> attempt orderly_poweroff at all. Thus the EMIF code is not converted to use
> this new function.
> 
> Signed-off-by: Matti Vaittinen 
> ---
> Changelog
>  v7:
>   - new patch
> 
> Please note - this patch has received only a minimal amount of testing.
> (The new API call was tested to shut-down my system at driver probe but
> no odd corner-cases have been tested).
> 
> Any testing for thermal shutdown is appreciated.
> ---
>  drivers/thermal/thermal_core.c | 63 ++---
>  include/linux/reboot.h |  1 +
>  kernel/reboot.c| 86 ++

Please send a patch implementing the reboot/shutdown and then another
one replacing the thermal shutdown code by a call to the new API.

>  3 files changed, 91 insertions(+), 59 deletions(-)
> 
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 996c038f83a4..b1444845af38 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -36,10 +36,8 @@ static LIST_HEAD(thermal_governor_list);
>  
>  static DEFINE_MUTEX(thermal_list_lock);
>  static DEFINE_MUTEX(thermal_governor_lock);
> -static DEFINE_MUTEX(poweroff_lock);
>  
>  static atomic_t in_suspend;
> -static bool power_off_triggered;
>  
>  static struct thermal_governor *def_governor;
>  
> @@ -327,70 +325,18 @@ static void handle_non_critical_trips(struct 
> thermal_zone_device *tz, int trip)
>  def_governor->throttle(tz, trip);
>  }
>  
> -/**
> - * thermal_emergency_poweroff_func - emergency poweroff work after a known 
> delay
> - * @work: work_struct associated with the emergency poweroff function
> - *
> - * This function is called in very critical situations to force
> - * a kernel poweroff after a configurable timeout value.
> - */
> -static void thermal_emergency_poweroff_func(struct work_struct *work)
> -{
> - /*
> -  * We have reached here after the emergency thermal shutdown
> -  * Waiting period has expired. This means orderly_poweroff has
> -  * not been able to shut off the system for some reason.
> -  * Try to shut down the system immediately using kernel_power_off
> -  * if populated
> -  */
> - WARN(1, "Attempting kernel_power_off: Temperature too high\n");
> - kernel_power_off();
> -
> - /*
> -  * Worst of the worst case trigger emergency restart
> -  */
> - WARN(1, "Attempting emergency_restart: Temperature too high\n");
> - emergency_restart();
> -}
> -
> -static DECLARE_DELAYED_WORK(thermal_emergency_poweroff_work,
> - thermal_emergency_poweroff_func);
> -
> -/**
> - * thermal_emergency_poweroff - Trigger an emergency system poweroff
> - *
> - * This may be called from any critical situation to trigger a system 
> shutdown
> - * after a known period of time. By default this is not scheduled.
> - */
> -static void thermal_emergency_poweroff(void)
> +void thermal_zone_device_critical(struct thermal_zone_device *tz)
>  {
> - int poweroff_delay_ms = CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS;
>   /*
>* poweroff_delay_ms must be a carefully profiled positive value.
> -  * Its a must for thermal_emergency_poweroff_work to be scheduled
> +  * Its a must for forced_emergency_poweroff_work to be scheduled.
>*/
> - if (poweroff_delay_ms <= 0)
> - return;
> - schedule_delayed_work(_emergency_poweroff_work,
> -   msecs_to_jiffies(poweroff_delay_ms));
> -}
> + int poweroff_delay_ms = CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS;
>  
> -void thermal_zone_device_critical(struct thermal_zone_device *tz)
> -{
>   dev_emerg(>device, "%s: critical temperature reached, "
> "shutting down\n", tz->type);
>  
> - mutex_lock(_lock);
> - if (!power_off_triggered) {
> - /*
> -  * Queue a backup emergency shutdown in the event of
> -  * orderly_poweroff failure
> -  */
> - thermal_emergency_poweroff();
> - orderly_poweroff(true);
> - power_off_triggered = true;
> - }
> - mutex_unlock(_lock);
> + hw_protection_shutdown("Temperature too high", poweroff_delay_ms);
>  }
>  EXPORT_SYMBOL(thermal_zone_device_critical);
>  
> @@ -1549,7 +1495,6

[tip:perf/core] BUILD SUCCESS 5deac80d4571dffb51f452f0027979d72259a1b9

2021-04-16 Thread kernel test robot

 allnoconfig
nds32   defconfig
nios2allyesconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
s390 allmodconfig
parisc   allyesconfig
s390defconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a003-20210416
i386 randconfig-a006-20210416
i386 randconfig-a001-20210416
i386 randconfig-a005-20210416
i386 randconfig-a004-20210416
i386 randconfig-a002-20210416
x86_64   randconfig-a014-20210416
x86_64   randconfig-a015-20210416
x86_64   randconfig-a011-20210416
x86_64   randconfig-a013-20210416
x86_64   randconfig-a012-20210416
x86_64   randconfig-a016-20210416
i386 randconfig-a015-20210416
i386 randconfig-a014-20210416
i386 randconfig-a013-20210416
i386 randconfig-a012-20210416
i386 randconfig-a016-20210416
i386 randconfig-a011-20210416
riscvnommu_k210_defconfig
riscvnommu_virt_defconfig
riscv allnoconfig
riscv   defconfig
riscv  rv32_defconfig
um   allmodconfig
umallnoconfig
um   allyesconfig
um  defconfig
x86_64   allyesconfig
x86_64rhel-8.3-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  rhel-8.3-kbuiltin
x86_64  kexec

clang tested configs:
x86_64   randconfig-a003-20210416
x86_64   randconfig-a002-20210416
x86_64   randconfig-a005-20210416
x86_64   randconfig-a001-20210416
x86_64   randconfig-a006-20210416
x86_64   randconfig-a004-20210416

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org

Re: [PATCH v7 6/6] w1: ds2438: support for writing to offset register

2021-04-16 Thread kernel test robot

Hi Luiz,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v5.12-rc7 next-20210416]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Luiz-Sampaio/w1-ds2438-fixed-a-coding-style-issue/20210417-071754
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
151501160401e2dc669ea7dac2c599b53f220c33
config: csky-randconfig-m031-20210416 (attached as .config)
compiler: csky-linux-gcc (GCC) 9.3.0

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

smatch warnings:
drivers/w1/slaves/w1_ds2438.c:218 w1_ds2438_change_offset_register() warn: 
inconsistent indenting

vim +218 drivers/w1/slaves/w1_ds2438.c

   195  
   196  static int w1_ds2438_change_offset_register(struct w1_slave *sl, u8 
*value)
   197  {
   198  unsigned int retries = W1_DS2438_RETRIES;
   199  u8 w1_buf[9];
   200  u8 w1_page1_buf[DS2438_PAGE_SIZE + 1 /*for CRC*/];
   201  
   202  if (w1_ds2438_get_page(sl, 1, w1_page1_buf) == 0) {
   203  memcpy(_buf[2], w1_page1_buf, DS2438_PAGE_SIZE - 1); 
/* last register reserved */
   204  w1_buf[7] = value[0]; /* change only offset register */
   205  w1_buf[8] = value[1];
   206  while (retries--) {
   207  if (w1_reset_select_slave(sl))
   208  continue;
   209  w1_buf[0] = W1_DS2438_WRITE_SCRATCH;
   210  w1_buf[1] = 0x01; /* write to page 1 */
   211  w1_write_block(sl->master, w1_buf, 9);
   212  
   213  if (w1_reset_select_slave(sl))
   214  continue;
   215  w1_buf[0] = W1_DS2438_COPY_SCRATCH;
   216  w1_buf[1] = 0x01;
   217  w1_write_block(sl->master, w1_buf, 2);
 > 218  return 0;
   219  }
   220  }
   221  return -1;
   222  }
   223  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip

Re: [PATCH 00/13] [RFC] Rust support

2021-04-16 Thread comex

On Fri, Apr 16, 2021 at 4:24 AM Peter Zijlstra  wrote:
> Simlar thing for RCU; C11 can't optimally do that; it needs to make
> rcu_dereference() a load-acquire [something ARM64 has already done in C
> because the compiler might be too clever by half when doing LTO :-(].
> But it's the compiler needing the acquire semantics, not the computer,
> which is just bloody wrong.

You may already know, but perhaps worth clarifying:

C11 does have atomic_signal_fence() which is a compiler fence.  But a
compiler fence only ensures the loads will be emitted in the right
order, not that the CPU will execute them in the right order.  CPU
architectures tend to guarantee that two loads will be executed in the
right order if the second one's address depends on the first one's
result, but a dependent load can stop being dependent after compiler
optimizations involving value speculation.  Using a load-acquire works
around this, not because it stops the compiler from performing any
optimization, but because it tells the computer to execute the loads
in the right order *even if* the compiler has broken the value
dependence.

So C11 atomics don't make the situation worse, compared to Linux's
atomics implementation based on volatile and inline assembly.  Both
are unsound in the presence of value speculation.  C11 atomics were
*supposed* to make the situation better, with memory_order_consume,
which would have specifically forbidden the compiler from performing
value speculation.  But all the compilers punted on getting this to
work and instead just implemented memory_order_consume as
memory_order_acquire.

As for Rust, it compiles to the same LLVM IR that Clang compiles C
into.  Volatile, inline assembly, and C11-based atomics: all of these
are available in Rust, and generate exactly the same code as their C
counterparts, for better or for worse.  Unfortunately, the Rust
project has relatively limited muscle when it comes to contributing to
LLVM.  So while it would definitely be nice if Rust could make RCU
sound, and from a specification perspective I think people would be
quite willing and probably easier to work with than the C committee...
I suspect that implementing this would require the kind of sweeping
change to LLVM that is probably not going to come from Rust.

There are other areas where I think that kind of discussion might be
more fruitful.  For example, the Rust documentation currently says
that a volatile read racing with a non-volatile write (i.e. seqlocks)
is undefined behavior. [1]  However, I am of the opinion that this is
essentially a spec bug, for reasons that are probably not worth
getting into here.

[1] https://doc.rust-lang.org/nightly/std/ptr/fn.read_volatile.html

[GIT PULL] Networking for 5.12-rc8

2021-04-16 Thread Jakub Kicinski

The following changes since commit 4e04e7513b0fa2fe8966a1c83fb473f1667e2810:

  Merge tag 'net-5.12-rc7' of 
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net (2021-04-09 15:26:51 
-0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git tags/net-5.12-rc8

for you to fetch changes up to f2764bd4f6a8dffaec3e220728385d9756b3c2cb:

  netlink: don't call ->netlink_bind with table lock held (2021-04-16 17:01:04 
-0700)


Networking fixes for 5.12-rc8, including fixes from netfilter,
and bpf. BPF verifier changes stand out, otherwise things have
slowed down.

Current release - regressions:

 - gro: ensure frag0 meets IP header alignment

 - Revert "net: stmmac: re-init rx buffers when mac resume back"

 - ethernet: macb: fix the restore of cmp registers

Previous releases - regressions:

 - ixgbe: Fix NULL pointer dereference in ethtool loopback test

 - ixgbe: fix unbalanced device enable/disable in suspend/resume

 - phy: marvell: fix detection of PHY on Topaz switches

 - make tcp_allowed_congestion_control readonly in non-init netns

 - xen-netback: Check for hotplug-status existence before watching

Previous releases - always broken:

 - bpf: mitigate a speculative oob read of up to map value size by
tightening the masking window

 - sctp: fix race condition in sctp_destroy_sock

 - sit, ip6_tunnel: Unregister catch-all devices

 - netfilter: nftables: clone set element expression template

 - netfilter: flowtable: fix NAT IPv6 offload mangling

 - net: geneve: check skb is large enough for IPv4/IPv6 header

 - netlink: don't call ->netlink_bind with table lock held

Signed-off-by: Jakub Kicinski 


Alexander Duyck (1):
  ixgbe: Fix NULL pointer dereference in ethtool loopback test

Aya Levin (2):
  net/mlx5: Fix setting of devlink traps in switchdev mode
  net/mlx5e: Fix setting of RS FEC mode

Christophe JAILLET (1):
  net: davicom: Fix regulator not turned off on failed probe

Ciara Loftus (1):
  libbpf: Fix potential NULL pointer dereference

Claudiu Beznea (1):
  net: macb: fix the restore of cmp registers

Colin Ian King (1):
  ice: Fix potential infinite loop when using u8 loop counter

Daniel Borkmann (9):
  bpf: Use correct permission flag for mixed signed bounds arithmetic
  bpf: Move off_reg into sanitize_ptr_alu
  bpf: Ensure off_reg has no mixed signed bounds for all types
  bpf: Rework ptr_limit into alu_limit and add common error path
  bpf: Improve verifier error messages for users
  bpf: Refactor and streamline bounds check into helper
  bpf: Move sanitize_val_alu out of op switch
  bpf: Tighten speculative pointer arithmetic mask
  bpf: Update selftests to reflect new error states

David S. Miller (7):
  Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
  Merge branch 'catch-all-devices'
  Merge branch 'ibmvnic-napi-fixes'
  Merge branch '10GbE' of 
git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/net-queue
  Merge tag 'mlx5-fixes-2021-04-14' of 
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
  Merge branch 'ch_tlss-fixes'
  Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf

Eric Dumazet (2):
  netfilter: nft_limit: avoid possible divide error in nft_limit_init
  gro: ensure frag0 meets IP header alignment

Florian Westphal (4):
  netfilter: bridge: add pre_exit hooks for ebtable unregistration
  netfilter: arp_tables: add pre_exit hook for table unregister
  netfilter: x_tables: fix compat match/target pad out-of-bound write
  netlink: don't call ->netlink_bind with table lock held

Heiner Kallweit (1):
  r8169: don't advertise pause in jumbo mode

Hristo Venev (2):
  net: sit: Unregister catch-all devices
  net: ip6_tunnel: Unregister catch-all devices

Jakub Kicinski (2):
  ethtool: fix kdoc attr name
  ethtool: pause: make sure we init driver stats

Jason Xing (1):
  i40e: fix the panic when running bpf in xdpdrv mode

Joakim Zhang (1):
  MAINTAINERS: update maintainer entry for freescale fec driver

Jonathon Reinhart (1):
  net: Make tcp_allowed_congestion_control readonly in non-init netns

Lijun Pan (5):
  ibmvnic: correctly use dev_consume/free_skb_irq
  ibmvnic: avoid calling napi_disable() twice
  ibmvnic: remove duplicate napi_schedule call in do_reset function
  ibmvnic: remove duplicate napi_schedule call in open function
  MAINTAINERS: update my email

Michael Brown (1):
  xen-netback: Check for hotplug-status existence before watching

Nicolas Dichtel (2):
  doc: move seg6_flowlabel to seg6-sysctl.rst
  vrf: fix a comment about loopback device

Or Cohen (1):
  net/sctp: fix race condition in sctp_destroy_sock

Pablo Neira Ayuso (3):
  netfilter: flowtable: fix

Re: [External] Re: [PATCH] tcp: fix silent loss when syncookie is trigered

2021-04-16 Thread Eric Dumazet

On Sat, Apr 17, 2021 at 12:45 AM 赵亚  wrote:
>
> On Fri, Apr 16, 2021 at 7:52 PM Eric Dumazet  wrote:
> >
> > On Fri, Apr 16, 2021 at 12:52 PM zhaoya  wrote:
> > >
> > > When syncookie is triggered, since $MSSID is spliced into cookie and
> > > the legal index of msstab  is 0,1,2,3, this gives client 3 bytes
> > > of freedom, resulting in at most 3 bytes of silent loss.
> > >
> > > C seq=12345-> S
> > > C <--seq=cookie/ack=12346 S S generated the cookie
> > > [RFC4987 Appendix A]
> > > C ---seq=123456/ack=cookie+1-->X  S The first byte was loss.
> > > C -seq=123457/ack=cookie+1--> S The second byte was received and
> > > cookie-check was still okay and
> > > handshake was finished.
> > > C  >
> >
> > I think this has been discussed in the past :
> > https://kognitio.com/blog/syn-cookies-ate-my-dog-breaking-tcp-on-linux/
> >
> > If I remember well, this can not be fixed "easily"
> >
> > I suspect you are trading one minor issue with another (which is
> > considered more practical these days)
> > Have you tried what happens if the server receives an out-of-order
> > packet after the SYN & SYN-ACK ?
> > The answer is : RST packet is sent, killing the session.
> >
> > That is the reason why sseq is not part of the hash key.
>
> Yes, I've tested this scenario. More sessions do get reset.
>
> If a client got an RST, it knew the session failed, which was clear. However,
> if the client send a character and it was acknowledged, but the server did not
> receive it, this could cause confusion.
> >
> > In practice, secure connexions are using a setup phase where more than
> > 3 bytes are sent in the first packet.
> > We recommend using secure protocols over TCP. (prefer HTTPS over HTTP,
> > SSL over plaintext)
>
> Yes, i agree with you. But the basis of practice is principle.
> Syncookie breaks the
> semantics of TCP.
> >
> > Your change would severely impair servers under DDOS ability to really
> > establish flows.
>
> Would you tell me more details.
> >
> > Now, if your patch is protected by a sysctl so that admins can choose
> > the preferred behavior, then why not...
>
> The sysctl in the POC is just for triggering problems easily.
>
> So the question is, when syncookie is triggered, which is more important,
> the practice or the principle?

SYNCOOKIES have lots of known limitations.

You can disable them if you need.

Or you can add a sysctl or socket options so that each listener can
decide what they want.

I gave feedback of why your initial patch was _not_ good.

I think it can render a server under DDOS absolutely unusable.
Exactly the same situation than _without_ syncookies being used.
We do not want to go back to the situation wed had before SYNCOOKIES
were invented.

I think you should have put a big warning in the changelog to explain
that you fully understood
the risks.

We prefer having servers that can still be useful, especially ones
serving 100% HTTPS traffic.

Thank you.

Re: [PATCH v7 2/9] reboot: thermal: Export hardware protection shutdown

2021-04-16 Thread Daniel Lezcano

On 14/04/2021 07:52, Matti Vaittinen wrote:
> Thermal core contains a logic for safety shutdown. System is attempted to
> be powered off if temperature exceeds safety limits.
> 
> Currently this can be also utilized by regulator subsystem as a final
> protection measure if PMICs report dangerous over-voltage, over-current or
> over-temperature and if per regulator counter measures fail or do not
> exist.
> 
> Move this logic to kernel/reboot.c and export the functionality for other
> subsystems to use. Also replace the mutex with a spinlock to allow using
> the function from any context.
> 
> Also the EMIF bus code has implemented a safety shut-down. EMIF does not
> attempt orderly_poweroff at all. Thus the EMIF code is not converted to use
> this new function.
> 
> Signed-off-by: Matti Vaittinen 
> ---
> Changelog
>  v7:
>   - new patch
> 
> Please note - this patch has received only a minimal amount of testing.
> (The new API call was tested to shut-down my system at driver probe but
> no odd corner-cases have been tested).
> 
> Any testing for thermal shutdown is appreciated.

You can test it easily by enabling the option CONFIG_THERMAL_EMULATION

Then in any thermal zone:

Assuming the critical temp is below the one specified in the command:

echo 10 > /sys/class/thermal/thermal_zone0/emul_temp

> ---
>  drivers/thermal/thermal_core.c | 63 ++---
>  include/linux/reboot.h |  1 +
>  kernel/reboot.c| 86 ++
>  3 files changed, 91 insertions(+), 59 deletions(-)
> 
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index 996c038f83a4..b1444845af38 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -36,10 +36,8 @@ static LIST_HEAD(thermal_governor_list);
>  
>  static DEFINE_MUTEX(thermal_list_lock);
>  static DEFINE_MUTEX(thermal_governor_lock);
> -static DEFINE_MUTEX(poweroff_lock);
>  
>  static atomic_t in_suspend;
> -static bool power_off_triggered;
>  
>  static struct thermal_governor *def_governor;
>  
> @@ -327,70 +325,18 @@ static void handle_non_critical_trips(struct 
> thermal_zone_device *tz, int trip)
>  def_governor->throttle(tz, trip);
>  }
>  
> -/**
> - * thermal_emergency_poweroff_func - emergency poweroff work after a known 
> delay
> - * @work: work_struct associated with the emergency poweroff function
> - *
> - * This function is called in very critical situations to force
> - * a kernel poweroff after a configurable timeout value.
> - */
> -static void thermal_emergency_poweroff_func(struct work_struct *work)
> -{
> - /*
> -  * We have reached here after the emergency thermal shutdown
> -  * Waiting period has expired. This means orderly_poweroff has
> -  * not been able to shut off the system for some reason.
> -  * Try to shut down the system immediately using kernel_power_off
> -  * if populated
> -  */
> - WARN(1, "Attempting kernel_power_off: Temperature too high\n");
> - kernel_power_off();
> -
> - /*
> -  * Worst of the worst case trigger emergency restart
> -  */
> - WARN(1, "Attempting emergency_restart: Temperature too high\n");
> - emergency_restart();
> -}
> -
> -static DECLARE_DELAYED_WORK(thermal_emergency_poweroff_work,
> - thermal_emergency_poweroff_func);
> -
> -/**
> - * thermal_emergency_poweroff - Trigger an emergency system poweroff
> - *
> - * This may be called from any critical situation to trigger a system 
> shutdown
> - * after a known period of time. By default this is not scheduled.
> - */
> -static void thermal_emergency_poweroff(void)
> +void thermal_zone_device_critical(struct thermal_zone_device *tz)
>  {
> - int poweroff_delay_ms = CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS;
>   /*
>* poweroff_delay_ms must be a carefully profiled positive value.
> -  * Its a must for thermal_emergency_poweroff_work to be scheduled
> +  * Its a must for forced_emergency_poweroff_work to be scheduled.
>*/
> - if (poweroff_delay_ms <= 0)
> - return;
> - schedule_delayed_work(_emergency_poweroff_work,
> -   msecs_to_jiffies(poweroff_delay_ms));
> -}
> + int poweroff_delay_ms = CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS;
>  
> -void thermal_zone_device_critical(struct thermal_zone_device *tz)
> -{
>   dev_emerg(>device, "%s: critical temperature reached, "
> "shutting down\n", tz->type);
>  
> - mutex_lock(_lock);
> - if (!power_off_triggered) {
> - /*
> -  * Queue a backup emergency shutdown in the event of
> -  * orderly_poweroff failure
> -  */
> - thermal_emergency_poweroff();
> - orderly_poweroff(true);
> - power_off_triggered = true;
> - }
> - mutex_unlock(_lock);
> + hw_protection_shutdown("Temperature too high",

Re: [PATCH v4 2/2] riscv: Disable data start offset in flat binaries

2021-04-16 Thread Greg Ungerer




On 17/4/21 11:10 am, Damien Le Moal wrote:

uclibc/gcc combined with elf2flt riscv linker file fully resolve the
PC relative __global_pointer$ value at compile time and do not generate
a relocation entry to set a correct value of the gp register at runtime.
As a result, if the flatbin loader offsets the start of the data
section, the relative position change between the text and data sections
compared to the compile time positions results in an incorrect gp value
being used. This causes flatbin executables to crash.

Avoid this problem by enabling CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET
automatically when CONFIG_RISCV is enabled and CONFIG_MMU is disabled.

Signed-off-by: Damien Le Moal 
Acked-by: Palmer Dabbelt 


Acked-by: Greg Ungerer 

Palmer do you want me to take this via my tree with 1/2 in the series,
or are you going to pick it up?

Regards
Greg



---
  arch/riscv/Kconfig | 1 +
  1 file changed, 1 insertion(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 4515a10c5d22..add528eb9235 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -33,6 +33,7 @@ config RISCV
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
+   select BINFMT_FLAT_NO_DATA_START_OFFSET if !MMU
select CLONE_BACKWARDS
select CLINT_TIMER if !MMU
select COMMON_CLK

Re: [PATCH v4 1/2] binfmt_flat: allow not offsetting data start

2021-04-16 Thread Damien Le Moal

On 2021/04/17 13:52, Greg Ungerer wrote:
> 
> On 17/4/21 11:10 am, Damien Le Moal wrote:
>> Commit 2217b9826246 ("binfmt_flat: revert "binfmt_flat: don't offset
>> the data start"") restored offsetting the start of the data section by
>> a number of words defined by MAX_SHARED_LIBS. As a result, since
>> MAX_SHARED_LIBS is never 0, a gap between the text and data sections
>> always exists. For architectures which cannot support a such gap
>> between the text and data sections (e.g. riscv nommu), flat binary
>> programs cannot be executed.
>>
>> To allow an architecture to request no data start offset to allow for
>> contiguous text and data sections for binaries flagged with
>> FLAT_FLAG_RAM, introduce the new config option
>> CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET. Using this new option, the
>> macro DATA_START_OFFSET_WORDS is conditionally defined in binfmt_flat.c
>> to MAX_SHARED_LIBS for architectures tolerating or needing the data
>> start offset (CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET disabled case)
>> and to 0 when CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET is enabled.
>> DATA_START_OFFSET_WORDS is used in load_flat_file() to calculate the
>> data section length and start position.
>>
>> Signed-off-by: Damien Le Moal 
>> ---
>>   fs/Kconfig.binfmt |  3 +++
>>   fs/binfmt_flat.c  | 19 ++-
>>   2 files changed, 17 insertions(+), 5 deletions(-)
>>
>> diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
>> index c6f1c8c1934e..06fb7a93a1bd 100644
>> --- a/fs/Kconfig.binfmt
>> +++ b/fs/Kconfig.binfmt
>> @@ -112,6 +112,9 @@ config BINFMT_FLAT_ARGVP_ENVP_ON_STACK
>>   config BINFMT_FLAT_OLD_ALWAYS_RAM
>>  bool
>>   
>> +config BINFMT_FLAT_NO_DATA_START_OFFSET
>> +bool
>> +
>>   config BINFMT_FLAT_OLD
>>  bool "Enable support for very old legacy flat binaries"
>>  depends on BINFMT_FLAT
>> diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
>> index b9c658e0548e..1dc68dfba3e0 100644
>> --- a/fs/binfmt_flat.c
>> +++ b/fs/binfmt_flat.c
>> @@ -74,6 +74,12 @@
>>   #defineMAX_SHARED_LIBS (1)
>>   #endif
>>   
>> +#ifdef CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET
>> +#define DATA_START_OFFSET_WORDS (0)
>> +#else
>> +#define DATA_START_OFFSET_WORDS (MAX_SHARED_LIBS)
>> +#endif
>> +
>>   struct lib_info {
>>  struct {
>>  unsigned long start_code;   /* Start of text 
>> segment */
>> @@ -560,6 +566,7 @@ static int load_flat_file(struct linux_binprm *bprm,
>>   * it all together.
>>   */
>>  if (!IS_ENABLED(CONFIG_MMU) && !(flags & 
>> (FLAT_FLAG_RAM|FLAT_FLAG_GZIP))) {
>> +
> 
> Random white space change...
> Don't worry about re-spinning though, I will just edit this chunk out.

Oops. Sorry about that. I should have better checked :)

> 
> 
>>  /*
>>   * this should give us a ROM ptr,  but if it doesn't we don't
>>   * really care
>> @@ -576,7 +583,8 @@ static int load_flat_file(struct linux_binprm *bprm,
>>  goto err;
>>  }
>>   
>> -len = data_len + extra + MAX_SHARED_LIBS * sizeof(unsigned 
>> long);
>> +len = data_len + extra +
>> +DATA_START_OFFSET_WORDS * sizeof(unsigned long);
>>  len = PAGE_ALIGN(len);
>>  realdatastart = vm_mmap(NULL, 0, len,
>>  PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, 0);
>> @@ -591,7 +599,7 @@ static int load_flat_file(struct linux_binprm *bprm,
>>  goto err;
>>  }
>>  datapos = ALIGN(realdatastart +
>> -MAX_SHARED_LIBS * sizeof(unsigned long),
>> +DATA_START_OFFSET_WORDS * sizeof(unsigned long),
>>  FLAT_DATA_ALIGN);
>>   
>>  pr_debug("Allocated data+bss+stack (%u bytes): %lx\n",
>> @@ -622,7 +630,8 @@ static int load_flat_file(struct linux_binprm *bprm,
>>  memp_size = len;
>>  } else {
>>   
>> -len = text_len + data_len + extra + MAX_SHARED_LIBS * 
>> sizeof(u32);
>> +len = text_len + data_len + extra +
>> +DATA_START_OFFSET_WORDS * sizeof(u32);
>>  len = PAGE_ALIGN(len);
>>  textpos = vm_mmap(NULL, 0, len,
>>  PROT_READ | PROT_EXEC | PROT_WRITE, MAP_PRIVATE, 0);
>> @@ -638,7 +647,7 @@ static int load_flat_file(struct linux_binprm *bprm,
>>   
>>  realdatastart = textpos + ntohl(hdr->data_start);
>>  datapos = ALIGN(realdatastart +
>> -MAX_SHARED_LIBS * sizeof(u32),
>> +DATA_START_OFFSET_WORDS * sizeof(u32),
>>  FLAT_DATA_ALIGN);
>>   
>>  reloc = (__be32 __user *)
>> @@ -714,7 +723,7 @@ static int load_flat_file(struct linux_binprm *bprm,
>>  ret = result;
>>  pr_err("Unable to read code+data+bss, errno %d\n", ret);
>>

Re: [PATCH net] net/core/dev.c: Ensure pfmemalloc skbs are correctly handled when receiving

2021-04-16 Thread Eric Dumazet

On Sat, Apr 17, 2021 at 2:08 AM Xie He  wrote:
>
> When an skb is allocated by "__netdev_alloc_skb" in "net/core/skbuff.c",
> if "sk_memalloc_socks()" is true, and if there's not sufficient memory,
> the skb would be allocated using emergency memory reserves. This kind of
> skbs are called pfmemalloc skbs.
>
> pfmemalloc skbs must be specially handled in "net/core/dev.c" when
> receiving. They must NOT be delivered to the target protocol if
> "skb_pfmemalloc_protocol(skb)" is false.
>
> However, if, after a pfmemalloc skb is allocated and before it reaches
> the code in "__netif_receive_skb", "sk_memalloc_socks()" becomes false,
> then the skb will be handled by "__netif_receive_skb" as a normal skb.
> This causes the skb to be delivered to the target protocol even if
> "skb_pfmemalloc_protocol(skb)" is false.
>
> This patch fixes this problem by ensuring all pfmemalloc skbs are handled
> by "__netif_receive_skb" as pfmemalloc skbs.
>
> "__netif_receive_skb_list" has the same problem as "__netif_receive_skb".
> This patch also fixes it.
>
> Fixes: b4b9e3558508 ("netvm: set PF_MEMALLOC as appropriate during SKB 
> processing")
> Cc: Mel Gorman 
> Cc: David S. Miller 
> Cc: Neil Brown 
> Cc: Peter Zijlstra 
> Cc: Jiri Slaby 
> Cc: Mike Christie 
> Cc: Eric B Munson 
> Cc: Eric Dumazet 
> Cc: Sebastian Andrzej Siewior 
> Cc: Christoph Lameter 
> Cc: Andrew Morton 
> Signed-off-by: Xie He 
> ---
>  net/core/dev.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1f79b9aa9a3f..3e6b7879daef 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5479,7 +5479,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>  {
> int ret;
>
> -   if (sk_memalloc_socks() && skb_pfmemalloc(skb)) {
> +   if (skb_pfmemalloc(skb)) {
> unsigned int noreclaim_flag;
>
> /*
> @@ -5507,7 +5507,7 @@ static void __netif_receive_skb_list(struct list_head 
> *head)
> bool pfmemalloc = false; /* Is current sublist PF_MEMALLOC? */
>
> list_for_each_entry_safe(skb, next, head, list) {
> -   if ((sk_memalloc_socks() && skb_pfmemalloc(skb)) != 
> pfmemalloc) {
> +   if (skb_pfmemalloc(skb) != pfmemalloc) {
> struct list_head sublist;
>
> /* Handle the previous sublist */
> --
> 2.27.0
>

The race window has been considered to be small that we prefer the
code as it is.

The reason why we prefer current code is that we use a static key for
the implementation
of sk_memalloc_socks()

Trading some minor condition (race) with extra cycles for each
received packet is a serious concern.

What matters is a persistent condition that would _deplete_ memory,
not for a dozen of packets,
but thousands. Can you demonstrate such an issue ?

Re: [PATCH v4 1/2] binfmt_flat: allow not offsetting data start

2021-04-16 Thread Greg Ungerer




On 17/4/21 11:10 am, Damien Le Moal wrote:

Commit 2217b9826246 ("binfmt_flat: revert "binfmt_flat: don't offset
the data start"") restored offsetting the start of the data section by
a number of words defined by MAX_SHARED_LIBS. As a result, since
MAX_SHARED_LIBS is never 0, a gap between the text and data sections
always exists. For architectures which cannot support a such gap
between the text and data sections (e.g. riscv nommu), flat binary
programs cannot be executed.

To allow an architecture to request no data start offset to allow for
contiguous text and data sections for binaries flagged with
FLAT_FLAG_RAM, introduce the new config option
CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET. Using this new option, the
macro DATA_START_OFFSET_WORDS is conditionally defined in binfmt_flat.c
to MAX_SHARED_LIBS for architectures tolerating or needing the data
start offset (CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET disabled case)
and to 0 when CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET is enabled.
DATA_START_OFFSET_WORDS is used in load_flat_file() to calculate the
data section length and start position.

Signed-off-by: Damien Le Moal 
---
  fs/Kconfig.binfmt |  3 +++
  fs/binfmt_flat.c  | 19 ++-
  2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
index c6f1c8c1934e..06fb7a93a1bd 100644
--- a/fs/Kconfig.binfmt
+++ b/fs/Kconfig.binfmt
@@ -112,6 +112,9 @@ config BINFMT_FLAT_ARGVP_ENVP_ON_STACK
  config BINFMT_FLAT_OLD_ALWAYS_RAM
bool
  
+config BINFMT_FLAT_NO_DATA_START_OFFSET

+   bool
+
  config BINFMT_FLAT_OLD
bool "Enable support for very old legacy flat binaries"
depends on BINFMT_FLAT
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index b9c658e0548e..1dc68dfba3e0 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -74,6 +74,12 @@
  #define   MAX_SHARED_LIBS (1)
  #endif
  
+#ifdef CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET

+#define DATA_START_OFFSET_WORDS(0)
+#else
+#define DATA_START_OFFSET_WORDS(MAX_SHARED_LIBS)
+#endif
+
  struct lib_info {
struct {
unsigned long start_code;   /* Start of text 
segment */
@@ -560,6 +566,7 @@ static int load_flat_file(struct linux_binprm *bprm,
 * it all together.
 */
if (!IS_ENABLED(CONFIG_MMU) && !(flags & 
(FLAT_FLAG_RAM|FLAT_FLAG_GZIP))) {
+


Random white space change...
Don't worry about re-spinning though, I will just edit this chunk out.



/*
 * this should give us a ROM ptr,  but if it doesn't we don't
 * really care
@@ -576,7 +583,8 @@ static int load_flat_file(struct linux_binprm *bprm,
goto err;
}
  
-		len = data_len + extra + MAX_SHARED_LIBS * sizeof(unsigned long);

+   len = data_len + extra +
+   DATA_START_OFFSET_WORDS * sizeof(unsigned long);
len = PAGE_ALIGN(len);
realdatastart = vm_mmap(NULL, 0, len,
PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, 0);
@@ -591,7 +599,7 @@ static int load_flat_file(struct linux_binprm *bprm,
goto err;
}
datapos = ALIGN(realdatastart +
-   MAX_SHARED_LIBS * sizeof(unsigned long),
+   DATA_START_OFFSET_WORDS * sizeof(unsigned long),
FLAT_DATA_ALIGN);
  
  		pr_debug("Allocated data+bss+stack (%u bytes): %lx\n",

@@ -622,7 +630,8 @@ static int load_flat_file(struct linux_binprm *bprm,
memp_size = len;
} else {
  
-		len = text_len + data_len + extra + MAX_SHARED_LIBS * sizeof(u32);

+   len = text_len + data_len + extra +
+   DATA_START_OFFSET_WORDS * sizeof(u32);
len = PAGE_ALIGN(len);
textpos = vm_mmap(NULL, 0, len,
PROT_READ | PROT_EXEC | PROT_WRITE, MAP_PRIVATE, 0);
@@ -638,7 +647,7 @@ static int load_flat_file(struct linux_binprm *bprm,
  
  		realdatastart = textpos + ntohl(hdr->data_start);

datapos = ALIGN(realdatastart +
-   MAX_SHARED_LIBS * sizeof(u32),
+   DATA_START_OFFSET_WORDS * sizeof(u32),
FLAT_DATA_ALIGN);
  
  		reloc = (__be32 __user *)

@@ -714,7 +723,7 @@ static int load_flat_file(struct linux_binprm *bprm,
ret = result;
pr_err("Unable to read code+data+bss, errno %d\n", ret);
vm_munmap(textpos, text_len + data_len + extra +
-   MAX_SHARED_LIBS * sizeof(u32));
+ DATA_START_OFFSET_WORDS * sizeof(u32));
goto err;
}
}



Thanks, otherwise looks good.

Acked-by: Greg Ungerer 

I will push this into my

[PATCH v2 (RESEND) 2/2] riscv: atomic: Using ARCH_ATOMIC in asm/atomic.h

2021-04-16 Thread guoren

From: Guo Ren 

The linux/atomic-arch-fallback.h has been there for a while, but
only x86 & arm64 support it. Let's make riscv follow the
linux/arch/* development trendy and make the codes more readable
and maintainable.

This patch also cleanup some codes:
 - Add atomic_andnot_* operation
 - Using amoswap.w.rl & amoswap.w.aq instructions in xchg
 - Remove cmpxchg_acquire/release unnecessary optimization

Change in v2:
 - Fixup andnot bug by Peter Zijlstra

Signed-off-by: Guo Ren 
Link: 
https://lore.kernel.org/linux-riscv/cak8p3a0fg3cpqbnup7kxj3713cmuqv1wceh-vcrngkm00wx...@mail.gmail.com/
Cc: Arnd Bergmann 
Cc: Peter Zijlstra 
Cc: Anup Patel 
Cc: Palmer Dabbelt 
---

Signed-off-by: Guo Ren 
---
 arch/riscv/include/asm/atomic.h  | 230 +++
 arch/riscv/include/asm/cmpxchg.h | 199 ++---
 2 files changed, 99 insertions(+), 330 deletions(-)

diff --git a/arch/riscv/include/asm/atomic.h b/arch/riscv/include/asm/atomic.h
index 400a8c8..b127cb1 100644
--- a/arch/riscv/include/asm/atomic.h
+++ b/arch/riscv/include/asm/atomic.h
@@ -8,13 +8,8 @@
 #ifndef _ASM_RISCV_ATOMIC_H
 #define _ASM_RISCV_ATOMIC_H
 
-#ifdef CONFIG_GENERIC_ATOMIC64
-# include 
-#else
-# if (__riscv_xlen < 64)
-#  error "64-bit atomics require XLEN to be at least 64"
-# endif
-#endif
+#include 
+#include 
 
 #include 
 #include 
@@ -25,25 +20,13 @@
 #define __atomic_release_fence()   \
__asm__ __volatile__(RISCV_RELEASE_BARRIER "" ::: "memory");
 
-static __always_inline int atomic_read(const atomic_t *v)
-{
-   return READ_ONCE(v->counter);
-}
-static __always_inline void atomic_set(atomic_t *v, int i)
-{
-   WRITE_ONCE(v->counter, i);
-}
+#define arch_atomic_read(v)__READ_ONCE((v)->counter)
+#define arch_atomic_set(v, i)  __WRITE_ONCE(((v)->counter), 
(i))
 
 #ifndef CONFIG_GENERIC_ATOMIC64
-#define ATOMIC64_INIT(i) { (i) }
-static __always_inline s64 atomic64_read(const atomic64_t *v)
-{
-   return READ_ONCE(v->counter);
-}
-static __always_inline void atomic64_set(atomic64_t *v, s64 i)
-{
-   WRITE_ONCE(v->counter, i);
-}
+#define ATOMIC64_INIT  ATOMIC_INIT
+#define arch_atomic64_read arch_atomic_read
+#define arch_atomic64_set  arch_atomic_set
 #endif
 
 /*
@@ -53,7 +36,7 @@ static __always_inline void atomic64_set(atomic64_t *v, s64 i)
  */
 #define ATOMIC_OP(op, asm_op, I, asm_type, c_type, prefix) \
 static __always_inline \
-void atomic##prefix##_##op(c_type i, atomic##prefix##_t *v)\
+void arch_atomic##prefix##_##op(c_type i, atomic##prefix##_t *v)   \
 {  \
__asm__ __volatile__ (  \
"   amo" #asm_op "." #asm_type " zero, %1, %0"  \
@@ -76,6 +59,12 @@ ATOMIC_OPS(sub, add, -i)
 ATOMIC_OPS(and, and,  i)
 ATOMIC_OPS( or,  or,  i)
 ATOMIC_OPS(xor, xor,  i)
+ATOMIC_OPS(andnot, and,  ~i)
+
+#define arch_atomic_andnot arch_atomic_andnot
+#ifndef CONFIG_GENERIC_ATOMIC64
+#define arch_atomic64_andnot   arch_atomic64_andnot
+#endif
 
 #undef ATOMIC_OP
 #undef ATOMIC_OPS
@@ -87,7 +76,7 @@ ATOMIC_OPS(xor, xor,  i)
  */
 #define ATOMIC_FETCH_OP(op, asm_op, I, asm_type, c_type, prefix)   \
 static __always_inline \
-c_type atomic##prefix##_fetch_##op##_relaxed(c_type i, \
+c_type arch_atomic##prefix##_fetch_##op##_relaxed(c_type i,\
 atomic##prefix##_t *v) \
 {  \
register c_type ret;\
@@ -99,7 +88,7 @@ c_type atomic##prefix##_fetch_##op##_relaxed(c_type i,
\
return ret; \
 }  \
 static __always_inline \
-c_type atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v)\
+c_type arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v)\
 {  \
register c_type ret;\
__asm__ __volatile__ (  \
@@ -112,15 +101,16 @@ c_type atomic##prefix##_fetch_##op(c_type i, 
atomic##prefix##_t *v)   \
 
 #define ATOMIC_OP_RETURN(op, asm_op, c_op, I, asm_type, c_type, prefix)
\
 static __always_inline \
-c_type atomic##prefix##_##op##_return_relaxed(c_type i,
\
+c_type arch_atomic##prefix##_##op##_return_relaxed(c_type i,   \

[PATCH v2 (RESEND) 1/2] locking/atomics: Fixup GENERIC_ATOMIC64 conflict with atomic-arch-fallback.h

2021-04-16 Thread guoren

From: Guo Ren 

Current GENERIC_ATOMIC64 in atomic-arch-fallback.h is broken. When a 32-bit
arch use atomic-arch-fallback.h will cause compile error.

In file included from include/linux/atomic.h:81,
from include/linux/rcupdate.h:25,
from include/linux/rculist.h:11,
from include/linux/pid.h:5,
from include/linux/sched.h:14,
from arch/riscv/kernel/asm-offsets.c:10:
   include/linux/atomic-arch-fallback.h: In function 'arch_atomic64_inc':
>> include/linux/atomic-arch-fallback.h:1447:2: error: implicit declaration of 
>> function 'arch_atomic64_add'; did you mean 'arch_atomic_add'? 
>> [-Werror=implicit-function-declaration]
1447 |  arch_atomic64_add(1, v);
 |  ^
 |  arch_atomic_add

The atomic-arch-fallback.h & atomic-fallback.h &
atomic-instrumented.h are generated by gen-atomic-fallback.sh &
gen-atomic-instrumented.sh, so just take care the bash files.

Add atomic64_* wrapper into atomic64.h, then there is no dependency
of atomic-*-fallback.h in atomic64.h.

Change in v2:
 - Fixup scripts/atomic/gen-atomic-instrumented.sh wih duplicated
   definition export.

Signed-off-by: Guo Ren 
Cc: Peter Zijlstra 
Cc: Arnd Bergmann 
---
 include/asm-generic/atomic-instrumented.h | 264 +++---
 include/asm-generic/atomic64.h|  89 ++
 include/linux/atomic-arch-fallback.h  |   5 +-
 include/linux/atomic-fallback.h   |   5 +-
 scripts/atomic/gen-atomic-fallback.sh |   3 +-
 scripts/atomic/gen-atomic-instrumented.sh |  23 ++-
 6 files changed, 251 insertions(+), 138 deletions(-)

diff --git a/include/asm-generic/atomic-instrumented.h 
b/include/asm-generic/atomic-instrumented.h
index 888b6cf..c9e69c6 100644
--- a/include/asm-generic/atomic-instrumented.h
+++ b/include/asm-generic/atomic-instrumented.h
@@ -831,6 +831,137 @@ atomic_dec_if_positive(atomic_t *v)
 #define atomic_dec_if_positive atomic_dec_if_positive
 #endif
 
+#if !defined(arch_xchg_relaxed) || defined(arch_xchg)
+#define xchg(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_xchg(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_xchg_acquire)
+#define xchg_acquire(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_xchg_acquire(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_xchg_release)
+#define xchg_release(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_xchg_release(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_xchg_relaxed)
+#define xchg_relaxed(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_xchg_relaxed(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if !defined(arch_cmpxchg_relaxed) || defined(arch_cmpxchg)
+#define cmpxchg(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_cmpxchg(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_cmpxchg_acquire)
+#define cmpxchg_acquire(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_cmpxchg_acquire(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_cmpxchg_release)
+#define cmpxchg_release(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_cmpxchg_release(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_cmpxchg_relaxed)
+#define cmpxchg_relaxed(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_cmpxchg_relaxed(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if !defined(arch_try_cmpxchg_relaxed) || defined(arch_try_cmpxchg)
+#define try_cmpxchg(ptr, oldp, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   typeof(oldp) __ai_oldp = (oldp); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   instrument_atomic_write(__ai_oldp, sizeof(*__ai_oldp)); \
+   arch_try_cmpxchg(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_try_cmpxchg_acquire)
+#define try_cmpxchg_acquire(ptr, oldp, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   typeof(oldp) __ai_oldp = (oldp); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   instrument_atomic_write(__ai_oldp, sizeof(*__ai_oldp)); \
+   arch_try_cmpxchg_acquire(__ai_ptr, __ai_oldp, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_try_cmpxchg_release)
+#define try_cmpxchg_release(ptr, oldp, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   typeof(oldp) __ai_oldp = (oldp); \
+

Re: PROBLEM: DoS Attack on Fragment Cache

2021-04-16 Thread Eric Dumazet

On Sat, Apr 17, 2021 at 2:31 AM David Ahern  wrote:
>
> [ cc author of 648700f76b03b7e8149d13cc2bdb3355035258a9 ]



I think this has been discussed already. There is no strategy that
makes IP reassembly units immune to DDOS attacks.

We added rb-tree and sysctls to let admins choose to use GB of RAM if
they really care.



>
> On 4/16/21 3:58 PM, Keyu Man wrote:
> > Hi,
> >
> >
> >
> > My name is Keyu Man. We are a group of researchers from University
> > of California, Riverside. Zhiyun Qian is my advisor. We found the code
> > in processing IPv4/IPv6 fragments will potentially lead to DoS Attacks.
> > Specifically, after the latest kernel receives an IPv4 fragment, it will
> > try to fit it into a queue by calling function
> >
> >
> >
> > struct inet_frag_queue *inet_frag_find(struct fqdir *fqdir, void
> > *key) in net/ipv4/inet_fragment.c.
> >
> >
> >
> > However, this function will first check if the existing fragment
> > memory exceeds the fqdir->high_thresh. If it exceeds, then drop the
> > fragment regardless whether it belongs to a new queue or an existing queue.
> >
> > Chances are that an attacker can fill the cache with fragments that
> > will never be assembled (i.e., only sends the first fragment with new
> > IPIDs every time) to exceed the threshold so that all future incoming
> > fragmented IPv4 traffic would be blocked and dropped. Since there is no
> > GC mechanism, the victim host has to wait for 30s when the fragments are
> > expired to continue receive incoming fragments normally.
> >
> > In practice, given the 4MB fragment cache, the attacker only needs
> > to send 1766 fragments to exhaust the cache and DoS the victim for 30s,
> > whose cost is pretty low. Besides, IPv6 would also be affected since the
> > issue resides in inet part.
> >
> > This issue is introduced in commit
> > 648700f76b03b7e8149d13cc2bdb3355035258a9 (inet: frags: use rhashtables
> > for reassembly units) which removes fqdir->low_thresh, and GC worker as
> > well. We would gently request to bring GC worker back to the kernel to
> > prevent the DoS attacks.
> >
> > Looking forward to hear from you
> >
> >
> >
> > Thanks,
> >
> > Keyu Man
> >
>

[PATCH v3 7/8] mm: memcontrol: move obj_cgroup_uncharge_pages() out of css_set_lock

2021-04-16 Thread Muchun Song

The css_set_lock is used to guard the list of inherited objcgs. So there
is no need to uncharge kernel memory under css_set_lock. Just move it
out of the lock.

Signed-off-by: Muchun Song 
Reviewed-by: Shakeel Butt 
Acked-by: Roman Gushchin 
Acked-by: Johannes Weiner 
---
 mm/memcontrol.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index c4eebe2a2914..e0c398fe7443 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -289,9 +289,10 @@ static void obj_cgroup_release(struct percpu_ref *ref)
WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
nr_pages = nr_bytes >> PAGE_SHIFT;
 
-   spin_lock_irqsave(_set_lock, flags);
if (nr_pages)
obj_cgroup_uncharge_pages(objcg, nr_pages);
+
+   spin_lock_irqsave(_set_lock, flags);
list_del(>list);
spin_unlock_irqrestore(_set_lock, flags);
 
-- 
2.11.0

[PATCH v3 8/8] mm: vmscan: remove noinline_for_stack

2021-04-16 Thread Muchun Song

The noinline_for_stack is introduced by commit 666356297ec4 ("vmscan:
set up pagevec as late as possible in shrink_inactive_list()"), its
purpose is to delay the allocation of pagevec as late as possible to
save stack memory. But the commit 2bcf88796381 ("mm: take pagevecs off
reclaim stack") replace pagevecs by lists of pages_to_free. So we do
not need noinline_for_stack, just remove it (let the compiler decide
whether to inline).

Signed-off-by: Muchun Song 
Acked-by: Johannes Weiner 
Acked-by: Roman Gushchin 
Reviewed-by: Shakeel Butt 
Acked-by: Michal Hocko 
---
 mm/vmscan.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2bc5cf409958..2d2727b78df9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2014,8 +2014,8 @@ static int too_many_isolated(struct pglist_data *pgdat, 
int file,
  *
  * Returns the number of pages moved to the given lruvec.
  */
-static unsigned noinline_for_stack move_pages_to_lru(struct lruvec *lruvec,
-struct list_head *list)
+static unsigned int move_pages_to_lru(struct lruvec *lruvec,
+ struct list_head *list)
 {
int nr_pages, nr_moved = 0;
LIST_HEAD(pages_to_free);
@@ -2095,7 +2095,7 @@ static int current_may_throttle(void)
  * shrink_inactive_list() is a helper for shrink_node().  It returns the number
  * of reclaimed pages
  */
-static noinline_for_stack unsigned long
+static unsigned long
 shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
 struct scan_control *sc, enum lru_list lru)
 {
-- 
2.11.0

[PATCH v3 3/8] mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec

2021-04-16 Thread Muchun Song

All the callers of mem_cgroup_page_lruvec() just pass page_pgdat(page)
as the 2nd parameter to it (except isolate_migratepages_block()). But
for isolate_migratepages_block(), the page_pgdat(page) is also equal
to the local variable of @pgdat. So mem_cgroup_page_lruvec() do not
need the pgdat parameter. Just remove it to simplify the code.

Signed-off-by: Muchun Song 
Acked-by: Johannes Weiner 
Reviewed-by: Shakeel Butt 
Acked-by: Roman Gushchin 
Acked-by: Michal Hocko 
---
 include/linux/memcontrol.h | 10 +-
 mm/compaction.c|  2 +-
 mm/memcontrol.c|  9 +++--
 mm/swap.c  |  2 +-
 mm/workingset.c|  2 +-
 5 files changed, 11 insertions(+), 14 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c193be760709..f2a5aaba3577 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -743,13 +743,12 @@ static inline struct lruvec *mem_cgroup_lruvec(struct 
mem_cgroup *memcg,
 /**
  * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
  * @page: the page
- * @pgdat: pgdat of the page
  *
  * This function relies on page->mem_cgroup being stable.
  */
-static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
-   struct pglist_data *pgdat)
+static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page)
 {
+   pg_data_t *pgdat = page_pgdat(page);
struct mem_cgroup *memcg = page_memcg(page);
 
VM_WARN_ON_ONCE_PAGE(!memcg && !mem_cgroup_disabled(), page);
@@ -1221,9 +1220,10 @@ static inline struct lruvec *mem_cgroup_lruvec(struct 
mem_cgroup *memcg,
return >__lruvec;
 }
 
-static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page,
-   struct pglist_data *pgdat)
+static inline struct lruvec *mem_cgroup_page_lruvec(struct page *page)
 {
+   pg_data_t *pgdat = page_pgdat(page);
+
return >__lruvec;
 }
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 8c5028bfbd56..1c500e697c88 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -994,7 +994,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
if (!TestClearPageLRU(page))
goto isolate_fail_put;
 
-   lruvec = mem_cgroup_page_lruvec(page, pgdat);
+   lruvec = mem_cgroup_page_lruvec(page);
 
/* If we already hold the lock, we can skip some rechecking */
if (lruvec != locked) {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 50e3cf1e263e..caf193088beb 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1181,9 +1181,8 @@ void lruvec_memcg_debug(struct lruvec *lruvec, struct 
page *page)
 struct lruvec *lock_page_lruvec(struct page *page)
 {
struct lruvec *lruvec;
-   struct pglist_data *pgdat = page_pgdat(page);
 
-   lruvec = mem_cgroup_page_lruvec(page, pgdat);
+   lruvec = mem_cgroup_page_lruvec(page);
spin_lock(>lru_lock);
 
lruvec_memcg_debug(lruvec, page);
@@ -1194,9 +1193,8 @@ struct lruvec *lock_page_lruvec(struct page *page)
 struct lruvec *lock_page_lruvec_irq(struct page *page)
 {
struct lruvec *lruvec;
-   struct pglist_data *pgdat = page_pgdat(page);
 
-   lruvec = mem_cgroup_page_lruvec(page, pgdat);
+   lruvec = mem_cgroup_page_lruvec(page);
spin_lock_irq(>lru_lock);
 
lruvec_memcg_debug(lruvec, page);
@@ -1207,9 +1205,8 @@ struct lruvec *lock_page_lruvec_irq(struct page *page)
 struct lruvec *lock_page_lruvec_irqsave(struct page *page, unsigned long 
*flags)
 {
struct lruvec *lruvec;
-   struct pglist_data *pgdat = page_pgdat(page);
 
-   lruvec = mem_cgroup_page_lruvec(page, pgdat);
+   lruvec = mem_cgroup_page_lruvec(page);
spin_lock_irqsave(>lru_lock, *flags);
 
lruvec_memcg_debug(lruvec, page);
diff --git a/mm/swap.c b/mm/swap.c
index a75a8265302b..e0d5699213cc 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -313,7 +313,7 @@ void lru_note_cost(struct lruvec *lruvec, bool file, 
unsigned int nr_pages)
 
 void lru_note_cost_page(struct page *page)
 {
-   lru_note_cost(mem_cgroup_page_lruvec(page, page_pgdat(page)),
+   lru_note_cost(mem_cgroup_page_lruvec(page),
  page_is_file_lru(page), thp_nr_pages(page));
 }
 
diff --git a/mm/workingset.c b/mm/workingset.c
index b7cdeca5a76d..4f7a306ce75a 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -408,7 +408,7 @@ void workingset_activation(struct page *page)
memcg = page_memcg_rcu(page);
if (!mem_cgroup_disabled() && !memcg)
goto out;
-   lruvec = mem_cgroup_page_lruvec(page, page_pgdat(page));
+   lruvec = mem_cgroup_page_lruvec(page);
workingset_age_nonresident(lruvec, thp_nr_pages(page));
 out:
rcu_read_unlock();
-- 
2.11.0

[PATCH v3 6/8] mm: memcontrol: simplify the logic of objcg pinning memcg

2021-04-16 Thread Muchun Song

The obj_cgroup_release() and memcg_reparent_objcgs() are serialized by
the css_set_lock. We do not need to care about objcg->memcg being
released in the process of obj_cgroup_release(). So there is no need
to pin memcg before releasing objcg. Remove those pinning logic to
simplfy the code.

There are only two places that modifies the objcg->memcg. One is the
initialization to objcg->memcg in the memcg_online_kmem(), another
is objcgs reparenting in the memcg_reparent_objcgs(). It is also
impossible for the two to run in parallel. So xchg() is unnecessary
and it is enough to use WRITE_ONCE().

Signed-off-by: Muchun Song 
Acked-by: Johannes Weiner 
Reviewed-by: Shakeel Butt 
Acked-by: Roman Gushchin 
---
 mm/memcontrol.c | 20 ++--
 1 file changed, 6 insertions(+), 14 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index caf193088beb..c4eebe2a2914 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -261,7 +261,6 @@ static void obj_cgroup_uncharge_pages(struct obj_cgroup 
*objcg,
 static void obj_cgroup_release(struct percpu_ref *ref)
 {
struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
-   struct mem_cgroup *memcg;
unsigned int nr_bytes;
unsigned int nr_pages;
unsigned long flags;
@@ -291,11 +290,9 @@ static void obj_cgroup_release(struct percpu_ref *ref)
nr_pages = nr_bytes >> PAGE_SHIFT;
 
spin_lock_irqsave(_set_lock, flags);
-   memcg = obj_cgroup_memcg(objcg);
if (nr_pages)
obj_cgroup_uncharge_pages(objcg, nr_pages);
list_del(>list);
-   mem_cgroup_put(memcg);
spin_unlock_irqrestore(_set_lock, flags);
 
percpu_ref_exit(ref);
@@ -330,17 +327,12 @@ static void memcg_reparent_objcgs(struct mem_cgroup 
*memcg,
 
spin_lock_irq(_set_lock);
 
-   /* Move active objcg to the parent's list */
-   xchg(>memcg, parent);
-   css_get(>css);
-   list_add(>list, >objcg_list);
-
-   /* Move already reparented objcgs to the parent's list */
-   list_for_each_entry(iter, >objcg_list, list) {
-   css_get(>css);
-   xchg(>memcg, parent);
-   css_put(>css);
-   }
+   /* 1) Ready to reparent active objcg. */
+   list_add(>list, >objcg_list);
+   /* 2) Reparent active objcg and already reparented objcgs to parent. */
+   list_for_each_entry(iter, >objcg_list, list)
+   WRITE_ONCE(iter->memcg, parent);
+   /* 3) Move already reparented objcgs to the parent's list */
list_splice(>objcg_list, >objcg_list);
 
spin_unlock_irq(_set_lock);
-- 
2.11.0

[PATCH v3 4/8] mm: memcontrol: simplify lruvec_holds_page_lru_lock

2021-04-16 Thread Muchun Song

We already have a helper lruvec_memcg() to get the memcg from lruvec, we
do not need to do it ourselves in the lruvec_holds_page_lru_lock(). So use
lruvec_memcg() instead. And if mem_cgroup_disabled() returns false, the
page_memcg(page) (the LRU pages) cannot be NULL. So remove the odd logic
of "memcg = page_memcg(page) ? : root_mem_cgroup". And use lruvec_pgdat
to simplify the code. We can have a single definition for this function
that works for !CONFIG_MEMCG, CONFIG_MEMCG + mem_cgroup_disabled() and
CONFIG_MEMCG.

Signed-off-by: Muchun Song 
Acked-by: Johannes Weiner 
Reviewed-by: Shakeel Butt 
Acked-by: Roman Gushchin 
Acked-by: Michal Hocko 
---
 include/linux/memcontrol.h | 31 +++
 1 file changed, 7 insertions(+), 24 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f2a5aaba3577..2fc728492c9b 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -755,22 +755,6 @@ static inline struct lruvec *mem_cgroup_page_lruvec(struct 
page *page)
return mem_cgroup_lruvec(memcg, pgdat);
 }
 
-static inline bool lruvec_holds_page_lru_lock(struct page *page,
- struct lruvec *lruvec)
-{
-   pg_data_t *pgdat = page_pgdat(page);
-   const struct mem_cgroup *memcg;
-   struct mem_cgroup_per_node *mz;
-
-   if (mem_cgroup_disabled())
-   return lruvec == >__lruvec;
-
-   mz = container_of(lruvec, struct mem_cgroup_per_node, lruvec);
-   memcg = page_memcg(page) ? : root_mem_cgroup;
-
-   return lruvec->pgdat == pgdat && mz->memcg == memcg;
-}
-
 struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
 
 struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
@@ -1227,14 +1211,6 @@ static inline struct lruvec 
*mem_cgroup_page_lruvec(struct page *page)
return >__lruvec;
 }
 
-static inline bool lruvec_holds_page_lru_lock(struct page *page,
- struct lruvec *lruvec)
-{
-   pg_data_t *pgdat = page_pgdat(page);
-
-   return lruvec == >__lruvec;
-}
-
 static inline void lruvec_memcg_debug(struct lruvec *lruvec, struct page *page)
 {
 }
@@ -1516,6 +1492,13 @@ static inline void unlock_page_lruvec_irqrestore(struct 
lruvec *lruvec,
spin_unlock_irqrestore(>lru_lock, flags);
 }
 
+static inline bool lruvec_holds_page_lru_lock(struct page *page,
+ struct lruvec *lruvec)
+{
+   return lruvec_pgdat(lruvec) == page_pgdat(page) &&
+  lruvec_memcg(lruvec) == page_memcg(page);
+}
+
 /* Don't lock again iff page's lruvec locked */
 static inline struct lruvec *relock_page_lruvec_irq(struct page *page,
struct lruvec *locked_lruvec)
-- 
2.11.0

[PATCH v3 5/8] mm: memcontrol: rename lruvec_holds_page_lru_lock to page_matches_lruvec

2021-04-16 Thread Muchun Song

lruvec_holds_page_lru_lock() doesn't check anything about locking and is
used to check whether the page belongs to the lruvec. So rename it to
page_matches_lruvec().

Signed-off-by: Muchun Song 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Reviewed-by: Shakeel Butt 
---
 include/linux/memcontrol.h | 8 
 mm/vmscan.c| 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 2fc728492c9b..0ce97eff79e2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1492,8 +1492,8 @@ static inline void unlock_page_lruvec_irqrestore(struct 
lruvec *lruvec,
spin_unlock_irqrestore(>lru_lock, flags);
 }
 
-static inline bool lruvec_holds_page_lru_lock(struct page *page,
- struct lruvec *lruvec)
+/* Test requires a stable page->memcg binding, see page_memcg() */
+static inline bool page_matches_lruvec(struct page *page, struct lruvec 
*lruvec)
 {
return lruvec_pgdat(lruvec) == page_pgdat(page) &&
   lruvec_memcg(lruvec) == page_memcg(page);
@@ -1504,7 +1504,7 @@ static inline struct lruvec 
*relock_page_lruvec_irq(struct page *page,
struct lruvec *locked_lruvec)
 {
if (locked_lruvec) {
-   if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+   if (page_matches_lruvec(page, locked_lruvec))
return locked_lruvec;
 
unlock_page_lruvec_irq(locked_lruvec);
@@ -1518,7 +1518,7 @@ static inline struct lruvec 
*relock_page_lruvec_irqsave(struct page *page,
struct lruvec *locked_lruvec, unsigned long *flags)
 {
if (locked_lruvec) {
-   if (lruvec_holds_page_lru_lock(page, locked_lruvec))
+   if (page_matches_lruvec(page, locked_lruvec))
return locked_lruvec;
 
unlock_page_lruvec_irqrestore(locked_lruvec, *flags);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index bb8321026c0c..2bc5cf409958 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2062,7 +2062,7 @@ static unsigned noinline_for_stack 
move_pages_to_lru(struct lruvec *lruvec,
 * All pages were isolated from the same lruvec (and isolation
 * inhibits memcg migration).
 */
-   VM_BUG_ON_PAGE(!lruvec_holds_page_lru_lock(page, lruvec), page);
+   VM_BUG_ON_PAGE(!page_matches_lruvec(page, lruvec), page);
add_page_to_lru_list(page, lruvec);
nr_pages = thp_nr_pages(page);
nr_moved += nr_pages;
-- 
2.11.0

[PATCH v3 1/8] mm: memcontrol: fix page charging in page replacement

2021-04-16 Thread Muchun Song

The pages aren't accounted at the root level, so do not charge the page
to the root memcg in page replacement. Although we do not display the
value (mem_cgroup_usage) so there shouldn't be any actual problem, but
there is a WARN_ON_ONCE in the page_counter_cancel(). Who knows if it
will trigger? So it is better to fix it.

Signed-off-by: Muchun Song 
Acked-by: Johannes Weiner 
Reviewed-by: Shakeel Butt 
Acked-by: Roman Gushchin 
Acked-by: Michal Hocko 
---
 mm/memcontrol.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 64ada9e650a5..f229de925aa5 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6806,9 +6806,11 @@ void mem_cgroup_migrate(struct page *oldpage, struct 
page *newpage)
/* Force-charge the new page. The old one will be freed soon */
nr_pages = thp_nr_pages(newpage);
 
-   page_counter_charge(>memory, nr_pages);
-   if (do_memsw_account())
-   page_counter_charge(>memsw, nr_pages);
+   if (!mem_cgroup_is_root(memcg)) {
+   page_counter_charge(>memory, nr_pages);
+   if (do_memsw_account())
+   page_counter_charge(>memsw, nr_pages);
+   }
 
css_get(>css);
commit_charge(newpage, memcg);
-- 
2.11.0

[PATCH v3 2/8] mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm

2021-04-16 Thread Muchun Song

When mm is NULL, we do not need to hold rcu lock and call css_tryget for
the root memcg. And we also do not need to check !mm in every loop of
while. So bail out early when !mm.

Signed-off-by: Muchun Song 
Acked-by: Johannes Weiner 
Reviewed-by: Shakeel Butt 
Acked-by: Roman Gushchin 
Acked-by: Michal Hocko 
---
 mm/memcontrol.c | 25 ++---
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f229de925aa5..50e3cf1e263e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -901,20 +901,23 @@ struct mem_cgroup *get_mem_cgroup_from_mm(struct 
mm_struct *mm)
if (mem_cgroup_disabled())
return NULL;
 
+   /*
+* Page cache insertions can happen without an
+* actual mm context, e.g. during disk probing
+* on boot, loopback IO, acct() writes etc.
+*
+* No need to css_get on root memcg as the reference
+* counting is disabled on the root level in the
+* cgroup core. See CSS_NO_REF.
+*/
+   if (unlikely(!mm))
+   return root_mem_cgroup;
+
rcu_read_lock();
do {
-   /*
-* Page cache insertions can happen without an
-* actual mm context, e.g. during disk probing
-* on boot, loopback IO, acct() writes etc.
-*/
-   if (unlikely(!mm))
+   memcg = mem_cgroup_from_task(rcu_dereference(mm->owner));
+   if (unlikely(!memcg))
memcg = root_mem_cgroup;
-   else {
-   memcg = 
mem_cgroup_from_task(rcu_dereference(mm->owner));
-   if (unlikely(!memcg))
-   memcg = root_mem_cgroup;
-   }
} while (!css_tryget(>css));
rcu_read_unlock();
return memcg;
-- 
2.11.0

[PATCH v3 0/8] memcontrol code cleanup and simplification

2021-04-16 Thread Muchun Song

This patch series is part of [1] patch series. Because those patches are
code cleanup or simplification. I gather those patches into a separate
series to make it easier to review.

[1] 
https://lore.kernel.org/linux-mm/20210409122959.82264-1-songmuc...@bytedance.com/

Changlogs in v3:
  1. Collect Acked-by and Review-by tags.
  2. Add a comment to patch 5 (suggested by Johannes).

  Thanks to Johannes, Shakeel and Michal's review.

Changlogs in v2:
  1. Collect Acked-by and Review-by tags.
  2. Add a new patch to rename lruvec_holds_page_lru_lock to 
page_matches_lruvec.
  3. Add a comment to patch 2.

  Thanks to Roman, Johannes, Shakeel and Michal's review.

Muchun Song (8):
  mm: memcontrol: fix page charging in page replacement
  mm: memcontrol: bail out early when !mm in get_mem_cgroup_from_mm
  mm: memcontrol: remove the pgdata parameter of mem_cgroup_page_lruvec
  mm: memcontrol: simplify lruvec_holds_page_lru_lock
  mm: memcontrol: rename lruvec_holds_page_lru_lock to
page_matches_lruvec
  mm: memcontrol: simplify the logic of objcg pinning memcg
  mm: memcontrol: move obj_cgroup_uncharge_pages() out of css_set_lock
  mm: vmscan: remove noinline_for_stack

 include/linux/memcontrol.h | 43 ++
 mm/compaction.c|  2 +-
 mm/memcontrol.c| 65 +-
 mm/swap.c  |  2 +-
 mm/vmscan.c|  8 +++---
 mm/workingset.c|  2 +-
 6 files changed, 50 insertions(+), 72 deletions(-)

-- 
2.11.0

Re: [PATCH v2] mm, thp: Relax the VM_DENYWRITE constraint on file-backed THPs

2021-04-16 Thread Hugh Dickins

On Mon, 5 Apr 2021, Collin Fijalkovich wrote:

> Transparent huge pages are supported for read-only non-shmem files,
> but are only used for vmas with VM_DENYWRITE. This condition ensures that
> file THPs are protected from writes while an application is running
> (ETXTBSY).  Any existing file THPs are then dropped from the page cache
> when a file is opened for write in do_dentry_open(). Since sys_mmap
> ignores MAP_DENYWRITE, this constrains the use of file THPs to vmas
> produced by execve().
> 
> Systems that make heavy use of shared libraries (e.g. Android) are unable
> to apply VM_DENYWRITE through the dynamic linker, preventing them from
> benefiting from the resultant reduced contention on the TLB.
> 
> This patch reduces the constraint on file THPs allowing use with any
> executable mapping from a file not opened for write (see
> inode_is_open_for_write()). It also introduces additional conditions to
> ensure that files opened for write will never be backed by file THPs.
> 
> Restricting the use of THPs to executable mappings eliminates the risk that
> a read-only file later opened for write would encounter significant
> latencies due to page cache truncation.
> 
> The ld linker flag '-z max-page-size=(hugepage size)' can be used to
> produce executables with the necessary layout. The dynamic linker must
> map these file's segments at a hugepage size aligned vma for the mapping to
> be backed with THPs.
> 
> Comparison of the performance characteristics of 4KB and 2MB-backed
> libraries follows; the Android dex2oat tool was used to AOT compile an
> example application on a single ARM core.
> 
> 4KB Pages:
> ==
> 
> count  event_name# count / runtime
> 598,995,035,942cpu-cycles# 1.800861 GHz
>  81,195,620,851raw-stall-frontend# 244.112 M/sec
> 347,754,466,597iTLB-loads# 1.046 G/sec
>   2,970,248,900iTLB-load-misses  # 0.854122% miss rate
> 
> Total test time: 332.854998 seconds.
> 
> 2MB Pages:
> ==
> 
> count  event_name# count / runtime
> 592,872,663,047cpu-cycles# 1.800358 GHz
>  76,485,624,143raw-stall-frontend# 232.261 M/sec
> 350,478,413,710iTLB-loads# 1.064 G/sec
> 803,233,322iTLB-load-misses  # 0.229182% miss rate
> 
> Total test time: 329.826087 seconds
> 
> A check of /proc/$(pidof dex2oat64)/smaps shows THPs in use:
> 
> /apex/com.android.art/lib64/libart.so
> FilePmdMapped:  4096 kB
> 
> /apex/com.android.art/lib64/libart-compiler.so
> FilePmdMapped:  2048 kB
> 
> Signed-off-by: Collin Fijalkovich 

Acked-by: Hugh Dickins 

and you also won

Reviewed-by: William Kucharski 

in the v1 thread.

I had hoped to see a more dramatic difference in the numbers above,
but I'm a performance naif, and presume other loads and other
libraries may show further benefit.

> ---
> Changes v1 -> v2:
> * commit message 'non-shmem filesystems' -> 'non-shmem files'
> * Add performance testing data to commit message
> 
>  fs/open.c   | 13 +++--
>  mm/khugepaged.c | 16 +++-
>  2 files changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/open.c b/fs/open.c
> index e53af13b5835..f76e960d10ea 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -852,8 +852,17 @@ static int do_dentry_open(struct file *f,
>* XXX: Huge page cache doesn't support writing yet. Drop all page
>* cache for this file before processing writes.
>*/
> - if ((f->f_mode & FMODE_WRITE) && filemap_nr_thps(inode->i_mapping))
> - truncate_pagecache(inode, 0);
> + if (f->f_mode & FMODE_WRITE) {
> + /*
> +  * Paired with smp_mb() in collapse_file() to ensure nr_thps
> +  * is up to date and the update to i_writecount by
> +  * get_write_access() is visible. Ensures subsequent insertion
> +  * of THPs into the page cache will fail.
> +  */
> + smp_mb();
> + if (filemap_nr_thps(inode->i_mapping))
> + truncate_pagecache(inode, 0);
> + }
>  
>   return 0;
>  
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index a7d6cb912b05..4c7cc877d5e3 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -459,7 +459,8 @@ static bool hugepage_vma_check(struct vm_area_struct *vma,
>  
>   /* Read-only file mappings need to be aligned for THP to work. */
>   if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file &&
> - (vm_flags & VM_DENYWRITE)) {
> + !inode_is_open_for_write(vma->vm_file->f_inode) &&
> + (vm_flags & VM_EXEC)) {
>   return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
>   HPAGE_PMD_NR);
>   }
> @@ -1872,6 +1873,19 @@ static void collapse_file(struct mm_struct *mm,
>   else {
>   __mod_lruvec_page_state(new_page, NR_FILE_THPS, nr);
>

Re: [PATCH 04/13] Kbuild: Rust support

2021-04-16 Thread Willy Tarreau

On Sat, Apr 17, 2021 at 01:46:35AM +0200, Miguel Ojeda wrote:
> On Sat, Apr 17, 2021 at 12:04 AM Willy Tarreau  wrote:
> >
> > But my point remains that the point of extreme care is at the interface
> > with the rest of the kernel because there is a change of semantics
> > there.
> >
> > Sure but as I said most often (due to API or ABI inheritance), both
> > are already exclusive and stored as ranges. Returning 1..4095 for
> > errno or a pointer including NULL for a success doesn't shock me at
> > all.
> 
> At the point of the interface we definitely need to take care of
> converting properly, but for Rust-to-Rust code (i.e. the ones using
> `Result` etc.), that would not be a concern.

Sure.

> Just to ensure I understood your concern, for instance, in this case
> you mentioned:
> 
>result.status = foo_alloc();
>if (!result.status) {
>result.error = -ENOMEM;
>return result;
>}

Yes I mentioned this when it was my understanding that the composite
result returned was made both of a pointer and an error code, but Connor
explained that it was in fact more of a selector and a union.

> Is your concern is that the caller would mix up the `status` with the
> `error`, basically bubbling up the `status` as an `int` and forgetting
> about the `error`, and then someone else later understanding that
> `int` as a non-error because it is non-negative?

My concern was to know what field to look at to reliably detect an error
from the C side after a sequence doing C -> Rust -> C when the inner C
code uses NULL to mark an error and the upper C code uses NULL as a valid
value and needs to look at an error code instead to rebuild a result. But
if it's more:

 if (result.ok)
return result.pointer;
 else
return (void *)-result.error;

then it shouldn't be an issue.

Willy

Re: [PATCH v2 5/6] kunit: mptcp: adhear to KUNIT formatting standard

2021-04-16 Thread David Gow

Hi Matt,

> Like patch 1/6, I can apply it in MPTCP tree and send it later to
> net-next with other patches.
> Except if you guys prefer to apply it in KUnit tree and send it to
> linux-next?

Given 1/6 is going to net-next, it makes sense to send this out that
way too, then, IMHO.
The only slight concern I have is that the m68k test config patch in
the series will get split from the others, but that should resolve
itself when they pick up the last patch.

At the very least, this shouldn't cause any conflicts with anything
we're doing in the KUnit tree.

Cheers,
-- David

Re: [PATCH] watchdog: aspeed: fix hardware timeout calculation

2021-04-16 Thread Guenter Roeck

On Fri, Apr 16, 2021 at 08:42:49PM -0700, rentao.b...@gmail.com wrote:
> From: Tao Ren 
> 
> Fix hardware timeout calculation in aspeed_wdt_set_timeout function to
> ensure the reload value does not exceed the hardware limit.
> 
> Fixes: efa859f7d786 ("watchdog: Add Aspeed watchdog driver")
> Reported-by: Amithash Prasad 
> Signed-off-by: Tao Ren 

Reviewed-by: Guenter Roeck 

> ---
>  drivers/watchdog/aspeed_wdt.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/watchdog/aspeed_wdt.c b/drivers/watchdog/aspeed_wdt.c
> index 7e00960651fa..507fd815d767 100644
> --- a/drivers/watchdog/aspeed_wdt.c
> +++ b/drivers/watchdog/aspeed_wdt.c
> @@ -147,7 +147,7 @@ static int aspeed_wdt_set_timeout(struct watchdog_device 
> *wdd,
>  
>   wdd->timeout = timeout;
>  
> - actual = min(timeout, wdd->max_hw_heartbeat_ms * 1000);
> + actual = min(timeout, wdd->max_hw_heartbeat_ms / 1000);
>  
>   writel(actual * WDT_RATE_1MHZ, wdt->base + WDT_RELOAD_VALUE);
>   writel(WDT_RESTART_MAGIC, wdt->base + WDT_RESTART);
> -- 
> 2.17.1
>

hppa64-linux-ld: lib/zstd/entropy_common.o(.text+0x1c): cannot reach _mcount

2021-04-16 Thread kernel test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
master
head:   9cdbf6467424045617cd6e79dcaad06bb8efa31c
commit: 8ef7ca75120a39167def40f41daefee013c4b5af lockdep/selftest: Add more 
recursive read related test cases
date:   8 months ago
config: parisc-randconfig-r006-20210416 (attached as .config)
compiler: hppa64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8ef7ca75120a39167def40f41daefee013c4b5af
git remote add linus 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
git fetch --no-tags linus master
git checkout 8ef7ca75120a39167def40f41daefee013c4b5af
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross W=1 
ARCH=parisc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

>> hppa64-linux-ld: lib/zstd/entropy_common.o(.text+0x1c): cannot reach _mcount
   lib/zstd/entropy_common.o: in function `FSE_versionNumber':
   (.text+0x1c): relocation truncated to fit: R_PARISC_PCREL22F against symbol 
`_mcount' defined in .text.hot section in arch/parisc/kernel/entry.o
   hppa64-linux-ld: lib/zstd/entropy_common.o(.text+0x64): cannot reach _mcount
   lib/zstd/entropy_common.o: in function `FSE_isError':
   (.text+0x64): relocation truncated to fit: R_PARISC_PCREL22F against symbol 
`_mcount' defined in .text.hot section in arch/parisc/kernel/entry.o
   hppa64-linux-ld: lib/zstd/entropy_common.o(.text+0xbc): cannot reach _mcount
   lib/zstd/entropy_common.o: in function `HUF_isError':
   (.text+0xbc): relocation truncated to fit: R_PARISC_PCREL22F against symbol 
`_mcount' defined in .text.hot section in arch/parisc/kernel/entry.o
   hppa64-linux-ld: lib/zstd/entropy_common.o(.text+0x160): cannot reach _mcount
   lib/zstd/entropy_common.o: in function `FSE_readNCount':
   (.text+0x160): relocation truncated to fit: R_PARISC_PCREL22F against symbol 
`_mcount' defined in .text.hot section in arch/parisc/kernel/entry.o
   hppa64-linux-ld: lib/zstd/entropy_common.o(.text+0xb3c): cannot reach _mcount
   lib/zstd/entropy_common.o: in function `HUF_readStats_wksp':
   (.text+0xb3c): relocation truncated to fit: R_PARISC_PCREL22F against symbol 
`_mcount' defined in .text.hot section in arch/parisc/kernel/entry.o
   hppa64-linux-ld: lib/zstd/fse_decompress.o(.text+0x70): cannot reach _mcount
   lib/zstd/fse_decompress.o: in function `FSE_buildDTable_wksp':
   (.text+0x70): relocation truncated to fit: R_PARISC_PCREL22F against symbol 
`_mcount' defined in .text.hot section in arch/parisc/kernel/entry.o
   hppa64-linux-ld: lib/zstd/fse_decompress.o(.text+0x3d4): cannot reach _mcount
   lib/zstd/fse_decompress.o: in function `FSE_buildDTable_rle':
   (.text+0x3d4): relocation truncated to fit: R_PARISC_PCREL22F against symbol 
`_mcount' defined in .text.hot section in arch/parisc/kernel/entry.o
   hppa64-linux-ld: lib/zstd/fse_decompress.o(.text+0x440): cannot reach _mcount
   lib/zstd/fse_decompress.o: in function `FSE_buildDTable_raw':
   (.text+0x440): relocation truncated to fit: R_PARISC_PCREL22F against symbol 
`_mcount' defined in .text.hot section in arch/parisc/kernel/entry.o
   hppa64-linux-ld: lib/zstd/fse_decompress.o(.text+0x564): cannot reach _mcount
   lib/zstd/fse_decompress.o: in function `FSE_decompress_usingDTable':
   (.text+0x564): relocation truncated to fit: R_PARISC_PCREL22F against symbol 
`_mcount' defined in .text.hot section in arch/parisc/kernel/entry.o
   hppa64-linux-ld: lib/zstd/fse_decompress.o(.text+0x347c): cannot reach 
_mcount
   lib/zstd/fse_decompress.o: in function `FSE_decompress_wksp':
   (.text+0x347c): relocation truncated to fit: R_PARISC_PCREL22F against 
symbol `_mcount' defined in .text.hot section in arch/parisc/kernel/entry.o
   hppa64-linux-ld: lib/zstd/zstd_common.o(.text+0x2c): cannot reach _mcount
   lib/zstd/zstd_common.o: in function `ZSTD_stackAlloc':
   (.text+0x2c): additional relocation overflows omitted from the output
   hppa64-linux-ld: lib/zstd/zstd_common.o(.text+0x8c): cannot reach _mcount
   hppa64-linux-ld: lib/zstd/zstd_common.o(.text+0xe0): cannot reach _mcount
   hppa64-linux-ld: lib/zstd/zstd_common.o(.text+0x1c4): cannot reach _mcount
   hppa64-linux-ld: lib/zstd/zstd_common.o(.text+0x25c): cannot reach _mcount
   hppa64-linux-ld: lib/zstd/zstd_common.o(.text+0x2dc): cannot reach _mcount

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip

Re: [PATCH v6] lib: add basic KUnit test for lib/math

2021-04-16 Thread David Gow

On Sat, Apr 17, 2021 at 2:04 AM Daniel Latypov  wrote:
>
> Add basic test coverage for files that don't require any config options:
> * part of math.h (what seem to be the most commonly used macros)
> * gcd.c
> * lcm.c
> * int_sqrt.c
> * reciprocal_div.c
> (Ignored int_pow.c since it's a simple textbook algorithm.)
>
> These tests aren't particularly interesting, but they
> * provide short and simple examples of parameterized tests
> * provide a place to add tests for any new files in this dir
> * are written so adding new test cases to cover edge cases should be easy
>   * looking at code coverage, we hit all the branches in the .c files
>
> Signed-off-by: Daniel Latypov 
> Reviewed-by: David Gow 
> ---

Thanks: I've tested this version, and am happy with it. A part of me
still kind-of would like there to be names for the parameters, but I
definitely understand that it doesn't really work well for the lcm and
gcd cases where we're doing both (a,b) and (b,a). So let's keep it
as-is.

Hopefully we can get these in for 5.13!

Cheers,
-- David

Re: [External] Re: [PATCH v20 5/9] mm: hugetlb: defer freeing of HugeTLB pages

2021-04-16 Thread Muchun Song

On Sat, Apr 17, 2021 at 7:56 AM Mike Kravetz  wrote:
>
> On 4/15/21 1:40 AM, Muchun Song wrote:
> > In the subsequent patch, we should allocate the vmemmap pages when
> > freeing a HugeTLB page. But update_and_free_page() can be called
> > under any context, so we cannot use GFP_KERNEL to allocate vmemmap
> > pages. However, we can defer the actual freeing in a kworker to
> > prevent from using GFP_ATOMIC to allocate the vmemmap pages.
>
> Thanks!  I knew we would need to introduce a kworker for this when I
> removed the kworker previously used in free_huge_page.

Yeah, but another choice is using GFP_ATOMIC to allocate vmemmap
pages when we are in an atomic context. If not atomic context, just
use GFP_KERNEL. In this case, we can drop kworker.

>
> > The __update_and_free_page() is where the call to allocate vmemmmap
> > pages will be inserted.
>
> This patch adds the functionality required for __update_and_free_page
> to potentially sleep and fail.  More questions will come up in the
> subsequent patch when code must deal with the failures.

Right. More questions are welcome.

>
> >
> > Signed-off-by: Muchun Song 
> > ---
> >  mm/hugetlb.c | 73 
> > 
> >  mm/hugetlb_vmemmap.c | 12 -
> >  mm/hugetlb_vmemmap.h | 17 
> >  3 files changed, 85 insertions(+), 17 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 923d05e2806b..eeb8f5480170 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -1376,7 +1376,7 @@ static void remove_hugetlb_page(struct hstate *h, 
> > struct page *page,
> >   h->nr_huge_pages_node[nid]--;
> >  }
> >
> > -static void update_and_free_page(struct hstate *h, struct page *page)
> > +static void __update_and_free_page(struct hstate *h, struct page *page)
> >  {
> >   int i;
> >   struct page *subpage = page;
> > @@ -1399,12 +1399,73 @@ static void update_and_free_page(struct hstate *h, 
> > struct page *page)
> >   }
> >  }
> >
> > +/*
> > + * As update_and_free_page() can be called under any context, so we cannot
> > + * use GFP_KERNEL to allocate vmemmap pages. However, we can defer the
> > + * actual freeing in a workqueue to prevent from using GFP_ATOMIC to 
> > allocate
> > + * the vmemmap pages.
> > + *
> > + * free_hpage_workfn() locklessly retrieves the linked list of pages to be
> > + * freed and frees them one-by-one. As the page->mapping pointer is going
> > + * to be cleared in free_hpage_workfn() anyway, it is reused as the 
> > llist_node
> > + * structure of a lockless linked list of huge pages to be freed.
> > + */
> > +static LLIST_HEAD(hpage_freelist);
> > +
> > +static void free_hpage_workfn(struct work_struct *work)
> > +{
> > + struct llist_node *node;
> > +
> > + node = llist_del_all(_freelist);
> > +
> > + while (node) {
> > + struct page *page;
> > + struct hstate *h;
> > +
> > + page = container_of((struct address_space **)node,
> > +  struct page, mapping);
> > + node = node->next;
> > + page->mapping = NULL;
> > + h = page_hstate(page);
>
> The VM_BUG_ON_PAGE(!PageHuge(page), page) in page_hstate is going to
> trigger because a previous call to remove_hugetlb_page() will
> set_compound_page_dtor(page, NULL_COMPOUND_DTOR)

Sorry, I did not realise that. Thanks for your reminder.

>
> Note how h(hstate) is grabbed before calling update_and_free_page in
> existing code.
>
> We could potentially drop the !PageHuge(page) in page_hstate.  Or,
> perhaps just use 'size_to_hstate(page_size(page))' in free_hpage_workfn.

I prefer not to change the behavior of page_hstate(). So I
should use 'size_to_hstate(page_size(page))' directly.

Thanks Mike.


> --
> Mike Kravetz

Re: [PATCH v2 1/2] locking/atomics: Fixup GENERIC_ATOMIC64 conflict with atomic-arch-fallback.h

2021-04-16 Thread Guo Ren

Abandoned, it has duplicated definition export in gen-atomic-instrumented.sh

On Sat, Apr 17, 2021 at 10:57 AM  wrote:
>
> From: Guo Ren 
>
> Current GENERIC_ATOMIC64 in atomic-arch-fallback.h is broken. When a 32-bit
> arch use atomic-arch-fallback.h will cause compile error.
>
> In file included from include/linux/atomic.h:81,
> from include/linux/rcupdate.h:25,
> from include/linux/rculist.h:11,
> from include/linux/pid.h:5,
> from include/linux/sched.h:14,
> from arch/riscv/kernel/asm-offsets.c:10:
>include/linux/atomic-arch-fallback.h: In function 'arch_atomic64_inc':
> >> include/linux/atomic-arch-fallback.h:1447:2: error: implicit declaration 
> >> of function 'arch_atomic64_add'; did you mean 'arch_atomic_add'? 
> >> [-Werror=implicit-function-declaration]
> 1447 |  arch_atomic64_add(1, v);
>  |  ^
>  |  arch_atomic_add
>
> The atomic-arch-fallback.h & atomic-fallback.h &
> atomic-instrumented.h are generated by gen-atomic-fallback.sh &
> gen-atomic-instrumented.sh, so just take care the bash files.
>
> Remove the dependency of atomic-*-fallback.h in atomic64.h.
>
> Signed-off-by: Guo Ren 
> Cc: Peter Zijlstra 
> Cc: Arnd Bergmann 
> ---
>  include/asm-generic/atomic-instrumented.h | 307 
> +-
>  include/asm-generic/atomic64.h|  89 +
>  include/linux/atomic-arch-fallback.h  |   5 +-
>  include/linux/atomic-fallback.h   |   5 +-
>  scripts/atomic/gen-atomic-fallback.sh |   3 +-
>  scripts/atomic/gen-atomic-instrumented.sh |  23 ++-
>  6 files changed, 294 insertions(+), 138 deletions(-)
>
> diff --git a/include/asm-generic/atomic-instrumented.h 
> b/include/asm-generic/atomic-instrumented.h
> index 888b6cf..f6ce7a2 100644
> --- a/include/asm-generic/atomic-instrumented.h
> +++ b/include/asm-generic/atomic-instrumented.h
> @@ -831,6 +831,180 @@ atomic_dec_if_positive(atomic_t *v)
>  #define atomic_dec_if_positive atomic_dec_if_positive
>  #endif
>
> +#if !defined(arch_xchg_relaxed) || defined(arch_xchg)
> +#define xchg(ptr, ...) \
> +({ \
> +   typeof(ptr) __ai_ptr = (ptr); \
> +   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +   arch_xchg(__ai_ptr, __VA_ARGS__); \
> +})
> +#endif
> +
> +#if defined(arch_xchg_acquire)
> +#define xchg_acquire(ptr, ...) \
> +({ \
> +   typeof(ptr) __ai_ptr = (ptr); \
> +   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +   arch_xchg_acquire(__ai_ptr, __VA_ARGS__); \
> +})
> +#endif
> +
> +#if defined(arch_xchg_release)
> +#define xchg_release(ptr, ...) \
> +({ \
> +   typeof(ptr) __ai_ptr = (ptr); \
> +   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +   arch_xchg_release(__ai_ptr, __VA_ARGS__); \
> +})
> +#endif
> +
> +#if defined(arch_xchg_relaxed)
> +#define xchg_relaxed(ptr, ...) \
> +({ \
> +   typeof(ptr) __ai_ptr = (ptr); \
> +   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +   arch_xchg_relaxed(__ai_ptr, __VA_ARGS__); \
> +})
> +#endif
> +
> +#if !defined(arch_cmpxchg_relaxed) || defined(arch_cmpxchg)
> +#define cmpxchg(ptr, ...) \
> +({ \
> +   typeof(ptr) __ai_ptr = (ptr); \
> +   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +   arch_cmpxchg(__ai_ptr, __VA_ARGS__); \
> +})
> +#endif
> +
> +#if defined(arch_cmpxchg_acquire)
> +#define cmpxchg_acquire(ptr, ...) \
> +({ \
> +   typeof(ptr) __ai_ptr = (ptr); \
> +   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +   arch_cmpxchg_acquire(__ai_ptr, __VA_ARGS__); \
> +})
> +#endif
> +
> +#if defined(arch_cmpxchg_release)
> +#define cmpxchg_release(ptr, ...) \
> +({ \
> +   typeof(ptr) __ai_ptr = (ptr); \
> +   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +   arch_cmpxchg_release(__ai_ptr, __VA_ARGS__); \
> +})
> +#endif
> +
> +#if defined(arch_cmpxchg_relaxed)
> +#define cmpxchg_relaxed(ptr, ...) \
> +({ \
> +   typeof(ptr) __ai_ptr = (ptr); \
> +   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +   arch_cmpxchg_relaxed(__ai_ptr, __VA_ARGS__); \
> +})
> +#endif
> +
> +#if !defined(arch_cmpxchg64_relaxed) || defined(arch_cmpxchg64)
> +#define cmpxchg64(ptr, ...) \
> +({ \
> +   typeof(ptr) __ai_ptr = (ptr); \
> +   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +   arch_cmpxchg64(__ai_ptr, __VA_ARGS__); \
> +})
> +#endif
> +
> +#if defined(arch_cmpxchg64_acquire)
> +#define cmpxchg64_acquire(ptr, ...) \
> +({ \
> +   typeof(ptr) __ai_ptr = (ptr); \
> +   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +   arch_cmpxchg64_acquire(__ai_ptr, __VA_ARGS__); \
> +})
> +#endif
> +
> +#if defined(arch_cmpxchg64_release)
> +#define cmpxchg64_release(ptr, ...) \
> +({ \
> +   typeof(ptr) __ai_ptr = (ptr); \
> +   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
> +

[tip:sched/core] BUILD SUCCESS WITH WARNING a1b93fc0377e73dd54f819a993f83291324bb54a

2021-04-16 Thread kernel test robot

   defconfig
arc  allyesconfig
nds32 allnoconfig
nds32   defconfig
nios2allyesconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
s390 allmodconfig
parisc   allyesconfig
s390defconfig
sparcallyesconfig
sparc   defconfig
i386defconfig
mips allyesconfig
mips allmodconfig
powerpc  allyesconfig
powerpc   allnoconfig
i386 randconfig-a003-20210416
i386 randconfig-a006-20210416
i386 randconfig-a001-20210416
i386 randconfig-a005-20210416
i386 randconfig-a004-20210416
i386 randconfig-a002-20210416
x86_64   randconfig-a014-20210416
x86_64   randconfig-a015-20210416
x86_64   randconfig-a011-20210416
x86_64   randconfig-a013-20210416
x86_64   randconfig-a012-20210416
x86_64   randconfig-a016-20210416
i386 randconfig-a015-20210416
i386 randconfig-a014-20210416
i386 randconfig-a013-20210416
i386 randconfig-a012-20210416
i386 randconfig-a016-20210416
i386 randconfig-a011-20210416
riscvnommu_k210_defconfig
riscvnommu_virt_defconfig
riscv allnoconfig
riscv   defconfig
riscv  rv32_defconfig
um   allmodconfig
umallnoconfig
um   allyesconfig
um  defconfig
x86_64rhel-8.3-kselftests
x86_64  defconfig
x86_64   rhel-8.3
x86_64  rhel-8.3-kbuiltin
x86_64  kexec

clang tested configs:
x86_64   randconfig-a003-20210416
x86_64   randconfig-a002-20210416
x86_64   randconfig-a005-20210416
x86_64   randconfig-a001-20210416
x86_64   randconfig-a006-20210416
x86_64   randconfig-a004-20210416

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org

[PATCH v2] Documentation: kunit: Update kunit_tool page

2021-04-16 Thread David Gow

The kunit_tool documentation page was pretty minimal, and a bit
outdated. Update it and flesh it out a bit.

In particular,
- Mention that .kunitconfig is now in the build directory
- Describe the use of --kunitconfig to specify a different config
  framgent
- Mention the split functionality (i.e., commands other than 'run')
- Describe --raw_output and kunit.py parse
- Mention the globbing support
- Provide a quick overview of other options, including --build_dir and
  --alltests

Note that this does overlap a little with the new running_tips page. I
don't think it's a problem having both: this page is supposed to be a
bit more of a reference, rather than a list of useful tips, so the fact
that they both describe the same features isn't a problem.

Signed-off-by: David Gow 
Reviewed-by: Daniel Latypov 
---

Adopted the changes from Daniel.

Changes since v1:
https://lore.kernel.org/linux-kselftest/20210416034036.797727-1-david...@google.com/
- Mention that the default build directory is '.kunit' when discussing
  '.kunitconfig' files.
- Reword the discussion of 'CONFIG_KUNIT_ALL_TESTS' under '--alltests'



 Documentation/dev-tools/kunit/kunit-tool.rst | 140 +--
 1 file changed, 132 insertions(+), 8 deletions(-)

diff --git a/Documentation/dev-tools/kunit/kunit-tool.rst 
b/Documentation/dev-tools/kunit/kunit-tool.rst
index 29ae2fee8123..4247b7420e3b 100644
--- a/Documentation/dev-tools/kunit/kunit-tool.rst
+++ b/Documentation/dev-tools/kunit/kunit-tool.rst
@@ -22,14 +22,19 @@ not require any virtualization support: it is just a 
regular program.
 What is a .kunitconfig?
 ===
 
-It's just a defconfig that kunit_tool looks for in the base directory.
-kunit_tool uses it to generate a .config as you might expect. In addition, it
-verifies that the generated .config contains the CONFIG options in the
-.kunitconfig; the reason it does this is so that it is easy to be sure that a
-CONFIG that enables a test actually ends up in the .config.
+It's just a defconfig that kunit_tool looks for in the build directory
+(``.kunit`` by default).  kunit_tool uses it to generate a .config as you might
+expect. In addition, it verifies that the generated .config contains the CONFIG
+options in the .kunitconfig; the reason it does this is so that it is easy to
+be sure that a CONFIG that enables a test actually ends up in the .config.
 
-How do I use kunit_tool?
-
+It's also possible to pass a separate .kunitconfig fragment to kunit_tool,
+which is useful if you have several different groups of tests you wish
+to run independently, or if you want to use pre-defined test configs for
+certain subsystems.
+
+Getting Started with kunit_tool
+===
 
 If a kunitconfig is present at the root directory, all you have to do is:
 
@@ -48,10 +53,129 @@ However, you most likely want to use it with the following 
options:
 
 .. note::
This command will work even without a .kunitconfig file: if no
-.kunitconfig is present, a default one will be used instead.
+   .kunitconfig is present, a default one will be used instead.
+
+If you wish to use a different .kunitconfig file (such as one provided for
+testing a particular subsystem), you can pass it as an option.
+
+.. code-block:: bash
+
+   ./tools/testing/kunit/kunit.py run --kunitconfig=fs/ext4/.kunitconfig
 
 For a list of all the flags supported by kunit_tool, you can run:
 
 .. code-block:: bash
 
./tools/testing/kunit/kunit.py run --help
+
+Configuring, Building, and Running Tests
+
+
+It's also possible to run just parts of the KUnit build process independently,
+which is useful if you want to make manual changes to part of the process.
+
+A .config can be generated from a .kunitconfig by using the ``config`` argument
+when running kunit_tool:
+
+.. code-block:: bash
+
+   ./tools/testing/kunit/kunit.py config
+
+Similarly, if you just want to build a KUnit kernel from the current .config,
+you can use the ``build`` argument:
+
+.. code-block:: bash
+
+   ./tools/testing/kunit/kunit.py build
+
+And, if you already have a built UML kernel with built-in KUnit tests, you can
+run the kernel and display the test results with the ``exec`` argument:
+
+.. code-block:: bash
+
+   ./tools/testing/kunit/kunit.py exec
+
+The ``run`` command which is discussed above is equivalent to running all three
+of these in sequence.
+
+All of these commands accept a number of optional command-line arguments. The
+``--help`` flag will give a complete list of these, or keep reading this page
+for a guide to some of the more useful ones.
+
+Parsing Test Results
+
+
+KUnit tests output their results in TAP (Test Anything Protocol) format.
+kunit_tool will, when running tests, parse this output and print a summary
+which is much more pleasant to read. If you wish to look at the raw test
+results in TAP format, you can

Re: [PATCH net] net: fix use-after-free when UDP GRO with shared fraglist

2021-04-16 Thread Yunsheng Lin

On 2021/1/6 11:32, Dongseok Yi wrote:
> On 2021-01-06 12:07, Willem de Bruijn wrote:
>>
>> On Tue, Jan 5, 2021 at 8:29 PM Dongseok Yi  wrote:
>>>
>>> On 2021-01-05 06:03, Willem de Bruijn wrote:

 On Mon, Jan 4, 2021 at 4:00 AM Dongseok Yi  wrote:
>
> skbs in frag_list could be shared by pskb_expand_head() from BPF.

 Can you elaborate on the BPF connection?
>>>
>>> With the following registered ptypes,
>>>
>>> /proc/net # cat ptype
>>> Type Device  Function
>>> ALL   tpacket_rcv
>>> 0800  ip_rcv.cfi_jt
>>> 0011  llc_rcv.cfi_jt
>>> 0004  llc_rcv.cfi_jt
>>> 0806  arp_rcv
>>> 86dd  ipv6_rcv.cfi_jt
>>>
>>> BPF checks skb_ensure_writable between tpacket_rcv and ip_rcv
>>> (or ipv6_rcv). And it calls pskb_expand_head.
>>>
>>> [  132.051228] pskb_expand_head+0x360/0x378
>>> [  132.051237] skb_ensure_writable+0xa0/0xc4
>>> [  132.051249] bpf_skb_pull_data+0x28/0x60
>>> [  132.051262] bpf_prog_331d69c77ea5e964_schedcls_ingres+0x5f4/0x1000
>>> [  132.051273] cls_bpf_classify+0x254/0x348
>>> [  132.051284] tcf_classify+0xa4/0x180
>>
>> Ah, you have a BPF program loaded at TC. That was not entirely obvious.
>>
>> This program gets called after packet sockets with ptype_all, before
>> those with a specific protocol.
>>
>> Tcpdump will have inserted a program with ptype_all, which cloned the
>> skb. This triggers skb_ensure_writable -> pskb_expand_head ->
>> skb_clone_fraglist -> skb_get.
>>
>>> [  132.051294] __netif_receive_skb_core+0x590/0xd28
>>> [  132.051303] __netif_receive_skb+0x50/0x17c
>>> [  132.051312] process_backlog+0x15c/0x1b8
>>>

> While tcpdump, sk_receive_queue of PF_PACKET has the original frag_list.
> But the same frag_list is queued to PF_INET (or PF_INET6) as the fraglist
> chain made by skb_segment_list().
>
> If the new skb (not frag_list) is queued to one of the sk_receive_queue,
> multiple ptypes can see this. The skb could be released by ptypes and
> it causes use-after-free.

 If I understand correctly, a udp-gro-list skb makes it up the receive
 path with one or more active packet sockets.

 The packet socket will call skb_clone after accepting the filter. This
 replaces the head_skb, but shares the skb_shinfo and thus frag_list.

 udp_rcv_segment later converts the udp-gro-list skb to a list of
 regular packets to pass these one-by-one to udp_queue_rcv_one_skb.
 Now all the frags are fully fledged packets, with headers pushed
 before the payload. This does not change their refcount anymore than
 the skb_clone in pf_packet did. This should be 1.

 Eventually udp_recvmsg will call skb_consume_udp on each packet.

 The packet socket eventually also frees its cloned head_skb, which triggers

   kfree_skb_list(shinfo->frag_list)
 kfree_skb
   skb_unref
 refcount_dec_and_test(>users)
>>>
>>> Every your understanding is right, but
>>>

>
> [ 4443.426215] [ cut here ]
> [ 4443.426222] refcount_t: underflow; use-after-free.
> [ 4443.426291] WARNING: CPU: 7 PID: 28161 at lib/refcount.c:190
> refcount_dec_and_test_checked+0xa4/0xc8
> [ 4443.426726] pstate: 6045 (nZCv daif +PAN -UAO)
> [ 4443.426732] pc : refcount_dec_and_test_checked+0xa4/0xc8
> [ 4443.426737] lr : refcount_dec_and_test_checked+0xa0/0xc8
> [ 4443.426808] Call trace:
> [ 4443.426813]  refcount_dec_and_test_checked+0xa4/0xc8
> [ 4443.426823]  skb_release_data+0x144/0x264
> [ 4443.426828]  kfree_skb+0x58/0xc4
> [ 4443.426832]  skb_queue_purge+0x64/0x9c
> [ 4443.426844]  packet_set_ring+0x5f0/0x820
> [ 4443.426849]  packet_setsockopt+0x5a4/0xcd0
> [ 4443.426853]  __sys_setsockopt+0x188/0x278
> [ 4443.426858]  __arm64_sys_setsockopt+0x28/0x38
> [ 4443.426869]  el0_svc_common+0xf0/0x1d0
> [ 4443.426873]  el0_svc_handler+0x74/0x98
> [ 4443.426880]  el0_svc+0x8/0xc
>
> Fixes: 3a1296a38d0c (net: Support GRO/GSO fraglist chaining.)
> Signed-off-by: Dongseok Yi 
> ---
>  net/core/skbuff.c | 20 +++-
>  1 file changed, 19 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f62cae3..1dcbda8 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -3655,7 +3655,8 @@ struct sk_buff *skb_segment_list(struct sk_buff 
> *skb,
> unsigned int delta_truesize = 0;
> unsigned int delta_len = 0;
> struct sk_buff *tail = NULL;
> -   struct sk_buff *nskb;
> +   struct sk_buff *nskb, *tmp;
> +   int err;
>
> skb_push(skb, -skb_network_offset(skb) + offset);
>
> @@ -3665,11 +3666,28 @@ struct sk_buff *skb_segment_list(struct sk_buff 
> *skb,
> nskb = list_skb;
> list_skb = list_skb->next;
>
> +

[PATCH] watchdog: aspeed: fix hardware timeout calculation

2021-04-16 Thread rentao . bupt

From: Tao Ren 

Fix hardware timeout calculation in aspeed_wdt_set_timeout function to
ensure the reload value does not exceed the hardware limit.

Fixes: efa859f7d786 ("watchdog: Add Aspeed watchdog driver")
Reported-by: Amithash Prasad 
Signed-off-by: Tao Ren 
---
 drivers/watchdog/aspeed_wdt.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/watchdog/aspeed_wdt.c b/drivers/watchdog/aspeed_wdt.c
index 7e00960651fa..507fd815d767 100644
--- a/drivers/watchdog/aspeed_wdt.c
+++ b/drivers/watchdog/aspeed_wdt.c
@@ -147,7 +147,7 @@ static int aspeed_wdt_set_timeout(struct watchdog_device 
*wdd,
 
wdd->timeout = timeout;
 
-   actual = min(timeout, wdd->max_hw_heartbeat_ms * 1000);
+   actual = min(timeout, wdd->max_hw_heartbeat_ms / 1000);
 
writel(actual * WDT_RATE_1MHZ, wdt->base + WDT_RELOAD_VALUE);
writel(WDT_RESTART_MAGIC, wdt->base + WDT_RESTART);
-- 
2.17.1

[GIT PULL] libnvdimm fixes for v5.12-rc8 / final

2021-04-16 Thread Dan Williams

Hi Linus, please pull from:

...to receive a handful of libnvdimm fixups.

The largest change is for a regression that landed during -rc1 for
block-device read-only handling. Vaibhav found a new use for the
ability (originally introduced by virtio_pmem) to call back to the
platform to flush data, but also found an original bug in that
implementation. Lastly, Arnd cleans up some compile warnings in dax.

This has all appeared in -next with no reported issues.

---

The following changes since commit e49d033bddf5b565044e2abe4241353959bc9120:

  Linux 5.12-rc6 (2021-04-04 14:15:36 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
tags/libnvdimm-fixes-for-5.12-rc8

for you to fetch changes up to 11d2498f1568a0f923dc8ef7621de15a9e89267f:

  Merge branch 'for-5.12/dax' into libnvdimm-fixes (2021-04-09 22:00:09 -0700)


libnvdimm fixes for v5.12-rc8

- Fix a regression of read-only handling in the pmem driver.

- Fix a compile warning.

- Fix support for platform cache flush commands on powerpc/papr


Arnd Bergmann (1):
  dax: avoid -Wempty-body warnings

Dan Williams (2):
  libnvdimm: Notify disk drivers to revalidate region read-only
  Merge branch 'for-5.12/dax' into libnvdimm-fixes

Vaibhav Jain (1):
  libnvdimm/region: Fix nvdimm_has_flush() to handle ND_REGION_ASYNC

 drivers/dax/bus.c|  6 ++
 drivers/nvdimm/bus.c | 14 ++
 drivers/nvdimm/pmem.c| 37 +
 drivers/nvdimm/region_devs.c | 16 ++--
 include/linux/nd.h   |  1 +
 5 files changed, 56 insertions(+), 18 deletions(-)

Re: [syzbot] unexpected kernel reboot (4)

2021-04-16 Thread Tetsuo Handa

On 2021/04/15 0:39, Andrey Konovalov wrote:
> On Wed, Apr 14, 2021 at 7:45 AM Dmitry Vyukov  wrote:
>> The reproducer connects some USB HID device and communicates with the driver.
>> Previously we observed reboots because HID devices can trigger reboot
>> SYSRQ, but we disable it with "CONFIG_MAGIC_SYSRQ is not set".
>> How else can a USB device reboot the machine? Is it possible to disable it?
>> I don't see any direct includes of  in drivers/usb/*
> 
> This happens when a keyboard sends the Ctrl+Alt+Del sequence, see
> fn_boot_it()->ctrl_alt_del() in drivers/tty/vt/keyboard.c.
> 

Regarding ctrl_alt_del() problem, doing

  sh -c 'echo 0 > /proc/sys/kernel/ctrl-alt-del; echo $$ > 
/proc/sys/kernel/cad_pid'

as root before start fuzzing might help.

Also, with the command above, reproducer still triggers suspend operation which 
freezes userspace processes.
This could possibly be one of causes for no output / lost connections. Try 
disabling freeze/suspend related configs?

[   60.881255][ T6280] usb 5-1: new high-speed USB device number 2 using 
dummy_hcd
[   61.260648][ T6280] usb 5-1: config 0 interface 0 altsetting 0 endpoint 0x81 
has an invalid bInterval 0, changing to 7
[   61.274056][ T6280] usb 5-1: New USB device found, idVendor=0926, 
idProduct=, bcdDevice= 0.40
[   61.284700][ T6280] usb 5-1: New USB device strings: Mfr=0, Product=0, 
SerialNumber=0
[   61.289556][ T6280] usb 5-1: config 0 descriptor??
[   61.780871][ T6280] keytouch 0003:0926:.0002: fixing up Keytouch IEC 
report descriptor
[   61.792015][ T6280] input: HID 0926: as 
/devices/platform/dummy_hcd.0/usb5/5-1/5-1:0.0/0003:0926:.0002/input/input5
[   61.871612][ T6280] keytouch 0003:0926:.0002: input,hidraw1: USB HID 
v0.00 Keyboard [HID 0926:] on usb-dummy_hcd.0-1/input0
[   62.137706][ T6847] PM: suspend entry (s2idle)
[   62.147914][ T6847] Filesystems sync: 0.007 seconds
[   62.152031][ T6847] Freezing user space processes ... (elapsed 0.003 
seconds) done.
[   62.158369][ T6847] OOM killer disabled.
[   62.159673][ T6847] Freezing remaining freezable tasks ... (elapsed 0.003 
seconds) done.
[   62.167440][ T6847] vhci_hcd vhci_hcd.15: suspend vhci_hcd
[   62.169569][ T6847] vhci_hcd vhci_hcd.14: suspend vhci_hcd
[   62.171562][ T6847] vhci_hcd vhci_hcd.13: suspend vhci_hcd
[   62.173500][ T6847] vhci_hcd vhci_hcd.12: suspend vhci_hcd
[   62.175740][ T6847] vhci_hcd vhci_hcd.11: suspend vhci_hcd
[   62.177677][ T6847] vhci_hcd vhci_hcd.10: suspend vhci_hcd
[   62.179725][ T6847] vhci_hcd vhci_hcd.9: suspend vhci_hcd
[   62.181602][ T6847] vhci_hcd vhci_hcd.8: suspend vhci_hcd
[   62.183681][ T6847] vhci_hcd vhci_hcd.7: suspend vhci_hcd
[   62.185594][ T6847] vhci_hcd vhci_hcd.6: suspend vhci_hcd
[   62.187552][ T6847] vhci_hcd vhci_hcd.5: suspend vhci_hcd
[   62.189566][ T6847] vhci_hcd vhci_hcd.4: suspend vhci_hcd
[   62.191767][ T6847] vhci_hcd vhci_hcd.3: suspend vhci_hcd
[   62.193657][ T6847] vhci_hcd vhci_hcd.2: suspend vhci_hcd
[   62.195634][ T6847] vhci_hcd vhci_hcd.1: suspend vhci_hcd
[   62.197430][ T6847] vhci_hcd vhci_hcd.0: suspend vhci_hcd
[   62.249881][T8] mptbase: ioc0: pci-suspend: pdev=0x888005495000, 
slot=:00:10.0, Entering operating state [D0]

Re: [PATCH 1/1] mm: Fix struct page layout on 32-bit systems

2021-04-16 Thread Matthew Wilcox

On Fri, Apr 16, 2021 at 07:08:23PM +0200, Jesper Dangaard Brouer wrote:
> On Fri, 16 Apr 2021 16:27:55 +0100
> Matthew Wilcox  wrote:
> 
> > On Thu, Apr 15, 2021 at 08:08:32PM +0200, Jesper Dangaard Brouer wrote:
> > > See below patch.  Where I swap32 the dma address to satisfy
> > > page->compound having bit zero cleared. (It is the simplest fix I could
> > > come up with).  
> > 
> > I think this is slightly simpler, and as a bonus code that assumes the
> > old layout won't compile.
> 
> This is clever, I like it!  When reading the code one just have to
> remember 'unsigned long' size difference between 64-bit vs 32-bit.
> And I assume compiler can optimize the sizeof check out then doable.

I checked before/after with the replacement patch that doesn't
have compiler warnings.  On x86, there is zero codegen difference
(objdump -dr before/after matches exactly) for both x86-32 with 32-bit
dma_addr_t and x86-64.  For x86-32 with 64-bit dma_addr_t, the compiler
makes some different inlining decisions in page_pool_empty_ring(),
page_pool_put_page() and page_pool_put_page_bulk(), but it's not clear
to me that they're wrong.

Function old new   delta
page_pool_empty_ring 387 307 -80
page_pool_put_page   604 516 -88
page_pool_put_page_bulk  690 517-173

[PATCH V2 2/9] platform/x86: intel_pmc_core: Remove global struct pmc_dev

2021-04-16 Thread David E. Box

The intel_pmc_core driver did not always bind to a device which meant it
lacked a struct device that could be used to maintain driver data. So a
global instance of struct pmc_dev was used for this purpose and functions
accessed this directly. Since the driver now binds to an ACPI device,
remove the global pmc_dev in favor of one that is allocated during probe.
Modify users of the global to obtain the object by argument instead.

Signed-off-by: David E. Box 
Reviewed-by: Hans de Goede 
Reviewed-by: Rajneesh Bhardwaj 
---

V2: No change

 drivers/platform/x86/intel_pmc_core.c | 41 ++-
 1 file changed, 21 insertions(+), 20 deletions(-)

diff --git a/drivers/platform/x86/intel_pmc_core.c 
b/drivers/platform/x86/intel_pmc_core.c
index 07657532ccdb..e8474d171d23 100644
--- a/drivers/platform/x86/intel_pmc_core.c
+++ b/drivers/platform/x86/intel_pmc_core.c
@@ -31,8 +31,6 @@
 
 #include "intel_pmc_core.h"
 
-static struct pmc_dev pmc;
-
 /* PKGC MSRs are common across Intel Core SoCs */
 static const struct pmc_bit_map msr_map[] = {
{"Package C2",  MSR_PKG_C2_RESIDENCY},
@@ -729,9 +727,8 @@ static int pmc_core_dev_state_get(void *data, u64 *val)
 
 DEFINE_DEBUGFS_ATTRIBUTE(pmc_core_dev_state, pmc_core_dev_state_get, NULL, 
"%llu\n");
 
-static int pmc_core_check_read_lock_bit(void)
+static int pmc_core_check_read_lock_bit(struct pmc_dev *pmcdev)
 {
-   struct pmc_dev *pmcdev = 
u32 value;
 
value = pmc_core_reg_read(pmcdev, pmcdev->map->pm_cfg_offset);
@@ -856,28 +853,26 @@ static int pmc_core_ppfear_show(struct seq_file *s, void 
*unused)
 DEFINE_SHOW_ATTRIBUTE(pmc_core_ppfear);
 
 /* This function should return link status, 0 means ready */
-static int pmc_core_mtpmc_link_status(void)
+static int pmc_core_mtpmc_link_status(struct pmc_dev *pmcdev)
 {
-   struct pmc_dev *pmcdev = 
u32 value;
 
value = pmc_core_reg_read(pmcdev, SPT_PMC_PM_STS_OFFSET);
return value & BIT(SPT_PMC_MSG_FULL_STS_BIT);
 }
 
-static int pmc_core_send_msg(u32 *addr_xram)
+static int pmc_core_send_msg(struct pmc_dev *pmcdev, u32 *addr_xram)
 {
-   struct pmc_dev *pmcdev = 
u32 dest;
int timeout;
 
for (timeout = NUM_RETRIES; timeout > 0; timeout--) {
-   if (pmc_core_mtpmc_link_status() == 0)
+   if (pmc_core_mtpmc_link_status(pmcdev) == 0)
break;
msleep(5);
}
 
-   if (timeout <= 0 && pmc_core_mtpmc_link_status())
+   if (timeout <= 0 && pmc_core_mtpmc_link_status(pmcdev))
return -EBUSY;
 
dest = (*addr_xram & MTPMC_MASK) | (1U << 1);
@@ -903,7 +898,7 @@ static int pmc_core_mphy_pg_show(struct seq_file *s, void 
*unused)
 
mutex_lock(>lock);
 
-   if (pmc_core_send_msg(_core_reg_low) != 0) {
+   if (pmc_core_send_msg(pmcdev, _core_reg_low) != 0) {
err = -EBUSY;
goto out_unlock;
}
@@ -911,7 +906,7 @@ static int pmc_core_mphy_pg_show(struct seq_file *s, void 
*unused)
msleep(10);
val_low = pmc_core_reg_read(pmcdev, SPT_PMC_MFPMC_OFFSET);
 
-   if (pmc_core_send_msg(_core_reg_high) != 0) {
+   if (pmc_core_send_msg(pmcdev, _core_reg_high) != 0) {
err = -EBUSY;
goto out_unlock;
}
@@ -954,7 +949,7 @@ static int pmc_core_pll_show(struct seq_file *s, void 
*unused)
mphy_common_reg  = (SPT_PMC_MPHY_COM_STS_0 << 16);
mutex_lock(>lock);
 
-   if (pmc_core_send_msg(_common_reg) != 0) {
+   if (pmc_core_send_msg(pmcdev, _common_reg) != 0) {
err = -EBUSY;
goto out_unlock;
}
@@ -975,9 +970,8 @@ static int pmc_core_pll_show(struct seq_file *s, void 
*unused)
 }
 DEFINE_SHOW_ATTRIBUTE(pmc_core_pll);
 
-static int pmc_core_send_ltr_ignore(u32 value)
+static int pmc_core_send_ltr_ignore(struct pmc_dev *pmcdev, u32 value)
 {
-   struct pmc_dev *pmcdev = 
const struct pmc_reg_map *map = pmcdev->map;
u32 reg;
int err = 0;
@@ -1003,6 +997,8 @@ static ssize_t pmc_core_ltr_ignore_write(struct file *file,
 const char __user *userbuf,
 size_t count, loff_t *ppos)
 {
+   struct seq_file *s = file->private_data;
+   struct pmc_dev *pmcdev = s->private;
u32 buf_size, value;
int err;
 
@@ -1012,7 +1008,7 @@ static ssize_t pmc_core_ltr_ignore_write(struct file 
*file,
if (err)
return err;
 
-   err = pmc_core_send_ltr_ignore(value);
+   err = pmc_core_send_ltr_ignore(pmcdev, value);
 
return err == 0 ? count : err;
 }
@@ -1340,13 +1336,19 @@ static void pmc_core_do_dmi_quirks(struct pmc_dev 
*pmcdev)
 static int pmc_core_probe(struct platform_device *pdev)
 {
static bool device_initialized;
-   struct pmc_dev *pmcdev = 
+   struct pmc_dev *pmcdev;
const struct x86_cpu_id

[PATCH V2 3/9] platform/x86: intel_pmc_core: Handle sub-states generically

2021-04-16 Thread David E. Box

From: Gayatri Kammela 

The current implementation of pmc_core_substate_res_show() is written
specifically for Tiger Lake. However, new platform will also have
sub-states and may support different modes. Therefore rewrite the code to
handle sub-states generically.

Obtain the number and type of enabled states form the PMC. Use the Low
Power Mode (LPM) priority register to store the states in order from
shallowest to deepest for displays. Add a for_each macro to simplify
this. While changing the sub-state display it makes sense to show only the
"enabled" sub-states instead of showing all possible ones. After this
patch, the debugfs file looks like this:

Substate   Residency
S0i2.0 0
S0i3.0 0
S0i2.1 9329279
S0i3.1 0
S0i3.2 0

Suggested-by: David E. Box 
Signed-off-by: Gayatri Kammela 
Signed-off-by: David E. Box 
Reviewed-by: Hans de Goede 
Acked-by: Rajneesh Bhardwaj 
---

V2: Renamed num_modes to num_lpm_modes as suggested by Rajneesh

 drivers/platform/x86/intel_pmc_core.c | 59 ++-
 drivers/platform/x86/intel_pmc_core.h | 18 +++-
 2 files changed, 64 insertions(+), 13 deletions(-)

diff --git a/drivers/platform/x86/intel_pmc_core.c 
b/drivers/platform/x86/intel_pmc_core.c
index e8474d171d23..c02f63c00ecc 100644
--- a/drivers/platform/x86/intel_pmc_core.c
+++ b/drivers/platform/x86/intel_pmc_core.c
@@ -579,8 +579,9 @@ static const struct pmc_reg_map tgl_reg_map = {
.pm_cfg_offset = CNP_PMC_PM_CFG_OFFSET,
.pm_read_disable_bit = CNP_PMC_READ_DISABLE_BIT,
.ltr_ignore_max = TGL_NUM_IP_IGN_ALLOWED,
-   .lpm_modes = tgl_lpm_modes,
+   .lpm_num_maps = TGL_LPM_NUM_MAPS,
.lpm_en_offset = TGL_LPM_EN_OFFSET,
+   .lpm_priority_offset = TGL_LPM_PRI_OFFSET,
.lpm_residency_offset = TGL_LPM_RESIDENCY_OFFSET,
.lpm_sts = tgl_lpm_maps,
.lpm_status_offset = TGL_LPM_STATUS_OFFSET,
@@ -1140,18 +1141,14 @@ DEFINE_SHOW_ATTRIBUTE(pmc_core_ltr);
 static int pmc_core_substate_res_show(struct seq_file *s, void *unused)
 {
struct pmc_dev *pmcdev = s->private;
-   const char **lpm_modes = pmcdev->map->lpm_modes;
u32 offset = pmcdev->map->lpm_residency_offset;
-   u32 lpm_en;
-   int index;
+   int i, mode;
 
-   lpm_en = pmc_core_reg_read(pmcdev, pmcdev->map->lpm_en_offset);
-   seq_printf(s, "status substate residency\n");
-   for (index = 0; lpm_modes[index]; index++) {
-   seq_printf(s, "%7s %7s %-15u\n",
-  BIT(index) & lpm_en ? "Enabled" : " ",
-  lpm_modes[index], pmc_core_reg_read(pmcdev, offset));
-   offset += 4;
+   seq_printf(s, "%-10s %-15s\n", "Substate", "Residency");
+
+   pmc_for_each_mode(i, mode, pmcdev) {
+   seq_printf(s, "%-10s %-15u\n", pmc_lpm_modes[mode],
+  pmc_core_reg_read(pmcdev, offset + (4 * mode)));
}
 
return 0;
@@ -1203,6 +1200,45 @@ static int pmc_core_pkgc_show(struct seq_file *s, void 
*unused)
 }
 DEFINE_SHOW_ATTRIBUTE(pmc_core_pkgc);
 
+static void pmc_core_get_low_power_modes(struct pmc_dev *pmcdev)
+{
+   u8 lpm_priority[LPM_MAX_NUM_MODES];
+   u32 lpm_en;
+   int mode, i, p;
+
+   /* Use LPM Maps to indicate support for substates */
+   if (!pmcdev->map->lpm_num_maps)
+   return;
+
+   lpm_en = pmc_core_reg_read(pmcdev, pmcdev->map->lpm_en_offset);
+   pmcdev->num_lpm_modes = hweight32(lpm_en);
+
+   /* Each byte contains information for 2 modes (7:4 and 3:0) */
+   for (mode = 0; mode < LPM_MAX_NUM_MODES; mode += 2) {
+   u8 priority = pmc_core_reg_read_byte(pmcdev,
+   pmcdev->map->lpm_priority_offset + (mode / 2));
+   int pri0 = GENMASK(3, 0) & priority;
+   int pri1 = (GENMASK(7, 4) & priority) >> 4;
+
+   lpm_priority[pri0] = mode;
+   lpm_priority[pri1] = mode + 1;
+   }
+
+   /*
+* Loop though all modes from lowest to highest priority,
+* and capture all enabled modes in order
+*/
+   i = 0;
+   for (p = LPM_MAX_NUM_MODES - 1; p >= 0; p--) {
+   int mode = lpm_priority[p];
+
+   if (!(BIT(mode) & lpm_en))
+   continue;
+
+   pmcdev->lpm_en_modes[i++] = mode;
+   }
+}
+
 static void pmc_core_dbgfs_unregister(struct pmc_dev *pmcdev)
 {
debugfs_remove_recursive(pmcdev->dbgfs_dir);
@@ -1379,6 +1415,7 @@ static int pmc_core_probe(struct platform_device *pdev)
 
mutex_init(>lock);
pmcdev->pmc_xram_read_bit = pmc_core_check_read_lock_bit(pmcdev);
+   pmc_core_get_low_power_modes(pmcdev);
pmc_core_do_dmi_quirks(pmcdev);
 
/*
diff --git a/drivers/platform/x86/intel_pmc_core.h 
b/drivers/platform/x86/intel_pmc_core.h
index 98ebdfe57138..2ffe0eba36e1 100644
--- a/drivers/platform/x86/intel_pmc_core.h
+++

[PATCH V2 7/9] platform/x86: intel_pmc_core: Add option to set/clear LPM mode

2021-04-16 Thread David E. Box

By default the Low Power Mode (LPM or sub-state) status registers will
latch condition status on every entry into Package C10. This is
configurable in the PMC to allow latching on any achievable sub-state. Add
a debugfs file to support this.

Also add the option to clear the status registers to 0. Clearing the status
registers before testing removes ambiguity around when the current values
were set.

The new file, latch_lpm_mode, looks like this:

[c10] S0i2.0 S0i3.0 S0i2.1 S0i3.1 S0i3.2 clear

Signed-off-by: David E. Box 
Reviewed-by: Hans de Goede 
---

V2: - Rebase on Tamar/Tomas global reset patch that already adds
  Extended Test Register 3
- In write function, make sure count is 1 less than buffer to reserve
  space for '\0'
- Use sysfs_streq to properly compare the input string

 drivers/platform/x86/intel_pmc_core.c | 112 ++
 drivers/platform/x86/intel_pmc_core.h |  20 +
 2 files changed, 132 insertions(+)

diff --git a/drivers/platform/x86/intel_pmc_core.c 
b/drivers/platform/x86/intel_pmc_core.c
index 684f13f0c4a5..97cf3384c4c0 100644
--- a/drivers/platform/x86/intel_pmc_core.c
+++ b/drivers/platform/x86/intel_pmc_core.c
@@ -586,6 +586,7 @@ static const struct pmc_reg_map tgl_reg_map = {
.ltr_ignore_max = TGL_NUM_IP_IGN_ALLOWED,
.lpm_num_maps = TGL_LPM_NUM_MAPS,
.lpm_res_counter_step_x2 = TGL_PMC_LPM_RES_COUNTER_STEP_X2,
+   .lpm_sts_latch_en_offset = TGL_LPM_STS_LATCH_EN_OFFSET,
.lpm_en_offset = TGL_LPM_EN_OFFSET,
.lpm_priority_offset = TGL_LPM_PRI_OFFSET,
.lpm_residency_offset = TGL_LPM_RESIDENCY_OFFSET,
@@ -1321,6 +1322,114 @@ static int pmc_core_substate_req_regs_show(struct 
seq_file *s, void *unused)
 }
 DEFINE_SHOW_ATTRIBUTE(pmc_core_substate_req_regs);
 
+static int pmc_core_lpm_latch_mode_show(struct seq_file *s, void *unused)
+{
+   struct pmc_dev *pmcdev = s->private;
+   bool c10;
+   u32 reg;
+   int idx, mode;
+
+   reg = pmc_core_reg_read(pmcdev, pmcdev->map->lpm_sts_latch_en_offset);
+   if (reg & LPM_STS_LATCH_MODE) {
+   seq_puts(s, "c10");
+   c10 = false;
+   } else {
+   seq_puts(s, "[c10]");
+   c10 = true;
+   }
+
+   pmc_for_each_mode(idx, mode, pmcdev) {
+   if ((BIT(mode) & reg) && !c10)
+   seq_printf(s, " [%s]", pmc_lpm_modes[mode]);
+   else
+   seq_printf(s, " %s", pmc_lpm_modes[mode]);
+   }
+
+   seq_puts(s, " clear\n");
+
+   return 0;
+}
+
+static ssize_t pmc_core_lpm_latch_mode_write(struct file *file,
+const char __user *userbuf,
+size_t count, loff_t *ppos)
+{
+   struct seq_file *s = file->private_data;
+   struct pmc_dev *pmcdev = s->private;
+   bool clear = false, c10 = false;
+   unsigned char buf[8];
+   size_t ret;
+   int idx, m, mode;
+   u32 reg;
+
+   if (count > sizeof(buf) - 1)
+   return -EINVAL;
+
+   ret = simple_write_to_buffer(buf, sizeof(buf) - 1, ppos, userbuf, 
count);
+   if (ret < 0)
+   return ret;
+
+   buf[count] = '\0';
+
+   /*
+* Allowed strings are:
+*  Any enabled substate, e.g. 'S0i2.0'
+*  'c10'
+*  'clear'
+*/
+   mode = sysfs_match_string(pmc_lpm_modes, buf);
+
+   /* Check string matches enabled mode */
+   pmc_for_each_mode(idx, m, pmcdev)
+   if (mode == m)
+   break;
+
+   if (mode != m || mode < 0) {
+   if (sysfs_streq(buf, "clear"))
+   clear = true;
+   else if (sysfs_streq(buf, "c10"))
+   c10 = true;
+   else
+   return -EINVAL;
+   }
+
+   if (clear) {
+   mutex_lock(>lock);
+
+   reg = pmc_core_reg_read(pmcdev, pmcdev->map->etr3_offset);
+   reg |= ETR3_CLEAR_LPM_EVENTS;
+   pmc_core_reg_write(pmcdev, pmcdev->map->etr3_offset, reg);
+
+   mutex_unlock(>lock);
+
+   return count;
+   }
+
+   if (c10) {
+   mutex_lock(>lock);
+
+   reg = pmc_core_reg_read(pmcdev, 
pmcdev->map->lpm_sts_latch_en_offset);
+   reg &= ~LPM_STS_LATCH_MODE;
+   pmc_core_reg_write(pmcdev, 
pmcdev->map->lpm_sts_latch_en_offset, reg);
+
+   mutex_unlock(>lock);
+
+   return count;
+   }
+
+   /*
+* For LPM mode latching we set the latch enable bit and selected mode
+* and clear everything else.
+*/
+   reg = LPM_STS_LATCH_MODE | BIT(mode);
+   mutex_lock(>lock);
+   pmc_core_reg_write(pmcdev, pmcdev->map->lpm_sts_latch_en_offset, reg);
+   mutex_unlock(>lock);
+
+   return count;
+}

[PATCH V2 5/9] platform/x86: intel_pmc_core: Get LPM requirements for Tiger Lake

2021-04-16 Thread David E. Box

From: Gayatri Kammela 

Platforms that support low power modes (LPM) such as Tiger Lake maintain
requirements for each sub-state that a readable in the PMC. However, unlike
LPM status registers, requirement registers are not memory mapped but are
available from an ACPI _DSM. Collect the requirements for Tiger Lake using
the _DSM method and store in a buffer.

Signed-off-by: Gayatri Kammela 
Co-developed-by: David E. Box 
Signed-off-by: David E. Box 
Reviewed-by: Hans de Goede 
---

V2: - Move buffer allocation so that it does not need to be freed
  (which was missing anyway) when an error is encountered.
- Use label to free out_obj after errors
- Use memcpy instead of memcpy_fromio for ACPI memory

 drivers/platform/x86/intel_pmc_core.c | 56 +++
 drivers/platform/x86/intel_pmc_core.h |  2 +
 2 files changed, 58 insertions(+)

diff --git a/drivers/platform/x86/intel_pmc_core.c 
b/drivers/platform/x86/intel_pmc_core.c
index 0e59a84b51bf..97efe9a6bd01 100644
--- a/drivers/platform/x86/intel_pmc_core.c
+++ b/drivers/platform/x86/intel_pmc_core.c
@@ -23,7 +23,9 @@
 #include 
 #include 
 #include 
+#include 
 
+#include 
 #include 
 #include 
 #include 
@@ -31,6 +33,9 @@
 
 #include "intel_pmc_core.h"
 
+#define ACPI_S0IX_DSM_UUID "57a6512e-3979-4e9d-9708-ff13b2508972"
+#define ACPI_GET_LOW_MODE_REGISTERS1
+
 /* PKGC MSRs are common across Intel Core SoCs */
 static const struct pmc_bit_map msr_map[] = {
{"Package C2",  MSR_PKG_C2_RESIDENCY},
@@ -590,6 +595,53 @@ static const struct pmc_reg_map tgl_reg_map = {
.etr3_offset = ETR3_OFFSET,
 };
 
+static void pmc_core_get_tgl_lpm_reqs(struct platform_device *pdev)
+{
+   struct pmc_dev *pmcdev = platform_get_drvdata(pdev);
+   const int num_maps = pmcdev->map->lpm_num_maps;
+   size_t lpm_size = LPM_MAX_NUM_MODES * num_maps * 4;
+   union acpi_object *out_obj;
+   struct acpi_device *adev;
+   guid_t s0ix_dsm_guid;
+   u32 *lpm_req_regs, *addr;
+
+   adev = ACPI_COMPANION(>dev);
+   if (!adev)
+   return;
+
+   guid_parse(ACPI_S0IX_DSM_UUID, _dsm_guid);
+
+   out_obj = acpi_evaluate_dsm(adev->handle, _dsm_guid, 0,
+   ACPI_GET_LOW_MODE_REGISTERS, NULL);
+   if (out_obj && out_obj->type == ACPI_TYPE_BUFFER) {
+   int size = out_obj->buffer.length;
+
+   if (size != lpm_size) {
+   acpi_handle_debug(adev->handle,
+   "_DSM returned unexpected buffer size,"
+   " have %d, expect %ld\n", size, lpm_size);
+   goto free_acpi_obj;
+   }
+   } else {
+   acpi_handle_debug(adev->handle,
+ "_DSM function 0 evaluation failed\n");
+   goto free_acpi_obj;
+   }
+
+   addr = (u32 *)out_obj->buffer.pointer;
+
+   lpm_req_regs = devm_kzalloc(>dev, lpm_size * sizeof(u32),
+GFP_KERNEL);
+   if (!lpm_req_regs)
+   goto free_acpi_obj;
+
+   memcpy(lpm_req_regs, addr, lpm_size);
+   pmcdev->lpm_req_regs = lpm_req_regs;
+
+free_acpi_obj:
+   ACPI_FREE(out_obj);
+}
+
 static inline u32 pmc_core_reg_read(struct pmc_dev *pmcdev, int reg_offset)
 {
return readl(pmcdev->regbase + reg_offset);
@@ -1424,10 +1476,14 @@ static int pmc_core_probe(struct platform_device *pdev)
return -ENOMEM;
 
mutex_init(>lock);
+
pmcdev->pmc_xram_read_bit = pmc_core_check_read_lock_bit(pmcdev);
pmc_core_get_low_power_modes(pmcdev);
pmc_core_do_dmi_quirks(pmcdev);
 
+   if (pmcdev->map == _reg_map)
+   pmc_core_get_tgl_lpm_reqs(pdev);
+
/*
 * On TGL, due to a hardware limitation, the GBE LTR blocks PC10 when
 * a cable is attached. Tell the PMC to ignore it.
diff --git a/drivers/platform/x86/intel_pmc_core.h 
b/drivers/platform/x86/intel_pmc_core.h
index aa44fd5399cc..64fb368f40f6 100644
--- a/drivers/platform/x86/intel_pmc_core.h
+++ b/drivers/platform/x86/intel_pmc_core.h
@@ -294,6 +294,7 @@ struct pmc_reg_map {
  * @s0ix_counter:  S0ix residency (step adjusted)
  * @num_lpm_modes: Count of enabled modes
  * @lpm_en_modes:  Array of enabled modes from lowest to highest priority
+ * @lpm_req_regs:  List of substate requirements
  *
  * pmc_dev contains info about power management controller device.
  */
@@ -310,6 +311,7 @@ struct pmc_dev {
u64 s0ix_counter;
int num_lpm_modes;
int lpm_en_modes[LPM_MAX_NUM_MODES];
+   u32 *lpm_req_regs;
 };
 
 #define pmc_for_each_mode(i, mode, pmcdev) \
-- 
2.25.1

[PATCH V2 6/9] platform/x86: intel_pmc_core: Add requirements file to debugfs

2021-04-16 Thread David E. Box

From: Gayatri Kammela 

Add the debugfs file, substate_requirements, to view the low power mode
(LPM) requirements for each enabled mode alongside the last latched status
of the condition.

After this patch, the new file will look like this:

Element |S0i2.0 |S0i3.0 |S0i2.1 |S0i3.1 |   
 S0i3.2 |Status |
USB2PLL_OFF_STS |  Required |  Required |  Required |  Required |  
Required |   |
PCIe/USB3.1_Gen2PLL_OFF_STS |  Required |  Required |  Required |  Required |  
Required |   |
   PCIe_Gen3PLL_OFF_STS |  Required |  Required |  Required |  Required |  
Required |   Yes |
OPIOPLL_OFF_STS |  Required |  Required |  Required |  Required |  
Required |   Yes |
  OCPLL_OFF_STS |  Required |  Required |  Required |  Required |  
Required |   Yes |
MainPLL_OFF_STS |   |  Required |   |  Required |  
Required |   |

Signed-off-by: Gayatri Kammela 
Co-developed-by: David E. Box 
Signed-off-by: David E. Box 
Reviewed-by: Hans de Goede 
---

V2: No change

 drivers/platform/x86/intel_pmc_core.c | 86 +++
 1 file changed, 86 insertions(+)

diff --git a/drivers/platform/x86/intel_pmc_core.c 
b/drivers/platform/x86/intel_pmc_core.c
index 97efe9a6bd01..684f13f0c4a5 100644
--- a/drivers/platform/x86/intel_pmc_core.c
+++ b/drivers/platform/x86/intel_pmc_core.c
@@ -1241,6 +1241,86 @@ static int pmc_core_substate_l_sts_regs_show(struct 
seq_file *s, void *unused)
 }
 DEFINE_SHOW_ATTRIBUTE(pmc_core_substate_l_sts_regs);
 
+static void pmc_core_substate_req_header_show(struct seq_file *s)
+{
+   struct pmc_dev *pmcdev = s->private;
+   int i, mode;
+
+   seq_printf(s, "%30s |", "Element");
+   pmc_for_each_mode(i, mode, pmcdev)
+   seq_printf(s, " %9s |", pmc_lpm_modes[mode]);
+
+   seq_printf(s, " %9s |\n", "Status");
+}
+
+static int pmc_core_substate_req_regs_show(struct seq_file *s, void *unused)
+{
+   struct pmc_dev *pmcdev = s->private;
+   const struct pmc_bit_map **maps = pmcdev->map->lpm_sts;
+   const struct pmc_bit_map *map;
+   const int num_maps = pmcdev->map->lpm_num_maps;
+   u32 sts_offset = pmcdev->map->lpm_status_offset;
+   u32 *lpm_req_regs = pmcdev->lpm_req_regs;
+   int mp;
+
+   /* Display the header */
+   pmc_core_substate_req_header_show(s);
+
+   /* Loop over maps */
+   for (mp = 0; mp < num_maps; mp++) {
+   u32 req_mask = 0;
+   u32 lpm_status;
+   int mode, idx, i, len = 32;
+
+   /*
+* Capture the requirements and create a mask so that we only
+* show an element if it's required for at least one of the
+* enabled low power modes
+*/
+   pmc_for_each_mode(idx, mode, pmcdev)
+   req_mask |= lpm_req_regs[mp + (mode * num_maps)];
+
+   /* Get the last latched status for this map */
+   lpm_status = pmc_core_reg_read(pmcdev, sts_offset + (mp * 4));
+
+   /*  Loop over elements in this map */
+   map = maps[mp];
+   for (i = 0; map[i].name && i < len; i++) {
+   u32 bit_mask = map[i].bit_mask;
+
+   if (!(bit_mask & req_mask))
+   /*
+* Not required for any enabled states
+* so don't display
+*/
+   continue;
+
+   /* Display the element name in the first column */
+   seq_printf(s, "%30s |", map[i].name);
+
+   /* Loop over the enabled states and display if required 
*/
+   pmc_for_each_mode(idx, mode, pmcdev) {
+   if (lpm_req_regs[mp + (mode * num_maps)] & 
bit_mask)
+   seq_printf(s, " %9s |",
+  "Required");
+   else
+   seq_printf(s, " %9s |", " ");
+   }
+
+   /* In Status column, show the last captured state of 
this agent */
+   if (lpm_status & bit_mask)
+   seq_printf(s, " %9s |", "Yes");
+   else
+   seq_printf(s, " %9s |", " ");
+
+   seq_puts(s, "\n");
+   }
+   }
+
+   return 0;
+}
+DEFINE_SHOW_ATTRIBUTE(pmc_core_substate_req_regs);
+
 static int pmc_core_pkgc_show(struct seq_file *s, void *unused)
 {
struct pmc_dev *pmcdev = s->private;
@@ -1360,6 +1440,12 @@ static void pmc_core_dbgfs_register(struct pmc_dev 
*pmcdev)
pmcdev->dbgfs_dir, pmcdev,

[PATCH V2 9/9] platform/x86: intel_pmc_core: Add support for Alder Lake PCH-P

2021-04-16 Thread David E. Box

Alder PCH-P is based on Tiger Lake PCH.

Signed-off-by: David E. Box 
Reviewed-by: Hans de Goede 
Acked-by: Rajneesh Bhardwaj 
---

V2: No change

 drivers/platform/x86/intel_pmc_core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/platform/x86/intel_pmc_core.c 
b/drivers/platform/x86/intel_pmc_core.c
index 786b67171ddc..900aa5e40a0f 100644
--- a/drivers/platform/x86/intel_pmc_core.c
+++ b/drivers/platform/x86/intel_pmc_core.c
@@ -1577,6 +1577,7 @@ static const struct x86_cpu_id intel_pmc_core_ids[] = {
X86_MATCH_INTEL_FAM6_MODEL(ATOM_TREMONT,_reg_map),
X86_MATCH_INTEL_FAM6_MODEL(ATOM_TREMONT_L,  _reg_map),
X86_MATCH_INTEL_FAM6_MODEL(ROCKETLAKE,  _reg_map),
+   X86_MATCH_INTEL_FAM6_MODEL(ALDERLAKE_L, _reg_map),
{}
 };
 
-- 
2.25.1

[PATCH V2 8/9] platform/x86: intel_pmc_core: Add LTR registers for Tiger Lake

2021-04-16 Thread David E. Box

From: Gayatri Kammela 

Just like Ice Lake, Tiger Lake uses Cannon Lake's LTR information
and supports a few additional registers. Hence add the LTR registers
specific to Tiger Lake to the cnp_ltr_show_map[].

Also adjust the number of LTR IPs for Tiger Lake to the correct amount.

Signed-off-by: Gayatri Kammela 
Signed-off-by: David E. Box 
Reviewed-by: Hans de Goede 
Acked-by: Rajneesh Bhardwaj 
---

V2: No change

 drivers/platform/x86/intel_pmc_core.c | 2 ++
 drivers/platform/x86/intel_pmc_core.h | 4 +++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/platform/x86/intel_pmc_core.c 
b/drivers/platform/x86/intel_pmc_core.c
index 97cf3384c4c0..786b67171ddc 100644
--- a/drivers/platform/x86/intel_pmc_core.c
+++ b/drivers/platform/x86/intel_pmc_core.c
@@ -383,6 +383,8 @@ static const struct pmc_bit_map cnp_ltr_show_map[] = {
 * a list of core SoCs using this.
 */
{"WIGIG",   ICL_PMC_LTR_WIGIG},
+   {"THC0",TGL_PMC_LTR_THC0},
+   {"THC1",TGL_PMC_LTR_THC1},
/* Below two cannot be used for LTR_IGNORE */
{"CURRENT_PLATFORM",CNP_PMC_LTR_CUR_PLT},
{"AGGREGATED_SYSTEM",   CNP_PMC_LTR_CUR_ASLT},
diff --git a/drivers/platform/x86/intel_pmc_core.h 
b/drivers/platform/x86/intel_pmc_core.h
index c45805671c4a..e8dae9c6c45f 100644
--- a/drivers/platform/x86/intel_pmc_core.h
+++ b/drivers/platform/x86/intel_pmc_core.h
@@ -191,8 +191,10 @@ enum ppfear_regs {
 #define GET_X2_COUNTER(v)  ((v) >> 1)
 #define LPM_STS_LATCH_MODE BIT(31)
 
-#define TGL_NUM_IP_IGN_ALLOWED 22
 #define TGL_PMC_SLP_S0_RES_COUNTER_STEP0x7A
+#define TGL_PMC_LTR_THC0   0x1C04
+#define TGL_PMC_LTR_THC1   0x1C08
+#define TGL_NUM_IP_IGN_ALLOWED 23
 #define TGL_PMC_LPM_RES_COUNTER_STEP_X261  /* 30.5us * 2 */
 
 /*
-- 
2.25.1

[PATCH V2 4/9] platform/x86: intel_pmc_core: Show LPM residency in microseconds

2021-04-16 Thread David E. Box

From: Gayatri Kammela 

Modify the low power mode (LPM or sub-state) residency counters to display
in microseconds just like the slp_s0_residency counter. The granularity of
the counter is approximately 30.5us per tick. Double this value then divide
by two to maintain accuracy.

Signed-off-by: Gayatri Kammela 
Signed-off-by: David E. Box 
Reviewed-by: Hans de Goede 
Reviewed-by: Rajneesh Bhardwaj 
---

V2: No change

 drivers/platform/x86/intel_pmc_core.c | 14 --
 drivers/platform/x86/intel_pmc_core.h |  3 +++
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/drivers/platform/x86/intel_pmc_core.c 
b/drivers/platform/x86/intel_pmc_core.c
index c02f63c00ecc..0e59a84b51bf 100644
--- a/drivers/platform/x86/intel_pmc_core.c
+++ b/drivers/platform/x86/intel_pmc_core.c
@@ -580,6 +580,7 @@ static const struct pmc_reg_map tgl_reg_map = {
.pm_read_disable_bit = CNP_PMC_READ_DISABLE_BIT,
.ltr_ignore_max = TGL_NUM_IP_IGN_ALLOWED,
.lpm_num_maps = TGL_LPM_NUM_MAPS,
+   .lpm_res_counter_step_x2 = TGL_PMC_LPM_RES_COUNTER_STEP_X2,
.lpm_en_offset = TGL_LPM_EN_OFFSET,
.lpm_priority_offset = TGL_LPM_PRI_OFFSET,
.lpm_residency_offset = TGL_LPM_RESIDENCY_OFFSET,
@@ -1138,17 +1139,26 @@ static int pmc_core_ltr_show(struct seq_file *s, void 
*unused)
 }
 DEFINE_SHOW_ATTRIBUTE(pmc_core_ltr);
 
+static inline u64 adjust_lpm_residency(struct pmc_dev *pmcdev, u32 offset,
+  const int lpm_adj_x2)
+{
+   u64 lpm_res = pmc_core_reg_read(pmcdev, offset);
+
+   return GET_X2_COUNTER((u64)lpm_adj_x2 * lpm_res);
+}
+
 static int pmc_core_substate_res_show(struct seq_file *s, void *unused)
 {
struct pmc_dev *pmcdev = s->private;
+   const int lpm_adj_x2 = pmcdev->map->lpm_res_counter_step_x2;
u32 offset = pmcdev->map->lpm_residency_offset;
int i, mode;
 
seq_printf(s, "%-10s %-15s\n", "Substate", "Residency");
 
pmc_for_each_mode(i, mode, pmcdev) {
-   seq_printf(s, "%-10s %-15u\n", pmc_lpm_modes[mode],
-  pmc_core_reg_read(pmcdev, offset + (4 * mode)));
+   seq_printf(s, "%-10s %-15llu\n", pmc_lpm_modes[mode],
+  adjust_lpm_residency(pmcdev, offset + (4 * mode), 
lpm_adj_x2));
}
 
return 0;
diff --git a/drivers/platform/x86/intel_pmc_core.h 
b/drivers/platform/x86/intel_pmc_core.h
index 2ffe0eba36e1..aa44fd5399cc 100644
--- a/drivers/platform/x86/intel_pmc_core.h
+++ b/drivers/platform/x86/intel_pmc_core.h
@@ -188,9 +188,11 @@ enum ppfear_regs {
 #define ICL_PMC_SLP_S0_RES_COUNTER_STEP0x64
 
 #define LPM_MAX_NUM_MODES  8
+#define GET_X2_COUNTER(v)  ((v) >> 1)
 
 #define TGL_NUM_IP_IGN_ALLOWED 22
 #define TGL_PMC_SLP_S0_RES_COUNTER_STEP0x7A
+#define TGL_PMC_LPM_RES_COUNTER_STEP_X261  /* 30.5us * 2 */
 
 /*
  * Tigerlake Power Management Controller register offsets
@@ -268,6 +270,7 @@ struct pmc_reg_map {
const u32 pm_vric1_offset;
/* Low Power Mode registers */
const int lpm_num_maps;
+   const int lpm_res_counter_step_x2;
const u32 lpm_en_offset;
const u32 lpm_priority_offset;
const u32 lpm_residency_offset;
-- 
2.25.1

[PATCH V2 1/9] platform/x86: intel_pmc_core: Don't use global pmcdev in quirks

2021-04-16 Thread David E. Box

The DMI callbacks, used for quirks, currently access the PMC by getting
the address a global pmc_dev struct. Instead, have the callbacks set a
global quirk specific variable. In probe, after calling dmi_check_system(),
pass pmc_dev to a function that will handle each quirk if its variable
condition is met. This allows removing the global pmc_dev later.

Signed-off-by: David E. Box 
Reviewed-by: Hans de Goede 
Reviewed-by: Rajneesh Bhardwaj 
---

V2: No change

 drivers/platform/x86/intel_pmc_core.c | 19 ---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/platform/x86/intel_pmc_core.c 
b/drivers/platform/x86/intel_pmc_core.c
index 8fb4e6d1d68d..07657532ccdb 100644
--- a/drivers/platform/x86/intel_pmc_core.c
+++ b/drivers/platform/x86/intel_pmc_core.c
@@ -1298,9 +1298,15 @@ static const struct pci_device_id pmc_pci_ids[] = {
  * the platform BIOS enforces 24Mhz crystal to shutdown
  * before PMC can assert SLP_S0#.
  */
+static bool xtal_ignore;
 static int quirk_xtal_ignore(const struct dmi_system_id *id)
 {
-   struct pmc_dev *pmcdev = 
+   xtal_ignore = true;
+   return 0;
+}
+
+static void pmc_core_xtal_ignore(struct pmc_dev *pmcdev)
+{
u32 value;
 
value = pmc_core_reg_read(pmcdev, pmcdev->map->pm_vric1_offset);
@@ -1309,7 +1315,6 @@ static int quirk_xtal_ignore(const struct dmi_system_id 
*id)
/* Low Voltage Mode Enable */
value &= ~SPT_PMC_VRIC1_SLPS0LVEN;
pmc_core_reg_write(pmcdev, pmcdev->map->pm_vric1_offset, value);
-   return 0;
 }
 
 static const struct dmi_system_id pmc_core_dmi_table[]  = {
@@ -1324,6 +1329,14 @@ static const struct dmi_system_id pmc_core_dmi_table[]  
= {
{}
 };
 
+static void pmc_core_do_dmi_quirks(struct pmc_dev *pmcdev)
+{
+   dmi_check_system(pmc_core_dmi_table);
+
+   if (xtal_ignore)
+   pmc_core_xtal_ignore(pmcdev);
+}
+
 static int pmc_core_probe(struct platform_device *pdev)
 {
static bool device_initialized;
@@ -1365,7 +1378,7 @@ static int pmc_core_probe(struct platform_device *pdev)
mutex_init(>lock);
platform_set_drvdata(pdev, pmcdev);
pmcdev->pmc_xram_read_bit = pmc_core_check_read_lock_bit();
-   dmi_check_system(pmc_core_dmi_table);
+   pmc_core_do_dmi_quirks(pmcdev);
 
/*
 * On TGL, due to a hardware limitation, the GBE LTR blocks PC10 when
-- 
2.25.1

[PATCH V2 0/9] intel_pmc_core: Add sub-state requirements and mode

2021-04-16 Thread David E. Box

- Patch 1 and 2 remove the use of the global struct pmc_dev
- Patches 3-7 add support for reading low power mode sub-state
  requirements, latching sub-state status on different low power mode
  events, and displaying the sub-state residency in microseconds
- Patch 8 adds missing LTR IPs for TGL
- Patch 9 adds support for ADL-P which is based on TGL

Applied on top of latest hans-review/review-hans

Patches that changed in V2:
Patch 3: Variable name change
Patch 5: Do proper cleanup after fail
Patch 7: Debugfs write function fixes

David E. Box (4):
  platform/x86: intel_pmc_core: Don't use global pmcdev in quirks
  platform/x86: intel_pmc_core: Remove global struct pmc_dev
  platform/x86: intel_pmc_core: Add option to set/clear LPM mode
  platform/x86: intel_pmc_core: Add support for Alder Lake PCH-P

Gayatri Kammela (5):
  platform/x86: intel_pmc_core: Handle sub-states generically
  platform/x86: intel_pmc_core: Show LPM residency in microseconds
  platform/x86: intel_pmc_core: Get LPM requirements for Tiger Lake
  platform/x86: intel_pmc_core: Add requirements file to debugfs
  platform/x86: intel_pmc_core: Add LTR registers for Tiger Lake

 drivers/platform/x86/intel_pmc_core.c | 384 +++---
 drivers/platform/x86/intel_pmc_core.h |  47 +++-
 2 files changed, 395 insertions(+), 36 deletions(-)


base-commit: 823b31517ad3196324322804ee365d5fcff704d6
-- 
2.25.1

Re: [RFC PATCH] USB:XHCI:skip hub registration

2021-04-16 Thread liulongfang

On 2021/4/16 23:20, Alan Stern wrote:
> On Fri, Apr 16, 2021 at 10:03:21AM +0800, liulongfang wrote:
>> On 2021/4/15 22:43, Alan Stern wrote:
>>> On Thu, Apr 15, 2021 at 08:22:38PM +0800, Longfang Liu wrote:
 When the number of ports on the USB hub is 0, skip the registration
 operation of the USB hub.

 The current Kunpeng930's XHCI hardware controller is defective. The number
 of ports on its USB3.0 bus controller is 0, and the number of ports on
 the USB2.0 bus controller is 1.

 In order to solve this problem that the USB3.0 controller does not have
 a port which causes the registration of the hub to fail, this patch passes
 the defect information by adding flags in the quirks of xhci and usb_hcd,
 and finally skips the registration process of the hub directly according
 to the results of these flags when the hub is initialized.

 Signed-off-by: Longfang Liu 
>>>
>>> The objections that Greg raised are all good ones.
>>>
>>> But even aside from them, this patch doesn't actually do what the 
>>> description says.  The patch doesn't remove the call to usb_add_hcd 
>>> for the USB-3 bus.  If you simply skipped that call (and the 
>>> corresponding call to usb_remove_hcd) when there are no 
>>> ports on the root hub, none of the stuff in this patch would be needed.
>>>
>>> Alan Stern
>>>
>>
>> "[RFC PATCH] USB:XHCI:Adjust the log level of hub"
> 
> I don't understand.  What patch is that?  Do you have a URL for it?
> 
URL: 
https://patchwork.kernel.org/project/linux-usb/patch/161652-37920-1-git-send-email-liulongf...@huawei.com/
Thanks
Longfang.

>> The current method is an improved method of the above patch.
>> This patch just make it skip registering USB-3 root hub if that hub has no 
>> ports,
> 
> No, that isn't what this patch does.
> 
> If the root hub wasn't registered, hub_probe wouldn't get called.  But 
> with your patch, the system tries to register the root hub, and it does 
> call hub_probe, and then that function fails with a warning message.
> 
> The way to _really_ akip registering the root hub is to change the 
> xhci-hcd code.  Make it skip calling usb_add_hcd.
> 
>> after skipping registering, no port will not report error log,the goal of 
>> this
>> patch is reached without error log output.
> 
> Why do you want to get rid of the error log output?  There really _is_ 
> an error, because the USB-3 hardware on your controller is defective.  
> Since the hardware is buggy, we _should_ print an error message in the 
> kernel log.
> 
> Alan Stern
> .
>

[GIT PULL] cxl fixes for v5.12-rc8 / final

2021-04-16 Thread Dan Williams

Hi Linus, please pull from:

  git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
tags/cxl-fixes-for-5.12-rc8

...to receive a collection of fixes for the CXL memory class driver
introduced in v5.12-rc1.

The driver was primarily developed on a work-in-progress QEMU
emulation of the interface and we have since found a couple places
where it hid spec compat bugs in the driver, or had a spec
implementation bug itself. The biggest change here is replacing a
percpu_ref with an rwsem to cleanup a couple bugs in the error unwind
path during ioctl device init. Lastly there were some minor cleanups
to not export the power-management sysfs-ABI for the ioctl device, use
the proper sysfs helper for emitting values, and prevent subtle bugs
as new administration commands are added to the supported list.

The bulk of it has appeared in -next save for the top commit which was
found today and validated on a fixed-up QEMU model.

---

The following changes since commit a38fd8748464831584a19438cbb3082b5a2dab15:

  Linux 5.12-rc2 (2021-03-05 17:33:41 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
tags/cxl-fixes-for-5.12-rc8

for you to fetch changes up to fae8817ae804a682c6823ad1672438f39fc46c28:

  cxl/mem: Fix memory device capacity probing (2021-04-16 18:21:56 -0700)


cxl for 5.12-rc8

- Fix support for CXL memory devices with registers offset from the BAR
  base.

- Fix the reporting of device capacity.

- Fix the driver commands list definition to be disconnected from the
  UAPI command list.

- Replace percpu_ref with rwsem to fix initialization error path.

- Fix leaks in the driver initialization error path.

- Drop the power/ directory from CXL device sysfs.

- Use the recommended sysfs helper for attribute 'show' implementations.


Ben Widawsky (1):
  cxl/mem: Fix register block offset calculation

Dan Williams (5):
  cxl/mem: Use sysfs_emit() for attribute show routines
  cxl/mem: Fix synchronization mechanism for device removal vs
ioctl operations
  cxl/mem: Do not rely on device_add() side effects for
dev_set_name() failures
  cxl/mem: Disable cxl device power management
  cxl/mem: Fix memory device capacity probing

Robert Richter (1):
  cxl/mem: Force array size of mem_commands[] to CXL_MEM_COMMAND_ID_MAX

 drivers/cxl/mem.c | 152 --
 1 file changed, 89 insertions(+), 63 deletions(-)

Re: [External] [PATCH v3] dma-buf: Add DmaBufTotal counter in meminfo

2021-04-16 Thread Muchun Song

On Sat, Apr 17, 2021 at 12:08 AM Peter Enderborg
 wrote:
>
> This adds a total used dma-buf memory. Details
> can be found in debugfs, however it is not for everyone
> and not always available. dma-buf are indirect allocated by
> userspace. So with this value we can monitor and detect
> userspace applications that have problems.

I want to know more details about the problems.
Can you share what problems you have encountered?

Thanks.

>
> Signed-off-by: Peter Enderborg 
> ---
>  drivers/dma-buf/dma-buf.c | 12 
>  fs/proc/meminfo.c |  5 -
>  include/linux/dma-buf.h   |  1 +
>  3 files changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
> index f264b70c383e..d40fff2ae1fa 100644
> --- a/drivers/dma-buf/dma-buf.c
> +++ b/drivers/dma-buf/dma-buf.c
> @@ -37,6 +37,7 @@ struct dma_buf_list {
>  };
>
>  static struct dma_buf_list db_list;
> +static atomic_long_t dma_buf_global_allocated;
>
>  static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen)
>  {
> @@ -79,6 +80,7 @@ static void dma_buf_release(struct dentry *dentry)
> if (dmabuf->resv == (struct dma_resv *)[1])
> dma_resv_fini(dmabuf->resv);
>
> +   atomic_long_sub(dmabuf->size, _buf_global_allocated);
> module_put(dmabuf->owner);
> kfree(dmabuf->name);
> kfree(dmabuf);
> @@ -586,6 +588,7 @@ struct dma_buf *dma_buf_export(const struct 
> dma_buf_export_info *exp_info)
> mutex_lock(_list.lock);
> list_add(>list_node, _list.head);
> mutex_unlock(_list.lock);
> +   atomic_long_add(dmabuf->size, _buf_global_allocated);
>
> return dmabuf;
>
> @@ -1346,6 +1349,15 @@ void dma_buf_vunmap(struct dma_buf *dmabuf, struct 
> dma_buf_map *map)
>  }
>  EXPORT_SYMBOL_GPL(dma_buf_vunmap);
>
> +/**
> + * dma_buf_get_size - Return the used nr pages by dma-buf
> + */
> +long dma_buf_allocated_pages(void)
> +{
> +   return atomic_long_read(_buf_global_allocated) >> PAGE_SHIFT;
> +}
> +EXPORT_SYMBOL_GPL(dma_buf_allocated_pages);

Why need "EXPORT_SYMBOL_GPL"?

> +
>  #ifdef CONFIG_DEBUG_FS
>  static int dma_buf_debug_show(struct seq_file *s, void *unused)
>  {
> diff --git a/fs/proc/meminfo.c b/fs/proc/meminfo.c
> index 6fa761c9cc78..ccc7c40c8db7 100644
> --- a/fs/proc/meminfo.c
> +++ b/fs/proc/meminfo.c
> @@ -16,6 +16,7 @@
>  #ifdef CONFIG_CMA
>  #include 
>  #endif
> +#include 
>  #include 
>  #include "internal.h"
>
> @@ -145,7 +146,9 @@ static int meminfo_proc_show(struct seq_file *m, void *v)
> show_val_kb(m, "CmaFree:",
> global_zone_page_state(NR_FREE_CMA_PAGES));
>  #endif
> -
> +#ifdef CONFIG_DMA_SHARED_BUFFER
> +   show_val_kb(m, "DmaBufTotal:", dma_buf_allocated_pages());
> +#endif
> hugetlb_report_meminfo(m);
>
> arch_report_meminfo(m);
> diff --git a/include/linux/dma-buf.h b/include/linux/dma-buf.h
> index efdc56b9d95f..5b05816bd2cd 100644
> --- a/include/linux/dma-buf.h
> +++ b/include/linux/dma-buf.h
> @@ -507,4 +507,5 @@ int dma_buf_mmap(struct dma_buf *, struct vm_area_struct 
> *,
>  unsigned long);
>  int dma_buf_vmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
>  void dma_buf_vunmap(struct dma_buf *dmabuf, struct dma_buf_map *map);
> +long dma_buf_allocated_pages(void);
>  #endif /* __DMA_BUF_H__ */
> --
> 2.17.1
>

[PATCH v2 2/2] riscv: atomic: Using ARCH_ATOMIC in asm/atomic.h

2021-04-16 Thread guoren

From: Guo Ren 

The linux/atomic-arch-fallback.h has been there for a while, but
only x86 & arm64 support it. Let's make riscv follow the
linux/arch/* development trendy and make the codes more readable
and maintainable.

This patch also cleanup some codes:
 - Add atomic_andnot_* operation
 - Using amoswap.w.rl & amoswap.w.aq instructions in xchg
 - Remove cmpxchg_acquire/release unnecessary optimization

Change in v2:
 - Fixup andnot bug by Peter Zijlstra

Signed-off-by: Guo Ren 
Link: 
https://lore.kernel.org/linux-riscv/cak8p3a0fg3cpqbnup7kxj3713cmuqv1wceh-vcrngkm00wx...@mail.gmail.com/
Cc: Arnd Bergmann 
Cc: Peter Zijlstra 
Cc: Anup Patel 
Cc: Palmer Dabbelt 
---
 arch/riscv/include/asm/atomic.h  | 230 +++
 arch/riscv/include/asm/cmpxchg.h | 199 ++---
 2 files changed, 99 insertions(+), 330 deletions(-)

diff --git a/arch/riscv/include/asm/atomic.h b/arch/riscv/include/asm/atomic.h
index 400a8c8..b127cb1 100644
--- a/arch/riscv/include/asm/atomic.h
+++ b/arch/riscv/include/asm/atomic.h
@@ -8,13 +8,8 @@
 #ifndef _ASM_RISCV_ATOMIC_H
 #define _ASM_RISCV_ATOMIC_H
 
-#ifdef CONFIG_GENERIC_ATOMIC64
-# include 
-#else
-# if (__riscv_xlen < 64)
-#  error "64-bit atomics require XLEN to be at least 64"
-# endif
-#endif
+#include 
+#include 
 
 #include 
 #include 
@@ -25,25 +20,13 @@
 #define __atomic_release_fence()   \
__asm__ __volatile__(RISCV_RELEASE_BARRIER "" ::: "memory");
 
-static __always_inline int atomic_read(const atomic_t *v)
-{
-   return READ_ONCE(v->counter);
-}
-static __always_inline void atomic_set(atomic_t *v, int i)
-{
-   WRITE_ONCE(v->counter, i);
-}
+#define arch_atomic_read(v)__READ_ONCE((v)->counter)
+#define arch_atomic_set(v, i)  __WRITE_ONCE(((v)->counter), 
(i))
 
 #ifndef CONFIG_GENERIC_ATOMIC64
-#define ATOMIC64_INIT(i) { (i) }
-static __always_inline s64 atomic64_read(const atomic64_t *v)
-{
-   return READ_ONCE(v->counter);
-}
-static __always_inline void atomic64_set(atomic64_t *v, s64 i)
-{
-   WRITE_ONCE(v->counter, i);
-}
+#define ATOMIC64_INIT  ATOMIC_INIT
+#define arch_atomic64_read arch_atomic_read
+#define arch_atomic64_set  arch_atomic_set
 #endif
 
 /*
@@ -53,7 +36,7 @@ static __always_inline void atomic64_set(atomic64_t *v, s64 i)
  */
 #define ATOMIC_OP(op, asm_op, I, asm_type, c_type, prefix) \
 static __always_inline \
-void atomic##prefix##_##op(c_type i, atomic##prefix##_t *v)\
+void arch_atomic##prefix##_##op(c_type i, atomic##prefix##_t *v)   \
 {  \
__asm__ __volatile__ (  \
"   amo" #asm_op "." #asm_type " zero, %1, %0"  \
@@ -76,6 +59,12 @@ ATOMIC_OPS(sub, add, -i)
 ATOMIC_OPS(and, and,  i)
 ATOMIC_OPS( or,  or,  i)
 ATOMIC_OPS(xor, xor,  i)
+ATOMIC_OPS(andnot, and,  ~i)
+
+#define arch_atomic_andnot arch_atomic_andnot
+#ifndef CONFIG_GENERIC_ATOMIC64
+#define arch_atomic64_andnot   arch_atomic64_andnot
+#endif
 
 #undef ATOMIC_OP
 #undef ATOMIC_OPS
@@ -87,7 +76,7 @@ ATOMIC_OPS(xor, xor,  i)
  */
 #define ATOMIC_FETCH_OP(op, asm_op, I, asm_type, c_type, prefix)   \
 static __always_inline \
-c_type atomic##prefix##_fetch_##op##_relaxed(c_type i, \
+c_type arch_atomic##prefix##_fetch_##op##_relaxed(c_type i,\
 atomic##prefix##_t *v) \
 {  \
register c_type ret;\
@@ -99,7 +88,7 @@ c_type atomic##prefix##_fetch_##op##_relaxed(c_type i,
\
return ret; \
 }  \
 static __always_inline \
-c_type atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v)\
+c_type arch_atomic##prefix##_fetch_##op(c_type i, atomic##prefix##_t *v)\
 {  \
register c_type ret;\
__asm__ __volatile__ (  \
@@ -112,15 +101,16 @@ c_type atomic##prefix##_fetch_##op(c_type i, 
atomic##prefix##_t *v)   \
 
 #define ATOMIC_OP_RETURN(op, asm_op, c_op, I, asm_type, c_type, prefix)
\
 static __always_inline \
-c_type atomic##prefix##_##op##_return_relaxed(c_type i,
\
+c_type arch_atomic##prefix##_##op##_return_relaxed(c_type i,   \

[PATCH v2 1/2] locking/atomics: Fixup GENERIC_ATOMIC64 conflict with atomic-arch-fallback.h

2021-04-16 Thread guoren

From: Guo Ren 

Current GENERIC_ATOMIC64 in atomic-arch-fallback.h is broken. When a 32-bit
arch use atomic-arch-fallback.h will cause compile error.

In file included from include/linux/atomic.h:81,
from include/linux/rcupdate.h:25,
from include/linux/rculist.h:11,
from include/linux/pid.h:5,
from include/linux/sched.h:14,
from arch/riscv/kernel/asm-offsets.c:10:
   include/linux/atomic-arch-fallback.h: In function 'arch_atomic64_inc':
>> include/linux/atomic-arch-fallback.h:1447:2: error: implicit declaration of 
>> function 'arch_atomic64_add'; did you mean 'arch_atomic_add'? 
>> [-Werror=implicit-function-declaration]
1447 |  arch_atomic64_add(1, v);
 |  ^
 |  arch_atomic_add

The atomic-arch-fallback.h & atomic-fallback.h &
atomic-instrumented.h are generated by gen-atomic-fallback.sh &
gen-atomic-instrumented.sh, so just take care the bash files.

Remove the dependency of atomic-*-fallback.h in atomic64.h.

Signed-off-by: Guo Ren 
Cc: Peter Zijlstra 
Cc: Arnd Bergmann 
---
 include/asm-generic/atomic-instrumented.h | 307 +-
 include/asm-generic/atomic64.h|  89 +
 include/linux/atomic-arch-fallback.h  |   5 +-
 include/linux/atomic-fallback.h   |   5 +-
 scripts/atomic/gen-atomic-fallback.sh |   3 +-
 scripts/atomic/gen-atomic-instrumented.sh |  23 ++-
 6 files changed, 294 insertions(+), 138 deletions(-)

diff --git a/include/asm-generic/atomic-instrumented.h 
b/include/asm-generic/atomic-instrumented.h
index 888b6cf..f6ce7a2 100644
--- a/include/asm-generic/atomic-instrumented.h
+++ b/include/asm-generic/atomic-instrumented.h
@@ -831,6 +831,180 @@ atomic_dec_if_positive(atomic_t *v)
 #define atomic_dec_if_positive atomic_dec_if_positive
 #endif
 
+#if !defined(arch_xchg_relaxed) || defined(arch_xchg)
+#define xchg(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_xchg(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_xchg_acquire)
+#define xchg_acquire(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_xchg_acquire(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_xchg_release)
+#define xchg_release(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_xchg_release(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_xchg_relaxed)
+#define xchg_relaxed(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_xchg_relaxed(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if !defined(arch_cmpxchg_relaxed) || defined(arch_cmpxchg)
+#define cmpxchg(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_cmpxchg(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_cmpxchg_acquire)
+#define cmpxchg_acquire(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_cmpxchg_acquire(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_cmpxchg_release)
+#define cmpxchg_release(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_cmpxchg_release(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_cmpxchg_relaxed)
+#define cmpxchg_relaxed(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_cmpxchg_relaxed(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if !defined(arch_cmpxchg64_relaxed) || defined(arch_cmpxchg64)
+#define cmpxchg64(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_cmpxchg64(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_cmpxchg64_acquire)
+#define cmpxchg64_acquire(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_cmpxchg64_acquire(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_cmpxchg64_release)
+#define cmpxchg64_release(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_cmpxchg64_release(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if defined(arch_cmpxchg64_relaxed)
+#define cmpxchg64_relaxed(ptr, ...) \
+({ \
+   typeof(ptr) __ai_ptr = (ptr); \
+   instrument_atomic_write(__ai_ptr, sizeof(*__ai_ptr)); \
+   arch_cmpxchg64_relaxed(__ai_ptr, __VA_ARGS__); \
+})
+#endif
+
+#if !defined(arch_try_cmpxchg_relaxed) || defined(arch_try_cmpxchg)
+#define

Re: [External] Re: [PATCH v20 4/9] mm: hugetlb: free the vmemmap pages associated with each HugeTLB page

2021-04-16 Thread Muchun Song

On Sat, Apr 17, 2021 at 5:10 AM Mike Kravetz  wrote:
>
> On 4/15/21 1:40 AM, Muchun Song wrote:
> > Every HugeTLB has more than one struct page structure. We __know__ that
> > we only use the first 4 (__NR_USED_SUBPAGE) struct page structures
> > to store metadata associated with each HugeTLB.
> >
> > There are a lot of struct page structures associated with each HugeTLB
> > page. For tail pages, the value of compound_head is the same. So we can
> > reuse first page of tail page structures. We map the virtual addresses
> > of the remaining pages of tail page structures to the first tail page
> > struct, and then free these page frames. Therefore, we need to reserve
> > two pages as vmemmap areas.
> >
> > When we allocate a HugeTLB page from the buddy, we can free some vmemmap
> > pages associated with each HugeTLB page. It is more appropriate to do it
> > in the prep_new_huge_page().
> >
> > The free_vmemmap_pages_per_hpage(), which indicates how many vmemmap
> > pages associated with a HugeTLB page can be freed, returns zero for
> > now, which means the feature is disabled. We will enable it once all
> > the infrastructure is there.
> >
> > Signed-off-by: Muchun Song 
> > Reviewed-by: Oscar Salvador 
> > Tested-by: Chen Huang 
> > Tested-by: Bodeddula Balasubramaniam 
> > Acked-by: Michal Hocko 
>
> There may need to be some trivial rebasing due to Oscar's changes
> when they go in.

Yeah, thanks for your reminder.

>
> Reviewed-by: Mike Kravetz 
> --
> Mike Kravetz

Re: [External] Re: [PATCH v2 5/8] mm: memcontrol: rename lruvec_holds_page_lru_lock to page_matches_lruvec

2021-04-16 Thread Muchun Song

On Fri, Apr 16, 2021 at 11:20 PM Johannes Weiner  wrote:
>
> On Fri, Apr 16, 2021 at 01:14:04PM +0800, Muchun Song wrote:
> > lruvec_holds_page_lru_lock() doesn't check anything about locking and is
> > used to check whether the page belongs to the lruvec. So rename it to
> > page_matches_lruvec().
> >
> > Signed-off-by: Muchun Song 
>
> The rename makes sense, since the previous name was defined by a
> specific use case rather than what it does. That said, it did imply a
> lock context that makes the test result stable. Without that the
> function could use a short comment, IMO. How about:
>
> /* Test requires a stable page->memcg binding, see page_memcg() */

Make sense. I will add this comment.

>
> With that,
> Acked-by: Johannes Weiner 

Thanks.

Re: [RFC 0/2] Add a new translation tool scripts/trslt.py

2021-04-16 Thread Wu X.C.

On Thu, Apr 15, 2021 at 03:00:36PM -0600, Jonathan Corbet wrote:
> Wu XiangCheng  writes:
> 
> > Hi all,
> >
> > This set of patches aim to add a new translation tool - trslt.py, which
> > can control the transltions version corresponding to source files.
> >
> > For a long time, kernel documentation translations lacks a way to control 
> > the
> > version corresponding to the source files. If you translate a file and then
> > someone updates the source file, there will be a problem. It's hard to know
> > which version the existing translation corresponds to, and even harder to 
> > sync
> > them. 
> >
> > The common way now is to check the date, but this is not exactly accurate,
> > especially for documents that are often updated. And some translators write 
> > corresponding commit ID in the commit log for reference, it is a good way, 
> > but still a little troublesome.
> >
> > Thus, the purpose of ``trslt.py`` is to add a new annotating tag to the file
> > to indicate corresponding version of the source file::
> >
> > .. translation_origin_commit: 
> >
> > The script will automatically copy file and generate tag when creating new
> > translation, and give update suggestions based on those tags when updating
> > translations.
> >
> > More details please read doc in [Patch 2/2].
> 
> So, like Federico, I'm unconvinced about putting this into the
> translated text itself.  This is metadata, and I'd put it with the rest
> of the metadata.  My own suggestion would be a tag like:
> 
>   Translates: 6161a4b18a66 ("docs: reporting-issues: make people CC the 
> regressions list")
> 
> It would be an analogue to the Fixes tag in this regard; you could have
> more than one of them if need be.

Yes, that's also a good idea rather than add a tag to text itself.

> 
> I'm not sure we really need a script in the kernel tree for this; it
> seems like what you really want is some sort of git commit hook.  That
> said, if you come up with something useful, we can certainly find a
> place for it.

Emmm, thought again.

Maybe we just need a doc to tell people recommended practice, just put a
script or hook in the doc.

Use it or not, depend on themselves. That's may easier, but I'm worried
about whether this loose approach will work better.

Thanks!

Wu X.C.


signature.asc
Description: PGP signature

Re: [PATCH 1/2] mm: Fix struct page layout on 32-bit systems

2021-04-16 Thread Matthew Wilcox



Replacement patch to fix compiler warning.

From: "Matthew Wilcox (Oracle)" 
Date: Fri, 16 Apr 2021 16:34:55 -0400
Subject: [PATCH 1/2] mm: Fix struct page layout on 32-bit systems
To: bro...@redhat.com
Cc: linux-kernel@vger.kernel.org,
linux...@kvack.org,
net...@vger.kernel.org,
linuxppc-...@lists.ozlabs.org,
linux-arm-ker...@lists.infradead.org,
linux-m...@vger.kernel.org,
ilias.apalodi...@linaro.org,
mcr...@linux.microsoft.com,
grygorii.stras...@ti.com,
a...@kernel.org,
h...@lst.de,
linux-snps-...@lists.infradead.org,
mho...@kernel.org,
mgor...@suse.de

32-bit architectures which expect 8-byte alignment for 8-byte integers
and need 64-bit DMA addresses (arc, arm, mips, ppc) had their struct
page inadvertently expanded in 2019.  When the dma_addr_t was added,
it forced the alignment of the union to 8 bytes, which inserted a 4 byte
gap between 'flags' and the union.

Fix this by storing the dma_addr_t in one or two adjacent unsigned longs.
This restores the alignment to that of an unsigned long, and also fixes a
potential problem where (on a big endian platform), the bit used to denote
PageTail could inadvertently get set, and a racing get_user_pages_fast()
could dereference a bogus compound_head().

Fixes: c25fff7171be ("mm: add dma_addr_t to struct page")
Signed-off-by: Matthew Wilcox (Oracle) 
---
 include/linux/mm_types.h |  4 ++--
 include/net/page_pool.h  | 12 +++-
 net/core/page_pool.c | 12 +++-
 3 files changed, 20 insertions(+), 8 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6613b26a8894..5aacc1c10a45 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -97,10 +97,10 @@ struct page {
};
struct {/* page_pool used by netstack */
/**
-* @dma_addr: might require a 64-bit value even on
+* @dma_addr: might require a 64-bit value on
 * 32-bit architectures.
 */
-   dma_addr_t dma_addr;
+   unsigned long dma_addr[2];
};
struct {/* slab, slob and slub */
union {
diff --git a/include/net/page_pool.h b/include/net/page_pool.h
index b5b195305346..ad6154dc206c 100644
--- a/include/net/page_pool.h
+++ b/include/net/page_pool.h
@@ -198,7 +198,17 @@ static inline void page_pool_recycle_direct(struct 
page_pool *pool,
 
 static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
 {
-   return page->dma_addr;
+   dma_addr_t ret = page->dma_addr[0];
+   if (sizeof(dma_addr_t) > sizeof(unsigned long))
+   ret |= (dma_addr_t)page->dma_addr[1] << 16 << 16;
+   return ret;
+}
+
+static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t addr)
+{
+   page->dma_addr[0] = addr;
+   if (sizeof(dma_addr_t) > sizeof(unsigned long))
+   page->dma_addr[1] = addr >> 16 >> 16;
 }
 
 static inline bool is_page_pool_compiled_in(void)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index ad8b0707af04..f014fd8c19a6 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -174,8 +174,10 @@ static void page_pool_dma_sync_for_device(struct page_pool 
*pool,
  struct page *page,
  unsigned int dma_sync_size)
 {
+   dma_addr_t dma_addr = page_pool_get_dma_addr(page);
+
dma_sync_size = min(dma_sync_size, pool->p.max_len);
-   dma_sync_single_range_for_device(pool->p.dev, page->dma_addr,
+   dma_sync_single_range_for_device(pool->p.dev, dma_addr,
 pool->p.offset, dma_sync_size,
 pool->p.dma_dir);
 }
@@ -226,7 +228,7 @@ static struct page *__page_pool_alloc_pages_slow(struct 
page_pool *pool,
put_page(page);
return NULL;
}
-   page->dma_addr = dma;
+   page_pool_set_dma_addr(page, dma);
 
if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
@@ -294,13 +296,13 @@ void page_pool_release_page(struct page_pool *pool, 
struct page *page)
 */
goto skip_dma_unmap;
 
-   dma = page->dma_addr;
+   dma = page_pool_get_dma_addr(page);
 
-   /* When page is unmapped, it cannot be returned our pool */
+   /* When page is unmapped, it cannot be returned to our pool */
dma_unmap_page_attrs(pool->p.dev, dma,
 PAGE_SIZE << pool->p.order, pool->p.dma_dir,
 DMA_ATTR_SKIP_CPU_SYNC);
-   page->dma_addr = 0;
+   page_pool_set_dma_addr(page, 0);
 skip_dma_unmap:
/* This may be the last page returned, releasing the pool, so
 * it is not safe to reference pool

Re: [PATCH v13 14/14] powerpc/64s/radix: Enable huge vmalloc mappings

2021-04-16 Thread Nicholas Piggin

Excerpts from Andrew Morton's message of April 16, 2021 4:55 am:
> On Thu, 15 Apr 2021 12:23:55 +0200 Christophe Leroy 
>  wrote:
>> > +   * is done. STRICT_MODULE_RWX may require extra work to support this
>> > +   * too.
>> > +   */
>> >   
>> > -  return __vmalloc_node_range(size, 1, MODULES_VADDR, MODULES_END, 
>> > GFP_KERNEL,
>> > -  PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS, 
>> > NUMA_NO_NODE,
>> 
>> 
>> I think you should add the following in 
>> 
>> #ifndef MODULES_VADDR
>> #define MODULES_VADDR VMALLOC_START
>> #define MODULES_END VMALLOC_END
>> #endif
>> 
>> And leave module_alloc() as is (just removing the enclosing #ifdef 
>> MODULES_VADDR and adding the 
>> VM_NO_HUGE_VMAP  flag)
>> 
>> This would minimise the conflits with the changes I did in powerpc/next 
>> reported by Stephen R.
>> 
> 
> I'll drop powerpc-64s-radix-enable-huge-vmalloc-mappings.patch for now,
> make life simpler.

Yeah that's fine.

> Nick, a redo on top of Christophe's changes in linux-next would be best
> please.

Will do.

Thanks,
Nick

[PATCH] riscv: patch: remove lockdep assertion on lock text_mutex

2021-04-16 Thread Changbin Du

The function patch_insn_write() expects that the text_mutex is already
held. There's a case that text_mutex is acquired by ftrace_run_update_code()
under syscall context but then patch_insn_write() will be executed under the
migration kthread context as we involves stop machine. So we should remove
the assertion, or it can cause warning storm in kernel message.

[  104.641978] [ cut here ]
[  104.642327] WARNING: CPU: 0 PID: 13 at arch/riscv/kernel/patch.c:63 
patch_insn_write+0x166/0x17c
[  104.643587] Modules linked in:
[  104.644691] CPU: 0 PID: 13 Comm: migration/0 Not tainted 
5.12.0-rc7-00067-g9cdbf6467424 #102
[  104.644907] Hardware name: riscv-virtio,qemu (DT)
[  104.645068] Stopper: multi_cpu_stop+0x0/0x17e <- 0x0
[  104.645349] epc : patch_insn_write+0x166/0x17c
[  104.645467]  ra : patch_insn_write+0x162/0x17c
[  104.645534] epc : ffe059c6 ra : ffe059c2 sp : 
ffe002a33c70
[  104.645580]  gp : ffe0019e5518 tp : ffe002a232c0 t0 : 
ffe01295e8a8
[  104.645622]  t1 : 0001 t2 :  s0 : 
ffe002a33cc0
[  104.645675]  s1 : ffe07f72 a0 :  a1 : 

[  104.645716]  a2 : 0001 a3 :  a4 : 
0001
[  104.645757]  a5 : ffe0799e45c8 a6 : 000ca097 a7 : 

[  104.645798]  s2 : 0008 s3 : 0f72 s4 : 
ffe002a33ce0
[  104.645839]  s5 : 0f7a s6 : 0003 s7 : 
0003
[  104.645880]  s8 : 0004 s9 : 0002 s10: 

[  104.645920]  s11: 0002 t3 : 0001 t4 : 
ffe000c615c8
[  104.645958]  t5 : 7fff t6 : 0380
[  104.645998] status: 0100 badaddr:  cause: 
0003
[  104.646081] Call Trace:
[  104.646147] [] patch_insn_write+0x166/0x17c
[  104.646280] [] patch_text_nosync+0x10/0x32
[  104.646317] [] ftrace_update_ftrace_func+0x74/0xac
[  104.646352] [] ftrace_modify_all_code+0x9c/0x144
[  104.646387] [] __ftrace_modify_code+0x12/0x1c
[  104.646420] [] multi_cpu_stop+0xa8/0x17e
[  104.646451] [] cpu_stopper_thread+0xb2/0x156
[  104.646489] [] smpboot_thread_fn+0x102/0x1ea
[  104.646524] [] kthread+0x132/0x148
[  104.646556] [] ret_from_exception+0x0/0x14
[  104.646657] ---[ end trace ccf71babb9de4d5b ]---
[  104.647444] [ cut here ]

Signed-off-by: Changbin Du 
---
 arch/riscv/kernel/patch.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/arch/riscv/kernel/patch.c b/arch/riscv/kernel/patch.c
index 0b552873a577..6d2ed9c15065 100644
--- a/arch/riscv/kernel/patch.c
+++ b/arch/riscv/kernel/patch.c
@@ -49,19 +49,18 @@ static void patch_unmap(int fixmap)
 }
 NOKPROBE_SYMBOL(patch_unmap);
 
+
+/*
+ * Before reaching here, it was expected to lock the text_mutex
+ * already, so we don't need to give another lock here and could
+ * ensure that it was safe between each cores.
+ */
 static int patch_insn_write(void *addr, const void *insn, size_t len)
 {
void *waddr = addr;
bool across_pages = (((uintptr_t) addr & ~PAGE_MASK) + len) > PAGE_SIZE;
int ret;
 
-   /*
-* Before reaching here, it was expected to lock the text_mutex
-* already, so we don't need to give another lock here and could
-* ensure that it was safe between each cores.
-*/
-   lockdep_assert_held(_mutex);
-
if (across_pages)
patch_map(addr + len, FIX_TEXT_POKE1);
 
-- 
2.27.0

Re: [RFC PATCH] capabilities: require CAP_SETFCAP to map uid 0 (v3)

2021-04-16 Thread Serge E. Hallyn

On Fri, Apr 16, 2021 at 04:34:53PM -0500, Serge E. Hallyn wrote:
> On Fri, Apr 16, 2021 at 05:05:01PM +0200, Christian Brauner wrote:
> > On Thu, Apr 15, 2021 at 11:58:51PM -0500, Serge Hallyn wrote:
> > > (Eric - this patch (v3) is a cleaned up version of the previous approach.
> > > v4 is at 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/sergeh/linux.git/log/?h=2021-04-15/setfcap-nsfscaps-v4
> > > and is the approach you suggested.  I can send it also as a separate patch
> > > if you like)
> > > 
> > > A process running as uid 0 but without cap_setfcap currently can simply
> > > unshare a new user namespace with uid 0 mapped to 0.  While this task
> > > will not have new capabilities against the parent namespace, there is
> > > a loophole due to the way namespaced file capabilities work.  File
> > > capabilities valid in userns 1 are distinguised from file capabilities
> > > valid in userns 2 by the kuid which underlies uid 0.  Therefore
> > > the restricted root process can unshare a new self-mapping namespace,
> > > add a namespaced file capability onto a file, then use that file
> > > capability in the parent namespace.
> > > 
> > > To prevent that, do not allow mapping uid 0 if the process which
> > > opened the uid_map file does not have CAP_SETFCAP, which is the capability
> > > for setting file capabilities.
> > > 
> > > A further wrinkle:  a task can unshare its user namespace, then
> > > open its uid_map file itself, and map (only) its own uid.  In this
> > > case we do not have the credential from before unshare,  which was
> > > potentially more restricted.  So, when creating a user namespace, we
> > > record whether the creator had CAP_SETFCAP.  Then we can use that
> > > during map_write().
> > > 
> > > With this patch:
> > > 
> > > 1. unprivileged user can still unshare -Ur
> > > 
> > > ubuntu@caps:~$ unshare -Ur
> > > root@caps:~# logout
> > > 
> > > 2. root user can still unshare -Ur
> > > 
> > > ubuntu@caps:~$ sudo bash
> > > root@caps:/home/ubuntu# unshare -Ur
> > > root@caps:/home/ubuntu# logout
> > > 
> > > 3. root user without CAP_SETFCAP cannot unshare -Ur:
> > > 
> > > root@caps:/home/ubuntu# /sbin/capsh --drop=cap_setfcap --
> > > root@caps:/home/ubuntu# /sbin/setcap cap_setfcap=p /sbin/setcap
> > > unable to set CAP_SETFCAP effective capability: Operation not permitted
> > > root@caps:/home/ubuntu# unshare -Ur
> > > unshare: write failed /proc/self/uid_map: Operation not permitted
> > > 
> > > Signed-off-by: Serge Hallyn 
> > > 
> > > Changelog:
> > >* fix logic in the case of writing to another task's uid_map
> > >* rename 'ns' to 'map_ns', and make a file_ns local variable
> > >* use /* comments */
> > >* update the CAP_SETFCAP comment in capability.h
> > >* rename parent_unpriv to parent_can_setfcap (and reverse the
> > >  logic)
> > >* remove printks
> > >* clarify (i hope) the code comments
> > >* update capability.h comment
> > >* renamed parent_can_setfcap to parent_could_setfcap
> > >* made the check its own disallowed_0_mapping() fn
> > >* moved the check into new_idmap_permitted
> > > ---
> > 
> > Thank you for working on this fix!
> > 
> > I do prefer your approach of doing the check at user namespace creation
> > time instead of moving it into the setxattr() codepath.
> > 
> > Let me reiterate that the ability to write through fscaps is a valid
> > usecase and this should continue to work but that for locked down user
> > namespace as Andrew wants to use them your patch provides a clean
> > solution.
> > We've are using identity mappings in quite a few scenarios partially
> > when performing tests but also to write through fscaps.
> > We also had reports of users that use identity mappings. They create
> > their rootfs by running image extraction in an identity mapped userns
> > where fscaps are written through.
> > Podman has use-cases for this feature as well and has been affected by
> > the regression of the first fix.
> 
> Thanks for reviewing.
> 
> I'm not sure what your point above is, so just to make sure - the
> alternative implementation also does allow fscaps for cases where
> root uid is remapped, only disallowing it if it would violate the
> ancestor's lack of cap_setfcap.
> 
> 
> > >  include/linux/user_namespace.h  |  3 ++
> > >  include/uapi/linux/capability.h |  3 +-
> > >  kernel/user_namespace.c | 61 +++--
> > >  3 files changed, 63 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/include/linux/user_namespace.h 
> > > b/include/linux/user_namespace.h
> > > index 64cf8ebdc4ec..f6c5f784be5a 100644
> > > --- a/include/linux/user_namespace.h
> > > +++ b/include/linux/user_namespace.h
> > > @@ -63,6 +63,9 @@ struct user_namespace {
> > >   kgid_t  group;
> > >   struct ns_commonns;
> > >   unsigned long   flags;
> > > + /* parent_could_setfcap: true if the creator if this ns had CAP_SETFCAP
> > > +  * in its effective

Re: [PATCH v3 2/2] iommu/sva: Remove mm parameter from SVA bind API

2021-04-16 Thread kernel test robot

Hi Jacob,

I love your patch! Yet something to improve:

[auto build test ERROR on e49d033bddf5b565044e2abe4241353959bc9120]

url:
https://github.com/0day-ci/linux/commits/Jacob-Pan/Simplify-and-restrict-IOMMU-SVA-APIs/20210417-052451
base:   e49d033bddf5b565044e2abe4241353959bc9120
config: arm64-randconfig-r034-20210416 (attached as .config)
compiler: clang version 13.0.0 (https://github.com/llvm/llvm-project 
f549176ad976caa3e19edd036df9a7e12770af7c)
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# install arm64 cross compiling tool for clang build
# apt-get install binutils-aarch64-linux-gnu
# 
https://github.com/0day-ci/linux/commit/6d85fee95bdcd7e53f10442ddc71d0c310d43367
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Jacob-Pan/Simplify-and-restrict-IOMMU-SVA-APIs/20210417-052451
git checkout 6d85fee95bdcd7e53f10442ddc71d0c310d43367
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 
ARCH=arm64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:2631:15: error: incompatible 
>> function pointer types initializing 'struct iommu_sva *(*)(struct device *, 
>> unsigned int)' with an expression of type 'struct iommu_sva *(struct device 
>> *, struct mm_struct *, unsigned int)' 
>> [-Werror,-Wincompatible-function-pointer-types]
   .sva_bind   = arm_smmu_sva_bind,
 ^
   1 error generated.


vim +2631 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c

f534d98b9d2705 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c Jean-Philippe 
Brucker 2020-09-18  2608  
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2609  static struct iommu_ops arm_smmu_ops = {
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2610   .capable= arm_smmu_capable,
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2611   .domain_alloc   = arm_smmu_domain_alloc,
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2612   .domain_free= arm_smmu_domain_free,
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2613   .attach_dev = arm_smmu_attach_dev,
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2614   .map= arm_smmu_map,
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2615   .unmap  = arm_smmu_unmap,
07fdef34d2be68 drivers/iommu/arm-smmu-v3.c Zhen Lei 
 2018-09-20  2616   .flush_iotlb_all= arm_smmu_flush_iotlb_all,
32b124492bdf97 drivers/iommu/arm-smmu-v3.c Robin Murphy 
 2017-09-28  2617   .iotlb_sync = arm_smmu_iotlb_sync,
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2618   .iova_to_phys   = arm_smmu_iova_to_phys,
cefa0d55da3753 drivers/iommu/arm-smmu-v3.c Joerg Roedel 
 2020-04-29  2619   .probe_device   = arm_smmu_probe_device,
cefa0d55da3753 drivers/iommu/arm-smmu-v3.c Joerg Roedel 
 2020-04-29  2620   .release_device = arm_smmu_release_device,
08d4ca2a672bab drivers/iommu/arm-smmu-v3.c Robin Murphy 
 2016-09-12  2621   .device_group   = arm_smmu_device_group,
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2622   .domain_get_attr= arm_smmu_domain_get_attr,
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2623   .domain_set_attr= arm_smmu_domain_set_attr,
8f78515425daea drivers/iommu/arm-smmu-v3.c Robin Murphy 
 2016-09-12  2624   .of_xlate   = arm_smmu_of_xlate,
50019f09a4baa0 drivers/iommu/arm-smmu-v3.c Eric Auger   
 2017-01-19  2625   .get_resv_regions   = arm_smmu_get_resv_regions,
a66c5dc549d1e1 drivers/iommu/arm-smmu-v3.c Thierry Reding   
 2019-12-18  2626   .put_resv_regions   = 
generic_iommu_put_resv_regions,
f534d98b9d2705 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c Jean-Philippe 
Brucker 2020-09-18  2627   .dev_has_feat   = 
arm_smmu_dev_has_feature,
f534d98b9d2705 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c Jea

Re: [RFC PATCH] irqchip/gic-v3: Do not enable irqs when handling spurious interrups

2021-04-16 Thread He Ying


Hello Marc,


在 2021/4/16 22:15, Marc Zyngier 写道:

[+ Mark]

On Fri, 16 Apr 2021 07:22:17 +0100,
He Ying  wrote:

We found this problem in our kernel src tree:

[   14.816231] [ cut here ]
[   14.816231] kernel BUG at irq.c:99!
[   14.816232] Internal error: Oops - BUG: 0 [#1] SMP
[   14.816232] Process swapper/0 (pid: 0, stack limit = 0x(ptrval))
[   14.816233] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G   O  
4.19.95-1.h1.AOS2.0.aarch64 #14
[   14.816233] Hardware name: evb (DT)
[   14.816234] pstate: 80400085 (Nzcv daIf +PAN -UAO)
[   14.816234] pc : asm_nmi_enter+0x94/0x98
[   14.816235] lr : asm_nmi_enter+0x18/0x98
[   14.816235] sp : 08003c50
[   14.816235] pmr_save: 0070
[   14.816237] x29: 08003c50 x28: 095f56c0
[   14.816238] x27:  x26: 08004000
[   14.816239] x25: 015e x24: 8008fb916000
[   14.816240] x23: 2045 x22: 080817cc
[   14.816241] x21: 08003da0 x20: 0060
[   14.816242] x19: 03ff x18: 
[   14.816243] x17: 0008 x16: 003d0900
[   14.816244] x15: 095ea6c8 x14: 8008fff5ab40
[   14.816244] x13: 8008fff58b9d x12: 
[   14.816245] x11: 08c8a200 x10: 8e31fca5
[   14.816246] x9 : 08c8a208 x8 : 000f
[   14.816247] x7 : 0004 x6 : 8008fff58b9e
[   14.816248] x5 :  x4 : 8000
[   14.816249] x3 :  x2 : 8000
[   14.816250] x1 : 0012 x0 : 095f56c0
[   14.816251] Call trace:
[   14.816251]  asm_nmi_enter+0x94/0x98
[   14.816251]  el1_irq+0x8c/0x180
[   14.816252]  gic_handle_irq+0xbc/0x2e4
[   14.816252]  el1_irq+0xcc/0x180
[   14.816253]  arch_timer_handler_virt+0x38/0x58
[   14.816253]  handle_percpu_devid_irq+0x90/0x240
[   14.816253]  generic_handle_irq+0x34/0x50
[   14.816254]  __handle_domain_irq+0x68/0xc0
[   14.816254]  gic_handle_irq+0xf8/0x2e4
[   14.816255]  el1_irq+0xcc/0x180
[   14.816255]  arch_cpu_idle+0x34/0x1c8
[   14.816255]  default_idle_call+0x24/0x44
[   14.816256]  do_idle+0x1d0/0x2c8
[   14.816256]  cpu_startup_entry+0x28/0x30
[   14.816256]  rest_init+0xb8/0xc8
[   14.816257]  start_kernel+0x4c8/0x4f4
[   14.816257] Code: 940587f1 d5384100 b9401001 36a7fd01 (d421)
[   14.816258] Modules linked in: start_dp(O) smeth(O)
[   15.103092] ---[ end trace 701753956cb14aa8 ]---
[   15.103093] Kernel panic - not syncing: Fatal exception in interrupt
[   15.103099] SMP: stopping secondary CPUs
[   15.103100] Kernel Offset: disabled
[   15.103100] CPU features: 0x36,a2400218
[   15.103100] Memory Limit: none

Urgh...


Our kernel src tree is based on 4.19.95 and backports arm64 pseudo-NMI
patches but doesn't support nested NMI. Its top relative commit is
commit 17ce302f3117 ("arm64: Fix interrupt tracing in the presence of NMIs").

Can you please reproduce it with mainline and without any backport?
It is hard to reason about something that isn't a vanilla kernel.


I think our kernel is quite like v5.3 mainline. Reproducing it in v5.3 
mainline may


be a little difficult for us because our product needs some more self 
developed


patches to work.




I look into this issue and find that it's caused by 'BUG_ON(in_nmi())'
in nmi_enter(). From the call trace, we find two 'el1_irqs' which
means an interrupt preempts the other one and the new one is an NMI.
Furthermore, by adding some prints, we find the first irq also calls
nmi_enter(), but its priority is not GICD_INT_NMI_PRI and its irq number
is 1023. It enables irq by calling gic_arch_enable_irqs() in
gic_handle_irq(). At this moment, the second irq preempts the first irq
and it's an NMI but current context is already in nmi. So that may be
the problem.

I'm not sure I get it. From the stack trace, I see this:

[   14.816251]  asm_nmi_enter+0x94/0x98
[   14.816251]  el1_irq+0x8c/0x180  (C)
[   14.816252]  gic_handle_irq+0xbc/0x2e4
[   14.816252]  el1_irq+0xcc/0x180  (B)
[   14.816253]  arch_timer_handler_virt+0x38/0x58
[   14.816253]  handle_percpu_devid_irq+0x90/0x240
[   14.816253]  generic_handle_irq+0x34/0x50
[   14.816254]  __handle_domain_irq+0x68/0xc0
[   14.816254]  gic_handle_irq+0xf8/0x2e4
[   14.816255]  el1_irq+0xcc/0x180  (A)

which indicates that we preempted a timer interrupt (A) with another
IRQ (B), itself immediately preempted by another IRQ (C)? That's
indeed at least one too many.

Can you please describe for each of (A), (B) and (C) whether they are
spurious or not, what their priorities are if they aren't spurious?


Yes. I ignored interrupt (A). (B) is spurious and its priority is 0xa0 
and PMR is 0x70.


(C) is an NMI and its priority is 0x20. Note that GIC_PRIO_IRQON is 0xe0,

GIC_PRIO_IRQOFF is 0x60, GICD_INT_DEF_PRI is 0xa0 and GICD_INT_NMI_PRI is

0x20 in our kernel.


In my opinion, when handling

Re: Candidate Linux ABI for Intel AMX and hypothetical new related features

2021-04-16 Thread Len Brown

On Fri, Apr 16, 2021 at 6:14 PM Andy Lutomirski  wrote:

> My point is that every ...

I encourage you to continue to question everything and trust nobody.
While it may cost you a lot in counseling, it is certainly valuable,
at least to me! :-)

I do request, however, that feedback stay specific, stay technical,
and stay on-topic.
We all have plenty of real challenges we can be tackling with our limited time.

> Is there any authoritative guidance at all on what actually happens,
> performance-wise, when someone does AMX math?

Obviously, I can't speak to the performance of AMX itself pre-production,
and somebody who does that for a living will release stuff on or
before release day.

What I've told you about the performance side-effects on the system
(and lack thereof)
from running AMX code is an authoritative answer, and is as much as I
can tell you today.
If I failed to answer a question about AMX, my apologies, please re-ask it.

And if we learn something new between now and release day that is
relevant to this discussion,
I will certainly request to share it.

Our team (Intel Open Source Technology Center) advocated getting the existing
public AMX documentation published as early as possible.  However, if
you are really
nto the details of how AMX works, you may also be interested to know
that the AMX hardware patent filings are fully public ;-)

cheers,
Len Brown, Intel Open Source Technology Center

Re: [PATCH] riscv: atomic: Using ARCH_ATOMIC in asm/atomic.h

2021-04-16 Thread Guo Ren

On Thu, Apr 15, 2021 at 4:52 PM Peter Zijlstra  wrote:
>
> On Thu, Apr 15, 2021 at 07:39:22AM +, guo...@kernel.org wrote:
> >  - Add atomic_andnot_* operation
>
> > @@ -76,6 +59,12 @@ ATOMIC_OPS(sub, add, -i)
> >  ATOMIC_OPS(and, and,  i)
> >  ATOMIC_OPS( or,  or,  i)
> >  ATOMIC_OPS(xor, xor,  i)
> > +ATOMIC_OPS(andnot, and,  -i)
>
> ~i, surely.

Thx for correct me. I'll fix it in the next version patch.

-- 
Best Regards
 Guo Ren

ML: https://lore.kernel.org/linux-csky/

Re: [PATCH v3 2/2] iommu/sva: Remove mm parameter from SVA bind API

2021-04-16 Thread kernel test robot

Hi Jacob,

I love your patch! Yet something to improve:

[auto build test ERROR on e49d033bddf5b565044e2abe4241353959bc9120]

url:
https://github.com/0day-ci/linux/commits/Jacob-Pan/Simplify-and-restrict-IOMMU-SVA-APIs/20210417-052451
base:   e49d033bddf5b565044e2abe4241353959bc9120
config: arm64-randconfig-r022-20210416 (attached as .config)
compiler: aarch64-linux-gcc (GCC) 9.3.0
reproduce (this is a W=1 build):
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# 
https://github.com/0day-ci/linux/commit/6d85fee95bdcd7e53f10442ddc71d0c310d43367
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Jacob-Pan/Simplify-and-restrict-IOMMU-SVA-APIs/20210417-052451
git checkout 6d85fee95bdcd7e53f10442ddc71d0c310d43367
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross W=1 
ARCH=arm64 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All errors (new ones prefixed by >>):

>> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:2631:15: error: initialization 
>> of 'struct iommu_sva * (*)(struct device *, unsigned int)' from incompatible 
>> pointer type 'struct iommu_sva * (*)(struct device *, struct mm_struct *, 
>> unsigned int)' [-Werror=incompatible-pointer-types]
2631 |  .sva_bind  = arm_smmu_sva_bind,
 |   ^
   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c:2631:15: note: (near 
initialization for 'arm_smmu_ops.sva_bind')
   cc1: some warnings being treated as errors


vim +2631 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c

f534d98b9d2705 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c Jean-Philippe 
Brucker 2020-09-18  2608  
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2609  static struct iommu_ops arm_smmu_ops = {
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2610   .capable= arm_smmu_capable,
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2611   .domain_alloc   = arm_smmu_domain_alloc,
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2612   .domain_free= arm_smmu_domain_free,
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2613   .attach_dev = arm_smmu_attach_dev,
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2614   .map= arm_smmu_map,
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2615   .unmap  = arm_smmu_unmap,
07fdef34d2be68 drivers/iommu/arm-smmu-v3.c Zhen Lei 
 2018-09-20  2616   .flush_iotlb_all= arm_smmu_flush_iotlb_all,
32b124492bdf97 drivers/iommu/arm-smmu-v3.c Robin Murphy 
 2017-09-28  2617   .iotlb_sync = arm_smmu_iotlb_sync,
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2618   .iova_to_phys   = arm_smmu_iova_to_phys,
cefa0d55da3753 drivers/iommu/arm-smmu-v3.c Joerg Roedel 
 2020-04-29  2619   .probe_device   = arm_smmu_probe_device,
cefa0d55da3753 drivers/iommu/arm-smmu-v3.c Joerg Roedel 
 2020-04-29  2620   .release_device = arm_smmu_release_device,
08d4ca2a672bab drivers/iommu/arm-smmu-v3.c Robin Murphy 
 2016-09-12  2621   .device_group   = arm_smmu_device_group,
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2622   .domain_get_attr= arm_smmu_domain_get_attr,
48ec83bcbcf509 drivers/iommu/arm-smmu-v3.c Will Deacon  
 2015-05-27  2623   .domain_set_attr= arm_smmu_domain_set_attr,
8f78515425daea drivers/iommu/arm-smmu-v3.c Robin Murphy 
 2016-09-12  2624   .of_xlate   = arm_smmu_of_xlate,
50019f09a4baa0 drivers/iommu/arm-smmu-v3.c Eric Auger   
 2017-01-19  2625   .get_resv_regions   = arm_smmu_get_resv_regions,
a66c5dc549d1e1 drivers/iommu/arm-smmu-v3.c Thierry Reding   
 2019-12-18  2626   .put_resv_regions   = 
generic_iommu_put_resv_regions,
f534d98b9d2705 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c Jean-Philippe 
Brucker 2020-09-18  2627   .dev_has_feat   = 
arm_smmu_dev_has_feature,
f534d98b9d2705 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c Jean-Philippe 
Brucker 2020-09-18  2628   .dev_feat_enabled   = 
arm_smmu_dev_feature_enabled,
f534d98b9d2705 drivers/iom

[PATCH 3/3] ARM: dts: mstar: Add a dts for M5Stack UnitV2

2021-04-16 Thread Daniel Palmer

M5Stack are releasing a new widget based on the
SigmaStar SSD202D. We have some support for the
SSD202D so lets add a dts for it.

Link: https://m5stack-store.myshopify.com/products/unitv2-ai-camera-gc2145
Signed-off-by: Daniel Palmer 
---
 arch/arm/boot/dts/Makefile|  1 +
 .../dts/mstar-infinity2m-ssd202d-unitv2.dts   | 25 +++
 2 files changed, 26 insertions(+)
 create mode 100644 arch/arm/boot/dts/mstar-infinity2m-ssd202d-unitv2.dts

diff --git a/arch/arm/boot/dts/Makefile b/arch/arm/boot/dts/Makefile
index 8e5d4ab4e75e..66ddc7d0bd03 100644
--- a/arch/arm/boot/dts/Makefile
+++ b/arch/arm/boot/dts/Makefile
@@ -1397,6 +1397,7 @@ dtb-$(CONFIG_ARCH_MILBEAUT) += milbeaut-m10v-evb.dtb
 dtb-$(CONFIG_ARCH_MSTARV7) += \
mstar-infinity-msc313-breadbee_crust.dtb \
mstar-infinity2m-ssd202d-ssd201htv2.dtb \
+   mstar-infinity2m-ssd202d-unitv2.dtb \
mstar-infinity3-msc313e-breadbee.dtb \
mstar-mercury5-ssc8336n-midrived08.dtb
 dtb-$(CONFIG_ARCH_ASPEED) += \
diff --git a/arch/arm/boot/dts/mstar-infinity2m-ssd202d-unitv2.dts 
b/arch/arm/boot/dts/mstar-infinity2m-ssd202d-unitv2.dts
new file mode 100644
index ..a81684002e45
--- /dev/null
+++ b/arch/arm/boot/dts/mstar-infinity2m-ssd202d-unitv2.dts
@@ -0,0 +1,25 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2021 thingy.jp.
+ * Author: Daniel Palmer 
+ */
+
+/dts-v1/;
+#include "mstar-infinity2m-ssd202d.dtsi"
+
+/ {
+   model = "UnitV2";
+   compatible = "m5stack,unitv2", "mstar,infinity2m";
+
+   aliases {
+   serial0 = _uart;
+   };
+
+   chosen {
+   stdout-path = "serial0:115200n8";
+   };
+};
+
+_uart {
+   status = "okay";
+};
-- 
2.31.0

[PATCH 2/3] dt-bindings: arm: mstar: Add compatible for M5Stack UnitV2

2021-04-16 Thread Daniel Palmer

Add a compatible for the M5Stack UnitV2 that is based on the
SigmaStar SSD202D (inifinity2m).

Signed-off-by: Daniel Palmer 
---
 Documentation/devicetree/bindings/arm/mstar/mstar.yaml | 1 +
 1 file changed, 1 insertion(+)

diff --git a/Documentation/devicetree/bindings/arm/mstar/mstar.yaml 
b/Documentation/devicetree/bindings/arm/mstar/mstar.yaml
index 61d08c473eb8..a316eef1b728 100644
--- a/Documentation/devicetree/bindings/arm/mstar/mstar.yaml
+++ b/Documentation/devicetree/bindings/arm/mstar/mstar.yaml
@@ -24,6 +24,7 @@ properties:
 items:
   - enum:
   - honestar,ssd201htv2 # Honestar SSD201_HT_V2 devkit
+  - m5stack,unitv2 # M5Stack UnitV2
   - const: mstar,infinity2m
 
   - description: infinity3 boards
-- 
2.31.0

Re: [PATCH 0/8] iommu: fix a couple of spelling mistakes detected by codespell tool

2021-04-16 Thread Leizhen (ThunderTown)




On 2021/4/16 23:24, Joerg Roedel wrote:
> On Fri, Mar 26, 2021 at 02:24:04PM +0800, Zhen Lei wrote:
>> This detection and correction covers the entire driver/iommu directory.
>>
>> Zhen Lei (8):
>>   iommu/pamu: fix a couple of spelling mistakes
>>   iommu/omap: Fix spelling mistake "alignement" -> "alignment"
>>   iommu/mediatek: Fix spelling mistake "phyiscal" -> "physical"
>>   iommu/sun50i: Fix spelling mistake "consits" -> "consists"
>>   iommu: fix a couple of spelling mistakes
>>   iommu/amd: fix a couple of spelling mistakes
>>   iommu/arm-smmu: Fix spelling mistake "initally" -> "initially"
>>   iommu/vt-d: fix a couple of spelling mistakes
> 
> This patch-set doesn't apply. Please re-send it as a single patch when
> v5.13-rc1 is released.

OK

> 
> Thanks,
> 
>   Joerg
> 
> .
>

[PATCH 0/3] ARM: mstar: Add initial support for M5Stack UnitV2

2021-04-16 Thread Daniel Palmer

This series adds basic support for the soon to be released M5Stack
UnitV2 based on the SigmaStar SSD202D.

With the rest of the commits in my tree the SPI NAND, ethernet, USB etc
should work so the UnitV2 should be fully usable with a mainline-ish
kernel.

Hopefully this will encourage someone else to help with cleaning
up and pushing the commits for these SoCs.

Link: https://m5stack-store.myshopify.com/products/unitv2-ai-camera-gc2145

Daniel Palmer (3):
  dt-bindings: vendor-prefixes: Add vendor prefix for M5Stack
  dt-bindings: arm: mstar: Add compatible for M5Stack UnitV2
  ARM: dts: mstar: Add a dts for M5Stack UnitV2

 .../devicetree/bindings/arm/mstar/mstar.yaml  |  1 +
 .../devicetree/bindings/vendor-prefixes.yaml  |  2 ++
 arch/arm/boot/dts/Makefile|  1 +
 .../dts/mstar-infinity2m-ssd202d-unitv2.dts   | 25 +++
 4 files changed, 29 insertions(+)
 create mode 100644 arch/arm/boot/dts/mstar-infinity2m-ssd202d-unitv2.dts

-- 
2.31.0

[PATCH 1/3] dt-bindings: vendor-prefixes: Add vendor prefix for M5Stack

2021-04-16 Thread Daniel Palmer

M5Stack make various modules for STEM, Makers, IoT.
Their UnitV2 is based on a SigmaStar SSD202D SoC which
we already have some minimal support for so add a
prefix in preparation for UnitV2 board support.

Link: https://m5stack.com/
Signed-off-by: Daniel Palmer 
---
 Documentation/devicetree/bindings/vendor-prefixes.yaml | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/Documentation/devicetree/bindings/vendor-prefixes.yaml 
b/Documentation/devicetree/bindings/vendor-prefixes.yaml
index f6064d84a424..7129fe3b9144 100644
--- a/Documentation/devicetree/bindings/vendor-prefixes.yaml
+++ b/Documentation/devicetree/bindings/vendor-prefixes.yaml
@@ -651,6 +651,8 @@ patternProperties:
 description: Liebherr-Werk Nenzing GmbH
   "^lxa,.*":
 description: Linux Automation GmbH
+  "^m5stack,.*":
+description: M5Stack
   "^macnica,.*":
 description: Macnica Americas
   "^mantix,.*":
-- 
2.31.0

Re: [PATCH 5/8] iommu: fix a couple of spelling mistakes

2021-04-16 Thread Leizhen (ThunderTown)




On 2021/4/16 23:55, John Garry wrote:
> On 26/03/2021 06:24, Zhen Lei wrote:
>> There are several spelling mistakes, as follows:
>> funcions ==> functions
>> distiguish ==> distinguish
>> detroyed ==> destroyed
>>
>> Signed-off-by: Zhen Lei
> 
> I think that there should be a /s/appropriatley/appropriately/ in iommu.c

OK, I will fix it in v2.

> 
> Thanks,
> john
> 
> .
>

[PATCH v4 1/2] binfmt_flat: allow not offsetting data start

2021-04-16 Thread Damien Le Moal

Commit 2217b9826246 ("binfmt_flat: revert "binfmt_flat: don't offset
the data start"") restored offsetting the start of the data section by
a number of words defined by MAX_SHARED_LIBS. As a result, since
MAX_SHARED_LIBS is never 0, a gap between the text and data sections
always exists. For architectures which cannot support a such gap
between the text and data sections (e.g. riscv nommu), flat binary
programs cannot be executed.

To allow an architecture to request no data start offset to allow for
contiguous text and data sections for binaries flagged with
FLAT_FLAG_RAM, introduce the new config option
CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET. Using this new option, the
macro DATA_START_OFFSET_WORDS is conditionally defined in binfmt_flat.c
to MAX_SHARED_LIBS for architectures tolerating or needing the data
start offset (CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET disabled case)
and to 0 when CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET is enabled.
DATA_START_OFFSET_WORDS is used in load_flat_file() to calculate the
data section length and start position.

Signed-off-by: Damien Le Moal 
---
 fs/Kconfig.binfmt |  3 +++
 fs/binfmt_flat.c  | 19 ++-
 2 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/Kconfig.binfmt b/fs/Kconfig.binfmt
index c6f1c8c1934e..06fb7a93a1bd 100644
--- a/fs/Kconfig.binfmt
+++ b/fs/Kconfig.binfmt
@@ -112,6 +112,9 @@ config BINFMT_FLAT_ARGVP_ENVP_ON_STACK
 config BINFMT_FLAT_OLD_ALWAYS_RAM
bool
 
+config BINFMT_FLAT_NO_DATA_START_OFFSET
+   bool
+
 config BINFMT_FLAT_OLD
bool "Enable support for very old legacy flat binaries"
depends on BINFMT_FLAT
diff --git a/fs/binfmt_flat.c b/fs/binfmt_flat.c
index b9c658e0548e..1dc68dfba3e0 100644
--- a/fs/binfmt_flat.c
+++ b/fs/binfmt_flat.c
@@ -74,6 +74,12 @@
 #defineMAX_SHARED_LIBS (1)
 #endif
 
+#ifdef CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET
+#define DATA_START_OFFSET_WORDS(0)
+#else
+#define DATA_START_OFFSET_WORDS(MAX_SHARED_LIBS)
+#endif
+
 struct lib_info {
struct {
unsigned long start_code;   /* Start of text 
segment */
@@ -560,6 +566,7 @@ static int load_flat_file(struct linux_binprm *bprm,
 * it all together.
 */
if (!IS_ENABLED(CONFIG_MMU) && !(flags & 
(FLAT_FLAG_RAM|FLAT_FLAG_GZIP))) {
+
/*
 * this should give us a ROM ptr,  but if it doesn't we don't
 * really care
@@ -576,7 +583,8 @@ static int load_flat_file(struct linux_binprm *bprm,
goto err;
}
 
-   len = data_len + extra + MAX_SHARED_LIBS * sizeof(unsigned 
long);
+   len = data_len + extra +
+   DATA_START_OFFSET_WORDS * sizeof(unsigned long);
len = PAGE_ALIGN(len);
realdatastart = vm_mmap(NULL, 0, len,
PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE, 0);
@@ -591,7 +599,7 @@ static int load_flat_file(struct linux_binprm *bprm,
goto err;
}
datapos = ALIGN(realdatastart +
-   MAX_SHARED_LIBS * sizeof(unsigned long),
+   DATA_START_OFFSET_WORDS * sizeof(unsigned long),
FLAT_DATA_ALIGN);
 
pr_debug("Allocated data+bss+stack (%u bytes): %lx\n",
@@ -622,7 +630,8 @@ static int load_flat_file(struct linux_binprm *bprm,
memp_size = len;
} else {
 
-   len = text_len + data_len + extra + MAX_SHARED_LIBS * 
sizeof(u32);
+   len = text_len + data_len + extra +
+   DATA_START_OFFSET_WORDS * sizeof(u32);
len = PAGE_ALIGN(len);
textpos = vm_mmap(NULL, 0, len,
PROT_READ | PROT_EXEC | PROT_WRITE, MAP_PRIVATE, 0);
@@ -638,7 +647,7 @@ static int load_flat_file(struct linux_binprm *bprm,
 
realdatastart = textpos + ntohl(hdr->data_start);
datapos = ALIGN(realdatastart +
-   MAX_SHARED_LIBS * sizeof(u32),
+   DATA_START_OFFSET_WORDS * sizeof(u32),
FLAT_DATA_ALIGN);
 
reloc = (__be32 __user *)
@@ -714,7 +723,7 @@ static int load_flat_file(struct linux_binprm *bprm,
ret = result;
pr_err("Unable to read code+data+bss, errno %d\n", ret);
vm_munmap(textpos, text_len + data_len + extra +
-   MAX_SHARED_LIBS * sizeof(u32));
+ DATA_START_OFFSET_WORDS * sizeof(u32));
goto err;
}
}
-- 
2.30.2

[PATCH v4 2/2] riscv: Disable data start offset in flat binaries

2021-04-16 Thread Damien Le Moal

uclibc/gcc combined with elf2flt riscv linker file fully resolve the
PC relative __global_pointer$ value at compile time and do not generate
a relocation entry to set a correct value of the gp register at runtime.
As a result, if the flatbin loader offsets the start of the data
section, the relative position change between the text and data sections
compared to the compile time positions results in an incorrect gp value
being used. This causes flatbin executables to crash.

Avoid this problem by enabling CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET
automatically when CONFIG_RISCV is enabled and CONFIG_MMU is disabled.

Signed-off-by: Damien Le Moal 
Acked-by: Palmer Dabbelt 
---
 arch/riscv/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 4515a10c5d22..add528eb9235 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -33,6 +33,7 @@ config RISCV
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
+   select BINFMT_FLAT_NO_DATA_START_OFFSET if !MMU
select CLONE_BACKWARDS
select CLINT_TIMER if !MMU
select COMMON_CLK
-- 
2.30.2

[PATCH v4 0/2] Fix binfmt_flat loader for RISC-V

2021-04-16 Thread Damien Le Moal

RISC-V NOMMU flat binaries cannot tolerate a gap between the text and
data section as the toolchain fully resolves at compile time the PC
relative global pointer (__global_pointer$ value loaded in the gp
register). Without a relocation entry provided, the flat bin loader
cannot fix the value if a gap is introduced and user executables fail
to run.

This series fixes this problem by allowing an architecture to request
the flat loader to suppress the offset of the data start section.
Combined with the use of elf2flt "-r" option to mark the flat
executables with the FLAT_FLAG_RAM flag, the text and data sections are
loaded contiguously in memory, without a change in their relative
position from compile time.

The first patch fixes binfmt_flat flat_load_file() using the new
configuration option CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET. The
second patch enables this new option for RISCV NOMMU builds.

These patches do not change the binfmt_flat loader behavior for other
architectures.

Changes from v3:
* Renamed the configuration option from
  CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP to
  CONFIG_BINFMT_FLAT_NO_DATA_START_OFFSET to clarify that only the
  offseting of the data section start is suppressed.
* Do not force loding to RAM (contiguously) if the flat binary does not
  have the FLAT_FLAG_RAM flag set.
* Updated commit messages to reflect above changes.

Changes from v2:
* Updated distribution list
* Added Palmer ack-by tag

Changes from v1:
* Replace FLAT_TEXT_DATA_NO_GAP macro with
  CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP config option (patch 1).
* Remove the addition of riscv/include/asm/flat.h and set
  CONFIG_BINFMT_FLAT_NO_TEXT_DATA_GAP for RISCV and !MMU

Damien Le Moal (2):
  binfmt_flat: allow not offsetting data start
  riscv: Disable data start offset in flat binaries

 arch/riscv/Kconfig |  1 +
 fs/Kconfig.binfmt  |  3 +++
 fs/binfmt_flat.c   | 19 ++-
 3 files changed, 18 insertions(+), 5 deletions(-)

-- 
2.30.2

Re: [PATCH 1/2] mm: Fix struct page layout on 32-bit systems

2021-04-16 Thread kernel test robot

Hi "Matthew,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on linux/master]
[also build test WARNING on linus/master v5.12-rc7]
[cannot apply to hnaz-linux-mm/master next-20210416]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch]

url:
https://github.com/0day-ci/linux/commits/Matthew-Wilcox-Oracle/Change-struct-page-layout-for-page_pool/20210417-070951
base:   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
5e46d1b78a03d52306f21f77a4e4a144b6d31486
config: parisc-randconfig-s031-20210416 (attached as .config)
compiler: hppa-linux-gcc (GCC) 9.3.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# apt-get install sparse
# sparse version: v0.6.3-280-g2cd6d34e-dirty
# 
https://github.com/0day-ci/linux/commit/898e155048088be20b2606575a24108eacc4c91b
git remote add linux-review https://github.com/0day-ci/linux
git fetch --no-tags linux-review 
Matthew-Wilcox-Oracle/Change-struct-page-layout-for-page_pool/20210417-070951
git checkout 898e155048088be20b2606575a24108eacc4c91b
# save the attached .config to linux build tree
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-9.3.0 make.cross C=1 
CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' W=1 ARCH=parisc 

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot 

All warnings (new ones prefixed by >>):

   In file included from net/core/xdp.c:15:
   include/net/page_pool.h: In function 'page_pool_get_dma_addr':
>> include/net/page_pool.h:203:40: warning: left shift count >= width of type 
>> [-Wshift-count-overflow]
 203 |   ret |= (dma_addr_t)page->dma_addr[1] << 32;
 |^~
   include/net/page_pool.h: In function 'page_pool_set_dma_addr':
>> include/net/page_pool.h:211:28: warning: right shift count >= width of type 
>> [-Wshift-count-overflow]
 211 |   page->dma_addr[1] = addr >> 32;
 |^~


vim +203 include/net/page_pool.h

   198  
   199  static inline dma_addr_t page_pool_get_dma_addr(struct page *page)
   200  {
   201  dma_addr_t ret = page->dma_addr[0];
   202  if (sizeof(dma_addr_t) > sizeof(unsigned long))
 > 203  ret |= (dma_addr_t)page->dma_addr[1] << 32;
   204  return ret;
   205  }
   206  
   207  static inline void page_pool_set_dma_addr(struct page *page, dma_addr_t 
addr)
   208  {
   209  page->dma_addr[0] = addr;
   210  if (sizeof(dma_addr_t) > sizeof(unsigned long))
 > 211  page->dma_addr[1] = addr >> 32;
   212  }
   213  

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


.config.gz
Description: application/gzip

Re: [PATCH v4 2/2] kvm/arm64: Try stage2 block mapping for host device MMIO

2021-04-16 Thread Keqian Zhu

Hi Marc,

On 2021/4/16 22:44, Marc Zyngier wrote:
> On Thu, 15 Apr 2021 15:08:09 +0100,
> Keqian Zhu  wrote:
>>
>> Hi Marc,
>>
>> On 2021/4/15 22:03, Keqian Zhu wrote:
>>> The MMIO region of a device maybe huge (GB level), try to use
>>> block mapping in stage2 to speedup both map and unmap.
>>>
>>> Compared to normal memory mapping, we should consider two more
>>> points when try block mapping for MMIO region:
>>>
>>> 1. For normal memory mapping, the PA(host physical address) and
>>> HVA have same alignment within PUD_SIZE or PMD_SIZE when we use
>>> the HVA to request hugepage, so we don't need to consider PA
>>> alignment when verifing block mapping. But for device memory
>>> mapping, the PA and HVA may have different alignment.
>>>
>>> 2. For normal memory mapping, we are sure hugepage size properly
>>> fit into vma, so we don't check whether the mapping size exceeds
>>> the boundary of vma. But for device memory mapping, we should pay
>>> attention to this.
>>>
>>> This adds get_vma_page_shift() to get page shift for both normal
>>> memory and device MMIO region, and check these two points when
>>> selecting block mapping size for MMIO region.
>>>
>>> Signed-off-by: Keqian Zhu 
>>> ---
>>>  arch/arm64/kvm/mmu.c | 61 
>>>  1 file changed, 51 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index c59af5ca01b0..5a1cc7751e6d 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -738,6 +738,35 @@ transparent_hugepage_adjust(struct kvm_memory_slot 
>>> *memslot,
>>> return PAGE_SIZE;
>>>  }
>>>  
[...]

>>> +   /*
>>> +* logging_active is guaranteed to never be true for VM_PFNMAP
>>> +* memslots.
>>> +*/
>>> +   if (logging_active) {
>>> force_pte = true;
>>> vma_shift = PAGE_SHIFT;
>>> +   } else {
>>> +   vma_shift = get_vma_page_shift(vma, hva);
>>> }
>> I use a if/else manner in v4, please check that. Thanks very much!
> 
> That's fine. However, it is getting a bit late for 5.13, and we don't
> have much time to left it simmer in -next. I'll probably wait until
> after the merge window to pick it up.
OK, no problem. Thanks! :)

BRs,
Keqian

Re: [PATCH 5.10 00/25] 5.10.31-rc1 review

2021-04-16 Thread Samuel Zou





On 2021/4/15 22:47, Greg Kroah-Hartman wrote:

This is the start of the stable review cycle for the 5.10.31 release.
There are 25 patches in this series, all will be posted as a response
to this one.  If anyone has any issues with these being applied, please
let me know.

Responses should be made by Sat, 17 Apr 2021 14:44:01 +.
Anything received after that time might be too late.

The whole patch series can be found in one patch at:

https://www.kernel.org/pub/linux/kernel/v5.x/stable-review/patch-5.10.31-rc1.gz
or in the git tree and branch at:

git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
linux-5.10.y
and the diffstat can be found below.

thanks,

greg k-h



Tested on arm64 and x86 for 5.10.31-rc1,

Kernel repo:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
Branch: linux-5.10.y
Version: 5.10.31-rc1
Commit: 32f5704a0a4f7dcc8aa74a49dbcce359d758f6d5
Compiler: gcc version 7.3.0 (GCC)

arm64:

Testcase Result Summary:
total: 5764
passed: 5764
failed: 0
timeout: 0


x86:

Testcase Result Summary:
total: 5764
passed: 5764
failed: 0
timeout: 0


Tested-by: Hulk Robot

Re: [PATCH v3] powerpc: fix EDEADLOCK redefinition error in uapi/asm/errno.h

2021-04-16 Thread Tony Ambardar

On Fri, 16 Apr 2021 at 03:41, Michael Ellerman  wrote:
>
> Tony Ambardar  writes:
> > Hello Michael,
> >
> > The latest version of this patch addressed all feedback I'm aware of
> > when submitted last September, and I've seen no further comments from
> > reviewers since then.
> >
> > Could you please let me know where this stands and if anything further
> > is needed?
>
> Sorry, it's still sitting in my inbox :/
>
> I was going to reply to suggest we split the tools change out. The
> headers under tools are usually updated by another maintainer, I think
> it might even be scripted.
>
> Anyway I've applied your patch and done that (dropped the change to
> tools/.../errno.h), which should also mean the stable backport is more
> likely to work automatically.
>
> It will hit mainline in v5.13-rc1 and then be backported to the stable
> trees.
>
> I don't think you actually need the tools version of the header updated
> to fix your bug? In which case we can probably just wait for it to be
> updated automatically when the tools headers are sync'ed with the kernel
> versions.
>
> cheers

I appreciate the follow up. My original bug was indeed with the tools
header but is being patched locally, so waiting for those headers to
sync with the kernel versions is fine if it simplifies things overall.

Thanks,
Tony

Re: [PATCH 4.14 00/68] 4.14.231-rc1 review

2021-04-16 Thread Samuel Zou





On 2021/4/15 22:46, Greg Kroah-Hartman wrote:

This is the start of the stable review cycle for the 4.14.231 release.
There are 68 patches in this series, all will be posted as a response
to this one.  If anyone has any issues with these being applied, please
let me know.

Responses should be made by Sat, 17 Apr 2021 14:44:01 +.
Anything received after that time might be too late.

The whole patch series can be found in one patch at:

https://www.kernel.org/pub/linux/kernel/v4.x/stable-review/patch-4.14.231-rc1.gz
or in the git tree and branch at:

git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git 
linux-4.14.y
and the diffstat can be found below.

thanks,

greg k-h



Tested on x86 for 4.14.231-rc1,

Kernel repo:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable-rc.git
Branch: linux-4.14.y
Version: 4.14.231-rc1
Commit: 520c87617485a8885f18d5cb9d70076199e37b43
Compiler: gcc version 7.3.0 (GCC)

x86:

Testcase Result Summary:
total: 5711
passed: 5711
failed: 0
timeout: 0


Tested-by: Hulk Robot

Re: [PATCH] cxl/mem: Fix memory device capacity probing

2021-04-16 Thread Verma, Vishal L

On Fri, 2021-04-16 at 17:43 -0700, Dan Williams wrote:
> The CXL Identify Memory Device output payload emits capacity in 256MB
> units. The driver is treating the capacity field as bytes. This was
> missed because QEMU reports bytes when it should report bytes / 256MB.
> 
> Fixes: 8adaf747c9f0 ("cxl/mem: Find device capabilities")
> Cc: Ben Widawsky 
> Signed-off-by: Dan Williams 
> ---
>  drivers/cxl/mem.c |7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)

Looks good,
Reviewed-by: Vishal Verma 

> 
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 1b5078311f7d..2acc6173da36 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -4,6 +4,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -1419,6 +1420,7 @@ static int cxl_mem_enumerate_cmds(struct cxl_mem *cxlm)
>   */
>  static int cxl_mem_identify(struct cxl_mem *cxlm)
>  {
> + /* See CXL 2.0 Table 175 Identify Memory Device Output Payload */
>   struct cxl_mbox_identify {
>   char fw_revision[0x10];
>   __le64 total_capacity;
> @@ -1447,10 +1449,11 @@ static int cxl_mem_identify(struct cxl_mem *cxlm)
>    * For now, only the capacity is exported in sysfs
>    */
>   cxlm->ram_range.start = 0;
> - cxlm->ram_range.end = le64_to_cpu(id.volatile_capacity) - 1;
> + cxlm->ram_range.end = le64_to_cpu(id.volatile_capacity) * SZ_256M - 1;
>  
> 
> 
> 
>   cxlm->pmem_range.start = 0;
> - cxlm->pmem_range.end = le64_to_cpu(id.persistent_capacity) - 1;
> + cxlm->pmem_range.end =
> + le64_to_cpu(id.persistent_capacity) * SZ_256M - 1;
>  
> 
> 
> 
>   memcpy(cxlm->firmware_version, id.fw_revision, sizeof(id.fw_revision));
>  
> 
> 
> 
>

[gustavoars-linux:testing/warray-bounds] BUILD SUCCESS WITH WARNING f26f5c4b90b01bfc415b38f9246b0a36d63b9aaa

2021-04-16 Thread kernel test robot

tree/branch: 
https://git.kernel.org/pub/scm/linux/kernel/git/gustavoars/linux.git 
testing/warray-bounds
branch HEAD: f26f5c4b90b01bfc415b38f9246b0a36d63b9aaa  sctp: Fix out-of-bounds 
warning in sctp_process_asconf_param()

Warning reports:

https://lore.kernel.org/lkml/202104160817.a5foa0xa-...@intel.com

Warning in current branch:

arch/alpha/include/asm/string.h:22:16: warning: '__builtin_memcpy' offset [12, 
16] from the object at 'com' is out of the bounds of referenced subobject 
'config' with type 'unsigned char' at offset 10 [-Warray-bounds]
arch/alpha/include/asm/string.h:22:16: warning: '__builtin_memcpy' offset [17, 
24] from the object at 'alloc' is out of the bounds of referenced subobject 
'key' with type 'struct bkey' at offset 0 [-Warray-bounds]
arch/alpha/include/asm/string.h:22:16: warning: '__builtin_memcpy' offset [3, 
7] from the object at 'cmd' is out of the bounds of referenced subobject 
'feature' with type 'unsigned char' at offset 1 [-Warray-bounds]

possible Warning in current branch:

arch/x86/include/asm/string_32.h:182:25: warning: '__builtin_memcpy' offset 
[12, 16] from the object at 'com' is out of the bounds of referenced subobject 
'config' with type 'unsigned char' at offset 10 [-Warray-bounds]
arch/x86/include/asm/string_32.h:182:25: warning: '__builtin_memcpy' offset 
[17, 38] from the object at 'txdesc' is out of the bounds of referenced 
subobject 'frame_control' with type 'short unsigned int' at offset 14 
[-Warray-bounds]
include/linux/bitmap.h:249:2: warning: 'memcpy' offset [17, 24] from the object 
at 'settings' is out of the bounds of referenced subobject 'advertising' with 
type 'long long unsigned int' at offset 8 [-Warray-bounds]
include/linux/bitmap.h:249:2: warning: 'memcpy' offset [9, 16] from the object 
at 'settings' is out of the bounds of referenced subobject 'supported' with 
type 'long long unsigned int' at offset 0 [-Warray-bounds]
include/linux/unaligned/le_byteshift.h:14:17: warning: array subscript 1 is 
outside array bounds of 'unsigned char[1]' [-Warray-bounds]
include/linux/unaligned/le_byteshift.h:14:29: warning: intermediate array 
offset 2 is outside array bounds of 'unsigned char' [-Warray-bounds]

Warning ids grouped by kconfigs:

gcc_recent_errors
|-- alpha-allmodconfig
|   |-- 
arch-alpha-include-asm-string.h:warning:__builtin_memcpy-offset-from-the-object-at-alloc-is-out-of-the-bounds-of-referenced-subobject-key-with-type-struct-bkey-at-offset
|   |-- 
arch-alpha-include-asm-string.h:warning:__builtin_memcpy-offset-from-the-object-at-cmd-is-out-of-the-bounds-of-referenced-subobject-feature-with-type-unsigned-char-at-offset
|   `-- 
arch-alpha-include-asm-string.h:warning:__builtin_memcpy-offset-from-the-object-at-com-is-out-of-the-bounds-of-referenced-subobject-config-with-type-unsigned-char-at-offset
|-- alpha-allyesconfig
|   |-- 
arch-alpha-include-asm-string.h:warning:__builtin_memcpy-offset-from-the-object-at-alloc-is-out-of-the-bounds-of-referenced-subobject-key-with-type-struct-bkey-at-offset
|   |-- 
arch-alpha-include-asm-string.h:warning:__builtin_memcpy-offset-from-the-object-at-cmd-is-out-of-the-bounds-of-referenced-subobject-feature-with-type-unsigned-char-at-offset
|   `-- 
arch-alpha-include-asm-string.h:warning:__builtin_memcpy-offset-from-the-object-at-com-is-out-of-the-bounds-of-referenced-subobject-config-with-type-unsigned-char-at-offset
|-- alpha-defconfig
|   `-- 
arch-alpha-include-asm-string.h:warning:__builtin_memcpy-offset-from-the-object-at-cmd-is-out-of-the-bounds-of-referenced-subobject-feature-with-type-unsigned-char-at-offset
|-- alpha-randconfig-r013-20210416
|   |-- 
arch-alpha-include-asm-string.h:warning:__builtin_memcpy-offset-from-the-object-at-alloc-is-out-of-the-bounds-of-referenced-subobject-key-with-type-struct-bkey-at-offset
|   `-- 
arch-alpha-include-asm-string.h:warning:__builtin_memcpy-offset-from-the-object-at-cmd-is-out-of-the-bounds-of-referenced-subobject-feature-with-type-unsigned-char-at-offset
|-- i386-randconfig-a002-20210416
|   `-- 
arch-x86-include-asm-string_32.h:warning:__builtin_memcpy-offset-from-the-object-at-com-is-out-of-the-bounds-of-referenced-subobject-config-with-type-unsigned-char-at-offset
|-- i386-randconfig-a003-20210416
|   `-- 
arch-x86-include-asm-string_32.h:warning:__builtin_memcpy-offset-from-the-object-at-com-is-out-of-the-bounds-of-referenced-subobject-config-with-type-unsigned-char-at-offset
|-- i386-randconfig-a016-20210416
|   `-- 
arch-x86-include-asm-string_32.h:warning:__builtin_memcpy-offset-from-the-object-at-txdesc-is-out-of-the-bounds-of-referenced-subobject-frame_control-with-type-short-unsigned-int-at-offset
|-- mips-randconfig-c003-20210416
|   |-- 
include-linux-unaligned-le_byteshift.h:warning:array-subscript-is-outside-array-bounds-of-unsigned-char
|   `-- 
include-linux-unaligned-le_byteshift.h:warning:intermediate-array-offset-is-outside-array-bounds-of-unsigned-char
`-- x86_64-randconfig-a011-20210416
|-- 
include-linux

[PATCH] cxl/mem: Fix memory device capacity probing

2021-04-16 Thread Dan Williams

The CXL Identify Memory Device output payload emits capacity in 256MB
units. The driver is treating the capacity field as bytes. This was
missed because QEMU reports bytes when it should report bytes / 256MB.

Fixes: 8adaf747c9f0 ("cxl/mem: Find device capabilities")
Cc: Ben Widawsky 
Signed-off-by: Dan Williams 
---
 drivers/cxl/mem.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 1b5078311f7d..2acc6173da36 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1419,6 +1420,7 @@ static int cxl_mem_enumerate_cmds(struct cxl_mem *cxlm)
  */
 static int cxl_mem_identify(struct cxl_mem *cxlm)
 {
+   /* See CXL 2.0 Table 175 Identify Memory Device Output Payload */
struct cxl_mbox_identify {
char fw_revision[0x10];
__le64 total_capacity;
@@ -1447,10 +1449,11 @@ static int cxl_mem_identify(struct cxl_mem *cxlm)
 * For now, only the capacity is exported in sysfs
 */
cxlm->ram_range.start = 0;
-   cxlm->ram_range.end = le64_to_cpu(id.volatile_capacity) - 1;
+   cxlm->ram_range.end = le64_to_cpu(id.volatile_capacity) * SZ_256M - 1;
 
cxlm->pmem_range.start = 0;
-   cxlm->pmem_range.end = le64_to_cpu(id.persistent_capacity) - 1;
+   cxlm->pmem_range.end =
+   le64_to_cpu(id.persistent_capacity) * SZ_256M - 1;
 
memcpy(cxlm->firmware_version, id.fw_revision, sizeof(id.fw_revision));

[PATCH 2/2] drivers: hv: Create a consistent pattern for checking Hyper-V hypercall status

2021-04-16 Thread Joseph Salisbury

From: Joseph Salisbury 

There is not a consistent pattern for checking Hyper-V hypercall status.
Existing code uses a number of variants.  The variants work, but a consistent
pattern would improve the readability of the code, and be more conformant
to what the Hyper-V TLFS says about hypercall status.

Implemented new helper functions hv_result(), hv_result_success(), and
hv_repcomp().  Changed the places where hv_do_hypercall() and related variants
are used to use the helper functions.

Signed-off-by: Joseph Salisbury 
---
 arch/x86/hyperv/hv_apic.c   | 16 +---
 arch/x86/hyperv/hv_init.c   |  2 +-
 arch/x86/hyperv/hv_proc.c   | 25 ++---
 arch/x86/hyperv/irqdomain.c |  6 +++---
 arch/x86/hyperv/mmu.c   |  8 
 arch/x86/hyperv/nested.c|  8 
 arch/x86/include/asm/mshyperv.h |  1 +
 drivers/hv/hv.c |  2 +-
 drivers/pci/controller/pci-hyperv.c |  2 +-
 include/asm-generic/mshyperv.h  | 25 -
 10 files changed, 54 insertions(+), 41 deletions(-)

diff --git a/arch/x86/hyperv/hv_apic.c b/arch/x86/hyperv/hv_apic.c
index 284e73661a18..ca581b24974a 100644
--- a/arch/x86/hyperv/hv_apic.c
+++ b/arch/x86/hyperv/hv_apic.c
@@ -103,7 +103,7 @@ static bool __send_ipi_mask_ex(const struct cpumask *mask, 
int vector)
struct hv_send_ipi_ex *ipi_arg;
unsigned long flags;
int nr_bank = 0;
-   int ret = 1;
+   u64 status = HV_STATUS_INVALID_PARAMETER;
 
if (!(ms_hyperv.hints & HV_X64_EX_PROCESSOR_MASKS_RECOMMENDED))
return false;
@@ -128,19 +128,19 @@ static bool __send_ipi_mask_ex(const struct cpumask 
*mask, int vector)
if (!nr_bank)
ipi_arg->vp_set.format = HV_GENERIC_SET_ALL;
 
-   ret = hv_do_rep_hypercall(HVCALL_SEND_IPI_EX, 0, nr_bank,
+   status = hv_do_rep_hypercall(HVCALL_SEND_IPI_EX, 0, nr_bank,
  ipi_arg, NULL);
 
 ipi_mask_ex_done:
local_irq_restore(flags);
-   return ((ret == 0) ? true : false);
+   return hv_result_success(status);
 }
 
 static bool __send_ipi_mask(const struct cpumask *mask, int vector)
 {
int cur_cpu, vcpu;
struct hv_send_ipi ipi_arg;
-   int ret = 1;
+   u64 status;
 
trace_hyperv_send_ipi_mask(mask, vector);
 
@@ -184,9 +184,9 @@ static bool __send_ipi_mask(const struct cpumask *mask, int 
vector)
__set_bit(vcpu, (unsigned long *)_arg.cpu_mask);
}
 
-   ret = hv_do_fast_hypercall16(HVCALL_SEND_IPI, ipi_arg.vector,
+   status = hv_do_fast_hypercall16(HVCALL_SEND_IPI, ipi_arg.vector,
 ipi_arg.cpu_mask);
-   return ((ret == 0) ? true : false);
+   return hv_result_success(status);
 
 do_ex_hypercall:
return __send_ipi_mask_ex(mask, vector);
@@ -195,6 +195,7 @@ static bool __send_ipi_mask(const struct cpumask *mask, int 
vector)
 static bool __send_ipi_one(int cpu, int vector)
 {
int vp = hv_cpu_number_to_vp_number(cpu);
+   u64 status;
 
trace_hyperv_send_ipi_one(cpu, vector);
 
@@ -207,7 +208,8 @@ static bool __send_ipi_one(int cpu, int vector)
if (vp >= 64)
return __send_ipi_mask_ex(cpumask_of(cpu), vector);
 
-   return !hv_do_fast_hypercall16(HVCALL_SEND_IPI, vector, BIT_ULL(vp));
+   status = hv_do_fast_hypercall16(HVCALL_SEND_IPI, vector, BIT_ULL(vp));
+   return hv_result_success(status);
 }
 
 static void hv_send_ipi(int cpu, int vector)
diff --git a/arch/x86/hyperv/hv_init.c b/arch/x86/hyperv/hv_init.c
index b81047dec1da..a5b73584e2cc 100644
--- a/arch/x86/hyperv/hv_init.c
+++ b/arch/x86/hyperv/hv_init.c
@@ -369,7 +369,7 @@ static void __init hv_get_partition_id(void)
local_irq_save(flags);
output_page = *this_cpu_ptr(hyperv_pcpu_output_arg);
status = hv_do_hypercall(HVCALL_GET_PARTITION_ID, NULL, output_page);
-   if ((status & HV_HYPERCALL_RESULT_MASK) != HV_STATUS_SUCCESS) {
+   if (!hv_result_success(status)) {
/* No point in proceeding if this failed */
pr_err("Failed to get partition ID: %lld\n", status);
BUG();
diff --git a/arch/x86/hyperv/hv_proc.c b/arch/x86/hyperv/hv_proc.c
index 60461e598239..f16234aef358 100644
--- a/arch/x86/hyperv/hv_proc.c
+++ b/arch/x86/hyperv/hv_proc.c
@@ -93,10 +93,9 @@ int hv_call_deposit_pages(int node, u64 partition_id, u32 
num_pages)
status = hv_do_rep_hypercall(HVCALL_DEPOSIT_MEMORY,
 page_count, 0, input_page, NULL);
local_irq_restore(flags);
-
-   if ((status & HV_HYPERCALL_RESULT_MASK) != HV_STATUS_SUCCESS) {
+   if (!hv_result_success(status)) {
pr_err("Failed to deposit pages: %lld\n", status);
-   ret = status;
+   ret = hv_result(status);
goto err_free_allocations;
}
 
@@ -122,7 +121,7 @@

[PATCH 1/2] x86/hyperv: Move hv_do_rep_hypercall to asm-generic

2021-04-16 Thread Joseph Salisbury

From: Joseph Salisbury 

This patch makes no functional changes.  It simply moves hv_do_rep_hypercall()
out of arch/x86/include/asm/mshyperv.h and into asm-generic/mshyperv.h

hv_do_rep_hypercall() is architecture independent, so it makes sense that it
should be in the architecture independent mshyperv.h, not in the x86-specific
mshyperv.h.

This is done in preperation for a follow up patch which creates a consistent
pattern for checking Hyper-V hypercall status.

Signed-off-by: Joseph Salisbury 
---
 arch/x86/include/asm/mshyperv.h | 32 
 include/asm-generic/mshyperv.h  | 31 +++
 2 files changed, 31 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
index ccf60a809a17..bfc98b490f07 100644
--- a/arch/x86/include/asm/mshyperv.h
+++ b/arch/x86/include/asm/mshyperv.h
@@ -189,38 +189,6 @@ static inline u64 hv_do_fast_hypercall16(u16 code, u64 
input1, u64 input2)
return hv_status;
 }
 
-/*
- * Rep hypercalls. Callers of this functions are supposed to ensure that
- * rep_count and varhead_size comply with Hyper-V hypercall definition.
- */
-static inline u64 hv_do_rep_hypercall(u16 code, u16 rep_count, u16 
varhead_size,
- void *input, void *output)
-{
-   u64 control = code;
-   u64 status;
-   u16 rep_comp;
-
-   control |= (u64)varhead_size << HV_HYPERCALL_VARHEAD_OFFSET;
-   control |= (u64)rep_count << HV_HYPERCALL_REP_COMP_OFFSET;
-
-   do {
-   status = hv_do_hypercall(control, input, output);
-   if ((status & HV_HYPERCALL_RESULT_MASK) != HV_STATUS_SUCCESS)
-   return status;
-
-   /* Bits 32-43 of status have 'Reps completed' data. */
-   rep_comp = (status & HV_HYPERCALL_REP_COMP_MASK) >>
-   HV_HYPERCALL_REP_COMP_OFFSET;
-
-   control &= ~HV_HYPERCALL_REP_START_MASK;
-   control |= (u64)rep_comp << HV_HYPERCALL_REP_START_OFFSET;
-
-   touch_nmi_watchdog();
-   } while (rep_comp < rep_count);
-
-   return status;
-}
-
 extern struct hv_vp_assist_page **hv_vp_assist_page;
 
 static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index dff58a3db5d5..a5246a6ea02d 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -41,6 +41,37 @@ extern struct ms_hyperv_info ms_hyperv;
 extern u64 hv_do_hypercall(u64 control, void *inputaddr, void *outputaddr);
 extern u64 hv_do_fast_hypercall8(u16 control, u64 input8);
 
+/*
+ * Rep hypercalls. Callers of this functions are supposed to ensure that
+ * rep_count and varhead_size comply with Hyper-V hypercall definition.
+ */
+static inline u64 hv_do_rep_hypercall(u16 code, u16 rep_count, u16 
varhead_size,
+ void *input, void *output)
+{
+   u64 control = code;
+   u64 status;
+   u16 rep_comp;
+
+   control |= (u64)varhead_size << HV_HYPERCALL_VARHEAD_OFFSET;
+   control |= (u64)rep_count << HV_HYPERCALL_REP_COMP_OFFSET;
+
+   do {
+   status = hv_do_hypercall(control, input, output);
+   if ((status & HV_HYPERCALL_RESULT_MASK) != HV_STATUS_SUCCESS)
+   return status;
+
+   /* Bits 32-43 of status have 'Reps completed' data. */
+   rep_comp = (status & HV_HYPERCALL_REP_COMP_MASK) >>
+   HV_HYPERCALL_REP_COMP_OFFSET;
+
+   control &= ~HV_HYPERCALL_REP_START_MASK;
+   control |= (u64)rep_comp << HV_HYPERCALL_REP_START_OFFSET;
+
+   touch_nmi_watchdog();
+   } while (rep_comp < rep_count);
+
+   return status;
+}
 
 /* Generate the guest OS identifier as described in the Hyper-V TLFS */
 static inline  __u64 generate_guest_id(__u64 d_info1, __u64 kernel_version,
-- 
2.17.1

Re: [PATCH v2 9/9] userfaultfd/shmem: modify shmem_mcopy_atomic_pte to use install_ptes

2021-04-16 Thread Hugh Dickins

On Mon, 12 Apr 2021, Axel Rasmussen wrote:

> In a previous commit, we added the mcopy_atomic_install_ptes() helper.
> This helper does the job of setting up PTEs for an existing page, to map
> it into a given VMA. It deals with both the anon and shmem cases, as
> well as the shared and private cases.
> 
> In other words, shmem_mcopy_atomic_pte() duplicates a case it already
> handles. So, expose it, and let shmem_mcopy_atomic_pte() use it
> directly, to reduce code duplication.
> 
> This requires that we refactor shmem_mcopy_atomic-pte() a bit:
> 
> Instead of doing accounting (shmem_recalc_inode() et al) part-way
> through the PTE setup, do it beforehand. This frees up
> mcopy_atomic_install_ptes() from having to care about this accounting,
> but it does mean we need to clean it up if we get a failure afterwards
> (shmem_uncharge()).
> 
> We can *almost* use shmem_charge() to do this, reducing code
> duplication. But, it does `inode->i_mapping->nrpages++`, which would
> double-count since shmem_add_to_page_cache() also does this.
> 
> Signed-off-by: Axel Rasmussen 
> ---
>  include/linux/userfaultfd_k.h |  5 
>  mm/shmem.c| 52 +++
>  mm/userfaultfd.c  | 25 -
>  3 files changed, 27 insertions(+), 55 deletions(-)

Very nice, and it gets better.

> 
> diff --git a/include/linux/userfaultfd_k.h b/include/linux/userfaultfd_k.h
> index 794d1538b8ba..3e20bfa9ef80 100644
> --- a/include/linux/userfaultfd_k.h
> +++ b/include/linux/userfaultfd_k.h
> @@ -53,6 +53,11 @@ enum mcopy_atomic_mode {
>   MCOPY_ATOMIC_CONTINUE,
>  };
>  
> +extern int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t 
> *dst_pmd,

mcopy_atomic_install_pte throughout as before.

> +  struct vm_area_struct *dst_vma,
> +  unsigned long dst_addr, struct page *page,
> +  bool newly_allocated, bool wp_copy);
> +
>  extern ssize_t mcopy_atomic(struct mm_struct *dst_mm, unsigned long 
> dst_start,
>   unsigned long src_start, unsigned long len,
>   bool *mmap_changing, __u64 mode);
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 3f48cb5e8404..9b12298405a4 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2376,10 +2376,8 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
>   struct address_space *mapping = inode->i_mapping;
>   gfp_t gfp = mapping_gfp_mask(mapping);
>   pgoff_t pgoff = linear_page_index(dst_vma, dst_addr);
> - spinlock_t *ptl;
>   void *page_kaddr;
>   struct page *page;
> - pte_t _dst_pte, *dst_pte;
>   int ret;
>   pgoff_t max_off;
>  
> @@ -2389,8 +2387,10 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  
>   if (!*pagep) {
>   page = shmem_alloc_page(gfp, info, pgoff);
> - if (!page)
> - goto out_unacct_blocks;
> + if (!page) {
> + shmem_inode_unacct_blocks(inode, 1);
> + goto out;
> + }
>  
>   if (!zeropage) {/* COPY */
>   page_kaddr = kmap_atomic(page);
> @@ -2430,59 +2430,27 @@ int shmem_mcopy_atomic_pte(struct mm_struct *dst_mm,
>   if (ret)
>   goto out_release;
>  
> - _dst_pte = mk_pte(page, dst_vma->vm_page_prot);
> - if (dst_vma->vm_flags & VM_WRITE)
> - _dst_pte = pte_mkwrite(pte_mkdirty(_dst_pte));
> - else {
> - /*
> -  * We don't set the pte dirty if the vma has no
> -  * VM_WRITE permission, so mark the page dirty or it
> -  * could be freed from under us. We could do it
> -  * unconditionally before unlock_page(), but doing it
> -  * only if VM_WRITE is not set is faster.
> -  */
> - set_page_dirty(page);
> - }
> -
> - dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, );
> -
> - ret = -EFAULT;
> - max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE);
> - if (unlikely(pgoff >= max_off))
> - goto out_release_unlock;
> -
> - ret = -EEXIST;
> - if (!pte_none(*dst_pte))
> - goto out_release_unlock;
> -
> - lru_cache_add(page);
> -
>   spin_lock_irq(>lock);
>   info->alloced++;
>   inode->i_blocks += BLOCKS_PER_PAGE;
>   shmem_recalc_inode(inode);
>   spin_unlock_irq(>lock);
>  
> - inc_mm_counter(dst_mm, mm_counter_file(page));
> - page_add_file_rmap(page, false);
> - set_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte);
> + ret = mcopy_atomic_install_ptes(dst_mm, dst_pmd, dst_vma, dst_addr,
> + page, true, false);
> + if (ret)
> + goto out_release_uncharge;
>  
> - /* No need to invalidate - it was non-present before */
> - update_mmu_cache(dst_vma, dst_addr, dst_pte);
> -

Re: PROBLEM: DoS Attack on Fragment Cache

2021-04-16 Thread David Ahern

[ cc author of 648700f76b03b7e8149d13cc2bdb3355035258a9 ]

On 4/16/21 3:58 PM, Keyu Man wrote:
> Hi,
> 
>  
> 
>     My name is Keyu Man. We are a group of researchers from University
> of California, Riverside. Zhiyun Qian is my advisor. We found the code
> in processing IPv4/IPv6 fragments will potentially lead to DoS Attacks.
> Specifically, after the latest kernel receives an IPv4 fragment, it will
> try to fit it into a queue by calling function
> 
>  
> 
>     struct inet_frag_queue *inet_frag_find(struct fqdir *fqdir, void
> *key) in net/ipv4/inet_fragment.c.
> 
>  
> 
>     However, this function will first check if the existing fragment
> memory exceeds the fqdir->high_thresh. If it exceeds, then drop the
> fragment regardless whether it belongs to a new queue or an existing queue.
> 
> Chances are that an attacker can fill the cache with fragments that
> will never be assembled (i.e., only sends the first fragment with new
> IPIDs every time) to exceed the threshold so that all future incoming
> fragmented IPv4 traffic would be blocked and dropped. Since there is no
> GC mechanism, the victim host has to wait for 30s when the fragments are
> expired to continue receive incoming fragments normally.
> 
>     In practice, given the 4MB fragment cache, the attacker only needs
> to send 1766 fragments to exhaust the cache and DoS the victim for 30s,
> whose cost is pretty low. Besides, IPv6 would also be affected since the
> issue resides in inet part.
> 
> This issue is introduced in commit
> 648700f76b03b7e8149d13cc2bdb3355035258a9 (inet: frags: use rhashtables
> for reassembly units) which removes fqdir->low_thresh, and GC worker as
> well. We would gently request to bring GC worker back to the kernel to
> prevent the DoS attacks.
> 
> Looking forward to hear from you
> 
>  
> 
>     Thanks,
> 
> Keyu Man
>

Re: [PATCH v8 clocksource 2/5] clocksource: Retry clock read if long delays detected

2021-04-16 Thread Paul E. McKenney

On Fri, Apr 16, 2021 at 10:45:28PM +0200, Thomas Gleixner wrote:
> On Tue, Apr 13 2021 at 21:35, Paul E. McKenney wrote:
> >  #define WATCHDOG_INTERVAL (HZ >> 1)
> >  #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
> 
> Didn't we discuss that the threshold is too big ?

Indeed we did!  How about like this, so that WATCHDOG_INTERVAL is at 500
microseconds and we tolerate up to 125 microseconds of delay during the
timer-read process?

I am firing up overnight tests.

Thanx, Paul

commit 6c52b5f3cfefd6e429efc4413fd25e3c394e959f
Author: Paul E. McKenney 
Date:   Fri Apr 16 16:19:43 2021 -0700

clocksource: Reduce WATCHDOG_THRESHOLD

Currently, WATCHDOG_THRESHOLD is set to detect a 62.5-millisecond skew in
a 500-millisecond WATCHDOG_INTERVAL.  This requires that clocks be skewed
by more than 12.5% in order to be marked unstable.  Except that a clock
that is skewed by that much is probably destroying unsuspecting software
right and left.  And given that there are now checks for false-positive
skews due to delays between reading the two clocks, and given that current
hardware clocks all increment well in excess of 1MHz, it should be possible
to greatly decrease WATCHDOG_THRESHOLD.

Therefore, decrease WATCHDOG_THRESHOLD from the current 62.5 milliseconds
down to 500 microseconds.

Suggested-by: Thomas Gleixner 
Cc: John Stultz 
Cc: Stephen Boyd 
Cc: Jonathan Corbet 
Cc: Mark Rutland 
Cc: Marc Zyngier 
Cc: Andi Kleen 
[ paulmck: Apply Rik van Riel feedback. ]
Reported-by: Chris Mason 
Signed-off-by: Paul E. McKenney 

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 8f4967e59b05..4ec19a13dcf0 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -125,8 +125,8 @@ static void __clocksource_change_rating(struct clocksource 
*cs, int rating);
  * Interval: 0.5sec Threshold: 0.0625s
  */
 #define WATCHDOG_INTERVAL (HZ >> 1)
-#define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
-#define WATCHDOG_MAX_SKEW (NSEC_PER_SEC >> 6)
+#define WATCHDOG_THRESHOLD (500 * NSEC_PER_USEC)
+#define WATCHDOG_MAX_SKEW (WATCHDOG_THRESHOLD / 4)

 static void clocksource_watchdog_work(struct work_struct *work)
 {

Re: [PATCH 04/15] static_call: Use global functions for the self-test

2021-04-16 Thread Thomas Gleixner

On Fri, Apr 16 2021 at 23:37, Thomas Gleixner wrote:
> On Fri, Apr 16 2021 at 13:38, Sami Tolvanen wrote:
>>  #ifdef CONFIG_STATIC_CALL_SELFTEST
>>  
>> -static int func_a(int x)
>> +int func_a(int x)
>>  {
>>  return x+1;
>>  }
>>  
>> -static int func_b(int x)
>> +int func_b(int x)
>>  {
>>  return x+2;
>>  }
>
> Did you even compile that?
>
> Global functions without a prototype are generating warnings, but we can
> ignore them just because of sekurity, right?
>
> Aside of that polluting the global namespace with func_a/b just to work
> around a tool shortcoming is beyond hillarious.
>
> Fix the tool not the perfectly correct code.

That said, I wouldn't mind a  __dont_dare_to_rename annotation to help
the compiler, but anything else is just wrong.

Thanks,

tglx

[RFC v2 4/6] fs: distinguish between user initiated freeze and kernel initiated freeze

2021-04-16 Thread Luis Chamberlain

Userspace can initiate a freeze call using ioctls. If the kernel decides
to freeze a filesystem later it must be able to distinguish if userspace
had initiated the freeze, so that it does not unfreeze it later
automatically on resume.

Likewise if the kernel is initiating a freeze on its own it should *not*
fail to freeze a filesystem if a user had already frozen it on our behalf.
This same concept applies to thawing, even if its not possible for
userspace to beat the kernel in thawing a filesystem. This logic however
has never applied to userspace freezing and thawing, two consecutive
userspace freeze calls will results in only the first one succeeding, so
we must retain the same behaviour in userspace.

This doesn't implement yet kernel initiated filesystem freeze calls,
this will be done in subsequent calls. This change should introduce
no functional changes, it just extends the definitions a frozen
filesystem to account for future kernel initiated filesystem freeze.

Signed-off-by: Luis Chamberlain 
---
 fs/super.c | 27 ++-
 include/linux/fs.h | 17 +++--
 2 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 744b2399a272..53106d4c7f56 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -40,7 +40,7 @@
 #include 
 #include "internal.h"
 
-static int thaw_super_locked(struct super_block *sb);
+static int thaw_super_locked(struct super_block *sb, bool usercall);
 
 static LIST_HEAD(super_blocks);
 static DEFINE_SPINLOCK(sb_lock);
@@ -977,7 +977,7 @@ static void do_thaw_all_callback(struct super_block *sb)
down_write(>s_umount);
if (sb->s_root && sb->s_flags & SB_BORN) {
emergency_thaw_bdev(sb);
-   thaw_super_locked(sb);
+   thaw_super_locked(sb, false);
} else {
up_write(>s_umount);
}
@@ -1625,10 +1625,13 @@ static void sb_freeze_unlock(struct super_block *sb)
 }
 
 /* Caller takes lock and handles active count */
-static int freeze_locked_super(struct super_block *sb)
+static int freeze_locked_super(struct super_block *sb, bool usercall)
 {
int ret;
 
+   if (!usercall && sb_is_frozen(sb))
+   return 0;
+
if (!sb_is_unfrozen(sb))
return -EBUSY;
 
@@ -1673,7 +1676,10 @@ static int freeze_locked_super(struct super_block *sb)
 * For debugging purposes so that fs can warn if it sees write activity
 * when frozen is set to SB_FREEZE_COMPLETE, and for thaw_super().
 */
-   sb->s_writers.frozen = SB_FREEZE_COMPLETE;
+   if (usercall)
+   sb->s_writers.frozen = SB_FREEZE_COMPLETE;
+   else
+   sb->s_writers.frozen = SB_FREEZE_COMPLETE_AUTO;
return 0;
 }
 
@@ -1717,7 +1723,7 @@ int freeze_super(struct super_block *sb)
atomic_inc(>s_active);
 
down_write(>s_umount);
-   error = freeze_locked_super(sb);
+   error = freeze_locked_super(sb, true);
if (error) {
deactivate_locked_super(sb);
goto out;
@@ -1731,10 +1737,13 @@ int freeze_super(struct super_block *sb)
 EXPORT_SYMBOL(freeze_super);
 
 /* Caller deals with the sb->s_umount */
-static int __thaw_super_locked(struct super_block *sb)
+static int __thaw_super_locked(struct super_block *sb, bool usercall)
 {
int error;
 
+   if (!usercall && sb_is_unfrozen(sb))
+   return 0;
+
if (!sb_is_frozen(sb))
return -EINVAL;
 
@@ -1763,11 +1772,11 @@ static int __thaw_super_locked(struct super_block *sb)
 }
 
 /* Handles unlocking of sb->s_umount for you */
-static int thaw_super_locked(struct super_block *sb)
+static int thaw_super_locked(struct super_block *sb, bool usercall)
 {
int error;
 
-   error = __thaw_super_locked(sb);
+   error = __thaw_super_locked(sb, usercall);
if (error) {
up_write(>s_umount);
return error;
@@ -1787,6 +1796,6 @@ static int thaw_super_locked(struct super_block *sb)
 int thaw_super(struct super_block *sb)
 {
down_write(>s_umount);
-   return thaw_super_locked(sb);
+   return thaw_super_locked(sb, true);
 }
 EXPORT_SYMBOL(thaw_super);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3dcf2c1968e5..6980e709e94a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1406,9 +1406,10 @@ enum {
SB_FREEZE_FS = 3,   /* For internal FS use (e.g. to stop
 * internal threads if needed) */
SB_FREEZE_COMPLETE = 4, /* ->freeze_fs finished successfully */
+   SB_FREEZE_COMPLETE_AUTO = 5,/* same but initiated automatically */
 };
 
-#define SB_FREEZE_LEVELS (SB_FREEZE_COMPLETE - 1)
+#define SB_FREEZE_LEVELS (SB_FREEZE_COMPLETE_AUTO - 2)
 
 struct sb_writers {
int frozen; /* Is sb frozen? */
@@ -1897,6 +1898,18 @@ static inline bool sb_is_frozen_by_user(struct

[RFC v2 6/6] fs: add automatic kernel fs freeze / thaw and remove kthread freezing

2021-04-16 Thread Luis Chamberlain

Add support to automatically handle freezing and thawing filesystems
during the kernel's suspend/resume cycle.

This is needed so that we properly really stop IO in flight without
races after userspace has been frozen. Without this we rely on
kthread freezing and its semantics are loose and error prone.
For instance, even though a kthread may use try_to_freeze() and end
up being frozen we have no way of being sure that everything that
has been spawned asynchronously from it (such as timers) have also
been stopped as well.

A long term advantage of also adding filesystem freeze / thawing
supporting during suspend / hibernation is that long term we may
be able to eventually drop the kernel's thread freezing completely
as it was originally added to stop disk IO in flight as we hibernate
or suspend.

This also removes all the superflous freezer calls on all filesystems
as they are no longer needed as the VFS now performs filesystem
freezing/thaw if the filesystem has support for it. The filesystem
therefore is in charge of properly dealing with quiescing of the
filesystem through its callbacks.

This also implies that many kthread users exist which have been
adding freezer semantics onto its kthreads without need. These also
will need to be reviewed later.

This is based on prior work originally by Rafael Wysocki and later by
Jiri Kosina.

The following Coccinelle rule was used as to remove the now superflous
freezer calls:

spatch --sp-file fs-freeze-cleanup.cocci --in-place fs/

@ has_freeze_fs @
identifier super_ops;
expression freeze_op;
@@

struct super_operations super_ops = {
.freeze_fs = freeze_op,
};

@ remove_set_freezable depends on has_freeze_fs @
expression time;
statement S, S2;
expression task, current;
@@

(
-   set_freezable();
|
-   if (try_to_freeze())
-   continue;
|
-   try_to_freeze();
|
-   freezable_schedule();
+   schedule();
|
-   freezable_schedule_timeout(time);
+   schedule_timeout(time);
|
-   if (freezing(task)) { S }
|
-   if (freezing(task)) { S }
-   else
{ S2 }
|
-   freezing(current)
)

@ remove_wq_freezable @
expression WQ_E, WQ_ARG1, WQ_ARG2, WQ_ARG3, WQ_ARG4;
identifier fs_wq_fn;
@@

(
WQ_E = alloc_workqueue(WQ_ARG1,
-  WQ_ARG2 | WQ_FREEZABLE,
+  WQ_ARG2,
   ...);
|
WQ_E = alloc_workqueue(WQ_ARG1,
-  WQ_ARG2 | WQ_FREEZABLE | WQ_ARG3,
+  WQ_ARG2 | WQ_ARG3,
   ...);
|
WQ_E = alloc_workqueue(WQ_ARG1,
-  WQ_ARG2 | WQ_ARG3 | WQ_FREEZABLE,
+  WQ_ARG2 | WQ_ARG3,
   ...);
|
WQ_E = alloc_workqueue(WQ_ARG1,
-  WQ_ARG2 | WQ_ARG3 | WQ_FREEZABLE | WQ_ARG4,
+  WQ_ARG2 | WQ_ARG3 | WQ_ARG4,
   ...);
|
WQ_E =
-   WQ_ARG1 | WQ_FREEZABLE
+   WQ_ARG1
|
WQ_E =
-   WQ_ARG1 | WQ_FREEZABLE | WQ_ARG3
+   WQ_ARG1 | WQ_ARG3
|
fs_wq_fn(
-   WQ_FREEZABLE | WQ_ARG2 | WQ_ARG3
+   WQ_ARG2 | WQ_ARG3
)
|
fs_wq_fn(
-   WQ_FREEZABLE | WQ_ARG2
+   WQ_ARG2
)
|
fs_wq_fn(
-   WQ_FREEZABLE
+   0
)
)

Signed-off-by: Luis Chamberlain 
---
 fs/btrfs/disk-io.c |  4 +-
 fs/btrfs/scrub.c   |  2 +-
 fs/cifs/cifsfs.c   | 10 ++---
 fs/cifs/dfs_cache.c|  2 +-
 fs/ext4/super.c|  2 -
 fs/f2fs/gc.c   |  7 +---
 fs/f2fs/segment.c  |  6 +--
 fs/gfs2/glock.c|  6 +--
 fs/gfs2/main.c |  4 +-
 fs/jfs/jfs_logmgr.c| 11 ++
 fs/jfs/jfs_txnmgr.c| 31 +--
 fs/nilfs2/segment.c| 48 ++-
 fs/super.c | 88 ++
 fs/xfs/xfs_log.c   |  3 +-
 fs/xfs/xfs_mru_cache.c |  2 +-
 fs/xfs/xfs_pwork.c |  2 +-
 fs/xfs/xfs_super.c | 14 +++
 fs/xfs/xfs_trans_ail.c |  7 +---
 include/linux/fs.h | 13 +++
 kernel/power/process.c | 15 ++-
 20 files changed, 175 insertions(+), 102 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9fc2ec72327f..2c718f1eaae3 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -2303,7 +2303,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info 
*fs_info,
struct btrfs_fs_devices *fs_devices)
 {
u32 max_active = fs_info->thread_pool_size;
-   unsigned int flags = WQ_MEM_RECLAIM | WQ_FREEZABLE | WQ_UNBOUND;
+   unsigned int flags = WQ_MEM_RECLAIM | WQ_UNBOUND;
 
fs_info->workers =
btrfs_alloc_workqueue(fs_info, "worker",
@@ -2355,7 +2355,7 @@ static int btrfs_init_workqueues(struct btrfs_fs_info 
*fs_info,

[RFC v2 5/6] fs: add iterate_supers_excl() and iterate_supers_reverse_excl()

2021-04-16 Thread Luis Chamberlain

There are use cases where we wish to traverse the superblock list
but also capture errors, and in which case we want to avoid having
our callers issue a lock themselves since we can do the locking for
the callers. Provide a iterate_supers_excl() which calls a function
with the write lock held. If an error occurs we capture it and
propagate it.

Likewise there are use cases where we wish to traverse the superblock
list but in reverse order. The new iterate_supers_reverse_excl() helpers
does this but also also captures any errors encountered.

Reviewed-by: Jan Kara 
Signed-off-by: Luis Chamberlain 
---
 fs/super.c | 91 ++
 include/linux/fs.h |  2 +
 2 files changed, 93 insertions(+)

diff --git a/fs/super.c b/fs/super.c
index 53106d4c7f56..2a6ef4ec2496 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -705,6 +705,97 @@ void iterate_supers(void (*f)(struct super_block *, void 
*), void *arg)
spin_unlock(_lock);
 }
 
+/**
+ * iterate_supers_excl - exclusively call func for all active superblocks
+ * @f: function to call
+ * @arg: argument to pass to it
+ *
+ * Scans the superblock list and calls given function, passing it
+ * locked superblock and given argument. Returns 0 unless an error
+ * occurred on calling the function on any superblock.
+ */
+int iterate_supers_excl(int (*f)(struct super_block *, void *), void *arg)
+{
+   struct super_block *sb, *p = NULL;
+   int error = 0;
+
+   spin_lock(_lock);
+   list_for_each_entry(sb, _blocks, s_list) {
+   if (hlist_unhashed(>s_instances))
+   continue;
+   sb->s_count++;
+   spin_unlock(_lock);
+
+   down_write(>s_umount);
+   if (sb->s_root && (sb->s_flags & SB_BORN)) {
+   error = f(sb, arg);
+   if (error) {
+   up_write(>s_umount);
+   spin_lock(_lock);
+   __put_super(sb);
+   break;
+   }
+   }
+   up_write(>s_umount);
+
+   spin_lock(_lock);
+   if (p)
+   __put_super(p);
+   p = sb;
+   }
+   if (p)
+   __put_super(p);
+   spin_unlock(_lock);
+
+   return error;
+}
+
+/**
+ * iterate_supers_reverse_excl - exclusively calls func in reverse order
+ * @f: function to call
+ * @arg: argument to pass to it
+ *
+ * Scans the superblock list and calls given function, passing it
+ * locked superblock and given argument, in reverse order, and holding
+ * the s_umount write lock. Returns if an error occurred.
+ */
+int iterate_supers_reverse_excl(int (*f)(struct super_block *, void *),
+void *arg)
+{
+   struct super_block *sb, *p = NULL;
+   int error = 0;
+
+   spin_lock(_lock);
+   list_for_each_entry_reverse(sb, _blocks, s_list) {
+   if (hlist_unhashed(>s_instances))
+   continue;
+   sb->s_count++;
+   spin_unlock(_lock);
+
+   down_write(>s_umount);
+   if (sb->s_root && (sb->s_flags & SB_BORN)) {
+   error = f(sb, arg);
+   if (error) {
+   up_write(>s_umount);
+   spin_lock(_lock);
+   __put_super(sb);
+   break;
+   }
+   }
+   up_write(>s_umount);
+
+   spin_lock(_lock);
+   if (p)
+   __put_super(p);
+   p = sb;
+   }
+   if (p)
+   __put_super(p);
+   spin_unlock(_lock);
+
+   return error;
+}
+
 /**
  * iterate_supers_type - call function for superblocks of given type
  * @type: fs type
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 6980e709e94a..0f4d624f0f3f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3442,6 +3442,8 @@ extern struct super_block *get_active_super(struct 
block_device *bdev);
 extern void drop_super(struct super_block *sb);
 extern void drop_super_exclusive(struct super_block *sb);
 extern void iterate_supers(void (*)(struct super_block *, void *), void *);
+extern int iterate_supers_excl(int (*f)(struct super_block *, void *), void 
*arg);
+extern int iterate_supers_reverse_excl(int (*)(struct super_block *, void *), 
void *);
 extern void iterate_supers_type(struct file_system_type *,
void (*)(struct super_block *, void *), void *);
 
-- 
2.29.2

[RFC v2 1/6] fs: provide unlocked helper for freeze_super()

2021-04-16 Thread Luis Chamberlain

freeze_super() holds a write lock, however we wish to also enable
callers which already hold the write lock. To do this provide a helper
and make freeze_super() use it. This way, all that freeze_super() does
now is lock handling and active count management.

This change has no functional changes.

Suggested-by: Dave Chinner 
Reviewed-by: Jan Kara 
Signed-off-by: Luis Chamberlain 
---
 fs/super.c | 100 +
 1 file changed, 55 insertions(+), 45 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 11b7e7213fd1..e24d0849d935 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1624,59 +1624,20 @@ static void sb_freeze_unlock(struct super_block *sb)
percpu_up_write(sb->s_writers.rw_sem + level);
 }
 
-/**
- * freeze_super - lock the filesystem and force it into a consistent state
- * @sb: the super to lock
- *
- * Syncs the super to make sure the filesystem is consistent and calls the fs's
- * freeze_fs.  Subsequent calls to this without first thawing the fs will 
return
- * -EBUSY.
- *
- * During this function, sb->s_writers.frozen goes through these values:
- *
- * SB_UNFROZEN: File system is normal, all writes progress as usual.
- *
- * SB_FREEZE_WRITE: The file system is in the process of being frozen.  New
- * writes should be blocked, though page faults are still allowed. We wait for
- * all writes to complete and then proceed to the next stage.
- *
- * SB_FREEZE_PAGEFAULT: Freezing continues. Now also page faults are blocked
- * but internal fs threads can still modify the filesystem (although they
- * should not dirty new pages or inodes), writeback can run etc. After waiting
- * for all running page faults we sync the filesystem which will clean all
- * dirty pages and inodes (no new dirty pages or inodes can be created when
- * sync is running).
- *
- * SB_FREEZE_FS: The file system is frozen. Now all internal sources of fs
- * modification are blocked (e.g. XFS preallocation truncation on inode
- * reclaim). This is usually implemented by blocking new transactions for
- * filesystems that have them and need this additional guard. After all
- * internal writers are finished we call ->freeze_fs() to finish filesystem
- * freezing. Then we transition to SB_FREEZE_COMPLETE state. This state is
- * mostly auxiliary for filesystems to verify they do not modify frozen fs.
- *
- * sb->s_writers.frozen is protected by sb->s_umount.
- */
-int freeze_super(struct super_block *sb)
+/* Caller takes lock and handles active count */
+static int freeze_locked_super(struct super_block *sb)
 {
int ret;
 
-   atomic_inc(>s_active);
-   down_write(>s_umount);
-   if (sb->s_writers.frozen != SB_UNFROZEN) {
-   deactivate_locked_super(sb);
+   if (sb->s_writers.frozen != SB_UNFROZEN)
return -EBUSY;
-   }
 
-   if (!(sb->s_flags & SB_BORN)) {
-   up_write(>s_umount);
+   if (!(sb->s_flags & SB_BORN))
return 0;   /* sic - it's "nothing to do" */
-   }
 
if (sb_rdonly(sb)) {
/* Nothing to do really... */
sb->s_writers.frozen = SB_FREEZE_COMPLETE;
-   up_write(>s_umount);
return 0;
}
 
@@ -1705,7 +1666,6 @@ int freeze_super(struct super_block *sb)
sb->s_writers.frozen = SB_UNFROZEN;
sb_freeze_unlock(sb);
wake_up(>s_writers.wait_unfrozen);
-   deactivate_locked_super(sb);
return ret;
}
}
@@ -1714,9 +1674,59 @@ int freeze_super(struct super_block *sb)
 * when frozen is set to SB_FREEZE_COMPLETE, and for thaw_super().
 */
sb->s_writers.frozen = SB_FREEZE_COMPLETE;
+   return 0;
+}
+
+/**
+ * freeze_super - lock the filesystem and force it into a consistent state
+ * @sb: the super to lock
+ *
+ * Syncs the super to make sure the filesystem is consistent and calls the fs's
+ * freeze_fs.  Subsequent calls to this without first thawing the fs will 
return
+ * -EBUSY.
+ *
+ * During this function, sb->s_writers.frozen goes through these values:
+ *
+ * SB_UNFROZEN: File system is normal, all writes progress as usual.
+ *
+ * SB_FREEZE_WRITE: The file system is in the process of being frozen.  New
+ * writes should be blocked, though page faults are still allowed. We wait for
+ * all writes to complete and then proceed to the next stage.
+ *
+ * SB_FREEZE_PAGEFAULT: Freezing continues. Now also page faults are blocked
+ * but internal fs threads can still modify the filesystem (although they
+ * should not dirty new pages or inodes), writeback can run etc. After waiting
+ * for all running page faults we sync the filesystem which will clean all
+ * dirty pages and inodes (no new dirty pages or inodes can be created when
+ * sync is running).
+ *
+ * SB_FREEZE_FS: The file system is frozen. Now all internal sources of fs
+ *

[RFC v2 3/6] fs: add a helper for thaw_super_locked() which does not unlock

2021-04-16 Thread Luis Chamberlain

The thaw_super_locked() expects the caller to hold the sb->s_umount
semaphore. It also handles the unlocking of the semaphore for you.
Allow for cases where the caller will do the unlocking of the semaphore.

Signed-off-by: Luis Chamberlain 
---
 fs/super.c | 25 +++--
 1 file changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/super.c b/fs/super.c
index 72b445a69a45..744b2399a272 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1730,14 +1730,13 @@ int freeze_super(struct super_block *sb)
 }
 EXPORT_SYMBOL(freeze_super);
 
-static int thaw_super_locked(struct super_block *sb)
+/* Caller deals with the sb->s_umount */
+static int __thaw_super_locked(struct super_block *sb)
 {
int error;
 
-   if (!sb_is_frozen(sb)) {
-   up_write(>s_umount);
+   if (!sb_is_frozen(sb))
return -EINVAL;
-   }
 
if (sb_rdonly(sb)) {
sb->s_writers.frozen = SB_UNFROZEN;
@@ -1752,7 +1751,6 @@ static int thaw_super_locked(struct super_block *sb)
printk(KERN_ERR
"VFS:Filesystem thaw failed\n");
lockdep_sb_freeze_release(sb);
-   up_write(>s_umount);
return error;
}
}
@@ -1761,10 +1759,25 @@ static int thaw_super_locked(struct super_block *sb)
sb_freeze_unlock(sb);
 out:
wake_up(>s_writers.wait_unfrozen);
-   deactivate_locked_super(sb);
return 0;
 }
 
+/* Handles unlocking of sb->s_umount for you */
+static int thaw_super_locked(struct super_block *sb)
+{
+   int error;
+
+   error = __thaw_super_locked(sb);
+   if (error) {
+   up_write(>s_umount);
+   return error;
+   }
+
+   deactivate_locked_super(sb);
+
+   return 0;
+ }
+
 /**
  * thaw_super -- unlock filesystem
  * @sb: the super to thaw
-- 
2.29.2

[RFC v2 2/6] fs: add frozen sb state helpers

2021-04-16 Thread Luis Chamberlain

The question of whether or not a superblock is frozen needs to be
augmented in the future to account for differences between a user
initiated freeze and a kernel initiated freeze done automatically
on behalf of the kernel.

Provide helpers so that these can be used instead so that we don't
have to expand checks later in these same call sites as we expand
the definition of a frozen superblock.

Signed-off-by: Luis Chamberlain 
---
 fs/ext4/ext4_jbd2.c |  2 +-
 fs/super.c  |  4 ++--
 fs/xfs/xfs_trans.c  |  3 +--
 include/linux/fs.h  | 34 ++
 4 files changed, 38 insertions(+), 5 deletions(-)

diff --git a/fs/ext4/ext4_jbd2.c b/fs/ext4/ext4_jbd2.c
index be799040a415..efda50563feb 100644
--- a/fs/ext4/ext4_jbd2.c
+++ b/fs/ext4/ext4_jbd2.c
@@ -72,7 +72,7 @@ static int ext4_journal_check_start(struct super_block *sb)
 
if (sb_rdonly(sb))
return -EROFS;
-   WARN_ON(sb->s_writers.frozen == SB_FREEZE_COMPLETE);
+   WARN_ON(sb_is_frozen(sb));
journal = EXT4_SB(sb)->s_journal;
/*
 * Special case here: if the journal has aborted behind our
diff --git a/fs/super.c b/fs/super.c
index e24d0849d935..72b445a69a45 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -1629,7 +1629,7 @@ static int freeze_locked_super(struct super_block *sb)
 {
int ret;
 
-   if (sb->s_writers.frozen != SB_UNFROZEN)
+   if (!sb_is_unfrozen(sb))
return -EBUSY;
 
if (!(sb->s_flags & SB_BORN))
@@ -1734,7 +1734,7 @@ static int thaw_super_locked(struct super_block *sb)
 {
int error;
 
-   if (sb->s_writers.frozen != SB_FREEZE_COMPLETE) {
+   if (!sb_is_frozen(sb)) {
up_write(>s_umount);
return -EINVAL;
}
diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index bcc978011869..b4669dd65c9e 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -272,8 +272,7 @@ xfs_trans_alloc(
 * Zero-reservation ("empty") transactions can't modify anything, so
 * they're allowed to run while we're frozen.
 */
-   WARN_ON(resp->tr_logres > 0 &&
-   mp->m_super->s_writers.frozen == SB_FREEZE_COMPLETE);
+   WARN_ON(resp->tr_logres > 0 && sb_is_frozen(mp->m_super));
ASSERT(!(flags & XFS_TRANS_RES_FDBLKS) ||
   xfs_sb_version_haslazysbcount(>m_sb));
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c3c88fdb9b2a..3dcf2c1968e5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1885,6 +1885,40 @@ static inline bool sb_start_intwrite_trylock(struct 
super_block *sb)
return __sb_start_write_trylock(sb, SB_FREEZE_FS);
 }
 
+/**
+ * sb_is_frozen_by_user - is superblock frozen by a user call
+ * @sb: the super to check
+ *
+ * Returns true if the super freeze was initiated by userspace, for instance,
+ * an ioctl call.
+ */
+static inline bool sb_is_frozen_by_user(struct super_block *sb)
+{
+   return sb->s_writers.frozen == SB_FREEZE_COMPLETE;
+}
+
+/**
+ * sb_is_frozen - is superblock frozen
+ * @sb: the super to check
+ *
+ * Returns true if the super is frozen.
+ */
+static inline bool sb_is_frozen(struct super_block *sb)
+{
+   return sb_is_frozen_by_user(sb);
+}
+
+/**
+ * sb_is_unfrozen - is superblock unfrozen
+ * @sb: the super to check
+ *
+ * Returns true if the super is unfrozen.
+ */
+static inline bool sb_is_unfrozen(struct super_block *sb)
+{
+   return sb->s_writers.frozen == SB_UNFROZEN;
+}
+
 bool inode_owner_or_capable(struct user_namespace *mnt_userns,
const struct inode *inode);
 
-- 
2.29.2

[RFC v2 0/6] vfs: provide automatic kernel freeze / resume

2021-04-16 Thread Luis Chamberlain

This picks up where I left off in 2017, I was inclined to respin this
up again due to a new issue Lukas reported which is easy to reproduce.
I posted a stop-gap patch for that issue [0] and a test case, however,
*this* is the work we want upstream to fix these sorts of issues.

As discussed long ago though at LSFMM, we have much work to do. However,
we can take small strides to get us closer. This is trying to take one
step. It addresses most of the comments and feedback from the last
series I posted, however it doesn't address the biggest two issues:

 o provide clean and clear semantics between userspace ioctls /
   automatic fs freeze, and freeze bdev. This also involves moving the
   counter stuff from bdev to the superblock. This is pending work.
 o The loopback hack which breaks the reverse ordering isn't addressed,
   perhaps just flagging it suffices for now?
 o The long term desirable DAG *is* plausable and I have an initial
   kernel graph implementation which I could use, but that may take
   longer to merge.

What this series *does* address based on the last series is:

  o Rebased onto linux-next tag next-20210415
  o Fixed RCU complaints. The issue was that I was adding new fs levels, and
this increated undesirably also the amount of rw semaphores, but we were
just using the new levels to distinguish *who* was triggering the suspend,
it was either inside the kernel and automatic, or triggered by userspace.
  o thaw_super_locked() was added but that unlocks the sb sb->s_umount,
our exclusive reverse iterate supers however will want to hold that
semaphore, so we provide a helper which lets the caller do the unlocking
for you, and make thaw_super_locked() a user of that.
  o WQ_FREEZABLE is now dealt with
  o I have folded the fs freezer removal stuff into the patch which adds
the automatic fs frezer / thaw work from the kernel as otherwise separting
this would create intermediate steps which would produce kernels which
can stall on suspend / resume.

[0] https://lkml.kernel.org/r/20210416235850.23690-1-mcg...@kernel.org
[1] https://lkml.kernel.org/r/20171129232356.28296-1-mcg...@kernel.org  


Luis Chamberlain (6):
  fs: provide unlocked helper for freeze_super()
  fs: add frozen sb state helpers
  fs: add a helper for thaw_super_locked() which does not unlock
  fs: distinguish between user initiated freeze and kernel initiated
freeze
  fs: add iterate_supers_excl() and iterate_supers_reverse_excl()
  fs: add automatic kernel fs freeze / thaw and remove kthread freezing

 fs/btrfs/disk-io.c |   4 +-
 fs/btrfs/scrub.c   |   2 +-
 fs/cifs/cifsfs.c   |  10 +-
 fs/cifs/dfs_cache.c|   2 +-
 fs/ext4/ext4_jbd2.c|   2 +-
 fs/ext4/super.c|   2 -
 fs/f2fs/gc.c   |   7 +-
 fs/f2fs/segment.c  |   6 +-
 fs/gfs2/glock.c|   6 +-
 fs/gfs2/main.c |   4 +-
 fs/jfs/jfs_logmgr.c|  11 +-
 fs/jfs/jfs_txnmgr.c|  31 ++--
 fs/nilfs2/segment.c|  48 +++---
 fs/super.c | 321 ++---
 fs/xfs/xfs_log.c   |   3 +-
 fs/xfs/xfs_mru_cache.c |   2 +-
 fs/xfs/xfs_pwork.c |   2 +-
 fs/xfs/xfs_super.c |  14 +-
 fs/xfs/xfs_trans.c |   3 +-
 fs/xfs/xfs_trans_ail.c |   7 +-
 include/linux/fs.h |  64 +++-
 kernel/power/process.c |  15 +-
 22 files changed, 405 insertions(+), 161 deletions(-)

-- 
2.29.2

Re: [PATCH 1/5] scsi: BusLogic: Fix missing `pr_cont' use

2021-04-16 Thread Joe Perches

On Fri, 2021-04-16 at 14:41 -0600, Khalid Aziz wrote:
> On 4/15/21 8:08 PM, Joe Perches wrote:
> > And while it's a lot more code, I'd prefer a solution that looks more
> > like the other commonly used kernel logging extension mechanisms
> > where adapter is placed before the format, ... in the argument list.
> 
> Hi Joe,
> 
> I don't mind making these changes. It is quite a bit of code but
> consistency with other kernel code is useful. Would you like to finalize
> this patch, or would you prefer that I take this patch as starting point
> and finalize it?

Probably better to apply Maciej's patches first and then something
like this.

btw: the coccinelle script was

@@
expression a, b;
@@


\(blogic_announce\|blogic_info\|blogic_notice\|blogic_warn\|blogic_err\)(
-   a, b
+   b, a
, ...)

[PATCH net] net/core/dev.c: Ensure pfmemalloc skbs are correctly handled when receiving

2021-04-16 Thread Xie He

When an skb is allocated by "__netdev_alloc_skb" in "net/core/skbuff.c",
if "sk_memalloc_socks()" is true, and if there's not sufficient memory,
the skb would be allocated using emergency memory reserves. This kind of
skbs are called pfmemalloc skbs.

pfmemalloc skbs must be specially handled in "net/core/dev.c" when
receiving. They must NOT be delivered to the target protocol if
"skb_pfmemalloc_protocol(skb)" is false.

However, if, after a pfmemalloc skb is allocated and before it reaches
the code in "__netif_receive_skb", "sk_memalloc_socks()" becomes false,
then the skb will be handled by "__netif_receive_skb" as a normal skb.
This causes the skb to be delivered to the target protocol even if
"skb_pfmemalloc_protocol(skb)" is false.

This patch fixes this problem by ensuring all pfmemalloc skbs are handled
by "__netif_receive_skb" as pfmemalloc skbs.

"__netif_receive_skb_list" has the same problem as "__netif_receive_skb".
This patch also fixes it.

Fixes: b4b9e3558508 ("netvm: set PF_MEMALLOC as appropriate during SKB 
processing")
Cc: Mel Gorman 
Cc: David S. Miller 
Cc: Neil Brown 
Cc: Peter Zijlstra 
Cc: Jiri Slaby 
Cc: Mike Christie 
Cc: Eric B Munson 
Cc: Eric Dumazet 
Cc: Sebastian Andrzej Siewior 
Cc: Christoph Lameter 
Cc: Andrew Morton 
Signed-off-by: Xie He 
---
 net/core/dev.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 1f79b9aa9a3f..3e6b7879daef 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5479,7 +5479,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
 {
int ret;
 
-   if (sk_memalloc_socks() && skb_pfmemalloc(skb)) {
+   if (skb_pfmemalloc(skb)) {
unsigned int noreclaim_flag;
 
/*
@@ -5507,7 +5507,7 @@ static void __netif_receive_skb_list(struct list_head 
*head)
bool pfmemalloc = false; /* Is current sublist PF_MEMALLOC? */
 
list_for_each_entry_safe(skb, next, head, list) {
-   if ((sk_memalloc_socks() && skb_pfmemalloc(skb)) != pfmemalloc) 
{
+   if (skb_pfmemalloc(skb) != pfmemalloc) {
struct list_head sublist;
 
/* Handle the previous sublist */
-- 
2.27.0

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1309 matches

Mail list logo