from:"Andrey Ryabinin"

[PATCH] keys: Allow disabling read permissions for key possessor

2021-03-22 Thread Andrey Ryabinin

keyctl_read_key() has a strange code which allows possessor to read
key's payload regardless of READ permission status:

$ keyctl add user test test @u
196773443
$ keyctl print 196773443
test
$ keyctl describe 196773443
196773443: alswrv-v  1000  1000 user: test
$ keyctl rdescribe 196773443
user;1000;1000;3f01;test
$ keyctl setperm 196773443 0x3d01
$ keyctl describe 196773443
196773443: alsw-v-v  1000  1000 user: test
$ keyctl  print 196773443
test

The last keyctl print should fail with -EACCESS instead of success.
Fix this by removing weird possessor checks.

Signed-off-by: Andrey Ryabinin 
---

 - This was noticed by code review. It seems like a bug to me,
 but if I'm wrong and current behavior is correct, I think we need
 at least better comment here.
   

 security/keys/keyctl.c | 15 +--
 1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/security/keys/keyctl.c b/security/keys/keyctl.c
index 96a92a645216d..2ec021c7adc12 100644
--- a/security/keys/keyctl.c
+++ b/security/keys/keyctl.c
@@ -845,22 +845,9 @@ long keyctl_read_key(key_serial_t keyid, char __user 
*buffer, size_t buflen)
 
/* see if we can read it directly */
ret = key_permission(key_ref, KEY_NEED_READ);
-   if (ret == 0)
-   goto can_read_key;
-   if (ret != -EACCES)
+   if (ret != 0)
goto key_put_out;
 
-   /* we can't; see if it's searchable from this process's keyrings
-* - we automatically take account of the fact that it may be
-*   dangling off an instantiation key
-*/
-   if (!is_key_possessed(key_ref)) {
-   ret = -EACCES;
-   goto key_put_out;
-   }
-
-   /* the key is probably readable - now try to read it */
-can_read_key:
if (!key->type->read) {
ret = -EOPNOTSUPP;
goto key_put_out;
-- 
2.26.2

[PATCH stable 5.10-5.11] iommu/amd: Fix sleeping in atomic in increase_address_space()

2021-03-09 Thread Andrey Ryabinin

commit 140456f994195b568ecd7fc2287a34eadffef3ca upstream.

increase_address_space() calls get_zeroed_page(gfp) under spin_lock with
disabled interrupts. gfp flags passed to increase_address_space() may allow
sleeping, so it comes to this:

 BUG: sleeping function called from invalid context at mm/page_alloc.c:4342
 in_atomic(): 1, irqs_disabled(): 1, pid: 21555, name: epdcbbf1qnhbsd8

 Call Trace:
  dump_stack+0x66/0x8b
  ___might_sleep+0xec/0x110
  __alloc_pages_nodemask+0x104/0x300
  get_zeroed_page+0x15/0x40
  iommu_map_page+0xdd/0x3e0
  amd_iommu_map+0x50/0x70
  iommu_map+0x106/0x220
  vfio_iommu_type1_ioctl+0x76e/0x950 [vfio_iommu_type1]
  do_vfs_ioctl+0xa3/0x6f0
  ksys_ioctl+0x66/0x70
  __x64_sys_ioctl+0x16/0x20
  do_syscall_64+0x4e/0x100
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by moving get_zeroed_page() out of spin_lock/unlock section.

Fixes: 754265bcab ("iommu/amd: Fix race in increase_address_space()")
Signed-off-by: Andrey Ryabinin 
Acked-by: Will Deacon 
Link: https://lore.kernel.org/r/20210217143004.19165-1-a...@yandex-team.com
Signed-off-by: Joerg Roedel 
Signed-off-by: Andrey Ryabinin 
---
 drivers/iommu/amd/iommu.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index f0adbc48fd179..9256f84f5ebf1 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1502,6 +1502,10 @@ static bool increase_address_space(struct 
protection_domain *domain,
bool ret = true;
u64 *pte;
 
+   pte = (void *)get_zeroed_page(gfp);
+   if (!pte)
+   return false;
+
spin_lock_irqsave(>lock, flags);
 
amd_iommu_domain_get_pgtable(domain, );
@@ -1513,10 +1517,6 @@ static bool increase_address_space(struct 
protection_domain *domain,
if (WARN_ON_ONCE(pgtable.mode == PAGE_MODE_6_LEVEL))
goto out;
 
-   pte = (void *)get_zeroed_page(gfp);
-   if (!pte)
-   goto out;
-
*pte = PM_LEVEL_PDE(pgtable.mode, iommu_virt_to_phys(pgtable.root));
 
pgtable.root  = pte;
@@ -1530,10 +1530,12 @@ static bool increase_address_space(struct 
protection_domain *domain,
 */
amd_iommu_domain_set_pgtable(domain, pte, pgtable.mode);
 
+   pte = NULL;
ret = true;
 
 out:
spin_unlock_irqrestore(>lock, flags);
+   free_page((unsigned long)pte);
 
return ret;
 }
-- 
2.26.2

[PATCH stable 5.4] iommu/amd: Fix sleeping in atomic in increase_address_space()

2021-03-09 Thread Andrey Ryabinin

commit 140456f994195b568ecd7fc2287a34eadffef3ca upstream.

increase_address_space() calls get_zeroed_page(gfp) under spin_lock with
disabled interrupts. gfp flags passed to increase_address_space() may allow
sleeping, so it comes to this:

 BUG: sleeping function called from invalid context at mm/page_alloc.c:4342
 in_atomic(): 1, irqs_disabled(): 1, pid: 21555, name: epdcbbf1qnhbsd8

 Call Trace:
  dump_stack+0x66/0x8b
  ___might_sleep+0xec/0x110
  __alloc_pages_nodemask+0x104/0x300
  get_zeroed_page+0x15/0x40
  iommu_map_page+0xdd/0x3e0
  amd_iommu_map+0x50/0x70
  iommu_map+0x106/0x220
  vfio_iommu_type1_ioctl+0x76e/0x950 [vfio_iommu_type1]
  do_vfs_ioctl+0xa3/0x6f0
  ksys_ioctl+0x66/0x70
  __x64_sys_ioctl+0x16/0x20
  do_syscall_64+0x4e/0x100
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by moving get_zeroed_page() out of spin_lock/unlock section.

Fixes: 754265bcab ("iommu/amd: Fix race in increase_address_space()")
Signed-off-by: Andrey Ryabinin 
Acked-by: Will Deacon 
Link: https://lore.kernel.org/r/20210217143004.19165-1-a...@yandex-team.com
Signed-off-by: Joerg Roedel 
Signed-off-by: Andrey Ryabinin 
---
 drivers/iommu/amd_iommu.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 7b724f7b27a99..c392930253a30 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1469,25 +1469,27 @@ static bool increase_address_space(struct 
protection_domain *domain,
bool ret = false;
u64 *pte;
 
+   pte = (void *)get_zeroed_page(gfp);
+   if (!pte)
+   return false;
+
spin_lock_irqsave(>lock, flags);
 
if (address <= PM_LEVEL_SIZE(domain->mode) ||
WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
goto out;
 
-   pte = (void *)get_zeroed_page(gfp);
-   if (!pte)
-   goto out;
-
*pte = PM_LEVEL_PDE(domain->mode,
iommu_virt_to_phys(domain->pt_root));
domain->pt_root  = pte;
domain->mode+= 1;
 
+   pte = NULL;
ret = true;
 
 out:
spin_unlock_irqrestore(>lock, flags);
+   free_page((unsigned long)pte);
 
return ret;
 }
-- 
2.26.2

[PATCH stable 4.9] iommu/amd: Fix sleeping in atomic in increase_address_space()

2021-03-09 Thread Andrey Ryabinin

commit 140456f994195b568ecd7fc2287a34eadffef3ca upstream.

increase_address_space() calls get_zeroed_page(gfp) under spin_lock with
disabled interrupts. gfp flags passed to increase_address_space() may allow
sleeping, so it comes to this:

 BUG: sleeping function called from invalid context at mm/page_alloc.c:4342
 in_atomic(): 1, irqs_disabled(): 1, pid: 21555, name: epdcbbf1qnhbsd8

 Call Trace:
  dump_stack+0x66/0x8b
  ___might_sleep+0xec/0x110
  __alloc_pages_nodemask+0x104/0x300
  get_zeroed_page+0x15/0x40
  iommu_map_page+0xdd/0x3e0
  amd_iommu_map+0x50/0x70
  iommu_map+0x106/0x220
  vfio_iommu_type1_ioctl+0x76e/0x950 [vfio_iommu_type1]
  do_vfs_ioctl+0xa3/0x6f0
  ksys_ioctl+0x66/0x70
  __x64_sys_ioctl+0x16/0x20
  do_syscall_64+0x4e/0x100
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by moving get_zeroed_page() out of spin_lock/unlock section.

Fixes: 754265bcab ("iommu/amd: Fix race in increase_address_space()")
Signed-off-by: Andrey Ryabinin 
Acked-by: Will Deacon 
Link: https://lore.kernel.org/r/20210217143004.19165-1-a...@yandex-team.com
Signed-off-by: Joerg Roedel 
Signed-off-by: Andrey Ryabinin 
---
 drivers/iommu/amd_iommu.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index f4a15d9f2bbb2..8377bd388d673 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1331,24 +1331,26 @@ static void increase_address_space(struct 
protection_domain *domain,
unsigned long flags;
u64 *pte;
 
+   pte = (void *)get_zeroed_page(gfp);
+   if (!pte)
+   goto out;
+
spin_lock_irqsave(>lock, flags);
 
if (WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
/* address space already 64 bit large */
goto out;
 
-   pte = (void *)get_zeroed_page(gfp);
-   if (!pte)
-   goto out;
-
*pte = PM_LEVEL_PDE(domain->mode,
virt_to_phys(domain->pt_root));
domain->pt_root  = pte;
domain->mode+= 1;
domain->updated  = true;
+   pte  = NULL;
 
 out:
spin_unlock_irqrestore(>lock, flags);
+   free_page((unsigned long)pte);
 
return;
 }
-- 
2.26.2

[PATCH stable 4.14-4.19] iommu/amd: Fix sleeping in atomic in increase_address_space()

2021-03-09 Thread Andrey Ryabinin

commit 140456f994195b568ecd7fc2287a34eadffef3ca upstream.

increase_address_space() calls get_zeroed_page(gfp) under spin_lock with
disabled interrupts. gfp flags passed to increase_address_space() may allow
sleeping, so it comes to this:

 BUG: sleeping function called from invalid context at mm/page_alloc.c:4342
 in_atomic(): 1, irqs_disabled(): 1, pid: 21555, name: epdcbbf1qnhbsd8

 Call Trace:
  dump_stack+0x66/0x8b
  ___might_sleep+0xec/0x110
  __alloc_pages_nodemask+0x104/0x300
  get_zeroed_page+0x15/0x40
  iommu_map_page+0xdd/0x3e0
  amd_iommu_map+0x50/0x70
  iommu_map+0x106/0x220
  vfio_iommu_type1_ioctl+0x76e/0x950 [vfio_iommu_type1]
  do_vfs_ioctl+0xa3/0x6f0
  ksys_ioctl+0x66/0x70
  __x64_sys_ioctl+0x16/0x20
  do_syscall_64+0x4e/0x100
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by moving get_zeroed_page() out of spin_lock/unlock section.

Fixes: 754265bcab ("iommu/amd: Fix race in increase_address_space()")
Signed-off-by: Andrey Ryabinin 
Acked-by: Will Deacon 
Link: https://lore.kernel.org/r/20210217143004.19165-1-a...@yandex-team.com
Signed-off-by: Joerg Roedel 
Signed-off-by: Andrey Ryabinin 
---
 drivers/iommu/amd_iommu.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index 494caaa265af0..8195ff219b48c 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -1347,24 +1347,26 @@ static void increase_address_space(struct 
protection_domain *domain,
unsigned long flags;
u64 *pte;
 
+   pte = (void *)get_zeroed_page(gfp);
+   if (!pte)
+   return;
+
spin_lock_irqsave(>lock, flags);
 
if (WARN_ON_ONCE(domain->mode == PAGE_MODE_6_LEVEL))
/* address space already 64 bit large */
goto out;
 
-   pte = (void *)get_zeroed_page(gfp);
-   if (!pte)
-   goto out;
-
*pte = PM_LEVEL_PDE(domain->mode,
iommu_virt_to_phys(domain->pt_root));
domain->pt_root  = pte;
domain->mode+= 1;
domain->updated  = true;
+   pte  = NULL;
 
 out:
spin_unlock_irqrestore(>lock, flags);
+   free_page((unsigned long)pte);
 
return;
 }
-- 
2.26.2

Re: [PATCH] iommu/amd: Fix sleeping in atomic in increase_address_space()

2021-03-05 Thread Andrey Ryabinin




On 3/4/21 3:19 PM, Joerg Roedel wrote:
> On Wed, Feb 17, 2021 at 06:10:02PM +, Will Deacon wrote:
>>>  drivers/iommu/amd/iommu.c | 10 ++
>>>  1 file changed, 6 insertions(+), 4 deletions(-)
>>
>> Acked-by: Will Deacon 
> 
> Applied for v5.12, thanks.
> 
> There were some conflicts which I resolved, can you please check the
> result, Andrey? The updated patch is attached.
> 

Thanks, looks good to me.

[PATCH] iommu/amd: Fix sleeping in atomic in increase_address_space()

2021-02-17 Thread Andrey Ryabinin

increase_address_space() calls get_zeroed_page(gfp) under spin_lock with
disabled interrupts. gfp flags passed to increase_address_space() may allow
sleeping, so it comes to this:

 BUG: sleeping function called from invalid context at mm/page_alloc.c:4342
 in_atomic(): 1, irqs_disabled(): 1, pid: 21555, name: epdcbbf1qnhbsd8

 Call Trace:
  dump_stack+0x66/0x8b
  ___might_sleep+0xec/0x110
  __alloc_pages_nodemask+0x104/0x300
  get_zeroed_page+0x15/0x40
  iommu_map_page+0xdd/0x3e0
  amd_iommu_map+0x50/0x70
  iommu_map+0x106/0x220
  vfio_iommu_type1_ioctl+0x76e/0x950 [vfio_iommu_type1]
  do_vfs_ioctl+0xa3/0x6f0
  ksys_ioctl+0x66/0x70
  __x64_sys_ioctl+0x16/0x20
  do_syscall_64+0x4e/0x100
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix this by moving get_zeroed_page() out of spin_lock/unlock section.

Fixes: 754265bcab ("iommu/amd: Fix race in increase_address_space()")
Signed-off-by: Andrey Ryabinin 
Cc: 
---
 drivers/iommu/amd/iommu.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index f0adbc48fd17..9256f84f5ebf 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -1502,6 +1502,10 @@ static bool increase_address_space(struct 
protection_domain *domain,
bool ret = true;
u64 *pte;
 
+   pte = (void *)get_zeroed_page(gfp);
+   if (!pte)
+   return false;
+
spin_lock_irqsave(>lock, flags);
 
amd_iommu_domain_get_pgtable(domain, );
@@ -1513,10 +1517,6 @@ static bool increase_address_space(struct 
protection_domain *domain,
if (WARN_ON_ONCE(pgtable.mode == PAGE_MODE_6_LEVEL))
goto out;
 
-   pte = (void *)get_zeroed_page(gfp);
-   if (!pte)
-   goto out;
-
*pte = PM_LEVEL_PDE(pgtable.mode, iommu_virt_to_phys(pgtable.root));
 
pgtable.root  = pte;
@@ -1530,10 +1530,12 @@ static bool increase_address_space(struct 
protection_domain *domain,
 */
amd_iommu_domain_set_pgtable(domain, pte, pgtable.mode);
 
+   pte = NULL;
ret = true;
 
 out:
spin_unlock_irqrestore(>lock, flags);
+   free_page((unsigned long)pte);
 
return ret;
 }
-- 
2.26.2

[PATCH 4/4] sched/cpuacct: Make user/system times in cpuacct.stat more precise

2021-02-17 Thread Andrey Ryabinin

cpuacct.stat shows user time based on raw random precision tick
based counters. Use cputime_addjust() to scale these values against the
total runtime accounted by the scheduler, like we already do
for user/system times in /proc//stat.

Signed-off-by: Andrey Ryabinin 
---
 kernel/sched/cpuacct.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 7eff79faab0d..36347f2810c0 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -266,25 +266,30 @@ static int cpuacct_all_seq_show(struct seq_file *m, void 
*V)
 static int cpuacct_stats_show(struct seq_file *sf, void *v)
 {
struct cpuacct *ca = css_ca(seq_css(sf));
-   s64 val[CPUACCT_STAT_NSTATS];
+   struct task_cputime cputime;
+   u64 val[CPUACCT_STAT_NSTATS];
int cpu;
int stat;
 
-   memset(val, 0, sizeof(val));
+   memset(, 0, sizeof(cputime));
for_each_possible_cpu(cpu) {
u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
 
-   val[CPUACCT_STAT_USER] += cpustat[CPUTIME_USER];
-   val[CPUACCT_STAT_USER] += cpustat[CPUTIME_NICE];
-   val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SYSTEM];
-   val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_IRQ];
-   val[CPUACCT_STAT_SYSTEM] += cpustat[CPUTIME_SOFTIRQ];
+   cputime.utime += cpustat[CPUTIME_USER];
+   cputime.utime += cpustat[CPUTIME_NICE];
+   cputime.stime += cpustat[CPUTIME_SYSTEM];
+   cputime.stime += cpustat[CPUTIME_IRQ];
+   cputime.stime += cpustat[CPUTIME_SOFTIRQ];
+
+   cputime.sum_exec_runtime += this_cpu_read(*ca->cpuusage);
}
 
+   cputime_adjust(, _css(sf)->cgroup->prev_cputime,
+   [CPUACCT_STAT_USER], [CPUACCT_STAT_SYSTEM]);
+
for (stat = 0; stat < CPUACCT_STAT_NSTATS; stat++) {
-   seq_printf(sf, "%s %lld\n",
-  cpuacct_stat_desc[stat],
-  (long long)nsec_to_clock_t(val[stat]));
+   seq_printf(sf, "%s %llu\n", cpuacct_stat_desc[stat],
+   nsec_to_clock_t(val[stat]));
}
 
return 0;
-- 
2.26.2

[PATCH 1/4] cputime,cpuacct: Include guest time in user time in cpuacct.stat

2021-02-17 Thread Andrey Ryabinin

cpuacct.stat in no-root cgroups shows user time without guest time
included int it. This doesn't match with user time shown in root
cpuacct.stat and /proc//stat.

Make account_guest_time() to add user time to cgroup's cpustat to
fix this.

Fixes: ef12fefabf94 ("cpuacct: add per-cgroup utime/stime statistics")
Signed-off-by: Andrey Ryabinin 
Cc: 
---
 kernel/sched/cputime.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 5f611658eeab..95a9c5603d29 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -139,8 +139,6 @@ void account_user_time(struct task_struct *p, u64 cputime)
  */
 void account_guest_time(struct task_struct *p, u64 cputime)
 {
-   u64 *cpustat = kcpustat_this_cpu->cpustat;
-
/* Add guest time to process. */
p->utime += cputime;
account_group_user_time(p, cputime);
@@ -148,11 +146,11 @@ void account_guest_time(struct task_struct *p, u64 
cputime)
 
/* Add guest time to cpustat. */
if (task_nice(p) > 0) {
-   cpustat[CPUTIME_NICE] += cputime;
-   cpustat[CPUTIME_GUEST_NICE] += cputime;
+   task_group_account_field(p, CPUTIME_NICE, cputime);
+   task_group_account_field(p, CPUTIME_GUEST_NICE, cputime);
} else {
-   cpustat[CPUTIME_USER] += cputime;
-   cpustat[CPUTIME_GUEST] += cputime;
+   task_group_account_field(p, CPUTIME_USER, cputime);
+   task_group_account_field(p, CPUTIME_GUEST, cputime);
}
 }
 
-- 
2.26.2

[PATCH 2/4] cgroup: Fix 'usage_usec' time in root's cpu.stat

2021-02-17 Thread Andrey Ryabinin

Global CPUTIME_USER counter already includes CPUTIME_GUEST
Also CPUTIME_NICE already includes CPUTIME_GUEST_NICE.

Remove additions of CPUTIME_GUEST[_NICE] to total ->sum_exec_runtime
to not account them twice.

Fixes: 936f2a70f207 ("cgroup: add cpu.stat file to root cgroup")
Signed-off-by: Andrey Ryabinin 
Cc: 
---
 kernel/cgroup/rstat.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index d51175cedfca..89ca9b61aa0d 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -421,8 +421,6 @@ static void root_cgroup_cputime(struct task_cputime 
*cputime)
cputime->sum_exec_runtime += user;
cputime->sum_exec_runtime += sys;
cputime->sum_exec_runtime += cpustat[CPUTIME_STEAL];
-   cputime->sum_exec_runtime += cpustat[CPUTIME_GUEST];
-   cputime->sum_exec_runtime += cpustat[CPUTIME_GUEST_NICE];
}
 }
 
-- 
2.26.2

[PATCH 3/4] sched/cpuacct: fix user/system in shown cpuacct.usage*

2021-02-17 Thread Andrey Ryabinin

cpuacct has 2 different ways of accounting and showing user
and system times.

The first one uses cpuacct_account_field() to account times
and cpuacct.stat file to expose them. And this one seems to work ok.

The second one is uses cpuacct_charge() function for accounting and
set of cpuacct.usage* files to show times. Despite some attempts to
fix it in the past it still doesn't work. E.g. while running KVM
guest the cpuacct_charge() accounts most of the guest time as
system time. This doesn't match with user times shown in
cpuacct.stat or proc//stat.

Use cpustats accounted in cpuacct_account_field() as the source
of user/sys times for cpuacct.usage* files. Make cpuacct_charge()
to account only summary execution time.

Fixes: d740037fac70 ("sched/cpuacct: Split usage accounting into user_usage and 
sys_usage")
Signed-off-by: Andrey Ryabinin 
Cc: 
---
 kernel/sched/cpuacct.c | 77 +++---
 1 file changed, 34 insertions(+), 43 deletions(-)

diff --git a/kernel/sched/cpuacct.c b/kernel/sched/cpuacct.c
index 941c28cf9738..7eff79faab0d 100644
--- a/kernel/sched/cpuacct.c
+++ b/kernel/sched/cpuacct.c
@@ -29,7 +29,7 @@ struct cpuacct_usage {
 struct cpuacct {
struct cgroup_subsys_state  css;
/* cpuusage holds pointer to a u64-type object on every CPU */
-   struct cpuacct_usage __percpu   *cpuusage;
+   u64 __percpu*cpuusage;
struct kernel_cpustat __percpu  *cpustat;
 };
 
@@ -49,7 +49,7 @@ static inline struct cpuacct *parent_ca(struct cpuacct *ca)
return css_ca(ca->css.parent);
 }
 
-static DEFINE_PER_CPU(struct cpuacct_usage, root_cpuacct_cpuusage);
+static DEFINE_PER_CPU(u64, root_cpuacct_cpuusage);
 static struct cpuacct root_cpuacct = {
.cpustat= _cpustat,
.cpuusage   = _cpuacct_cpuusage,
@@ -68,7 +68,7 @@ cpuacct_css_alloc(struct cgroup_subsys_state *parent_css)
if (!ca)
goto out;
 
-   ca->cpuusage = alloc_percpu(struct cpuacct_usage);
+   ca->cpuusage = alloc_percpu(u64);
if (!ca->cpuusage)
goto out_free_ca;
 
@@ -99,7 +99,8 @@ static void cpuacct_css_free(struct cgroup_subsys_state *css)
 static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int cpu,
 enum cpuacct_stat_index index)
 {
-   struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
u64 data;
 
/*
@@ -115,14 +116,17 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int 
cpu,
raw_spin_lock_irq(_rq(cpu)->lock);
 #endif
 
-   if (index == CPUACCT_STAT_NSTATS) {
-   int i = 0;
-
-   data = 0;
-   for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
-   data += cpuusage->usages[i];
-   } else {
-   data = cpuusage->usages[index];
+   switch (index) {
+   case CPUACCT_STAT_USER:
+   data = cpustat[CPUTIME_USER] + cpustat[CPUTIME_NICE];
+   break;
+   case CPUACCT_STAT_SYSTEM:
+   data = cpustat[CPUTIME_SYSTEM] + cpustat[CPUTIME_IRQ] +
+   cpustat[CPUTIME_SOFTIRQ];
+   break;
+   case CPUACCT_STAT_NSTATS:
+   data = *cpuusage;
+   break;
}
 
 #ifndef CONFIG_64BIT
@@ -132,10 +136,14 @@ static u64 cpuacct_cpuusage_read(struct cpuacct *ca, int 
cpu,
return data;
 }
 
-static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu, u64 val)
+static void cpuacct_cpuusage_write(struct cpuacct *ca, int cpu)
 {
-   struct cpuacct_usage *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
-   int i;
+   u64 *cpuusage = per_cpu_ptr(ca->cpuusage, cpu);
+   u64 *cpustat = per_cpu_ptr(ca->cpustat, cpu)->cpustat;
+
+   /* Don't allow to reset global kernel_cpustat */
+   if (ca == _cpuacct)
+   return;
 
 #ifndef CONFIG_64BIT
/*
@@ -143,9 +151,10 @@ static void cpuacct_cpuusage_write(struct cpuacct *ca, int 
cpu, u64 val)
 */
raw_spin_lock_irq(_rq(cpu)->lock);
 #endif
-
-   for (i = 0; i < CPUACCT_STAT_NSTATS; i++)
-   cpuusage->usages[i] = val;
+   *cpuusage = 0;
+   cpustat[CPUTIME_USER] = cpustat[CPUTIME_NICE] = 0;
+   cpustat[CPUTIME_SYSTEM] = cpustat[CPUTIME_IRQ] = 0;
+   cpustat[CPUTIME_SOFTIRQ] = 0;
 
 #ifndef CONFIG_64BIT
raw_spin_unlock_irq(_rq(cpu)->lock);
@@ -196,7 +205,7 @@ static int cpuusage_write(struct cgroup_subsys_state *css, 
struct cftype *cft,
return -EINVAL;
 
for_each_possible_cpu(cpu)
-   cpuacct_cpuusage_write(ca, cpu, 0);
+   cpuacct_cpuusage_write(ca, cpu);
 
return 0;
 }
@@ -243,25 +252,12 @@ static int cpuacct_all_seq_show(struct seq_file *m, void 
*V)
seq_puts(m

[PATCH] ubsan: remove overflow checks

2021-02-09 Thread Andrey Ryabinin

Since GCC 8.0 -fsanitize=signed-integer-overflow doesn't work with -fwrapv.
-fwrapv makes signed overflows defines and GCC essentially disables
ubsan checks. On GCC < 8.0 -fwrapv doesn't have influence on
-fsanitize=signed-integer-overflow setting, so it kinda works
but generates false-positves and violates uaccess rules:

lib/iov_iter.o: warning: objtool: iovec_from_user()+0x22d: call to 
__ubsan_handle_add_overflow() with UACCESS enabled

Disable signed overflow checks to avoid these problems.
Remove unsigned overflow checks as well.
Unsigned overflow appeared as side effect of the commit
 cdf8a76fda4a ("ubsan: move cc-option tests into Kconfig"),
but it never worked (kernel doesn't boot). And unsigned overflows
are allowed by C standard, so it just pointless.

Signed-off-by: Andrey Ryabinin 
Cc: Peter Zijlstra 
Cc: Josh Poimboeuf 
Cc: Randy Dunlap 
Cc: Stephen Rothwell 
Cc: Dmitry Vyukov 
Cc: Kees Cook 
Cc: Alexander Viro 
---
 lib/Kconfig.ubsan  | 17 ---
 lib/test_ubsan.c   | 49 --
 lib/ubsan.c| 68 --
 scripts/Makefile.ubsan |  2 --
 4 files changed, 136 deletions(-)

diff --git a/lib/Kconfig.ubsan b/lib/Kconfig.ubsan
index 3a0b1c930733..e5372a13511d 100644
--- a/lib/Kconfig.ubsan
+++ b/lib/Kconfig.ubsan
@@ -112,23 +112,6 @@ config UBSAN_UNREACHABLE
  This option enables -fsanitize=unreachable which checks for control
  flow reaching an expected-to-be-unreachable position.
 
-config UBSAN_SIGNED_OVERFLOW
-   bool "Perform checking for signed arithmetic overflow"
-   default UBSAN
-   depends on $(cc-option,-fsanitize=signed-integer-overflow)
-   help
- This option enables -fsanitize=signed-integer-overflow which checks
- for overflow of any arithmetic operations with signed integers.
-
-config UBSAN_UNSIGNED_OVERFLOW
-   bool "Perform checking for unsigned arithmetic overflow"
-   depends on $(cc-option,-fsanitize=unsigned-integer-overflow)
-   depends on !X86_32 # avoid excessive stack usage on x86-32/clang
-   help
- This option enables -fsanitize=unsigned-integer-overflow which checks
- for overflow of any arithmetic operations with unsigned integers. This
- currently causes x86 to fail to boot.
-
 config UBSAN_OBJECT_SIZE
bool "Perform checking for accesses beyond the end of objects"
default UBSAN
diff --git a/lib/test_ubsan.c b/lib/test_ubsan.c
index 5e5d9355ef49..7e7bbd0f3fd2 100644
--- a/lib/test_ubsan.c
+++ b/lib/test_ubsan.c
@@ -11,51 +11,6 @@ typedef void(*test_ubsan_fp)(void);
#config, IS_ENABLED(config) ? "y" : "n");   \
} while (0)
 
-static void test_ubsan_add_overflow(void)
-{
-   volatile int val = INT_MAX;
-   volatile unsigned int uval = UINT_MAX;
-
-   UBSAN_TEST(CONFIG_UBSAN_SIGNED_OVERFLOW);
-   val += 2;
-
-   UBSAN_TEST(CONFIG_UBSAN_UNSIGNED_OVERFLOW);
-   uval += 2;
-}
-
-static void test_ubsan_sub_overflow(void)
-{
-   volatile int val = INT_MIN;
-   volatile unsigned int uval = 0;
-   volatile int val2 = 2;
-
-   UBSAN_TEST(CONFIG_UBSAN_SIGNED_OVERFLOW);
-   val -= val2;
-
-   UBSAN_TEST(CONFIG_UBSAN_UNSIGNED_OVERFLOW);
-   uval -= val2;
-}
-
-static void test_ubsan_mul_overflow(void)
-{
-   volatile int val = INT_MAX / 2;
-   volatile unsigned int uval = UINT_MAX / 2;
-
-   UBSAN_TEST(CONFIG_UBSAN_SIGNED_OVERFLOW);
-   val *= 3;
-
-   UBSAN_TEST(CONFIG_UBSAN_UNSIGNED_OVERFLOW);
-   uval *= 3;
-}
-
-static void test_ubsan_negate_overflow(void)
-{
-   volatile int val = INT_MIN;
-
-   UBSAN_TEST(CONFIG_UBSAN_SIGNED_OVERFLOW);
-   val = -val;
-}
-
 static void test_ubsan_divrem_overflow(void)
 {
volatile int val = 16;
@@ -155,10 +110,6 @@ static void test_ubsan_object_size_mismatch(void)
 }
 
 static const test_ubsan_fp test_ubsan_array[] = {
-   test_ubsan_add_overflow,
-   test_ubsan_sub_overflow,
-   test_ubsan_mul_overflow,
-   test_ubsan_negate_overflow,
test_ubsan_shift_out_of_bounds,
test_ubsan_out_of_bounds,
test_ubsan_load_invalid_value,
diff --git a/lib/ubsan.c b/lib/ubsan.c
index bec38c64d6a6..26229973049d 100644
--- a/lib/ubsan.c
+++ b/lib/ubsan.c
@@ -163,74 +163,6 @@ static void ubsan_epilogue(void)
}
 }
 
-static void handle_overflow(struct overflow_data *data, void *lhs,
-   void *rhs, char op)
-{
-
-   struct type_descriptor *type = data->type;
-   char lhs_val_str[VALUE_LENGTH];
-   char rhs_val_str[VALUE_LENGTH];
-
-   if (suppress_report(>location))
-   return;
-
-   ubsan_prologue(>location, type_is_signed(type) ?
-   "signed-integer-overflow" :
-   "unsigned-integer-overflow");
-
-   val_to_st

[PATCH] MAINTAINERS: Update Andrey Ryabinin's email address

2021-02-04 Thread Andrey Ryabinin

Update my email, @virtuozzo.com will stop working shortly.

Signed-off-by: Andrey Ryabinin 
---
 .mailmap| 1 +
 MAINTAINERS | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/.mailmap b/.mailmap
index 632700cee55c..b325d3c79725 100644
--- a/.mailmap
+++ b/.mailmap
@@ -40,6 +40,7 @@ Andrew Murray  

 Andrew Murray  
 Andrew Vasquez 
 Andrey Ryabinin  
+Andrey Ryabinin  
 Andy Adamson 
 Antoine Tenart  
 Antoine Tenart  
diff --git a/MAINTAINERS b/MAINTAINERS
index 00836f6452f0..f3126e88669d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9559,7 +9559,7 @@ F:Documentation/hwmon/k8temp.rst
 F: drivers/hwmon/k8temp.c
 
 KASAN
-M: Andrey Ryabinin 
+M: Andrey Ryabinin 
 R: Alexander Potapenko 
 R: Dmitry Vyukov 
 L: kasan-...@googlegroups.com
-- 
2.26.2

Re: [PATCH] ubsan: Require GCC-8+ or Clang to use UBSAN

2021-01-14 Thread Andrey Ryabinin




On 1/14/21 1:59 PM, Peter Zijlstra wrote:
> On Mon, Jan 04, 2021 at 04:13:17PM +0100, Peter Zijlstra wrote:
>> On Tue, Dec 22, 2020 at 11:04:54PM -0600, Josh Poimboeuf wrote:
>>> GCC 7 has a known bug where UBSAN ignores '-fwrapv' and generates false
>>> signed-overflow-UB warnings.  The type mismatch between 'i' and
>>> 'nr_segs' in copy_compat_iovec_from_user() is causing such a warning,
>>> which also happens to violate uaccess rules:
>>>
>>>   lib/iov_iter.o: warning: objtool: iovec_from_user()+0x22d: call to 
>>> __ubsan_handle_add_overflow() with UACCESS enabled
>>>
>>> Fix it by making the variable types match.
>>>
>>> This is similar to a previous commit:
>>>
>>>   29da93fea3ea ("mm/uaccess: Use 'unsigned long' to placate UBSAN warnings 
>>> on older GCC versions")
>>
>> Maybe it's time we make UBSAN builds depend on GCC-8+ ?
> 
> ---
> Subject: ubsan: Require GCC-8+ or Clang to use UBSAN
> 
> Just like how we require GCC-8.2 for KASAN due to compiler bugs, require
> a sane version of GCC for UBSAN.
> 
> Specifically, before GCC-8 UBSAN doesn't respect -fwrapv and thinks
> signed arithmetic is buggered.
> 

Actually removing CONFIG_UBSAN_SIGNED_OVERFLOW would give us the same effect 
without restricting GCC versions.

Re: [PATCH v8 1/5] kasan: support backing vmalloc space with real shadow memory

2019-10-18 Thread Andrey Ryabinin




On 10/16/19 4:22 PM, Mark Rutland wrote:
> Hi Andrey,
> 
> On Wed, Oct 16, 2019 at 03:19:50PM +0300, Andrey Ryabinin wrote:
>> On 10/14/19 4:57 PM, Daniel Axtens wrote:
>>>>> + /*
>>>>> +  * Ensure poisoning is visible before the shadow is made visible
>>>>> +  * to other CPUs.
>>>>> +  */
>>>>> + smp_wmb();
>>>>
>>>> I'm not quite understand what this barrier do and why it needed.
>>>> And if it's really needed there should be a pairing barrier
>>>> on the other side which I don't see.
>>>
>>> Mark might be better able to answer this, but my understanding is that
>>> we want to make sure that we never have a situation where the writes are
>>> reordered so that PTE is installed before all the poisioning is written
>>> out. I think it follows the logic in __pte_alloc() in mm/memory.c:
>>>
>>> /*
>>>  * Ensure all pte setup (eg. pte page lock and page clearing) are
>>>  * visible before the pte is made visible to other CPUs by being
>>>  * put into page tables.
>>>  *
>>>  * The other side of the story is the pointer chasing in the page
>>>  * table walking code (when walking the page table without locking;
>>>  * ie. most of the time). Fortunately, these data accesses consist
>>>  * of a chain of data-dependent loads, meaning most CPUs (alpha
>>>  * being the notable exception) will already guarantee loads are
>>>  * seen in-order. See the alpha page table accessors for the
>>>  * smp_read_barrier_depends() barriers in page table walking code.
>>>  */
>>> smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
>>>
>>> I can clarify the comment.
>>
>> I don't see how is this relevant here.
> 
> The problem isn't quite the same, but it's a similar shape. See below
> for more details.
> 
>> barrier in __pte_alloc() for very the following case:
>>
>> CPU 0CPU 1
>> __pte_alloc():  
>> pte_offset_kernel(pmd_t * dir, unsigned long address):
>>  pgtable_t new = pte_alloc_one(mm);pte_t *new = 
>> (pte_t *) pmd_page_vaddr(*dir) + ((address >> PAGE_SHIFT) & (PTRS_PER_PAGE - 
>> 1));  
>>  smp_wmb();
>> smp_read_barrier_depends();
>>  pmd_populate(mm, pmd, new);
>>  /* do something with 
>> pte, e.g. check if (pte_none(*new)) */
>>
>>
>> It's needed to ensure that if CPU1 sees pmd_populate() it also sees 
>> initialized contents of the 'new'.
>>
>> In our case the barrier would have been needed if we had the other side like 
>> this:
>>
>> if (!pte_none(*vmalloc_shadow_pte)) {
>>  shadow_addr = (unsigned long)__va(pte_pfn(*vmalloc_shadow_pte) << 
>> PAGE_SHIFT);
>>  smp_read_barrier_depends();
>>  *shadow_addr; /* read the shadow, barrier ensures that if we see 
>> installed pte, we will see initialized shadow memory. */
>> }
>>
>>
>> Without such other side the barrier is pointless.
> 
> The barrier isn't pointless, but we are relying on a subtlety that is
> not captured in LKMM, as one of the observers involved is the TLB (and
> associated page table walkers) of the CPU.
> 
> Until the PTE written by CPU 0 has been observed by the TLB of CPU 1, it
> is not possible for CPU 1 to satisfy loads from the memory that PTE
> maps, as it doesn't yet know which memory that is.
> 
> Once the PTE written by CPU has been observed by the TLB of CPU 1, it is
> possible for CPU 1 to satisfy those loads. At this instant, CPU 1 must
> respect the smp_wmb() before the PTE was written, and hence sees zeroes
 
s/zeroes/poison values

> written before this. Note that if this were not true, we could not
> safely swap userspace memory.
> 
> There is the risk (as laid out in [1]) that CPU 1 attempts to hoist the
> loads of the shadow memory above the load of the PTE, samples a stale
> (faulting) status from the TLB, then performs the load of the PTE and
> sees a valid value. In this case (on arm64) a spurious fault could be
> taken when the access is architecturally performed.
> 
> It is possible on arm64 to use a barrier here to prevent the spurious
> fault, but this is not smp_read_barrier_depends(), as that does nothing
> for everyone but al

Re: [PATCH v8 1/5] kasan: support backing vmalloc space with real shadow memory

2019-10-16 Thread Andrey Ryabinin



On 10/14/19 4:57 PM, Daniel Axtens wrote:
> Hi Andrey,
> 
> 
>>> +   /*
>>> +* Ensure poisoning is visible before the shadow is made visible
>>> +* to other CPUs.
>>> +*/
>>> +   smp_wmb();
>>
>> I'm not quite understand what this barrier do and why it needed.
>> And if it's really needed there should be a pairing barrier
>> on the other side which I don't see.
> 
> Mark might be better able to answer this, but my understanding is that
> we want to make sure that we never have a situation where the writes are
> reordered so that PTE is installed before all the poisioning is written
> out. I think it follows the logic in __pte_alloc() in mm/memory.c:
> 
>   /*
>* Ensure all pte setup (eg. pte page lock and page clearing) are
>* visible before the pte is made visible to other CPUs by being
>* put into page tables.
>*
>* The other side of the story is the pointer chasing in the page
>* table walking code (when walking the page table without locking;
>* ie. most of the time). Fortunately, these data accesses consist
>* of a chain of data-dependent loads, meaning most CPUs (alpha
>* being the notable exception) will already guarantee loads are
>* seen in-order. See the alpha page table accessors for the
>* smp_read_barrier_depends() barriers in page table walking code.
>*/
>   smp_wmb(); /* Could be smp_wmb__xxx(before|after)_spin_lock */
> 
> I can clarify the comment.
> 

I don't see how is this relevant here.

barrier in __pte_alloc() for very the following case:

CPU 0   CPU 1
__pte_alloc():  pte_offset_kernel(pmd_t 
* dir, unsigned long address):
 pgtable_t new = pte_alloc_one(mm);pte_t *new = 
(pte_t *) pmd_page_vaddr(*dir) + ((address >> PAGE_SHIFT) & (PTRS_PER_PAGE - 
1));  
 smp_wmb();
smp_read_barrier_depends();
 pmd_populate(mm, pmd, new);
/* do something with 
pte, e.g. check if (pte_none(*new)) */


It's needed to ensure that if CPU1 sees pmd_populate() it also sees initialized 
contents of the 'new'.

In our case the barrier would have been needed if we had the other side like 
this:

if (!pte_none(*vmalloc_shadow_pte)) {
shadow_addr = (unsigned long)__va(pte_pfn(*vmalloc_shadow_pte) << 
PAGE_SHIFT);
smp_read_barrier_depends();
*shadow_addr; /* read the shadow, barrier ensures that if we see 
installed pte, we will see initialized shadow memory. */
}


Without such other side the barrier is pointless.

Re: [PATCH v8 1/5] kasan: support backing vmalloc space with real shadow memory

2019-10-11 Thread Andrey Ryabinin




On 10/1/19 9:58 AM, Daniel Axtens wrote:
 
>  core_initcall(kasan_memhotplug_init);
>  #endif
> +
> +#ifdef CONFIG_KASAN_VMALLOC
> +static int kasan_populate_vmalloc_pte(pte_t *ptep, unsigned long addr,
> +   void *unused)
> +{
> + unsigned long page;
> + pte_t pte;
> +
> + if (likely(!pte_none(*ptep)))
> + return 0;
> +
> + page = __get_free_page(GFP_KERNEL);
> + if (!page)
> + return -ENOMEM;
> +
> + memset((void *)page, KASAN_VMALLOC_INVALID, PAGE_SIZE);
> + pte = pfn_pte(PFN_DOWN(__pa(page)), PAGE_KERNEL);
> +
> + /*
> +  * Ensure poisoning is visible before the shadow is made visible
> +  * to other CPUs.
> +  */
> + smp_wmb();

I'm not quite understand what this barrier do and why it needed.
And if it's really needed there should be a pairing barrier
on the other side which I don't see.

> +
> + spin_lock(_mm.page_table_lock);
> + if (likely(pte_none(*ptep))) {
> + set_pte_at(_mm, addr, ptep, pte);
> + page = 0;
> + }
> + spin_unlock(_mm.page_table_lock);
> + if (page)
> + free_page(page);
> + return 0;
> +}
> +


...

> @@ -754,6 +769,8 @@ merge_or_add_vmap_area(struct vmap_area *va,
>   }
>  
>  insert:
> + kasan_release_vmalloc(orig_start, orig_end, va->va_start, va->va_end);
> +
>   if (!merged) {
>   link_va(va, root, parent, link, head);
>   augment_tree_propagate_from(va);
> @@ -2068,6 +2085,22 @@ static struct vm_struct *__get_vm_area_node(unsigned 
> long size,
>  
>   setup_vmalloc_vm(area, va, flags, caller);
>  
> + /*
> +  * For KASAN, if we are in vmalloc space, we need to cover the shadow
> +  * area with real memory. If we come here through VM_ALLOC, this is
> +  * done by a higher level function that has access to the true size,
> +  * which might not be a full page.
> +  *
> +  * We assume module space comes via VM_ALLOC path.
> +  */
> + if (is_vmalloc_addr(area->addr) && !(area->flags & VM_ALLOC)) {
> + if (kasan_populate_vmalloc(area->size, area)) {
> + unmap_vmap_area(va);
> + kfree(area);
> + return NULL;
> + }
> + }
> +
>   return area;
>  }
>  
> @@ -2245,6 +2278,9 @@ static void __vunmap(const void *addr, int 
> deallocate_pages)
>   debug_check_no_locks_freed(area->addr, get_vm_area_size(area));
>   debug_check_no_obj_freed(area->addr, get_vm_area_size(area));
>  
> + if (area->flags & VM_KASAN)
> + kasan_poison_vmalloc(area->addr, area->size);
> +
>   vm_remove_mappings(area, deallocate_pages);
>  
>   if (deallocate_pages) {
> @@ -2497,6 +2533,9 @@ void *__vmalloc_node_range(unsigned long size, unsigned 
> long align,
>   if (!addr)
>   return NULL;
>  
> + if (kasan_populate_vmalloc(real_size, area))
> + return NULL;
> +

KASAN itself uses __vmalloc_node_range() to allocate and map shadow in memory 
online callback.
So we should either skip non-vmalloc and non-module addresses here or teach 
kasan's memory online/offline
callbacks to not use __vmalloc_node_range() (do something similar to 
kasan_populate_vmalloc() perhaps?).

Re: [RFC PATCH] lib/ubsan: Don't seralize UBSAN report

2019-09-25 Thread Andrey Ryabinin




On 9/20/19 1:08 PM, Julien Grall wrote:
> At the moment, UBSAN report will be serialized using a spin_lock(). On
> RT-systems, spinlocks are turned to rt_spin_lock and may sleep. This will
> result to the following splat if the undefined behavior is in a context
> that can sleep:
> 
> [ 6951.484876] BUG: sleeping function called from invalid context at 
> /src/linux/kernel/locking/rtmutex.c:968
> [ 6951.484882] in_atomic(): 1, irqs_disabled(): 128, pid: 3447, name: make
> [ 6951.484884] 1 lock held by make/3447:
> [ 6951.484885]  #0: 9a966332 (>mmap_sem){}, at: 
> do_page_fault+0x140/0x4f8
> [ 6951.484895] irq event stamp: 6284
> [ 6951.484896] hardirqs last  enabled at (6283): [] 
> _raw_spin_unlock_irqrestore+0x90/0xa0
> [ 6951.484901] hardirqs last disabled at (6284): [] 
> _raw_spin_lock_irqsave+0x30/0x78
> [ 6951.484902] softirqs last  enabled at (2430): [] 
> fpsimd_restore_current_state+0x60/0xe8
> [ 6951.484905] softirqs last disabled at (2427): [] 
> fpsimd_restore_current_state+0x28/0xe8
> [ 6951.484907] Preemption disabled at:
> [ 6951.484907] [] rt_mutex_futex_unlock+0x4c/0xb0
> [ 6951.484911] CPU: 3 PID: 3447 Comm: make Tainted: GW 
> 5.2.14-rt7-01890-ge6e057589653 #911
> [ 6951.484913] Call trace:
> [ 6951.484913]  dump_backtrace+0x0/0x148
> [ 6951.484915]  show_stack+0x14/0x20
> [ 6951.484917]  dump_stack+0xbc/0x104
> [ 6951.484919]  ___might_sleep+0x154/0x210
> [ 6951.484921]  rt_spin_lock+0x68/0xa0
> [ 6951.484922]  ubsan_prologue+0x30/0x68
> [ 6951.484924]  handle_overflow+0x64/0xe0
> [ 6951.484926]  __ubsan_handle_add_overflow+0x10/0x18
> [ 6951.484927]  __lock_acquire+0x1c28/0x2a28
> [ 6951.484929]  lock_acquire+0xf0/0x370
> [ 6951.484931]  _raw_spin_lock_irqsave+0x58/0x78
> [ 6951.484932]  rt_mutex_futex_unlock+0x4c/0xb0
> [ 6951.484933]  rt_spin_unlock+0x28/0x70
> [ 6951.484934]  get_page_from_freelist+0x428/0x2b60
> [ 6951.484936]  __alloc_pages_nodemask+0x174/0x1708
> [ 6951.484938]  alloc_pages_vma+0x1ac/0x238
> [ 6951.484940]  __handle_mm_fault+0x4ac/0x10b0
> [ 6951.484941]  handle_mm_fault+0x1d8/0x3b0
> [ 6951.484942]  do_page_fault+0x1c8/0x4f8
> [ 6951.484943]  do_translation_fault+0xb8/0xe0
> [ 6951.484945]  do_mem_abort+0x3c/0x98
> [ 6951.484946]  el0_da+0x20/0x24
> 
> The spin_lock() will protect against multiple CPUs to output a report
> together, I guess to prevent them to be interleaved. However, they can
> still interleave with other messages (and even splat from __migth_sleep).
> 
> So the lock usefulness seems pretty limited. Rather than trying to
> accomodate RT-system by switching to a raw_spin_lock(), the lock is now
> completely dropped.
> 
> Reported-by: Andre Przywara 
> Signed-off-by: Julien Grall 
> ---

Acked-by: Andrey Ryabinin 


>  lib/ubsan.c | 64 
> ++---
>  1 file changed, 23 insertions(+), 41 deletions(-)
> 
> diff --git a/lib/ubsan.c b/lib/ubsan.c
> index e7d31735950d..39d5952c4273 100644
> --- a/lib/ubsan.c
> +++ b/lib/ubsan.c
> @@ -140,25 +140,21 @@ static void val_to_string(char *str, size_t size, 
> struct type_descriptor *type,
>   }
>  }
>  
> -static DEFINE_SPINLOCK(report_lock);
> -
> -static void ubsan_prologue(struct source_location *location,
> - unsigned long *flags)
> +static void ubsan_prologue(struct source_location *location)
>  {
>   current->in_ubsan++;
> - spin_lock_irqsave(_lock, *flags);
>  
>   pr_err(""
>   "\n");
>   print_source_location("UBSAN: Undefined behaviour in", location);
>  }
>  
> -static void ubsan_epilogue(unsigned long *flags)
> +static void ubsan_epilogue(void)
>  {
>   dump_stack();
>   pr_err(""
>   "\n");
> - spin_unlock_irqrestore(_lock, *flags);
> +
>   current->in_ubsan--;
>  }
>  
> @@ -167,14 +163,13 @@ static void handle_overflow(struct overflow_data *data, 
> void *lhs,
>  {
>  
>   struct type_descriptor *type = data->type;
> - unsigned long flags;
>   char lhs_val_str[VALUE_LENGTH];
>   char rhs_val_str[VALUE_LENGTH];
>  
>   if (suppress_report(>location))
>   return;
>  
> - ubsan_prologue(>location, );
> + ubsan_prologue(>location);
>  
>   val_to_string(lhs_val_str, sizeof(lhs_val_str), type, lhs);
>   val_to_string(rhs_val_str, sizeof(rhs_val_str), type, rhs);
> @@ -186,7 +181,7 @@ static void handle_overflo

Re: [PATCH] mm, debug, kasan: save and dump freeing stack trace for kasan

2019-09-25 Thread Andrey Ryabinin




On 9/23/19 11:20 AM, Vlastimil Babka wrote:
> On 9/16/19 5:57 PM, Andrey Ryabinin wrote:
>> I'd rather keep all logic in one place, i.e. "if (!page_owner_disabled && 
>> (IS_ENABLED(CONFIG_KASAN) || debug_pagealloc_enabled())"
>> With this no changes in early_debug_pagealloc() required and 
>> CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y should also work correctly.
> 
> OK.
> 
> 8<
> 
> From 7437c43f02682fdde5680fa83e87029f7529e222 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka 
> Date: Mon, 16 Sep 2019 11:28:19 +0200
> Subject: [PATCH] mm, debug, kasan: save and dump freeing stack trace for kasan
> 
> The commit "mm, page_owner, debug_pagealloc: save and dump freeing stack 
> trace"
> enhanced page_owner to also store freeing stack trace, when debug_pagealloc is
> also enabled. KASAN would also like to do this [1] to improve error reports to
> debug e.g. UAF issues. This patch therefore introduces a helper config option
> PAGE_OWNER_FREE_STACK, which is enabled when PAGE_OWNER and either of
> DEBUG_PAGEALLOC or KASAN is enabled. Boot-time, the free stack saving is
> enabled when booting a KASAN kernel with page_owner=on, or non-KASAN kernel
> with debug_pagealloc=on and page_owner=on.
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=203967
> 
> Suggested-by: Dmitry Vyukov 
> Suggested-by: Walter Wu 
> Suggested-by: Andrey Ryabinin 
> Signed-off-by: Vlastimil Babka 
> ---

Reviewed-by: Andrey Ryabinin

Re: [PATCH v3] mm/kasan: dump alloc and free stack for page allocator

2019-09-16 Thread Andrey Ryabinin

On 9/16/19 12:42 PM, Vlastimil Babka wrote:
> On 9/12/19 7:05 PM, Andrey Ryabinin wrote:
>>
>> Or another alternative option (and actually easier one to implement), leave 
>> PAGE_OWNER as is (no "select"s in Kconfigs)
>> Make PAGE_OWNER_FREE_STACK like this:
>>
>> +config PAGE_OWNER_FREE_STACK
>> +def_bool KASAN || DEBUG_PAGEALLOC
>> +depends on PAGE_OWNER
>> +
>>
>> So, users that want alloc/free stack will have to enable CONFIG_PAGE_OWNER=y 
>> and add page_owner=on to boot cmdline.
>>
>>
>> Basically the difference between these alternative is whether we enable 
>> page_owner by default or not. But there is always a possibility to disable 
>> it.
> 
> OK, how about this?
> 
 ...

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c5d62f1c2851..d9e44671af3f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -710,8 +710,12 @@ static int __init early_debug_pagealloc(char *buf)
>   if (kstrtobool(buf, ))
>   return -EINVAL;
>  
> - if (enable)
> + if (enable) {
>   static_branch_enable(&_debug_pagealloc_enabled);
> +#ifdef CONFIG_PAGE_OWNER
> + page_owner_free_stack_disabled = false;

I think this won't work with CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y

> +#endif
> + }
>  
>   return 0;
>  }
> diff --git a/mm/page_owner.c b/mm/page_owner.c
> index dee931184788..b589bfbc4795 100644
> --- a/mm/page_owner.c
> +++ b/mm/page_owner.c
> @@ -24,13 +24,15 @@ struct page_owner {
>   short last_migrate_reason;
>   gfp_t gfp_mask;
>   depot_stack_handle_t handle;
> -#ifdef CONFIG_DEBUG_PAGEALLOC
> +#ifdef CONFIG_PAGE_OWNER_FREE_STACK
>   depot_stack_handle_t free_handle;
>  #endif
>  };
>  
>  static bool page_owner_disabled = true;
> +bool page_owner_free_stack_disabled = true;
>  DEFINE_STATIC_KEY_FALSE(page_owner_inited);
> +static DEFINE_STATIC_KEY_FALSE(page_owner_free_stack);
>  
>  static depot_stack_handle_t dummy_handle;
>  static depot_stack_handle_t failure_handle;
> @@ -46,6 +48,9 @@ static int __init early_page_owner_param(char *buf)
>   if (strcmp(buf, "on") == 0)
>   page_owner_disabled = false;
>  
> + if (!page_owner_disabled && IS_ENABLED(CONFIG_KASAN))

I'd rather keep all logic in one place, i.e. "if (!page_owner_disabled && 
(IS_ENABLED(CONFIG_KASAN) || debug_pagealloc_enabled())"
With this no changes in early_debug_pagealloc() required and 
CONFIG_DEBUG_PAGEALLOC_ENABLE_DEFAULT=y should also work correctly.

> + page_owner_free_stack_disabled = false;
> +
>   return 0;
>  }
>  early_param("page_owner", early_page_owner_param);

Re: [PATCH v3] mm/kasan: dump alloc and free stack for page allocator

2019-09-12 Thread Andrey Ryabinin




On 9/12/19 5:31 PM, Vlastimil Babka wrote:
> On 9/12/19 4:08 PM, Walter Wu wrote:
>>
>>>   extern void __reset_page_owner(struct page *page, unsigned int order);
>>> diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan
>>> index 6c9682ce0254..dc560c7562e8 100644
>>> --- a/lib/Kconfig.kasan
>>> +++ b/lib/Kconfig.kasan
>>> @@ -41,6 +41,8 @@ config KASAN_GENERIC
>>>   select SLUB_DEBUG if SLUB
>>>   select CONSTRUCTORS
>>>   select STACKDEPOT
>>> +    select PAGE_OWNER
>>> +    select PAGE_OWNER_FREE_STACK
>>>   help
>>>     Enables generic KASAN mode.
>>>     Supported in both GCC and Clang. With GCC it requires version 4.9.2
>>> @@ -63,6 +65,8 @@ config KASAN_SW_TAGS
>>>   select SLUB_DEBUG if SLUB
>>>   select CONSTRUCTORS
>>>   select STACKDEPOT
>>> +    select PAGE_OWNER
>>> +    select PAGE_OWNER_FREE_STACK
>>>   help
>>
>> What is the difference between PAGE_OWNER+PAGE_OWNER_FREE_STACK and
>> DEBUG_PAGEALLOC?
> 
> Same memory usage, but debug_pagealloc means also extra checks and 
> restricting memory access to freed pages to catch UAF.
> 
>> If you directly enable PAGE_OWNER+PAGE_OWNER_FREE_STACK
>> PAGE_OWNER_FREE_STACK,don't you think low-memory device to want to use
>> KASAN?
> 
> OK, so it should be optional? But I think it's enough to distinguish no 
> PAGE_OWNER at all, and PAGE_OWNER+PAGE_OWNER_FREE_STACK together - I don't 
> see much point in PAGE_OWNER only for this kind of debugging.
> 
> So how about this? KASAN wouldn't select PAGE_OWNER* but it would be 
> recommended in the help+docs. When PAGE_OWNER and KASAN are selected by user, 
> PAGE_OWNER_FREE_STACK gets also selected, and both will be also runtime 
> enabled without explicit page_owner=on.
> I mostly want to avoid another boot-time option for enabling 
> PAGE_OWNER_FREE_STACK.
> Would that be enough flexibility for low-memory devices vs full-fledged 
> debugging?

Originally I thought that with you patch users still can disable page_owner via 
"page_owner=off" boot param.
But now I realized that this won't work. I think it should work, we should 
allow users to disable it.



Or another alternative option (and actually easier one to implement), leave 
PAGE_OWNER as is (no "select"s in Kconfigs)
Make PAGE_OWNER_FREE_STACK like this:

+config PAGE_OWNER_FREE_STACK
+   def_bool KASAN || DEBUG_PAGEALLOC
+   depends on PAGE_OWNER
+

So, users that want alloc/free stack will have to enable CONFIG_PAGE_OWNER=y 
and add page_owner=on to boot cmdline.


Basically the difference between these alternative is whether we enable 
page_owner by default or not. But there is always a possibility to disable it.

Re: [PATCH v3] mm/kasan: dump alloc and free stack for page allocator

2019-09-12 Thread Andrey Ryabinin




On 9/12/19 4:53 PM, Vlastimil Babka wrote:
> On 9/11/19 5:19 PM, Qian Cai wrote:
>>
>> The new config looks redundant and confusing. It looks to me more of a 
>> document update
>> in Documentation/dev-tools/kasan.txt to educate developers to select 
>> PAGE_OWNER and
>> DEBUG_PAGEALLOC if needed.
>  
> Agreed. But if you want it fully automatic, how about something
> like this (on top of mmotm/next)? If you agree I'll add changelog
> and send properly.
> 
> 8<
> 
> From a528d14c71d7fdf5872ca8ab3bd1b5bad26670c9 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka 
> Date: Thu, 12 Sep 2019 15:51:23 +0200
> Subject: [PATCH] make KASAN enable page_owner with free stack capture
> 
> ---
>  include/linux/page_owner.h |  1 +
>  lib/Kconfig.kasan  |  4 
>  mm/Kconfig.debug   |  5 +
>  mm/page_alloc.c|  6 +-
>  mm/page_owner.c| 37 -
>  5 files changed, 39 insertions(+), 14 deletions(-)
> 

Looks ok to me. This certainly better than full dependency on the 
DEBUG_PAGEALLOC which we don't need.

Re: [PATCH v2 0/2] mm/kasan: dump alloc/free stack for page allocator

2019-09-10 Thread Andrey Ryabinin




On 9/9/19 4:07 PM, Vlastimil Babka wrote:
> On 9/9/19 10:24 AM, walter-zh...@mediatek.com wrote:
>> From: Walter Wu 
>>
>> This patch is KASAN report adds the alloc/free stacks for page allocator
>> in order to help programmer to see memory corruption caused by page.
>>
>> By default, KASAN doesn't record alloc and free stack for page allocator.
>> It is difficult to fix up page use-after-free or dobule-free issue.
>>
>> Our patchsets will record the last stack of pages.
>> It is very helpful for solving the page use-after-free or double-free.
>>
>> KASAN report will show the last stack of page, it may be:
>> a) If page is in-use state, then it prints alloc stack.
>>     It is useful to fix up page out-of-bound issue.
> 
> I still disagree with duplicating most of page_owner functionality for the 
> sake of using a single stack handle for both alloc and free (while page_owner 
> + debug_pagealloc with patches in mmotm uses two handles). It reduces the 
> amount of potentially important debugging information, and I really doubt the 
> u32-per-page savings are significant, given the rest of KASAN overhead.
> 
>> BUG: KASAN: slab-out-of-bounds in kmalloc_pagealloc_oob_right+0x88/0x90
>> Write of size 1 at addr ffc0d64ea00a by task cat/115
>> ...
>> Allocation stack of page:
>>   set_page_stack.constprop.1+0x30/0xc8
>>   kasan_alloc_pages+0x18/0x38
>>   prep_new_page+0x5c/0x150
>>   get_page_from_freelist+0xb8c/0x17c8
>>   __alloc_pages_nodemask+0x1a0/0x11b0
>>   kmalloc_order+0x28/0x58
>>   kmalloc_order_trace+0x28/0xe0
>>   kmalloc_pagealloc_oob_right+0x2c/0x68
>>
>> b) If page is freed state, then it prints free stack.
>>     It is useful to fix up page use-after-free or double-free issue.
>>
>> BUG: KASAN: use-after-free in kmalloc_pagealloc_uaf+0x70/0x80
>> Write of size 1 at addr ffc0d651c000 by task cat/115
>> ...
>> Free stack of page:
>>   kasan_free_pages+0x68/0x70
>>   __free_pages_ok+0x3c0/0x1328
>>   __free_pages+0x50/0x78
>>   kfree+0x1c4/0x250
>>   kmalloc_pagealloc_uaf+0x38/0x80
>>
>> This has been discussed, please refer below link.
>> https://bugzilla.kernel.org/show_bug.cgi?id=203967
> 
> That's not a discussion, but a single comment from Dmitry, which btw contains 
> "provide alloc *and* free stacks for it" ("it" refers to page, emphasis 
> mine). It would be nice if he or other KASAN guys could clarify.
> 

For slab objects we memorize both alloc and free stacks. You'll never know in 
advance what information will be usefull
to fix an issue, so it usually better to provide more information. I don't 
think we should do anything different for pages.

Given that we already have the page_owner responsible for providing alloc/free 
stacks for pages, all that we should in KASAN do is to
enable the feature by default. Free stack saving should be decoupled from 
debug_pagealloc into separate option so that it can be enabled 
by KASAN and/or debug_pagealloc.

Re: [PATCH 1/2] riscv: Add memmove string operation.

2019-08-27 Thread Andrey Ryabinin




On 8/27/19 12:07 PM, Nick Hu wrote:
> Hi Andrey
> 
> On Thu, Aug 22, 2019 at 11:59:02PM +0800, Andrey Ryabinin wrote:
>> On 8/7/19 10:19 AM, Nick Hu wrote:
>>> There are some features which need this string operation for compilation,
>>> like KASAN. So the purpose of this porting is for the features like KASAN
>>> which cannot be compiled without it.
>>>
>>
>> Compilation error can be fixed by diff bellow (I didn't test it).
>> If you don't need memmove very early (before kasan_early_init()) than 
>> arch-specific not-instrumented memmove()
>> isn't necessary to have.
>>
>> ---
>>  mm/kasan/common.c | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/mm/kasan/common.c b/mm/kasan/common.c
>> index 6814d6d6a023..897f9520bab3 100644
>> --- a/mm/kasan/common.c
>> +++ b/mm/kasan/common.c
>> @@ -107,6 +107,7 @@ void *memset(void *addr, int c, size_t len)
>>  return __memset(addr, c, len);
>>  }
>>  
>> +#ifdef __HAVE_ARCH_MEMMOVE
>>  #undef memmove
>>  void *memmove(void *dest, const void *src, size_t len)
>>  {
>> @@ -115,6 +116,7 @@ void *memmove(void *dest, const void *src, size_t len)
>>  
>>  return __memmove(dest, src, len);
>>  }
>> +#endif
>>  
>>  #undef memcpy
>>  void *memcpy(void *dest, const void *src, size_t len)
>> -- 
>> 2.21.0
>>
>>
>>
> I have confirmed that the string operations are not used before 
> kasan_early_init().
> But I can't make sure whether other ARCHs would need it before 
> kasan_early_init().
> Do you have any idea to check that? Should I cc all other ARCH maintainers?
 

This doesn't affect other ARCHes in any way. If other arches have their own 
not-instrumented
memmove implementation (and they do), they will continue to be able to use it 
early.

Re: [PATCH 2/2] riscv: Add KASAN support

2019-08-22 Thread Andrey Ryabinin




On 8/14/19 10:44 AM, Nick Hu wrote:

>>
>>> diff --git a/arch/riscv/kernel/vmlinux.lds.S 
>>> b/arch/riscv/kernel/vmlinux.lds.S
>>> index 23cd1a9..9700980 100644
>>> --- a/arch/riscv/kernel/vmlinux.lds.S
>>> +++ b/arch/riscv/kernel/vmlinux.lds.S
>>> @@ -46,6 +46,7 @@ SECTIONS
>>> KPROBES_TEXT
>>> ENTRY_TEXT
>>> IRQENTRY_TEXT
>>> +   SOFTIRQENTRY_TEXT
>>
>> Hmm.  What is the relation to kasan here?  Maybe we should add this
>> separately with a good changelog?
>>
> There is a commit for it:
> 
> Author: Alexander Potapenko 
> Date:   Fri Mar 25 14:22:05 2016 -0700
> 
> arch, ftrace: for KASAN put hard/soft IRQ entries into separate sections
> 
> KASAN needs to know whether the allocation happens in an IRQ handler.
> This lets us strip everything below the IRQ entry point to reduce the
> number of unique stack traces needed to be stored.
> 
> Move the definition of __irq_entry to  so that the
> users don't need to pull in .  Also introduce the
> __softirq_entry macro which is similar to __irq_entry, but puts the
> corresponding functions to the .softirqentry.text section.
> 
> After reading the patch I understand that soft/hard IRQ entries should be
> separated for KASAN to work, but why?
> 

KASAN doesn't need soft/hard IRQ entries separated. KASAN wants to know the 
entry
point of IRQ (hard or soft) to filter out random non-irq part of the stacktrace 
before feeding it to
stack_depot_save. See filter_irq_stacks().

Re: [PATCH 1/2] riscv: Add memmove string operation.

2019-08-22 Thread Andrey Ryabinin

On 8/7/19 10:19 AM, Nick Hu wrote:
> There are some features which need this string operation for compilation,
> like KASAN. So the purpose of this porting is for the features like KASAN
> which cannot be compiled without it.
> 

Compilation error can be fixed by diff bellow (I didn't test it).
If you don't need memmove very early (before kasan_early_init()) than 
arch-specific not-instrumented memmove()
isn't necessary to have.

---
 mm/kasan/common.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 6814d6d6a023..897f9520bab3 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -107,6 +107,7 @@ void *memset(void *addr, int c, size_t len)
return __memset(addr, c, len);
 }
 
+#ifdef __HAVE_ARCH_MEMMOVE
 #undef memmove
 void *memmove(void *dest, const void *src, size_t len)
 {
@@ -115,6 +116,7 @@ void *memmove(void *dest, const void *src, size_t len)
 
return __memmove(dest, src, len);
 }
+#endif
 
 #undef memcpy
 void *memcpy(void *dest, const void *src, size_t len)
-- 
2.21.0

[PATCH v5] kasan: add memory corruption identification for software tag-based mode

2019-08-21 Thread Andrey Ryabinin

From: Walter Wu 

This patch adds memory corruption identification at bug report for
software tag-based mode, the report show whether it is "use-after-free"
or "out-of-bound" error instead of "invalid-access" error. This will make
it easier for programmers to see the memory corruption problem.

We extend the slab to store five old free pointer tag and free backtrace,
we can check if the tagged address is in the slab record and make a
good guess if the object is more like "use-after-free" or "out-of-bound".
therefore every slab memory corruption can be identified whether it's
"use-after-free" or "out-of-bound".

[aryabi...@virtuozzo.com: simplify & clenup code:
  https://lkml.kernel.org/r/3318f9d7-a760-3cc8-b700-f06108ae7...@virtuozzo.com]
Signed-off-by: Walter Wu 
Signed-off-by: Andrey Ryabinin 
---

== Changes
Change since v1:
- add feature option CONFIG_KASAN_SW_TAGS_IDENTIFY.
- change QUARANTINE_FRACTION to reduce quarantine size.
- change the qlist order in order to find the newest object in quarantine
- reduce the number of calling kmalloc() from 2 to 1 time.
- remove global variable to use argument to pass it.
- correct the amount of qobject cache->size into the byes of qlist_head.
- only use kasan_cache_shrink() to shink memory.

Change since v2:
- remove the shinking memory function kasan_cache_shrink()
- modify the description of the CONFIG_KASAN_SW_TAGS_IDENTIFY
- optimize the quarantine_find_object() and qobject_free()
- fix the duplicating function name 3 times in the header.
- modify the function name set_track() to kasan_set_track()

Change since v3:
- change tag-based quarantine to extend slab to identify memory corruption

Changes since v4:
 - Simplify and cleanup code.

 lib/Kconfig.kasan  |  8 
 mm/kasan/common.c  | 22 +++--
 mm/kasan/kasan.h   | 14 +-
 mm/kasan/report.c  | 44 --
 mm/kasan/tags_report.c | 24 +++
 5 files changed, 99 insertions(+), 13 deletions(-)

diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan
index 7fa97a8b5717..6c9682ce0254 100644
--- a/lib/Kconfig.kasan
+++ b/lib/Kconfig.kasan
@@ -134,6 +134,14 @@ config KASAN_S390_4_LEVEL_PAGING
  to 3TB of RAM with KASan enabled). This options allows to force
  4-level paging instead.
 
+config KASAN_SW_TAGS_IDENTIFY
+   bool "Enable memory corruption identification"
+   depends on KASAN_SW_TAGS
+   help
+ This option enables best-effort identification of bug type
+ (use-after-free or out-of-bounds) at the cost of increased
+ memory consumption.
+
 config TEST_KASAN
tristate "Module for testing KASAN for bug detection"
depends on m && KASAN
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 3b8cde0cb5b2..6814d6d6a023 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -304,7 +304,6 @@ size_t kasan_metadata_size(struct kmem_cache *cache)
 struct kasan_alloc_meta *get_alloc_info(struct kmem_cache *cache,
const void *object)
 {
-   BUILD_BUG_ON(sizeof(struct kasan_alloc_meta) > 32);
return (void *)object + cache->kasan_info.alloc_meta_offset;
 }
 
@@ -315,6 +314,24 @@ struct kasan_free_meta *get_free_info(struct kmem_cache 
*cache,
return (void *)object + cache->kasan_info.free_meta_offset;
 }
 
+
+static void kasan_set_free_info(struct kmem_cache *cache,
+   void *object, u8 tag)
+{
+   struct kasan_alloc_meta *alloc_meta;
+   u8 idx = 0;
+
+   alloc_meta = get_alloc_info(cache, object);
+
+#ifdef CONFIG_KASAN_SW_TAGS_IDENTIFY
+   idx = alloc_meta->free_track_idx;
+   alloc_meta->free_pointer_tag[idx] = tag;
+   alloc_meta->free_track_idx = (idx + 1) % KASAN_NR_FREE_STACKS;
+#endif
+
+   set_track(_meta->free_track[idx], GFP_NOWAIT);
+}
+
 void kasan_poison_slab(struct page *page)
 {
unsigned long i;
@@ -451,7 +468,8 @@ static bool __kasan_slab_free(struct kmem_cache *cache, 
void *object,
unlikely(!(cache->flags & SLAB_KASAN)))
return false;
 
-   set_track(_alloc_info(cache, object)->free_track, GFP_NOWAIT);
+   kasan_set_free_info(cache, object, tag);
+
quarantine_put(get_free_info(cache, object), cache);
 
return IS_ENABLED(CONFIG_KASAN_GENERIC);
diff --git a/mm/kasan/kasan.h b/mm/kasan/kasan.h
index 014f19e76247..35cff6bbb716 100644
--- a/mm/kasan/kasan.h
+++ b/mm/kasan/kasan.h
@@ -95,9 +95,19 @@ struct kasan_track {
depot_stack_handle_t stack;
 };
 
+#ifdef CONFIG_KASAN_SW_TAGS_IDENTIFY
+#define KASAN_NR_FREE_STACKS 5
+#else
+#define KASAN_NR_FREE_STACKS 1
+#endif
+
 struct kasan_alloc_meta {
struct kasan_track alloc_track;
-   struct kasan_track free_track;
+   struct kasan_track free_track[KASAN_NR_F

Re: [PATCH v4] kasan: add memory corruption identification for software tag-based mode

2019-08-21 Thread Andrey Ryabinin




On 8/20/19 8:37 AM, Walter Wu wrote:
> On Tue, 2019-08-06 at 13:43 +0800, Walter Wu wrote:
>> This patch adds memory corruption identification at bug report for
>> software tag-based mode, the report show whether it is "use-after-free"
>> or "out-of-bound" error instead of "invalid-access" error. This will make
>> it easier for programmers to see the memory corruption problem.
>>
>> We extend the slab to store five old free pointer tag and free backtrace,
>> we can check if the tagged address is in the slab record and make a
>> good guess if the object is more like "use-after-free" or "out-of-bound".
>> therefore every slab memory corruption can be identified whether it's
>> "use-after-free" or "out-of-bound".
>>
>> == Changes
>> Change since v1:
>> - add feature option CONFIG_KASAN_SW_TAGS_IDENTIFY.
>> - change QUARANTINE_FRACTION to reduce quarantine size.
>> - change the qlist order in order to find the newest object in quarantine
>> - reduce the number of calling kmalloc() from 2 to 1 time.
>> - remove global variable to use argument to pass it.
>> - correct the amount of qobject cache->size into the byes of qlist_head.
>> - only use kasan_cache_shrink() to shink memory.
>>
>> Change since v2:
>> - remove the shinking memory function kasan_cache_shrink()
>> - modify the description of the CONFIG_KASAN_SW_TAGS_IDENTIFY
>> - optimize the quarantine_find_object() and qobject_free()
>> - fix the duplicating function name 3 times in the header.
>> - modify the function name set_track() to kasan_set_track()
>>
>> Change since v3:
>> - change tag-based quarantine to extend slab to identify memory corruption
> 
> Hi,Andrey,
> 
> Would you review the patch,please?


I didn't notice anything fundamentally wrong, but I find there are some
questionable implementation choices that makes code look weirder than necessary
and harder to understand. So I ended up with cleaning it up, see the diff 
bellow.
I'll send v5 with that diff folded.




diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan
index 26cb3bcc9258..6c9682ce0254 100644
--- a/lib/Kconfig.kasan
+++ b/lib/Kconfig.kasan
@@ -140,7 +140,7 @@ config KASAN_SW_TAGS_IDENTIFY
help
  This option enables best-effort identification of bug type
  (use-after-free or out-of-bounds) at the cost of increased
- memory consumption for slab extending.
+ memory consumption.
 
 config TEST_KASAN
tristate "Module for testing KASAN for bug detection"
diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 2cdcb16b9c2d..6814d6d6a023 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -71,7 +71,7 @@ static inline depot_stack_handle_t save_stack(gfp_t flags)
return stack_depot_save(entries, nr_entries, flags);
 }
 
-void kasan_set_track(struct kasan_track *track, gfp_t flags)
+static inline void set_track(struct kasan_track *track, gfp_t flags)
 {
track->pid = current->pid;
track->stack = save_stack(flags);
@@ -304,8 +304,6 @@ size_t kasan_metadata_size(struct kmem_cache *cache)
 struct kasan_alloc_meta *get_alloc_info(struct kmem_cache *cache,
const void *object)
 {
-   if (!IS_ENABLED(CONFIG_KASAN_SW_TAGS_IDENTIFY))
-   BUILD_BUG_ON(sizeof(struct kasan_alloc_meta) > 32);
return (void *)object + cache->kasan_info.alloc_meta_offset;
 }
 
@@ -316,6 +314,24 @@ struct kasan_free_meta *get_free_info(struct kmem_cache 
*cache,
return (void *)object + cache->kasan_info.free_meta_offset;
 }
 
+
+static void kasan_set_free_info(struct kmem_cache *cache,
+   void *object, u8 tag)
+{
+   struct kasan_alloc_meta *alloc_meta;
+   u8 idx = 0;
+
+   alloc_meta = get_alloc_info(cache, object);
+
+#ifdef CONFIG_KASAN_SW_TAGS_IDENTIFY
+   idx = alloc_meta->free_track_idx;
+   alloc_meta->free_pointer_tag[idx] = tag;
+   alloc_meta->free_track_idx = (idx + 1) % KASAN_NR_FREE_STACKS;
+#endif
+
+   set_track(_meta->free_track[idx], GFP_NOWAIT);
+}
+
 void kasan_poison_slab(struct page *page)
 {
unsigned long i;
@@ -452,11 +468,8 @@ static bool __kasan_slab_free(struct kmem_cache *cache, 
void *object,
unlikely(!(cache->flags & SLAB_KASAN)))
return false;
 
-   if (IS_ENABLED(CONFIG_KASAN_SW_TAGS_IDENTIFY))
-   kasan_set_free_info(cache, object, tag);
-   else
-   kasan_set_track(_alloc_info(cache, object)->free_track,
-   GFP_NOWAIT);
+   kasan_set_free_info(cache, object, tag);
+
quarantine_put(get_free_info(cache, object), cache);
 
return IS_ENABLED(CONFIG_KASAN_GENERIC);
@@ -494,8 +507,7 @@ static void *__kasan_kmalloc(struct kmem_cache *cache, 
const void *object,
KASAN_KMALLOC_REDZONE);
 
if (cache->flags & SLAB_KASAN)
-   kasan_set_track(_alloc_info(cache, object)->alloc_track,
-

[PATCH] mm/kasan: Fix false positive invalid-free reports with CONFIG_KASAN_SW_TAGS=y

2019-08-19 Thread Andrey Ryabinin

The code like this:

ptr = kmalloc(size, GFP_KERNEL);
page = virt_to_page(ptr);
offset = offset_in_page(ptr);
kfree(page_address(page) + offset);

may produce false-positive invalid-free reports on the kernel with
CONFIG_KASAN_SW_TAGS=y.

In the example above we loose the original tag assigned to 'ptr',
so kfree() gets the pointer with 0xFF tag. In kfree() we check that
0xFF tag is different from the tag in shadow hence print false report.

Instead of just comparing tags, do the following:
 1) Check that shadow doesn't contain KASAN_TAG_INVALID. Otherwise it's
double-free and it doesn't matter what tag the pointer have.

 2) If pointer tag is different from 0xFF, make sure that tag in the shadow
is the same as in the pointer.

Fixes: 7f94ffbc4c6a ("kasan: add hooks implementation for tag-based mode")
Signed-off-by: Andrey Ryabinin 
Reported-by: Walter Wu 
Reported-by: Mark Rutland 
Cc: 
---
 mm/kasan/common.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 895dc5e2b3d5..3b8cde0cb5b2 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -406,8 +406,14 @@ static inline bool shadow_invalid(u8 tag, s8 shadow_byte)
if (IS_ENABLED(CONFIG_KASAN_GENERIC))
return shadow_byte < 0 ||
shadow_byte >= KASAN_SHADOW_SCALE_SIZE;
-   else
-   return tag != (u8)shadow_byte;
+
+   /* else CONFIG_KASAN_SW_TAGS: */
+   if ((u8)shadow_byte == KASAN_TAG_INVALID)
+   return true;
+   if ((tag != KASAN_TAG_KERNEL) && (tag != (u8)shadow_byte))
+   return true;
+
+   return false;
 }
 
 static bool __kasan_slab_free(struct kmem_cache *cache, void *object,
-- 
2.21.0

Re: [PATCHv2] lib/test_kasan: add roundtrip tests

2019-08-19 Thread Andrey Ryabinin




On 8/19/19 7:14 PM, Mark Rutland wrote:
> In several places we need to be able to operate on pointers which have
> gone via a roundtrip:
> 
>   virt -> {phys,page} -> virt
> 
> With KASAN_SW_TAGS, we can't preserve the tag for SLUB objects, and the
> {phys,page} -> virt conversion will use KASAN_TAG_KERNEL.
> 
> This patch adds tests to ensure that this works as expected, without
> false positives which have recently been spotted [1,2] in testing.
> 
> [1] 
> https://lore.kernel.org/linux-arm-kernel/20190819114420.2535-1-walter-zh...@mediatek.com/
> [2] 
> https://lore.kernel.org/linux-arm-kernel/20190819132347.gb9...@lakrids.cambridge.arm.com/
> 
> Signed-off-by: Mark Rutland 
> Reviewed-by: Andrey Konovalov 
> Tested-by: Andrey Konovalov 
> Cc: Alexander Potapenko 
> Cc: Andrew Morton 
> Cc: Andrey Ryabinin 
> Cc: Dmitry Vyukov 
> Cc: Will Deacon 
> ---


Acked-by: Andrey Ryabinin

Re: [PATCH] arm64: kasan: fix phys_to_virt() false positive on tag-based kasan

2019-08-19 Thread Andrey Ryabinin




On 8/19/19 4:34 PM, Will Deacon wrote:
> On Mon, Aug 19, 2019 at 02:23:48PM +0100, Mark Rutland wrote:
>> On Mon, Aug 19, 2019 at 01:56:26PM +0100, Will Deacon wrote:
>>> On Mon, Aug 19, 2019 at 07:44:20PM +0800, Walter Wu wrote:
 __arm_v7s_unmap() call iopte_deref() to translate pyh_to_virt address,
 but it will modify pointer tag into 0xff, so there is a false positive.

 When enable tag-based kasan, phys_to_virt() function need to rewrite
 its original pointer tag in order to avoid kasan report an incorrect
 memory corruption.
>>>
>>> Hmm. Which tree did you see this on? We've recently queued a load of fixes
>>> in this area, but I /thought/ they were only needed after the support for
>>> 52-bit virtual addressing in the kernel.
>>
>> I'm seeing similar issues in the virtio blk code (splat below), atop of
>> the arm64 for-next/core branch. I think this is a latent issue, and
>> people are only just starting to test with KASAN_SW_TAGS.
>>
>> It looks like the virtio blk code will round-trip a SLUB-allocated pointer 
>> from
>> virt->page->virt, losing the per-object tag in the process.
>>
>> Our page_to_virt() seems to get a per-page tag, but this only makes
>> sense if you're dealing with the page allocator, rather than something
>> like SLUB which carves a page into smaller objects giving each object a
>> distinct tag.
>>
>> Any round-trip of a pointer from SLUB is going to lose the per-object
>> tag.
> 
> Urgh, I wonder how this is supposed to work?
> 

We supposed to ignore pointers with 0xff tags. We do ignore them when memory 
access checked,
but not in kfree() path.
This untested patch should fix the issue:



---
 mm/kasan/common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/kasan/common.c b/mm/kasan/common.c
index 895dc5e2b3d5..0a81cc328049 100644
--- a/mm/kasan/common.c
+++ b/mm/kasan/common.c
@@ -407,7 +407,7 @@ static inline bool shadow_invalid(u8 tag, s8 shadow_byte)
return shadow_byte < 0 ||
shadow_byte >= KASAN_SHADOW_SCALE_SIZE;
else
-   return tag != (u8)shadow_byte;
+   return (tag != KASAN_TAG_KERNEL) && (tag != (u8)shadow_byte);
 }
 
 static bool __kasan_slab_free(struct kmem_cache *cache, void *object,
-- 
2.21.0

Re: [PATCH v3] kasan: add memory corruption identification for software tag-based mode

2019-07-31 Thread Andrey Ryabinin




On 7/26/19 4:19 PM, Walter Wu wrote:
> On Fri, 2019-07-26 at 15:52 +0300, Andrey Ryabinin wrote:
>>
>> On 7/26/19 3:28 PM, Walter Wu wrote:
>>> On Fri, 2019-07-26 at 15:00 +0300, Andrey Ryabinin wrote:
>>>>
>>>
>>>>>
>>>>>
>>>>> I remember that there are already the lists which you concern. Maybe we
>>>>> can try to solve those problems one by one.
>>>>>
>>>>> 1. deadlock issue? cause by kmalloc() after kfree()?
>>>>
>>>> smp_call_on_cpu()
>>>
>>>>> 2. decrease allocation fail, to modify GFP_NOWAIT flag to GFP_KERNEL?
>>>>
>>>> No, this is not gonna work. Ideally we shouldn't have any allocations 
>>>> there.
>>>> It's not reliable and it hurts performance.
>>>>
>>> I dont know this meaning, we need create a qobject and put into
>>> quarantine, so may need to call kmem_cache_alloc(), would you agree this
>>> action?
>>>
>>
>> How is this any different from what you have now?
> 
> I originally thought you already agreed the free-list(tag-based
> quarantine) after fix those issue. If no allocation there,

If no allocation there, than it must be somewhere else.
We known exactly the amount of memory we need, so it's possible to preallocate 
it in advance.


> i think maybe
> only move generic quarantine into tag-based kasan, but its memory
> consumption is more bigger our patch. what do you think?
>

Re: [PATCH v3] kasan: add memory corruption identification for software tag-based mode

2019-07-26 Thread Andrey Ryabinin




On 7/26/19 3:28 PM, Walter Wu wrote:
> On Fri, 2019-07-26 at 15:00 +0300, Andrey Ryabinin wrote:
>>
>
>>>
>>>
>>> I remember that there are already the lists which you concern. Maybe we
>>> can try to solve those problems one by one.
>>>
>>> 1. deadlock issue? cause by kmalloc() after kfree()?
>>
>> smp_call_on_cpu()
> 
>>> 2. decrease allocation fail, to modify GFP_NOWAIT flag to GFP_KERNEL?
>>
>> No, this is not gonna work. Ideally we shouldn't have any allocations there.
>> It's not reliable and it hurts performance.
>>
> I dont know this meaning, we need create a qobject and put into
> quarantine, so may need to call kmem_cache_alloc(), would you agree this
> action?
> 

How is this any different from what you have now?

Re: [PATCH v3] kasan: add memory corruption identification for software tag-based mode

2019-07-26 Thread Andrey Ryabinin




On 7/22/19 12:52 PM, Walter Wu wrote:
> On Thu, 2019-07-18 at 19:11 +0300, Andrey Ryabinin wrote:
>>
>> On 7/15/19 6:06 AM, Walter Wu wrote:
>>> On Fri, 2019-07-12 at 13:52 +0300, Andrey Ryabinin wrote:
>>>>
>>>> On 7/11/19 1:06 PM, Walter Wu wrote:
>>>>> On Wed, 2019-07-10 at 21:24 +0300, Andrey Ryabinin wrote:
>>>>>>
>>>>>> On 7/9/19 5:53 AM, Walter Wu wrote:
>>>>>>> On Mon, 2019-07-08 at 19:33 +0300, Andrey Ryabinin wrote:
>>>>>>>>
>>>>>>>> On 7/5/19 4:34 PM, Dmitry Vyukov wrote:
>>>>>>>>> On Mon, Jul 1, 2019 at 11:56 AM Walter Wu  
>>>>>>>>> wrote:
>>>>>>
>>>>>>>>>
>>>>>>>>> Sorry for delays. I am overwhelm by some urgent work. I afraid to
>>>>>>>>> promise any dates because the next week I am on a conference, then
>>>>>>>>> again a backlog and an intern starting...
>>>>>>>>>
>>>>>>>>> Andrey, do you still have concerns re this patch? This change allows
>>>>>>>>> to print the free stack.
>>>>>>>>
>>>>>>>> I 'm not sure that quarantine is a best way to do that. Quarantine is 
>>>>>>>> made to delay freeing, but we don't that here.
>>>>>>>> If we want to remember more free stacks wouldn't be easier simply to 
>>>>>>>> remember more stacks in object itself?
>>>>>>>> Same for previously used tags for better use-after-free identification.
>>>>>>>>
>>>>>>>
>>>>>>> Hi Andrey,
>>>>>>>
>>>>>>> We ever tried to use object itself to determine use-after-free
>>>>>>> identification, but tag-based KASAN immediately released the pointer
>>>>>>> after call kfree(), the original object will be used by another
>>>>>>> pointer, if we use object itself to determine use-after-free issue, then
>>>>>>> it has many false negative cases. so we create a lite quarantine(ring
>>>>>>> buffers) to record recent free stacks in order to avoid those false
>>>>>>> negative situations.
>>>>>>
>>>>>> I'm telling that *more* than one free stack and also tags per object can 
>>>>>> be stored.
>>>>>> If object reused we would still have information about n-last usages of 
>>>>>> the object.
>>>>>> It seems like much easier and more efficient solution than patch you 
>>>>>> proposing.
>>>>>>
>>>>> To make the object reused, we must ensure that no other pointers uses it
>>>>> after kfree() release the pointer.
>>>>> Scenario:
>>>>> 1). The object reused information is valid when no another pointer uses
>>>>> it.
>>>>> 2). The object reused information is invalid when another pointer uses
>>>>> it.
>>>>> Do you mean that the object reused is scenario 1) ?
>>>>> If yes, maybe we can change the calling quarantine_put() location. It
>>>>> will be fully use that quarantine, but at scenario 2) it looks like to
>>>>> need this patch.
>>>>> If no, maybe i miss your meaning, would you tell me how to use invalid
>>>>> object information? or?
>>>>>
>>>>
>>>>
>>>> KASAN keeps information about object with the object, right after payload 
>>>> in the kasan_alloc_meta struct.
>>>> This information is always valid as long as slab page allocated. Currently 
>>>> it keeps only one last free stacktrace.
>>>> It could be extended to record more free stacktraces and also record 
>>>> previously used tags which will allow you
>>>> to identify use-after-free and extract right free stacktrace.
>>>
>>> Thanks for your explanation.
>>>
>>> For extend slub object, if one record is 9B (sizeof(u8)+ sizeof(struct
>>> kasan_track)) and add five records into slub object, every slub object
>>> may add 45B usage after the system runs longer. 
>>> Slub object number is easy more than 1,000,000(maybe it may be more
>>> bigger), then the extending object memory usage should be 45MB, and
>>> unfortunately it is no limit. The memory usage is more bigger than our
>>> patch.
>>
>> No, it's not necessarily more.
>> And there are other aspects to consider such as performance, how simple 
>> reliable the code is.
>>
>>>
>>> We hope tag-based KASAN advantage is smaller memory usage. If it’s
>>> possible, we should spend less memory in order to identify
>>> use-after-free. Would you accept our patch after fine tune it?
>>
>> Sure, if you manage to fix issues and demonstrate that performance penalty 
>> of your
>> patch is close to zero.
> 
> 
> I remember that there are already the lists which you concern. Maybe we
> can try to solve those problems one by one.
> 
> 1. deadlock issue? cause by kmalloc() after kfree()?

smp_call_on_cpu()

> 2. decrease allocation fail, to modify GFP_NOWAIT flag to GFP_KERNEL?

No, this is not gonna work. Ideally we shouldn't have any allocations there.
It's not reliable and it hurts performance.


> 3. check whether slim 48 bytes (sizeof (qlist_object) +
> sizeof(kasan_alloc_meta)) and additional unique stacktrace in
> stackdepot?
> 4. duplicate struct 'kasan_track' information in two different places
> 

Yup.

> Would you have any other concern? or?
> 

It would be nice to see some performance numbers. Something that uses slab 
allocations a lot, e.g. netperf STREAM_STREAM test.

Re: [PATCH] kmemleak: Increase maximum early log entries to 1000000

2019-07-24 Thread Andrey Ryabinin




On 7/23/19 11:13 AM, Nicolas Boichat wrote:
> On Tue, Jul 23, 2019 at 3:46 PM Dmitry Vyukov  wrote:
>>
>> On Tue, Jul 23, 2019 at 9:26 AM Nicolas Boichat  
>> wrote:
>>>
>>> When KASan is enabled, a lot of memory is allocated early on,
>>> and kmemleak complains (this is on a 4GB RAM system):
>>> kmemleak: Early log buffer exceeded (129846), please increase
>>>   DEBUG_KMEMLEAK_EARLY_LOG_SIZE
>>>
>>> Let's increase the upper limit to 1M entry. That would take up
>>> 160MB of RAM at init (each early_log entry is 160 bytes), but
>>> the memory would later be freed (early_log is __initdata).
>>
>> Interesting. Is it on an arm64 system?
> 
> Yes arm64. And this is chromiumos-4.19 tree. I didn't try to track
> down where these allocations come from...
> 

Is this still a problem in upstream tree? 4.19 doesn't have fed84c785270 
("mm/memblock.c: skip kmemleak for kasan_init()")

Re: [PATCH] [v2] ubsan: build ubsan.c more conservatively

2019-07-22 Thread Andrey Ryabinin




On 7/22/19 3:50 PM, Arnd Bergmann wrote:
> objtool points out several conditions that it does not like, depending
> on the combination with other configuration options and compiler
> variants:
> 
> stack protector:
> lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch()+0xbf: call to 
> __stack_chk_fail() with UACCESS enabled
> lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch_v1()+0xbe: call 
> to __stack_chk_fail() with UACCESS enabled
> 
> stackleak plugin:
> lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch()+0x4a: call to 
> stackleak_track_stack() with UACCESS enabled
> lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch_v1()+0x4a: call 
> to stackleak_track_stack() with UACCESS enabled
> 
> kasan:
> lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch()+0x25: call to 
> memcpy() with UACCESS enabled
> lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch_v1()+0x25: call 
> to memcpy() with UACCESS enabled
> 
> The stackleak and kasan options just need to be disabled for this file
> as we do for other files already. For the stack protector, we already
> attempt to disable it, but this fails on clang because the check is
> mixed with the gcc specific -fno-conserve-stack option. According
> to Andrey Ryabinin, that option is not even needed, dropping it here
> fixes the stackprotector issue.
> 
> Fixes: d08965a27e84 ("x86/uaccess, ubsan: Fix UBSAN vs. SMAP")
> Link: https://lore.kernel.org/lkml/20190617123109.667090-1-a...@arndb.de/t/
> Link: https://lore.kernel.org/lkml/20190722091050.2188664-1-a...@arndb.de/t/
> Cc: sta...@vger.kernel.org
> Signed-off-by: Arnd Bergmann 


Reviewed-by: Andrey Ryabinin

Re: [PATCH] [v2] kasan: remove clang version check for KASAN_STACK

2019-07-22 Thread Andrey Ryabinin




On 7/19/19 11:03 PM, Arnd Bergmann wrote:
> asan-stack mode still uses dangerously large kernel stacks of
> tens of kilobytes in some drivers, and it does not seem that anyone
> is working on the clang bug.
> 
> Turn it off for all clang versions to prevent users from
> accidentally enabling it once they update to clang-9, and
> to help automated build testing with clang-9.
> 
> Link: https://bugs.llvm.org/show_bug.cgi?id=38809
> Fixes: 6baec880d7a5 ("kasan: turn off asan-stack for clang-8 and earlier")
> Signed-off-by: Arnd Bergmann 
> ---

Reviewed-by: Andrey Ryabinin

Re: [PATCH] [v3] page flags: prioritize kasan bits over last-cpuid

2019-07-22 Thread Andrey Ryabinin




On 7/22/19 2:55 PM, Arnd Bergmann wrote:
> ARM64 randdconfig builds regularly run into a build error, especially
> when NUMA_BALANCING and SPARSEMEM are enabled but not SPARSEMEM_VMEMMAP:
> 
>  #error "KASAN: not enough bits in page flags for tag"
> 
> The last-cpuid bits are already contitional on the available space,
> so the result of the calculation is a bit random on whether they
> were already left out or not.
> 
> Adding the kasan tag bits before last-cpuid makes it much more likely
> to end up with a successful build here, and should be reliable for
> randconfig at least, as long as that does not randomize NR_CPUS
> or NODES_SHIFT but uses the defaults.
> 
> In order for the modified check to not trigger in the x86 vdso32 code
> where all constants are wrong (building with -m32), enclose all the
> definitions with an #ifdef.
> 
> Fixes: 2813b9c02962 ("kasan, mm, arm64: tag non slab memory allocated via 
> pagealloc")
> Link: https://lore.kernel.org/lkml/20190618095347.3850490-1-a...@arndb.de/
> Reviewed-by: Andrey Konovalov 
> Signed-off-by: Arnd Bergmann 

Reviewed-by: Andrey Ryabinin

Re: [PATCH] ubsan: build ubsan.c more conservatively

2019-07-22 Thread Andrey Ryabinin




On 7/22/19 12:10 PM, Arnd Bergmann wrote:
> objtool points out several conditions that it does not like, depending
> on the combination with other configuration options and compiler
> variants:
> 
> stack protector:
> lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch()+0xbf: call to 
> __stack_chk_fail() with UACCESS enabled
> lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch_v1()+0xbe: call 
> to __stack_chk_fail() with UACCESS enabled
> 
> stackleak plugin:
> lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch()+0x4a: call to 
> stackleak_track_stack() with UACCESS enabled
> lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch_v1()+0x4a: call 
> to stackleak_track_stack() with UACCESS enabled
> 
> kasan:
> lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch()+0x25: call to 
> memcpy() with UACCESS enabled
> lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch_v1()+0x25: call 
> to memcpy() with UACCESS enabled
> 
> The stackleak and kasan options just need to be disabled for this file
> as we do for other files already. For the stack protector, we already
> attempt to disable it, but this fails on clang because the check is
> mixed with the gcc specific -fno-conserve-stack option, so we need to
> test them separately.
> 
> Fixes: 42440c1f9911 ("lib/ubsan: add type mismatch handler for new GCC/Clang")

There was no uaccess validataion at that time, so the right fixes line is 
probably this:

Fixes: d08965a27e84 ("x86/uaccess, ubsan: Fix UBSAN vs. SMAP")

> Link: https://lore.kernel.org/lkml/20190617123109.667090-1-a...@arndb.de/t/
> Cc: sta...@vger.kernel.org
> Signed-off-by: Arnd Bergmann 
> ---
>  lib/Makefile | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/lib/Makefile b/lib/Makefile
> index 095601ce371d..320e3b632dd3 100644
> --- a/lib/Makefile
> +++ b/lib/Makefile
> @@ -279,7 +279,8 @@ obj-$(CONFIG_UCS2_STRING) += ucs2_string.o
>  obj-$(CONFIG_UBSAN) += ubsan.o
>  
>  UBSAN_SANITIZE_ubsan.o := n
> -CFLAGS_ubsan.o := $(call cc-option, -fno-conserve-stack -fno-stack-protector)
> +KASAN_SANITIZE_ubsan.o := n
> +CFLAGS_ubsan.o := $(call cc-option, -fno-conserve-stack) $(call cc-option, 
> -fno-stack-protector) $(DISABLE_STACKLEAK_PLUGIN)

$(call cc-option, -fno-conserve-stack) can be removed entirely. It's just copy 
paste from kasan Makefile.
It was added in kasan purely for performance reasons.

Not sure that it's needed even in kasan Makefile, the code which was
the reason for adding fno-conserve-stack might not get into the final version 
of KASAN patches.

Re: kasan: paging percpu + kasan causes a double fault

2019-07-18 Thread Andrey Ryabinin




On 7/18/19 6:51 PM, Dmitry Vyukov wrote:
> On Mon, Jul 8, 2019 at 5:05 PM Dennis Zhou  wrote:
>>
>> Hi Andrey, Alexander, and Dmitry,
>>
>> It was reported to me that when percpu is ran with param
>> percpu_alloc=page or the embed allocation scheme fails and falls back to
>> page that a double fault occurs.
>>
>> I don't know much about how kasan works, but a difference between the
>> two is that we manually reserve vm area via vm_area_register_early().
>> I guessed it had something to do with the stack canary or the irq_stack,
>> and manually mapped the shadow vm area with kasan_add_zero_shadow(), but
>> that didn't seem to do the trick.
>>
>> RIP resolves to the fixed_percpu_data declaration.
>>
>> Double fault below:
>> [0.00] PANIC: double fault, error_code: 0x0
>> [0.00] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
>> 5.2.0-rc7-7-ge0afe6d4d12c-dirty #299
>> [0.00] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> 1.11.0-2.el7 04/01/2014
>> [0.00] RIP: 0010:no_context+0x38/0x4b0
>> [0.00] Code: df 41 57 41 56 4c 8d bf 88 00 00 00 41 55 49 89 d5 41 
>> 54 49 89 f4 55 48 89 fd 4c8
>> [0.00] RSP: :c828 EFLAGS: 00010096
>> [0.00] RAX: dc00 RBX: c850 RCX: 
>> 000b
>> [0.00] RDX: f5200030 RSI: 0003 RDI: 
>> c9000130
>> [0.00] RBP: c9a8 R08: 0001 R09: 
>> 
>> [0.00] R10:  R11:  R12: 
>> 0003
>> [0.00] R13: f5200030 R14:  R15: 
>> c9000130
>> [0.00] FS:  () GS:c900() 
>> knlGS:
>> [0.00] CS:  0010 DS:  ES:  CR0: 80050033
>> [0.00] CR2: c818 CR3: 02e0d001 CR4: 
>> 000606b0
>> [0.00] DR0:  DR1:  DR2: 
>> 
>> [0.00] DR3:  DR6: fffe0ff0 DR7: 
>> 0400
>> [0.00] Call Trace:
>> [0.00] Kernel panic - not syncing: Machine halted.
>> [0.00] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
>> 5.2.0-rc7-7-ge0afe6d4d12c-dirty #299
>> [0.00] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> 1.11.0-2.el7 04/01/2014
>> [0.00] Call Trace:
>> [0.00]  <#DF>
>> [0.00]  dump_stack+0x5b/0x90
>> [0.00]  panic+0x17e/0x36e
>> [0.00]  ? __warn_printk+0xdb/0xdb
>> [0.00]  ? spurious_kernel_fault_check+0x1a/0x60
>> [0.00]  df_debug+0x2e/0x39
>> [0.00]  do_double_fault+0x89/0xb0
>> [0.00]  double_fault+0x1e/0x30
>> [0.00] RIP: 0010:no_context+0x38/0x4b0
>> [0.00] Code: df 41 57 41 56 4c 8d bf 88 00 00 00 41 55 49 89 d5 41 
>> 54 49 89 f4 55 48 89 fd 4c8
>> [0.00] RSP: :c828 EFLAGS: 00010096
>> [0.00] RAX: dc00 RBX: c850 RCX: 
>> 000b
>> [0.00] RDX: f5200030 RSI: 0003 RDI: 
>> c9000130
>> [0.00] RBP: c9a8 R08: 0001 R09: 
>> 
>> [0.00] R10:  R11:  R12: 
>> 0003
>> [ 0.00] R13: f5200030 R14:  R15: c9000130
> 
> 
> Hi Dennis,
> 
> I don't have lots of useful info, but a naive question: could you stop
> using percpu_alloc=page with KASAN? That should resolve the problem :)
> We could even add a runtime check that will clearly say that this
> combintation does not work.
> 
> I see that setup_per_cpu_areas is called after kasan_init which is
> called from setup_arch. So KASAN should already map final shadow at
> that point.
> The only potential reason that I see is that setup_per_cpu_areas maps
> the percpu region at address that is not covered/expected by
> kasan_init. Where is page-based percpu is mapped? Is that covered by
> kasan_init?
> Otherwise, seeing the full stack trace of the fault may shed some light.
> 

percpu_alloc=page maps percpu areas into vmalloc, which don't have RW KASAN 
shadow mem.
irq stack are percpu thus we have GPF on attempt to poison stack redzones in 
irq.

Re: [PATCH] kasan: push back KASAN_STACK detection to clang-10

2019-07-18 Thread Andrey Ryabinin




On 7/18/19 5:14 PM, Arnd Bergmann wrote:
> asan-stack mode still uses dangerously large kernel stacks of
> tens of kilobytes in some drivers, and it does not seem that anyone
> is working on the clang bug.
> 
> Let's push this back to clang-10 for now so users don't run into
> this by accident, and we can test-build allmodconfig kernels using
> clang-9 without drowning in warnings.
> 
> Link: https://bugs.llvm.org/show_bug.cgi?id=38809
> Fixes: 6baec880d7a5 ("kasan: turn off asan-stack for clang-8 and earlier")
> Signed-off-by: Arnd Bergmann 
> ---
>  lib/Kconfig.kasan | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan
> index 4fafba1a923b..2f260bb63d77 100644
> --- a/lib/Kconfig.kasan
> +++ b/lib/Kconfig.kasan
> @@ -106,7 +106,7 @@ endchoice
>  
>  config KASAN_STACK_ENABLE
>   bool "Enable stack instrumentation (unsafe)" if CC_IS_CLANG && 
> !COMPILE_TEST
> - default !(CLANG_VERSION < 9)
> + default !(CLANG_VERSION < 10)

Wouldn't be better to make this thing for any clang version? And only when the 
bug is
finally fixed, specify the clang version which can enable this safely.


>   depends on KASAN
>   help
> The LLVM stack address sanitizer has a know problem that
>

Re: [PATCH v3] kasan: add memory corruption identification for software tag-based mode

2019-07-18 Thread Andrey Ryabinin




On 7/15/19 6:06 AM, Walter Wu wrote:
> On Fri, 2019-07-12 at 13:52 +0300, Andrey Ryabinin wrote:
>>
>> On 7/11/19 1:06 PM, Walter Wu wrote:
>>> On Wed, 2019-07-10 at 21:24 +0300, Andrey Ryabinin wrote:
>>>>
>>>> On 7/9/19 5:53 AM, Walter Wu wrote:
>>>>> On Mon, 2019-07-08 at 19:33 +0300, Andrey Ryabinin wrote:
>>>>>>
>>>>>> On 7/5/19 4:34 PM, Dmitry Vyukov wrote:
>>>>>>> On Mon, Jul 1, 2019 at 11:56 AM Walter Wu  
>>>>>>> wrote:
>>>>
>>>>>>>
>>>>>>> Sorry for delays. I am overwhelm by some urgent work. I afraid to
>>>>>>> promise any dates because the next week I am on a conference, then
>>>>>>> again a backlog and an intern starting...
>>>>>>>
>>>>>>> Andrey, do you still have concerns re this patch? This change allows
>>>>>>> to print the free stack.
>>>>>>
>>>>>> I 'm not sure that quarantine is a best way to do that. Quarantine is 
>>>>>> made to delay freeing, but we don't that here.
>>>>>> If we want to remember more free stacks wouldn't be easier simply to 
>>>>>> remember more stacks in object itself?
>>>>>> Same for previously used tags for better use-after-free identification.
>>>>>>
>>>>>
>>>>> Hi Andrey,
>>>>>
>>>>> We ever tried to use object itself to determine use-after-free
>>>>> identification, but tag-based KASAN immediately released the pointer
>>>>> after call kfree(), the original object will be used by another
>>>>> pointer, if we use object itself to determine use-after-free issue, then
>>>>> it has many false negative cases. so we create a lite quarantine(ring
>>>>> buffers) to record recent free stacks in order to avoid those false
>>>>> negative situations.
>>>>
>>>> I'm telling that *more* than one free stack and also tags per object can 
>>>> be stored.
>>>> If object reused we would still have information about n-last usages of 
>>>> the object.
>>>> It seems like much easier and more efficient solution than patch you 
>>>> proposing.
>>>>
>>> To make the object reused, we must ensure that no other pointers uses it
>>> after kfree() release the pointer.
>>> Scenario:
>>> 1). The object reused information is valid when no another pointer uses
>>> it.
>>> 2). The object reused information is invalid when another pointer uses
>>> it.
>>> Do you mean that the object reused is scenario 1) ?
>>> If yes, maybe we can change the calling quarantine_put() location. It
>>> will be fully use that quarantine, but at scenario 2) it looks like to
>>> need this patch.
>>> If no, maybe i miss your meaning, would you tell me how to use invalid
>>> object information? or?
>>>
>>
>>
>> KASAN keeps information about object with the object, right after payload in 
>> the kasan_alloc_meta struct.
>> This information is always valid as long as slab page allocated. Currently 
>> it keeps only one last free stacktrace.
>> It could be extended to record more free stacktraces and also record 
>> previously used tags which will allow you
>> to identify use-after-free and extract right free stacktrace.
> 
> Thanks for your explanation.
> 
> For extend slub object, if one record is 9B (sizeof(u8)+ sizeof(struct
> kasan_track)) and add five records into slub object, every slub object
> may add 45B usage after the system runs longer. 
> Slub object number is easy more than 1,000,000(maybe it may be more
> bigger), then the extending object memory usage should be 45MB, and
> unfortunately it is no limit. The memory usage is more bigger than our
> patch.

No, it's not necessarily more.
And there are other aspects to consider such as performance, how simple 
reliable the code is.

> 
> We hope tag-based KASAN advantage is smaller memory usage. If it’s
> possible, we should spend less memory in order to identify
> use-after-free. Would you accept our patch after fine tune it?

Sure, if you manage to fix issues and demonstrate that performance penalty of 
your
patch is close to zero.

Re: [PATCH v3] kasan: add memory corruption identification for software tag-based mode

2019-07-12 Thread Andrey Ryabinin




On 7/11/19 1:06 PM, Walter Wu wrote:
> On Wed, 2019-07-10 at 21:24 +0300, Andrey Ryabinin wrote:
>>
>> On 7/9/19 5:53 AM, Walter Wu wrote:
>>> On Mon, 2019-07-08 at 19:33 +0300, Andrey Ryabinin wrote:
>>>>
>>>> On 7/5/19 4:34 PM, Dmitry Vyukov wrote:
>>>>> On Mon, Jul 1, 2019 at 11:56 AM Walter Wu  
>>>>> wrote:
>>
>>>>>
>>>>> Sorry for delays. I am overwhelm by some urgent work. I afraid to
>>>>> promise any dates because the next week I am on a conference, then
>>>>> again a backlog and an intern starting...
>>>>>
>>>>> Andrey, do you still have concerns re this patch? This change allows
>>>>> to print the free stack.
>>>>
>>>> I 'm not sure that quarantine is a best way to do that. Quarantine is made 
>>>> to delay freeing, but we don't that here.
>>>> If we want to remember more free stacks wouldn't be easier simply to 
>>>> remember more stacks in object itself?
>>>> Same for previously used tags for better use-after-free identification.
>>>>
>>>
>>> Hi Andrey,
>>>
>>> We ever tried to use object itself to determine use-after-free
>>> identification, but tag-based KASAN immediately released the pointer
>>> after call kfree(), the original object will be used by another
>>> pointer, if we use object itself to determine use-after-free issue, then
>>> it has many false negative cases. so we create a lite quarantine(ring
>>> buffers) to record recent free stacks in order to avoid those false
>>> negative situations.
>>
>> I'm telling that *more* than one free stack and also tags per object can be 
>> stored.
>> If object reused we would still have information about n-last usages of the 
>> object.
>> It seems like much easier and more efficient solution than patch you 
>> proposing.
>>
> To make the object reused, we must ensure that no other pointers uses it
> after kfree() release the pointer.
> Scenario:
> 1). The object reused information is valid when no another pointer uses
> it.
> 2). The object reused information is invalid when another pointer uses
> it.
> Do you mean that the object reused is scenario 1) ?
> If yes, maybe we can change the calling quarantine_put() location. It
> will be fully use that quarantine, but at scenario 2) it looks like to
> need this patch.
> If no, maybe i miss your meaning, would you tell me how to use invalid
> object information? or?
> 


KASAN keeps information about object with the object, right after payload in 
the kasan_alloc_meta struct.
This information is always valid as long as slab page allocated. Currently it 
keeps only one last free stacktrace.
It could be extended to record more free stacktraces and also record previously 
used tags which will allow you
to identify use-after-free and extract right free stacktrace.

Re: [PATCH v5 0/5] Add object validation in ksize()

2019-07-10 Thread Andrey Ryabinin




On 7/8/19 8:07 PM, Marco Elver wrote:
> This version fixes several build issues --
> Reported-by: kbuild test robot 
> 
> Previous version here:
> http://lkml.kernel.org/r/20190627094445.216365-1-el...@google.com
> 
> Marco Elver (5):
>   mm/kasan: Introduce __kasan_check_{read,write}
>   mm/kasan: Change kasan_check_{read,write} to return boolean
>   lib/test_kasan: Add test for double-kzfree detection
>   mm/slab: Refactor common ksize KASAN logic into slab_common.c
>   mm/kasan: Add object validation in ksize()
> 

Reviewed-by: Andrey Ryabinin

Re: [PATCH v3] kasan: add memory corruption identification for software tag-based mode

2019-07-10 Thread Andrey Ryabinin




On 7/9/19 5:53 AM, Walter Wu wrote:
> On Mon, 2019-07-08 at 19:33 +0300, Andrey Ryabinin wrote:
>>
>> On 7/5/19 4:34 PM, Dmitry Vyukov wrote:
>>> On Mon, Jul 1, 2019 at 11:56 AM Walter Wu  wrote:

>>>
>>> Sorry for delays. I am overwhelm by some urgent work. I afraid to
>>> promise any dates because the next week I am on a conference, then
>>> again a backlog and an intern starting...
>>>
>>> Andrey, do you still have concerns re this patch? This change allows
>>> to print the free stack.
>>
>> I 'm not sure that quarantine is a best way to do that. Quarantine is made 
>> to delay freeing, but we don't that here.
>> If we want to remember more free stacks wouldn't be easier simply to 
>> remember more stacks in object itself?
>> Same for previously used tags for better use-after-free identification.
>>
> 
> Hi Andrey,
> 
> We ever tried to use object itself to determine use-after-free
> identification, but tag-based KASAN immediately released the pointer
> after call kfree(), the original object will be used by another
> pointer, if we use object itself to determine use-after-free issue, then
> it has many false negative cases. so we create a lite quarantine(ring
> buffers) to record recent free stacks in order to avoid those false
> negative situations.

I'm telling that *more* than one free stack and also tags per object can be 
stored.
If object reused we would still have information about n-last usages of the 
object.
It seems like much easier and more efficient solution than patch you proposing.

As for other concern about this particular patch
 - It wasn't tested. There is deadlock (sleep in atomic) on the report path 
which would have been noticed it tested.
   Also GFP_NOWAIT allocation which fails very noisy and very often, especially 
in memory constraint enviromnent where tag-based KASAN supposed to be used.

 - Inefficient usage of memory:
48 bytes (sizeof (qlist_object) + sizeof(kasan_alloc_meta)) per kfree() 
call seems like a lot. It could be less.

The same 'struct kasan_track' stored twice in two different places (in 
object and in quarantine).
Basically, at least some part of the quarantine always duplicates 
information that we already know about
recently freed object. 

Since now we call kmalloc() from kfree() path, every unique kfree() 
stacktrace now generates additional unique stacktrace that
takes space in stackdepot.

Re: [PATCH v3] kasan: add memory corruption identification for software tag-based mode

2019-07-08 Thread Andrey Ryabinin




On 7/5/19 4:34 PM, Dmitry Vyukov wrote:
> On Mon, Jul 1, 2019 at 11:56 AM Walter Wu  wrote:
> This patch adds memory corruption identification at bug report for
> software tag-based mode, the report show whether it is 
> "use-after-free"
> or "out-of-bound" error instead of "invalid-access" error.This will 
> make
> it easier for programmers to see the memory corruption problem.
>
> Now we extend the quarantine to support both generic and tag-based 
> kasan.
> For tag-based kasan, the quarantine stores only freed object 
> information
> to check if an object is freed recently. When tag-based kasan reports 
> an
> error, we can check if the tagged addr is in the quarantine and make a
> good guess if the object is more like "use-after-free" or 
> "out-of-bound".
>


 We already have all the information and don't need the quarantine to 
 make such guess.
 Basically if shadow of the first byte of object has the same tag as 
 tag in pointer than it's out-of-bounds,
 otherwise it's use-after-free.

 In pseudo-code it's something like this:

 u8 object_tag = *(u8 *)kasan_mem_to_shadow(nearest_object(cacche, 
 page, access_addr));

 if (access_addr_tag == object_tag && object_tag != KASAN_TAG_INVALID)
   // out-of-bounds
 else
   // use-after-free
>>>
>>> Thanks your explanation.
>>> I see, we can use it to decide corruption type.
>>> But some use-after-free issues, it may not have accurate free-backtrace.
>>> Unfortunately in that situation, free-backtrace is the most important.
>>> please see below example
>>>
>>> In generic KASAN, it gets accurate free-backrace(ptr1).
>>> In tag-based KASAN, it gets wrong free-backtrace(ptr2). It will make
>>> programmer misjudge, so they may not believe tag-based KASAN.
>>> So We provide this patch, we hope tag-based KASAN bug report is the same
>>> accurate with generic KASAN.
>>>
>>> ---
>>> ptr1 = kmalloc(size, GFP_KERNEL);
>>> ptr1_free(ptr1);
>>>
>>> ptr2 = kmalloc(size, GFP_KERNEL);
>>> ptr2_free(ptr2);
>>>
>>> ptr1[size] = 'x';  //corruption here
>>>
>>>
>>> static noinline void ptr1_free(char* ptr)
>>> {
>>> kfree(ptr);
>>> }
>>> static noinline void ptr2_free(char* ptr)
>>> {
>>> kfree(ptr);
>>> }
>>> ---
>>>
>> We think of another question about deciding by that shadow of the first
>> byte.
>> In tag-based KASAN, it is immediately released after calling kfree(), so
>> the slub is easy to be used by another pointer, then it will change
>> shadow memory to the tag of new pointer, it will not be the
>> KASAN_TAG_INVALID, so there are many false negative cases, especially in
>> small size allocation.
>>
>> Our patch is to solve those problems. so please consider it, thanks.
>>
> Hi, Andrey and Dmitry,
>
> I am sorry to bother you.
> Would you tell me what you think about this patch?
> We want to use tag-based KASAN, so we hope its bug report is clear and
> correct as generic KASAN.
>
> Thanks your review.
> Walter

 Hi Walter,

 I will probably be busy till the next week. Sorry for delays.
>>>
>>> It's ok. Thanks your kindly help.
>>> I hope I can contribute to tag-based KASAN. It is a very important tool
>>> for us.
>>
>> Hi, Dmitry,
>>
>> Would you have free time to discuss this patch together?
>> Thanks.
> 
> Sorry for delays. I am overwhelm by some urgent work. I afraid to
> promise any dates because the next week I am on a conference, then
> again a backlog and an intern starting...
> 
> Andrey, do you still have concerns re this patch? This change allows
> to print the free stack.

I 'm not sure that quarantine is a best way to do that. Quarantine is made to 
delay freeing, but we don't that here.
If we want to remember more free stacks wouldn't be easier simply to remember 
more stacks in object itself?
Same for previously used tags for better use-after-free identification.

> We also have a quarantine for hwasan in user-space. Though it works a
> bit differently then the normal asan quarantine. We keep a per-thread
> fixed-size ring-buffer of recent allocations:
> https://github.com/llvm-mirror/compiler-rt/blob/master/lib/hwasan/hwasan_report.cpp#L274-L284
> and scan these ring buffers during reports.
>

Re: [PATCH] [v2] page flags: prioritize kasan bits over last-cpuid

2019-06-18 Thread Andrey Ryabinin




On 6/18/19 6:30 PM, Arnd Bergmann wrote:
> On Tue, Jun 18, 2019 at 4:30 PM Andrey Ryabinin  
> wrote:
>> On 6/18/19 12:53 PM, Arnd Bergmann wrote:
>>> ARM64 randdconfig builds regularly run into a build error, especially
>>> when NUMA_BALANCING and SPARSEMEM are enabled but not SPARSEMEM_VMEMMAP:
>>>
>>>  #error "KASAN: not enough bits in page flags for tag"
>>>
>>> The last-cpuid bits are already contitional on the available space,
>>> so the result of the calculation is a bit random on whether they
>>> were already left out or not.
>>>
>>> Adding the kasan tag bits before last-cpuid makes it much more likely
>>> to end up with a successful build here, and should be reliable for
>>> randconfig at least, as long as that does not randomize NR_CPUS
>>> or NODES_SHIFT but uses the defaults.
>>>
>>> In order for the modified check to not trigger in the x86 vdso32 code
>>> where all constants are wrong (building with -m32), enclose all the
>>> definitions with an #ifdef.
>>>
>>
>> Why not keep "#error "KASAN: not enough bits in page flags for tag"" under 
>> "#ifdef CONFIG_KASAN_SW_TAGS" ?
> 
> I think I had meant the #error to leave out the mention of KASAN, as there
> might be other reasons for using up all the bits, but then I did not change
> it in the end.
> 
> Should I remove the "KASAN" word or add the #ifdef when resending?

It seems like changing the error message is a better choice.
Don't forget to remove "for tag" as well.

Re: [PATCH] [v2] page flags: prioritize kasan bits over last-cpuid

2019-06-18 Thread Andrey Ryabinin




On 6/18/19 12:53 PM, Arnd Bergmann wrote:
> ARM64 randdconfig builds regularly run into a build error, especially
> when NUMA_BALANCING and SPARSEMEM are enabled but not SPARSEMEM_VMEMMAP:
> 
>  #error "KASAN: not enough bits in page flags for tag"
> 
> The last-cpuid bits are already contitional on the available space,
> so the result of the calculation is a bit random on whether they
> were already left out or not.
> 
> Adding the kasan tag bits before last-cpuid makes it much more likely
> to end up with a successful build here, and should be reliable for
> randconfig at least, as long as that does not randomize NR_CPUS
> or NODES_SHIFT but uses the defaults.
> 
> In order for the modified check to not trigger in the x86 vdso32 code
> where all constants are wrong (building with -m32), enclose all the
> definitions with an #ifdef.
> 

Why not keep "#error "KASAN: not enough bits in page flags for tag"" under 
"#ifdef CONFIG_KASAN_SW_TAGS" ?

Re: [PATCH] ubsan: mark ubsan_type_mismatch_common inline

2019-06-18 Thread Andrey Ryabinin




On 6/18/19 3:56 PM, Arnd Bergmann wrote:
> On Mon, Jun 17, 2019 at 4:02 PM Peter Zijlstra  wrote:
>>
>> On Mon, Jun 17, 2019 at 02:31:09PM +0200, Arnd Bergmann wrote:
>>> objtool points out a condition that it does not like:
>>>
>>> lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch()+0x4a: call to 
>>> stackleak_track_stack() with UACCESS enabled
>>> lib/ubsan.o: warning: objtool: __ubsan_handle_type_mismatch_v1()+0x4a: call 
>>> to stackleak_track_stack() with UACCESS enabled
>>>
>>> I guess this is related to the call ubsan_type_mismatch_common()
>>> not being inline before it calls user_access_restore(), though
>>> I don't fully understand why that is a problem.
>>
>> The rules are that when AC is set, one is not allowed to CALL schedule,
>> because scheduling does not save/restore AC.  Preemption, through the
>> exceptions is fine, because the exceptions do save/restore AC.
>>
>> And while most functions do not appear to call into schedule, function
>> trace ensures that every single call does in fact call into schedule.
>> Therefore any CALL (with AC set) is invalid.
> 
> I see that stackleak_track_stack is already marked 'notrace',
> since we must ensure we don't recurse when calling into it from
> any of the function trace logic.
> 
> Does that mean we could just mark it as another safe call?
> 
> --- a/tools/objtool/check.c
> +++ b/tools/objtool/check.c
> @@ -486,6 +486,7 @@ static const char *uaccess_safe_builtin[] = {
> "__ubsan_handle_type_mismatch",
> "__ubsan_handle_type_mismatch_v1",
> /* misc */
> +   "stackleak_track_stack",
> "csum_partial_copy_generic",
> "__memcpy_mcsafe",
> "ftrace_likely_update", /* CONFIG_TRACE_BRANCH_PROFILING */
> 
> 
>> Maybe we should disable stackleak when building ubsan instead? We
>> already disable stack-protector when building ubsan.
> 
> I couldn't find out how that is done.
> 

I guess this:
ccflags-y += $(DISABLE_STACKLEAK_PLUGIN)

[tip:x86/urgent] x86/kasan: Fix boot with 5-level paging and KASAN

2019-06-14 Thread tip-bot for Andrey Ryabinin

Commit-ID:  f3176ec9420de0c385023afa3e4970129444ac2f
Gitweb: https://git.kernel.org/tip/f3176ec9420de0c385023afa3e4970129444ac2f
Author: Andrey Ryabinin 
AuthorDate: Fri, 14 Jun 2019 17:31:49 +0300
Committer:  Thomas Gleixner 
CommitDate: Fri, 14 Jun 2019 16:37:30 +0200

x86/kasan: Fix boot with 5-level paging and KASAN

Since commit d52888aa2753 ("x86/mm: Move LDT remap out of KASLR region on
5-level paging") kernel doesn't boot with KASAN on 5-level paging machines.
The bug is actually in early_p4d_offset() and introduced by commit
12a8cc7fcf54 ("x86/kasan: Use the same shadow offset for 4- and 5-level paging")

early_p4d_offset() tries to convert pgd_val(*pgd) value to a physical
address. This doesn't make sense because pgd_val() already contains the
physical address.

It did work prior to commit d52888aa2753 because the result of
"__pa_nodebug(pgd_val(*pgd)) & PTE_PFN_MASK" was the same as "pgd_val(*pgd)
& PTE_PFN_MASK". __pa_nodebug() just set some high bits which were masked
out by applying PTE_PFN_MASK.

After the change of the PAGE_OFFSET offset in commit d52888aa2753
__pa_nodebug(pgd_val(*pgd)) started to return a value with more high bits
set and PTE_PFN_MASK wasn't enough to mask out all of them. So it returns a
wrong not even canonical address and crashes on the attempt to dereference
it.

Switch back to pgd_val() & PTE_PFN_MASK to cure the issue.

Fixes: 12a8cc7fcf54 ("x86/kasan: Use the same shadow offset for 4- and 5-level 
paging")
Reported-by: Kirill A. Shutemov 
Signed-off-by: Andrey Ryabinin 
Signed-off-by: Thomas Gleixner 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Alexander Potapenko 
Cc: Dmitry Vyukov 
Cc: kasan-...@googlegroups.com
Cc: sta...@vger.kernel.org
Cc: 
Link: https://lkml.kernel.org/r/20190614143149.2227-1-aryabi...@virtuozzo.com

---
 arch/x86/mm/kasan_init_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 8dc0fc0b1382..296da58f3013 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -199,7 +199,7 @@ static inline p4d_t *early_p4d_offset(pgd_t *pgd, unsigned 
long addr)
if (!pgtable_l5_enabled())
return (p4d_t *)pgd;
 
-   p4d = __pa_nodebug(pgd_val(*pgd)) & PTE_PFN_MASK;
+   p4d = pgd_val(*pgd) & PTE_PFN_MASK;
p4d += __START_KERNEL_map - phys_base;
return (p4d_t *)p4d + p4d_index(addr);
 }

[PATCH] x86/kasan: Fix boot with 5-level paging and KASAN

2019-06-14 Thread Andrey Ryabinin

Since commit d52888aa2753 ("x86/mm: Move LDT remap out of KASLR region on
5-level paging") kernel doesn't boot with KASAN on 5-level paging machines.
The bug is actually in early_p4d_offset() and introduced by commit
12a8cc7fcf54 ("x86/kasan: Use the same shadow offset for 4- and 5-level paging")

early_p4d_offset() tries to convert pgd_val(*pgd) value to physical
address. This doesn't make sense because pgd_val() already contains
physical address.

It did work prior to commit d52888aa2753 because the result of
"__pa_nodebug(pgd_val(*pgd)) & PTE_PFN_MASK" was the same as
"pgd_val(*pgd) & PTE_PFN_MASK". __pa_nodebug() just set some high bit
which were masked out by applying PTE_PFN_MASK.

After the change of the PAGE_OFFSET offset in commit d52888aa2753
__pa_nodebug(pgd_val(*pgd)) started to return value with more high bits
set and PTE_PFN_MASK wasn't enough to mask out all of them. So we've got
wrong not even canonical address and crash on the attempt to dereference it.

Fixes: 12a8cc7fcf54 ("x86/kasan: Use the same shadow offset for 4- and 5-level 
paging")
Reported-by: Kirill A. Shutemov 
Signed-off-by: Andrey Ryabinin 
Cc: 
---
 arch/x86/mm/kasan_init_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 8dc0fc0b1382..296da58f3013 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -199,7 +199,7 @@ static inline p4d_t *early_p4d_offset(pgd_t *pgd, unsigned 
long addr)
if (!pgtable_l5_enabled())
return (p4d_t *)pgd;
 
-   p4d = __pa_nodebug(pgd_val(*pgd)) & PTE_PFN_MASK;
+   p4d = pgd_val(*pgd) & PTE_PFN_MASK;
p4d += __START_KERNEL_map - phys_base;
return (p4d_t *)p4d + p4d_index(addr);
 }
-- 
2.21.0

Re: [PATCH v2 1/3] fs/fuse, splice_write: Don't access pipe->buffers without pipe_lock()

2019-06-13 Thread Andrey Ryabinin




On 6/12/19 11:57 AM, Vlastimil Babka wrote:
> On 7/17/18 6:00 PM, Andrey Ryabinin wrote:
>> fuse_dev_splice_write() reads pipe->buffers to determine the size of
>> 'bufs' array before taking the pipe_lock(). This is not safe as
>> another thread might change the 'pipe->buffers' between the allocation
>> and taking the pipe_lock(). So we end up with too small 'bufs' array.
>>
>> Move the bufs allocations inside pipe_lock()/pipe_unlock() to fix this.
>>
>> Fixes: dd3bb14f44a6 ("fuse: support splice() writing to fuse device")
>> Signed-off-by: Andrey Ryabinin 
>> Cc: 
> 
> BTW, why don't we need to do the same in fuse_dev_splice_read()?
> 

do_splice() already takes the pipe_lock() before calling ->splice_read()

> Thanks,
> Vlastimil
> 
>> ---
>>  fs/fuse/dev.c | 7 +--
>>  1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index c6b88fa85e2e..702592cce546 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -1944,12 +1944,15 @@ static ssize_t fuse_dev_splice_write(struct 
>> pipe_inode_info *pipe,
>>  if (!fud)
>>  return -EPERM;
>>  
>> +pipe_lock(pipe);
>> +
>>  bufs = kmalloc_array(pipe->buffers, sizeof(struct pipe_buffer),
>>   GFP_KERNEL);
>> -if (!bufs)
>> +if (!bufs) {
>> +pipe_unlock(pipe);
>>  return -ENOMEM;
>> +}
>>  
>> -pipe_lock(pipe);
>>  nbuf = 0;
>>  rem = 0;
>>  for (idx = 0; idx < pipe->nrbufs && rem < len; idx++)
>>
>

Re: [PATCH v3] kasan: add memory corruption identification for software tag-based mode

2019-06-13 Thread Andrey Ryabinin




On 6/13/19 4:05 PM, Dmitry Vyukov wrote:
> On Thu, Jun 13, 2019 at 2:27 PM Andrey Ryabinin  
> wrote:
>> On 6/13/19 11:13 AM, Walter Wu wrote:
>>> This patch adds memory corruption identification at bug report for
>>> software tag-based mode, the report show whether it is "use-after-free"
>>> or "out-of-bound" error instead of "invalid-access" error.This will make
>>> it easier for programmers to see the memory corruption problem.
>>>
>>> Now we extend the quarantine to support both generic and tag-based kasan.
>>> For tag-based kasan, the quarantine stores only freed object information
>>> to check if an object is freed recently. When tag-based kasan reports an
>>> error, we can check if the tagged addr is in the quarantine and make a
>>> good guess if the object is more like "use-after-free" or "out-of-bound".
>>>
>>
>>
>> We already have all the information and don't need the quarantine to make 
>> such guess.
>> Basically if shadow of the first byte of object has the same tag as tag in 
>> pointer than it's out-of-bounds,
>> otherwise it's use-after-free.
>>
>> In pseudo-code it's something like this:
>>
>> u8 object_tag = *(u8 *)kasan_mem_to_shadow(nearest_object(cacche, page, 
>> access_addr));
>>
>> if (access_addr_tag == object_tag && object_tag != KASAN_TAG_INVALID)
>> // out-of-bounds
>> else
>> // use-after-free
> 
> But we don't have redzones in tag mode (intentionally), so unless I am
> missing something we don't have the necessary info. Both cases look
> the same -- we hit a different tag.

We always have some redzone. We need a place to store 'struct kasan_alloc_meta',
and sometimes also kasan_free_meta plus alignment to the next object.


> There may only be a small trailer for kmalloc-allocated objects that
> is painted with a different tag. I don't remember if we actually use a
> different tag for the trailer. Since tag mode granularity is 16 bytes,
> for smaller objects the trailer is impossible at all.
> 

Smaller that 16-bytes objects have 16 bytes of kasan_alloc_meta.
Redzones and freed objects always painted with KASAN_TAG_INVALID.

Re: [PATCH v3 3/3] asm-generic, x86: Add bitops instrumentation for KASAN

2019-06-13 Thread Andrey Ryabinin




On 5/31/19 6:08 PM, Marco Elver wrote:
> This adds a new header to asm-generic to allow optionally instrumenting
> architecture-specific asm implementations of bitops.
> 
> This change includes the required change for x86 as reference and
> changes the kernel API doc to point to bitops-instrumented.h instead.
> Rationale: the functions in x86's bitops.h are no longer the kernel API
> functions, but instead the arch_ prefixed functions, which are then
> instrumented via bitops-instrumented.h.
> 
> Other architectures can similarly add support for asm implementations of
> bitops.
> 
> The documentation text was derived from x86 and existing bitops
> asm-generic versions: 1) references to x86 have been removed; 2) as a
> result, some of the text had to be reworded for clarity and consistency.
> 
> Tested: using lib/test_kasan with bitops tests (pre-requisite patch).
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=198439
> Signed-off-by: Marco Elver 

Reviewed-by: Andrey Ryabinin

Re: [PATCH v3 2/3] x86: Use static_cpu_has in uaccess region to avoid instrumentation

2019-06-13 Thread Andrey Ryabinin




On 5/31/19 6:08 PM, Marco Elver wrote:
> This patch is a pre-requisite for enabling KASAN bitops instrumentation;
> using static_cpu_has instead of boot_cpu_has avoids instrumentation of
> test_bit inside the uaccess region. With instrumentation, the KASAN
> check would otherwise be flagged by objtool.
> 
> For consistency, kernel/signal.c was changed to mirror this change,
> however, is never instrumented with KASAN (currently unsupported under
> x86 32bit).
> 
> Signed-off-by: Marco Elver 
> Suggested-by: H. Peter Anvin 

Reviewed-by: Andrey Ryabinin

Re: [PATCH v3] kasan: add memory corruption identification for software tag-based mode

2019-06-13 Thread Andrey Ryabinin




On 6/13/19 11:13 AM, Walter Wu wrote:
> This patch adds memory corruption identification at bug report for
> software tag-based mode, the report show whether it is "use-after-free"
> or "out-of-bound" error instead of "invalid-access" error.This will make
> it easier for programmers to see the memory corruption problem.
> 
> Now we extend the quarantine to support both generic and tag-based kasan.
> For tag-based kasan, the quarantine stores only freed object information
> to check if an object is freed recently. When tag-based kasan reports an
> error, we can check if the tagged addr is in the quarantine and make a
> good guess if the object is more like "use-after-free" or "out-of-bound".
> 


We already have all the information and don't need the quarantine to make such 
guess.
Basically if shadow of the first byte of object has the same tag as tag in 
pointer than it's out-of-bounds,
otherwise it's use-after-free.

In pseudo-code it's something like this:

u8 object_tag = *(u8 *)kasan_mem_to_shadow(nearest_object(cacche, page, 
access_addr));

if (access_addr_tag == object_tag && object_tag != KASAN_TAG_INVALID)
// out-of-bounds
else
// use-after-free

Re: [PATCH v2] kasan: Initialize tag to 0xff in __kasan_kmalloc

2019-05-23 Thread Andrey Ryabinin




On 5/2/19 7:30 PM, Nathan Chancellor wrote:
> When building with -Wuninitialized and CONFIG_KASAN_SW_TAGS unset, Clang
> warns:
> 
> mm/kasan/common.c:484:40: warning: variable 'tag' is uninitialized when
> used here [-Wuninitialized]
> kasan_unpoison_shadow(set_tag(object, tag), size);
>   ^~~
> 
> set_tag ignores tag in this configuration but clang doesn't realize it
> at this point in its pipeline, as it points to arch_kasan_set_tag as
> being the point where it is used, which will later be expanded to
> (void *)(object) without a use of tag. Initialize tag to 0xff, as it
> removes this warning and doesn't change the meaning of the code.
> 
> Link: https://github.com/ClangBuiltLinux/linux/issues/465
> Signed-off-by: Nathan Chancellor 

Fixes: 7f94ffbc4c6a ("kasan: add hooks implementation for tag-based mode")
Cc: 
Reviewed-by: Andrey Ryabinin 

> ---
> 
> v1 -> v2:
> 
> * Initialize tag to 0xff at Andrey's request
> 
>  mm/kasan/common.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/kasan/common.c b/mm/kasan/common.c
> index 36afcf64e016..242fdc01aaa9 100644
> --- a/mm/kasan/common.c
> +++ b/mm/kasan/common.c
> @@ -464,7 +464,7 @@ static void *__kasan_kmalloc(struct kmem_cache *cache, 
> const void *object,
>  {
>   unsigned long redzone_start;
>   unsigned long redzone_end;
> - u8 tag;
> + u8 tag = 0xff;
>  
>   if (gfpflags_allow_blocking(flags))
>   quarantine_reduce();
>

Re: [PATCH v3] mm/kasan: Print frame description for stack bugs

2019-05-22 Thread Andrey Ryabinin

On 5/22/19 1:00 PM, Marco Elver wrote:
> This adds support for printing stack frame description on invalid stack
> accesses. The frame description is embedded by the compiler, which is
> parsed and then pretty-printed.
> 
> Currently, we can only print the stack frame info for accesses to the
> task's own stack, but not accesses to other tasks' stacks.
> 
> Example of what it looks like:
> 
> [   17.924050] page dumped because: kasan: bad access detected
> [   17.924908]
> [   17.925153] addr 8880673ef98a is located in stack of task insmod/2008 
> at offset 106 in frame:
> [   17.926542]  kasan_stack_oob+0x0/0xf5 [test_kasan]
> [   17.927932]
> [   17.928206] this frame has 2 objects:
> [   17.928783]  [32, 36) 'i'
> [   17.928784]  [96, 106) 'stack_array'
> [   17.929216]
> [   17.930031] Memory state around the buggy address:
> 
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=198435
> Signed-off-by: Marco Elver 

Reviewed-by: Andrey Ryabinin

Re: [PATCH v2] mm/kasan: Print frame description for stack bugs

2019-05-21 Thread Andrey Ryabinin




On 5/21/19 7:07 PM, Marco Elver wrote:
> On Tue, 21 May 2019 at 17:53, Alexander Potapenko  wrote:
>>
>> On Tue, May 21, 2019 at 5:43 PM Andrey Ryabinin  
>> wrote:
>>>
>>> On 5/20/19 6:47 PM, Marco Elver wrote:
>>>
>>>> +static void print_decoded_frame_descr(const char *frame_descr)
>>>> +{
>>>> + /*
>>>> +  * We need to parse the following string:
>>>> +  *"n alloc_1 alloc_2 ... alloc_n"
>>>> +  * where alloc_i looks like
>>>> +  *"offset size len name"
>>>> +  * or "offset size len name:line".
>>>> +  */
>>>> +
>>>> + char token[64];
>>>> + unsigned long num_objects;
>>>> +
>>>> + if (!tokenize_frame_descr(_descr, token, sizeof(token),
>>>> +   _objects))
>>>> + return;
>>>> +
>>>> + pr_err("\n");
>>>> + pr_err("this frame has %lu %s:\n", num_objects,
>>>> +num_objects == 1 ? "object" : "objects");
>>>> +
>>>> + while (num_objects--) {
>>>> + unsigned long offset;
>>>> + unsigned long size;
>>>> +
>>>> + /* access offset */
>>>> + if (!tokenize_frame_descr(_descr, token, sizeof(token),
>>>> +   ))
>>>> + return;
>>>> + /* access size */
>>>> + if (!tokenize_frame_descr(_descr, token, sizeof(token),
>>>> +   ))
>>>> + return;
>>>> + /* name length (unused) */
>>>> + if (!tokenize_frame_descr(_descr, NULL, 0, NULL))
>>>> + return;
>>>> + /* object name */
>>>> + if (!tokenize_frame_descr(_descr, token, sizeof(token),
>>>> +   NULL))
>>>> + return;
>>>> +
>>>> + /* Strip line number, if it exists. */
>>>
>>>Why?
> 
> The filename is not included, and I don't think it adds much in terms
> of ability to debug; nor is the line number included with all
> descriptions. I think, the added complexity of separating the line
> number and parsing is not worthwhile here. Alternatively, I could not
> pay attention to the line number at all, and leave it as is -- in that
> case, some variable names will display as "foo:123".
> 

Either way is fine by me. But explain why in comment if you decide
to keep current code.  Something like
 /* Strip line number cause it's not very helpful. */


>>>
>>>> + strreplace(token, ':', '\0');
>>>> +
>>>
>>> ...
>>>
>>>> +
>>>> + aligned_addr = round_down((unsigned long)addr, sizeof(long));
>>>> + mem_ptr = round_down(aligned_addr, KASAN_SHADOW_SCALE_SIZE);
>>>> + shadow_ptr = kasan_mem_to_shadow((void *)aligned_addr);
>>>> + shadow_bottom = kasan_mem_to_shadow(end_of_stack(current));
>>>> +
>>>> + while (shadow_ptr >= shadow_bottom && *shadow_ptr != 
>>>> KASAN_STACK_LEFT) {
>>>> + shadow_ptr--;
>>>> + mem_ptr -= KASAN_SHADOW_SCALE_SIZE;
>>>> + }
>>>> +
>>>> + while (shadow_ptr >= shadow_bottom && *shadow_ptr == 
>>>> KASAN_STACK_LEFT) {
>>>> + shadow_ptr--;
>>>> + mem_ptr -= KASAN_SHADOW_SCALE_SIZE;
>>>> + }
>>>> +
>>>
>>> I suppose this won't work if stack grows up, which is fine because it grows 
>>> up only on parisc arch.
>>> But "BUILD_BUG_ON(IS_ENABLED(CONFIG_STACK_GROUWSUP))" somewhere wouldn't 
>>> hurt.
>> Note that KASAN was broken on parisc from day 1 because of other
>> assumptions on the stack growth direction hardcoded into KASAN
>> (e.g. __kasan_unpoison_stack() and __asan_allocas_unpoison()).

It's not broken, it doesn't exist.

>> So maybe this BUILD_BUG_ON can be added in a separate patch as it's
>> not specific to what Marco is doing here?
> 

I think it's fine to add it in this patch because BUILD_BUG_ON() is just a hint 
for developers
that this particular function depends on growing down stack. So it's more a 
property of the function
rather than KASAN in general.

Other functions you mentioned can be marked with BUILD_BUG_ON()s as well, but 
not in this patch indeed.

> Happy to send a follow-up patch, or add here. Let me know what you prefer.
> 

Send v3 please.

Re: [PATCH v2] mm/kasan: Print frame description for stack bugs

2019-05-21 Thread Andrey Ryabinin




On 5/20/19 6:47 PM, Marco Elver wrote:

> +static void print_decoded_frame_descr(const char *frame_descr)
> +{
> + /*
> +  * We need to parse the following string:
> +  *"n alloc_1 alloc_2 ... alloc_n"
> +  * where alloc_i looks like
> +  *"offset size len name"
> +  * or "offset size len name:line".
> +  */
> +
> + char token[64];
> + unsigned long num_objects;
> +
> + if (!tokenize_frame_descr(_descr, token, sizeof(token),
> +   _objects))
> + return;
> +
> + pr_err("\n");
> + pr_err("this frame has %lu %s:\n", num_objects,
> +num_objects == 1 ? "object" : "objects");
> +
> + while (num_objects--) {
> + unsigned long offset;
> + unsigned long size;
> +
> + /* access offset */
> + if (!tokenize_frame_descr(_descr, token, sizeof(token),
> +   ))
> + return;
> + /* access size */
> + if (!tokenize_frame_descr(_descr, token, sizeof(token),
> +   ))
> + return;
> + /* name length (unused) */
> + if (!tokenize_frame_descr(_descr, NULL, 0, NULL))
> + return;
> + /* object name */
> + if (!tokenize_frame_descr(_descr, token, sizeof(token),
> +   NULL))
> + return;
> +
> + /* Strip line number, if it exists. */

   Why?

> + strreplace(token, ':', '\0');
> +

...

> +
> + aligned_addr = round_down((unsigned long)addr, sizeof(long));
> + mem_ptr = round_down(aligned_addr, KASAN_SHADOW_SCALE_SIZE);
> + shadow_ptr = kasan_mem_to_shadow((void *)aligned_addr);
> + shadow_bottom = kasan_mem_to_shadow(end_of_stack(current));
> +
> + while (shadow_ptr >= shadow_bottom && *shadow_ptr != KASAN_STACK_LEFT) {
> + shadow_ptr--;
> + mem_ptr -= KASAN_SHADOW_SCALE_SIZE;
> + }
> +
> + while (shadow_ptr >= shadow_bottom && *shadow_ptr == KASAN_STACK_LEFT) {
> + shadow_ptr--;
> + mem_ptr -= KASAN_SHADOW_SCALE_SIZE;
> + }
> +

I suppose this won't work if stack grows up, which is fine because it grows up 
only on parisc arch.
But "BUILD_BUG_ON(IS_ENABLED(CONFIG_STACK_GROUWSUP))" somewhere wouldn't hurt.

Re: [ubsan] f0996bc297: netperf.Throughput_total_tps -7.6% regression

2019-05-20 Thread Andrey Ryabinin

On 5/20/19 8:38 AM, kernel test robot wrote:
> Greeting,
> 
> FYI, we noticed a -7.6% regression of netperf.Throughput_total_tps due to 
> commit:
> 
> 
> commit: f0996bc2978e02d2ea898101462b960f6119b18f ("ubsan: Fix nasty 
> -Wbuiltin-declaration-mismatch GCC-9 warnings")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 

This can't be right. First of all the commit makes changes only lib/ubsan.c 
which is compiled only when CONFIG_UBSAN=y.
In your config:
# CONFIG_UBSAN is not set 

But even in the case of enabled UBSAN that commit doesn't change the generated 
machine code at all.

Re: [PATCH 4.9 09/62] kasan: turn on -fsanitize-address-use-after-scope

2019-05-06 Thread Andrey Ryabinin




On 5/6/19 6:10 PM, Greg Kroah-Hartman wrote:
> On Mon, May 06, 2019 at 05:55:54PM +0300, Andrey Ryabinin wrote:
>>
>>
>> On 5/6/19 5:32 PM, Greg Kroah-Hartman wrote:
>>> From: Andrey Ryabinin 
>>>
>>> commit c5caf21ab0cf884ef15b25af234f620e4a233139 upstream.
>>>
>>> In the upcoming gcc7 release, the -fsanitize=kernel-address option at
>>> first implied new -fsanitize-address-use-after-scope option.  This would
>>> cause link errors on older kernels because they don't have two new
>>> functions required for use-after-scope support.  Therefore, gcc7 changed
>>> default to -fno-sanitize-address-use-after-scope.
>>>
>>> Now the kernel has everything required for that feature since commit
>>> 828347f8f9a5 ("kasan: support use-after-scope detection").  So, to make it
>>> work, we just have to enable use-after-scope in CFLAGS.
>>>
>>> Link: 
>>> http://lkml.kernel.org/r/1481207977-28654-1-git-send-email-aryabi...@virtuozzo.com
>>> Signed-off-by: Andrey Ryabinin 
>>> Acked-by: Dmitry Vyukov 
>>> Cc: Alexander Potapenko 
>>> Cc: Andrey Konovalov 
>>> Signed-off-by: Andrew Morton 
>>> Signed-off-by: Linus Torvalds 
>>> Signed-off-by: Andrey Konovalov 
>>> Signed-off-by: Greg Kroah-Hartman 
>>>
>>> ---
>>>  scripts/Makefile.kasan |2 ++
>>>  1 file changed, 2 insertions(+)
>>>
>>> --- a/scripts/Makefile.kasan
>>> +++ b/scripts/Makefile.kasan
>>> @@ -29,6 +29,8 @@ else
>>>  endif
>>>  endif
>>>  
>>> +CFLAGS_KASAN += $(call cc-option, -fsanitize-address-use-after-scope)
>>> +
>>>  CFLAGS_KASAN_NOSANITIZE := -fno-builtin
>>>  
>>>  endif
>>>
>>>
>>
>> This shouldn't be in the -stable.
> 
> Why not?  Does no one use gcc7 with this kernel and kasan?
> 

You don't need this patch to use kasan on this kernel with gcc7.
This patch only enables detection of use-after-scope bugs. This feature 
appeared to be useless,
hence it disabled recently by commit 7771bdbbfd3d ("kasan: remove use after 
scope bugs detection.")

The link errors mentioned in changelog was the problem only for some period of 
time in the development branch of GCC 7.
The released GCC7 version doesn't have this problem.

Re: [PATCH 4.9 10/62] kasan: rework Kconfig settings

2019-05-06 Thread Andrey Ryabinin




On 5/6/19 5:32 PM, Greg Kroah-Hartman wrote:
> From: Arnd Bergmann 
> 
> commit e7c52b84fb18f08ce49b6067ae6285aca79084a8 upstream.
> 

This is a fix/workaround for the previous patch c5caf21ab0cf "kasan: turn on 
-fsanitize-address-use-after-scope"
which shouldn't be in the -stable. So without c5caf21ab0cf we don't need this 
one.

Re: [PATCH 4.9 09/62] kasan: turn on -fsanitize-address-use-after-scope

2019-05-06 Thread Andrey Ryabinin




On 5/6/19 5:32 PM, Greg Kroah-Hartman wrote:
> From: Andrey Ryabinin 
> 
> commit c5caf21ab0cf884ef15b25af234f620e4a233139 upstream.
> 
> In the upcoming gcc7 release, the -fsanitize=kernel-address option at
> first implied new -fsanitize-address-use-after-scope option.  This would
> cause link errors on older kernels because they don't have two new
> functions required for use-after-scope support.  Therefore, gcc7 changed
> default to -fno-sanitize-address-use-after-scope.
> 
> Now the kernel has everything required for that feature since commit
> 828347f8f9a5 ("kasan: support use-after-scope detection").  So, to make it
> work, we just have to enable use-after-scope in CFLAGS.
> 
> Link: 
> http://lkml.kernel.org/r/1481207977-28654-1-git-send-email-aryabi...@virtuozzo.com
> Signed-off-by: Andrey Ryabinin 
> Acked-by: Dmitry Vyukov 
> Cc: Alexander Potapenko 
> Cc: Andrey Konovalov 
> Signed-off-by: Andrew Morton 
> Signed-off-by: Linus Torvalds 
> Signed-off-by: Andrey Konovalov 
> Signed-off-by: Greg Kroah-Hartman 
> 
> ---
>  scripts/Makefile.kasan |2 ++
>  1 file changed, 2 insertions(+)
> 
> --- a/scripts/Makefile.kasan
> +++ b/scripts/Makefile.kasan
> @@ -29,6 +29,8 @@ else
>  endif
>  endif
>  
> +CFLAGS_KASAN += $(call cc-option, -fsanitize-address-use-after-scope)
> +
>  CFLAGS_KASAN_NOSANITIZE := -fno-builtin
>  
>  endif
> 
> 

This shouldn't be in the -stable.

[PATCH 1/2] ubsan: Fix nasty -Wbuiltin-declaration-mismatch GCC-9 warnings

2019-05-06 Thread Andrey Ryabinin

Building lib/ubsan.c with gcc-9 results in a ton of nasty warnings
like this one:
lib/ubsan.c warning: conflicting types for built-in function
 ‘__ubsan_handle_negate_overflow’; expected ‘void(void *, void *)’ 
[-Wbuiltin-declaration-mismatch]

The kernel's declarations of __ubsan_handle_*() often uses 'unsigned long'
types in parameters while GCC these parameters as 'void *' types,
hence the mismatch.

Fix this by using 'void *' to match GCC's declarations.

Reported-by: Linus Torvalds 
Signed-off-by: Andrey Ryabinin 
Fixes: c6d308534aef ("UBSAN: run-time undefined behavior sanity checker")
Cc: 
---
 lib/ubsan.c | 49 +++--
 1 file changed, 23 insertions(+), 26 deletions(-)

diff --git a/lib/ubsan.c b/lib/ubsan.c
index e4162f59a81c..1e9e2ab25539 100644
--- a/lib/ubsan.c
+++ b/lib/ubsan.c
@@ -86,11 +86,13 @@ static bool is_inline_int(struct type_descriptor *type)
return bits <= inline_bits;
 }
 
-static s_max get_signed_val(struct type_descriptor *type, unsigned long val)
+static s_max get_signed_val(struct type_descriptor *type, void *val)
 {
if (is_inline_int(type)) {
unsigned extra_bits = sizeof(s_max)*8 - type_bit_width(type);
-   return ((s_max)val) << extra_bits >> extra_bits;
+   unsigned long ulong_val = (unsigned long)val;
+
+   return ((s_max)ulong_val) << extra_bits >> extra_bits;
}
 
if (type_bit_width(type) == 64)
@@ -99,15 +101,15 @@ static s_max get_signed_val(struct type_descriptor *type, 
unsigned long val)
return *(s_max *)val;
 }
 
-static bool val_is_negative(struct type_descriptor *type, unsigned long val)
+static bool val_is_negative(struct type_descriptor *type, void *val)
 {
return type_is_signed(type) && get_signed_val(type, val) < 0;
 }
 
-static u_max get_unsigned_val(struct type_descriptor *type, unsigned long val)
+static u_max get_unsigned_val(struct type_descriptor *type, void *val)
 {
if (is_inline_int(type))
-   return val;
+   return (unsigned long)val;
 
if (type_bit_width(type) == 64)
return *(u64 *)val;
@@ -116,7 +118,7 @@ static u_max get_unsigned_val(struct type_descriptor *type, 
unsigned long val)
 }
 
 static void val_to_string(char *str, size_t size, struct type_descriptor *type,
-   unsigned long value)
+   void *value)
 {
if (type_is_int(type)) {
if (type_bit_width(type) == 128) {
@@ -163,8 +165,8 @@ static void ubsan_epilogue(unsigned long *flags)
current->in_ubsan--;
 }
 
-static void handle_overflow(struct overflow_data *data, unsigned long lhs,
-   unsigned long rhs, char op)
+static void handle_overflow(struct overflow_data *data, void *lhs,
+   void *rhs, char op)
 {
 
struct type_descriptor *type = data->type;
@@ -191,8 +193,7 @@ static void handle_overflow(struct overflow_data *data, 
unsigned long lhs,
 }
 
 void __ubsan_handle_add_overflow(struct overflow_data *data,
-   unsigned long lhs,
-   unsigned long rhs)
+   void *lhs, void *rhs)
 {
 
handle_overflow(data, lhs, rhs, '+');
@@ -200,23 +201,21 @@ void __ubsan_handle_add_overflow(struct overflow_data 
*data,
 EXPORT_SYMBOL(__ubsan_handle_add_overflow);
 
 void __ubsan_handle_sub_overflow(struct overflow_data *data,
-   unsigned long lhs,
-   unsigned long rhs)
+   void *lhs, void *rhs)
 {
handle_overflow(data, lhs, rhs, '-');
 }
 EXPORT_SYMBOL(__ubsan_handle_sub_overflow);
 
 void __ubsan_handle_mul_overflow(struct overflow_data *data,
-   unsigned long lhs,
-   unsigned long rhs)
+   void *lhs, void *rhs)
 {
handle_overflow(data, lhs, rhs, '*');
 }
 EXPORT_SYMBOL(__ubsan_handle_mul_overflow);
 
 void __ubsan_handle_negate_overflow(struct overflow_data *data,
-   unsigned long old_val)
+   void *old_val)
 {
unsigned long flags;
char old_val_str[VALUE_LENGTH];
@@ -237,8 +236,7 @@ EXPORT_SYMBOL(__ubsan_handle_negate_overflow);
 
 
 void __ubsan_handle_divrem_overflow(struct overflow_data *data,
-   unsigned long lhs,
-   unsigned long rhs)
+   void *lhs, void *rhs)
 {
unsigned long flags;
char rhs_val_str[VALUE_LENGTH];
@@ -323,7 +321,7 @@ static void ubsan_type_mismatch_common(struct 
type_mismatch_data_common *data,
 }
 
 void __ubsan_handle_type_mismatch(struct type_mismatch_data *data,
-   unsigned long ptr)
+   void *ptr)

[PATCH 2/2] ubsan: Remove vla bound checks.

2019-05-06 Thread Andrey Ryabinin

The kernel the kernel is built with -Wvla for some time, so is not supposed
to have any variable length arrays. Remove vla bounds checking from ubsan
since it's useless now.

Signed-off-by: Andrey Ryabinin 
---
 lib/ubsan.c| 18 --
 lib/ubsan.h|  5 -
 scripts/Makefile.ubsan |  1 -
 3 files changed, 24 deletions(-)

diff --git a/lib/ubsan.c b/lib/ubsan.c
index 1e9e2ab25539..c4859c2d2f1b 100644
--- a/lib/ubsan.c
+++ b/lib/ubsan.c
@@ -349,24 +349,6 @@ void __ubsan_handle_type_mismatch_v1(struct 
type_mismatch_data_v1 *data,
 }
 EXPORT_SYMBOL(__ubsan_handle_type_mismatch_v1);
 
-void __ubsan_handle_vla_bound_not_positive(struct vla_bound_data *data,
-   void *bound)
-{
-   unsigned long flags;
-   char bound_str[VALUE_LENGTH];
-
-   if (suppress_report(>location))
-   return;
-
-   ubsan_prologue(>location, );
-
-   val_to_string(bound_str, sizeof(bound_str), data->type, bound);
-   pr_err("variable length array bound value %s <= 0\n", bound_str);
-
-   ubsan_epilogue();
-}
-EXPORT_SYMBOL(__ubsan_handle_vla_bound_not_positive);
-
 void __ubsan_handle_out_of_bounds(struct out_of_bounds_data *data, void *index)
 {
unsigned long flags;
diff --git a/lib/ubsan.h b/lib/ubsan.h
index f4d8d0bd4016..b8fa83864467 100644
--- a/lib/ubsan.h
+++ b/lib/ubsan.h
@@ -57,11 +57,6 @@ struct nonnull_arg_data {
int arg_index;
 };
 
-struct vla_bound_data {
-   struct source_location location;
-   struct type_descriptor *type;
-};
-
 struct out_of_bounds_data {
struct source_location location;
struct type_descriptor *array_type;
diff --git a/scripts/Makefile.ubsan b/scripts/Makefile.ubsan
index 38b2b4818e8e..019771b845c5 100644
--- a/scripts/Makefile.ubsan
+++ b/scripts/Makefile.ubsan
@@ -3,7 +3,6 @@ ifdef CONFIG_UBSAN
   CFLAGS_UBSAN += $(call cc-option, -fsanitize=shift)
   CFLAGS_UBSAN += $(call cc-option, -fsanitize=integer-divide-by-zero)
   CFLAGS_UBSAN += $(call cc-option, -fsanitize=unreachable)
-  CFLAGS_UBSAN += $(call cc-option, -fsanitize=vla-bound)
   CFLAGS_UBSAN += $(call cc-option, -fsanitize=signed-integer-overflow)
   CFLAGS_UBSAN += $(call cc-option, -fsanitize=bounds)
   CFLAGS_UBSAN += $(call cc-option, -fsanitize=object-size)
-- 
2.21.0

[PATCH 2/2] mm/page_alloc: fix never set ALLOC_NOFRAGMENT flag

2019-04-23 Thread Andrey Ryabinin

Commit 0a79cdad5eb2 ("mm: use alloc_flags to record if kswapd can wake")
removed setting of the ALLOC_NOFRAGMENT flag. Bring it back.

Fixes: 0a79cdad5eb2 ("mm: use alloc_flags to record if kswapd can wake")
Signed-off-by: Andrey Ryabinin 
---
 mm/page_alloc.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2b2c7065102f..a85b8252c5ad 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3465,7 +3465,7 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
return alloc_flags;
 
if (zone_idx(zone) != ZONE_NORMAL)
-   goto out;
+   return alloc_flags;
 
/*
 * If ZONE_DMA32 exists, assume it is the one after ZONE_NORMAL and
@@ -3474,9 +3474,9 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
 */
BUILD_BUG_ON(ZONE_NORMAL - ZONE_DMA32 != 1);
if (nr_online_nodes > 1 && !populated_zone(--zone))
-   goto out;
+   return alloc_flags;
 
-out:
+   alloc_flags |= ALLOC_NOFRAGMENT;
 #endif /* CONFIG_ZONE_DMA32 */
return alloc_flags;
 }
-- 
2.21.0

[PATCH 1/2] mm/page_alloc: avoid potential NULL pointer dereference

2019-04-23 Thread Andrey Ryabinin

ac.preferred_zoneref->zone passed to alloc_flags_nofragment() can be NULL.
'zone' pointer unconditionally derefernced in alloc_flags_nofragment().
Bail out on NULL zone to avoid potential crash.
Currently we don't see any crashes only because alloc_flags_nofragment()
has another bug which allows compiler to optimize away all accesses to
'zone'.

Fixes: 6bb154504f8b ("mm, page_alloc: spread allocations across zones before 
introducing fragmentation")
Signed-off-by: Andrey Ryabinin 
---
 mm/page_alloc.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 933bd42899e8..2b2c7065102f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3461,6 +3461,9 @@ alloc_flags_nofragment(struct zone *zone, gfp_t gfp_mask)
alloc_flags |= ALLOC_KSWAPD;
 
 #ifdef CONFIG_ZONE_DMA32
+   if (!zone)
+   return alloc_flags;
+
if (zone_idx(zone) != ZONE_NORMAL)
goto out;
 
-- 
2.21.0

Re: [PATCH 1/4] net/skbuff: don't waste memory reserves

2019-04-19 Thread Andrey Ryabinin

On 4/19/19 4:41 PM, Eric Dumazet wrote:
> On 04/19/2019 06:17 AM, Andrey Ryabinin wrote:
>>
> 
>> I don't see why that would be a problem. If refill failed because we didn't 
>> have
>> access to reserves, then there going to be an another refill attempt, right?
>> And the next refill attempt will be with access to the reserves if memalloc 
>> socket was created.
>> We can't predict the future, so until the memalloc socket appeared we must 
>> assume that those
>> RX ring buffers won't be used to reclaim memory (and that is actually true 
>> in 99% of cases).
>>
> 
> I just said that the alloc might be attempted "in the past"
> 
> Yes, we can not predict the future, this is why we need to access the reserve 
> _now_ and not
> at the time the packet is received.
> 
> The 'being true in 99% of cases' argument is not really convincing.
> 
> You want the NIC to be ready to receive packets even before 
> sk_memalloc_socks() becomes true.

The fact that we allocate pages before the socket created I completely 
understand.

But why that failed allocation is such a problem?

1. sk_memalloc_socks() false
2. NIC driver tries to allocate pages and fails
3. swapon /nfs/swap_file /* memalloc_socket created */
4. We may be loosing some packets because of 2. But there will be retransmit, I 
suppose NIC driver will
try to allocate pages again, and this time with access to reserves, so 
eventually we all good.

So what is the problem? Potential lost of some packets for some period of time 
after swapon?

> If a NIC driver has a bug, please fix it.
> 

I don't follow, what kind of bug are you talking about?
The problem I'm trying to fix is entirely in __dev_alloc_pages():
1. sk_memalloc_socks() false all the time.
2. NIC takes pages from memory reserves
3. The fact that we occupying memory reserves increases the chances that the 
next page will be also taken from reserves as well,
which means more dropped packets than we would have without using the reserves.

[PATCH 1/4] net/skbuff: don't waste memory reserves

2019-04-18 Thread Andrey Ryabinin

In some workloads we have noticed packets being dropped by
sk_filter_trim_cap() because the 'skb' was allocated from pfmemalloc
reserves:

/*
 * If the skb was allocated from pfmemalloc reserves, only
 * allow SOCK_MEMALLOC sockets to use it as this socket is
 * helping free memory
 */
if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP);
return -ENOMEM;
}

Memalloc sockets are used for stuff like swap over NBD or NFS
and only memalloc sockets can process memalloc skbs. Since we
don't have any memalloc sockets in our setups we shouldn't have
memalloc skbs either. It simply doesn't make any sense to waste
memory reserves on skb which will be dropped anyway.

It appears that __dev_alloc_pages() unconditionally uses
__GFP_MEMALLOC, so unless caller added __GFP_NOMEMALLOC, the
__dev_alloc_pages() may dive into memory reserves.
Later build_skb() or __skb_fill_page_desc() sets skb->pfmemalloc = 1
so this skb always dropped by sk_filter_trim_cap().

Instead of wasting memory reserves we simply shouldn't use them in the
case of absence memalloc sockets in the system. Do this by adding
the  __GFP_MEMALLOC only when such socket is present in the system.

Fixes: 0614002bb5f7 ("netvm: propagate page->pfmemalloc from skb_alloc_page to 
skb")
Signed-off-by: Andrey Ryabinin 
---
 include/linux/skbuff.h | 17 -
 include/net/sock.h | 15 ---
 2 files changed, 16 insertions(+), 16 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a06275a618f0..676e54f84de4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2784,6 +2784,19 @@ void napi_consume_skb(struct sk_buff *skb, int budget);
 void __kfree_skb_flush(void);
 void __kfree_skb_defer(struct sk_buff *skb);
 
+#ifdef CONFIG_NET
+DECLARE_STATIC_KEY_FALSE(memalloc_socks_key);
+static inline int sk_memalloc_socks(void)
+{
+   return static_branch_unlikely(_socks_key);
+}
+#else
+static inline int sk_memalloc_socks(void)
+{
+   return 0;
+}
+#endif
+
 /**
  * __dev_alloc_pages - allocate page for network Rx
  * @gfp_mask: allocation priority. Set __GFP_NOMEMALLOC if not for network Rx
@@ -2804,7 +2817,9 @@ static inline struct page *__dev_alloc_pages(gfp_t 
gfp_mask,
 * 4.  __GFP_MEMALLOC is ignored if __GFP_NOMEMALLOC is set due to
 * code in gfp_to_alloc_flags that should be enforcing this.
 */
-   gfp_mask |= __GFP_COMP | __GFP_MEMALLOC;
+   gfp_mask |=  __GFP_COMP;
+   if (sk_memalloc_socks())
+   gfp_mask |= __GFP_MEMALLOC;
 
return alloc_pages_node(NUMA_NO_NODE, gfp_mask, order);
 }
diff --git a/include/net/sock.h b/include/net/sock.h
index bdd77bbce7d8..5b2138d47bd8 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -838,21 +838,6 @@ static inline bool sock_flag(const struct sock *sk, enum 
sock_flags flag)
return test_bit(flag, >sk_flags);
 }
 
-#ifdef CONFIG_NET
-DECLARE_STATIC_KEY_FALSE(memalloc_socks_key);
-static inline int sk_memalloc_socks(void)
-{
-   return static_branch_unlikely(_socks_key);
-}
-#else
-
-static inline int sk_memalloc_socks(void)
-{
-   return 0;
-}
-
-#endif
-
 static inline gfp_t sk_gfp_mask(const struct sock *sk, gfp_t gfp_mask)
 {
return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
-- 
2.21.0

[PATCH 2/4] net/skbuff: warn if kmalloc_reserve() fails to allocate memory.

2019-04-18 Thread Andrey Ryabinin

Commit c93bdd0e03e8 ("netvm: allow skb allocation to use PFMEMALLOC reserves")
removed memory potential allocation failure messages in the cases when
pfmemalloc reserves are not allowed to be used. Inability to allocate
skb usually indicates some problem, e.g. these ones:

commit 5d4c9bfbabdb ("tcp: fix potential huge kmalloc() calls in 
TCP_REPAIR")
commit 19125c1a4fd8 ("net: use skb_clone to avoid alloc_pages failure.")

It makes sense to bring the warning back to make problems more obvious,
easier to spot. Also neighboring to kmalloc_reserve() allocations often
doesn't add __GFP_NOWARN. For example __alloc_skb():

skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
...
data = kmalloc_reserve(size, gfp_mask, node, );

It seems unreasonable to allow kmem_cache_alloc_node() to fail loudly but
forbid the same for the kmalloc_reserve().
So, don't add __GFP_NOWARN unless we can use fallback allocation with
__GFP_MEMALLOC.

Also remove __GFP_NOMEMALLOC when usage of memory reserves isn't allowed.
Many allocations on receive path use plain GFP_ATOMIC without adding
__GFP_NOMEMALLOC (again, see __alloc_skb()). Such allocations have greater 
chances
to succeed because plain GFP_ATOMIC can use 1/4 of memory reserves, while
GFP_ATOMIC|__GFP_NOMEMALLOC can't. So it's seems more reasonable to add
__GFP_NOMEMALLOC only if we have fallback to memory reserves.

Signed-off-by: Andrey Ryabinin 
---
 net/core/skbuff.c | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a083e188374f..97557554e90e 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -133,16 +133,19 @@ static void *__kmalloc_reserve(size_t size, gfp_t flags, 
int node,
   unsigned long ip, bool *pfmemalloc)
 {
void *obj;
+   gfp_t gfp_mask = flags;
bool ret_pfmemalloc = false;
+   bool pfmemalloc_allowed = gfp_pfmemalloc_allowed(flags);
 
/*
 * Try a regular allocation, when that fails and we're not entitled
 * to the reserves, fail.
 */
-   obj = kmalloc_node_track_caller(size,
-   flags | __GFP_NOMEMALLOC | __GFP_NOWARN,
-   node);
-   if (obj || !(gfp_pfmemalloc_allowed(flags)))
+   if (pfmemalloc_allowed)
+   gfp_mask |= __GFP_NOMEMALLOC | __GFP_NOWARN;
+
+   obj = kmalloc_node_track_caller(size, gfp_mask, node);
+   if (obj || !pfmemalloc_allowed)
goto out;
 
/* Try again but now we are using pfmemalloc reserves */
-- 
2.21.0

Re: [PATCH v8 3/5] arm64: replace -pg with CC_FLAGS_FTRACE in mm/kasan Makefile

2019-04-08 Thread Andrey Ryabinin




On 2/8/19 6:10 PM, Torsten Duwe wrote:
>   In preparation for arm64 supporting ftrace built on other compiler
>   options, let's have makefiles remove the $(CC_FLAGS_FTRACE)
>   flags, whatever these may be, rather than assuming '-pg'.
> 
>   There should be no functional change as a result of this patch.
> 
> Signed-off-by: Torsten Duwe 
> 

Acked-by: Andrey Ryabinin

Re: [PATCH] ubsan: Avoid unnecessary 128-bit shifts

2019-04-02 Thread Andrey Ryabinin




On 4/1/19 11:57 AM, George Spelvin wrote:
> Double-word sign-extending shifts by a variable amount are a
> non-trivial amount of code and complexity.  Doing signed long shifts
> before the cast to (s_max) greatly simplifies the object code.
> 
> (Yes, I know "signed" is redundant.  It's there for emphasis.)
> 
> The complex issue raised by this patch is that allows s390 (at
> least) to enable CONFIG_ARCH_SUPPORTS_INT128.
> 
> If you enable that option, s_max becomes 128 bits, and gcc compiles
> the pre-patch code with a call to __ashrti3.  (And, on some gcc
> versions, __ashlti3.)  Which isn't implemented, ergo link error.
> 
> Enabling that option allows 64-bit widening multiplies which
> greatly simplify a lot of timestamp scaling code in the kernel,
> so it's desirable.
> 
> But how to get there?
> 
> One option is to implement __ashrti3 on the platforms that need it.
> But I'm inclined to *not* do so, because it's inefficient, rare,
> and avoidable.  This patch fixes the sole instance in the entire
> kernel, which will make that implementation dead code, and I think
> its absence will encourage Don't Do That, Then going forward.
> 
> But if we don't implement it, we've created an awkward dependency
> between patches in different subsystems, and that needs handling.
> 
> Option 1: Submit this for 5.2 and turn on INT128 for s390 in 5.3.
> Option 2: Let the arches cherry-pick this patch pre-5.2.
> 
> My preference is for option 2, but that requires permission from
> ubsan's owner.  Andrey?
> 

Fine by me:
Acked-by: Andrey Ryabinin  

> Signed-off-by: George Spelvin 
> Cc: Andrey Ryabinin 
> Cc: linux-s...@vger.kernel.org
> Cc: Heiko Carstens 
> ---
>  lib/ubsan.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/ubsan.c b/lib/ubsan.c
> index e4162f59a81c..43ce177a5ca7 100644
> --- a/lib/ubsan.c
> +++ b/lib/ubsan.c
> @@ -89,8 +89,8 @@ static bool is_inline_int(struct type_descriptor *type)
>  static s_max get_signed_val(struct type_descriptor *type, unsigned long val)
>  {
>   if (is_inline_int(type)) {
> - unsigned extra_bits = sizeof(s_max)*8 - type_bit_width(type);
> - return ((s_max)val) << extra_bits >> extra_bits;
> + unsigned extra_bits = sizeof(val)*8 - type_bit_width(type);
> + return (s_max)((signed long)val << extra_bits >> extra_bits);

Cast to s_max is redundant.  

>   }
>  
>   if (type_bit_width(type) == 64)
>

Re: [PATCH 18/20] objtool: Add UACCESS validation

2019-03-07 Thread Andrey Ryabinin




On 3/7/19 11:33 PM, Peter Zijlstra wrote:
> On Thu, Mar 07, 2019 at 11:15:42PM +0300, Andrey Ryabinin wrote:
>>
>>
>> On 3/7/19 8:41 PM, Peter Zijlstra wrote:
>>> On Thu, Mar 07, 2019 at 08:33:26AM -0800, Linus Torvalds wrote:
>>>> On Thu, Mar 7, 2019 at 3:52 AM Peter Zijlstra  wrote:
>>>>>
>>>>> XXX: are we sure we want __memset marked AC-safe?
>>>>
>>>> It's certainly one of the safer functions to call with AC set, but it
>>>> sounds wrong anyway. It's not like it's likely to leak kernel data
>>>> (most memset's are with 0, and even the non-zero ones I can't imagine
>>>> are sensitive - more like poison values etc).
>>>>
>>>> What's the call site that made you go "just add __memset() to the list"?
>>>
>>> __asan_{,un}poinson_stack_memory()
>>
>> These two can be called only with CONFIG_KASAN_EXTRA=y which 
>> was removed very recently, so it should be safe to delete these functions.
> 
> Ooh shiny. Clearly my tree still has them; what commit do I need to look
> for?
> 


commit 7771bdbbfd3d6f204631b6fd9e1bbc30cd15918e

Re: [PATCH 18/20] objtool: Add UACCESS validation

2019-03-07 Thread Andrey Ryabinin




On 3/7/19 8:41 PM, Peter Zijlstra wrote:
> On Thu, Mar 07, 2019 at 08:33:26AM -0800, Linus Torvalds wrote:
>> On Thu, Mar 7, 2019 at 3:52 AM Peter Zijlstra  wrote:
>>>
>>> XXX: are we sure we want __memset marked AC-safe?
>>
>> It's certainly one of the safer functions to call with AC set, but it
>> sounds wrong anyway. It's not like it's likely to leak kernel data
>> (most memset's are with 0, and even the non-zero ones I can't imagine
>> are sensitive - more like poison values etc).
>>
>> What's the call site that made you go "just add __memset() to the list"?
> 
> __asan_{,un}poinson_stack_memory()

These two can be called only with CONFIG_KASAN_EXTRA=y which 
was removed very recently, so it should be safe to delete these functions.

>   kasan_{,un}poison_shadow()
> __memset()
> 
>

Re: [PATCH] kasan: fix coccinelle warnings in kasan_p*_table

2019-03-05 Thread Andrey Ryabinin




On 3/4/19 8:04 PM, Andrey Konovalov wrote:
> kasan_p4d_table, kasan_pmd_table and kasan_pud_table are declared as
> returning bool, but return 0 instead of false, which produces a coccinelle
> warning. Fix it.
> 
> Fixes: 0207df4fa1a8 ("kernel/memremap, kasan: make ZONE_DEVICE with work with 
> KASAN")
> Reported-by: kbuild test robot 
> Signed-off-by: Andrey Konovalov 
> ---

Acked-by: Andrey Ryabinin 


>  mm/kasan/init.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/kasan/init.c b/mm/kasan/init.c
> index 45a1b5e38e1e..fcaa1ca03175 100644
> --- a/mm/kasan/init.c
> +++ b/mm/kasan/init.c
> @@ -42,7 +42,7 @@ static inline bool kasan_p4d_table(pgd_t pgd)
>  #else
>  static inline bool kasan_p4d_table(pgd_t pgd)
>  {
> - return 0;
> + return false;
>  }
>  #endif
>  #if CONFIG_PGTABLE_LEVELS > 3
> @@ -54,7 +54,7 @@ static inline bool kasan_pud_table(p4d_t p4d)
>  #else
>  static inline bool kasan_pud_table(p4d_t p4d)
>  {
> - return 0;
> + return false;
>  }
>  #endif
>  #if CONFIG_PGTABLE_LEVELS > 2
> @@ -66,7 +66,7 @@ static inline bool kasan_pmd_table(pud_t pud)
>  #else
>  static inline bool kasan_pmd_table(pud_t pud)
>  {
> - return 0;
> + return false;
>  }
>  #endif
>  pte_t kasan_early_shadow_pte[PTRS_PER_PTE] __page_aligned_bss;
>

Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-03-04 Thread Andrey Ryabinin




On 3/2/19 1:20 AM, Johannes Weiner wrote:
> On Fri, Mar 01, 2019 at 10:46:34PM +0300, Andrey Ryabinin wrote:
>> On 3/1/19 8:49 PM, Johannes Weiner wrote:
>>> On Fri, Mar 01, 2019 at 01:38:26PM +0300, Andrey Ryabinin wrote:
>>>> On 2/26/19 3:50 PM, Andrey Ryabinin wrote:
>>>>> On 2/22/19 10:15 PM, Johannes Weiner wrote:
>>>>>> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
>>>>>>> In a presence of more than 1 memory cgroup in the system our reclaim
>>>>>>> logic is just suck. When we hit memory limit (global or a limit on
>>>>>>> cgroup with subgroups) we reclaim some memory from all cgroups.
>>>>>>> This is sucks because, the cgroup that allocates more often always wins.
>>>>>>> E.g. job that allocates a lot of clean rarely used page cache will push
>>>>>>> out of memory other jobs with active relatively small all in memory
>>>>>>> working set.
>>>>>>>
>>>>>>> To prevent such situations we have memcg controls like low/max, etc 
>>>>>>> which
>>>>>>> are supposed to protect jobs or limit them so they to not hurt others.
>>>>>>> But memory cgroups are very hard to configure right because it requires
>>>>>>> precise knowledge of the workload which may vary during the execution.
>>>>>>> E.g. setting memory limit means that job won't be able to use all memory
>>>>>>> in the system for page cache even if the rest the system is idle.
>>>>>>> Basically our current scheme requires to configure every single cgroup
>>>>>>> in the system.
>>>>>>>
>>>>>>> I think we can do better. The idea proposed by this patch is to reclaim
>>>>>>> only inactive pages and only from cgroups that have big
>>>>>>> (!inactive_is_low()) inactive list. And go back to shrinking active 
>>>>>>> lists
>>>>>>> only if all inactive lists are low.
>>>>>>
>>>>>> Yes, you are absolutely right.
>>>>>>
>>>>>> We shouldn't go after active pages as long as there are plenty of
>>>>>> inactive pages around. That's the global reclaim policy, and we
>>>>>> currently fail to translate that well to cgrouped systems.
>>>>>>
>>>>>> Setting group protections or limits would work around this problem,
>>>>>> but they're kind of a red herring. We shouldn't ever allow use-once
>>>>>> streams to push out hot workingsets, that's a bug.
>>>>>>
>>>>>>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec 
>>>>>>> *lruvec, struct mem_cgroup *memcg,
>>>>>>>  
>>>>>>> scan >>= sc->priority;
>>>>>>>  
>>>>>>> +   if (!sc->may_shrink_active && 
>>>>>>> inactive_list_is_low(lruvec,
>>>>>>> +   file, memcg, sc, false))
>>>>>>> +   scan = 0;
>>>>>>> +
>>>>>>> /*
>>>>>>>  * If the cgroup's already been deleted, make sure to
>>>>>>>  * scrape out the remaining cache.
>>>>>>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>>>>>>> scan_control *sc)
>>>>>>> struct reclaim_state *reclaim_state = current->reclaim_state;
>>>>>>> unsigned long nr_reclaimed, nr_scanned;
>>>>>>> bool reclaimable = false;
>>>>>>> +   bool retry;
>>>>>>>  
>>>>>>> do {
>>>>>>> struct mem_cgroup *root = sc->target_mem_cgroup;
>>>>>>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>>>>>>> scan_control *sc)
>>>>>>> };
>>>>>>> struct mem_cgroup *memcg;
>>>>>>>  
>>>>>>> +   retry = false;
>>>>>>> +
>>>>>>> memset(>nr, 0, sizeof(sc->nr));
>>>>>>>  
>>>&

Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-03-01 Thread Andrey Ryabinin




On 3/1/19 8:49 PM, Johannes Weiner wrote:
> Hello Andrey,
> 
> On Fri, Mar 01, 2019 at 01:38:26PM +0300, Andrey Ryabinin wrote:
>> On 2/26/19 3:50 PM, Andrey Ryabinin wrote:
>>> On 2/22/19 10:15 PM, Johannes Weiner wrote:
>>>> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
>>>>> In a presence of more than 1 memory cgroup in the system our reclaim
>>>>> logic is just suck. When we hit memory limit (global or a limit on
>>>>> cgroup with subgroups) we reclaim some memory from all cgroups.
>>>>> This is sucks because, the cgroup that allocates more often always wins.
>>>>> E.g. job that allocates a lot of clean rarely used page cache will push
>>>>> out of memory other jobs with active relatively small all in memory
>>>>> working set.
>>>>>
>>>>> To prevent such situations we have memcg controls like low/max, etc which
>>>>> are supposed to protect jobs or limit them so they to not hurt others.
>>>>> But memory cgroups are very hard to configure right because it requires
>>>>> precise knowledge of the workload which may vary during the execution.
>>>>> E.g. setting memory limit means that job won't be able to use all memory
>>>>> in the system for page cache even if the rest the system is idle.
>>>>> Basically our current scheme requires to configure every single cgroup
>>>>> in the system.
>>>>>
>>>>> I think we can do better. The idea proposed by this patch is to reclaim
>>>>> only inactive pages and only from cgroups that have big
>>>>> (!inactive_is_low()) inactive list. And go back to shrinking active lists
>>>>> only if all inactive lists are low.
>>>>
>>>> Yes, you are absolutely right.
>>>>
>>>> We shouldn't go after active pages as long as there are plenty of
>>>> inactive pages around. That's the global reclaim policy, and we
>>>> currently fail to translate that well to cgrouped systems.
>>>>
>>>> Setting group protections or limits would work around this problem,
>>>> but they're kind of a red herring. We shouldn't ever allow use-once
>>>> streams to push out hot workingsets, that's a bug.
>>>>
>>>>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, 
>>>>> struct mem_cgroup *memcg,
>>>>>  
>>>>>   scan >>= sc->priority;
>>>>>  
>>>>> + if (!sc->may_shrink_active && inactive_list_is_low(lruvec,
>>>>> + file, memcg, sc, false))
>>>>> + scan = 0;
>>>>> +
>>>>>   /*
>>>>>* If the cgroup's already been deleted, make sure to
>>>>>* scrape out the remaining cache.
>>>>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>>>>> scan_control *sc)
>>>>>   struct reclaim_state *reclaim_state = current->reclaim_state;
>>>>>   unsigned long nr_reclaimed, nr_scanned;
>>>>>   bool reclaimable = false;
>>>>> + bool retry;
>>>>>  
>>>>>   do {
>>>>>   struct mem_cgroup *root = sc->target_mem_cgroup;
>>>>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>>>>> scan_control *sc)
>>>>>   };
>>>>>   struct mem_cgroup *memcg;
>>>>>  
>>>>> + retry = false;
>>>>> +
>>>>>   memset(>nr, 0, sizeof(sc->nr));
>>>>>  
>>>>>   nr_reclaimed = sc->nr_reclaimed;
>>>>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>>>>> scan_control *sc)
>>>>>   }
>>>>>   } while ((memcg = mem_cgroup_iter(root, memcg, )));
>>>>>  
>>>>> + if ((sc->nr_scanned - nr_scanned) == 0 &&
>>>>> +  !sc->may_shrink_active) {
>>>>> + sc->may_shrink_active = 1;
>>>>> + retry = true;
>>>>> + continue;
>>>>> + }
>>>>
>>>> Using !scanned as the gate could be a problem. There might be a cgroup
>>>> that has inactive

[PATCH] mm-remove-zone_lru_lock-function-access-lru_lock-directly-fix

2019-03-01 Thread Andrey Ryabinin

A slightly better version of __split_huge_page();

Signed-off-by: Andrey Ryabinin 
Cc: Vlastimil Babka 
Cc: Mel Gorman 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Rik van Riel 
Cc: William Kucharski 
Cc: John Hubbard 
---
 mm/huge_memory.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 4ccac6b32d49..fcf657886b4b 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2440,11 +2440,11 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
pgoff_t end, unsigned long flags)
 {
struct page *head = compound_head(page);
-   struct zone *zone = page_zone(head);
+   pg_data_t *pgdat = page_pgdat(head);
struct lruvec *lruvec;
int i;
 
-   lruvec = mem_cgroup_page_lruvec(head, zone->zone_pgdat);
+   lruvec = mem_cgroup_page_lruvec(head, pgdat);
 
/* complete memcg works before add pages to LRU */
mem_cgroup_split_huge_fixup(head);
@@ -2475,7 +2475,7 @@ static void __split_huge_page(struct page *page, struct 
list_head *list,
xa_unlock(>mapping->i_pages);
}
 
-   spin_unlock_irqrestore(_pgdat(head)->lru_lock, flags);
+   spin_unlock_irqrestore(>lru_lock, flags);
 
remap_page(head);
 
-- 
2.19.2

Re: [PATCH v2 2/4] mm: remove zone_lru_lock() function access ->lru_lock directly

2019-03-01 Thread Andrey Ryabinin




On 3/1/19 12:44 AM, John Hubbard wrote:
> On 2/28/19 12:33 AM, Andrey Ryabinin wrote:
>> We have common pattern to access lru_lock from a page pointer:
>>  zone_lru_lock(page_zone(page))
>>
>> Which is silly, because it unfolds to this:
>>  
>> _DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock
>> while we can simply do
>>  _DATA(page_to_nid(page))->lru_lock
>>
> 
> Hi Andrey,
> 
> Nice. I like it so much that I immediately want to tweak it. :)
> 
> 
>> Remove zone_lru_lock() function, since it's only complicate things.
>> Use 'page_pgdat(page)->lru_lock' pattern instead.
> 
> Here, I think the zone_lru_lock() is actually a nice way to add
> a touch of clarity at the call sites. How about, see below:
> 
> [snip]
> 
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 2fd4247262e9..22423763c0bd 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -788,10 +788,6 @@ typedef struct pglist_data {
>>  
>>  #define node_start_pfn(nid) (NODE_DATA(nid)->node_start_pfn)
>>  #define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid))
>> -static inline spinlock_t *zone_lru_lock(struct zone *zone)
>> -{
>> -return >zone_pgdat->lru_lock;
>> -}
>>  
> 
> Instead of removing that function, let's change it, and add another
> (since you have two cases: either a page* or a pgdat* is available),
> and move it to where it can compile, like this:
> 
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 80bb6408fe73..cea3437f5d68 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -1167,6 +1167,16 @@ static inline pg_data_t *page_pgdat(const struct page 
> *page)
> return NODE_DATA(page_to_nid(page));
>  }
>  
> +static inline spinlock_t *zone_lru_lock(pg_data_t *pgdat)
> +{
> +   return >lru_lock;
> +}
> +


I don't think wrapper for a simple plain access to the struct member is 
reasonable.
Besides, there are plenty of "spin_lock(>lru_lock)" even without this 
patch,
so for consistency reasons >lru_lock seems like a better choice to me.

Also ">lru_lock" is just shorter than:
  "node_lru_lock(pgdat)"



> +static inline spinlock_t *zone_lru_lock_from_page(struct page *page)
> +{
> +   return zone_lru_lock(page_pgdat(page));
> +}
> +

I don't think such function would find any use. Usually lru_lock is taken
to perform some manipulations with page *and* pgdat, thus it's better to 
remember
page_pgdat(page) in local variable.

Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-03-01 Thread Andrey Ryabinin




On 2/26/19 3:50 PM, Andrey Ryabinin wrote:
> 
> 
> On 2/22/19 10:15 PM, Johannes Weiner wrote:
>> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
>>> In a presence of more than 1 memory cgroup in the system our reclaim
>>> logic is just suck. When we hit memory limit (global or a limit on
>>> cgroup with subgroups) we reclaim some memory from all cgroups.
>>> This is sucks because, the cgroup that allocates more often always wins.
>>> E.g. job that allocates a lot of clean rarely used page cache will push
>>> out of memory other jobs with active relatively small all in memory
>>> working set.
>>>
>>> To prevent such situations we have memcg controls like low/max, etc which
>>> are supposed to protect jobs or limit them so they to not hurt others.
>>> But memory cgroups are very hard to configure right because it requires
>>> precise knowledge of the workload which may vary during the execution.
>>> E.g. setting memory limit means that job won't be able to use all memory
>>> in the system for page cache even if the rest the system is idle.
>>> Basically our current scheme requires to configure every single cgroup
>>> in the system.
>>>
>>> I think we can do better. The idea proposed by this patch is to reclaim
>>> only inactive pages and only from cgroups that have big
>>> (!inactive_is_low()) inactive list. And go back to shrinking active lists
>>> only if all inactive lists are low.
>>
>> Yes, you are absolutely right.
>>
>> We shouldn't go after active pages as long as there are plenty of
>> inactive pages around. That's the global reclaim policy, and we
>> currently fail to translate that well to cgrouped systems.
>>
>> Setting group protections or limits would work around this problem,
>> but they're kind of a red herring. We shouldn't ever allow use-once
>> streams to push out hot workingsets, that's a bug.
>>
>>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, 
>>> struct mem_cgroup *memcg,
>>>  
>>> scan >>= sc->priority;
>>>  
>>> +   if (!sc->may_shrink_active && inactive_list_is_low(lruvec,
>>> +   file, memcg, sc, false))
>>> +   scan = 0;
>>> +
>>> /*
>>>  * If the cgroup's already been deleted, make sure to
>>>  * scrape out the remaining cache.
>>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>>> scan_control *sc)
>>> struct reclaim_state *reclaim_state = current->reclaim_state;
>>> unsigned long nr_reclaimed, nr_scanned;
>>> bool reclaimable = false;
>>> +   bool retry;
>>>  
>>> do {
>>> struct mem_cgroup *root = sc->target_mem_cgroup;
>>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>>> scan_control *sc)
>>> };
>>> struct mem_cgroup *memcg;
>>>  
>>> +   retry = false;
>>> +
>>> memset(>nr, 0, sizeof(sc->nr));
>>>  
>>> nr_reclaimed = sc->nr_reclaimed;
>>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>>> scan_control *sc)
>>> }
>>> } while ((memcg = mem_cgroup_iter(root, memcg, )));
>>>  
>>> +   if ((sc->nr_scanned - nr_scanned) == 0 &&
>>> +!sc->may_shrink_active) {
>>> +   sc->may_shrink_active = 1;
>>> +   retry = true;
>>> +   continue;
>>> +   }
>>
>> Using !scanned as the gate could be a problem. There might be a cgroup
>> that has inactive pages on the local level, but when viewed from the
>> system level the total inactive pages in the system might still be low
>> compared to active ones. In that case we should go after active pages.
>>
>> Basically, during global reclaim, the answer for whether active pages
>> should be scanned or not should be the same regardless of whether the
>> memory is all global or whether it's spread out between cgroups.
>>
>> The reason this isn't the case is because we're checking the ratio at
>> the lruvec level - which is the highest level (and identical to the
>> node counters) when memory is global, but it's at the lowest level
>> when memory is cgro

Re: [PATCH 1/8] kasan,x86: Frob kasan_report() in an exception

2019-02-28 Thread Andrey Ryabinin




On 2/28/19 6:52 PM, Dmitry Vyukov wrote:
> On Thu, Feb 28, 2019 at 4:46 PM Peter Zijlstra  wrote:
>>
>> On Thu, Feb 28, 2019 at 04:22:04PM +0100, Dmitry Vyukov wrote:
>>> On Thu, Feb 28, 2019 at 4:05 PM Peter Zijlstra  wrote:

 Because __asan_{load,store}{N,1,2,4,8,16}_noabort() get called from
 UACCESS context, and kasan_report() is most definitely _NOT_ safe to
 be called from there, move it into an exception much like BUG/WARN.

 *compile tested only*
>>>
>>>
>>> Please test it by booting KASAN kernel and then loading module
>>> produced by CONFIG_TEST_KASAN=y. There are too many subtle aspects to
>>> rely on "compile tested only", reviewers can't catch all of them
>>> either.
>>
>> Sure, I'll do that. I just wanted to share the rest of the patches.
>>
>> A quick test shows it dies _RELY_ early, as in:
>>
>> "Booting the kernel."
>>
>> is the first and very last thing it says... I wonder how I did that :-)
> 
> One thing is that during early boot kasan_report is called multiple
> times, but these are false positives related to the fact that we don't
> have a proper shadow yet (setup later). So during early boot we set
> kasan_disable=1 (or some global or per-task flag), and then
> kasan_report checks it and returns.
> Once we setup proper shadow, the flag is reset and from now on
> kasan_report actually reports bug.
> 

Yup, see report_enabled() function.

[PATCH v2 4/4] mm/vmscan: remove unused lru_pages argument

2019-02-28 Thread Andrey Ryabinin

Since commit 9092c71bb724 ("mm: use sc->priority for slab shrink targets")
the argument 'unsigned long *lru_pages' passed around with no purpose.
Remove it.

Signed-off-by: Andrey Ryabinin 
Acked-by: Johannes Weiner 
Acked-by: Vlastimil Babka 
Cc: Michal Hocko 
Cc: Rik van Riel 
Cc: Mel Gorman 
---

Changes since v1:
 - Changelog update
 - Added acks

 mm/vmscan.c | 17 +
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2d081a32c6a8..07f74e9507b6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2257,8 +2257,7 @@ enum scan_balance {
  * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
  */
 static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
-  struct scan_control *sc, unsigned long *nr,
-  unsigned long *lru_pages)
+  struct scan_control *sc, unsigned long *nr)
 {
int swappiness = mem_cgroup_swappiness(memcg);
struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
@@ -2409,7 +2408,6 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
fraction[1] = fp;
denominator = ap + fp + 1;
 out:
-   *lru_pages = 0;
for_each_evictable_lru(lru) {
int file = is_file_lru(lru);
unsigned long lruvec_size;
@@ -2525,7 +2523,6 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
BUG();
}
 
-   *lru_pages += lruvec_size;
nr[lru] = scan;
}
 }
@@ -2534,7 +2531,7 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
  * This is a basic per-node page freer.  Used by both kswapd and direct 
reclaim.
  */
 static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup 
*memcg,
- struct scan_control *sc, unsigned long *lru_pages)
+ struct scan_control *sc)
 {
struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
unsigned long nr[NR_LRU_LISTS];
@@ -2546,7 +2543,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, 
struct mem_cgroup *memc
struct blk_plug plug;
bool scan_adjusted;
 
-   get_scan_count(lruvec, memcg, sc, nr, lru_pages);
+   get_scan_count(lruvec, memcg, sc, nr);
 
/* Record the original scan target for proportional adjustments later */
memcpy(targets, nr, sizeof(nr));
@@ -2751,7 +2748,6 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
.pgdat = pgdat,
.priority = sc->priority,
};
-   unsigned long node_lru_pages = 0;
struct mem_cgroup *memcg;
 
memset(>nr, 0, sizeof(sc->nr));
@@ -2761,7 +2757,6 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
 
memcg = mem_cgroup_iter(root, NULL, );
do {
-   unsigned long lru_pages;
unsigned long reclaimed;
unsigned long scanned;
 
@@ -2798,8 +2793,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
 
reclaimed = sc->nr_reclaimed;
scanned = sc->nr_scanned;
-   shrink_node_memcg(pgdat, memcg, sc, _pages);
-   node_lru_pages += lru_pages;
+   shrink_node_memcg(pgdat, memcg, sc);
 
if (sc->may_shrinkslab) {
shrink_slab(sc->gfp_mask, pgdat->node_id,
@@ -3332,7 +3326,6 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup 
*memcg,
.may_swap = !noswap,
.may_shrinkslab = 1,
};
-   unsigned long lru_pages;
 
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
@@ -3349,7 +3342,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup 
*memcg,
 * will pick up pages from other mem cgroup's as well. We hack
 * the priority and make it zero.
 */
-   shrink_node_memcg(pgdat, memcg, , _pages);
+   shrink_node_memcg(pgdat, memcg, );
 
trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
-- 
2.19.2

[PATCH v2 2/4] mm: remove zone_lru_lock() function access ->lru_lock directly

2019-02-28 Thread Andrey Ryabinin

We have common pattern to access lru_lock from a page pointer:
zone_lru_lock(page_zone(page))

Which is silly, because it unfolds to this:

_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock
while we can simply do
_DATA(page_to_nid(page))->lru_lock

Remove zone_lru_lock() function, since it's only complicate things.
Use 'page_pgdat(page)->lru_lock' pattern instead.

Signed-off-by: Andrey Ryabinin 
Acked-by: Vlastimil Babka 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Rik van Riel 
Cc: Mel Gorman 
---

Changes since v1:
 - Added ack.

 Documentation/cgroup-v1/memcg_test.txt |  4 ++--
 Documentation/cgroup-v1/memory.txt |  4 ++--
 include/linux/mm_types.h   |  2 +-
 include/linux/mmzone.h |  4 
 mm/compaction.c| 15 ---
 mm/filemap.c   |  4 ++--
 mm/huge_memory.c   |  6 +++---
 mm/memcontrol.c| 14 +++---
 mm/mlock.c | 14 +++---
 mm/page_idle.c |  8 
 mm/rmap.c  |  2 +-
 mm/swap.c  | 16 
 mm/vmscan.c| 16 
 13 files changed, 53 insertions(+), 56 deletions(-)

diff --git a/Documentation/cgroup-v1/memcg_test.txt 
b/Documentation/cgroup-v1/memcg_test.txt
index 5c7f310f32bb..621e29ffb358 100644
--- a/Documentation/cgroup-v1/memcg_test.txt
+++ b/Documentation/cgroup-v1/memcg_test.txt
@@ -107,9 +107,9 @@ Under below explanation, we assume 
CONFIG_MEM_RES_CTRL_SWAP=y.
 
 8. LRU
 Each memcg has its own private LRU. Now, its handling is under global
-   VM's control (means that it's handled under global zone_lru_lock).
+   VM's control (means that it's handled under global pgdat->lru_lock).
Almost all routines around memcg's LRU is called by global LRU's
-   list management functions under zone_lru_lock().
+   list management functions under pgdat->lru_lock.
 
A special function is mem_cgroup_isolate_pages(). This scans
memcg's private LRU and call __isolate_lru_page() to extract a page
diff --git a/Documentation/cgroup-v1/memory.txt 
b/Documentation/cgroup-v1/memory.txt
index 8e2cb1dabeb0..a33cedf85427 100644
--- a/Documentation/cgroup-v1/memory.txt
+++ b/Documentation/cgroup-v1/memory.txt
@@ -267,11 +267,11 @@ When oom event notifier is registered, event will be 
delivered.
Other lock order is following:
PG_locked.
mm->page_table_lock
-   zone_lru_lock
+   pgdat->lru_lock
  lock_page_cgroup.
   In many cases, just lock_page_cgroup() is called.
   per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
-  zone_lru_lock, it has no lock of its own.
+  pgdat->lru_lock, it has no lock of its own.
 
 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fe0f672f53ce..9b9dd8350e26 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -79,7 +79,7 @@ struct page {
struct {/* Page cache and anonymous pages */
/**
 * @lru: Pageout list, eg. active_list protected by
-* zone_lru_lock.  Sometimes used as a generic list
+* pgdat->lru_lock.  Sometimes used as a generic list
 * by the page owner.
 */
struct list_head lru;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2fd4247262e9..22423763c0bd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -788,10 +788,6 @@ typedef struct pglist_data {
 
 #define node_start_pfn(nid)(NODE_DATA(nid)->node_start_pfn)
 #define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid))
-static inline spinlock_t *zone_lru_lock(struct zone *zone)
-{
-   return >zone_pgdat->lru_lock;
-}
 
 static inline struct lruvec *node_lruvec(struct pglist_data *pgdat)
 {
diff --git a/mm/compaction.c b/mm/compaction.c
index 98f99f41dfdc..a3305f13a138 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -775,6 +775,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
unsigned long end_pfn, isolate_mode_t isolate_mode)
 {
struct zone *zone = cc->zone;
+   pg_data_t *pgdat = zone->zone_pgdat;
unsigned long nr_scanned = 0, nr_isolated = 0;
struct lruvec *lruvec;
unsigned long flags = 0;
@@ -839,8 +840,8 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 * if contended.
 */
if (!(low_pfn % SWAP_CLUSTER_MAX)
-   && compact_unlock_should_abort(zone_lru_lock(zone), flags,
-

[PATCH v2 3/4] mm/compaction: pass pgdat to too_many_isolated() instead of zone

2019-02-28 Thread Andrey Ryabinin

too_many_isolated() in mm/compaction.c looks only at node state,
so it makes more sense to change argument to pgdat instead of zone.

Signed-off-by: Andrey Ryabinin 
Acked-by: Vlastimil Babka 
Acked-by: Rik van Riel 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Mel Gorman 
---

Changes since v1:
 - Added acks

 mm/compaction.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index a3305f13a138..b2d02aba41d8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -738,16 +738,16 @@ isolate_freepages_range(struct compact_control *cc,
 }
 
 /* Similar to reclaim, but different enough that they don't share logic */
-static bool too_many_isolated(struct zone *zone)
+static bool too_many_isolated(pg_data_t *pgdat)
 {
unsigned long active, inactive, isolated;
 
-   inactive = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
-   node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
-   active = node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE) +
-   node_page_state(zone->zone_pgdat, NR_ACTIVE_ANON);
-   isolated = node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE) +
-   node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON);
+   inactive = node_page_state(pgdat, NR_INACTIVE_FILE) +
+   node_page_state(pgdat, NR_INACTIVE_ANON);
+   active = node_page_state(pgdat, NR_ACTIVE_FILE) +
+   node_page_state(pgdat, NR_ACTIVE_ANON);
+   isolated = node_page_state(pgdat, NR_ISOLATED_FILE) +
+   node_page_state(pgdat, NR_ISOLATED_ANON);
 
return isolated > (inactive + active) / 2;
 }
@@ -774,8 +774,7 @@ static unsigned long
 isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
unsigned long end_pfn, isolate_mode_t isolate_mode)
 {
-   struct zone *zone = cc->zone;
-   pg_data_t *pgdat = zone->zone_pgdat;
+   pg_data_t *pgdat = cc->zone->zone_pgdat;
unsigned long nr_scanned = 0, nr_isolated = 0;
struct lruvec *lruvec;
unsigned long flags = 0;
@@ -791,7 +790,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 * list by either parallel reclaimers or compaction. If there are,
 * delay for some time until fewer pages are isolated
 */
-   while (unlikely(too_many_isolated(zone))) {
+   while (unlikely(too_many_isolated(pgdat))) {
/* async migration should just abort */
if (cc->mode == MIGRATE_ASYNC)
return 0;
-- 
2.19.2

[PATCH v2 1/4] mm/workingset: remove unused @mapping argument in workingset_eviction()

2019-02-28 Thread Andrey Ryabinin

workingset_eviction() doesn't use and never did use the @mapping argument.
Remove it.

Signed-off-by: Andrey Ryabinin 
Acked-by: Johannes Weiner 
Acked-by: Rik van Riel 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Mel Gorman 
---

Changes since v1:
 - s/@mapping/@page->mapping in comment
 - Acks

 include/linux/swap.h | 2 +-
 mm/vmscan.c  | 2 +-
 mm/workingset.c  | 5 ++---
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 649529be91f2..fc50e21b3b88 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,7 +307,7 @@ struct vma_swap_readahead {
 };
 
 /* linux/mm/workingset.c */
-void *workingset_eviction(struct address_space *mapping, struct page *page);
+void *workingset_eviction(struct page *page);
 void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ac4806f0f332..a9852ed7b97f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -952,7 +952,7 @@ static int __remove_mapping(struct address_space *mapping, 
struct page *page,
 */
if (reclaimed && page_is_file_cache(page) &&
!mapping_exiting(mapping) && !dax_mapping(mapping))
-   shadow = workingset_eviction(mapping, page);
+   shadow = workingset_eviction(page);
__delete_from_page_cache(page, shadow);
xa_unlock_irqrestore(>i_pages, flags);
 
diff --git a/mm/workingset.c b/mm/workingset.c
index dcb994f2acc2..0bedf67502d5 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -215,13 +215,12 @@ static void unpack_shadow(void *shadow, int *memcgidp, 
pg_data_t **pgdat,
 
 /**
  * workingset_eviction - note the eviction of a page from memory
- * @mapping: address space the page was backing
  * @page: the page being evicted
  *
- * Returns a shadow entry to be stored in @mapping->i_pages in place
+ * Returns a shadow entry to be stored in @page->mapping->i_pages in place
  * of the evicted @page so that a later refault can be detected.
  */
-void *workingset_eviction(struct address_space *mapping, struct page *page)
+void *workingset_eviction(struct page *page)
 {
struct pglist_data *pgdat = page_pgdat(page);
struct mem_cgroup *memcg = page_memcg(page);
-- 
2.19.2

Re: [PATCH 5/6] objtool: Add UACCESS validation

2019-02-27 Thread Andrey Ryabinin




On 2/27/19 5:08 PM, Peter Zijlstra wrote:
> On Mon, Feb 25, 2019 at 01:43:35PM +0100, Peter Zijlstra wrote:
>> It is important that UACCESS regions are as small as possible;
>> furthermore the UACCESS state is not scheduled, so doing anything that
>> might directly call into the scheduler will cause random code to be
>> ran with UACCESS enabled.
>>
>> Teach objtool too track UACCESS state and warn about any CALL made
>> while UACCESS is enabled. This very much includes the __fentry__()
>> tracing calls and __preempt_schedule() calls.
>>
>> Note that exceptions _do_ save/restore the UACCESS state, and therefore
>> they can drive preemption. This also means that all exception handlers
>> must have an otherwise dedundant UACCESS disable instruction;
>> therefore ignore this warning for !STT_FUNC code (exception handlers
>> are not normal functions).
>>
>> It also provides a UACCESS_SAFE() annotation which allows explicit
>> annotation. This is meant to be used for future things like:
>> unsafe_copy_{to,from}_user().
>>
>> Signed-off-by: Peter Zijlstra (Intel) 
> 
> So KASAN is wildly unhappy..
> 
> I can't actually find any definitions of those functions, so I can't
> very well mark the safe, even if we wanted to.
> 

They are macro-generated. Use 'git grep DEFINE_ASAN'

> ---
> 
>>> arch/x86/kernel/signal.o: warning: objtool: restore_sigcontext()+0x59: call 
>>> to __asan_store8_noabort() with UACCESS enabled

Re: [PATCH] [v2] kasan: turn off asan-stack for clang-8 and earlier

2019-02-26 Thread Andrey Ryabinin




On 2/23/19 1:29 AM, Arnd Bergmann wrote:
> Building an arm64 allmodconfig kernel with clang results in over 140 warnings
> about overly large stack frames, the worst ones being:
> 
> drivers/gpu/drm/panel/panel-sitronix-st7789v.c:196:12: error: stack frame 
> size of 20224 bytes in function 'st7789v_prepare'
> drivers/video/fbdev/omap2/omapfb/displays/panel-tpo-td028ttec1.c:196:12: 
> error: stack frame size of 13120 bytes in function 'td028ttec1_panel_enable'
> drivers/usb/host/max3421-hcd.c:1395:1: error: stack frame size of 10048 bytes 
> in function 'max3421_spi_thread'
> drivers/net/wan/slic_ds26522.c:209:12: error: stack frame size of 9664 bytes 
> in function 'slic_ds26522_probe'
> drivers/crypto/ccp/ccp-ops.c:2434:5: error: stack frame size of 8832 bytes in 
> function 'ccp_run_cmd'
> drivers/media/dvb-frontends/stv0367.c:1005:12: error: stack frame size of 
> 7840 bytes in function 'stv0367ter_algo'
> 
> None of these happen with gcc today, and almost all of these are the result
> of a single known issue in llvm.  Hopefully it will eventually get fixed with
> the clang-9 release.
> 
> In the meantime, the best idea I have is to turn off asan-stack for clang-8
> and earlier, so we can produce a kernel that is safe to run.
> 
> I have posted three patches that address the frame overflow warnings that are
> not addressed by turning off asan-stack, so in combination with this change,
> we get much closer to a clean allmodconfig build, which in turn is necessary
> to do meaningful build regression testing.
> 
> It is still possible to turn on the CONFIG_ASAN_STACK option on all versions
> of clang, and it's always enabled for gcc, but when CONFIG_COMPILE_TEST is
> set, the option remains invisible, so allmodconfig and randconfig builds
> (which are normally done with a forced CONFIG_COMPILE_TEST) will still result
> in a mostly clean build.
> 
> Cc: Andrey Ryabinin 
> Cc: Dmitry Vyukov 
> Cc: Nick Desaulniers 
> Cc: Mark Brown 
> Cc: Qian Cai 
> Cc: Kostya Serebryany 
> Cc: Andrey Konovalov 
> Link: https://bugs.llvm.org/show_bug.cgi?id=38809
> Signed-off-by: Arnd Bergmann 
> ---

Acked-by: Andrey Ryabinin

Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-02-26 Thread Andrey Ryabinin

On 2/25/19 7:03 AM, Roman Gushchin wrote:
> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
>> In a presence of more than 1 memory cgroup in the system our reclaim
>> logic is just suck. When we hit memory limit (global or a limit on
>> cgroup with subgroups) we reclaim some memory from all cgroups.
>> This is sucks because, the cgroup that allocates more often always wins.
>> E.g. job that allocates a lot of clean rarely used page cache will push
>> out of memory other jobs with active relatively small all in memory
>> working set.
>>
>> To prevent such situations we have memcg controls like low/max, etc which
>> are supposed to protect jobs or limit them so they to not hurt others.
>> But memory cgroups are very hard to configure right because it requires
>> precise knowledge of the workload which may vary during the execution.
>> E.g. setting memory limit means that job won't be able to use all memory
>> in the system for page cache even if the rest the system is idle.
>> Basically our current scheme requires to configure every single cgroup
>> in the system.
>>
>> I think we can do better. The idea proposed by this patch is to reclaim
>> only inactive pages and only from cgroups that have big
>> (!inactive_is_low()) inactive list. And go back to shrinking active lists
>> only if all inactive lists are low.
> 
> Hi Andrey!
> 
> It's definitely an interesting idea! However, let me bring some concerns:
> 1) What's considered active and inactive depends on memory pressure inside
> a cgroup.

There is no such dependency. High memory pressure may be generated both
by active and inactive pages. We also can have a cgroup creating no pressure
with almost only active (or only inactive) pages.

> Actually active pages in one cgroup (e.g. just deleted) can be colder
> than inactive pages in an other (e.g. a memory-hungry cgroup with a tight
> memory.max).
> 

Well, yes, this is a drawback of having per-memcg lrus.

> Also a workload inside a cgroup can to some extend control what's going
> to the active LRU. So it opens a way to get more memory unfairly by
> artificially promoting more pages to the active LRU. So a cgroup
> can get an unfair advantage over other cgroups.
> 

Unfair is usually a negative term, but in this case it's very much depends on 
definition of what is "fair".

If fair means to put equal reclaim pressure on all cgroups, than yes, the patch
increases such unfairness, but such unfairness is a good thing.
Obviously it's more valuable to keep in memory actively used page than the page 
that not used.

> Generally speaking, now we have a way to measure the memory pressure
> inside a cgroup. So, in theory, it should be possible to balance
> scanning effort based on memory pressure.
> 

Simply by design, the inactive pages are the first candidates to reclaim.
Any decision that doesn't take into account inactive pages probably would be 
wrong.

E.g. cgroup A with active job loading a big and active working set which 
creates high memory pressure
and cgroup B - idle (no memory pressure) with a huge not used cache.
It's definitely preferable to reclaim from B rather than from A.

Re: [PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-02-26 Thread Andrey Ryabinin




On 2/22/19 10:15 PM, Johannes Weiner wrote:
> On Fri, Feb 22, 2019 at 08:58:25PM +0300, Andrey Ryabinin wrote:
>> In a presence of more than 1 memory cgroup in the system our reclaim
>> logic is just suck. When we hit memory limit (global or a limit on
>> cgroup with subgroups) we reclaim some memory from all cgroups.
>> This is sucks because, the cgroup that allocates more often always wins.
>> E.g. job that allocates a lot of clean rarely used page cache will push
>> out of memory other jobs with active relatively small all in memory
>> working set.
>>
>> To prevent such situations we have memcg controls like low/max, etc which
>> are supposed to protect jobs or limit them so they to not hurt others.
>> But memory cgroups are very hard to configure right because it requires
>> precise knowledge of the workload which may vary during the execution.
>> E.g. setting memory limit means that job won't be able to use all memory
>> in the system for page cache even if the rest the system is idle.
>> Basically our current scheme requires to configure every single cgroup
>> in the system.
>>
>> I think we can do better. The idea proposed by this patch is to reclaim
>> only inactive pages and only from cgroups that have big
>> (!inactive_is_low()) inactive list. And go back to shrinking active lists
>> only if all inactive lists are low.
> 
> Yes, you are absolutely right.
> 
> We shouldn't go after active pages as long as there are plenty of
> inactive pages around. That's the global reclaim policy, and we
> currently fail to translate that well to cgrouped systems.
> 
> Setting group protections or limits would work around this problem,
> but they're kind of a red herring. We shouldn't ever allow use-once
> streams to push out hot workingsets, that's a bug.
> 
>> @@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, 
>> struct mem_cgroup *memcg,
>>  
>>  scan >>= sc->priority;
>>  
>> +if (!sc->may_shrink_active && inactive_list_is_low(lruvec,
>> +file, memcg, sc, false))
>> +scan = 0;
>> +
>>  /*
>>   * If the cgroup's already been deleted, make sure to
>>   * scrape out the remaining cache.
>> @@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>> scan_control *sc)
>>  struct reclaim_state *reclaim_state = current->reclaim_state;
>>  unsigned long nr_reclaimed, nr_scanned;
>>  bool reclaimable = false;
>> +bool retry;
>>  
>>  do {
>>  struct mem_cgroup *root = sc->target_mem_cgroup;
>> @@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>> scan_control *sc)
>>  };
>>  struct mem_cgroup *memcg;
>>  
>> +retry = false;
>> +
>>  memset(>nr, 0, sizeof(sc->nr));
>>  
>>  nr_reclaimed = sc->nr_reclaimed;
>> @@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct 
>> scan_control *sc)
>>  }
>>  } while ((memcg = mem_cgroup_iter(root, memcg, )));
>>  
>> +if ((sc->nr_scanned - nr_scanned) == 0 &&
>> + !sc->may_shrink_active) {
>> +sc->may_shrink_active = 1;
>> +retry = true;
>> +continue;
>> +}
> 
> Using !scanned as the gate could be a problem. There might be a cgroup
> that has inactive pages on the local level, but when viewed from the
> system level the total inactive pages in the system might still be low
> compared to active ones. In that case we should go after active pages.
> 
> Basically, during global reclaim, the answer for whether active pages
> should be scanned or not should be the same regardless of whether the
> memory is all global or whether it's spread out between cgroups.
> 
> The reason this isn't the case is because we're checking the ratio at
> the lruvec level - which is the highest level (and identical to the
> node counters) when memory is global, but it's at the lowest level
> when memory is cgrouped.
> 
> So IMO what we should do is:
> 
> - At the beginning of global reclaim, use node_page_state() to compare
>   the INACTIVE_FILE:ACTIVE_FILE ratio and then decide whether reclaim
>   can go after active pages or not. Regardless of what the ratio is in
>   individual lruvecs.
> 
> - And likewise at the beginning of cgroup limit reclaim, walk the
>   subtree starting at sc->target_mem_cgroup, sum up the INACTIVE_FILE
>   and ACTIVE_FILE counters, and make inactive_is_low() decision on
>   those sums.
> 

Sounds reasonable.

Re: [PATCH 1/5] mm/workingset: remove unused @mapping argument in workingset_eviction()

2019-02-26 Thread Andrey Ryabinin




On 2/25/19 3:01 PM, Vlastimil Babka wrote:
> On 2/22/19 6:43 PM, Andrey Ryabinin wrote:
>> workingset_eviction() doesn't use and never did use the @mapping argument.
>> Remove it.
>>
>> Signed-off-by: Andrey Ryabinin 
>> Cc: Johannes Weiner 
>> Cc: Michal Hocko 
>> Cc: Vlastimil Babka 
>> Cc: Rik van Riel 
>> Cc: Mel Gorman 
>> ---
>>  include/linux/swap.h | 2 +-
>>  mm/vmscan.c  | 2 +-
>>  mm/workingset.c  | 3 +--
>>  3 files changed, 3 insertions(+), 4 deletions(-)
>>
>> diff --git a/include/linux/swap.h b/include/linux/swap.h
>> index 649529be91f2..fc50e21b3b88 100644
>> --- a/include/linux/swap.h
>> +++ b/include/linux/swap.h
>> @@ -307,7 +307,7 @@ struct vma_swap_readahead {
>>  };
>>  
>>  /* linux/mm/workingset.c */
>> -void *workingset_eviction(struct address_space *mapping, struct page *page);
>> +void *workingset_eviction(struct page *page);
>>  void workingset_refault(struct page *page, void *shadow);
>>  void workingset_activation(struct page *page);
>>  
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index ac4806f0f332..a9852ed7b97f 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -952,7 +952,7 @@ static int __remove_mapping(struct address_space 
>> *mapping, struct page *page,
>>   */
>>  if (reclaimed && page_is_file_cache(page) &&
>>  !mapping_exiting(mapping) && !dax_mapping(mapping))
>> -shadow = workingset_eviction(mapping, page);
>> +shadow = workingset_eviction(page);
>>  __delete_from_page_cache(page, shadow);
>>  xa_unlock_irqrestore(>i_pages, flags);
>>  
>> diff --git a/mm/workingset.c b/mm/workingset.c
>> index dcb994f2acc2..0906137760c5 100644
>> --- a/mm/workingset.c
>> +++ b/mm/workingset.c
>> @@ -215,13 +215,12 @@ static void unpack_shadow(void *shadow, int *memcgidp, 
>> pg_data_t **pgdat,
>>  
>>  /**
>>   * workingset_eviction - note the eviction of a page from memory
>> - * @mapping: address space the page was backing
>>   * @page: the page being evicted
>>   *
>>   * Returns a shadow entry to be stored in @mapping->i_pages in place
> 
> The line above still references @mapping, I guess kerneldoc build will
> complain?
> 

Maybe. Will replace it with @page->mapping->i_pages

Re: [PATCH 5/5] mm/vmscan: don't forcely shrink active anon lru list

2019-02-26 Thread Andrey Ryabinin

On 2/22/19 9:22 PM, Johannes Weiner wrote:
> On Fri, Feb 22, 2019 at 08:43:37PM +0300, Andrey Ryabinin wrote:
>> shrink_node_memcg() always forcely shrink active anon list.
>> This doesn't seem like correct behavior. If system/memcg has no swap, it's
>> absolutely pointless to rebalance anon lru lists.
>> And in case we did scan the active anon list above, it's unclear why would
>> we need this additional force scan. If there are cases when we want more
>> aggressive scan of the anon lru we should just change the scan target
>> in get_scan_count() (and better explain such cases in the comments).
>>
>> Remove this force shrink and let get_scan_count() to decide how
>> much of active anon we want to shrink.
> 
> This change breaks the anon pre-aging.
> 
> The idea behind this is that the VM maintains a small batch of anon
> reclaim candidates with recent access information. On every reclaim,
> even when we just trim cache, which is the most common reclaim mode,
> but also when we just swapped out some pages and shrunk the inactive
> anon list, at the end of it we make sure that the list of potential
> anon candidates is refilled for the next reclaim cycle.
> 
> The comments for this are above inactive_list_is_low() and the
> age_active_anon() call from kswapd.
> 
> Re: no swap, you are correct. We should gate that rebalancing on
> total_swap_pages, just like age_active_anon() does.
> 


I think we should leave anon aging only for !SCAN_FILE cases.
At least aging was definitely invented for the SCAN_FRACT mode which was the
main mode at the time it was added by the commit:

556adecba110bf5f1db6c6b56416cfab5bcab698
Author: Rik van Riel 
Date:   Sat Oct 18 20:26:34 2008 -0700

vmscan: second chance replacement for anonymous pages


Later we've got more of the SCAN_FILE mode usage, commit:

e9868505987a03a26a3979f27b82911ccc003752
Author: Rik van Riel 
Date:   Tue Dec 11 16:01:10 2012 -0800

mm,vmscan: only evict file pages when we have plenty


and I think would be reasonable to  avoid the anon aging in the SCAN_FILE case.
Because if workload generates enough inactive file pages we never go to the 
SCAN_FRACT,
so aging is just as useless as with no swap case.

So, how about something like bellow on top of the patch?

---
 mm/vmscan.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index efd10d6b9510..6c63adfee37b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2525,6 +2525,15 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
 
nr[lru] = scan;
}
+
+   /*
+* Even if we did not try to evict anon pages at all, we want to
+* rebalance the anon lru active/inactive ratio to maintain
+* enough reclaim candidates for the next reclaim cycle.
+*/
+   if (scan_balance != SCAN_FILE && inactive_list_is_low(lruvec,
+   false, memcg, sc, false))
+   nr[LRU_ACTIVE_ANON] += SWAP_CLUSTER_MAX;
 }
 
 /*
-- 
2.19.2

[PATCH RFC] mm/vmscan: try to protect active working set of cgroup from reclaim.

2019-02-22 Thread Andrey Ryabinin

In a presence of more than 1 memory cgroup in the system our reclaim
logic is just suck. When we hit memory limit (global or a limit on
cgroup with subgroups) we reclaim some memory from all cgroups.
This is sucks because, the cgroup that allocates more often always wins.
E.g. job that allocates a lot of clean rarely used page cache will push
out of memory other jobs with active relatively small all in memory
working set.

To prevent such situations we have memcg controls like low/max, etc which
are supposed to protect jobs or limit them so they to not hurt others.
But memory cgroups are very hard to configure right because it requires
precise knowledge of the workload which may vary during the execution.
E.g. setting memory limit means that job won't be able to use all memory
in the system for page cache even if the rest the system is idle.
Basically our current scheme requires to configure every single cgroup
in the system.

I think we can do better. The idea proposed by this patch is to reclaim
only inactive pages and only from cgroups that have big
(!inactive_is_low()) inactive list. And go back to shrinking active lists
only if all inactive lists are low.

Now, the simple test case to demonstrate the effect of the patch.
The job in one memcg repeatedly compresses one file:

 perf stat -n --repeat 20 gzip -ck sample > /dev/null

and just 'dd' running in parallel reading the disk in another cgroup.

Before:
Performance counter stats for 'gzip -ck sample' (20 runs):
  17.673572290 seconds time elapsed 
 ( +-  5.60% )
After:
Performance counter stats for 'gzip -ck sample' (20 runs):
  11.426193980 seconds time elapsed 
 ( +-  0.20% )

The more often dd cgroup allocates memory, the more gzip suffer.
With 4 parallel dd instead of one:

Before:
Performance counter stats for 'gzip -ck sample' (20 runs):
  499.976782013 seconds time elapsed
  ( +- 23.13% )
After:
Performance counter stats for 'gzip -ck sample' (20 runs):
  11.307450516 seconds time elapsed 
 ( +-  0.27% )

It would be possible to achieve the similar effect by
setting the memory.low on gzip cgroup, but the best value for memory.low
depends on the size of the 'sample' file. It also possible
to limit the 'dd' job, but just imagine something more sophisticated
than just 'dd', the job that would benefit from occupying all available
memory. The best limit for such job would be something like
'total_memory' - 'sample size' which is again unknown.

Signed-off-by: Andrey Ryabinin 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Rik van Riel 
Cc: Mel Gorman 
Cc: Roman Gushchin 
Cc: Shakeel Butt 
---
 mm/vmscan.c | 18 +-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index efd10d6b9510..2f562c3358ab 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -104,6 +104,8 @@ struct scan_control {
/* One of the zones is ready for compaction */
unsigned int compaction_ready:1;
 
+   unsigned int may_shrink_active:1;
+
/* Allocation order */
s8 order;
 
@@ -2489,6 +2491,10 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
 
scan >>= sc->priority;
 
+   if (!sc->may_shrink_active && inactive_list_is_low(lruvec,
+   file, memcg, sc, false))
+   scan = 0;
+
/*
 * If the cgroup's already been deleted, make sure to
 * scrape out the remaining cache.
@@ -2733,6 +2739,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
struct reclaim_state *reclaim_state = current->reclaim_state;
unsigned long nr_reclaimed, nr_scanned;
bool reclaimable = false;
+   bool retry;
 
do {
struct mem_cgroup *root = sc->target_mem_cgroup;
@@ -2742,6 +2749,8 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
};
struct mem_cgroup *memcg;
 
+   retry = false;
+
memset(>nr, 0, sizeof(sc->nr));
 
nr_reclaimed = sc->nr_reclaimed;
@@ -2813,6 +2822,13 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
}
} while ((memcg = mem_cgroup_iter(root, memcg, )));
 
+   if ((sc->nr_scanned - nr_scanned) == 0 &&
+!sc->may_shrink_active) {
+   sc->may_shrink_active = 1;
+   retry = true;
+   continue;
+   }
+
if (reclaim_state) {
sc->nr_reclaimed += reclaim_state->reclaimed_slab;
reclaim_state->reclaimed_slab = 0;
@@ -28

[PATCH 5/5] mm/vmscan: don't forcely shrink active anon lru list

2019-02-22 Thread Andrey Ryabinin

shrink_node_memcg() always forcely shrink active anon list.
This doesn't seem like correct behavior. If system/memcg has no swap, it's
absolutely pointless to rebalance anon lru lists.
And in case we did scan the active anon list above, it's unclear why would
we need this additional force scan. If there are cases when we want more
aggressive scan of the anon lru we should just change the scan target
in get_scan_count() (and better explain such cases in the comments).

Remove this force shrink and let get_scan_count() to decide how
much of active anon we want to shrink.

Signed-off-by: Andrey Ryabinin 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Rik van Riel 
Cc: Mel Gorman 
---
 mm/vmscan.c | 12 ++--
 1 file changed, 2 insertions(+), 10 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 07f74e9507b6..efd10d6b9510 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2563,8 +2563,8 @@ static void shrink_node_memcg(struct pglist_data *pgdat, 
struct mem_cgroup *memc
 sc->priority == DEF_PRIORITY);
 
blk_start_plug();
-   while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
-   nr[LRU_INACTIVE_FILE]) {
+   while (nr[LRU_ACTIVE_ANON] || nr[LRU_INACTIVE_ANON] ||
+   nr[LRU_ACTIVE_FILE] || nr[LRU_INACTIVE_FILE]) {
unsigned long nr_anon, nr_file, percentage;
unsigned long nr_scanned;
 
@@ -2636,14 +2636,6 @@ static void shrink_node_memcg(struct pglist_data *pgdat, 
struct mem_cgroup *memc
}
blk_finish_plug();
sc->nr_reclaimed += nr_reclaimed;
-
-   /*
-* Even if we did not try to evict anon pages at all, we want to
-* rebalance the anon lru active/inactive ratio.
-*/
-   if (inactive_list_is_low(lruvec, false, memcg, sc, true))
-   shrink_active_list(SWAP_CLUSTER_MAX, lruvec,
-  sc, LRU_ACTIVE_ANON);
 }
 
 /* Use reclaim/compaction for costly allocs or under memory pressure */
-- 
2.19.2

[PATCH 1/5] mm/workingset: remove unused @mapping argument in workingset_eviction()

2019-02-22 Thread Andrey Ryabinin

workingset_eviction() doesn't use and never did use the @mapping argument.
Remove it.

Signed-off-by: Andrey Ryabinin 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Rik van Riel 
Cc: Mel Gorman 
---
 include/linux/swap.h | 2 +-
 mm/vmscan.c  | 2 +-
 mm/workingset.c  | 3 +--
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 649529be91f2..fc50e21b3b88 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -307,7 +307,7 @@ struct vma_swap_readahead {
 };
 
 /* linux/mm/workingset.c */
-void *workingset_eviction(struct address_space *mapping, struct page *page);
+void *workingset_eviction(struct page *page);
 void workingset_refault(struct page *page, void *shadow);
 void workingset_activation(struct page *page);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index ac4806f0f332..a9852ed7b97f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -952,7 +952,7 @@ static int __remove_mapping(struct address_space *mapping, 
struct page *page,
 */
if (reclaimed && page_is_file_cache(page) &&
!mapping_exiting(mapping) && !dax_mapping(mapping))
-   shadow = workingset_eviction(mapping, page);
+   shadow = workingset_eviction(page);
__delete_from_page_cache(page, shadow);
xa_unlock_irqrestore(>i_pages, flags);
 
diff --git a/mm/workingset.c b/mm/workingset.c
index dcb994f2acc2..0906137760c5 100644
--- a/mm/workingset.c
+++ b/mm/workingset.c
@@ -215,13 +215,12 @@ static void unpack_shadow(void *shadow, int *memcgidp, 
pg_data_t **pgdat,
 
 /**
  * workingset_eviction - note the eviction of a page from memory
- * @mapping: address space the page was backing
  * @page: the page being evicted
  *
  * Returns a shadow entry to be stored in @mapping->i_pages in place
  * of the evicted @page so that a later refault can be detected.
  */
-void *workingset_eviction(struct address_space *mapping, struct page *page)
+void *workingset_eviction(struct page *page)
 {
struct pglist_data *pgdat = page_pgdat(page);
struct mem_cgroup *memcg = page_memcg(page);
-- 
2.19.2

[PATCH 4/5] mm/vmscan: remove unused lru_pages argument

2019-02-22 Thread Andrey Ryabinin

The argument 'unsigned long *lru_pages' passed around with no purpose,
remove it.

Signed-off-by: Andrey Ryabinin 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Rik van Riel 
Cc: Mel Gorman 
---
 mm/vmscan.c | 17 +
 1 file changed, 5 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 2d081a32c6a8..07f74e9507b6 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2257,8 +2257,7 @@ enum scan_balance {
  * nr[2] = file inactive pages to scan; nr[3] = file active pages to scan
  */
 static void get_scan_count(struct lruvec *lruvec, struct mem_cgroup *memcg,
-  struct scan_control *sc, unsigned long *nr,
-  unsigned long *lru_pages)
+  struct scan_control *sc, unsigned long *nr)
 {
int swappiness = mem_cgroup_swappiness(memcg);
struct zone_reclaim_stat *reclaim_stat = >reclaim_stat;
@@ -2409,7 +2408,6 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
fraction[1] = fp;
denominator = ap + fp + 1;
 out:
-   *lru_pages = 0;
for_each_evictable_lru(lru) {
int file = is_file_lru(lru);
unsigned long lruvec_size;
@@ -2525,7 +2523,6 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
BUG();
}
 
-   *lru_pages += lruvec_size;
nr[lru] = scan;
}
 }
@@ -2534,7 +2531,7 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
  * This is a basic per-node page freer.  Used by both kswapd and direct 
reclaim.
  */
 static void shrink_node_memcg(struct pglist_data *pgdat, struct mem_cgroup 
*memcg,
- struct scan_control *sc, unsigned long *lru_pages)
+ struct scan_control *sc)
 {
struct lruvec *lruvec = mem_cgroup_lruvec(pgdat, memcg);
unsigned long nr[NR_LRU_LISTS];
@@ -2546,7 +2543,7 @@ static void shrink_node_memcg(struct pglist_data *pgdat, 
struct mem_cgroup *memc
struct blk_plug plug;
bool scan_adjusted;
 
-   get_scan_count(lruvec, memcg, sc, nr, lru_pages);
+   get_scan_count(lruvec, memcg, sc, nr);
 
/* Record the original scan target for proportional adjustments later */
memcpy(targets, nr, sizeof(nr));
@@ -2751,7 +2748,6 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
.pgdat = pgdat,
.priority = sc->priority,
};
-   unsigned long node_lru_pages = 0;
struct mem_cgroup *memcg;
 
memset(>nr, 0, sizeof(sc->nr));
@@ -2761,7 +2757,6 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
 
memcg = mem_cgroup_iter(root, NULL, );
do {
-   unsigned long lru_pages;
unsigned long reclaimed;
unsigned long scanned;
 
@@ -2798,8 +2793,7 @@ static bool shrink_node(pg_data_t *pgdat, struct 
scan_control *sc)
 
reclaimed = sc->nr_reclaimed;
scanned = sc->nr_scanned;
-   shrink_node_memcg(pgdat, memcg, sc, _pages);
-   node_lru_pages += lru_pages;
+   shrink_node_memcg(pgdat, memcg, sc);
 
if (sc->may_shrinkslab) {
shrink_slab(sc->gfp_mask, pgdat->node_id,
@@ -3332,7 +3326,6 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup 
*memcg,
.may_swap = !noswap,
.may_shrinkslab = 1,
};
-   unsigned long lru_pages;
 
sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
(GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
@@ -3349,7 +3342,7 @@ unsigned long mem_cgroup_shrink_node(struct mem_cgroup 
*memcg,
 * will pick up pages from other mem cgroup's as well. We hack
 * the priority and make it zero.
 */
-   shrink_node_memcg(pgdat, memcg, , _pages);
+   shrink_node_memcg(pgdat, memcg, );
 
trace_mm_vmscan_memcg_softlimit_reclaim_end(sc.nr_reclaimed);
 
-- 
2.19.2

[PATCH 2/5] mm: remove zone_lru_lock() function access ->lru_lock directly

2019-02-22 Thread Andrey Ryabinin

We have common pattern to access lru_lock from a page pointer:
zone_lru_lock(page_zone(page))

Which is silly, because it unfolds to this:

_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock
while we can simply do
_DATA(page_to_nid(page))->lru_lock

Remove zone_lru_lock() function, since it's only complicate things.
Use 'page_pgdat(page)->lru_lock' pattern instead.

Signed-off-by: Andrey Ryabinin 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Rik van Riel 
Cc: Mel Gorman 
---
 Documentation/cgroup-v1/memcg_test.txt |  4 ++--
 Documentation/cgroup-v1/memory.txt |  4 ++--
 include/linux/mm_types.h   |  2 +-
 include/linux/mmzone.h |  4 
 mm/compaction.c| 15 ---
 mm/filemap.c   |  4 ++--
 mm/huge_memory.c   |  6 +++---
 mm/memcontrol.c| 14 +++---
 mm/mlock.c | 14 +++---
 mm/page_idle.c |  8 
 mm/rmap.c  |  2 +-
 mm/swap.c  | 16 
 mm/vmscan.c| 16 
 13 files changed, 53 insertions(+), 56 deletions(-)

diff --git a/Documentation/cgroup-v1/memcg_test.txt 
b/Documentation/cgroup-v1/memcg_test.txt
index 5c7f310f32bb..621e29ffb358 100644
--- a/Documentation/cgroup-v1/memcg_test.txt
+++ b/Documentation/cgroup-v1/memcg_test.txt
@@ -107,9 +107,9 @@ Under below explanation, we assume 
CONFIG_MEM_RES_CTRL_SWAP=y.
 
 8. LRU
 Each memcg has its own private LRU. Now, its handling is under global
-   VM's control (means that it's handled under global zone_lru_lock).
+   VM's control (means that it's handled under global pgdat->lru_lock).
Almost all routines around memcg's LRU is called by global LRU's
-   list management functions under zone_lru_lock().
+   list management functions under pgdat->lru_lock.
 
A special function is mem_cgroup_isolate_pages(). This scans
memcg's private LRU and call __isolate_lru_page() to extract a page
diff --git a/Documentation/cgroup-v1/memory.txt 
b/Documentation/cgroup-v1/memory.txt
index 8e2cb1dabeb0..a33cedf85427 100644
--- a/Documentation/cgroup-v1/memory.txt
+++ b/Documentation/cgroup-v1/memory.txt
@@ -267,11 +267,11 @@ When oom event notifier is registered, event will be 
delivered.
Other lock order is following:
PG_locked.
mm->page_table_lock
-   zone_lru_lock
+   pgdat->lru_lock
  lock_page_cgroup.
   In many cases, just lock_page_cgroup() is called.
   per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
-  zone_lru_lock, it has no lock of its own.
+  pgdat->lru_lock, it has no lock of its own.
 
 2.7 Kernel Memory Extension (CONFIG_MEMCG_KMEM)
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fe0f672f53ce..9b9dd8350e26 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -79,7 +79,7 @@ struct page {
struct {/* Page cache and anonymous pages */
/**
 * @lru: Pageout list, eg. active_list protected by
-* zone_lru_lock.  Sometimes used as a generic list
+* pgdat->lru_lock.  Sometimes used as a generic list
 * by the page owner.
 */
struct list_head lru;
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 2fd4247262e9..22423763c0bd 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -788,10 +788,6 @@ typedef struct pglist_data {
 
 #define node_start_pfn(nid)(NODE_DATA(nid)->node_start_pfn)
 #define node_end_pfn(nid) pgdat_end_pfn(NODE_DATA(nid))
-static inline spinlock_t *zone_lru_lock(struct zone *zone)
-{
-   return >zone_pgdat->lru_lock;
-}
 
 static inline struct lruvec *node_lruvec(struct pglist_data *pgdat)
 {
diff --git a/mm/compaction.c b/mm/compaction.c
index 98f99f41dfdc..a3305f13a138 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -775,6 +775,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
unsigned long end_pfn, isolate_mode_t isolate_mode)
 {
struct zone *zone = cc->zone;
+   pg_data_t *pgdat = zone->zone_pgdat;
unsigned long nr_scanned = 0, nr_isolated = 0;
struct lruvec *lruvec;
unsigned long flags = 0;
@@ -839,8 +840,8 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 * if contended.
 */
if (!(low_pfn % SWAP_CLUSTER_MAX)
-   && compact_unlock_should_abort(zone_lru_lock(zone), flags,
-   , cc))
+

[PATCH 3/5] mm/compaction: pass pgdat to too_many_isolated() instead of zone

2019-02-22 Thread Andrey Ryabinin

too_many_isolated() in mm/compaction.c looks only at node state,
so it makes more sense to change argument to pgdat instead of zone.

Signed-off-by: Andrey Ryabinin 
Cc: Johannes Weiner 
Cc: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Rik van Riel 
Cc: Mel Gorman 
---
 mm/compaction.c | 19 +--
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index a3305f13a138..b2d02aba41d8 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -738,16 +738,16 @@ isolate_freepages_range(struct compact_control *cc,
 }
 
 /* Similar to reclaim, but different enough that they don't share logic */
-static bool too_many_isolated(struct zone *zone)
+static bool too_many_isolated(pg_data_t *pgdat)
 {
unsigned long active, inactive, isolated;
 
-   inactive = node_page_state(zone->zone_pgdat, NR_INACTIVE_FILE) +
-   node_page_state(zone->zone_pgdat, NR_INACTIVE_ANON);
-   active = node_page_state(zone->zone_pgdat, NR_ACTIVE_FILE) +
-   node_page_state(zone->zone_pgdat, NR_ACTIVE_ANON);
-   isolated = node_page_state(zone->zone_pgdat, NR_ISOLATED_FILE) +
-   node_page_state(zone->zone_pgdat, NR_ISOLATED_ANON);
+   inactive = node_page_state(pgdat, NR_INACTIVE_FILE) +
+   node_page_state(pgdat, NR_INACTIVE_ANON);
+   active = node_page_state(pgdat, NR_ACTIVE_FILE) +
+   node_page_state(pgdat, NR_ACTIVE_ANON);
+   isolated = node_page_state(pgdat, NR_ISOLATED_FILE) +
+   node_page_state(pgdat, NR_ISOLATED_ANON);
 
return isolated > (inactive + active) / 2;
 }
@@ -774,8 +774,7 @@ static unsigned long
 isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
unsigned long end_pfn, isolate_mode_t isolate_mode)
 {
-   struct zone *zone = cc->zone;
-   pg_data_t *pgdat = zone->zone_pgdat;
+   pg_data_t *pgdat = cc->zone->zone_pgdat;
unsigned long nr_scanned = 0, nr_isolated = 0;
struct lruvec *lruvec;
unsigned long flags = 0;
@@ -791,7 +790,7 @@ isolate_migratepages_block(struct compact_control *cc, 
unsigned long low_pfn,
 * list by either parallel reclaimers or compaction. If there are,
 * delay for some time until fewer pages are isolated
 */
-   while (unlikely(too_many_isolated(zone))) {
+   while (unlikely(too_many_isolated(pgdat))) {
/* async migration should just abort */
if (cc->mode == MIGRATE_ASYNC)
return 0;
-- 
2.19.2

Re: [PATCH] kasan: turn off asan-stack for clang-8 and earlier

2019-02-21 Thread Andrey Ryabinin




On 2/21/19 6:19 PM, Arnd Bergmann wrote:
> On Thu, Feb 21, 2019 at 11:06 AM Andrey Ryabinin
>  wrote:
>> On 2/20/19 8:35 PM, Arnd Bergmann wrote:
>>> On Wed, Feb 20, 2019 at 6:00 PM Andrey Ryabinin  
>>> wrote:
>>>> On 2/20/19 5:51 PM, Arnd Bergmann wrote:
> 
>>> Maybe bringing it back would be a compromise? That way it's hidden from
>>> all the build testing bots (because of the !CONFIG_COMPILE_TEST dependency),
>>> but anyone who really wants it can still have the option, and set
>>> CONFIG_FRAME_WARN
>>> to whichever value they like.
>>>
>>
>>
>> I think there is much simpler solution:
>>
>> ---
>>  lib/Kconfig.kasan | 2 ++
>>  1 file changed, 2 insertions(+)
>>
>> diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan
>> index 219cddc913ac..6cd035f06cee 100644
>> --- a/lib/Kconfig.kasan
>> +++ b/lib/Kconfig.kasan
>> @@ -105,6 +105,8 @@ endchoice
>>
>>  config KASAN_STACK
>> int
>> +   range 0 1
>> +   prompt "Enable stack instrumentation (unsafe)" if CC_IS_CLANG && 
>> (CLANG_VERSION < 9)
>> default 0 if CC_IS_CLANG && (CLANG_VERSION < 9)
>> default 1
>> help
>> --
>>
>>
>> AFAIK, randconfig is not able to randomize int config options, so it will be 
>> disabled for build robots,
>> but users still will be able to enable it.
> 
> Right, this will work, but I find it a bit awkward to require users to
> enter 0 or 1.
> 
> My assumption is that build bots turn on CONFIG_COMPILE_TEST, so
> having a bool option that depends on COMPILE_TEST would be more
> conventional. We can debate whether it should also depend on
> CONFIG_EXPERT or not. Something like
> 
> config KASAN_STACK
>  bool "Enable stack instrumentation (unsafe)" if CC_IS_CLANG
> && !COMPILE_TEST
>  default CC_IS_GCC || (CLANG_VERSION >= 9)
> 
> And then a simpler Makefile logic (could also be done in Kconfig) to turn
> that bool symbol into an integer argument for asan-stack=
> 

Sounds good.

Re: [PATCH 4.19 23/30] x86_64: increase stack size for KASAN_EXTRA

2019-02-21 Thread Andrey Ryabinin




On 2/21/19 6:26 PM, Greg Kroah-Hartman wrote:
> On Thu, Feb 21, 2019 at 06:00:06PM +0300, Andrey Ryabinin wrote:
>> On 2/21/19 5:36 PM, Greg Kroah-Hartman wrote:
>>> 4.19-stable review patch.  If anyone has any objections, please let me know.
>>>
>>
>> Drop it please. It increases kernel stack up to 4-order which is above 
>> PAGE_ALLOC_COSTLY_ORDER.
>> This changes behavior of the fork() by making it fail with -ENOMEM due of 
>> high memory fragmentation.
>>
>> The patch that removes KASAN_EXTRA (not in Linus's tree yet) would be a 
>> better fix for the problem.
> 
> Ok, should this also be dropped from 4.20.y?
> 

Yes.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2765 matches

Mail list logo